throbber
Spiliopoulou et al. (Eds.): From Data and Information Analysis to Knowledge Engineering
`Selected Papers from the 29th Annual Conference of the German Classification Society (GfKl)
`Magdeburg, ISBN 1431-8814, pp. 430-437, c(cid:2)Springer 2006.
`
`Near Similarity Search and
`Plagiarism Analysis
`
`Benno Stein1 and Sven Meyer zu Eissen2
`
`1 Faculty of Media, Media Systems
`Bauhaus University Weimar, 99421 Weimar, Germany
`2 Faculty of Computer Science,
`Paderborn University, 33098 Paderborn, Germany
`
`Abstract. Existing methods to text plagiarism analysis mainly base on “chunk-
`ing”, a process of grouping a text into meaningful units each of which gets encoded
`by an integer number. Together theses numbers form a document’s signature or fin-
`gerprint. An overlap of two documents’ fingerprints indicate a possibly plagiarized
`text passage. Most approaches use MD5 hashes to construct fingerprints, which is
`bound up with two problems: (i) it is computationally expensive, (ii) a small chunk
`size must be chosen to identify matching passages, which additionally increases the
`effort for fingerprint computation, fingerprint comparison, and fingerprint storage.
`This paper proposes a new class of fingerprints that can be considered as an ab-
`straction of the classical vector space model. These fingerprints operationalize the
`concept of “near similarity” and enable one to quickly identify candidate passages
`for plagiarism. Experiments show that a plagiarism analysis based on our finger-
`prints leads to a speed-up by a factor of five and higher—without compromising
`the recall performance.
`
`1 Plagiarism Analysis
`
`Plagiarism is the act of claiming to be the author of material that someone
`else actually wrote (Encyclopædia Britannica, 2005). This definition relates
`to text documents, which is also the focus of this paper. Clearly, a question of
`central importance is to what extent such and similar tasks can be automated.
`Several techniques for plagiarism analysis have been proposed in the past—
`most of them rely on one of the following ideas.
`
`Substring Matching. Substring matching approaches try to identify maximal
`matches in pairs of strings, which then are used as plagiarism indicators
`(Gusfield (1997)). Typically, the substrings are represented in suffix trees, and
`graph-based measures are employed to capture the fraction of the plagiarized
`sections (Baker (1993); Monostori et al. (2002, 2000)). However, Finkel et al.
`(2002) as well as Baker (1993) propose the use of text compression algorithms
`to identify matches.
`Keyword Similarity. The idea here is to extract and to weight topic-identifying
`keywords from a document and to compare them to the keywords of other
`documents. If the similarity exceeds a threshold, the candidate documents
`
`EX1056
`Roku V. Media Chain
`U.S. Patent No. 10,489,560
`
`0001
`
`

`

`are divided into smaller pieces, which then are compared recursively (Si et al.
`(1997); Fullam and Park (2002)). Note that this approach assumes that pla-
`giarism usually happens in topically similar documents.
`Fingerprint Analysis. The most popular approach to plagiarism analysis is
`the detection of overlapping text sequences by means of fingerprinting: Docu-
`ments are partitioned into term sequences, called chunks, from which digital
`digests are computed that form the document’s fingerprint. When the digests
`are inserted into a hashtable, collisions indicate matching sequences. Recent
`work that describes details and variants of this approach include Brin et al.
`(1995); Shivakumar and Garcia-Molina (1996); Finkel et al. (2002).
`
`1.1 Contributions of this Paper
`
`The overall contribution of this paper relates to the usage of fuzzy-fingerprints
`as an effective tool for plagiarism analysis. To understand different intentions
`for similarity search and plagiarism analysis we first introduce the distinction
`of local and global similarity. In fact, fuzzy-fingerprints can be understood
`as a combination of both paradigms, where the parameter “chunk size” con-
`trols the degree of locality. In particular, we use this distinction to develop a
`taxonomy of methods for plagiarism analysis. These considerations are pre-
`sented in the following section. Section 3 reports on experiments that quantify
`interesting properties of our approach.
`
`2 Fingerprinting, Similarity, and Plagiarism Analysis
`
`In the context of information retrieval a fingerprint h(d) of a document d
`can be considered as a set of encoded substrings taken from d, which serve to
`identify d uniquely.1 Following Hoad and Zobel (2003), the process of creating
`a fingerprint comprises four areas that need consideration.
`
`1. Substring Selection. From the original document substrings (chunks) are
`extracted according to some selection strategy. Such a strategy may con-
`sider positional, frequency-based, or structural information.
`2. Substring Number. The substring number defines the fingerprint resolu-
`tion. There is an obvious trade-off between fingerprint quality, processing
`effort, and storage requirements, which must be carefully balanced. The
`more information of a document is encoded in the fingerprint, the more
`reliably a possible collision of two fingerprints can be interpreted.
`3. Substring Size. The substring size defines the fingerprint granularity. A
`fine granularity makes a fingerprint more susceptible to false matches,
`while with a coarse granularity fingerprinting becomes very sensitive to
`changes.
`
`1 The term “signature” is sometimes also used in this connection.
`
`0002
`
`

`

`4. Substring Encoding. The selected substrings are mapped onto integer
`numbers. Substring conversion establishes a hash operation where—aside
`from uniqueness and uniformity—also efficiency is an important issue
`(Ramakrishna and Zobel (1997)). For this, the popular MD5 hashing
`algorithm is often employed (Rivest (1992)).
`
`If the main issue is similarity analysis and not unique identification, the
`entire document d is used during the substring formation step—i. e., the union
`of all chunks covers the entire document. The total set of integer numbers
`represents the fingerprint h(d). Note that the chunks may not be of uniform
`length but should be formed with the analysis task in mind.
`
`2.1 Local and Global Similarity Analysis
`
`For two documents A and B let h(A) and h(B) be their fingerprints with
`the respective resolutions |h(A)| and |h(B)|. Following Finkel et al. (2002), a
`similarity analysis between A and B that is based on h(A) and h(B) measures
`the portion of the fingerprint intersection:
`|h(A) ∩ h(B)|
`|h(A) ∪ h(B)|
`
`ϕlocal (A, B) =
`
`We call such a kind of similarity measure local similarity or overlap simi-
`larity, because it directly relates to the number of identical regions. By con-
`trast, the vector space model along with the cosine measure does not depend
`on identical regions: Two documents may have a similarity of 1 though they
`may not share any 2-gram. The vector space model along with the cosine mea-
`sure receives a global characteristic because it quantifies the term frequency
`of the entire document; in particular, the model neglects word order. Figure 1
`contrasts the principles of local and global similarity analysis pictorially.
`Basically, a fingerprint h(d) of a document d is nothing more than a
`special document model of d. In this sense, every information retrieval task
`that is based on a standard document model can also be operationalized with
`fingerprints. However, fingerprint methods are more flexible since they can
`be targeted specifically towards one of the following objectives:
`
`1. compactness—with respected to the document length
`2. fidelity—with respected to a local similarity analysis
`
`It is difficult to argue whether a fingerprint should be preferred to a
`standard document model in order to tackle a given information retrieval
`task. To better understand this problem of choosing an adequate document
`model we have developed a taxonomy of approaches to plagiarism analysis,
`which is shown in Figure 2. The approaches as well as the methods can be
`divided into local and global strategies. Note that in the literature on the
`subject local plagiarism analysis methods are encountered more often than
`global analysis methods. This is in the nature of things, since expropriating
`
`0003
`
`

`

`A
`
`Drawing the conclusion ``knowledge over search'' is obvious on
`the one hand, but too simple on the other: Among others, the
`question remains what can be done if the resource ``design know-
`ledge'' is not available or cannot be elicited, or is too expensive,
`or must tediously be experienced? Obviously we can learn from
`human problem solvers where to spend search effort deliberately
`in order to gain the maximum impact for automated problem
`solving. The paper in hand gives such an example: In Subsection
`2.1 we introduce the paradigm of functional abstraction to address
`behavior-based design problems. It develops from the search-
`plus-simulation paradigm by untwining the roles of search and
`simulation; in this way it forms a synthesis of the aforementioned
`approaches.
`
`Local similarity
`analysis,
`based on the
`overlap of
`contiguous
`sequences.
`
`A
`
`Drawing the conclusion ``knowledge over search'' is obvious on
`the one hand, but too simple on the other: Among others, the
`question remains what can be done if the resource ``design know-
`ledge'' is not available or cannot be elicited, or is too expensive,
`or must tediously be experienced? Obviously we can learn from
`human problem solvers where to spend search effort deliberately
`in order to gain the maximum impact for automated problem
`solving. The paper in hand gives such an example: In Subsection
`2.1 we introduce the paradigm of functional abstraction to address
`behavior-based design problems. It develops from the search-
`plus-simulation paradigm by untwining the roles of search and
`simulation; in this way it forms a synthesis of the aforementioned
`approaches.
`
`B
`
`at the first sight ``knowledge over search'' is obvious on the one
`hand, but too simple on the other: Among others, the question
`remains whether or not he could believe the alleged claim.
`However, most of us think that it develops from the search-plus-
`simulation paradigm. This way one could gain the maximum
`impact for automated diagnosis problem solving, simply by un-
`twining the roles of search and simulation.
`Human problem solving expertise is highly effective but of heu-
`ristic nature; moreover, it is hard to elicit but rather easy to
`process. Successful implementations of knowledge-based design
`algorithms don´t search in a gigantic space of behavior models
`but operate in a well defined structure space instead, which is
`spanned by compositional and taxonomic relations.
`
`Global similarity
`analysis,
`based on the
`shared part of
`the global
`term vectors.
`
`B
`
`at the first sight ``knowledge over search'' is obvious on the one
`hand, but too simple on the other: Among others, the question
`remains whether or not he could believe the alleged claim.
`However, most of us think that it develops from the search-plus-
`simulation paradigm. This way one could gain the maximum
`impact for automated diagnosis problem solving, simply by un-
`twining the roles of search and simulation.
`Human problem solving expertise is highly effective but of heu-
`ristic nature; moreover, it is hard to elicit but rather easy to
`process. Successful implementations of knowledge-based design
`algorithms don´t search in a gigantic space of behavior models
`but operate in a well defined structure space instead, which is
`spanned by compositional and taxonomic relations.
`
`Fig. 1. Two documents A and B which are analyzed regarding their similarity.
`The left-hand side illustrates a measure of local similarity: All matching contiguous
`sequences (chunks) with a length ≥ 5 words are highlighted. The right-hand side
`illustrates a measure of global similarity: Here the common word stems (without
`stopwords) of document A and B are highlighted. Observe that both similarity
`analyses may lead to the same similarity assessment.
`
`the exact wording of another author often relates to text passages rather than
`to the entire text. At the second level our taxonomy differentiates the local
`approaches with respect to the comparison rigor, and the global approaches
`with respect to statistical analysis versus style analysis.
`Among the shown approaches, the chunk identity analysis—usually oper-
`ationalized with the MD5 hashing algorithm—is the most popular approach
`to plagiarism analysis. Nevertheless, the method comes along with inherent
`disadvantages: (i) it is computationally expensive, and (ii) a small chunk size
`must be chosen (3-10 words), which has a negative impact to both retrieval
`and storage performance. Observe that all mentioned problems can be coun-
`tered, if the chunk size is drastically increased. This, however, requires some
`kind of fingerprints that operationalize a “relaxed” comparison concept.
`The following subsection adresses this problem. It introduces fuzzy-
`fingerprints, which are specifically tailored to text documents and which pro-
`vide the desired feature: an efficient means for near similarity analysis.
`
`2.2 Fingerprints that Capture Near Similarity
`
`While most fingerprint approaches rely on the original document d, from
`which chunks are selected and given to a hashing algorithm, our approach is
`based on the vector space model representation of the chunks. Key idea is
`a comparison of the distribution of the index terms in each chunk regarding
`their expected term frequency classes.
`
`0004
`
`

`

`Plagiarism
`analysis
`
`Local
`similarity
`
`Global
`similarity
`
`Chunk similarity
`analysis
`
`Chunk identity
`analysis
`
`Term occurrence
`analysis
`
`Style
`analysis
`
`MD5 Hashing
`
`Shingling
`
`Fuzzy-Fingerprint
`
`order-
`preserving
`methods
`
`order-
`neglecting
`methods
`
`Text structure
`analysis
`
`Linguistic
`analysis
`
`Methods
`
`order-
`preserving
`methods
`
`order-
`neglecting
`methods
`
`Suffix-tree model
`with tree-cover
`
`Vector space model
`with cos-measure
`
`Fig. 2. A taxonomy of approaches and methods to plagiarism analysis.
`
`In particular, we abstract the concept of term frequency classes towards
`prefix frequency classes by comprising index terms into a small number of
`equivalence classes, such that all terms from the same equivalence class start
`with a particular prefix. Then, grounded on the analysis of large corpora a
`reference distribution of index term frequencies can be computed, and, for a
`predefined set of prefixes, the a-priory probability of a term being member in a
`certain prefix class can be stated. The deviation of a chunk’s term distribution
`from these a-priory probabilities forms a chunk-specific characteristic that can
`be encoded as small integer.
`The basic procedure for constructing a fuzzy-fingerprint hϕ(d) for a doc-
`ument d is as follows:2
`1. Formation of a set C of chunks for d such that the extracted substrings
`c ∈ C cover d.
`2. For each chunk c ∈ C:
`(a) Computation of the vector space model c of c.
`(b) Computation of pf , the vector of relative frequencies of the prefix
`classes for the index terms in c.
`(c) Computation of Δpf , the vector of relative deviations of pf wrt. the
`expected prefix class distribution in the British National Corpus.
`
`2 Actually, the procedure is technically much more involved. It includes an
`alogrithm for chunking, the determination of suited prefix classes, the compu-
`tation of a reference distribution, and the identification as well as application of
`fuzzification schemes. Details can be found in Stein (2005).
`
`0005
`
`

`

`(d) Fuzzyfication of Δpf by abstracting the exact deviations towards a
`fuzzy deviation scheme with r intervals, and computation of a hash
`value γ:
`k−1(cid:2)
`
`γ =
`
`i=0
`
`δi · ri,
`
`with δi ∈ {0, . . . , r − 1}
`
`k is the number of prefix classes, and δi is the fuzzified deviation of
`the frequency of prefix class i.
`3. Formation of hϕ(d) as the union of the hash values γc, c ∈ C.
`
`Remarks. The granularity of the fingerprint is controlled within three dimen-
`sions: By the number of chunks, |C|, in Step 1, by the number of equivalence
`classes, k, in Step 2b, and by the resolution of the fuzzy deviation scheme, r,
`in Step 2d.
`
`3 Runtime Performance and Classification Characteristic
`
`This section presents results from a comparative analysis of the fuzzy-fin-
`gerprinting approach. In particular we investigate the following questions:
`
`1. Runtime Performance. To which extent is plagiarism identification accel-
`erated compared to MD5 fingerprinting?
`2. Classification Characteristic. How does fuzzy-fingerprinting relate to
`other local (MD5 fingerprinting) and global (vector space model) sim-
`ilarity measures?
`
`To answer these questions we set up different plagiarism experiments. The
`following plots result from a setting where as basis the RFC collection of
`the Internet Society was chosen: It comprises about 3000 documents with a
`considerable part of versioned sections. From this collection 50 documents
`were drawn randomly and compared to eight collection subsets with sizes be-
`tween 100 and 800 documents. Since this comparison relied on the documents’
`fingerprint representations, the number of observed collisions corresponds di-
`rectly to the runtime of the plagiarism analysis. Figure 3 reflects this fact: It
`shows the developing of the hash collisions (left) as well as the entire analy-
`sis time (right). The main reason for the large performance difference stems
`from the fact that fuzzy-fingerprinting allows for chunk sizes of 100 words on
`average, while MD5 fingerprinting works acceptable only for chunk sizes of 3
`to 10 words.
`The plot on the left-hand side in Figure 4 gives an answer to the question
`of how a document’s local and global similarity analysis are related. It shows
`the deviation of fingerprint-based similarity values compared to the respec-
`tive cosine similarity values under the vector space model, averaged over all
`documents of the RFC collection; observe that fuzzy-fingerprints resemble
`the cosine similarity better than MD5 fingerprints do. Especially against this
`background the plot on the right-hand side in Figure 4 must be interpreted:
`
`0006
`
`

`

`y-Axis: Runtime (sec) of plagiarism analysis
`x-Axis: Size of collection
`
`MD5 Fingerprint
`
`Fuzzy-Fingerprint
`
`12345678
`
`y-Axis: Number of collisions (* 1000)
`x-Axis: Size of collection
`
`MD5 Fingerprint
`
`Fuzzy-Fingerprint
`
` 20
` 18
` 16
` 14
` 12
` 10
`
`2468
`
` 100
`
` 200
`
` 300
`
` 400
`
` 500
`
` 600
`
` 700
`
` 800
`
` 100
`
` 200
`
` 300
`
` 400
`
` 500
`
` 600
`
` 700
`
` 800
`
`Fig. 3. Runtime performance of a plagiarism analysis task: 50 documents are com-
`pared to different subsets of the RFC collection. The figures show the runtime
`expressed in the number of fingerprint collisions (left) as well as in seconds (right).
`
`1.0
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`
`
`y-Axis: Deviation from global similairty
`x-Axis: Global similarity (VSM + cos-similarity)
`
`MD5 Fingerprint
`
`Fuzzy-Fingerprint
`
`0.2
`
`0.4
`
`0.6
`
`0.8
`
` 1.0
`
`1.0
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`
`
`y-Axis: Deviation from local similarity
`x-Axis: Local similarity (MD5 Fingerprint)
`
`Fuzzy-Fingerprint
`
`0.2
`
`0.4
`
`0.6
`
`0.8
`
` 1.0
`
`Fig. 4. Classification characteristics within the above plagiarism analysis task. Left:
`Similarity deviation of fingerprinting compared to the cosine similarity under the
`vector space model. Right: Similarity deviation of fuzzy-fingerprinting compared to
`optimum MD5 fingerprinting.
`
`The rather small deviation between fuzzy-fingerprints and the “optimum” fin-
`gerprint, which is a fine-grained MD5 fingerprint of chunk size 3, illustrates
`the robustness of fuzzy-fingerprinting.
`
`4 Summary
`
`To identify plagiarized versions of a document or of some parts of it, similar-
`ity analyses must be performed. In this connection the paper introduced the
`distinction of local and global similarity measures. Local similarity measures
`answer the question which percentage of two documents are equal; global
`similarity measures answer the question to which percentage the entire doc-
`uments are equal. This is a subtle but important difference, which leads to a
`taxonomy of methods for plagiarism analysis.
`Local methods for plagiarism analysis base on fingerprinting, and in this
`paper we propose a new class of fuzzy-fingerprints that can be considered as
`an abstraction of the classical vector space model. These fingerprints allow
`for chunk sizes that are an order of magnitude larger than the typical MD5
`
`0007
`
`

`

`digesting chunk sizes. As a consequence, the identification of plagiarism can-
`didates is advanced significantly (more than a factor of five)—while reducing
`the size of the fingerprint database at the same time.
`Our experiments also show the robustness of these fingerprints with re-
`spect to both large variations in the chunk size and the similarity range. Al-
`together, these properties make the concept of fuzzy-fingerprinting an ideal
`tool for plagiarism analysis and near similarity search in large document col-
`lections.
`
`References
`
`Brenda S. Baker. On finding duplication in strings and software.
`http://cm.bell-labs.com/cm/cs/papers.html, 1993.
`Sergey Brin, James Davis, and Hector Garcia-Molina. Copy detection mechanisms
`for digital documents. In SIGMOD ’95, pages 398–409, New York, NY, USA,
`1995. ACM Press. ISBN 0-89791-731-6.
`Encyclopædia Britannica. New Frontiers in Cheating.
`http://www.britannica.com/eb/article?tocId=228894, 2005.
`Raphael A. Finkel, Arkady Zaslavsky, Krisztian Monostori, and Heinz Schmidt.
`Signature Extraction for Overlap Detection in Documents. In Proc. of the 25th
`Australian conference on Computer Science, pages 59–64. Australian Computer
`Society, 2002.
`K. Fullam and J. Park. Improvements for scalable and accurate plagiarism
`detection in digital documents. http://www.lips.utexas.edu/~
`kfullam/pdf/DataMiningReport.pdf, 2002.
`Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science
`and Computational Biology. Cambridge University Press, 1997.
`Timothy C. Hoad and Justin Zobel. Methods for Identifying Versioned and
`Plagiarised Documents. American Society for Information Science and
`Technology, 54(3):203–215, 2003.
`K. Monostori, R. Finkel, A. Zaslavsky, G. Hodsz, and M. Pataki. Comparison of
`overlap detection techniques. In LNCS, volume 2329, 2002.
`Kriszti´an Monostori, Arkdy Zaslavsky, and Heinz Schmidt. Document overlap
`detection system for distributed digital libraries. In DL ’00, pages 226–227,
`New York, NY, USA, 2000. ACM Press.
`M. V. Ramakrishna and J. Zobel. Performance in Practice of String Hashing
`Functions. In Proc. of the Intl. Conf. on Database Systems for Advanced
`Applications, Australia, 1997.
`Ronald L. Rivest. The md5 message-digest algorithm.
`http://theory.lcs.mit.edu/~rivest/rfc1321.txt, April 1992.
`Narayanan Shivakumar and Hector Garcia-Molina. Building a scalable and
`accurate copy detection mechanism. In DL ’96, pages 160–168, New York, NY,
`USA, 1996. ACM Press.
`Antonio Si, Hong Va Leong, and Rynson W. H. Lau. Check: a document
`plagiarism detection system. In SAC ’97, pages 70–77, New York, NY, USA,
`1997. ACM Press.
`Benno Stein. Fuzzy-Fingerprints for Text-based Information Retrieval. In
`Tochtermann and Maurer, editors, 5th Intl. Conf. on Knowledge Management
`(I-KNOW 05), Graz, Austria, JUCS. Know-Center, 2005.
`
`0008
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket