throbber
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
`HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
`
`53
`
`Features Based Text Similarity Detection
`
`Chow Kok Kent, Naomie Salim
`
`Faculty of Computer Science and Information Systems, University Teknologi Malaysia, 81310 Skudai,
`Johor, Malaysia
`
`Abstract— As the Internet help us cross cultural border by providing different information, plagiarism issue is bound to arise. As a result,
`plagiarism detection becomes more demanding in overcoming this issue. Different plagiarism detection tools have been developed based on
`various detection techniques. Nowadays, fingerprint matching technique plays an important role in those detection tools. However, in handling
`some large content articles, there are some weaknesses in fingerprint matching technique especially in space and time consumption issue. In
`this paper, we propose a new approach to detect plagiarism which integrates the use of fingerprint matching technique with four key features
`to assist in the detection process. These proposed features are capable to choose the main point or key sentence in the articles to be
`compared. Those selected sentence will be undergo the fingerprint matching process in order to detect the similarity between the sentences.
`Hence, time and space usage for the comparison process is reduced without affecting the effectiveness of the plagiarism detection.
`
`—————————— (cid:139) ——————————
`
`1 INTRODUCTION
`
`In plagiarism detection, the content of a suspected 
`document may be represented as a collection of terms, 
`words, stems, phrases, or other units derived or inferred 
`from the text of the document. Different techniques will 
`lead to vary efficiency and effectiveness in plagiarism 
`detection based on different document descriptors. Before 
`a document is taken to be compared, it is necessary for us 
`to choose the most appropriate representation techniques 
`to retrieve the main points of the document. When the 
`documents contain primarily unrestricted text such as 
`newspaper articles, legal documents and so on, the 
`relevance of a document is established through ʹfull‐textʹ 
`retrieval. This has been usually accomplished by 
`identifying key terms in the documents. There are a few 
`techniques that have been developed or adapted for 
`plagiarism detection in natural language documents. The 
`most common   technique used nowadays is the 
`Fingerprint Matching technique [1][2]that consists of the 
`process of scanning and examining the fingerprint of two 
`documents in order to detect plagiarism. 
`
`2 FINGERPRINT MATCHING TECHNIQUE
`Fingerprinting  techniques  mostly  rely  on  the  use  of  K‐
`grams  (Manuel  et  al.  2006)  because  the  process  of 
`fingerprinting divides the document into grams of certain 
`length k. Then, the fingerprints of two documents can be 
`
`compared  in  order  to  detect  plagiarism.  It  has  been 
`observed 
`through 
`the 
`literature 
`that 
`fingerprints 
`matching approach differs based on what representation 
`or comparison unit (i.e.grams) is used. 
`
`Fig.1  Fingerprint Matching Technique 
`
`Character-based Fingerprint Matching
`2.1
`The conventional fingerprinting technique uses sequence
`of characters to form the fingerprint for the whole
`document. During 1996, Heintze divides fingerprinting
`techniques into two types which are full and selective. In
`full fingerprinting, document fingerprint consists of the
`set of all possible substrings of length K. For example, if
`we have a document of length |D| = 5 consisting only
`one statement that has only one word “touch”, then we
`can see that “touc” and “ouch” are the all possible
`substrings of length K = 4. In general, there are |D| – k +
`1 substrings or k-grams, where |D| is the length of the
`document. Basically, comparing two documents under
`
`EX1057
`Roku V. Media Chain
`U.S. Patent No. 10,489,560
`
`

`

`JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
`HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
`
`54
`
`for larger values of K. Therefore, we can conclude that K
`= 4 is an optimal or near optimal choice. Here is an
`explanation of how this 3-least frequent 4- grams works.
`A 4-gram of a string is a set of all possible 4-character
`substrings. For example, let take a string S = “English
`Word”, then the possible set of 4-grams include ”engl,
`ngli, glis, lish, ishw, shwo, hwor, word” with ignoring
`spaces.
`Secondly, three least-frequent 4-grams are the best option
`to represent thesentence uniquely. To illustrate the three
`least-frequent 4-gram construction process, consider the
`following sentence S “soccer game is fantastic”. The 4-
`grams are socc, occe, ccer, cerg, etc. In this method, instead
`of comparing all possible 4-grams, only three 4-grams
`which have the least frequency over all 4-grams will be
`chosen. Gauge the frequency or the weight of each n-
`gram was stated by as follows [3].
`
`Let the document contain J distinct n-grams, with mi
`occurrences of n-gram number
`i.
`Then the weight assigned to the ith n-gram will
`
` Where
`
`this technique is counting the number of substrings that
`are common in both fingerprints [1].
`
`Hence, if we compare a document A of size |A| against a
`document B, and if N is the number of substrings
`common in both, then the resemblance measure R of how
`much of A is contained in B can be computed as follows:
`
` where 0 <= R <= 1
`
`It is important to choose the right value of k to provide
`good discrimination among documents. If the value of k
`is chosen appropriately, then full fingerprinting gives
`reliable exact match results. The value given by Heintze
`(1996) was effectively 30-45 characters. Although full
`fingerprinting can be considered as a time and space
`consuming burden, it is a very useful measure for
`document copy detection.
`
`Phrase-based Fingerprint Matching
`2.2
`In 2001, Lyon et al. generates fingerprint using phrase‐
`mechanism to measure the resemblance between two 
`documents. During the early stage, we have to convert 
`each document to a set of trigrams (three words). Hence, 
`a sentence such as “Web Based Cross Language Plagiarism 
`Detection” will be converted to the set trigrams {“Web 
`Based Cross”, “Based Cross Language”, “Cross Language 
`Plagiarism”, “Language Plagiarism Detection”}. Then, the set 
`of trigrams for each document is compared with all other 
`using the matching algorithm. Finally, the measure of the 
`resemblance for each pair of documents is calculated as 
`follows:  
`
` where   0 <= R <= 1 
`
`Thirdly, the three least-frequent 4-grams are concatenated
`to represent the fingerprint of a sentence from the query
`document to be compared with the three least-frequent 4-
`gram representations of sentences
`in
`the corpora
`documents. Thus, in our example, if the three least-
`frequent 4-grams from Snew are occe, ccer, cerg will be
`concatenated
`to
`form
`the
`fingerprint F of Snew
`“occeccercerg”. Finally, two sentences are treated the same
`where S(A) and S(B) are the set of all trigrams in  
`if their corresponding three
`least frequent 4-gram
`documents A and B, respectively. When the value of R is 
`representations are the same. A measure of resemblance
`approaching 1, it means that the document is identical to         
`for each pair of documents is computed as follows:
`the respective pair of document. Phrase‐based fingerprint 
`matching  will  works  better  and  faster  than  character‐
`based fingerprinting since it deals with words rather than 
`letters. 
`
`0 <= R <=1
`
`Statement-based Fingerprint Matching
`2.3
`The pros and cons of character-based and phrase-based
`fingerprinting have led Yerra and Ng (2005) to represent
`the fingerprints of each statement (and thence the whole
`document) by three least-frequent 4-grams. Although any
`value of K can be considered, yet K = 4 was stated as an
`ideal choice by Yerra and Ng (2005). This is because
`smaller values of K (i.e., K = 1, 2, or 3), do not provide
`good discrimination between sentences. On the other
`hand, the larger the values of K (i.e., K = 5, 6, 7...etc), the
`better discrimination of words in one sentence from
`words in another. However each K-gram requires K bytes
`of storage and hence space-consuming becomes too large
`
`where F(A) and F(B) are the common fingerprints in
`documents A and B, respectively.
`The main advantages of the statement-based fingerprint
`method are the time and space consumption issue. The
`processing time and space usage for this method are
`much better than the character-based and phrase-based
`techniques. However, in the case of rewording and
`restructuring of statements, all fingerprinting techniques
`will fail in detecting that kind of plagiarism.
`
`

`

`JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
`HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
`
`55
`
`3 FEATURES BASED TEXT SIMILARITY
`DETECTION
`In previous section, we discuss the implementation of the
`fingerprint matching
`technique
`in
`the plagiarism
`detection tools. It cannot be denied that those matching
`techniques play an essential role
`in detecting the
`plagiarism documents. However, there are still some
`weaknesses in these techniques that affect the efficiency
`of the detection process. For instance, time and space
`consumption issue. As we know, both of the plagiarized
`ad reference documents are divided into n-grams before
`the comparison process. The process time of splitting the
`whole documents into n-grams and the space needed to
`store those n-grams will be a burden. In our proposed
`research, we will integrate the fingerprint matching
`technique with our proposed
`features based
`text
`similarity detection method. Rather than split the whole
`documents into n-gram, we will only choose the main
`ideas or important contents of the documents. Several
`features are considered to extract the main idea from the
`documents and integrated with the fingerprint matching
`technique for the comparison technique. All those
`proposed features are important elements to detect the
`plagiarized documents.
`Below are four features that used by our proposed
`research to assist the plagiarism detection.
`
`3.1 Top Keyword Feature
`In order to detect translation plagiarisms, it is essential to 
`translate  the  plagiarized  Malay  documents  into  English 
`before used as the query documents for further detection 
`process.  After  the  plagiarized  documents  have  been 
`translated into English, it will improve the effectiveness of 
`the detection process as the source documents are also in 
`English.  We  use  Google  Translate  API  which  is  a  well‐
`known  translation  tool  developed  by  the  Google  and  is 
`freely  distributed.  With  this API,  the  language  blocks  of 
`text  can  be  easily  detected  and  translated  to  other 
`preferred languages. The API is designed to be a simple 
`and  easy  to  detect  or  translate  languages  when  offline 
`translation is not available. 
`
`3.2 First Sentence Similarity
`In  general,  the  first  sentence  of  a document  contains  the 
`important  points  of  the  overall  document.  The  first 
`sentence can act as a general extraction of the documents 
`especially in news article, first sentence in the article is a 
`very  important  sentence  which  can  simply  represent  the 
`overall contents of the article [4]. For this feature, we use 
`the  first  sentence  of  the  reference  document  to  be 
`compared  with  the  plagiarized  documents.  Fingerprint 
`matching  technique  is  used  to  represent  the  sentence  by 
`dividing the first sentence and the plagiarized text into n‐
`grams  for  comparison.  The  score  for  the  first  sentence 
`similarity feature is computed by,   
`
`
`
`         0 <= R <= 1 
`

`                    

`Where S(A) is the set of all n‐grams in the first sentence of 
`the reference document and S(B) is the set of all n‐grams 
`in  the  plagiarized  document.  .  When  the  value  of  R  is 
`approaching 1, it means that the document is identical to 
`the respective pair of document. 
`
`3.3 Query Phrase
`In  some  condition,  by  taking  the  first  sentence  of  an 
`article  to  be  the  extraction  or  main  ideas  of  the  overall 
`article,  the  effectiveness  of  the  plagiarism  detection  will 
`decreased.  The  first  sentence  similarity  features 
`is 
`efficiently applied to most of the news articles, but not all 
`types  of  the  documents.  Hence,  we  propose  the  query 
`phrase  feature  that  can  be  applied  to  all  types  of  the 
`documents.  Generally, 
`the  query  phase 
`features 
`implement  the  same  flow  of  process  with  the  first 
`sentence similarity. The only difference is the sentence to 
`be  chosen  as  the  extraction  or  main  point  of  the 
`document.  Rather  than  taking  the  first  sentence  as  the 
`extraction, we choose the sentences which are considered 
`important  that  appear  after  the  query  phrase.  In  this 
`feature,  we  determine  a  few  of  the  query  phrases  as 
`below. 

`In conclusion, 
`In general, 
`We conclude that... 
`We find that... 
`The survey shows that... 
`The experiment shows that... 

`Based on this feature, the sentences that come after these 
`query phrases are taken to be the extraction of the overall 
`articles. These sentences can represent the overall ideas of 
`the whole article effectively. For examples, 

`S1:  We  conclude  that  the  main  cause  of  the  social  ills  is  the 
`family problem. 
`S2:  In  conclusion,  it  cannot  be  denied  that  teachers  play  an 
`important role in producing a group of good quality leader for 
`the future. 

`After  determine  the  sentence,  fingerprint  matching 
`technique  is  used  to  represent  the  sentence  by  dividing 
`the chosen sentence and the plagiarized text into n‐grams 
`for comparison. The score for the first sentence similarity 
`feature is computed by,     

`                      

`Where  S(A)  is  the  set  of  all  n‐grams  in  the  chosen 
`sentence of the reference document and S(B) is the set of 
`all n‐grams in the plagiarized document. . When the value 
`of  R  is  approaching  1,  it  means  that  the  document  is 
`identical to the respective pair of document. 
`
`         0 <= R <= 1 
`
`

`

`JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
`HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
`
`56
`
`
`Top Keyword Issue
`In  general,  after  removal  of  stop  words  and  stemming 
`process,  the  word  with  higher  frequency  in  a  document 
`can  be  considered  as  the  key  word  that  represent  the 
`overall ideas of the articles. However, some of the highly 
`frequency words in the articles do not necessary provide 
`the overall ideas of the document. This can be due to the 
`author’s style of writing who frequently writes the words 
`that does not bring any significant meaning to the articles. 
`In  this  situation,  those  words  will  be  considered  as  top 
`key  words  despite  it  does  not  provide  any  important 
`meaning  regards  to  the  articles.  The  effectiveness  of  the 
`plagiarism  detection  will  affected  indirectly  due  to  this 
`issue. 
`
`First Sentence Similarity Issue
`As discussed in the previous section, the first sentence 
`similarity issue is only effectively applied to particular 
`types of articles especially the news articles. First sentence 
`in most of the news articles contains the general ideas of 
`the overall news. However, articles such as journal 
`papers, thesis reports, and conference papers have the 
`unique way to identify their most important sentence. 
`First sentence of those types of articles do not provide the 
`right idea on what the whole report is discuss about. 
`Hence, by selecting the first sentence to be compared, the 
`result of the detection cannot achieve its optimum level. 
`
`
`Query Phrase
`Query Phrase feature is an effective solution to overcome 
`the  weaknesses  in  the  first  sentence  similarity  feature.  It 
`provides  a  different  way  to  capture  the  important 
`sentence  in  an  article  and  take  those  sentences  to  be 
`compared  using  the  fingerprint  matching  technique.  In 
`our  proposed  research,  we  determine  6  major  query 
`phrases to be used in the method. However, there are still 
`a lot of phrase can be considered as query phrase which 
`can provide a more effective plagiarism detection process. 
`Future  works  should  be  done  to  increase  the  number  of 
`suitable query phrase that used in out proposed method. 
`
` 5
`
` CONCLUSION
`In this paper, we presented our method in text similarity 
`detection. Feature based text similarity detection is a new 
`approach  to  detect  plagiarism.  This  approach  integrates 
`the  use  of  fingerprint  matching  technique  with  several 
`key  features  to  detect  the  similarity  level  between 
`sentences.  With  further  experiment,  we  feel  that  our 
`approach  can  detect  the  similarity  between  sentences 
`with high efficiency and effectiveness.
`
`3.4 Longest Common Subsequence (LCS)
`Longest Common Subsequence is one of the techniques
`used in ROUGE which is a well-known summary
`evaluation method. Given two sequences X and Y, the
`longest common subsequence (LCS) for X and Y is the
`common subsequence with maximum length [5]. To
`apply LCS
`in plagiarism detection, we view
`the
`documents as a sequence of words. The longer LCS of
`two documents sentences is, the more similar the two
`documents are. We use LCS-based F-measure to estimate
`the similarity between two documents X of length m and
`Y of length n, assuming X is a reference document
`sentences and Y is a plagiarize document.

`
`Rlcs =   
`

`Plcs = 
`

`Flcs = 
`

`

`

`

`
`Where LCS(X,Y) is the length of a longest common
` = Plcs /Rlcs. We notice that
`subsequence of X and Y, and
`the score for the LCS is 1 when X=Y while score is zero
`when LCS(X,Y) =0 where there is nothing common in
`both between X and Y.
`The main advantage of LCS is that it does not require
`consecutive matches but in-sequence matches that reflect
`sentence
`level word order as n-grams. The other
`advantage is that it automatically includes the longest in-
`sequence common n-grams, therefore no predefined n-
`gram length is necessary [5].
`For example,
`
`S1: Player kicked the ball.
`S2: Player kick the ball.
`S3: The ball kick player.
`
`S1 is the reference sentence and S2 and S3 is two different
`plagiarized sentences respectively. If we implement the
`fingerprint matching technique with 2-gram, the score for
`S2 and S3 is the same since both have the same bigrams
`“the ball”. However, S2 and S3 have significant difference
`in sentence meaning. When using the Longest Common
`Subsequence (LCS), S2 has a score of 3/4=0.75 and S3 has
`. Hence, S2 has a higher
`a score of 2/4=0.5, with
`similarity score than S3.
`
`4 DISCUSSION
`Feature based text similarity detection is a method to
`detect the plagiarism by integrating the use of fingerprint
`matching technique with several key features to assist the
`detection process. However, several aspects still need to
`be improved in order to increase the effectiveness and
`efficiency of the detection.
`
`
`
`

`

`JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
`HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
`
`57
`
`Dr. Naomie Salim is an Assoc.Prof. presently working as 
`a Deputy Dean of Postgraduate Studies in the Faculty of 
`Computer  Science  and  Information  System  in  Universiti 
`Teknologi Malaysia. She received her degree in Computer 
`Science  from  Universiti  Teknologi  Malaysia  in  1989.  She 
`received her Master degree from University of Illinois and 
`Ph.D  Degree  from  University  of  Sheffield  in  1992  and 
`2002  respectively.  Her  current  research  interest  includes 
`Information  Retrieval,  Distributed  Database 
`and 
`Chemoinformatic. 

`


`
` .
`
`
`
`
`6 ACKNOWLEDGEMENT
`This project is sponsored partly by the Ministry of 
`Science, Technology and Innovation under the E‐Science 
`grant 01‐01‐06‐SF‐0502. 
`
`.  
`
`[4]
`
`7 REFERENCES
`[1] HEINTZE, Nevin. Scalable document fingerprinting. 
`Proceedings of the Second USENIX Workshop on 
`Electronic Commerce. Oakland, California. 1996. 
`[2] Yerra and Ng. A Sentence‐Based Copy Detection 
`Approach for Web Documents. Fuzzy Systems and 
`Knowledge Discovery. 2005. 
`[3] Damashek, M. Gaushing Similarity with n‐Grams: 
`Language –Independent Categorization of Text. Science, 
`267(20), pp. 834‐848. 1995. 
` Madhavi Ganapathiraju. Million Books to Web: 
`technological Challenges and Research Issues. Proc Tamil 
`Internet Conference, pp227‐242. December 2004. 
`[5] Chin‐Yew Lin. ROUGE: A Package for Automatic 
`Evaluation of Summaries. University of Southern 
`California.   
`[6] Chin‐Yew Lin and Eduard Hovy. Automatic Evaluation of 
`Summaries Using N‐gram Co‐Occurrence Statistic. 
`University of Southern California.  
`[7] Salha Mohammed Alzahrani. Plagiarism Auto‐Detection 
`in Arabic Scripts using Statement‐based Fingerprint 
`Matching. University Teknologi Malaysia. 2009.     
`[8] Lyon, C. Barrett, R. Malcolm, J.A. Experiments in 
`plagiarism detection. Technical report 388. School of 
`Computer Science, University of Hertfordshire. 2003. 
`[9] Computer Algorithm for plagiarism detection. Education, 
`IEEE Transaction on, 32(2), 94‐99. 
`[10] Salem, M. Comparison and Fusion of Retrieval Schemes 
`Based on Different Structures, Similarity Measures and 
`Weighting Schemes. University Teknologi Malaysia. 2006. 
`[11] Lin, C‐Y. And E, Hovy. Manual and Automatic Evaluation 
`of Summaries. In Proceeding of the Workshop on 
`Automatic Summarization, post conference workshop of 
`ACL‐2002, pp. 45‐51, Philadelphia, PA. 2002.   
`[12] Cormen, T. R, C.E. Leiserson, and R.L. Rivest. Introduction 
`to Algorihms. The MIT Press. 1989.   
`


`Mr. Chow Kok Kent received his B.Sc. degree in 
`Universiti Teknologi Malaysia, Malaysia in 2009. He is 
`currently purchasing Master degree in Faculty of 
`Computer Science and Information System, Universiti 
`Teknologi Malaysia. His current research interest includes 
`information retrieval, Plagiarism Detection and Soft 
`Computting.   

`

`
`
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket