IPR2022-00391, No. 1057-49 Exhibit - EX1057 Kent 2010 (P.T.A.B. Jan. 10, 2022)

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
`HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
`
`53
`
`Features Based Text Similarity Detection
`
`Chow Kok Kent, Naomie Salim
`
`Faculty of Computer Science and Information Systems, University Teknologi Malaysia, 81310 Skudai,
`Johor, Malaysia
`
`Abstract— As the Internet help us cross cultural border by providing different information, plagiarism issue is bound to arise. As a result,
`plagiarism detection becomes more demanding in overcoming this issue. Different plagiarism detection tools have been developed based on
`various detection techniques. Nowadays, fingerprint matching technique plays an important role in those detection tools. However, in handling
`some large content articles, there are some weaknesses in fingerprint matching technique especially in space and time consumption issue. In
`this paper, we propose a new approach to detect plagiarism which integrates the use of fingerprint matching technique with four key features
`to assist in the detection process. These proposed features are capable to choose the main point or key sentence in the articles to be
`compared. Those selected sentence will be undergo the fingerprint matching process in order to detect the similarity between the sentences.
`Hence, time and space usage for the comparison process is reduced without affecting the effectiveness of the plagiarism detection.
`
`—————————— (cid:139) ——————————
`
`1 INTRODUCTION
`
`In plagiarism detection, the content of a suspected
`document may be represented as a collection of terms,
`words, stems, phrases, or other units derived or inferred
`from the text of the document. Different techniques will
`lead to vary efficiency and effectiveness in plagiarism
`detection based on different document descriptors. Before
`a document is taken to be compared, it is necessary for us
`to choose the most appropriate representation techniques
`to retrieve the main points of the document. When the
`documents contain primarily unrestricted text such as
`newspaper articles, legal documents and so on, the
`relevance of a document is established through ʹfull‐textʹ
`retrieval. This has been usually accomplished by
`identifying key terms in the documents. There are a few
`techniques that have been developed or adapted for
`plagiarism detection in natural language documents. The
`most common technique used nowadays is the
`Fingerprint Matching technique [1][2]that consists of the
`process of scanning and examining the fingerprint of two
`documents in order to detect plagiarism.
`
`2 FINGERPRINT MATCHING TECHNIQUE
`Fingerprinting techniques mostly rely on the use of K‐
`grams (Manuel et al. 2006) because the process of
`fingerprinting divides the document into grams of certain
`length k. Then, the fingerprints of two documents can be
`
`compared in order to detect plagiarism. It has been
`observed
`through
`the
`literature
`that
`fingerprints
`matching approach differs based on what representation
`or comparison unit (i.e.grams) is used.
`
`Fig.1 Fingerprint Matching Technique
`
`Character-based Fingerprint Matching
`2.1
`The conventional fingerprinting technique uses sequence
`of characters to form the fingerprint for the whole
`document. During 1996, Heintze divides fingerprinting
`techniques into two types which are full and selective. In
`full fingerprinting, document fingerprint consists of the
`set of all possible substrings of length K. For example, if
`we have a document of length |D| = 5 consisting only
`one statement that has only one word “touch”, then we
`can see that “touc” and “ouch” are the all possible
`substrings of length K = 4. In general, there are |D| – k +
`1 substrings or k-grams, where |D| is the length of the
`document. Basically, comparing two documents under
`
`EX1057
`Roku V. Media Chain
`U.S. Patent No. 10,489,560
`
`

`JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
`HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
`
`54
`
`for larger values of K. Therefore, we can conclude that K
`= 4 is an optimal or near optimal choice. Here is an
`explanation of how this 3-least frequent 4- grams works.
`A 4-gram of a string is a set of all possible 4-character
`substrings. For example, let take a string S = “English
`Word”, then the possible set of 4-grams include ”engl,
`ngli, glis, lish, ishw, shwo, hwor, word” with ignoring
`spaces.
`Secondly, three least-frequent 4-grams are the best option
`to represent thesentence uniquely. To illustrate the three
`least-frequent 4-gram construction process, consider the
`following sentence S “soccer game is fantastic”. The 4-
`grams are socc, occe, ccer, cerg, etc. In this method, instead
`of comparing all possible 4-grams, only three 4-grams
`which have the least frequency over all 4-grams will be
`chosen. Gauge the frequency or the weight of each n-
`gram was stated by as follows [3].
`
`Let the document contain J distinct n-grams, with mi
`occurrences of n-gram number
`i.
`Then the weight assigned to the ith n-gram will
`
` Where
`
`this technique is counting the number of substrings that
`are common in both fingerprints [1].
`
`Hence, if we compare a document A of size |A| against a
`document B, and if N is the number of substrings
`common in both, then the resemblance measure R of how
`much of A is contained in B can be computed as follows:
`
` where 0 <= R <= 1
`
`It is important to choose the right value of k to provide
`good discrimination among documents. If the value of k
`is chosen appropriately, then full fingerprinting gives
`reliable exact match results. The value given by Heintze
`(1996) was effectively 30-45 characters. Although full
`fingerprinting can be considered as a time and space
`consuming burden, it is a very useful measure for
`document copy detection.
`
`Phrase-based Fingerprint Matching
`2.2
`In 2001, Lyon et al. generates fingerprint using phrase‐
`mechanism to measure the resemblance between two
`documents. During the early stage, we have to convert
`each document to a set of trigrams (three words). Hence,
`a sentence such as “Web Based Cross Language Plagiarism
`Detection” will be converted to the set trigrams {“Web
`Based Cross”, “Based Cross Language”, “Cross Language
`Plagiarism”, “Language Plagiarism Detection”}. Then, the set
`of trigrams for each document is compared with all other
`using the matching algorithm. Finally, the measure of the
`resemblance for each pair of documents is calculated as
`follows:
`
` where   0 <= R <= 1
`
`Thirdly, the three least-frequent 4-grams are concatenated
`to represent the fingerprint of a sentence from the query
`document to be compared with the three least-frequent 4-
`gram representations of sentences
`in
`the corpora
`documents. Thus, in our example, if the three least-
`frequent 4-grams from Snew are occe, ccer, cerg will be
`concatenated
`to
`form
`the
`fingerprint F of Snew
`“occeccercerg”. Finally, two sentences are treated the same
`where S(A) and S(B) are the set of all trigrams in
`if their corresponding three
`least frequent 4-gram
`documents A and B, respectively. When the value of R is
`representations are the same. A measure of resemblance
`approaching 1, it means that the document is identical to
`for each pair of documents is computed as follows:
`the respective pair of document. Phrase‐based fingerprint
`matching will works better and faster than character‐
`based fingerprinting since it deals with words rather than
`letters.
`
`0 <= R <=1
`
`Statement-based Fingerprint Matching
`2.3
`The pros and cons of character-based and phrase-based
`fingerprinting have led Yerra and Ng (2005) to represent
`the fingerprints of each statement (and thence the whole
`document) by three least-frequent 4-grams. Although any
`value of K can be considered, yet K = 4 was stated as an
`ideal choice by Yerra and Ng (2005). This is because
`smaller values of K (i.e., K = 1, 2, or 3), do not provide
`good discrimination between sentences. On the other
`hand, the larger the values of K (i.e., K = 5, 6, 7...etc), the
`better discrimination of words in one sentence from
`words in another. However each K-gram requires K bytes
`of storage and hence space-consuming becomes too large
`
`where F(A) and F(B) are the common fingerprints in
`documents A and B, respectively.
`The main advantages of the statement-based fingerprint
`method are the time and space consumption issue. The
`processing time and space usage for this method are
`much better than the character-based and phrase-based
`techniques. However, in the case of rewording and
`restructuring of statements, all fingerprinting techniques
`will fail in detecting that kind of plagiarism.
`
`

`JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
`HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
`
`55
`
`3 FEATURES BASED TEXT SIMILARITY
`DETECTION
`In previous section, we discuss the implementation of the
`fingerprint matching
`technique
`in
`the plagiarism
`detection tools. It cannot be denied that those matching
`techniques play an essential role
`in detecting the
`plagiarism documents. However, there are still some
`weaknesses in these techniques that affect the efficiency
`of the detection process. For instance, time and space
`consumption issue. As we know, both of the plagiarized
`ad reference documents are divided into n-grams before
`the comparison process. The process time of splitting the
`whole documents into n-grams and the space needed to
`store those n-grams will be a burden. In our proposed
`research, we will integrate the fingerprint matching
`technique with our proposed
`features based
`text
`similarity detection method. Rather than split the whole
`documents into n-gram, we will only choose the main
`ideas or important contents of the documents. Several
`features are considered to extract the main idea from the
`documents and integrated with the fingerprint matching
`technique for the comparison technique. All those
`proposed features are important elements to detect the
`plagiarized documents.
`Below are four features that used by our proposed
`research to assist the plagiarism detection.
`
`3.1 Top Keyword Feature
`In order to detect translation plagiarisms, it is essential to
`translate the plagiarized Malay documents into English
`before used as the query documents for further detection
`process. After the plagiarized documents have been
`translated into English, it will improve the effectiveness of
`the detection process as the source documents are also in
`English. We use Google Translate API which is a well‐
`known translation tool developed by the Google and is
`freely distributed. With this API, the language blocks of
`text can be easily detected and translated to other
`preferred languages. The API is designed to be a simple
`and easy to detect or translate languages when offline
`translation is not available.
`
`3.2 First Sentence Similarity
`In general, the first sentence of a document contains the
`important points of the overall document. The first
`sentence can act as a general extraction of the documents
`especially in news article, first sentence in the article is a
`very important sentence which can simply represent the
`overall contents of the article [4]. For this feature, we use
`the first sentence of the reference document to be
`compared with the plagiarized documents. Fingerprint
`matching technique is used to represent the sentence by
`dividing the first sentence and the plagiarized text into n‐
`grams for comparison. The score for the first sentence
`similarity feature is computed by,
`
`
`
`         0 <= R <= 1
`
`
`
`
`Where S(A) is the set of all n‐grams in the first sentence of
`the reference document and S(B) is the set of all n‐grams
`in the plagiarized document. . When the value of R is
`approaching 1, it means that the document is identical to
`the respective pair of document.
`
`3.3 Query Phrase
`In some condition, by taking the first sentence of an
`article to be the extraction or main ideas of the overall
`article, the effectiveness of the plagiarism detection will
`decreased. The first sentence similarity features
`is
`efficiently applied to most of the news articles, but not all
`types of the documents. Hence, we propose the query
`phrase feature that can be applied to all types of the
`documents. Generally,
`the query phase
`features
`implement the same flow of process with the first
`sentence similarity. The only difference is the sentence to
`be chosen as the extraction or main point of the
`document. Rather than taking the first sentence as the
`extraction, we choose the sentences which are considered
`important that appear after the query phrase. In this
`feature, we determine a few of the query phrases as
`below.
`
`In conclusion,
`In general,
`We conclude that...
`We find that...
`The survey shows that...
`The experiment shows that...
`
`Based on this feature, the sentences that come after these
`query phrases are taken to be the extraction of the overall
`articles. These sentences can represent the overall ideas of
`the whole article effectively. For examples,
`
`S1: We conclude that the main cause of the social ills is the
`family problem.
`S2: In conclusion, it cannot be denied that teachers play an
`important role in producing a group of good quality leader for
`the future.
`
`After determine the sentence, fingerprint matching
`technique is used to represent the sentence by dividing
`the chosen sentence and the plagiarized text into n‐grams
`for comparison. The score for the first sentence similarity
`feature is computed by,
`
`
`
`Where S(A) is the set of all n‐grams in the chosen
`sentence of the reference document and S(B) is the set of
`all n‐grams in the plagiarized document. . When the value
`of R is approaching 1, it means that the document is
`identical to the respective pair of document.
`
`         0 <= R <= 1
`
`

`JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
`HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
`
`56
`
`
`Top Keyword Issue
`In general, after removal of stop words and stemming
`process, the word with higher frequency in a document
`can be considered as the key word that represent the
`overall ideas of the articles. However, some of the highly
`frequency words in the articles do not necessary provide
`the overall ideas of the document. This can be due to the
`author’s style of writing who frequently writes the words
`that does not bring any significant meaning to the articles.
`In this situation, those words will be considered as top
`key words despite it does not provide any important
`meaning regards to the articles. The effectiveness of the
`plagiarism detection will affected indirectly due to this
`issue.
`
`First Sentence Similarity Issue
`As discussed in the previous section, the first sentence
`similarity issue is only effectively applied to particular
`types of articles especially the news articles. First sentence
`in most of the news articles contains the general ideas of
`the overall news. However, articles such as journal
`papers, thesis reports, and conference papers have the
`unique way to identify their most important sentence.
`First sentence of those types of articles do not provide the
`right idea on what the whole report is discuss about.
`Hence, by selecting the first sentence to be compared, the
`result of the detection cannot achieve its optimum level.
`
`
`Query Phrase
`Query Phrase feature is an effective solution to overcome
`the weaknesses in the first sentence similarity feature. It
`provides a different way to capture the important
`sentence in an article and take those sentences to be
`compared using the fingerprint matching technique. In
`our proposed research, we determine 6 major query
`phrases to be used in the method. However, there are still
`a lot of phrase can be considered as query phrase which
`can provide a more effective plagiarism detection process.
`Future works should be done to increase the number of
`suitable query phrase that used in out proposed method.
`
` 5
`
` CONCLUSION
`In this paper, we presented our method in text similarity
`detection. Feature based text similarity detection is a new
`approach to detect plagiarism. This approach integrates
`the use of fingerprint matching technique with several
`key features to detect the similarity level between
`sentences. With further experiment, we feel that our
`approach can detect the similarity between sentences
`with high efficiency and effectiveness.
`
`3.4 Longest Common Subsequence (LCS)
`Longest Common Subsequence is one of the techniques
`used in ROUGE which is a well-known summary
`evaluation method. Given two sequences X and Y, the
`longest common subsequence (LCS) for X and Y is the
`common subsequence with maximum length [5]. To
`apply LCS
`in plagiarism detection, we view
`the
`documents as a sequence of words. The longer LCS of
`two documents sentences is, the more similar the two
`documents are. We use LCS-based F-measure to estimate
`the similarity between two documents X of length m and
`Y of length n, assuming X is a reference document
`sentences and Y is a plagiarize document.
`
`
`Rlcs =
`
`
`Plcs =
`
`
`Flcs =
`
`
`
`
`
`
`
`
`
`Where LCS(X,Y) is the length of a longest common
` = Plcs /Rlcs. We notice that
`subsequence of X and Y, and
`the score for the LCS is 1 when X=Y while score is zero
`when LCS(X,Y) =0 where there is nothing common in
`both between X and Y.
`The main advantage of LCS is that it does not require
`consecutive matches but in-sequence matches that reflect
`sentence
`level word order as n-grams. The other
`advantage is that it automatically includes the longest in-
`sequence common n-grams, therefore no predefined n-
`gram length is necessary [5].
`For example,
`
`S1: Player kicked the ball.
`S2: Player kick the ball.
`S3: The ball kick player.
`
`S1 is the reference sentence and S2 and S3 is two different
`plagiarized sentences respectively. If we implement the
`fingerprint matching technique with 2-gram, the score for
`S2 and S3 is the same since both have the same bigrams
`“the ball”. However, S2 and S3 have significant difference
`in sentence meaning. When using the Longest Common
`Subsequence (LCS), S2 has a score of 3/4=0.75 and S3 has
`. Hence, S2 has a higher
`a score of 2/4=0.5, with
`similarity score than S3.
`
`4 DISCUSSION
`Feature based text similarity detection is a method to
`detect the plagiarism by integrating the use of fingerprint
`matching technique with several key features to assist the
`detection process. However, several aspects still need to
`be improved in order to increase the effectiveness and
`efficiency of the detection.
`
`
`
`

`JOURNAL OF COMPUTING, VOLUME 2, ISSUE 1, JANUARY 2010, ISSN 2151-9617
`HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
`
`57
`
`Dr. Naomie Salim is an Assoc.Prof. presently working as
`a Deputy Dean of Postgraduate Studies in the Faculty of
`Computer Science and Information System in Universiti
`Teknologi Malaysia. She received her degree in Computer
`Science from Universiti Teknologi Malaysia in 1989. She
`received her Master degree from University of Illinois and
`Ph.D Degree from University of Sheffield in 1992 and
`2002 respectively. Her current research interest includes
`Information Retrieval, Distributed Database
`and
`Chemoinformatic.
`
`
`
`
`
` .
`
`
`
`
`6 ACKNOWLEDGEMENT
`This project is sponsored partly by the Ministry of
`Science, Technology and Innovation under the E‐Science
`grant 01‐01‐06‐SF‐0502.
`
`.
`
`[4]
`
`7 REFERENCES
`[1] HEINTZE, Nevin. Scalable document fingerprinting.
`Proceedings of the Second USENIX Workshop on
`Electronic Commerce. Oakland, California. 1996.
`[2] Yerra and Ng. A Sentence‐Based Copy Detection
`Approach for Web Documents. Fuzzy Systems and
`Knowledge Discovery. 2005.
`[3] Damashek, M. Gaushing Similarity with n‐Grams:
`Language –Independent Categorization of Text. Science,
`267(20), pp. 834‐848. 1995.
` Madhavi Ganapathiraju. Million Books to Web:
`technological Challenges and Research Issues. Proc Tamil
`Internet Conference, pp227‐242. December 2004.
`[5] Chin‐Yew Lin. ROUGE: A Package for Automatic
`Evaluation of Summaries. University of Southern
`California.
`[6] Chin‐Yew Lin and Eduard Hovy. Automatic Evaluation of
`Summaries Using N‐gram Co‐Occurrence Statistic.
`University of Southern California.
`[7] Salha Mohammed Alzahrani. Plagiarism Auto‐Detection
`in Arabic Scripts using Statement‐based Fingerprint
`Matching. University Teknologi Malaysia. 2009.
`[8] Lyon, C. Barrett, R. Malcolm, J.A. Experiments in
`plagiarism detection. Technical report 388. School of
`Computer Science, University of Hertfordshire. 2003.
`[9] Computer Algorithm for plagiarism detection. Education,
`IEEE Transaction on, 32(2), 94‐99.
`[10] Salem, M. Comparison and Fusion of Retrieval Schemes
`Based on Different Structures, Similarity Measures and
`Weighting Schemes. University Teknologi Malaysia. 2006.
`[11] Lin, C‐Y. And E, Hovy. Manual and Automatic Evaluation
`of Summaries. In Proceeding of the Workshop on
`Automatic Summarization, post conference workshop of
`ACL‐2002, pp. 45‐51, Philadelphia, PA. 2002.
`[12] Cormen, T. R, C.E. Leiserson, and R.L. Rivest. Introduction
`to Algorihms. The MIT Press. 1989.
`
`
`
`Mr. Chow Kok Kent received his B.Sc. degree in
`Universiti Teknologi Malaysia, Malaysia in 2009. He is
`currently purchasing Master degree in Faculty of
`Computer Science and Information System, Universiti
`Teknologi Malaysia. His current research interest includes
`information retrieval, Plagiarism Detection and Soft
`Computting.
`
`
`
`
`
`
`

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases