`(10) Patent No:
`a2) United States Patent
`US 6,185,614 B1
`Cuomoetal.
`(45) Date of Patent:
`Feb. 6, 2001
`
`
`(54) METHOD AND SYSTEM FOR COLLECTING
`USER PROFILE INFORMATION OVER THE
`WORLD-WIDE WEB IN THE PRESENCE OF
`DYNAMIC CONTENT USING DOCUMENT
`COMPARATORS
`
`(75)
`
`Inventors: Gennaro A. Cuomo,Apex; Binh Q.
`Nguyen, Cary; Sandeep K.Singhal,
`Raleigh, all of NC (US)
`
`(73) Assignee:
`
`International Business Machines
`Corp., Armonk, NY (US)
`
`(*) Notice:
`
`Under 35 U.S.C. 154(b), the term of this
`patent shall be extended for O days.
`
`:
`(21) Appl. No.: 09/084,452
`(22)
`Filed:
`May26, 1998
`(51) Unt. Ch. ieee GO6F 15/173; GO6F 15/16;
`GO6F 7/00
`(52) U.S. Che ceecescecssssseessseeee 709/224; 709/203; 707/104
`(58) Field of Search 0... 709/203, 224;
`707/6, 10, 104, 501, 513, 3, 5
`.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,649,186 *
`7/1997 Ferguson ....ccceceseeceseseeee 707/10
`5,732,218
`3/1998 Bland .............
`«709/204
`
`5,740,430 *
`4/1998 Rosenberg et al.
`..essssesseeen 707/200
`5,745,900 *
`4/1998 BUITOWS veescssccsssssssseecesssees 707/102
`5,813,007 *
`9/1998 Nielsen ......
`707/10
`
`5,890,164 *
`3/1999 Nielsen......
`vee 707/201
`5,892,917
`4/1999 Myerson ....
`w.. 709/204
`5,893,908 *
`4/1999 Cullen et al. oes 707/5
`5,895,470 *
`4/1999 Pirolli Ct Al. ieeeeeeeseeeeee 707/102
`5,898,836 :
`4/1999 Freivald et al.
`- 709/218
`
`oorong .
`rr909 proder et “ ”
`oe ma3
`5041044 * e190 Mowe aes
`“sys
`
`5978.842 * 11/1999 Noble ie al~
`~ 709/218
`
`5,983,268 * 11/1999 Freivald et al...
`.. 799/218
`5,987,480 * 11/1999 Donohueet al.
`..
`.. 707/501
`
`5,999,929 * 12/1999 Goodman ....ecccsceseeeeeeees 707/7
`
`6,012,087 *
`1/2000 Freivald et al. sess 709/218
`FOREIGN PATENT DOCUMENTS
`9831155
`7/1998 (WO).
`OTHER PUBLICATIONS
`
`Brin, S., et al., “Copy Detection Mechansims for Digital
`Documents,” Proc. Of the 1995 ACM SIGMODInt’l. Conf.
`on Management of Data, ACM,pp. 398-409, May, 1995.*
`Garcia—Molina, H., et al, “dSCAM: Finding Document
`Copies Across Multiple Databases,” Proc. of the 4th Int’l.
`Conf.on Parallel and Distributed Information Systems,
`IEEE, pp. 68-79, May 1995.*
`
`* cited by examiner
`
`Primary Examiner—Ahmad F. Matar
`Assistant Examiner—Andrew Caldwell
`(74) Altorney, Agent, or Firm—A. Bruce Clay
`(57)
`ABSTRACT
`
`Disclosed is a method and system for collecting profile
`information about users accessing dynamically generated
`content from one or moreservers. In a specific embodiment,
`a server dynamically generates a web page in response to a
`user request. The server customizes the web
`page content
`
`based on the requested universal resource identifier (URT
`
`and one or moreof: the user’s identity, access permissions,
`demographic information, and previous behaviorat the site.
`The web server then passes the URI, user identity, and
`.
`.
`.
`dynamically generated web ne ae lotoes information
`COuector.
`Ane access information collector generates
`dOCcu-
`ment comparators from the current web page content and
`compares them to document comparators associated with
`previously retrieved web pages. If the current web page is
`sufficiently similar to some previously retrieved web page,
`the access information collector logs the URI, user identity,
`and a documentkey associated with the matching previously
`retrieved page. Otherwise, the access information collector
`generates a new key; stores the new key and the document
`comparators in a database; and logs the URI, user identity,
`and the newly generated documentkey.
`
`27 Claims, 4 Drawing Sheets
`
`
`serenaDn
`aaoCompute Document
`Comparator
`
`Select Candidate
`Documentand Comparator
`
`
`lo
`Comparators diferBY
`lessthan threshold.
`
`”
`
`Generate new Keyfor
`retrieved document
`
`
`
`410
`
`rieve Document Key|Add new entryto Retrieved
`
`torCanidae Dozumert
`\
`|e
`pacument Databaso
`
`
`490
`‘entty to Document
`415
`coomparatr inox Database
`Add entry to Log Fife with
`
`
`
`
`
` 435
`
`CandidateDocument’sKey aAaniyoLogFelewith=
`
`SAMSUNG 1022
`
`SAMSUNG 1022
`
`1
`
`
`
`L‘Sis
`
`Aemoayer)Of
`
`JOINSCc
`
`fe
`
`oe
`
`tCi[2007
` as=Belly[200
`
`
`YIOMION
`
`at
`
`o8
`
`veFlesee
`
`92
`
`U.S. Patent
`
`Feb. 6, 2001
`
`Sheet 1 of 4
`
`US 6,185,614 B1
`
`2
`
`
`
`U.S. Patent
`
`Feb.6, 2001
`
`Sheet 2 of 4
`
`US 6,185,614 B1
`
`Static Static||PynamicCame Static Coame
`
`
`
`
`Content
`Content
`G ° ah
`G on at
`Content one
`Database
`enerator Database
`enerator Database
`enerator
`
` 232
`
`222
`
`220
`
`Web
`Server
`
`Web
`Server
`
`210
`
`211
`
`Web
`Server
`
`212
`
`AccessInformation Collector
`
`240
`
`205
`
`CLIENT
`
`CLIENT
`
`CLIENT
`
`200
`
`201
`
`202
`
`FIG. 2
`
`3
`
`
`
`U.S. Patent
`
`Feb. 6, 2001
`
`Sheet 3 of 4
`
`US 6,185,614 B1
`
`€‘Sid
`
`SLE
`
`jueWINDOg
`
`Joyesedwog
`
`cle
`
`|fnELLEZLe
`juewinooqjuewns0g
`
`haysoyeredwog
`
`yuswNnoog eseqeieq
`
`yuewnoog
`
`fonTn.
`
`peawiey
`
`yuewNoeg
`
`Joyeseduoy
`
`xepu|
`
`jueWUNDOGg
`
`hey
`
`ZLELeLOE
`
`
`
`
`
`4
`
`
`
`
`
`
`
`U.S. Patent
`
`Feb.6, 2001
`
`Sheet 4 of 4
`
`US 6,185,614 B1
`
`Receive URI, requesttime,
`client identity, and document
`content
`
`
`
`Compute Document
`Comparator
`
`
`
`
`Select Candidate
`Document and Comparator
`
`400
`
`402
`
`404
`
`Candidate
`document
`
`No
`
`
`
`
`Comparators differ b
`
`Generate new Keyfor
`retrieved document
`
`
`for Candidate Document
`
`Add newentry to Retrieved
`Document Database
`
`Add new entry to Document
`Comparator Index Database
`
`
`less than threshold
`
` Retrieve Document Key
`
`
`
`
`Add entry to Log File with
`Candidate Document's Key
`
`
`
`Add entry to Log File with
`retrieved Document's Key
`
`
`490
`
`FIG. 4
`
`420
`
`425
`
`430
`
`435
`
`5
`
`
`
`US 6,185,614 B1
`
`1
`METHOD AND SYSTEM FOR COLLECTING
`USER PROFILE INFORMATION OVER THE
`WORLD-WIDE WEB IN THE PRESENCE OF
`DYNAMIC CONTENT USING DOCUMENT
`COMPARATORS
`
`2
`access information in the presence of dynamically-generated
`content at a Web server, in order to support the accurate
`generation of user profiles.
`
`SUMMARYOF THE INVENTION
`
`FIELD OF THE INVENTION
`
`This invention relates in general to computer software,
`and in particular to a method and system for collecting
`profile information about users accessing Web pages from a
`plurality of Web servers. More particularly,
`the present
`invention relates to a method and system by which user
`profile information can be collected when the Web content
`is generated dynamically for each request at the Web server.
`
`BACKGROUND OF THE INVENTION
`
`In the World-Wide Web, a content provider deploys a
`plurality of Web servers that deliver Web pagesto clients.
`Whenrequesting a Web page, the client supplies a Uniform
`Resource Locator (URL) or Universal Resource Identifier
`(URDto the server. The server associates this URI with a
`particular page of content and delivers that information to
`the requesting client.
`As the World-Wide Web is being used increasingly to
`support commerce and targeted advertising, content provid-
`ers desire to collect
`information about which users are
`
`accessing the site and what site content those users are
`accessing. This information can be used to establish “pro-
`files” for each site visitor and enable tuning of the Website
`content to meet the visitors’ interests. Traditionally,
`this
`visitor information is collected by the Web server or a proxy
`server in the form of a log file. This log file contains, among
`other things, the requesting host address, the requested URI,
`and the time at which the request was received. Because
`each URIrepresentsa particular piece of static content at the
`Website, the URIis sufficient for a user profile analyzer to
`evaluate which content was received by each user and to
`detect similarities among the behavior of different users.
`Recent Webservers are providing support for server-side
`scripting, whereby the URIis associated with a program or
`script that is executed at
`the Web server. This script is
`responsible for receiving the URI and the user identity and
`using this information to dynamically generate the content
`that should be returned to the requesting user. This generated
`content may account for the user’s previous behaviorat the
`site, his access permissions, his demographic information, or
`any number of other factors. Dynamic server content is
`supported by most Web servers today, including Microsoft’s
`Active Server Pages, Sun’s Dynamic Server Pages, industry-
`standard servlets, Common Gateway Interface (CGI)
`executables, and other mechanisms.
`As a result of this direction, a particular URI can no longer
`be associated with particular content at the Web site. On
`different requests,
`the URI may return wholly different
`content depending on the requesting user and the context in
`which the request was issued. Consequently, existing meth-
`ods for capturing user information are insufficient for pro-
`ducing meaningful user profiles. More specifically, the reli-
`ance on URIs alone prevents the accurate characterization of
`which users are exhibiting similar access behavior.
`Therefore, a method is neededfor efficiently collecting user
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`Oneobject of the present invention is to provide, within
`a networked environment, a method of associating each
`user’s request
`for World-Wide Web information to the
`content of the retrieved document when that document was
`
`generated dynamically.
`invention is to group
`Another object of the present
`together user requests that retrieve the same document
`content. Yet another object of the present invention is to
`ignore minor variations in document content as might occur
`when the documents differ only in the presence of the
`requesting user’s name. Still yet another object of the
`present invention is to enable the use of a range of metrics
`for comparing two documents for similarity.
`To achieve the foregoing objects and in accordance with
`the purpose of the invention as broadly described herein, a
`method and system are disclosed for collecting information
`about user accesses by analyzing the content of retrieved
`documents and associating Document Comparators with
`each document. These and other features, aspects, and
`advantages of the present
`invention will become better
`understood with reference to the following description,
`appended claims, and accompanying drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`For a more complete understanding of the present inven-
`tion and for further advantages thereof, reference is now
`made to the following Detailed Description taken in con-
`junction with the accompanying Drawings, in which:
`FIG. 1 is a pictorial representation of a data processing
`system which maybe utilized to implement a method and
`system of the present invention;
`FIG. 2 shows a block diagram of a World-Wide Web
`environment
`in which user access information may be
`generated in accordance with the present invention;
`FIG. 3 shows a sample data structure for representing the
`information collected by the Access Information Collector in
`accordance with the present invention; and
`FIG. 4 is a flowchart showing how an Access Information
`Collector analyzes a documentretrieved from a Web server
`and updates its data structures.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`Referring to FIG. 1, there is depicted a graphical repre-
`sentation of a data processing system 8, which may be
`utilized to implement the present invention. As maybe seen,
`data processing system 8 may include a plurality of
`networks, such as Local Area Networks (LAN) 10 and 32,
`each of which preferably includes a plurality of individual
`computers 12 and 30, respectively. Of course, those skilled
`in the art will appreciate that a plurality of Intelligent Work
`Stations IWS) coupled to a host processor may be utilized
`for each such network. Each said network mayalso consist
`of a plurality of processors coupled via a communications
`medium, such as shared memory, shared storage, or an
`6
`
`6
`
`
`
`US 6,185,614 B1
`
`3
`interconnection network. As is common in such data pro-
`cessing systems, each individual computer may be coupled
`to a storage device 14 and/or a printer/output device 16 and
`may be provided with a pointing device such as a mouse 17.
`The data processing system 8 may also include multiple
`mainframe computers, such as mainframe computer 18,
`which may be preferably coupled to LAN 10 by means of
`communications link 22. The mainframe computer 18 may
`also be coupled to a storage device 20 which may serve as
`remote storage for LAN 10. Similarly, LAN 10 may be
`coupled via communications link 24 through a sub-system
`control unit/communications controller 26 and communica-
`tions link 34 to a gateway server 28. The gateway server 28
`is preferably an IWS which serves to link LAN 32 to LAN
`10.
`
`With respect to LAN 32 and LAN 10, a plurality of
`documents or resource objects may be stored within storage
`device 20 and controlled by mainframe computer 18, as
`resource managerorlibrary service for the resource objects
`thus stored. Of course, those skilled in the art will appreciate
`that mainframe computer 18 may be located a great geo-
`graphic distance from LAN 10 and similarly, LAN 10 may
`be located a substantial distance from LAN 32. For example,
`LAN 32 maybe located in California while LAN 10 may be
`located within North Carolina and mainframe computer 18
`may be located in New York.
`Software program code which employs the present inven-
`tion is typically stored in the memoryof a storage device 14
`of a stand alone workstation or LAN server from which a
`
`developer may access the code for distribution purposes, the
`software program code may be embodied on anyofa variety
`of known media for use with a data processing system such
`as a diskette or CD-ROMor maybedistributed to users from
`a memory of one computer system over a network of some
`type to other computer systemsfor use by users of such other
`systems. Such techniques and methods for embodying soft-
`ware code on media and/or distributing software code are
`well-known and will not be further discussed herein.
`
`Referring now to FIG. 2, components of a World-Wide
`Web system are shown in which user information may be
`gathered in accordance with the present invention. A plu-
`rality of clients (generally indicated by reference numerals
`200, 201, and 202) access information over a network 205
`using World-Wide Web browsers such as NETSCAPE
`NAVIGATOR,
`a
`trademark of Netscape,
`Inc. or
`MICROSOFT INTERNET EXPLORER, a trademark of
`Microsoft, Inc. These clients access a plurality of Web
`servers (generally indicated by reference numerals 210, 211,
`and 212) such as LOTUS GO, a trademark of Lotus,Inc.,
`MICROSOFT INTERNET INFORMATION SERVICE
`
`Inc. or NETSCAPE
`(IS), a trademark of Microsoft,
`FASTTRACK,a trademark of Netscape, Inc.
`In accessing these Web servers, the clients 200, 201 and
`202 specify a URI. Each of these Web servers 210, 211, and
`212 accesses a Static Content Database (generally indicated
`by reference numerals 220, 221, and 222) and a Dynamic
`Content Generator (generally indicated by reference numer-
`als 230, 231, and 232) that receives a URI and other
`information about the user and generates Web content suit-
`able for display by the browsersat the clients 200, 201, and
`202. These Dynamic Content Generators 230, 231, and 232
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`including Active Server Pages,
`may take many forms,
`servlets, Common Gateway Interface (CGI) binaries, or
`Dynamic Server Pages.
`Upon receiving a URI request from a client, the Web
`server 210, 211, or 212 either retrieves the content from the
`Static Content Database 220, 221, or 222 or from the
`Dynamic Content Generator 230, 231, or 232. An Access
`Information Collector 240 receives client requests and con-
`tent returned from the Static Content Database 220, 221, or
`222 or from the Dynamic Content Generator 230, 231, or
`232 and collects log information that can be used to analyze
`the access patterns of varioususers. It should be understood
`that the physical location of the components shown in FIG.
`2 may vary. In particular, the Access Information Collector
`240 may be embedded in the Web servers 210, 211, and 212.
`Moreover, the Dynamic Content Generators 200, 201, and
`202 and Static Content Databases 220, 221, and 222 may be
`co-located with the Web servers 210, 211, and 212.
`
`FIG. 3 illustrates the information collected by the Access
`Information Collector in accordance with the present inven-
`tion. A Log File 300 contains a sequence of Access Records.
`Each Access Record includesat least a time stamp 301, a
`requested URI 313, and a Document Key 312.
`A Retrieved Document Database 310 contains a reposi-
`tory of Document Records corresponding to documents
`retrieved by users. Each Document Record 311 is indexed by
`a Document Key 312 and contains an associated URI 313,
`document text 314, and a Document Comparator 315. The
`Document Key 312, when combined with the URI 313,
`serves to uniquely identity the Document Record 311. Docu-
`ment Keys may be assigned sequentially or by any other
`appropriate method.
`The Document Comparator 315 is a representation of the
`document’s contents and is used by a Document Comparator
`Function to determine whether there are substantial pre-
`defined similarities, as will be subsequently described in
`greater detail, between the current document and other
`previously retrieved documents. The Document Comparator
`Function receives the Document Comparators for two docu-
`ments and determines whether the two documents are sub-
`
`stantially similar. To make this determination, the Function
`may employ a Document Difference Threshold, a numeric
`value that indicates how much two documents may differ
`before they are no longer deemedto be substantially similar.
`The use of the Document Difference Threshold depends on
`the particular Document Comparator Function being used.
`The use of a Document Difference Threshold allows the
`
`Document Comparator Function to ignore minordifferences
`between two documents. Such minor differences include
`
`timestamps, client name, or client-specific data.
`In the present embodimentof this invention, the Docu-
`ment Comparator 315 is the actual content of the document
`itself, and the Document Comparator Function for any two
`documents is defined to be the number of character
`
`insertions, deletions, or modifications required to convert
`one documentto the other. This computation is well under-
`stood in the prior art (see, for example, the use oftries, as
`described in Chapter 11 of Alan Tharp, File Organization
`and Processing, Wiley, 1988) and will not be discussed
`further. Alternative embodiments of this invention may
`7
`
`7
`
`
`
`US 6,185,614 B1
`
`10
`
`15
`
`310 whose URI matchesthat of the retrieved document.(It
`should be understood that alternative embodiments of this
`
`invention may remove the restriction that the URI of the
`retrieved document and the URIof the Candidate Document
`
`20
`
`match. Alternative embodiments of this invention may also
`introduce additional restrictions on what constitutes a Can-
`
`didate Document.) At decision block 406, it is determined
`whetheror not a Candidate Documenthas been found. If the
`
`6
`Referring now to FIG. 4, a flowchart depicts the steps
`taken by the Access Information Collector 240 to analyze a
`documentretrieved from a Web server and to update the Log
`File 300, Retrieved Document Database 310, and Document
`Comparator Index 320 (as shown in FIG. 3). At block 400,
`the Access Information Collector 240 receives the requested
`URI, the time of the request, the identity of the requesting
`client, and the content of the retrieved document. At block
`402, a Document Comparator 315 is computed for the
`retrieved document. At block 404, a Candidate Document
`and Candidate Document Comparator are selected from the
`Retrieved Document Database 310. The Candidate Docu-
`ment is a document in the Retrieved Document Database
`
`5
`compute a Document Comparator 315 by mapping each
`word, paragraph, or section of the document to a binary
`token. In this case,
`the Document Comparator Function
`might count the number of matching binary tokens, and the
`Document Difference Threshold would designate what per-
`centage of the tokens must match (see, for example, “Copy
`Detection Mechanisms for Digital Documents,” by Sergey
`Brin, James Davis, and Hector Garcia-Molina, in Proceed-
`ings of the 1995 SIGMOD International Conference on
`Management of Data, pages 398-409, May 1995). Yet
`another embodimentof this invention may define a Docu-
`ment Comparator 315 as a list of the most significant (as
`predefined) words or phrases in the document; the Docu-
`ment Comparator Function may simply count how many
`words or phrases occur in both documents, and the Docu-
`ment Difference Threshold would designate what percentage
`of words in each document must appear in the other. Other
`comparison methods are well established in the prior art.
`The essential element of a Document Comparator 315 is that
`a metric (i.e.
`the Document Comparator Function) must
`exist for comparing two different Document Comparators to
`determine by how much their respective documents differ.
`Indeed, a Document Comparator 315 mayactually comprise
`multiple Comparators, one per each predefined section of
`the document, each having an associated Document Com-
`parator Function.
`Finally, a Document Comparator Index 320 associates
`each Document Comparator 315 with the corresponding
`Document Key 312. The Index 320 is used to improve the
`performance of the Document Comparator 315 evaluations
`and the selection of Candidate Documents (see FIG. 4).
`However,
`it
`is a performance optimization that may be
`omitted by alternative embodiments of this invention.
`Though the data structures have been illustrated in FIG. 3
`with a particular embodiment, alternative representations of
`this information are possible. The essential attributes of
`these implementations is the association of each Document
`Comparator 315 to a Document Key 312, the association of
`each user URI 313retrieval with a particular Document Key
`then it is
`If the answer to decision block 406 is no,
`312, and the association of each Document Key 312 with
`determined that the retrieved documentis new. At block 420,
`particular document content. It should be noted that various
`a new Document Key is generated for the retrieved docu-
`optimizations are also possible. For example,
`instead of
`storing each document’s full content, the Retrieved Docu-
`ment. At block 425, a new entry is added to the Retrieved
`Document Database 310 to associate the retrieved docu-
`ment Database 310 maystore onlyalist of most significant
`words or phrases.
`ment’s Document Key with a new Document Record con-
`50
`taining the retrieved URI, retrieved document, and retrieved
`When a documentis accessed from the Web server (with
`document’s Document Comparator. At block 430, a new
`a particular URD,
`the Access Information Collector 240
`entry is added to the Document Comparator Index 320
`analyzes the retrieved document (using the Document Com-
`database to associate the retrieved document’s Document
`parator Function) to determine whether it is substantially
`similar
`to another document
`that has been previously
`retrieved from that Web server using the same URI. If a
`substantially similar document has already been generated
`by the Web server, then the user’s access is associated with
`that previous document; however, if a substantially similar
`document has not been previously generated by the Web
`server, then the user’s access is associated with this new
`document. In this way, the Access Information Collector 240
`distinguishes between different dynamically-generated
`documentsretrieved using the same URI while also merging
`access information about documents that are nearly identi-
`cal.
`
`25
`
`30
`
`35
`
`40
`
`45
`
`55
`
`60
`
`65
`
`answerto decision block 406 is yes, then at decision block
`408, the Document Comparator Function is invoked with the
`Document Comparators of the retrieved documentand of the
`Candidate Document
`to determine whether or not
`the
`retrieved document and the Candidate Document are sub-
`
`stantially similar.
`Continuing with FIG. 4, if the answer to decision block
`408 is yes, then it is determined that the retrieved document
`is sufficiently similar to the Candidate Document and no new
`entry is required to either the Retrieved Document Database
`310 or to the Document Comparator Index 320. At block
`410,
`the Document Key is retrieved for the Candidate
`Document. At block 415, a new entry is added to the Log
`File, including the time stamp, requested URI, and candidate
`document’s Document Key. The process then terminates at
`block 490. If the answer to decision block 408 is no, then
`control returns to block 404, where another Candidate
`Documentis selected for evaluation.
`
`Comparator with the retrieved document’s DocumentKey.
`Atblock 435, a new entry is added to the Log File, including
`the time stamp, requested URI, and retrieved document’s
`Document Key. The process then terminates at block 490.
`Thus, each user accessis associated with a Document Key
`representing a document in the Retrieved Document Data-
`base with a sufficiently close Document Comparator. Each
`URI
`is,
`therefore, potentially linked with multiple
`documents, each having different content. At the same time,
`the analysis ignores minor differences between documents,
`as might arise when page content is customized in minor
`ways to reflect the identity of the requesting user.
`8
`
`8
`
`
`
`US 6,185,614 B1
`
`7
`Although the present invention has been described with
`respect to a specific preferred embodimentthereof, various
`changes and modifications may be suggested to one skilled
`in the art and it
`is intended that
`the present
`invention
`encompass such changes and modifications as fall within the
`scope of the appended claims.
`What weclaim is:
`1. A method of collecting information about document
`retrievals over the World-Wide Web, comprising the steps
`of:
`
`receiving a requesting user identity, requested Universal
`Resource Identifier (URI), and a contentof a retrieved
`document;
`selecting a Candidate Document from a Retrieved Docu-
`ment Database, said Candidate Document associated
`with a Candidate Document Key;
`to said Candidate
`comparing said retrieved document
`Documentto determine a sufficiency of said Candidate
`Document;
`associating said retrieved document with a newly gener-
`ated Retrieved DocumentKeyif said Candidate Docu-
`ment is not deemedto be sufficient;
`adding said retrieved document to said Received Docu-
`ment Database; and
`adding a Log File Entry including said requesting user
`identity, said requested URI, and said Retrieved Docu-
`ment Key.
`2. The method of claim 1, wherein each of a plurality of
`documents in said Retrieved Document Databaseis associ-
`
`ated with a Document Comparator and wherein a first
`Document Comparator may be compared to a second Docu-
`ment Comparator using a Document Comparator Function.
`3. The method of claim 2, wherein said step of comparing
`to determine a sufficiency of said Candidate Document
`further comprises the steps of:
`computing said first Document Comparator for said
`retrieved document;
`retrieving said second Document Comparator for said
`Candidate Document;
`computing with said Document Comparator Function a
`numeric measure of a difference between said first
`Document Comparator and said second Document
`Comparator; and
`comparing said numeric measure against a predefined
`Document Difference Threshold.
`4. The method of claim 2, wherein each said Document
`Comparator comprises content of said each of a plurality of
`documents associated therewith.
`
`for said
`5. The method of claim 4, wherein a URI
`Candidate Document is equal to a URI for said retrieved
`document.
`
`6. The method of claim 2, wherein each said Document
`Comparator is computed by associating predefined portions
`of said each of a plurality of documents to a binary token.
`7. The method of claim 2, wherein each said Document
`Comparator comprises a list of significant words or phrases
`in said each of a plurality of documents.
`8. The method of claim 2, wherein each said Document
`Comparator comprises a Comparator for each of a plurality
`of predefined sections of said each of a plurality of docu-
`ments.
`
`9. The method of claim 2, wherein said step of selecting
`a Candidate Document comprises selecting from a Docu-
`ment Comparator Database.
`
`wn
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`10. A system for collecting information about document
`retrievals over the World-Wide Web, comprising:
`means for receiving a requesting user identity, requested
`Universal Resource Identifier (URI), and a content of a
`retrieved document;
`means for selecting a Candidate Document from a
`Retrieved Document Database, said Candidate Docu-
`ment associated with a Candidate Document Key;
`means for comparing said retrieved document to said
`Candidate Documentto determine a sufficiency of said
`Candidate Document;
`means for associating said retrieved document with a
`newly generated Retrieved Document Key if said Can-
`didate Documentis not deemed to be sufficient;
`means for adding said retrieved document
`to said
`Received Document Database; and
`meansfor adding a Log File Entry including said request-
`ing user
`identity, said requested URI, and said
`Retrieved Document Key.
`11. The system of claim 10, wherein eachofa plurality of
`documents in said Retrieved Document Databaseis associ-
`
`ated with a Document Comparator and wherein a first
`Document Comparator may be compared to a second Docu-
`ment Comparator using a Document Comparator Function.
`12. The system of claim 11, wherein said means for
`comparing to determine a sufficiency of said Candidate
`Document further comprises:
`means for computing said first Document Comparator for
`said retrieved document;
`meansfor retrieving said second Document Comparator
`for said Candidate Document;
`means for computing with said Document Comparator
`Function a numeric measure of a difference between
`
`said first Document Comparator and said second Docu-
`ment Comparator; and
`means for comparing said numeric measure against a
`predefined Document Difference Threshold.
`13. The system of claim 11, wherein each said Document
`Comparator comprises content of said each of a plurality of
`documents associated therewith.
`
`14. The system of claim 13, wherein a URI for said
`Candidate Document is equal to a URI for said retrieved
`document.
`
`15. The system of claim 11, wherein each said Document
`Comparator is computed by associating predefined portions
`of said each of a plurality of documents to a binary token.
`16. The system of claim 11, wherein each said Document
`Comparator comprises a list of significant words or phrases
`in said each of a plurality of documents.
`17. The system of claim 11, wherein each said Document
`Comparator comprises a Comparator for each of a plurality
`of predefined sections of said each of a plurality of docu-
`ments.
`
`18. The system of claim 11, wherein said means for
`selecting a Candidate Document comprises selecting from a
`Document Comparator Database.
`19. A computer program product recorded on computer
`readable medium for collecting information about document
`retrievals over the World-Wide Web, comprising:
`computer readable means for receiving a requesting user
`identity, requested Universal Resource Identifier (URI),
`and a content of a retrieved document;
`
`9
`
`9
`
`
`
`US 6,185,614 B1
`
`9
`computer readable meansfor selecting a Candidate Docu-
`ment from a Retrieved Document Database, said Can-
`didate Document associated with a Candidate Docu-
`
`ment Key;
`computer readable means for comparing said retrieved
`documentto said Candidate Documentto determine a
`
`sufficiency of said Candidate Document;
`computer readable means for associating said retrieved
`document with a newly generated Retrieved Document
`Key if said Candidate Document is not deemed to be
`sufficient;
`computer readable meansfor adding said retrieved docu-
`ment to said Received Document Database; and
`computer readable means for adding a Log File Entry
`including said requesting user identity, said requested
`URI, and said Retrieved Document Key.
`20. The program product of claim 19, wherein each of a
`plurality of documents in said Retrieved Document Data-
`base is associated with a Document Comparator and wherein
`a first Document Comparator may be compared to a second
`Document Comparator according to a predefined distance
`metric.
`
`21. The program product of claim 20, wherein said
`computer readable means for comparing to determine a
`sufficiency of said Candidate Document further comprises:
`computer readable means for computing said first Docu-
`ment Comparator for said retrieved document;
`computer readable means for
`retrieving said second
`Document Comparator for said Candidate Document;
`computer readable means for computing with said Docu-
`ment Comparator Function a numeric measure of a
`
`10
`difference between said first Document Comparator
`and said second Document Comparator; and
`
`computer readable means for comparing said numeric
`measure against a predefined Document Difference
`Threshold.
`
`22. The program product of claim 20, wherein each said
`Document Comparator comprises content of said each of a
`plurality of documents associated therewith