`Chakrabarti et al.
`
`USOO6418433B1
`(10) Patent No.:
`US 6,418,433 B1
`(45) Date of Patent:
`Jul. 9, 2002
`
`(54) SYSTEM AND METHOD FOR FOCUSSED
`WEB CRAWLNG
`
`(75) Inventors: Soumen Chakrabarti, San Jose; Byron
`Edward Dom, Los Gatos; Martin
`Henk van den Berg, Palo Alto, all of
`CA (US)
`(73) Assignee: International Business Machines
`Corporation, Armonk, NY (US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(*) Notice:
`
`(21) Appl. No.: 09/239,921
`(22) Filed:
`Jan. 28, 1999
`(51) Int. Cl. ................................................ G06F 17/30
`(52) U.S. Cl. ............................ 707/5; 707/513; 709/218
`(58) Field of Search .......................... 707/1-6, 10, 104,
`707/501, 513; 709/200, 203, 217-219
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,369,577 A 11/1994 Kadashevich et al. .. 364/419.13
`5.530,852 A 6/1996 Meske, Jr. et al. .......... 395/600
`5,708,829 A
`1/1998 Kadashevich et al. ...... 395/793
`5,717,912 A 2/1998 Millett et al. ............... 395/603
`5,784,608 A 7/1998 Meske, Jr. et al........... 395/602
`5,787,417 A
`7/1998 Hargrove ....................... 707/4
`5,796,952 A * 8/1998 Davis et al. ......
`... 709/234
`5,832,494. A * 11/1998 Egger et al. ................ 707/102
`OTHER PUBLICATIONS
`“Fab: Content-Based, Collaborative Recommendation,”
`Balabanovic and Shoham, Communications of the ACM,
`Mar. 1997, vol. 40, No. 3, pp. 66–72.*
`
`“Annotated Reference List Agents,” David C. Blight, Pro
`ceedings of the 1997 IEEE Conference on Communications,
`Power and Computing, May 22-23, 1997, pp. 7-12.*
`“Silk from a Sow Ear: Extracting Usable Structures from the
`Web.” Pirolli et al., Xerox PARC, Proceedings of the 1996
`Conference on Human Factors and Computing Systems,
`Canada, Apr. 13, 1996, pp. 118-125 Available at and down
`loaded from http:/www.acm.*
`“Enhanced Hypertext Categorization Using Hyperlinks.”
`Chakrabarti et al., Proceedings of the 1998 ACM SUGMOD
`International Conference on Management of Data, Seattle,
`USA, Jun. 1, 1998, pp. 307–318.*
`“Automated Reference List Agents.” David C. Blight,
`TRlabs, Winnipeg, Canada, Proceedings of the 1997 IEEE
`Conference or Communications, Power and Computing,
`May 22–23, 1997 97CH36117, apges 7–12.*
`
`* cited by examiner
`
`Primary Examiner-Hosain T. Alam
`(74) Attorney, Agent, or Firm John L. Rogitz
`(57)
`ABSTRACT
`A focussed Web crawler learns to recognize Web pages that
`are relevant to the interest of one or more users, from a Set
`of examples provided by the users. It then explores the Web
`Starting from the example Set, using the Statistics collected
`from the examples and other analysis on the link graph of the
`growing crawl database, to guide itself towards relevant,
`valuable resources and away from irrelevant and/or low
`quality material on the Web. Thereby, the Web crawler
`builds a comprehensive topic-specific library for the benefit
`of Specific users.
`
`32 Claims, 5 Drawing Sheets
`
`OVERALL FLOW
`
`100
`RECEIVESEED
`SET OF
`DOCUMENTS
`
`102
`MAPSEED SE
`TONODESN
`TAONOMY
`
`104
`HIGHIGHT
`FREQUENTLY
`OCCURRING
`CATEGORIES
`
`106
`MARKSELECTED
`CATEGORIES
`AS"GOOD"
`
`112
`DETERMINE
`"RELEVANCE",
`"PRIORITY"
`
`114
`BASED ONRELEVANCE,
`USESOFTORHARCRAW
`TONSERT OUTLINKS
`
`
`
`TOKENEURLTOFIND
`TEXT, OUTLINKS
`
`108
`STARING WITHSEED
`SET, SELECTURLBY
`(NUM TRIES, RELEVANCE)
`(COULD USE PRIORITY
`INSTEAD)
`
`118
`
`DETERMINEPRIORITY
`OFOUTLINKS
`(USINGHITS)
`
`
`
`REVISIT PAGE,
`NKSBASED
`ONPRIORITY
`
`120
`PERIODICALLYINVOKE
`QUALITYRATING
`TORATECRAW,
`ADD NEWEAMPLES
`
`Petitioner Google Ex-1032, 0001
`
`
`
`U.S. Patent
`
`Jul. 9, 2002
`
`Sheet 1 of 5
`
`US 6,418,433 B1
`
`ADMINIST
`NTERFAC
`
`
`
`
`
`
`
`
`
`
`
`Fig. 1
`
`10
`
`
`
`QUALITY
`RATER
`
`SELECT
`TOPICS
`------------------
`
`TAXONOMY
`TABLES
`
`
`
`EXAMPLE
`TABLES
`
`CRAW
`TABLES
`
`CATEGORY
`MODEL
`
`
`
`
`
`- - - - - - - - - - - ------|--
`32
`
`
`
`
`
`HYPERTEXT
`CLASSIFIER
`TRANER
`
`26
`
`14
`
`CRAWLER PRIORITY
`CONTROER
`(WATCHDOG)
`CRAWLER
`WORKER THREADS
`
`27
`
`HYPERTEXT
`CLASSIFIER
`
`36
`38
`40
`42
`44
`46
`48
`50
`51
`51A
`
`CRAWLTABLEENTRY
`
`32
`/
`
`FIELD NAMEDATATYPE DESCRIPTION
`URL
`VARCHAR URL OF PAGE
`OID
`CHAR(8)
`64-BITHASH OF URL (PRIMARY KEY)
`NUM TRIES SHORTINT COUNTHOWMANYTIMES THISURLHAS BEEN CONSIDERED BY CRAWLER
`PRIORITY
`SHORTINT CRAWORREFRESHPRIORITY SETBY TOPICANALYER
`IPADDR
`INTEGER
`PADDRESS OF SERVER FROMWHICHPAGEWASACQUIRED
`FOUND
`TIMESTAMPTIME WHEN PAGE WASFIRST FOUND
`INDEXED
`TIMESTAMPTIME WHEN PAGE WAS LAST INDEXED
`MODIFIED
`TIMESTAMPTIME WHENPAGE WASLASTMODIFIED
`REEVANCE FLOAT
`RELEVANCE OF PAGE
`CID
`SHINT.
`CATEGORY ID
`FIELD NAMEDATA TYPE DESCRIPTION
`52 NSRCOID
`CHAR(8)
`SOURCE OF LINK
`54NDSTOID
`CHAR(8)
`TARGET OF LINK
`56-TYPE
`SHORTINT TYPE OF LINK SUBDIR. CROSS, OTHERSITEREDIRECT
`(LINK) EDGE TABLE
`
`
`
`
`
`34
`-
`
`Petitioner Google Ex-1032, 0002
`
`
`
`U.S. Patent
`
`Jul. 9, 2002
`
`Sheet 2 of 5
`
`US 6,418,433 B1
`
`OVERALL FLOW
`
`Fig. 2
`
`1OO
`RECEIVE SEED
`SET OF
`DOCUMENTS
`
`102
`MAPSEED SE
`TO NODESIN
`TAXONOMY
`
`
`
`104
`HIGHIGHT
`FREQUENTLY
`OCCURRING
`CATEGORIES
`
`106
`MARKSELECTED
`CATEGORIES
`AS"GOOD"
`
`112
`DETERMINE
`"REEANGE",
`PRIORITY
`
`110
`TOKENEURTOFIND
`TEXT, OUTLINKS
`
`114
`BASED ONRELEVANCE,
`USESOFT OR HARDCRAW
`TOINSERT OUTLINKS
`
`116
`DETERMINEPRIORITY
`OF OUTLINKS
`(USINGHTS)
`
`108
`STARTING WITHSEED
`SET, SELECTURBY
`(NUM TRIES, RELEVANCE)
`(COULD USE PRIORITY
`INSTEAD)
`
`118
`
`REVISIT PAGE,
`LINKSBASED
`ONPRIORITY
`
`120
`PERIODICALLYINVOKE
`QUALITYRATING
`TORATECRAW,
`ADD NEWEXAMPLES
`
`Petitioner Google Ex-1032, 0003
`
`
`
`U.S. Patent
`
`Jul. 9, 2002
`
`Sheet 3 of 5
`
`US 6,418,433 B1
`Fig. 5
`WATCHDOG
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`62
`
`68
`PRINT CRAW
`DAGNOSTICS
`
`CHECKWORKER
`QUEUELENGTHS
`
`DETERMINEH NEW,
`OLDPAGESTOFETCH
`THIS ACCESS
`
`SORTOLDPAGES
`NORDER
`(NUM TRIES,
`PRIORITY)
`
`YES
`
`SORT OUTLINKS
`NORDER
`(NUM TRIES,
`PRIORITY)
`
`
`
`
`
`
`
`
`
`INCREMENT
`NUM TRIES
`
`INSERTWORK
`INTOQUEUE
`
`
`
`SELECT WORKER
`WILOW WORK
`INQUEUE
`
`Petitioner Google Ex-1032, 0004
`
`
`
`U.S. Patent
`
`Jul. 9, 2002
`
`Sheet 4 of 5
`
`US 6,418,433 B1
`
`
`
`
`
`
`
`
`
`
`
`
`
`SLEEP UNTIL WORK
`INQUEUE
`
`86
`
`
`
`
`
`Fig. 4
`WORKER
`
`
`
`EXTRACT WORK,
`CONTACTWEBSERVER
`90
`
`
`
`RETREVE ONLY
`MODIFIED PORTIONS
`OFPAGE
`
`RETRIEVE
`PAGE
`
`
`
`TOKENIZE
`PAGE
`
`CLASSIFY
`(FIG.5)
`
`UPDATECRAW
`TABLEENTRY FOR
`THIS PAGE ONLY
`
`112
`INSERTNEW LINK
`TABLEENTRIES
`
`
`
`
`
`NO
`
`GOOD
`PAGETOPIC2
`
`106
`
`
`
`YES
`110
`AIEEE
`OUTLINKS (EXPAND)
`
`Petitioner Google Ex-1032, 0005
`
`
`
`U.S. Patent
`
`Jul. 9, 2002
`
`Sheet 5 of 5
`
`US 6,418,433 B1
`
`114
`CLASSIFY
`CURRENTPAGE
`
`Fig. 5
`
`(EXPAND)
`FIG. 4
`
`YES
`
`
`
`
`
`GOOD
`
`116 NO
`
`TIME TO PANIC)
`
`FIG. 4
`
`PAGETOPIC <GS (PRUNE)
`
`YES
`
`COLLECT
`OUTLINKS &
`ININKS
`
`124
`RELAXMATCH
`CRITERA
`(FIG.6)
`
`126
`BACKWARDS
`CRAWL
`
`PRUNE
`(FIG. 4)
`
`YES
`
`132
`
`
`
`134
`CLASSIFY
`CURRENTPAGE
`
`122
`RECLASSIFY
`USING
`
`(E LINKINFO
`
`GOOD
`
`NO
`
`130
`FINDH GOOD
`PAGES THIS SITE
`
`Fig 6
`
`
`
`140
`EXPAND
`NFIG. 4
`
`FINDLEAST COMMON
`ANCESTOR OFCLASS
`OF PAGE AND EACH
`DESIRED CLASS
`
`
`
`
`
`138
`
`CA
`FEASIBLE MATCH2
`
`142
`
`Petitioner Google Ex-1032, 0006
`
`
`
`1
`SYSTEMAND METHOD FOR FOCUSSED
`WEB CRAWLING
`
`US 6,418,433 B1
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention relates generally to Searching wide
`area computer networks for information, and more particu
`larly to searching the World Wide Web for topical informa
`tion.
`2. Description of the Related Art
`The system known as the “World Wide Web”, or simply
`“Web”, as implemented by the wide area computer network
`known as the Internet, contains a vast amount of information
`in the form of Web pages. Each Web page is electronically
`Stored in a respective Web site on a computer, referred to as
`a Web server, with the Web itself including many Web
`Servers that are interconnected by means of the Internet. A
`perSon can connect a computer to the Internet Via, e.g., a
`telephone line, and thereby electronically access the Web
`pages on the Web servers.
`As the Web has grown, many millions of Web pages have
`been created. In other words, the Web contains a vast amount
`of information, and the content of the Web grows and
`changes minute by minute. It will accordingly be appreci
`ated that Some means must be provided for a person to Sort
`through the vast quantities of Web pages to find a particular
`item of interest.
`With the above consideration in mind, most users employ
`Software known as Web browsers when accessing the Web.
`To search the Web for a particular topic of information, the
`user causes their Web browser to access a Web site of a
`centralized Search engine that is maintained by a Search
`company. Examples of currently popular Search engines are
`Alta Vista TM and HotbotTM.
`Centralized Search engines use Software referred to as
`"crawlers' to continuously access Web pages and construct
`a centralized keyword index. When a person wishes to
`retrieve information, the perSon's browser accesses a cen
`tralized Search engine using a query, for example, "luxury
`cars”. In response, Software at the centralized engine
`accesses its indeX to retrieve names of Web sites considered
`by the Search engine to be appropriate Sources for the
`Sought-after information. The Search engine transmits to the
`browser hyperlinks to the retrieved sites, along with brief
`Summaries of each Site, with the browser presenting the
`information to the user. The user can then Select the Site or
`Sites they want by causing the browser to access the Site or
`Sites.
`Owing to the burgeoning of the Web and the ever-growing
`amount of its information, and the fact that the above
`described centralized crawler Schemes posture themselves to
`respond to any possible query (i.e., to be all things to all
`people), centralized crawler/searchers require large invest
`ments in hardware and Software and must never cease
`crawling the Web, to indeX new pages and to periodically
`revisit old pages that might have changed. Indeed, one Web
`Search company currently requires the use of 16 of the most
`powerful computers made by a major computer
`manufacturer, each computer having 8 gigabytes of memory.
`Another Search company currently uses a cluster of 300
`powerful WorkStations and over one terabyte of memory to
`crawl over 10 million Web pages per day. Despite these
`heroic efforts, however, it is estimated that a single Search
`company is able to index only 30%-40% of the Web, owing
`to the size of the Web which, incidentally, shows no signs of
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`Slowing its rate of expansion (currently at about one million
`new pages per day).
`Accordingly, one problem with current technology that is
`recognized and addressed by the present invention is the
`need to reduce the vast amount of Web search hardware and
`Software that is inherently required by a centralized Search
`Scheme.
`Additionally, evaluating whether a particular Web page
`contains relevant information with respect to a user query is
`Sometimes difficult. Moreover, user queries may not be
`effectively articulated, or they may be overbroad.
`Consequently, a Web Search engine frequently responds to a
`query by returning a large number of Web pages that are of
`little or no interest to the requester. Nonetheless, a user must
`laboriously Sort through hundreds and perhaps thousands of
`returned Web pages, which, as discussed above, can be
`considered to represent only 30%-40% of the total Web
`content in any case. Moreover, because a centralized crawler
`Seeks the capability to respond to any query, most of the
`index of any Single centralized System contains information
`that is of little or no value to any Single user or indeed to any
`Single interrelated group of users.
`Thus, two other problems recognized and addressed by
`the present invention are the lack of focus of Search results,
`and the fact that centralized crawlers are not tailored to any
`particular user or to any particular interrelated group of users
`and, thus, contain mostly irrelevant information, from the
`point of View of a Single user or group of users.
`In addition to the above considerations, the present inven
`tion recognizes that many if not most Web pages refer to
`other Web pages by means of hyperlinks, which a user can
`Select to move from a referring Web page to a referred-to
`Web page. The present invention further recognizes that
`Such hyperlinks are more than Simply navigation tools, they
`are important tools for relevant data acquisition as well.
`It happens that with the existing Web communication
`protocol (hypertext transfer protocol, or "http'), when a user
`clicks on a hyperlink to a referred-to Web page V from a
`referring Web page u, the user's browser Sends the identity
`of the referring Web page u to the Web server that hosts the
`referred-to Web page V, and this information can be recorded
`or logged. Unfortunately, current logs of which Web pages
`refer to which other Web pages are mostly unused and
`indeed mostly not enabled by Web site managers, and the
`logs moreover consume a relatively large amount of elec
`tronic data Storage Space. Also, no Standard way exists for a
`remote user to access and use the information in the logs.
`The present invention, however, recognizes the above
`noted problem and addresses how to exploit this currently
`unused but potentially valuable information in the context of
`resolving the unfocussed, centralized crawling problems
`noted above.
`
`SUMMARY OF THE INVENTION
`The invention is a general purpose computer programmed
`according to the inventive Steps herein to generate a data
`base of Web pages that is focussed on a predefined topic or
`topics, for Subsequent efficient Searching of the database by
`users. The invention can also be embodied as an article of
`manufacture-a machine component-that is used by a
`digital processing apparatus and which tangibly embodies a
`program of instructions that are executable by the digital
`processing apparatus to generate the focussed database. This
`invention is realized in a critical machine component that
`causes a digital processing apparatus to undertake the inven
`tive logic herein.
`
`Petitioner Google Ex-1032, 0007
`
`
`
`US 6,418,433 B1
`
`5
`
`15
`
`25
`
`3
`In accordance with the present invention, the computer
`includes computer readable code means for receiving a Seed
`Set of Web pages in a crawl database, with the Seed Set being
`representative of at least one topic. The computer also
`includes computer readable code means for identifying
`outlink Web pages from one or more Web pages in the crawl
`database, and computer readable code means for evaluating
`the outlink Web pages for relevance to the topic. Further, the
`computer includes computer readable code means for caus
`ing outlinks only of Web pages that are evaluated as being
`relevant to the topic to be Stored in the crawl database, for
`Subsequent evaluation.
`In a preferred embodiment, computer readable code
`means assign a revisitation priority to a Web page, based on
`the means for evaluating. To embody the code means, at
`least one “watchdog” module is provided, and the watchdog
`module periodically determines new and old pages to con
`sider. The new pages are selected from the outlink Web
`pages, preferably those that have not yet been Visited by the
`present logic, and the old pages are Selected from pages in
`the crawl database. One or more worker modules respond to
`the watchdog module to access the new and old pages to
`consider. In one embodiment, in “considering a page the
`present logic fetches and analyzes the page. AS disclosed in
`detail below, each Web page is associated with a respective
`field (referred to in the preferred embodiment as the “Num
`Tries” field) that represents the number of times the respec
`tive page has been accessed, and the Num Tries field of a
`Web page is incremented each time the Web page is con
`sidered.
`Additionally, the preferred worker module includes
`means for determining whether a gathering rate of relevant
`pages is below a “panic' threshold. In one embodiment, the
`computer includes computer readable code means for con
`sidering all outlinks and inlinks of a Web page in the crawl
`database when the gathering rate is at or below the panic
`threshold. When the gathering rate is at or below the
`threshold, the Scope of the topic can be increased to an
`expanded Scope.
`In another aspect, a computer System for focussed Search
`ing of the World Wide Web includes a computer that in turn
`includes a watchdog module for Scheduling worker thread
`work and at least one worker module for undertaking the
`work. The work to be undertaken includes adding outlinks of
`a Web page to a crawl database, but only when the Web page
`is relevant to a predefined topic or topics. The crawl database
`is accessible by the computer and is focussed only on the
`topic Such that the System includes no topically comprehen
`Sive database of the World Wide Web.
`In Still another aspect, a computer-implemented method is
`disclosed for building a focussed database of Web pages.
`The method includes receiving a Search query from a user,
`and in response to the Search query, accessing a crawl
`database containing only information pertaining to Web
`55
`pages related to a limited number of predefined topics. A
`computer program product executing the method steps is
`also disclosed.
`The details of the present invention, both as to its structure
`and operation, can best be understood in reference to the
`accompanying drawings, in which like reference numerals
`refer to like parts, and in which:
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a block diagram of the present System, showing
`an example Web page table entry and associated link table
`entry in an exploded relationship to the crawl database;
`
`4
`FIG. 2 is an overall flow chart of the present logic;
`FIG. 3 is a flow chart of the watchdog module logic;
`FIG. 4 is a flow chart of a worker module logic;
`FIG. 5 is a flow chart of the worker module classification
`logic, and
`FIG. 6 is a flow chart of the logic for relaxing match
`criteria to recover from the panic mode.
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`Referring initially to FIG. 1, a system for focussed Web
`crawling is shown, generally designated 10. In the particular
`architecture shown, the System 10 includes a digital pro
`cessing apparatus, Such as a computer 12, which accesses
`the World Wide Web via the Internet 13. In one intended
`embodiment, the computer 12 may be a personal computer
`made by International Business Machines Corporation
`(IBM) of Armonk, N.Y. as shown, or the computer 12 may
`be any computer, including computerS Sold under trade
`marks such as AS400, with accompanying IBM Network
`Stations. Or, the computer 12 may be a Unix computer, or
`OS/2 server, or Windows NT server, or IBM RS/6000 250
`workstation with 128 MB of main memory running AIX
`3.2.5., or an IBM laptop computer.
`The computer 12 includes a focussed crawler 14 which
`may be executed by a processor within the computer 12 as
`a Series of computer-executable instructions. These instruc
`tions may reside, for example, in RAM of the computer 12,
`and may be programmed as a C'' ODBC application.
`Alternatively, the instructions may be contained on a data
`Storage device with a computer readable medium, Such as a
`computer diskette 16 shown in FIG.1. As show, the diskette
`16 includes a data Storage medium 18 holding computer
`program code elements A-D. Or, the instructions may be
`Stored on a DASD array, magnetic tape, conventional hard
`disk drive, electronic read-only memory, optical Storage
`device, or other appropriate data Storage device. In an
`illustrative embodiment of the invention, the computer
`executable instructions may be lines of compiled C" com
`patible code. As yet another equivalent alternative, the logic
`can be embedded in an application Specific integrated circuit
`(ASIC) chip or other electronic circuitry.
`FIG. 1 also shows that the system 10 can include periph
`eral computer equipment known in the art, including an
`output device Such as a video monitor 20 and input devices
`Such as a computer keyboard 22 and mouse 24. Other output
`devices can be used, Such as printers, other computers, and
`So on. Likewise, input devices other than the keyboard 22
`can be used, e.g., trackballs, keypads, touch Screens, and
`Voice recognition devices.
`AS described in detail below, the focussed crawler 14
`accesses Scheduler controls 26 that control when the
`focussed crawler 14 executes its logic. Accordingly, the
`scheduler controls 26 are referred to below as a “watchdog
`module', and the focussed crawler is from time to time
`referred to as a “worker thread” or threads 27. Indeed, in the
`preferred embodiment the watchdog module controls mul
`tiple worker threads 27.
`Additionally, the focussed crawler 14 accesses a topic
`analyzer 28(also referred to herein as “hypertext classifier”).
`The topic analyzer 28 compares the content of a Web page
`with a predefined topic or topics and generates a response
`representative of how relevant the Web page is to the topic.
`A relevant Web page is referred to as a “good” Web page.
`Details of one preferred topic analyzer are set forth in
`
`35
`
`40
`
`45
`
`50
`
`60
`
`65
`
`Petitioner Google Ex-1032, 0008
`
`
`
`US 6,418,433 B1
`
`15
`
`25
`
`35
`
`40
`
`S
`co-pending U.S. patent application Ser. No. 09/143,733,
`filed Aug. 29, 1998, for an invention entitled “Method for
`Interactively Creating an Information Database Including
`Preferred Information Elements, such as Preferred
`Authority, Worldwide Web Page', owned by the same
`assignee as is the present invention and incorporated herein
`by reference. Or, the following references provide topic
`analyzers: Chakrabarti et al., “Enhanced Hypertext Catego
`rization Using Hyperlinks”, SIGMOD ACM, 1998, and
`Chakrabarti et al., “Scalable Feature Selection,
`Classification, and Signature Generation for Organizing
`Large Text Databases into Hierarchical Taxonomies”,
`VLDB Journal, invited paper, August, 1998.
`As disclosed below, it is the purpose of the system 10 to
`generate a database that contains information on only Web
`pages that pertain to the topic or topics of interest, i.e., a
`focussed database. This database is depicted in FIG. 1 as a
`crawl database 30. As shown, the crawl database 30 includes
`a Web page (“crawl”) table 32 that includes corresponding
`link tables 34 each of which is an edge table relative to the
`Web page table 32. Also, the database 30 includes example
`Web page tables 35A and taxonomy tables 35B. A user can
`search the database 30 efficiently for Web pages of interest,
`i.e., for only Web pages relating to the topic on which the
`database 30 is focussed. Advantageously, Web document
`meta data Such as the data in present tables is Stored in a
`relational database, and more preferably in the database
`referred to as DB2/UDB, made by the present assignee.
`As shown in FIG. 1, the Web page table 32 includes plural
`fields, each having at least a name, a data type, and, for
`disclosure purposes, a field description. More Specifically,
`the Web page table 32 includes a Uniform Resource Locator
`(URL) field 36, having a variable character data type, that
`represents a Web page URL. Also, the Web page table 32
`includes various other fields associated with the URL,
`including an OID field 38 that is an eight character, 64 bit
`hash of the URL, and a Num Tries field 40 that is a count
`of how many times the Web page associated with the URL
`has been considered by the crawler 14, both as a “new” page
`and as an “old” page as described below. Furthermore, the
`Web page table 32 includes a priority field 42 that represents
`how often the Web page is to be revisited by the crawler 14,
`as determined in consonance with the topic analyzer 28, to
`determine whether any changes in the Web page have
`occurred, and to “refresh' the entry in the Web page table 32
`as appropriate therefor.
`In addition, the preferred Web page table 32 includes
`various administrative fields, including an integer IP address
`field 44 representing the address of the Web server from
`which the Web page represented by the URL field 36 was
`acquired. Also, three time Stamp fields are provided, namely,
`a “found” field 46, indicating the date and time when the
`page was initially found, an “indexed” field 48, indicating
`the date and time when the page was last indexed in the table
`32, and a “modified” field 50, indicating the date and time
`the Web page was last modified by the provider of the
`content of the page. Moreover, a relevance filed 51 indicates
`the relevance of the Web page as more fully disclosed below,
`and a category id (“cid”) field 51A indicates the topic
`category of the page.
`It is to be understood that information pertaining to a
`“seed' set of Web pages is initially stored in the Web page
`table 32. The Seed Set can be gathered from, e.g., the
`temporary Internet file directories of the employees of a
`company or from Some other group that can be expected to
`have shared interests and that, consequently, can be expected
`to have collectively downloaded Web pages that are related
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`to a limited number of predefined topics. Thus, the Seed Set
`does not define a comprehensive, universal Set of all topics
`on the Web, but rather a relatively narrow topic or range of
`topics that are of interest to the particular Source (e.g.,
`employee group) from which the Seed Set is derived or
`otherwise defined. As but one example, a seed set of Web
`pages might all be related to the topic “mining”. The topic
`itself can be defined by a user or by considering the Seed Set
`using the topic analyzer 28, including an associated classi
`fier trainer 28A, to derive the predefined topic or topics from
`the seed set in the example tables 35A and from data in the
`taxonomy tables 35B, but in any case, only Web pages
`relating to the topic are contained in the database 30. Once
`derived, the target topic is Stored in a category model data
`Structure 35C.
`The link table 34 is an edge table (e.g., graph edge)
`associated with the URL field 36. It includes an eight
`character Source field 52, representing the Source of the
`hyperlink that was followed by the crawler 14 to initially
`consider the Web page represented by the URL field 36.
`Such links are referred to as “inlinks”, because they lead in
`to the Web page represented by the URL field 36. Similarly,
`an eight character target field 54 represents any links to other
`Web pages contained in the Web page represented by the
`URL field 36. Such links are referred to as “outlinks'
`because they lead to Web pages other than the Web page
`represented by the URL field 36. A “type” field 56 that is a
`short integer string is also included in the link table 34. The
`type field 56 indicates the type of outlink in the target field
`54, e.g., whether the link is to a subdirectory of the directory
`in which the URL 36 resides, or in another directory (a
`“cross directory”) in the same server, or on another Web site
`altogether, or whether the link redirects the user to back to
`the URL 36.
`With the above overall architecture in mind, a user can
`generate a query for information using the keyboard 22 or
`mouse 24, and in response a conventional browser or
`searcher 58 associated with the computer 12 accesses the
`crawl database 30 to retrieve a list of relevant Web pages
`therefrom in accordance with well-known principles. It will
`readily be appreciated that the database 30 does not contain
`a listing of Web pages unrelated to the predefined topic, but
`only relevant Web pages. Consequently, the browser 58 will
`quickly respond to the query when the query is related to the
`predefined topic; otherwise, for queries unrelated to the
`topic, no response will be available from the database 30.
`Under Such a circumstance, the browser 58 can acceSS a
`conventional crawler database. If desired, a conventional
`quality rating system 60 can evaluate the quality of Web sites
`in the database 30 for relevancy to a user's query.
`FIG. 2 shows the overall logic of the present invention.
`Starting at state 100, the seed set of example Web pages is
`received from the user. Moving to block 102, the seed set is
`mapped to appropriate nodes in the taxonomy by invoking
`a classification program Such as Taper or HyperClass on
`each document in the seed set. Proceeding to block 104,
`frequently occurring Seed Set categories in the taxonomy are
`highlighted, and these categories are marked or otherwise
`indicated as being “good” categories at block 106.
`Next, at block 108, starting with the seed set the URL of
`each page is selected based on its Num Tries field 40 (FIG.
`1) and its relevancy field 51. Alternatively, the priority field
`42 can be used. Initially, the relevance and priority fields can
`have default values.
`From block 108 the process moves to block 110 to
`tokenize the URL to identify its textual content and out
`
`Petitioner Google Ex-1032, 0009
`
`
`
`7
`links. Proceeding to block 112, the relevance and priority of
`the document are determined as follows.
`To determine the relevance of a document, it is assumed
`that the category taxonomy imposes a hierarchical partition
`of web documents. Categories in the taxonomy tree, also
`referred to as nodes, are denoted “c”. The predicate good.(c)
`denotes whether a node “c” has been marked as good. By
`definition, for any document “d”, the probability that it was
`generated from the category co corresponding to the root
`node, denoted Prcod), is one. In general Pred=Prparent
`(c)d Prcd, parent(c)). Prparent(c)d is computer
`recursively, whereas Pricidparent(c) is computed using
`Bayes Rule as Preparent(c)Pridc/XPrc'parent(c)Pr
`dic), where the Sum ranges over all siblings c' of c. Finally,
`the probability that a page is relevant is X.
`Preld). This
`quantity, denoted R(d), is typically very Small, so the loga
`rithm of R(d) can be stored if desired.
`To model the way a document is generated by a category
`the following Bernoulli document model for text is used.
`First Select a category for the document and pick a length for
`the document. Notionally associated with each category is a
`many-sided coin. Each face of the coin represents a word;
`the probability that the face comes up corresponds with the
`probability that the corresponding word occurs in a docu
`ment of this particular category. This coin is repeatedly
`tossed (figuratively) to write out the document until the
`length is reached. The category taxonomy is represented in
`a relational table as follows:
`kcid Smallint not null primary key,
`pcid Smallint not null,
`kcname varchar(250) not null,
`good Smallint not null
`where kcid is the ID of the current node ("kid" class ID),
`pcid is the ID of the parent class, kcname is the textual name,
`and good is a bit Set to express whether the node is good for
`the current focus. The document table has two fields cou
`pling it with the taxonomy table: relevance, which is Set to
`logy.
`Preld), and cid, which represents the best match
`ing leaf for the class. The priority with which a particular
`page is revisited can then be directly correlated to its
`relevance.
`Moreover, the priority of a document not only can be
`determined by determining its relevance, but also by deter
`mining its "popularity', a measure of the quality of the
`document. Based on the recall precision trade-off of the
`application and the availability of disk and network
`resources, a Suitable threshold u on relevance is defined, e.g.,
`10%-20% Of the pages fetched in the current crawl, only
`those pages exceeding the threshold are Selected as candi
`date “authorities'. Any page pointing to at least one “author
`ity” is a candidate “hub”. “Hubs” are not thresholded.
`Next, an edge Set E is constructed using only those links
`that are between pages on different Sites. Unlike the con
`ventional HITS algorithm, the present invention have
`weights; edge (u,v) has three associated numbers: the rel
`evance Score of “u” called Ru, the number of distinct pages
`on the same server as “u’ that point to “v', called hu,v),
`and the number of distinct pages on the same Server as “v'
`to which “u' points, called au,v). Iterations of the follow
`ing form can then be performed using the edge weights
`Ru/hu,v) while computing av and Ru/au,v) while
`computing hu, with the authority assignment being
`changed to only consider pages that pass the relevance bar:
`RX