throbber
(12) United States Patent
`Chakrabarti et al.
`
`I lllll llllllll Ill lllll lllll lllll lllll lllll 111111111111111111111111111111111
`US006418433Bl
`US 6,418,433 Bl
`Jul. 9, 2002
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) SYSTEM AND METHOD FOR FOCUSSED
`WEB CRAWLING
`
`(75)
`
`Inventors: Soumen Chakrabarti, San Jose; Byron
`Edward Dom, Los Gatos; Martin
`Henk van den Berg, Palo Alto, all of
`CA(US)
`
`(73) Assignee: International Business Machines
`Corporation, Armonk, NY (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/239,921
`
`(22) Filed:
`
`Jan.28, 1999
`
`Int. Cl.7 ................................................ G06F 17/30
`(51)
`(52) U.S. Cl. ............................ 707/5; 707/513; 709/218
`(58) Field of Search .......................... 707/1-6, 10, 104,
`707/501, 513; 709/200, 203, 217-219
`
`(56)
`
`References Cited
`
`"Annotated Reference List Agents," David C. Blight, Pro(cid:173)
`ceedings of the 1997 IEEE Conference on Communications,
`Power and Computing, May 22-23, 1997, pp. 7-12.*
`
`"Silk from a Sow'Ear: Extracting Usable Structures from the
`Web," Pirolli et al., Xerox PARC, Proceedings of the 1996
`Conference on Human Factors and Computing Systems,
`Canada, Apr. 13, 1996, pp. 118-125 Available at and down(cid:173)
`loaded from http:/www.acm.*
`
`"Enhanced Hypertext Categorization Using Hyperlinks,"
`Chakrabarti et al., Proceedings of the 1998 ACM SU GM OD
`International Conference on Management of Data, Seattle,
`USA, Jun. 1, 1998, pp. 307-318.*
`
`"Automated Reference List Agents," David C. Blight,
`TRlabs, Winnipeg, Canada, Proceedings of the 1997 IEEE
`Conference or Communications, Power and Computing,
`May 22-23, 1997 97CH36117, apges 7-12.*
`
`* cited by examiner
`
`Primary Examiner---Hosain T. Alam
`(74) Attorney, Agent, or Firm-John L. Rogitz
`
`U.S. PATENT DOCUMENTS
`
`(57)
`
`ABSTRACT
`
`. 364/419.13
`11/1994 Kadashevich et al.
`5,369,577 A
`6/1996 Meske, Jr. et al. .......... 395/600
`5,530,852 A
`1/1998 Kadashevich et al.
`...... 395/793
`5,708,829 A
`2/1998 Millett et al. ............... 395/603
`5,717,912 A
`7/1998 Meske, Jr. et al. .......... 395/602
`5,784,608 A
`7/1998 Hargrove ....................... 707/4
`5,787,417 A
`5,796,952 A * 8/1998 Davis et al.
`................ 709/234
`5,832,494 A * 11/1998 Egger et al. ................ 707/102
`
`OIBER PUBLICATIONS
`
`"Fab: Content-Based, Collaborative Recommendation,"
`Balabanovic and Shoham, Communications of the ACM,
`Mar. 1997, vol. 40, No. 3, pp. 66-72.*
`
`A focussed Web crawler learns to recognize Web pages that
`are relevant to the interest of one or more users, from a set
`of examples provided by the users. It then explores the Web
`starting from the example set, using the statistics collected
`from the examples and other analysis on the link graph of the
`growing crawl database, to guide itself towards relevant,
`valuable resources and away from irrelevant and/or low
`quality material on the Web. Thereby, the Web crawler
`builds a comprehensive topic-specific library for the benefit
`of specific users.
`
`32 Claims, 5 Drawing Sheets
`
`OVERALL FLOW
`
`100
`RECEIVE SEED
`SET OF
`DOCUMENTS
`
`, 102
`MAP SEED SET
`TO NODES IN
`TAXONOMY
`
`104
`HIGHLIGHT
`FREQUENTLY
`OCCURRING
`CATEGORIES
`
`106
`MARK SELECTED
`CATEGORIES
`AS "GOOD"
`
`112
`DETERMINE
`"RElEVANCE",
`"PRIORITY"
`
`110
`
`TOKENIZE URl TO FIND
`TEXT, OUTUNKS
`
`114
`
`116
`
`108
`
`STARTING WITH SEED
`SET. SELECT URl BY
`(NUM_IRIES, RElEYANCE)
`(COUlD USE PRIORITY
`INSTEAD)
`
`118
`
`BASED OH RELEVANCE,
`USE SOii OR HARD CRAWl
`TO INSERT OUTUHKS
`
`DETERMINE PRIORITY
`OF OUTllNKS
`(USING HITS)
`
`REVISIT PAGE,
`LINKS BASED
`ON PRIORITY
`
`120
`PERIODICALLY INVOKE
`QUAllTY RATING
`TO RATE CRAWl.
`ADD NEW EXAMPLES
`
`

`

`U.S. Patent
`
`Jul. 9, 2002
`
`Sheet 1 of 5
`
`US 6,418,433 Bl
`
`22
`
`Fig. 1
`
`26
`
`CRAWlER PRIORITY
`CONTROUER
`(WATCHDOG)
`CRAWlER
`WORKER THREADS
`
`14
`
`10
`OUTPUT DEVICES
`/
`KnBOARD t--=t~=-rn~_NJf_~~-f-1· ____ Mo_Nn_oR_--120 .-QU-ALIT_.__,Y 6D
`MOUSE 58
`RATER
`INSPECT
`BROWSE
`24
`SHECT
`& ADD
`TOPICS
`r-------------------,
`I
`I
`I
`I
`: TAXONOMY
`:
`EXAMPlE
`CRAWl
`CATEGORY
`TABlES
`TABlES
`TABlES
`MODH
`1
`1
`I
`I
`L, __________________ J
`30 ~'
`358 35A
`32
`35C
`28A
`
`27
`
`32
`/
`
`28
`
`HYPERTEXT
`ClASSIFIER
`
`18
`
`HYPERTEXT
`CLASSIFIER _____ _.
`TRAINER
`
`16
`
`CRAWl TABlE ENTRY
`
`FIELD NAME DATA TYPE DESCRIPTION
`URl
`VARCHAR
`URl OF PAGE
`CHAR(8)
`DID
`64-BIT HASH OF URL (PRIMARY KEY)
`NUM_TRIES SHORTINT
`COUNT HOW MANY TIMES THIS URL HAS BEEN CONSIDERED BY CRAWlER
`SHORTINT
`CRAWl OR REFRESH PRIORITY SET BY TOPIC ANAlYZER
`PRIORITY
`IP ADDRESS OF SERVER FROM WHICH PAGE WAS ACQUIRED
`IPADDR
`INTEGER
`FOUND
`TIMESTAMP
`TIME WHEN PAGE WAS FIRST FOUND
`TIMESTAMP
`INDEXED
`TIME WHEN PAGE WAS lAST INDEXED
`TIMESTAMP
`MODlflED
`TIME WHEN PAGE WAS lAST MODlflED
`RElEVANCE
`FLOAT
`RElEVANCE OF PAGE
`CID
`CATEGORY ID
`SH.INT.
`
`-....._ -
`
`52
`54
`56
`
`FIELD NAME
`SRCOID
`DSTOID
`TYPE
`
`DATA TYPE DESCRIPTION
`CHAR(8)
`SOURCE OF UNK
`CHAR(8)
`TARGET Of UNK
`SHORTINT
`TYPE Of UNK: SUBDIR. CROSS. OTHERSITE. REDIRECT
`(llNK) EDGE TABLE
`
`34 ,
`
`36
`"'-.
`38
`' ..._
`40
`42
`._
`44
`46
`48
`50
`51
`51A
`
`........
`
`--
`
`.-
`.,,...
`
`/
`
`

`

`U.S. Patent
`
`Jul. 9, 2002
`
`Sheet 2 of 5
`
`US 6,418,433 Bl
`
`OVERALL now
`
`Fig. 2
`
`~--'---
`
`( 100
`RECEIVE SEED
`SET Of
`DOCUMENTS
`
`( 106
`( 104
`/ 102
`MARK SELECTED
`HIGHLIGHT
`MAP SEED SET
`r-. TO NODES IN r-. fREQUENTLY r-. CATEGORIES
`TAXONOMY
`OCCURRING
`AS "GOOD"
`CATEGORIES
`
`112
`\
`DETERMINE
`"RELEVANCE",~--
`"PRIORITY"
`
`l lO \
`TOKENIZE URL TO flND
`TEXT. OUTLINKS
`
`I
`
`( 108
`STARTING WITH SEED
`SET. SELECT URL BY
`~---1 (NUM_TRIES, RElEYANCE)
`(COULD USE PRIORITY
`INSTEAD)
`
`BASED ON RELEVANCE,
`USE son OR HARD CRAWL
`TO INSERT OUTLINKS
`
`-
`
`PERIODICALLY INVOKE
`QUALITY RA TING
`TO RATE CRAWL,
`ADD NEW EXAMPLES
`
`

`

`U.S. Patent
`
`Jul. 9, 2002
`
`Sheet 3 of 5
`
`US 6,418,433 Bl
`
`Fig. 3
`WATCHDOG
`
`62
`
`SLEEP
`
`68
`PRINT CRAWL
`DIAGNOSTICS
`
`NO
`
`70
`YES
`DETERMINE # NEW,
`OLD PAGES TO f ETCH
`THIS ACCESS
`
`74
`
`SORT OLD PAGES
`IN ORDER
`(NUM_TRIES.
`PRIORITY)
`
`SORT OUTLINKS
`IN ORDER
`(NUM_TRIES.
`PRIORITY)
`
`72
`
`NO
`
`84
`INCREMENT
`NUM_TRIES
`
`INSERT WORK
`INTO QUEUE
`,... _ __._____.__,
`80
`SHECT WORKER
`L---_.... WI LOW WORK ___ ___.
`IN QUEUE
`
`

`

`U.S. Patent
`
`Jul. 9, 2002
`
`Sheet 4 of 5
`
`US 6,418,433 Bl
`
`-----~ SlEEP UNTll WORK
`IN QUEUE
`
`fig. 4
`WORKER
`
`86
`
`88
`
`TOKENIZE
`PAGE
`
`104
`ClASSIFY
`(flG. 5)
`
`108
`UPDATE CRAWL
`- - - - - - - - - l TABLE ENTRY FOR
`THIS PAGE ONLY
`
`NO
`
`YES
`112
`~:r~vNf~ i~Jl
`INSERT NEW LINK
`TABLE ENTRIES - - -1 OUTLINKS (EXPAND)
`
`110
`
`

`

`U.S. Patent
`
`Jul. 9, 2002
`
`Sheet 5 of 5
`
`US 6,418,433 Bl
`
`.----------'-~114
`ClASSlfY
`CURRENT PAGE
`
`Fig. 5
`
`(EXPAND) ~-<
`FIG. 4
`
`118
`
`NO
`
`FIG. 4
`(PRUNE)
`
`120
`
`COllECT
`OUTUNKS &
`INUNKS
`
`..---........i...._.J.....--.
`
`124
`RElAX MATCH
`CRITERIA
`(FIG. 6)
`
`126
`BACKWARDS
`CRAWl
`
`PRUNE
`(FIG. 4)
`
`YES
`
`130
`
`122
`
`REClASSlfY
`~I USING
`llNK INFO
`
`FIND# GOOD
`PAGES THIS SITE
`
`134
`ClASSlfY
`CURRENT PAGE
`
`136
`FIND lEAST COMMON
`ANCESTOR OF ClASS ...____.......
`OF PAGE AND EACH
`DESIRED ClASS
`
`PRUNE IN
`FIG. 4
`
`

`

`US 6,418,433 Bl
`
`1
`SYSTEM AND METHOD FOR FOCUSSED
`WEB CRAWLING
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention relates generally to searching wide
`area computer networks for information, and more particu(cid:173)
`larly to searching the World Wide Web for topical informa(cid:173)
`tion.
`2. Description of the Related Art
`The system known as the "World Wide Web", or simply
`"Web", as implemented by the wide area computer network
`known as the Internet, contains a vast amount of information
`in the form of Web pages. Each Web page is electronically
`stored in a respective Web site on a computer, referred to as
`a Web server, with the Web itself including many Web
`servers that are interconnected by means of the Internet. A
`person can connect a computer to the Internet via, e.g., a
`telephone line, and thereby electronically access the Web
`pages on the Web servers.
`As the Web has grown, many millions of Web pages have
`been created. In other words, the Web contains a vast amount
`of information, and the content of the Web grows and
`changes minute by minute. It will accordingly be appreci- 25
`ated that some means must be provided for a person to sort
`through the vast quantities of Web pages to find a particular
`item of interest.
`With the above consideration in mind, most users employ 30
`software known as Web browsers when accessing the Web.
`To search the Web for a particular topic of information, the
`user causes their Web browser to access a Web site of a
`centralized search engine that is maintained by a search
`company. Examples of currently popular search engines are 35
`Alta Vista™ and Hotbot™.
`Centralized search engines use software referred to as
`"crawlers" to continuously access Web pages and construct
`a centralized keyword index. When a person wishes to
`retrieve information, the person's browser accesses a cen(cid:173)
`tralized search engine using a query, for example, "luxury
`cars". In response, software at the centralized engine
`accesses its index to retrieve names of Web sites considered
`by the search engine to be appropriate sources for the
`sought-after information. The search engine transmits to the
`browser hyperlinks to the retrieved sites, along with brief
`summaries of each site, with the browser presenting the
`information to the user. The user can then select the site or
`sites they want by causing the browser to access the site or
`sites.
`Owing to the burgeoning of the Web and the ever-growing
`amount of its information, and the fact that the above(cid:173)
`described centralized crawler schemes posture themselves to
`respond to any possible query (i.e., to be all things to all
`people), centralized crawler/searchers require large invest- 55
`ments in hardware and software and must never cease
`crawling the Web, to index new pages and to periodically
`revisit old pages that might have changed. Indeed, one Web
`search company currently requires the use of 16 of the most
`powerful computers made by a major computer 60
`manufacturer, each computer having 8 gigabytes of memory.
`Another search company currently uses a cluster of 300
`powerful workstations and over one terabyte of memory to
`crawl over 10 million Web pages per day. Despite these
`heroic efforts, however, it is estimated that a single search 65
`company is able to index only 30%-40% of the Web, owing
`to the size of the Web which, incidentally, shows no signs of
`
`2
`slowing its rate of expansion (currently at about one million
`new pages per day).
`Accordingly, one problem with current technology that is
`recognized and addressed by the present invention is the
`5 need to reduce the vast amount of Web search hardware and
`software that is inherently required by a centralized search
`scheme.
`Additionally, evaluating whether a particular Web page
`contains relevant information with respect to a user query is
`10 sometimes difficult. Moreover, user queries may not be
`effectively articulated, or they may be overbroad.
`Consequently, a Web search engine frequently responds to a
`query by returning a large number of Web pages that are of
`little or no interest to the requester. Nonetheless, a user must
`15 laboriously sort through hundreds and perhaps thousands of
`returned Web pages, which, as discussed above, can be
`considered to represent only 30%-40% of the total Web
`content in any case. Moreover, because a centralized crawler
`seeks the capability to respond to any query, most of the
`20 index of any single centralized system contains information
`that is of little or no value to any single user or indeed to any
`single interrelated group of users.
`Thus, two other problems recognized and addressed by
`the present invention are the lack of focus of search results,
`and the fact that centralized crawlers are not tailored to any
`particular user or to any particular interrelated group of users
`and, thus, contain mostly irrelevant information, from the
`point of view of a single user or group of users.
`In addition to the above considerations, the present inven(cid:173)
`tion recognizes that many if not most Web pages refer to
`other Web pages by means of hyperlinks, which a user can
`select to move from a referring Web page to a referred-to
`Web page. The present invention further recognizes that
`such hyperlinks are more than simply navigation tools; they
`are important tools for relevant data acquisition as well.
`It happens that with the existing Web communication
`protocol (hypertext transfer protocol, or "http"), when a user
`clicks on a hyperlink to a referred-to Web page v from a
`40 referring Web page u, the user's browser sends the identity
`of the referring Web page u to the Web server that hosts the
`referred-to Web page v, and this information can be recorded
`or logged. Unfortunately, current logs of which Web pages
`refer to which other Web pages are mostly unused and
`45 indeed mostly not enabled by Web site managers, and the
`logs moreover consume a relatively large amount of elec(cid:173)
`tronic data storage space. Also, no standard way exists for a
`remote user to access and use the information in the logs.
`The present invention, however, recognizes the above-
`50 noted problem and addresses how to exploit this currently
`unused but potentially valuable information in the context of
`resolving the unfocussed, centralized crawling problems
`noted above.
`
`SUMMARY OF THE INVENTION
`The invention is a general purpose computer programmed
`according to the inventive steps herein to generate a data(cid:173)
`base of Web pages that is focussed on a predefined topic or
`topics, for subsequent efficient searching of the database by
`users. The invention can also be embodied as an article of
`manufacture-a machine component-that is used by a
`digital processing apparatus and which tangibly embodies a
`program of instructions that are executable by the digital
`processing apparatus to generate the focussed database. This
`invention is realized in a critical machine component that
`causes a digital processing apparatus to undertake the inven(cid:173)
`tive logic herein.
`
`

`

`3
`In accordance with the present invention, the computer
`includes computer readable code means for receiving a seed
`set of Web pages in a crawl database, with the seed set being
`representative of at least one topic. The computer also
`includes computer readable code means for identifying 5
`outlink Web pages from one or more Web pages in the crawl
`database, and computer readable code means for evaluating
`the outlink Web pages for relevance to the topic. Further, the
`computer includes computer readable code means for caus(cid:173)
`ing outlinks only of Web pages that are evaluated as being 10
`relevant to the topic to be stored in the crawl database, for
`subsequent evaluation.
`In a preferred embodiment, computer readable code
`means assign a revisitation priority to a Web page, based on
`the means for evaluating. To embody the code means, at 15
`least one "watchdog" module is provided, and the watchdog
`module periodically determines new and old pages to con(cid:173)
`sider. The new pages are selected from the outlink Web
`pages, preferably those that have not yet been visited by the
`present logic, and the old pages are selected from pages in 20
`the crawl database. One or more worker modules respond to
`the watchdog module to access the new and old pages to
`consider. In one embodiment, in "considering" a page the
`present logic fetches and analyzes the page. As disclosed in
`detail below, each Web page is associated with a respective 25
`field (referred to in the preferred embodiment as the "Num_
`Tries" field) that represents the number of times the respec(cid:173)
`tive page has been accessed, and the Num_Tries field of a
`Web page is incremented each time the Web page is con(cid:173)
`sidered.
`Additionally, the preferred worker module includes
`means for determining whether a gathering rate of relevant
`pages is below a "panic" threshold. In one embodiment, the
`computer includes computer readable code means for con(cid:173)
`sidering all outlinks and inlinks of a Web page in the crawl 35
`database when the gathering rate is at or below the panic
`threshold. When the gathering rate is at or below the
`threshold, the scope of the topic can be increased to an
`expanded scope.
`In another aspect, a computer system for focussed search(cid:173)
`ing of the World Wide Web includes a computer that in turn
`includes a watchdog module for scheduling worker thread
`work and at least one worker module for undertaking the
`work. The work to be undertaken includes adding outlinks of 45
`a Web page to a crawl database, but only when the Web page
`is relevant to a predefined topic or topics. The crawl database
`is accessible by the computer and is focussed only on the
`topic such that the system includes no topically comprehen(cid:173)
`sive database of the World Wide Web.
`In still another aspect, a computer-implemented method is
`disclosed for building a focussed database of Web pages.
`The method includes receiving a search query from a user,
`and in response to the search query, accessing a crawl
`database containing only information pertaining to Web 55
`pages related to a limited number of predefined topics. A
`computer program product executing the method steps is
`also disclosed.
`The details of the present invention, both as to its structure
`and operation, can best be understood in reference to the
`accompanying drawings, in which like reference numerals
`refer to like parts, and in which:
`
`40
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram of the present system, showing
`an example Web page table entry and associated link table
`entry in an exploded relationship to the crawl database;
`
`US 6,418,433 Bl
`
`4
`FIG. 2 is an overall flow chart of the present logic;
`FIG. 3 is a flow chart of the watchdog module logic;
`FIG. 4 is a flow chart of a worker module logic;
`FIG. 5 is a flow chart of the worker module classification
`logic; and
`FIG. 6 is a flow chart of the logic for relaxing match
`criteria to recover from the panic mode.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`
`Referring initially to FIG. 1, a system for focussed Web
`crawling is shown, generally designated 10. In the particular
`architecture shown, the system 10 includes a digital pro(cid:173)
`cessing apparatus, such as a computer 12, which accesses
`the World Wide Web via the Internet 13. In one intended
`embodiment, the computer 12 may be a personal computer
`made by International Business Machines Corporation
`(IBM) of Armonk, N.Y. as shown, or the computer 12 may
`be any computer, including computers sold under trade(cid:173)
`marks such as AS400, with accompanying IBM Network
`Stations. Or, the computer 12 may be a Unix computer, or
`OS/2 server, or Windows NT server, or IBM RS/6000 250
`workstation with 128 MB of main memory running AIX
`3.2.5., or an IBM laptop computer.
`The computer 12 includes a focussed crawler 14 which
`may be executed by a processor within the computer 12 as
`a series of computer-executable instructions. These instruc(cid:173)
`tions may reside, for example, in RAM of the computer 12,
`30 and may be programmed as a c++ ODBC application.
`Alternatively, the instructions may be contained on a data
`storage device with a computer readable medium, such as a
`computer diskette 16 shown in FIG. 1. As show, the diskette
`16 includes a data storage medium 18 holding computer
`program code elements A-D. Or, the instructions may be
`stored on a DASD array, magnetic tape, conventional hard
`disk drive, electronic read-only memory, optical storage
`device, or other appropriate data storage device. In an
`illustrative embodiment of the invention, the computer(cid:173)
`executable instructions may be lines of compiled c++ com(cid:173)
`patible code. As yet another equivalent alternative, the logic
`can be embedded in an application specific integrated circuit
`(ASIC) chip or other electronic circuitry.
`FIG. 1 also shows that the system 10 can include periph-
`eral computer equipment known in the art, including an
`output device such as a video monitor 20 and input devices
`such as a computer keyboard 22 and mouse 24. Other output
`devices can be used, such as printers, other computers, and
`50 so on. Likewise, input devices other than the keyboard 22
`can be used, e.g., trackballs, keypads, touch screens, and
`voice recognition devices.
`As described in detail below, the focussed crawler 14
`accesses scheduler controls 26 that control when the
`focussed crawler 14 executes its logic. Accordingly, the
`scheduler controls 26 are referred to below as a "watchdog
`module", and the focussed crawler is from time to time
`referred to as a "worker thread" or threads 27. Indeed, in the
`preferred embodiment the watchdog module controls mul-
`60 tiple worker threads 27.
`Additionally, the focussed crawler 14 accesses a topic
`analyzer 28 (also referred to herein as "hypertext classifier").
`The topic analyzer 28 compares the content of a Web page
`with a predefined topic or topics and generates a response
`65 representative of how relevant the Web page is to the topic.
`A relevant Web page is referred to as a "good" Web page.
`Details of one preferred topic analyzer are set forth in
`
`

`

`US 6,418,433 Bl
`
`20
`
`25
`
`15
`
`5
`co-pending U.S. patent application Ser. No. 09/143,733,
`filed Aug. 29, 1998, for an invention entitled "Method for
`Interactively Creating an Information Database Including
`Preferred Information Elements, such as Preferred(cid:173)
`Au~hority, ~orldwide Web Page", owned by the same 5
`assignee as is the present invention and incorporated herein
`by reference. Or, the following references provide topic
`a?al~zers: C:hakrabarti et al., "Enhanced Hypertext Catego(cid:173)
`nzat10n Usmg Hyperlinks", SIGMOD ACM, 1998, and
`Chakrabarti et al., "Scalable Feature Selection
`Classification, and Signature Generation for Organizin~ 10
`Large Text Databases into Hierarchical Taxonomies"
`VLDB Journal, invited paper, August, 1998.
`'
`As disclosed below, it is the purpose of the system 10 to
`generate a database that contains information on only Web
`pages that pertain to the topic or topics of interest, i.e., a
`focussed database. This database is depicted in FIG. 1 as a
`crawl database 30. As shown, the crawl database 30 includes
`a. Web page ("crawl") table 32 that includes corresponding
`lmk tables 34 each of which is an edge table relative to the
`Web page table 32. Also, the database 30 includes example
`Web page tables 35A and taxonomy tables 35B. A user can
`~earch the database 30 efficiently for Web pages of interest,
`i.e., for only Web pages relating to the topic on which the
`database 30 is focussed. Advantageously, Web document
`meta data such as the data in present tables is stored in a
`relational database, and more preferably in the database
`referred to as DB2/UDB, made by the present assignee.
`As shown in FIG. 1, the Web page table 32 includes plural
`fields, each having at least a name, a data type, and, for 30
`disclosure purposes, a field description. More specifically,
`the Web page table 32 includes a Uniform Resource Locator
`(URL) field 36, having a variable character data type, that
`represents a Web page URL. Also, the Web page table 32
`includes various other fields associated with the URL 35
`including an om field 38 that is an eight character, 64 bi~
`hash of the URL, and a Num_Tries field 40 that is a count
`of how many times the Web page associated with the URL
`has been considered by the crawler 14, both as a "new" page
`and as an "old" page as described below. Furthermore, the 40
`Web page table 32 includes a priority field 42 that represents
`how often the Web page is to be revisited by the crawler 14,
`as dete~mined in consonance with the topic analyzer 28, to
`determme whether any changes in the Web page have
`occurred, and to "refresh" the entry in the Web page table 32 45
`as appropriate therefor.
`In addition, the preferred Web page table 32 includes
`various administrative fields, including an integer IP address
`field 44 representing the address of the Web server from
`which the Web page represented by the URL field 36 was 50
`acquired. Also, three time stamp fields are provided, namely,
`a "found" field 46, indicating the date and time when the
`page was ini~ially found, an "indexed" field 48, indicating
`the date and time when the page was last indexed in the table
`32, and a "modified" field 50, indicating the date and time 55
`the Web page was last modified by the provider of the
`content of the page. Moreover, a relevance filed 51 indicates
`the relevance of the Web page as more fully disclosed below,
`and a category id ("cid") field 51A indicates the topic
`category of the page.
`It is to be understood that information pertaining to a
`"seed" set of Web pages is initially stored in the Web page
`table 32. The seed set can be gathered from, e.g., the
`temporary Internet file directories of the employees of a
`company or from some other group that can be expected to 65
`have shared interests and that, consequently, can be expected
`to have collectively downloaded Web pages that are related
`
`60
`
`6
`to a limited number of predefined topics. Thus, the seed set
`does not define a comprehensive, universal set of all topics
`on the Web, but rather a relatively narrow topic or range of
`topics that are of interest to the particular source (e.g.,
`employee group) from which the seed set is derived or
`otherwise defined. As but one example, a seed set of Web
`~ages might all be related to the topic "mining". The topic
`itself can be defined by a user or by considering the seed set
`using the topic analyzer 28, including an associated classi(cid:173)
`fier trainer 28A, to derive the predefined topic or topics from
`the seed set in the example tables 35A and from data in the
`taxonomy tables 35B, but in any case, only Web pages
`relating to the topic are contained in the database 30. Once
`derived, the target topic is stored in a category model data
`structure 35C.
`Th~ link t~ble 34 is an edge table (e.g., graph edge)
`associated with the URL field 36. It includes an eight
`character source field 52, representing the source of the
`hyperlink that was followed by the crawler 14 to initially
`consider the Web page represented by the URL field 36.
`Such links are referred to as "inlinks", because they lead in
`to t~e Web page represented by the URL field 36. Similarly,
`an eight character target field 54 represents any links to other
`Web pages contained in the Web page represented by the
`URL field 36. Such links are referred to as "outlinks"
`because they lead to Web pages other than the Web page
`represented by the URL field 36. A "type" field 56 that is a
`short integer string is also included in the link table 34. The
`type field 56 indicates the type of outlink in the target field
`?4, e.!?:., whether the link is to a subdirectory of the directory
`m which the URL 36 resides, or in another directory (a
`"cross directory") in the same server, or on another Web site
`altogether, or whether the link redirects the user to back to
`the URL 36.
`With the above overall architecture in mind, a user can
`generate a query for information using the keyboard 22 or
`mouse 24, and in response a conventional browser or
`searcher 58 associated with the computer 12 accesses the
`crawl database 30 to retrieve a list of relevant Web pages
`therefrom in accordance with well-known principles. It will
`readily be appreciated that the database 30 does not contain
`a listing of Web pages unrelated to the predefined topic, but
`only relevant Web pages. Consequently, the browser 58 will
`quickly respond to the query when the query is related to the
`predefined topic; otherwise, for queries unrelated to the
`topic, no response will be available from the database 30.
`Under such a circumstance, the browser 58 can access a
`conventional crawler database. If desired, a conventional
`quality rating system 60 can evaluate the quality of Web sites
`in the database 30 for relevancy to a user's query.
`FIG. 2 shows the overall logic of the present invention.
`Starting at state 100, the seed set of example Web pages is
`received from the user. Moving to block 102, the seed set is
`mapped to appropriate nodes in the taxonomy by invoking
`a classification program such as Taper or HyperClass on
`each document in the seed set. Proceeding to block 104,
`fr.equ~ntly occurring seed set categories in the taxonomy are
`highlighted, and these categories are marked or otherwise
`indicated as being "good" categories at block 106.
`Next, at block 108, starting with the seed set the URL of
`each pa.ge is selected based on its Num_Tries field 40 (FIG.
`1) and its relevancy field 51. Alternatively, the priority field
`42 can be used. Initially, the relevance and priority fields can
`have default values.
`From block 108 the process moves to block 110 to
`tokenize the URL to identify its textual content and out-
`
`

`

`US 6,418,433 Bl
`
`5
`
`10
`
`7
`links. Proceeding to block 112, the relevance and priority of
`the document are determined as follows.
`To determine the relevance of a document, it is assumed
`that the category taxonomy imposes a hierarchical partition
`of web documents. Categories in the taxonomy tree, also
`referred to as nodes, are denoted "c". The predicate good(c)
`denotes whether a node "c" has been marked as good. By
`definition, for any document "d", the probability that it was
`generated from the category c0 corresponding to the root
`node, denoted Pr[ c0 ld], is one. In general Pr[ cld]=Pr[parent
`(c)ld]Pr[cld,parent(c)]. Pr[parent(c)ld] is computer
`recursively, whereas Pr[cld,parent(c)] is computed using
`Bayes Rule as Pr[clparent(c)]Pr[dlc/IcPr[c'lparent(c')]Pr
`[die'], where the sum ranges over all siblings c' of c. Finally,
`the probability that a page is relevant is ~good(C)Pr[ cld]. This
`quantity, denoted R( d), is typically very small, so the loga(cid:173)
`rithm of R( d) can be stored if desired.
`To model the way a document is generated by a category
`the following Bernoulli document model for text is used.
`First select a category for the document and pick a length for
`the document. Notionally associated with each category is a
`many-sided coin. Each face of the coin represents a word;
`the probability that the face comes up corresponds with the
`probability that the corresponding word occurs in a docu(cid:173)
`ment of this particular category. This coin is repeatedly
`tossed (figuratively) to write out the document until the
`length is reached. The category taxonomy is represented in
`a relational table as follows:
`kcid smallint not null primary key,
`pcid smallint not null,
`kcname varchar(250) not null,
`good smallint not null
`where kcid is the ID of the current node ("kid" class ID),
`pcid is the ID of the parent class, kcname is the textual name,
`and good is a bit set to express whether the node is good for 35
`the current focus. The document table has two fields cou-
`pling it with the taxonomy table: relevance, which is set to
`log ~good(clr[ cld], and cid, which represents the best match(cid:173)
`ing leaf for the class. The priority with which a particular
`page is revisited can then be directly correlated to its
`relevance.
`Moreover, the priority of a document not only can be
`determined by determining its relevance, but also by deter(cid:173)
`mining its "popularity", a measure of the quality of the
`document. Based on the recall precision trade-off of the 45
`application and the availability of disk and network
`resources, a suitable thresholdµ on relevance is defined, e.g.,
`10%-20% Of the pages fetched in the current crawl, only
`those pages exceeding the threshold are selected as candi(cid:173)
`date "authorities". Any page pointing to at least one "author(cid:173)
`ity" is a candidate "hub". "Hubs" are not thresholded.
`Next, an edge set E is constructed using only those links
`that are between pages on different sites. Unlike the con(cid:173)
`ventional HITS algorithm, the present invention have
`weights; edge (u,v) has three associated numbers: the rel(cid:173)
`evance score of "u" called R[ u ], the number of distinct pages
`on the same server as "u" that point to "v", called h'[u,v],
`and the number of distinct pages on the same server as "v"
`to which "u" points, called a'[ u,v]. Iterat

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket