`Law et al.
`
`lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
`
`US006754873Bl
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 6, 754,873 Bl
`Jun.22,2004
`
`(54) TECHNIQUES FOR FINDING RElATED
`HYPERLINKED DOCUMENTS USING LINK(cid:173)
`BASED ANALYSIS
`
`cache:PJIPOZ9wxQsJ:gatekeeper.dec.com/pub/DEC/SRC/
`publicatious/monika/sigir98/pdf+lmoved+Algorilhms+for+
`topic&hl=eo&ie=UIF-8. *
`
`(75)
`
`Inventors: Kin Lun Law, Redwood City, CA
`(US); Georges R._Harlk, Mountain _
`View, CA (US)
`
`(73) Assignee: Google Inc., Moutain View, CA (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/544,733
`
`(22) Filed:
`
`Apr. 6, 2000
`
`Related U.S. Application Data
`( 60) Provisional application No. 60/155,277, filed on Sep. 20,
`1999.
`Int. Cl? .......................... G06F 17/00; G06F 17/30
`(51)
`(52) U.S. Cl •.................... 715/501.1; 715/514; 715/513;
`707/3; 707/4; 707/5
`(58) Field of Search .............................. 715/501.1, 514,
`715/513; 707/4, 5, 3
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,895,470 A • 4/1999 Pirolli et al ................. 707/102
`6,285,999 Bl • 9/2001 Page ................•..••........ 707/5
`6,457,028 Bl • 9/2002 Pitkow et al. .............. 715/513
`6,591,261 B1 • 7/2003 Arthurs ......................... 707/2
`6,665,837 B1
`12/2003 Dean et al ............... 715/501.1
`
`01HER PUBLICPJIONS
`
`Bharat & Henzinger, "Improved Algorithms for Topic Dis(cid:173)
`tillation in a Hyperlinked Environment" 1998 http://
`216.239.37.104/searcb?q=
`
`Dean & Henzinger, "Finding related pages in the World
`- -Wide Web" Mar. 1; 1999, http:/!216.23939.104/search?q=
`cache:tWTI2B22VcEJ:www.unizh.ch/home/mazw/re(cid:173)
`ports/www8conf/2148/pdf!pdl.pdf +Finding+Relaled+
`Pages+in+the+ World+ Wide+ Web&hl=en&ie-UTF-8. *
`
`Bharat & Henzinger, "Improved Algorithms for Topic Dis(cid:173)
`tillation in a Hyperlinked Environment" 1998 http://
`216.239 .37.104/search?q=
`cache:PJIPOZ9wxQsJ:gatekeeper.dec.com/pub/DEC/SRC/
`publicatious/moriika/sigir98/pdf+lmoved+Algorithms+for+
`topic&hlsen&ie=UTF-8. *
`
`* cited by examiner
`
`Primary Examiner-Stephen S. Hong
`Assistant Examiner~dam L. Basehoar
`(74) Attorney, Agent, or Firm--Harrity & Snyder LLP
`
`(57)
`
`ABSTRACT
`
`Techniques for finding related hyperlinked documents using
`link-based analysis are provided. Backlink and forwardlink
`sets can be utilized to find web pages that are related to a
`selected web page. The scores for links from web pages that
`are from the same host and links from web pages with
`numerous links can be reduced to achieve a better list of
`related web pages. The list of related web pages can be
`utilized as a feature to a word-based search engine or an
`addition to a web browser.
`
`19 Claims, 14 Drawing Sheets
`
`EXHIBIT 2107
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`
`
`U.S. Patent
`
`Jun.22,2004
`
`Sheet 1 of 14
`
`US 6,754,873 Bl
`
`..,..---1
`
`11
`
`FIG. 1
`
`51
`
`53
`
`55
`
`57
`
`..,..---1
`
`DISPLAY
`ADAPTER
`
`59
`
`3
`
`SOUND CARD
`
`61
`
`63
`
`65
`
`NETWORK
`INTERFACE
`
`9
`
`11
`
`FIG. 2
`
`
`
`U.S. Patent
`
`Jun.22,2004
`
`Sheet 2 of 14
`
`US 6,754,873 Bl
`
`1
`
`NETWORK
`
`1
`
`D
`
`-
`~o:
`
`-
`
`1
`
`D
`
`-
`I oJo:
`
`-
`
`FIG. 3
`
`
`
`U.S. Patent
`US. Patent
`
`Jun.22,2004
`Jun. 22, 2004
`
`Sheet 3 0f 14
`Sheet 3 of 14
`
`US 6,754,873 Bl
`US 6,754,873 B1
`
`-------- - -- -
`
`FIG.‘{
`
`,--
`
`I-~
`,<""'
`I
`I
`I
`
`I
`I
`
`-------
`
`'=I
`c+
`
`()o -
`"9|?
`~
`
`.
`~
`
`
`
`U.S. Patent
`US. Patent
`
`Jun.22,2004
`Jun. 22, 2004
`
`Sheet 4 0f 14
`Sheet 4 of 14
`
`US 6,754,873 Bl
`US 6,754,873 B1
`
`- - -
`
`-- (
`
`
`
`I ' I
`
`--
`
`l
`
`\
`
`\
`
`\
`
`\
`
`I
`\
`
`
`
`U.S. Patent
`US. Patent
`
`Jun.22,2004
`Jun. 22, 2004
`
`Sheet 5 0f 14
`Sheet 5 of 14
`
`US 6,754,873 Bl
`US 6,754,873 B1
`
`.\"
`
`~I
`:4
`0
`
`I
`
`If?
`
`I
`
`K
`
`6
`~· ----
`:9;
`‘L
`
`
`
`U.S. Patent
`
`Jun.22,2004
`
`Sheet 6 of 14
`
`US 6,754,873 Bl
`
`SCAN WEB PAGES AND SAVE
`FORWARDLINK AND BACKLINK
`SETS
`
`CALCULATE RELATED LISTS FOR
`THE WEB PAGES
`
`GENERATE A POPULAR LIST OF
`WEB PAGES THAT OCCUR MOST
`OFTEN IN THE RELATED LISTS
`
`301
`
`303
`
`305
`
`DONE
`
`FIG. 7
`
`
`
`U.S. Patent
`
`Jun.22,2004
`
`Sheet 7 of 14
`
`US 6,754,873 Bl
`
`GENERATING A
`RELATED LIST
`
`+
`PROVIDE A BACKLINK SET OF WEB V
`PAGES FOR A SELECTED WEB
`PAGE
`
`+
`PROVIDE A FORWARDLINK SET OF V
`WEB PAGES FROM THE BACKLINK
`SET
`
`+
`
`ASSIGN A VALUE TO EACH
`FORWARD LINK IN THE WEB PAGES
`OF THE BACKLINK SET
`
`~
`GENERATE A SCORE FOR EACH
`WEB PAGE IN THE FORWARD LINK
`SET
`
`/
`
`/
`
`351
`
`353
`
`355
`
`357
`
`+
`
`GENERATE A LIST OF RELATED
`WEB PAGES
`
`+
`
`DISPLAY THE LIST OF RELATED
`WEB PAGES
`
`v 359
`
`361
`
`/
`
`+
`
`DONE
`
`FIG. 8
`
`
`
`U.S. Patent
`
`Jun.22,2004
`
`Sheet 8 of 14
`
`US 6,754,873 Bl
`
`GET A FORWARD LINK FROM A
`WEB PAGE IN THE BACKLINK SET
`
`401
`
`403
`
`>----YES------+1
`
`REDUCE VALUE ASSIGNED TO
`FORWARD LINK
`
`407
`
`>----YES,------+1
`
`REDUCE VALUE ASSIGNED TO
`FORWARD LINK
`
`405
`
`409
`
`YES
`
`NO
`
`NO
`
`ASSIGN THE FORWARD LINK A
`VALUE
`
`411
`
`413
`
`NO +
`
`(_Do -NE )
`
`FIG. 9
`
`
`
`U.S. Patent
`
`Jun.22,2004
`
`Sheet 9 of 14
`
`US 6,754,873 Bl
`
`DISPLAYING THE
`LIST
`
`GENERATE SYMMETRIC AND
`UNSYMMETRIC LISTS
`
`DISPLAY LINKS TO WEB PAGES IN
`SYMMETRIC LIST
`
`DISPLAY LINKS TO WEB PAGES IN
`UNSYMMETRIC LIST IF NOT IN
`POPULAR LIST
`
`451
`
`453
`
`455
`
`DONE
`
`FIG. 10
`
`
`
`U.S. Patent
`
`Jun.22,2004
`
`Sheet 10 of 14
`
`US 6,754,873 Bl
`
`Try our special searches:
`
`Uncle Sam Search milllons of US govemmeni documents
`
`Linux Search the web's Linux resources
`
`F I &-. ! f
`
`http://v.n.vw.google.com/
`
`11/J/W
`
`
`
`U.S. Patent
`
`Jun.22,2004
`
`Sheet 11 of 14
`
`US 6,754,873 Bl
`
`About 104143 matches for New York Times
`S hm,.j ng re.sults 1-10, Search took 0. 08 seconds
`How do I interpret the results?
`
`~~S"li"i
`~ ')1
`ww.nytim.es.co-m/
`Ncm! Tty out Goog!eScom ~ ·
`
`www .. nytimg,&2m/~~ght!ttml
`New! Try out GoogleScout
`
`www.nvtim.es.comJinfotwntents/~
`New, Tryout.~~
`
`\.VWW.nytimes.oomlyr/mo/dayJ
`New! Try out GoogleScout
`
`www.nytimes.oomllearningl
`Newt Try out GoogleScout
`
`The New York Times Syndicate
`... a. ne11· era, The New York Thnl!!l Syndicate is ready to ...
`.. mism, either. The New York TimllS Magazine is celebrating ...
`nytl!Jin.oomf Cached (Sk} New! Try out GooglcScout
`
`New York Today citv guide and event calendar
`... Q,d~ home delivery ofThe New York Times JUNE 30- JULY 6 ...
`... doors in New Yorit," says William Grimes in roday's Times ...
`www.nytoday.roml Cacl;ed ($91::) New! Try out GoogleSoout
`
`www.nytcoml
`New! Try uut GoogleSoout
`
`Hom~ Page
`nytimeofax.coml Cached Ok) Newl Try out GoosleSrout
`
`n:vtime.s. i;QJti/
`Newt Try out ~.ikS.~\tt
`
`f f G. I J A
`
`GotHJooo~:)ovog l e ~.-..
`
`1 ~ :J. 1 l'i
`
`!'!
`
`'1.
`
`!l. Ei
`
`:til: ~
`
`Resui1Page:
`
`. Eew Yo~::k Time~
`
`,
`
`ewquery,
`N
`
`http:l/w"'w. goo gle. cornlsearch?q= New+ York:+' l'Imes
`
`11/3JW
`
`'fry your query on other mgines
`
`
`
`U.S. Patent
`
`Jun.22,2004
`
`Sheet 12 of 14
`
`US 6,754,873 Bl
`
`Alta Vista Excite HQW.Qt Info seek~ ~ fib.oo!. Amazon Open Directory e<iro\q!s
`
`Copyright 01999 Google Inc. - AbQyt
`
`
`
`U.S. Patent
`
`Jun.22,2004
`
`Sheet 13 of 14
`
`US 6,754,873 Bl
`
`G. '
`
`-'1·-·e_· __ .,,
`
`. ·,
`.·•'·'
`.·.
`
`,:•
`
`GoogleS&9ill has automatically scouted the web for pages related to ~11$ime:>,oomf
`!Jf:::,'!!e ('TOogieScaut ~?..!f!.~t! list of a company~5 r:ompetitors . .. ---~------------
`24 related pages for www..nytimeli.com/
`Showing related pages 1-10, Search took 0_05 seconds
`How do I intemr~ltlll=<_I:e.@!ts!
`
`www.nytirus.cgmt
`Newt Try out GOQ.SIP.~llt
`
`CNN Intemc.tiY~
`category books flowers music office travel video MA1N PAGE WORLD US_ LOCAL POLITICS WEATHER
`BUSINESS SPORTS SCI· TECH NATURE ENTERTAINMENT BOOKS TRAVEL FOOD HEALTH STYLE IN-__ _
`www.cMooml Ca.cl:wd (61kl Newt Try oot ~!S9om
`
`U.SATODAY
`USA iODAYdeskTOPnew~ Swcll the web: Nationline Wasbington World Stocks Scores Ba!eball NBA NHL Tech
`Books Ca~ Small Business Travel Mi!J~rmium Hot Site$ W~b tech Politic .. _
`www.WU!t<Jday.com/ C~ N!!Wl Try out GooWeSoout
`
`washlngt~.~om: News Front
`-----;---About !he $lte
`Advertising lnfo_ --:::-:::----- Send us feedback and
`correctio!lll
`Olb.e~ Po:rt Co. W<ID sires Wednesday, .fune 30, ...
`www . .....-ashingtQnpoot.rom/ Newl Try out Goo¢eSoout
`
`Los Angeles Tima Web Site
`Wedn~y. Jun~ 30, 1999 06:59PM News--~--- Main Page News Wite:<J AP The Wire--------- Today's Paper
`----~--- Co)QQm~ C<;JiliJnelliMJi Food Gov't & Polmcs ...
`WWI\'.Iatime5.coml Cached (35k) New! 'fry oot Googh:Soout
`
`~
`New! Try out Qpog{e!kou:t
`
`W'OCi'/&_hlgtgo.tnbune.com/
`New! fry out Gm!deSoout
`
`www.abcnews.coml
`New! Try out QQ.ggl~:SJmrt
`
`The Christian ~en~ MQnitor Electronic Edition
`This page uses frames, but your browser doesn'l suppo-rt them.
`www.(smonitor.com.l Cached (Ok) New, Try out ~~ut
`
`:W.!ll~<QJlle to Mercury Cente~
`Silicon Valley news and information from the PWi~ Prize.winning San Jo-se M~rcmy News. ·-Last updEit~ Ell S:21 p.m.
`PDT Wednesday, June 30, 1999 Fed raises rate.s as expei)L
`http:ltwv.w. google.cornlsearch'fslte=sea:rch&nrun~ I O&q=related~bJ91 U&. ... :www.nytunes.com
`
`l 1 /319'1
`
`
`
`U.S. Patent
`
`Jun.22,2004
`
`Sheet 14 of 14
`
`US 6,754,873 Bl
`
`Gooogle~ ..... ~
`R>:sult Page;
`
`~- ~
`
`1
`
`;!
`
`Copyright .01999 Coogle Inc. -~
`
`·""'-'--"~~--=--~---·---'~~-···-·---·~-==
`
`prG./38
`
`http:/ /w'A'W. googl e. com/search'lstte=s-earch&num= l O&q=related:oJYl u&_ .. : www. nyt1mes.com
`
`ll/3!9Y
`
`
`
`US 6,754,873 Bl
`
`1
`TECHNIQUES FOR FINDING RELATED
`HYPERLINKED DOCUMENTS USING LINK(cid:173)
`BASED ANALYSIS
`
`2
`Therefore, what is needed are innovative techniques for
`finding related hyperlinked documents without requiring
`human categorization of the information.
`
`This application claims the benefit of U.S. Provisional 5
`Application No. 60/155,277, filed Sep. 20, 1999, which is
`hereby incorporated by reference.
`
`SUMMARY OF THE INVENTION
`
`The present invention provides innovative techniques for
`finding related hyperlinked ocuments using link-based
`analysis. The link structure of the hyperlinked documents is
`analyzed in order to find hyperlinked documents that are
`related to and at the same level of generality of a hyperlinked
`document. The invention can be utilized any number of
`ways including as an additional feature for a word-based
`search engine or as an addition on a web browser. Some
`specific embodiments of the invention are described below.
`In one embodiment, the invention provides a computer
`implemented method of generating lists of hyperlinked
`documents that are related to a given or selected hyperlinked
`document. A first set of hyperlinked documents that have a
`20 forward link to the selected hyperlinked document is pro(cid:173)
`vided. Additionally, a second set of hyperlinked documents
`that are pointed to by the forward links in the hyperlinked
`documents in the first set is provided. A value is assigned to
`each forward link in each of the hyperlinked documents in
`25 the first set, with the value being reduced for a forward link
`if there are multiple hyperlinked documents from the same
`host as the hyperlinked document that includes the forward
`link. A score is generated for each hyperlinked document in
`the second set according to the values of the forward links
`30 pointing to the hyperlinked document. Accordingly, a list of
`related hyperlinked documents is generated from the second
`set according to the score of the hyperlinked documents. In
`a preferred embodiment, the related hyperlinked documents
`are displayed in an order based on their score.
`In another embodiment, the invention provides a com(cid:173)
`puter implemented method of generating lists of related
`hyperlinked documents. A first set of hyperlinked documents
`that have a forward link to a selected hyperlinked document
`is provided. A second set of hyperlinked documents that are
`pointed to by the forward links in the hyperlinked docu(cid:173)
`ments of the first set is also provided. A value is assigned to
`each forward link in each of the hyperlinked documents in
`the first set, with the value being reduced according to the
`number of forward links in the hyperlinked document that
`includes the forward link. A score is generated for each
`hyperlinked document in the second set according to the
`values of the forward links pointing to the hyperlinked
`document. Lastly, a list of related hyperlinked documents is
`generated from the second set according to the score of the
`hyperlinked documents.
`Other features and advantages of the invention will
`become readily apparent upon review of the following
`description in association with the accompanying drawings,
`where the same or similar structures are designated with the
`55 same reference numerals.
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`This application is related to U.S. patent application No.
`08/655,149, filed May 30, 1996, now U.S. Pat. No. 5,754,
`760, which is hereby incorporated by reference.
`
`BACKGROUND OF THE INVENTION
`
`10
`
`15
`
`35
`
`The present invention relates to hyperlinked document
`systems. More specifically, the invention relates to tech(cid:173)
`niques for finding related hyperlinked documents using
`link-based analysis.
`The Internet, and more specifically the World Wide Web,
`provides users all over the world with virtually unlimited
`amounts of information in the form of hyperlinked docu(cid:173)
`ments. As new information is added to the Web, more
`hyperlinked documents are added that include links to the
`existing web of information.
`One of the reasons for the almost explosive growth of
`information on the Web is that virtually anyone can add
`hyperlinked documents, which will be immediately avail(cid:173)
`able to users around the world. For better or worse, the Web
`is virtually unstructured, meaning that users are free to add
`information to the Web in almost any way they desire.
`Although this provides great flexibility in adding inform a(cid:173)
`tion to the Web, it can significantly increase the difficulty in
`finding information that is desired.
`Probably the most popular mechanism for finding infor(cid:173)
`mation on the Web is to use word-based search engines.
`Word-based search engines allow a user to enter words,
`phrases, and other search criteria so that the search engine
`can retrieve the hyperlinked documents that best match the 40
`user's search criteria.
`Word-based search engines have been tremendously suc(cid:173)
`cessful in allowing users to find the information they desire
`on the Web. There are times, however, when a user wants to
`find hyperlinked documents that are related to and at the 45
`same level of generality to a selected hyperlinked document.
`For example, a user may be viewing a company's web site
`and wish to see other web sites for competitive companies.
`As another example, a user may have found a university's
`computer science department web site and the user may 50
`desire to see computer science department web sites of other
`universities. Traditional word-based search engines may not
`provide satisfactory results for these types of desired infor(cid:173)
`mation.
`Some web sites have recognized this deficiency and have
`taken on the pain staking process of categorizing the infor(cid:173)
`mation on the Web. Although it is possible that the related
`hyperlinked documents that are desired are in a single
`category, it often happens that the related hyperlinked docu(cid:173)
`ments are spread throughout multiple categories. For 60
`example, if information regarding each university is placed
`in a separate category, one will not find a single category that
`includes information regarding the computer science depart(cid:173)
`ments of multiple universities. Additionally, categorizing the
`information on the Web takes a considerable amount of time 65
`and typically requires human decision making to categorize
`the information.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 illustrates an example of a computer system that
`can be utilized to execute the software of an embodiment of
`the invention.
`FIG. 2 illustrates a system block diagram of the computer
`system of FIG. 1.
`FIG. 3 illustrates a network of multiple computer systems
`such as the Internet.
`FIG. 4 shows an example of linked web pages in order to
`demonstrate scoring techniques of the present invention.
`
`
`
`US 6,754,873 Bl
`
`3
`FIG. 5 shows linked web pages of FIG. 4 to more clearly
`show a scoring technique that reduces values of links if the
`web pages are from the same host.
`FIG. 6 shows a web page of FIG. 4 in order to more
`clearly indicate the scoring technique that reduces a value of 5
`a link if there are multiple links in a hyperlinked document.
`FIG. 7 shows a flowchart of a process of preprocessing a
`web of information.
`FIG. 8 shows a flowchart of a process of generating a list
`of related web pages.
`FIG. 9 shows a flowchart of a process of assigning values
`to forward links in a backlink set of web pages.
`FIG. 10 shows a flowchart of a process of displaying the
`list of related web pages.
`FIG. 11 shows an example of a web page for a word-based
`search engine that includes an embodiment of the invention.
`FIGS. 12A and 12B show a web page including word(cid:173)
`based search results and a link to find related web pages
`according to a link-based analysis.
`FIGS. 13A and 13B show a web page including related 20
`web pages from a link-based analysis.
`
`10
`
`4
`The system bus architecture of computer system 1 is
`represented by arrows 67. However, these arrows are illus(cid:173)
`trative of any interconnection scheme serving to link the
`subsystems. For example, a local bus could be utilized to
`connect the central processor to the system memory and
`display adapter. Computer system 1 shown in FIG. 2 is but
`an example of a computer system suitable for use with the
`invention. Other computer architectures having different
`configurations of subsystems can also be utilized.
`FIG. 3 shows a network of multiple computer systems. A
`network 101 provides communication between multiple
`computer systems 1. In a wide area network such as the
`Internet, some of the computer systems are servers (or hosts)
`and provide access to resources or services to client com(cid:173)
`puter systems on the network. With respect to web pages,
`15 there are multiple server computer systems that store the
`web pages that make up the Web. The web pages typically
`include links in the form of uniform resource locators
`(URLs) that are a link to another web page, whether it is on
`the same server or a different one.
`As described above, the Web is a distributed network of
`web pages. Networks of hyperlinked documents can also be
`present in local area networks (e.g., intranets). The operation
`of these intranets is very similar to the Internet except that
`it is not uncommon for all or a majority of the hyperlinked
`documents of an intranet to be stored on a single server
`computer system.
`Now that typical computer systems and networks have
`been described, it may be beneficial to show an example of
`30 related web pages. FIG. 4 shows an example of linked web
`pages to demonstrate how related web pages can be obtained
`from linked-based analysis. A web page 201 is the selected
`web page of interest in which it is desirable to find related
`web pages.
`Web pages 203-207 include one or more links to web
`page 201 as indicated by the arrows. A backlink set 209
`includes web pages 203-207 meaning that the backlink set
`of web pages is the set of web pages that include a link to
`a specific web page, shown here as web page 201. A host 211
`40 stores web pages 205-207 as indicated, the importance of
`which will be described in more detail below.
`Web pages 213-216 are web pages that are pointed to by
`at least one web page in backlink set 209. In other words,
`there is at least one web page in backlink set 209 that
`45 includes a link that points to one of web pages 213-216. A
`forwardlink set 218 includes web pages 213-216 and is
`called such because it is derived from forward links from the
`backlink set. In order to facilitate describing the invention,
`web page 201 will not be described as being a member of
`50 forwardlink set 218. However, web page 201 can be con(cid:173)
`sidered to be a ember of forwardlink set 218 in some
`embodiments as will be described below.
`For some very popular web pages, the backlink set can be
`quite large (e.g., a million links or more). Therefore, in a
`preferred embodiment, for backlink sets that have more than
`a predetermined number of links (e.g., 10,000), a random
`sampling of the links is utilized. In other words, if the
`backlink set has more than a predetermined number of links,
`fewer links can be selected at random (or selected in another
`manner) to be processed.
`The web pages of forwardlink set 218 can be thought of
`as being related to and at the same level of generality of web
`page 201. This is because there is at least one web page that
`includes a link to both web page 201 and each of the web
`pages in forwardlink set 218. Thus, a list of related web
`pages can be generated from the web pages and the backlink
`set.
`
`25
`
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`In the description that follows, the present invention will
`be described in reference to embodiments that generate lists
`of related web pages from the Word Wide Web. More
`specifically, the embodiments will be described in reference
`to generating lists of related web pages utilizing link-based
`analysis requiring little or no human decision making.
`However, embodiments of the invention are not limited to
`any particular environment, application or specific imple(cid:173)
`mentation. For example, the embodiments described below
`are in reference to web pages but the invention can be
`advantageously applied to any type of hyperlinked docu- 35
`ment. Therefore, the description of the embodiments that
`follows is for purposes of illustration and not limitation.
`FIG. 1 illustrates an example of a computer system that
`can be used to execute the software of an embodiment of the
`invention. FIG. 1 shows a computer system 1 that includes
`a display 3, screen 5, cabinet 7, keyboard 9, and mouse 11.
`Mouse 11 can have one or more buttons for interacting with
`a graphical user interface. Cabinet 7 houses a CD-ROM
`drive 13, system memory and a hard drive (see FIG. 2)
`which can be utilized to store and retrieve software pro(cid:173)
`grams incorporating computer code that implements the
`invention, data for use with the invention, and the like.
`Although CD-ROM 15 is shown as an exemplary computer
`readable storage medium, other computer readable storage
`media including floppy disk, tape, flash memory, system
`memory, and hard drive can be utilized. Additionally, a data
`signal embodied in a carrier wave (e.g., in a network
`including the Internet) can be the computer readable storage
`medium.
`FIG. 2 shows a system block diagram of computer system 55
`1 used to execute the software of an embodiment of the
`invention. As in FIG. 1, computer system 1 includes monitor
`3 and keyboard 9, and mouse 11. Computer system 1 further
`includes subsystems such as a central processor 51, system
`memory 53, fixed storage 55 (e.g., hard drive), removable 60
`storage 57 (e.g., CD-ROM drive), display adapter 59, sound
`card 61, speakers 63, and network interface 65. Other
`computer systems suitable for use with the invention can
`include additional or fewer subsystems. For example,
`another computer system could include more than one 65
`processor 51 (i.e., a multi-processor system) or a cache
`memory.
`
`
`
`US 6,754,873 Bl
`
`6
`the number of links in the web page. As shown in FIG. 6,
`web page 204 includes 4 links and so the value of each link
`from the web page is V.. In this way, web pages that have
`relatively many links do not get more than their fair share of
`"votes" for related web pages. In a preferred embodiment,
`the value for each link is 1 divided by the number of links
`in the web page plus a predetermined amount (e.g., 10).
`Although the techniques described in reference to FIGS.
`5 and 6 can be used alone, in preferred embodiments, the
`10 techniques are used in conjunction to generate an accurate
`list of related web pages. The techniques can be combined
`by multiplying all the values for a given link in order to
`determine the final value for the link. As an example, the
`following table shows the scores that would be generated for
`each of the web pages of forwardlink set 218 utilizing this
`technique:
`
`Web Page
`
`Links
`
`213
`214
`215
`216
`
`(1/2) + (1/4)
`(1/4) + (1/3 * 1/2)
`(1/4) + (1/3 * 1/2)
`(1/4) + (1/3 * 1/2)
`
`Score
`
`0.75
`0.415
`0.415
`0.415
`
`25
`
`5
`One way that the web pages of forwardlink set 218 can be
`scored according to "relatedness" to web page 201 is accord(cid:173)
`ing to the number of links to each web page from backlink
`set 209. Utilizing this technique, each web page in forward(cid:173)
`link set 218 has a score of 2 because there are two links to 5
`each web page from the web pages of the backlink set.
`Accordingly, the same scores indicate that all of the web
`pages of web page 218 are equally related to web page 201.
`For at least the following reasons, this result may not be
`satisfactory.
`Within a single host, the web pages typically include
`many links to other web pages within the same host. For
`example, a company's web site may include many links that
`interlink different web pages on the same host (or domain or
`other grouping of web pages). Thus, referring back to FIG. 15
`4, web page 201 could describe a product of the company
`while web pages 214, 215, and 216 could correspond to
`other web pages of the company on the same host that are
`relatively unrelated to the product from web page 201.
`Assuming that web page 213 describes a competing product 20
`from a competitor company (and hence likely a different
`host), it would be desirable for web page 213 to get a higher
`score of "relatedness" than web pages 214-216 because it is
`more likely the user is interested in a competing product than
`a relatively unrelated web page from the same host.
`Another problem is that some web pages have relatively
`few links while others have relatively many links. If each
`link is counted equally, the web pages with relatively many
`links gets more "votes" for the "relatedness" of the web
`pages. For example, referring to FIG. 4, web pages 203 and
`205-207 all include two links. However, web page 204 has
`four links, which is twice as many as in the other web pages
`of backlink set 209. Because web page 204 has more links,
`the web page has a greater impact on selecting related web
`pages than the other web pages of the backlink set. It would 35
`be desirable if the web pages of the backlink set were
`considered relatively equal regardless of the number of links
`each web page includes.
`It should be noted that FIG. 4 shows a very simple 40
`example in order to illustrate some of the problems that can
`be encountered in selecting related web pages. As one can
`imagine, typically the web pages include many more links
`and the number of web pages involved in determining
`related web pages is far greater than that shown in FIG. 4.
`Nevertheless, FIG. 4 is useful in illustrating some problems
`related to generating a list of related web pages. Also, the
`links were shown as coming from the bottom of the web
`pages for simplicity, but links are typically embedded within
`the text of the web pages as will be shown in subsequent
`figures.
`FIG. 5 shows one technique that can be utilized to reduce
`the importance of web pages from the same host. As
`described above in reference to FIG. 4, web pages 205-207
`are in the same host. Instead of giving each link from these
`web pages a value of, for example, 1, the links are given a
`value of 1 divided by the number of web pages that are from
`the same host, which in this case is 3. Accordingly, each link
`from web pages 205-207 are given a value of Ij3 as shown.
`By dividing the value for a link by the number of web pages 60
`in the same host, the amalgam of web pages from the same
`host have a total of 6* lf3=2 "votes," which would be the
`same as a single web page with two links from another host.
`FIG. 6 shows a technique in which to reduce the impor(cid:173)
`tance of the individual links from a web page with relatively
`many links. In order to reduce the value for each link from
`a web page with many links, the value of one is divided by
`
`45
`
`One or more of these techniques can also be combined with
`a measure of text-based similarity of the web pages.
`FIG. 7 shows a flowchart of a process of preprocessing the
`Web in order to generate banklink sets. The Web does not
`30 include backlink sets so it is beneficial to scan the Web and
`identify backlink sets for each of the web pages on the Web.
`This allows related web pages to be identified in a much
`more efficient manner because the backlink sets are already
`calculated.
`At step 301, web pages are scanned and forwardlink and
`backlink sets are saved. As each web pages is processed, the
`forward links in the web page are identified and saved as a
`forwardlink set. Additionally, the current web page is added
`to the backlink set of each of the web pages that are pointed
`to by links in the current web page. The generation of
`forwardlink and backlink sets is well suited for an automated
`process and can be continually run to identify and save
`changes in the Web.
`In addition to calculating backlink sets, lists of related
`web pages can be calculated for each of the web pages on the
`Web at a step 303. The list of related web pages can be
`calculated as described below in reference to FIGS. 8 and 9.
`Although optional, the lists of related web pages can be
`utilized to generate a popular list of web pages that occur
`most often in the related lists at a step 305.
`The popular list includes web pages that occur most
`frequently in the lists of related web pages. For example, a
`popular site like the yahoo web page may occur very
`frequently in the lists of related web pages. The popular list
`55 may be a predetermined number (e.g., 2000) of web pages
`that occur most often in the lists of related web pages. The
`popular list can be utilized to more accurately display the list
`of related web pages as described in more detail in reference
`to FIG. 10.
`FIG. 8 shows a flowchart of a process of generating a list
`of related web pages. For example, the list of related web
`pages can be generated upon request from a user for a web
`page being displayed in a web browser or for a web page that
`is being displayed (e.g., as a link) in results from a web
`65 search.
`At a step 351, a backlink set of web pages is provided for
`a given or selected web page. The selected web page is the
`
`50
`
`
`
`US 6,754,873 Bl
`
`7
`web page for which the list of related web pages should be
`generated. The backlink set of web pages can be easily
`determined from the backlink set generated during prepro(cid:173)
`cessing of the web as described in FIG. 7.
`A forwardlink set of web pages is provided from the 5
`backlink set at a step 353. The forwardlink set of web pages
`are the web pages that are pointed to by the forward links in
`the backlink set for the selected web page. The forwardlink
`set can be generated during preprocessing but since the time
`to generate the forwardlink set is minimal, in some embodi- 10
`ments the forwardlink set is generated in real time when the
`user asks for a list of related web pages for the selected web
`page.
`At a step 355, a value is assigned to each forward link of
`the web pages of a backlink set. A process of assigning a 15
`value to the forward links will be described in more detail in
`reference to FIG. 9.
`A score is generated for each web page in the forwardlink
`set at a step 357. The score can be generated by adding
`together all the values for each of the forward links that
`points to each web page in the forwardlink set.
`Once the scores are generated for each of the web pages
`in the forwardlink set, a list of related web pages can be
`generated at a step 359. The list of related web pages can be
`generated from the forwardlink set according to the score of 25
`the web pages. In other words, the score is an indication of
`the relatedness to the selected web page and the higher the
`score, the more related the web page is.
`At a step 361, the list of related web pages is displayed.
`The displayed list of related web pages can be a predeter(cid:173)
`mined number of the most highly related web pages