`US008516357Bl
`
`(12) United States Patent
`Harik
`
`(IO) Patent No.:
`(45) Date ofPatent:
`
`US 8,516,357 Bl
`*Aug. 20, 2013
`
`(54) LINK BASED CLUSTERING OF
`BYPERLINKED DOCUMENTS
`
`(75)
`
`Inventor: Georges R. Harik, Mountain View, CA
`(US)
`
`(73) Assignee: Google Inc., Mountain View, CA (US)
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 3 5
`U.S.C. 154(b) by 446 days.
`This patent is subject to a terminal dis(cid:173)
`claimer.
`
`(21) Appl. No.: 121959,141
`
`(22) Filed:
`
`Dec. 2,2010
`
`Related U.S. Application Data
`(63) Continuation of application No. 11/693,522, filed on
`Mar. 29, 2007, now abandoned, which
`is a
`continuation of application No. 09/636,052, filed on
`Aug. 10,2000, now Pat. No. 7,213,198.
`(60) Provisional application No. 60/148,854, filed on Aug.
`12, 1999.
`
`(51)
`
`(2006.01)
`(2006.01)
`
`Int. Cl.
`G06F 17100
`G06F 17120
`(52) U.S. CI.
`USPC ............ 715/200; 715/206; 715/207; 715/208
`(58) Field of Classification Search
`USPC .................................. 715/200,206,207,208
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5,761,436 A
`6/1998 Nielsen
`5,835,905 A + 1111998 Pirolli et al ........................... 1/1
`5,847,708 A + 1211998 Wolff ............................ 715/764
`5,875,446 A
`2/1999 Brown etal.
`
`401
`
`--
`
`5,890,172 A
`5,920,859 A
`5,941,944 A
`5,999,664 A "'
`6,038,574 A
`6,049,799 A *
`--6;078;913 A +
`6,112,203 A +
`6,122,647 A
`6,167,397 A
`6,182,066 Bl *
`6,272,505 B I *
`6,285,999 B I
`6,345,253 B1
`6,353,827 B1 •
`6,356,898 B2
`
`3/1999 Borman et al.
`7/1999 Li
`8/1999 Messerly
`1211999 Mahoney et al .............. 382/305
`3/2000 Pitkow et al.
`4/2000 Mangat et a!. ........................ 1/1
`612000 Aoki-et ai;· . .-.. : ............... ~ .. ~.-.. ; 1!1-
`8/2000 Bharat eta!. .......................... 1/1
`9/2000 Horowitz et at.
`12/2000 Jacobson et al.
`lf2001 Marques ............................... 1/1
`8/2001 DeLaHuerga .............. 715/235
`912001 Page
`2/2002 Viswanalhan
`3/2002 Davies eta!. ................. 707/769
`3/2002 Cohen et al.
`(Continued)
`
`OTHER PUBLICATIONS
`
`Judit Bar-IIan et al., "3"' International Workshop on Web Dynamics",
`May 18, 2004, 125 pages.
`
`(Continued)
`
`Primary Examiner- Doug Hutton, Jr.
`Assistant Examiner- Soumya Dasgupta
`(74) Attorney, Agent, or Firm- Harrity & Harrity, LLP
`
`(57)
`
`ABSTRACT
`
`Techniques for grouping hyperlinked documents are pro(cid:173)
`vided. Links near or in the neighborhood of the hyperlinked
`documents are analyzed in order to group the hyperlinked
`documents by topic. For example, links that are search results
`can be grouped by identi:fYing other hyperlinked documents
`that have multiple forward links to the search results. The
`search results can then be grouped according to the forward
`links of the other hyperlinked documents.
`
`20 Claims, 18 Drawing Sheets
`
`SATURN
`PLANET
`PAGE
`
`. .
`
`/
`/
`
`/
`
`401
`
`SATURN
`PLANET
`PAGE
`
`403
`
`EXHIBIT 2108
`Facebook, Inc. et at.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`
`
`US 8,516,357 Bl
`Page 2
`
`(56)
`
`References Cited
`
`200110047375 A1
`200110049698 A1
`2002/0035611 A1
`
`1112001 Fest
`12/2001 Hsu eta!.
`3/2002 Dooley
`
`OTHER PUBLICATIONS
`
`Haffner eta!., "Request-Prediction and Hyperlink-Proposals", Nov.
`2000, 177 pages.
`Blank et al., "Interactive Gradebook: The Missing (Hyper)Link",
`1998, 5 pages.
`Rennie et al., "Using Reinforcement Learning to Spider the Web
`Efficiently", 1999, 10 pages.
`Rennie eta!., "Efficient Web Spidering with Reinforcement Learn(cid:173)
`ing", 1999, 13 pages.
`* cited by examiner
`
`U.S. PATENT DOCUMENTS
`6,385,602 B1
`5/2002 Tso et al.
`6,446,061 B1 * 9/2002 Doerre eta!.
`6,457,028 B1
`9/2002 Pitkow eta!.
`6,510,406 B1
`112003 Marchisio
`6,549,935 B1
`4/2003 Lapstun et a!.
`6,562,077 B2 * 5/2003 Bobrow eta!.
`6,574,644 B2 * 6/2003 Hsu eta!.
`6,665,715 B1
`12/2003 Houri
`6,675,170 B1
`112004 Flake
`6,728,752 B1
`4/2004 Chen eta!.
`6,990,628 B1 *
`112006 Palmer eta!.
`7,213,198 B1
`5/2007 Harik
`
`................. 707/738
`
`............... 715/204
`715/205
`
`................. 715/234
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 1 of 18
`
`US 8,516,357 Bl
`
`,.--1
`
`11
`
`FIG.1
`
`51
`
`53
`
`55
`
`,.--1
`
`57
`
`67
`
`65
`
`61
`
`63
`
`59
`
`9
`
`11
`
`FIG. 2
`
`
`
`U.S. Patent
`US. Patent
`
`Aug. 20, 2013
`Aug. 20, 2013
`
`Sheet 2 of 18
`Sheet 2 of 18
`
`US 8,516,357 B1
`US 8,516,357 Bl
`
`1
`
`
`
`1
`
`D_
`
`~a:
`
`D_
`
`~o:
`
`FIG. 3
`HG. 3
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 3 of 18
`
`US 8,516,357 Bl
`
`205
`
`CARS
`PAGE
`
`207
`
`CARS
`PAGE
`
`215
`
`217
`
`ASTRONOMY
`PAGE
`
`SCIENCE
`PAGE
`
`SATURN
`CAR
`PAGE
`
`SATURN
`CAR
`PAGE
`
`SATURN
`PLANET
`PAGE
`
`SATURN
`PLANET
`PAGE
`
`201
`
`203
`
`209
`
`211
`
`SftJURN
`PLANET
`PAGE
`
`213
`
`FIG.4
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 4 of 18
`
`US 8,516,357 Bl
`
`RECEIVE LINKS TO HYPERLINKED
`DOCUMENTS
`
`IDENTIFY OTHER HYPERLINKED
`DOCUMENTS THAT HAVE MULTIPLE
`FORWARD LINKS TO THE
`HYPERLINKED DOCUMENTS
`
`GROUP THE HYPERLINKED
`DOCUMENTSACCORrnNGTOTHE
`FORWARD LINKS OF THE OTHER
`HYPERLINKEDDOCUMENTS
`
`301
`
`303
`
`305
`
`FIG. 5
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 5 of 18
`
`US 8,516,357 Bl
`
`401
`
`403
`
`FIG. 6A
`
`401
`
`SATURN
`PLANET
`PAGE
`
`I
`
`I
`
`I
`I
`
`I
`
`I
`
`I
`
`I
`I
`
`;
`
`SATURN
`PLANET
`PAGE
`
`403
`
`401
`
`FIG. 68
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 6 of 18
`
`US 8,516,357 Bl
`
`501
`
`503
`
`505
`
`507
`
`509
`
`511
`
`515
`
`RECEIVE LINKS TO HYPERLINKED
`DOCUMENTS AS THE SEARCH SET
`
`EXPAND THE SEARCH SET TO
`GENERATE AN EXPANDED SET
`
`GENERATE A BACK LINK LIST OF
`HYPERLINKED DOCUMENTS THAT
`POINT TO THE EXPANDED SET
`
`CALCULATE SIMILARITY MEASURE
`FOR THE HYPERLINKED
`DOCUMENTS IN THE EXPANDED
`SET
`
`MERGE HYPERLINKED
`DOCUMENTS (OR GROUPS) IN THE
`EXPANDED SET
`
`CALCULATE SIMILARITY MEASURE
`FOR THE MERGED GROUPS IN THE
`EXPANDED SET
`
`NO
`
`513
`
`YES
`
`DISPLAY SEARCH RESULTS IN
`GROUPS
`
`DONE
`
`FIG. 7
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 7 of 18
`
`US 8,516,357 Bl
`
`jaguar
`
`I 0 results
`
`Google Searchi'm feeling lucky
`
`At least 9749 matches for jaguar
`Showing results 1-10, Search took 0.14 seconds
`Clickinz on a red bor searches tbr backlinks (cim:ions).
`How do I interpret the results?
`
`--=:::1 Jaguar Cars Global Home Page
`... Jaguar Cars Official Web Site. Jaguar Cars Official Web Site ...
`... Jaguar Cars Official Web Site. Jaguar Cars Official Web Site ...
`www.jaguarcars.com/ Cached (5k)
`
`t USS Jaguar
`-
`... to the USS Jaguar, the first and largest Star Trek site for kids. I am ...
`... be like a fictitious starship named Jaguar. Here anyone under 18 can join ...
`www.worldkids.net/jaguar/ Cached (5k)
`
`1 The Jaguar Collection - Official Web site
`-
`www.collection.co.ukf Cached (2k)
`
`• Welcome to Jag-lovers- the Jaguar Enthusiasts place on the Internet
`-
`... -INFORMATION Brochures 1 Buyer's Guide I Jaguar Specialists Web!.-
`... (Windows) I Windows Wallpaper I-JAGUAR MODEL RELATED E-Type I Lump ...
`www.jag-lovers.org/ Cached (l4k)
`
`1 Jaguar/Daimler used parts exchange
`-
`... Brochures Buyer's Guide Jaguar Specialists Web ...
`... (Windows) Windows Wallpaper -JAGUAR MODEL RELATED E-Type ...
`www.jag-lovers.org/HyperNews/get/jagswap.html Cached (87k)
`
`-
`
`• Jaguar Interactive II-- The Premier 24-Hour Atari Jaguar Discussion Board
`. Jaguar Interactive II- Hosted by Atari Gaming Headquarters- Other.. .
`.. .Jag-@·Zine I JagFest '98 1 JagFest '991 Jaguar, MA I Jaguar's ...
`www.atarihq.com/interactive/ Cached (30k)
`
`1 Classic Car Source -- Welsh Jaguar Classic Car Museum
`-
`... WELSH JAGUARCLASS1C CAR MUSEUM And JAGGIN' AROUND RESTAURANT AND PUB "A ...
`... CELEBRATION OF THE GLORY DAYS OF THE JAGUAR" The WELSH JAGUAR. ..
`www.ciassicar.com/museums/welshjag/welshjag.htm Cached (3k)
`
`1 Jaguar Club of Florida
`-
`... Welcome to the Jaguar Club of Florida! This i.~ nur place on the ...
`... meetings, sponsored events and activities! Jaguar Cooks 1 -A brand new ...
`www.magicnet.net/-jaguartjcof.html Cached (6k)
`
`-
`
`1 Tomb of the Jade Jaguar Slide Lecture, Maya treasures ofTikal, Guatemala
`
`FIG. 8A
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 8 of 18
`
`US 8,516,357 Bl
`
`... the JADE JAGVAR, TIKAL, GUATEMALA an informative slide lecture on Vlaya ...
`... available. The Tomb of the Jade Jaguar at Tikal (Burial 196, Structure ...
`www.maya-art-books.org/html/jade_Jecture.html Cached (9k)
`
`• Jaguar Sovereign
`-
`... us] Alexandra a 1986 Jaguar XJ6 Sovereign 4.2 IMPORTANT.. .
`... way affiliated with or connected to Jaguar Cars Ltd. The official website ...
`www.bitcon.no/-gunnarisovereign.htmlCached (5k)
`GOOOOOOOOOGLE--+
`Resu!! Page: I 2 3 4 5 6 7 8 9 10 Next ;)age
`
`New query: .
`Jaguar Google Search
`Search within results?
`Try your query on other engines
`AltaVista Excite HotBot lnfoseek Lycos DejaNews Yahoo! Amazon Open Directory eGroups
`
`About Google!
`Copyright© 1999 Google Inc.
`
`FIG. 88
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 9 of 18
`
`US 8,516,357 Bl
`
`603
`
`605
`
`607
`
`FORD MOTOR
`COMPANY
`PAGE
`
`100 HOT
`CAR& TRUCK
`PAGE
`
`PRN
`AUTOMOBILE
`LINKS PAGE
`
`JAGUAR CARS
`GLOBAL PAGE
`
`601
`
`JAGUAR
`CARS EUROPE
`PAGE
`
`JAGUAR
`CARS ASIA
`PAGE
`
`JAGUAR CARS
`SOUTH AMERICA
`PAGE
`
`609
`
`611
`
`613
`
`FIG. 9
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 10 of 18
`
`US 8,516,357 Bl
`
`EXPANDED SEARCH
`SET
`
`FIG. 10
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 11 of 18
`
`US 8,516,357 Bl
`
`801
`\
`
`HASHTABLE
`0
`
`2
`
`E,D
`
`A,B,C
`
`A
`
`• • •
`FIG. 11
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 12 of 18
`
`US 8,516,357 Bl
`
`A
`
`B
`c
`
`D
`
`E
`
`CO-CITATION MATRIX
`A B C D E
`
`1
`
`1
`
`2
`
`851
`
`I
`
`1
`
`FIG. 12A
`
`901
`
`RELATIONSHIP VECTOR
`
`I
`U) (~) O) (n
`FIG. 128
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 13 of 18
`
`US 8,516,357 Bl
`
`CO-CITATION (A, B)_ a NumBL(A) • NumBL(B)
`SIMILARITY (A, B) =
`w
`NumBL(A) • NumBL(B}
`
`EXAMPLE:
`CO-CITATION (A, B) = 1
`NumBL(A) = 2
`NumBL(B) = 2
`a= 10
`w = 100,000,000
`
`SIMILARITY (A, B) =
`
`10. 2. 2
`1 - - -
`100,000,000
`
`F
`
`~ 112
`
`FIG. 13
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 14 of 18
`
`US 8,516,357 Bl
`
`AUGMENTED RELATIONSHIP
`VECTOR
`
`951
`/
`f
`
`(J.s) Us) Uo) Uo)
`FIG. 14
`
`SORTED RELATIONSHIP
`VECTOR
`
`1001
`/
`f
`
`Uo) Oo) (Js) Us)
`FIG. 15
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 15 of 18
`
`US 8,516,357 Bl
`
`MERGE VECTOR
`A B C D E
`
`1051
`
`I F I F I G I G I ~
`FIG. 16
`
`1101 I
`
`CO-CITATION MATRIX AFTER
`THE MERGE
`A F G
`
`2
`
`0
`
`0
`
`A
`
`F
`
`G
`
`FIG. 17
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 16 of 18
`
`US 8,516,357 Bl
`
`Clustering for 'saturn'
`
`Invoking Project Georges ... please wait!
`
`Running grouper ...
`
`Waiting for it to finish ...
`
`Looking for results ...
`
`Reading the results ...
`
`Cleaning up ...
`
`Waiting for cleanup ...
`
`~1201
`
`• Saturn
`pds.jplnasa.gov/planets/welcome/saturn.htm (C:0.866863 P:0.55lll)
`• Cassini: Voyage to Saturn
`www.jpl.nasa.gov/cassini/ (C:0.855673 P:O.S8622l)
`• NSSDC Photo Gallery: Saturn
`nssdc.gsfc.nasa.gov/photo_gallery/photogallery-satum.html (C:0.855624 P:0.492393)
`• Saturn
`ww w.hawastsoc.org/solar/eng/satum.h!Jn (C:0.834942 P:0.51 0658)
`• The 1995-6 Saturn Ring Plane Crossings
`rillgside.arc.nasa.gov/wwwlrpxlrpx.html (C:0.810368 P:0.473472)
`• Saturn
`www.seds.org/nineplanets/nineplanets/sarurn.html (C:0.805337 P:0.55l736)
`• Saturn
`nssdc.gsfc.nasa.gov/planetary/planets/satumpage.html (C:O. 774866 P:0.492622)
`• No Title
`ringside.arc.na~a.govfwwwfsaturn/saturn.html (C:0,755906 P:0.45388)
`• Saturn Fact Sheet
`nssdc.gsfc.nasa.gov/p!anetary/factsheet/satumfact.hnnl (C :0.7 49!32 P:0.4 7 4296)
`• No Tille
`www.jpl.nasa.gov/cassini (C:0.7l0059 P:0.575479)
`• No Title
`
`FIG. 18A
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 17 of 18
`
`US 8,516,357 Bl
`
`newproducts.jpLnasa.govfsatuml (C:0.7055! P:0.483345)
`• Saturn Picture List
`www.seds.org/nineplanets/nineplanetsfpxsat.html (C:0.695906 P:0.49987)
`• Saturn
`spacer.com/nineplanetsfsaturn.html (C:0.688069 P:O .3 61 059)
`• Saturn's spokes in rings
`www.hawas:soc.org/solar!cap!satfspoke.htm (C:0.657886 P:O .416526)
`• Saturn Ring Plane Crossings of 1995-1996 (JPL)
`www.jpl.nnsn.gov/saturnf (C:0.656777 P:0.383i34)
`• Hubble Space Telescope Views Major Storm on Saturn
`www.hawa~tsoc.org/solar/cap/sat/vsatstrm.htm (C:0.65J!63 P:0.386542)
`• Saturn
`www.star.le.ac.uk!edu/nineplanetslsaturn.html (C:0.650382 P:0.341283)
`• Saturn Events
`www.isc.tamu.edu/-astrofsaturn.html (C:0.646718 P:0.464393)
`• Cassini UVIS Inernet Site
`lasp.colorado.edu/cassinif (C:0.645669 P:0.493858)
`• StarDate Online I Solar System Guide I Saturn
`stardate.utexas.edu/resourcesfssguidefsatum.html (C:0.64 P:0.482643)
`• SATURN IMAGES
`vraptor.jpl.nasa.govlvoyager/vgrsat_irr.g.html (C:0.639535 P:0.456519)
`• The Planet Saturn
`wwwflag.wr.usgs.gov/USGSF!ag/Space/wallisaturn.html (C:0.639053 P:0.452598)
`• No Title
`ringside.a:-c.nasa.gov/saturn/satrun.html (C:0.634215 P:0.43450 1)
`• Voyager Saturn Science Summary
`www.hawastosc.org/so lar/eng/v grsat.htm (C:0.628713 P:0.453529)
`• The Solar System - Saturn
`www.jpl.nasa.gov/solarsystem/satum/ (C:0.627907 P:0.523598)
`• Saturn Image & Animation Index
`www.hawastsoc.org/solar/capfsatJ (C:0.61 0045 P:0.386236)
`• Artwork of Saturn and Its Satellites
`www.jpLnasa.gov/cassini/Images/artwork! (C:0.608!87 P:0.446021)
`• Exploring The Planets - Pioneer Encounters Saturn
`www.nasm.edufceps/ETP/SATURN/satpioneer.html (C:0.606299 P:0.408835)
`• Saturn
`www.hpcc.astro. washiogton.edu/mirrors/nineplanets/saturn.html (C:O .587209 P:O .397482)
`• Saturn [OuluJ
`www.oulu.fi/-spaceweb/textbook/saturn.htm (C:0.581992 P:0.36847 5)
`
`• www.jpl.nasa.gov/sarurnfiau6!92.html (C:0.566929 P:0.355337)
`
`1201~
`
`• No Title
`www.sierranet.netf-elengyellastro.html (C:0.566929 P:0.3i5875)
`• No Title
`bang.lanl.govfsolarsys/saturn.htm (C:0.566929 P:0.430716)
`• No Title
`esther.la.asu.edu/asu _tes/TES _Editor/SOLAR_ SYST _ TOUR/Saturn.html (C:O .566929 P:0.442466)
`[nfo about Saturn and Outer Space
`www.jpl.nasa.gov/cassini/Morelnfofoutrchinfo.html (C:0.56 P:0.4!6022)
`• Pseudo-image of Saturn's Ring
`www.hawastsoc.org/solar/cap/sat/s-ring.htm ( C: 0. 5 56922 P:0.292241)
`• Saturn Without a Ring
`
`•
`
`FIG. 188
`
`
`
`U.S. Patent
`
`Aug. 20, 2013
`
`Sheet 18 of 18
`
`US 8,516,357 Bl
`
`www.hawastsoc.org/solar/cap/sat/satnr.htm (C:0.556922 P:0.29224l)
`• Saturn
`www.pitagoras.g 12.br/planetas/nineplanets/satum.html (C:0.554887 P:0.291173)
`• Saturn Images Menu
`emma.la.asu.edu/SOLAR~SYST_TOUR/Satum.html (C:0.551181 P:0.45784')
`• Saturn
`www.star.le.ac.uk/edu/solar/saturn.html (C:0.54 386 P:0.385916)
`• Saturn Viewer 1.1 Short Form
`ringside.arc.nasa.gov/www/rpx/viewer/shortform .html (C:O. 543 86 P:0.393 744)
`
`• No Title
`nssdc.gsfc.nasa.gov/photo _gallery/PhotoGallery-Saturn.html (C:0.535433 P:0.3 81491)
`• Saturn's Possible New Moons
`www.seds.org/nineplanets/nineplanets/newsats.html (C:0.535433 P:0.482292)
`
`~
`1201
`
`• Welcome to the Saturn Cars Web Site
`www.saturncars.com/ (C:0.997514 P:0.557076)
`
`• www.saturncars.com/communicationlindex3.htrnl (C:0.952756 P:0.508095)
`
`• The 1998 Saturn Line
`www.saturncars.com/car/models/ (C:0.666667 P:0.4589 15)
`e Saturn's Frequently Asked Questions
`www.saturncars.com/communication/faq.html (C:0.597633 P:0.432074)
`• Saturn of Arkansas
`www.saturnar.com/ (C:0.594143 P:0.31 I 772)
`• No Title
`www.gmcanada.com/eng!ishlvehicles/satu.html (C:0.584 795 P:O .418006)
`• Saturnalia -the Saturn Enthusiasts Site
`www.erols.com/core/ (C:0.581 395 P:0.353033)
`• Saturn of Fremeont Home Page
`www.saturnfr.com/ (C:0.579882 ?:0.392782)
`• About the SATURN web site
`www.saturncars.com/about/ {C:0.55118J P:0.411856)
`• Saturn Albany Home Pao-e
`www.saturnalbat;y.com/ (C:0.5?J 181 P:0.317662)
`
`1203
`
`• Sega Online- Products: Saturn
`www.sega.com/products/hardware/saturn/ (C:0.700085 P:0.402884
`• Game Revolution Sega Saturn Reviews
`www.game-revoluliou.com/games/saturn/saturn.htm (C:0.685039 P:0.465736
`• Phil's site--Sega Saturn World
`www.mcn.net'-montanasites/segaSaturn.htm (C:0.657798 P:0.345 I 74)
`e New and Used Video Games at Used GAMES!- SEGA SATURN:-)
`www.usedvideogame.com/segalsaturn/sat.htm (C:0.62992J P:0.343644)
`• Sega Saturn Index (Sega Force: Index: Systems: Saturn)
`www.cyberdrive.netf-gskalbalsatum.htm (C:0.543307 P:0.325 I 24)
`
`>- 1205
`
`FIG. 18C
`
`
`
`US 8,516,357 Bl
`
`1
`LINK BASED CLUSTERING OF
`HYPERLINKED DOCUMENTS
`
`RELATED APPLICATIONS
`
`This application is a continuation of U.S. patent applica(cid:173)
`tion Ser. No. 11/693,522, filed Mar. 29, 2007, which is a
`continuation of U.S. patent application Ser. No. 09/636,052,
`filed Aug. 10, 2000 (now U.S. Pat. No. 7,213,198), which
`claims the benefit of U.S. Provisional Application No.
`60/148,854, filed Aug. 12, 1999, the disclosures of which are
`hereby incorporated herein by reference.
`
`BACKGROUND OF THE INVENTION
`
`2
`forces the user to go through extra hoops in specifying her
`query. The user is in fact forced to think like a textual search
`engine and attempt to guess what words would be included in
`the topic she is interested in and not in other topics matching
`5 her original query. Guessing at the search engine's behavior
`can become an iterative and counterproductive nightmare.
`As the size of the Web continues to increase, it becomes
`increasingly more desirable to have innovative techniques for
`efficiently grouping hyperlinked documents by topic. By dis-
`10 playing the links to web pages grouped by topic, a search
`engine can provide some coherence (i.e., not jump from one
`topic to another and then back again) to the search results.
`Additionally, it would be beneficial to have techniques that
`are efficient at grouping search results by topic by analyzing
`15 the link structure of the Web.
`
`The present invention relates to grouping (or clustering)
`hyperlinked documents. More specifically, the invention
`relates to techniques for grouping hyperlinked documents
`from a search so that the links to hyperlinked documents
`about the same (or similar) topic can be displayed together.
`The World Wide Web (or "Web") contains a vast amount of
`information in the form ofhyperlinked documents (e.g., web
`pages). One of the reasons for the virtually explosive growth
`in the number ofhyperlinked documents on the Web is that
`just about anyone can upload hyperlinked documents, which 25
`can include links to other hyperlinked documents. Although
`there is no doubt that there is a vast amount of useful infor(cid:173)
`mation on the Web, the unstructured nature of the Web can
`make it difficult to find the information that is desired.
`A search engine attempts to return relevant information in
`response to requests from users. These requests usually come
`in the form of queries (e.g., sets of words that are related to the
`desired topic). Search engines typically return a number of
`links to web pages, with a brief description of those pages.
`Because the number of pages on the Web is huge, ensuring 35
`that the returned pages are relevant to the topic the user had in
`mind is a central problem in web searching. Possibly the
`simplest and most prevalent way of doing this is to find web
`pages containing all or many of the words included in the
`query, which can be called text-based searching.
`Text-based searching over the Web can be notoriously
`imprecise and several problems arise in its use. To begin with,
`usually a large number of web pages match the user's query.
`Displaying all of these pages to the user becomes impractical
`and some method of ordering these results should be used. 45
`Methods that assess and use the quality of the pages, returning
`only well-linked pages for example, can significantly
`improve the quality of the returned results. However, the
`returned results can still range over a number of different
`topics, only one of which the user had in mind.
`Consider, for example, a search query including only the
`word "Saturn." This query can refer to the Saturn brand of car,
`the planet Saturn, the Sega Saturn game system, or the Roman
`god Saturn. Most likely, the user is interested in only one of
`the above topics. However, a search engine searching for this 55
`word would come up with a mishmash of results from among
`all of these topics. A user interested in the Saturn car would
`have to wade through many irrelevant search results to get the
`information she desires.
`One solution to this problem utilizes user feedback. After 60
`the user notices that the results include links from a number of
`different topics, the user can narrow or focus the search to
`only the one topic in which the user is interested. For example,
`the user could add the word "car" to the query. This solution
`has at least two negative aspects. First, it can exclude high
`quality relevant pages, which include the word Saturn but not
`"car" (e.g., uses "automobile" instead). More importantly, it
`
`SUMMARY OF THE INVENTION
`
`20
`
`The present invention provides innovative techniques for
`grouping hyperlinked documents. In general, links near or in
`the neighborhood of the hyperlinked documents are analyzed
`in order to group the hyperlinked documents by topic. For
`example, links that are search results can be grouped by
`identifying other hyperlinked documents that have multiple
`forward links to the search results. The search results can then
`be grouped according to the forward links of the other hyper-
`linked documents. Embodiments of the invention can there(cid:173)
`fore utilize the existing link structure of the Web to group the
`search results according to topic. Some specific embodiments
`30 of the invention are described below.
`In one embodiment, the invention provides a computer
`implemented method of grouping hyperlinked documents.
`Links to hyperlinked documents are received and other hyper(cid:173)
`linked documents are identified that have forward links to the
`initial hyperlinked documents. The hyperlinked documents
`are grouped according to the forward links of the other hyper(cid:173)
`linked documents. Where the hyperlinked documents are
`results from a search, the links to the hyperlinked documents
`can be displayed in groups separated graphically on the
`40 screen.
`In another embodiment, the invention provides a computer
`implemented method of grouping hyperlinked documents.
`An initial set of links to hyperlinked documents is received.
`The initial set is expanded to include links to additional
`hyperlinked documents that have a forward link to or are
`pointed to by a hyperlinked document in the initial set,
`thereby generating an expanded set oflinks. A back link list of
`links to hyper! inked documents is identified that have forward
`links to the hyperlinked documents in the expanded set. The
`50 hyperlinked documents in the initial set are then grouped
`according to the forward links of the hyperlinked documents
`in the back link list of links. In other cases, the hyperlinked
`documents may be grouped according to link information that
`is more than one level back from those documents.
`In another embodiment, the invention provides a computer
`implemented method of grouping hyperlinked documents.
`Groups of hyperlinked documents are received, each group
`including at least one hyperlinked document. A similarity
`measure is calculated between a pair of groups ofhyperlinked
`documents, the similarity measure based on the number of
`hyperlinked documents that have forward links to each of the
`pair of groups ofhyperlinked documents. The pair of groups
`ofhyperlinked documents are merged if the similarity mea(cid:173)
`sure crosses a threshold. In some embodiments, the similarity
`65 measure can be reduced by the product of the number of
`forward links to each of the pair of groups of hyperlinked
`documents.
`
`
`
`US 8,516,357 Bl
`
`3
`In another embodiment, the invention provides a computer
`implemented method of grouping hyperlinked documents.
`Links to hyperlinked documents are received. The hyper(cid:173)
`linked documents are grouped according to forward links to
`the hyperlinked documents in other hyperlinked documents.
`The links to the hyperlinked documents are displayed in
`groups.
`Other features and advantages of the invention will become
`readily apparent upon review of the following description and
`association with the accompanying drawings, where the same 10
`or similar structures are designated with the same reference
`numerals.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`4
`geously applied to grouping hyperlinked documents on a
`local intranet for a number of diverse uses, such as for orga(cid:173)
`nizing hyperlinked documents at an institution. Therefore,
`the description of the embodiments that follows is for pur(cid:173)
`poses of illustration and not limitation.
`FIG.1 illustrates an example of a computer system that can
`be used to execute the software of an embodiment of the
`invention. FIG. 1 shows a computer system 1 that includes a
`display 3, screen 5, cabinet 7, keyboard 9, and mouse 11.
`Mouse 11 can have one or more buttons for interacting with a
`graphical user interface. Cabinet 7 houses a CD-ROM drive
`13, system memory andaharddrive(seeFIG. 2)whichcan be
`utilized to store and retrieve software programs incorporating
`computer code that implements the invention, data for use
`15 with the invention, and the like. Although CD-ROM 15 is
`shown as an exemplary computer readable storage medium,
`other computer readable storage media including floppy disk,
`tape, flash memory, system memory, and hard drive can be
`utilized. Additionally, a data signal embodied in a carrier
`20 wave (e.g., in a network including the Internet) can be the
`computer readable storage medium.
`FIG. 2 shows a system block diagram of computer system
`1 used to execute the software of an embodiment of the
`invention. As in FIG. 1, computer system 1 includes monitor
`3 and keyboard 9, and mouse 11. Computer system 1 further
`includes subsystems such as a central processor 51, system
`memory 53, fixed storage 55 (e.g., hard drive), removable
`storage 57 (e.g., CD-ROM drive), display adapter 59, sound
`card 61, speakers 63, and network interface 65. Other com-
`30 puter systems suitable for use with the invention can include
`additional or fewer subsystems. For example, another com(cid:173)
`puter system could include more than one processor 51 (i.e.,
`a multi-processor system) or a cache memory.
`The system bus architecture of computer system 1 is rep-
`35 resented by arrows 67. However, these arrows are illustrative
`of any interconnection scheme serving to link the subsystems.
`For example, a local bus could be utilized to connect the
`central processor to the system memory and display adapter.
`Computer system 1 shown in FIG. 2 is but an example of a
`40 computer system suitable for use with the invention. Other
`computer architectures having different configurations of
`subsystems can also be utilized.
`FIG. 3 shows a network of multiple computer systems. A
`network 101 provides communication between multiple
`45 computer systems 1. In a wide area network such as the
`Internet, some of the computer systems are servers (or hosts)
`and provide access to resources or services to client computer
`systems on the network. With respect to web pages, there are
`multiple server computer systems that store the web pages
`50 that make up the Web. The web pages typically include links
`in the form of uniform resource locators (URLs) that are a link
`to another web page, whether it is on the same server or a
`different one.
`As described above, the Web is a distributed network of
`55 web pages. Networks ofhyperlinked documents can also be
`present in local area networks (e.g., intranets ). The operation
`of these intranets is very similar to the Internet except that it
`is not uncommon for all or a majority of the hyperlinked
`documents of an intranet to be stored on a single server
`60 computer system.
`Clustering is the general term for a technology that identi(cid:173)
`fies groups of similar objects among a given set. For example,
`consider the words {cat, dog, airplane, bus, automobile,
`giraffe }.A clustering of this set could divide it into two parts.
`65 One part, {cat, dog, giraffe}, would describe animals, while
`the other, {automobile, bus, airplane} would describe
`vehicles of travel. This is essentially what the invention can
`
`FIG. 1 illustrates an example of a computer system that can
`be utilized to execute the software of an embodiment of the
`invention.
`FIG. 2 illustrates a system block diagram of the computer
`system of FIG. 1.
`FIG. 3 illustrates a network of multiple computer systems
`including wide area networks and local area networks.
`FIG. 4 shows an example of links between web pages
`relating to Saturn cars and the planet Saturn.
`FIG. 5 shows a flow chart of a process of grouping hyper- 25
`linked documents.
`FIG. 6A shows of a web page on the planet Saturn having
`a forward link to another web page on the planet Saturn and
`FIG. 6B shows how the links can be considered to enhance
`grouping the web pages by topic.
`FIG. 7 shows a flow chart of a process of grouping hyper(cid:173)
`linked documents that have been obtained from a search and
`utilizes a similarity measure to penalize popular web pages
`and groups.
`FIGS. SA and SB show a web page example of the search
`results for the query ')aguar."
`FIG. 9 shows additional web pages in the neighborhood of
`a Jaguar cars global page.
`FIG. 10 shows an example of a back link list ofhyperlinked
`documents that have forward links to an expanded search set
`ofhyperlinked documents.
`FIG. 11 shows a hash table that can be utilized to organize
`the relationships between the web pages in FIG. 10.
`FIG. 12A shows a co-citation matrix and FIG. 12B shows
`a relationship vector.
`FIG. 13 shows an example of an equation that can be
`utilized to measure the similarity of hyperlinked documents
`or groups ofhyperlinked documents.
`FIG. 14 shows an augmented relationship vector.
`FIG. 15 shows a sorted relationship vector.
`FIG. 16 shows a merge vector.
`FIG. 17 shows a co-citation matrix after the merge.
`FIGS.18A-18C show a web pageexampleofsearchresults
`being grouped by topic.
`
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`
`In the description that follows, the present invention will be
`described in reference to embodiments that group hyper(cid:173)
`linked documents (e.g., web pages). More specifically, the
`embodiments will be described in reference to embodiments
`that group the web pages from a search by topic so that the
`user can better find the desired information. However,
`embodiments of the invention are not limited to any particular
`environment, application or specific implementation. For
`example, the embodiments described below can be advanta-
`
`
`
`US 8,516,357 Bl
`
`5
`provide with respect to web pages, grouping together sets of
`web pages that are on the same topic. In our example however,
`we used our background knowledge of the given entities to do
`this grouping ourselves. In the case of web pages, it would be
`desirable to provide a suitable alternative to this knowledge 5
`base that lends itself to being automated. One aspect of the
`invention is that the link structure of the Web itself provides
`such information.
`Consider that pages on the same general topic are likely to
`be linked to from the same set of pages. Take, for example, the 10
`Saturn query described above. It makes sense that two pages
`on the Saturn car could be linked to from a common page
`(e.g., a page reviewing cars). However, it is extremely
`unlikely that any page on the Web links to one page talking
`about the Saturn automobile and another about the Sega Sat- 15
`urn game system. Although this can happen, statistical meth(cid:173)
`ods can be used to discount the effects of such an unlikely
`occurrence. Thus, by utilizing the link structure of the Web,
`pages that are on the same topic can be grouped together.
`Standard clustering techniques usually require the defini- 20
`tion of a distance between pairs of the objects to be clustered.
`Pages that are close together are then more likely to be about
`the same topic than those that are far apart. One way to define
`this distance is to consider the co-citation number of a pair of
`web pages. The co-citation number is the number of pages 25
`that link to both pages simultaneously. The higher this num(cid:173)
`ber, the more related are the pair of pages, and thus the smaller
`the distance between them should be. Therefore, the inverse
`of the co-citation number can be utilized as a definition of
`distance.
`FIG. 4 shows an example of links between web pages
`relating to Saturn cars and the planet Saturn.