throbber
'111111111111111111111111111111111111111111111111111111111111111111111111111
`US008631094Bl
`
`(12) United States Patent
`Alpert et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 8,631,094 Bl
`Jan.14,2014
`
`(54) DISTRIBUTED PARALLEL
`DETERMINATiON OF SINGLE AND
`MULTIPLE SOURCE SHORTEST PATHS IN
`LARGE DIRECTED GRAPHS
`
`_(75) __ JnV€:1l!OJ"~:~~~se~O!l~_Aipert, ]3_!!r}c~J~y,_ CA (1JID; _ .
`Nlssan Hajaj, San Mateo, CA (US)
`
`(73) Assignee: Google Inc., Mountain View, CA (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. IS4(b) by 713 days.
`
`(21) Appl. No.: 12/537,681
`
`(22) Filed:
`
`Aug.7,2009
`
`Related U.S. Application Data
`
`(60) Provisional application No. 611087,623, filed on Aug.
`8,2008.
`
`(51)
`
`(2006.01)
`(2006.01)
`(2006.01)
`(2006.01)
`
`Int. Cl.
`G06F 15116
`G06F 7100
`G06F 17100
`G06F 17130
`(52) U.S. CI.
`USPC ............ 709/219; 707/631; 707/637; 707/798
`(58) Field of Classification Search
`USPC .......................................................... 709/219
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`2005/0088965 Al * 4/2005 Atlas et al ..................... 370/216
`2007/0297332 AI • 1212007 Broberg et al ................ 370/235
`2008/0288580 AI* 11/2008 Wang eta!. ................... 709/203
`1/2009 Roberts et al ................. 718/104
`2009/0007127 Al *
`2009/0210489 A1 * 8/2009 Debet al ....................... 709/204
`2009/0316697 A I • 12/2009 Dakshinamoorthy
`. ..... . .. .. .----- ---- __ c;ltal._ ..... ., .... ,. ................. 3_701390_
`2010/0017368 Al * 112010 Mao et al .......................... 707/3
`"' cited by examiner
`
`Primary Examiner- Kevin Bates
`Assistant Examiner- Tom Y Chang
`(74) Attorney, Agent, or Firm- Fish & Richardson P.C.
`
`(57)
`
`ABSTRACT
`
`Systems and methods for checkpointing a computation dis(cid:173)
`tributed over multiple peer servers. On each server, sequen(cid:173)
`tially storing checkpoints collectively representing a current
`state of the computation on that server as of a most recent
`checkpoint, each checkpoint having a checkpoint timestamp.
`Whenrestarting a first server, rebuilding a most recent state of
`the first server from the checkpoints written by the first server
`through a most recent checkpoint having a most recent check(cid:173)
`point timestamp, and requesting from each of the other peer
`servers updates from the most recent checkpoint timestamp
`time of the first server. On each server, in response to a first
`request for updates as of a particular time, deriving the
`requested updates from the state data in the server uncommit(cid:173)
`ted to a checkpoint and the state data in checkpoints ofthe
`server that have a timestamp no earlier than the particular time
`of the first request, and providing the requested updates to the
`first server.
`
`2003/0033582 AI"' 2/2003 Klein et al .•••..............•.•... 7I6/4
`2005/0073962 AI* 4/2005 Zabele et al .................. 370/254
`
`19 Claims, 8 Drawing Sheets
`
`~----":~-----~
`s1111ra !:NodeAandNodeBI
`1
`1
`I Node A Seed W Dlst1
`I
`SeeciX lliot1
`I
`-y Di112
`I
`I "'
`: NodoB s-IW lliot1
`:
`I ;~;:;::;::::::=]1
`1
`s-IX lliot1
`I \r
`I
`SeedY Dlot 1
`I 1":7'"-:-:-T:IJ.JJIIila;;;;;;;;;,-,,...,..~:---
`Sllard 2; Node G
`I
`: Node A SoodZ D~t 1.5
`1 NodoG Seed X 01112
`I Repiwlent~y;1.5lalesathan2
`I
`BoodY Dls!Z
`I
`dNodoB!Seodz lo~ot.s I
`SGodZ DillS
`I Roplaca entJy: .5 Jo Jess than 1
`-----u.- ---INodoGlSoodZ lo1st1.5l
`ReplacovoJue; 1.51B Jess than 3
`
`1
`
`EXHIBIT 2085
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00479
`
`

`

`U.S. Patent
`
`Jan. 14,2014
`
`Sheet 1 of8
`
`US 8,631,094 Bl
`
`150 ~
`
`--- .......
`
`152b
`-
`
`'
`
`'
`
`/
`
`/
`
`' ' \
`
`\
`
`152c :
`I
`
`I
`
`/
`
`/
`
`.......
`
`---jl
`
`101
`
`Distance
`Table
`
`156a
`
`Distance
`Table
`
`Distance
`Table
`
`Merged
`Distance
`Table
`
`FIG.1A
`
`

`

`100,
`
`SeedW
`102
`
`~
`00
`•
`~
`~
`~
`
`~ = ~
`
`106
`
`I
`
`\
`
`104
`
`I
`
`A 1Jfl
`
`I ~ . 51
`
`----=2~-----l
`Shard 1: Node A and Node B
`
`1
`
`Seed Y I Dist 2
`
`:1.22
`
`NodeE
`11 112
`
`NodeB
`
`Seed Y I Dist 1
`Shard 2: Node G
`NodeG Seed X Dist3
`SeedY Dist2
`SeedZ Dist 3
`
`-
`
`<
`
`Node G r•r-2----~ I
`1M
`I
`Uodate
`I
`I Node G I Seed X I Dist 2
`Replace value; 2 is less than 3
`
`Node F
`
`I
`
`:1.24
`
`FIG.1B
`
`I
`
`SeedZ
`
`~
`
`~ := ....
`~ ...
`0 .... ...
`
`N
`
`rFJ =(cid:173)
`('D a
`N
`0 .....
`
`QO
`
`d
`rJl
`00
`0..,
`w
`
`""""' = \C
`~ = """"'
`
`

`

`100,
`
`SeedW
`102
`
`1
`
`104
`
`1
`
`122
`
`116
`
`Replace entry; .5 is less than 1
`
`120L
`:sh-;rd1~N-;;de Aa~d N~e-81
`1 Node A Seed W Dist 1
`I
`Seed X Dist 1
`I
`Seed Y Dist 2
`: Node B Seed W Dist 1
`1
`Seed X Dist 1
`I
`Seed Y Dist 1
`I
`Shard 2: Node G
`: Node G Seed X Dist 2
`SeedY Dist2
`SeedZ Dist 3
`
`I
`I
`
`1 - - - ·
`124
`·~
`
`114
`
`-
`
`-
`
`126
`
`Node E I 2hfl
`
`112
`
`2
`
`1
`
`- Node F
`
`~
`00
`•
`~
`~
`~
`
`~ = ~
`
`~
`
`~ := ....
`~ ....
`0 ....
`....
`
`N
`
`rFJ =-('D
`.....
`
`('D
`
`(.H
`
`0 .....
`
`QO
`
`d
`rJl
`00
`0..,
`w
`
`""""' = \C
`~ = """"'
`
`~
`11 Node G
`Update
`[Nod;,A.Jiee-d]]Dist 1.5 I
`Replace entry; 1.5 is less than 2
`I Node B I Seed Z I Dist .5 I
`I Node G I Seed Z I Dist 1.5 I
`
`Replace value; 1.5 is less than 3
`
`FIG.1C
`
`

`

`U.S. Patent
`
`Jan.14,2014
`
`Sheet 4 of8
`
`US 8,631,094 Bl
`
`DIVIDE A DIRECTED EDGE ~
`310
`GRAPH OF NODES INTO
`SHARDS
`1
`
`ASSIGN EACH OF THE
`SHARDS TO A SERVER
`
`~
`
`) 320
`
`TABLE FOR EACH OF THE
`NODES IN EACH SHARD
`
`330
`
`1
`CALCULATE A DISTANCE ~
`1
`
`RANKING THE NODES
`BASED ON THE
`CALCULATION
`
`340
`~
`
`FIG.2
`
`

`

`U.S. Patent
`
`Jan.14,2014
`
`Sheet 5 of8
`
`US 8,631,094 Bl
`
`RECEIVE A DISTANCE
`UPDATE INCLUDING AN
`UPDATED DISTANCE
`
`402
`
`DETERMINE WHETHER THE
`DISTANCE TABLE
`INCLUDES ONE OF THE
`SEEDS
`
`DETERMINE WHETHER
`THE UPDATED
`DISTANCE IS
`SHORTER THAN THE
`THREE NEAREST
`SEEDS
`
`Yes
`
`416
`
`IGNORE THE
`DISTANCE UPDATE
`
`DELETE THE ONE OF
`THE THREE NEAREST
`SEEDS FROM THE
`DISTANCE TABLE
`
`ADD THE ONE OF THE
`SEEDS AND THE
`UPDATED DISTANCE
`TO THE DISTANCE
`TABLE
`
`42.
`
`REPLACE THE DISTANCE
`WITH THE UPDATED
`DISTANCE
`
`41
`
`PROPAGATE THE
`UPDATE DISTANCE
`ALONG THE DIRECTED
`EDGE GRAPH
`
`414
`
`FIG. 3
`
`

`

`U.S. Patent
`
`Jan.14,2014
`
`Sheet 6 of8
`
`US 8,631,094 Bl
`
`502a"?_ Shard
`SeverA
`
`S02b(_ Shard
`Sever B
`
`502nc_ Shard
`SeverN
`
`Time to
`r - - - - -J.) S04a
`
`r - - - - -J.) so4b
`
`1 _:heckpoint J 1
`1_c~ckpoint J 1
`I Checkpoint J 1
`
`r - - - - -J.) S04n
`
`GFS
`
`Checkpoint(s) A
`.5!}&
`
`Checkpoint(s) B
`.i!MJz
`
`Checkpoint(s) N
`.ifl1lo.
`
`FIG.4A
`
`

`

`U.S. Patent
`
`Jan.14,2014
`
`Sheet 7 of8
`
`US 8,631,094 Bl
`
`Time 1:,
`
`r -----J514a
`l_c~~~ ~
`
`r - - - - -6 sub
`~-~~int J ~
`
`f - - - - - 0 S14n
`~~~-kp~I~J ~
`
`502a(_ Shard
`SeverA
`
`S02b(_ Shard
`Sever B
`
`502nc_ Shard
`SeverN
`
`GFS
`
`Checkpoint(s) A
`.5JlBil
`
`Checkpoint(s) B
`508b
`
`I
`
`Checkpoint(s) N
`.5flBtJ.
`
`506
`
`FIG.4B
`
`

`

`U.S. Patent
`
`Jan. 14,2014
`
`Sheet 8 of8
`
`US 8,631,094 Bl
`
`\
`\
`\
`\
`\
`\
`\
`
`\
`\
`\
`\
`
`'
`
`'
`' '
`'
`
`"
`
`tg
`
`<::::.
`
`7
`
`""'
`
`-rlr-~ ~
`
`~
`
`0.
`
`LO
`(!)
`LL.
`
`~~o
`
`L
`
`<::::. tti
`lrJ
`
`~ 0
`\ -rll;
`' ~~ 0 ~ o~ ~
`~ 0 ._
`.E 0
`tdl
`
`~
`<::::.
`~
`
`~I
`
`'-
`0
`<Jl
`<Jl
`Q)
`0
`0
`'-
`D..
`
`~
`us
`0
`
`/
`
`/
`
`/
`
`/
`
`/
`
`/
`
`'
`'
`
`'
`
`'
`
`'
`' '
`
`

`

`US 8,631,094 Bl
`
`1
`DISTRIBUTED PARALLEL
`DETERMINATION OF SINGLE AND
`MULTIPLE SOURCE SHORTEST PATHS IN
`LARGE DIRECTED GRAPHS
`
`CROSS-REFERENCE TO RELATED
`APPLICATION
`
`This application claims the benefit under 35 U.S.C. § 119
`(e) of the filing date of U.S. Patent Application No. 61/087,
`623, filed Aug. 8, 2008, and entitled "Distributed Parallel
`Determination of Single and Multiple Source Shortest Paths
`In Large Directed Graphs," the contents of which are incor(cid:173)
`porated herein by reference.
`
`BACKGROUND
`
`2
`small positive integer, such as one or three. The systems can
`handle graphs with hundreds ofbillions of nodes and trillions
`of edges. The systems run in a distributed environment, across
`many machines and efficiently recover from machine fail(cid:173)
`ures.
`The details of one or more implementations of the subject
`matter are set forth in the accompanying drawings and the
`description below. Other features, aspects and advantages of
`the subject matter will be apparent from the description,
`10 drawings, and claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`20
`
`FIG. 1A is a block diagram showing an example directed
`15 graph analysis system.
`FIGS. 1B and 1C are block diagrams showing example
`directed graphs and corresponding distance tables.
`FIG. 2 is a flow chart of an example process for determin(cid:173)
`ing values of a distance table.
`FIG. 3 is a flow chart of an example process for modifying
`a distance table in response to a distance update.
`FIG. 4A is a block diagram showing example shard serv(cid:173)
`ers, checkpoints, and a global file system at a first time inter(cid:173)
`val.
`FIG. 4B is a block diagram showing example shard servers,
`checkpoints, and a global file system at a second time interval.
`FIG. 5 is a schematic diagram of an example of a generic
`computer system.
`Like reference symbols in the various drawings indicate
`like elements.
`
`The present specification is directed to systems, compo(cid:173)
`nents of systems, and methods performed by them, that can
`find multiple shortest paths in very large graphs.
`Graph analysis methods are used to compute a shortest
`path on a weighted directed graph. A number of techniques
`for solving shortest paths problems have been implemented.
`The Dijkstra and the Bellman-Ford algorithms for the single
`source shortest paths problem have no parallelism and/or are 25
`not scalable. While Bellman-Ford can easily be parallelized,
`it is not scalable because it requires too many iterations, each
`propagating messages through all the edges. Others' work on
`parallelizing the Dijkstra algorithm has resulted in system
`designs that rely on the use of shared memory, random access 30
`to in-memory graph data, and reliable machines. Such
`designs cannot rnn across many machines, cannot be realized
`in the absence of shared memory access, cannot work with
`large graphs stored on disk, and cannot routinely handle
`machine failures. Examples of such systems are described in 35
`"Parallel Shortest Paths Algorithms for Solving Large-Scale
`Instances": http://www.cc.gatech.edu/researchlreports/GT(cid:173)
`CSE-06-19 and http:/ /www.cc.gatech.edu/-bader/papers/
`ShortestPaths-ALENEX2007.pdf. These systems use shared
`memory models and where the graph data is all in memory, 40
`and are described as working on large graph instances having
`2 billion edges.
`The systems described in this specification can compute
`single source and multiple source shortest paths for graph
`instances having trillions of edges and have the capacity to 45
`scale to even larger size graphs.
`
`DETAILED DESCRIPTION
`
`FIG. 1A is a block diagram showing an directed graph
`analysis system 150. The system 150 works on data defining
`a directed graph with directed edges. The directed graph can
`include nodes and directed links. Each node can represent a
`single physical entity, or can alternatively represent many
`physical entities that share a common attribute. In some
`implementations, each node is represented by a representa(cid:173)
`tive number, such as a hash of a unique descriptor of an entity
`represented by the node.
`The system divides the identified nodes into subsets. In the
`depicted example, the system 150 has divided nodes A, B, C,
`D, E, F, G, H, I, K, and L, into three subsets 152a, 152b, and
`152c. Nodes that have directed edges out to other nodes
`("outgoing edges") will be referred to as "source nodes." In
`the depicted example, node A has directed edges to nodes D
`and Band is a source node. Nodes that have no such directed
`50 edges will be referred to as "leaves" or "leaf nodes."
`After the system 150 divides the data describing identified
`nodes and their outgoing directed edges into subsets, which
`may also be referred to as shards, the system assigns the nodes
`to servers, which may also be referred to as shard servers or
`55 peer servers. In the depicted example, the system 150 assigns
`the subsets 152a, 152b, and 152c to servers 154a, 154b, and
`154c, respectively. The servers will generally be in commu(cid:173)
`nication with a master server 157 and each other through a
`network 101. Thus, each server will have data describing a
`60 portion of a directed graph. Because it will be necessary for
`the servers to determine, for any arbitrary node, which server
`the node belongs to-i.e., which server the node has been
`assigned to-the assignment of nodes to servers is accom(cid:173)
`plished by a computation that can be performed very quickly.
`65 In some implementations, this mapping computation is a hash
`of a node identifier modulo the number of servers. In other
`implementations, domain information is used so that nodes in
`
`SUMMARY
`
`Computing shortest paths on a weighted digraph is
`required by many commercial applications. Some applica(cid:173)
`tions must solve the problem for extremely large digraph
`instances, i.e., instances of more than a trillion (10 12
`) edges.
`Such applications generally represent entities in the applica(cid:173)
`tion domain using the nodes and directed edges of a weighted
`orunweighted directed graph, and compute single or multiple
`seed, nearest seed paths in the graph in performing various
`analyses of the underlying entities. The systems that are
`described in this specification can process such graphs on a
`distributed environment in which thousands of computers are
`used, and in which at least one machine failure during the
`course of a computation will generally be common. The sys(cid:173)
`tems described here can efficiently overcome such failures.
`Given a large weighted digraph and a set of seed nodes (which
`may include hundreds of seeds) on the graph, the systems find
`for each node the n seeds with the shortest paths to the node
`and the lengths of these paths, where n is a predetermined
`
`

`

`US 8,631,094 Bl
`
`3
`the same domains have an increased likelihood (over random
`assignment) to be assigned to the same servers.
`The ultimate task of each server is to compute a nearest
`seed distances for the nodes assigned to the server, and for the
`servers collectively and operating in parallel to compute near-
`est seed distances for the entire input graph. For example,
`server 154c computes a distance table 156c of nearest seeds
`for the source nodes in the subset 152c. Nearest seed compu(cid:173)
`tations are described in more detail below. The servers also
`compute nearest seed distances for the leaf nodes in leaf
`tables, as will also be described below. Once the computa(cid:173)
`tions by the servers 154a-154c are complete, the system 150
`combines (actually or virtually) the distance tables 156a-
`156c and the leaf tables into a single combined distance table 15
`158, which contains the shortest paths computed by the sys(cid:173)
`tem for each of the nodes represented in the input data.
`In some scenarios, the edges of the graph are all of equal
`length (distance). In other scenarios, information about the
`directed edges represented by the edges, or the nodes repre- 20
`sented by target nodes of the edges, is represented by weights
`assigned to the edges. In some scenarios, the input data to the
`servers includes edge weight data (which in effect modifies
`the lengths of the edges) as well as graph structure data.
`The input to the system also includes data identifying 25
`seeds. In some scenarios, only one or a few seeds are identi(cid:173)
`fied. In others, hundreds or even thousands of seeds may be
`used. A seed in the graph context is a particular node that is
`preselected according to one or more of its characteristics.
`For example, seed resources may be chosen based on a node's 30
`importance or other characteristics, or those of the underlying
`application entity. Seeds and seed selection is described in
`more detail below. Thus, the linked graph 100 may be a
`weighted directed and cyclical graph in which some of the
`nodes are designated as seeds.
`Each server ultimately computes then closest seeds to each
`of the server's nodes and the distance values from each node
`to each of its respective closest seeds. To identifY seeds, the
`set of nodes may be analyzed and one or more seeds selected
`from the set of nodes according to a selection criterion. In
`some implementations, seeds are identified using a partially
`or fully manual process and provided to the system as a list of
`node identifiers. As already mentioned, in the system the data
`and the processing are distributed across many servers, in
`some implementations, across more than one thousand serv(cid:173)
`ers. Each server processes a shard of the input and output data
`corresponding to its nodes.
`Because the system is designed for processing very large
`graphs, the graph data shards are too large to be stored in
`random access memory (RAM) and so must be stored on disk
`memory or some other form of mass storage memory; how(cid:173)
`ever, the number of servers is selected so that the distance
`table and the current portions of the leaf table of each server
`can be stored in the RAM of the server.
`The system can store the link graph 100 using any number
`of storage mechanisms. For example, in some implementa(cid:173)
`tions, the system may store the linked graph 100 on a distrib(cid:173)
`uted global file system (GFS). One implementation of such a
`file system is described in Ghemawat, et. a!, The Google
`Global File System, 19th ACM Symposium on Operating
`Systems Principles. The linked graph 100 may be stored in
`one or more link map files on the GFS. Each link map file
`contains a representation of the nodes and their outgoing
`directed edges (identifYing the target nodes) for a portion of
`the linked graph 100. The link map files may be replicated to 65
`provide redundancy if a disk fails, or is otherwise inacces(cid:173)
`sible.
`
`4
`Each server processes distance update messages which it
`receives from other servers or generates itself and updates the
`nearest seed data in its distance and leaf tables as appropriate.
`Each server generates distance update messages, as will be
`described, as a consequence of updating the nearest seed
`information for its nodes. A server receiving an update mes(cid:173)
`sage determines whether the update includes better informa(cid:173)
`tion (i.e., a shorter distance) than is represented in the receiv(cid:173)
`ing server's distance table, applies the update if it does, and
`10 ignores it otherwise.
`Each server uses three tables:
`1) A link table, stored on disk, represents the shard's part of
`the directed edge graph. Each row in the link table rep(cid:173)
`resents a node and identifies all the outgoing directed
`edges from the node. To identifY the target nodes that
`may have to be notified of a change in a nearest seed
`distance of a node N1 owned by a server, the server will
`look up the node N1 in its link table and find all the nodes
`to which N1 has outgoing directed edges.
`2) A distance table, stored in RAM, has a structure that
`parallels that of the link table, having the same ordering
`of nodes as the link table does. For each of the server's
`nodes, the distance table has n seed and distance pairs.
`This data represents the server's best information as to
`which n seeds are closest to each of its nodes and the
`distance from each node to its closest n seeds. The dis-
`tance table represents the state of the computation, and is
`updated in response to update messages. The table also
`includes a dirty bit, which when set indicates for a node
`and a nearest seed to that node that the nearest distance
`information for the node node-seed combination has to
`be propagated to the nodes on the outgoing directed
`edges from the node. When the server is determining
`what updates to send, this table is scanned sequentially,
`and the nodes that have to be looked up in the directed
`edge table are therefore identified in the right order for
`sequential reading.
`3) A leaf table is similar to the distance table, except not all
`of it is kept in RAM. Also, because leaves have no outgoing
`40 directed edges, no dirty bit information needs to be main(cid:173)
`tained. In systems where there are many more leaf nodes than
`source nodes, the leaf table data is accumulated in RAM in a
`hashed data structure, the node identifier being the hashed
`key, and when RAM fills up, the data is written to disk in
`45 node-sorted order and operation continues. At the end of the
`computation for the entire graph, all the disk data is merged,
`keeping the shortest distances recorded for each leaf.
`In order to reduce the disk seek latency, local copies of the
`link table are maintained by the server and generally spread
`50 over multiple disks. Link table lookups are performed on
`separate I/0 threads one thread per local disk. Distribution of
`the link table among multiple local disks of the server is
`described below. Therefore, other threads are not blocked
`waiting for I/0 unless they have nothing else to do. Worker
`55 threads scan the distance table in order looking for dirty
`entries, i.e., entries that have a dirty bit set, that satisfY a
`distance threshold, which will be described. The nodes
`belonging to such dirty entries are placed on a queue for the
`lookup threads to fetch. The queue is kept as full as possible,
`60 so that the seek results (outgoing directed edges for the
`marked nodes) are ready when they are needed for distance
`update propagation. This allows for efficient processing of
`nodes in sorted order and substantially reduces disk access
`bottlenecks.
`FIGS. 1B and 1C are block diagrams showing an example
`graph 100 and corresponding distance tables 120. Seed W
`102 has a single outgoing directed edge to Node A 104. Seed
`
`35
`
`

`

`US 8,631,094 Bl
`
`5
`Y 106 has outgoing directed edges to Seed X 108, Node B
`110, Node E 112, and Node C 114 with lengths or distance
`values of 1, 1, 2, and 0.5, respectively.
`As illustrated by FIGS. 1B and 1C, a first shard server is
`configured to store in RAM a portion of distance table 120
`corresponding to distances from any number of seeds to Node
`A 104 and Node B 110 (distance table portion 122) and the
`other nodes that are assigned to the first shard server. A second
`shard server is similarly configured to store in RAM a portion
`of distance table 120 corresponding to distances from any
`number of seeds to Node G (distance table portion 124) and
`many other nodes
`Each portion of the distance table computed by the shard
`servers will ultimately be combined to generate a complete
`distance table (distance table 120). In some implementations,
`the combination of the portions is performed virtually. Thus,
`because the system knows which shard servers store the near-
`est seed distances in accordance with the shard assignment,
`the system may access the portion of the distance table on the
`appropriate shard server without having to combine all of the
`distance table portions into a single distance table. In some
`implementations, one use of the results is in ranking nodes,
`where a shorter distance to an n-th nearest seed indicates a
`higher quality.
`A shard server will receive distance updates, which may
`come from the server itself, but will generally come from
`other servers. A distance update sent from sending shard
`server to a receiving shard server includes the identification of
`a node owned by the receiving shard server, the identification
`of the seed, and the distance from the node to seed as deter(cid:173)
`mined by the sending shard server. The update can be
`accepted or ignored by the receiving shard server. In some
`implementations, when an update is sent, the corresponding
`entry in the sending shard distance table is marked as clean. In
`some implementations, the entry in the sending shard server
`is marked as clean without regard to whether receipt of the
`update was acknowledged.
`For example, in reference to FIG. 1B, a distance update is
`generated for the distance value of 2 for the distance from
`Seed X 108 to Node G 116. As depicted in the example, the
`new distance value of 2 is less than the current distance value
`of 3 for the distance between Seed X and node G in the
`distance table 124 of the shard server that owns node G.
`Because the distance is less than what that server has as the
`distance, it is used to modifY the node-to-seed distance value,
`which is consequently marked as dirty. The information in the
`dirty entries is generally propagated by update messages to
`the servers owning the corresponding target nodes. Thus, any
`distance update may initialize a cascade of distance update
`messages.
`In reference to FIG. 1C, the server can send an update to
`Node B, as illustrated by the updated distance value 0.5, and
`represented by connector 128. In response, distance updates
`are transmitted to the shard servers assigned to Node A 104,
`Node B 110, and Node G 116. As depicted by the example, the
`shard servers may use the updates to both replace distance
`values (e.g., as described above), and replace entire entries in
`the distance table portions. The largest value of the distance
`table portion 122 may be replaced (e.g., the previous entry
`removed and a new entry added) because the new distance
`value for a new seed is smaller than the distance values for one
`of the current seeds (e.g., the distance value for Seed Z 126 is
`shorter than the distance value for SeedY 106). In both cases,
`the entry is marked as dirty.
`In some implementations, the distance table maintains, for 65
`each nearest seed of a node, the identity of the previous node
`in the path from the nearest seed to the node. In such imple-
`
`6
`mentations, each update message includes, in addition to seed
`and distance, the identifier of the source node on the path.
`With this information, when the computation is complete, for
`each node, the path to each nearest seed can be determined.
`In some implementations, each server filters for outgoing
`update messages to reduce the sending of unnecessary mes(cid:173)
`sages. In some implementations, a table keyed by target node
`contains the best shortest path data sent to the target node by
`the server thus far. Before an update message is sent by the
`10 server, the message target node of the message is looked up in
`the table and if it shows that the server has already sent a
`message with the same or a shorter distance between target
`and the seed of the message, the message is not sent. The keys
`of this filter table are selected either statically before the
`15 computation or dynamically during the computation. The
`keys are selected so as to include as nearly as practicable the
`most popular target nodes in the graph (i.e., the nodes with the
`most outgoing directed edges directed to them).
`In some implementations, the system uses an adaptive
`20 propagation threshold that may be different for each shard
`server and that determines which updates the shard server will
`actually propagate. A shard server will generate update mes(cid:173)
`sages only if the distance in the message is less than a certain
`threshold. This means that as the server is scarming the dis-
`25 tance table, the only nodes it will look up in the link table (and,
`most importantly, the only nodes for which it will perform
`disk operations) are the nodes for which a dirty seed value
`satisfies the propagation threshold. As the computation pro(cid:173)
`ceeds, the information about near seeds will become stable
`30 (i.e., not be updated and become dirty), so the threshold will
`increase to cause information about farther seeds to be prop a(cid:173)
`gated, until the threshold is at a maximum value (no threshold
`limit) and the entire computation is completed.
`The propagation threshold may be adjusted dynamically
`35 by the shard server itself to maintain a specified propagation
`ratio, such as a propagation ratio of 1-in-6 (i.e., only 1-in-6
`distance updates (dirty records) are actually transmitted). The
`propagation threshold value determines a distance threshold
`for the distance updates. If more than 1-in-6 distance updates
`40 are being transmitted, the threshold value is lowered. Con(cid:173)
`versely, if fewer than 1-in-6 distance updates are transmitted,
`the threshold value is increased. The propagation threshold
`may be used to reduce the bandwidth requirements for pro(cid:173)
`cessing the distance updates. The propagation threshold gen-
`45 erally reduces the number of redundant distance updates the
`server sends (that is, updates containing distances that will
`later be overridden by shorter distances). The higher the
`threshold, the more parallelism and better disk throughput
`that is achieved, at the cost of sending more redundant
`so updates to the peers.
`The propagation threshold for a shard server may be
`adjusted, in response to system -wide conditions, by a master
`server or process that monitors the progress of the computa(cid:173)
`tion. In some implementations, the progress of the computa-
`55 tion is represented by propagation threshold values provided
`by the shard servers to the master. A shard server having a
`threshold substantially lower than that of the other servers, for
`example, within the bottom 10% of the thresholds, would
`likely proceed more quickly with a higher threshold. The
`60 master can determine that the increased burden on the other
`servers of receiving more updates will be outweighed by the
`quicker conclusion of the entire computation resulting from
`the lagging server processing its own dirty data more quickly,
`in which case the master server will instruct the lagging server
`to increase its propagation threshold.
`FIG. 2 is a flow chart of an example process 300 for deter(cid:173)
`mining values of a distance table. For convenience, process
`
`

`

`US 8,631,094 Bl
`
`7
`300 will be described in reference to a system that is config(cid:173)
`ured with program instructions (e.g., one or more computer
`software modules) that when executed performs process 300.
`The system divides a graph of nodes into shards (310) and
`assigns each of the shards to a server (320). The system
`calculates, in parallel, a nearest seeds distance table for each
`of the nodes in each shard (330). Using the final result of the
`calculation, the system ranks the nodes (340). For example,
`nodes associated with shorter nearest seed distances may
`receive higher rankings. Generally, as distance other than the
`distance to the closest seed is used. In some implementations,
`the third closest seed is used in calculating a ranking.
`Process 300 will generally be re-executed after the link
`graph 100 changes, i.e., new directed edges are added, or
`nodes are added or deleted, or weights change. The shard
`servers can access the local copies of the respective informa- 15
`tion and retrieve the additional update information from the
`GFS.
`FIG. 3 is a flow chart of an example process 400 for modi(cid:173)
`fying a distance table in response to a distance update. For
`convenience, process 400 will be described in reference to a 20
`shard server that is configured with program instructions
`(e.g., one or more computer applications) that when executed
`performs process 400. Process 400 may be used to modify a
`portion of a distance table. For example, in reference to FIG.
`1B, process 400 may be used to modifY portion 122 of dis(cid:173)
`tance table 120.
`The shard server receives a distance update including an
`updated distance ( 402). The distance update may be received
`in a message from another shard. The distance update
`includes a distance, a seed identifier, and a destination node
`identifier. For example, in reference to FIG. 1B, the shard
`server assigned process distance updates for Node G 116
`receives a distance update that includes a distance value of 2,
`a seed identifier identifying Seed X 108, and destination
`identifier of Node G 116. This tells the shard server that
`receives the update (and that owns Node G) that the shortest
`distance between Node G and Seed X is 2 and maybe less.
`The shard server also determines whether the distance
`table includes, for the destination node, one of the seeds in the
`distance update ( 404). If the seed is found in the table for the
`node ( 406), the shard server determines if the distance value 40
`is shorter than the current distance for the seed and the node
`( 408). If the distance value is not shorter, the shard server
`ignores the received distance update (410). Otherwise, the
`shard server replaces the

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket