`
`1111111111111111111111111111111111111111111111111111111111111
`US007698509Bl
`
`c12) United States Patent
`Koster et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,698,509 Bl
`Apr. 13, 2010
`
`(54) SNOOPING-BASED CACHE-COHERENCE
`FILTER FORA POINT-TO-POINT
`CONNECTED MULTIPROCESSING NODE
`
`(75)
`
`Inventors: Michael J. Koster, Freemont, CA (US);
`Christopher L. Johnson, San Francisco,
`CA (US); Brian W. O'Krafka, Austin,
`TX (US)
`
`(73) Assignee: Oracle America, Inc., Redwood Shores,
`CA (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 873 days.
`
`(21) Appl. No.: 10/889,952
`
`(22) Filed:
`
`Jul. 13, 2004
`
`(51)
`
`Int. Cl.
`G06F 12100
`(2006.01)
`(52) U.S. Cl. ....................... 7111146; 711/144; 711/145;
`711/128; 711/119; 711/E12.032; 711/E12.042
`(58) Field of Classification Search .................. 711/146
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`6,018,763 A *
`6,055,610 A *
`6,671,780 B1 *
`6,721,848 B2 *
`
`112000 Hughes eta!. .............. 709/213
`4/2000 Smith eta!. ................. 7111156
`12/2003 Lu et al ...................... 7111136
`4/2004 Gaither ....................... 7111122
`
`6,810,467 B1 *
`................ 7111146
`10/2004 Khare eta!.
`6,959,364 B2 *
`10/2005 Safraneket a!. ....... ...... 7111146
`2002/0087811 A1 *
`7/2002 Khare eta!.
`................ 7111146
`2002/0126704 A1 *
`.................. 370/503
`9/2002 Camet a!.
`2003/0009626 A1 *
`112003 Gruner eta!. ..... .. ... ..... 7111130
`2004/0068616 A1 *
`4/2004 Tierney eta!. .............. 7111141
`2004/02307 52 A1 *
`1112004 Blake eta!. ................. 7111147
`2005/0044195 A1 *
`............ .. ....... 709/223
`2/2005 Westfall
`2005/0228952 A1 *
`10/2005 Mayhew eta!. ............. 7111133
`FOREIGN PATENT DOCUMENTS
`889403 A2 * 111999
`EP
`* cited by examiner
`Primary Examiner-Reginald G Bragdon
`Assistant Examiner-Aracelis Ruiz
`(74) Attorney, Agent, or Firm-Osha • Liang LLP
`
`(57)
`
`ABSTRACT
`
`A multiprocessing node has a plurality of point-to-point con(cid:173)
`nected microprocessors. Each of the microprocessors is also
`point-to-point connected to a filter. In response to a local
`cache miss, a microprocessor issues a broadcast for the
`requested data to the filter. The filter, using memory that
`stores a copy of the tags of data stored in the local cache
`memories of each of the microprocessors, relays the broad(cid:173)
`cast to those/microprocessors having copies of the requested
`data. If the snoop filter memory indicates that none of the
`microprocessors have a copy of the requested data, the snoop
`filter may either (i) cancel the broadcast and issue a message
`back to the requesting microprocessor, or (ii) relay the broad(cid:173)
`cast to a connected multiprocessing node.
`
`15 Claims, 11 Drawing Sheets
`
`"up" adjacent
`node
`
`1~0
`
`"left'' adjacent
`node
`
`Microprocessor
`
`1~2
`
`1~6
`
`Microprocessor
`
`~
`
`~ 1~2
`
`,---..
`
`Microprocessor
`
`Snoop Filter
`I Shadow Tag I
`Memory
`1M
`
`i i
`
`1~4
`
`1 ~8 -
`
`Microprocessor
`
`" right" adjacent
`node
`
`"down" adJC!cent
`node
`
`Petition for Inter Partes Review of
`U.S. Pat. No. 7,296,121
`IPR2015‐00158
`EXHIBIT
`Sony‐
`
`1
`
`
`
`U.S. Patent
`
`Apr. 13, 2010
`
`Sheet 1 of 11
`
`US 7,698,509 Bl
`
`I'-(X)
`..--
`
`Q)
`N.C
`-~~
`(.)
`,.
`
`7
`
`""
`
`Q)
`
`..---£
`-l<O
`(.)
`
`IDGJ
`
`..--
`
`..__.
`
`~J
`...--
`
`>-
`c 0
`·n; E
`~Q)
`~
`,.
`
`.<
`
`7
`
`"'
`
`.._
`-
`Q)
`0
`~
`c
`0
`(.)
`Q)
`.c
`(.)
`<0
`(.)
`
`~J
`N
`
`.._
`0
`fJ)
`fJ)
`Q)
`(.)
`.._
`0
`0..
`.._
`0
`(.)
`
`~
`
`. /
`
`/ 0
`
`2
`
`
`
`U.S. Patent
`
`Apr. 13, 2010
`
`Sheet 2 of 11
`
`US 7,698,509 Bl
`
`(0
`("()
`
`'-
`...........
`())
`c
`:::l
`()) a.
`u E
`0 u
`
`co
`("()
`
`'-
`....... _.
`())
`c
`:::l
`()) a.
`u E
`0 u
`
`0
`'V
`
`'-
`())
`..............
`c
`:J
`()) a.
`0 E
`0
`u
`
`N
`'V
`
`'-
`_. .......
`())
`c
`:J
`()) a.
`0 E
`0
`u
`
`/
`
`0
`("()
`
`N
`("()
`
`'-
`
`())
`
`~ '-
`CD en
`CC>
`0 c
`<i:: ·-
`'
`en
`-o!IJ
`c
`())
`<U 0
`.... o
`cno_
`=
`:::l
`:iE
`
`3
`
`
`
`U.S. Patent
`
`Apr. 13, 2010
`
`Sheet 3 of 11
`
`US 7,698,509 Bl
`
`Q)
`0
`~ .....
`
`Q) -
`c --:::::J
`g.
`:::::J g
`:::::J a.
`c
`-
`
`-\
`v
`
`~J
`<0
`
`~._/
`
`gJ
`
`-\
`v
`
`>-.....
`0
`E
`Q)
`~
`c
`ro
`~
`
`._/
`
`<0
`._/
`1.0
`
`._/
`
`J
`
`.....
`0
`(/)
`(/)
`Q)
`0
`0
`.....
`a.
`0 .....
`0
`~
`
`.....
`0
`(/)
`(/)
`Q)
`0
`0 .....
`a.
`0 .....
`0
`~
`
`.....
`0
`(/)
`(/)
`Q)
`0
`0
`.....
`a.
`0 .....
`0
`~
`
`.....
`0
`(/)
`(/)
`Q)
`0
`0 .....
`a.
`0 .....
`0
`~
`
`Vt
`I\
`
`Vt
`l'v
`
`Vt
`l'v
`
`If
`l'v
`
`/
`
`0
`1.0
`
`4
`
`
`
`70
`
`84
`'
`
`Microprocessor
`
`Microprocessor
`
`~
`00
`•
`~
`~
`~
`
`~ = ~
`
`-
`72
`
`Cache
`Controller
`
`t \_..Cache-Me~ory
`
`Transaction
`
`t"'--i
`
`Local
`Cache
`
`---
`I./ Cache
`Controller
`
`1ooJi
`
`I
`
`80
`
`Local
`Cache
`
`~
`
`l
`
`82
`\
`
`Memory
`
`I
`
`FIG. 4
`
`>
`'e :-: ....
`0 ....
`
`~(H
`N
`
`0
`
`rFJ =-('D
`.....
`
`('D
`
`.j;o.
`
`0 .....
`....
`....
`
`d
`rJl
`-....l
`0..,
`\C
`
`00 u. = \C = """"'
`
`5
`
`
`
`94
`
`96
`
`Microprocessor
`
`Microprocessor
`
`98
`
`Cache I
`
`Controller ~ ~ Local
`Cache
`
`Cache
`
`I
`Controller H Local
`
`'100
`~
`
`Cache
`
`Network
`104
`
`)
`
`108
`
`90
`
`/
`
`Directory
`
`Memory
`
`""'-
`
`~
`
`~
`
`I
`
`Memory
`
`92
`
`l.l Dire:tory 1
`
`FIG. 5
`
`~
`00
`•
`~
`~
`~
`
`~ = ~
`
`>
`'e :-:
`......
`(,H
`
`~
`
`N
`0 ......
`0
`
`('D
`
`rFJ =-('D
`......
`Ul
`0 ......
`......
`......
`
`d
`rJl
`-....l
`0..,
`\C
`
`00 u. = \C = """"'
`
`6
`
`
`
`110
`
`1\2 •
`~ 4- Multiprocessing
`
`Node
`
`1\4 •
`
`Multiprocessing
`Node
`
`1\6 •
`
`Multiprocessing
`Node
`
`1\8 •
`
`Multiprocessing
`Node
`
`~
`
`~
`00
`•
`~
`~
`~
`
`~ = ~
`
`1 ~0
`
`1 ~2
`
`1~4
`
`1~6
`
`_. Multiprocessing
`Node
`
`Multiprocessing
`Node
`
`Multiprocessing
`Node
`
`Multiprocessing
`Node
`
`~
`
`1~8
`
`1~0
`
`1~2
`
`1~4
`
`I
`I
`
`_. Multiprocessing
`Node
`
`Multiprocessing
`Node
`
`Multiprocessing
`Node
`
`Multiprocessing
`Node
`
`..__
`
`116
`
`1~8
`
`1~0
`
`1~2
`
`~ Multiprocessing
`Node
`
`Multiprocessing
`Node
`
`Multiprocessing
`Node
`
`Multiprocessing
`Node
`
`..__
`
`t
`
`t
`
`FIG. 6
`
`t
`
`t
`
`(.H
`
`~
`
`N
`
`>
`'e :-: ....
`0 ....
`
`0
`
`('D
`('D
`
`rFJ =(cid:173)
`.....
`0\
`0 .....
`....
`....
`
`d
`rJl
`-....l
`0..,
`\C
`
`00 u. = \C = """"'
`
`7
`
`
`
`"up" adjacent
`node
`I'
`
`1 ~0
`
`~
`00
`•
`~
`~
`~
`
`~ = ~
`
`"left" adjacent
`node
`
`...
`
`Microprocessor
`
`I'
`
`1£2
`
`1~6
`
`Microprocessor
`
`- Microprocessor
`_____.
`
`I'
`
`1£4
`
`1~8
`
`Microprocessor
`
`.
`
`I
`
`'right" adjacent
`node
`
`~
`
`~ 1 ~2
`
`Snoop Filter
`
`Shadow Tag
`Memory
`164
`
`i i
`
`"down" adjacent
`node
`
`FIG. 7
`
`>
`'e :-: ....
`0 ....
`
`(.H
`
`~
`
`N
`
`0
`
`('D
`('D
`
`rFJ =(cid:173)
`.....
`-....l
`0 .....
`....
`....
`
`d
`rJl
`-....l
`0..,
`\C
`
`00 u. = \C = """"'
`
`8
`
`
`
`U.S. Patent
`
`Apr. 13, 2010
`
`Sheet 8 of 11
`
`US 7,698,509 Bl
`
`'-
`0
`U)
`U)
`(J)
`0
`0
`'-
`a.
`0
`'-
`0
`~
`
`J • I
`
`r-v;!
`
`~J
`
`~ ~
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`<CO
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`,
`
`•
`
`I
`
`'-
`0
`U)
`U)
`Q)
`0
`0
`'-
`a.
`0
`'-
`0
`~
`
`~-o---
`re-m---
`1---<( - ·
`
`-
`
`..
`...
`.....
`f'.-~ ~J
`
`'-
`0
`U)
`U)
`Q)
`0
`0
`'-
`a.
`0
`'-
`0
`~
`
`I • J
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`Cl <(
`'
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`• I
`
`I
`
`'-
`0
`U)
`U)
`(J)
`0
`0
`'-
`a.
`0
`'-
`0
`~
`
`9
`
`
`
`U.S. Patent
`
`Apr. 13, 2010
`
`Sheet 9 of 11
`
`US 7,698,509 Bl
`
`'-
`0
`U)
`U)
`Q)
`0
`0
`'-
`a.
`0
`'-
`0
`~
`•
`
`~
`
`~../
`
`T"""
`
`~
`
`r - - - - -
`I .---+
`I I
`"'
`I I
`-
`all
`I<{
`I I
`I I
`I I
`I I
`I I
`I I
`
`,, +: r
`
`'-
`0
`U)
`U)
`Q)
`0
`0
`'-
`a.
`0
`'-
`0
`~
`
`..
`
`-
`--
`"~ ~../
`
`T"""
`
`T"""
`
`'-
`0
`U)
`U)
`Q)
`0
`0
`'-
`a.
`0
`'-
`0
`~
`
`~~. : ~
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`~
`I
`<{I
`1m
`'-
`2
`I
`I
`- ~~ ~
`·-
`I
`I
`I
`__ __ J
`lL
`s: 0 ~~
`a. oEm ..-----~
`0
`'tJQ)T""
`<O::;E
`0
`..c:
`c:
`(/)
`(/)
`
`0)
`
`-
`
`T"""
`
`"~ ~./
`..
`
`T"""
`
`-
`
`r
`
`r
`
`'-
`0
`U)
`U)
`Q)
`0
`0
`'-
`a.
`0
`'-
`0
`~
`
`10
`
`
`
`U.S. Patent
`
`Apr. 13, 2010
`
`Sheet 10 of 11
`
`US 7,698,509 Bl
`
`~
`
`0
`(/)
`(/)
`Q)
`u
`0
`~ c..
`0
`\... u
`~
`
`J
`
`0
`0
`N
`
`..
`
`..
`
`""'"
`'-0
`N
`
`~-/
`N
`
`0)
`
`~
`Q)
`
`.. ~~
`-·-u.
`3t 0 VI
`c.. oE..-
`ua>N
`0
`0
`~:2:
`c:
`en
`en
`
`~J
`N
`
`p - - -
`I r-~
`I I
`Cll.
`I~
`I I
`I I
`I I
`,, t I ,,
`
`N
`'-O
`N
`
`..
`
`~-/
`N
`..
`
`~
`
`0
`(/)
`rJ)
`Q)
`u
`0 .....
`c..
`0
`\... u
`~
`
`-
`-
`
`~
`
`0
`(/)
`(/)
`Q)
`0
`0
`\... c..
`0
`~ 0
`:2
`
`~
`
`.....
`0
`(/)
`rJ)
`Q)
`0
`0
`~ c..
`0
`~
`(.)
`~
`
`.h
`
`,
`
`.
`~
`Ll..:
`
`11
`
`
`
`U.S. Patent
`
`Apr. 13, 2010
`
`Sheet 11 of 11
`
`US 7,698,509 Bl
`
`Microprocessor
`
`..---------8----,
`-------A----, I
`I
`I
`t: 2F
`I I~
`
`202
`
`2~6
`
`Snoop Filter
`
`Microprocessor
`
`Microprocessor
`
`i : I i I
`
`I
`I
`I
`I
`
`I
`I
`
`I
`I
`I
`I
`I
`8 A
`I
`I
`I
`I
`I
`I
`
`+ I
`
`I
`I
`
`,
`
`~+
`
`:
`
`I
`
`I
`I
`
`200 '
`
`Microprocessor
`
`204
`
`~
`
`2~8
`
`Microprocessor
`
`.. Microprocessor
`
`..........
`220
`
`2£2
`
`2~6
`
`Snoop Filter
`
`......_232
`
`2£4
`
`2~8
`
`• i i
`
`------8----'
`Microprocessor ...
`I
`I
`..,. ___ -----A---~
`-
`
`~ Microprocessor
`
`FIG. 11
`
`12
`
`
`
`US 7,698,509 Bl
`
`1
`SNOOPING-BASED CACHE-COHERENCE
`FILTER FOR A POINT-TO-POINT
`CONNECTED MULTIPROCESSING NODE
`
`BACKGROUND OF INVENTION
`
`As shown in FIG.1, a typical computer system 10 includes
`at least a microprocessor 12 and a main memory 14. The main
`memory 14 contains data for use by the microprocessor 12 to
`perform the operations of the computer system 10. However,
`because the speed of the microprocessor 12 is typically sig(cid:173)
`nificantly faster than that of the main memory 14, memory of
`smaller size and faster speed (referred to and known as
`"cache" memory) is often implemented to allow the micro(cid:173)
`processor 12 to access frequently and/or recently requested
`data faster than it would otherwise take to obtain such data
`from the main memory 14.
`Still referring to FIG. 1, the microprocessor 12 has an
`"on-chip" (i.e., on the same semiconductor die as the micro(cid:173)
`processor 12), or "L1," cache memory 16 and an "off-chip,"
`or "L2," cache memory 18. When the microprocessor 12
`requests data, a cache controller 20 causes the L1 cache
`memory 16 to be searched for the requested data, and if that
`search does not "hit" (i.e., a cache "miss" occurs), the L2
`cache memory 18 is searched for the requested data. If the 25
`requested data is not found in the cache memories 16, 18, the
`requested data is retrieved from the relatively slow main
`memory 14.
`Those skilled in the art will recognize that a microproces(cid:173)
`sor may have any number of cache memory levels, which are 30
`typically referred to by number in order of decreasing prox(cid:173)
`imity to the microprocessor. Further, those skilled in the art
`will recognize that any number of cache memories may be
`on-chip and any number of cache memories may be off-chip.
`A computer system, like the one shown in FIG. 1, may be 35
`used as a system that services requests from and provides data
`to other computers connected over a network. Such a client(cid:173)
`server network model 30 is shown in FIG. 2. In FIG. 2, a
`stand-alone server 32 is connected over a network 34 to
`several client computers 36, 38, 40, 42. The server 32 may be
`used to store data, programs, etc. for use by the client com(cid:173)
`puters 36, 38, 40, 42. Those skilled in the art will recognize
`that the server 32 may also be used to manage and control the
`client computers 36, 38, 40, 42.
`Although some computer systems, like the one shown in
`FIG. 1, have a single microprocessor 12 (such a computer
`system referred to and known as a "uniprocessor" computer
`system), other computer systems, like the server 32 shown in
`FIG. 2, may be formed of multiple microprocessors. FIG. 3
`shows such a multiprocessing computer system 50.
`The computer system 50 of FIG. 3 is shown as having
`multiple microprocessors 52, 54, 56, 58. The microproces(cid:173)
`sors 52, 54, 56, 58 communicate with one another and with a
`main memory 60 over a network (e.g., a bus) 62. The network
`62 is implemented as a set of bits that propagate data in
`parallel from one location to another. The "bandwidth" of the
`network 62 (i.e., the number of bits propagated in parallel by
`the network 62) is an important factor in the overall perfor(cid:173)
`mance of the computer system 50. FIG. 3 also shows an
`input/output interface 64 that is connected to the network 62
`and serves to input and output data to other portions of the
`computer system 50 and/or components external to the com(cid:173)
`puter system 50.
`Those skilled in the art will recognize that the multipro(cid:173)
`cessing computer system 50 of FIG. 3 may represent a par(cid:173)
`ticular type of multiprocessing computer system used in net(cid:173)
`working and known and referred to as a symmetric
`
`2
`multiprocessing (SMP) computer system. A SMP computer
`system is one in which multiple microprocessors share, for
`example, the same memory and input/output interface. Those
`skilled in the art will also recognize that a SMP computer
`system may share the same operating system. Although the
`multiple microprocessors in a SMP computer system share
`the same resources, each microprocessor may act indepen(cid:173)
`dently. For example, while one microprocessor searches for
`data in memory, other microprocessors may update the
`10 memory and perform other tasks, thereby increasing the abil(cid:173)
`ity of the SMP computer system to handle intensive network(cid:173)
`ing demands.
`Those skilled in the art will recognize that SMP computer
`systems provide good scalability in that additional micropro-
`15 cessors may be added or removed with minimal changes to
`the system. Despite the benefits of SMP computer systems,
`bottlenecks may occur when several microprocessors on a
`board share a single memory bus. Rather than put too many
`microprocessors on the same SMP board, designers of net-
`20 work elements often distribute applications across a net(cid:173)
`worked cluster of SMP boards, where each board has its own
`memory, I/0 interface, and operating system.
`
`SUMMARY OF INVENTION
`
`According to one aspect of one or more embodiments of
`the present invention, a computer system comprises: a first
`integrated circuit having a local cache memory; a second
`integrated circuit connected to the first integrated circuit; and
`a filter connected by point-to-point interconnect to the first
`integrated circuit and the second integrated circuit, where, in
`response to a miss for requested data in the local cache
`memory, a broadcast for the requested data is propagated to
`the snoop filter, and where the snoop filter is configured to
`relay the broadcast to the second microprocessor dependent
`on whether the second integrated circuit has a copy of the
`requested data.
`According to another aspect of one or more embodiments
`of the present invention, a method of performing computer
`40 system operations comprises: requesting data from a cache
`memory of a first integrated circuit; if the requested data is not
`found in the cache memory, issuing a broadcast for the
`requested data to a filter point-to-point connected to the first
`integrated circuit; and if an address of the requested data is
`45 found in the filter, relaying the broadcast to a second inte(cid:173)
`grated circuit associated with the address.
`According to another aspect of one or more embodiments
`of the present invention, a computer network comprises: a
`cluster of individual SMP computer systems that are connect-
`50 able using point-to-point interconnect, at least one of the
`individual SMP computer systems having a filter, where, in
`response to a cache miss for requested data in an integrated
`circuit of the at least one of the individual SMP computer
`systems, the integrated circuit is configured to issue a broad-
`55 cast for the requested data to the filter, and where the filter is
`configured to relay the broadcast to another integrated circuit
`in the at least one of the individual SMP computer systems if
`the another integrated circuit has a copy of the requested data
`According to another aspect of one or more embodiments
`60 of the present invention, a computer system comprises a
`plurality of integrated circuits, a filter point-to-point con(cid:173)
`nected to the plurality of integrated circuits and having a
`memory that stores addresses of data stored in cache memo(cid:173)
`ries of the plurality of integrated circuits, and memory com-
`65 prising instructions to: requesting data from a cache memory
`of one of plurality of integrated circuits; if the requested data
`is not found in the cache memory, issuing a broadcast for the
`
`13
`
`
`
`US 7,698,509 Bl
`
`3
`requested data to the filter; and if an address of the requested
`data is found in the filter, relaying the broadcast to another one
`of the plurality of integrated circuits, the another one of the
`plurality of integrated circuits being associated with the
`address.
`Other aspects and advantages of the invention will be
`apparent from the following description and the appended
`claims.
`
`BRIEF DESCRIPTION OF DRAWINGS
`
`FIG. 1 shows a typical computer system.
`FIG. 2 shows a typical computer network.
`FIG. 3 shows a typical multiprocessor computer system.
`FIG. 4 shows a snooping cache-coherent multiprocessor
`computer system.
`FIG. 5 shows a directory-based cache-coherent multipro(cid:173)
`cessor computer system.
`FIG. 6 shows a computer system in accordance with an
`embodiment of the present invention.
`FIG. 7 shows a portion of a computer system in accordance
`with an embodiment of the present invention.
`FIG. 8 shows a flow of messages in a portion of a typical
`computer system.
`FIG. 9 shows an exemplary flow of messages in a portion of
`a computer system in accordance with an embodiment of the
`present invention.
`FIG.10 shows an exemplary flow of messages in a portion
`of a computer system in accordance with an embodiment of
`the present invention.
`FIG. 11 shows an exemplary flow of messages in a portion
`of a computer system in accordance with an embodiment of
`the present invention.
`
`DETAILED DESCRIPTION
`
`In a SMP computer system, such as that shown in FIG. 3,
`each microprocessor has its own cache memory (see discus(cid:173)
`sion of cache memories above with reference to FIG. 1).
`Thus, because a particular cache memory data item acted
`upon by one microprocessor may cause the copy of that data
`item to differ from other copies of that data item stored in the
`cache memories of the various microprocessors in a SMP
`computer system, "cache-coherency" techniques are imple(cid:173)
`mented to ensure that the local cache memory of each micro(cid:173)
`processor is consistent with respect to values that are stored in
`the cache memories of other microprocessors in the SMP
`computer system.
`Cache-coherence problems arise in SMP computer sys(cid:173)
`tems when more than one microprocessor cache memory
`holds a copy of a data item. One type of cache-coherency
`technique known and referred to as a "snooping" relies on all
`cache memories to monitor a common network (e.g., a bus)
`that connects microprocessors to memory. In other words, a
`snooping-based cache-coherency technique depends on the
`ability of cache memories to observe every transaction on a
`network (e.g., a bus) common to the cache memories.
`Now referring to FIG. 4, when microprocessor 70 requests
`data, a cache controller 72 local to microprocessor 70
`searches for the requested data in a cache memory 7 4 local to
`microprocessor 70. If the requested data is not found in the
`local cache memory 74, the cache controller 72 broadcasts a
`data request on a network (e.g., a bus) 76 connected to other
`cache controllers (e.g., cache controller 78) (others not
`shown). The cache controllers (e.g., cache controller 78)
`"snoop" on the network 76 to monitor all transactions on the
`network 76. If a particular cache memory (e.g., cache
`
`4
`memory 80 associated with microprocessor 84) has the data
`requested by the requesting cache controller 72, the cache
`controller (e.g., cache controller 78) associated with the cache
`memory (e.g., cache memory 80) having the requested data
`forwards (i.e., returns) the requested data to the requesting
`cache controller 72, which, in turn, updates its associated
`cache memory 74 with the returned requested data and pro(cid:173)
`vides the returned requested data to requesting microproces(cid:173)
`sor 70. Alternatively, if the requested data is not held in any of
`10 the cache memories 74, 80, a copy of the requested data in the
`main memory 82 is returned to and used by the requesting
`microprocessor 70.
`Further, a cache controller, connected to the network 76
`that observes data being written from one cache memory to
`15 another may invalidate or update its own copy of that data.
`The next time the cache controller's microprocessor requests
`that data, the most recent value of the data is provided to the
`microprocessor, either because its local cache memory has
`the most recent value of the data or through obtaining that
`20 data by generating a data request on the network 76.
`Those skilled in the art will recognize that although a
`snooping-based cache-coherency technique obtains data rela(cid:173)
`tively quickly (i.e., has relatively low latency), such a tech(cid:173)
`nique consumes relatively high bandwidth due to the parallel
`25 broadcast nature of its requests. As a result, snooping-based
`cache-coherency techniques are typically limited to small(cid:173)
`scale systems.
`Now referring to FIG. 5, in another type of cache-coher(cid:173)
`ency technique known and referred to as "directory-based
`30 cache-coherence," when a cache miss occurs in a local cache
`memory (e.g., local cache memory 98 or 100) of a micropro(cid:173)
`cessor (e.g., microprocessor 94 or 96), a cache controller
`(e.g., cache controller 102 or 106) issues a data request over a
`network 104 to a "home" directory (e.g., directory 90 or 92)
`35 of the requested data, the "home" directory typically being
`associated with the "home" memory (e.g., memory 108 or
`109) of the requested data. The "home" directory may indi(cid:173)
`cate to the cache controller a location of the requested data.
`Alternatively, if the "home" directory indicates that no other
`40 cache memories connected to the network 104 have the
`requested data, the requested data may be returned by the
`"home" memory of the requested data.
`One advantage of directory-based cache-coherency tech(cid:173)
`niques with respect to snooping-based cache-coherency tech-
`45 niques is that they keep track of which microprocessor nodes
`have copies of particular data, thereby eliminating the need
`for a high-bandwidth data request broadcast. This is valuable
`on read misses because a data request is subsequently satis(cid:173)
`fied either by the directory indicating the location of a copy of
`50 the requested data or by accessing the main memory.
`Further, because directory-based cache-coherent tech(cid:173)
`niques may rely on low-bandwidth interconnect rather than
`on high-bandwidth networks (e.g., buses) that are necessary
`for broadcasting in snooping-based cache-coherency tech-
`55 niques, directory-based cache-coherent SMP computer sys(cid:173)
`tems may be scalable to a large number of microprocessors.
`However, the indirection overheads associated with directory
`queries make directory-based cache-coherency techniques
`slower (i.e., have higher latency) than snooping-based cache-
`60 coherency techniques (e.g., a directory-based cache-coher(cid:173)
`ence technique may often require three times the number of
`"hops" otherwise taken in a snooping-based cache-coherence
`technique).
`For example, in a snooping-based cache-coherency tech-
`65 nique, upon a cache miss, one set of parallel messages is
`broadcast over a bus and one response message with the
`requested data is sent back to the requesting processing node.
`
`14
`
`
`
`US 7,698,509 Bl
`
`5
`On the other hand, in a directory-based cache-coherent tech(cid:173)
`nique, upon a cache miss, a data request message is sent to the
`home processing node, the home processing node forwards
`the data request message to the owning cache memory, and
`the owning cache memory returns the requested data to the
`requesting processing node. Thus, generally, in snooping(cid:173)
`based cache-coherency techniques, there are more messages
`in parallel (relatively low average latency), while in directory(cid:173)
`based cache-coherency techniques, there are more messages
`in series (relatively high average latency).
`Often, several small SMP servers (e.g., a near-commodity
`modular shelf server) are connected together to provide
`increased processing capabilities. Due to the limited band(cid:173)
`width of the cables connecting the servers, directory-based
`cache-coherency techniques are required to ensure cache- 15
`coherence among the servers. However, as discussed above,
`directory-based cache-coherency techniques have relatively
`high average latency compared to snooping-based cache(cid:173)
`coherency techniques.
`Embodiments of the present invention relate to a technique 20
`for implementing a snooping-based cache-coherence filter
`device (also referred to herein as "snoop filter") in a point-to(cid:173)
`point connected multiprocessing node. FIG. 6 shows an
`exemplary computer system 110 in accordance with an
`embodiment of the present invention. In FIG. 6, a plurality of 25
`multiprocessing nodes 112, 114, 116, 118, 120, 122, 124,
`126, 128, 130, 132, 134, 136, 138, 140, 142 are point-to-point
`connected using high -bandwidth interconnect (shown but not
`labeled). Particularly, each multiprocessing node 112, 114,
`116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 30
`140, 142 is connected to an adjacent multiprocessing node (in
`FIG. 6, each peripheral multiprocessing node is shown as
`being connected to the opposite peripheral multiprocessing
`node, e.g., multiprocessing node 112 is connected to multi(cid:173)
`processing node 118). In one or more other embodiments of 35
`the present invention, a multiprocessing node may be con(cid:173)
`nected to a non-adjacent multiprocessing node. Further, a
`multiprocessing node is not limited to a server and may be any
`type of computer system.
`In one or more embodiments of the present invention, 40
`high-bandwidth interconnect for point-to-point connected
`multiprocessing nodes may be implemented using intercon(cid:173)
`nect technologies such as, for example, Infiniband or PCI
`Express. In one or more other embodiments of the present
`invention, high-bandwidth interconnect used to point-to- 45
`point connect multiprocessing nodes may have a bandwidth
`greater than that of 16-bit 1 GHz interconnect.
`Further, in one or more embodiments of the present inven(cid:173)
`tion, point-to-point interconnect may be used in cabling a
`plurality of multiprocessing nodes (e.g., near-commodity 50
`shelf servers) together. Moreover, in one or more embodi(cid:173)
`ments of the present invention, point-to-point interconnect
`may be used to connect a plurality of multiprocessing nodes
`to a passive backplane.
`FIG. 7 shows an exemplary multiprocessing node 150 in 55
`accordance with an embodiment of the present invention. In
`FIG. 7, the multiprocessing node 150 has four microproces(cid:173)
`sors 152, 154, 156, 158. However, in one or more other
`embodiments of the present invention, a multiprocessing
`node may have any number of microprocessors. In FIG. 7, 60
`each of the microprocessors 152, 154, 156, 158 is connected
`to a snoop filter 162 via high-bandwidth point-to-point inter(cid:173)
`connect. The snoop filter 162 may be connected over high(cid:173)
`bandwidth point-to-point interconnect to one or more snoop
`filters (not shown) of connected multiprocessing nodes (not 65
`shown). Further, each microprocessor 152, 154, 156, 158 may
`be connected, using point-to-point interconnect, to every
`
`6
`other microprocessor 152, 154, 156, 158 in the multiprocess(cid:173)
`ing node 150 for, for example, chip-to-chip communication.
`Further, although the snoop filter 162 in FIG. 7 is shown as
`being connected to adjacent nodes, in one or more other
`embodiments of the present invention, a snoop filter may be
`connected to one or more non-adjacent nodes.
`The snoop filter 162 observes snooping-based cache-co(cid:173)
`herence broadcasts for requested data and the responses
`thereto. At least partly in order to determine whether to for-
`10 ward or cancel snooping-based cache-coherence broadcasts,
`the snoop filter 162 has local state memory (referred to and
`shown in FIG. 7 as "shadow tag memory") 164 that stores the
`tags of data stored in the local cache memories (e.g., "L2"
`cache memories) (not shown) of each of the microprocessors
`152, 154, 156, 158. In other words, the shadow tag memory
`164 has a copy of the tags of data stored in the local cache
`memories of the microprocessors 152, 154, 156, 158.
`By having the shadow tag memory 164, the snoop filter 162
`forwards a received broadcast for requested data (by one of
`the microprocessors 152, 154, 156, 158 or from another mul(cid:173)
`tiprocessing node (not shown)) to a particular one of the
`microprocessors 152, 154, 156, 158 only if its shadow tag
`memory 164 indicates that the particular microprocessor has
`a copy of the requested data. Otherwise, if the snoop filter 162
`determines that none of the microprocessors 152, 154, 156,
`158 has a copy of the requested data, the snoop filter 162 is
`configured to cancel any subsequent relays of the broadcast to
`the microprocessors 152, 154, 156, 158, and instead, sends a
`message back to the requesting microprocessor (or multipro(cid:173)
`cessing node (not shown)) indicating that none of the other
`microprocessors (or none of the microprocessors) in the mul-
`tiprocessing node 150 has a copy of the requested data.
`In one or more embodiments of the present invention, a
`shadow tag memory may be optimistically maintained as a
`set -associative cache. Further, in one or more embodiments of
`the present invention, the set-associative cache may use a
`MOESI (Modified Owner Exclusive Shared Invalid) cache(cid:173)
`coherency protocol.
`At least partly to illustrate the difference between an imple(cid:173)
`mentation of a multiprocessing node that uses a snoop filter in
`accordance with one or more embodiments of the present
`invention and one that does not, FIG. 8 shows an exemplary
`flow of messages in a multiprocessing node 170 (having
`microprocessors 172, 174, 176, 178) that does not have a
`snoop filter. In FIG. 8, microprocessor 172 ("the requesting
`microprocessor") issues a broadcast for requested data. The
`broadcast is initially forwarded as request A to microproces(cid:173)
`sors 174, 176 and is then relayed to microprocessor 178 from
`microprocessor 176. In response to request A, microproces(cid:173)
`sors 174, 176, 178 send their responses B, C, D, respectively,
`back to the requesting microprocessor 172.
`Those skilled in the art will recognize that because in the
`case shown in FIG. 8, if only microprocessor 174 has a copy
`of the requested data, the bandwidth of the multiprocessing
`node 170 used for responses A and Dis unnecessarily con(cid:173)
`sumed.
`However, as discussed above with reference to FIG. 7, by
`using a snoop filter in accordance with one or more embodi(cid:173)
`ments of the present invention, requests for data are sent only
`to those microprocessors having copies of the requested data.
`For example, FIG. 9 shows an exemplary flow of messages in
`a multiprocessing node 180 in accordance with an embodi(cid:173)
`ment of the present invention.
`The multiprocessing node 180 is shown as having a snoop
`filter 192 that is connected via high-bandwidth interconnect
`(shown, but not labeled) to microprocessors 182, 184, 186,
`188. In FIG. 9, microprocessor 182 ("the requesting micro-
`
`15
`
`
`
`US 7,698,509 Bl
`
`7
`processor") issues a broadcast A for requested data. Due to
`the presence of the snoop filter 192, broadcast A is first routed
`to the snoop filter 192. Referencing its shadow tag memory
`194, the snoop filter 192 determines whether any of micro(cid:173)
`processors 184, 186, 188 have a copy of the data requested in
`broadcast A. In the exemplary case shown in FIG. 9, the snoop
`filter 192, referencing its shadow tag memory 194, deter(cid:173)
`mines that microprocessor 188 has a copy of the data
`requested in broadcast A, and thus, the snoop filter 192 for(cid:173)
`wards broadcast A to microprocessor 188. In response to
`broadcast A being forwarded to microprocessor 188, micro(cid:173)
`processor 188 issues a response B (having a copy of the
`requested data) that is forwarded back to the requesting
`microprocessor 182 through the snoop filter 192.
`By forwarding response B through the snoop filter 192, the
`snoop filter 192 is able to update its shadow tag memory 194.