`(12) Patent Application Publication (10) Pub. N0.: US 2002/0053004 A1
`PONG
`(43) Pub. Date:
`May 2, 2002
`
`US 20020053004A1
`
`(54) ASYNCHRONOUS CACHE COHERENCE
`ARCHITECTURE IN A SHARED MEMORY
`MULTIPROCESSOR WITH POINT-TO-POINT
`LINKS
`
`(76) Inventor: FONG PONG, MOUNTAIN VIEW,
`CA (US)
`
`Correspondence Address?
`IP ADMINISTRATION
`LEGAL DEPARTMENT 20BN
`HEWLETT PACKARD COMPANY
`5A 943030890
`’
`
`<*>
`
`(21) APPL NO:
`(22) Filed;
`
`d 37
`CPA ?l d
`t'
`l'
`t'
`2211110; SZIZEJCZI 1On(
`)
`e um er
`'
`'
`09 “44,173
`NOV_ 19, 1999
`
`Publication Classi?cation
`
`(51) Int. Cl.7 ................................................... .. G06F 12/08
`
`370
`\ processor 0
`
`processor 1
`
`Scheduling
`Window
`
`(52) US. Cl. ........................ .. 711/119; 711/146; 711/144;
`711/147
`
`(57)
`
`ABSTRACT
`
`In a shared memory, multiprocessor system, an asynchro
`nous cache coherence method associates state information
`With each data block to indicate Whether a copy of the data
`block is valid or invalid. When a processor in the multipro
`cessor system requests a data block, it issues the request to
`one or more other processors and the shared memory.
`Depending on the implementation, the request may be
`
`'
`memory that receive the request independently check to
`determine Whether they have a valid copy of the requested
`data block basedhonthe state1 idnformaticfmthOnly thetpéoges
`a a
`e reques e
`sor or memory avmg a va 1 copy 0
`block responds to the request. The memory control path
`betWeen each processor and a shared memory controller
`may be implemented With tWo unidirectional and dedicated
`point-to-point links for sending and receiving requests for
`blocks of data.
`
`f‘
`314
`3366"
`
`374
`$
`=
`17>
`-
`‘5 mi ,_
`72
`:
`E368
`arallel links
`p
`350— :340 ‘3/342 ,J g ‘<6 r’ 346
`310\\|__,- K306 344-
`.cc 0:
`Oxlj g9 i O 312
`'
`i
`i
`g 5% . a
`Li3 E S
`5) l in:
`: Q 352 f
`l
`: 308
`i
`i
`i
`;
`A
`I
`i
`I
`I
`326
`I
`= E/ §E|\.L324\_Ei :
`354/1 i
`E i
`i
`i
`i
`bankO
`bank‘i
`bank2
`
`322 O
`
`318
`/
`p2
`
`320
`/
`p3
`
`300
`/
`memory controller
`.
`address bus/switch ,304
`data bus/switch
`i
`I
`o3 ,
`L302
`i
`i
`bank3
`
`328
`
`330
`
`332
`
`334
`
`336
`
`
`
`Patent Application Publication May 2, 2002 Sheet 1 of 5
`
`US 2002/0053004 A1
`
`102
`
`118 104
`\
`\
`
`134
`/
`
`, , a ' State data content
`116\
`, I ' ’
`/
`/
`Cache(s)
`Cache(s) ’
`138
`142
`112~\_% Request Kg Request
`108
`
`106
`
`114 ‘
`
`'
`
`'
`
`'
`
`122/1204
`
`‘\124
`126/‘
`INTERCONNECT
`
`132
`
`100/
`
`F|G.1
`
`Memory
`Controller “130 136
`0
`/
`
`140
`/
`
`, , /’ State data content
`Memory —’ '
`
`110J
`
`,
`133
`
`(- 202
`
`f 204
`
`FIG. 2
`
`m 212 m 214
`=
`=
`1‘
`A
`
`Request
`
`Request
`
`IIIIIIII
`
`\208 \21()
`
`V
`
`SnoopQ _
`f 1
`224
`
`ReqQ SnoopQ
`\22o 2215f
`
`/
`200
`
`memory controller
`
`V
`
`ReqQ
`\222 223
`, /
`_ v
`Internal hrgh-speed
`bus/switch
`MemQ
`
`‘232
`
`/
`206
`
`
`
`Patent Application Publication
`
`May 2, 2002 Sheet 2 of 5
`
`US 2002/0053004 A1
`
`C’)
`
`22
`U.
`
`V._ommoooEoLomwoooa
`
`Em
`
`Em
`
`mczzumcom
`
`>>o_u_.__>>
`
`mam«mmmmmomm
`
`nm0
`
`wEm
`
`
`
`
`Patent Application Publication May 2, 2002 Sheet 3 0f 5
`
`US 2002/0053004 A1
`
`0ST.
`
`o_>>w w: $2 m. n ; _ n y _
`
`
`cozEmBB Emu + m + m
`m m NET E w m
`
`
`
`
`
`
`EYE" 9:8 $22 $28 928 m
`
`mwwJ ‘ W4 2 %_ < > 0 Wm ‘
`
`mi?» ...... LL. ...... :WL. ........... m m m m m
`
`w @E L L
`
`
`
`2 Na w? $3
`
`oo¢\| 65m B20956 c H. \ Q E0 u " 3Q‘ m 2on0 “ S
`
`
`
`
`
`_ _ % %
`_ w < _ =
`
`> V "G H n i n . n (wow
`“e a " $3 32
`“mu, m n n m 252293
`
`Lo=9EooEoEmE @ > m wow 0 > > _ m0
`E1) 9
`wowljE < + W “ Pk < =.%
`
`u _ q n m?“ w
`
`P 68805 0 63305
`
`_ < < _ 3 mm @H w@
`
`7 V > V
`
`. mi 9.58:5
`Al >>ovE>>
`
`
`
`Patent Application Publication May 2, 2002 Sheet 4 of 5
`
`US 2002/0053004 A1
`
`
`
`Cache or
`Directory Filter
`L.
`
`
`
`LO
`(15
`_
`LL
`
`.9
`EN
`:03‘;CLO:
`8
`9
`3‘
`§
`E
`‘’
`G,
`E
`
`
`
`_C
`5:.’
`'§
`>03
`3
`U)(D
`E’'0
`'0
`ms
`
`
`
`
`parallellinks
`SchedulingWindow
`
`
`
`processor1
`
`processor0
`
`
`
`Patent Application Publication May 2, 2002 Sheet 5 0f 5
`
`US 2002/0053004 A1
`
`F E mocwwmi
`
`it
`
`wow
`
`o :n mocwwwa +
`
`cow
`
`00w +
`00m +
`
`wow
`
`\IHI) mom
`wldi
`
`
`
`US 2002/0053004 A1
`
`May 2, 2002
`
`ASYNCHRONOUS CACHE COHERENCE
`ARCHITECTURE IN A SHARED MEMORY
`MULTIPROCESSOR WITH POINT-TO-POINT
`LINKS
`
`TECHNICAL FIELD
`[0001] The invention relates to shared memory, multipro
`cessor systems, and in particular, cache coherence protocols.
`
`BACKGROUND
`[0002] A shared memory multiprocessor system is a type
`of computer system having tWo or more processors, each
`sharing the memory system and capable of executing its oWn
`program. These systems are referred to as “shared memory”
`because the processors can each access the system’s
`memory. There are a variety of different types of memory
`models, such as Uniform Memory Access (UMA), Non
`Uniform Memory Access (NUMA) and Cache Only
`Memory Architecture (COMA) model.
`[0003] Both single and multiprocessors typically use
`caches to reduce the time required to access data in memory
`(the memory latency). A cache improves access time
`because it enables a processor to keep frequently used
`instructions or data nearby, Where it can access them more
`quickly than from memory. Despite this bene?t, cache
`schemes create a different challenge called the cache coher
`ence problem. The cache coherence problem refers to the
`situation Where different versions of the same data can have
`different values. For example, a neWly revised copy of the
`data in the cache may be different than the old, stale copy in
`the memory. This problem is more complicated in multi
`processors Where each processor typically has its oWn cache.
`
`[0004] The protocols used to maintain coherence for mul
`tiple processors are called cache-coherence protocols. The
`objective of these protocols is to track the state of any
`sharing of a data block. One type of protocol is called
`“snooping.” In this type of protocol, every cache that has a
`copy of the data from a block of physical memory also has
`a copy of the sharing status of the block. The caches are
`typically on a shared-memory bus, and all cache controllers
`monitor or “snoop” on the bus to determine Whether or not
`they have a copy of a block that is requested on the bus.
`[0005] One example of a traditional snooping protocol is
`the P6/P7 bus architecture from Intel Corporation. In Intel’s
`scheme, When a processor issues a memory access request
`and has a miss in its local cache, the request (address and
`command) is broadcast on the control bus. Subsequently, all
`other processors and the memory controller listening to the
`bus Will latch in the request. The processors then each probe
`their local caches to see if they have the data. Also, the
`memory controller starts a “speculative” memory access.
`The memory access is termed “speculative” because it
`proceeds Without knoWing Whether the data copy from the
`memory request Will be used.
`
`[0006] After a ?xed number of cycles, all processors
`report their snoop results by asserting a HIT or HITM signal.
`The HIT signal means that the processor has a clean copy of
`the data in its local cache. The HITM signal means that the
`processor has an exclusive and modi?ed copy of the data in
`its local cache. If a processor cannot report its snoop result
`in time, it Will assert both the HIT and HITM signals. This
`results in insertions of the Wait state until the processor
`completes its snoop activity.
`
`[0007] Generally speaking, the snoop results serve tWo
`purposes: 1) they provide sharing information; and 2) they
`identify Which entity should provide the missed data block,
`i.e. either one of the processors or the memory. In processing
`a read miss, a processor may load the missed block in the
`exclusive or shared state depending on Whether the HIT or
`HITM signal is asserted. For example, in the case Where
`another processor has the most recently modi?ed copy of the
`requested data in a modi?ed state, it asserts the HITM signal.
`Consequently, it prevents the memory from responding With
`the data.
`
`[0008] Anytime a processor asserts the HITM signal, it
`must provide the data copy to the requesting processor.
`Importantly, the speculative memory access must be
`aborted. If no processor asserts the HITM signal, the
`memory controller Will provide the data.
`
`[0009] The traditional snooping scheme outlined above
`has limitations in that it 30 requires all processors to
`synchroniZe their response. The design may synchroniZe the
`response by requiring all processors to generate their snoop
`results in exactly the same cycle. This requirement imposes
`a ?xed latency time constraint betWeen receiving bus
`requests and producing the snoop results.
`
`[0010] The ?xed latency constraint presents a number of
`challenges for the design of processors With multiple-level
`cache hierarchies. In order to satisfy the ?xed latency
`constraint, the processor may require a special purpose, ultra
`fast snooping logic path. The processor may have to adopt
`a priority scheme in Which it assigns a higher priority to
`snoop requests than requests from the processor’s execution
`unit. If the processor cannot be made fast enough, the ?xed
`time betWeen snoop request and snoop report may be
`increased. Some combination of these approaches may be
`necessary to implement synchroniZed snooping.
`
`[0011] The traditional snooping scheme may not save
`memory bandWidth. In order to reduce memory access
`latency, the scheme fetches the memory copy of the
`requested data in parallel With the processor cache look up
`operations. As a result, unnecessary accesses to memory
`occur. Even if a processor asserts a HITM signal indicating
`that it Will provide the requested data, the speculative access
`to memory still occurs, but the memory does not return its
`copy.
`
`SUMMARY
`
`[0012] The invention provides an asynchronous cache
`coherence method and a multiprocessor system that employs
`an asynchronous cache coherence protocol. One particular
`implementation uses point-to-point links to communicate
`memory requests betWeen the processors and memory in a
`shared memory, multiprocessor system.
`
`[0013] In the asynchronous cache coherence method, state
`information associated With each data block indicates
`Whether a copy of the data block is valid or invalid. When
`a processor in the multiprocessor system requests a data
`block, it issues the request to one or more other processors
`and the shared memory. Depending on the implementation,
`the request may be broadcast, or speci?cally targeted to
`processors having a copy of the requested data block. Each
`of the processors and memory that receive the request
`independently check to determine Whether they have a valid
`
`
`
`US 2002/0053004 A1
`
`May 2, 2002
`
`copy of the requested data block based on the state infor
`mation. Only the processor or memory having a valid copy
`of the requested data block responds to the request.
`[0014] A multiprocessor that employs the asynchronous
`cache coherence protocol has tWo or more processors that
`communicate With a shared memory via a memory control
`ler. Each of the processors and shared memory are capable
`of storing a copy of a data block, and each data block is
`associated With state indicating Whether the copy is valid.
`The processors communicate a request for a data block to the
`memory controller. The other processors and shared
`memory process the request by checking Whether they have
`a valid copy of the data block. The processor or shared
`memory having the valid copy of the requested data block
`responds, and the other processors drop the request silently.
`[0015] One implementation utiliZes point-to-point links in
`the memory control path to send and receive requests for
`blocks of data. In particular, each processor communicates
`With the memory controller via tWo dedicated, and unidi
`rectional links. One link issues requests for data blocks,
`While the other receives requests. Similar point-to-point
`links may be used to communicate blocks of data betWeen
`processors and the memory controller.
`[0016] Further features and advantages of the invention
`Will become apparent With reference to the folloWing
`detailed description and accompanying draWings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0017] FIG. 1 is a block diagram illustrating a shared
`memory multiprocessor that employs an asynchronous
`cache protocol.
`[0018] FIG. 2 is a block diagram illustrating an example
`of a multiprocessor that implements a memory control path
`With point-to-point links betWeen processors and the
`memory controller.
`
`[0019] FIG. 3 is a block diagram illustrating an example
`of a data path implementation for the multiprocessor shoWn
`in FIG. 2.
`
`[0020] FIG. 4 is a block diagram of a multiprocessor With
`a memory controller that uses an internal cache for buffering
`frequently accessed data blocks.
`
`[0021] FIG. 5 is a block diagram of a multiprocessor With
`a memory controller that uses an external cache for buffering
`frequently accessed data blocks. FIG. 6 illustrates a data
`block that incorporates a directory identifying Which pro
`cessors have a copy of the block.
`
`DESCRIPTION
`
`Introduction
`
`[0022]
`[0023] For the sake of this discussion, the code and data in
`a multiprocessor system are generally called “data.” The
`system organiZes this data into blocks. Each of these blocks
`is associated With state information (sometimes referred to
`as “state”) that indicates Whether the block is valid or
`invalid. This state may be implemented using a single bit per
`memory block. Initially, blocks in memory are in the valid
`state. When one of the processors in the multiprocessor
`system modi?es a block of data from memory, the system
`changes the state of the copy in memory to the invalid state.
`
`[0024] This approach avoids the need for processors to
`report snoop results. Each processor processes requests for
`a block of data independently. In particular, each processor
`propagates a read or Write request through its cache hierar
`chy independently. When a processor probes its local cache
`and discovers that it does not have a data block requested by
`another processor, it simply drops the request Without
`responding. Conversely, if the processor has the requested
`block, it proceeds to provide it to the requesting processor.
`This scheme is sometimes referred to as “asynchronous”
`because the processors do not have to synchroniZe a
`response to a request for a data block.
`
`[0025] FIG. 1 illustrates an example of a shared memory
`system 100 that employs this approach. The system has a
`number of processors 102-108 that each accesses a shared
`memory 110. TWo of the processors 102, 104 are expanded
`slightly to reveal an internal FIFO buffer (e.g., 112, 114), a
`cache system (e.g., 116, 118) and control paths (e.g., 120
`126) for sending and receiving memory access requests. For
`example, a processor 102 in this architecture has a control
`path 120 for issuing a request, and a control path 122 for
`receiving requests. FIG. 1 does not illustrate a speci?c
`example of the cache hierarchy because it is not critical to
`the design.
`
`[0026] Each of the processors communicate With each
`other and a memory controller 130 via an interconnect 132.
`FIG. 1 depicts the interconnect 132 generally because it may
`be implemented in a variety of Ways, including, but not
`limited to, a shared bus or sWitch. In this model, the main
`memory 110 is treated like a cache. At any given time, one
`or more processors may have a copy of a data block from
`memory in its cache. When a processor modi?es a copy of
`the block, the other copies in main memory and other caches
`become invalid. The state information associated With each
`block re?ects its status as either valid or invalid.
`
`[0027] FIG. 1 shoWs tWo examples of data blocks 133,
`134 and their associated state information (see, e.g., state
`information labeled With reference nos. 136 and 138). In the
`tWo examples, the state information is appended to the data
`block. Each block has at least one bit for state information
`and the remainder of the block is data content 140, 142.
`Although the state information is shoWn appended to the
`copy of the block, it is possible to keep the state information
`separately from the block as long as it is associated With the
`block.
`
`[0028] Point-to-point Links In the Memory Control Path
`
`[0029] While the interconnect illustrated in FIG. 1 can be
`implemented using a conventional bus design based on
`shared Wires, such a design has limited scalability. The
`electrical loading of devices on the bus, in particular, limit
`the speed of the bus clock as Well as the number of devices
`that can be attached to the bus. A better approach is to use
`high speed point-to-point links as the physical medium
`interconnecting processors With the memory controller. The
`topology of a point-to-point architecture may be made
`transparent to the devices utiliZing it by emulating a shared
`bus type of protocol.
`
`[0030] FIG. 2 is a block diagram of a shared memory
`multiprocessor 200 that employs point-to-point links for the
`memory control path. In the design shoWn in FIG. 2, the
`processors 202, 204 and memory controller 206 communi
`
`
`
`US 2002/0053004 A1
`
`May 2, 2002
`
`cate through tWo dedicated and unidirectional links (e.g.,
`links 208 and 210 for processor 202).
`[0031] Like FIG. 1, FIG. 2 simpli?es the internal details
`of the processors because they are not particularly pertinent
`to the memory system. The processors include one or more
`caches (e.g., 212-214), and a FIFO queue for buffering
`incoming requests for data (e.g., 216, 218).
`[0032] When a processor issues a request for a block of
`data, the request ?rst enters a request queue (ReqQ, 220,
`222) in the memory controller 206. The memory controller
`has one request queue per processor. The queues may be
`designed to broadcast the request to all other processors and
`the memory, or alternatively may target the request to a
`speci?c processor or set of processors knoWn to have a copy
`of the requested block. In the latter case, the system has
`additional support for keeping track of Which processors
`have a data block as explained further beloW.
`[0033] Preferably, the request queues communicate
`requests via a high-speed internal address bus or sWitch 223
`(referred to generally as a “bus” or “control path intercon
`nect”). Each of the processors and memory main memory
`devices are capable of storing a copy of a requested data
`block. Therefore, each has a corresponding destination
`buffer (e.g., queues 224, 226, 228, 230) in the memory
`controller for receiving memory requests from the bus 223.
`The buffers for receiving requests destined for processors
`are referred to as snoop queues (e. g., SnoopQs 224 and 226).
`
`[0034] The main memory 232 may be comprised of a
`number of discrete memory devices, such as the memory
`banks 234, 236 shoWn in FIG. 2. These devices may be
`implemented in conventional DRAM, SDRAM, RAMBUS
`DRAM, etc. The buffers for receiving requests destined for
`these memory devices are referred to as memory queues
`(e.g., memoryQs 228, 230).
`[0035] The snoopQs and memoryQs process memory
`requests independently in a First In, First Out manner.
`Unless speci?ed otherWise, each of the queues and buffers in
`the multiprocessor system process requests and data in a
`First In, First Out manner. The snoopQs process requests one
`by one and issue them to the corresponding processor. For
`example, the snoopQ labeled 224 in FIG. 2 sends requests
`to the processor labeled 202, Which then buffers the requests
`in its internal buffer 216, and ultimately checks its cache
`hierarchy for a valid copy of the requested block.
`
`[0036] Just as a request is queued in the snoopQs, it is also
`queued in the memoryQ, Which initiates the memory
`accesses to the appropriate memory banks.
`
`[0037] The point-to-point links in the memory control
`path have a number of advantages over a conventional bus
`design based on shared Wires. First, a relatively complex bus
`protocol required for a shared bus is reduced to a simple
`point-to-point protocol. As long as the destination buffers
`have space, a request can be pumped into the link every
`cycle. Second, the point-to-point links can be clocked at a
`higher frequency (e. g., 400 MHZ) than the traditional system
`bus (e.g., 66 MHZ to 100 MHZ). Third, more processors can
`be attached to a single memory controller, provided that the
`memory bandWidth is not a bottleneck. The point-to-point
`links alloW more processors to be connected to the memory
`controller because they are narroW links (i.e. have feWer
`Wires) than a full-Width bus.
`
`[0038] The Data Path
`
`[0039] The system illustrated in FIG. 2 only shoWs the
`memory control path. The path for transferring data blocks
`betWeen memory and each of the processors is referred to as
`the data path. The data path may be implemented With data
`sWitches, point-to-point links, a shared bus, etc. FIG. 3
`illustrates one possible implementation of the data path for
`the architecture shoWn in FIG. 2.
`
`[0040] In FIG. 3, the memory controller 300 is expanded
`to shoW a data path implemented With a data bus or sWitch
`302. The control path is implemented using an address bus
`or sWitch 304 as described above. In response to the request
`queues (e.g., 306, 308), the control path communicates
`requests to the snoop queues (e.g., 310, 312) for the pro
`cessors 314-320 and to the memory queues (e.g., 322-328)
`for the memory banks 330-336.
`
`[0041] In this design, each of the processors has tWo
`dedicated and unidirectional point-to-point links 340-346
`With the memory controller 300 for transferring data blocks.
`The data blocks transferred along these links are buffered at
`each end. For example, a data block coming from the data
`bus 302 and destined for a processor is buffered in an
`incoming queue 350 corresponding to that processor in the
`memory controller. Conversely, a data block coming from
`the processor and destined for the data bus is buffered in an
`outgoing queue 352 corresponding to that processor in the
`memory controller. The data bus, in turn, has a series of high
`speed data links (e.g., 354) With each of the memory banks
`(330-336).
`[0042] TWo of the processors 314, 316 are expanded to
`reveal an example of a cache hierarchy. For example, the
`cache hierarchy in processor 0 has a level tWo cache, and
`separate data and instruction caches 362, 364. This diagram
`depicts only one possible example of a possible cache
`hierarchy. The processor receives control and data in
`memory control and data buffers 366, 368, respectively. The
`level tWo cache includes control logic to process requests for
`data blocks from the memory control buffer 366. In addition,
`it has a data path for receiving data blocks from the data
`buffer 368. The level tWo cache partitions code and data into
`the instruction and data caches, respectively. The execution
`unit 370 Within the processor fetches and executes instruc
`tions from the instruction cache and controls transfers of
`data betWeen the data cache and its internal register ?les.
`
`[0043] When the processor needs to access a data block
`and does not have it in its cache, the level tWo cache issues
`a request for the block to its internal request queue 372,
`Which in turn, sends the request to a corresponding request
`queue 306 in the memory controller. When the processor is
`responding to a request for a data block, the level tWo cache
`transfers the data block to an internal data queue 374. This
`data queue, in turn, processes data blocks in FIFO order, and
`transfers it to the corresponding data queue 352 in the
`memory controller.
`
`[0044] Further Optimizations
`
`[0045] The performance of the control path may be
`improved by keeping track of Which processors have copies
`of a data block and limiting traf?c in the control path by
`speci?cally addressing other processors or memory rather
`than broadcasting commands.
`
`
`
`US 2002/0053004 A1
`
`May 2, 2002
`
`[0046] Directory Based Filter for Read Misses
`
`[0047] Since data blocks are associated With additional
`state information, this state information can be extended to
`include the ID of the processor that currently has a particular
`data block. This ID can be used to target a processor When
`a requesting processor makes a read request and ?nds that its
`cache does not have a valid copy of the requested data block.
`Using the processor ID associated With the requested data
`block, the requesting processor speci?cally addresses the
`read request to the processor that has the valid copy. All
`other processors are shielded from receiving the request.
`[0048] While this approach improves the performance of
`memory accesses on read requests, it does not address the
`issue of cache coherence for Write requests. Shared memory
`multiprocessors typically implement a cache coherence pro
`tocol to make sure that the processors access the correct
`copy of a data block after it is modi?ed. There are tWo
`primary protocols for cache coherence: Write invalidation
`and Write update. The Write invalidation protocol invalidates
`other copies of a data block in response to a Write operation.
`The Write update (sometimes referred to as Write broadcast)
`protocol updates all of the cached copies of a data block
`When it is modi?ed in a Write operation.
`[0049] In the speci?c approach outlined above for using
`the processor ID to address a processor on a read request, the
`multiprocessor system may implement a Write update or
`Write invalidation protocol. In the case of a Write invalida
`tion protocol, the memory controller broadcasts Write invali
`dations to all processors, or uses a directory to reduce traf?c
`in the control path as explained in the neXt section.
`
`[0050] Directory Based Filter for All Traf?c
`
`[0051] To further reduce traffic in the control path, the
`memory controller can use a directory to track the proces
`sors that have a copy of a particular data block. A directory,
`in this conteXt, is a mechanism for identifying Which pro
`cessors have a copy of a data block. One Way to implement
`the directory is With a presence bit vector. Each processor
`has a bit in the presence bit vector for a data block. When the
`bit corresponding to a processor is set in the bit vector, the
`processor has a copy of the data block.
`
`[0052] In a Write invalidation protocol, the memory con
`troller can utiliZe the directory to determine Which proces
`sors have a copy of data block, and then multi-cast a Write
`invalidation only to the processors that have a copy of the
`data block. The directory acts as a ?lter in that it reduces the
`number of processors that are targeted for a Write invalida
`tion request.
`Implementation of the Memory Directory
`[0053]
`[0054] There are a variety of Ways to implement a memory
`directory. Some possible eXamples are discussed beloW.
`[0055] Separate Memory Depository
`[0056] One Way to implement the memory directory is to
`use a separate memory bank for the directory information. In
`this implementation, the memory controller directs a request
`from the request queue to the directory, Which ?lters the
`request and addresses it to the appropriate processors (and
`possibly memory devices). FIGS. 4 and 5 shoW alternative
`implementations of the multiprocessor system depicted in
`FIG. 2. Since these Figures contain similar components as
`
`those depicted in FIGS. 2 and 3, only components of
`interest to the folloWing discussion are labeled With refer
`ence numbers. Unless otherWise noted, the description of the
`components is the same as provided above.
`
`[0057] As shoWn in these ?gures, the directory may be
`stored in a memory device that is either integrated into the
`memory controller or implemented in a separate component.
`In FIG. 4, the directory is stored on a memory device 400
`integrated into the memory controller. The directory ?lter
`400 receives requests from the request queues (e.g., 402,
`404) in the memory controller, determines Which processors
`have a copy of the data block of interest, and forWards the
`request to the snoopQ(s) (e.g., 406, 408) corresponding to
`these processors via the address bus 410. In addition, the
`directory ?lter forWards the request to the memoryQ (e.g.,
`412) of the memory device that stores the requested data
`block via the address bus 410.
`
`[0058] In FIG. 5, the directory is stored on separate
`memory component 500. The operation of the directory ?lter
`is similar to the one shoWn in FIG. 4, eXcept that a controller
`502 is used to interconnect the request queues 504, 506 and
`the address bus 508 With the directory ?lter 500.
`
`[0059] Folding the Directory into Data Blocks
`
`[0060] Rather than using a component to maintain the
`directory, it may be incorporated into the data blocks. For
`eXample, the directory may be incorporated into the Error
`Correction Code bits of the block. Memory is typically
`addressed in units of bytes. A byte is an 8 bit quantity. In
`addition to the 8 bits of data Within a byte, each byte is
`usually associated With an additional ECC bit. In the case
`Where a data block is comprised of 64 bytes, there are 64
`ECC bits. In practice, nine bits of ECC are used to protect
`128 bits of data. Thus, only 36 ECC bits are necessary to
`protect a block of 64 bytes. The remaining 28 ECC bits may
`be used to store the directory.
`
`[0061] FIG. 6 illustrates an eXample of a data block that
`incorporates a presence bit vector in selected ECC bits of the
`block. The block is associated With state information 602,
`such as a bit indicating Whether the block is valid or invalid,
`and the processor ID of a processor currently having a valid
`copy of the block. The data content section of the block is
`shoWn as a contiguous series of bytes (e.g., 604 .
`.
`. 606),
`each having an ECC bit. Some of these bits serve as part of
`the block’s error correction code, While others are bits in the
`presence bit vector. Each bit in the presence bit vector
`corresponds to a processor in the system and indicates
`Whether that processor has a copy of the block.
`
`[0062] Reducing Latency and Demand for Memory Band
`Width
`
`[0063] The directory scheme does not solve the problem
`of memory bandWidth. Due to the directory information, a
`request to access a block may potentially require tWo
`memory accesses: one access for the data, and another for
`updating the directory.
`
`[0064] A further optimiZation to reduce accesses to
`memory is to buffer frequently accessed blocks in a shared
`cache as shoWn in FIGS. 4 and 5. The use of a cache
`reduces accesses to memory because many of the requests
`can be satis?ed by accessing the memory controller’s cache
`instead of the main memory. The blocks 400, 500 in FIGS.
`
`
`
`US 2002/0053004 A1
`
`May 2, 2002
`
`4 and 5 that illustrate the directory ?lter also illustrate a
`possible implementation of a cache.
`
`[0065] FIG. 4 illustrates a cache 400 that is integrated into
`the memory controller. The cache is a fraction of the siZe of
`main memory and stores the most frequently used data
`blocks. The memory controller issues requests to the cache
`directly from the request queues 402, 404. When the
`requested block is in the cache, the cache provides it to the
`requesting processor via the data bus and the data queue of
`the requesting processor. When a block is requested that is
`not in the cache, the cache replaces an infrequently used
`block in the cache With the requested block. The cache uses
`a link 420 betWeen it and the data bus 422 to transfer data
`blocks to and from memory 424 and to and from the data
`queues (e.g., 426, 428) corresponding to the processors.
`[0066] FIG. 5 illustrates a cache that is implemented in a
`separate component from the memory controller. The opera
`tion of the cache is similar to the cache in FIG. 4, eXcept that
`the controller 502 is responsible for receiving requests from
`the request queues 504, 506 and forWarding them to the
`cache 500. In addition, the cache communicates With data
`queues and memory on the data bus 510 via a link 512
`betWeen the controller and the data bus.
`[0067] Conclusion
`[0068] While the invention is described With reference to
`speci?c implementations, the scope of the invention is not
`limited to these implementations. There are a variety of
`Ways to implement the invention. For eXample, the eXamples
`provided above shoW point-to-point links in the control and
`data path betWeen the processors and memory. HoWever, it
`possible to implement a similar asynchronous cache coher
`ence scheme Without using point-to-point control or data
`links. It is possible to use a shared bus instead of indepen
`