`Keller et al.
`
`US006490661B1
`US 6,490,661 B1
`Dec. 3, 2002
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`(54) MAINTAINING CACHE COHERENCY
`DURING A MEMORY READ OPERATION IN
`A MULTIPROCESSING COMPUTER
`SYSTEM
`
`(75) Inventors: James B. Keller, Palo Alto, CA (US);
`Derrick R. Meyer, Austin, TX (US)
`
`(73) Assignee: Advanced Micro Devices, Inc.,
`Sunnyvale, CA (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/217,212
`(22) Filed:
`Dec. 21, 1998
`
`(51) Int. Cl.7 .............................................. .. G06F 13/14
`(52) US. Cl. ...................... .. 711/150; 711/141; 711/130
`(58) Field of Search ............................... .. 711/141, 144,
`711/145, 146, 130, 119, 150; 709/213,
`214, 216, 200, 238
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,303,362 A
`5,412,788 A
`
`4/1994 Butts, Jr. et 211.
`5/1995 Collins et 211.
`
`(List continued on neXt page.)
`
`FOREIGN PATENT DOCUMENTS
`
`EP
`EP
`EP
`EP
`EP
`
`0 379 771
`0 412 353
`0 611 026
`0 777 184
`817 076
`
`8/1990
`2/1991
`8/1994
`6/1997
`1/1998
`
`OTHER PUBLICATIONS
`
`Kumar, et al., “Ef?cient and Scalable Cache Coherence
`Schemes for Shared Memory Hypercube Multiprocessors,”
`IEEE, XP000533913, Pub Date Nov. 14, 1994, pp. 498—507.
`
`Laudon, et al., “The SGI Origin: A ccNUMA Highly Scal
`able Server,” XP000656582, Silicon Graphics, Inc., ACM,
`1997, pp. 241—251.
`Jhang et al., “A NeW Write—lnvalidate Snooping Cache
`Coherence Protocol for Split Transaction Bus—Based Mul
`tiprocessor Systems,” IEEE TENCON, Oct. 1993, pp.
`229—232.
`
`Prete, “RST Cache Memory Design for a Tightly Coupled
`Multiprocessor System,” IEEE Micro, vol. 11, No. 2, Apr.
`1991, pp. 16—19 and 40—52.
`
`Primary Examiner—Hong Kim
`(74) Attorney, Agent, or Firm—LaWrence J. Merkel
`
`(57)
`
`ABSTRACT
`
`Amessaging scheme that accomplishes cache-coherent data
`transfers during a memory read operation in a multiprocess
`ing computer system is described. Asource processing node
`sends a read command to a target processing node to read
`data from a designated memory location in a system
`memory associated With the target processing node. In
`response to the read command, the target processing node
`transmits a probe command to all the remaining processing
`nodes in the computer system regardless of Whether one or
`more of the remaining nodes have a copy of the data cached
`in their respective cache memories. Probe command causes
`each node to maintain cache coherency by appropriately
`changing the state of the cache block containing the
`requested data and by causing the node having an updated
`copy of the cache block to send the cache block to the source
`node. Each processing node that receives a probe command
`sends, in return, a probe response indicating Whether that
`processing node has a cached copy of the data and the state
`of the cached copy if the responding node has the cached
`copy. The target node sends a read response including the
`requested data to the source node. The source node Waits for
`responses from the target node and from each of the remain
`ing node in the system and acknowledges the receipt of
`requested data by sending a source done response to the
`target node.
`
`35 Claims, 13 Drawing Sheets
`
`Memory
`15
`
`16A
`
`Memory
`m
`
`0
`180 _ L—r _ 18A
`/
`Frccessmg *
`Processmg
`Node
`IF
`NOGE
`13 i 7125
`1
`\F
`“D
`\F
`
`4
`
`24A
`
`MC
`
`16B
`
`k
`
`18F
`
`455
`
`was
`
`24E\
`
`far
`
`240
`
`24D
`
`(8H
`
`1
`
`15K
`
`18L
`
`\F
`
`\F
`
`v0
`Bndge
`2o
`
`\F
`
`\F
`
`me
`181
`240
`\
`/_
`‘F Processing 4;> Pmcessmg
`IF
`Neee
`12:;
`
`g:
`
`‘
`
`L2“
`
`_
`
`“a
`
`-
`
`4.24 a
`
`‘16C
`
`—
`
`IE
`
`(5D
`
`22
`
`vo
`Bus
`
`Memory
`g:
`
`101
`
`Memory
`14D
`
`1
`
`APPLE 1006
`
`
`
`US 6,490,661 B1
`Page 2
`
`US. PATENT DOCUMENTS
`
`5/1996 Green
`5,517,494 A
`5537575 A * 7/1996 Foley er a1- -------------- -- 711/141
`5,560,038 A
`9/1996 Haddock .... ..
`.. 709/236
`5,659,708 A
`8/1997 Arimilli er a1- ----------- -- 711/146
`5,673,413 A
`9/1997 Deshpande @191-
`5,684,977 A 11/1997 Van L00 et 81.
`5,749,095 A
`5/1998 Hagersten
`5,859,983 A
`1/1999 Heller et a1. .............. .. 709/251
`5,878,268 A
`3/1999 Hagersten
`---- -- 709/215
`5,887,138 A
`3/1999 Hagersten e191-
`5,893,144 A
`4/1999 WOOd et a1. .............. .. 711/122
`5,927,118 A
`7/1999 Minote er 91-
`5966729 A 10/1999 Phelps
`5,987,544 A * 11/1999 Bannon et a1. ........... .. 710/100
`5,991,819 A 11/1999 YOllIlg ........... ..
`.. 709/253
`6,012,127 A
`1/2000 McDonald et a1.
`.. 711/141
`6,018,791 A
`1/2000 Arimilli et a1. ..
`.. 711/141
`6,038,644 A * 3/2000 Irie et a1. .................. .. 711/141
`
`6,049,851 A * 4/2000 Bryg et a1. ............... .. 711/141
`6,070,231 A
`5/2000 Ottinger ................... .. 711/141
`6,085,263 A
`7/2000 Sharma et a1.
`6,098,115 A
`8/2000 Eberhard et a1. ............ .. 710/7
`6,101,420 A
`8/2000 VanDOren et aL
`6,108,737 A * 8/2000 Sharma et a1. ............ .. 710/107
`6,108,752 A
`8/2000 Van Daren et a1.
`6,112,281 A
`8/2000 Bamford et aL
`6,138,218 A * 10/2000 Arimilli et a1. ........... .. 711/146
`6,199,153 B1 * 3/2001 Razdan et a1' ~~~~~ __
`711/172
`6,209,065 B1 * 3/2001 Van Daren et a1. ..
`. 711/150
`6,249,846 B1 * 6/2001 Van Daren et a1. ..
`.... .. 710/39
`6,275,905 B1
`8/2001 Keller et aL _______ __
`711/141
`6,286,090 B1 * 9/2001 Steely et a1.
`711/152
`6,292,705 B1
`9/2001 Wang et a1. ................. .. 700/5
`6,295,853 B1
`9/2001 Razdan et aL
`6,370,621 B1
`4/2002 Keller
`6,393,529 B1
`5/2002 Keller
`
`* cited by examiner
`
`2
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 1 0f 13
`
`US 6,490,661 B1
`
`Memory
`Memory
`m
`E
`I f16A
`I r168
`MC
`MC
`18C—\— |_I/ —/—18A
`—— |_—r ——/—18F
`/
`/ 24A
`/
`Processing
`‘
`Processing
`Node
`IF
`Node
`IF
`E i E
`18D /
`
`IF
`
`IF
`
`IF M\
`\ 18B
`
`M\
`\ 18E
`
`24E \ /— 24F
`
`24C \ /— 24D
`
`18H
`[/-
`
`"
`
`IF
`
`I
`
`IF
`
`//- 18K
`18L
`/
`/- 18I
`. /_?'>
`Processing + H0
`Processing
`Node
`IF
`IF
`Bridge
`Node
`IF
`IF
`22 i Q + a
`I MC l
`I MC l
`|/()
`I \16c
`I \16D
`Bus
`Memory
`Memory
`14C
`14D
`
`18G \
`\
`IF
`
`—
`
`—
`
`1&1 /—
`
`22
`
`10 —/
`
`Fig. 1
`
`3
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 2 0f 13
`
`US 6,490,661 B1
`
`Processing Node
`12A
`
`Processing Node
`12B
`
`CLK_L
`\ 24BA
`CTL
`
`\ 2450
`
`CLK_L
`\ 24AA
`CTL
`
`Fig. 2
`
`4
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 3 0f 13
`
`US 6,490,661 B1
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`0
`
`Reserved or
`Packet-
`Specific
`
`1
`
`2
`
`CMD[5:O]
`
`Reserved or Packet-Specific
`
`30 /
`
`Fig. 3
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`O
`
`1
`2
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`De?t'zg‘l’de
`852189119
`psfkj-réiigiic
`
`CMD[5:O]
`Deglrgide
`SrcNode[3:O]
`SrCTagHSiZ]
`
`Addr[7:0]
`
`Addr[15:8]
`
`Addr[23:16]
`
`Addr[31 :24]
`
`Addr[39:32]
`
`32 /
`
`Fig. 4
`
`5
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 4 0f 13
`
`US 6,490,661 B1
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`O
`
`1
`
`2
`
`3
`
`4
`
`DestNode
`
`[1:0]
`
`_
`
`CMD[5.0]
`
`SrcTag
`[1:0]
`
`_
`SrcNode[3.0]
`
`DestNode
`[3:2]
`
`Reserved or
`Packet-Specified
`
`_
`srcTagls'z]
`
`Reserved or Packet-Specified
`
`34 /
`
`Fig. 5
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`O
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`TgtNode
`[1:0]
`
`SrcTag
`[1:0]
`
`_
`CMD[5.0]
`
`_
`SrcNode[3.0]
`
`TgtNode
`[3:2]
`
`Count
`
`SrcTag[6:2]
`
`Addr[7:0]
`
`Addr[15:8]
`
`Addr[23116]
`
`Addr[31:24]
`
`Addr[39:32]
`
`36/
`
`Fig. 6
`
`6
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 5 0f 13
`
`US 6,490,661 B1
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`3
`
`Data [7:0]
`
`Data [15:8]
`
`Data [23:16]
`
`Data [31 :24]
`
`Data [39:32]
`
`Data [47:40]
`
`Data [55:48]
`
`Data [63:56]
`
`38 /
`
`Fig. 7
`
`7
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 6 6f 13
`
`US 6,490,661 B1
`
`CMD Code
`000000
`000001
`000011
`000100
`000101
`00011x
`001000
`001001
`001010
`001011
`001100
`001110
`001101
`0011 11
`01xxxx
`10xxxx
`11000x
`1 1001 x
`11010x
`11011x
`111000
`111001
`111010
`111011
`11110x
`1111 10
`111111
`
`46/
`
`Packet Type
`Info
`Command
`-
`Command/Address
`Command/Address
`-
`Command/Address
`Command/Address
`Command/Address
`Command/Address
`Command/Address
`Command/Address
`Command
`Command/Address/Data
`Command/Address
`Command/Address/Data
`Rd Response
`-
`Response
`-
`Response
`Response
`Response
`-
`Response
`Response
`Info
`
`Command
`Nop
`Interrupt Broadcast
`Reserved
`Probe/Sro
`Probe/Tgt
`Reserved
`RdBIkS
`RdBlk
`RdBlkMod
`ChangeToDirty
`ValidateBlk
`CIeanVicBlk
`Interrupt Target
`VicBlk
`Read(Sized)
`Write(Sized)
`RdResponse
`Reserved
`ProbeResp
`Reserved
`SrcDone
`MemCancel
`TgtStart
`Reserved
`TgtDone
`IntrResponse
`Error
`
`Fig. 8
`
`8
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 7 0f 13
`
`US 6,490,661 B1
`
`96c 95
`
`Memory
`421
`
`Fig. 9
`
`9
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 8 0f 13
`
`US 6,490,661 B1
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`TgtNode
`[1:0]
`
`SrcTag
`[1:0]
`
`_
`CMD[5.0]
`
`_
`SrcNode[3.0]
`
`TgtNode
`[3:2]
`
`Nextstate DM
`[1.0]
`
`SrcTag[6:2]
`
`Addr[7:0]
`
`Addr[15:8]
`
`Addr[23:16]
`
`Addr[31:24]
`
`Addr[39:32]
`
`Next State [1:0]
`
`Next State
`
`0
`
`1
`
`2
`
`No Change
`
`Shared:
`C|ean—>Shared
`Dirty—> Shared/Dirty
`
`Invalid
`
`10
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 9 0f 13
`
`US 6,490,661 B1
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`O
`
`1
`
`2
`
`3
`
`4
`
`RespNode
`[1:0]
`
`_
`CMD[5.0]
`
`SrcTag
`[1:0]
`
`_
`SrcNode[3.0]
`
`RespNode
`[3:2]
`
`Count
`
`SrcTag[6:2]
`
`RSV
`
`Probe Type
`
`48 / Fig. 1 1A
`
`Probe
`
`Tgt
`
`Type, Count
`
`0
`
`1
`
`1
`
`X
`
`O
`
`1
`
`Encode Length of Data Packet
`
`Cache Block:
`Type Must be 1, Count Must be 7
`
`Cache Block:
`Type Must be 1, Count Must be 7
`
`50/
`
`Fig. 118
`
`11
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 10 0f 13
`
`US 6,490,661 B1
`
`5"
`Tlme
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`0
`
`1
`
`2
`
`3
`
`4
`
`RespNode
`[1:0]
`
`SrcTag
`[1:0]
`
`_
`CMD[5.0]
`
`_
`SrcNode[3.0]
`
`RespNode
`[3:2]
`
`Rsv
`
`Hit
`
`SrcTag[6:2]
`
`Reserved or Packet-Specific
`
`52/
`
`Fig. 12
`
`12
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 11 0f 13
`
`US 6,490,661 B1
`
`
`
`960 692.
`
`
`
`mwcoawwm 80.5
`
`
`
`wwcoawmm umwm
`
`mcoc 2w
`
`54/
`
`Memory
`421
`
`Fig. 13
`
`13
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 12 0f 13
`
`US 6,490,661 B1
`
`Memory
`421
`
`Fig. 14
`
`14
`
`
`
`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 13 6f 13
`
`US 6,490,661 B1
`
`Coherent System Node
`
`Src Node sends Read
`Command to Tgt Node
`
`Probe Command from
`Target Node to all the
`remaining system nodes
`and RdResponse from
`Tgt Node to Src Node
`
`Probe Hit?
`
`Probe Responses from
`all the remaining system
`nodes, and Read
`Response from TgtNode
`to SrcNode
`
`‘ SrcNode waits for all the
`responses
`
`No
`
`All responses
`received by the
`SrcNode?
`
`Yes
`SrcNode sends SrcDone
`response to TgtNode
`
`-Node with dirty block
`sends a RdResp to the
`SrcNode and a
`MemCancel response to
`the TgtNode
`-Other remaining nodes
`send Probe Responses
`to the SrcNode
`
`Did
`TgtNode send
`RdResponse prior to
`receiving
`MemCanoel'?
`
`TgtNode sends TgtDone
`response to the SrcNode
`
`66/
`
`FIG. 15
`
`15
`
`
`
`US 6,490,661 B1
`
`1
`MAINTAINING CACHE COHERENCY
`DURING A MEMORY READ OPERATION IN
`A MULTIPROCESSING COMPUTER
`SYSTEM
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`This application is related to the following patent appli
`cations: Ser. No. 09/217,367 ?led Dec. 21, 1998; Ser. No.
`09/217,649 ?led Dec. 21, 1998, now US. Pat. No. 6,275,
`905; Ser. No. 09/217,699 ?led Dec. 21, 1998, now US. Pat.
`No. 6,370,621; and Ser. No. 09/220,487 ?led Dec. 23, 1998,
`now US. Pat. No. 6,167,492.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention broadly relates to computer
`systems, and more particularly, to a messaging scheme to
`accomplish cache-coherent data transfers in a multiprocess
`ing computing environment.
`2. Description of the Related Art
`Generally, personal computers (PCs) and other types of
`computer systems have been designed around a shared bus
`system for accessing memory. One or more processors and
`one or more input/output (I/O) devices are coupled to
`memory through the shared bus. The I/O devices may be
`coupled to the shared bus through an I/O bridge, Which
`manages the transfer of information betWeen the shared bus
`and the I/O devices. The processors are typically coupled
`directly to the shared bus or through a cache hierarchy.
`Unfortunately, shared bus systems suffer from several
`draWbacks. For example, since there are multiple devices
`attached to the shared bus, the bus is typically operated at a
`relatively loW frequency. Further, system memory read and
`Write cycles through the shared system bus take substantially
`longer than information transfers involving a cache Within a
`processor or involving tWo or more processors. Another
`disadvantage of the shared bus system is a lack of scalability
`to larger number of devices. As mentioned above, the
`amount of bandWidth is ?xed (and may decrease if adding
`additional devices reduces the operable frequency of the
`bus). Once the bandWidth requirements of the devices
`attached to the bus (either directly or indirectly) exceeds the
`available bandWidth of the bus, devices Will frequently be
`stalled When attempting to access the bus. Overall perfor
`mance may be decreased unless a mechanism is provided to
`conserve the limited system memory bandWidth.
`A read or a Write operation addressed to a non-cache
`system memory takes more processor clock cycles than
`similar operations betWeen tWo processors or betWeen a
`processor and its internal cache. The limitations on bus
`bandWidth, coupled With the lengthy access time to read or
`Write to a system memory, negatively affect the computer
`system performance.
`One or more of the above problems may be addressed
`using a distributed memory system. A computer system
`employing a distributed memory system includes multiple
`nodes. TWo or more of the nodes are connected to memory,
`and the nodes are interconnected using any suitable inter
`connect. For example, each node may be connected to each
`other node using dedicated lines. Alternatively, each node
`may connect to a ?xed number of other nodes, and trans
`actions may be routed from a ?rst node to a second node to
`Which the ?rst node is not directly connected via one or more
`intermediate nodes. The memory address space is assigned
`across the memories in each node.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`55
`
`60
`
`65
`
`2
`Nodes may additionally include one or more processors.
`The processors typically include caches that store cache
`blocks of data read from the memories. Furthermore, a node
`may include one or more caches external to the processors.
`Since the processors and/or nodes may be storing cache
`blocks accessed by other nodes, a mechanism for maintain
`ing coherency Within the nodes is desired.
`SUMMARY OF THE INVENTION
`The problems outlined above are in large part solved by
`a computer system as described herein. The computer sys
`tem may include multiple processing nodes, tWo or more of
`Which may be coupled to separate memories Which may
`form a distributed memory system. The processing nodes
`may include caches, and the computer system may maintain
`coherency betWeen the caches and the distributed memory
`system.
`In one embodiment, the present invention relates to a
`multiprocessing computer system Where the processing
`nodes are interconnected through a plurality of dual unidi
`rectional links. Each pair of unidirectional links forms a
`coherent link structure that connects only tWo of the pro
`cessing nodes. One unidirectional link in the pair of links
`sends signals from a ?rst processing node to a second
`processing node connected through that pair of unidirec
`tional links. The other unidirectional link in the pair carries
`a reverse How of signals, ie it sends signals from the second
`processing node to the ?rst processing node. Thus,, each
`unidirectional link forms as a point-to-point interconnect
`that is designed for packetiZed information transfer. Com
`munication betWeen tWo processing nodes may be routed
`through more than one remaining nodes in the system.
`Each processing node may be coupled to a respective
`system memory through a memory bus. The memory bus
`may be bidirectional. Each processing node comprises at
`least one processor core and may optionally include a
`memory controller for communicating With the respective
`system memory. Other interface logic may be included in
`one or more processing nodes to alloW connectivity With
`various I/O devices through one or more 1/0 bridges.
`In one embodiment, one or more 1/0 bridges may be
`coupled to their respective processing nodes through a set of
`non-coherent dual unidirectional links. These I/O bridges
`communicate With their host processors through this set of
`non-coherent dual unidirectional links in much the same
`Way as tWo directly-linked processors communicate With
`each other through a coherent dual unidirectional link.
`In one embodiment, When a ?rst processing node sends a
`read command to a second processing node to read data from
`a designated memory location associated With the second
`processing node, the second processing node responsively
`transmits a probe command to all the remaining processing
`nodes in the system. The probe command is transmitted
`regardless of Whether one or more of the remaining nodes
`have a copy of the data cached in their respective cache
`memories. Each processing node that has a cached copy of
`the designated memory location updates its cache tag asso
`ciated With that cached data to re?ect the current status of the
`data. Each processing node that receives a probe command
`sends, in return, a probe response indicating Whether that
`processing node has a cached copy of the data. In the event
`that a processing node has a cached copy of the designated
`memory location, the probe response from that processing
`node further includes the state of the cached data—i.e.
`modi?ed, shared etc.
`The target processing node, ie the second processing
`node, sends a read response to the source processing node,
`
`16
`
`
`
`US 6,490,661 B1
`
`3
`ie the ?rst processing node. This read response contains the
`data requested by the source node through the read com
`mand. The ?rst processing node acknowledges receipt of the
`data by transmitting a source done response to the second
`processing node. When the second processing node receives
`the source done response it removes the read command
`(received from the ?rst processing node) from its command
`buffer queue. The second processing node may, at that point,
`start to respond to a command to the same designated
`memory location. This sequence of messaging is one step in
`maintaining cache-coherent system memory reads in a mul
`tiprocessing computer system. The data read from the des
`ignated memory location may be less than the Whole cache
`block in siZe if the read command speci?es so.
`Upon receiving the probe command, all of the remaining
`nodes check the status of the cached copy, if any, of the
`designated memory location as described before. In the
`event that a processing node, other than the source and the
`target nodes, ?nds a cached copy of the designated memory
`location that is in a modi?ed state, that processing node
`responds With a memory cancel response sent to the target
`node, ie the second processing node. This memory cancel
`response causes the second processing node to abort further
`processing of the read command, and to stop transmission of
`the read response, if it hasn’t sent the read response yet. All
`the other remaining processing nodes still send their probe
`responses to the ?rst processing node. The processing node
`that has the modi?ed cached data sends that modi?ed data to
`the ?rst processing node through its oWn read response. The
`messaging scheme involving probe responses and read
`responses thus maintains cache coherency during a system
`memory read operation.
`The memory cancel response further causes the second
`processing node to transmit a target done response to the ?rst
`processing node regardless of Whether it earlier sent the read
`response to the ?rst processing node. The ?rst processing
`node Waits for all the responses to arrive—i.e. the probe
`responses, the target done response, and the read response
`from the processing node having the modi?ed cached data—
`prior to completing the data read cycle by sending a source
`done response to the second processing node. In this
`embodiment, the memory cancel response conserves system
`memory bandWidth by causing the second processing node
`to abort time-consuming memory read operation When a
`modi?ed copy of the requested data is cached at a different
`processing node. Reduced data transfer latencies are thus
`achieved When it is observed that a data transfer betWeen
`tWo processing nodes over the high-speed dual unidirec
`tional link is substantially faster than a similar data transfer
`betWeen a processing node and a system memory that
`involves a relatively sloW speed system memory bus.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`A better understanding of the present invention can be
`obtained When the folloWing detailed description of the
`preferred embodiment is considered in conjunction With the
`folloWing draWings, in Which:
`FIG. 1 is a block diagram of one embodiment of a
`computer system.
`FIG. 2 shoWs in detail one embodiment of the intercon
`nect betWeen a pair of processing nodes from FIG. 1.
`FIG. 3 is a block diagram of one embodiment of an
`information packet.
`FIG. 4 is a block diagram of one embodiment of an
`address packet.
`FIG. 5 is a block diagram of one embodiment of a
`response packet.
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`4
`FIG. 6 is a block diagram of one embodiment of a
`command packet.
`FIG. 7 is a block diagram of one embodiment of a data
`packet.
`FIG. 8 is a table illustrating exemplary packet types that
`may be employed in the computer system of FIG. 1.
`FIG. 9 is a diagram illustrating an example How of
`packets corresponding to a memory read operation.
`FIG. 10A is a block diagram of one embodiment of a
`probe command packet.
`FIG. 10B is a block diagram for one embodiment of the
`encoding for the NeXtState ?eld in the probe command
`packet of FIG. 10A.
`FIG. 11A is a block diagram of one embodiment of a read
`response packet.
`FIG. 11B shoWs in one embodiment the relationship
`betWeen the Probe, Tgt and Type ?elds of the read response
`packet of FIG. 11A.
`FIG. 12 is a block diagram of one embodiment of a probe
`response packet.
`FIG. 13 is a diagram illustrating an example How of
`packets involving a memory cancel response.
`FIG. 14 is a diagram illustrating an example How of
`packets shoWing a messaging scheme that combines probe
`commands and memory cancel response.
`FIG. 15 is an exemplary ?oWchart for the transactions
`involved in a memory read operation.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENT
`
`Turning noW to FIG. 1, one embodiment of a multipro
`cessing computer system 10 is shoWn. In the embodiment of
`FIG. 1, computer system 10 includes several processing
`nodes 12A, 12B, 12C, and 12D. Each processing node is
`coupled to a respective memory 14A—14D via a memory
`controller 16A—16D included Within each respective pro
`cessing node 12A—12D. Additionally, processing nodes
`12A—12D include one or more interface ports 18, also
`knoWn as interface logic, to communicate among the pro
`cessing nodes 12A—12D, and to also communicate betWeen
`a processing node and a corresponding I/O bridge. For
`eXample, processing node 12A includes interface logic 18A
`for communicating With processing node 12B, interface
`logic 18B for communicating With processing node 12C,
`and a third interface logic 18C for communicating With yet
`another processing node (not shoWn). Similarly, processing
`node 12B includes interface logic 18D, 18E, and 18F;
`processing node 12C includes interface logic 18G, 18H, and
`181; and processing node 12D includes interface logic 18] ,
`18K, and 18L. Processing node 12D is coupled to commu
`nicate With an I/O bridge 20 via interface logic 18L. Other
`processing nodes may communicate With other I/O bridges
`in a similar fashion. I/O bridge 20 is coupled to an I/O bus
`22.
`The interface structure that interconnects processing
`nodes 12A—12D includes a set of dual-unidirectional links.
`Each dual-unidirectional link is implemented as a pair of
`packet-based unidirectional links to accomplish high-speed
`packetiZed information transfer betWeen any tWo processing
`nodes in the computer system 10. Each unidirectional link
`may be vieWed as a pipelined, split-transaction interconnect.
`Each unidirectional link 24 includes a set of coherent
`unidirectional lines. Thus, each pair of unidirectional links
`may be vieWed as comprising one transmission bus carrying
`a ?rst plurality of binary packets and one receiver bus
`
`17
`
`
`
`US 6,490,661 B1
`
`5
`carrying a second plurality of binary packets. The content of
`a binary packet Will primarily depend on the type of opera
`tion being requested and the processing node initiating the
`operation. One example of a dual-unidirectional link struc
`ture is links 24A and 24B. The unidirectional lines 24A are
`used to transmit packets from processing node 12A to
`processing node 12B and lines 24B are used to transmit
`packets from processing node 12B to processing node 12A.
`Other sets of lines 24C—24H are used to transmit packets
`betWeen their corresponding processing nodes as illustrated
`in FIG. 1.
`Asimilar dual-unidirectional link structure may be used to
`interconnect a processing node and its corresponding I/O
`device, or a graphic device or an I/O bridge as is shoWn With
`respect to the processing node 12D. A dual-unidirectional
`link may be operated in a cache coherent fashion for
`communication betWeen processing nodes or in a non
`coherent fashion for communication betWeen a processing
`node and an external I/O or graphic device or an I/O bridge.
`It is noted that a packet to be transmitted from one process
`ing node to another may pass through one or more remaining
`nodes. For example, a packet transmitted by processing node
`12A to processing node 12D may pass through either
`processing node 12B or processing node 12C in the arrange
`ment of FIG. 1. Any suitable routing algorithm may be used.
`Other embodiments of computer system 10 may include
`more or feWer processing nodes than those shoWn in FIG. 1.
`Processing nodes 12A—12D, in addition to a memory
`controller and interface logic, may include other circuit
`elements such as one or more processor cores, an internal
`cache memory, a bus bridge, a graphics logic, a bus
`controller, a peripheral device controller, etc. Broadly
`speaking, a processing node comprises at least one processor
`and may optionally include a memory controller for com
`municating With a memory and other logic as desired.
`Further, each circuit element in a processing node may be
`coupled to one or more interface ports depending on the
`functionality being performed by the processing node. For
`example, some circuit elements may only couple to the
`interface logic that connects an I/O bridge to the processing
`node, some other circuit elements may only couple to the
`interface logic that connects tWo processing nodes, etc.
`Other combinations may be easily implemented as desired.
`Memories 14A—14D may comprise any suitable memory
`devices. For example, a memory 14A—14D may comprise
`one or more RAMBUS DRAMs (RDRAMs), synchronous
`DRAMs (SDRAMs), static RAM, etc. The memory address
`space of the computer system 10 is divided among memories
`14A—14D. Each processing node 12A—12D may include a
`memory map used to determine Which addresses are mapped
`to Which memories 14A—14D, and hence to Which process
`ing node 12A—12D a memory request for a particular
`address should be routed. In one embodiment, the coherency
`point for an address Within computer system 10 is the
`memory controller 16A—16D coupled to the memory that is
`storing the bytes corresponding to the address. In other
`Words, the memory controller 16A—16D is responsible for
`ensuring that each memory access to the corresponding
`memory 14A—14D occurs in a cache coherent fashion.
`Memory controllers 16A—16D may comprise control cir
`cuitry for interfacing to memories 14A—14D. Additionally,
`memory controllers 16A—16D may include request queues
`for queuing memory requests.
`Generally, interface logic 18A—18L may comprise a vari
`ety of buffers for receiving packets from one unidirectional
`link and for buffering packets to be transmitted upon another
`unidirectional link. Computer system 10 may employ any
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`6
`suitable flow control mechanism for transmitting packets.
`For example, in one embodiment, each transmitting inter
`face logic 18 stores a count of the number of each type of
`buffers Within the receiving interface logic at the other end
`of the link to Which the transmitting interface logic is
`connected. The interface logic does not transmit a packet
`unless the receiving interface logic has a free buffer to store
`the packet. As a receiving buffer is freed by routing a packet
`onWard, the receiving interface logic transmits a message to
`the sending interface logic to indicate that the buffer has
`been freed. Such a mechanism may be referred to as a
`“coupon-based” system.
`Turning next to FIG. 2, a block diagram illustrating
`processing nodes 12A and 12B is shoWn to illustrate in more
`detail one embodiment of the dual unidirectional link struc
`ture connecting the processing nodes 12A and 12B. In the
`embodiment of FIG. 2, lines 24A (the unidirectional link
`24A) include a clock line 24AA, a control line 24AB, and a
`command/address/data bus 24AC. Similarly, lines 24B (the
`unidirectional link 24B) include a clock line 24BA, a control
`line 24BB, and a command/address/data bus 24BC.
`A clock line transmits a clock signal that indicates a
`sample point for its corresponding control line and the
`command/address/data bus. In one particular embodiment,
`data/control bits are transmitted on each edge (i.e. rising
`edge and falling edge) of the clock signal. Accordingly, tWo
`data bits per line may be transmitted per clock cycle. The
`amount of time employed to transmit one bit per line is
`referred to herein as a “bit time”. The above-mentioned
`embodiment includes tWo bit times per clock cycle. Apacket
`may be transmitted across tWo or more bit times. Multiple
`clock lines may be used depending upon the Width of the
`command/address/data bus. For example, tWo clock lines
`may be used for a 32 bit command/address/data bus (With
`one half of the command/address/data bus referenced to one
`of the clock lines and the other half of the command/address/
`data bus and the control line referenced to the other one of
`the clock lines.
`The control line indicates Whether or not the data trans
`mitted upon the command/address/data bus is either a bit
`time of a control packet or a bit time of a data packet. The
`control line is asserted to indicate a control packet, and
`deasserted to indicate a data packet. Certain control packets
`indicate that a data packet folloWs. The data packet may
`immediately folloW the corresponding control packet. In one
`embodiment, other control packets may interrupt the trans
`mission of a data packet. Such an interruption may be
`performed by asserting the control line for a number of bit
`times during transmission of the data packet and transmit
`ting the bit times of the control packet While the control line
`is asserted. Control packets that interrupt a data packet may
`not indicate that a data packet Will be folloWing.
`The command/address/data bus comprises a set of lines
`for transmitting the data, command, response and address
`bits. In one embodiment, the command/address/data bus
`may comprise 8, 16, or 32 lines. Each processing node or I/O
`bridge may employ any one of the supported numbers of
`lines according to design choice. Other embodiments may
`support other siZes of command/address/data bus as desired.
`According to one embodiment, the command/address/
`data bus lines and the clock line may carry inverted data (i.e.
`a logical one is represented as a loW voltage on the line, and
`a logical Zero is represented as a high voltage). Alternatively,
`these lines may carry non-inverted data (in Which a logical
`one is represented as a high voltage on the line, and logical
`Zero is represented as a loW voltage). Asuitable positive and
`negative logic combination may also be implemented.
`
`18
`
`
`
`US 6,490,661 B1
`
`7
`Turning noW to FIGS. 3—7, exemplary packets employed
`in a cache-coherent communication (i.e., the communication
`betWeen processing nodes) according to one embodiment of
`computer system 10 are shoWn. FIGS. 3—6 illustrate