throbber
(12) United States Patent
`Keller et al.
`
`US006490661B1
`US 6,490,661 B1
`Dec. 3, 2002
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`(54) MAINTAINING CACHE COHERENCY
`DURING A MEMORY READ OPERATION IN
`A MULTIPROCESSING COMPUTER
`SYSTEM
`
`(75) Inventors: James B. Keller, Palo Alto, CA (US);
`Derrick R. Meyer, Austin, TX (US)
`
`(73) Assignee: Advanced Micro Devices, Inc.,
`Sunnyvale, CA (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/217,212
`(22) Filed:
`Dec. 21, 1998
`
`(51) Int. Cl.7 .............................................. .. G06F 13/14
`(52) US. Cl. ...................... .. 711/150; 711/141; 711/130
`(58) Field of Search ............................... .. 711/141, 144,
`711/145, 146, 130, 119, 150; 709/213,
`214, 216, 200, 238
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,303,362 A
`5,412,788 A
`
`4/1994 Butts, Jr. et 211.
`5/1995 Collins et 211.
`
`(List continued on neXt page.)
`
`FOREIGN PATENT DOCUMENTS
`
`EP
`EP
`EP
`EP
`EP
`
`0 379 771
`0 412 353
`0 611 026
`0 777 184
`817 076
`
`8/1990
`2/1991
`8/1994
`6/1997
`1/1998
`
`OTHER PUBLICATIONS
`
`Kumar, et al., “Ef?cient and Scalable Cache Coherence
`Schemes for Shared Memory Hypercube Multiprocessors,”
`IEEE, XP000533913, Pub Date Nov. 14, 1994, pp. 498—507.
`
`Laudon, et al., “The SGI Origin: A ccNUMA Highly Scal
`able Server,” XP000656582, Silicon Graphics, Inc., ACM,
`1997, pp. 241—251.
`Jhang et al., “A NeW Write—lnvalidate Snooping Cache
`Coherence Protocol for Split Transaction Bus—Based Mul
`tiprocessor Systems,” IEEE TENCON, Oct. 1993, pp.
`229—232.
`
`Prete, “RST Cache Memory Design for a Tightly Coupled
`Multiprocessor System,” IEEE Micro, vol. 11, No. 2, Apr.
`1991, pp. 16—19 and 40—52.
`
`Primary Examiner—Hong Kim
`(74) Attorney, Agent, or Firm—LaWrence J. Merkel
`
`(57)
`
`ABSTRACT
`
`Amessaging scheme that accomplishes cache-coherent data
`transfers during a memory read operation in a multiprocess
`ing computer system is described. Asource processing node
`sends a read command to a target processing node to read
`data from a designated memory location in a system
`memory associated With the target processing node. In
`response to the read command, the target processing node
`transmits a probe command to all the remaining processing
`nodes in the computer system regardless of Whether one or
`more of the remaining nodes have a copy of the data cached
`in their respective cache memories. Probe command causes
`each node to maintain cache coherency by appropriately
`changing the state of the cache block containing the
`requested data and by causing the node having an updated
`copy of the cache block to send the cache block to the source
`node. Each processing node that receives a probe command
`sends, in return, a probe response indicating Whether that
`processing node has a cached copy of the data and the state
`of the cached copy if the responding node has the cached
`copy. The target node sends a read response including the
`requested data to the source node. The source node Waits for
`responses from the target node and from each of the remain
`ing node in the system and acknowledges the receipt of
`requested data by sending a source done response to the
`target node.
`
`35 Claims, 13 Drawing Sheets
`
`Memory
`15
`
`16A
`
`Memory
`m
`
`0
`180 _ L—r _ 18A
`/
`Frccessmg *
`Processmg
`Node
`IF
`NOGE
`13 i 7125
`1
`\F
`“D
`\F
`
`4
`
`24A
`
`MC
`
`16B
`
`k
`
`18F
`
`455
`
`was
`
`24E\
`
`far
`
`240
`
`24D
`
`(8H
`
`1
`
`15K
`
`18L
`
`\F
`
`\F
`
`v0
`Bndge
`2o
`
`\F
`
`\F
`
`me
`181
`240
`\
`/_
`‘F Processing 4;> Pmcessmg
`IF
`Neee
`12:;
`
`g:
`
`‘
`
`L2“
`
`_
`
`“a
`
`-
`
`4.24 a
`
`‘16C
`
`—
`
`IE
`
`(5D
`
`22
`
`vo
`Bus
`
`Memory
`g:
`
`101
`
`Memory
`14D
`
`1
`
`APPLE 1006
`
`

`

`US 6,490,661 B1
`Page 2
`
`US. PATENT DOCUMENTS
`
`5/1996 Green
`5,517,494 A
`5537575 A * 7/1996 Foley er a1- -------------- -- 711/141
`5,560,038 A
`9/1996 Haddock .... ..
`.. 709/236
`5,659,708 A
`8/1997 Arimilli er a1- ----------- -- 711/146
`5,673,413 A
`9/1997 Deshpande @191-
`5,684,977 A 11/1997 Van L00 et 81.
`5,749,095 A
`5/1998 Hagersten
`5,859,983 A
`1/1999 Heller et a1. .............. .. 709/251
`5,878,268 A
`3/1999 Hagersten
`---- -- 709/215
`5,887,138 A
`3/1999 Hagersten e191-
`5,893,144 A
`4/1999 WOOd et a1. .............. .. 711/122
`5,927,118 A
`7/1999 Minote er 91-
`5966729 A 10/1999 Phelps
`5,987,544 A * 11/1999 Bannon et a1. ........... .. 710/100
`5,991,819 A 11/1999 YOllIlg ........... ..
`.. 709/253
`6,012,127 A
`1/2000 McDonald et a1.
`.. 711/141
`6,018,791 A
`1/2000 Arimilli et a1. ..
`.. 711/141
`6,038,644 A * 3/2000 Irie et a1. .................. .. 711/141
`
`6,049,851 A * 4/2000 Bryg et a1. ............... .. 711/141
`6,070,231 A
`5/2000 Ottinger ................... .. 711/141
`6,085,263 A
`7/2000 Sharma et a1.
`6,098,115 A
`8/2000 Eberhard et a1. ............ .. 710/7
`6,101,420 A
`8/2000 VanDOren et aL
`6,108,737 A * 8/2000 Sharma et a1. ............ .. 710/107
`6,108,752 A
`8/2000 Van Daren et a1.
`6,112,281 A
`8/2000 Bamford et aL
`6,138,218 A * 10/2000 Arimilli et a1. ........... .. 711/146
`6,199,153 B1 * 3/2001 Razdan et a1' ~~~~~ __
`711/172
`6,209,065 B1 * 3/2001 Van Daren et a1. ..
`. 711/150
`6,249,846 B1 * 6/2001 Van Daren et a1. ..
`.... .. 710/39
`6,275,905 B1
`8/2001 Keller et aL _______ __
`711/141
`6,286,090 B1 * 9/2001 Steely et a1.
`711/152
`6,292,705 B1
`9/2001 Wang et a1. ................. .. 700/5
`6,295,853 B1
`9/2001 Razdan et aL
`6,370,621 B1
`4/2002 Keller
`6,393,529 B1
`5/2002 Keller
`
`* cited by examiner
`
`2
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 1 0f 13
`
`US 6,490,661 B1
`
`Memory
`Memory
`m
`E
`I f16A
`I r168
`MC
`MC
`18C—\— |_I/ —/—18A
`—— |_—r ——/—18F
`/
`/ 24A
`/
`Processing
`‘
`Processing
`Node
`IF
`Node
`IF
`E i E
`18D /
`
`IF
`
`IF
`
`IF M\
`\ 18B
`
`M\
`\ 18E
`
`24E \ /— 24F
`
`24C \ /— 24D
`
`18H
`[/-
`
`"
`
`IF
`
`I
`
`IF
`
`//- 18K
`18L
`/
`/- 18I
`. /_?'>
`Processing + H0
`Processing
`Node
`IF
`IF
`Bridge
`Node
`IF
`IF
`22 i Q + a
`I MC l
`I MC l
`|/()
`I \16c
`I \16D
`Bus
`Memory
`Memory
`14C
`14D
`
`18G \
`\
`IF
`
`—
`
`—
`
`1&1 /—
`
`22
`
`10 —/
`
`Fig. 1
`
`3
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 2 0f 13
`
`US 6,490,661 B1
`
`Processing Node
`12A
`
`Processing Node
`12B
`
`CLK_L
`\ 24BA
`CTL
`
`\ 2450
`
`CLK_L
`\ 24AA
`CTL
`
`Fig. 2
`
`4
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 3 0f 13
`
`US 6,490,661 B1
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`0
`
`Reserved or
`Packet-
`Specific
`
`1
`
`2
`
`CMD[5:O]
`
`Reserved or Packet-Specific
`
`30 /
`
`Fig. 3
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`O
`
`1
`2
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`De?t'zg‘l’de
`852189119
`psfkj-réiigiic
`
`CMD[5:O]
`Deglrgide
`SrcNode[3:O]
`SrCTagHSiZ]
`
`Addr[7:0]
`
`Addr[15:8]
`
`Addr[23:16]
`
`Addr[31 :24]
`
`Addr[39:32]
`
`32 /
`
`Fig. 4
`
`5
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 4 0f 13
`
`US 6,490,661 B1
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`O
`
`1
`
`2
`
`3
`
`4
`
`DestNode
`
`[1:0]
`
`_
`
`CMD[5.0]
`
`SrcTag
`[1:0]
`
`_
`SrcNode[3.0]
`
`DestNode
`[3:2]
`
`Reserved or
`Packet-Specified
`
`_
`srcTagls'z]
`
`Reserved or Packet-Specified
`
`34 /
`
`Fig. 5
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`O
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`TgtNode
`[1:0]
`
`SrcTag
`[1:0]
`
`_
`CMD[5.0]
`
`_
`SrcNode[3.0]
`
`TgtNode
`[3:2]
`
`Count
`
`SrcTag[6:2]
`
`Addr[7:0]
`
`Addr[15:8]
`
`Addr[23116]
`
`Addr[31:24]
`
`Addr[39:32]
`
`36/
`
`Fig. 6
`
`6
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 5 0f 13
`
`US 6,490,661 B1
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`3
`
`Data [7:0]
`
`Data [15:8]
`
`Data [23:16]
`
`Data [31 :24]
`
`Data [39:32]
`
`Data [47:40]
`
`Data [55:48]
`
`Data [63:56]
`
`38 /
`
`Fig. 7
`
`7
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 6 6f 13
`
`US 6,490,661 B1
`
`CMD Code
`000000
`000001
`000011
`000100
`000101
`00011x
`001000
`001001
`001010
`001011
`001100
`001110
`001101
`0011 11
`01xxxx
`10xxxx
`11000x
`1 1001 x
`11010x
`11011x
`111000
`111001
`111010
`111011
`11110x
`1111 10
`111111
`
`46/
`
`Packet Type
`Info
`Command
`-
`Command/Address
`Command/Address
`-
`Command/Address
`Command/Address
`Command/Address
`Command/Address
`Command/Address
`Command/Address
`Command
`Command/Address/Data
`Command/Address
`Command/Address/Data
`Rd Response
`-
`Response
`-
`Response
`Response
`Response
`-
`Response
`Response
`Info
`
`Command
`Nop
`Interrupt Broadcast
`Reserved
`Probe/Sro
`Probe/Tgt
`Reserved
`RdBIkS
`RdBlk
`RdBlkMod
`ChangeToDirty
`ValidateBlk
`CIeanVicBlk
`Interrupt Target
`VicBlk
`Read(Sized)
`Write(Sized)
`RdResponse
`Reserved
`ProbeResp
`Reserved
`SrcDone
`MemCancel
`TgtStart
`Reserved
`TgtDone
`IntrResponse
`Error
`
`Fig. 8
`
`8
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 7 0f 13
`
`US 6,490,661 B1
`
`96c 95
`
`Memory
`421
`
`Fig. 9
`
`9
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 8 0f 13
`
`US 6,490,661 B1
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`TgtNode
`[1:0]
`
`SrcTag
`[1:0]
`
`_
`CMD[5.0]
`
`_
`SrcNode[3.0]
`
`TgtNode
`[3:2]
`
`Nextstate DM
`[1.0]
`
`SrcTag[6:2]
`
`Addr[7:0]
`
`Addr[15:8]
`
`Addr[23:16]
`
`Addr[31:24]
`
`Addr[39:32]
`
`Next State [1:0]
`
`Next State
`
`0
`
`1
`
`2
`
`No Change
`
`Shared:
`C|ean—>Shared
`Dirty—> Shared/Dirty
`
`Invalid
`
`10
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 9 0f 13
`
`US 6,490,661 B1
`
`Bit Time
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`O
`
`1
`
`2
`
`3
`
`4
`
`RespNode
`[1:0]
`
`_
`CMD[5.0]
`
`SrcTag
`[1:0]
`
`_
`SrcNode[3.0]
`
`RespNode
`[3:2]
`
`Count
`
`SrcTag[6:2]
`
`RSV
`
`Probe Type
`
`48 / Fig. 1 1A
`
`Probe
`
`Tgt
`
`Type, Count
`
`0
`
`1
`
`1
`
`X
`
`O
`
`1
`
`Encode Length of Data Packet
`
`Cache Block:
`Type Must be 1, Count Must be 7
`
`Cache Block:
`Type Must be 1, Count Must be 7
`
`50/
`
`Fig. 118
`
`11
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 10 0f 13
`
`US 6,490,661 B1
`
`5"
`Tlme
`
`7
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`0
`
`1
`
`2
`
`3
`
`4
`
`RespNode
`[1:0]
`
`SrcTag
`[1:0]
`
`_
`CMD[5.0]
`
`_
`SrcNode[3.0]
`
`RespNode
`[3:2]
`
`Rsv
`
`Hit
`
`SrcTag[6:2]
`
`Reserved or Packet-Specific
`
`52/
`
`Fig. 12
`
`12
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 11 0f 13
`
`US 6,490,661 B1
`
`
`
`960 692.
`
`
`
`mwcoawwm 80.5
`
`
`
`wwcoawmm umwm
`
`mcoc 2w
`
`54/
`
`Memory
`421
`
`Fig. 13
`
`13
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 12 0f 13
`
`US 6,490,661 B1
`
`Memory
`421
`
`Fig. 14
`
`14
`
`

`

`U.S. Patent
`
`Dec. 3, 2002
`
`Sheet 13 6f 13
`
`US 6,490,661 B1
`
`Coherent System Node
`
`Src Node sends Read
`Command to Tgt Node
`
`Probe Command from
`Target Node to all the
`remaining system nodes
`and RdResponse from
`Tgt Node to Src Node
`
`Probe Hit?
`
`Probe Responses from
`all the remaining system
`nodes, and Read
`Response from TgtNode
`to SrcNode
`
`‘ SrcNode waits for all the
`responses
`
`No
`
`All responses
`received by the
`SrcNode?
`
`Yes
`SrcNode sends SrcDone
`response to TgtNode
`
`-Node with dirty block
`sends a RdResp to the
`SrcNode and a
`MemCancel response to
`the TgtNode
`-Other remaining nodes
`send Probe Responses
`to the SrcNode
`
`Did
`TgtNode send
`RdResponse prior to
`receiving
`MemCanoel'?
`
`TgtNode sends TgtDone
`response to the SrcNode
`
`66/
`
`FIG. 15
`
`15
`
`

`

`US 6,490,661 B1
`
`1
`MAINTAINING CACHE COHERENCY
`DURING A MEMORY READ OPERATION IN
`A MULTIPROCESSING COMPUTER
`SYSTEM
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`This application is related to the following patent appli
`cations: Ser. No. 09/217,367 ?led Dec. 21, 1998; Ser. No.
`09/217,649 ?led Dec. 21, 1998, now US. Pat. No. 6,275,
`905; Ser. No. 09/217,699 ?led Dec. 21, 1998, now US. Pat.
`No. 6,370,621; and Ser. No. 09/220,487 ?led Dec. 23, 1998,
`now US. Pat. No. 6,167,492.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention broadly relates to computer
`systems, and more particularly, to a messaging scheme to
`accomplish cache-coherent data transfers in a multiprocess
`ing computing environment.
`2. Description of the Related Art
`Generally, personal computers (PCs) and other types of
`computer systems have been designed around a shared bus
`system for accessing memory. One or more processors and
`one or more input/output (I/O) devices are coupled to
`memory through the shared bus. The I/O devices may be
`coupled to the shared bus through an I/O bridge, Which
`manages the transfer of information betWeen the shared bus
`and the I/O devices. The processors are typically coupled
`directly to the shared bus or through a cache hierarchy.
`Unfortunately, shared bus systems suffer from several
`draWbacks. For example, since there are multiple devices
`attached to the shared bus, the bus is typically operated at a
`relatively loW frequency. Further, system memory read and
`Write cycles through the shared system bus take substantially
`longer than information transfers involving a cache Within a
`processor or involving tWo or more processors. Another
`disadvantage of the shared bus system is a lack of scalability
`to larger number of devices. As mentioned above, the
`amount of bandWidth is ?xed (and may decrease if adding
`additional devices reduces the operable frequency of the
`bus). Once the bandWidth requirements of the devices
`attached to the bus (either directly or indirectly) exceeds the
`available bandWidth of the bus, devices Will frequently be
`stalled When attempting to access the bus. Overall perfor
`mance may be decreased unless a mechanism is provided to
`conserve the limited system memory bandWidth.
`A read or a Write operation addressed to a non-cache
`system memory takes more processor clock cycles than
`similar operations betWeen tWo processors or betWeen a
`processor and its internal cache. The limitations on bus
`bandWidth, coupled With the lengthy access time to read or
`Write to a system memory, negatively affect the computer
`system performance.
`One or more of the above problems may be addressed
`using a distributed memory system. A computer system
`employing a distributed memory system includes multiple
`nodes. TWo or more of the nodes are connected to memory,
`and the nodes are interconnected using any suitable inter
`connect. For example, each node may be connected to each
`other node using dedicated lines. Alternatively, each node
`may connect to a ?xed number of other nodes, and trans
`actions may be routed from a ?rst node to a second node to
`Which the ?rst node is not directly connected via one or more
`intermediate nodes. The memory address space is assigned
`across the memories in each node.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`55
`
`60
`
`65
`
`2
`Nodes may additionally include one or more processors.
`The processors typically include caches that store cache
`blocks of data read from the memories. Furthermore, a node
`may include one or more caches external to the processors.
`Since the processors and/or nodes may be storing cache
`blocks accessed by other nodes, a mechanism for maintain
`ing coherency Within the nodes is desired.
`SUMMARY OF THE INVENTION
`The problems outlined above are in large part solved by
`a computer system as described herein. The computer sys
`tem may include multiple processing nodes, tWo or more of
`Which may be coupled to separate memories Which may
`form a distributed memory system. The processing nodes
`may include caches, and the computer system may maintain
`coherency betWeen the caches and the distributed memory
`system.
`In one embodiment, the present invention relates to a
`multiprocessing computer system Where the processing
`nodes are interconnected through a plurality of dual unidi
`rectional links. Each pair of unidirectional links forms a
`coherent link structure that connects only tWo of the pro
`cessing nodes. One unidirectional link in the pair of links
`sends signals from a ?rst processing node to a second
`processing node connected through that pair of unidirec
`tional links. The other unidirectional link in the pair carries
`a reverse How of signals, ie it sends signals from the second
`processing node to the ?rst processing node. Thus,, each
`unidirectional link forms as a point-to-point interconnect
`that is designed for packetiZed information transfer. Com
`munication betWeen tWo processing nodes may be routed
`through more than one remaining nodes in the system.
`Each processing node may be coupled to a respective
`system memory through a memory bus. The memory bus
`may be bidirectional. Each processing node comprises at
`least one processor core and may optionally include a
`memory controller for communicating With the respective
`system memory. Other interface logic may be included in
`one or more processing nodes to alloW connectivity With
`various I/O devices through one or more 1/0 bridges.
`In one embodiment, one or more 1/0 bridges may be
`coupled to their respective processing nodes through a set of
`non-coherent dual unidirectional links. These I/O bridges
`communicate With their host processors through this set of
`non-coherent dual unidirectional links in much the same
`Way as tWo directly-linked processors communicate With
`each other through a coherent dual unidirectional link.
`In one embodiment, When a ?rst processing node sends a
`read command to a second processing node to read data from
`a designated memory location associated With the second
`processing node, the second processing node responsively
`transmits a probe command to all the remaining processing
`nodes in the system. The probe command is transmitted
`regardless of Whether one or more of the remaining nodes
`have a copy of the data cached in their respective cache
`memories. Each processing node that has a cached copy of
`the designated memory location updates its cache tag asso
`ciated With that cached data to re?ect the current status of the
`data. Each processing node that receives a probe command
`sends, in return, a probe response indicating Whether that
`processing node has a cached copy of the data. In the event
`that a processing node has a cached copy of the designated
`memory location, the probe response from that processing
`node further includes the state of the cached data—i.e.
`modi?ed, shared etc.
`The target processing node, ie the second processing
`node, sends a read response to the source processing node,
`
`16
`
`

`

`US 6,490,661 B1
`
`3
`ie the ?rst processing node. This read response contains the
`data requested by the source node through the read com
`mand. The ?rst processing node acknowledges receipt of the
`data by transmitting a source done response to the second
`processing node. When the second processing node receives
`the source done response it removes the read command
`(received from the ?rst processing node) from its command
`buffer queue. The second processing node may, at that point,
`start to respond to a command to the same designated
`memory location. This sequence of messaging is one step in
`maintaining cache-coherent system memory reads in a mul
`tiprocessing computer system. The data read from the des
`ignated memory location may be less than the Whole cache
`block in siZe if the read command speci?es so.
`Upon receiving the probe command, all of the remaining
`nodes check the status of the cached copy, if any, of the
`designated memory location as described before. In the
`event that a processing node, other than the source and the
`target nodes, ?nds a cached copy of the designated memory
`location that is in a modi?ed state, that processing node
`responds With a memory cancel response sent to the target
`node, ie the second processing node. This memory cancel
`response causes the second processing node to abort further
`processing of the read command, and to stop transmission of
`the read response, if it hasn’t sent the read response yet. All
`the other remaining processing nodes still send their probe
`responses to the ?rst processing node. The processing node
`that has the modi?ed cached data sends that modi?ed data to
`the ?rst processing node through its oWn read response. The
`messaging scheme involving probe responses and read
`responses thus maintains cache coherency during a system
`memory read operation.
`The memory cancel response further causes the second
`processing node to transmit a target done response to the ?rst
`processing node regardless of Whether it earlier sent the read
`response to the ?rst processing node. The ?rst processing
`node Waits for all the responses to arrive—i.e. the probe
`responses, the target done response, and the read response
`from the processing node having the modi?ed cached data—
`prior to completing the data read cycle by sending a source
`done response to the second processing node. In this
`embodiment, the memory cancel response conserves system
`memory bandWidth by causing the second processing node
`to abort time-consuming memory read operation When a
`modi?ed copy of the requested data is cached at a different
`processing node. Reduced data transfer latencies are thus
`achieved When it is observed that a data transfer betWeen
`tWo processing nodes over the high-speed dual unidirec
`tional link is substantially faster than a similar data transfer
`betWeen a processing node and a system memory that
`involves a relatively sloW speed system memory bus.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`A better understanding of the present invention can be
`obtained When the folloWing detailed description of the
`preferred embodiment is considered in conjunction With the
`folloWing draWings, in Which:
`FIG. 1 is a block diagram of one embodiment of a
`computer system.
`FIG. 2 shoWs in detail one embodiment of the intercon
`nect betWeen a pair of processing nodes from FIG. 1.
`FIG. 3 is a block diagram of one embodiment of an
`information packet.
`FIG. 4 is a block diagram of one embodiment of an
`address packet.
`FIG. 5 is a block diagram of one embodiment of a
`response packet.
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`4
`FIG. 6 is a block diagram of one embodiment of a
`command packet.
`FIG. 7 is a block diagram of one embodiment of a data
`packet.
`FIG. 8 is a table illustrating exemplary packet types that
`may be employed in the computer system of FIG. 1.
`FIG. 9 is a diagram illustrating an example How of
`packets corresponding to a memory read operation.
`FIG. 10A is a block diagram of one embodiment of a
`probe command packet.
`FIG. 10B is a block diagram for one embodiment of the
`encoding for the NeXtState ?eld in the probe command
`packet of FIG. 10A.
`FIG. 11A is a block diagram of one embodiment of a read
`response packet.
`FIG. 11B shoWs in one embodiment the relationship
`betWeen the Probe, Tgt and Type ?elds of the read response
`packet of FIG. 11A.
`FIG. 12 is a block diagram of one embodiment of a probe
`response packet.
`FIG. 13 is a diagram illustrating an example How of
`packets involving a memory cancel response.
`FIG. 14 is a diagram illustrating an example How of
`packets shoWing a messaging scheme that combines probe
`commands and memory cancel response.
`FIG. 15 is an exemplary ?oWchart for the transactions
`involved in a memory read operation.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENT
`
`Turning noW to FIG. 1, one embodiment of a multipro
`cessing computer system 10 is shoWn. In the embodiment of
`FIG. 1, computer system 10 includes several processing
`nodes 12A, 12B, 12C, and 12D. Each processing node is
`coupled to a respective memory 14A—14D via a memory
`controller 16A—16D included Within each respective pro
`cessing node 12A—12D. Additionally, processing nodes
`12A—12D include one or more interface ports 18, also
`knoWn as interface logic, to communicate among the pro
`cessing nodes 12A—12D, and to also communicate betWeen
`a processing node and a corresponding I/O bridge. For
`eXample, processing node 12A includes interface logic 18A
`for communicating With processing node 12B, interface
`logic 18B for communicating With processing node 12C,
`and a third interface logic 18C for communicating With yet
`another processing node (not shoWn). Similarly, processing
`node 12B includes interface logic 18D, 18E, and 18F;
`processing node 12C includes interface logic 18G, 18H, and
`181; and processing node 12D includes interface logic 18] ,
`18K, and 18L. Processing node 12D is coupled to commu
`nicate With an I/O bridge 20 via interface logic 18L. Other
`processing nodes may communicate With other I/O bridges
`in a similar fashion. I/O bridge 20 is coupled to an I/O bus
`22.
`The interface structure that interconnects processing
`nodes 12A—12D includes a set of dual-unidirectional links.
`Each dual-unidirectional link is implemented as a pair of
`packet-based unidirectional links to accomplish high-speed
`packetiZed information transfer betWeen any tWo processing
`nodes in the computer system 10. Each unidirectional link
`may be vieWed as a pipelined, split-transaction interconnect.
`Each unidirectional link 24 includes a set of coherent
`unidirectional lines. Thus, each pair of unidirectional links
`may be vieWed as comprising one transmission bus carrying
`a ?rst plurality of binary packets and one receiver bus
`
`17
`
`

`

`US 6,490,661 B1
`
`5
`carrying a second plurality of binary packets. The content of
`a binary packet Will primarily depend on the type of opera
`tion being requested and the processing node initiating the
`operation. One example of a dual-unidirectional link struc
`ture is links 24A and 24B. The unidirectional lines 24A are
`used to transmit packets from processing node 12A to
`processing node 12B and lines 24B are used to transmit
`packets from processing node 12B to processing node 12A.
`Other sets of lines 24C—24H are used to transmit packets
`betWeen their corresponding processing nodes as illustrated
`in FIG. 1.
`Asimilar dual-unidirectional link structure may be used to
`interconnect a processing node and its corresponding I/O
`device, or a graphic device or an I/O bridge as is shoWn With
`respect to the processing node 12D. A dual-unidirectional
`link may be operated in a cache coherent fashion for
`communication betWeen processing nodes or in a non
`coherent fashion for communication betWeen a processing
`node and an external I/O or graphic device or an I/O bridge.
`It is noted that a packet to be transmitted from one process
`ing node to another may pass through one or more remaining
`nodes. For example, a packet transmitted by processing node
`12A to processing node 12D may pass through either
`processing node 12B or processing node 12C in the arrange
`ment of FIG. 1. Any suitable routing algorithm may be used.
`Other embodiments of computer system 10 may include
`more or feWer processing nodes than those shoWn in FIG. 1.
`Processing nodes 12A—12D, in addition to a memory
`controller and interface logic, may include other circuit
`elements such as one or more processor cores, an internal
`cache memory, a bus bridge, a graphics logic, a bus
`controller, a peripheral device controller, etc. Broadly
`speaking, a processing node comprises at least one processor
`and may optionally include a memory controller for com
`municating With a memory and other logic as desired.
`Further, each circuit element in a processing node may be
`coupled to one or more interface ports depending on the
`functionality being performed by the processing node. For
`example, some circuit elements may only couple to the
`interface logic that connects an I/O bridge to the processing
`node, some other circuit elements may only couple to the
`interface logic that connects tWo processing nodes, etc.
`Other combinations may be easily implemented as desired.
`Memories 14A—14D may comprise any suitable memory
`devices. For example, a memory 14A—14D may comprise
`one or more RAMBUS DRAMs (RDRAMs), synchronous
`DRAMs (SDRAMs), static RAM, etc. The memory address
`space of the computer system 10 is divided among memories
`14A—14D. Each processing node 12A—12D may include a
`memory map used to determine Which addresses are mapped
`to Which memories 14A—14D, and hence to Which process
`ing node 12A—12D a memory request for a particular
`address should be routed. In one embodiment, the coherency
`point for an address Within computer system 10 is the
`memory controller 16A—16D coupled to the memory that is
`storing the bytes corresponding to the address. In other
`Words, the memory controller 16A—16D is responsible for
`ensuring that each memory access to the corresponding
`memory 14A—14D occurs in a cache coherent fashion.
`Memory controllers 16A—16D may comprise control cir
`cuitry for interfacing to memories 14A—14D. Additionally,
`memory controllers 16A—16D may include request queues
`for queuing memory requests.
`Generally, interface logic 18A—18L may comprise a vari
`ety of buffers for receiving packets from one unidirectional
`link and for buffering packets to be transmitted upon another
`unidirectional link. Computer system 10 may employ any
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`6
`suitable flow control mechanism for transmitting packets.
`For example, in one embodiment, each transmitting inter
`face logic 18 stores a count of the number of each type of
`buffers Within the receiving interface logic at the other end
`of the link to Which the transmitting interface logic is
`connected. The interface logic does not transmit a packet
`unless the receiving interface logic has a free buffer to store
`the packet. As a receiving buffer is freed by routing a packet
`onWard, the receiving interface logic transmits a message to
`the sending interface logic to indicate that the buffer has
`been freed. Such a mechanism may be referred to as a
`“coupon-based” system.
`Turning next to FIG. 2, a block diagram illustrating
`processing nodes 12A and 12B is shoWn to illustrate in more
`detail one embodiment of the dual unidirectional link struc
`ture connecting the processing nodes 12A and 12B. In the
`embodiment of FIG. 2, lines 24A (the unidirectional link
`24A) include a clock line 24AA, a control line 24AB, and a
`command/address/data bus 24AC. Similarly, lines 24B (the
`unidirectional link 24B) include a clock line 24BA, a control
`line 24BB, and a command/address/data bus 24BC.
`A clock line transmits a clock signal that indicates a
`sample point for its corresponding control line and the
`command/address/data bus. In one particular embodiment,
`data/control bits are transmitted on each edge (i.e. rising
`edge and falling edge) of the clock signal. Accordingly, tWo
`data bits per line may be transmitted per clock cycle. The
`amount of time employed to transmit one bit per line is
`referred to herein as a “bit time”. The above-mentioned
`embodiment includes tWo bit times per clock cycle. Apacket
`may be transmitted across tWo or more bit times. Multiple
`clock lines may be used depending upon the Width of the
`command/address/data bus. For example, tWo clock lines
`may be used for a 32 bit command/address/data bus (With
`one half of the command/address/data bus referenced to one
`of the clock lines and the other half of the command/address/
`data bus and the control line referenced to the other one of
`the clock lines.
`The control line indicates Whether or not the data trans
`mitted upon the command/address/data bus is either a bit
`time of a control packet or a bit time of a data packet. The
`control line is asserted to indicate a control packet, and
`deasserted to indicate a data packet. Certain control packets
`indicate that a data packet folloWs. The data packet may
`immediately folloW the corresponding control packet. In one
`embodiment, other control packets may interrupt the trans
`mission of a data packet. Such an interruption may be
`performed by asserting the control line for a number of bit
`times during transmission of the data packet and transmit
`ting the bit times of the control packet While the control line
`is asserted. Control packets that interrupt a data packet may
`not indicate that a data packet Will be folloWing.
`The command/address/data bus comprises a set of lines
`for transmitting the data, command, response and address
`bits. In one embodiment, the command/address/data bus
`may comprise 8, 16, or 32 lines. Each processing node or I/O
`bridge may employ any one of the supported numbers of
`lines according to design choice. Other embodiments may
`support other siZes of command/address/data bus as desired.
`According to one embodiment, the command/address/
`data bus lines and the clock line may carry inverted data (i.e.
`a logical one is represented as a loW voltage on the line, and
`a logical Zero is represented as a high voltage). Alternatively,
`these lines may carry non-inverted data (in Which a logical
`one is represented as a high voltage on the line, and logical
`Zero is represented as a loW voltage). Asuitable positive and
`negative logic combination may also be implemented.
`
`18
`
`

`

`US 6,490,661 B1
`
`7
`Turning noW to FIGS. 3—7, exemplary packets employed
`in a cache-coherent communication (i.e., the communication
`betWeen processing nodes) according to one embodiment of
`computer system 10 are shoWn. FIGS. 3—6 illustrate

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket