`
`United States Patent
`Baumgartner et al.
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 6,338,122 B1
`Jan. 8, 2002
`
`US006338122B1
`
`(54) NON-UNIFORM MEMORY ACCESS (NUMA)
`DATA PROCESSING SYSTEM THAT
`
`SPECULATIVELY FORWARDS A READ ggggEST TO A REMOTE PROCESSING
`
`(75) Inventors: Yoanna Baumgartner; Mark Edward
`Dean; Anna Elman, all of Austin, TX
`(Us)
`
`(73) Assignee: International Business Machines
`Corporation, Armonk, NY (US)
`
`(*) NOtiCeI
`
`Subject I0 any disclaimer, the term Of this
`patent iS eXtended 0r adjusted under 35
`U.S.C. 154(1)) by 0 days.
`
`(21) Appl, No; 09/211,351
`_
`Dec‘ 15’ 1998
`(22) Flled:
`G061; 12/00; G061: 13/00
`(51) Int Clj ____ __
`(52) us. Cl. ..................... .. 711/141; 711/100; 711/124;
`711/147_ 711/154
`711/124 122
`141 100’ 147’
`’
`’
`
`"""""""""
`
`’
`
`(58) Field of Search
`
`(56)
`
`References Cited
`
`US. PATENT DOCUMENTS
`
`712/29
`5,754,877 A * 5/1998 Hagersten et a1.
`5,892,970 A * 4/1999 Hagersten ................. .. 710/5
`5,950,226 A * 9/1999 Hagersten et a1.
`711/124
`5,958,019 A * 9/1999 Hagersten et a1. ........ .. 709/400
`
`FOREIGN PATENT DOCUMENTS
`
`
`
`E11; EP
`
`
`
`if 0817072 A2
`
`1/1998
`
`* cited by examiner
`_
`_
`_
`_
`P r lmar y Exammer—Tuan V‘ Th?“
`(74) Attorney, Agent, or Flrm—Cas1mer K. Salys;
`BraceWell & Patterson, LLP
`(57)
`ABSTRACT
`
`A non-uniform memory access (NUMA) computer system
`includes at least a local processing node and a remote
`processing node that are each coupled to a node intercon
`nect. The local processing node includes a local
`interconnect, a processor and a system memory coupled to
`the local interconnect, and a node controller interposed
`betWeen the local interconnect and the node interconnect. In
`response to receipt of a read request from the local
`interconnect, the node controller speculatively transmits the
`Iead request to the M19“? Processmg “0916 “a the “Ode
`interconnect. Thereafter, in response to receipt of a response
`to the read request from the remote processing node, the
`node controller handles the response in accordance With a
`resolution of the read request at the local processing node.
`For example, in one processing scenario, data contained in
`the response received from the remote processing node is
`discarded by the node controller if the read request received
`a Modi?ed Intervention coherency response at the local
`processing node.
`
`24 Claims, 8 Drawing Sheets
`
`I- '''''''''''''''''''''''''''''''''' '—8_'aT J6
`
`|
`'
`A |
`‘52 2 i
`
`12
`1 a
`P —
`PRoCEssoR
`CORE
`‘
`
`g1 4
`
`0 a
`
`12?
`PROCESSOR
`
`CoRE
`Q
`
`1
`—m
`
`<1 4
`
`CACHE
`HIERARCHY
`
`'
`I
`'
`|
`i
`
`-
`|
`
`I
`
`-
`|
`
`|
`
`CACHE
`I-IIERARCI-IY
`
`I
`
`!
`I:
`I
`I
`NODE E
`|
`ARB'TER
`CONTROLLER
`I
`: L__ _________ __
`.
`
`£1
`
`T
`
`~
`
`m
`
`PROIESSEING
`
`V
`NODE
`INTERCONNECT
`
`I
`I
`
`|
`'
`| <
`
`.
`]
`-
`
`LOCAL
`
`INTERCONNECT '
`
`I
`
`16
`g
`
`I
`E
`MEZZANINE
`BUS BRIDGE
`
`I
`>1
`<11 I
`MEMORY
`CONTROLLER
`!
`|
`i
`E -
`SYSTEM
`
`30
`8
`
`3 2
`8
`
`IIO
`DEVICES
`
`34
`8
`STORAGE
`DEVICES
`
`‘
`
`!
`MEMORY
`l
`>MEzzANINE I
`BUS
`|
`'
`|
`-
`|
`
`l
`
`1
`
`APPLE 1011
`
`
`
`P3U
`
`1B22
`
`8m___wM_
`
`_
`
`la_8_$o_>m_n_.3_woéoa_.mw5wzz8$:,__6__N.,$moozS_¢mNm_.mz_z<Nm_s__
`_>mosms_om_E02
`
` >xom<E__.___NNt_E8#50_n_mommmuommmommmoomm_m__na__
`
` $joEzoo%_>m_osms_3amooz_2_N__—_m_0_BmzzooEE_m__2#8..‘.
`
`O09_m>:om<$_I
`
`
`.1.._woemmmoma.EjoEzoowz_z<Nms_
`
`
`2
`
`
`
`
`tHCtaP3U
`
`Jan. 8, 2002
`
`Sheet 2 of 8
`
`1B221,833,6SU
`
`
`
`5mzzoo$:,=._<Uo._O...5mzz8mEz_._<oO._OH
`
`
`
`
`
`
`
`3%$2:<55$2:mmm_m_n_n_<
`
`zo_B<mz<E
`
`
`sme:2:SE.:2:>Iazmmm>_Hmm
`II0692.:%_,%m_m%uIIImwzxgwm
`
`<53<53
`
`fl._Om._.2OUll
`
`
`>moBmE_aVm
`
`mm_._.m<_>_
`
`5_¢C..r.L<.U_¢CL:
`2052.9,_,_m.>,,_.m_¢
`
`
`
`
`SF::23
`
`
`
`
`
`3m::2:ozmmmam
`
`
`
`I.r<n_<._.<n_
`
`._.Um_zzoumm._.z_maozOHx.§N.&
`
`ozazmm
`
`Efizm
`
`._.omzzOomm._.z_moozo._.
`
`ICEm$m8<
`
`S.
`
`3
`
`
`
`U.S. Patent
`
`Jan. 8,2002
`
`Sheet 3 of 8
`
`US 6,338,122 B1
`
`.
`Tlg. 3%
`
`PROCESSOR ISSUES
`REQUEST TRANSACTION ON =
`ITS LOCAL INTERCONNECT
`
`‘I
`
`SNOOPERS PROVIDE AStatOut
`
`1 1 4
`
`7 0
`
`VOTES AND ARB'TER
`PROVIDES ARespln VOTE
`
`CQMPILES TO GENERATE
`AStatln VOTE
`
`DISCARD
`DATA
`
`VOTAER‘ESePFLBn ,
`
`WAS AStatln
`
`PROCESSING
`NODE I
`
`REQUEST
`TRANSACTION
`
`=
`
`I
`
`{I 2 6
`
`q 8 4
`YES
`TRANSMIT APPROPRIATE
`TRANSACTION TO REMOTE
`PROCESSING NQDEISI
`
`I
`
`RESPONSEISI
`RECEIVED FROM
`REMOTE PROCESSING
`NODEIS) ?
`
`<8 8
`YES
`REFQEIIISESSJELRANSLQCLIPN
`PROCESSING NODE
`:
`v
`REQUEST TRANSACTION
`SERVICED
`I
`
`0
`
`(1 3 2
`SNQOPER SUPPLIES
`REQUESTED CACHE “NE
`E
`g1 3 4
`REQUESTING PROCESSOR
`LOADS REQUESTED CACHE
`“NE L'YEQAASHQCHE
`
`V
`
`150
`
`END
`
`NODE CONTROLLER SPECULA-
`TIVELY FORWARDS REQUEST
`TRANSACTION TO HOME NODE
`+
`{1 0 2
`SNOOPERS PROVIDE ARespOUt
`VOTES, WHICH ARBITER
`COMPILES To GENERATE
`ARespln VOTE
`
`ARespln VOTE
`WAS RETRY 2
`
`AReSp'"
`VOTE WAS MODIFIED
`INTERVENTION
`.
`
`NODE CONTROLLER ISSUES
`WRITE WITH CLEAN TO
`HOME NODE
`
`<1 2 4
`T
`REQUESTING PROCESSOR
`LOADS REQUESTED CACHE
`LINE INTO ITS CACHE
`HIERARCHY
`g1 2 2
`+
`SNOQPER SUPPUES
`REQUESTED CACHE LINE
`
`YES
`
`1 4 4
`g
`NQDE CONTROLLER
`TRANSMITS THE
`REQUESTED CACHE LINE T0
`REQUESTING PROCESSOR
`
`ARespln
`1 3 O
`(1 4 6
`V
`, hm}; \ \ RQRU“
`YES
`I ’ VOTE WAS SHARED‘ ‘ l9. REQUESTING PROCESSOR
`\ ~ \ INTERVENTION , I ’
`LOADS REQUESTED CACHE
`‘ S‘ ? , /’
`LINE INTO ITS CACHE
`V
`HIERARCHY
`,
`
`II
`<1
`T
`
`4
`
`
`
`U.S. Patent
`
`Jan. 8,2002
`
`Sheet 4 of 8
`
`US 6,338,122 B1
`
`I
`
`BEGIN
`
`16 0
`I
`
`RECEIVE
`TRANSACTION AT
`HOME NODE
`?
`
`YES 8-1 6 4
`NODE CONTROLLER
`OF HOME NODE TRANS
`MITS TRANSACTION ON
`LOCAL INTERCONNECT
`OF HOME NODE
`
`READ REQUEST SERVICED
`BY A SNOOPER BY
`SUPPLYING A COPY OF
`THE REQUESTED CACHE
`LINE TO THE NODE
`CONTROLLER
`‘V
`g1 7 4
`
`NODE CONTROLLER
`TRANSMITS REQUESTED
`CACHE LINE T0
`REQUESTING
`PROCESSING NODE
`
`190
`
`<1 8 2
`
`PERFORM ACTION
`INDICATED BY
`TRANSACTION
`
`UPDATE SYSTEM
`MEMORY WITH CACHE
`LINE CONTAINED IN
`WRITE TRANSACTION
`
`Tzg. 3G3
`
`5
`
`
`
`U.S. Patent
`
`Jan. 8, 2002
`
`Sheet 5 of 8
`
`US 6,338,122 B1
`
`|moVzo:zm>~Ez_mAIlao_
`mommmoofiI«mommmuomm
`omE_oo_>_88
`
`emu3_m_>:<._:om:mmm3_H_H_
`
`_Eozmooz
`
`>mosms__>mosm_s_3s_Em>m“EjofizouEE.:oEz83sEm>m
`
`_93¢_om
`
`_mommmoomm_mommmoomm_nFH5__om
`
`E3.1H2H.
`
`5.$3
`
`6
`
`
`
`
`
`U.S. Patent
`
`Jan. 8, 2002
`
`Sheet 6 of 8
`
`US 6,338,122 B1
`
`mommmoomm
`
`n _
`
`,
`
`mommmuomm
`
`WE
`
`
`
`mI_,.s_Em>m
`
`>mo2m2
`
`a F
`
`maoz
`
`$joEzoo
`
`cm
`
`8.
`
`mooz
`
`ls_E>EjoEzoo3mm
`
`>mosm_2
`
`3
`
`F
`
`mommmuomm
`
`2:
`
`mommmuomm
`
`7
`
`
`
`
`
`
`
`UU
`
`
`
`aa
`
`
`
`00
`
`
`
` M7 M7
`
`
`
`SS
`
`
`
`mm
`
`
`
`mm
`
`
`
`P.Amz_._m_._o<uxaab.pmoflmmsommmmP.Amz_._m_._o<uxaab.pmoflmmsommmm
`
`
`
` MEomImI___mommmoommmommmoomn.mml_258IwmE_>>E_§> MEomImI___mommmoommmommmoomn.mml_258IwmE_>>E_§>
`
`
`
`
`
`
`
`2.n,9»Rs2.n,9»Rs
`
`
`6,E36,E3
`
`
`
`
`
`uW_i_mommmoommmommmoommuW_i_mommmoommmommmoomm
`
`
`
`89FnF89FnF
`
`
`
`mM82M82mM82M82
`
`
`w>mosm_>_>mosm_s_sI23>I3mwEjofizoow>mosm_>_>mosm_s_sI23>I3mwEjofizoo
`
`
`
`Ejofizoo3s_Em>mEjofizoo3s_Em>m
`
`8
`
`
`
`
`
`U.S. Patent
`
`Jan. 8,2002
`
`Sheet 8 of 8
`
`US 6,338,122 B1
`
`vTw
`
`w
`
`mo?
`
`momwwuomm
`
`vTw
`
`m
`
`n P
`
`mowwwoomm
`
`\~ @ P
`
`ow wz: N53
`
`
`
`wé?w >._m_mmom
`
`"5 >100
`
`owhmmscmm
`
`gm
`
`mow
`
`3 m
`
`M502 M502
`
`
`
`mmjOEzoo mwjomhzou
`
`WIN
`o m N ‘N o m/ 2
`
`vJ
`
`mOwwmuOmm
`
`2
`
`mowmwuomm
`
`@v .m@
`
`w P
`
`vJ
`
`9
`
`
`
`US 6,338,122 B1
`
`1
`NON-UNIFORM MEMORY ACCESS (NUMA)
`DATA PROCESSING SYSTEM THAT
`SPECULATIVELY FORWARDS A READ
`REQUEST TO A REMOTE PROCESSING
`NODE
`
`BACKGROUND OF THE INVENTION
`
`2
`either non-coherent or cache coherent, depending upon
`Whether or not data coherency is maintained betWeen caches
`in different nodes. The complexity of cache coherent NUMA
`(CC-NUMA) systems is attributable in large measure to the
`additional communication required for hardWare to maintain
`data coherency not only betWeen the various levels of cache
`memory and system memory Within each node but also
`betWeen cache and system memories in different nodes.
`NUMA computer systems do, hoWever, address the scal
`ability limitations of conventional SMP computer systems
`since each node Within a NUMA computer system can be
`implemented as a smaller SMP system. Thus, the shared
`components Within each node can be optimiZed for use by
`only a feW processors, While the overall system bene?ts
`from the availability of larger scale parallelism While main
`taining relatively loW latency.
`A principal performance concern With CC-NUMA com
`puter systems is the latency associated With communication
`transactions transmitted via the interconnect coupling the
`nodes. In particular, read transactions, Which are by far the
`most common type of transaction, may have tWice the
`latency When targeting data resident in remote system
`memory as compared to read transactions targeting data
`resident in local system memory. Because of the relatively
`high latency associated With read transactions transmitted on
`the nodal interconnect versus read transactions on the local
`interconnects, it is useful and desirable to reduce the latency
`of read transactions transmitted over the nodal interconnect.
`
`SUMMARY OF THE INVENTION
`In accordance With the present invention, a non-uniform
`memory access (NUMA) computer system includes at least
`a local processing node and a remote processing node that
`are each coupled to a node interconnect. The local process
`ing node includes a local interconnect, a processor and a
`system memory coupled to the local interconnect, and a
`node controller interposed betWeen the local interconnect
`and the node interconnect. In response to receipt of a read
`request from the local interconnect, the node controller
`speculatively transmits the read request to the remote pro
`cessing node via the node interconnect. Thereafter, in
`response to receipt of a response to the read request from the
`remote processing node, the node controller handles the
`response in accordance With a resolution of the read request
`at the local processing node. For example, in one processing
`scenario, data contained in the response received from the
`remote processing node is discarded by the node controller
`if the read request received a Modi?ed Intervention coher
`ency response at the local processing node.
`All objects, features, and advantages of the present inven
`tion Will become apparent in the folloWing detailed Written
`description.
`BRIEF DESCRIPTION OF THE DRAWINGS
`The novel features believed characteristic of the invention
`are set forth in the appended claims. The invention itself
`hoWever, as Well as a preferred mode of use, further objects
`and advantages thereof, Will best be understood by reference
`to the folloWing detailed description of an illustrative
`embodiment When read in conjunction With the accompa
`nying draWings, Wherein:
`FIG. 1 depicts an illustrative embodiment of a NUMA
`computer system in accordance With the present invention;
`FIG. 2 is a more detailed block diagram of the node
`controller shoWn in FIG. 1;
`FIGS. 3A and 3B are high level logical ?oWcharts that
`together illustrate an exemplary method of processing
`
`1. Technical Field
`The present invention relates in general to a method and
`system for data processing and, in particular, to data pro
`cessing Within a non-uniform memory access (NUMA) data
`processing system. Still more particularly, the present inven
`tion relates to a NUMA data processing system and method
`of communication in a NUMA data processing system in
`Which read requests are speculatively forWarded to remote
`memory.
`2. Description of the Related Art
`It is Well-known in the computer arts that greater com
`puter system performance can be achieved by harnessing the
`processing poWer of multiple individual processors in tan
`dem. Multi-processor (MP) computer systems can be
`designed With a number of different topologies, of Which
`various ones may be better suited for particular applications
`depending upon the performance requirements and softWare
`environment of each application. One of the most common
`MP computer topologies is a symmetric multi-processor
`(SMP) con?guration in Which multiple processors share
`common resources, such as a system memory and input/
`output (I/O) subsystem, Which are typically coupled to a
`shared system interconnect. Such computer systems are said
`to be symmetric because all processors in an SMP computer
`system ideally have the same access latency With respect to
`data stored in the shared system memory.
`Although SMP computer systems permit the use of rela
`tively simple inter-processor communication and data shar
`ing methodologies, SMP computer systems have limited
`scalability. In other Words, While performance of a typical
`SMP computer system can generally be expected to improve
`With scale (i.e., With the addition of more processors),
`inherent bus, memory, and input/output (I/O) bandWidth
`limitations prevent signi?cant advantage from being
`obtained by scaling a SMP beyond a implementation
`dependent siZe at Which the utiliZation of these shared
`resources is optimiZed. Thus, the SMP topology itself suffers
`to a certain extent from bandWidth limitations, especially at
`the system memory, as the system scale increases. SMP
`computer systems also do not scale Well from the standpoint
`of manufacturing ef?ciency. For example, although some
`components can be optimiZed for use in both uniprocessor
`and small-scale SMP computer systems, such components
`are often inef?cient for use in large-scale SMPs. Conversely,
`components designed for use in large-scale SMPs are
`impractical for use in smaller systems from a cost stand
`point.
`As a result, an MP computer system topology knoWn as
`non-uniform memory access (NUMA) has emerged as an
`alternative design that addresses many of the limitations of
`SMP computer systems at the expense of some additional
`complexity. A typical NUMA computer system includes a
`number of interconnected nodes that each include one or
`more processors and a local “system” memory. Such com
`puter systems are said to have a non-uniform memory access
`because each processor has loWer access latency With
`respect to data stored in the system memory at its local node
`than With respect to data stored in the system memory at a
`remote node. NUMA systems can be further classi?ed as
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`10
`
`
`
`US 6,338,122 B1
`
`3
`request transactions in Which read requests at a source
`processing node are speculatively forwarded to a remote
`processing node; and
`FIGS. 4A—4D together illustrate an exemplary processing
`scenario in accordance With the method depicted in FIGS.
`3A and 3B.
`
`DETAILED DESCRIPTION OF ILLUSTRATIVE
`EMBODIMENT
`
`System OvervieW
`With reference noW to the ?gures and in particular With
`reference to FIG. 1, there is depicted an illustrative embodi
`ment of a NUMA computer system in accordance With the
`present invention. The depicted embodiment can be realiZed,
`for example, as a Workstation, server, or mainframe com
`puter. As illustrated, NUMA computer system 6 includes a
`number (N22) of processing nodes 8a—8n, Which are inter
`connected by node interconnect 22. Processing nodes 8a—8n
`may each include M (M20) processors 10, a local inter
`connect 16, and a system memory 18 that is accessed via a
`memory controller 17. Processors 10a—10m are preferably
`(but not necessarily) identical and may comprise a processor
`Within the PoWerPCTM line of processors available from
`International Business Machines (IBM) Corporation of
`Armonk, NY. In addition to the registers, instruction ?oW
`logic and execution units utiliZed to execute program
`instructions, Which are generally designated as processor
`core 12, each of processors 10a—10m also includes an
`on-chip cache hierarchy that is utiliZed to stage data to the
`associated processor core 12 from system memories 18.
`Each cache hierarchy 14 may include, for example, a level
`one (L1) cache and a level tWo (L2) cache having storage
`capacities of betWeen 8—32 kilobytes (kB) and 1—16 mega
`bytes (MB), respectively.
`Each of processing nodes 8a—8n further includes a respec
`tive node controller 20 coupled betWeen local interconnect
`16 and node interconnect 22. Each node controller 20 serves
`as a local agent for remote processing nodes 8 by performing
`at least tWo functions. First, each node controller 20 snoops
`the associated local interconnect 16 and facilitates the trans
`mission of local communication transactions to remote
`processing nodes 8. Second, each node controller 20 snoops
`communication transactions on node interconnect 22 and
`masters relevant communication transactions on the associ
`ated local interconnect 16. Communication on each local
`interconnect 16 is controlled by an arbiter 24. Arbiters 24
`regulate access to local interconnects 16 based on bus
`request signals generated by processors 10 and compile
`coherency responses for snooped communication transac
`tions on local interconnects 16, as discussed further beloW.
`Local interconnect 16 is coupled, via meZZanine bus
`bridge 26, to a meZZanine bus 30, Which may be imple
`mented as a Peripheral Component Interconnect (PCI) local
`bus, for example. MeZZanine bus bridge 26 provides both a
`loW latency path through Which processors 10 may directly
`access devices among I/O devices 32 and storage devices 34
`that are mapped to bus memory and/or I/O address spaces
`and a high bandWidth path through Which I/O devices 32 and
`storage devices 34 may access system memory 18. I/O
`devices 32 may include, for example, a display device, a
`keyboard, a graphical pointer, and serial and parallel ports
`for connection to external netWorks or attached devices.
`Storage devices 34, on the other hand, may include optical
`or magnetic disks that provide non-volatile storage for
`operating system and application softWare.
`Memory OrganiZation
`All of processors 10 in NUMA computer system 6 share
`a single physical memory space, meaning that each physical
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`4
`address is associated With only a single location in one of
`system memories 18. Thus, the overall contents of the
`system memory, Which can generally be accessed by any
`processor 10 in NUMA computer system 6, can be vieWed
`as partitioned betWeen system memories 18. For example, in
`an illustrative embodiment of the present invention having
`four processing nodes 8, NUMA computer system may have
`a 16 gigabyte (GB) physical address space including both a
`general purpose memory area and a reserved area. The
`general purpose memory area is divided into 500 MB
`segments, With each of the four processing nodes 8 being
`allocated every fourth segment. The reserved area, Which
`may contain approximately 2 GB, includes system control
`and peripheral memory and I/O areas that are each allocated
`to a respective one of processing nodes 8.
`For purposes of the present discussion, the processing
`node 8 that stores a particular datum in its system memory
`18 is said to be the home node for that datum; conversely,
`others of processing nodes 8a—8n are said to be remote
`nodes With respect to the particular datum.
`Memory Coherency
`Because data stored Within each system memory 18 can
`be requested, accessed, and modi?ed by any processor 10
`Within NUMA computer system 6, NUMA computer system
`6 implements a cache coherence protocol to maintain coher
`ence both betWeen caches in the same processing node and
`betWeen caches in different processing nodes. Thus, NUMA
`computer system 6 is properly classi?ed as a CC-NUMA
`computer system. The cache coherence protocol that is
`implemented is implementation-dependent and may
`comprise, for example, the Well-knoWn Modi?ed,
`Exclusive, Shared, Invalid (MESI) protocol or a variant
`thereof. Hereafter, it Will be assumed that cache hierarchies
`14 and arbiters 24 implement the conventional MESI
`protocol, of Which node controllers 20 recogniZe the M, S
`and I states and consider the E state to be merged into the M
`state for correctness. That is, node controllers 20 assume that
`data held exclusively by a remote cache has been modi?ed,
`Whether or not the data has actually been modi?ed.
`Interconnect Architecture
`Local interconnects 16 and node interconnect 22 can each
`be implemented With any bus-based broadcast architecture,
`sWitch-based broadcast architecture, or sWitch-based non
`broadcast architecture. HoWever, in a preferred
`embodiment, at least node interconnect 22 is implemented
`as a sWitch-based non-broadcast interconnect governed by
`the 6xx communication protocol developed by IBM Corpo
`ration. Local interconnects 16 and node interconnect 22
`permit split transactions, meaning that no ?xed timing
`relationship exists betWeen the address and data tenures
`comprising a communication transaction and that data pack
`ets can be ordered differently than the associated address
`packets. The utiliZation of local interconnects 16 and node
`interconnect 22 is also preferably enhanced by pipelining
`communication transactions, Which permits a subsequent
`communication transaction to be sourced prior to the master
`of a previous communication transaction receiving coher
`ency responses from each recipient.
`Regardless of the type or types of interconnect architec
`ture that are implemented, at least three types of “packets”
`(packet being used here generically to refer to a discrete unit
`of information)—address, data, and coherency response—
`are utiliZed to convey information betWeen processing nodes
`8 via node interconnect 22 and betWeen snoopers via local
`interconnects 16. Referring noW to Tables I and II, a
`summary of relevant ?elds and de?nitions are given for
`address and data packets, respectively.
`
`11
`
`
`
`US
`6,338,122 B1
`
`5
`
`TABLE I
`
`Field Name
`
`Description
`
`Address
`<O:7>
`
`Address
`<8:15>
`Address
`<16:63>
`
`Aparity
`<O:2>
`TDescriptors
`
`Modi?ers de?ning attributes of a
`communication transaction for coherency,
`write thru, and protection
`Tag used to identify all packets within a
`communication transaction
`Address portion that indicates the
`physical, virtual or I/O address in a
`request
`Indicates parity for address bits <O:63>
`
`Indicate size and type of communication
`transaction
`
`TABLE II
`
`Field Name
`
`Description
`
`Data
`<O:127>
`Data parity
`<O:15>
`DTag
`<O:7>
`DValid
`<O:1>
`
`Data for read and write transactions
`
`Indicates parity for data lines <O:127>
`
`Tag used to match a data packet with an
`address packet
`Indicates if valid information is present
`in Data and DTag ?elds
`
`As indicated in Tables I and II, to permit a recipient node or
`snooper to determine the communication transaction to
`which each packet belongs, each packet in a communication
`transaction is identi?ed with a transaction tag. Those skilled
`in the art will appreciate that additional ?ow control logic
`and associated ?ow control signals may be utilized to
`regulate the utilization of the ?nite communication
`resources.
`Within each processing node 8, status and coherency
`responses are communicated between each snooper and the
`local arbiter 24. The signal lines within local interconnects
`16 that are utilized for status and coherency communication
`are summarized below in Table III.
`
`TABLE III
`
`Signal Name
`
`Description
`
`AStatOut
`<O:1>
`
`AStatIn
`<O:1>
`
`ARespOut
`<O:2>
`
`ARespIn
`<O:2>
`
`Encoded signals asserted by each bus
`receiver to indicate flow control or error
`information to arbiter
`Encoded signals asserted by arbiter in
`response to tallying the AStatOut signals
`asserted by the bus receivers
`Encoded signals asserted by each bus
`receiver to indicate coherency information
`to arbiter
`Encoded signals asserted by arbiter in
`response to tallying the ARespOut signals
`asserted by the bus receivers
`
`Status and coherency responses transmitted via the AResp
`and AStat lines of local interconnects 16 preferably have a
`?xed but programmable timing relationship with the asso
`ciated address packets. For example, the AStatOut votes,
`which provide a preliminary indication of whether or not
`each snooper has successfully received an address packet
`transmitted on local interconnect 16, may be required in the
`second cycle following receipt of the address packet. Arbiter
`24 compiles the AStatOut votes and then issues the AStatIn
`
`6
`vote a ?xed but programmable number of cycles later (e.g.,
`1 cycle). Possible AStat votes are summarized below in
`Table IV.
`
`TABLE IV
`
`AStat vote
`
`Meaning
`
`Null
`Ack
`Error
`Retry
`
`Idle
`Transaction accepted by snooper
`Parity error detected in transaction
`Retry transaction, usually for flow
`control
`
`Following the AStatIn period, the ARespOut votes may then
`be required a ?xed but programmable number of cycles
`(e.g., 2 cycles) later. Arbiter 24 also compiles the ARespOut
`votes of each snooper and delivers an ARespIn vote, pref
`erably during the next cycle. The possible AResp votes
`preferably include the coherency responses listed in Table V.
`
`Coherency
`responses
`
`Retry
`
`Modi?ed
`intervention
`Shared
`Null
`ReRun
`
`TABLE V
`
`Meaning
`
`Source of request must retry transaction
`usually for flow control reasons
`Line is modi?ed in cache and will be
`sourced to requestor
`Line is held shared in cache
`Line is invalid in cache
`Snooped request has long latency and
`source of request will be instructed to
`reissue transaction at a later time
`
`The ReRun AResp vote, which is usually issued by a node
`controller 20, indicates that the snooped request has a long
`latency and that the source of the request will be instructed
`to reissue the transaction at a later time. Thus, in contrast to
`a Retry AResp vote, a ReRun makes the recipient of a
`transaction that voted ReRun (and not the originator of the
`transaction) responsible for causing the communication
`transaction to be reissued at a later time.
`Node Controller
`Referring now to FIG. 2, there is illustrated a more
`detailed block diagram of a node controller 20 in NUMA
`computer system 6 of FIG. 1. As shown in FIG. 2, each node
`controller 20, which is coupled between a local interconnect
`16 and node interconnect 22, includes a transaction receive
`unit (TRU) 40, a transaction send unit (TSU) 42, a data
`receive unit (DRU) 44, and a data send unit (DSU) 46. TRU
`40, TSU 42, DRU 44 and DSU 46 can be implemented, for
`example, with ?eld programmable gate arrays (FPGAs) or
`application speci?c integrated circuits (ASICs). As
`indicated, the address and data paths through node controller
`20 are bifurcated, with address (and coherency) packets
`being processed by TRU 40 and TSU 42 and data packets
`being processed by DSU 44 and DRU 46.
`TRU 40, which is so designated to indicate transaction
`?ow off of node interconnect 22, is responsible for accepting
`address and coherency packets from node interconnect 22,
`issuing transactions on local interconnect 16, and forward
`ing responses to TSU 42. TRU 40 includes response mul
`tiplexer (mux) 52, which receives packets from node inter
`connect 22 and passes selected packets to both bus master 54
`and coherency response logic 56 within TSU 42. In response
`to receipt of a address packet from response multiplexer 52,
`bus master 54 can initiate a communication transaction on
`its local interconnect 16 that is the same as or different from
`the type of communication transaction indicated by the
`received address packet.
`
`1O
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`12
`
`
`
`US 6,338,122 B1
`
`7
`TSU 42, Which as indicated by its nomenclature is a
`conduit for transactions ?owing onto node interconnect 22,
`includes a multiple-entry pending buffer 60 that temporarily
`stores attributes of communication transactions sourced onto
`node interconnect 22 that have yet to be completed. The
`transaction attributes stored in an entry of pending buffer 60
`preferably include at least the address (including tag) of the
`transaction, the type of the transaction, and the number of
`eXpected coherency responses. Each pending buffer entry
`has an associated status, Which can be set either to Null,
`indicating that the pending buffer entry can be deleted, or to
`ReRun, indicating that the transaction is still pending. In
`addition to sourcing address packets on node interconnect
`22, TSU 42 interacts With TRU 40 to process memory
`request transactions and issues commands to DRU 44 and
`DSU 46 to control the transfer of data betWeen local
`interconnect 16 and node interconnect 22. TSU 42 also
`implements the selected (i.e., MSI) coherency protocol for
`node interconnect 22 With coherency response logic 56 and
`maintains coherence directory 50 With directory control
`logic 58.
`Coherence directory 50 stores indications of the system
`memory addresses of data (e.g., cache lines) checked out to
`caches in remote nodes for Which the local processing node
`is the home node. The address indication for each cache line
`is stored in association With an identi?er of each remote
`processing node having a copy of the cache line and the
`coherency status of the cache line at each such remote
`processing node. Possible coherency states for entries in
`coherency directory 50 are summariZed in Table VI.
`
`TABLE VI
`
`Possible
`state(s)
`in local
`cache
`
`I
`
`Possible
`state(s)
`in
`remote
`cache
`
`M, E, or
`I
`
`S or I
`
`S or I
`
`M, E, S,
`or I
`S or I
`
`I
`
`S or I
`
`Coherence
`directory
`state
`
`Modi?ed
`(M)
`
`Shared
`(S)
`
`Invalid
`(I)
`Pending-
`shared
`
`Pending-
`modi?ed
`
`I
`
`M, E, or
`I
`
`Meaning
`
`Cache line may be
`modi?ed at a remote
`node With respect to
`system memory at home
`node
`Cache line may be held
`non-exclusively at
`remote node
`Cache line is not held
`by any remote node
`Cache line is in the
`process of being
`invalidated at remote
`nodes
`Cache line, Which may
`be modi?ed remotely,
`is in process of being
`Written back to system
`memory at home node,
`possibly With
`invalidation at remote
`node
`
`As indicated in Table VI, the knowledge of the coherency
`states of cache lines held by remote processing nodes is
`imprecise. This imprecision is due to the fact that a cache
`line held remotely can make a transition from S to I, from
`E to I, or from E to M Without notifying the node controller
`20 of the home node.
`Processing Read Request Transactions
`Referring noW to FIGS. 3A and 3B, there are illustrated
`tWo high level logical ?oWcharts that together depict an
`eXemplary method for processing read request transactions
`in accordance With the present invention. Referring ?rst to
`
`8
`FIG. 3A, the process begins at block 70 and thereafter
`proceeds to block 72, Which depicts a processor 10, such as
`processor 10a of processing node 8a, issuing a read request
`transaction on its local interconnect 16. The read request
`transaction is received by node controller 20 and the rest of
`the snoopers coupled to local interconnect 16 of processing
`node 8a. In response to receipt of the read request, the
`snoopers drive AStatOut votes, Which are compiled by
`arbiter 24 to generate an AStatIn vote, as shoWn at block 74.
`Before node controller 20 supplies an Ack AStatOut vote to
`permit the read request to proceed, node controller 20
`allocates both a read entry and Write-With-clean entry in
`pending buffer 60, if the read request speci?es an address in
`a remote system memory 18. As discussed further beloW, by
`allocating both entries, node controller 20 is able to specu
`latively forWard the read request to the home node of the
`requested cache line and correctly handle the response to the
`read request regardless of the outcome of the subsequent
`AResp vote at processing node 8a.
`Referring noW to block 76, if the AStatIn vote generated
`at block 74 is Retry, the read request is essentially killed,
`allocated entries, if any, in pending buffer 60 are freed, and
`the process returns to block 72, Which has been described. In
`this case, processor 10a must reissue the read request at a
`later time. If, on the other hand, the AStatIn vote generated
`at block 74 is not Retry, the process proceeds from block 76
`to block 78, Which depicts node controller 20 determining by
`reference to the memory map Whether or not its processing
`node 8 is the home node of the physical address speci?ed in
`the read request. If so, the process proceeds to block 80;
`hoWever, if the local processing node 8 is not the home node
`for the read request, the process proceeds to block 100.
`Referring noW to block 80, the snoopers Within process
`ing node 8a then provide their ARespOut votes, Which
`arbiter 24 compiles to generate an ARespIn vote. If coher
`ency directory 50 indicates that the cache line identi?ed by
`the address speci?ed in the read request is checked out to at
`least one remote processing node 8, node controller 20 Will
`vote ReRun if servicing the read request requires commu
`nication With a remote processing node 8. For eXample, if
`coherency directory 50 indicates that a requested