throbber
(12)
`
`United States Patent
`Baumgartner et al.
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 6,338,122 B1
`Jan. 8, 2002
`
`US006338122B1
`
`(54) NON-UNIFORM MEMORY ACCESS (NUMA)
`DATA PROCESSING SYSTEM THAT
`
`SPECULATIVELY FORWARDS A READ ggggEST TO A REMOTE PROCESSING
`
`(75) Inventors: Yoanna Baumgartner; Mark Edward
`Dean; Anna Elman, all of Austin, TX
`(Us)
`
`(73) Assignee: International Business Machines
`Corporation, Armonk, NY (US)
`
`(*) NOtiCeI
`
`Subject I0 any disclaimer, the term Of this
`patent iS eXtended 0r adjusted under 35
`U.S.C. 154(1)) by 0 days.
`
`(21) Appl, No; 09/211,351
`_
`Dec‘ 15’ 1998
`(22) Flled:
`G061; 12/00; G061: 13/00
`(51) Int Clj ____ __
`(52) us. Cl. ..................... .. 711/141; 711/100; 711/124;
`711/147_ 711/154
`711/124 122
`141 100’ 147’
`’
`’
`
`"""""""""
`
`’
`
`(58) Field of Search
`
`(56)
`
`References Cited
`
`US. PATENT DOCUMENTS
`
`712/29
`5,754,877 A * 5/1998 Hagersten et a1.
`5,892,970 A * 4/1999 Hagersten ................. .. 710/5
`5,950,226 A * 9/1999 Hagersten et a1.
`711/124
`5,958,019 A * 9/1999 Hagersten et a1. ........ .. 709/400
`
`FOREIGN PATENT DOCUMENTS
`
`
`
`E11; EP
`
`
`
`if 0817072 A2
`
`1/1998
`
`* cited by examiner
`_
`_
`_
`_
`P r lmar y Exammer—Tuan V‘ Th?“
`(74) Attorney, Agent, or Flrm—Cas1mer K. Salys;
`BraceWell & Patterson, LLP
`(57)
`ABSTRACT
`
`A non-uniform memory access (NUMA) computer system
`includes at least a local processing node and a remote
`processing node that are each coupled to a node intercon
`nect. The local processing node includes a local
`interconnect, a processor and a system memory coupled to
`the local interconnect, and a node controller interposed
`betWeen the local interconnect and the node interconnect. In
`response to receipt of a read request from the local
`interconnect, the node controller speculatively transmits the
`Iead request to the M19“? Processmg “0916 “a the “Ode
`interconnect. Thereafter, in response to receipt of a response
`to the read request from the remote processing node, the
`node controller handles the response in accordance With a
`resolution of the read request at the local processing node.
`For example, in one processing scenario, data contained in
`the response received from the remote processing node is
`discarded by the node controller if the read request received
`a Modi?ed Intervention coherency response at the local
`processing node.
`
`24 Claims, 8 Drawing Sheets
`
`I- '''''''''''''''''''''''''''''''''' '—8_'aT J6
`
`|
`'
`A |
`‘52 2 i
`
`12
`1 a
`P —
`PRoCEssoR
`CORE
`‘
`
`g1 4
`
`0 a
`
`12?
`PROCESSOR
`
`CoRE
`Q
`
`1
`—m
`
`<1 4
`
`CACHE
`HIERARCHY
`
`'
`I
`'
`|
`i
`
`-
`|
`
`I
`
`-
`|
`
`|
`
`CACHE
`I-IIERARCI-IY
`
`I
`
`!
`I:
`I
`I
`NODE E
`|
`ARB'TER
`CONTROLLER
`I
`: L__ _________ __
`.
`
`£1
`
`T
`
`~
`
`m
`
`PROIESSEING
`
`V
`NODE
`INTERCONNECT
`
`I
`I
`
`|
`'
`| <
`
`.
`]
`-
`
`LOCAL
`
`INTERCONNECT '
`
`I
`
`16
`g
`
`I
`E
`MEZZANINE
`BUS BRIDGE
`
`I
`>1
`<11 I
`MEMORY
`CONTROLLER
`!
`|
`i
`E -
`SYSTEM
`
`30
`8
`
`3 2
`8
`
`IIO
`DEVICES
`
`34
`8
`STORAGE
`DEVICES
`
`‘
`
`!
`MEMORY
`l
`>MEzzANINE I
`BUS
`|
`'
`|
`-
`|
`
`l
`
`1
`
`APPLE 1011
`
`

`
`P3U
`
`1B22
`
`8m___wM_
`
`_
`
`la_8_$o_>m_n_.3_woéoa_.mw5wzz8$:,__6__N.,$moozS_¢mNm_.mz_z<Nm_s__
`_>mosms_om_E02
`
` >xom<E__.___NNt_E8#50_n_mommmuommmommmoomm_m__na__
`
` $joEzoo%_>m_osms_3amooz_2_N__—_m_0_BmzzooEE_m__2#8..‘.
`
`O09_m>:om<$_I
`
`
`.1.._woemmmoma.EjoEzoowz_z<Nms_
`
`
`2
`
`
`

`
`tHCtaP3U
`
`Jan. 8, 2002
`
`Sheet 2 of 8
`
`1B221,833,6SU
`
`
`
`5mzzoo$:,=._<Uo._O...5mzz8mEz_._<oO._OH
`
`
`
`
`
`
`
`3%$2:<55$2:mmm_m_n_n_<
`
`zo_B<mz<E
`
`
`sme:2:SE.:2:>Iazmmm>_Hmm
`II0692.:%_,%m_m%uIIImwzxgwm
`
`<53<53
`
`fl._Om._.2OUll
`
`
`>moBmE_aVm
`
`mm_._.m<_>_
`
`5_¢C..r.L<.U_¢CL:
`2052.9,_,_m.>,,_.m_¢
`
`
`
`
`SF::23
`
`
`
`
`
`3m::2:ozmmmam
`
`
`
`I.r<n_<._.<n_
`
`._.Um_zzoumm._.z_maozOHx.§N.&
`
`ozazmm
`
`Efizm
`
`._.omzzOomm._.z_moozo._.
`
`ICEm$m8<
`
`S.
`
`3
`
`

`
`U.S. Patent
`
`Jan. 8,2002
`
`Sheet 3 of 8
`
`US 6,338,122 B1
`
`.
`Tlg. 3%
`
`PROCESSOR ISSUES
`REQUEST TRANSACTION ON =
`ITS LOCAL INTERCONNECT
`
`‘I
`
`SNOOPERS PROVIDE AStatOut
`
`1 1 4
`
`7 0
`
`VOTES AND ARB'TER
`PROVIDES ARespln VOTE
`
`CQMPILES TO GENERATE
`AStatln VOTE
`
`DISCARD
`DATA
`
`VOTAER‘ESePFLBn ,
`
`WAS AStatln
`
`PROCESSING
`NODE I
`
`REQUEST
`TRANSACTION
`
`=
`
`I
`
`{I 2 6
`
`q 8 4
`YES
`TRANSMIT APPROPRIATE
`TRANSACTION TO REMOTE
`PROCESSING NQDEISI
`
`I
`
`RESPONSEISI
`RECEIVED FROM
`REMOTE PROCESSING
`NODEIS) ?
`
`<8 8
`YES
`REFQEIIISESSJELRANSLQCLIPN
`PROCESSING NODE
`:
`v
`REQUEST TRANSACTION
`SERVICED
`I
`
`0
`
`(1 3 2
`SNQOPER SUPPLIES
`REQUESTED CACHE “NE
`E
`g1 3 4
`REQUESTING PROCESSOR
`LOADS REQUESTED CACHE
`“NE L'YEQAASHQCHE
`
`V
`
`150
`
`END
`
`NODE CONTROLLER SPECULA-
`TIVELY FORWARDS REQUEST
`TRANSACTION TO HOME NODE
`+
`{1 0 2
`SNOOPERS PROVIDE ARespOUt
`VOTES, WHICH ARBITER
`COMPILES To GENERATE
`ARespln VOTE
`
`ARespln VOTE
`WAS RETRY 2
`
`AReSp'"
`VOTE WAS MODIFIED
`INTERVENTION
`.
`
`NODE CONTROLLER ISSUES
`WRITE WITH CLEAN TO
`HOME NODE
`
`<1 2 4
`T
`REQUESTING PROCESSOR
`LOADS REQUESTED CACHE
`LINE INTO ITS CACHE
`HIERARCHY
`g1 2 2
`+
`SNOQPER SUPPUES
`REQUESTED CACHE LINE
`
`YES
`
`1 4 4
`g
`NQDE CONTROLLER
`TRANSMITS THE
`REQUESTED CACHE LINE T0
`REQUESTING PROCESSOR
`
`ARespln
`1 3 O
`(1 4 6
`V
`, hm}; \ \ RQRU“
`YES
`I ’ VOTE WAS SHARED‘ ‘ l9. REQUESTING PROCESSOR
`\ ~ \ INTERVENTION , I ’
`LOADS REQUESTED CACHE
`‘ S‘ ? , /’
`LINE INTO ITS CACHE
`V
`HIERARCHY
`,
`
`II
`<1
`T
`
`4
`
`

`
`U.S. Patent
`
`Jan. 8,2002
`
`Sheet 4 of 8
`
`US 6,338,122 B1
`
`I
`
`BEGIN
`
`16 0
`I
`
`RECEIVE
`TRANSACTION AT
`HOME NODE
`?
`
`YES 8-1 6 4
`NODE CONTROLLER
`OF HOME NODE TRANS
`MITS TRANSACTION ON
`LOCAL INTERCONNECT
`OF HOME NODE
`
`READ REQUEST SERVICED
`BY A SNOOPER BY
`SUPPLYING A COPY OF
`THE REQUESTED CACHE
`LINE TO THE NODE
`CONTROLLER
`‘V
`g1 7 4
`
`NODE CONTROLLER
`TRANSMITS REQUESTED
`CACHE LINE T0
`REQUESTING
`PROCESSING NODE
`
`190
`
`<1 8 2
`
`PERFORM ACTION
`INDICATED BY
`TRANSACTION
`
`UPDATE SYSTEM
`MEMORY WITH CACHE
`LINE CONTAINED IN
`WRITE TRANSACTION
`
`Tzg. 3G3
`
`5
`
`

`
`U.S. Patent
`
`Jan. 8, 2002
`
`Sheet 5 of 8
`
`US 6,338,122 B1
`
`|moVzo:zm>~Ez_mAIlao_
`mommmoofiI«mommmuomm
`omE_oo_>_88
`
`emu3_m_>:<._:om:mmm3_H_H_
`
`_Eozmooz
`
`>mosms__>mosm_s_3s_Em>m“EjofizouEE.:oEz83sEm>m
`
`_93¢_om
`
`_mommmoomm_mommmoomm_nFH5__om
`
`E3.1H2H.
`
`5.$3
`
`6
`
`
`
`

`
`U.S. Patent
`
`Jan. 8, 2002
`
`Sheet 6 of 8
`
`US 6,338,122 B1
`
`mommmoomm
`
`n _
`
`,
`
`mommmuomm
`
`WE
`
`
`
`mI_,.s_Em>m
`
`>mo2m2
`
`a F
`
`maoz
`
`$joEzoo
`
`cm
`
`8.
`
`mooz
`
`ls_E>EjoEzoo3mm
`
`>mosm_2
`
`3
`
`F
`
`mommmuomm
`
`2:
`
`mommmuomm
`
`7
`
`
`
`

`
`U
`
`a
`
`0
`
`M7
`
`S
`
`m
`
`m
`
`2.n,9»Rs
`
`89FnF
`
`6,E3
`
`
`uW_i_mommmoommmommmoomm
`
`mM82M82
`
`P.Amz_._m_._o<uxaab.pmoflmmsommmm
`
`
`
` MEomImI___mommmoommmommmoomn.mml_258IwmE_>>E_§>
`
`w>mosm_>_>mosm_s_sI23>I3mwEjofizoo
`
`Ejofizoo3s_Em>m
`
`8
`
`
`

`
`U.S. Patent
`
`Jan. 8,2002
`
`Sheet 8 of 8
`
`US 6,338,122 B1
`
`vTw
`
`w
`
`mo?
`
`momwwuomm
`
`vTw
`
`m
`
`n P
`
`mowwwoomm
`
`\~ @ P
`
`ow wz: N53
`
`
`
`wé?w >._m_mmom
`
`"5 >100
`
`owhmmscmm
`
`gm
`
`mow
`
`3 m
`
`M502 M502
`
`
`
`mmjOEzoo mwjomhzou
`
`WIN
`o m N ‘N o m/ 2
`
`vJ
`
`mOwwmuOmm
`
`2
`
`mowmwuomm
`
`@v .m@
`
`w P
`
`vJ
`
`9
`
`

`
`US 6,338,122 B1
`
`1
`NON-UNIFORM MEMORY ACCESS (NUMA)
`DATA PROCESSING SYSTEM THAT
`SPECULATIVELY FORWARDS A READ
`REQUEST TO A REMOTE PROCESSING
`NODE
`
`BACKGROUND OF THE INVENTION
`
`2
`either non-coherent or cache coherent, depending upon
`Whether or not data coherency is maintained betWeen caches
`in different nodes. The complexity of cache coherent NUMA
`(CC-NUMA) systems is attributable in large measure to the
`additional communication required for hardWare to maintain
`data coherency not only betWeen the various levels of cache
`memory and system memory Within each node but also
`betWeen cache and system memories in different nodes.
`NUMA computer systems do, hoWever, address the scal
`ability limitations of conventional SMP computer systems
`since each node Within a NUMA computer system can be
`implemented as a smaller SMP system. Thus, the shared
`components Within each node can be optimiZed for use by
`only a feW processors, While the overall system bene?ts
`from the availability of larger scale parallelism While main
`taining relatively loW latency.
`A principal performance concern With CC-NUMA com
`puter systems is the latency associated With communication
`transactions transmitted via the interconnect coupling the
`nodes. In particular, read transactions, Which are by far the
`most common type of transaction, may have tWice the
`latency When targeting data resident in remote system
`memory as compared to read transactions targeting data
`resident in local system memory. Because of the relatively
`high latency associated With read transactions transmitted on
`the nodal interconnect versus read transactions on the local
`interconnects, it is useful and desirable to reduce the latency
`of read transactions transmitted over the nodal interconnect.
`
`SUMMARY OF THE INVENTION
`In accordance With the present invention, a non-uniform
`memory access (NUMA) computer system includes at least
`a local processing node and a remote processing node that
`are each coupled to a node interconnect. The local process
`ing node includes a local interconnect, a processor and a
`system memory coupled to the local interconnect, and a
`node controller interposed betWeen the local interconnect
`and the node interconnect. In response to receipt of a read
`request from the local interconnect, the node controller
`speculatively transmits the read request to the remote pro
`cessing node via the node interconnect. Thereafter, in
`response to receipt of a response to the read request from the
`remote processing node, the node controller handles the
`response in accordance With a resolution of the read request
`at the local processing node. For example, in one processing
`scenario, data contained in the response received from the
`remote processing node is discarded by the node controller
`if the read request received a Modi?ed Intervention coher
`ency response at the local processing node.
`All objects, features, and advantages of the present inven
`tion Will become apparent in the folloWing detailed Written
`description.
`BRIEF DESCRIPTION OF THE DRAWINGS
`The novel features believed characteristic of the invention
`are set forth in the appended claims. The invention itself
`hoWever, as Well as a preferred mode of use, further objects
`and advantages thereof, Will best be understood by reference
`to the folloWing detailed description of an illustrative
`embodiment When read in conjunction With the accompa
`nying draWings, Wherein:
`FIG. 1 depicts an illustrative embodiment of a NUMA
`computer system in accordance With the present invention;
`FIG. 2 is a more detailed block diagram of the node
`controller shoWn in FIG. 1;
`FIGS. 3A and 3B are high level logical ?oWcharts that
`together illustrate an exemplary method of processing
`
`1. Technical Field
`The present invention relates in general to a method and
`system for data processing and, in particular, to data pro
`cessing Within a non-uniform memory access (NUMA) data
`processing system. Still more particularly, the present inven
`tion relates to a NUMA data processing system and method
`of communication in a NUMA data processing system in
`Which read requests are speculatively forWarded to remote
`memory.
`2. Description of the Related Art
`It is Well-known in the computer arts that greater com
`puter system performance can be achieved by harnessing the
`processing poWer of multiple individual processors in tan
`dem. Multi-processor (MP) computer systems can be
`designed With a number of different topologies, of Which
`various ones may be better suited for particular applications
`depending upon the performance requirements and softWare
`environment of each application. One of the most common
`MP computer topologies is a symmetric multi-processor
`(SMP) con?guration in Which multiple processors share
`common resources, such as a system memory and input/
`output (I/O) subsystem, Which are typically coupled to a
`shared system interconnect. Such computer systems are said
`to be symmetric because all processors in an SMP computer
`system ideally have the same access latency With respect to
`data stored in the shared system memory.
`Although SMP computer systems permit the use of rela
`tively simple inter-processor communication and data shar
`ing methodologies, SMP computer systems have limited
`scalability. In other Words, While performance of a typical
`SMP computer system can generally be expected to improve
`With scale (i.e., With the addition of more processors),
`inherent bus, memory, and input/output (I/O) bandWidth
`limitations prevent signi?cant advantage from being
`obtained by scaling a SMP beyond a implementation
`dependent siZe at Which the utiliZation of these shared
`resources is optimiZed. Thus, the SMP topology itself suffers
`to a certain extent from bandWidth limitations, especially at
`the system memory, as the system scale increases. SMP
`computer systems also do not scale Well from the standpoint
`of manufacturing ef?ciency. For example, although some
`components can be optimiZed for use in both uniprocessor
`and small-scale SMP computer systems, such components
`are often inef?cient for use in large-scale SMPs. Conversely,
`components designed for use in large-scale SMPs are
`impractical for use in smaller systems from a cost stand
`point.
`As a result, an MP computer system topology knoWn as
`non-uniform memory access (NUMA) has emerged as an
`alternative design that addresses many of the limitations of
`SMP computer systems at the expense of some additional
`complexity. A typical NUMA computer system includes a
`number of interconnected nodes that each include one or
`more processors and a local “system” memory. Such com
`puter systems are said to have a non-uniform memory access
`because each processor has loWer access latency With
`respect to data stored in the system memory at its local node
`than With respect to data stored in the system memory at a
`remote node. NUMA systems can be further classi?ed as
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`10
`
`

`
`US 6,338,122 B1
`
`3
`request transactions in Which read requests at a source
`processing node are speculatively forwarded to a remote
`processing node; and
`FIGS. 4A—4D together illustrate an exemplary processing
`scenario in accordance With the method depicted in FIGS.
`3A and 3B.
`
`DETAILED DESCRIPTION OF ILLUSTRATIVE
`EMBODIMENT
`
`System OvervieW
`With reference noW to the ?gures and in particular With
`reference to FIG. 1, there is depicted an illustrative embodi
`ment of a NUMA computer system in accordance With the
`present invention. The depicted embodiment can be realiZed,
`for example, as a Workstation, server, or mainframe com
`puter. As illustrated, NUMA computer system 6 includes a
`number (N22) of processing nodes 8a—8n, Which are inter
`connected by node interconnect 22. Processing nodes 8a—8n
`may each include M (M20) processors 10, a local inter
`connect 16, and a system memory 18 that is accessed via a
`memory controller 17. Processors 10a—10m are preferably
`(but not necessarily) identical and may comprise a processor
`Within the PoWerPCTM line of processors available from
`International Business Machines (IBM) Corporation of
`Armonk, NY. In addition to the registers, instruction ?oW
`logic and execution units utiliZed to execute program
`instructions, Which are generally designated as processor
`core 12, each of processors 10a—10m also includes an
`on-chip cache hierarchy that is utiliZed to stage data to the
`associated processor core 12 from system memories 18.
`Each cache hierarchy 14 may include, for example, a level
`one (L1) cache and a level tWo (L2) cache having storage
`capacities of betWeen 8—32 kilobytes (kB) and 1—16 mega
`bytes (MB), respectively.
`Each of processing nodes 8a—8n further includes a respec
`tive node controller 20 coupled betWeen local interconnect
`16 and node interconnect 22. Each node controller 20 serves
`as a local agent for remote processing nodes 8 by performing
`at least tWo functions. First, each node controller 20 snoops
`the associated local interconnect 16 and facilitates the trans
`mission of local communication transactions to remote
`processing nodes 8. Second, each node controller 20 snoops
`communication transactions on node interconnect 22 and
`masters relevant communication transactions on the associ
`ated local interconnect 16. Communication on each local
`interconnect 16 is controlled by an arbiter 24. Arbiters 24
`regulate access to local interconnects 16 based on bus
`request signals generated by processors 10 and compile
`coherency responses for snooped communication transac
`tions on local interconnects 16, as discussed further beloW.
`Local interconnect 16 is coupled, via meZZanine bus
`bridge 26, to a meZZanine bus 30, Which may be imple
`mented as a Peripheral Component Interconnect (PCI) local
`bus, for example. MeZZanine bus bridge 26 provides both a
`loW latency path through Which processors 10 may directly
`access devices among I/O devices 32 and storage devices 34
`that are mapped to bus memory and/or I/O address spaces
`and a high bandWidth path through Which I/O devices 32 and
`storage devices 34 may access system memory 18. I/O
`devices 32 may include, for example, a display device, a
`keyboard, a graphical pointer, and serial and parallel ports
`for connection to external netWorks or attached devices.
`Storage devices 34, on the other hand, may include optical
`or magnetic disks that provide non-volatile storage for
`operating system and application softWare.
`Memory OrganiZation
`All of processors 10 in NUMA computer system 6 share
`a single physical memory space, meaning that each physical
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`4
`address is associated With only a single location in one of
`system memories 18. Thus, the overall contents of the
`system memory, Which can generally be accessed by any
`processor 10 in NUMA computer system 6, can be vieWed
`as partitioned betWeen system memories 18. For example, in
`an illustrative embodiment of the present invention having
`four processing nodes 8, NUMA computer system may have
`a 16 gigabyte (GB) physical address space including both a
`general purpose memory area and a reserved area. The
`general purpose memory area is divided into 500 MB
`segments, With each of the four processing nodes 8 being
`allocated every fourth segment. The reserved area, Which
`may contain approximately 2 GB, includes system control
`and peripheral memory and I/O areas that are each allocated
`to a respective one of processing nodes 8.
`For purposes of the present discussion, the processing
`node 8 that stores a particular datum in its system memory
`18 is said to be the home node for that datum; conversely,
`others of processing nodes 8a—8n are said to be remote
`nodes With respect to the particular datum.
`Memory Coherency
`Because data stored Within each system memory 18 can
`be requested, accessed, and modi?ed by any processor 10
`Within NUMA computer system 6, NUMA computer system
`6 implements a cache coherence protocol to maintain coher
`ence both betWeen caches in the same processing node and
`betWeen caches in different processing nodes. Thus, NUMA
`computer system 6 is properly classi?ed as a CC-NUMA
`computer system. The cache coherence protocol that is
`implemented is implementation-dependent and may
`comprise, for example, the Well-knoWn Modi?ed,
`Exclusive, Shared, Invalid (MESI) protocol or a variant
`thereof. Hereafter, it Will be assumed that cache hierarchies
`14 and arbiters 24 implement the conventional MESI
`protocol, of Which node controllers 20 recogniZe the M, S
`and I states and consider the E state to be merged into the M
`state for correctness. That is, node controllers 20 assume that
`data held exclusively by a remote cache has been modi?ed,
`Whether or not the data has actually been modi?ed.
`Interconnect Architecture
`Local interconnects 16 and node interconnect 22 can each
`be implemented With any bus-based broadcast architecture,
`sWitch-based broadcast architecture, or sWitch-based non
`broadcast architecture. HoWever, in a preferred
`embodiment, at least node interconnect 22 is implemented
`as a sWitch-based non-broadcast interconnect governed by
`the 6xx communication protocol developed by IBM Corpo
`ration. Local interconnects 16 and node interconnect 22
`permit split transactions, meaning that no ?xed timing
`relationship exists betWeen the address and data tenures
`comprising a communication transaction and that data pack
`ets can be ordered differently than the associated address
`packets. The utiliZation of local interconnects 16 and node
`interconnect 22 is also preferably enhanced by pipelining
`communication transactions, Which permits a subsequent
`communication transaction to be sourced prior to the master
`of a previous communication transaction receiving coher
`ency responses from each recipient.
`Regardless of the type or types of interconnect architec
`ture that are implemented, at least three types of “packets”
`(packet being used here generically to refer to a discrete unit
`of information)—address, data, and coherency response—
`are utiliZed to convey information betWeen processing nodes
`8 via node interconnect 22 and betWeen snoopers via local
`interconnects 16. Referring noW to Tables I and II, a
`summary of relevant ?elds and de?nitions are given for
`address and data packets, respectively.
`
`11
`
`

`
`US
`6,338,122 B1
`
`5
`
`TABLE I
`
`Field Name
`
`Description
`
`Address
`<O:7>
`
`Address
`<8:15>
`Address
`<16:63>
`
`Aparity
`<O:2>
`TDescriptors
`
`Modi?ers de?ning attributes of a
`communication transaction for coherency,
`write thru, and protection
`Tag used to identify all packets within a
`communication transaction
`Address portion that indicates the
`physical, virtual or I/O address in a
`request
`Indicates parity for address bits <O:63>
`
`Indicate size and type of communication
`transaction
`
`TABLE II
`
`Field Name
`
`Description
`
`Data
`<O:127>
`Data parity
`<O:15>
`DTag
`<O:7>
`DValid
`<O:1>
`
`Data for read and write transactions
`
`Indicates parity for data lines <O:127>
`
`Tag used to match a data packet with an
`address packet
`Indicates if valid information is present
`in Data and DTag ?elds
`
`As indicated in Tables I and II, to permit a recipient node or
`snooper to determine the communication transaction to
`which each packet belongs, each packet in a communication
`transaction is identi?ed with a transaction tag. Those skilled
`in the art will appreciate that additional ?ow control logic
`and associated ?ow control signals may be utilized to
`regulate the utilization of the ?nite communication
`resources.
`Within each processing node 8, status and coherency
`responses are communicated between each snooper and the
`local arbiter 24. The signal lines within local interconnects
`16 that are utilized for status and coherency communication
`are summarized below in Table III.
`
`TABLE III
`
`Signal Name
`
`Description
`
`AStatOut
`<O:1>
`
`AStatIn
`<O:1>
`
`ARespOut
`<O:2>
`
`ARespIn
`<O:2>
`
`Encoded signals asserted by each bus
`receiver to indicate flow control or error
`information to arbiter
`Encoded signals asserted by arbiter in
`response to tallying the AStatOut signals
`asserted by the bus receivers
`Encoded signals asserted by each bus
`receiver to indicate coherency information
`to arbiter
`Encoded signals asserted by arbiter in
`response to tallying the ARespOut signals
`asserted by the bus receivers
`
`Status and coherency responses transmitted via the AResp
`and AStat lines of local interconnects 16 preferably have a
`?xed but programmable timing relationship with the asso
`ciated address packets. For example, the AStatOut votes,
`which provide a preliminary indication of whether or not
`each snooper has successfully received an address packet
`transmitted on local interconnect 16, may be required in the
`second cycle following receipt of the address packet. Arbiter
`24 compiles the AStatOut votes and then issues the AStatIn
`
`6
`vote a ?xed but programmable number of cycles later (e.g.,
`1 cycle). Possible AStat votes are summarized below in
`Table IV.
`
`TABLE IV
`
`AStat vote
`
`Meaning
`
`Null
`Ack
`Error
`Retry
`
`Idle
`Transaction accepted by snooper
`Parity error detected in transaction
`Retry transaction, usually for flow
`control
`
`Following the AStatIn period, the ARespOut votes may then
`be required a ?xed but programmable number of cycles
`(e.g., 2 cycles) later. Arbiter 24 also compiles the ARespOut
`votes of each snooper and delivers an ARespIn vote, pref
`erably during the next cycle. The possible AResp votes
`preferably include the coherency responses listed in Table V.
`
`Coherency
`responses
`
`Retry
`
`Modi?ed
`intervention
`Shared
`Null
`ReRun
`
`TABLE V
`
`Meaning
`
`Source of request must retry transaction
`usually for flow control reasons
`Line is modi?ed in cache and will be
`sourced to requestor
`Line is held shared in cache
`Line is invalid in cache
`Snooped request has long latency and
`source of request will be instructed to
`reissue transaction at a later time
`
`The ReRun AResp vote, which is usually issued by a node
`controller 20, indicates that the snooped request has a long
`latency and that the source of the request will be instructed
`to reissue the transaction at a later time. Thus, in contrast to
`a Retry AResp vote, a ReRun makes the recipient of a
`transaction that voted ReRun (and not the originator of the
`transaction) responsible for causing the communication
`transaction to be reissued at a later time.
`Node Controller
`Referring now to FIG. 2, there is illustrated a more
`detailed block diagram of a node controller 20 in NUMA
`computer system 6 of FIG. 1. As shown in FIG. 2, each node
`controller 20, which is coupled between a local interconnect
`16 and node interconnect 22, includes a transaction receive
`unit (TRU) 40, a transaction send unit (TSU) 42, a data
`receive unit (DRU) 44, and a data send unit (DSU) 46. TRU
`40, TSU 42, DRU 44 and DSU 46 can be implemented, for
`example, with ?eld programmable gate arrays (FPGAs) or
`application speci?c integrated circuits (ASICs). As
`indicated, the address and data paths through node controller
`20 are bifurcated, with address (and coherency) packets
`being processed by TRU 40 and TSU 42 and data packets
`being processed by DSU 44 and DRU 46.
`TRU 40, which is so designated to indicate transaction
`?ow off of node interconnect 22, is responsible for accepting
`address and coherency packets from node interconnect 22,
`issuing transactions on local interconnect 16, and forward
`ing responses to TSU 42. TRU 40 includes response mul
`tiplexer (mux) 52, which receives packets from node inter
`connect 22 and passes selected packets to both bus master 54
`and coherency response logic 56 within TSU 42. In response
`to receipt of a address packet from response multiplexer 52,
`bus master 54 can initiate a communication transaction on
`its local interconnect 16 that is the same as or different from
`the type of communication transaction indicated by the
`received address packet.
`
`1O
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`12
`
`

`
`US 6,338,122 B1
`
`7
`TSU 42, Which as indicated by its nomenclature is a
`conduit for transactions ?owing onto node interconnect 22,
`includes a multiple-entry pending buffer 60 that temporarily
`stores attributes of communication transactions sourced onto
`node interconnect 22 that have yet to be completed. The
`transaction attributes stored in an entry of pending buffer 60
`preferably include at least the address (including tag) of the
`transaction, the type of the transaction, and the number of
`eXpected coherency responses. Each pending buffer entry
`has an associated status, Which can be set either to Null,
`indicating that the pending buffer entry can be deleted, or to
`ReRun, indicating that the transaction is still pending. In
`addition to sourcing address packets on node interconnect
`22, TSU 42 interacts With TRU 40 to process memory
`request transactions and issues commands to DRU 44 and
`DSU 46 to control the transfer of data betWeen local
`interconnect 16 and node interconnect 22. TSU 42 also
`implements the selected (i.e., MSI) coherency protocol for
`node interconnect 22 With coherency response logic 56 and
`maintains coherence directory 50 With directory control
`logic 58.
`Coherence directory 50 stores indications of the system
`memory addresses of data (e.g., cache lines) checked out to
`caches in remote nodes for Which the local processing node
`is the home node. The address indication for each cache line
`is stored in association With an identi?er of each remote
`processing node having a copy of the cache line and the
`coherency status of the cache line at each such remote
`processing node. Possible coherency states for entries in
`coherency directory 50 are summariZed in Table VI.
`
`TABLE VI
`
`Possible
`state(s)
`in local
`cache
`
`I
`
`Possible
`state(s)
`in
`remote
`cache
`
`M, E, or
`I
`
`S or I
`
`S or I
`
`M, E, S,
`or I
`S or I
`
`I
`
`S or I
`
`Coherence
`directory
`state
`
`Modi?ed
`(M)
`
`Shared
`(S)
`
`Invalid
`(I)
`Pending-
`shared
`
`Pending-
`modi?ed
`
`I
`
`M, E, or
`I
`
`Meaning
`
`Cache line may be
`modi?ed at a remote
`node With respect to
`system memory at home
`node
`Cache line may be held
`non-exclusively at
`remote node
`Cache line is not held
`by any remote node
`Cache line is in the
`process of being
`invalidated at remote
`nodes
`Cache line, Which may
`be modi?ed remotely,
`is in process of being
`Written back to system
`memory at home node,
`possibly With
`invalidation at remote
`node
`
`As indicated in Table VI, the knowledge of the coherency
`states of cache lines held by remote processing nodes is
`imprecise. This imprecision is due to the fact that a cache
`line held remotely can make a transition from S to I, from
`E to I, or from E to M Without notifying the node controller
`20 of the home node.
`Processing Read Request Transactions
`Referring noW to FIGS. 3A and 3B, there are illustrated
`tWo high level logical ?oWcharts that together depict an
`eXemplary method for processing read request transactions
`in accordance With the present invention. Referring ?rst to
`
`8
`FIG. 3A, the process begins at block 70 and thereafter
`proceeds to block 72, Which depicts a processor 10, such as
`processor 10a of processing node 8a, issuing a read request
`transaction on its local interconnect 16. The read request
`transaction is received by node controller 20 and the rest of
`the snoopers coupled to local interconnect 16 of processing
`node 8a. In response to receipt of the read request, the
`snoopers drive AStatOut votes, Which are compiled by
`arbiter 24 to generate an AStatIn vote, as shoWn at block 74.
`Before node controller 20 supplies an Ack AStatOut vote to
`permit the read request to proceed, node controller 20
`allocates both a read entry and Write-With-clean entry in
`pending buffer 60, if the read request speci?es an address in
`a remote system memory 18. As discussed further beloW, by
`allocating both entries, node controller 20 is able to specu
`latively forWard the read request to the home node of the
`requested cache line and correctly handle the response to the
`read request regardless of the outcome of the subsequent
`AResp vote at processing node 8a.
`Referring noW to block 76, if the AStatIn vote generated
`at block 74 is Retry, the read request is essentially killed,
`allocated entries, if any, in pending buffer 60 are freed, and
`the process returns to block 72, Which has been described. In
`this case, processor 10a must reissue the read request at a
`later time. If, on the other hand, the AStatIn vote generated
`at block 74 is not Retry, the process proceeds from block 76
`to block 78, Which depicts node controller 20 determining by
`reference to the memory map Whether or not its processing
`node 8 is the home node of the physical address speci?ed in
`the read request. If so, the process proceeds to block 80;
`hoWever, if the local processing node 8 is not the home node
`for the read request, the process proceeds to block 100.
`Referring noW to block 80, the snoopers Within process
`ing node 8a then provide their ARespOut votes, Which
`arbiter 24 compiles to generate an ARespIn vote. If coher
`ency directory 50 indicates that the cache line identi?ed by
`the address speci?ed in the read request is checked out to at
`least one remote processing node 8, node controller 20 Will
`vote ReRun if servicing the read request requires commu
`nication With a remote processing node 8. For eXample, if
`coherency directory 50 indicates that a requested cache line
`is Modi?ed at a remote processing node 8, servicing a read
`request Will entail forWarding the read request to the remote
`processing node 8. Similarly, if coherency directory 50
`indicates that a requested cache lin

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket