`Glasco
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 7,103,725 B2
`*Sep. 5, 2006
`
`US007103725B2
`
`(54) METHODS AND APPARATUS FOR
`SPECULATIVE PROBING WITH EARLY
`COMPLETION AND DELAYED REQUEST
`
`(75) Inventor: David B‘ Glasco, Austin’ TX (Us)
`
`(73) Assignee: NeWisys, Inc., Austin, TX (US)
`
`(*) Notice/I
`
`Subjectto any disclaimerathetenn Ofthis
`P211911t is extended Or adjusted under 35
`U.S.C. 154(b) by 529 days.
`
`This patent is subject to a terminal dis-
`claimer.
`
`(21) APPL No; 10/106,430
`
`(22) Filed:
`
`Mar. 22, 2002
`
`(65)
`
`Prior Publication Data
`
`Us 2003/0182514 A1
`(51) Int. Cl.
`G06F 12/08
`G06F 12/16
`
`(2006.01)
`(2006.01)
`
`Sep' 25’ 2003
`
`(52) US. Cl. ..................... .. 711/141;711/118;711/120;
`711/128_ 711/130 711/146_ 711/147_ 711/148_
`’
`’
`’
`’ 711/135’
`
`Z}
`(58) 71:53:13‘: lcggsig?galtilgn 15262211317
`’
`T
`’
`T
`’
`T ’ 710/106 315’
`See application ?le for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5/2000 Carpenter et a1. ........ .. 711/141
`6,067,603 A *
`6,167,492 A 12/2000 Keller et a1. .... ..
`711/154
`6,292,705 B1 *
`9/2001 Wang et a1. ................. .. 700/5
`6,338,122 B1* 1/2002 Baumgartner et al. .... .. 711/141
`6,385,705 B1
`5/2002 Keller et a1. .............. .. 711/154
`711/150
`6,490,661 B1
`12/2002 Keller et al. .... ..
`711/141
`710/316
`.. 711/144
`
`
`
`6,615,319 B1* 9/2003 Khare et al. 6,633,945 B1 * 10/2003 Fu et al. ......... ..
`
`6/2004 Arimilli et al.
`6,754,782 B1 *
`... 711/146
`7/2004 Dhong et a1.
`6,760,819 131*
`711/149
`6,799,252 B1* 9/2004 Bauman ......... ..
`6,839,808 B1* 1/2005 Gruner et al. ............ .. 711/130
`
`OTHER PUBLICATIONS
`
`1.03,
`HyperTransportTM I/O Link Speci?cation Revision
`HyperTransportTM Consortium, Oct. 10, 2001, Copyright © 2001
`HyperTransport Technology Consortium.
`* Cited by examiner
`
`Primary ExamineriMattheW Kim
`Assistant ExamineriZhuo H. Li
`(74) Attorney, Agent, or FirmiBeyer Weaver & Thomas,
`LLP
`
`(57)
`
`ABSTRACT
`
`According to the present invention, methods and apparatus
`are provided for increasing the e?iciency of data access in
`multiple processor, multiple cluster systems. A cache coher
`ence controller associated With a ?rst cluster of processors
`can determine Whether speculative probing can be per
`formed before forwarding a data access request to a second
`cluster. The cache coherence controller can send the data
`access request to the second cluster if the data access request
`can not be completed locally.
`
`5,195,089 A *
`
`3/1993 Sindhu et a1. ............ .. 370/235
`
`51 Claims, 15 Drawing Sheets
`
`CPU
`901-1
`
`c
`903-1
`
`L _, 0
`907
`9032
`
`cm)
`901-2
`
`c
`903-3
`
`c
`90341
`
`(3
`903-6
`
`Request
`Cluster-900
`
`L
`909
`
`Home
`Cluster 920
`
`Remote
`Cluster 940
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 1 0f 15
`
`US 7,103,725 B2
`
`Figure 1A
`
`Processing
`Cluster 101
`
`[-1 1 1d
`2" ‘3
`
`Processing
`Cluster 103
`
`'
`
`111a“_\"r_____h
`
`A
`
`r________f'111€
`
`F
`
`F’
`
`Processing
`Cluster 1G5
`
`.-
`
`",
`
`Processing
`Cluster 107
`
`'
`
`Figure 1B
`
`Processing
`Cluster 121
`
`_
`
`Processing
`T Cluster I23
`
`Processing
`Cluster I25
`
`‘
`
`Processing
`Cluster 127
`
`
`
`U.S. Patent
`
`Sep. 5,2006
`
`Sheet 2 of 15
`
`US 7,103,725 B2
`
`oomnx.
`
`£8
`
`Nousmwm
`
`
`
`EBWBUBofiom
`
`Ein,5
`
`£8Smmoooa
`
`onomo
`
`ooaoaunoo
`
`ommssofioo
`
`«momHommmooum
`
`ooh
`
`...
`
`mHommoooum
`
`UNOuommoooum
`
`$8.
`
`0
`
`ammDBfiofim
`
`SmExamOh
`
`ommOh..I...lIlSmOh
`
`88
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 3 0f 15
`
`US 7,103,725 B2
`
`WEEK
`
`
`
`mom ho?sm mom o?msm ?ooopo?
`
`
`
`H H m QQQLCBS 3220302
`
`m Emmi
`
`
`
`
`
`8». 8%23 E8500
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 4 0f 15
`
`US 7,103,725 B2
`
`02
`
`mlmow
`
`DAHU
`
`m. 5w
`
`NAQw
`
`\ a0 /
`
`mow
`
`bow
`
`00%
`
`K
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 5 0f 15
`
`US 7,103,725 B2
`
`02
`
`mimom
`
`DAD
`
`méom
`
`méom
`
`mom
`
`bow
`
`mom
`
`<m 05%
`
`02
`
`Tmom
`
`ll
`
`DmU
`
`Tam
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 6 0f 15
`
`US 7,103,725 B2
`
`mémm
`
`mm 2&5
`
`Hmm
`
`mum
`
`A
`
`/
`
`Tam
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 7 0f 15
`
`US 7,103,725 B2
`
`0m @5005
`
`DmU
`
`mISNm
`
`L
`
`D56
`
`2% / mwm
`
`0.3
`
`DmU
`
`73m
`
`
`
`wuwoZ HmucAéoZ
`
`mmm
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 8 0f 15
`
`US 7,103,725 B2
`
`O2
`
`méww
`
`DmO
`
`méwm
`
`méom
`
`02
`
`Tmom
`
`D06
`
`71%
`
`mm 25mm
`
`mom
`
`‘£25152
`
`mwm
`
`
`
`U.S. Patent
`
`Sep. 5,2006
`
`Sheet 9 of 15
`
`US 7,103,725 B2
`
`OS
`
`5mm
`
`3%
`
`Bofimm
`
`2%5:30
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 10 0f 15
`
`US 7,103,725 B2
`
`02
`
`Némb
`
`in
`UK
`
`I\
`
`mémh
`
`mm“
`
`T
`O2
`
`DmU
`
`méoh
`mmob
`
`NIHE.
`DAD
`véob O
`
`mlmoh 0
`
`w PBWE
`
`mow
`
`N63.
`mow
`nob
`T/
`
`Tmow 75b 0 DmU
`
`
`
`cow H8910
`
`$250M
`
`m3.
`
`N-IR
`
`DR.
`
`72R.
`
`mémh
`
`
`
`own .5520
`
`92am
`
`
`
`03, 5520
`
`305mm
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 11 0f 15
`
`US 7,103,725 B2
`
`Figure 8
`
`Start
`
`Identi?es Memory Line
`Associated With A
`Request From A Request
`801 ’\ Cluster Processor
`
`803
`
`Specula '
`_
`Probing Be
`
`805 N Proceed With Speculative
`Probing
`
`Proceed Without
`Speculative Probing
`
`Yes
`809 ,\
`
`Provide Probe
`Information To
`intervening Processor
`
`823
`
`X
`
`Provide Probe
`-
`Information To Request
`311 ’\ Cluster Processor
`
`-
`Proceed W1tl1out
`Sp eculative Probing
`
`Y
`
`815 ’\ Wait For Responses
`|
`
`'_____—
`
`,
`
`813
`
`7 End
`
`
`
`U.S. Patent
`
`Sep. 5,2006
`
`Sheet 12 of 15
`
`US 7,103,725 B2
`
`vi.‘
`2a
`
`‘'9
`UsO\
`
`‘T
`Uv—l
`a
`
`921~3
`
`‘Ic‘
`
`CI‘
`[\
`tn
`.401 AN 03
`
`Q
`
`CA
`
`Q
`
`L
`
`L 949
`
`In
`
`l\ M
`
`94
`
`1-l
`I
`rd
`
`3
`
`022
`2%
`
`‘T
`Ov—l
`S
`
`Home
`
`Cluster920
`
`Remote
`
`Cluster940
`
`V?
`02O\
`
`V.’
`02ON
`
`‘F
`OCH
`3
`
`°‘
`
`000‘
`
`pt?‘
`Q_‘r—¥
`0%
`
`‘Y
`001
`8
`
`on
`Q)
`
`is;
`
`-v-f
`LT...
`
`L 905
`
`
`
`CPUCL901~1903-1WL
`
`Req909
`
`t
`
`00
`
`Clu
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 13 0f 15
`
`US 7,103,725 B2
`
`Figure 10
`
`1001 X Identify Cache State In
`Controller
`
`1003
`
`Cache State =
`Shared?
`
`N
`0
`
`Y
`es
`
`1015
`
`Types Of Access
`Requested?
`
`1005
`
`Cache State =
`Owned?
`
`NO
`
`Read
`
`1 007
`
`Cache State =
`Exclusive?
`
`No
`
`1009
`
`Cache State =
`Modi?ed?
`
`No
`V
`
`1017 ’\
`Yes
`
`Y
`Can Complete
`Transaction Locally
`
`Write
`
`1011
`
`Can Not Complete
`N Transaction Locally
`
`0
`
`J‘
`
`End T
`
`
`
`U.S. Patent
`
`Sep. 5, 2006
`
`Sheet 14 0f 15
`
`US 7,103,725 B2
`
`WAN:
`
`7-4
`
`mém:
`
`olmo:
`
`mlmo:
`
`: 2&5
`
`NR8:
`DmU
`Name: 0
`as:
`.8:
`mo:
`\T/
`
`Two: 78: O DmU
`$35M
`T
`
`o2 H SEED
`
`NAN:
`
`mm:
`
`02
`
`5.:
`7mm:
`HAN:
`
`Q:
`
`NISQ:
`
`T3:
`
`ow:
`
`
`
`ow: .6530
`
`2050M
`
`
`
`US’ Patent
`
`Sep- 5, 2006
`
`Sheet 15 0f 15
`
`US 7,103,725 B2
`
`Figure 12
`
`1201
`
`PM
`
`Allocate Transaction
`Identi?er
`
`I’
`
`1203 *\ Probe Local And Remote
`Clusters
`
`V
`
`1205 ’\ Local Transaction
`Completes
`
`Maintain Transaction
`1207
`x Identi?er
`
`1209
`
`All Remote Probe
`~ esponses Receive .
`
`Yes
`
`i
`1211 \ Clear Transaction
`Identi?er
`
`
`
`US 7,103 ,725 B2
`
`1
`METHODS AND APPARATUS FOR
`SPECULATIVE PROBING WITH EARLY
`COMPLETION AND DELAYED REQUEST
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`This application is related to concurrently ?led US.
`application Ser. No. 10/106,426, entitled METHODS AND
`APPARATUS FOR SPECULATIVE PROBING AT A
`10
`REQUEST CLUSTER and to concurrently ?led U.S. appli
`cation Ser. No. 10/106,299, entitled METHODS AND
`APPARATUS FOR SPECULATIVE PROBING WITH
`EARLY COMPLETION AND EARLY REQUEST, the dis
`closures of Which are incorporated by reference herein for
`all purposes.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention generally relates to accessing data
`in a multiple processor system. More speci?cally, the
`present invention provides techniques for improving data
`access ef?ciency While maintaining cache coherency in a
`multiple processor system having a multiple cluster archi
`tecture.
`2. Description of Related Art
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple proces
`sor computer systems have processors coupled to a system
`memory through a shared bus. In order to optimiZe access to
`data in the system memory, individual processors are typi
`cally designed to Work With cache memory. In one example,
`each processor has a cache that is loaded With data that the
`processor frequently accesses. The cache can be onchip or
`olfchip. Each cache block can be read or Written by the
`processor. HoWever, cache coherency problems can arise
`because multiple copies of the same data can co-exist in
`systems having multiple processors and multiple cache
`memories. For example, a frequently accessed data block
`corresponding to a memory line may be loaded into the
`cache of tWo different processors. In one example, if both
`processors attempt to Write neW values into the data block at
`the same time, different data values may result. One value
`may be Written into the ?rst cache While a different value is
`Written into the second cache. A system might then be unable
`to determine What value to Write through to system memory.
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys
`tems. One solution is to simply force all processor Writes to
`go through to memory immediately and bypass the associ
`ated cache. The Write requests can then be serialiZed before
`overWriting a system memory line. HoWever, bypassing the
`cache signi?cantly decreases ef?ciency gained by using a
`cache. Other cache coherency mechanisms have been devel
`oped for speci?c architectures. In a shared bus architecture,
`each processor can check or snoop on the bus to determine
`Whether it can read or Write a shared cache block. In one
`example, a processor only Writes an object When it oWns or
`has exclusive access to the object. Each corresponding cache
`object is then updated to alloW processors access to the most
`recent version of the object.
`Bus arbitration can be used When both processors attempt
`to Write the same shared data block in the same clock cycle.
`Bus arbitration logic can decide Which processor gets the
`bus ?rst. Although, cache coherency mechanisms such as
`bus arbitration are effective, using a shared bus limits the
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`number of processors that can be implemented in a single
`system With a single memory space.
`Other multiprocessor schemes involve individual proces
`sor, cache, and memory systems connected to other proces
`sors, cache, and memory systems using a netWork backbone
`such as Ethernet or Token Ring. Multiprocessor schemes
`involving separate computer systems each With its oWn
`address space can avoid many cache coherency problems
`because each processor has its oWn associated memory and
`cache. When one processor Wishes to access data on a
`remote computing system, communication is explicit. Mes
`sages are sent to move data to another processor and
`messages are received to accept data from another processor
`using standard netWork protocols such as TCP/IP. Multipro
`cessor systems using explicit communication including
`transactions such as sends and receives are referred to as
`systems using multiple private memories. By contrast, mul
`tiprocessor system using implicit communication including
`transactions such as loads and stores are referred to herein as
`using a single address space.
`Multiprocessor schemes using separate computer systems
`alloW more processors to be interconnected While minimiZ
`ing cache coherency problems. HoWever, it Would take
`substantially more time to access data held by a remote
`processor using a netWork infrastructure than it Would take
`to access data held by a processor coupled to a system bus.
`Furthermore, valuable netWork bandWidth Would be con
`sumed moving data to the proper processors. This can
`negatively impact both processor and netWork performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a
`system With a single memory space. In one example, indi
`vidual processors can be directly connected to each other
`through a plurality of point-to-point links to form a cluster
`of processors. Separate clusters of processors can also be
`connected. The point-to-point links signi?cantly increase the
`bandWidth for coprocessing and multiprocessing functions.
`HoWever, using a point-to-point architecture to connect
`multiple processors in a multiple cluster system sharing a
`single memory space presents its oWn problems.
`Consequently, it is desirable to provide techniques for
`improving data access and cache coherency in systems
`having multiple clusters of multiple processors connected
`using point-to-point links.
`
`SUMMARY OF THE INVENTION
`
`According to the present invention, methods and appara
`tus are provided for increasing the ef?ciency of data access
`in a multiple processor, multiple cluster system. A cache
`coherence controller associated With a ?rst cluster of pro
`cessors can determine Whether speculative probing can be
`performed before forWarding a data access request to a
`second cluster. The cache coherence controller can send the
`data access request to the second cluster if the data access
`request can not be completed locally.
`According to speci?c embodiments, a computer system is
`provided. A ?rst cluster includes a ?rst plurality of proces
`sors and a ?rst cache coherence controller. The ?rst plurality
`of processors and the ?rst cache coherence controller are
`interconnected in a point-to-point architecture. A second
`cluster includes a second plurality of processors and a
`second cache coherence controller. The second plurality of
`processors and the second cache coherence controller are
`interconnected in a point-to-point architecture. The ?rst
`cache coherence controller is coupled to the second cache
`coherence controller. The ?rst cache coherence controller is
`
`
`
`US 7,103 ,725 B2
`
`3
`con?gured to receive a cache access request originating
`from the ?rst plurality of processors and send a probe to the
`?rst plurality of processors in the ?rst cluster before the
`cache access request is received by a serialiZation point in
`the second cluster. The ?rst cache coherence controller can
`be further con?gured to send the cache access request to the
`second cluster after receiving a probe response from the ?rst
`node.
`In one embodiment, the serialization point is a memory
`controller in the second cluster. The probe can be associated
`With the memory line corresponding to the cache access
`request. The ?rst cache coherence controller can be further
`con?gured to respond to the probe originating from the
`second cluster using information obtained from the probe of
`the ?rst plurality of processors. The ?rst cache coherence
`controller can also be associated With a pending buffer.
`The ?rst cache coherence controller can send the cache
`access request to the second cluster after receiving a probe
`response from the ?rst node. The ?rst cache coherence
`controller can send the cache access request after determin
`ing that the cache access request can not be completed
`locally. Whether or not the cache access request can be
`completed locally may depend on the state of the cache.
`According to another embodiment, a cache coherence
`controller is provided. The cache coherence controller
`includes interface circuitry coupled to a plurality of local
`processors in a local cluster and a non-local cache coherence
`controller in a non-local cluster. The plurality of local
`processors are arranged in a point-to-point architecture. The
`cache coherence controller can also include a protocol
`engine coupled to the interface circuitry. The protocol
`engine can be con?gured to receive a cache access request
`from a ?rst processor in the local cluster and speculatively
`probe a local node.
`According to another embodiment, a method for a cache
`coherence controller to manage data access in a multipro
`cessor system is provided. A cache access request is received
`from a local processor associated With a local cluster of
`processors connected through a point-to-point architecture.
`It is determined if speculative probing of a local node
`associated With a cache can be performed before forWarding
`the cache request to a non-local cache coherence controller.
`The non-local cache coherence controller is associated With
`a remote cluster of processors connected through a point
`to-point architecture. The remote cluster of processors
`shares an address space With the local cluster of processors.
`A further understanding of the nature and advantages of
`the present invention may be realiZed by reference to the
`remaining portions of the speci?cation and the draWings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the
`folloWing description taken in conjunction With the accom
`panying draWings, Which are illustrative of speci?c embodi
`ments of the present invention.
`FIGS. 1A and 1B are diagrammatic representation depict
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster
`having a plurality of processors.
`FIG. 3 is a diagrammatic representation of a cache coher
`ence controller.
`FIG. 4 is a diagrammatic representation shoWing a trans
`action ?oW for a data access request.
`FIG. 5Ai5D are diagrammatic representations shoWing
`cache coherence controller functionality.
`
`4
`FIG. 6 is a diagrammatic representation depicting a trans
`action ?oW for a data access request from a processor
`transmitted to a home cache coherency controller.
`FIG. 7 is a diagrammatic representation shoWing a trans
`action ?oW for speculative probing at a request cluster.
`FIG. 8 is a process How diagram depicting the handling of
`intervening requests.
`FIG. 9 is a diagrammatic representation shoWing a trans
`action ?oW for speculative probing With delayed request.
`FIG. 10 is a process How diagram depicting the determi
`nation of Whether a data access request can complete locally.
`FIG. 11 is a diagrammatic representation shoWing a
`transaction ?oW for speculative probing With early request.
`FIG. 12 is a process How diagram depicting the mainte
`nance of transaction information.
`
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Reference Will noW be made in detail to some speci?c
`embodiments of the invention including the best modes
`contemplated by the inventors for carrying out the invention.
`Examples of these speci?c embodiments are illustrated in
`the accompanying draWings. While the invention is
`described in conjunction With these speci?c embodiments, it
`Will be understood that it is not intended to limit the
`invention to the described embodiments. On the contrary, it
`is intended to cover alternatives, modi?cations, and equiva
`lents as may be included Within the spirit and scope of the
`invention as de?ned by the appended claims. Multi-proces
`sor architectures having point-to-point communication
`among their processors are suitable for implementing spe
`ci?c embodiments of the present invention. In the folloWing
`description, numerous speci?c details are set forth in order
`to provide a thorough understanding of the present inven
`tion. The present invention may be practiced Without some
`or all of these speci?c details. Well knoWn process opera
`tions have not been described in detail in order not to
`unnecessarily obscure the present invention.
`Techniques are provided for increasing data access effi
`ciency in a multiple processor, multiple cluster system. In a
`point-to-point architecture, a cluster of processors includes
`multiple processors directly connected to each other through
`point-to-point links. By using point-to-point links instead of
`a conventional shared bus or external network, multiple
`processors are used e?iciently in a system sharing the same
`memory space. Processing and netWork ef?ciency are also
`improved by avoiding many of the bandWidth and latency
`limitations of over conventional bus and external netWork
`based multiprocessor architectures. According to various
`embodiments, hoWever, linearly increasing the number of
`processors in a point-to-point architecture leads to an expo
`nential increase in the number of links used to connect the
`multiple processors. In order to reduce the number of links
`used and to further modulariZe a multiprocessor system
`using a point-to-point architecture, multiple clusters are
`used.
`According to various embodiments, the multiple proces
`sor clusters are interconnected using a point-to-point archi
`tecture. Each cluster of processors includes a cache coher
`ence controller used to handle communications betWeen
`clusters. In one embodiment, the point-to-point architecture
`used to connect processors are used to connect clusters as
`Well.
`By using a cache coherence controller, multiple cluster
`systems can be built using processors that may not neces
`sarily support multiple clusters. Such a multiple cluster
`
`
`
`US 7,103 ,725 B2
`
`5
`system can be built by using a cache coherence controller to
`represent non-local nodes in local transactions so that local
`nodes do not need to be aWare of the existence of nodes
`outside of the local cluster. More detail on the cache
`coherence controller Will be provided beloW.
`In a single cluster system, cache coherency can be main
`tained by sending all data access requests through a serial
`ization point. Any mechanism for ordering data access
`requests is referred to herein as a serialization point. One
`example of a serialization point is a memory controller.
`Various processors in the single cluster system send data
`access requests to the memory controller. The memory
`controller can be con?gured to serialize the data access
`requests so that only one data access request for a given
`memory line is alloWed at any particular time. If another
`processor attempts to access the same memory line, the data
`access attempt is blocked until the memory line is unlocked.
`The memory controller alloWs cache coherency to be main
`tained in a multiple processor, single cluster system.
`A serialization point can also be used in a multiple
`processor, multiple cluster system Where the processors in
`the various clusters share a single address space. By using a
`single address space, internal point-to-point links can be
`used to signi?cantly improve intercluster communication
`over traditional external netWork based multiple cluster
`systems. Various processors in various clusters send data
`access requests to a memory controller associated With a
`particular cluster such as a home cluster. The memory
`controller can similarly serialize all data requests from the
`different clusters. HoWever, a serialization point in a mul
`tiple processor, multiple cluster system may not be as
`efficient as a serialization point in a multiple processor,
`single cluster system. That is, delay resulting from factors
`such as latency from transmitting betWeen clusters can
`adversely affect the response times for various data access
`requests. It should be noted that delay also results from the
`use of probes in a multiple processor environment.
`Although delay in intercluster transactions in an architec
`ture using a shared memory space is signi?cantly less than
`the delay in conventional message passing environmients
`using external netWorks such as Ethernet or Token Ring,
`even minimal delay is a signi?cant factor. In some applica
`tions, there may be millions of data access requests from a
`processor in a single second. Any delay can adversely
`impact processor performance.
`According to various embodiments, speculative probing
`is used to increase the ef?ciency of accessing data in a
`multiple processor, multiple cluster system. A mechanism
`for eliciting a response from a node to maintain cache
`coherency in a system is referred to herein as a probe. In one
`example, a mechanism for snooping a cache is referred to as
`a probe. A response to a probe can be directed to the source
`or target of the initiating request. Any mechanism for
`sending probes to nodes associated With cache blocks before
`a request associated With the probes is received at a serial
`ization point is referred to herein as speculative probing.
`Techniques of the present invention recognize the reor
`dering or elimination of certain data access requests do not
`adversely affect cache coherency. That is, the end value in
`the cache is the same Whether or not snooping occurs. For
`example, a local processor attempting to read the cache data
`block can be alloWed to access the data block Without
`sending the requests through a serialization point in certain
`circumstances. In one example, read access can be permitted
`When the cache block is valid and the associated memory
`line is not locked. The techniques of the present invention
`provide mechanisms for determining When speculative
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`probing can be performed and also provide mechanisms for
`determining When speculative probing can be completed
`Without sending a request through a serialization point.
`Speculative probing Will be described in greater detail
`beloW. By completing a data access transaction Within a
`local cluster, the delay associated With transactions in a
`multiple cluster system can be reduced or eliminated.
`To alloW even more ef?cient speculative probing, the
`techniques of the present invention also provide mechanisms
`for handling transactions that may result from speculatively
`probing a local node before locking a particular memory
`line. In one example, a cache coherence protocol used in a
`point-to-point architecture may not alloW for speculative
`probing. Nonetheless, mechanisms are provided to alloW
`various nodes such as processors and memory controllers to
`continue operations Within the cache coherence protocol
`Without knoWing that any protocol variations have occurred.
`FIG. 1A is a diagrammatic representation of one example
`of a multiple cluster, multiple processor system that can use
`the techniques of the present invention. Each processing
`cluster 101, 103, 105, and 107 can include a plurality of
`processors. The processing clusters 101, 103, 105, and 107
`are connected to each other through point-to-point links
`111117‘: In one embodiment, the multiple processors in the
`multiple cluster architecture shoWn in FIG. 1A share the
`same memory space. In this example, the point-to-point
`links 111117’ are internal system connections that are used in
`place of a traditional front-side bus to connect the multiple
`processors in the multiple clusters 101, 103, 105, and 107.
`The point-to-point links may support any point-to-point
`coherence protocol.
`FIG. 1B is a diagrammatic representation of another
`example of a multiple cluster, multiple processor system that
`can use the techniques of the present invention. Each pro
`cessing cluster 121, 123, 125, and 127 can be coupled to a
`sWitch 135 through point-to-point links 141a*d. It should be
`noted that using a sWitch and point-to-point alloWs imple
`mentation With feWer point-to-point links When connecting
`multiple clusters in the system. A sWitch 131 can include a
`processor With a coherence protocol interface. According to
`various implementations, a multicluster system shoWn in
`FIG. 1A is expanded using a sWitch 131 as shoWn in FIG.
`1B.
`FIG. 2 is a diagrammatic representation of a multiple
`processor cluster, such as the cluster 101 shoWn in FIG. 1A.
`Cluster 200 includes processors 202ai202d, one or more
`Basic I/O systems (BIOS) 204, a memory subsystem com
`prising memory banks 206ai206d, point-to-point commu
`nication links 208ai208e, and a service processor 212. The
`point-to-point communication links are con?gured to alloW
`interconnections betWeen processors 202ai202d, I/O sWitch
`210, and cache coherence controller 230. The service pro
`cessor 212 is con?gured to alloW communications With
`processors 202ai202d, I/O sWitch 210, and cache coherence
`controller 230 via a JTAG interface represented in FIG. 2 by
`links 214a*214f It should be noted that other interfaces are
`supported. I/O sWitch 210 connects the rest of the system to
`I/O adapters 216 and 220.
`According to speci?c embodiments, the service processor
`of the present invention has the intelligence to partition
`system resources according to a previously speci?ed parti
`tioning schema. The partitioning can be achieved through
`direct manipulation of routing tables associated With the
`system processors by the service processor Which is made
`possible by the point-to-point communication infrastructure.
`The routing tables are used to control and isolate various
`system resources, the connections betWeen Which are
`
`
`
`US 7,103 ,725 B2
`
`7
`de?ned therein. The service processor and computer system
`partitioning are described in US. patent application Ser. No.
`09/932,456, titled Computer System Partitioning Using Data
`Transfer Routing Mechanism, ?led on Aug. 16, 2001, the
`entirety of Which is incorporated by reference for all pur
`poses.
`The processors 20211401 are also coupled to a cache
`coherence controller 230 through point-to-point links
`23211401. Any mechanism or apparatus that can be used to
`provide communication betWeen multiple processor clusters
`While maintaining cache coherence is referred to herein as a
`cache coherence controller. The cache coherence controller
`230 can be coupled to cache coherence controllers associ
`ated With other multiprocessor clusters. It should be noted
`that there can be more than one cache coherence controller
`in one cluster. The cache coherence controller 230 commu
`nicates With both processors 20211401 as Well as remote
`clusters using a point-to-point protocol.
`More generally, it should be understood that the speci?c
`architecture shoWn in FIG. 2 is merely exemplary and that
`embodiments of the present invention are contemplated
`having different con?gurations and resource interconnec
`tions, and a variety of alternatives for each of the system
`resources shoWn. HoWever, for purpose of illustration, spe
`ci?c details of server 200 Will be assumed. For example,
`most of the resources shoWn in FIG. 2 are assumed to reside
`on a single electronic assembly. In addition, memory banks
`20611420601 may comprise double data rate (DDR) memory
`Which is physically provided as dual in-line memory mod
`ules (DIMMs). I/O adapter 216 may be, for example, an
`ultra direct memory access (UDMA) controller or a small
`computer system interface (SCSI) controller Which provides
`access to a permanent storage device. I/O adapter 220 may
`be an Ethernet card adapted to provide communications With
`a netWork such as, for example, a local area netWork (LAN)
`or the Internet.
`According to a speci?c embodiment and as shoWn in FIG.
`2, both of I/O adapters 216 and 220 provide symmetric I/O
`access. That is, each provides access to equivalent sets of
`I/O. As Will be understood, such a con?guration Would
`facilitate a partitioning scheme in Which multiple partitions
`have access to the same types of I/O. HoWever, it should also
`be understood that embodiments are envisioned in Which
`partitions Without I/O are created. For example, a partition
`including one or more processors and associated memory
`resources, i.e., a memory complex, could be created for the
`purpose of testing the memory complex.
`According to one embodiment, service processor 212 is a
`Motorola MPC855T microprocessor Which includes inte
`grated chipset functions. The cache coherence controller 230
`can be an Application Speci?c Integrated Circuit (ASIC)
`supporting the local point-to-point coherence protocol. The
`cache coherence controller 230 can also be con?gured to
`handle a non-coherent protocol to alloW communication
`with I/O devices. In one embodiment, the cache coherence
`controller 230 is a specially con?gured programmable chip
`such as a programmable logic device or a ?eld program
`mable gate array.
`FIG. 3 is a diagrammatic representation of one example of
`a cache coherence controller 230. The cache coherence
`controller can include a protocol engine 305 con?gured to
`handle packets such as probes and requests received from
`processors in various clusters of a multiprocessor system.
`The functionality of the protocol engine 305 can be parti
`tioned across several engines to improve performance. In
`one example, partitioning can be done based on individual
`transactions ?oWs, packet type (request, probe and
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`response), direction (incoming and outgoing), or transac
`tions ?oW (request ?oWs, probe ?oWs, etc).
`The protocol engine 305 has access to a pending buffer
`309 that alloWs the cache coherence controller to track
`transactions such as recent requests and probes and associ
`ated