`
`US007107409B2
`
`(12)
`
`United States Patent
`
`Glasco
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,107,409 B2
`*Sep. 12, 2006
`
`(54) METHODS AND APPARATUS FOR
`SPEC ULATIVE PROBING AT A REQUEST
`CLUSTER
`
`_
`(75)
`lflvcnlori David 15-G1‘-lSI=0-AUSIIIL TX (US)
`<73) Assignee: Newism Inn Ausm TX (US)
`( “ ) Notice:
`Subject to any disclaimer. the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 459 days.
`
`This patent is subject‘ to a terminal dis-
`claimer.
`
`(21) Appl. No.: 101'] 06,426
`
`(22)
`
`Filed:
`
`Mar. 22, 2002
`
`(65)
`
`Prior Publication Data
`
`711.-“I41
`
`6.615.319 BI "
`9.12003 Khare el :11.
`6.631.448 BI
`IUEZOU3 Weber
`6.633.945 Bl *
`10.12003 Fu eta].
`T10.-‘"316
`6.754.782 B1‘
`632004 Arimilli et al.
`711.5144
`6,760,319 31*
`i.-2004 Dhonget al.
`7111146
`.
`.n..
`*
`0
`1'
`2.§§§.§3§ E1. EEO? 2?.'i'.'l‘I’L. .1 .............1 $11.33
`20010053004 A1
`$2002 Pong
`
`OTHER PUBLICATIONS
`H)o;uer]r'i'an.\‘p(Jrt""" H0
`1’.i1I:'}'(
`S',ue(.'{{i(:at1'(m Revisirm L03,
`HyperTr3.r1sp0rlT” Consortium. Oct. 10. 2001. Copyright (C) 2001
`l1yper'1'1'ansport'1'echnology Consortium.
`U.S. App]. No. 101106.299. filed Mar. 22. 2002. (Jflicc Action
`mailed Nov. 21. 2005.
`us. App]. N6. 101145.433.
`mailed Nov, 21. 2005.
`
`liled May 13. 2002. Ufliee Action
`
`(51)
`
`[BL (1
`(2006.01)
`G06!’ I2/'08
`(2006.01)
`G06F1'2/16
`7111141; 7113118: 7111120;
`(52) U.S. Cl.
`7111128: 7111131); 7111146: 7111147: 71 11148:
`71111135
`7l11'141—1-16,
`(58) Field of Classification Search
`7111130. 147-149. 117-118. 135. 136. 119-122
`See application file for complete search history.
`
`us. Appl. N6. 1u..=1o6,43o.0Ir1ce Action dated Nov. 2. 2605.
`,‘ c.wd b CW miner
`‘
`V
`"
`Primary E.\'arr1ir:er—Matt1]ew Kiln
`/lssis.*am Exaiitincr
`?'.h11o I-1. Li
`(74) /1 m)r.I1e_'.’_. Agent, or Firni Beyer Weaver &. Thomas,
`LU’
`
`(57)
`
`_
`A35 IRA“-1
`
`According to the present invention. methods and apparatus
`are provided for increasing the eflieiency of data access in a
`multiple processor, multiple cluster system. A eaelie eoher-
`enee eonl.mller associated with at lirst cluster of processors
`can determine whether speculative probing at a first cluster
`can be performed to improve overall transaction efliciency.
`Intervening requests from a second cluster can be handled
`using information from the speculative probe at the first
`cluster
`‘
`
`52 Claims, 15 Drawing Sheets
`
`(56)
`
`References Cited
`lJ.S. P/\'1'I€N’l‘
`|'){)(IUMl".N’l'S
`
`3'?0.1'235
`
`T11.-"I41
`
`T11.-"154
`. 7005
`
`Tl|1"l4l
`7111141
`711.-"154
`’.-"ll.-"150
`
`3.51993 Sindhll cl 2!].
`9.-"1999
`llagcrsten er al.
`512000 Carpenter et al.
`10.-‘"2000 Loewenstein et a].
`12.52000 Keller et al.
`9.-‘Z001 Wang et al.
`
`
`
`112002 Baumganner el :11.
`4.52002 Janakirainan ct al.
`512002 Kcllcr Ct 211.
`..
`12-‘.2002 Keller el al_
`
`
`
`5.195.089 A "‘
`5.958.019 A
`6.067.603 A "‘
`6.141.692 A
`6.162.492 A
`6.292.705 Bl ’“
`
`6338.122 Bl "‘
`6.37’-’-1,331 Bl *
`6,385,705 131
`6.490.661 Bl
`
`
`
`c:|
`cm __1 e
`mru
`iflc
`1103-4
`IIDI 1
`‘H03 1
`1101-2
`1103-1]
`
`
`
`c
`UB3:
`
`
`Clnmutl-:0
`
`
`
`1
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 1 0f 15
`
`US 7,107,409 B2
`
`Figure 1A
`
`Processing
`Cluster 101
`
`‘
`
`/—l 1 1d
`,I' ‘3
`
`Processing
`Cluster 103
`
`g
`
`A
`
`A
`
`lll?‘xl ____ _~_
`
`I ____ ___\/'_ll1C
`
`V
`
`V
`
`Processing
`Cluster 105
`
`4
`
`;
`
`':
`
`Processing
`Cluster 107
`
`V
`
`Figure 1B
`
`Processing
`Cluster 121
`
`Processing
`Cluster 123
`
`;
`
`141
`
`"7'¥141d
`Switch 1 3 l
`
`14 1
`
`A
`
`A
`
`4 1 C
`
`Processing
`Cluster 125
`
`‘
`
`>
`
`Processing
`Cluster 127
`
`2
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 2 0f 15
`
`US 7,107,409 B2
`
`N ESwE
`
`80m)
`
`3
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 12, 2006
`2
`M
`0..&
`
`Sheet 3 0f 15
`.MS
`3
`
`US 7,107,409 B2
`US 7,107,409 B2
`
`6momHofismmomimam30305Mwfivnmm
`
`
`
`SmoofltflfiEouonoouoz
`
`
`
`NhemoomfiopfiEmconoo
`
`/ 6mm
`EN
`
`mEsmfi
`
`4
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 4 0f 15
`
`US 7,107,409 B2
`
`U2
`
`NA?“
`
`DmU
`
`méow
`
`A
`
`w @EwE
`
`méow
`
`D5 \ / mow
`
`wow
`
`mow
`
`02
`
`78¢
`
`DmU
`
`THE“
`
`5
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 5 0f 15
`
`US 7,107,409 B2
`
`U2
`
`mlmom
`
`DmO
`
`méom
`
`A
`
`<m 03E
`
`méom
`
`P6 \ / mom
`
`wow
`
`mom
`
`
`
`wowoz EooAéoZ
`
`02
`
`Tmom
`
`DmO
`
`73m
`
`6
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 6 0f 15
`
`US 7,107,409 B2
`
`mémm
`
`mm 2&5
`
`mmm
`
`
`
`wmwoz $001282
`
`7mm
`
`mam
`
`ll
`
`/
`
`Tam
`
`7
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 7 0f 15
`
`US 7,107,409 B2
`
`mLmwm
`
`DmO
`
`mAwm
`
`0m 2&5
`
`NIZNm
`
`P6 \ /
`
`
`
`@262 185-52
`
`mmm
`
`79%
`
`DmO
`
`73%
`
`8
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 8 0f 15
`
`US 7,107,409 B2
`
`02
`
`mlmwm
`
`A
`
`DmO
`
`mlsm
`
`DmO
`
`NIHwm
`
`mwm
`
`gm
`
`mwm
`
`am 2&5
`
`U2
`
`Tmwm
`
`k
`
`DmO
`
`Tam
`
`3km
`
`9
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 9 of 15
`
`1,
`
`2B9
`
`10
`
`7:9%Bofiom0A
`7IS3%T30
`MSo5:20
`
`20A
`
`9‘
`0 5\O
`
` qOS
`
`
`
`moo\323%
`
`ooeEEED
`
`10
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 10 0f 15
`
`US 7,107,409 B2
`
`mmow
`méok.
`
`DmU
`
`DmU
`
`méow
`Ymon
`
`m- mow
`
`Nlmow
`
`von
`
`w Emmi
`
`wow
`
`U2
`
`mlmmb
`mSNb U
`
`vémb
`
`mémb
`
`bub
`g \
`‘III. q 22
`Tmmb
`Tab
`
`NIHNH
`
`
`
`own 633D
`
`086$
`
`mow
`
`
`
`09. H8250
`
`EBEQM
`
`NISR
`
`mi
`
`BR 7gb
`mi \
`‘I
`
`0
`
`
`
`03. $330
`
`3050M
`
`11
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 11 0f 15
`
`US 7,107,409 B2
`
`Figure 8
`
`Identi?es Memory Line
`Associated With A
`Request From A Request
`801 X
`Cluster Processor
`
`803
`
`n Specula '
`Probing Be
`erforrne ‘7
`
`Yes
`V
`
`305 ’\ Proceed With Speculative
`Probing
`
`As ociated With
`Request Cluster
`
`NO_
`
`7
`
`Proceed Without
`Speculative Probing
`
`Yes
`809 ,\
`
`Provide Probe
`Informauon To
`Intervening Processor
`
`823
`X ,
`
`Provide Probe
`-
`‘
`7 Information To Request
`81 1 '\ Cluster Processor
`
`-
`Proceed W1thout
`Speculative Probing
`
`815 ’\*
`
`Wait For Responses
`
`7
`
`I
`
`813
`
`> End
`
`12
`
`
`
`U.S. Patent
`
`Sep. 12,2006
`
`Sheet 12 of 15
`
`US 7,107,409 B2
`
`13
`
`02
`
`mag
`
`380
`
`o.‘4
`
`38Rm
`
`ma4
`
`u
`
`W5‘.
`
`Al
`2%
`
`75¢
`
`U20
`
`Show
`
`ea5530
`
`Bossy.
`
`OESEED
`
`mEmmi
`
`98m0
`
`«-80U
`
`Vmoo38UU
`
`38Tmom1Ba
`DADOA
`jmomé
`E8TflomUDmu
`
`momA
`
`Hmoswuam
`
`com5520
`
`13
`
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 13 0f 15
`
`US 7,107,409 B2
`
`Figure 10
`
`1001 \J Identify Cache State In
`Controller
`
`i003
`
`Cache State =
`Shared?
`
`No
`
`Yes
`
`1005
`
`Cache State =
`Owned?
`
`1015
`
`Types Of Access
`Requested?
`
`1007
`
`Cache State =
`Exclusive?
`
`1017 ’\
`
`No
`
`Yes
`
`V
`
`Can Complete
`Transaction Locally
`
`1009
`
`Cache State =
`Modi?ed?
`
`No
`V
`
`1011 /\ Can Not Complete
`Transaction Locally
`
`‘
`
`n
`
`Write
`
`End ~
`
`14
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 14 0f 15
`
`US 7,107,409 B2
`
`z 03E
`
`
`
`I 50 o 0 E0 0 T A o T. P6
`
`02 o u 0 Q 02 o
`
`3“: EN: EN: 321 EL EN: Km:
`
`2: Q
`
`Ni: 2: I; 0 Q u
`
`Q
`
`98: 08: #8: v8: 30: 2o: 8: #8: I0:
`
`1H
`
`R:
`
`1
`
`u Till
`
`/ 8: \ A
`\ Q /
`mo: “$53M
`
`8: “B26
`
`
`
`~15 H 520 020m
`
`8:
`
`BOHCOM
`
`@5330
`
`15
`
`
`
`U.S. Patent
`
`Sep. 12, 2006
`
`Sheet 15 of 15
`
`US 7,107,409 B2
`
`Figure 12
`
`1201
`
`Allocate Transaction
`Identi?er
`
`V
`
`1203
`
`Probe Local And Remote
`Clusters
`
`Y
`
`1205 ’\
`
`Local Transaction
`Completes
`
`V
`
`Maintain Transaction
`1207
`X Identi?er
`
`1209
`
`All Remote Probe
`~ esponses Receive .
`
`Yes
`
`i
`
`121 l
`
`Clear Transaction
`\ Identi?er
`
`16
`
`
`
`US 7,107,409 B2
`
`1
`METHODS AND APPARATUS FOR
`SPECULATIVE PROBING AT A REQUEST
`CLUSTER
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`This application is related to concurrently ?led US.
`application Ser. No. 10/ 106,430, entitled METHODS AND
`APPARATUS FOR SPECULATIVE PROBING WITH
`EARLY COMPLETION AND DELAYED REQUEST and
`to concurrently ?led US. application Ser. No. 10/106,299,
`entitled METHODS AND APPARATUS FOR SPECULA
`TIVE PROBING WITH EARLY COMPLETION AND
`EARLY REQUEST, the disclosures of Which are incorpo
`rated by reference herein for all purposes.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention generally relates to accessing data
`in a multiple processor system. More speci?cally, the
`present invention provides techniques for improving data
`access ef?ciency While maintaining cache coherency in a
`multiple processor system having a multiple cluster archi
`tecture.
`2. Description of Related Art
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple proces
`sor computer systems have processors coupled to a system
`memory through a shared bus. In order to optimiZe access to
`data in the system memory, individual processors are typi
`cally designed to Work With cache memory. In one example,
`each processor has a cache that is loaded With data that the
`processor frequently accesses. The cache can be onchip or
`olfchip. Each cache block can be read or Written by the
`processor. HoWever, cache coherency problems can arise
`because multiple copies of the same data can co-exist in
`systems having multiple processors and multiple cache
`memories. For example, a frequently accessed data block
`corresponding to a memory line may be loaded into the
`cache of tWo different processors. In one example, if both
`processors attempt to Write neW values into the data block at
`the same time, different data values may result. One value
`may be Written into the ?rst cache While a different value is
`Written into the second cache. A system might then be unable
`to determine What value to Write through to system memory.
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys
`tems. One solution is to simply force all processor Writes to
`go through to memory immediately and bypass the associ
`ated cache. The Write requests can then be serialiZed before
`overWriting a system memory line. HoWever, bypassing the
`cache signi?cantly decreases ef?ciency gained by using a
`cache. Other cache coherency mechanisms have been devel
`oped for speci?c architectures. In a shared bus architecture,
`each processor checks or snoops on the bus to determine
`Whether it can read or Write a shared cache block. In one
`example, a processor only Writes an object When it oWns or
`has exclusive access to the object. Each corresponding cache
`object is then updated to alloW processors access to the most
`recent version of the object.
`Bus arbitration can be used When both processors attempt
`to Write the same shared data block in the same clock cycle.
`Bus arbitration logic can decide Which processor gets the
`bus ?rst. Although, cache coherency mechanisms such as
`bus arbitration are effective, using a shared bus limits the
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`number of processors that can be implemented in a single
`system With a single memory space.
`Other multiprocessor schemes involve individual proces
`sor, cache, and memory systems connected to other proces
`sors, cache, and memory systems using a netWork backbone
`such as Ethernet or Token Ring. Multiprocessor schemes
`involving separate computer systems each With its oWn
`address space can avoid many cache coherency problems
`because each processor has its oWn associated memory and
`cache. When one processor Wishes to access data on a
`remote computing system, communication is explicit. Mes
`sages are sent to move data to another processor and
`messages are received to accept data from another processor
`using standard netWork protocols such as TCP/IP. Multipro
`cessor systems using explicit communication including
`transactions such as sends and receives are referred to as
`systems using multiple private memories. By contrast, mul
`tiprocessor system using implicit communication including
`transactions such as loads and stores are referred to herein as
`using a single address space.
`Multiprocessor schemes using separate computer systems
`alloW more processors to be interconnected While minimiZ
`ing cache coherency problems. HoWever, it Would take
`substantially more time to access data held by a remote
`processor using a netWork infrastructure than it Would take
`to access data held by a processor coupled to a system bus.
`Furthermore, valuable netWork bandWidth Would be con
`sumed moving data to the proper processors. This can
`negatively impact both processor and netWork performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a
`system With a single memory space. In one example, indi
`vidual processors can be directly connected to each other
`through a plurality of point-to-point links to form a cluster
`of processors. Separate clusters of processors can also be
`connected. The point-to-point links signi?cantly increase the
`bandWidth for coprocessing and multiprocessing functions.
`HoWever, using a point-to-point architecture to connect
`multiple processors in a multiple cluster system sharing a
`single memory space presents its oWn problems.
`Consequently, it is desirable to provide techniques for
`improving data access and cache coherency in systems
`having multiple clusters of multiple processors connected
`using point-to-point links.
`
`SUMMARY OF THE INVENTION
`
`According to the present invention, methods and appara
`tus are provided for increasing the ef?ciency of data access
`in a multiple processor, multiple cluster system. A cache
`coherence controller associated With a ?rst cluster of pro
`cessors can determine Whether speculative local probing at
`a ?rst cluster can be performed to improve overall transac
`tion ef?ciency. Intervening requests from a second cluster
`can be handled using information from the speculative probe
`at the ?rst cluster.
`According to speci?c embodiments, a computer system is
`provided. A ?rst cluster includes a ?rst plurality of proces
`sors and a ?rst cache coherence controller. The ?rst plurality
`of processors and the ?rst cache coherence controller are
`interconnected in a point-to-point architecture. A second
`cluster includes a second plurality of processors and a
`second cache coherence controller. The second plurality of
`processors and the second cache coherence controller are
`interconnected in a point-to-point architecture. The ?rst
`cache coherence controller is coupled to the second cache
`coherence controller. The ?rst cache coherence controller is
`
`17
`
`
`
`US 7,107,409 B2
`
`3
`con?gured to receive a cache access request originating
`from the ?rst plurality of processors and send a probe to the
`?rst plurality of processors in the ?rst cluster before the
`cache access request is received by a serialiZation point in
`the second cluster.
`In one embodiment, the serialization point is a memory
`controller in the second cluster. The probe can be associated
`With the memory line corresponding to the cache access
`request. The ?rst cache coherence controller can be further
`con?gured to respond to the probe originating from the
`second cluster using information obtained from the probe of
`the ?rst plurality of processors. The ?rst cache coherence
`controller can also be associated With a pending buffer.
`According to another embodiment, a cache coherence
`controller is provided. The cache coherence controller
`includes interface circuitry coupled to a plurality of local
`processors in a local cluster and a non-local cache coherence
`controller in a non-local cluster. The plurality of local
`processors are arranged in a point-to-point architecture. The
`cache coherence controller can also include a protocol
`engine coupled to the interface circuitry. The protocol
`engine can be con?gured to receive a cache access request
`from a ?rst processor in the local cluster and speculatively
`probe a local node.
`According to another embodiment, a method for a cache
`coherence controller to manage data access in a multipro
`cessor system is provided. A cache access request is received
`from a local processor associated With a local cluster of
`processors connected through a point-to-point architecture.
`It is determined if speculative probing of a local node
`associated With a cache can be performed before forWarding
`the cache request to a non-local cache coherence controller.
`The non-local cache coherence controller is associated With
`a remote cluster of processors connected through a point
`to-point architecture. The remote cluster of processors
`shares an address space With the local cluster of processors.
`A further understanding of the nature and advantages of
`the present invention may be realiZed by reference to the
`remaining portions of the speci?cation and the draWings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the
`folloWing description taken in conjunction With the accom
`panying draWings, Which are illustrative of speci?c embodi
`ments of the present invention.
`FIGS. 1A and 1B are diagrammatic representation depict
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster
`having a plurality of processors.
`FIG. 3 is a diagrammatic representation of a cache coher
`ence controller.
`FIG. 4 is a diagrammatic representation shoWing a trans
`action ?oW for a data access request from a processor in a
`single cluster.
`FIGS. 5Ai5D are diagrammatic representations shoWing
`cache coherence controller functionality.
`FIG. 6 is a diagrammatic representation depicting a trans
`action ?oW for a data access request from a processor
`transmitted to a home cache coherency controller.
`FIG. 7 is a diagrammatic representation shoWing a trans
`action ?oW for speculative probing at a request cluster.
`FIG. 8 is a process How diagram depicting the handling of
`intervening requests.
`FIG. 9 is a diagrammatic representation shoWing a trans
`action ?oW for speculative probing With delayed request.
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`FIG. 10 is a process How diagram depicting the determi
`nation of Whether a data access request can complete locally.
`FIG. 11 is a diagrammatic representation shoWing a
`transaction ?oW for speculative probing With early request.
`FIG. 12 is a process How diagram depicting the mainte
`nance of transaction information.
`
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`Reference Will noW be made in detail to some speci?c
`embodiments of the invention including the best modes
`contemplated by the inventors for carrying out the invention.
`Examples of these speci?c embodiments are illustrated in
`the accompanying draWings. While the invention is
`described in conjunction With these speci?c embodiments, it
`Will be understood that it is not intended to limit the
`invention to the described embodiments. On the contrary, it
`is intended to cover alternatives, modi?cations, and equiva
`lents as may be included Within the spirit and scope of the
`invention as de?ned by the appended claims. Multi-proces
`sor architectures having point-to-point communication
`among their processors are suitable for implementing spe
`ci?c embodiments of the present invention. In the folloWing
`description, numerous speci?c details are set forth in order
`to provide a thorough understanding of the present inven
`tion. The present invention may be practiced Without some
`or all of these speci?c details. Well knoWn process opera
`tions have not been described in detail in order not to
`unnecessarily obscure the present invention.
`Techniques are provided for increasing data access effi
`ciency in a multiple processor, multiple cluster system. In a
`point-to-point architecture, a cluster of processors includes
`multiple processors directly connected to each other through
`point-to-point links. By using point-to-point links instead of
`a conventional shared bus or external network, multiple
`processors are used e?iciently in a system sharing the same
`memory space. Processing and netWork ef?ciency are also
`improved by avoiding many of the bandWidth and latency
`limitations of conventional bus and external netWork based
`multiprocessor architectures. According to various embodi
`ments, hoWever, linearly increasing the number of proces
`sors in a point-to-point architecture leads to an exponential
`increase in the number of links used to connect the multiple
`processors. In order to reduce the number of links used and
`to further modulariZe a multiprocessor system using a point
`to-point architecture, multiple clusters are used.
`According to various embodiments, the multiple proces
`sor clusters are interconnected using a point-to-point archi
`tecture. Each cluster of processors includes a cache coher
`ence controller used to handle communications betWeen
`clusters. In one embodiment, the point-to-point architecture
`used to connect processors are used to connect clusters as
`Well.
`By using a cache coherence controller, multiple cluster
`systems can be built using processors that may not neces
`sarily support multiple clusters. Such a multiple cluster
`system can be built by using a cache coherence controller to
`represent non-local nodes in local transactions so that local
`nodes do not need to be aWare of the existence of nodes
`outside of the local cluster. More detail on the cache
`coherence controller Will be provided beloW.
`In a single cluster system, cache coherency can be main
`tained by sending all data access requests through a serial
`iZation point. Any mechanism for ordering data access
`requests is referred to herein as a serialiZation point. One
`example of a serialiZation point is a memory controller.
`
`18
`
`
`
`US 7,107,409 B2
`
`5
`Various processors in the single cluster system send data
`access requests to the memory controller. The memory
`controller can be con?gured to serialize the data access
`requests so that only one data access request for a given
`memory line is alloWed at any particular time. If another
`processor attempts to access the same memory line, the data
`access attempt is blocked until the memory line is unlocked.
`The memory controller alloWs cache coherency to be main
`tained in a multiple processor, single cluster system.
`A serialization point can also be used in a multiple
`processor, multiple cluster system Where the processors in
`the various clusters share a single address space. By using a
`single address space, internal point-to-point links can be
`used to signi?cantly improve intercluster communication
`over traditional external netWork based multiple cluster
`systems. Various processors in various clusters send data
`access requests to a memory controller associated With a
`particular cluster such as a home cluster. The memory
`controller can similarly serialize all data requests from the
`different clusters. However, a serialization point in a mul
`tiple processor, multiple cluster system may not be as
`ef?cient as a serialization point in a multiple processor,
`single cluster system. That is, delay resulting from factors
`such as latency from transmitting betWeen clusters can
`adversely affect the response times for various data access
`requests. It should be noted that delay also results from the
`use of probes in a multiple processor environment.
`Although delay in intercluster transactions in an architec
`ture using a shared memory space is signi?cantly less than
`the delay in conventional message passing environments
`using external netWorks such as Ethernet or Token Ring,
`even minimal delay is a signi?cant factor. In some applica
`tions, there may be millions of data access requests from a
`processor in a single second. Any delay can adversely
`impact processor performance.
`According to various embodiments, speculative probing
`is used to increase the ef?ciency of accessing data in a
`multiple processor, multiple cluster system. A mechanism
`for eliciting a response from a node to maintain cache
`coherency in a system is referred to herein as a probe. In one
`example, a mechanism for snooping a cache is referred to as
`a probe. A response to a probe can be directed to the source
`or target of the initiating request. Any mechanism for
`sending probes to nodes associated With cache blocks before
`a request associated With the probes is received at a serial
`ization point is referred to herein as speculative probing.
`Techniques of the present invention recognize the reor
`dering or elimination of certain data access requests do not
`adversely affect cache coherency. That is, the end value in
`the cache is the same Whether or not snooping occurs. For
`example, a local process or attempting to read the cache data
`block can be alloWed to access the data block Without
`sending the requests through a serialization point in certain
`circumstances. In one example, read access can be permitted
`When the cache block is valid and the associated memory
`line is not locked. The techniques of the present invention
`provide mechanisms for determining When speculative
`probing can be performed and also provide mechanisms for
`determining When speculative probing can be completed
`Without sending a request through a serialization point.
`Speculative probing Will be described in greater detail
`beloW. By completing a data access transaction Within a
`local cluster, the delay associated With transactions in a
`multiple cluster system can be reduced or eliminated.
`To alloW even more ef?cient speculative probing, the
`techniques of the present invention also provide mechanisms
`for handling transactions that may result from speculatively
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`probing a local node before locking a particular memory
`line. In one example, a cache coherence protocol used in a
`point-to-point architecture may not alloW for speculative
`probing. Nonetheless, mechanisms are provided to alloW
`various nodes such as processors and memory controllers to
`continue operations Within the cache coherence protocol
`Without knowing that any protocol variations have occurred.
`FIG. 1A is a diagrammatic representation of one example
`of a multiple cluster, multiple processor system that can use
`the techniques of the present invention. Each processing
`cluster 101, 103, 105, and 107 can include a plurality of
`processors. The processing clusters 101, 103, 105, and 107
`are connected to each other through point-to-point links
`111117‘: In one embodiment, the multiple processors in the
`multiple cluster architecture shoWn in FIG. 1A share the
`same memory space. In this example, the point-to-point
`links 111117’ are internal system connections that are used in
`place of a traditional front-side bus to connect the multiple
`processors in the multiple clusters 101, 103, 105, and 107.
`The point-to-point links may support any point-to-point
`coherence protocol.
`FIG. 1B is a diagrammatic representation of another
`example of a multiple cluster, multiple processor system that
`can use the techniques of the present invention. Each pro
`cessing cluster 121, 123, 125, and 127 can be coupled to a
`sWitch 135 through point-to-point links 141a*d. It should be
`noted that using a sWitch and point-to-point alloWs imple
`mentation With feWer point-to-point links When connecting
`multiple clusters in the system. A sWitch 131 can include a
`processor With a coherence protocol interface. According to
`various implementations, a multicluster system shoWn in
`FIG. 1A is expanded using a sWitch 131 as shoWn in FIG.
`1B.
`FIG. 2 is a diagrammatic representation of a multiple
`processor cluster, such as the cluster 101 shoWn in FIG. 1A.
`Cluster 200 includes processors 202ai202d, one or more
`Basic I/O systems (BIOS) 204, a memory subsystem com
`prising memory banks 206ai206d, point-to-point commu
`nication links 208ai208e, and a service processor 212. The
`point-to-point communication links are con?gured to alloW
`interconnections betWeen processors 202ai202d, I/O sWitch
`210, and cache coherence controller 230. The service pro
`cessor 212 is con?gured to alloW communications With
`processors 202ai202d, I/O sWitch 210, and cache coherence
`controller 230 via a JTAG interface represented in FIG. 2 by
`links 214a*214f It should be noted that other interfaces are
`supported. I/O sWitch 210 connects the rest of the system to
`I/O adapters 216 and 220.
`According to speci?c embodiments, the service processor
`of the present invention has the intelligence to partition
`system resources according to a previously speci?ed parti
`tioning schema. The partitioning can be achieved through
`direct manipulation of routing tables associated With the
`system processors by the service processor Which is made
`possible by the point-to-point communication infrastructure.
`The routing tables are used to control and isolate various
`system resources, the connections betWeen Which are
`de?ned therein. The service processor and computer system
`partitioning are described in patent application Ser. No.
`09/932,456 titled Computer System Partitioning Using Data
`Transfer Routing Mechanism, ?led on Aug. 16, 2001, the
`entirety of Which is incorporated by reference for all pur
`poses.
`The processors 202aid are also coupled to a cache
`coherence controller 230 through point-to-point links
`232aid. Any mechanism or apparatus that can be used to
`provide communication betWeen multiple processor clusters
`
`19
`
`
`
`US 7,107,409 B2
`
`7
`While maintaining cache coherence is referred to herein as a
`cache coherence controller. The cache coherence controller
`230 can be coupled to cache coherence controllers associ
`ated With other multiprocessor clusters. It should be noted
`that there can be more than one cache coherence controller
`in one cluster. The cache coherence controller 230 commu
`nicates With both processors 202aid as Well as remote
`clusters using a point-to-point protocol.
`More generally, it should be understood that the speci?c
`architecture shoWn in FIG. 2 is merely exemplary and that
`embodiments of the present invention are contemplated
`having different con?gurations and resource interconnec
`tions, and a variety of alternatives for each of the system
`resources shoWn. HoWever, for purpose of illustration, spe
`ci?c details of server 200 Will be assumed. For example,
`most of the resources shoWn in FIG. 2 are assumed to reside
`on a single electronic assembly. In addition, memory banks
`206ai206d may comprise double data rate (DDR) memory
`Which is physically provided as dual in-line memory mod
`ules (DIMMs). I/O adapter 216 may be, for example, an
`ultra direct memory access (UDMA) controller or a small
`computer system interface (SCSI) controller Which provides
`access to a permanent storage device. I/O adapter 220 may
`be an Ethernet card adapted to provide communications With
`a netWork such as, for example, a local area netWork (LAN)
`or the Internet.
`According to a speci?c embodiment and as shoWn in FIG.
`2, both of I/O adapters 216 and 220 provide symmetric I/O
`access. That is, each provides access to equivalent sets of
`I/O. As Will be understood, such a con?guration Would
`facilitate a partitioning scheme in Which multiple partitions
`have access to the same types of I/O. HoWever, it should also
`be understood that embodiments are envisioned in Which
`partitions Without I/O are created. For example, a partition
`including one or more processors and associated memory
`resources, i.e., a memory complex, could be created for the
`purpose of testing the memory complex.
`According to one embodiment, service processor 212 is a
`Motorola MPC855T microprocessor Which includes inte
`grated chipset functions. The cache coherence controller 230
`can be an Application Speci?c Integrated Circuit (ASIC)
`supporting the local point-to-point coherence protocol. The
`cache coherence controller 230 can also be con?gured to
`handle a non-coherent protocol to alloW communication
`with I/O devices. In one embodiment, the cache coherence
`controller 230 is a specially con?gured programmable chip
`such as a programmable logic device or a ?eld program
`mable gate array.
`FIG. 3 is a diagrammatic representation of one example of
`a cache coherence controller 230. The cache coherence
`controller can include a protocol engine 305 con?gured to
`handle packets such as probes and requests received from
`processors in various clusters of a multiprocessor system.
`The functionality of the protocol engine 305 can be parti
`tioned across several engines to improve performance. In
`one example, partitioning can be done based on individual
`transactions ?oWs, packet type (request, probe and
`response), direction (incoming and outgoing), or transac
`tions ?oW (request ?oWs, probe ?oWs, etc).
`The protocol engine 305 has access to a pending buffer
`309 that alloWs the cache coherence controller to track
`transactions such as recent requests and probes and associ
`ated the transactions With speci?c processors. Transaction
`information maintained in the pending buffer 309 can
`include transaction destination nodes, the addresses of
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`requests for subsequent collision de