`
`(12) Ulllted States Patent
`Glasco
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 7,395,379 B2
`*Jul. 1, 2008
`
`(54) METHODS AND APPARATUS FOR
`RESPONDING TOA REQUEST CLUSTER
`
`6,351,791 B1 *
`6,374,331 B1 *
`
`2/2002 Freerksen et a1. ......... .. 711/146
`4/2002 Janakiraman etal. ..... .. 711/141
`
`A _ TX S
`' D _d B G1
`(75) I
`nvemor'
`a“ '
`asco’ usnn’
`(U )
`
`6,385,705 B1
`6,490,661 B1
`
`5/2002 Keller et al. .............. .. 711/154
`12/2002 Keller et a1. .... ..
`711/154
`
`(73) Assigneez Newisysa Inu’Austin’ TX (Us)
`
`(*) Notice:
`
`Subject to any disclaimer, theterm Ofthis
`Pawnt is extended Or adjusted under 35
`U.S.C. 154(b) by 862 days.
`
`6,615,319 B2* 9/2003 Khare et a1. .............. .. 711/141
`6,631,401 B1 * 10/2003 Keller et al. .............. .. 709/213
`6,631,448 B2* 10/2003 Weber ...................... .. 711/141
`6,633,945 B1 * 10/2003 Fu etal. ................... .. 710/316
`
`This patent is subject to a terminal dis
`clalmer'
`
`(21) APP1-NO-I 10/145A39
`
`(22) Filed:
`
`May 13, 2002
`
`(65)
`
`Prior Publication Data
`US 2003/0212741 A1
`Nov. 13, 2003
`
`(51) Int- Cl-
`(2006-01)
`G06F 12/ 00
`(52) US. Cl. ...................... .. 711/146; 711/141; 711/144;
`711/147; 711/148
`(58) Field of Classi?cation Search ............... .. 711/141,
`711/146*148, 144
`See application ?le for complete search history.
`References Cited
`
`(56)
`
`US. PATENT DOCUMENTS
`5,195,089 A
`3/1993 Sindhu et a1‘
`5,659,710 A *
`8/1997 Sherman et a1‘ ___________ __ 71l/146
`5,893,151 A
`4/1999 Merchant
`5,958,019 A *
`9/1999 HagerSten et a1, _________ __ 713/375
`5,966,729 A 10/ 1999 Phelps
`6,038,644 A
`3/2000 Irie et al.
`6,067,603 A
`5/2000 Carpenter et a1~
`6,141,692 A 10/2000 Loewenstein et a1.
`6,167,492 A 12/2000 Keller et a1. .............. .. 711/154
`6,292,705 B1
`9/2001 Wang et 31.
`6,295,583 B1
`9/2001 RaZdan et a1.
`6,336,169 B1* 1/2002 Arimilli et a1. ............ .. 711/144
`6,338,122 B1 *
`1/2002 Baumgartner et a1. ..... .. 711/141
`
`(Continued)
`
`OTHER PUBLICATIONS
`
`Alan Charesworth, “star?re: extending the SMP envelope”, pub
`lished in Feb. 1998 by IEEE, pp. 39-49.*
`
`d
`C t'
`( on “me )
`Primary ExamineriSanjiV Shah
`Assistant ExamineriZhuo H Li
`(74) Attorney, Agent, or FirmiWeaVer Austin V1lleneuve &
`Sampson
`
`(57)
`
`ABSTRACT
`
`According to the present invention,' methods and apparatus
`are provided for increasing the e?iciency of data access in a
`multiple processor, multiple cluster system. A home cluster of
`processors receives a cache access request from a request
`‘cluster. The home cluster includes mechanisms for instruct
`ing probed remote clusters to respond to the request cluster
`instead of to the home cluster. The home cluster can also
`include mechanisms for reducing the number of probes sent
`to remote clusters.Techniques are also includedforproviding
`the requesting cluster With information to determine the num
`ber of responses to be transmitted to the requesting cluster as
`a result of the reduction in the number of probes sent at the
`home cluster.
`
`35 Claims, 15 Drawing Sheets
`
`CPU
`1001-1
`
`Request
`Cluster 1000
`
`Hum: Clun?
`1020
`
`Rum:
`Cluster 1040
`
`Ran-1m:
`Cluster 1060
`
`
`
`US 7,395,379 B2
`Page 2
`
`Us. PATENT DOCUMENTS
`
`6,728,843 B1* 4/2004 Pong et al. ................ .. 711/150
`6,754,782 B2
`6/2004 Arimilli et al.
`6,760,819 B2
`7/2004 Dhong et al.
`6,799,252 B1
`9/2004 Bauman
`6,839,808 B2 *
`1/2005 Gruner et al. ............. .. 711/130
`6,973,543 B1
`12/2005 Hughes
`2002/0053004 A1* 5/2002 Pong ........................ .. 711/119
`2003/0095557 A1
`5/2003 Keller et al.
`
`OTHER PUBLICATIONS
`
`1. 03,
`HyperTransportTM I/O Link Speci?cation Revision
`HyperTranspoItTM ConsoItium, Oct. 10, 2001, Copyright © 2001
`HyperTranspoIt Technology ConsoItium.
`U.S. Appl. No. 10/106,426, Of?ce Action dated Sep. 22, 2004.
`US. Appl. No. 10/106,426, Of?ce Action dated Mar. 7, 2005.
`US. Appl. No. 10/406,426, Of?ce Action dated Jul. 21, 2005.
`US. Appl. No. 10/106,426, Of?ce Action dated Nov. 21, 2005.
`US. Appl. No. 10/106,430, Of?ce Action dated Sep. 23, 2004.
`US. Appl. No. 10/106,430, Of?ce Action dated Mar. 10,2005.
`U.S. Appl. No. 10/106,430, Of?ce Action dated Jul. 21, 2005.
`US. Appl. No. 10/106,430, Of?ce Action dated Nov. 2, 2005.
`US. Appl. No. 10/106,299, Of?ce Action dated Sep. 22, 2004.
`US. Appl. No. 10/106,299, Of?ce Action dated Mar. 10,2005.
`U.S. Appl. No. 10/106,299, Of?ce Action dated Jul. 21, 2005.
`
`US. Appl. No. 10/106,299, Of?ce Action dated Nov. 21, 2005.
`US. Appl. No. 10/145,438, Of?ce Action dated Nov. 21, 2005.
`US. Appl. No. 10/106,426, ?led Mar. 22, 2002, Notice ofAllowance,
`mailed Apr. 21, 2006.
`US. Appl. No. 10/106,426, ?led Mar. 22, 2002, Allowed claims.
`U.S. Appl. No. 10/106,430, ?led Mar. 22, 2002, Notice ofAllowance
`mailed Apr. 21, 2006.
`US. Appl. No. 10/106,430, ?led Mar. 22, 2002, Allowed claims.
`U.S. Appl. No. 10/106,299, ?led Mar. 22, 2002, Notice ofAllowance
`mailed Apr. 28, 2006.
`US. Appl. No. 10/106,299, ?led Mar. 22, 2002, Allowed claims.
`U.S. Appl. No. 10/145,438, ?led May 13, 2002, Of?ce Action mailed
`Jun. 20, 2007.
`US. Appl. No. 10/145,438, ?led May 13, 2002, Of?ce Action mailed
`Mar. 9, 2007.
`US. Appl. No. 10/145,438, ?led May 13, 2002, Of?ce Action mailed
`Aug. 22, 2006.
`US. Appl. No. 10/145,438, ?led May 13, 2002, Of?ce Action mailed
`May 4, 2006.
`US. Appl. No. 10/145,438, ?led May 13, 2002, Of?ce Action mailed
`Nov. 21, 2005.
`Alan ChareswoIth “Star?re: extending the SMP envelope” published
`in Feb. 1998 by IEEE, pp. 39-49.
`U.S. Appl. No. 10/145,438, Final Of?ce Action mailed Nov. 28,
`2007.
`
`* cited by examiner
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 1 of 15
`
`US 7,395,379 B2
`
`Figure 1A
`
`Processing
`Cluster 101
`
`‘
`
`[-1 1 1d
`5' '3
`
`Processing
`j Cluster 103
`
`I
`
`,w" _'--~\I
`
`" , - — — ~ — __
`
`I}
`
`1l1a—\:,____:h
`
`(.1116
`
`1111f)
`
`{Twp/"111C
`
`r
`
`‘L
`
`Processing
`Cluster 105
`
`‘
`
`[-1 1 1b
`:' '3
`3U."
`
`Processing
`Cluster 107
`
`V
`
`Figure 1B
`
`Processing
`Cluster 121
`
`_
`
`Processing
`Cluster 123
`
`V
`
`141!)
`X5:
`
`141
`6
`
`Processing f
`Cluster 125
`
`‘
`
`Processing
`Cluster 127
`
`
`
`Jul. 1, 2008
`
`Sheet 2 of 15
`
`US 7,395,379 B2
`
`.08SmU.NOHSE
`tinm£8['11)SowM£8PI\«30:20
`Bofiom
`
`
`
`uocfionoo
`
`ommbzouaoo
`
`uommoooum
`
`
`
`cancan
`
`oafixamon
`
`coonooom
`
`..II.I..SNon
`
`
`uofiofimm¢nfiVVHmml
`0lEfiucum
`
`
`
`U.S. Patent
`
`m
`
`4.MS
`
`5
`
`US 7,395,379 B2
`
`3mfimmmm
`
`
`mmomBaummommammmEugen
`
`
`
`HamDom.w.HofiH.~Efioaousoz
`
`
`
`2,gmoomfimyfiE2300WOmm
`
`mEsmqm
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 4 0f 15
`
`US 7,395,379 B2
`
`w QBwE
`
`DmU
`
`méow
`
`DmU
`
`N- H ow bow
`
`D\ A
`
`mow
`
`mow
`
`U2
`
`fmow
`
`DmU
`
`75¢
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 5 0f 15
`
`US 7,395,379 B2
`
`qmom
`
`02
`
`DmD
`
`m-Sm
`
`méom
`
`<m 25%
`
`mom
`
`mom
`
`
`
`mowoz EooQéoZ
`
`02
`
`Tmom
`
`DAD
`
`72%
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 6 0f 15
`
`US 7,395,379 B2
`
`mémm
`
`mmm .\ / M,
`
`/
`
`
`
`@252 EooAéoZ
`
`Hmm
`
`mm 03E
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 7 0f 15
`
`US 7,395,379 B2
`
`um 25%»
`
`DmO
`
`méwm
`
`Qm
`
`79%
`
`K
`
`Tjvm
`
`DmU
`
`mmm
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 8 0f 15
`
`US 7,395,379 B2
`
`U2
`
`méwm
`
`4
`
`mm 2&5
`
`
`
`E8 , 3% . Sm - Eon , 3% u A o A 02 0
`
`A
`
`0% q
`
`, 5 A
`
`§ \
`
`V £0 OwHIQOZ
`
`m;
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 9 0f 15
`
`US 7,395,379 B2
`
`U DmO O 3.6 0 DnAU A U U DAAO
`
`w 93E
`
`A
`
`Wmoo wAow Ymow mAoo Wmom mAowT new Alm-mow AlmowAl 75w
`
`
`
`
`
`
`
`
`
`mow \
`
`\ A
`
`02 u u o , A T 02 T 0
`
`
`
`
`
`Wmmw mémw .wAmo mAmw L M 5% Tmmw AAmw
`
`A
`
`
`
` \ Q / méwwAl. 5% TAAww U A U
`
`E \
`
`A
`
`0
`
`mg \
`
`
`
`NAmw 050m
`
`_ r 0% 5530
`
`
`
`mow 5263A
`
`
`
`coo $330
`
`
`
`ado 805mm
`
`9% $620
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 10 0f 15
`
`US 7,395,379 B2
`
`wm? 3Q #5 W5, RB 29 3E
`
`02 o o o A ?l 02 o
`
`A
`
`h 05E
`
`
`
`
`
`3? E? ,1? 3? £2 Earl 5 ‘ii? #SPIIE o P6 o Eu 0 Q6 1H o 0 P6
`
`\ ,,
`
`A
`
`5 \
`
`A
`
`\ A
`
`EE‘I E ‘13$ 0 q 0
`
`WE \
`
`5 \
`o /
`
`, , 88526
`
`
`
`NANN, 056m
`
`
`
`mow $250M
`
`82250
`
`QR 22am
`
`82325
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 11 of 15
`
`US 7,395,379 B2
`
`Figure 8
`
`ag Handling A
`A Home Cache
`erence Contro
`
`7
`Receive Cache Access
`Request
`
`8 03 \ Generate New Tag
`
`V
`
`805 X
`
`V
`Maintained Tag In
`Pending Buffer
`
`7
`
`Forward Request To
`Serialization Point
`
`V
`
`Receive Probe From
`Serialization Point
`
`813
`
`'1 6 Resulting
`‘
`Locally Generated
`Request?
`
`No
`v
`Use Tag Corresponding
`To Tag From Request
`Cluster
`
`Use Newly Generated Tag
`From Home Cluster Tag
`Space
`
`v
`
`Broadcast Probes To
`Nonlocal Clusters With
`Selected Tag Information
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 12 0f 15
`
`US 7,395,379 B2
`
`Figure 9
`
`ag Handling A
`A Request Cache
`herence Contro
`
`F
`
`901 \ Send Cache Access
`Request To Home Cluster
`
`r
`
`903 "\ Receive Probe From
`Home Cluster
`
`v
`
`0 9 5 \ Probe Local Nodes
`
`v
`
`907 '\ Receive A Plurality Of
`Probe Responses
`
`'1
`
`Signal Prcgfetslsor
`90
`Associated 1 The
`9 \ Request After Expected
`Probe Responses Are
`Received
`
`
`
`US. Patent
`
`Jul. 1,2008
`
`Sheet 13 0f 15
`
`US 7,395,379 B2
`
`
`
`
`
`WmooH wéoo? #182 U PHD U
`
`02 U U A
`
`
`
`NANA: W32 W52 \ A
`
`2 85mm
`
`DmU
`
`NIH o2
`
`a $025M
`33 A
`
`
`
`002 .6520
`
`0 Q U
`A
`2% M \
`
`
`
`$3.2 53 All T32 805mm
`
`0
`
`7 H 02 835%
`
`. S2
`
`.6320 25m
`
`
`
`Q1: .5556
`
`32 $330
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 14 0f 15
`
`US 7,395,379 B2
`
`Figure l l
`
`ag Managemen
`Before Probe
`Transmission
`
`V
`
`1101 \ Receive Cache Access
`Request
`
`"
`1105 '\ Maintained Tag In
`Pending Buffer
`
`'
`1107 \ Fgégggi?-lugfirio
`
`Yes
`
`V
`1111 '\ Receive Probe From
`Serialization Point
`
`1113
`
`a I e Resulting ~
`Locally Generated
`REquest?
`
`1115 '\d
`
`No
`v
`Use Tag Corresponding
`To Tag From Request
`Cluster
`
`1121 \ Use Newly Generated Tag
`' From Home Cluster Tag
`Space
`
`V
`
`1123 \d Select Clusters To Send
`7
`Probes To Based On
`Directory
`
`1131
`
`'
`X Send Probe To Home
`Cluster With Coherence
`Information
`
`1133 ~\_
`
`v
`Forward Probes To
`Selected Clusters With
`Tag Information
`
`
`
`US. Patent
`
`Jul. 1, 2008
`
`Sheet 15 0f 15
`
`US 7,395,379 B2
`
`Figure 12
`
`ag Handling Upo
`Receiving Probe
`Responses
`
`1201 N Send Cache Access
`Request To Home Cluster
`
`7
`
`1203 \ Receive Probe From
`Home Cluster
`
`1205 \ Extract Information From
`
`Probe To Determine
`Number Of Expected
`Probe Responses
`
`1207 '\ Probe Local Nodes
`
`l
`1209 *\ Receive A Plurality Of
`Probe Responses
`
`1211 '\
`
`7
`Signal Processor
`Associated With The
`Request A?er Expected
`Probe Responses Are
`Received
`
`
`
`US 7,395,379 B2
`
`1
`METHODS AND APPARATUS FOR
`RESPONDING TO A REQUEST CLUSTER
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`The present application is related to ?led US. application
`Ser. No. 10/106,426 titled Methods And Apparatus For
`Speculative Probing At A Request Cluster, US. application
`Ser. No. 10/106,430 titled Methods And Apparatus For
`Speculative Probing With Early Completion And Delayed
`Request, and US. application Ser. No. 10/106,299 titled
`Methods And Apparatus For Speculative Probing With Early
`Completion And Early Request, the entireties of Which are
`incorporated by reference herein for all purposes. The present
`application is also related to concurrently ?led U.S. applica
`tion Ser. No. 10/145,438 titled Methods And Apparatus For
`Responding To A Request Cluster by David B. Glasco, the
`entirety of Which is incorporated by reference for all pur
`poses.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention generally relates to accessing data in
`a multiple processor system. More speci?cally, the present
`invention provides techniques for improving data access e?i
`ciency While maintaining cache coherency in a multiple pro
`cessor system having a multiple cluster architecture.
`2. Description of Related Art
`Data access in multiple processor systems can raise issues
`relating to cache coherency. Conventional multiple processor
`computer systems have processors coupled to a system
`memory through a shared bus. In order to optimiZe access to
`data in the system memory, individual processors are typi
`cally designed to Work With cache memory. In one example,
`each processor has a cache that is loaded With data that the
`processor frequently accesses. The cache is read or Written by
`a processor. HoWever, cache coherency problems arise
`because multiple copies of the same data can co-exist in
`systems having multiple processors and multiple cache
`memories. For example, a frequently accessed data block
`corresponding to a memory line may be loaded into the cache
`of tWo different processors. In one example, if both proces
`sors attempt to Write neW values into the data block at the
`same time, different data values may result. One value may be
`Written into the ?rst cache While a different value is Written
`into the second cache. A system might then be unable to
`determine What value to Write through to system memory.
`A variety of cache coherency mechanisms have been
`developed to address such problems in multiprocessor sys
`tems. One solution is to simply force all processor Writes to go
`through to memory immediately and bypass the associated
`cache. The Write requests can then be serialized before over
`Writing a system memory line. HoWever, bypassing the cache
`signi?cantly decreases e?iciency gained by using a cache.
`Other cache coherency mechanisms have been developed for
`speci?c architectures. In a shared bus architecture, each pro
`cessor checks or snoops on the bus to determine Whether it
`can read or Write a shared cache block. In one example, a
`processor only Writes an object When it oWns or has exclusive
`access to the object. Each corresponding cache object is then
`updated to alloW processors access to the most recent version
`of the object.
`Bus arbitration is used When both processors attempt to
`Write the same shared data block in the same clock cycle. Bus
`arbitration logic decides Which processor gets the bus ?rst.
`
`10
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`Although, cache coherency mechanisms such as bus arbitra
`tion are effective, using a shared bus limits the number of
`processors that can be implemented in a single system With a
`single memory space.
`Other multiprocessor schemes involve individual proces
`sor, cache, and memory systems connected to other proces
`sors, cache, and memory systems using a netWork backbone
`such as Ethernet or Token Ring. Multiprocessor schemes
`involving separate computer systems each With its oWn
`address space can avoid many cache coherency problems
`because each processor has its oWn associated memory and
`cache. When one processor Wishes to access data on a remote
`computing system, communication is explicit. Messages are
`sent to move data to another processor and messages are
`received to accept data from another processor using standard
`netWork protocols such as TCP/IP. Multiprocessor systems
`using explicit communication including transactions such as
`sends and receives are referred to as systems using multiple
`private memories. By contrast, multiprocessor system using
`implicit communication including transactions such as loads
`and stores are referred to herein as using a single address
`space.
`Multiprocessor schemes using separate computer systems
`alloW more processors to be interconnected While minimiZing
`cache coherency problems. HoWever, it Would take substan
`tially more time to access data held by a remote processor
`using a netWork infrastructure than it Would take to access
`data held by a processor coupled to a system bus. Further
`more, valuable netWork bandWidth Would be consumed mov
`ing data to the proper processors. This can negatively impact
`both processor and netWork performance.
`Performance limitations have led to the development of a
`point-to-point architecture for connecting processors in a sys
`tem With a single memory space. In one example, individual
`processors can be directly connected to each other through a
`plurality of point-to-point links to form a cluster of proces
`sors. Separate clusters of processors can also be connected.
`The point-to-point links signi?cantly increase the bandWidth
`for coprocessing and multiprocessing functions. HoWever,
`using a point-to-point architecture to connect multiple pro
`cessors in a multiple cluster system sharing a single memory
`space presents its oWn problems.
`Consequently, it is desirable to provide techniques for
`improving data access and cache coherency in systems hav
`ing multiple clusters of multiple processors connected using
`point-to-point links.
`
`SUMMARY OF THE INVENTION
`
`According to the present invention, methods and apparatus
`are provided for increasing the e?iciency of data access in a
`multiple processor, multiple cluster system. A home cluster of
`processors receives a cache access request from a request
`cluster. The home cluster includes mechanisms for instruct
`ing probed remote clusters to respond to the request cluster
`instead of to the home cluster. The home cluster can also
`include mechanisms for reducing the number of probes sent
`to remote clusters. Techniques are also included for providing
`the requesting cluster With information to determine the num
`ber of responses to be transmitted to the requesting cluster as
`a result of the reduction in the number of probes sent from the
`home cluster.
`According to various embodiments, a computer system is
`provided. A home cluster includes a ?rst plurality of proces
`sors and a home cache coherence controller. The ?rst plurality
`of processors and the home cache coherence controller are
`interconnected in a point-to-point architecture. The home
`
`
`
`US 7,395,379 B2
`
`3
`cache coherence controller is con?gured to send a probe to a
`remote cluster upon receiving a cache access request from a
`request cluster. The probe includes information directing the
`remote cluster to send a probe response corresponding to the
`request to the request cluster.
`According to other embodiments, another computer sys
`tem is provided. The computer system includes a ?rst cluster
`and a second cluster. The ?rst cluster includes a ?rst plurality
`of processors and a ?rst cache coherence controller. The ?rst
`plurality of processors and the ?rst cache coherence control
`ler are interconnected in a point-to -point architecture. The
`second cluster includes a second plurality of processors and a
`second cache coherence controller. The second plurality of
`processors and the second cache coherence controller are
`interconnected in a point-to -point architecture. The ?rst
`cache coherence controller is coupled to the second cache
`coherence controller and con?gured to send a request to the
`second cluster. The ?rst cache coherence controller is con?g
`ured to receive a plurality of probe responses corresponding
`to the request.
`According to still other embodiments, a method for a cache
`coherence controller to manage data access in a multiproces
`sor system is provided. A cache access request originating
`from a ?rst cluster of processors is sent to a second cluster of
`processors. A plurality of probe responses corresponding to
`the cache access request is received from a plurality of clus
`ters.
`A further understanding of the nature and advantages of the
`present invention may be realiZed by reference to the remain
`ing portions of the speci?cation and the draWings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention may best be understood by reference to the
`folloWing description taken in conjunction With the accom
`panying draWings, Which are illustrative of speci?c embodi
`ments of the present invention.
`FIGS. 1A and 1B are diagrammatic representation depict
`ing a system having multiple clusters.
`FIG. 2 is a diagrammatic representation of a cluster having
`a plurality of processors.
`FIG. 3 is a diagrammatic representation of a cache coher
`ence controller.
`FIG. 4 is a diagrammatic representation shoWing a trans
`action ?oW for a data access request from a processor in a
`single cluster.
`FIGS. 5A-5D are diagrammatic representations shoWing
`cache coherence controller functionality.
`FIG. 6 is a diagrammatic representation depicting a trans
`action ?oW for a remote cluster sending a probe response to a
`home cluster.
`FIG. 7 is a diagrammatic representation shoWing a trans
`action ?oW for a remote cluster sending a probe response to a
`requesting cluster.
`FIG. 8 is a How process diagram shoWing tag management
`before probe transmission to remote nodes.
`FIG. 9 is a process How diagram shoWing a technique for
`receiving probe responses.
`FIG. 10 is a diagrammatic representation shoWing a trans
`action ?oW for a remote cluster sending a probe response to a
`requesting cluster.
`FIG. 11 is a How process diagram shoWing tag manage
`ment before probe transmission to remote nodes in a system
`With a coherence directory.
`FIG. 12 is a process How diagram shoWing a technique for
`receiving probe responses in a system With a coherence direc
`tory.
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`DETAILED DESCRIPTION OF SPECIFIC
`EMBODIMENTS
`
`Reference Will noW be made in detail to some speci?c
`embodiments of the invention including the best modes con
`templated by the inventors for carrying out the invention.
`Examples of these speci?c embodiments are illustrated in the
`accompanying draWings. While the invention is described in
`conjunction With these speci?c embodiments, it Will be
`understood that it is not intended to limit the invention to the
`described embodiments. On the contrary, it is intended to
`cover alternatives, modi?cations, and equivalents as may be
`included Within the spirit and scope of the invention as
`de?ned by the appended claims. Multi-processor architec
`tures having point-to-point communication among their pro
`cessors are suitable for implementing speci?c embodiments
`of the present invention. In the folloWing description, numer
`ous speci?c details are set forth in order to provide a thorough
`understanding of the present invention. The present invention
`may be practiced Without some or all of these speci?c details.
`Well knoWn process operations have not been described in
`detail in order not to unnecessarily obscure the present inven
`tion. Furthermore, the present application’s reference to a
`particular singular entity includes that possibility that the
`methods and apparatus of the present invention can be imple
`mented using more than one entity, unless the context clearly
`dictates otherWise.
`Techniques are provided for increasing data access e?i
`ciency in a multiple processor, multiple cluster system. In a
`point-to-point architecture, a cluster of processors includes
`multiple processors directly connected to each other through
`point-to-point links. By using point-to-point links instead of a
`conventional shared bus or external netWork, multiple pro
`cessors are used e?iciently in a system sharing the same
`memory space. Processing and netWork e?iciency are also
`improved by avoiding many of the bandWidth and latency
`limitations of conventional bus and external netWork based
`multiprocessor architectures. According to various embodi
`ments, hoWever, linearly increasing the number of proces sors
`in a point-to-point architecture leads to an exponential
`increase in the number of links used to connect the multiple
`processors. In order to reduce the number of links used and to
`further modulariZe a multiprocessor system using a point-to
`point architecture, multiple clusters are used.
`According to various embodiments, the multiple processor
`clusters are interconnected using a point-to-point architec
`ture. Each cluster of processors includes a cache coherence
`controller used to handle communications betWeen clusters.
`In one embodiment, the point-to-point architecture used to
`connect processors are used to connect clusters as Well.
`By using a cache coherence controller, multiple cluster
`systems can be built using processors that may not necessarily
`support multiple clusters. Such a multiple cluster system can
`be built by using a cache coherence controller to represent
`non-local nodes in local transactions so that local nodes do
`not need to be aWare of the existence of nodes outside of the
`local cluster. More detail on the cache coherence controller
`Will be provided beloW.
`In a single cluster system, cache coherency can be main
`tained by sending all data access requests through a serialiZa
`tion point. Any mechanism for ordering data access requests
`is referred to herein as a serialization point. One example of a
`serialiZation point is a memory controller. Various processors
`in the single cluster system send data access requests to the
`memory controller. In one example, the memory controller is
`con?gured to serialiZe or lock the data access requests so that
`only one data access request for a given memory line is
`
`
`
`US 7,395,379 B2
`
`25
`
`5
`allowed at any particular time. If another processor attempts
`to access the same memory line, the data access attempt is
`blocked until the memory line is unlocked. The memory
`controller alloWs cache coherency to be maintained in a mul
`tiple processor, single cluster system.
`A serialization point can also be used in a multiple proces
`sor, multiple cluster system Where the processors in the vari
`ous clusters share a single address space. By using a single
`address space, internal point-to-point links can be used to
`signi?cantly improve intercluster communication over tradi
`tional external netWork based multiple cluster systems. Vari
`ous processors in various clusters send data access requests to
`a memory controller associated With a particular cluster such
`as a home cluster. The memory controller can similarly seri
`alize all data requests from the different clusters. However, a
`serialization point in a multiple processor, multiple cluster
`system may not be as ef?cient as a serialization point in a
`multiple processor, single cluster system. That is, delay
`resulting from factors such as latency from transmitting
`betWeen clusters can adversely affect the response times for
`various data access requests. It should be noted that delay also
`results from the use of probes in a multiple processor envi
`ronment.
`Although delay in intercluster transactions in an architec
`ture using a shared memory space is signi?cantly less than the
`delay in conventional message passing environments using
`external netWorks such as Ethernet or Token Ring, even mini
`mal delay is a signi?cant factor. In some applications, there
`may be millions of data access requests from a processor in a
`fraction of a second. Any delay can adversely impact proces
`sor performance.
`According to various embodiments, speculative probing is
`used to increase the ef?ciency of accessing data in a multiple
`processor, multiple cluster system. A mechanism for eliciting
`a response from a node to maintain cache coherency in a
`system is referred to herein as a probe. In one example, a
`mechanism for snooping a cache is referred to as a probe. A
`response to a probe can be directed to the source or target of
`the initiating request. Any mechanism for sending probes to
`nodes associated With cache blocks before a request associ
`ated With the probes is received at a serialization point is
`referred to herein as speculative probing.
`According to various embodiments, the reordering or
`elimination of certain data access requests do not adversely
`affect cache coherency. That is, the end value in the cache is
`the same Whether or not snooping occurs. For example, a
`local processor attempting to read the cache data block can be
`alloWed to access the data block Without sending the requests
`through a serialization point in certain circumstances. In one
`example, read access can be permitted When the cache block
`is valid and the associated memory line is not locked. Tech
`niques for performing speculative probing generally are
`described in US. application Ser. No. 10/ 106,426 titled
`Methods And Apparatus For Speculative Probing At A
`Request Cluster, US. application Ser. No. 10/ 106,430 titled
`55
`Methods And Apparatus For Speculative Probing With Early
`Completion And Delayed Request, and US. application Ser.
`No. 10/ 106,299 titled Methods And Apparatus For Specula
`tive Probing With Early Completion And Early Request, the
`entireties of Which are incorporated by reference herein for all
`purposes. By completing a data access transaction Within a
`local cluster, the delay associated With transactions in a mul
`tiple cluster system can be reduced or eliminated.
`The techniques of the present invention recognize that
`other ef?ciencies can be achieved, particularly When specu
`lative probing can not be completed at a local cluster. In one
`example, a cache access request is forWarded from a local
`
`45
`
`50
`
`60
`
`65
`
`20
`
`30
`
`35
`
`40
`
`6
`cluster to a home cluster. A home cluster then proceeds to
`send probes to remote clusters in the system. In typical imple
`mentations, the home cluster gatherers the probe responses
`corresponding to the probe before sending an aggregated
`response to the request cluster. The aggregated response typi
`cally includes the results of the home cluster probes and the
`results of the remote cluster probes. The techniques of the
`present invention provide techniques for more e?iciently
`aggregating responses at the request cluster instead of a home
`cluster. According to various embodiments, remote clusters
`send probe responses directly to the request cluster instead of
`sending the probe responses to the request cluster through a
`home cluster. In one embodiment, techniques are provided
`for enabling a home cluster to send a reduced number of
`probes to remote clusters. Mechanisms are provided for
`alloWing a home cluster to inform the request cluster that a
`reduced number of probes are being transmitted. The mecha
`nisms can be implemented in a manner entirely transparent to
`remote clusters.
`FIG. 1A is a diagrammatic representation of one example
`of a multiple cluster, multiple processor system that can use
`the techniques of the present invention. Each processing clus
`ter 101, 103, 105, and 107 can include a plurality of proces
`sors. The processing clusters 101, 103, 105, and 107 are
`connected to each other through point-to-point links Illa-f
`In one embodiment, the multiple processors in the multiple
`cluster architecture shoWn in FIG. 1A share the same memory
`space. In this example, the point-to-point links Illa-f are
`internal system connections that are used in place of a tradi
`tional front-side bus to connect the multiple processors in the
`multiple clusters 101, 103, 105, and 107. The point-to-point
`links may support any point-to-point coherence protocol.
`FIG. 1B is a diagrammatic representation of another
`example of a multiple cluster, multiple processor system that
`can use the techniques of the present invention. Each process
`ing cluster 121, 123, 125, and 127 can be coupled to a sWitch
`131 through point-to-point links 141a-d. It should be noted
`that using a sWitch and point-to-point links alloWs implemen
`tation With feWer point-to-point links When connecting mul
`tiple clusters in the system. A sWitch 131 can include a pro
`cessor With a coherence protocol interface. According to
`various implementations, a multicluster system shoWn in
`FIG. 1A is expanded using a sWitch 131 as shoWn in FIG. 1B.
`FIG. 2 is a diagrammatic representation of a multiple pro
`cessor cluster, such as the cluster 101 shoWn in FIG. 1A.
`Cluster 200 includes processors 20211-20201, one or more
`Basic I/O systems (BIOS) 204, a memory subsystem com
`prising memory banks 206a-206d, point-to-point communi
`cation links 208a-208e, and a service processor 212. The
`point-to-point communication links are con?gured to alloW
`interconnections betWeen processors 20211-20201, I/O sWitch
`210, and cache coherence controller 230. The service proces
`sor 212 is con?gured to alloW communications With proces
`sors 20211-20201, I/O sWitch 210, and cache coherence con
`troller 230 via a JTAG interface represented in FIG. 2 by links
`214a-214f It should be noted that other interfaces are sup
`ported. I/O sWitch 210 connects the rest of the system to I/O
`adapters 216 and 220.
`According to speci?c embodiments, the service processor
`of the present invention has the intelligence to partition sys
`tem resources according to a previously speci?ed partitioning
`schema. The partitioning can be achieved through direct
`manipulation of routing tables associated With the system
`processors by the service processor Which is made possible
`by the point-to-point communication infrastructure. The
`routing tables are used to control and isolate various system
`resources, the connections b