`
`United States Patent
`US 7,587,581 B2
`(10) Patent No.:
`(12)
`Joy et al.
`(45) Date of Patent:
`Sep. 8, 2009
`
`
`(54) MULTIPLE-THREAD PROCESSOR WITH
`IN-PIPELINE, THREAD SELECTABLE
`STORAGE
`
`5,590,359 A
`5,680,641 A
`5,684,993 A
`
`12/1996 Sharangpani
`10/1997 Sidman
`11/1997 Willman
`
`(75)
`
`Inventors: William N. Joy, Aspen, CO (US); Mare
`Tremblay, Menlo Park, CA (US); Gary
`Lauterbach, Los Altos, CA (US);
`Joseph I. Chamdani, Santa Clara, CA
`(US)
`
`(73) Assignee: Sun Microsystems, Inc., Palo Alto, CA
`US(US)
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`USC. 154(b) by 211 days.
`
`(*) Notice:
`
`(21) Appl. No.: 11/710,112
`(22)
`Filed:
`Feb. 23, 2007
`
`(65)
`
`Prior Publication Data
`US 2007/0174597 Al
`Jul. 26, 2007
`
`Related U.S. Application Data
`(63) Continuation of application No. 10/403,406, filed on
`Mar. 31. 2003. now Pat. No. 7.185.185. which is a
`continuation of application No. 09/309 734 filed on
`May
`11. 1999. now Pat. No. 6.542.991 ae
`ys
`, 0 ees”
`(51)
`Int. Cl.
`(2006.01)
`G06F 9/38
`(52) US. CM.
`cecccesscsssssssessnsessnsessssernsessaseseen 712/229
`(58) Field of Classification SeaTeis nm one
`S
`lication
`file
`f;
`lete
`search history.
`C6 APPNCATON
`TNE
`TOF COMPINE SEALCA
`NISIOFY:
`References Cited
`U.S. PATENT DOCUMENTS
`
`.
`
`.
`
`/
`
`.
`
`(56)
`
`5,361,337 A
`5,452,452 A
`5,513,130 A
`5,584,023 A
`
`11/1994 Okin
`9/1995 Gaetneret al.
`4/1996 Redmond
`12/1996 Hsu
`
`(Continued)
`
`WO
`
`FOREIGN PATENT DOCUMENTS
`WO 99/21082
`4/1999
`
`OTHER PUBLICATIONS
`Fillo, M. etal., The M-Machine Multicomputer, 1997, Plenum
`Publishing,International Jounal of Parallel Programming, vol. 25,
`No. 3, pp. 193-212.*
`
`(Continued)
`Primary Examiner—Enic Coleman
`(74) Attorney, Agent, or Firm—Gunnison, McKay &
`Hodgson, L.L.P.; Forrest Gunnison
`(57)
`ABSTRACT
`
`AproCeSSOF reduces wasted cycle ume resulting from stalling
`and idling, andincreasesthe proportion ofexecution time, by
`supporting and implementing both vertical multithreading
`and horizontal multithreading. Vertical multithreading per-
`mits overlapping or “hiding” of cache miss wait times. In
`vertical multithreading, multiple hardware threads share the
`same processor pipeline. A hardware thread is typically a
`Process, a Hehtweightprocess,a nativethread. or thelike in
`multithreading increasesparallelismwithintheprocessorcir-
`cuit structure, for example within a single integrated circuit
`die that makes up a single-chip processor. To further increase
`system parallelism in some processor embodiments, multiple
`processor cores are formed in a single die. Advances in on-
`chip multiprocessor horizontal threading are gained as pro-
`cessor core sizes are reduced through technological advance-
`ments.
`
`an operating system that supports multithreading. Horizonta
`
`1 Claim, 21 Drawing Sheets
`
`N04
`
`Thread
`
`(USIpipetine)
`
`
`
`
`ORAM 1134
`ooh
`
`Pcl
`i
`
`1
`
`1300
`
`1102
`
`SAMSUNG 1056
`SAMSUNG 1056
`SAMSUNG v. SMART MOBILE
`SAMSUNGv. SMART MOBILE
`IPR2022-01004
`IPR2022-01004
`
`1
`
`
`
`US 7,587,581 B2
`
`Page 2
`
`6,420,903 Bl
`6,507,862 Bl
`
`7/2002 Singh etal.
`1/2003 Joyetal.
`
`OTHER PUBLICATIONS
`
`Olukotum, K., etal., The Case for a Single-Chip Multiprocessor,
`1996, ACM,pp. 2-10.*
`Hammond,L,etal., A Single-Chip Multiprocessor,Sep. 1997, IEEE,
`pp. 79-85.*
`Gulati et al., “Performance Study of a Multithreaded Superscalar
`Microprocessor”, Proceedings. International Symposium on High-
`Performance Computer Architecture,
`1996,
`pp.
`291-301.
`(XP000572068).
`Gunther, “Multithreading with Distributed Functional Units”, ZEEE
`Transactions on Computers, vol. 46, No. 4, IEEE Inc., New York,
`Apr. 1, 1997, pp. 399-411. CXP006560 16).
`Klass ct al., “A New Family of Scmidynamic and Dynamic Flip-
`Flops with Embedded Logic for High-Performance Processors”,
`IEEE JournalofSolid-State Circuits, vol. 34, No. 5, IEEE Inc., New
`York, Jun. 11, 1998, pp. 712-716. (XP002 1563 16).
`Tremblayet al., “A Three Dimensional Register File for Superscalar
`Processors”, Proceedings of the 28'" Annual Hawaii International
`Conf. on Systems Sciences, Jan. 1995, pp. 191-201.
`
`* cited by examiner
`
`U.S. PATENT DOCUMENTS
`
`11/1997
`2/1998
`3/1998
`4/1998
`5/1998
`6/1998
`7/1998
`9/1998
`1/1999
`1/1999
`4/1999
`9/1999
`3/2000
`4/2000
`5/2000
`5/2000
`8/2000
`8/2000
`9/2000
`12/2000
`3/2001
`5/2001
`10/2001
`
`Jagannathan et al.
`Yungetal.
`Dubeyet al.
`Reineret al.
`Familiar
`Stent
`Tremblay
`Rossmann
`Engebretsenet al.
`Kean
`Schneider
`Kametani
`Shimizu
`Flynnetal.
`Panwar et al.
`Eickemeyeret al.
`Wrightetal.
`Borkenhagenet al.
`Torii
`Mahalingaiah etal.
`Aglietti et al.
`Nation et al.
`Gottlieb
`
`AAAAAAAAAAAAAAAAAAAAB
`
`l
`Bl
`Bl
`
`5,692,193
`5,721,868
`5,724,565
`5,742,806
`5,752,027
`5,761,285
`5,778,247
`5,809,415
`5,860,138
`5,861,761
`5,893,159
`5,960,458
`6,038,647
`6,052,708
`6,058,466
`6,061,710
`6,101,599
`6,105,051
`6,122,712
`6,167,507
`6,205,519
`6,233,599
`6,298,431
`
`2
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 1 of 21
`
`US 7,587,581 B2
`
`vilblollcllcllcll Oblvilbblohhvil
`
`PooacocetoceseexsreveveseetetetetesaOSESod
`[COISOoORooo)
`
`PXococcececoricodPOOPosesoseoeod
`
`
`SenenetereteceerWotetevetetocerersDSRSSCCNLS
`
`PSS.rahetetetatetetetmeBSGe
`SoSRECORDSCKD
`
`p80RASSeeea]LRSOSOESCOC)PICO
`
`
`
`
`
`
`
`velSZbOCL22LwelBZLeetObvelBZLOEl9bect
`
`aipl[|¢poeiyy0WY,Posty,Add
`Oct91cab
`
`Gt“Olu
`
`psyPoesyAd)=.%@POSY}Add=}
`BalV0l~
`
`3
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 2 of 21
`
`US 7,587,581 B2
`
`
`>:4tL.VAWLLL.SWE
`PeetFpBeyJeyeyq se
`
`vezcetOFZESEZHETCETSCOHZSCG?ezOr«82KZ
`VeldwetievewzWalevizzz
`967CSCosewS22G2=vGZ(iaGZSCCSC#WGZvozGZURL
` feetbehetEeleSR0eo
`9cz9S?22:poasy,ER8Se9S
`
`de‘Old
`
`,yeEeTeerq
`
`Je“Old
`
`4
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 3 of 21
`
`US 7,587,581 B2
`
`Shared
`I$. BP. NFR
`+-
`IMMU
`
`Pipeline
`
`Thread 0
`Machine
`State
`
`Shared
`
`UPA Bus
`526
`
`E$ Bus
`524
`
`FIG. 3
`
`5
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 4 of 21
`
`US 7,587,581 B2
`
`
`
`
`act
`se_ce_|
`;
`tid g
`
`mg_header
`
`clk
`sclk
`tid
`
`412
`414
`
`FIG. AA
`
`
`
`
`FIG. 4B
`
`FIG. 4C
`
`fatehctive
`
`6
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 5 of 21
`
`US 7,587,581 B2
`
`500
`
`J
`
`T0
`TD
`
`T2 bit
`storage
`
`
`
`
`Tt bit
`storage
`
`
`510
`TD
`
`FIG. 9
`
`L1_load_miss
`
`Li_inst_miss
`inst_buf_empty
`exception_mode
`
`MT_mode
`
`thread_priority
`
`tl
`
`Ts
`
`Thread
`Select
`Logic
`
`610
`
`FIG. 6
`
`7
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 6 of 21
`
`US 7,587,581 B2
`
`700 \
`
`
`
`
`
`710
`DATA CACHE-
`THREAD 0
`
`DATA CACHE-
`THREAD 1
`
`712
`
`FIG. 7A
`
`TID+VA[6:5]|VA[4:0]
`
` VA[63: 7]
`
`Cache VA
`Tagbit
`
`«i,
`Index bits
`
`|Byte offset
`
`722
`
`724
`
`726
`
`720
`
`FIG. 7B
`
`8
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 7 of 21
`
`US 7,587,581 B2
`
`63
`1413
`54
`
`
`0|s
`
`s CACHE invex|ye orrseT [+820
`
`63
`1413 12
`54
`
`822
`823 824 896
`NEW CACHE INDEX
`
`0|s
`
`dto|noex-se|eyte orrset [+820
`
`16kB DM CACHE
`
`THREAD 0
`
`VA=A , PA=B
`
`THREAD 1 n+256
`
`VA=A’
`
`, PA=B
`
`
`
`
`
`olf
`
`FIG. 8
`
`899
`
`|
`
`812
`
`826
`
`9
`
`
`
`Sep. 8, 2009
`
`Sheet 8 of 21
`
`US 7,587,581 B2
`
`U.S. Patent
`
`006
`
`06
`
`paibys
`
`
`
`
`(¢3Aom—Zyoddns0})
`Avy50,$3diyg—ug
`
`sng$3sngWdN
`
`906926
`
`paJDUS
`
`AWA+YINdé“$1
`
`NWI+YIN‘dd“G
`
`paioys
`
`auljedid
`
`palJoUs
`
`auijedid
`
`10
`
`10
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 9 of 21
`
`US 7,587,581 B2
`
`1000
`
`Thread
`(USIIi_ pipeline)
`
`E$ Control Unit
`
`(ECU)
`
`PC
`1032
`
`L2$
`SRAM
`1024
`
`DRAM 1054
`and
`
`UPA
`
`1026
`
`FIG. 10
`
`11
`
`11
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 10 of 21
`
`US 7,587,581 B2
`
`OOLL
`
`OltCLLt
`
`
`
`
`
`WVUS$21AOK-F‘ANTdlug—uQ
`
`(aujadid3115)(aujadid1ISN)00poosu
`poosyy
`
`
`
`
`
`
`
`yellWvydDd#CllWudDd
`
`
`sehgnzhootwalZelh
`
`WeOld
`
`HUN[04jU0$3aaHUNJosue)$3
`$01)(103)
`
`12
`
`12
`
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 11 of 21
`
`US 7,587,581 B2
`
`1220
`
`1212
`
`1210
`
`io
`
`1224
`
`INSTRUCTION CACHE
`
`INSTR TLB 144|PREDECODE(16K PLUS NEXT FIELD,
`
`BRANCH HIST)
`(64 ENTRIES)
`UNIT
`NEXTFIELD,
`BRANCH Hist
`
`INSTR
`
`{*
`
`TAKEN/
`NOT TAKEN
`
`INTEGER REGISTERS
`(8 WINDOWS, 7 READ,
`3 WRITE PORTS)
`
`s
`
`FLOATING—-POINT REGISTERS
`(5 READ PORTS,
`3 WRITE PORTS)
`
`ox
`
`aa=—
`
`_= =a
`
`c=
`
`DATA CACHE
`(16K)
`
`1222
`
`CACHE CONTROL/
`SYSTEM INTERFACE
`
`PHYSICAL ADDR
`
`13
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet12 of 21
`
`US 7,587,581 B2
`
`Bit Line
`
`
`
`Register
`Index
`
`
`
`feattetetceteratetETI
`
`Deter ebeteteetreeeteeetFat ete ties
`
`eet
`et
`cat ate tet
`deo
`letie
`Testietoetestastasi deletes]
`Re et tetabteta 8 ot et at 8?
`
`ee at te ee ee et ts
`eee eee tat ate te
`ete
`ey
`
`ata tat
`
`et
`
`at
`
`ot
`
`cet
`
`ie
`
`,
`
`Current
`Window pp f:
`JUS
`Decode
`
`1312
`a
`SENSIS
`
`
`1314
`
`A Three Dimensional Register File
`
`FG. 13
`
`14
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 13 of 21
`
`US 7,587,581 B2
`
`)n1g
`
`StOld
`
`PEPaTYogITEETTT.seTSrosoe
`EETTERN
`LeerTERTTTET||dSTTalTET
`eee>on
`RCEECH
`f°Iq74$479Sq99LO8469
`pttetPtetttEt
`ptttTytptttttttPttteePt|fteerPtttfoPt
`pttettyptt}EyEy
`
`~AL
`
`puoi
`Sour]
`
`Styl
`
`sour1g
`
`Styl
`
`viOld
`
`-————~—-/
`
`15
`
`15
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 14 of 21
`
`US 7,587,581 B2
`
`VOI“Dl
`
`oj
`
`LAN
`
`[>°ddl
`
`ql‘Oldo0g1—~
`
`QFDidOLADosd v9)
`|PPCoa
`PEELcygsd
`
`aanChysdi
`aacTgsdsaCyesunosian
`p39oun
`ChdacsPtgd.fTCOUTtC—“~*ESC(‘$RSSCST*SC
`#S$NSNSJcdoa)pdutouaCJgduon|CJzduss
`
`VOT‘Ola
`
`7[ditou
`
`16
`
`16
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 15 of 21
`
`US 7,587,581 B2
`
`at“Did
`
`Opa
`
`cosdi
`
`c1sda
`
`CTzsdu
`
`91S
`
`SeSSS38-§=8es80606\6
`
`¥191sigh=<<=aaaa4aa
`LoPahERLE<4<tyoTXL<}y,HY}L<TyoV8291VyVVv
`
`$eAfotPp
`4,1||fTtTttftftft|4,|fttfot|||ft|4|[|{|||||HY|||of|
`BLK,CoDey“et“HetSH,
`
`“ttt
`
`
`
`“\Noigi
`
`eZ9l
`
`Joduom
`
`cyydaom
`
`anZdsom
`
`CTodaon
`
`IaMon
`
`CTcdaon
`
`OTgduou
`
`]{duu
`
`0291
`
`17
`
`17
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet16 of 21
`
`US 7,587,581 B2
`
`Window Sp
`(callee)
`sclocals's::
`1720
`itt
`
` Current HENS
`
`function
`call
`
`1712
`
`FIG. 17A
`
`Current
`Window
`(caller)
`1710
`
`18
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 17 of 21
`
`US 7,587,581 B2
`
`
`
`Sul]DJDPj090j
`
`4jSte
`ptf
`
`11g
`
`ZMOPUIM|MOpUIM
`
`¢MOPUIMZMOpUIM
`
`Gab“Olu
`
`19
`
`19
`
`
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 18 of 21
`
`US 7,587,581 B2
`
`
`
`
`
`FIG. 18A}IIG. 18B
`
`FIG. 18C}FIG. 18D
`
`KEY TO FIG. 18
`
`rps0_outop
`
`
`
`
`
`rpsl_out
`rps2_out
`rps3_out
`rps4_out
`rpso_out
`rps6_out
`
`
`
`rpsO_in
`rpst_in
`rps2_in
`rps3_in
`rps4_in
`rpso_in
`
`rps6_in >
`
`i min
`
`rewp0 (>
`rewp1
`rewp2 1D
`FtTft
`rcwp3 (>
`rewp4 [>
`rewp5 [>
`rewp6 (>
`rewp/7 [>
`
`|f
`
`rytTOTOT
`ttCOUT
`aeeeeDeee t
`a
`| Ly
`
`FIG. 18A
`
`20
`
`20
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 19 of 21
`
`US 7,587,581 B2
`
`
`
`FIG. 18B
`
`21
`
`21
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 20 of 21
`
`US 7,587,581 B2
`
`
`<tymalettia©030KSTzittyytHU
`PuPCETARYSS)LOeeCeePTT
`<1:eeeeAP)
`
`
`<hyaanmmilYa
`—SRFALl
`
`Pee
`
`AY—Finey:
`
`<>J}PM
`
`1800 ——*
`
`mit
`
`SZSRS2EBRSJdBBESSS>Ss25525ssB55oncag
`
`apapanananananaOOAo0qcaa
`
`
`S99OoFC88ovct|§IQuaus=Fee5FF4Oo7n7rNaaa=9°anoOn==aaaaa=a=+=
`
`FIG. 18C
`
`22
`
`22
`
`
`
`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 21 of 21
`
`US 7,587,581 B2
`
`ceee
`
`pe
`
`<pStatt=Ll
`
`
`
`«—— 1810
`
`~—— 1820
`
`TPPUS
`
`wdtin
`
`Fig. 18D
`
`23
`
`23
`
`
`
`US 7,587,581 B2
`
`1
`MULTIPLE-THREAD PROCESSOR WITH
`IN-PIPELINE, THREAD SELECTABLE
`STORAGE
`
`CROSS-REFERENCE TO RELATED
`APPLICATION
`
`This application is a continuation of U.S. patent applica-
`tion Ser. No. 10/403,406, entitled “MULTIPLE-THREAD
`PROCESSOR WITH IN-PIPELINE, THREAD SEILECT-
`ABLE STOPAGE,’in the nameofinventors, William N. Joy,
`Mare Tremblay, Gary Lauterbach, and Joseph I. Chamdani
`filed on Mar. 31, 2003, now U.S. Pat. No. 7,185,185, whichis
`incorporated herein by reference in its entirety, which was a
`continuation of application Ser. No. 09/309,734, filed May
`11, 1999, now U.S. Pat. No. 6,542,991, which application is
`incorporated herein by referencein its entirety.
`In addition, the present inventionis related to subject mat-
`ter disclosed in the following, commonly-owned, co-pending
`patent applications:
`1. U.S. patent application Ser. No. 09/309,732, entitled,
`“Processor with Multiple-Thread, Vertically-Threaded
`Pipeline”, naming William Joy, Marc Tremblay, Gary
`Lauterbach, and Joseph Chamdanias inventors and filed
`May11, 1999;
`2. US. patent application Ser. No. 09/309,731, entitled,
`“Vertically-Threaded Processor with Multi-Dimen-
`sional Storage”, naming William Joy, Marc Tremblay,
`Gary Lauterbach, and Joseph Chamdani as inventors
`and filed May 11, 1999;
`3. U.S. patent application Ser. No. 09/309,735, entitled,
`“Switching Method in a Multi-Threaded Processor’,
`naming William Joy, Mare Tremblay, Gary Lauterbach,
`and Joseph Chamdanias inventors and filed May 11,
`1999; and
`4. US. patent application Ser. No. 09/309,733, entitled,
`“Thread Switch Logic in a Multiple-Thread Processor’,
`naming William Joy, Mare Tremblay, Gary Lauterbach,
`and Joseph Chamdanias inventors and filed May 11,
`1999,
`
`BACKGROUNDOF THE INVENTION
`
`1. Field of the Invention
`
`The present invention relates to processor or computer
`architecture. More specifically, the present invention relates
`to multiple-threading processor architectures and methods of
`operation and execution.
`2. Description of the Related Art
`In many commercial computing applications, a large per-
`centage of time elapses during pipeline stalling and idling,
`rather than in productive execution, due to cache misses and
`latency in accessing external caches or external memory fol-
`lowing the cache misses. Stalling and idling are mostdetri-
`mental, due to frequent cache misses, in database handling
`operations such as OLTP, DSS, data mining,financial fore-
`casting, mechanical and electronic computer-aided design
`(MCAD/ECAD), web servers, data servers, and the like.
`Thus, although a processor may execute at high speed, much
`time is wasted while idly awaiting data.
`One technique for reducing stalling andidling is hardware
`multithreading to achieve processor execution during other-
`wise idle cycles. Hardware multithreading involves replica-
`tion of some processor resources, for example replication of
`architected registers, for each thread. Replication is not
`required for most processor resources, including instruction
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`translation look-aside buffers (TLB),
`and data caches,
`instruction fetch and dispatch elements, branch units, execu-
`tion units, and thelike.
`Unfortunately duplication ofresourcesis costly in terms of
`integrated circuit consumption and performance.
`Accordingly, improved multithreading circuits and operat-
`ing methodsare neededthat are economical in resources and
`avoid costly overhead which reduces processor performance.
`
`SUMMARYOF THE INVENTION
`
`A processor reduces wasted cycle time resulting from stall-
`ing and idling, and increases the proportion ofexecution time,
`by supporting and implementing both vertical multithreading
`and horizontal multithreading. Vertical multithreading per-
`mits overlapping or “hiding” of cache miss wait times. In
`vertical multithreading, multiple hardware threads share the
`same processor pipeline. A hardware thread is typically a
`process, a lightweight process, a native thread, or the like in
`an operating system that supports multithreading. Horizontal
`multithreading increases parallelism within the processorcir-
`cuit structure, for example within a single integrated circuit
`die that makes up a single-chip processor. To further increase
`system parallelism in some processor embodiments, multiple
`processor cores are formed in a single die. Advances in on-
`chip multiprocessor horizontal threading are gained as pro-
`cessor core sizes are reduced through technological advance-
`ments.
`
`The described processor structure and operating method
`may be implemented in many structural variations. For
`example two processor cores are combined with an on-chip
`set-associative L2 cache in one system. In another example,
`four processor cores are combined with a direct RAMBUS
`interface with no external L2 cache. A countless number of
`variations are possible. In some systems, cach processor core
`is a vertically-threaded pipeline.
`In a further aspect of some multithreading system and
`method embodiments, a computing system may be config-
`ured in many different processor variations that allocate
`execution among a plurality of execution threads. For
`example, in a “1C2T” configuration, a single processor die
`includes two vertical threads. In a “4C4T” configuration, a
`four-processor multiprocessor is formed on a single die with
`each of the four processors being four-way vertically
`threaded. Countless other “nCkT”structures and combina-
`
`tions may be implemented on one or moreintegrated circuit
`dies depending on the fabrication process employed and the
`applications envisioned for the processor. Various systems
`may include caches that are selectively configured,
`for
`example as segregated L1 caches and segregated L2 caches,
`or segregated L1 caches and shared L2 caches, or shared L1
`caches and shared L2 caches.
`
`In an aspect of some multithreading system and method
`embodiments, in response to a cache missstall a processor
`freezes the entire pipeline state of an executing thread. The
`processor executes instructions and manages the machine
`state of each thread separately and independently. The func-
`tional properties of an independent thread state are stored
`throughoutthe pipeline extending to the pipeline registers to
`enable the processor to postpone execution of a stalling
`thread, relinquish the pipeline to a previously idle thread, later
`resuming execution of the postponedstalling thread at the
`precise state of the stalling thread immediately prior to the
`thread switch.
`
`In another aspect of some multithreading system and
`method embodiments, a processor include a “four-dimen-
`sional” register structure in whichregisterfile structures are
`
`24
`
`24
`
`
`
`US 7,587,581 B2
`
`3
`replicated by N for vertical threading in combination with a
`three-dimensional storage circuit. The multi-dimensional
`storage is formed by constructing a storage, such as a register
`file or memory, as a plurality of two-dimensional storage
`planes.
`In another aspect of some multithreading system and
`method embodiments, a processor implements N-bitflip-flop
`global substitution. To implement multiple machine states,
`the processor converts 1-bit flip-flops in storage cells of the
`stalling vertical thread to an N-bit globalflip-flop where N is
`the numberofvertical threads.
`In one aspect of some processor and processing method
`embodiments, the processor improves throughputefficiency
`and exploits increased parallelism by introducing multi-
`threading to an existing and mature processor core. The mul-
`tithreading is implemented in two steps including vertical
`multithreading and horizontal multithreading. The processor
`core is retrofitted to support multiple machinestates. System
`embodimentsthat exploit retrofitting of an existing processor
`core advantageously leverage hundreds ofman-years ofhard-
`ware and software development by extending the lifetime of
`a proven processorpipeline generation.
`In another aspect of some multithreading system and
`method embodiments, a processor includes a thread switch-
`ing control logic that performsa fast thread-switching opera-
`tion in response to an L1 cache missstall. The fast thread-
`switching operation implements one or more of several
`thread-switching methods. A first thread-switching operation
`is “oblivious”thread-switching for every N cycle in which the
`individualflip-flops locally determine a thread-switch with-
`out notification of stalling. The oblivious technique avoids
`usage of an extra global interconnection between threads for
`thread selection. A second thread-switching operation is
`“semi-oblivious” thread-switching for use with an existing
`“pipeline stall” signal (if any). The pipelinestall signal oper-
`ates in two capacities, first as a notification of a pipelinestall,
`and secondas a thread select signal between threadsso that,
`again, usage of an extra global
`interconnection between
`threadsfor thread selection is avoided.A third thread-switch-
`ing operation is an “intelligent global scheduler’ thread-
`switching in which a thread switch decision is based on a
`plurality of signals including: (1) an L1 data cache miss stall
`signal, (2) an instruction buffer empty signal, (3) an L2 cache
`miss signal, (4) a thread priority signal, (5) a thread timer
`signal, (6) an interrupt signal, or other sources of triggering.
`In some embodiments,the thread select signal is broadcast as
`fast as possible, similar to a clock tree distribution. In some
`systems, a processor derives a thread select signal that is
`applied to the flip-flops by overloading a scan enable.(SE)
`signal of a scannableflip-flop.
`In an additional aspect of some multithreading system and
`method embodiments, a processor includes anti-aliasing
`logic coupled to an L1 cache so that the L1 cache is shared
`among threads via anti-aliasing. The L1 cacheis a virtually-
`indexed, physically-tagged cache that
`is shared among
`threads. The anti-aliasing logic avoids hazards that result
`from multiple virtual addresses mapping to one physical
`address. The anti-aliasing logic selectively invalidates or
`updates duplicate L1 cache entries.
`In another aspect of some multithreading system and
`method embodiments, a processor includeslogic for attaining
`avery fast exception handling functionality while executing
`non-threaded programs by invoking a multithreaded-type
`functionality in response to an exception condition. The pro-
`cessor, while operating in multithreaded conditions or while
`executing non-threaded programs, progresses through mul-
`tiple machinestates during execution. The very fast exception
`
`20
`
`40
`
`45
`
`50
`
`55
`
`4
`handling logic includes connection ofan exception signalline
`to thread select logic, causing an exception signal to evoke a
`switch in thread and machine state. The switch in thread and
`
`machine state causes the processor to enter and to exit the
`exception handler immediately, without waiting to drain the
`pipeline or queues and without the inherent timing penalty of
`the operating system’s software saving andrestoring of reg-
`isters.
`An additional aspect of some multithreading systems and
`methods is a thread reservation system or thread locking
`system in which a thread pathwayis reserved for usage by a
`selected thread. A thread control logic may select a particular
`thread that is to execute with priority in comparison to other
`threads. A high priority thread may be associated with an
`operation with strict time constraints, an operation that is
`frequently and predominantly executed in comparison to
`other threads. The thread control
`logic controls thread-
`switching operation so that a particular hardware thread is
`reserved for usage by the selected thread.
`In another aspect of some multithreading system and
`method embodiments, a processor includes logic supporting
`lightweight processes and native threads. The logic includes a
`block that disables thread ID tagging and disables cache
`segregation since lightweight processes and native threads
`share the samevirtual tag space.
`In another aspect of some multithreading system and
`method embodiments, a processor includes logic for tagging
`a thread identifier (TID) for usage with processor blocksthat
`are not stalled. Pertinent non-stalling blocks include caches,
`translation look-aside buffers (TLB), a load buffer asynchro-
`nousinterface, an external memory management unit (MMU)
`interface, and others.
`In another aspect of some multithreading system and
`method embodiments, a processor includes a cache that is
`segregated into a plurality of N cache parts. Cache segrega-
`tion avoidsinterference, “pollution”, or “cross-talk” between
`threads. One technique for cache segregation utilizes logic for
`storing and communicating thread identification (TID)bits.
`The cache utilizes cache indexing logic. For example, the TID
`bits can be inserted at the most significant bits of the cache
`index.
`
`In a further additional aspect of some embodiments of the
`multithreading system and method, some processors include
`a thread reservation functionality.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The features ofthe described embodimentsare specifically
`set forth in the appended claims. However, embodiments of
`the invention relating to both structure and method of opera-
`tion, may best be understood by referring to the following
`description and accompanying drawings.
`FIGS. 1A and 1B are timing diagramsrespectively illus-
`trating execution flow of a single-thread processor and a
`vertical multithread processor.
`FIGS. 2A, 2B, and 2C are timing diagrams respectively
`illustrating execution flow of a single-thread processor, a
`vertical multithread processor, and a vertical and horizontal
`multithread processor.
`FIG.3 is a schematic functional block diagram depicting a
`design configuration for a
`single-processor vertically-
`threaded processorthat is suitable for implementing various
`multithreading techniques and system implementations that
`improve multithreading performance andfunctionality.
`FIGS. 4A, 4B, and 4C are diagrams showing an embodi-
`ment of a pulse-based high-speed flip-flop that is advanta-
`geously usedto attain multithreading in an integrated circuit.
`
`25
`
`25
`
`
`
`US 7,587,581 B2
`
`5
`FIG. 4A is a schematic block diagram illustrating control and
`storage blocksof a circuit employing high-speed multiple-bit
`flip-flops. FIG. 4B is a schematic circuit diagram that shows
`a multiple-bit bistable multivibrator (flip-flop) circuit. FIG.
`4C is a timing diagram illustrating timing of the multiple-bit
`flip-flop.
`FIG. 5 is a schematic block diagram illustrating an N-bit
`“thread selectable”flip-flop substitution logic that is used to
`create vertically multithreaded functionality in a processor
`pipeline while maintaining the samecircuit size as a single-
`threaded pipeline.
`FIG. 6 is a schematic block diagram illustrating a thread
`switch logic which rapidly generates a thread identifier (TID)
`signal identifying an active thread of a plurality of threads.
`FIGS. 7A and 7Bare, respectively, a schematic block dia-
`gram showing an example of a segregated cache anda picto-
`rial diagram showing an example of an addressing technique
`for the segregated cache.
`FIG. 8 is a schematic block diagram showing a suitable
`anti-aliasing logic for usage in various processor implemen-
`tations including a cache, such as an L1 cache, and L2 cache,
`or others.
`
`FIG.9 is a schematic functional block diagram depicting a
`design configuration for a single-chip dual-processor verti-
`cally-threaded processor that is suitable for implementing
`various multithreading techniques and system implementa-
`tions that improve multithreading performance and function-
`ality.
`FIG. 10 is a schematic functional block diagram depicting
`an alternative design configuration for a single-processorver-
`tically-threaded processor that is suitable for implementing
`various multithreading techniques and system implementa-
`tions that improve multithreading performance and function-
`ality.
`FIG. 11 is a schematic functional block diagram depicting
`an alternative design configuration for a single-chip dual-
`processor vertically-threaded processor that is suitable for
`implementing various multithreading techniques and system
`implementations that improve multithreading performance
`and functionality.
`FIG. 12 is a schematic block diagram illustrating a proces-
`sor and processor architecture that are suitable for imple-
`menting various multithreading techniques and system
`implementations that improve multithreading performance
`and functionality.
`FIG. 13 is a schematic perspective diagram showing a
`multi-dimensionalregisterfile.
`FIG. 14 is a schematic circuit diagram showing a conven-
`tional implementation of register windows.
`FIG. 15 is a schematic circuit diagram showinga plurality
`of bit cells of a register windowsof the multi-dimensional
`register file that avoids waste of integrated circuit area by
`exploiting the condition that only one windowis read and
`only one window is written at one time.
`FIG. 16, a schematic circuit diagram illustrates a suitable
`bit storage circuit storing onebit of the local registers for the
`multi-dimensionalregisterfile with eight windows.
`FIGS. 17A and 17Bare, respectively, a schematic pictorial
`diagram anda schematic block diagram illustrating sharing of
`registers among adjacent windows.
`FIG. 18 is a schematic circuit diagram illustrating an
`implementation of a multi-dimensionalregisterfile for regis-
`ters shared acrossa plurality of windows.
`The use of the same reference symbols in different draw-
`ings indicates similar or identical items.
`
`20
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`6
`DESCRIPTION OF THE EMBODIMENT(S)
`
`two timing diagrams
`Referring to FIGS. 1A and 1B,
`respectively illustrate execution flow 110 in a single-thread
`processor and instruction flow 120 in a vertical multithread
`processor. Processing applications such as database applica-
`tions spend a significant portion of execution timestalled
`awaiting memory servicing. FIG. 1A is a highly schematic
`timing diagram showing execution flow 110 of a single-
`thread processor executing a database application. In anillus-
`trative example, the single-thread processor is a four-way
`superscalar processor. Shaded areas 112 correspondto peri-
`ods of execution in which the single-thread processor core
`issues instructions. Blank areas 114 correspondto timeperi-
`ods in which the single-thread processorcoreis stalled wait-
`ing for data or instructions from memory or an external cache.
`A typical single-thread processor executing a typical data-
`base application executes instructions about 30% ofthe time
`with the remaining 70% of the time elapsed in a stalled
`condition. The 30% utilization rate exemplifies the inefficient
`usage of resources by a single-thread processor.
`FIG. 1B is a highly schematic timing diagram showing
`execution flow 120 of similar database operations by a mul-
`tithread processor. Applications such as database applications
`have a large amountinherent parallelism due to the heavy
`throughputorientation of database applications and the com-
`mon database functionality of processing several indepen-
`dent transactionsat one time. The basic conceptof exploiting
`multithread functionality
`involves utilizing processor
`resources efficiently when a thread is stalled by executing
`other threads while the stalled thread remains stalled. The
`
`execution flow 120 depicts a first thread 122, a second thread
`124, a third thread 126 and a fourth thread 128, all of which
`are shown with shading in the timing diagram. As one thread
`stalls, for example first thread 122, another thread, such as
`second thread 124, switches into execution on the otherwise
`unused or idle pipeline. Blank areas 130 correspondto idle
`times whenall threadsare stalled. Overall processorutiliza-
`tion is significantly improved by multithreading. Theillustra-
`tive technique ofmultithreading employsreplication ofarchi-
`tected registers for each thread and is called “vertical
`multithreading”.
`Vertical multithreading is advantageous in processing
`applications in which frequent cache misses result in heavy
`clock penalties. When cache misses cause a first thread to
`stall, vertical multithreading permits a second thread to
`execute whenthe processor would otherwise remain idle. The
`second thread thus takes over execution of the pipeline. A
`context switch from the first thread to the second thread
`
`involves saving the useful states ofthe first thread and assign-
`ing new states to the second thread. Whenthefirst thread
`restarts after stalling, the saved states are returned andthefirst
`thread proceeds
`in execution. Vertical multithreading
`imposescosts on a processorin resources used for saving and
`restoring thread states.
`Referring to FIGS. 2A, 2B, and 2C,three highly schematic
`timing diagramsrespectively illustrate execution flow 210 of
`a single-thread processor, execution flow 230 of a vertical
`multithread processor, and execution flow 250 a combined
`vertical and horizontal multithread processor. In FIG. 2A,
`shaded areas 212 showing periods of execution and blank
`areas 214 showing time periods in which the single-thread
`processorcoreis idle dueto stall illustrate the inefficiency of
`a single-thread processor.
`In FIG. 2B, execution flow 230 in a vertical threaded pro-
`cessor includes execution ofa first thread 232, and a second
`thread 234, both shaded in the timing diagram, and an idle
`
`26
`
`26
`
`
`
`US 7,587,581 B2
`
`10
`
`15
`
`20
`
`25
`
`30
`
`40
`
`45
`
`7
`time shown in a blank area 240. Efficient instruction execute
`proceeds as one thread stalls and, in response to thestall,
`another thread switches into execution on the otherwise
`
`unused or idle pipeline. In the blank areas 240, an idle time
`occurs whenall threads are stalled. For vertical multithread
`
`processor maintains a separate processing state for T execut-
`ing threads. Only oneofthe threadsis active at one time. The
`vertical multithreaded processor switches execution to
`anotherthread on a cache miss, for example an L1 cache miss.
`A horizontal threaded processor, using a technique called
`chip-multiple processing, combines multiple processors on a
`single integratedcircuit die. The multiple processors are ver-
`tically threaded to form a processor with both vertical and
`horizontal threading, augmenting executing efficiency and
`decreasing latency in a multiplicative fash