throbber
US007587581B2
`
`United States Patent
`US 7,587,581 B2
`(10) Patent No.:
`(12)
`Joy et al.
`(45) Date of Patent:
`Sep. 8, 2009
`
`
`(54) MULTIPLE-THREAD PROCESSOR WITH
`IN-PIPELINE, THREAD SELECTABLE
`STORAGE
`
`5,590,359 A
`5,680,641 A
`5,684,993 A
`
`12/1996 Sharangpani
`10/1997 Sidman
`11/1997 Willman
`
`(75)
`
`Inventors: William N. Joy, Aspen, CO (US); Mare
`Tremblay, Menlo Park, CA (US); Gary
`Lauterbach, Los Altos, CA (US);
`Joseph I. Chamdani, Santa Clara, CA
`(US)
`
`(73) Assignee: Sun Microsystems, Inc., Palo Alto, CA
`US(US)
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`USC. 154(b) by 211 days.
`
`(*) Notice:
`
`(21) Appl. No.: 11/710,112
`(22)
`Filed:
`Feb. 23, 2007
`
`(65)
`
`Prior Publication Data
`US 2007/0174597 Al
`Jul. 26, 2007
`
`Related U.S. Application Data
`(63) Continuation of application No. 10/403,406, filed on
`Mar. 31. 2003. now Pat. No. 7.185.185. which is a
`continuation of application No. 09/309 734 filed on
`May
`11. 1999. now Pat. No. 6.542.991 ae
`ys
`, 0 ees”
`(51)
`Int. Cl.
`(2006.01)
`G06F 9/38
`(52) US. CM.
`cecccesscsssssssessnsessnsessssernsessaseseen 712/229
`(58) Field of Classification SeaTeis nm one
`S
`lication
`file
`f;
`lete
`search history.
`C6 APPNCATON
`TNE
`TOF COMPINE SEALCA
`NISIOFY:
`References Cited
`U.S. PATENT DOCUMENTS
`
`.
`
`.
`
`/
`
`.
`
`(56)
`
`5,361,337 A
`5,452,452 A
`5,513,130 A
`5,584,023 A
`
`11/1994 Okin
`9/1995 Gaetneret al.
`4/1996 Redmond
`12/1996 Hsu
`
`(Continued)
`
`WO
`
`FOREIGN PATENT DOCUMENTS
`WO 99/21082
`4/1999
`
`OTHER PUBLICATIONS
`Fillo, M. etal., The M-Machine Multicomputer, 1997, Plenum
`Publishing,International Jounal of Parallel Programming, vol. 25,
`No. 3, pp. 193-212.*
`
`(Continued)
`Primary Examiner—Enic Coleman
`(74) Attorney, Agent, or Firm—Gunnison, McKay &
`Hodgson, L.L.P.; Forrest Gunnison
`(57)
`ABSTRACT
`
`AproCeSSOF reduces wasted cycle ume resulting from stalling
`and idling, andincreasesthe proportion ofexecution time, by
`supporting and implementing both vertical multithreading
`and horizontal multithreading. Vertical multithreading per-
`mits overlapping or “hiding” of cache miss wait times. In
`vertical multithreading, multiple hardware threads share the
`same processor pipeline. A hardware thread is typically a
`Process, a Hehtweightprocess,a nativethread. or thelike in
`multithreading increasesparallelismwithintheprocessorcir-
`cuit structure, for example within a single integrated circuit
`die that makes up a single-chip processor. To further increase
`system parallelism in some processor embodiments, multiple
`processor cores are formed in a single die. Advances in on-
`chip multiprocessor horizontal threading are gained as pro-
`cessor core sizes are reduced through technological advance-
`ments.
`
`an operating system that supports multithreading. Horizonta
`
`1 Claim, 21 Drawing Sheets
`
`N04
`
`Thread
`
`(USIpipetine)
`
`
`
`
`ORAM 1134
`ooh
`
`Pcl
`i
`
`1
`
`1300
`
`1102
`
`SAMSUNG 1056
`SAMSUNG 1056
`SAMSUNG v. SMART MOBILE
`SAMSUNGv. SMART MOBILE
`IPR2022-01004
`IPR2022-01004
`
`1
`
`

`

`US 7,587,581 B2
`
`Page 2
`
`6,420,903 Bl
`6,507,862 Bl
`
`7/2002 Singh etal.
`1/2003 Joyetal.
`
`OTHER PUBLICATIONS
`
`Olukotum, K., etal., The Case for a Single-Chip Multiprocessor,
`1996, ACM,pp. 2-10.*
`Hammond,L,etal., A Single-Chip Multiprocessor,Sep. 1997, IEEE,
`pp. 79-85.*
`Gulati et al., “Performance Study of a Multithreaded Superscalar
`Microprocessor”, Proceedings. International Symposium on High-
`Performance Computer Architecture,
`1996,
`pp.
`291-301.
`(XP000572068).
`Gunther, “Multithreading with Distributed Functional Units”, ZEEE
`Transactions on Computers, vol. 46, No. 4, IEEE Inc., New York,
`Apr. 1, 1997, pp. 399-411. CXP006560 16).
`Klass ct al., “A New Family of Scmidynamic and Dynamic Flip-
`Flops with Embedded Logic for High-Performance Processors”,
`IEEE JournalofSolid-State Circuits, vol. 34, No. 5, IEEE Inc., New
`York, Jun. 11, 1998, pp. 712-716. (XP002 1563 16).
`Tremblayet al., “A Three Dimensional Register File for Superscalar
`Processors”, Proceedings of the 28'" Annual Hawaii International
`Conf. on Systems Sciences, Jan. 1995, pp. 191-201.
`
`* cited by examiner
`
`U.S. PATENT DOCUMENTS
`
`11/1997
`2/1998
`3/1998
`4/1998
`5/1998
`6/1998
`7/1998
`9/1998
`1/1999
`1/1999
`4/1999
`9/1999
`3/2000
`4/2000
`5/2000
`5/2000
`8/2000
`8/2000
`9/2000
`12/2000
`3/2001
`5/2001
`10/2001
`
`Jagannathan et al.
`Yungetal.
`Dubeyet al.
`Reineret al.
`Familiar
`Stent
`Tremblay
`Rossmann
`Engebretsenet al.
`Kean
`Schneider
`Kametani
`Shimizu
`Flynnetal.
`Panwar et al.
`Eickemeyeret al.
`Wrightetal.
`Borkenhagenet al.
`Torii
`Mahalingaiah etal.
`Aglietti et al.
`Nation et al.
`Gottlieb
`
`AAAAAAAAAAAAAAAAAAAAB
`
`l
`Bl
`Bl
`
`5,692,193
`5,721,868
`5,724,565
`5,742,806
`5,752,027
`5,761,285
`5,778,247
`5,809,415
`5,860,138
`5,861,761
`5,893,159
`5,960,458
`6,038,647
`6,052,708
`6,058,466
`6,061,710
`6,101,599
`6,105,051
`6,122,712
`6,167,507
`6,205,519
`6,233,599
`6,298,431
`
`2
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 1 of 21
`
`US 7,587,581 B2
`
`vilblollcllcllcll Oblvilbblohhvil
`
`PooacocetoceseexsreveveseetetetetesaOSESod
`[COISOoORooo)
`
`PXococcececoricodPOOPosesoseoeod
`
`
`SenenetereteceerWotetevetetocerersDSRSSCCNLS
`
`PSS.rahetetetatetetetmeBSGe
`SoSRECORDSCKD
`
`p80RASSeeea]LRSOSOESCOC)PICO
`
`
`
`
`
`
`
`velSZbOCL22LwelBZLeetObvelBZLOEl9bect
`
`aipl[|¢poeiyy0WY,Posty,Add
`Oct91cab
`
`Gt“Olu
`
`psyPoesyAd)=.%@POSY}Add=}
`BalV0l~
`
`3
`
`
`
`
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 2 of 21
`
`US 7,587,581 B2
`
`
`>:4tL.VAWLLL.SWE
`PeetFpBeyJeyeyq se
`
`vezcetOFZESEZHETCETSCOHZSCG?ezOr«82KZ
`VeldwetievewzWalevizzz
`967CSCosewS22G2=vGZ(iaGZSCCSC#WGZvozGZURL
` feetbehetEeleSR0eo
`9cz9S?22:poasy,ER8Se9S
`
`de‘Old
`
`,yeEeTeerq
`
`Je“Old
`
`4
`
`
`
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 3 of 21
`
`US 7,587,581 B2
`
`Shared
`I$. BP. NFR
`+-
`IMMU
`
`Pipeline
`
`Thread 0
`Machine
`State
`
`Shared
`
`UPA Bus
`526
`
`E$ Bus
`524
`
`FIG. 3
`
`5
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 4 of 21
`
`US 7,587,581 B2
`
`
`
`
`act
`se_ce_|
`;
`tid g
`
`mg_header
`
`clk
`sclk
`tid
`
`412
`414
`
`FIG. AA
`
`
`
`
`FIG. 4B
`
`FIG. 4C
`
`fatehctive
`
`6
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 5 of 21
`
`US 7,587,581 B2
`
`500
`
`J
`
`T0
`TD
`
`T2 bit
`storage
`
`
`
`
`Tt bit
`storage
`
`
`510
`TD
`
`FIG. 9
`
`L1_load_miss
`
`Li_inst_miss
`inst_buf_empty
`exception_mode
`
`MT_mode
`
`thread_priority
`
`tl
`
`Ts
`
`Thread
`Select
`Logic
`
`610
`
`FIG. 6
`
`7
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 6 of 21
`
`US 7,587,581 B2
`
`700 \
`
`
`
`
`
`710
`DATA CACHE-
`THREAD 0
`
`DATA CACHE-
`THREAD 1
`
`712
`
`FIG. 7A
`
`TID+VA[6:5]|VA[4:0]
`
` VA[63: 7]
`
`Cache VA
`Tagbit
`
`«i,
`Index bits
`
`|Byte offset
`
`722
`
`724
`
`726
`
`720
`
`FIG. 7B
`
`8
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 7 of 21
`
`US 7,587,581 B2
`
`63
`1413
`54
`
`
`0|s
`
`s CACHE invex|ye orrseT [+820
`
`63
`1413 12
`54
`
`822
`823 824 896
`NEW CACHE INDEX
`
`0|s
`
`dto|noex-se|eyte orrset [+820
`
`16kB DM CACHE
`
`THREAD 0
`
`VA=A , PA=B
`
`THREAD 1 n+256
`
`VA=A’
`
`, PA=B
`
`
`
`
`
`olf
`
`FIG. 8
`
`899
`
`|
`
`812
`
`826
`
`9
`
`

`

`Sep. 8, 2009
`
`Sheet 8 of 21
`
`US 7,587,581 B2
`
`U.S. Patent
`
`006
`
`06
`
`paibys
`
`
`
`
`(¢3Aom—Zyoddns0})
`Avy50,$3diyg—ug
`
`sng$3sngWdN
`
`906926
`
`paJDUS
`
`AWA+YINdé“$1
`
`NWI+YIN‘dd“G
`
`paioys
`
`auljedid
`
`palJoUs
`
`auijedid
`
`10
`
`10
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 9 of 21
`
`US 7,587,581 B2
`
`1000
`
`Thread
`(USIIi_ pipeline)
`
`E$ Control Unit
`
`(ECU)
`
`PC
`1032
`
`L2$
`SRAM
`1024
`
`DRAM 1054
`and
`
`UPA
`
`1026
`
`FIG. 10
`
`11
`
`11
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 10 of 21
`
`US 7,587,581 B2
`
`OOLL
`
`OltCLLt
`
`
`
`
`
`WVUS$21AOK-F‘ANTdlug—uQ
`
`(aujadid3115)(aujadid1ISN)00poosu
`poosyy
`
`
`
`
`
`
`
`yellWvydDd#CllWudDd
`
`
`sehgnzhootwalZelh
`
`WeOld
`
`HUN[04jU0$3aaHUNJosue)$3
`$01)(103)
`
`12
`
`12
`
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 11 of 21
`
`US 7,587,581 B2
`
`1220
`
`1212
`
`1210
`
`io
`
`1224
`
`INSTRUCTION CACHE
`
`INSTR TLB 144|PREDECODE(16K PLUS NEXT FIELD,
`
`BRANCH HIST)
`(64 ENTRIES)
`UNIT
`NEXTFIELD,
`BRANCH Hist
`
`INSTR
`
`{*
`
`TAKEN/
`NOT TAKEN
`
`INTEGER REGISTERS
`(8 WINDOWS, 7 READ,
`3 WRITE PORTS)
`
`s
`
`FLOATING—-POINT REGISTERS
`(5 READ PORTS,
`3 WRITE PORTS)
`
`ox
`
`aa=—
`
`_= =a
`
`c=
`
`DATA CACHE
`(16K)
`
`1222
`
`CACHE CONTROL/
`SYSTEM INTERFACE
`
`PHYSICAL ADDR
`
`13
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet12 of 21
`
`US 7,587,581 B2
`
`Bit Line
`
`
`
`Register
`Index
`
`
`
`feattetetceteratetETI
`
`Deter ebeteteetreeeteeetFat ete ties
`
`eet
`et
`cat ate tet
`deo
`letie
`Testietoetestastasi deletes]
`Re et tetabteta 8 ot et at 8?
`
`ee at te ee ee et ts
`eee eee tat ate te
`ete
`ey
`
`ata tat
`
`et
`
`at
`
`ot
`
`cet
`
`ie
`
`,
`
`Current
`Window pp f:
`JUS
`Decode
`
`1312
`a
`SENSIS
`
`
`1314
`
`A Three Dimensional Register File
`
`FG. 13
`
`14
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 13 of 21
`
`US 7,587,581 B2
`
`)n1g
`
`StOld
`
`PEPaTYogITEETTT.seTSrosoe
`EETTERN
`LeerTERTTTET||dSTTalTET
`eee>on
`RCEECH
`f°Iq74$479Sq99LO8469
`pttetPtetttEt
`ptttTytptttttttPttteePt|fteerPtttfoPt
`pttettyptt}EyEy
`
`~AL
`
`puoi
`Sour]
`
`Styl
`
`sour1g
`
`Styl
`
`viOld
`
`-————~—-/
`
`15
`
`15
`
`
`
`
`
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 14 of 21
`
`US 7,587,581 B2
`
`VOI“Dl
`
`oj
`
`LAN
`
`[>°ddl
`
`ql‘Oldo0g1—~
`
`QFDidOLADosd v9)
`|PPCoa
`PEELcygsd
`
`aanChysdi
`aacTgsdsaCyesunosian
`p39oun
`ChdacsPtgd.fTCOUTtC—“~*ESC(‘$RSSCST*SC
`#S$NSNSJcdoa)pdutouaCJgduon|CJzduss
`
`VOT‘Ola
`
`7[ditou
`
`16
`
`16
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 15 of 21
`
`US 7,587,581 B2
`
`at“Did
`
`Opa
`
`cosdi
`
`c1sda
`
`CTzsdu
`
`91S
`
`SeSSS38-§=8es80606\6
`
`¥191sigh=<<=aaaa4aa
`LoPahERLE<4<tyoTXL<}y,HY}L<TyoV8291VyVVv
`
`$eAfotPp
`4,1||fTtTttftftft|4,|fttfot|||ft|4|[|{|||||HY|||of|
`BLK,CoDey“et“HetSH,
`
`“ttt
`
`
`
`“\Noigi
`
`eZ9l
`
`Joduom
`
`cyydaom
`
`anZdsom
`
`CTodaon
`
`IaMon
`
`CTcdaon
`
`OTgduou
`
`]{duu
`
`0291
`
`17
`
`17
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet16 of 21
`
`US 7,587,581 B2
`
`Window Sp
`(callee)
`sclocals's::
`1720
`itt
`
` Current HENS
`
`function
`call
`
`1712
`
`FIG. 17A
`
`Current
`Window
`(caller)
`1710
`
`18
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 17 of 21
`
`US 7,587,581 B2
`
`
`
`Sul]DJDPj090j
`
`4jSte
`ptf
`
`11g
`
`ZMOPUIM|MOpUIM
`
`¢MOPUIMZMOpUIM
`
`Gab“Olu
`
`19
`
`19
`
`
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 18 of 21
`
`US 7,587,581 B2
`
`
`
`
`
`FIG. 18A}IIG. 18B
`
`FIG. 18C}FIG. 18D
`
`KEY TO FIG. 18
`
`rps0_outop
`
`
`
`
`
`rpsl_out
`rps2_out
`rps3_out
`rps4_out
`rpso_out
`rps6_out
`
`
`
`rpsO_in
`rpst_in
`rps2_in
`rps3_in
`rps4_in
`rpso_in
`
`rps6_in >
`
`i min
`
`rewp0 (>
`rewp1
`rewp2 1D
`FtTft
`rcwp3 (>
`rewp4 [>
`rewp5 [>
`rewp6 (>
`rewp/7 [>
`
`|f
`
`rytTOTOT
`ttCOUT
`aeeeeDeee t
`a
`| Ly
`
`FIG. 18A
`
`20
`
`20
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 19 of 21
`
`US 7,587,581 B2
`
`
`
`FIG. 18B
`
`21
`
`21
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 20 of 21
`
`US 7,587,581 B2
`
`
`<tymalettia©030KSTzittyytHU
`PuPCETARYSS)LOeeCeePTT
`<1:eeeeAP)
`
`
`<hyaanmmilYa
`—SRFALl
`
`Pee
`
`AY—Finey:
`
`<>J}PM
`
`1800 ——*
`
`mit
`
`SZSRS2EBRSJdBBESSS>Ss25525ssB55oncag
`
`apapanananananaOOAo0qcaa
`
`
`S99OoFC88ovct|§IQuaus=Fee5FF4Oo7n7rNaaa=9°anoOn==aaaaa=a=+=
`
`FIG. 18C
`
`22
`
`22
`
`

`

`U.S. Patent
`
`Sep. 8, 2009
`
`Sheet 21 of 21
`
`US 7,587,581 B2
`
`ceee
`
`pe
`
`<pStatt=Ll
`
`
`
`«—— 1810
`
`~—— 1820
`
`TPPUS
`
`wdtin
`
`Fig. 18D
`
`23
`
`23
`
`

`

`US 7,587,581 B2
`
`1
`MULTIPLE-THREAD PROCESSOR WITH
`IN-PIPELINE, THREAD SELECTABLE
`STORAGE
`
`CROSS-REFERENCE TO RELATED
`APPLICATION
`
`This application is a continuation of U.S. patent applica-
`tion Ser. No. 10/403,406, entitled “MULTIPLE-THREAD
`PROCESSOR WITH IN-PIPELINE, THREAD SEILECT-
`ABLE STOPAGE,’in the nameofinventors, William N. Joy,
`Mare Tremblay, Gary Lauterbach, and Joseph I. Chamdani
`filed on Mar. 31, 2003, now U.S. Pat. No. 7,185,185, whichis
`incorporated herein by reference in its entirety, which was a
`continuation of application Ser. No. 09/309,734, filed May
`11, 1999, now U.S. Pat. No. 6,542,991, which application is
`incorporated herein by referencein its entirety.
`In addition, the present inventionis related to subject mat-
`ter disclosed in the following, commonly-owned, co-pending
`patent applications:
`1. U.S. patent application Ser. No. 09/309,732, entitled,
`“Processor with Multiple-Thread, Vertically-Threaded
`Pipeline”, naming William Joy, Marc Tremblay, Gary
`Lauterbach, and Joseph Chamdanias inventors and filed
`May11, 1999;
`2. US. patent application Ser. No. 09/309,731, entitled,
`“Vertically-Threaded Processor with Multi-Dimen-
`sional Storage”, naming William Joy, Marc Tremblay,
`Gary Lauterbach, and Joseph Chamdani as inventors
`and filed May 11, 1999;
`3. U.S. patent application Ser. No. 09/309,735, entitled,
`“Switching Method in a Multi-Threaded Processor’,
`naming William Joy, Mare Tremblay, Gary Lauterbach,
`and Joseph Chamdanias inventors and filed May 11,
`1999; and
`4. US. patent application Ser. No. 09/309,733, entitled,
`“Thread Switch Logic in a Multiple-Thread Processor’,
`naming William Joy, Mare Tremblay, Gary Lauterbach,
`and Joseph Chamdanias inventors and filed May 11,
`1999,
`
`BACKGROUNDOF THE INVENTION
`
`1. Field of the Invention
`
`The present invention relates to processor or computer
`architecture. More specifically, the present invention relates
`to multiple-threading processor architectures and methods of
`operation and execution.
`2. Description of the Related Art
`In many commercial computing applications, a large per-
`centage of time elapses during pipeline stalling and idling,
`rather than in productive execution, due to cache misses and
`latency in accessing external caches or external memory fol-
`lowing the cache misses. Stalling and idling are mostdetri-
`mental, due to frequent cache misses, in database handling
`operations such as OLTP, DSS, data mining,financial fore-
`casting, mechanical and electronic computer-aided design
`(MCAD/ECAD), web servers, data servers, and the like.
`Thus, although a processor may execute at high speed, much
`time is wasted while idly awaiting data.
`One technique for reducing stalling andidling is hardware
`multithreading to achieve processor execution during other-
`wise idle cycles. Hardware multithreading involves replica-
`tion of some processor resources, for example replication of
`architected registers, for each thread. Replication is not
`required for most processor resources, including instruction
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`translation look-aside buffers (TLB),
`and data caches,
`instruction fetch and dispatch elements, branch units, execu-
`tion units, and thelike.
`Unfortunately duplication ofresourcesis costly in terms of
`integrated circuit consumption and performance.
`Accordingly, improved multithreading circuits and operat-
`ing methodsare neededthat are economical in resources and
`avoid costly overhead which reduces processor performance.
`
`SUMMARYOF THE INVENTION
`
`A processor reduces wasted cycle time resulting from stall-
`ing and idling, and increases the proportion ofexecution time,
`by supporting and implementing both vertical multithreading
`and horizontal multithreading. Vertical multithreading per-
`mits overlapping or “hiding” of cache miss wait times. In
`vertical multithreading, multiple hardware threads share the
`same processor pipeline. A hardware thread is typically a
`process, a lightweight process, a native thread, or the like in
`an operating system that supports multithreading. Horizontal
`multithreading increases parallelism within the processorcir-
`cuit structure, for example within a single integrated circuit
`die that makes up a single-chip processor. To further increase
`system parallelism in some processor embodiments, multiple
`processor cores are formed in a single die. Advances in on-
`chip multiprocessor horizontal threading are gained as pro-
`cessor core sizes are reduced through technological advance-
`ments.
`
`The described processor structure and operating method
`may be implemented in many structural variations. For
`example two processor cores are combined with an on-chip
`set-associative L2 cache in one system. In another example,
`four processor cores are combined with a direct RAMBUS
`interface with no external L2 cache. A countless number of
`variations are possible. In some systems, cach processor core
`is a vertically-threaded pipeline.
`In a further aspect of some multithreading system and
`method embodiments, a computing system may be config-
`ured in many different processor variations that allocate
`execution among a plurality of execution threads. For
`example, in a “1C2T” configuration, a single processor die
`includes two vertical threads. In a “4C4T” configuration, a
`four-processor multiprocessor is formed on a single die with
`each of the four processors being four-way vertically
`threaded. Countless other “nCkT”structures and combina-
`
`tions may be implemented on one or moreintegrated circuit
`dies depending on the fabrication process employed and the
`applications envisioned for the processor. Various systems
`may include caches that are selectively configured,
`for
`example as segregated L1 caches and segregated L2 caches,
`or segregated L1 caches and shared L2 caches, or shared L1
`caches and shared L2 caches.
`
`In an aspect of some multithreading system and method
`embodiments, in response to a cache missstall a processor
`freezes the entire pipeline state of an executing thread. The
`processor executes instructions and manages the machine
`state of each thread separately and independently. The func-
`tional properties of an independent thread state are stored
`throughoutthe pipeline extending to the pipeline registers to
`enable the processor to postpone execution of a stalling
`thread, relinquish the pipeline to a previously idle thread, later
`resuming execution of the postponedstalling thread at the
`precise state of the stalling thread immediately prior to the
`thread switch.
`
`In another aspect of some multithreading system and
`method embodiments, a processor include a “four-dimen-
`sional” register structure in whichregisterfile structures are
`
`24
`
`24
`
`

`

`US 7,587,581 B2
`
`3
`replicated by N for vertical threading in combination with a
`three-dimensional storage circuit. The multi-dimensional
`storage is formed by constructing a storage, such as a register
`file or memory, as a plurality of two-dimensional storage
`planes.
`In another aspect of some multithreading system and
`method embodiments, a processor implements N-bitflip-flop
`global substitution. To implement multiple machine states,
`the processor converts 1-bit flip-flops in storage cells of the
`stalling vertical thread to an N-bit globalflip-flop where N is
`the numberofvertical threads.
`In one aspect of some processor and processing method
`embodiments, the processor improves throughputefficiency
`and exploits increased parallelism by introducing multi-
`threading to an existing and mature processor core. The mul-
`tithreading is implemented in two steps including vertical
`multithreading and horizontal multithreading. The processor
`core is retrofitted to support multiple machinestates. System
`embodimentsthat exploit retrofitting of an existing processor
`core advantageously leverage hundreds ofman-years ofhard-
`ware and software development by extending the lifetime of
`a proven processorpipeline generation.
`In another aspect of some multithreading system and
`method embodiments, a processor includes a thread switch-
`ing control logic that performsa fast thread-switching opera-
`tion in response to an L1 cache missstall. The fast thread-
`switching operation implements one or more of several
`thread-switching methods. A first thread-switching operation
`is “oblivious”thread-switching for every N cycle in which the
`individualflip-flops locally determine a thread-switch with-
`out notification of stalling. The oblivious technique avoids
`usage of an extra global interconnection between threads for
`thread selection. A second thread-switching operation is
`“semi-oblivious” thread-switching for use with an existing
`“pipeline stall” signal (if any). The pipelinestall signal oper-
`ates in two capacities, first as a notification of a pipelinestall,
`and secondas a thread select signal between threadsso that,
`again, usage of an extra global
`interconnection between
`threadsfor thread selection is avoided.A third thread-switch-
`ing operation is an “intelligent global scheduler’ thread-
`switching in which a thread switch decision is based on a
`plurality of signals including: (1) an L1 data cache miss stall
`signal, (2) an instruction buffer empty signal, (3) an L2 cache
`miss signal, (4) a thread priority signal, (5) a thread timer
`signal, (6) an interrupt signal, or other sources of triggering.
`In some embodiments,the thread select signal is broadcast as
`fast as possible, similar to a clock tree distribution. In some
`systems, a processor derives a thread select signal that is
`applied to the flip-flops by overloading a scan enable.(SE)
`signal of a scannableflip-flop.
`In an additional aspect of some multithreading system and
`method embodiments, a processor includes anti-aliasing
`logic coupled to an L1 cache so that the L1 cache is shared
`among threads via anti-aliasing. The L1 cacheis a virtually-
`indexed, physically-tagged cache that
`is shared among
`threads. The anti-aliasing logic avoids hazards that result
`from multiple virtual addresses mapping to one physical
`address. The anti-aliasing logic selectively invalidates or
`updates duplicate L1 cache entries.
`In another aspect of some multithreading system and
`method embodiments, a processor includeslogic for attaining
`avery fast exception handling functionality while executing
`non-threaded programs by invoking a multithreaded-type
`functionality in response to an exception condition. The pro-
`cessor, while operating in multithreaded conditions or while
`executing non-threaded programs, progresses through mul-
`tiple machinestates during execution. The very fast exception
`
`20
`
`40
`
`45
`
`50
`
`55
`
`4
`handling logic includes connection ofan exception signalline
`to thread select logic, causing an exception signal to evoke a
`switch in thread and machine state. The switch in thread and
`
`machine state causes the processor to enter and to exit the
`exception handler immediately, without waiting to drain the
`pipeline or queues and without the inherent timing penalty of
`the operating system’s software saving andrestoring of reg-
`isters.
`An additional aspect of some multithreading systems and
`methods is a thread reservation system or thread locking
`system in which a thread pathwayis reserved for usage by a
`selected thread. A thread control logic may select a particular
`thread that is to execute with priority in comparison to other
`threads. A high priority thread may be associated with an
`operation with strict time constraints, an operation that is
`frequently and predominantly executed in comparison to
`other threads. The thread control
`logic controls thread-
`switching operation so that a particular hardware thread is
`reserved for usage by the selected thread.
`In another aspect of some multithreading system and
`method embodiments, a processor includes logic supporting
`lightweight processes and native threads. The logic includes a
`block that disables thread ID tagging and disables cache
`segregation since lightweight processes and native threads
`share the samevirtual tag space.
`In another aspect of some multithreading system and
`method embodiments, a processor includes logic for tagging
`a thread identifier (TID) for usage with processor blocksthat
`are not stalled. Pertinent non-stalling blocks include caches,
`translation look-aside buffers (TLB), a load buffer asynchro-
`nousinterface, an external memory management unit (MMU)
`interface, and others.
`In another aspect of some multithreading system and
`method embodiments, a processor includes a cache that is
`segregated into a plurality of N cache parts. Cache segrega-
`tion avoidsinterference, “pollution”, or “cross-talk” between
`threads. One technique for cache segregation utilizes logic for
`storing and communicating thread identification (TID)bits.
`The cache utilizes cache indexing logic. For example, the TID
`bits can be inserted at the most significant bits of the cache
`index.
`
`In a further additional aspect of some embodiments of the
`multithreading system and method, some processors include
`a thread reservation functionality.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The features ofthe described embodimentsare specifically
`set forth in the appended claims. However, embodiments of
`the invention relating to both structure and method of opera-
`tion, may best be understood by referring to the following
`description and accompanying drawings.
`FIGS. 1A and 1B are timing diagramsrespectively illus-
`trating execution flow of a single-thread processor and a
`vertical multithread processor.
`FIGS. 2A, 2B, and 2C are timing diagrams respectively
`illustrating execution flow of a single-thread processor, a
`vertical multithread processor, and a vertical and horizontal
`multithread processor.
`FIG.3 is a schematic functional block diagram depicting a
`design configuration for a
`single-processor vertically-
`threaded processorthat is suitable for implementing various
`multithreading techniques and system implementations that
`improve multithreading performance andfunctionality.
`FIGS. 4A, 4B, and 4C are diagrams showing an embodi-
`ment of a pulse-based high-speed flip-flop that is advanta-
`geously usedto attain multithreading in an integrated circuit.
`
`25
`
`25
`
`

`

`US 7,587,581 B2
`
`5
`FIG. 4A is a schematic block diagram illustrating control and
`storage blocksof a circuit employing high-speed multiple-bit
`flip-flops. FIG. 4B is a schematic circuit diagram that shows
`a multiple-bit bistable multivibrator (flip-flop) circuit. FIG.
`4C is a timing diagram illustrating timing of the multiple-bit
`flip-flop.
`FIG. 5 is a schematic block diagram illustrating an N-bit
`“thread selectable”flip-flop substitution logic that is used to
`create vertically multithreaded functionality in a processor
`pipeline while maintaining the samecircuit size as a single-
`threaded pipeline.
`FIG. 6 is a schematic block diagram illustrating a thread
`switch logic which rapidly generates a thread identifier (TID)
`signal identifying an active thread of a plurality of threads.
`FIGS. 7A and 7Bare, respectively, a schematic block dia-
`gram showing an example of a segregated cache anda picto-
`rial diagram showing an example of an addressing technique
`for the segregated cache.
`FIG. 8 is a schematic block diagram showing a suitable
`anti-aliasing logic for usage in various processor implemen-
`tations including a cache, such as an L1 cache, and L2 cache,
`or others.
`
`FIG.9 is a schematic functional block diagram depicting a
`design configuration for a single-chip dual-processor verti-
`cally-threaded processor that is suitable for implementing
`various multithreading techniques and system implementa-
`tions that improve multithreading performance and function-
`ality.
`FIG. 10 is a schematic functional block diagram depicting
`an alternative design configuration for a single-processorver-
`tically-threaded processor that is suitable for implementing
`various multithreading techniques and system implementa-
`tions that improve multithreading performance and function-
`ality.
`FIG. 11 is a schematic functional block diagram depicting
`an alternative design configuration for a single-chip dual-
`processor vertically-threaded processor that is suitable for
`implementing various multithreading techniques and system
`implementations that improve multithreading performance
`and functionality.
`FIG. 12 is a schematic block diagram illustrating a proces-
`sor and processor architecture that are suitable for imple-
`menting various multithreading techniques and system
`implementations that improve multithreading performance
`and functionality.
`FIG. 13 is a schematic perspective diagram showing a
`multi-dimensionalregisterfile.
`FIG. 14 is a schematic circuit diagram showing a conven-
`tional implementation of register windows.
`FIG. 15 is a schematic circuit diagram showinga plurality
`of bit cells of a register windowsof the multi-dimensional
`register file that avoids waste of integrated circuit area by
`exploiting the condition that only one windowis read and
`only one window is written at one time.
`FIG. 16, a schematic circuit diagram illustrates a suitable
`bit storage circuit storing onebit of the local registers for the
`multi-dimensionalregisterfile with eight windows.
`FIGS. 17A and 17Bare, respectively, a schematic pictorial
`diagram anda schematic block diagram illustrating sharing of
`registers among adjacent windows.
`FIG. 18 is a schematic circuit diagram illustrating an
`implementation of a multi-dimensionalregisterfile for regis-
`ters shared acrossa plurality of windows.
`The use of the same reference symbols in different draw-
`ings indicates similar or identical items.
`
`20
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`6
`DESCRIPTION OF THE EMBODIMENT(S)
`
`two timing diagrams
`Referring to FIGS. 1A and 1B,
`respectively illustrate execution flow 110 in a single-thread
`processor and instruction flow 120 in a vertical multithread
`processor. Processing applications such as database applica-
`tions spend a significant portion of execution timestalled
`awaiting memory servicing. FIG. 1A is a highly schematic
`timing diagram showing execution flow 110 of a single-
`thread processor executing a database application. In anillus-
`trative example, the single-thread processor is a four-way
`superscalar processor. Shaded areas 112 correspondto peri-
`ods of execution in which the single-thread processor core
`issues instructions. Blank areas 114 correspondto timeperi-
`ods in which the single-thread processorcoreis stalled wait-
`ing for data or instructions from memory or an external cache.
`A typical single-thread processor executing a typical data-
`base application executes instructions about 30% ofthe time
`with the remaining 70% of the time elapsed in a stalled
`condition. The 30% utilization rate exemplifies the inefficient
`usage of resources by a single-thread processor.
`FIG. 1B is a highly schematic timing diagram showing
`execution flow 120 of similar database operations by a mul-
`tithread processor. Applications such as database applications
`have a large amountinherent parallelism due to the heavy
`throughputorientation of database applications and the com-
`mon database functionality of processing several indepen-
`dent transactionsat one time. The basic conceptof exploiting
`multithread functionality
`involves utilizing processor
`resources efficiently when a thread is stalled by executing
`other threads while the stalled thread remains stalled. The
`
`execution flow 120 depicts a first thread 122, a second thread
`124, a third thread 126 and a fourth thread 128, all of which
`are shown with shading in the timing diagram. As one thread
`stalls, for example first thread 122, another thread, such as
`second thread 124, switches into execution on the otherwise
`unused or idle pipeline. Blank areas 130 correspondto idle
`times whenall threadsare stalled. Overall processorutiliza-
`tion is significantly improved by multithreading. Theillustra-
`tive technique ofmultithreading employsreplication ofarchi-
`tected registers for each thread and is called “vertical
`multithreading”.
`Vertical multithreading is advantageous in processing
`applications in which frequent cache misses result in heavy
`clock penalties. When cache misses cause a first thread to
`stall, vertical multithreading permits a second thread to
`execute whenthe processor would otherwise remain idle. The
`second thread thus takes over execution of the pipeline. A
`context switch from the first thread to the second thread
`
`involves saving the useful states ofthe first thread and assign-
`ing new states to the second thread. Whenthefirst thread
`restarts after stalling, the saved states are returned andthefirst
`thread proceeds
`in execution. Vertical multithreading
`imposescosts on a processorin resources used for saving and
`restoring thread states.
`Referring to FIGS. 2A, 2B, and 2C,three highly schematic
`timing diagramsrespectively illustrate execution flow 210 of
`a single-thread processor, execution flow 230 of a vertical
`multithread processor, and execution flow 250 a combined
`vertical and horizontal multithread processor. In FIG. 2A,
`shaded areas 212 showing periods of execution and blank
`areas 214 showing time periods in which the single-thread
`processorcoreis idle dueto stall illustrate the inefficiency of
`a single-thread processor.
`In FIG. 2B, execution flow 230 in a vertical threaded pro-
`cessor includes execution ofa first thread 232, and a second
`thread 234, both shaded in the timing diagram, and an idle
`
`26
`
`26
`
`

`

`US 7,587,581 B2
`
`10
`
`15
`
`20
`
`25
`
`30
`
`40
`
`45
`
`7
`time shown in a blank area 240. Efficient instruction execute
`proceeds as one thread stalls and, in response to thestall,
`another thread switches into execution on the otherwise
`
`unused or idle pipeline. In the blank areas 240, an idle time
`occurs whenall threads are stalled. For vertical multithread
`
`processor maintains a separate processing state for T execut-
`ing threads. Only oneofthe threadsis active at one time. The
`vertical multithreaded processor switches execution to
`anotherthread on a cache miss, for example an L1 cache miss.
`A horizontal threaded processor, using a technique called
`chip-multiple processing, combines multiple processors on a
`single integratedcircuit die. The multiple processors are ver-
`tically threaded to form a processor with both vertical and
`horizontal threading, augmenting executing efficiency and
`decreasing latency in a multiplicative fash

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket