`(12) United States Patent
`US 6,615,338 B1
`(10) Patent No.:
`US 6,615,338 B1
`(10) Patent No.:
`Tremblay et al.
`(45) Date of Patent:
`Sep. 2, 2003
`
`(45) Date of Patent: Sep. 2, 2003
`Tremblayet al.
`
`US006615338B1
`USOO661.5338B1
`
`(54)
`(54)
`
`(75)
`(75)
`
`(73)
`(73)
`
`(*)
`
`(21)
`(21)
`(22)
`(22)
`(51)
`(61)
`(52)
`(52)
`(58)
`(58)
`(56)
`(56)
`
`EP
`EP
`EP
`EP
`
`CLUSTERED ARCHITECTUREIN A VLIW
`CLUSTERED ARCHITECTURE IN A VLIW
`PROCESSOR
`PROCESSOR
`Inventors: Marc Tremblay, Menlo Park, CA (US);
`Inventors: Marc Tremblay, Menlo Park, CA (US);
`William Joy, Aspen, CO (US)
`William Joy, Aspen, CO (US)
`Assignee: Sun Microsystems, Inc., Palo Alto, CA
`Assignee: Sun Microsystems, Inc., Palo Alto, CA
`(US)
`(US)
`
`Notice:
`Notice:
`
`Subject to any disclaimer, the term of this
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`U.S.C. 154(b) by 0 days.
`
`OTHER PUBLICATIONS
`OTHER PUBLICATIONS
`Findlayet al., “HARP: AVLIW RISCProcessor”, IEEE, pp.
`Findlay et al., “HARP: A VLIW RISC Processor", IEEE, pp.
`368-372, 1991.*
`:
`:
`.
`368-372, 1991.*
`Keckler et al. “Processor Coupling: Integrating Compile
`Keckler et al: “Processor Coupling: Integrating Compile
`Time and Runtime Scheduling for Parallelism” Proceedings
`Time and Runtime Scheduling for Parallelism” Proceedings
`of the Annual International Symposium on Computer Archi
`of the Annual International Symposium on Computer Archi-
`tecture, US, New York, IEEE, vol. Symp. 19, 1992, pp.
`tecture, US, New York, IEEE, vol. Symp. 19, 1992, pp.
`202-213, XP000325804, ISBN: 0-89791-510-6.
`202-213, XP000325804, ISBN: 0-89791-510-6.
`Steven et al.: “iHARP: a multiple instruction issue proces
`Steven et al.: “iHARP: a multiple instruction issue proces-
`sor IEE Proceedings E. Computers & Digital Techniques.,
`sor” IEE Proceedings E. Computers & Digital Techniques.,
`vol. 139, No. 5, Sep. 1992, pp. 439–449, XP000319892,
`vol. 139, No. 5, Sep. 1992, pp. 439-449, KP000319892,
`Institution of Electrical Engineers. Stevenage., GB, ISSN:
`Institution of Electrical Engineers. Stevenage., GB, ISSN:
`Appl. No.: 09/204,584
`1350-2387.
`Appl. No.: 09/204,584
`1350-2387.
`Filed:
`Dec. 3, 1998
`* cited by examiner
`Filed:
`Dec. 3, 1998
`* cited by examiner
`Primary Examiner Emanuel Todd Voeltz
`7
`Primary Examiner—Emanuel Todd Voeltz
`7
`ne oceenenaraacacacnenenOOTS (74) Aitorney, Agent, or Firm—Zagorin, O’Brien &
`
`th m"' is (74) Attorney, Agent, or Firm Zagorin, O'Brien &
`Field of Search .....cccccccsecsnen 712/24, 23, 217,
`Graham, LLP
`Field of Search ............................. ... ..., Graham, LLP
`ABSTRACT
`(57)
`ABSTRACT
`(57)
`.
`References Cited
`References Cited
`A Very Long Instruction Word (VLIW) processor has a
`A Very Long Instruction Word (VLIW) processor has a
`clustered architecture including a plurality of independent
`U.S. PATENT DOCUMENTS
`clustered architecture including a plurality of independent
`U.S. PATENT DOCUMENTS
`functional units and a multi-ported register file that is
`functional units and a multi-ported register file that
`is
`ooobon ‘ ‘ 13/1993 pupaa cal. soni
`divided into a plurality of separate register file segments, the
`E. A : 1910, Six et al... gig divided into a plurality of Separate register file Segments, the
`5301,340 A
`4/1994 Cook .............. .395f800 register file Segments being individually associated with the
`5301340 A *
`41994 COOK sesecceecseses305/800
`register file segments being individually associated with the
`5,467,476 A
`11/1995 Kawasaki
`..
`... 395/800
`plurality of independent functional units. The functional
`5,467.476 A 11/1995 Kawasaki ..
`... 395/800
`plurality of independent functional units. The functional
`5.530,817 A 6/1996 Masubuchi ...
`... 395/375
`units access the respective associated register file Segments
`5,530,817 A
`6/1996 Masubuchi...
`we 395/375
`units access the respective associated register file segments
`5,542,059 A
`7/1996 Blomgren ................... 395/375
`using read operations that are local to the functional unit/
`5,542,059 A
`7/1996 Blomgren .......cceeee 395/375
`using read operations that are local to the functional unit/
`5,657,291 A 8/1997 Podlesny et al.
`register file Segment pairs. In contrast, the functional units
`5,657,291 A
`8/1997 Podlesny etal.
`register file segment pairs. In contrast, the functional units
`5,721868 A 2/1998 Yung et al. ................. 395/476
`access the register file Segments using write operations that
`5,721,868 A
`2/1998 Yung et al. cee 395/476
`access the register file segments using write operations that
`5,761475 A 6/1998 Yung et al.
`... 395/394
`are broadcast to a plurality of register file Segments. Inde
`5,761,475 A
`6/1998 Yungetal.
`w+ 395/394
`are broadcast to a plurality of register file segments. Inde-
`
`5,764,943 A
`6/1998 Wechsler ........-..-.ss..--. 395/394
`pendence between clusters is attained since the separate
`5,764,943 A 6/1998 Wechsler .................... 395/394
`pendence between clusters is attained since the Separate
`ores ‘ ‘ 1008 Leune ae ” pop clustered functional unit/ register file segment pairs have
`s A : E. E.M. r 3. clustered functional unit/ register file Segment pairs have
`
`5001301 A *
`51999 Matsuo et al.
`.
`_. 395/388
`local (internal) bypassing that allows internal computations
`5,901.301 A
`5f1999 Matsuo et al. .
`... 395/388
`local (internal) bypassing that allows internal computations
`
`6,076,159 A *
`6/2000 Fleck et al.
`....
`712/241
`to proceed, but have only limited bypassing between differ-
`
`6,076.159 A
`6/2000 Fleck et al. ....
`... 712/241
`to proceed, but have only limited bypassing between differ
`6,170,051 B1 * 1/2001 Dowling ..................... 712/225
`ent functional unit/ register file segment pair clusters. Thus
`6,170,051 BL *
`1/2001 Dowling... 712/225
`ent functional unit/ register file segment pair clusters. Thus
`a particular functional unit? register Segment pair does not
`a particular functional unit/ register segment pair does not
`FOREIGN PATENT DOCUMENTS
`bypass to all other functional unit/ register Segment pairs.
`FOREIGN PATENT DOCUMENTS
`bypass to all other functional unit/ register segment pairs.
`0 730 223
`9/1994 ee GO6F/9/38
`O 730 223
`9/1994 ............. GO6F/9/38
`O 653 703
`5/1995
`............. GO6F/9/38
`0 653 703
`S/I1995
`seeeeeeeceeee GO06F/9/38
`
`
`
`25 Claims, 18 Drawing Sheets
`25 Claims, 18 Drawing Sheets
`
`
`110
`112
`i
`112
`MPUt
`MPU2
`PUt
`PU2
`7210
`270
`2-210
`20
`instruction Cache
`Instruction Cache
`
`instruction Cache
`instruction Cache
`212
`1
`212
`22
`!
`22
`Instruction Aligner
`Instruction Aligner
`instruction Aligner
`instruction Aligner
`
`214
`q
`214
`24
`!
`214
`
`[
`instruction Buffer
`Instruction Buffer
`instruction Buffer
`instruction Buffer
`2
`226
`226s444 226
`PCU
`PCU
`
`PC
`PC2
`
`PC,
`PC,
`
`
`
`
`
`
`
`
`
`
` T rn
`216
`216, 220
`
`
`| Register Files |
`Register Files
`Register Files
`| Register Files |
`224-224.
`224 224 || 21
`224 224
`224 224
`21
`
`
`2185, 224 224 224 ‘224|||218, +224 \a2at\204 \204
`Load/Store Unit
`Load/Store Unit
`[
`|
`loadStore Unit
`Load Store Unit
`
`
`|
`Shared Data Cache and Synchronization Area
`Shared Data Cache and Synchronization Area
`
`-
`
`
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 1 of 18
`Sheet 1 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`ILP
`ILP
`
`SIZE
`SIZE
`
`12
`
`10
`
`FIG. 1
`FIG. 1
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 2 of 18
`Sheet 2 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`S/QNcre
`
`S/999'|
`
`wvuquaes
`
`¢Old
`
`S/999'L‘S-¥dN
`
`
`
`$/999°L‘N-vdN
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 3 of 18
`Sheet 3 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`PC
`
`PC2
`
`f 10
`
`210
`
`112
`
`210
`210
`
`
`
`
`
`
`
`
`
`
`
`
`
`InStruction CaChe
`
`
`
`212
`Instruction Aligner
`214
`
`InStruction Buffer
`
`PCU
`
`MFU3 MFU2MFU1 GFU
`
`
`
`
`
`Register Files
`218- 224 – 224
`Load/Store Unit
`Load/Store Unit
`
`
`
`
`
`
`
`InStruction CaChe
`
`
`
`22
`Instruction Aligner
`214
`
`InStruction Buffer
`N
`N
`
`iii.
`
`
`
`MFU3MFU2|MFUGFU
`Register Files
`
`
`
`218- 224 – 224
`218
`LOad/Store Unit
`Load/Store Unit
`
`Shared Data Cache and Synchronization Area
`Shared Data Cache and Synchronization Area
`
`FIG. 3
`FIG. 3
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 4 of 18
`Sheet 4 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`J216
`
`Broadcast Writes (5)
`
`Broadcast Writes (5)
`
`3 Read POrtS 3 Read POrtS 3 Read POrtS 3 Read POrtS
`3 Read Ports
`3Read Ports
`3Read Ports
`3 Read Ports
`FIG. 4
`FIG. 4
`
`
`
`Global
`Global
`Registers
`Registers
`12R/4W
`12R/4W
`O
`or
`12R15W
`
`12R/SW
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 5 of 18
`Sheet 5 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`Data In
`(up to 332-bit registers)
`(up to 3 32-bit registers)
`
`5 Cycles
`5 Cycles
`(globally
`(globally
`ViSible
`visible
`latency)
`latency)
`
`Hardware bypasses
`Hardware bypasses
`provide shortest
`provide shortest
`internal latency
`internal latency
`(4, 2 and 1 cycle
`(4, 2 and 7 cycle
`latency)
`latency)
`
` Data In
`
`
`
` Data Out —
`DataOut
`(one 32-bit register)
`(one 32-bit register)
`
`F.G. 6
`FIG. 6
`
`
`
`U.S. Patent
`U.S. Patent
`
`
`
`Sheet 6 of 18
`Sheet 6 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Z°9l4
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 7 of 18
`Sheet 7 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`sors?
`
`rs2
`
`cai}td
`
`N?
`
`
`AWW«W«”C0) ))))]]]},,,||pp,pWWW rs? rs2
`weeeee
`
`l|l1
`
`I||t{IIII|||||I{((II||
`
`FIG. 8A
`FIG. 8A
`
`I | |!||I I !|||||
`
`| l| ! |II I||I || I
`
`
`
`7226
`_—T To 4
`Pipe Ld/St
`756
`C/AT
`
`Located in the
`Pipe control Unit
`Pipe_control Unit
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 8 of 18
`Sheet 8 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`
`
`912€adid)adidajohope,
`
`G8Old
`
`a—(Gadd)addpeo)am|4|ev|ew|wi|oa
`
`
`
`
`
`01) (9 odid) ad?dpeo?@WTTWITTE) I 170 || 5 ||
`
`50/ISOOOO
`JOVIS-IGOIIT
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 9 of 18
`Sheet 9 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`
`
`==SsSe,2aEw29/8
`
`-800
`800
`
`FIG. 80
`FIG. 8C
`
`FIG. 9
`FIG. 9
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 10 of 18
`Sheet 10 0f 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`group mfu3
`mfu2
`mfu1
`gfu
`
`
`
`group |mfu2|mful|mfu3 gtu
`
`
`
`..., I,
`in|+tte
`
`
`
`
`iw2|mius2|mfue2|mtut_2|otu2|
`viv.3 mius 3 mile 3 mugglu 3
`tinA|rtus_4mtu|mtut_4{gt4
`
`
`in5|rmtus_5|mtu2_s[mivt_sJot5
`
`
`
`
`
`
`
`FIG 10A
`FIG. 10A
`
`
`
`
`
`
`
`3 4 || 5 || 6 || 7 || 8
`10 11 12 13
`
`cycle 1|2 12|18/|8 10|11|
`
`
`
`... ." "I"
`foefee)[PT
`E1 E2x E3 E4 TWB |
`|
`|
`neler||ler|eo|xjeslea|7|we||||_|
`EA1A2 A3 Twel
`out_lolelelarfatas|rlwo]|||||
`DE EA1A2 A3 twell
`ou2||ololelelar|aelas|r|wel||_|
`
`FIG 10B
`FIG. 10B
`
`
`
`
`
`FIG 10C
`FIG. 10€
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 11 of 18
`Sheet 11 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`Le)
`
`|7
`
`m[mm2=Be]BRB!BE|ss
`
`
`1|[eenae|
`
`
`vert
`oles
`|ooler
`bel
`eeler
`ter
`>
`| 1 || 2 | 3 || 4 || 5 || 6 || 7 | 8 || 9 | 10 11 12
`Cycle
`0
`mfu1 1 D E1 E2 X1 X2X3 E3 E4 T WB
`
`helper |
`| DEEexixexagged twel
`rove||of)eaewelaolcolx
`hal
`Q3
`rntS
`
`
`mu 2 TD ID | DEEA1A2 Aalt we
`
`
`mt2|_||olololelelarfae
`
`
`
`mu21 DETEEE A1A2 A3 Twel
`met|olelelelelalalas]
`||
`mile 2 |
`| ID | DD E1 E2 x Eleat
`mue2|||olo{olerleo|xles
`rl
`
`
`helper |
`|
`|
`|
`|
`| DEEexeget
`fewer|||||[olerteelx
`[D
`fea]
`
`
`glut | DEEEE| AllA2laat well
`out[olelelele|adwlaslr
` A3
`
`glu 2 TDD ID | DEEA1A2 A3 it we
`uz|lololololelelala
`lwe
`
`
`
`
`
`
`
`~
`
`Mm
`
`No
`
`FIG. 11A
`FIG. 11A
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 12 of 18
`Sheet 12 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`
`
`€ ~NI
`
`™_NallSB]ao!=/2/2GE]<|
`on!iSoy!>2=E;/,sce
`
`FIG 11B
`FIG. 11B
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 13 of 18
`Sheet 13 0f 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`
`
` ee
`
`Eilee|x|Eseat we
`imive_1|o|er]er]er|eo]x[esles!lwo]|
`| DEE2X Eg|Earlwell
`meer|_|oo|o|er|e2|x|3]eal7|e]
`
`| DEE A1A2 A3 twe
`mu?)|{|folete}ariaalas|rine
`eo
`
`
`31 DE1 E2x1 x2 x3 E3 E4 TWB
`rmfu3_1|o|e1|e2|x1|x2|xa]e3|e4|r[wel|
`helper |
`| DEE2x1 x2 x3 E3 E4 TWB
`felver|_|oLer[eo|xr|x2[xa]e3|e4|7[wel
`
`
`mu32
`DD ID | ETEA1A2 A3 Twel
`mus21||ololoelejarlaalas|r|we
`
`ee
`
`
`de EEI eataeast we
`out[olelelelelataelas|rial|
`gfu 1
`
`©
`DDDDTE EA1A2 A3 twe
`uz||olololo}elelarfaalas|rine
`fu 2
`9.
`
`=>
`
`FIG 11C
`FIG. 11€
`
`
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 14 of 18
`Sheet 14 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`ifu pou pC
`ifu_pcu_pc
`
`RFO
`
`-226
`£226
`
`horizontal = 3 x 32
`mfu2x data t
`TT|PERE horizontal = 3x 32
`mfu2x_data_t
`mfu3X data t
`miu3x_datat
`Sp? data a2
`HiT|tT|ser_data_ae
`paytT
`
`PCU_GFUX_DP|~
`PCU GFUX DP-1110
`1130
`
`no
`TTT||rt|
`1110 1130
`PCU_GFUX MC
`PCU GFUX MC
`-
`mful pcu data e
`mful_pcudatae HH
`mfut_pcu_data_e2 +H
`mfu1 pCu data e2
`mful pcu data e4
`mfut_pcu_datae4|||
`IE|_||| deu_pcu_dc_data/63:32]
`
`r|| deu_peu_dc_data[31:0]
`dcu pCu dc data/31:0
`dcu pCu dic data16332)
`Isu pCu ndc data/31:0)
`|Tf|| tsu_pou_ndedata0]
`SupCu ndc data?ö3:32)
`pCu rS2 data
`pcu_rs2_data
`pCu rS1 data
`pcu_rsi_data
`pCu Strol data
`pcu_strd_data
`pcu_strd1data
`pCu Stra1 data
`pCu rS1 data
`pcu_rs?_data
`pCu (S2 data
`pcu_rs2_data
`pCu Stro data
`pcu_strddata
`
`ST
`
`hOrizOntal buSeS = 11 X 32
`horizontal buses = 11 x 3.
`Vertical routing = 13 x 32
`vertical routing = 13 x 32 4
`
`|
`
`GFU
`
`222
`222
`
`FIG. 12A
`FIG. 12A
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 15 of 18
`Sheet 15 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`RF1
`RFI
`(decoding)
`(deCOding)
`
`RFT
`RF1
`
`-226
`Jf226
`
`spr data a2
`|]TPEETE spr_data_a2
`Thorizontal = 4
`mfu2X data t
`||
`|| ||
`mfu2x_data_t ELTTYrorizontal = 4
`mfu3x data t
`mfu3x_data_t
`{TTT
`H+{||
`f1_pcu_rs1_data
`H rf1 pCu (S1 data
`mfu1x_data_t eT rf1_pcu_rs2_data
`mfu1X data t
`H. rf1 pCu rS2 data
`rf1 pCu rS3 data
`rf1_pcu_rs3_data
`PCU MFUIX MC
`PCU_MFU1X_DP
`PCU MFU1X MC
`PCU MFU1X DP
`1120
`7
`~
`1132
`1 120
`1132
`Ty dcu_pcu_dc_data[31:0]
`glux_data_at
`doupCuldc dataI31.0
`gfux data a1
`dCupCu_dc_data16332)
`gfuX data a4
`
`glux_dataa4 | dcu|pcu_de_data/63:32]
`i lsu_pcu_ndc_data[31:0]
`ISupCu ndc data/31:0)
`ISupCu ndc data163:32)
`lsu_pcu_ndc_data[63:32]
`
`
`
`ldx1 data
`ldx1 data
`ldx1m_data
`ldx1m data
`horizontal buses = 13
`hOriZOntal buSeS = 13
`pCu mfu1 rS1 data
`pcu_mful_rs1_data
`pCu mfu1 rS2 data
`pcu_mful_rs2_data
`pCu mfu1 rS3 data
`pcu_mful_rs3_data
`
`Ht
`
`mfu1 pCu data e
`mfu1_pcu_data_e
`mfuTpcu_datae?
`mful pCu_data e2
`mful pCu data e4
`gfu pCu data e
`gfu_pcu_data_e
`gfu pCu data e6e34
`gfu_pcu_data_e6e34
`
`mful_pcu_data_e4
`rt
`220
`
`
`
`pCu Stra data
`pcu_strd_data
`
`MFU1
`MFU1
`
`220
`
`FIG. 12B
`FIG. 12B
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 16 of 18
`Sheet 16 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`-226
`226
`
`miuix data t
`HT]PTRAETPmfutx_oata_t
`Horizontal
`miuéx data t||
`||
`Horizontal
`_‘Mmfusx_datat||||||
`|| Sp? data a2
`buSeS = 4
`buses = 4 T{==||||||serdata_a2
`
`III mil2x data t
`TtUT TTY
`mfu2xdatat
`e— Ah
`rf2_pcu_rs1_data
`EH rf2 pCu rS1 data
`rf2 pCu rS2 data
`rf2_pcu_rs2_data
`rf2 pCu rS3 data
`rf2_pcu_rs3_data
`
`499
`1122
`
`ldx1m_data
`ldX1m data
`ldx1 data
`Idx1 data
`gfuX data a1
`gfux_data_af
`gfuX data a4
`gfux_data_a4
`
`
`
`PCU_MFU2XMC
`PCU MFU2X MC
`
`PCU MFU2X DP
`
`Horizontal buses = 4
`Horizontal buses
`= 4
`orizomai
`DUSES
`OIZOI73, OUSGS
`
`i
`HT
`
`mfu2 pCu data e
`mfu2_pcu_data_e
`mfu2 pCu data e4
`mfu2_pcu_data_e4
`miu2_pcu_data_e2
`mfu2 pCu data e2
`
`pcu_mfu2_rs1_data
`pCu mfu2 rS1 data
`pCu mfu2 rS2 data
`pcu_mfu2_rs2_data
`pCu mfu2 rS3 data
`pcu_mfu2_rs3_data ry
`pCu Sird data
`pcu_strd_data
`
`
`
`MFU2
`
`MFU2
`
`220
`220
`
`FIG. 120
`FIG. 12€
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 17 of 18
`Sheet 17 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`ae
`
`/ 226
`L226
`
`gfux data t
`ATOTeo_gfux_data_t
`A
`A A
`PETTTLETTTTT|
`tex3_cata
`ldx3 data
`EEE ldx3m_data
`ldx3m data
`mfu2x data t
`mfu2xdata_t
`mtu3x_data_t
`mfu3X data t H Horizontal buSeS
`rf1 pCu rS1 data
`Horizontal buses
`rff_pcu_rs?data
`C
`in Channel = 5
`t
`in channel =
`rf1 pCu rS2 data
`fl_pcu_rs2_data
`rf1 pCu rS3 data
`rfl_pcu_rs3_data
`1122
`1122
`
`
`
`PCU MFU1X MC
`PCU_MFU1X_MC
`
`
`TT
`mfutpeudatae2
`agfux_data_aT
`
`PCU MFU1X DP
`PCU_MFU1X_DP
`1132
`1132
`22 Outputs
`22 outputs
`
`Horizontalbuses
`Horizontal buses
`in Channel = 19
`
`inchannel=
`
`pcu_mful_rs?data
`pCu miu1 rS1 data
`pcu_mful_rs2_data
`pCu mfu1 rS2 data
`pcu_mful_rs3_data
`pCu mful rS3 data
`
`dcupCu dc data/31:0)
`MTT dcu_pou_de_data[31:0]
`dCupCu dc data/63:32)
`a dcu_peu_de_data[63:32]
`| isu_peu_ndce_data[31:0]
`IsupCu ndc data/31:0)
`ISupCu ndc data?ö3:32)
`Isu_pcu_ndc_data[63:32]
`mful pCu data e
`mful
`pcu data e
`mfu1 pCu_data e2
`mful pCu_data e4
`mfu1_pcu_data_e4
`gfu pCu data e
`gfu_pcu_data_e
`gfu pCu data e6e34
`gfu_pcu_data_e6e34
`gfuX data a1
`gfuX data a2
`gfux data a2
`en mtulx data al
`gfuX data a3
`miu1X data a1
`giux_data_a3mfy1x_dataa2
`mfu1X data a2
`mfu1X data a3
`mfutx_data_a3
`Idx1_data
`ldX1 data
`ala
`::
`220 dyad
`220
`lax2m data
`
`FIG. 12D
`FIG. 12D
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 18 of 18
`Sheet 18 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`LSU
`LSU
`
`y 1200
`vf 1200
`pCu trap
`pcu_trap
`
`
`Idx]
`1210
`Idy1
`ldx1 ra(70) Idx1 data1630)
`Walid
`
`lax1_rd{7-0}|lox1_data[63:0]|dsize|valid
`Idx2
`ldx2 ra.I7:01 lox2 data?ö3:0)
`Jdx2_ra7:0]|lox?dataj63-0}| dsize|valid
`
`
`Idx2
`Idx3
`ldx3 rdI7:0) Idy3 data?S3:0)
`Idx3_rd[7:0]|ldx3_data[63:0]|dsize
`0]
`Of
`ldx3
`
`
`
`ldx4
`
`Cycle
`cycle
`
`1
`2 3 || 4 || 5 || 6 || 7
`|7/2/13|14]5/6|7
`
`10 11 12
`10}
`11)
`12
`
`...
`.""
`
`
`jot_{olefeelao|rh||| |
`
`
`membar ID | ETEEA1A2 A3 Twel
`membar|{olelelelaraalas|t[wel|
`inst || |
`| IDEA1A2 A3
`well
`
`‘inte||||[olelarlaelaal[wel|
`
`
`
`
`
`FIG. 14A
`FIG. 14A
`
`
`
`
`
`
`
`
`
`EIILATI
`
`
`
`bay|olelololarel|||
`
`
`
`
`cycle 10|17|121)/2;3/4)/54|617
`III, 10 11 12
`I DEI cA2 A3 twell
`load
`foed||olelolsalaslrlwel|||_|
`
`
`load
`DETEEA1A2 A3 twell
`food||lolelele|ar{aa|aslr[wel|
`inst
`|
`|
`|
`| DEA1A2 at we
`insie|||||[ole[atlalasl+[we
`FIG, 14B
`FIG. 14B
`
`
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`10
`
`15
`15
`
`20
`
`25
`25
`
`1
`1
`CLUSTERED ARCHITECTUREIN A VLIW
`CLUSTERED ARCHITECTURE IN A VLIW
`PROCESSOR
`PROCESSOR
`
`2
`2
`tion has a set of fields corresponding to each functional unit.
`tion hasa set of fields corresponding to each functional unit.
`Typical bit lengths of a Subinstruction commonly range from
`Typical bit lengths of a subinstruction commonly range from
`16 to 64 bits per functional unit to produce an instruction
`16 to 64 bits per functional unit to produce an instruction
`CROSS-REFERENCE TO RELATED
`length often in a range from 64 to 512 bits for VLIW groups
`CROSS-REFERENCE TO RELATED
`length often in a range from 64 to 512 bits for VLIW groups
`APPLICATIONS
`from four to eight Subinstructions.
`APPLICATIONS
`from four to eight subinstructions.
`The multiple functional units are kept busy by maintain
`The multiple functional units are kept busy by maintain-
`The present invention is related to Subject matter dis
`The present invention is related to subject matter dis-
`ing a code Sequence with Sufficient operations to keep
`ing a code sequence with sufficient operations to keep
`closed in the following co-pending patent applications:
`closed in the following co-pending patent applications:
`instructions scheduled. A VLIW processor often uses a
`instructions scheduled. A VLIW processor often uses a
`1. U.S. patent application Ser. No. 09/204,480, entitled,
`1. U.S. patent application Ser. No. 09/204,480, entitled,
`technique called trace Scheduling to maintain Scheduling
`technique called trace scheduling to maintain scheduling
`“A Multiple-Thread Processor for Threaded Software
`“A Multiple-Thread Processor for Threaded Software
`efficiency by unrolling loops and Scheduling code acroSS
`efficiency by unrolling loops and scheduling code across
`Applications”, naming Marc Tremblay and William Joy
`Applications”, naming Marc Tremblay and William Joy
`basic function blockS. Trace Scheduling also improves effi
`basic function blocks. Trace scheduling also improves effi-
`as inventors and filed on even date here with,
`as inventors and filed on even date herewith;
`ciency by allowing instructions to move acroSS branch
`ciency by allowing instructions to move across branch
`2. U.S. patent application Ser. No. 09/204,481, now U.S.
`points.
`2. U.S. patent application Ser. No. 09/204,481, now U.S.
`points.
`Pat. No. 6,343,348, entitled, “Apparatus and Method
`Pat. No. 6,343,348, entitled, “Apparatus and Method
`Limitations of VLIW processing include limited
`Limitations of VLIW processing include limited
`for Optimizing Die Utilization and Speed Performance
`for Optimizing Die Utilization and Speed Performance
`parallelism, limited hardware resources, and a vast increase
`parallelism, limited hardware resources, and a vast increase
`by Register File Splitting”, naming Marc Tremblay and
`by Register File Splitting”, naming Marc Tremblay and
`in code size. A limited amount of parallelism is available in
`in code size. A limited amountof parallelism is available in
`William Joy as inventors and filed on even date here
`William Joy as inventors and filed on even date here-
`instruction Sequences. Unless loops are unrolled a very large
`instruction sequences. Unless loops are unrolled a very large
`with;
`with;
`number of times, insufficient operations are available to fill
`numberof times, insufficient operations are available to fill
`3. U.S. patent application Ser. No. 09/204,536, entitled,
`3. U.S. patent application Ser. No. 09/204,536, entitled,
`the instruction capacity of the functional units. The opera
`the instruction capacity of the functional units. The opera-
`“Variable Issue-Width VLIW Processor”, naming Marc
`“Variable Issue-Width VLIW Processor”, naming Marc
`tional capacity of a VLIW processor is not determined by the
`tional capacity of a VLIW processoris not determined by the
`Tremblay as inventor and filed on even date herewith;
`Tremblay as inventor and filed on even date herewith;
`number of functional units alone. The capacity also depends
`numberof functional units alone. The capacity also depends
`4. U.S. patent application Ser. No. 09/204,586, now U.S.
`on the depth of the operational pipeline of the operational
`4. U.S. patent application Ser. No. 09/204,586, now U.S.
`on the depth of the operational pipeline of the operational
`Pat. No. 6,205,543, entitled, “Efficient Handling of a
`units. Several operational units Such as the memory, branch
`Pat. No. 6,205,543, entitled, “Efficient Handling of a
`units. Several operational units such as the memory, branch-
`Large Register File for Context Switching”, naming
`ing controller, and floating point functional units, are pipe
`Large Register File for Context Switching”, naming
`ing controller, and floating point functional units, are pipe-
`Marc Tremblay and William Joy as inventors and filed
`lined and perform a much larger number of operations than
`Mare Tremblay and William Joy as inventors andfiled
`lined and perform a much larger numberof operations than
`can be executed in parallel. For example, a floating point
`on even date here with;
`on even date herewith;
`can be executed in parallel. For example, a floating point
`pipeline with a depth of eight Steps has two operations issued
`5. U.S. patent application Ser. No. 09/205,121, now U.S.
`pipeline with a depth of eight steps has two operations issued
`5. U.S. patent application Ser. No. 09/205,121, now U.S.
`on a clock cycle that cannot depend on any of the operations
`Pat. No. 6,321,325, entitled, “Dual In-line Buffers for
`onaclock cycle that cannot depend on any ofthe operations
`Pat. No. 6,321,325, entitled, “Dual In-line Buffers for
`already within the floating point pipeline. Accordingly, the
`an Instruction Fetch Unit', naming Marc Tremblay and
`already within the floating point pipeline. Accordingly, the
`30
`an Instruction Fetch Unit”, naming Mare Tremblay and
`actual number of independent operations is approximately
`Graham Murphy as inventors and filed on even date
`actual number of independent operations is approximately
`Graham Murphy as inventors and filed on even date
`equal to the average pipeline depth times the number of
`herewith;
`equal to the average pipeline depth times the number of
`herewith;
`execution units. Consequently, the number of operations
`execution units. Consequently,
`the number of operations
`6. U.S. patent application Ser. No. 09/204,781, now U.S.
`6. U.S. patent application Ser. No. 09/204,781, now U.S.
`needed to maintain a maximum efficiency of operation for a
`needed to maintain a maximum efficiency of operation for a
`Pat. No. 6,249,861, entitled, “An Instruction Fetch Unit
`Pat. No. 6,249,861, entitled, “An Instruction Fetch Unit
`VLIW processor with four functional units is twelve to
`VLIW processor with four functional units is twelve to
`Aligner', naming Marc Tremblay and Graham Murphy
`Aligner”, naming Marc Tremblay and Graham Murphy
`sixteen.
`Sixteen.
`as inventors and filed on even date here with,
`as inventors and filed on even date herewith;
`Limited hardware resources are a problem, not only
`Limited hardware resources are a problem, not only
`7. U.S. patent application Ser. No. 09/204,535, now U.S.
`7. U.S. patent application Ser. No. 09/204,535, now U.S.
`because of duplication of functional units but more impor
`because of duplication of functional units but more impor-
`Pat. No. 6,279,100, entitled, “Local Stall Control
`Pat. No. 6,279,100, entitled, “Local Stall Control
`tantly due to a large increase in memory and register file
`tantly due to a large increase in memory andregister file
`Method and Structure in a Microprocessor', naming
`Method and Structure in a Microprocessor”, naming
`bandwidth. A large number of read and write ports are
`bandwidth. A large number of read and write ports are
`Marc Tremblay and Sharada Yeluri as inventors and
`Mare Tremblay and Sharada Yeluri as inventors and
`necessary for accessing the register file, imposing a band
`necessary for accessing the register file, imposing a band-
`filed on even date herewith;
`filed on even date herewith;
`width that is difficult to support without a large cost in the
`width that is difficult to support without a large cost in the
`8. U.S. patent application Ser. No. 09/204.858, entitled,
`8. U.S. patent application Ser. No. 09/204,858, entitled,
`Size of the register file and degradation in clock Speed. AS the
`size of the register file and degradation in clock speed. As the
`“Local and Global Register Partitioning in a VLIW
`“Local and Global Register Partitioning in a VLIW
`number of ports increases, the complexity of the memory
`numberof ports increases, the complexity of the memory
`Processor', naming Marc Tremblay and William Joy as
`Processor”, naming Mare Tremblay and William Joy as
`System further increases. To allow multiple memory
`45
`system further
`increases. To allow multiple memory
`inventors and filed on even date herewith; and
`inventors and filed on even date herewith; and
`45
`accesses in parallel, the memory is divided into multiple
`accesses in parallel, the memory is divided into multiple
`9. U.S. patent application Ser. No. 09/204,479, entitled,
`9. US. patent application Ser. No. 09/204,479, entitled,
`banks having different addresses to reduce the likelihood
`banks having different addresses to reduce the likelihood
`“Implicitly Derived Register Specifiers in a Processor”,
`“Implicitly Derived Register Specifiers in a Processor”,
`that multiple operations in a single instruction have con
`that multiple operations in a single instruction have con-
`naming Marc Tremblay and William Joy as inventors
`naming Mare Tremblay and William Joy as inventors
`flicting accesses that cause the processor to Stall since
`flicting accesses that cause the processor to stall since
`and filed on even date herewith.
`and filed on even date herewith.
`Synchrony must be maintained between the functional units.
`synchrony must be maintained between the functional units.
`Code size is a problem for Several reasons. The generation
`BACKGROUND OF THE INVENTION
`Codesize is a problem for several reasons. The generation
`BACKGROUND OF THE INVENTION
`of Sufficient operations in a nonbranching code fragment
`of sufficient operations in a nonbranching code fragment
`1. Field of the Invention
`1. Field of the Invention
`requires Substantial unrolling of loops, increasing the code
`requires substantial unrolling of loops, increasing the code
`The present invention relates to processors. More
`The present
`invention relates to processors. More
`size. Also, instructions that are not full include unused
`size. Also,
`instructions that are not full
`include unused
`Specifically, the present invention relates to architectures for
`Subinstructions that waste code Space, increasing code size.
`specifically, the present inventionrelates to architectures for
`subinstructions that waste code space, increasing code size.
`Very Long Instruction Word (VLIW) processors.
`Furthermore, the increase in the size of StorageS Such as the
`Very Long Instruction Word (VLIW) processors.
`Furthermore, the increase in the size of storages such as the
`2. Description of the Related Art
`register file increase the number of bits in the instruction for
`register file increase the numberof bits in the instruction for
`2. Description of the Related Art
`addressing registers in the register file.
`One technique for improving the performance of proces
`addressing registers in the registerfile.
`One technique for improving the performance of proces-
`A challenge in the design of VLIW processors is effective
`SorS is parallel execution of multiple instructions to allow
`sors is parallel execution of multiple instructions to allow
`Achallenge in the design of VLIW processorsis effective
`the instruction execution rate to exceed the clock rate.
`exploitation of instruction-level parallelism. Highly parallel
`the instruction execution rate to exceed the clock rate.
`exploitation of instruction-level parallelism. Highly parallel
`computing applications that have few data dependencies and
`Various types of parallel processors have been developed
`Various types of parallel processors have been developed
`computing applications that have few data dependencies and
`including Very Long Instruction Word (VLIW) processors
`few branches are executed most efficiently using a wide
`including Very Long Instruction Word (VLIW) processors
`few branches are executed most efficiently using a wide
`that use multiple, independent functional units to execute
`VLIW processor with a greater number of Subinstructions in
`that use multiple, independent functional units to execute
`VLIW processor with a greater numberof subinstructions in
`a VLIW group. However many computing applications are
`multiple instructions in parallel. VLIW processors package
`multiple instructions in parallel. VLIW processors package
`a VLIW group. However many computing applications are
`multiple operations into one very long instruction, the mul
`not highly parallel and include branches or data dependen
`multiple operations into one very long instruction, the mul-
`not highly parallel and include branches or data dependen-
`tiple operations being determined by Sub-instructions that
`cies that waste Space in instruction memory and cause
`tiple operations being determined by sub-instructions that
`cies that waste space in instruction memory and cause
`Stalling. Referring to FIG. 1, a graph illustrates a comparison
`are applied to the independent functional units. An instruc
`are applied to the independent functional units. An instruc-
`stalling. Referring to FIG. 1, a graph illustrates a comparison
`
`35
`35
`
`40
`40
`
`50
`50
`
`55
`55
`
`60
`60
`
`65
`65
`
`
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`10
`
`15
`15
`
`20
`
`3
`3
`of instruction issue efficiency and processor size as VLIW
`of instruction issue efficiency and processor size as VLIW
`group width is varied. The left axis of the graph relates to an
`group width is varied. Theleft axis of the graph relates to an
`instruction-level parallelism plot 10 that depicts the number
`instruction-level parallelism plot 10 that depicts the number
`of instructions executed per cycle against VLIW issue width.
`of instructions executed per cycle against VLIW issue width.
`The right axis of the graph relates to a relative processor Size
`Theright axis of the graphrelatesto a relative processorsize
`plot 12 that shows relative processor Size in relation to
`plot 12 that shows relative processor size in relation to
`VLIW issue width.
`VLIW issue width.
`What are needed are a technique and processor architec
`What are needed are a technique and processor architec-
`ture that increase the capacity for instru