`Tremblay et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 6,615,338 B1
`Sep. 2, 2003
`
`USOO661.5338B1
`
`(54)
`
`(75)
`
`(73)
`
`(21)
`(22)
`(51)
`(52)
`(58)
`(56)
`
`EP
`EP
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`Appl. No.: 09/204,584
`Filed:
`Dec. 3, 1998
`
`OTHER PUBLICATIONS
`CLUSTERED ARCHITECTURE IN A VLIW
`Findlay et al., “HARP: A VLIW RISC Processor", IEEE, pp.
`PROCESSOR
`368-372, 1991.*
`Inventors: Marc Tremblay, Menlo Park, CA (US);
`Keckler et al. “Processor Coupling: Integrating Compile
`William Joy, Aspen, CO (US)
`Time and Runtime Scheduling for Parallelism” Proceedings
`Assignee: Sun Microsystems, Inc., Palo Alto, CA of the Annual International Symposium on Computer Archi
`(US)
`tecture, US, New York, IEEE, vol. Symp. 19, 1992, pp.
`202-213, XP000325804, ISBN: 0-89791-510-6.
`Steven et al.: “iHARP: a multiple instruction issue proces
`sor IEE Proceedings E. Computers & Digital Techniques.,
`vol. 139, No. 5, Sep. 1992, pp. 439–449, XP000319892,
`Institution of Electrical Engineers. Stevenage., GB, ISSN:
`1350-2387.
`* cited by examiner
`Primary Examiner Emanuel Todd Voeltz
`7
`
`th m"' is (74) Attorney, Agent, or Firm Zagorin, O'Brien &
`Field of Search ............................. ... ..., Graham, LLP
`ABSTRACT
`(57)
`References Cited
`A Very Long Instruction Word (VLIW) processor has a
`U.S. PATENT DOCUMENTS
`clustered architecture including a plurality of independent
`functional units and a multi-ported register file that is
`E. A : 1910, Six et al... gig divided into a plurality of Separate register file Segments, the
`5301,340 A
`4/1994 Cook .............. .395f800 register file Segments being individually associated with the
`5,467.476 A 11/1995 Kawasaki ..
`... 395/800
`plurality of independent functional units. The functional
`5.530,817 A 6/1996 Masubuchi ...
`... 395/375
`units access the respective associated register file Segments
`5,542,059 A
`7/1996 Blomgren ................... 395/375
`using read operations that are local to the functional unit/
`5,657,291 A 8/1997 Podlesny et al.
`register file Segment pairs. In contrast, the functional units
`5,721868 A 2/1998 Yung et al. ................. 395/476
`access the register file Segments using write operations that
`5,761475 A 6/1998 Yung et al.
`... 395/394
`are broadcast to a plurality of register file Segments. Inde
`5,764,943 A 6/1998 Wechsler .................... 395/394
`pendence between clusters is attained since the Separate
`s A : E. E.M. r 3. clustered functional unit/ register file Segment pairs have
`5,901.301 A
`5f1999 Matsuo et al. .
`... 395/388
`local (internal) bypassing that allows internal computations
`6,076.159 A
`6/2000 Fleck et al. ....
`... 712/241
`to proceed, but have only limited bypassing between differ
`6,170,051 B1 * 1/2001 Dowling ..................... 712/225
`ent functional unit/ register file segment pair clusters. Thus
`a particular functional unit? register Segment pair does not
`FOREIGN PATENT DOCUMENTS
`bypass to all other functional unit/ register Segment pairs.
`O 730 223
`9/1994 ............. GO6F/9/38
`O 653 703
`5/1995
`............. GO6F/9/38
`
`25 Claims, 18 Drawing Sheets
`
`PC,
`
`i
`PUt
`2-210
`instruction Cache
`
`22
`instruction Aligner
`24
`
`instruction Buffer
`
`PCU
`
`PC2
`
`112
`PU2
`20
`
`instruction Cache
`22
`!
`instruction Aligner
`!
`214
`instruction Buffer
`
`226
`
`-
`
`Register Files
`224-224.
`224 224 || 21
`loadStore Unit
`
`
`
`216, 220
`Register Files
`224 224
`224 224
`Load Store Unit
`
`Shared Data Cache and Synchronization Area
`
`2
`
`216
`
`21
`
`IPR2023-00037
`Apple EX1027 Page 1
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 1 of 18
`
`US 6,615,338 B1
`
`ILP
`
`SIZE
`
`12
`
`10
`
`FIG. 1
`
`IPR2023-00037
`Apple EX1027 Page 2
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Sheet 2 of 18
`Sheet 2 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`S/999'|
`
`wvuquaes
`
`
`
`
`
`¢Old
`
`S/999'L‘S-¥dNil
`
`IPR2023-00037
`Apple EX1027 Page 3
`
`IPR2023-00037
`Apple EX1027 Page 3
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 3 of 18
`
`US 6,615,338 B1
`
`PC
`
`f 10
`
`210
`
`PC2
`
`112
`
`210
`
`
`
`
`
`
`
`
`
`
`
`
`
`InStruction CaChe
`
`
`
`212
`Instruction Aligner
`214
`
`InStruction Buffer
`
`PCU
`
`MFU3 MFU2MFU1 GFU
`
`
`
`
`
`Register Files
`218- 224 – 224
`Load/Store Unit
`
`
`
`
`
`InStruction CaChe
`
`
`
`22
`Instruction Aligner
`214
`
`InStruction Buffer
`N
`N
`
`iii.
`
`
`
`MFU3MFU2|MFUGFU
`Register Files
`
`
`
`218- 224 – 224
`LOad/Store Unit
`
`Shared Data Cache and Synchronization Area
`
`FIG. 3
`
`IPR2023-00037
`Apple EX1027 Page 4
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 4 of 18
`
`US 6,615,338 B1
`
`Broadcast Writes (5)
`
`3 Read POrtS 3 Read POrtS 3 Read POrtS 3 Read POrtS
`FIG. 4
`
`
`
`Global
`Registers
`12R/4W
`O
`12R15W
`
`IPR2023-00037
`Apple EX1027 Page 5
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 5 of 18
`
`US 6,615,338 B1
`
`Data In
`(up to 332-bit registers)
`
`5 Cycles
`(globally
`ViSible
`latency)
`
`Hardware bypasses
`provide shortest
`internal latency
`(4, 2 and 1 cycle
`latency)
`
`
`
`DataOut
`(one 32-bit register)
`
`F.G. 6
`
`IPR2023-00037
`Apple EX1027 Page 6
`
`
`
`U.S. Patent
`U.S. Patent
`
`
`
`Sep. 2, 2003
`Sep. 2, 2003
`
`Z°9l4
`
`Sheet 6 of 18
`Sheet 6 of 18
`
`US 6,615,338 B1
`US 6,615,338 B1
`
`IPR2023-00037
`Apple EX1027 Page 7
`
`IPR2023-00037
`Apple EX1027 Page 7
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 7 of 18
`
`US 6,615,338 B1
`
`
`
`Pipe control Unit
`
`FIG. 8A
`
`N?
`
`IPR2023-00037
`Apple EX1027 Page 8
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 8 of 18
`
`US 6,615,338 B1
`
`
`
`
`
`01) (9 odid) ad?dpeo?@WTTWITTE) I 170 || 5 ||
`
`50/ISOOOO
`
`IPR2023-00037
`Apple EX1027 Page 9
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 9 of 18
`
`US 6,615,338 B1
`
`
`
`-800
`
`FIG. 80
`
`FIG. 9
`
`IPR2023-00037
`Apple EX1027 Page 10
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 10 0f 18
`
`US 6,615,338 B1
`
`
`
`
`
`
`
`
`
`group mfu3
`
`mfu2
`
`mfu1
`
`..., I,
`viv.3 mius 3 mile 3 mugglu 3
`
`gfu
`
`
`
`
`
`FIG 10A
`
`10 11 12 13
`
`3 4 || 5 || 6 || 7 || 8
`
`... ." "I"
`E1 E2x E3 E4 TWB |
`|
`|
`EA1A2 A3 Twel
`DE EA1A2 A3 twell
`
`FIG 10B
`
`FIG 10C
`
`IPR2023-00037
`Apple EX1027 Page 11
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 11 of 18
`
`US 6,615,338 B1
`
`
`
`
`
`
`
`| 1 || 2 | 3 || 4 || 5 || 6 || 7 | 8 || 9 | 10 11 12
`Cycle
`mfu1 1 D E1 E2 X1 X2X3 E3 E4 T WB
`helper |
`| DEEexixexagged twel
`mu 2 TD ID | DEEA1A2 Aalt we
`mu21 DETEEE A1A2 A3 Twel
`mile 2 |
`| ID | DD E1 E2 x Eleat
`helper |
`|
`|
`|
`|
`| DEEexeget
`glut | DEEEE| AllA2laat well
`glu 2 TDD ID | DEEA1A2 A3 it we
`
`FIG. 11A
`
`IPR2023-00037
`Apple EX1027 Page 12
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 12 of 18
`
`US 6,615,338 B1
`
`
`
`FIG 11B
`
`IPR2023-00037
`Apple EX1027 Page 13
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 13 0f 18
`
`US 6,615,338 B1
`
`
`
`Eilee|x|Eseat we
`| DEE2X Eg|Earlwell
`| DEE A1A2 A3 twe
`31 DE1 E2x1 x2 x3 E3 E4 TWB
`helper |
`| DEE2x1 x2 x3 E3 E4 TWB
`mu32
`DD ID | ETEA1A2 A3 Twel
`de EEI eataeast we
`DDDDTE EA1A2 A3 twe
`
`gfu 1
`fu 2
`9.
`
`FIG 11C
`
`
`
`IPR2023-00037
`Apple EX1027 Page 14
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 14 of 18
`
`US 6,615,338 B1
`
`ifu pou pC
`
`no
`
`-226
`
`horizontal = 3 x 32
`Sp? data a2
`
`PCU GFUX DP-1110
`
`1130
`
`mfu2x data t
`mfu3X data t
`
`PCU GFUX MC
`
`-
`
`mful pcu data e
`mfu1 pCu data e2
`mful pcu data e4
`
`dcu pCu dc data/31:0
`dcu pCu dic data16332)
`Isu pCu ndc data/31:0)
`SupCu ndc data?ö3:32)
`pCu rS2 data
`pCu rS1 data
`pCu Strol data
`pCu Stra1 data
`pCu rS1 data
`pCu (S2 data
`pCu Stro data
`
`222
`
`|
`
`hOrizOntal buSeS = 11 X 32
`Vertical routing = 13 x 32
`
`GFU
`
`FIG. 12A
`
`IPR2023-00037
`Apple EX1027 Page 15
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 15 of 18
`
`US 6,615,338 B1
`
`RF1
`(deCOding)
`
`RF1
`
`-226
`
`mfu2X data t
`mfu3x data t
`
`mfu1X data t
`
`PCU MFU1X MC
`
`gfux data a1
`gfuX data a4
`
`ldx1 data
`ldx1m data
`
`hOriZOntal buSeS = 13
`
`pCu mfu1 rS1 data
`pCu mfu1 rS2 data
`pCu mfu1 rS3 data
`
`
`
`spr data a2
`Thorizontal = 4
`|| ||
`||
`H rf1 pCu (S1 data
`H. rf1 pCu rS2 data
`rf1 pCu rS3 data
`
`PCU MFU1X DP
`1132
`
`1 120
`doupCuldc dataI31.0
`dCupCu_dc_data16332)
`ISupCu ndc data/31:0)
`ISupCu ndc data163:32)
`
`mfu1 pCu data e
`mful pCu_data e2
`mful pCu data e4
`gfu pCu data e
`gfu pCu data e6e34
`
`pCu Stra data
`
`MFU1
`
`220
`
`FIG. 12B
`
`IPR2023-00037
`Apple EX1027 Page 16
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 16 of 18
`
`US 6,615,338 B1
`
`Horizontal
`buSeS = 4
`
`miuéx data t||
`
`-226
`
`miuix data t
`||
`|| Sp? data a2
`III mil2x data t
`EH rf2 pCu rS1 data
`rf2 pCu rS2 data
`rf2 pCu rS3 data
`
`PCU MFU2X MC
`
`1122
`
`PCU MFU2X DP
`
`Horizontal buses = 4
`OIZOI73, OUSGS
`
`ldX1m data
`ldx1 data
`gfuX data a1
`gfuX data a4
`
`mfu2 pCu data e
`mfu2 pCu data e4
`mfu2 pCu data e2
`
`pCu mfu2 rS1 data
`pCu mfu2 rS2 data
`pCu mfu2 rS3 data
`
`
`
`pCu Sird data
`
`MFU2
`
`220
`
`FIG. 120
`
`IPR2023-00037
`Apple EX1027 Page 17
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 17 of 18
`
`US 6,615,338 B1
`
`/ 226
`
`A
`
`A A
`
`gfux data t
`ldx3 data
`ldx3m data
`
`
`
`mfu2x data t
`mfu3X data t H Horizontal buSeS
`rf1 pCu rS1 data
`C
`in Channel = 5
`t
`rf1 pCu rS2 data
`rf1 pCu rS3 data
`1122
`
`PCU MFU1X MC
`
`Horizontal buses
`in Channel = 19
`
`PCU MFU1X DP
`1132
`22 Outputs
`
`dcupCu dc data/31:0)
`dCupCu dc data/63:32)
`IsupCu ndc data/31:0)
`ISupCu ndc data?ö3:32)
`mful pCu data e
`mfu1 pCu_data e2
`mful pCu_data e4
`gfu pCu data e
`gfu pCu data e6e34
`gfuX data a1
`gfuX data a2
`gfuX data a3
`
`miu1X data a1
`mfu1X data a2
`mfu1X data a3
`
`pCu miu1 rS1 data
`pCu mfu1 rS2 data
`pCu mful rS3 data
`
`ldX1 data
`::
`lax2m data
`
`220
`
`FIG. 12D
`
`IPR2023-00037
`Apple EX1027 Page 18
`
`
`
`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 18 of 18
`
`US 6,615,338 B1
`
`LSU
`
`Idy1
`Idx2
`Idx3
`
`ldx1 ra(70) Idx1 data1630)
`ldx2 ra.I7:01 lox2 data?ö3:0)
`ldx3 rdI7:0) Idy3 data?S3:0)
`
`y 1200
`
`pCu trap
`
`Walid
`
`Cycle
`
`1
`
`2 3 || 4 || 5 || 6 || 7
`
`.""
`...
`membar ID | ETEEA1A2 A3 Twel
`inst || |
`| IDEA1A2 A3
`well
`
`10 11 12
`
`
`
`FIG. 14A
`
`EIILATI
`
`III, 10 11 12
`I DEI cA2 A3 twell
`DETEEA1A2 A3 twell
`|
`|
`|
`| DEA1A2 at we
`
`
`
`
`
`
`
`load
`load
`inst
`
`FIG, 14B
`
`IPR2023-00037
`Apple EX1027 Page 19
`
`
`
`1
`CLUSTERED ARCHITECTURE IN A VLIW
`PROCESSOR
`
`US 6,615,338 B1
`
`2
`tion has a set of fields corresponding to each functional unit.
`Typical bit lengths of a Subinstruction commonly range from
`16 to 64 bits per functional unit to produce an instruction
`length often in a range from 64 to 512 bits for VLIW groups
`from four to eight Subinstructions.
`The multiple functional units are kept busy by maintain
`ing a code Sequence with Sufficient operations to keep
`instructions scheduled. A VLIW processor often uses a
`technique called trace Scheduling to maintain Scheduling
`efficiency by unrolling loops and Scheduling code acroSS
`basic function blockS. Trace Scheduling also improves effi
`ciency by allowing instructions to move acroSS branch
`points.
`Limitations of VLIW processing include limited
`parallelism, limited hardware resources, and a vast increase
`in code size. A limited amount of parallelism is available in
`instruction Sequences. Unless loops are unrolled a very large
`number of times, insufficient operations are available to fill
`the instruction capacity of the functional units. The opera
`tional capacity of a VLIW processor is not determined by the
`number of functional units alone. The capacity also depends
`on the depth of the operational pipeline of the operational
`units. Several operational units Such as the memory, branch
`ing controller, and floating point functional units, are pipe
`lined and perform a much larger number of operations than
`can be executed in parallel. For example, a floating point
`pipeline with a depth of eight Steps has two operations issued
`on a clock cycle that cannot depend on any of the operations
`already within the floating point pipeline. Accordingly, the
`actual number of independent operations is approximately
`equal to the average pipeline depth times the number of
`execution units. Consequently, the number of operations
`needed to maintain a maximum efficiency of operation for a
`VLIW processor with four functional units is twelve to
`Sixteen.
`Limited hardware resources are a problem, not only
`because of duplication of functional units but more impor
`tantly due to a large increase in memory and register file
`bandwidth. A large number of read and write ports are
`necessary for accessing the register file, imposing a band
`width that is difficult to support without a large cost in the
`Size of the register file and degradation in clock Speed. AS the
`number of ports increases, the complexity of the memory
`System further increases. To allow multiple memory
`accesses in parallel, the memory is divided into multiple
`banks having different addresses to reduce the likelihood
`that multiple operations in a single instruction have con
`flicting accesses that cause the processor to Stall since
`Synchrony must be maintained between the functional units.
`Code size is a problem for Several reasons. The generation
`of Sufficient operations in a nonbranching code fragment
`requires Substantial unrolling of loops, increasing the code
`size. Also, instructions that are not full include unused
`Subinstructions that waste code Space, increasing code size.
`Furthermore, the increase in the size of StorageS Such as the
`register file increase the number of bits in the instruction for
`addressing registers in the register file.
`A challenge in the design of VLIW processors is effective
`exploitation of instruction-level parallelism. Highly parallel
`computing applications that have few data dependencies and
`few branches are executed most efficiently using a wide
`VLIW processor with a greater number of Subinstructions in
`a VLIW group. However many computing applications are
`not highly parallel and include branches or data dependen
`cies that waste Space in instruction memory and cause
`Stalling. Referring to FIG. 1, a graph illustrates a comparison
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`The present invention is related to Subject matter dis
`closed in the following co-pending patent applications:
`1. U.S. patent application Ser. No. 09/204,480, entitled,
`“A Multiple-Thread Processor for Threaded Software
`Applications”, naming Marc Tremblay and William Joy
`as inventors and filed on even date here with,
`2. U.S. patent application Ser. No. 09/204,481, now U.S.
`Pat. No. 6,343,348, entitled, “Apparatus and Method
`for Optimizing Die Utilization and Speed Performance
`by Register File Splitting”, naming Marc Tremblay and
`William Joy as inventors and filed on even date here
`with;
`3. U.S. patent application Ser. No. 09/204,536, entitled,
`“Variable Issue-Width VLIW Processor”, naming Marc
`Tremblay as inventor and filed on even date herewith;
`4. U.S. patent application Ser. No. 09/204,586, now U.S.
`Pat. No. 6,205,543, entitled, “Efficient Handling of a
`Large Register File for Context Switching”, naming
`Marc Tremblay and William Joy as inventors and filed
`on even date here with;
`5. U.S. patent application Ser. No. 09/205,121, now U.S.
`Pat. No. 6,321,325, entitled, “Dual In-line Buffers for
`an Instruction Fetch Unit', naming Marc Tremblay and
`Graham Murphy as inventors and filed on even date
`herewith;
`6. U.S. patent application Ser. No. 09/204,781, now U.S.
`Pat. No. 6,249,861, entitled, “An Instruction Fetch Unit
`Aligner', naming Marc Tremblay and Graham Murphy
`as inventors and filed on even date here with,
`7. U.S. patent application Ser. No. 09/204,535, now U.S.
`Pat. No. 6,279,100, entitled, “Local Stall Control
`Method and Structure in a Microprocessor', naming
`Marc Tremblay and Sharada Yeluri as inventors and
`filed on even date herewith;
`8. U.S. patent application Ser. No. 09/204.858, entitled,
`“Local and Global Register Partitioning in a VLIW
`Processor', naming Marc Tremblay and William Joy as
`inventors and filed on even date herewith; and
`9. U.S. patent application Ser. No. 09/204,479, entitled,
`“Implicitly Derived Register Specifiers in a Processor”,
`naming Marc Tremblay and William Joy as inventors
`and filed on even date herewith.
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`The present invention relates to processors. More
`Specifically, the present invention relates to architectures for
`Very Long Instruction Word (VLIW) processors.
`2. Description of the Related Art
`One technique for improving the performance of proces
`SorS is parallel execution of multiple instructions to allow
`the instruction execution rate to exceed the clock rate.
`Various types of parallel processors have been developed
`including Very Long Instruction Word (VLIW) processors
`that use multiple, independent functional units to execute
`multiple instructions in parallel. VLIW processors package
`multiple operations into one very long instruction, the mul
`tiple operations being determined by Sub-instructions that
`are applied to the independent functional units. An instruc
`
`55
`
`60
`
`65
`
`IPR2023-00037
`Apple EX1027 Page 20
`
`
`
`3
`of instruction issue efficiency and processor size as VLIW
`group width is varied. The left axis of the graph relates to an
`instruction-level parallelism plot 10 that depicts the number
`of instructions executed per cycle against VLIW issue width.
`The right axis of the graph relates to a relative processor Size
`plot 12 that shows relative processor Size in relation to
`VLIW issue width.
`What are needed are a technique and processor architec
`ture that increase the capacity for instruction-level parallel
`ism while efficiently using resources So that the number of
`functional units kept busy in each cycle and the number of
`useful operations in a VLIW group are increased.
`SUMMARY OF THE INVENTION
`A Very Long Instruction Word (VLIW) processor has a
`clustered architecture including a plurality of independent
`functional units and a multi-ported register file that is
`divided into a plurality of Separate register file Segments, the
`register file Segments being individually associated with the
`plurality of independent functional units. The functional
`units access the respective associated register file Segments
`using read operations that are local to the functional unit/
`register file Segment pairs. In contrast, the functional units
`access the register file Segments using write operations that
`are broadcast to a plurality of register file Segments.
`In an illustrative embodiment, independence between
`clusters is attained Since the Separate clustered functional
`unit/ register file segment pairs have local (internal) bypass
`ing that allows internal computations to proceed, but have
`only limited bypassing between different functional unit/
`register file Segment pair clusters. Thus a particular func
`tional unit/ register Segment pair does not bypass to all other
`functional unit/ register segment pairs.
`Usage of local bypassing rather than global bypassing
`greatly reduces the interconnection Structures within the
`processor, advantageously reducing the length of intercon
`nect lines, reducing processor Size and increasing processor
`Speed by Shortening the distance of Signal transfer. Indepen
`dent clustering of functional units advantageously forms a
`highly scaleable structure in VLIW processor architecture
`using distributed functional units.
`In Some embodiments, a clustered functional unit? register
`file Segment pair also includes one or more annexes that
`Stage or delay intermediate results of an instruction thereby
`controlling data hazard conditions. The annexes contain
`Storage for storing destination register (rdl) specifiers for all
`annex Stages, valid bits, for the Stages of a pipeline, and
`priority logic that determines a most recent value of a
`register in the register file.
`The annexes include multiplexers that Select matching
`Stages among bypass levels in a priority logic that Selects
`databased on priority matching within a priority level. The
`annexes include compare logic that compares destination
`Specifiers of an instruction executing within the anneX
`pipeline against Source and destination Specifiers of other
`instructions currently executing in the local bypass range of
`a functional unit/ register file Segment pair cluster.
`A multi-ported register file is typically metal limited to the
`area consumed by the circuit proportional with the Square of
`the number of ports. The multi-ported register file that is
`divided into register file Segments that are individually
`allocated among functional unit/ register file Segment pair
`clusters. The plurality of Separate and independent register
`files forms a layout Structure with an improved layout
`efficiency. The read ports of the total register file Structure
`are allocated among the Separate and individual register
`
`4
`files. The Separate and individual register files have write
`ports that correspond to the total number of write ports in the
`total register file structure. Writes are fully broadcast so that
`all of the Separate and individual register files are coherent.
`In one illustrative embodiment, a 16-port register file
`Structure with twelve read ports and four write ports is split
`into four Separate and individual 7-port register files with
`three read ports and four write ports. The area of a Single
`16-port register file would have a size proportional to 16
`times 16 or 256. The separate and individual register files
`has a size proportional to 7 times 7 or 49 for a total of 4 times
`49 or 196. The capacity of a single 16-port register and the
`four 7-port registers is identical with the Split register file
`Structure advantageously having a significantly reduced
`area. The reduced area advantageously corresponds to an
`improvement in access time of a register file and thus speed
`performance due to a reduction in the length of word lines
`and bit lines connecting the array cells that reduces the time
`for a signal to pass on the lines. The improvement in Speed
`performance is highly advantageous due to Strict time bud
`gets that are imposed by the Specification of high
`performance processors and also to attain a large capacity
`register file that is operational at high Speed.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`The features of the described embodiments are specifi
`cally Set forth in the appended claims. However, embodi
`ments of the invention relating to both Structure and method
`of operation, may best be understood by referring to the
`following description and accompanying drawings.
`FIG. 1 is a graph illustrating a comparison of instruction
`issue efficiency and processor size as VLIW group width is
`varied.
`FIG. 2 is a Schematic block diagram illustrating a single
`integrated circuit chip implementation of a processor in
`accordance with an embodiment of the present invention.
`FIG. 3 is a Schematic block diagram showing the core of
`the processor.
`FIG. 4 is a Schematic block diagram that illustrates an
`embodiment of the Split register file that is Suitable for usage
`in the processor.
`FIG. 5 is a Schematic block diagram that shows a logical
`View of the register file and functional units in the processor.
`FIG. 6 is a pictorial Schematic diagram depicting an
`example of instruction execution among a plurality of media
`functional units.
`FIG. 7 is a Schematic timing diagram that illustrates
`timing of the processor pipeline.
`FIGS. 8A-8C are respectively, a schematic block diagram
`showing an embodiment of a general functional unit, a
`Simplified Schematic timing diagram showing timing of a
`general functional unit pipeline, and a bypass diagram
`showing possible bypasses for the general functional unit.
`FIG. 9 is a simplified schematic timing diagram illustrat
`ing timing of media functional unit pipelines.
`FIGS. 10A-10C respectively show an instruction
`Sequence table, and two pipeline diagrams illustrating
`execution of a VLIW group which shows stall operation for
`a five-cycle pair instruction and a Seven-cycle pair instruc
`tion.
`FIGS. 11A-11C are pipeline diagrams illustrating syn
`chronization of pair instructions in a group.
`FIGS. 12A, 12B, 12C, and 12D are respective schematic
`block diagrams illustrating the pipeline control unit Seg
`
`US 6,615,338 B1
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`IPR2023-00037
`Apple EX1027 Page 21
`
`
`
`US 6,615,338 B1
`
`15
`
`35
`
`40
`
`25
`
`S
`ments allocated to all of the functional units GFU, MFU1,
`MFU2, and MFU3.
`FIG. 13 is a schematic block diagram that illustrates a
`load annex block within the pipeline control unit.
`FIGS. 14A and 14B illustrate respective stall conditions in
`accordance with operation of Some embodiments of the
`present invention.
`The use of the same reference symbols in different drawings
`indicates Similar or identical items.
`DESCRIPTION OF THE EMBODIMENT(S)
`Referring to FIG. 2, a Schematic block diagram illustrates
`a single integrated circuit chip implementation of a proces
`Sor 100 that includes a memory interface 102, a geometry
`decompressor 104, two media processing units 110 and 112,
`a shared data cache 106, and Several interface controllers.
`The interface controllerS Support an interactive graphics
`environment with real-time constraints by integrating fun
`damental components of memory, graphics, and input/
`output bridge functionality on a single die. The components
`are mutually linked and closely linked to the processor core
`with high bandwidth, low-latency communication channels
`to manage multiple high-bandwidth data Streams efficiently
`and with a low response time. The interface controllers
`include a an UltraPort Architecture Interconnect (UPA)
`controller 116 and a peripheral component interconnect
`(PCI) controller 120. The illustrative memory interface 102
`is a direct Rambus dynamic RAM (DRDRAM) controller.
`The shared data cache 106 is a dual-ported storage that is
`shared among the media processing units 110 and 112 with
`one port allocated to each media processing unit. The data
`cache 106 is four-way set associative, follows a write-back
`protocol, and supports hits in the fill buffer (not shown). The
`data cache 106 allows fast data Sharing and eliminates the
`need for a complex, error-prone cache coherency protocol
`between the media processing units 110 and 112.
`The UPA controller 116 is a custom interface that attains
`a Suitable balance between high-performance computational
`and graphic Subsystems. The UPA is a cache-coherent,
`processor-memory interconnect. The UPA attains Several
`advantageous characteristics including a Scaleable band
`width through Support of multiple bused interconnects for
`data and addresses, packets that are Switched for improved
`bus utilization, higher bandwidth, and precise interrupt
`45
`processing. The UPA performs low latency memory
`accesses with high throughput paths to memory. The UPA
`includes a buffered cross-bar memory interface for increased
`bandwidth and improved scaleability. The UPA supports
`high-performance graphics with two-cycle Single-word
`writes on the 64-bit UPA interconnect. The UPA intercon
`nect architecture utilizes point-to-point packet Switched
`messages from a centralized System controller to maintain
`cache coherence. Packet Switching improves bus bandwidth
`utilization by removing the latencies commonly associated
`with transaction-based designs.
`The PCI controller 120 is used as the primary system I/O
`interface for connecting Standard, high-volume, low-cost
`peripheral devices, although other Standard interfaces may
`also be used. The PCI bus effectively transfers data among
`high bandwidth peripherals and low bandwidth peripherals,
`such as CD-ROM players, DVD players, and digital cam
`CS.
`Two media processing units 110 are included in a Single
`integrated circuit chip to Support an execution environment
`exploiting thread level parallelism in which two independent
`threads can execute simultaneously. The threads may arise
`
`50
`
`55
`
`60
`
`65
`
`6
`from any Sources Such as the same application, different
`applications, the operating System, or the runtime environ
`ment. Parallelism is exploited at the thread level since
`parallelism is rare beyond four, or even two, instructions per
`cycle in general purpose code. For example, the illustrative
`processor 100 is an eight-wide machine with eight execution
`units for executing instructions. A typical “general-purpose'
`processing code has an instruction level parallelism of about
`two So that, on average, most (about six) of the eight
`execution units would be idle at any time. The illustrative
`processor 100 employs thread level parallelism and operates
`on two independent threads, possibly attaining twice the
`performance of a processor having the same resources and
`clock rate but utilizing traditional non-thread parallelism.
`Thread level parallelism is particularly useful for JavaTM
`applications which are bound to have multiple threads of
`execution. JavaTM methods including “suspend”, “resume’,
`“sleep', and the like include effective support for threaded
`program code. In addition, Java" class libraries are thread
`safe to promote parallelism. (Java TM, Sun, Sun Microsys
`tems and the Sun Logo are trademarks or registered trade
`marks of Sun Microsystems, Inc. in the United States and
`other countries. All SPARC trademarks, including UltraS
`PARC I and UltraSPARC II, are used under license and are
`trademarks of SPARC International, Inc. in the United States
`and other countries. Products bearing SPARC trademarks
`are based upon an architecture developed by Sun
`Microsystems, Inc.) Furthermore, the thread model of the
`processor 100 Supports a dynamic compiler which runs as a
`Separate thread using one media processing unit 110 while
`the Second media processing unit 110 is used by the current
`application. In the illustrative System, the compiler applies
`optimizations based on “on-the-fly” profile feedback infor
`mation while dynamically modifying the executing code to
`improve eXecution on each Subsequent run. For example, a
`"garbage collector” may be executed on a first media
`processing unit 110, copying objects or gathering pointer
`information, while the application is executing on the other
`media processing unit 110.
`Although the processor 100 shown in FIG. 2 includes two
`processing units on an integrated circuit chip, the architec
`ture is highly Scaleable So that one to Several closely
`coupled processors may be formed in a message-based
`coherent architecture and resident on the same die to process
`multiple threads of execution. Thus, in the processor 100, a
`limitation on the number of processors formed on a Single
`die thus arises from capacity constraints of integrated circuit
`technology rather than from architectural constraints relat
`ing to the interactions and interconnections between proces
`SOS.
`Referring to FIG. 3, a schematic block diagram shows the
`core of the processor 100. The media processing units 110
`each include an instruction cache 210, an instruction aligner
`212, an instruction buffer 214, a pipeline control unit 226, a
`Split register file 216, a plurality of execution units, and a
`load/store unit 218. In the illustrative processor 100, the
`media processing units 110 use a plurality of execution units
`for executing instructions. The execution units for a media
`processing unit 110 include three media functional units
`(MFU) 220 and one general functional unit (GFU) 222. The
`media functional units 220 are multiple Single-instruction
`multiple-datapath (MSIMD) media functional units. Each of
`the media functional units 220 is capable of processing
`parallel 16-bit components. Various parallel 16-bit opera
`tions Supply the Single-instruction-multiple-datapath capa
`bility for the processor 100 including add, multiply–add,
`shift, compare, and the like. The media functional units 220
`
`IPR2023-00037
`Apple EX1027 Page 22
`
`
`
`7
`operate in combination as tightly-coupled digital Signal
`processors (DSPs). Each media functional unit 220 has an
`Separate and individual Sub-instruction Stream, but all three
`media functional units 220 execute Synchronously So that
`the Subinstructions progreSS lock-Step through pipeline
`Stages.
`The general functional unit 222 is a RISC processor
`capable of executing arithmetic logic unit (ALU) operations,
`loads and Stores, branches, and various Specialized and
`eSoteric functions Such as parallel power operations, recip
`rocal Square root operations, and many others. The general
`functional unit 222 Supports less common parallel opera
`tions Such as the parallel reciprocal Square root instruction.
`The illustrative instruction cache 210 has a