throbber
(12) United States Patent
`Tremblay et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 6,615,338 B1
`Sep. 2, 2003
`
`USOO661.5338B1
`
`(54)
`
`(75)
`
`(73)
`
`(21)
`(22)
`(51)
`(52)
`(58)
`(56)
`
`EP
`EP
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`Appl. No.: 09/204,584
`Filed:
`Dec. 3, 1998
`
`OTHER PUBLICATIONS
`CLUSTERED ARCHITECTURE IN A VLIW
`Findlay et al., “HARP: A VLIW RISC Processor", IEEE, pp.
`PROCESSOR
`368-372, 1991.*
`Inventors: Marc Tremblay, Menlo Park, CA (US);
`Keckler et al. “Processor Coupling: Integrating Compile
`William Joy, Aspen, CO (US)
`Time and Runtime Scheduling for Parallelism” Proceedings
`Assignee: Sun Microsystems, Inc., Palo Alto, CA of the Annual International Symposium on Computer Archi
`(US)
`tecture, US, New York, IEEE, vol. Symp. 19, 1992, pp.
`202-213, XP000325804, ISBN: 0-89791-510-6.
`Steven et al.: “iHARP: a multiple instruction issue proces
`sor IEE Proceedings E. Computers & Digital Techniques.,
`vol. 139, No. 5, Sep. 1992, pp. 439–449, XP000319892,
`Institution of Electrical Engineers. Stevenage., GB, ISSN:
`1350-2387.
`* cited by examiner
`Primary Examiner Emanuel Todd Voeltz
`7
`
`th m"' is (74) Attorney, Agent, or Firm Zagorin, O'Brien &
`Field of Search ............................. ... ..., Graham, LLP
`ABSTRACT
`(57)
`References Cited
`A Very Long Instruction Word (VLIW) processor has a
`U.S. PATENT DOCUMENTS
`clustered architecture including a plurality of independent
`functional units and a multi-ported register file that is
`E. A : 1910, Six et al... gig divided into a plurality of Separate register file Segments, the
`5301,340 A
`4/1994 Cook .............. .395f800 register file Segments being individually associated with the
`5,467.476 A 11/1995 Kawasaki ..
`... 395/800
`plurality of independent functional units. The functional
`5.530,817 A 6/1996 Masubuchi ...
`... 395/375
`units access the respective associated register file Segments
`5,542,059 A
`7/1996 Blomgren ................... 395/375
`using read operations that are local to the functional unit/
`5,657,291 A 8/1997 Podlesny et al.
`register file Segment pairs. In contrast, the functional units
`5,721868 A 2/1998 Yung et al. ................. 395/476
`access the register file Segments using write operations that
`5,761475 A 6/1998 Yung et al.
`... 395/394
`are broadcast to a plurality of register file Segments. Inde
`5,764,943 A 6/1998 Wechsler .................... 395/394
`pendence between clusters is attained since the Separate
`s A : E. E.M. r 3. clustered functional unit/ register file Segment pairs have
`5,901.301 A
`5f1999 Matsuo et al. .
`... 395/388
`local (internal) bypassing that allows internal computations
`6,076.159 A
`6/2000 Fleck et al. ....
`... 712/241
`to proceed, but have only limited bypassing between differ
`6,170,051 B1 * 1/2001 Dowling ..................... 712/225
`ent functional unit/ register file segment pair clusters. Thus
`a particular functional unit? register Segment pair does not
`FOREIGN PATENT DOCUMENTS
`bypass to all other functional unit/ register Segment pairs.
`O 730 223
`9/1994 ............. GO6F/9/38
`O 653 703
`5/1995
`............. GO6F/9/38
`
`25 Claims, 18 Drawing Sheets
`
`PC,
`
`i
`PUt
`2-210
`instruction Cache
`
`22
`instruction Aligner
`24
`
`instruction Buffer
`
`PCU
`
`PC2
`
`112
`PU2
`20
`
`instruction Cache
`22
`!
`instruction Aligner
`!
`214
`instruction Buffer
`
`226
`
`-
`
`Register Files
`224-224.
`224 224 || 21
`loadStore Unit
`
`
`
`216, 220
`Register Files
`224 224
`224 224
`Load Store Unit
`
`Shared Data Cache and Synchronization Area
`
`2
`
`216
`
`21
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 1
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 1 of 18
`
`US 6,615,338 B1
`
`ILP
`
`SIZE
`
`12
`
`10
`
`FIG. 1
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 2
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 2 of 18
`
`US 6,615,338 B1
`
`
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 3
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 3 of 18
`
`US 6,615,338 B1
`
`PC
`
`f 10
`
`210
`
`PC2
`
`112
`
`210
`
`
`
`
`
`
`
`
`
`
`
`
`
`InStruction CaChe
`
`
`
`212
`Instruction Aligner
`214
`
`InStruction Buffer
`
`PCU
`
`MFU3 MFU2MFU1 GFU
`
`
`
`
`
`Register Files
`218- 224 – 224
`Load/Store Unit
`
`
`
`
`
`InStruction CaChe
`
`
`
`22
`Instruction Aligner
`214
`
`InStruction Buffer
`N
`N
`
`iii.
`
`
`
`MFU3MFU2|MFUGFU
`Register Files
`
`
`
`218- 224 – 224
`LOad/Store Unit
`
`Shared Data Cache and Synchronization Area
`
`FIG. 3
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 4
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 4 of 18
`
`US 6,615,338 B1
`
`Broadcast Writes (5)
`
`3 Read POrtS 3 Read POrtS 3 Read POrtS 3 Read POrtS
`FIG. 4
`
`
`
`Global
`Registers
`12R/4W
`O
`12R15W
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 5
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 5 of 18
`
`US 6,615,338 B1
`
`Data In
`(up to 332-bit registers)
`
`5 Cycles
`(globally
`ViSible
`latency)
`
`Hardware bypasses
`provide shortest
`internal latency
`(4, 2 and 1 cycle
`latency)
`
`
`
`DataOut
`(one 32-bit register)
`
`F.G. 6
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 6
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 6 of 18
`
`US 6,615,338 B1
`
`
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 7
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 7 of 18
`
`US 6,615,338 B1
`
`
`
`Pipe control Unit
`
`FIG. 8A
`
`N?
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 8
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 8 of 18
`
`US 6,615,338 B1
`
`
`
`
`
`01) (9 odid) ad?dpeo?@WTTWITTE) I 170 || 5 ||
`
`50/ISOOOO
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 9
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 9 of 18
`
`US 6,615,338 B1
`
`
`
`-800
`
`FIG. 80
`
`FIG. 9
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 10
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 10 0f 18
`
`US 6,615,338 B1
`
`
`
`
`
`
`
`
`
`group mfu3
`
`mfu2
`
`mfu1
`
`..., I,
`viv.3 mius 3 mile 3 mugglu 3
`
`gfu
`
`
`
`
`
`FIG 10A
`
`10 11 12 13
`
`3 4 || 5 || 6 || 7 || 8
`
`... ." "I"
`E1 E2x E3 E4 TWB |
`|
`|
`EA1A2 A3 Twel
`DE EA1A2 A3 twell
`
`FIG 10B
`
`FIG 10C
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 11
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 11 of 18
`
`US 6,615,338 B1
`
`
`
`
`
`
`
`| 1 || 2 | 3 || 4 || 5 || 6 || 7 | 8 || 9 | 10 11 12
`Cycle
`mfu1 1 D E1 E2 X1 X2X3 E3 E4 T WB
`helper |
`| DEEexixexagged twel
`mu 2 TD ID | DEEA1A2 Aalt we
`mu21 DETEEE A1A2 A3 Twel
`mile 2 |
`| ID | DD E1 E2 x Eleat
`helper |
`|
`|
`|
`|
`| DEEexeget
`glut | DEEEE| AllA2laat well
`glu 2 TDD ID | DEEA1A2 A3 it we
`
`FIG. 11A
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 12
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 12 of 18
`
`US 6,615,338 B1
`
`
`
`FIG 11B
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 13
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 13 0f 18
`
`US 6,615,338 B1
`
`
`
`Eilee|x|Eseat we
`| DEE2X Eg|Earlwell
`| DEE A1A2 A3 twe
`31 DE1 E2x1 x2 x3 E3 E4 TWB
`helper |
`| DEE2x1 x2 x3 E3 E4 TWB
`mu32
`DD ID | ETEA1A2 A3 Twel
`de EEI eataeast we
`DDDDTE EA1A2 A3 twe
`
`gfu 1
`fu 2
`9.
`
`FIG 11C
`
`
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 14
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 14 of 18
`
`US 6,615,338 B1
`
`ifu pou pC
`
`no
`
`-226
`
`horizontal = 3 x 32
`Sp? data a2
`
`PCU GFUX DP-1110
`
`1130
`
`mfu2x data t
`mfu3X data t
`
`PCU GFUX MC
`
`-
`
`mful pcu data e
`mfu1 pCu data e2
`mful pcu data e4
`
`dcu pCu dc data/31:0
`dcu pCu dic data16332)
`Isu pCu ndc data/31:0)
`SupCu ndc data?ö3:32)
`pCu rS2 data
`pCu rS1 data
`pCu Strol data
`pCu Stra1 data
`pCu rS1 data
`pCu (S2 data
`pCu Stro data
`
`222
`
`|
`
`hOrizOntal buSeS = 11 X 32
`Vertical routing = 13 x 32
`
`GFU
`
`FIG. 12A
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 15
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 15 of 18
`
`US 6,615,338 B1
`
`RF1
`(deCOding)
`
`RF1
`
`-226
`
`mfu2X data t
`mfu3x data t
`
`mfu1X data t
`
`PCU MFU1X MC
`
`gfux data a1
`gfuX data a4
`
`ldx1 data
`ldx1m data
`
`hOriZOntal buSeS = 13
`
`pCu mfu1 rS1 data
`pCu mfu1 rS2 data
`pCu mfu1 rS3 data
`
`
`
`spr data a2
`Thorizontal = 4
`|| ||
`||
`H rf1 pCu (S1 data
`H. rf1 pCu rS2 data
`rf1 pCu rS3 data
`
`PCU MFU1X DP
`1132
`
`1 120
`doupCuldc dataI31.0
`dCupCu_dc_data16332)
`ISupCu ndc data/31:0)
`ISupCu ndc data163:32)
`
`mfu1 pCu data e
`mful pCu_data e2
`mful pCu data e4
`gfu pCu data e
`gfu pCu data e6e34
`
`pCu Stra data
`
`MFU1
`
`220
`
`FIG. 12B
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 16
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 16 of 18
`
`US 6,615,338 B1
`
`Horizontal
`buSeS = 4
`
`miuéx data t||
`
`-226
`
`miuix data t
`||
`|| Sp? data a2
`III mil2x data t
`EH rf2 pCu rS1 data
`rf2 pCu rS2 data
`rf2 pCu rS3 data
`
`PCU MFU2X MC
`
`1122
`
`PCU MFU2X DP
`
`Horizontal buses = 4
`OIZOI73, OUSGS
`
`ldX1m data
`ldx1 data
`gfuX data a1
`gfuX data a4
`
`mfu2 pCu data e
`mfu2 pCu data e4
`mfu2 pCu data e2
`
`pCu mfu2 rS1 data
`pCu mfu2 rS2 data
`pCu mfu2 rS3 data
`
`
`
`pCu Sird data
`
`MFU2
`
`220
`
`FIG. 120
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 17
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 17 of 18
`
`US 6,615,338 B1
`
`/ 226
`
`A
`
`A A
`
`gfux data t
`ldx3 data
`ldx3m data
`
`
`
`mfu2x data t
`mfu3X data t H Horizontal buSeS
`rf1 pCu rS1 data
`C
`in Channel = 5
`t
`rf1 pCu rS2 data
`rf1 pCu rS3 data
`1122
`
`PCU MFU1X MC
`
`Horizontal buses
`in Channel = 19
`
`PCU MFU1X DP
`1132
`22 Outputs
`
`dcupCu dc data/31:0)
`dCupCu dc data/63:32)
`IsupCu ndc data/31:0)
`ISupCu ndc data?ö3:32)
`mful pCu data e
`mfu1 pCu_data e2
`mful pCu_data e4
`gfu pCu data e
`gfu pCu data e6e34
`gfuX data a1
`gfuX data a2
`gfuX data a3
`
`miu1X data a1
`mfu1X data a2
`mfu1X data a3
`
`pCu miu1 rS1 data
`pCu mfu1 rS2 data
`pCu mful rS3 data
`
`ldX1 data
`::
`lax2m data
`
`220
`
`FIG. 12D
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 18
`
`

`

`U.S. Patent
`
`Sep. 2, 2003
`
`Sheet 18 of 18
`
`US 6,615,338 B1
`
`LSU
`
`Idy1
`Idx2
`Idx3
`
`ldx1 ra(70) Idx1 data1630)
`ldx2 ra.I7:01 lox2 data?ö3:0)
`ldx3 rdI7:0) Idy3 data?S3:0)
`
`y 1200
`
`pCu trap
`
`Walid
`
`Cycle
`
`1
`
`2 3 || 4 || 5 || 6 || 7
`
`.""
`...
`membar ID | ETEEA1A2 A3 Twel
`inst || |
`| IDEA1A2 A3
`well
`
`10 11 12
`
`
`
`FIG. 14A
`
`EIILATI
`
`III, 10 11 12
`I DEI cA2 A3 twell
`DETEEA1A2 A3 twell
`|
`|
`|
`| DEA1A2 at we
`
`
`
`
`
`
`
`load
`load
`inst
`
`FIG, 14B
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 19
`
`

`

`1
`CLUSTERED ARCHITECTURE IN A VLIW
`PROCESSOR
`
`US 6,615,338 B1
`
`2
`tion has a set of fields corresponding to each functional unit.
`Typical bit lengths of a Subinstruction commonly range from
`16 to 64 bits per functional unit to produce an instruction
`length often in a range from 64 to 512 bits for VLIW groups
`from four to eight Subinstructions.
`The multiple functional units are kept busy by maintain
`ing a code Sequence with Sufficient operations to keep
`instructions scheduled. A VLIW processor often uses a
`technique called trace Scheduling to maintain Scheduling
`efficiency by unrolling loops and Scheduling code acroSS
`basic function blockS. Trace Scheduling also improves effi
`ciency by allowing instructions to move acroSS branch
`points.
`Limitations of VLIW processing include limited
`parallelism, limited hardware resources, and a vast increase
`in code size. A limited amount of parallelism is available in
`instruction Sequences. Unless loops are unrolled a very large
`number of times, insufficient operations are available to fill
`the instruction capacity of the functional units. The opera
`tional capacity of a VLIW processor is not determined by the
`number of functional units alone. The capacity also depends
`on the depth of the operational pipeline of the operational
`units. Several operational units Such as the memory, branch
`ing controller, and floating point functional units, are pipe
`lined and perform a much larger number of operations than
`can be executed in parallel. For example, a floating point
`pipeline with a depth of eight Steps has two operations issued
`on a clock cycle that cannot depend on any of the operations
`already within the floating point pipeline. Accordingly, the
`actual number of independent operations is approximately
`equal to the average pipeline depth times the number of
`execution units. Consequently, the number of operations
`needed to maintain a maximum efficiency of operation for a
`VLIW processor with four functional units is twelve to
`Sixteen.
`Limited hardware resources are a problem, not only
`because of duplication of functional units but more impor
`tantly due to a large increase in memory and register file
`bandwidth. A large number of read and write ports are
`necessary for accessing the register file, imposing a band
`width that is difficult to support without a large cost in the
`Size of the register file and degradation in clock Speed. AS the
`number of ports increases, the complexity of the memory
`System further increases. To allow multiple memory
`accesses in parallel, the memory is divided into multiple
`banks having different addresses to reduce the likelihood
`that multiple operations in a single instruction have con
`flicting accesses that cause the processor to Stall since
`Synchrony must be maintained between the functional units.
`Code size is a problem for Several reasons. The generation
`of Sufficient operations in a nonbranching code fragment
`requires Substantial unrolling of loops, increasing the code
`size. Also, instructions that are not full include unused
`Subinstructions that waste code Space, increasing code size.
`Furthermore, the increase in the size of StorageS Such as the
`register file increase the number of bits in the instruction for
`addressing registers in the register file.
`A challenge in the design of VLIW processors is effective
`exploitation of instruction-level parallelism. Highly parallel
`computing applications that have few data dependencies and
`few branches are executed most efficiently using a wide
`VLIW processor with a greater number of Subinstructions in
`a VLIW group. However many computing applications are
`not highly parallel and include branches or data dependen
`cies that waste Space in instruction memory and cause
`Stalling. Referring to FIG. 1, a graph illustrates a comparison
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`The present invention is related to Subject matter dis
`closed in the following co-pending patent applications:
`1. U.S. patent application Ser. No. 09/204,480, entitled,
`“A Multiple-Thread Processor for Threaded Software
`Applications”, naming Marc Tremblay and William Joy
`as inventors and filed on even date here with,
`2. U.S. patent application Ser. No. 09/204,481, now U.S.
`Pat. No. 6,343,348, entitled, “Apparatus and Method
`for Optimizing Die Utilization and Speed Performance
`by Register File Splitting”, naming Marc Tremblay and
`William Joy as inventors and filed on even date here
`with;
`3. U.S. patent application Ser. No. 09/204,536, entitled,
`“Variable Issue-Width VLIW Processor”, naming Marc
`Tremblay as inventor and filed on even date herewith;
`4. U.S. patent application Ser. No. 09/204,586, now U.S.
`Pat. No. 6,205,543, entitled, “Efficient Handling of a
`Large Register File for Context Switching”, naming
`Marc Tremblay and William Joy as inventors and filed
`on even date here with;
`5. U.S. patent application Ser. No. 09/205,121, now U.S.
`Pat. No. 6,321,325, entitled, “Dual In-line Buffers for
`an Instruction Fetch Unit', naming Marc Tremblay and
`Graham Murphy as inventors and filed on even date
`herewith;
`6. U.S. patent application Ser. No. 09/204,781, now U.S.
`Pat. No. 6,249,861, entitled, “An Instruction Fetch Unit
`Aligner', naming Marc Tremblay and Graham Murphy
`as inventors and filed on even date here with,
`7. U.S. patent application Ser. No. 09/204,535, now U.S.
`Pat. No. 6,279,100, entitled, “Local Stall Control
`Method and Structure in a Microprocessor', naming
`Marc Tremblay and Sharada Yeluri as inventors and
`filed on even date herewith;
`8. U.S. patent application Ser. No. 09/204.858, entitled,
`“Local and Global Register Partitioning in a VLIW
`Processor', naming Marc Tremblay and William Joy as
`inventors and filed on even date herewith; and
`9. U.S. patent application Ser. No. 09/204,479, entitled,
`“Implicitly Derived Register Specifiers in a Processor”,
`naming Marc Tremblay and William Joy as inventors
`and filed on even date herewith.
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`The present invention relates to processors. More
`Specifically, the present invention relates to architectures for
`Very Long Instruction Word (VLIW) processors.
`2. Description of the Related Art
`One technique for improving the performance of proces
`SorS is parallel execution of multiple instructions to allow
`the instruction execution rate to exceed the clock rate.
`Various types of parallel processors have been developed
`including Very Long Instruction Word (VLIW) processors
`that use multiple, independent functional units to execute
`multiple instructions in parallel. VLIW processors package
`multiple operations into one very long instruction, the mul
`tiple operations being determined by Sub-instructions that
`are applied to the independent functional units. An instruc
`
`55
`
`60
`
`65
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 20
`
`

`

`3
`of instruction issue efficiency and processor size as VLIW
`group width is varied. The left axis of the graph relates to an
`instruction-level parallelism plot 10 that depicts the number
`of instructions executed per cycle against VLIW issue width.
`The right axis of the graph relates to a relative processor Size
`plot 12 that shows relative processor Size in relation to
`VLIW issue width.
`What are needed are a technique and processor architec
`ture that increase the capacity for instruction-level parallel
`ism while efficiently using resources So that the number of
`functional units kept busy in each cycle and the number of
`useful operations in a VLIW group are increased.
`SUMMARY OF THE INVENTION
`A Very Long Instruction Word (VLIW) processor has a
`clustered architecture including a plurality of independent
`functional units and a multi-ported register file that is
`divided into a plurality of Separate register file Segments, the
`register file Segments being individually associated with the
`plurality of independent functional units. The functional
`units access the respective associated register file Segments
`using read operations that are local to the functional unit/
`register file Segment pairs. In contrast, the functional units
`access the register file Segments using write operations that
`are broadcast to a plurality of register file Segments.
`In an illustrative embodiment, independence between
`clusters is attained Since the Separate clustered functional
`unit/ register file segment pairs have local (internal) bypass
`ing that allows internal computations to proceed, but have
`only limited bypassing between different functional unit/
`register file Segment pair clusters. Thus a particular func
`tional unit/ register Segment pair does not bypass to all other
`functional unit/ register segment pairs.
`Usage of local bypassing rather than global bypassing
`greatly reduces the interconnection Structures within the
`processor, advantageously reducing the length of intercon
`nect lines, reducing processor Size and increasing processor
`Speed by Shortening the distance of Signal transfer. Indepen
`dent clustering of functional units advantageously forms a
`highly scaleable structure in VLIW processor architecture
`using distributed functional units.
`In Some embodiments, a clustered functional unit? register
`file Segment pair also includes one or more annexes that
`Stage or delay intermediate results of an instruction thereby
`controlling data hazard conditions. The annexes contain
`Storage for storing destination register (rdl) specifiers for all
`annex Stages, valid bits, for the Stages of a pipeline, and
`priority logic that determines a most recent value of a
`register in the register file.
`The annexes include multiplexers that Select matching
`Stages among bypass levels in a priority logic that Selects
`databased on priority matching within a priority level. The
`annexes include compare logic that compares destination
`Specifiers of an instruction executing within the anneX
`pipeline against Source and destination Specifiers of other
`instructions currently executing in the local bypass range of
`a functional unit/ register file Segment pair cluster.
`A multi-ported register file is typically metal limited to the
`area consumed by the circuit proportional with the Square of
`the number of ports. The multi-ported register file that is
`divided into register file Segments that are individually
`allocated among functional unit/ register file Segment pair
`clusters. The plurality of Separate and independent register
`files forms a layout Structure with an improved layout
`efficiency. The read ports of the total register file Structure
`are allocated among the Separate and individual register
`
`4
`files. The Separate and individual register files have write
`ports that correspond to the total number of write ports in the
`total register file structure. Writes are fully broadcast so that
`all of the Separate and individual register files are coherent.
`In one illustrative embodiment, a 16-port register file
`Structure with twelve read ports and four write ports is split
`into four Separate and individual 7-port register files with
`three read ports and four write ports. The area of a Single
`16-port register file would have a size proportional to 16
`times 16 or 256. The separate and individual register files
`has a size proportional to 7 times 7 or 49 for a total of 4 times
`49 or 196. The capacity of a single 16-port register and the
`four 7-port registers is identical with the Split register file
`Structure advantageously having a significantly reduced
`area. The reduced area advantageously corresponds to an
`improvement in access time of a register file and thus speed
`performance due to a reduction in the length of word lines
`and bit lines connecting the array cells that reduces the time
`for a signal to pass on the lines. The improvement in Speed
`performance is highly advantageous due to Strict time bud
`gets that are imposed by the Specification of high
`performance processors and also to attain a large capacity
`register file that is operational at high Speed.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`The features of the described embodiments are specifi
`cally Set forth in the appended claims. However, embodi
`ments of the invention relating to both Structure and method
`of operation, may best be understood by referring to the
`following description and accompanying drawings.
`FIG. 1 is a graph illustrating a comparison of instruction
`issue efficiency and processor size as VLIW group width is
`varied.
`FIG. 2 is a Schematic block diagram illustrating a single
`integrated circuit chip implementation of a processor in
`accordance with an embodiment of the present invention.
`FIG. 3 is a Schematic block diagram showing the core of
`the processor.
`FIG. 4 is a Schematic block diagram that illustrates an
`embodiment of the Split register file that is Suitable for usage
`in the processor.
`FIG. 5 is a Schematic block diagram that shows a logical
`View of the register file and functional units in the processor.
`FIG. 6 is a pictorial Schematic diagram depicting an
`example of instruction execution among a plurality of media
`functional units.
`FIG. 7 is a Schematic timing diagram that illustrates
`timing of the processor pipeline.
`FIGS. 8A-8C are respectively, a schematic block diagram
`showing an embodiment of a general functional unit, a
`Simplified Schematic timing diagram showing timing of a
`general functional unit pipeline, and a bypass diagram
`showing possible bypasses for the general functional unit.
`FIG. 9 is a simplified schematic timing diagram illustrat
`ing timing of media functional unit pipelines.
`FIGS. 10A-10C respectively show an instruction
`Sequence table, and two pipeline diagrams illustrating
`execution of a VLIW group which shows stall operation for
`a five-cycle pair instruction and a Seven-cycle pair instruc
`tion.
`FIGS. 11A-11C are pipeline diagrams illustrating syn
`chronization of pair instructions in a group.
`FIGS. 12A, 12B, 12C, and 12D are respective schematic
`block diagrams illustrating the pipeline control unit Seg
`
`US 6,615,338 B1
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 21
`
`

`

`US 6,615,338 B1
`
`15
`
`35
`
`40
`
`25
`
`S
`ments allocated to all of the functional units GFU, MFU1,
`MFU2, and MFU3.
`FIG. 13 is a schematic block diagram that illustrates a
`load annex block within the pipeline control unit.
`FIGS. 14A and 14B illustrate respective stall conditions in
`accordance with operation of Some embodiments of the
`present invention.
`The use of the same reference symbols in different drawings
`indicates Similar or identical items.
`DESCRIPTION OF THE EMBODIMENT(S)
`Referring to FIG. 2, a Schematic block diagram illustrates
`a single integrated circuit chip implementation of a proces
`Sor 100 that includes a memory interface 102, a geometry
`decompressor 104, two media processing units 110 and 112,
`a shared data cache 106, and Several interface controllers.
`The interface controllerS Support an interactive graphics
`environment with real-time constraints by integrating fun
`damental components of memory, graphics, and input/
`output bridge functionality on a single die. The components
`are mutually linked and closely linked to the processor core
`with high bandwidth, low-latency communication channels
`to manage multiple high-bandwidth data Streams efficiently
`and with a low response time. The interface controllers
`include a an UltraPort Architecture Interconnect (UPA)
`controller 116 and a peripheral component interconnect
`(PCI) controller 120. The illustrative memory interface 102
`is a direct Rambus dynamic RAM (DRDRAM) controller.
`The shared data cache 106 is a dual-ported storage that is
`shared among the media processing units 110 and 112 with
`one port allocated to each media processing unit. The data
`cache 106 is four-way set associative, follows a write-back
`protocol, and supports hits in the fill buffer (not shown). The
`data cache 106 allows fast data Sharing and eliminates the
`need for a complex, error-prone cache coherency protocol
`between the media processing units 110 and 112.
`The UPA controller 116 is a custom interface that attains
`a Suitable balance between high-performance computational
`and graphic Subsystems. The UPA is a cache-coherent,
`processor-memory interconnect. The UPA attains Several
`advantageous characteristics including a Scaleable band
`width through Support of multiple bused interconnects for
`data and addresses, packets that are Switched for improved
`bus utilization, higher bandwidth, and precise interrupt
`45
`processing. The UPA performs low latency memory
`accesses with high throughput paths to memory. The UPA
`includes a buffered cross-bar memory interface for increased
`bandwidth and improved scaleability. The UPA supports
`high-performance graphics with two-cycle Single-word
`writes on the 64-bit UPA interconnect. The UPA intercon
`nect architecture utilizes point-to-point packet Switched
`messages from a centralized System controller to maintain
`cache coherence. Packet Switching improves bus bandwidth
`utilization by removing the latencies commonly associated
`with transaction-based designs.
`The PCI controller 120 is used as the primary system I/O
`interface for connecting Standard, high-volume, low-cost
`peripheral devices, although other Standard interfaces may
`also be used. The PCI bus effectively transfers data among
`high bandwidth peripherals and low bandwidth peripherals,
`such as CD-ROM players, DVD players, and digital cam
`CS.
`Two media processing units 110 are included in a Single
`integrated circuit chip to Support an execution environment
`exploiting thread level parallelism in which two independent
`threads can execute simultaneously. The threads may arise
`
`50
`
`55
`
`60
`
`65
`
`6
`from any Sources Such as the same application, different
`applications, the operating System, or the runtime environ
`ment. Parallelism is exploited at the thread level since
`parallelism is rare beyond four, or even two, instructions per
`cycle in general purpose code. For example, the illustrative
`processor 100 is an eight-wide machine with eight execution
`units for executing instructions. A typical “general-purpose'
`processing code has an instruction level parallelism of about
`two So that, on average, most (about six) of the eight
`execution units would be idle at any time. The illustrative
`processor 100 employs thread level parallelism and operates
`on two independent threads, possibly attaining twice the
`performance of a processor having the same resources and
`clock rate but utilizing traditional non-thread parallelism.
`Thread level parallelism is particularly useful for JavaTM
`applications which are bound to have multiple threads of
`execution. JavaTM methods including “suspend”, “resume’,
`“sleep', and the like include effective support for threaded
`program code. In addition, Java" class libraries are thread
`safe to promote parallelism. (Java TM, Sun, Sun Microsys
`tems and the Sun Logo are trademarks or registered trade
`marks of Sun Microsystems, Inc. in the United States and
`other countries. All SPARC trademarks, including UltraS
`PARC I and UltraSPARC II, are used under license and are
`trademarks of SPARC International, Inc. in the United States
`and other countries. Products bearing SPARC trademarks
`are based upon an architecture developed by Sun
`Microsystems, Inc.) Furthermore, the thread model of the
`processor 100 Supports a dynamic compiler which runs as a
`Separate thread using one media processing unit 110 while
`the Second media processing unit 110 is used by the current
`application. In the illustrative System, the compiler applies
`optimizations based on “on-the-fly” profile feedback infor
`mation while dynamically modifying the executing code to
`improve eXecution on each Subsequent run. For example, a
`"garbage collector” may be executed on a first media
`processing unit 110, copying objects or gathering pointer
`information, while the application is executing on the other
`media processing unit 110.
`Although the processor 100 shown in FIG. 2 includes two
`processing units on an integrated circuit chip, the architec
`ture is highly Scaleable So that one to Several closely
`coupled processors may be formed in a message-based
`coherent architecture and resident on the same die to process
`multiple threads of execution. Thus, in the processor 100, a
`limitation on the number of processors formed on a Single
`die thus arises from capacity constraints of integrated circuit
`technology rather than from architectural constraints relat
`ing to the interactions and interconnections between proces
`SOS.
`Referring to FIG. 3, a schematic block diagram shows the
`core of the processor 100. The media processing units 110
`each include an instruction cache 210, an instruction aligner
`212, an instruction buffer 214, a pipeline control unit 226, a
`Split register file 216, a plurality of execution units, and a
`load/store unit 218. In the illustrative processor 100, the
`media processing units 110 use a plurality of execution units
`for executing instructions. The execution units for a media
`processing unit 110 include three media functional units
`(MFU) 220 and one general functional unit (GFU) 222. The
`media functional units 220 are multiple Single-instruction
`multiple-datapath (MSIMD) media functional units. Each of
`the media functional units 220 is capable of processing
`parallel 16-bit components. Various parallel 16-bit opera
`tions Supply the Single-instruction-multiple-datapath capa
`bility for the processor 100 including add, multiply–add,
`shift, compare, and the like. The media functional units 220
`
`Amazon / Zentian Limited
`Exhibit 1027
`Page 22
`
`

`

`7
`operate in combination as tightly-coupled digital Signal
`processors (DSPs). Each media functional unit 220 has an
`Separate and individual Sub-instruction Stream, but all three
`media functional units 220 execute Synchronously So that
`the Subinstructions progreSS lock-Step through pipeline
`Stages.
`The general functional unit 222 is a RISC processor
`capable of executing arithmetic logic unit (ALU) operations,
`loads and Stores, branches, and various Specialized and
`eSoteric functions Such as parallel power operations, recip
`rocal Square root operations, and many others. The general
`functional unit 222 Supports less common parallel opera
`tions Such as the parallel reciprocal Square root instruction.
`The illustrat

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket