throbber
USOO7847803B1
`
`United States Patent
`(12)
`(10) Patent N0.:
`US 7,847,803 B1
`
`Van Hook
`(45) Date of Patent:
`Dec. 7, 2010
`
`(54) NIETHOD AND APPARATUS FOR
`INTERLEAVED GRAPHICS PROCESSING
`
`5/2000 Eickenieyei‘ et a1.
`........ 718/107
`6,061,710 A *
`6,161,173 A ‘ 12/2000 Krishna et a1.
`. 712/214
`
`................. 712/229
`6,209,083 B1 *
`3/2001 Naini et al.
`
`(75)
`
`Inventor: Timothy J. Van Hook, Atlierton, CA
`(US)
`
`(73) Assignee: ATI Technologies ULC (CA)
`.
`.
`.
`.
`.
`( * ) Notice:
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`USC 1540:) by 25 5 days.
`
`(21) Appl. No.: 09/625,812
`
`(22) Filed:
`
`Jul. 26, 2000
`
`(51)
`
`Int. Cl.
`(2006.01)
`G09G 5/36
`(2006.01)
`G061" 9/46
`(52) US. Cl.
`........................ 345/559; 718/107; 718/102
`(58) Field of Classification Search ................. 345/505,
`'
`345/560, 422, 606, 426. 501, 506, 502, 530)
`345/559; 7 [2/2, 32; 207, 229, 235, 202,
`712/214, 2205 216; 717/149, 124; 709/105;
`718/107, 102, 100
`See application file for complete search histow.
`_
`References Cited
`U.S. PATENT DOCUMENTS
`
`(56)
`
`5,115,513 A *
`5,261,063 A *
`5,357,617 A *
`5,710,912 A "‘
`5,768,594 A *
`5,778,250 A *
`5,961,628 A *
`5,973,705 A *
`
`.......... 712/36
`5/1992 Nemirovsky elal.
`11/1993 Kohn et al.
`.....
`.. 712/229
`.....
`.. 712/245
`10/1994 Davis et al.
`
`.. 712/220
`1/1998 Schlansker etal
`
`.. 717/149
`6/1998 Blclloch ct al.
`
`7/1998 Dye ...........
`712/32
`..
`10/1999 Nguyen et al.
`.. 712/2
`10/1999 Narayanaswanii
`345/505
`
`4/2001 Lee et a1.
`6,223,276 B1 ”‘
`6,330,584 B1 " 12/2001 Joffe et al.
`6,341,347 B1 *
`1/2002 Joy et al.
`
`.................... 712/207
`. 718/107
`,.
`. 712/228
`
`6,493,820 B2 5 12/2002 Akkary et al.
`.............. 712/235
`’1‘ cited by examiner
`
`.
`.10111 Hsu
`Primary Examiner
`(74) Attorney, Agent, or FirmiVedder Price, RC.
`
`(57)
`
`ABSTRACT
`
`The present invention provides for programmable interleaved
`graphics processing. The invention provides an execution
`pipeline and a number of registers. Each register holds
`instructions from a separate program. Instructions from the
`registers are interleaved in the execution pipeline such that the
`avcragc latency is one instruction per cycle. This is accom-
`plished even when there is conditional branching and execu-
`tion latency. When one instructionlias adependency based on
`execution ofa previous instruction, that second instruction is
`not provided to the execution pipeline until completion ofthe
`first
`instruction. However,
`in the meantime interleaved
`9151131090115 from other programs are 5911 being executed
`while the first instruction of the first program is executing.
`Thus the pipeline is always full and the processor is always
`working at peak capacity. The automatic interleaving of
`instructions permits simplified graphics software routines to
`be written. There is no need for the programmer or developer
`to anticipate or attempt to eliminate conditional branching or
`to worry a bout instruction latency. The design ofthe program-
`mablc intcrlcavcd graphics processing system provides a
`solution to those problems.
`
`7 Claims, 5 Drawing Sheets
`
`TARsEr PROGRAM COUNTER
`201
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`INS‘IRUCTIDN MEMORY
`10:1
`
`INSTRUCTION DECODE
`204
`
`REGISTER
`
`REGISTER
`
`REstsTER
`
`
`
`I 205C
`2055
`‘-
`OPERAND ROUTE 206
`
`
`
`
`zosN
`
`
`
`
`
`REGISTER
`
`
`I 205A
`
`
`
`
`DATA MEMORY
`209
`
`
`
`RESULT ROUTE 208
`
`
`
`
`
`
`SAMSUNG-1005
`
`Page 1 of 12
`
`SAMSUNG-1005
`Page 1 of 12
`
`

`

`US. Patent
`
`Dec. 7, 2010
`
`Sheet 1 of5
`
`US 7,847,803 B1
`
`GRAPHICS IIF, DMA
`
`
`OR GRAPHICS PROC.
`
`
`102
`
` 103B
`
`PIGP 103A
`
`
`INSTRUCTION
`DATA
`
`
`MEMORY
`MEMORY
`
`105
`104
`
`
`OTHER GRAPHICS
`
`
`PROCESSING
`107
`
`
`
`REGfi’gERS
`
`FIG. 1
`
`103N
`
`SAMSUNG-1005
`
`Page 2 of 12
`
`SAMSUNG-1005
`Page 2 of 12
`
`

`

`US. Patent
`
`Dec. 7, 2010
`
`Sheet 2 of5
`
`US 7,847,803 B1
`
`TARGET PROGRAM COUNTER
`
`201
`
`INSTRUCTION MEMORY
`203
`
`
`
`
`
`
`204
`
`REGISTER
`205A
`
`REGISTER
`205B
`
`REGISTER
`205C
`
`ARITHMETIC DATAPATH 207
`
`
` INSTRUCTION DECODE
`
`
`
`
`
`‘-
`
` DATA MEMORY
`209
`
`REGISTER
`205N
`
`
`
`SAMSUNG-1005
`
`Page 3 of 12
`
`SAMSUNG-1005
`Page 3 of 12
`
`

`

`US. Patent
`
`Dec. 7, 2010
`
`Sheet 3 of5
`
`US 7,847,803 B1
`
`RECEIVE INSTRUCTION STREAM
`301
`
`IDENTIFY N PROGRAMS
`302
`
`3 04
`
`ASSIGN PROGRAM COUNTERS
`AND REGISTERS
`303
`
`LOAD INSTRUCTIONS
`
`FIG. 3
`
`SAMSUNG-1005
`
`Page 4 of 12
`
`SAMSUNG-1005
`Page 4 of 12
`
`

`

`US. Patent
`
`Dec. 7, 2010
`
`Sheet 4 of5
`
`US 7,847,803 B1
`
` NEW PROGRAM
`
`
`
`
`DESIRED?
`400
`
`
`
`NO
`
` START NEW PROGRAM
`401
`
`
`
`
`YES
`
`ASSIGN OUTPUT
`
`
`REGISTER SLOT TO
`
`
`PROGRAM
`402
`
`
`
`FIG. 4
`
`
`
` YES
`
`
`EXECUTE
`
`INSTRUCTIONS
`
`403
`
`
`
`PROGRAM DONE?
`
`404
`
`
`LEAD OUTPUT IN
`
`RESERVED SPACE
`
`405
`
`THIS AND ALL
`‘ INSTRUCTION
`
`
`
`PROGRAMS DONE?
`AVAILABLE?
`
`
`
`406
`408
`
`
`
`
`
`
`
`YES
`
`
`
`OUTPUT SLOT
`AVAILABLE?
`407
`
`
`
`
`
`
`NO
`
`NO-OP
`
`SAMSUNG-1005
`
`Page 5 of 12
`
`SAMSUNG-1005
`Page 5 of 12
`
`

`

`US. Patent
`
`Dec. 7, 2010
`
`Sheet 5 of5
`
`US 7,847,803 B1
`
`4 BY 32b VECTOR REGISTER
`
`ELEMENT 0
`
`ELEMENT 1
`
`ELEMENT 2
`
`ELEMENT 3
`
`4 ELEMENT VECTOR
`
`3 ELEMENT VECTOR
`
`SCALAR
`
`FIG. 5
`
`63
`
`32
`
`0
`
`II——m
`15
`2
`2
`4
`11
`
`IMMEDIATE
`
`15 SUBOPCODE 15
`
`FIG. 6
`
`SAMSUNG-1005
`
`Page 6 of 12
`
`SAMSUNG-1005
`Page 6 of 12
`
`

`

`
`
`
`
`1. Field of the Invention
`This invention relates to the field of graphics processing.
`Portions of the disclosure of this patent document contain
`material that is subject to copyright protection. The copyright
`owner has no objection to the facsimile reproduction by any—
`one of the patent document or the patent disclosure as it
`appears in the Patent and Trademark Oflice file or records, but
`otherwise reserves all copyright rights whatsoever.
`2. Background
`Computer systems are often used to display generate and
`display graphics on a display. Display images are made up of
`thousands oftiny dots, where each dot is one ofthousands or
`millions of colors. These dots are known as picture elements,
`or “pixels”. Each pixel has a color, with the color ofeach pixel
`being represented by a number value stored in tie computer
`system.
`A three dimensional display image, although displayed
`using a two dimensional array ofpixels, may in fact be created
`by rendering of a plurality of graphical objects. Examples of
`graphical objects include points, lines, polygons, and three ,
`dimensional solid objects. Points, lines, and polygons repre-
`sent rendering “primitives” which are the basis for most ren-
`dering instructions. More complex structures, such as three
`dimensional objects, are formed from a combination or mesh
`of such primitives. To display a particular scene, the visible
`primitives associated with the scene are drawn individually
`by determining those pixels that fall within the edges of the
`primitive, and obtaining the attributes of the primitive that
`correspond to each oftho se pixels. The obtained attributes are
`used to determine the displayed color values of applicable
`pixels.
`Sometimes, a three dimensional display image is formed
`from overlapping primitives or surfaces. A blending function
`based on an opacity value associated with each pixel of each
`primitive is used to blend the colors ofoverlapping surfaces or
`layers when the top surface is not completely opaque. The
`final displayed color of an individual pixel may thus be a
`blend of colors from multiple surfaces or layers.
`In some cases, graphical data is rendered by executing
`instructions from an application that is drawing data to a
`display. During image rendering, three dimensional data is
`processed into a two dimensional image suitable for display.
`The three dimensional image data represents attributes such
`as color, opacity, texture, depth, and perspective information
`The draw commands from a program drawing to the display ,
`may include, for example, X andY coordinates for the verti—
`ces of the primitive, as well as some attribute parameters for
`the primitive, and a drawing command. The execution of
`drawing commands to generate a display image is known as
`graphics processing.
`The prior art has provided two solutions to accomplish
`graphics processing. One solution is to build special process-
`ing hardware to provide high speed graphics processing capa-
`bility. The other is to provide programmable graphics pro-
`cessing by executing graphics processing software on a
`general purpose processing platform. Both prior art solutions
`have drawbacks that limit their flexibility and performance.
`Ilardware solutions typically provide special purpose
`hardware that implements hardwired graphics processing
`algorithms that can provide very fast processing capabilities.
`However, the design and debugging ofhardware is a complex,
`expensive, and time consuming process. Hardware solutions
`
`10
`
`15
`
`30
`
`35
`
`4o
`
`45
`
`6O
`
`65
`
`US 7,847,803 B1
`
`1
`METHOD AND APPARATUS FOR
`INTERLEAVED GRAPHICS PROCESSING
`
`BACKG RU U ND
`
`2
`are also inflexible. Should a new algorithm become known,
`the only way to implement it is to build a new hardware
`product. Thus, the hardware solution lacks the flexibility
`needed to respond to changing conditions, in addition, hard-
`ware solutions generally arc only available to provide pro-
`cessing capability to graphics processing tasks. For non
`graphics processing tasks, additional processing capabilities
`are required, adding to the expense of a graphics processing
`system.
`Prior art software solutions provide a programming lan-
`guage that can be executed on a general purpose processing
`system. Rendering commands from a program drawing to the
`display are interpreted and executed in software. Software
`solutions are more flexible that hardware solutions in that new
`algoritlnns and techniques can be implemented by writing
`new software, which is easier than designing and building
`new hardware. However, existing software solutions also suf—
`fer from a number of disadvantages.
`One prior art software problem is instruction execution
`latency. Many graphics algorithms consists ofa small number
`of reduced instruction set computing (RISC) single instruc-
`tion, multiple data (SIMD) instructions (a few dozen instruc-
`tions or less). The instructions may be independent or depen-
`dent. A dcpendent instruction is an instruction that contains
`operand dependencies, that is the source operands of one
`instruction are the result operands of a prior instruction. For
`example, a typical quadratic polynomial d:ax**2+bx+c
`might be coded as
`6:26 “‘a+b
`
`(177x *e+r'
`
`Since the result (e) of the first instruction is a source (e) in
`the second instruction,
`the second instruction cannot be
`executed until completion of the first instruction, creating an
`operand dependency. Ifthe pipeline latency is, for example, 7
`clocks, the second instruction could not begin until 7 clocks
`had transpired, which is very inefficient
`(15% of peak
`throughput).
`There are several approaches to alleviating this problem.
`One is the scheduling of independent instructions between
`the dependent instructions, so that processing continues even
`when a dependent
`instruction is waiting for data. This
`requires optimizing of the execution of the program by a
`programmer or by an optimizing compiler. Requiring the
`programmer to optimize during development is a complex
`task. An optimizing compiler is difficult to write and may not
`always optimize.
`Another problem is that with many short graphics pro—
`grams there are often not enough independent instructions to
`schedule. If there are no independent instructions to execute,
`the processor is idle, reducing efliciency.
`Another attempt to optimize software execution of graph-
`ics commands is to interleave instructions from multiple ver-
`tices and pixels, such as different loop passes. This is even
`more complex than prescheduling, and leads to inefficiencies
`when modes change.
`Another approach is to shorten the pipeline latency, or
`pipeline bypassing (so that operands still in the pipeline can
`be used by other instructions before being written back to the
`registers). Both of these solutions require complex hardware
`to control and route all the operands in the pipeline, and is of
`benefit only when operations can execute in a single clock
`cycle. Operations that take multiple clocks, such as floating
`point multiply and add, are not optimized since the partial
`results within them camiot be bypassed to other instructions.
`
`SAMSUNG-1005
`
`Page 7 of 12
`
`SAMSUNG-1005
`Page 7 of 12
`
`

`

`US 7,847,803 B1
`
`3
`Software solutions also suffer from conditional execution.
`When a program executes a conditional branch dependent on
`the results of a previous instruction, the next instruction can-
`not begin execution until the result is computed, inefficiently
`waiting up to the depth of the pipeline. There are several
`approaches to alleviating this problem. Results in the pipeline
`can be bypassed to the branch control, or the pipeline depth
`can be shortened, with similar hardware complexities to oper-
`and dependency above. Another approach is branch delay
`slots,
`in which several
`instructions following the branch
`instruction are executed, which leads to software complexity
`in scheduling those instructions, especially in short graphics
`programs.
`Another approach is speculative execution and/or branch
`prediction ofinstructions after the branch and/or at the branch
`target, executing the branch whether needed or not. This leads
`to inefficiencies when the speculative or predicted instruc-
`tions are not needed.
`
`SUMMARY OF THE INVENTION
`
`The present invention provides efifieient graphics process—
`ing in a software environment by interleaving of graphics
`processing instructions. The invention provides an execution
`pipeline and a number of registers. Each register holds
`
`instructions from a separate program. (T 1e definition of“pro-
`gram” here is an operation performed on data. For example,
`the same operation on different data is treated as two pro-
`grams, different operations on the same data represent two
`programs, or different operations on different data are two
`programs.)
`Instructions from the registers are interleaved in the execu-
`tion pipeline such that the average latency is one instruction
`per cycle. This is accomplished even when there is condi-
`tional branching and execution latency. When one instruction
`of a program has a dependency based on execution of a
`previous instruction, that second instruction is not provided to
`the execution pipeline until completion ofthe first instruction.
`In the meantime, interleaved instructions from other pro-
`grams are still being executed while the first instruction ofthe
`first program is executing. Thus the pipeline is always full and
`the processor is always working at peak capacity.
`The automatic interleaving of instructions permits simpli-
`fied graphics software routines to be written. There is no need
`for the programmer or developer to anticipate or attempt to
`eliminate conditional branching or to worry about instruction
`latency. The progralmnable interleaved graphics processing
`system ofthe invention provides a solution to those pro blems.
`
`
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram of a system for implementing the
`present invention.
`FIG. 2 is an example embodiment of the programmable
`interleaved graphics processor of the present invention.
`FIG. 3 illustrates a flow diagram of one embodiment of an
`interleaver for use with the present invention.
`FIG. 4 illustrates the operation of the system for a set of
`programs where order is important.
`FIG. 5 illustrates the organization of a register file.
`FIG. 6 is an example of an instruction format.
`
`DETAILED DESCRIPTION OF THE INVENTION
`
`A method and apparatus for interleaved graphics process—
`ing is described. In the following description, numerous spe-
`cific details are set forth in order to provide a more detailed
`
`10
`
`15
`
`30
`
`35
`
`4o
`
`45
`
`6O
`
`65
`
`4
`description of the invention. It will be apparent, however, to
`one skilled in the art, that the present invention may be prac-
`ticed without these specific details. In other instances, well
`known details have not been provided so as to not unneces-
`sarily obscure the invention.
`The processing of three dimensional graphics is character—
`ized by large amounts ofprocessing. Graphics processors can
`process tens of millions ofvertices and geometric primitives,
`and hundreds of millions ofpixels per second. Because there
`is so much processing, any delay or inefficiency in program
`execution is harmful, and can be propagated very rapidly. But,
`because there is so much processing, with many of the tasks
`being independent from other tasks, there is the opportunity
`to continue processing other tasks when one task must be
`delayed.
`This invention takes advantage of the large amount of
`processing inherent in graphics processing by interleaving
`the programs for several vertices, primitives, pixels, or other
`graphical elements in a processing pipeline. In other words, at
`each clock cycle, a different one ofthe programs is executing
`its instructions. When a program is waiting for the results of
`a first instruction for data needed in a second instruction, the
`processor is not idle,
`it is executing instructions of other
`programs. When the number of programs equal to or greater
`than the depth of the processing pipeline is executing, the
`apparent instruction latency in each program can be as little as
`one instruction Consider where there are four programs ea eh
`with five instructions. Using interleaving, the four programs
`finish execution in twenty clock cycles, giving an average
`latency of one instruction per clock cycle. Using the inven-
`tion, the results of an instruction in a program are available as
`sources to the next instruction in that program, without the
`hardware complexity ofpipeline bypasses, the software com—
`plexity of instruction reordering, or the inefficiency of idle
`cycles. This results in a simple software programming model
`at low hardware cost and high efiieiency.
`For interleaved processing, the storage elements (registers)
`and program counters (PC) are unique for each program.
`There are as many copies of registers and PCs as there are
`programs executing. In one embodiment, there are as many
`programs executing as the depth of the instruction execution
`pipeline between an arithmetic result and source dependency,
`or a load source address and data dependency, or a branch
`source condition and the target instruction fetch. If branch
`instructions include Call and Return instructions then a PC
`stack per program is also required.
`Since only one instruction or instruction group is fetched
`per program per cycle, a single shared instruction memory or
`cache can be used, and since only one load/store instruction or
`group accesses data memory per cycle, a single shared data
`memory or cache can be used.
`An advantage of the invention is that since mode changes
`(program or data changes that affect the processing of a
`graphical element) are frequent in graphics applications, ea eh
`program executes independently on the program and data of
`its own element Each element’s mode is independent, and
`modes can be changed per element without performance deg-
`radation.
`If data sharing between programs is required, the invention
`can be extended to include external synchronization mecha—
`nisms, such as stalling execution at specific instruction or data
`addresses, or internal synchronization mechanisms such as
`conditional branches on the state of other programs.
`For example, a typical graphics processor pipeline might
`consist of seven execution stages (although the number of
`execution stages in practice varies widely with frequency,
`device characteristics, and instruction set.) As an example,
`
`SAMSUNG-1005
`
`Page 8 of 12
`
`SAMSUNG-1005
`Page 8 of 12
`
`

`

`US 7,847,803 B1
`
`5
`three general types ofinstructions which are latency sensitive
`are shown. One is an arithmetic instruction such as a floating
`point multiply add
`D:A *B+C;
`
`with a three cycle float multiply add operation. Another is a
`conditional branch instruction based on a register value less
`than zero, a decrement of that register, and a branch target
`with a register offset from the current PC.
`if (A——<0) goto (PC+B);
`where PC is the program counter. A third is a load instruction
`from a local memory at a computed address
`D:*(A+B);
`
`The following types are fairly representative of general RISC
`instructions.
`
`Arithmetic
`
`Branch
`
`Load
`
`1 Source Fetch
`2 Operand Route
`3 Madd A
`4 Madd B
`5 Madd C
`6 Result Route
`7 Destination Write
`
`Source Fetch
`Operand Route
`Compare
`Add PC, Dec Operand
`Instruction Fetch
`Instruction Decode
`Destination write
`
`Source Fetch
`Operand Route
`Address Add
`Address Route
`Memory Access
`Data Route
`Destination Write
`
`Memory store instructions are similar to loads, except the
`store data is a source operand fetch and no destination route
`and write occurs, so stores are typically not latency sensitive.
`Subroutine calls and returns are similar to branches, with the
`read or write of a PC stack with the previous or target PC.
`With the seven cycles of pipeline latency shown in the
`example, consider where programs for seven graphics primi—
`tives are executing interleaved in the pipeline. With the arith—
`metic instruction as an example, the program execution of 7
`programs 0-6 would interleave like the following over clocks
`l to 8.
`
`1
`2
`3
`4
`5
`6
`7
`8
`
`SF0
`SF1
`SF2
`SF3
`SF4
`SF5
`SF6
`SF0
`
`0R0
`ORl
`0R2
`0K3
`0R4
`0R5
`
`MAO
`MAl
`MALI
`MA3
`MA4
`
`MBO
`MBl
`MBE
`MB3
`
`MCO
`MCI
`MCZ
`
`RRO
`RRl
`
`DWO
`
`The source fetch of program 0 (SF0) occurs at clock l.At
`clock 2, the source fetch of program 1 (SF1) is executed,
`along with the operand route step of program 0 (0R0). At
`each clock, the source fetch of each succeeding program is
`placed in the pipeline, while earlier programs continue execu-
`tion. The result is such that in the 8th clock the next instruc—
`tion of program 0 begins execution, and the result of the
`previous instruction in that program, or the load data or
`branch target location of that program, is available to that
`instruction. The functional latency of the instructions visible
`to the program and programmer or compiler is one instruc-
`tion, even though the hardware pipeline latency ofthe instrue -
`tion is many more. (Note that without interleaving, the sub-
`sequent instructions ofprogram 0 would immediately follow
`the first instruction of program 0, but would need to stall at
`some point to wait for dependent data, creating pipeline inef-
`ficiencies.)
`
`6
`The interleaved graphics processor ofthe present invention
`may be part of a larger graphics pipeline. Referring to FIG. 1,
`a programmable interleaved graphics processor 103A
`includes registers 106, data memories 105, and/or instruction
`memory 104. These may be written with graphics data by a
`general purpose CPU 100 by means of a command processor
`or other interface or by DMA from memory (102). Some
`output registers may input data and commands to further
`hardwired graphics processing such as polygon rasterization,
`texturing, or framebuffer blending and Z buffering, or a sub—
`sequent progrannnable graphics processor 107. Parallelism
`can be provided by providing multiple programmable inter-
`leaved graphics processors (PGlPs) 10313 through 103N .
`An example embodiment ofthe programmable interleaved
`graphics processor is shown in FIG. 2. A target program
`counter 201 is coupled to a plurality of program counters
`202A through 202N. The number of program counters (PCs)
`should match the number of programs that are being inter-
`leaved at one time. (eg. seven in the above example). The
`program counters are coupled to instruction memory 203.
`Instructions from instruction memory 203 are provided to
`instruction decode 204 and then to an appropriate one of
`registers 205A through 205N . There are as many registers as
`there are programs being interleaved (and ideally, as many
`programs as stages in the processing pipeline).
`The registers 205A through 205N are coupled to operand
`route 206. The output of operand route 206 is coupled to the
`arithmetic datapath 207. Data memory 209 and output from
`datapath 207 are provided to result route 208. The output of
`result route 208 is made available to multiple registers 205A
`through 205N.
`In graphics processing, the graphical elements that are
`
`being processed are known by the CPU. Each element
`includes its processing mode, that is, which program is being
`applied to the element. The interleaving process queues up a
`number of programs appropriate for the size of the pipeline.
`The program counters and registers are assigned to each
`program and one instruction for each program is routed into
`the pipeline clock. A typical three dimensional graphic scene
`can have hundreds of thousands of programs available for
`batching in a group for interleaving.
`FIG. 3 illustrates a flow diagram of one embodiment of an
`interleaver for use with the present invention. At step 301 an
`instruction stream is received, with each instruction consist-
`ing of a program and data to be used by a graphical element.
`At step 302 the interleaver identifies N programs where N is
`the depth of the pipeline. (Note that in some circumstances,
`the munber of programs identified could be less than N).
`At step 303 the interleaver assigns program counters and
`registers to the N selected programs. At step 304 the instruc-
`tions from those programs are loaded as appropriate in the
`instruction stream of the graphics processor.
`Loads
`
`The present invention could also support loads because the
`latency ofloads is approximately the same as register latency.
`With a cache, load latency on a cache hit would fall in line
`with vertex batching quite well (6 clock fetch, mux, add, mux,
`read, mux, writeback/bypass). However cache misses might
`have long latency. A solution would be a register dependency
`checking, renaming, etc. Another solution is to have the batch
`(or the thread) stall on a miss. A prefeteh variant ofload could
`be supported to hide miss latency. Since load destination
`could be only temp, it is possible to keep a list of load targets
`(for each thread), and compare each of 3 sources to each
`pending load, and stall on collision or the Nth pending load.
`Since only two temp even and two temp odd sources (from a
`
`10
`
`15
`
`30
`
`35
`
`4o
`
`45
`
`6O
`
`65
`
`SAMSUNG-1005
`
`Page 9 of 12
`
`SAMSUNG-1005
`Page 9 of 12
`
`

`

`US 7,847,803 B1
`
`7
`vector/scalar pair) are possible in each cycle, another scheme
`is to keep a 1 b tag per temp register (dual port or two copies)
`with one bit per thread, where the bit is set on load, cleared on
`write, and read by pending instructions that stall if the bit is
`set. A 2 or 4 by 16 by 6-8 bit load use tag ram may be cheaper
`than the compares of pending loads by 4 accesses by 4 b
`registers, and allows unlimited pending loads.
`The invention also can operate efliciently when no -ops (no
`operations) may be required. For example, in rasterizing, the
`order of data delivery is important. This may require the
`insertion of no—ops into the instruction stream or retard the
`launching of new programs until the first program finishes.
`The invention provides a solution by reserving space in an
`output buffer where data can be placed in the buffer out of
`order. The programs are pre—sorted in the desired order, with
`spaces in the output buffer reserved for each program. Thus,
`even if a program finishes early, before a previously issued
`program, its output is stored in the correct location in the
`output buffer, with the output buffer not accessed until a
`collection of “in—order” data is available.
`In this situation, no new programs are started until an
`output buffer for each new program is available. This requires
`inserting no-ops in the instruction stream for this condition.
`FIG. 4 illustrates the operation of the system for a set of
`programs where order is important. At decision block 400 it is
`determined if a new program is desired. If not, the system
`loops, if so, the system proceeds to step 401. At step 401, the
`new program is started. At step 402, output slots are assigned
`to the program to provide the appropriate order. At step 403,
`an instruction is executed. At decision block 404 it is deter-
`mined if one of the programs is done. If so, the output of the
`program is loaded into its reserved space at step 405. If a
`program is not done at decision block 404, the system returns
`to step 403 to execute the next instruction.
`After the output ofa completed program is loaded into the
`appropriate reserved register at step 405, decision block 406
`determines if this program and all of the programs of the
`current set are complete. If all of the programs have not
`completed, the system looks for the next instruction. At deci-
`sion block 408 it is determined if an instruction is available,
`that is, for a program that has not yet finished execution. If
`there is an instruction available, the system returns to step 403
`to execute the next instruction. Ifthere is no instruction avail-
`able at decision block 408, the system no-ops at step 409 and
`returns to block 408.
`If all of the programs are complete at decision block 406,
`the system proceeds to decision block 407 to determine if an
`output slot is available. Ifnot, the system no-ops at step 410.
`Ifa slot is available, the system returns to step 401 and starts
`a new program.
`As noted above, one embodiment ofthe invention includes
`a number of copies of registers at least equal to the number of
`interleaved programs, but a number of registers greater than
`the number ofprograms may also be used for system through-
`put optimization. With a larger number of copies of registers
`than the number of interleaved programs, some copies can
`operate as multiple input buffers and some copies can operate
`as multiple output buffers.
`W'ith multibuffered input, another system component, such
`as a CPU command interface or a memory command DMA
`unit or a rasterization and texturing unit, can write input
`registers with data for subsequent programs in parallel with
`current interleaved program execution. With multibuffered
`output, another system component, such as setup and raster-
`ization unit or a framebuffer blending and Z buffering unit,
`can read output registers from previous completed programs
`in parallel With current interleaved program execution.
`
`10
`
`15
`
`30
`
`35
`
`4o
`
`45
`
`6O
`
`65
`
`8
`With multibuffered input and output registers, continuous
`program execution can occur without idle cycles lost waiting
`for input or output data. In one embodiment, a new inter-
`leaved program begins execution whenever a valid input data
`register copy is available after having been written by another
`system unit, and when an output data register copy is free after
`having been read by another system unit.
`One embodiment ofthe invention also includes instruction
`memory larger than that needed to hold all the programs being
`executed by the current
`interleaved programs, and data
`memory larger than that needed to hold the data set accessed
`by all current interleaved programs. With larger instruction
`and data memories, other system tuiits can update the con-
`tents of these memories at locations not used by currently
`executing interleaved programs so that sub sequent inter—
`leaved programs can begin execution without waiting for
`instruction or data memory updates from other system com-
`ponents.
`the input and output registers are
`In one embodiment,
`double buffered, containing twice as many copies ofinput and
`output registers as the number of executing interleaved pro-
`grams, and the instruction and data memories are also double
`buffered, containing twice as much instructions or data as are
`accessed by the executing interleaved programs.
`In addition to typical microprocessor types of instruction
`sets which are well known, such as arithmetic, logical, load
`store, and branch, the programma ble interleaved graphics
`processor instruction in one embodiment contains compound
`instructions such as are appropriate for performing common
`but complicated graphics operations. These compound
`instructions may perform several arithmetic,
`logical, and
`memory access operations.
`A frequently useful compound instruction is a texture
`sample instruction, in which the processor supplies a texture
`coordinate to a texturing system component, and the compo-
`nent returns a filtered texture image sample at that coordinate.
`Other possibly useful compound instructions include a sur-
`face evaluation instruction, which returns a surface sample
`from a surface parameter or coordinate, and a visibility evalu-
`ation instruction, which returns a visibility status or coverage
`from an image depth coordinate.
`The graphics processor
`instruction set may also be
`enhanced with complex instructions to perform common
`graphical algorithms (instead of performing those algorithms
`in a number of simpler instructions) such as matrix multiply,
`vector normalization, trigonometric functions, and exponen-
`tiation. Both compound and complex instructions can offer
`advantages in requiring fewer instructions to be accessed and
`providing higher p

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket