`
`United States Patent
`(12)
`(10) Patent N0.:
`US 7,847,803 B1
`
`Van Hook
`(45) Date of Patent:
`Dec. 7, 2010
`
`(54) NIETHOD AND APPARATUS FOR
`INTERLEAVED GRAPHICS PROCESSING
`
`5/2000 Eickenieyei‘ et a1.
`........ 718/107
`6,061,710 A *
`6,161,173 A ‘ 12/2000 Krishna et a1.
`. 712/214
`
`................. 712/229
`6,209,083 B1 *
`3/2001 Naini et al.
`
`(75)
`
`Inventor: Timothy J. Van Hook, Atlierton, CA
`(US)
`
`(73) Assignee: ATI Technologies ULC (CA)
`.
`.
`.
`.
`.
`( * ) Notice:
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`USC 1540:) by 25 5 days.
`
`(21) Appl. No.: 09/625,812
`
`(22) Filed:
`
`Jul. 26, 2000
`
`(51)
`
`Int. Cl.
`(2006.01)
`G09G 5/36
`(2006.01)
`G061" 9/46
`(52) US. Cl.
`........................ 345/559; 718/107; 718/102
`(58) Field of Classification Search ................. 345/505,
`'
`345/560, 422, 606, 426. 501, 506, 502, 530)
`345/559; 7 [2/2, 32; 207, 229, 235, 202,
`712/214, 2205 216; 717/149, 124; 709/105;
`718/107, 102, 100
`See application file for complete search histow.
`_
`References Cited
`U.S. PATENT DOCUMENTS
`
`(56)
`
`5,115,513 A *
`5,261,063 A *
`5,357,617 A *
`5,710,912 A "‘
`5,768,594 A *
`5,778,250 A *
`5,961,628 A *
`5,973,705 A *
`
`.......... 712/36
`5/1992 Nemirovsky elal.
`11/1993 Kohn et al.
`.....
`.. 712/229
`.....
`.. 712/245
`10/1994 Davis et al.
`
`.. 712/220
`1/1998 Schlansker etal
`
`.. 717/149
`6/1998 Blclloch ct al.
`
`7/1998 Dye ...........
`712/32
`..
`10/1999 Nguyen et al.
`.. 712/2
`10/1999 Narayanaswanii
`345/505
`
`4/2001 Lee et a1.
`6,223,276 B1 ”‘
`6,330,584 B1 " 12/2001 Joffe et al.
`6,341,347 B1 *
`1/2002 Joy et al.
`
`.................... 712/207
`. 718/107
`,.
`. 712/228
`
`6,493,820 B2 5 12/2002 Akkary et al.
`.............. 712/235
`’1‘ cited by examiner
`
`.
`.10111 Hsu
`Primary Examiner
`(74) Attorney, Agent, or FirmiVedder Price, RC.
`
`(57)
`
`ABSTRACT
`
`The present invention provides for programmable interleaved
`graphics processing. The invention provides an execution
`pipeline and a number of registers. Each register holds
`instructions from a separate program. Instructions from the
`registers are interleaved in the execution pipeline such that the
`avcragc latency is one instruction per cycle. This is accom-
`plished even when there is conditional branching and execu-
`tion latency. When one instructionlias adependency based on
`execution ofa previous instruction, that second instruction is
`not provided to the execution pipeline until completion ofthe
`first
`instruction. However,
`in the meantime interleaved
`9151131090115 from other programs are 5911 being executed
`while the first instruction of the first program is executing.
`Thus the pipeline is always full and the processor is always
`working at peak capacity. The automatic interleaving of
`instructions permits simplified graphics software routines to
`be written. There is no need for the programmer or developer
`to anticipate or attempt to eliminate conditional branching or
`to worry a bout instruction latency. The design ofthe program-
`mablc intcrlcavcd graphics processing system provides a
`solution to those problems.
`
`7 Claims, 5 Drawing Sheets
`
`TARsEr PROGRAM COUNTER
`201
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`INS‘IRUCTIDN MEMORY
`10:1
`
`INSTRUCTION DECODE
`204
`
`REGISTER
`
`REGISTER
`
`REstsTER
`
`
`
`I 205C
`2055
`‘-
`OPERAND ROUTE 206
`
`
`
`
`zosN
`
`
`
`
`
`REGISTER
`
`
`I 205A
`
`
`
`
`DATA MEMORY
`209
`
`
`
`RESULT ROUTE 208
`
`
`
`
`
`
`SAMSUNG-1005
`
`Page 1 of 12
`
`SAMSUNG-1005
`Page 1 of 12
`
`
`
`US. Patent
`
`Dec. 7, 2010
`
`Sheet 1 of5
`
`US 7,847,803 B1
`
`GRAPHICS IIF, DMA
`
`
`OR GRAPHICS PROC.
`
`
`102
`
` 103B
`
`PIGP 103A
`
`
`INSTRUCTION
`DATA
`
`
`MEMORY
`MEMORY
`
`105
`104
`
`
`OTHER GRAPHICS
`
`
`PROCESSING
`107
`
`
`
`REGfi’gERS
`
`FIG. 1
`
`103N
`
`SAMSUNG-1005
`
`Page 2 of 12
`
`SAMSUNG-1005
`Page 2 of 12
`
`
`
`US. Patent
`
`Dec. 7, 2010
`
`Sheet 2 of5
`
`US 7,847,803 B1
`
`TARGET PROGRAM COUNTER
`
`201
`
`INSTRUCTION MEMORY
`203
`
`
`
`
`
`
`204
`
`REGISTER
`205A
`
`REGISTER
`205B
`
`REGISTER
`205C
`
`ARITHMETIC DATAPATH 207
`
`
` INSTRUCTION DECODE
`
`
`
`
`
`‘-
`
` DATA MEMORY
`209
`
`REGISTER
`205N
`
`
`
`SAMSUNG-1005
`
`Page 3 of 12
`
`SAMSUNG-1005
`Page 3 of 12
`
`
`
`US. Patent
`
`Dec. 7, 2010
`
`Sheet 3 of5
`
`US 7,847,803 B1
`
`RECEIVE INSTRUCTION STREAM
`301
`
`IDENTIFY N PROGRAMS
`302
`
`3 04
`
`ASSIGN PROGRAM COUNTERS
`AND REGISTERS
`303
`
`LOAD INSTRUCTIONS
`
`FIG. 3
`
`SAMSUNG-1005
`
`Page 4 of 12
`
`SAMSUNG-1005
`Page 4 of 12
`
`
`
`US. Patent
`
`Dec. 7, 2010
`
`Sheet 4 of5
`
`US 7,847,803 B1
`
` NEW PROGRAM
`
`
`
`
`DESIRED?
`400
`
`
`
`NO
`
` START NEW PROGRAM
`401
`
`
`
`
`YES
`
`ASSIGN OUTPUT
`
`
`REGISTER SLOT TO
`
`
`PROGRAM
`402
`
`
`
`FIG. 4
`
`
`
` YES
`
`
`EXECUTE
`
`INSTRUCTIONS
`
`403
`
`
`
`PROGRAM DONE?
`
`404
`
`
`LEAD OUTPUT IN
`
`RESERVED SPACE
`
`405
`
`THIS AND ALL
`‘ INSTRUCTION
`
`
`
`PROGRAMS DONE?
`AVAILABLE?
`
`
`
`406
`408
`
`
`
`
`
`
`
`YES
`
`
`
`OUTPUT SLOT
`AVAILABLE?
`407
`
`
`
`
`
`
`NO
`
`NO-OP
`
`SAMSUNG-1005
`
`Page 5 of 12
`
`SAMSUNG-1005
`Page 5 of 12
`
`
`
`US. Patent
`
`Dec. 7, 2010
`
`Sheet 5 of5
`
`US 7,847,803 B1
`
`4 BY 32b VECTOR REGISTER
`
`ELEMENT 0
`
`ELEMENT 1
`
`ELEMENT 2
`
`ELEMENT 3
`
`4 ELEMENT VECTOR
`
`3 ELEMENT VECTOR
`
`SCALAR
`
`FIG. 5
`
`63
`
`32
`
`0
`
`II——m
`15
`2
`2
`4
`11
`
`IMMEDIATE
`
`15 SUBOPCODE 15
`
`FIG. 6
`
`SAMSUNG-1005
`
`Page 6 of 12
`
`SAMSUNG-1005
`Page 6 of 12
`
`
`
`
`
`
`
`1. Field of the Invention
`This invention relates to the field of graphics processing.
`Portions of the disclosure of this patent document contain
`material that is subject to copyright protection. The copyright
`owner has no objection to the facsimile reproduction by any—
`one of the patent document or the patent disclosure as it
`appears in the Patent and Trademark Oflice file or records, but
`otherwise reserves all copyright rights whatsoever.
`2. Background
`Computer systems are often used to display generate and
`display graphics on a display. Display images are made up of
`thousands oftiny dots, where each dot is one ofthousands or
`millions of colors. These dots are known as picture elements,
`or “pixels”. Each pixel has a color, with the color ofeach pixel
`being represented by a number value stored in tie computer
`system.
`A three dimensional display image, although displayed
`using a two dimensional array ofpixels, may in fact be created
`by rendering of a plurality of graphical objects. Examples of
`graphical objects include points, lines, polygons, and three ,
`dimensional solid objects. Points, lines, and polygons repre-
`sent rendering “primitives” which are the basis for most ren-
`dering instructions. More complex structures, such as three
`dimensional objects, are formed from a combination or mesh
`of such primitives. To display a particular scene, the visible
`primitives associated with the scene are drawn individually
`by determining those pixels that fall within the edges of the
`primitive, and obtaining the attributes of the primitive that
`correspond to each oftho se pixels. The obtained attributes are
`used to determine the displayed color values of applicable
`pixels.
`Sometimes, a three dimensional display image is formed
`from overlapping primitives or surfaces. A blending function
`based on an opacity value associated with each pixel of each
`primitive is used to blend the colors ofoverlapping surfaces or
`layers when the top surface is not completely opaque. The
`final displayed color of an individual pixel may thus be a
`blend of colors from multiple surfaces or layers.
`In some cases, graphical data is rendered by executing
`instructions from an application that is drawing data to a
`display. During image rendering, three dimensional data is
`processed into a two dimensional image suitable for display.
`The three dimensional image data represents attributes such
`as color, opacity, texture, depth, and perspective information
`The draw commands from a program drawing to the display ,
`may include, for example, X andY coordinates for the verti—
`ces of the primitive, as well as some attribute parameters for
`the primitive, and a drawing command. The execution of
`drawing commands to generate a display image is known as
`graphics processing.
`The prior art has provided two solutions to accomplish
`graphics processing. One solution is to build special process-
`ing hardware to provide high speed graphics processing capa-
`bility. The other is to provide programmable graphics pro-
`cessing by executing graphics processing software on a
`general purpose processing platform. Both prior art solutions
`have drawbacks that limit their flexibility and performance.
`Ilardware solutions typically provide special purpose
`hardware that implements hardwired graphics processing
`algorithms that can provide very fast processing capabilities.
`However, the design and debugging ofhardware is a complex,
`expensive, and time consuming process. Hardware solutions
`
`10
`
`15
`
`30
`
`35
`
`4o
`
`45
`
`6O
`
`65
`
`US 7,847,803 B1
`
`1
`METHOD AND APPARATUS FOR
`INTERLEAVED GRAPHICS PROCESSING
`
`BACKG RU U ND
`
`2
`are also inflexible. Should a new algorithm become known,
`the only way to implement it is to build a new hardware
`product. Thus, the hardware solution lacks the flexibility
`needed to respond to changing conditions, in addition, hard-
`ware solutions generally arc only available to provide pro-
`cessing capability to graphics processing tasks. For non
`graphics processing tasks, additional processing capabilities
`are required, adding to the expense of a graphics processing
`system.
`Prior art software solutions provide a programming lan-
`guage that can be executed on a general purpose processing
`system. Rendering commands from a program drawing to the
`display are interpreted and executed in software. Software
`solutions are more flexible that hardware solutions in that new
`algoritlnns and techniques can be implemented by writing
`new software, which is easier than designing and building
`new hardware. However, existing software solutions also suf—
`fer from a number of disadvantages.
`One prior art software problem is instruction execution
`latency. Many graphics algorithms consists ofa small number
`of reduced instruction set computing (RISC) single instruc-
`tion, multiple data (SIMD) instructions (a few dozen instruc-
`tions or less). The instructions may be independent or depen-
`dent. A dcpendent instruction is an instruction that contains
`operand dependencies, that is the source operands of one
`instruction are the result operands of a prior instruction. For
`example, a typical quadratic polynomial d:ax**2+bx+c
`might be coded as
`6:26 “‘a+b
`
`(177x *e+r'
`
`Since the result (e) of the first instruction is a source (e) in
`the second instruction,
`the second instruction cannot be
`executed until completion of the first instruction, creating an
`operand dependency. Ifthe pipeline latency is, for example, 7
`clocks, the second instruction could not begin until 7 clocks
`had transpired, which is very inefficient
`(15% of peak
`throughput).
`There are several approaches to alleviating this problem.
`One is the scheduling of independent instructions between
`the dependent instructions, so that processing continues even
`when a dependent
`instruction is waiting for data. This
`requires optimizing of the execution of the program by a
`programmer or by an optimizing compiler. Requiring the
`programmer to optimize during development is a complex
`task. An optimizing compiler is difficult to write and may not
`always optimize.
`Another problem is that with many short graphics pro—
`grams there are often not enough independent instructions to
`schedule. If there are no independent instructions to execute,
`the processor is idle, reducing efliciency.
`Another attempt to optimize software execution of graph-
`ics commands is to interleave instructions from multiple ver-
`tices and pixels, such as different loop passes. This is even
`more complex than prescheduling, and leads to inefficiencies
`when modes change.
`Another approach is to shorten the pipeline latency, or
`pipeline bypassing (so that operands still in the pipeline can
`be used by other instructions before being written back to the
`registers). Both of these solutions require complex hardware
`to control and route all the operands in the pipeline, and is of
`benefit only when operations can execute in a single clock
`cycle. Operations that take multiple clocks, such as floating
`point multiply and add, are not optimized since the partial
`results within them camiot be bypassed to other instructions.
`
`SAMSUNG-1005
`
`Page 7 of 12
`
`SAMSUNG-1005
`Page 7 of 12
`
`
`
`US 7,847,803 B1
`
`3
`Software solutions also suffer from conditional execution.
`When a program executes a conditional branch dependent on
`the results of a previous instruction, the next instruction can-
`not begin execution until the result is computed, inefficiently
`waiting up to the depth of the pipeline. There are several
`approaches to alleviating this problem. Results in the pipeline
`can be bypassed to the branch control, or the pipeline depth
`can be shortened, with similar hardware complexities to oper-
`and dependency above. Another approach is branch delay
`slots,
`in which several
`instructions following the branch
`instruction are executed, which leads to software complexity
`in scheduling those instructions, especially in short graphics
`programs.
`Another approach is speculative execution and/or branch
`prediction ofinstructions after the branch and/or at the branch
`target, executing the branch whether needed or not. This leads
`to inefficiencies when the speculative or predicted instruc-
`tions are not needed.
`
`SUMMARY OF THE INVENTION
`
`The present invention provides efifieient graphics process—
`ing in a software environment by interleaving of graphics
`processing instructions. The invention provides an execution
`pipeline and a number of registers. Each register holds
`
`instructions from a separate program. (T 1e definition of“pro-
`gram” here is an operation performed on data. For example,
`the same operation on different data is treated as two pro-
`grams, different operations on the same data represent two
`programs, or different operations on different data are two
`programs.)
`Instructions from the registers are interleaved in the execu-
`tion pipeline such that the average latency is one instruction
`per cycle. This is accomplished even when there is condi-
`tional branching and execution latency. When one instruction
`of a program has a dependency based on execution of a
`previous instruction, that second instruction is not provided to
`the execution pipeline until completion ofthe first instruction.
`In the meantime, interleaved instructions from other pro-
`grams are still being executed while the first instruction ofthe
`first program is executing. Thus the pipeline is always full and
`the processor is always working at peak capacity.
`The automatic interleaving of instructions permits simpli-
`fied graphics software routines to be written. There is no need
`for the programmer or developer to anticipate or attempt to
`eliminate conditional branching or to worry about instruction
`latency. The progralmnable interleaved graphics processing
`system ofthe invention provides a solution to those pro blems.
`
`
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram of a system for implementing the
`present invention.
`FIG. 2 is an example embodiment of the programmable
`interleaved graphics processor of the present invention.
`FIG. 3 illustrates a flow diagram of one embodiment of an
`interleaver for use with the present invention.
`FIG. 4 illustrates the operation of the system for a set of
`programs where order is important.
`FIG. 5 illustrates the organization of a register file.
`FIG. 6 is an example of an instruction format.
`
`DETAILED DESCRIPTION OF THE INVENTION
`
`A method and apparatus for interleaved graphics process—
`ing is described. In the following description, numerous spe-
`cific details are set forth in order to provide a more detailed
`
`10
`
`15
`
`30
`
`35
`
`4o
`
`45
`
`6O
`
`65
`
`4
`description of the invention. It will be apparent, however, to
`one skilled in the art, that the present invention may be prac-
`ticed without these specific details. In other instances, well
`known details have not been provided so as to not unneces-
`sarily obscure the invention.
`The processing of three dimensional graphics is character—
`ized by large amounts ofprocessing. Graphics processors can
`process tens of millions ofvertices and geometric primitives,
`and hundreds of millions ofpixels per second. Because there
`is so much processing, any delay or inefficiency in program
`execution is harmful, and can be propagated very rapidly. But,
`because there is so much processing, with many of the tasks
`being independent from other tasks, there is the opportunity
`to continue processing other tasks when one task must be
`delayed.
`This invention takes advantage of the large amount of
`processing inherent in graphics processing by interleaving
`the programs for several vertices, primitives, pixels, or other
`graphical elements in a processing pipeline. In other words, at
`each clock cycle, a different one ofthe programs is executing
`its instructions. When a program is waiting for the results of
`a first instruction for data needed in a second instruction, the
`processor is not idle,
`it is executing instructions of other
`programs. When the number of programs equal to or greater
`than the depth of the processing pipeline is executing, the
`apparent instruction latency in each program can be as little as
`one instruction Consider where there are four programs ea eh
`with five instructions. Using interleaving, the four programs
`finish execution in twenty clock cycles, giving an average
`latency of one instruction per clock cycle. Using the inven-
`tion, the results of an instruction in a program are available as
`sources to the next instruction in that program, without the
`hardware complexity ofpipeline bypasses, the software com—
`plexity of instruction reordering, or the inefficiency of idle
`cycles. This results in a simple software programming model
`at low hardware cost and high efiieiency.
`For interleaved processing, the storage elements (registers)
`and program counters (PC) are unique for each program.
`There are as many copies of registers and PCs as there are
`programs executing. In one embodiment, there are as many
`programs executing as the depth of the instruction execution
`pipeline between an arithmetic result and source dependency,
`or a load source address and data dependency, or a branch
`source condition and the target instruction fetch. If branch
`instructions include Call and Return instructions then a PC
`stack per program is also required.
`Since only one instruction or instruction group is fetched
`per program per cycle, a single shared instruction memory or
`cache can be used, and since only one load/store instruction or
`group accesses data memory per cycle, a single shared data
`memory or cache can be used.
`An advantage of the invention is that since mode changes
`(program or data changes that affect the processing of a
`graphical element) are frequent in graphics applications, ea eh
`program executes independently on the program and data of
`its own element Each element’s mode is independent, and
`modes can be changed per element without performance deg-
`radation.
`If data sharing between programs is required, the invention
`can be extended to include external synchronization mecha—
`nisms, such as stalling execution at specific instruction or data
`addresses, or internal synchronization mechanisms such as
`conditional branches on the state of other programs.
`For example, a typical graphics processor pipeline might
`consist of seven execution stages (although the number of
`execution stages in practice varies widely with frequency,
`device characteristics, and instruction set.) As an example,
`
`SAMSUNG-1005
`
`Page 8 of 12
`
`SAMSUNG-1005
`Page 8 of 12
`
`
`
`US 7,847,803 B1
`
`5
`three general types ofinstructions which are latency sensitive
`are shown. One is an arithmetic instruction such as a floating
`point multiply add
`D:A *B+C;
`
`with a three cycle float multiply add operation. Another is a
`conditional branch instruction based on a register value less
`than zero, a decrement of that register, and a branch target
`with a register offset from the current PC.
`if (A——<0) goto (PC+B);
`where PC is the program counter. A third is a load instruction
`from a local memory at a computed address
`D:*(A+B);
`
`The following types are fairly representative of general RISC
`instructions.
`
`Arithmetic
`
`Branch
`
`Load
`
`1 Source Fetch
`2 Operand Route
`3 Madd A
`4 Madd B
`5 Madd C
`6 Result Route
`7 Destination Write
`
`Source Fetch
`Operand Route
`Compare
`Add PC, Dec Operand
`Instruction Fetch
`Instruction Decode
`Destination write
`
`Source Fetch
`Operand Route
`Address Add
`Address Route
`Memory Access
`Data Route
`Destination Write
`
`Memory store instructions are similar to loads, except the
`store data is a source operand fetch and no destination route
`and write occurs, so stores are typically not latency sensitive.
`Subroutine calls and returns are similar to branches, with the
`read or write of a PC stack with the previous or target PC.
`With the seven cycles of pipeline latency shown in the
`example, consider where programs for seven graphics primi—
`tives are executing interleaved in the pipeline. With the arith—
`metic instruction as an example, the program execution of 7
`programs 0-6 would interleave like the following over clocks
`l to 8.
`
`1
`2
`3
`4
`5
`6
`7
`8
`
`SF0
`SF1
`SF2
`SF3
`SF4
`SF5
`SF6
`SF0
`
`0R0
`ORl
`0R2
`0K3
`0R4
`0R5
`
`MAO
`MAl
`MALI
`MA3
`MA4
`
`MBO
`MBl
`MBE
`MB3
`
`MCO
`MCI
`MCZ
`
`RRO
`RRl
`
`DWO
`
`The source fetch of program 0 (SF0) occurs at clock l.At
`clock 2, the source fetch of program 1 (SF1) is executed,
`along with the operand route step of program 0 (0R0). At
`each clock, the source fetch of each succeeding program is
`placed in the pipeline, while earlier programs continue execu-
`tion. The result is such that in the 8th clock the next instruc—
`tion of program 0 begins execution, and the result of the
`previous instruction in that program, or the load data or
`branch target location of that program, is available to that
`instruction. The functional latency of the instructions visible
`to the program and programmer or compiler is one instruc-
`tion, even though the hardware pipeline latency ofthe instrue -
`tion is many more. (Note that without interleaving, the sub-
`sequent instructions ofprogram 0 would immediately follow
`the first instruction of program 0, but would need to stall at
`some point to wait for dependent data, creating pipeline inef-
`ficiencies.)
`
`6
`The interleaved graphics processor ofthe present invention
`may be part of a larger graphics pipeline. Referring to FIG. 1,
`a programmable interleaved graphics processor 103A
`includes registers 106, data memories 105, and/or instruction
`memory 104. These may be written with graphics data by a
`general purpose CPU 100 by means of a command processor
`or other interface or by DMA from memory (102). Some
`output registers may input data and commands to further
`hardwired graphics processing such as polygon rasterization,
`texturing, or framebuffer blending and Z buffering, or a sub—
`sequent progrannnable graphics processor 107. Parallelism
`can be provided by providing multiple programmable inter-
`leaved graphics processors (PGlPs) 10313 through 103N .
`An example embodiment ofthe programmable interleaved
`graphics processor is shown in FIG. 2. A target program
`counter 201 is coupled to a plurality of program counters
`202A through 202N. The number of program counters (PCs)
`should match the number of programs that are being inter-
`leaved at one time. (eg. seven in the above example). The
`program counters are coupled to instruction memory 203.
`Instructions from instruction memory 203 are provided to
`instruction decode 204 and then to an appropriate one of
`registers 205A through 205N . There are as many registers as
`there are programs being interleaved (and ideally, as many
`programs as stages in the processing pipeline).
`The registers 205A through 205N are coupled to operand
`route 206. The output of operand route 206 is coupled to the
`arithmetic datapath 207. Data memory 209 and output from
`datapath 207 are provided to result route 208. The output of
`result route 208 is made available to multiple registers 205A
`through 205N.
`In graphics processing, the graphical elements that are
`
`being processed are known by the CPU. Each element
`includes its processing mode, that is, which program is being
`applied to the element. The interleaving process queues up a
`number of programs appropriate for the size of the pipeline.
`The program counters and registers are assigned to each
`program and one instruction for each program is routed into
`the pipeline clock. A typical three dimensional graphic scene
`can have hundreds of thousands of programs available for
`batching in a group for interleaving.
`FIG. 3 illustrates a flow diagram of one embodiment of an
`interleaver for use with the present invention. At step 301 an
`instruction stream is received, with each instruction consist-
`ing of a program and data to be used by a graphical element.
`At step 302 the interleaver identifies N programs where N is
`the depth of the pipeline. (Note that in some circumstances,
`the munber of programs identified could be less than N).
`At step 303 the interleaver assigns program counters and
`registers to the N selected programs. At step 304 the instruc-
`tions from those programs are loaded as appropriate in the
`instruction stream of the graphics processor.
`Loads
`
`The present invention could also support loads because the
`latency ofloads is approximately the same as register latency.
`With a cache, load latency on a cache hit would fall in line
`with vertex batching quite well (6 clock fetch, mux, add, mux,
`read, mux, writeback/bypass). However cache misses might
`have long latency. A solution would be a register dependency
`checking, renaming, etc. Another solution is to have the batch
`(or the thread) stall on a miss. A prefeteh variant ofload could
`be supported to hide miss latency. Since load destination
`could be only temp, it is possible to keep a list of load targets
`(for each thread), and compare each of 3 sources to each
`pending load, and stall on collision or the Nth pending load.
`Since only two temp even and two temp odd sources (from a
`
`10
`
`15
`
`30
`
`35
`
`4o
`
`45
`
`6O
`
`65
`
`SAMSUNG-1005
`
`Page 9 of 12
`
`SAMSUNG-1005
`Page 9 of 12
`
`
`
`US 7,847,803 B1
`
`7
`vector/scalar pair) are possible in each cycle, another scheme
`is to keep a 1 b tag per temp register (dual port or two copies)
`with one bit per thread, where the bit is set on load, cleared on
`write, and read by pending instructions that stall if the bit is
`set. A 2 or 4 by 16 by 6-8 bit load use tag ram may be cheaper
`than the compares of pending loads by 4 accesses by 4 b
`registers, and allows unlimited pending loads.
`The invention also can operate efliciently when no -ops (no
`operations) may be required. For example, in rasterizing, the
`order of data delivery is important. This may require the
`insertion of no—ops into the instruction stream or retard the
`launching of new programs until the first program finishes.
`The invention provides a solution by reserving space in an
`output buffer where data can be placed in the buffer out of
`order. The programs are pre—sorted in the desired order, with
`spaces in the output buffer reserved for each program. Thus,
`even if a program finishes early, before a previously issued
`program, its output is stored in the correct location in the
`output buffer, with the output buffer not accessed until a
`collection of “in—order” data is available.
`In this situation, no new programs are started until an
`output buffer for each new program is available. This requires
`inserting no-ops in the instruction stream for this condition.
`FIG. 4 illustrates the operation of the system for a set of
`programs where order is important. At decision block 400 it is
`determined if a new program is desired. If not, the system
`loops, if so, the system proceeds to step 401. At step 401, the
`new program is started. At step 402, output slots are assigned
`to the program to provide the appropriate order. At step 403,
`an instruction is executed. At decision block 404 it is deter-
`mined if one of the programs is done. If so, the output of the
`program is loaded into its reserved space at step 405. If a
`program is not done at decision block 404, the system returns
`to step 403 to execute the next instruction.
`After the output ofa completed program is loaded into the
`appropriate reserved register at step 405, decision block 406
`determines if this program and all of the programs of the
`current set are complete. If all of the programs have not
`completed, the system looks for the next instruction. At deci-
`sion block 408 it is determined if an instruction is available,
`that is, for a program that has not yet finished execution. If
`there is an instruction available, the system returns to step 403
`to execute the next instruction. Ifthere is no instruction avail-
`able at decision block 408, the system no-ops at step 409 and
`returns to block 408.
`If all of the programs are complete at decision block 406,
`the system proceeds to decision block 407 to determine if an
`output slot is available. Ifnot, the system no-ops at step 410.
`Ifa slot is available, the system returns to step 401 and starts
`a new program.
`As noted above, one embodiment ofthe invention includes
`a number of copies of registers at least equal to the number of
`interleaved programs, but a number of registers greater than
`the number ofprograms may also be used for system through-
`put optimization. With a larger number of copies of registers
`than the number of interleaved programs, some copies can
`operate as multiple input buffers and some copies can operate
`as multiple output buffers.
`W'ith multibuffered input, another system component, such
`as a CPU command interface or a memory command DMA
`unit or a rasterization and texturing unit, can write input
`registers with data for subsequent programs in parallel with
`current interleaved program execution. With multibuffered
`output, another system component, such as setup and raster-
`ization unit or a framebuffer blending and Z buffering unit,
`can read output registers from previous completed programs
`in parallel With current interleaved program execution.
`
`10
`
`15
`
`30
`
`35
`
`4o
`
`45
`
`6O
`
`65
`
`8
`With multibuffered input and output registers, continuous
`program execution can occur without idle cycles lost waiting
`for input or output data. In one embodiment, a new inter-
`leaved program begins execution whenever a valid input data
`register copy is available after having been written by another
`system unit, and when an output data register copy is free after
`having been read by another system unit.
`One embodiment ofthe invention also includes instruction
`memory larger than that needed to hold all the programs being
`executed by the current
`interleaved programs, and data
`memory larger than that needed to hold the data set accessed
`by all current interleaved programs. With larger instruction
`and data memories, other system tuiits can update the con-
`tents of these memories at locations not used by currently
`executing interleaved programs so that sub sequent inter—
`leaved programs can begin execution without waiting for
`instruction or data memory updates from other system com-
`ponents.
`the input and output registers are
`In one embodiment,
`double buffered, containing twice as many copies ofinput and
`output registers as the number of executing interleaved pro-
`grams, and the instruction and data memories are also double
`buffered, containing twice as much instructions or data as are
`accessed by the executing interleaved programs.
`In addition to typical microprocessor types of instruction
`sets which are well known, such as arithmetic, logical, load
`store, and branch, the programma ble interleaved graphics
`processor instruction in one embodiment contains compound
`instructions such as are appropriate for performing common
`but complicated graphics operations. These compound
`instructions may perform several arithmetic,
`logical, and
`memory access operations.
`A frequently useful compound instruction is a texture
`sample instruction, in which the processor supplies a texture
`coordinate to a texturing system component, and the compo-
`nent returns a filtered texture image sample at that coordinate.
`Other possibly useful compound instructions include a sur-
`face evaluation instruction, which returns a surface sample
`from a surface parameter or coordinate, and a visibility evalu-
`ation instruction, which returns a visibility status or coverage
`from an image depth coordinate.
`The graphics processor
`instruction set may also be
`enhanced with complex instructions to perform common
`graphical algorithms (instead of performing those algorithms
`in a number of simpler instructions) such as matrix multiply,
`vector normalization, trigonometric functions, and exponen-
`tiation. Both compound and complex instructions can offer
`advantages in requiring fewer instructions to be accessed and
`providing higher p