`11111)1
`0
`
`81
`
`11111111111111111111111111111
`
`(12) United States Patent
`Lindholm
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,038,685 B1
`May 2, 2006
`
`(54)
`
`(75)
`
`PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`Inventor: John Erik Lindholm, Saratoga, CA
`(US)
`
`(73)
`
`Assignee: NVIDIA Corporation, Santa Clara, CA
`(US)
`
`Notice:
`* )
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 134 days.
`
`(21)
`
`Appl. No.: 10/609,967
`
`(22)
`
`Filed:
`
`Jun. 30, 2003
`
`(51)
`
`Int. Cl.
`G06F 15/00
`G06F 13/00
`G06F 12/02
`G06F 9/46
`G06T 1/00
`(52) U.S. Cl.
`
`(2006.01)
`(2006.01)
`(2006.01)
`(2006.01)
`(2006.01)
` 345/501; 345/543; 345/536;
`718/104
` 345/501,
`(58) Field of Classification Search(cid:9)
`345/502, 530, 531, 522, 418, 419, 426, 427,
`345/543, 505, 536; 718/100, 104, 103
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5,020,115 A *
`5/1991 Black
`5,969,726 A
`10/1999 Rentschler et al.
`6,630,935 Bl* 10/2003 Taylor et al.
`6,731,289 Bl *
`5/2004 Peercy et al.
`2003/0041173 Al
`2/2003 Hoyle
`
` 382/298
`
` 345/522
` 345/503
`
`* cited by examiner
`
`Primary Examiner Kee M. Tung
`(74) Attorney, Agent, or Firm Patterson & Sheridan, LLP
`
`(57)
`
`ABSTRACT
`
`A programmable graphics processor for multithreaded
`execution of program instructions including a thread control
`unit. The programmable graphics processor is programmed
`with program instructions for processing primitive, pixel
`and vertex data. The thread control unit has a thread storage
`resource including locations allocated to store thread state
`data associated with samples of two or more types. Sample
`types include primitive, pixel and vertex. A number of
`threads allocated to processing a sample type may be
`dynamically modified.
`
`45 Claims, 9 Drawing Sheets
`
`From From
`
`za 22Q
`
`Execution
`
`Pipeline
`244
`
`Multithreaded
`Processing Unit
`&IQ
`
`Instruction
`Cache
`•
`414
`
`Thread
`Selection
`4- Unit
`4.12
`
`To 225
`From 225
`
`Resource
`Scoreboard
`9E1
`
`iwuAli
`
`Sequencer
`412
`
`Instruction
`Scheduler
`424
`
`Instruction
`Dispatcher
`441
`
`•
`
`From aa
`From 22Q
`
`Execution Unit
`
`PCU
`
`¤
`To zu
`
`To 27_Q
`
`Thread
`Control
`Unit
`424
`
`TSR
`222
`
`Register
`File
`224
`
`LG Ex. 1003, pg 1
`
`LG Ex. 1003
`LG v. ATI
`IPR2017-01225
`
`
`
`U.S. Patent
`
`May 2, 2006
`
`Sheet 1 of 9
`
`US 7,038,685 B1
`
`Host Computer 110
`
`100
`
`Host Memory
`112
`
`Host Processor
`114
`
`System Interface
`ill
`
`1k
`
`Graphics
`Subsystem
`17Q
`
`•
`Graphics Interface 117
`
`Graphics
`Processor
`105
`
`Local
`Memory
`140
`
`Memory
`Controller
`120
`
`4
`
`►
`
`•
`Front
`End
`130
`
`IDX
`135
`
`Programmable
`Graphics
`Processing
`Pipeline
`150
`
`1,
`Raster Analyzer
`160
`
`Output Controller
`180
`
`FIG. 1
`
`(-Output
`185
`
`i
`/
`
`LG Ex. 1003, pg 2
`
`
`
`U.S. Patent
`
`May 2, 2006
`
`Sheet 2 of 9
`
`US 7,038,685 B1
`
`From
`135
`
`Programmable
`Graphics
`Processing
`Pipeline
`150
`
`Vertex Input Buffer
`220
`
`'V
`Primitive Assembly/Setup
`205
`
`Y
`Raster Unit
`210
`
`'V
`Pixel Input Buffer
`215
`
`A
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Vertex Output Buffer
`260
`
`'V
`Pixel Output Buffer
`Ea
`
`Texture
`Unit
`225
`
`Texture
`Cache
`230
`
`.
`To 160
`
`Y
`To 160
`
`FIG. 2
`
`To From
`120
`120
`
`LG Ex. 1003, pg 3
`
`
`
`U.S. Patent
`
`May 2, 2006
`
`Sheet 3 of 9
`
`US 7,038,685 B1
`
`From
`215
`
`From
`220
`
`•
`•
`•
`
`Execution
`Pipeline
`240
`
`Multithreaded
`Processing Unit
`300
`
`To 225 4
`
`From 225
`
`From 215
`
`From 220
`
`Thread Control Unit
`320
`
`►
`
`TSR
`325
`
`Register
`File
`350
`
`••
`•
`
`PCU
`375
`
`To 260
`
`To 270
`
`LG Ex. 1003, pg 4
`
`
`
`U.S. Patent
`
`May 2, 2006
`
`Sheet 4 of 9
`
`US 7,038,685 B1
`
`From From
`215 220
`
`Execution
`
`Pipeline
`240
`
`Multithreaded
`Processing Unit
`400
`
`Instruction
`Cache
`410
`
`To 225
`
`From 225
`
`Thread
`Selection
`Unit
`415
`
`)0,
`
`•
`•
`
`Thread
`Control
`Unit
`420
`
`TSR
`
`Register
`File
`350
`
`Resource
`Scoreboard
`460
`
`A
`
`From 215
`From 220
`
`IWU 435
`
`Sequencer
`425
`
`Instruction
`Scheduler
`430
`
`Instruction
`Dispatcher
`440
`
`Execution Unit
`470
`
`PCU
`2za
`
`To 260
`
`To 270
`
`FIG. 4
`
`LG Ex. 1003, pg 5
`
`
`
`U.S. Patent
`
`May 2, 2006
`
`Sheet 5 of 9
`
`US 7,038,685 B1
`
`Assign
`VertexTh read
`535
`
`FIG. 5A
`
`FIG. 5B
`
`LG Ex. 1003, pg 6
`
`
`
`U.S. Patent
`
`May 2, 2006
`
`Sheet 6 of 9
`
`US 7,038,685 B1
`
`605
`
`625
`
`*
`
`635
`
`645
`
`610
`611
`612
`613
`
`FIG. 6A
`
`FIG. 6B
`
`) 620
`
`) 630
`
`640
`
`LG Ex. 1003, pg 7
`
`
`
`U.S. Patent
`
`May 2, 2006
`
`Sheet 7 of 9
`
`US 7,038,685 B1
`
`Allocating threads to
`a first sample type
`710
`
`Allocating threads to a
`second sample type
`715
`
`•
`Execute First
`Program Instructions
`720
`
`Determine
`allocations
`750
`
`Allocating threads to
`a first sample type
`755
`
`•
`Allocating threads to a
`second sample type
`760
`
`Execute First
`Program Instructions
`765
`
`Execute Second
`Program Instructions
`725
`
`Execute Second
`Program Instructions
`770
`
`FIG. 7A
`
`Allocating threads to
`the first sample type
`775
`
`FIG. 7B
`
`LG Ex. 1003, pg 8
`
`
`
`U.S. Patent
`
`May 2, 2006
`
`Sheet 8 of 9
`
`US 7,038,685 B1
`
`Receive Sample
`810
`
`Assign Thread
`825
`
`FIG. 8A
`
`Execute Thread
`880
`
`Deallocate Resources
`850
`
`FIG. 8B
`
`LG Ex. 1003, pg 9
`
`
`
`U.S. Patent
`
`May 2, 2006
`
`Sheet 9 of 9
`
`US 7,038,685 B1
`
`Determine thread
`priority
`950
`
`•
`Identify assigned
`thread(s) for priority
`955
`
`N
`
`•
`Read Program
`Counter(s)
`970
`
`•
`Update Program
`Counter(s)
`975
`
`FIG. 9B
`
`4,
`Identify assigned
`thread(s)
`910
`
`Select Thread(s)
`M
`
`Read Program
`Counter(s)
`920
`
`Update Program
`Counter(s)
`925
`
`FIG. 9A
`
`LG Ex. 1003, pg 10
`
`
`
`1
`PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`FIELD OF THE INVENTION
`
`One or more aspects of the invention generally relate to
`multithreaded processing, and more particularly to process-
`ing graphics data in a programmable graphics processor.
`
`BACKGROUND
`
`Current graphics data processing includes systems and
`methods developed to perform a specific operation on graph-
`ics data, e.g., linear interpolation, tessellation, rasterization,
`texture mapping, depth testing, etc. These graphics proces-
`sors include several fixed function computation units to
`perform such specific operations on specific types of graph-
`ics data, such as vertex data and pixel data. More recently,
`the computation units have a degree of programmability to
`perform user specified operations such that the vertex data is
`processed by a vertex processing unit using vertex programs
`and the pixel data is processed by a pixel processing unit
`using pixel programs. When the amount of vertex data being
`processed is low relative the amount of pixel data being
`processed, the vertex processing unit may be underutilized.
`Conversely, when the amount of vertex data being processed
`is high relative the amount of pixel data being processed, the
`pixel processing unit may be underutilized.
`Accordingly, it would be desirable to provide improved
`approaches to processing different types of graphics data to
`better utilize one or more processing units within a graphics
`processor.
`
`SUMMARY
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`A method and apparatus for processing and allocating
`threads for multithreaded execution of graphics programs is
`described. A graphics processor for multithreaded execution
`of program instructions associated with threads to process at
`least two sample types includes a thread control unit includ- 40
`ing a thread storage resource configured to store thread state
`data for each of the threads to process the at least two sample
`types.
`Alternatively, the graphics processor includes a multi-
`threaded processing unit. The multithreaded processing unit 45
`includes a thread control unit configured to store pointers to
`program instructions associated with threads, each thread
`processing a sample type of vertex, pixel or primitive. The
`multithreaded processing unit also includes at least one
`programmable computation unit configured to process data so
`under control of the program instructions.
`A method of multithreaded processing of graphics data
`includes receiving a pointer to a vertex program to process
`vertex samples. A first thread is assigned to a vertex sample.
`A pointer to a shader program to process pixel samples is 55
`received. A second thread is assigned to a pixel sample. The
`vertex program is executed to process the vertex sample and
`produce a processed vertex sample. The shader program is
`executed to process the pixel sample and produce a pro-
`cessed pixel sample.
`Alternatively, the method of multithreaded processing of
`graphics data includes allocating a first number of process-
`ing threads for a first sample type. A second number of
`processing threads is allocated for a second sample type.
`First program instructions associated with the first sample 65
`type are executed to process the graphics data and produce
`processed graphics data.
`
`60
`
`US 7,038,685 B1
`
`2
`A method of assigning threads for processing of graphics
`data includes receiving a sample to be processed. A sample
`type of vertex, pixel, or primitive, associated with the
`sample is determined. A thread is determined to be available
`for assignment to the sample. The thread is assigned to the
`sample.
`A method of selecting at least one thread for execution
`includes identifying one or more assigned threads from
`threads including at least a thread assigned to a pixel sample
`and a thread assigned to a vertex sample. At least one of the
`one or more assigned threads is selected for processing.
`A method of improving performance of multithreaded
`processing of graphics data using at least two sample types
`includes dynamically allocating a first number of threads for
`processing a first portion of the graphics data to a first
`sample type and dynamically allocating a second number of
`threads for processing a second portion of the graphics data
`to a second sample type.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`Accompanying drawing(s) show exemplary
`embodiment(s) in accordance with one or more aspects of
`the present invention; however, the accompanying
`drawing(s) should not be taken to limit the present invention
`to the embodiment(s) shown, but are for explanation and
`understanding only.
`FIG. 1 illustrates one embodiment of a computing system
`according to the invention including a host computer and a
`graphics subsystem.
`FIG. 2 is a block diagram of an embodiment of the
`Programmable Graphics Processing Pipeline of FIG. 1.
`FIG. 3 is a block diagram of an embodiment of the
`Execution Pipeline of FIG. 1.
`FIG. 4 is a block diagram of an alternate embodiment of
`the Execution Pipeline of FIG. 1.
`FIGS. 5A and 5B are flow diagrams of exemplary
`embodiments of thread assignment in accordance with one
`or more aspects of the present invention.
`FIGS. 6A and 6B are exemplary embodiments of a portion
`of the Thread Storage Resource storing thread state data
`within an embodiment of the Thread Control Unit of FIG. 3
`or FIG. 4.
`FIGS. 7A and 7B are flow diagrams of exemplary
`embodiments of thread allocation and processing in accor-
`dance with one or more aspects of the present invention.
`FIGS. 8A and 8B are flow diagrams of exemplary
`embodiments of thread assignment in accordance with one
`or more aspects of the present invention.
`FIGS. 9A and 9B are flow diagrams of exemplary
`embodiments of thread selection in accordance with one or
`more aspects of the present invention.
`
`DETAILED DESCRIPTION
`
`In the following description, numerous specific details are
`set forth to provide a more thorough understanding of the
`present invention. However, it will be apparent to one of
`skill in the art that the present invention may be practiced
`without one or more of these specific details. In other
`instances, well-known features have not been described in
`order to avoid obscuring the present invention.
`FIG. 1 is an illustration of a Computing System generally
`designated 100 and including a Host Computer 110 and a
`Graphics Subsystem 170. Computing System 100 may be a
`desktop computer, server, laptop computer, palm-sized com-
`puter, tablet computer, game console, cellular telephone,
`computer based simulator, or the like. Host Computer 110
`
`LG Ex. 1003, pg 11
`
`
`
`US 7,038,685 B1
`
`4
`3
`includes Host Processor 114 that may include a system
`operations, such as stencil, z test, and the like, and saves the
`memory controller to interface directly to Host Memory 112
`results or the samples output by Programmable Graphics
`Processing Pipeline 150 in Local Memory 140. When the
`or may communicate with Host Memory 112 through a
`data received by Graphics Subsystem 170 has been com-
`System Interface 115. System Interface 115 may be an I/O
`(input/output) interface or a bridge device including the 5 pletely processed by Graphics Processor 105, an Output 185
`of Graphics Subsystem 170 is provided using an Output
`system memory controller to interface directly to Host
`Memory 112. Examples of System Interface 115 known in
`Controller 180. Output Controller 180 is optionally config-
`the art include Intel® Northbridge and Intel® Southbridge.
`ured to deliver data to a display device, network, electronic
`Host Computer 110 communicates with Graphics Sub-
`control system, other Computing System 100, other Graph-
`system 170 via System Interface 115 and a Graphics Inter- ic) ics Subsystem 170, or the like. Alternatively, data is output
`face 117 within a Graphics Processor 105. Data received at
`to a film recording device or written to a peripheral device,
`Graphics Interface 117 can be passed to a Front End 130 or
`e.g., disk drive, tape, compact disk, or the like.
`written to a Local Memory 140 through Memory Controller
`FIG. 2 is an illustration of Programmable Graphics Pro-
`120. Graphics Processor 105 uses graphics memory to store
`cessing Pipeline 150 of FIG. 1. At least one set of samples
`graphics data and program instructions, where graphics data 15 is output by IDX 135 and received by Programmable
`Graphics Processing Pipeline 150 and the at least one set of
`is any data that is input to or output from components within
`the graphics processor. Graphics memory can include por-
`samples is processed according to at least one program, the
`tions of Host Memory 112, Local Memory 140, register files
`at least one program including graphics program instruc-
`coupled to the components within Graphics Processor 105,
`tions. A program can process one or more sets of samples.
`and the like.
`20 Conversely, a set of samples can be processed by a sequence
`Graphics Processor 105 includes, among other compo-
`of one or more programs.
`nents, Front End 130 that receives commands from Host
`Samples, such as surfaces, primitives, or the like, are
`Computer 110 via Graphics Interface 117. Front End 130
`received from IDX 135 by Programmable Graphics Process-
`ing Pipeline 150 and stored in a Vertex Input Buffer 220
`interprets and formats the commands and outputs the for-
`matted commands and data to an IDX (Index Processor) 25 including a register file, FIFO (first in first out), cache, or the
`135. Some of the formatted commands are used by Pro-
`like (not shown). The samples are broadcast to Execution
`grammable Graphics Processing Pipeline 150 to initiate
`Pipelines 240, four of which are shown in the figure. Each
`Execution Pipeline 240 includes at least one multithreaded
`processing of data by providing the location of program
`instructions or graphics data stored in memory. IDX 135,
`processing unit, to be described further herein. The samples
`Programmable Graphics Processing Pipeline 150 and a 30 output by Vertex Input Buffer 220 can be processed by any
`Raster Analyzer 160 each include an interface to Memory
`one of the Execution Pipelines 240. A sample is accepted by
`Controller 120 through which program instructions and data
`an Execution Pipeline 240 when a processing thread within
`the Execution Pipeline 240 is available as described further
`can be read from memory, e.g., any combination of Local
`Memory 140 and Host Memory 112. When a portion of Host
`herein. Each Execution Pipeline 240 signals to Vertex Input
`Memory 112 is used to store program instructions and data, 35 Buffer 220 when a sample can be accepted or when a sample
`the portion of Host Memory 112 can be uncached so as to
`cannot be accepted. In one embodiment Programmable
`increase performance of access by Graphics Processor 105.
`Graphics Processing Pipeline 150 includes a single Execu-
`IDX 135 optionally reads processed data, e.g., data writ-
`tion Pipeline 240 containing one multithreaded processing
`ten by Raster Analyzer 160, from memory and outputs the
`unit. In an alternative embodiment, Programmable Graphics
`data, processed data and formatted commands to Program- 40 Processing Pipeline 150 includes a plurality of Execution
`mable Graphics Processing Pipeline 150. Programmable
`Pipelines 240.
`Graphics Processing Pipeline 150 and Raster Analyzer 160
`Execution Pipelines 240 may receive first samples, such
`each contain one or more programmable processing units to
`as higher-order surface data, and tessellate the first samples
`perform a variety of specialized functions. Some of these
`to generate second samples, such as vertices. Execution
`functions are table lookup, scalar and vector addition, mul- 45 Pipelines 240 may be configured to transform the second
`tiplication, division, coordinate-system mapping, calcula-
`samples from an object-based coordinate representation
`tion of vector normals, tessellation, calculation of deriva-
`(object space) to an alternatively based coordinate system
`tives, interpolation, and the like. Programmable Graphics
`such as world space or normalized device coordinates
`Processing Pipeline 150 and Raster Analyzer 160 are each
`(NDC) space. Each Execution Pipeline 240 may communi-
`optionally configured such that data processing operations 50 cate with Texture Unit 225 using a read interface (not shown
`in FIG. 2) to read program instructions and graphics data
`are performed in multiple passes through those units or in
`such as texture maps from Local Memory 140 or Host
`multiple passes within Programmable Graphics Processing
`Pipeline 150. Programmable Graphics Processing Pipeline
`Memory 112 via Memory Controller 120 and a Texture
`150 and a Raster Analyzer 160 also each include a write
`Cache 230. Texture Cache 230 is used to improve memory
`interface to Memory Controller 120 through which data can 55 read performance by reducing read latency. In an alternate
`embodiment Texture Cache 230 is omitted. In another
`be written to memory.
`alternate embodiment, a Texture Unit 225 is included in each
`In a typical implementation Programmable Graphics Pro-
`cessing Pipeline 150 performs geometry computations, ras-
`Execution Pipeline 240. In another alternate embodiment
`terization, and pixel computations. Therefore Programmable
`program instructions are stored within Programmable
`Graphics Processing Pipeline 150 is programmed to operate 60 Graphics Processing Pipeline 150. In another alternate
`embodiment each Execution Pipeline 240 has a dedicated
`on surface, primitive, vertex, fragment, pixel, sample or any
`other data. For simplicity, the remainder of this description
`instruction read interface to read program instructions from
`Local Memory 140 or Host Memory 112 via Memory
`will use the term "samples" to refer to graphics data such as
`Controller 120.
`surfaces, primitives, vertices, pixels, fragments, or the like.
`Execution Pipelines 240 output processed samples, such
`Samples output by Programmable Graphics Processing 65
`Pipeline 150 are passed to a Raster Analyzer 160, which
`as vertices, that are stored in a Vertex Output Buffer 260
`optionally performs near and far plane clipping and raster
`including a register file, FIFO, cache, or the like (not
`
`LG Ex. 1003, pg 12
`
`
`
`US 7,038,685 B1
`
`5
`shown). Processed vertices output by Vertex Output Buffer
`260 are received by a Primitive Assembly/Setup Unit 205.
`Primitive Assembly/Setup Unit 205 calculates parameters,
`such as deltas and slopes, to rasterize the processed vertices
`and outputs parameters and samples, such as vertices, to a
`Raster Unit 210. Raster Unit 210 performs scan conversion
`on samples, such as vertices, and outputs samples, such as
`fragments, to a Pixel Input Buffer 215. Alternatively, Raster
`Unit 210 resamples processed vertices and outputs addi-
`tional vertices to Pixel Input Buffer 215.
`Pixel Input Buffer 215 outputs the samples to each Execu-
`tion Pipeline 240. Samples, such as pixels and fragments,
`output by Pixel Input Buffer 215 are each processed by only
`one of the Execution Pipelines 240. Pixel Input Buffer 215
`determines which one of the Execution Pipelines 240 to
`output each sample to depending on an output pixel position,
`e.g., (x,y), associated with each sample. In this manner, each
`sample is output to the Execution Pipeline 240 designated to
`process samples associated with the output pixel position. In
`an alternate embodiment, each sample output by Pixel Input
`Buffer 215 is processed by one of any available Execution
`Pipelines 240.
`Each Execution Pipeline 240 signals to Pixel Input Buffer
`240 when a sample can be accepted or when a sample cannot
`be accepted as described further herein. Program instruc-
`tions configure programmable computation units (PCUs)
`within an Execution Pipeline 240 to perform operations such
`as tessellation, perspective correction, texture mapping,
`shading, blending, and the like. Processed samples are
`output from each Execution Pipeline 240 to a Pixel Output
`Buffer 270. Pixel Output Buffer 270 optionally stores the
`processed samples in a register file, FIFO, cache, or the like
`(not shown). The processed samples are output from Pixel
`Output Buffer 270 to Raster Analyzer 160.
`FIG. 3 is a block diagram of an embodiment of Execution
`Pipeline 240 of FIG. 1 including at least one Multithreaded
`Processing Unit 300. An Execution Pipeline 240 can contain
`a plurality of Multithreaded Processing Units 300, each
`Multithreaded Processing Unit 300 containing at least one
`PCU 375. PCUs 375 are configured using program instruc-
`tions read by a Thread Control Unit 320 via Texture Unit
`225. Thread Control Unit 320 gathers source data specified
`by the program instructions and dispatches the source data
`and program instructions to at least one PCU 375. PCUs 375
`performs computations specified by the program instruc-
`tions and outputs data to at least one destination, e.g., Pixel
`Output Buffer 160, Vertex Output Buffer 260 and Thread
`Control Unit 320.
`A single program may be used to process several sets of
`samples. Thread Control Unit 320 receives samples or
`pointers to samples stored in Pixel Input Buffer 215 and
`Vertex Input Buffer 220. Thread Control Unit 320 receives
`a pointer to a program to process one or more samples.
`Thread Control Unit 320 assigns a thread to each sample to
`be processed. A thread includes a pointer to a program
`instruction (program counter), such as the first instruction
`within the program, thread state information, and storage
`resources for storing intermediate data generated during
`processing of the sample. Thread state information is stored
`in a TSR (Thread Storage Resource) 325. TSR 325 may be
`a register file, FIFO, circular buffer, or the like. An instruc-
`tion specifies the location of source data needed to execute
`the instruction. Source data, such as intermediate data gen-
`erated during processing of the sample is stored in a Register
`File 350. In addition to Register File 350, other source data
`may be stored in Pixel Input Buffer 215 or Vertex Input
`
`20
`
`6
`Buffer 220. In an alternate embodiment source data is stored
`in Local Memory 140, locations in Host Memory 112, and
`the like.
`Alternatively, in an embodiment permitting multiple pro-
`5 grams for two or more thread types, Thread Control Unit 320
`also receives a program identifier specifying which one of
`the two or more programs the program counter is associated
`with. Specifically, in an embodiment permitting simulta-
`neous execution of four programs for a thread type, two bits
`10 of thread state information are used to store the program
`identifier for a thread. Multithreaded execution of programs
`is possible because each thread may be executed indepen-
`dent of other threads, regardless of whether the other threads
`are executing the same program or a different program.
`15 PCUs 375 update each program counter associated with the
`threads in Thread Control Unit 320 following the execution
`of an instruction. For execution of a loop, call, return, or
`branch instruction the program counter may be updated
`based on the loop, call, return, or branch instruction.
`For example, each fragment or group of fragments within
`a primitive can be processed independently from the other
`fragments or from the other groups of fragments within the
`primitive. Likewise, each vertex within a surface can be
`processed independently from the other vertices within the
`25 surface. For a set of samples being processed using the same
`program, the sequence of program instructions associated
`with each thread used to process each sample within the set
`will be identical, although the program counter for each
`thread may vary. However, it is possible that, during execu-
`30 tion, the threads processing some of the samples within a set
`will diverge following the execution of a conditional branch
`instruction. After the execution of a conditional branch
`instruction, the sequence of executed instructions associated
`with each thread processing samples within the set may
`35 differ and each program counter stored in TSR 325 within
`Thread Control Unit 320 for the threads may differ accord-
`ingly.
`FIG. 4 is an illustration of an alternate embodiment of
`Execution Pipeline 240 containing at least one Multi-
`40 threaded Processing Unit 400. Thread Control Unit 420
`includes a TSR 325 to retain thread state data. In one
`embodiment TSR 325 stores thread state data for each of at
`least two thread types, where the at least two thread types
`may include pixel, primitive, and vertex. Thread state data
`45 for a thread may include, among other things, a program
`counter, a busy flag that indicates if the thread is either
`assigned to a sample or available to be assigned to a sample,
`a pointer to a source sample to be processed by the instruc-
`tions associated with the thread or the output pixel position
`so and output buffer ID of the source sample to be processed,
`and a pointer specifying a destination location in Vertex
`Output Buffer 260 or Pixel Output Buffer 270. Additionally,
`thread state data for a thread assigned to a sample may
`include the sample type, e.g., pixel, vertex, primitive, or the
`55 like. The type of data a thread processes identifies the thread
`type, e.g., pixel, vertex, primitive, or the like. For example,
`a thread may process a primitive, producing a vertex. After
`the vertex is rasterized and fragments are generated, the
`thread may process a fragment.
`Source samples are stored in either Pixel Input Buffer 215
`or Vertex Input Buffer 220. Thread allocation priority, as
`described further herein, is used to assign a thread to a
`source sample. A thread allocation priority is specified for
`each sample type and Thread Control Unit 420 is configured
`65 to assign threads to samples or allocate locations in a
`Register File 350 based on the priority assigned to each
`sample type. The thread allocation priority may be fixed,
`
`60
`
`LG Ex. 1003, pg 13
`
`
`
`US 7,038,685 B1
`
`8
`7
`sample data in Vertex Input Buffer 220. In a further alternate
`programmable, or dynamic. In one embodiment the thread
`embodiment, Thread Selection Unit 415 is configured to
`allocation priority may be fixed, always giving priority to
`read thread entries using on a priority based on graphics
`allocating vertex threads and pixel threads are only allocated
`primitive size (number of pixels or fragments included in a
`if vertex samples are not available for assignment to a
`thread.
`5 primitive) or a number of graphics primitives in Vertex
`Output Buffer 260. For example a dynamically determined
`In an alternate embodiment, Thread Control Unit 420 is
`thread execution priority is determined based on a number of
`configured to assign threads to source samples or allocate
`locations in Register File 350 using thread allocation pri-
`"pending" pixels, i.e., the number of pixels to be rasterized
`from the primitives in Primitive Assembly/Setup 205 and in
`orities based on an amount of sample data in Pixel Input
`Buffer 215 and another amount of sample data in Vertex 10 Vertex Output Buffer 260. Specifically, the thread execution
`Input Buffer 220. Dynamically modifying a thread alloca-
`priority may be tuned such that the number of pending pixels
`tion priority for vertex samples based on the amount of
`produced by processing vertex threads is adequate to
`sample data in Vertex Input Buffer 220 permits Vertex Input
`achieve maximum utilization of the computation resources
`Buffer 220 to drain faster and fill Vertex Output Buffer 260
`in Execution Pipelines 240 processing pixel threads.
`Thread Selection Unit 415 reads one or more thread
`and Pixel Input Buffer 215 faster or drain slower and fill 15
`Vertex Output Buffer 260 and Pixel Input Buffer 215 slower.
`entries based on thread execution priorities and outputs
`selected thread entries to Instruction Cache 410. Instruction
`Dynamically modifying a thread allocation priority for pixel
`Cache 410 determines if the program instructions corre-
`samples based on the amount of sample data in Pixel Input
`Buffer 215 permits Pixel Input Buffer 215 to drain faster and
`sponding to the program counters and sample type included
`fill Pixel Output Buffer 270 faster or drain slower and fill 20 in the thread state data for each thread entry are available in
`Pixel Output Buffer 270 slower. In a further alternate
`Instruction Cache 410. When a requested program instruc-
`embodiment, Thread Control Unit 420 is configured to
`tion is not available in Instruction Cache 410 it is read
`assign threads to source samples or allocate locations in
`(possibly along with other program instructions stored in
`Register File 350 using thread allocation priorities based on
`adjacent memory locations) from graphics memory. A base
`graphics primitive size (number of pixels or fragments 25 address, corresponding to the graphics memory location
`included in a primitive) or a number of graphics primitives
`where a first instruction in a program is stored, may be used
`in Vertex Output Buffer 260. For example a dynamically
`in conjunction with a program counter to determine the
`determined thread allocation priority may be determined
`location in graphics memory where a program instruction
`based on a number of "pending" pixels, i.e., the number of
`corresponding to the program counter is stored. In an
`pixels to be rasterized from the primitives in Primitive 30 alternate embodiment, Instruction Cache 410 can be shared
`Assembly/Setup 205 and in Vertex Output Buffer 260.
`between Multithreaded Processing Units 400 within Execu-
`tion Pipeline 240.
`Specifically, the thread allocation priority may be tuned such
`that the number of pending pixels produced by processing
`The program instructions corresponding to the program
`vertex threads is adequate to achieve maximum utilization of
`counters from the one or more thread entries are output by
`the computation resources in Execution Pipelines 240 pro- 35 Instruction Cache 410 to a scheduler, Instruction Scheduler
`430. The number of instructions output each clock cycle
`cessing pixel threads.
`from Instruction Cache 410 to Instruction Scheduler 430 can
`Once a thread is assigned to a source sample, the thread
`is allocated storage resources such as locations in a Register
`vary depending on whether or not the instructions are
`File 350 to retain intermediate data generated during execu-
`available in the cache. The number of instructions that can
`tion of program instructions associated with the thread. 40 be output each clock cycle from Instruction Cache 410 to
`Instruction Scheduler 430 may also vary between different
`Alternatively, source data is stored in storage resources
`including Local Memory 140, locations in Host Memory
`embodiments. In one embodiment, Instruction Cache 410
`112, and the like.
`outputs one instruction per clock cycle to Instruction Sched-
`A Thread Selection Unit 415 reads one or more thread
`uler 430.