`UNITED STATES DEPARTMENT OF COMMERCE
`United States Patent and TrademarkOffice
`
`May 9, 2023
`
`THIS IS TO CERTIFY THAT ANNEXED HERETOIS A TRUE COPY FROM
`THE RECORDSOFTHIS OFFICE OF:
`
`PATENT NUMBER:7,038,685
`ISSUE DATE: May2, 2006
`
`
`{ZAaTauouwurpertecmuCore
`
`
`
`By Authority of the
`UnderSecretary of Commercefor Intellectual Property
`andDirectorof the United States Patent and Trademark Office
`
`aoe
`
`Miguel Tarver
`Certifving Officer
`
`TCL 1005
`
`
`
`a2) United States Patent
`US 7,038,685 B1
`(10) Patent No.:
`May2, 2006
`(45) Date of Patent:
`Lindholm
`
`US007038685B1
`
`(54)
`
`PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`(75)
`
`Inventor:
`
`John Erik Lindholm, Saratoga, CA
`(US)
`
`(73)
`
`Assignee: NVIDIA Corporation, Santa Clara, CA
`(US)
`
`(*)
`
`Notice:
`
`Subject to anydisclaimer, the term ofthis
`patent is extended or adjusted under 35
`US.C. 154(b) by 134 days.
`
`(21)
`
`Appl. No.: 10/609,967
`
`(22)
`
`Filed:
`
`Jun. 30, 2003
`
`Int. Cl.
`
`(51)
`
`(2006.01)
`GO6F 15/00
`(2006.01)
`GO6F 13/00
`(2006.01)
`GO06F 12/02
`(2006.01)
`GO6F 9/46
`(2006.01)
`GO06T 1/00
`DUIS. C1. ce ceeeeeees 345/501; 345/543; 345/536,
`718/104
`
`Field of Classification Search ................ 345/501,
`345/502, 530, 531, 522, 418, 419, 426, 427,
`345/543, 505, 536; 718/100, 104, 103
`See application file for complete search history.
`
`(52)
`
`(58)
`
`
`
`5/1991 Blackeee 382/298
`5,020,115 A *
`10/1999 Rentschleret al.
`5,969,726 A
`6,630,935 B1* 10/2003 Taylor et al. wo. 345/522
`6,731,289 B1*
`5/2004 Peercy et al. ww. 345/503
`2003/0041173 Al
`2/2003 Hoyle
`
`* cited by examiner
`
`Primary Examiner—Kee M. Tung
`(74) Attorney, Agent, or Firm—Patterson & Sheridan, LLP
`
`(57)
`
`ABSTRACT
`
`for multithreaded
`A programmable graphics processor
`execution of program instructions including a thread control
`unit. The programmable graphics processor is programmed
`with program instructions for processing primitive, pixel
`and vertex data. The thread control unit has a thread storage
`resource including locations allocated to store thread state
`data associated with samples of two or more types. Sample
`types include primitive, pixel and vertex. A number of
`threads allocated to processing a sample type may be
`dynamically modified.
`
`45 Claims, 9 Drawing Sheets
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`iH
`
`Register
`File
`350
`
`TCL 1005
`
`
`Execution
`
`Pipeline
`
`
`Multithreaded
`
`
`Processing Unit
`
`
`400
`Thread
`
`To 225 « Instruction||selection
`
`
`Cache Unit
`———»+
`410
`“5
`e Thread
`Cont
`Unit
`=
`[sr
`|/——"
`423
`R
`-
`esource
`
`Scoreboard |»!_Instruction 325
`Scheduler
`460
`430
`+
`
`Instruction
`
`*)| Dispatcher
`
`
`
`
`
`
`7z
`
`From 225,
`
`From 218
`From 220.
`
`TCL 1005
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 1 of 9
`
`US 7,038,685 B1
`
` Host Computer 110
`
`
`
`Host Processor
`System Interface
`Host Memory
`
`112
`114
`115
`
`
`
`100
`
`-_—
`
`Graphics
`Subsystem
`170
`
`GraphicsInterface 117
`
`|
`araphics
`
`rocessor
`
`105
`
`
`
`
`
`
`
`
`
`
`|
`
`|
`
`Memory
`Controller
`120
`
`Front End
`
`130
`
`IDX
`135
`
`Programmable
`Graphics
`Processing
`Pipeline
`150
`
`j
`Raster Analyzer
`160
`
`
`
`
`
`Output Controller
`180
`
`FIG. 1
`
`Output
`185
`
`/
`
`TCL 1005
`
`TCL 1005
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 2 of 9
`
`US 7,038,685 B1
`
`From
`135
`
`Programmable
`Graphics
`Processing
`Pipeline
`150
`
`Primitive Assembly/Setup
`205
`
`Raster Unit
`210
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Texture
`Unit
`<— 225
`
`
`
`
`Pixel Input Buffer
`
`
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Vertex Input Buffer
`220
`
`
`
`Execution
`Pipeline
`240
`
`
`
`
`
`Execution
`Pipeline
`240
`
`
`
`
`
`Vertex Output Buffer
`260
`
`
`
`Pixel Output Buffer
`
`Texture
`Cache
`230
`
`4120=120
`
`FIG. 2
`
`TCL 1005
`
`TCL 1005
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 3 of 9
`
`US 7,038,685 B1
`
`Execution
`Pipeline
`240
`
`Multithreaded
`Processing Unit
`
`From
`215
`
`From
`220
`
`
`
`
`
`Thread Control Unit
`320
`
`Register
`File
`350
`
`FIG. 3
`
`TCL 1005
`
`TCL 1005
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 4 of 9
`
`US 7,038,685 B1
`
`From From
`215
`220
`
`Execution
`Pipeline“ Po
`
`
`Multithreaded
`Processing Unit
`400
`|,
`
`:
`nstruction
`Cache
`410
`
`;
`
`Thread
`
`To 225
`
`From 225 +
`
`
`Thread
`
`Selection
`
`Unit
`
`445
`
`
`
`
`
`|
`
`
`
`
`
`Sequencer
`425
`
`Resource
`Scoreboard }—»_!nstruction
`460
`Scheduler
`_
`430
`
`From 215—
`From 220 -. |
`
`Instruction
`Dispatcher
`440
`
`Register
`
`File
`350
`
`Execution Unit
`470
`
`PCU
`375
`
`
`
`
`
`—
`
`To260
`
`T0270
`
`FIG. 4
`
`TCL 1005
`
`TCL 1005
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 5 of 9
`
`US 7,038,685 B1
`
`
`Receive Pointer to a
`Program
`510
`
`
`
`
`
`
`
`
`Receive Pointer to a
`
`510
`
`Program
`
`Pixel
`or Vertex?
`315
`
`Pixel
`
`Vertex
`
`
`
`Assign Pixel
`Tread
`545
`
`
`
`
`
`
`
`
`Assign
`VertexThread
`
`535
`
`Vertex
`
`
`
`Pixel
`or Vertex?
`515
`
`
`
`
`Pass
`Pass
`
`
`Priority Test?
`Priority Test?
`339
`£20
`
`
`
`
`Vertex
`Pixel
`
`
`
`Thread
`Thread
`Available?
`Available?
`
`
`
`
`540
`525
`
`
`
`
`Assign Pixel
`Assign
`VertexThread
`Thread
`
`
`
`
`
`545
`330
`
`FIG. 5A
`
`FIG. 5B
`
`TCL 1005
`
`TCL 1005
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 6 of 9
`
`US 7,038,685 B1
`
`
`
`620
`
`630
`
`640
`
`625
`
`635
`
`645
`
`TCL 1005
`
`TCL 1005
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 7 of 9
`
`US 7,038,685 B1
`
`Allocating threads to
`a first sample type
`710
`
`Allocating threads to a
`second sample type
`715
`
`£25
`
`Execute First
`Program Instructions
`£20
`
`Execute Second
`Program Instructions
`
`Determine
`allocations
`750
`
`Allocating threads to
`a first sample type
`£55
`
`Allocating threads to a
`second sample type
`160
`
`Execute First
`Program Instructions
`£65
`
`Execute Second
`Program Instructions
`170
`
`ffs
`
`FIG. 7A
`
`Allocating threads to
`the first sample type
`
`FIG. 7B
`
`TCL 1005
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 8 of 9
`
`US 7,038,685 B1
`
`810
`
`Type
`815
`
`
`
`
` Receive Sample
` Identify Sample
`
`
` Assign Thread
`825
`
` Thread
`Available?
`
`
`820
`
`FIG. 8A
`
`Receive Sample
`850
`
`
`
`Identify Sample
`Type
`
`855
`
`
`
`
`
`
`Position
`PIOR
`disabled?
`Hazard?
`
`
`
`865
`
`860
`
`
`
`
`
`870
`
`
`eeAvailable?
` Assign Thread
`875
`
`
`
`
`
` esources
`Available?
`
`877
`
`Execute Thread
`880
`
`
`
`Deallocate Resources
`850
`
`FIG. 8B
`
`TCL 1005
`
`TCL 1005
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 9 of 9
`
`US 7,038,685 B1
`
`
`Determine thread
`priority
`950
`
`
`
`
`
`Identify assigned
`thread(s)for priority
`955
`
`Identify next priority
`980
`
`
`
`
`
`
`
`
`No threads?
`960
`
`
`IN
`Select Thread(s)
`965
`
`
`
`Read Program
`Counter(s)
`970
`
`Update Program
`Counter(s)
`975
`
`
`
`
`
`
`
`
`
`
`Select Thread(s)
`215
`
`
`|!
`
`Read Program
`Counter(s)
`920
`
`Update Program
`Counter(s)
`925
`
`
`a I
`
`dentify assigned
`thread(s)
`910
`
`FIG. 9A
`
`FIG. 9B
`
`TCL 1005
`
`TCL 1005
`
`
`
`US 7,038,685 B1
`
`2
`A methodofassigning threads for processing of graphics
`data includes receiving a sample to be processed. A sample
`type of vertex, pixel, or primitive, associated with the
`sample is determined. A thread is determined to be available
`for assignment to the sample. The thread is assigned to the
`sample.
`A method of selecting at least one thread for execution
`includes identifying one or more assigned threads from
`threads including at least a thread assigned to a pixel sample
`and a thread assigned to a vertex sample. Atleast one of the
`one or more assigned threads is selected for processing.
`A method of improving performance of multithreaded
`processing of graphics data using at least two sample types
`includes dynamicallyallocating a first number of threads for
`processing a first portion of the graphics data to a first
`sample type and dynamically allocating a second number of
`threads for processing a second portion of the graphics data
`to a second sample type.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`1
`PROGRAMMABLE GRAPITICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`
`
`FIELD OF THE INVENTION
`
`One or more aspects of the invention generally relate to
`multithreaded processing, and more particularly to process-
`ing graphics data in a programmable graphics processor.
`BACKGROUND
`
`10
`
`Current graphics data processing includes systems and
`methods developedto perform a specific operation on graph-
`ics data, e.g., linear interpolation, tessellation, rasterization,
`texture mapping, depth testing, etc. These graphics proces-
`sors include several fixed function computation units to
`performsuch specific operations on specific types of graph-
`ics data, such as vertex data and pixel data. More recently,
`the computation units have a degree of programmability to
`perform userspecified operations such that the vertex data is
`processed bya vertex processing unit using vertex programs
`and the pixel data is processed by a pixel processing unit
`using pixel programs. When the amount of vertex data being
`processed is low relative the amount of pixel data being
`processed, the vertex processing unit may be underutilized.
`Conversely, when the amount of vertex data being processed
`is highrelative the amount ofpixel data being processed, the
`pixel processing unit may be underutilized.
`Accordingly, it would be desirable to provide improved
`approachesto processing different types of graphics data to
`better utilize one or more processing units within a graphics
`processor.
`
`SUMMARY
`
`
`
`A method and apparatus for processing and allocating
`threads for multithreaded execution of graphics programsis
`described. A graphics processor for multithreaded execution
`of program instructions associated with threads to process at
`least two sample types includes a thread control unit includ-
`ing a thread storage resource configured to store thread state
`data for each ofthe threads to processthe at least two sample
`types.
`Alternatively, the graphics processor includes a multi-
`threaded processing unit. The multithreaded processing unit
`includes a thread control unit configured to store pointers to
`program instructions associated with threads, each thread
`processing a sample type of vertex, pixel or primitive. The
`multithreaded processing unit also includes at
`least one
`programmable computation unit configured to process data
`under control of the program instructions.
`A method of multithreaded processing of graphics data
`includesreceiving a pointer to a vertex programto process
`vertex samples. Afirst thread is assigned to a vertex sample.
`A pointer to a shader program to process pixel samples is
`received. A second thread is assigned to a pixel sample. The
`vertex program is executed to process the vertex sample and
`produce a processed vertex sample. The shader program is
`executed to process the pixel sample and produce a pro-
`cessed pixel sample.
`Alternatively, the method of multithreaded processing of
`graphics data includesallocating a first numberof process-
`ing threads for a first sample type. A second number of
`processing threads is allocated for a second sample type.
`First program instructions associated with the first sample
`type are executed to process the graphics data and produce
`processed graphicsdata.
`
`bhA
`
`35
`
`40
`
`45
`
`wa5
`
`60
`
`a5
`
`exemplary
`show
`drawing(s)
`Accompanying
`embodiment(s) in accordance with one or more aspects of
`the
`present
`invention;
`however,
`the
`accompanying
`drawing(s) should not be taken to limit the present invention
`to the embodiment(s) shown, but are for explanation and
`understanding only.
`FIG.1 illustrates one embodimentof a computing system
`according to the invention including a host computer and a
`graphics subsystem.
`FIG. 2 is a block diagram of an embodiment of the
`Programmable Graphics Processing Pipeline of FIG. 1.
`FIG. 3 is a block diagram of an embodiment of the
`Execution Pipeline of FIG. 1.
`FIG. 4 is a block diagramof an alternate embodiment of
`the Execution Pipeline of FIG. 1.
`(IGS. 5A and 5B are flow diagrams of exemplary
`embodiments of thread assignment in accordance with one
`or more aspects of the present invention.
`FIGS. 6A and 6B are exemplary embodiments of a portion
`the Thread Storage Resource storing thread state data
`hin an embodiment of the Thread Control Unit of FIG. 3
`FIG. 4.
`
`
`
`of
`wi
`
`or
`
`
`
`FIGS. 7A and 7B are flow diagrams of exemplary
`embodiments of thread allocation and processing in accor-
`dance with one or more aspects of the present invention.
`FIGS. 8A and 8B are flow diagrams of exemplary
`embodiments of thread assignment in accordance with one
`or more aspects of the present invention.
`[IGS. 9A and 9B are flow diagrams of exemplary
`embodiments of thread selection in accordance with one or
`
`more aspects of the present invention.
`DETAILED DESCRIPTION
`
`In the following description, numerousspecific details are
`set forth to provide a more thorough understanding of the
`present invention. However, it will be apparent to one of
`skill in the art that the present invention may be practiced
`without one or more of these specific details. In other
`instances, well-known features have not been described in
`order to avoid obscuring the present invention.
`FIG.1 is an illustration of a Computing System generally
`designated 100 and including a Host Computer 110 and a
`Graphics Subsystem 170. Computing System 100 maybe a
`desktop computer, server, laptop computer, palm-sized com-
`puter,
`tablet computer, game console, cellular telephone,
`computer based simulator, or the like. Host Computer 110
`
`TCL 1005
`
`TCL 1005
`
`
`
`US 7,038,685 B1
`
`
`
`5
`
`
`
`3
`includes [lost Processor 114 that may include a system
`memory controller to interface directly to Host Memory112
`or may communicate with Host Memory 112 through a
`System Interface 115. System Interface 115 maybe an I/O
`(input/output) interface or a bridge device including the
`system memory controller to interface directly to Host
`Memory112. Examples of System Interface 115 known in
`the art include Intel® Northbridge and Intel® Southbridge.
`Host Computer 110 communicates with Graphics Sub-
`system 170 via System Interface 115 and a Graphics Inter-
`face 117 within a Graphics Processor 105. Data received at
`Graphics Interface 117 can be passed to a Front End 130 or
`written to a Local Memory 140 through Memory Controller
`120. Graphics Processor 105 uses graphics memoryto store
`graphics data and programinstructions, where graphics data
`is any data that is input to or output from components within
`the graphics processor. Graphics memorycan include por-
`tions of Host Memory112, Local Memory 140,registerfiles
`coupled to the components within Graphics Processor 105,
`and the like,
`
`4
`operations, such asstencil, z test, and the like, and saves the
`results or the samples output by Programmable Graphics
`Processing Pipeline 150 in Local Memory 140. When the
`data received by Graphics Subsystem 170 has been com-
`pletely processed by Graphics Processor 105, an Output 185
`of Graphics Subsystem 170 is provided using an Output
`Controller 180. Output Controller 180 is optionally config-
`uredto deliver data to a display device, network, electronic
`contro] system, other Computing System 100, other Graph-
`ics Subsystem 170, or the like. Alternatively, data is output
`to a film recording device or written to a peripheral device,
`e.g., disk drive, tape, compact disk, or the like.
`FIG.2 is an illustration of Programmable Graphics Pro-
`cessing Pipeline 150 of 'IG. 1. At least one set of samples
`is output by IDX 135 and received by Programmable
`Graphics Processing Pipeline 150 and theat least one set of
`samples is processed accordingto at least one program, the
`at least one program including graphics program instruc-
`tions. A program can process one or more sets of samples.
`Conversely, a set of samples can be processed bya sequence
`of one or more programs.
`Graphics Processor 105 includes, among other compo-
`nents, Front End 130 that receives commands from Host
`Samples, such as surfaces, primitives, or the like, are
`received from IDX 135 by Programmable Graphics Process-
`Computer 110 via Graphics Interface 117. Front End 130
`ing Pipeline 150 and stored in a Vertex Input Buffer 220
`interprets and formats the commands and outputs the for-
`including a register file, FIFO (first in first out), cache, or the
`matted commands and data to an IDX (Index Processor)
`
`ike (not shown). The samples are broadcast to Execution
`135. Some of the formatted commands are used by Pro-
`
`Pipelines 240, four of which are showninthe figure. Each
`grammable Graphics Processing Pipeline 150 to initiate
`Execution Pipeline 240 includesat least one multithreaded
`processing of data by providing the location of program
`processing unit, to be described further herein. ‘The samples
`instructions or graphics data stored in memory. IDX 135,
`output by Vertex Input Buffer 220 can be processed by any
`Programmable Graphics Processing Pipeline 150 and a
`
`one of the Execution Pipelines 240. A sample is accepted by
`Raster Analyzer 160 each include an interface to Memory
`an Execution Pipeline 240 whena processing thread within
`Controller 120 through which program instructions and data
`can be read from memory, ¢.g., any combination of Local
`the Execution Pipeline 240 is available as described further
`
`Memory140 and Host Memory 112. Whenaportion of Host herein. Each Execution Pipeline 240 signals to Vertex Input
`Memory112 is used to store program instructions and data,
`Buffer 220 when a sample can be accepted or when a sample
`the portion of Host Memory112 can be uncached soas to
`cannot be accepted. In one embodiment Programmable
`increase performance of access by Graphics Processor 105.
`Graphics Processing Pipeline 150 includes a single Execu-
`IDX 135 optionally reads processed data, e.g., data writ-
`tion Pipeline 240 containing one multithreaded processing
`ten by Raster Analyzer 160, from memory and outputs the
`unit. In an alternative embodiment, Programmable Graphics
`data, processed data and formatted commands to Program-
`Processing Pipeline 150 includes a plurality of Execution
`mable Graphics Processing Pipeline 150. Programmable
`Pipelines 240.
`Graphics Processing Pipeline 150 and Raster Analyzer 160
`Execution Pipelines 240 mayreceive first samples, such
`each contain one or more programmable processing units to
`as higher-ordersurface data, and tessellate the first samples
`perform a variety of specialized functions. Some of these
`to generate second samples, such as vertices. Execution
`functions are table lookup, scalar and vector addition, mul-
`Pipelines 240 may be configured to transform the second
`tiplication, division, coordinate-system mapping, calcula-
`samples from an object-based coordinate representation
`tion of vector normals,
`tessellation, calculation of deriva-
`(object space) to analternatively based coordinate system
`tives, interpolation, and the like. Programmable Graphics
`such as world space or normalized device coordinates
`
`Processing Pipeline 150 and Raster Analyzer 160 are each
`(NDC) space. Each Execution Pipeline 240 may communi-
`optionally configured such that data processing operations
`cate with Texture Unit 225 using a read interface (not shown
`are performed in multiple passes through those units or in
`in FIG. 2) to read program instructions and graphics data
`multiple passes within Programmable Graphics Processing
`such as texture maps from Local Memory 140 or Host
`Pipeline 150. Programmable Graphics Processing Pipeline
`Memory 112 via Memory Controller 120 and a Texture
`150 and a Raster Analyzer 160 also each include a write
`Cache 230. Texture Cache 230 is used to improve memory
`interface to Memory Controller 120 through which data can
`read performance byreducing read latency. In an alternate
`embodiment Texture Cache 230 is omitted. In another
`be written to memory.
`alternate embodiment, a Texture Unit 225 is included in each
`In a typical implementation Programmable Graphics Pro-
`
`cessing Pipeline 150 performs geometry computations, ras-
`Execution Pipeline 240. In another alternate embodiment
`terization, and pixel computations. Therefore Programmable
`program instructions are stored within Programmable
`Graphics Processing Pipeline 150 is programmed to operate
`Graphics Processing Pipeline 150.
`In another alternate
`on surface, primitive, vertex, fragment, pixel, sample or any
`embodiment each Execution Pipeline 240 has a dedicated
`other data. For simplicity, the remainder of this description
`instruction read interface to read program instructions from
`will use the term “samples”to refer to graphics data such as
`Local Memory 140 or Host Memory 112 via Memory
`Controller 120.
`surfaces, primitives, vertices, pixels, fragments, or the like.
`
`Samples output by Programmable Graphics Processing
`Execution Pipelines 240 output processed samples, such
`Pipeline 150 are passed to a Raster Analyzer 160, which
`as vertices, that are stored in a Vertex Output Buffer 260
`optionally performs near and far plane clipping and raster
`including a register file, FIFO, cache, or the like (not
`
`
`
`
`40
`
`45
`
`
`
`TCL 1005
`
`TCL 1005
`
`
`
`US 7,038,685 B1
`
`6
`
`
`
`Buffer 220. In an alternate embodiment source data is stored
`
`5
`shown). Processed vertices output by Vertex Output Buffer
`260 are received by a Primitive Assembly/Setup Unit 205.
`Primitive Assembly/Setup Unit 205 calculates parameters,
`such as deltas and slopes, to rasterize the processed vertices
`and outputs parameters and samples, such as vertices, to a
`Raster Unit 210. Raster Unit 210 performs scan conversion
`on samples, such as vertices, and outputs samples, such as
`fragments, to a Pixel Input Buffer 215. Alternatively, Raster
`Unit 210 resamples processed vertices and outputs addi-
`tional vertices to Pixel Input Buffer 215.
`Pixel Input Buffer 215 outputs the samples to each Execu-
`
`tion Pipeline 240. Samples, such as pixels and fragments,
`output by Pixel Input Buffer 215 are each processed by only
`one of the Execution Pipelines 240. Pixel Input Buffer 215
`determines which one of
`the Execution Pipelines 240 to
`output each sample to depending onan outputpixel position,
`¢.g., (x,y), associated with cach sample. In this manner, cach
`
`sample is output to the Execution Pipeline 240 designated to
`process samples associated with the outputpixel position. In
`an alternate embodiment, each sample output by Pixel Input
`
`Buffer 215 is processed by one of any available Execution
`Pipelines 240.
`Each Execution Pipeline 240 signals to Pixel Input Buffer
`240 when a sample can be accepted or when a sample cannot
`be accepted as described further herein. Program instruc-
`tions configure programmable computation units (PCUs)
`within an Execution Pipeline 240 to perform operations such
`as tessellation, perspective correction,
`texture mapping,
`shading, blending, and the like. Processed samples are
`output from each Execution Pipeline 240 to a Pixel Output
`Butter 270. Pixel Output Buffer 270 optionally stores the
`processed samplesin a register file, FIFO, cache, or the like
`(not shown). The processed samples are output from Pixel
`Output Buffer 270 to Raster Analyzer 160.
`FIG.3 is a block diagram of an embodiment of Execution
`Pipeline 240 of FIG. 1 includingat least one Multithreaded
`
`Processing Unit 300. An Execution Pipeline 240 can contain
`a plurality of Multithreaded Processing Units 300, each
`Multithreaded Processing Unit 300 containing at least one
`PCU 375. PCUs 375 are configured using programinstruc-
`tions read by a Thread Control Unit 320 via Texture Unit
`225. Thread Control Unit 320 gathers source data specified
`by the program instructions and dispatches the source data
`and program instructions to at least one PCU 375. PCUs 375
`performs computations specified by the program instruc-
`tions and outputs data to at least one destination, e.g., Pixel
`Output Buffer 160, Vertex Output Buffer 260 and Thread
`Control Unit 320.
`
`
`
`
`
`
`
`
`
`
`
`10
`
`iw)5
`
`35
`
`40
`
`45
`
`in Local Memory 140, locations in Host Memory 112, and
`the like.
`
`
`
`
`Alternatively, in an embodiment permitting multiple pro-
`gramsfor two or more thread types, Thread Control Unit 320
`also receives a program identifier specifying which one of
`the two or more programs the program counter is associated
`with. Specifically,
`in an embodiment permitting simulta-
`neous execution of four programsfor a thread type, twobits
`of thread state information are used to store the program
`identifier for a thread. Multithreaded execution of programs
`is possible because each thread may be executed. indepen-
`dentof other threads, regardless of whetherthe other threads
`are executing the same program or a different program.
`PCUs 375 update each program counter associated with the
`threads in Thread Control! Unit 320 following the execution
`of an instruction. For execution of a loop, call, return, or
`branch instruction the program counter may be updated
`based on the loop, call, return, or branch instruction.
`For example, cach fragment or group of fragments within
`a primitive can be processed independently from the other
`fragments or from the other groups of fragments within the
`primitive. Likewise, each vertex within a surface can be
`processed independently from the other vertices within the
`surface. For a set of samples being processed using the same
`program, the sequence of program instructions associated
`with each thread used to process each sample within the set
`will be identical, although the program counter for each
`thread may vary. However,it is possible that, during execu-
`tion, the threads processing someofthe samples within a set
`will diverge following the execution of a conditional branch
`instruction. After the execution of a conditional branch
`
`instruction, the sequence of executedinstructions associated
`with each thread processing samples within the sel may
`differ and each program counter stored in TSR 325 within
`Thread Control Unit 320 for the threads may differ accord-
`ingly.
`FIG. 4 is an illustration of an alternate embodiment of
`
`least one Multi-
`Execution Pipeline 240 containing at
`threaded Processing Unit 400. Thread Control Unit 420
`includes a TSR 325 to retain thread state data. In one
`embodiment‘!'SR 325stores thread state data for each ofat
`
`
`
`
`
`
`
`
`
`east two thread types, where the at least two thread types
`mayinclude pixel, primitive, and vertex. Thread state data
`or a thread may include, among other things, a program
`counter, a busy flag that indicates if the thread is either
`assigned to a sample or available to be assigned lo a sample,
`a pointer to a source sample to be processed by the instruc-
`tions associated with the thread or the outputpixel position
`and output buffer ID of the source sample to be processed,
`
`
`
`and a pointer specifying a destination location in Vertex
`
`
`
`Output Buffer 260 or Pixel Output Buffer 270, Additionally,
`thread state data for a thread assigned to a sample may
`include the sample type, e.g., pixel, vertex, primitive, or the
`like. The type ofdata a thread processes identifies the thread
`type, e.g., pixel, vertex, primitive, or the like. For example,
`a thread mayprocess a primitive, producing a vertex. After
`the vertex is rasterized and fragments are generated, the
`thread mayprocess a fragment.
`Source samples are stored in either Pixel Input Buffer 215
`or Vertex Input Buffer 220. Thread allocation priority, as
`described further herein,
`is used to assign a thread to a
`source sample. A thread allocation priority is specified for
`each sample type and Thread Control Unit 420 is configured
`to assign threads to samples or allocate locations in a
`Register File 350 based on the priority assigned to each
`sample type. The thread allocation priority maybe fixed,
`
`TCL 1005
`
`Asingle program may be used to process several sets of
`samples. Thread Control Unit 320 receives samples or
`pointers to samples stored in Pixel Input Buffer 215 and
`Vertex Input Buffer 220. Thread Control Unit 320 receives
`a pointer to a program to process one or more samples.
`Thread Control Unit 320 assigns a thread to each sample to
`be processed. A thread includes a pointer to a program
`instruction (program counter), such as the first instruction
`within the program, thread state information, and storage
`resources for storing intermediate data generated during
`processing of the sample. Threadstate information is stored
`in a TSR (Thread Storage Resource) 325. TSR 325 may be
`a register file, FIFO, circular buffer, or the like. An instruc-
`tion specifies the location of source data needed to execute
`the instruction. Source data, such as intermediate data gen-
`erated during processing ofthe sample is stored in a Register
`File 350. In addition to Register File 350, other source data
`
`
`
`may be stored in Pixel Input Buffer 215 or Vertex Input
`
`
`
`
`TCL 1005
`
`
`
`US 7,038,685 B1
`
`
`
`
`
`
`7
`programmable, or dynamic. In one embodiment the thread
`allocation priority may be fixed, always giving priority to
`allocating vertex threads andpixel threads are only allocated
`if vertex samples are not available for assignment to a
`thread.
`In analternate embodiment, Thread Control Unit 420 is
`configured to assign threads to source samples or allocate
`locations in Register File 350 using thread allocation pri-
`orities based on an amount of sample data in Pixel Input
`Buffer 215 and another amount of sample data in Vertex
`Input Buffer 220. Dynamically modifying a thread alloca-
`tion priority for vertex samples based. on the amount of
`
`
`sample data in Vertex Input Buffer 220 permits Vertex Input
`
`
`Buffer 220 to drain faster andfill Vertex Output Buffer 260
`
`and Pixel Input Buffer 215 faster or drain slower andfill
`Vertex Output Buffer 260 and Pixel Input Buffer 215 slower.
`Dynamically modifying a thread allocationpriority for pixel
`samples based on the amount of sample data in Pixel Input
`Buffer 215 permits Pixel Input Buffer 215 to drain faster and
`fill Pixel Output Buffer 270 faster or drain slower andfill
`Pixel Output Buffer 270 slower.
`In a further alternate
`embodiment, Thread Control Unit 420 is configured to
`assign threads to source samples or allocate locations in
`Register File 350 using thread allocation priorities based on
`graphics primitive size (number of pixels or fragments
`included ina primitive) or a numberof graphics primitives
`in Vertex Output Buffer 260. For example a dynamically
`determined thread allocation priority may be determined
`based on a numberof “pending” pixels, 1.e., the number of
`pixels to be rasterized from the primitives in Primitive
`Assembly/Setup 205 and in Vertex Output Buffer 260.
`Specifically, the thread allocation priority maybe tuned such
`that the number of pending pixels produced by processing
`vertex threadsis adequate lo achieve maximum ulilization of
`the computation resources in Execution Pipelines 240 pro-
`cessing pixel threads.
`Once a thread is assigned to a source sample,the thread
`is allocated storage resources such as locations in a Register
`File 350 to retain intermediate data generated during execu-
`tion of program instructions associated with the thread.
`Alternatively, source data is stored in storage resources
`including Local Memory 140, locations in Host Memory
`112, andthelike.
`A Thread Selection Unit 415 reads one or more thread
`
`from Thread
`entries, each containing thread state data,
`Control Unit 420. Thread Selection Unit 415 may read
`thread entries to process a group of samples. For example,
`in one embodiment a group of samples, e.g., a number of
`vertices defining a primitive,
`four adjacent
`fragments
`arranged in a square, or the like, are processed simulta-
`neously. In the one embodiment computed values such as
`derivatives are shared within the group of samples thereby
`reducing the number of computations needed to process the
`group of samples compared with processing the group of
`samples without sharing the computed values.
`Tn Multithreaded Processing Unit 400, a thread execution
`priority is specified for each thread type and Thread Selec-
`tion Unit 415 is configured to read thread entries based on
`the thread execution priority assigned to each thread type. A
`Thread execution priority may be fixed, programmable, or
`dynamic. In one embodiment the thread executionpriority
`may befixed, always giving priority to execution of vertex
`threads and pixel threads are only executed if vertex threads
`are not available for execution.
`In another embodiment, Thread Selection Unit 415 is
`configured to read thread entries based on the amount of
`sample data in Pixel Input Buffer 215 and the amount of
`
`8
`sample data in Vertex Input Buffer 220. In a further alternate
`embodiment, Thread Selection Unit 415 is configured to
`read thread entries using on a priority based on graphics
`primitive size (numberof pixels or fragments included in a
`primitive) or a number of graphics primitives in Vertex
`Output Buffer 260. For example a dynamically determined
`thread execution priority is determined based on a number of
`“pending”pixels, Le., the number of pixels to be rasterized
`from the primitives in Primitive Assembly/Setup 205 and in
`Vertex Output Buffer 260. Specifically, the thread execution
`priority may be tuned suchthat the numberof pending pixels
`produced by processing vertex threads is adequate to
`achieve maximum utilization of the computati