throbber
TOALL,TOWHOM THESE; PRESENTS; SHALL,
`UNITED STATES DEPARTMENT OF COMMERCE
`United States Patent and TrademarkOffice
`
`May 9, 2023
`
`THIS IS TO CERTIFY THAT ANNEXED HERETOIS A TRUE COPY FROM
`THE RECORDSOFTHIS OFFICE OF:
`
`PATENT NUMBER:7,038,685
`ISSUE DATE: May2, 2006
`
`
`{ZAaTauouwurpertecmuCore
`
`
`
`By Authority of the
`UnderSecretary of Commercefor Intellectual Property
`andDirectorof the United States Patent and Trademark Office
`
`aoe
`
`Miguel Tarver
`Certifving Officer
`
`TCL 1005
`
`

`

`a2) United States Patent
`US 7,038,685 B1
`(10) Patent No.:
`May2, 2006
`(45) Date of Patent:
`Lindholm
`
`US007038685B1
`
`(54)
`
`PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`(75)
`
`Inventor:
`
`John Erik Lindholm, Saratoga, CA
`(US)
`
`(73)
`
`Assignee: NVIDIA Corporation, Santa Clara, CA
`(US)
`
`(*)
`
`Notice:
`
`Subject to anydisclaimer, the term ofthis
`patent is extended or adjusted under 35
`US.C. 154(b) by 134 days.
`
`(21)
`
`Appl. No.: 10/609,967
`
`(22)
`
`Filed:
`
`Jun. 30, 2003
`
`Int. Cl.
`
`(51)
`
`(2006.01)
`GO6F 15/00
`(2006.01)
`GO6F 13/00
`(2006.01)
`GO06F 12/02
`(2006.01)
`GO6F 9/46
`(2006.01)
`GO06T 1/00
`DUIS. C1. ce ceeeeeees 345/501; 345/543; 345/536,
`718/104
`
`Field of Classification Search ................ 345/501,
`345/502, 530, 531, 522, 418, 419, 426, 427,
`345/543, 505, 536; 718/100, 104, 103
`See application file for complete search history.
`
`(52)
`
`(58)
`
`
`
`5/1991 Blackeee 382/298
`5,020,115 A *
`10/1999 Rentschleret al.
`5,969,726 A
`6,630,935 B1* 10/2003 Taylor et al. wo. 345/522
`6,731,289 B1*
`5/2004 Peercy et al. ww. 345/503
`2003/0041173 Al
`2/2003 Hoyle
`
`* cited by examiner
`
`Primary Examiner—Kee M. Tung
`(74) Attorney, Agent, or Firm—Patterson & Sheridan, LLP
`
`(57)
`
`ABSTRACT
`
`for multithreaded
`A programmable graphics processor
`execution of program instructions including a thread control
`unit. The programmable graphics processor is programmed
`with program instructions for processing primitive, pixel
`and vertex data. The thread control unit has a thread storage
`resource including locations allocated to store thread state
`data associated with samples of two or more types. Sample
`types include primitive, pixel and vertex. A number of
`threads allocated to processing a sample type may be
`dynamically modified.
`
`45 Claims, 9 Drawing Sheets
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`iH
`
`Register
`File
`350
`
`TCL 1005
`
`
`Execution
`
`Pipeline
`
`
`Multithreaded
`
`
`Processing Unit
`
`
`400
`Thread
`
`To 225 « Instruction||selection
`
`
`Cache Unit
`———»+
`410
`“5
`e Thread
`Cont
`Unit
`=
`[sr
`|/——"
`423
`R
`-
`esource
`
`Scoreboard |»!_Instruction 325
`Scheduler
`460
`430
`+
`
`Instruction
`
`*)| Dispatcher
`
`
`
`
`
`
`7z
`
`From 225,
`
`From 218
`From 220.
`
`TCL 1005
`
`

`

`U.S. Patent
`
`May2, 2006
`
`Sheet 1 of 9
`
`US 7,038,685 B1
`
` Host Computer 110
`
`
`
`Host Processor
`System Interface
`Host Memory
`
`112
`114
`115
`
`
`
`100
`
`-_—
`
`Graphics
`Subsystem
`170
`
`GraphicsInterface 117
`
`|
`araphics
`
`rocessor
`
`105
`
`
`
`
`
`
`
`
`
`
`|
`
`|
`
`Memory
`Controller
`120
`
`Front End
`
`130
`
`IDX
`135
`
`Programmable
`Graphics
`Processing
`Pipeline
`150
`
`j
`Raster Analyzer
`160
`
`
`
`
`
`Output Controller
`180
`
`FIG. 1
`
`Output
`185
`
`/
`
`TCL 1005
`
`TCL 1005
`
`

`

`U.S. Patent
`
`May2, 2006
`
`Sheet 2 of 9
`
`US 7,038,685 B1
`
`From
`135
`
`Programmable
`Graphics
`Processing
`Pipeline
`150
`
`Primitive Assembly/Setup
`205
`
`Raster Unit
`210
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Texture
`Unit
`<— 225
`
`
`
`
`Pixel Input Buffer
`
`
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Vertex Input Buffer
`220
`
`
`
`Execution
`Pipeline
`240
`
`
`
`
`
`Execution
`Pipeline
`240
`
`
`
`
`
`Vertex Output Buffer
`260
`
`
`
`Pixel Output Buffer
`
`Texture
`Cache
`230
`
`4120=120
`
`FIG. 2
`
`TCL 1005
`
`TCL 1005
`
`

`

`U.S. Patent
`
`May2, 2006
`
`Sheet 3 of 9
`
`US 7,038,685 B1
`
`Execution
`Pipeline
`240
`
`Multithreaded
`Processing Unit
`
`From
`215
`
`From
`220
`
`
`
`
`
`Thread Control Unit
`320
`
`Register
`File
`350
`
`FIG. 3
`
`TCL 1005
`
`TCL 1005
`
`

`

`U.S. Patent
`
`May2, 2006
`
`Sheet 4 of 9
`
`US 7,038,685 B1
`
`From From
`215
`220
`
`Execution
`Pipeline“ Po
`
`
`Multithreaded
`Processing Unit
`400
`|,
`
`:
`nstruction
`Cache
`410
`
`;
`
`Thread
`
`To 225
`
`From 225 +
`
`
`Thread
`
`Selection
`
`Unit
`
`445
`
`
`
`
`
`|
`
`
`
`
`
`Sequencer
`425
`
`Resource
`Scoreboard }—»_!nstruction
`460
`Scheduler
`_
`430
`
`From 215—
`From 220 -. |
`
`Instruction
`Dispatcher
`440
`
`Register
`
`File
`350
`
`Execution Unit
`470
`
`PCU
`375
`
`
`
`
`
`—
`
`To260
`
`T0270
`
`FIG. 4
`
`TCL 1005
`
`TCL 1005
`
`

`

`U.S. Patent
`
`May2, 2006
`
`Sheet 5 of 9
`
`US 7,038,685 B1
`
`
`Receive Pointer to a
`Program
`510
`
`
`
`
`
`
`
`
`Receive Pointer to a
`
`510
`
`Program
`
`Pixel
`or Vertex?
`315
`
`Pixel
`
`Vertex
`
`
`
`Assign Pixel
`Tread
`545
`
`
`
`
`
`
`
`
`Assign
`VertexThread
`
`535
`
`Vertex
`
`
`
`Pixel
`or Vertex?
`515
`
`
`
`
`Pass
`Pass
`
`
`Priority Test?
`Priority Test?
`339
`£20
`
`
`
`
`Vertex
`Pixel
`
`
`
`Thread
`Thread
`Available?
`Available?
`
`
`
`
`540
`525
`
`
`
`
`Assign Pixel
`Assign
`VertexThread
`Thread
`
`
`
`
`
`545
`330
`
`FIG. 5A
`
`FIG. 5B
`
`TCL 1005
`
`TCL 1005
`
`

`

`U.S. Patent
`
`May2, 2006
`
`Sheet 6 of 9
`
`US 7,038,685 B1
`
`
`
`620
`
`630
`
`640
`
`625
`
`635
`
`645
`
`TCL 1005
`
`TCL 1005
`
`

`

`U.S. Patent
`
`May2, 2006
`
`Sheet 7 of 9
`
`US 7,038,685 B1
`
`Allocating threads to
`a first sample type
`710
`
`Allocating threads to a
`second sample type
`715
`
`£25
`
`Execute First
`Program Instructions
`£20
`
`Execute Second
`Program Instructions
`
`Determine
`allocations
`750
`
`Allocating threads to
`a first sample type
`£55
`
`Allocating threads to a
`second sample type
`160
`
`Execute First
`Program Instructions
`£65
`
`Execute Second
`Program Instructions
`170
`
`ffs
`
`FIG. 7A
`
`Allocating threads to
`the first sample type
`
`FIG. 7B
`
`TCL 1005
`
`

`

`U.S. Patent
`
`May2, 2006
`
`Sheet 8 of 9
`
`US 7,038,685 B1
`
`810
`
`Type
`815
`
`
`
`
` Receive Sample
` Identify Sample
`
`
` Assign Thread
`825
`
` Thread
`Available?
`
`
`820
`
`FIG. 8A
`
`Receive Sample
`850
`
`
`
`Identify Sample
`Type
`
`855
`
`
`
`
`
`
`Position
`PIOR
`disabled?
`Hazard?
`
`
`
`865
`
`860
`
`
`
`
`
`870
`
`
`eeAvailable?
` Assign Thread
`875
`
`
`
`
`
` esources
`Available?
`
`877
`
`Execute Thread
`880
`
`
`
`Deallocate Resources
`850
`
`FIG. 8B
`
`TCL 1005
`
`TCL 1005
`
`

`

`U.S. Patent
`
`May2, 2006
`
`Sheet 9 of 9
`
`US 7,038,685 B1
`
`
`Determine thread
`priority
`950
`
`
`
`
`
`Identify assigned
`thread(s)for priority
`955
`
`Identify next priority
`980
`
`
`
`
`
`
`
`
`No threads?
`960
`
`
`IN
`Select Thread(s)
`965
`
`
`
`Read Program
`Counter(s)
`970
`
`Update Program
`Counter(s)
`975
`
`
`
`
`
`
`
`
`
`
`Select Thread(s)
`215
`
`
`|!
`
`Read Program
`Counter(s)
`920
`
`Update Program
`Counter(s)
`925
`
`
`a I
`
`dentify assigned
`thread(s)
`910
`
`FIG. 9A
`
`FIG. 9B
`
`TCL 1005
`
`TCL 1005
`
`

`

`US 7,038,685 B1
`
`2
`A methodofassigning threads for processing of graphics
`data includes receiving a sample to be processed. A sample
`type of vertex, pixel, or primitive, associated with the
`sample is determined. A thread is determined to be available
`for assignment to the sample. The thread is assigned to the
`sample.
`A method of selecting at least one thread for execution
`includes identifying one or more assigned threads from
`threads including at least a thread assigned to a pixel sample
`and a thread assigned to a vertex sample. Atleast one of the
`one or more assigned threads is selected for processing.
`A method of improving performance of multithreaded
`processing of graphics data using at least two sample types
`includes dynamicallyallocating a first number of threads for
`processing a first portion of the graphics data to a first
`sample type and dynamically allocating a second number of
`threads for processing a second portion of the graphics data
`to a second sample type.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`1
`PROGRAMMABLE GRAPITICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`
`
`FIELD OF THE INVENTION
`
`One or more aspects of the invention generally relate to
`multithreaded processing, and more particularly to process-
`ing graphics data in a programmable graphics processor.
`BACKGROUND
`
`10
`
`Current graphics data processing includes systems and
`methods developedto perform a specific operation on graph-
`ics data, e.g., linear interpolation, tessellation, rasterization,
`texture mapping, depth testing, etc. These graphics proces-
`sors include several fixed function computation units to
`performsuch specific operations on specific types of graph-
`ics data, such as vertex data and pixel data. More recently,
`the computation units have a degree of programmability to
`perform userspecified operations such that the vertex data is
`processed bya vertex processing unit using vertex programs
`and the pixel data is processed by a pixel processing unit
`using pixel programs. When the amount of vertex data being
`processed is low relative the amount of pixel data being
`processed, the vertex processing unit may be underutilized.
`Conversely, when the amount of vertex data being processed
`is highrelative the amount ofpixel data being processed, the
`pixel processing unit may be underutilized.
`Accordingly, it would be desirable to provide improved
`approachesto processing different types of graphics data to
`better utilize one or more processing units within a graphics
`processor.
`
`SUMMARY
`
`
`
`A method and apparatus for processing and allocating
`threads for multithreaded execution of graphics programsis
`described. A graphics processor for multithreaded execution
`of program instructions associated with threads to process at
`least two sample types includes a thread control unit includ-
`ing a thread storage resource configured to store thread state
`data for each ofthe threads to processthe at least two sample
`types.
`Alternatively, the graphics processor includes a multi-
`threaded processing unit. The multithreaded processing unit
`includes a thread control unit configured to store pointers to
`program instructions associated with threads, each thread
`processing a sample type of vertex, pixel or primitive. The
`multithreaded processing unit also includes at
`least one
`programmable computation unit configured to process data
`under control of the program instructions.
`A method of multithreaded processing of graphics data
`includesreceiving a pointer to a vertex programto process
`vertex samples. Afirst thread is assigned to a vertex sample.
`A pointer to a shader program to process pixel samples is
`received. A second thread is assigned to a pixel sample. The
`vertex program is executed to process the vertex sample and
`produce a processed vertex sample. The shader program is
`executed to process the pixel sample and produce a pro-
`cessed pixel sample.
`Alternatively, the method of multithreaded processing of
`graphics data includesallocating a first numberof process-
`ing threads for a first sample type. A second number of
`processing threads is allocated for a second sample type.
`First program instructions associated with the first sample
`type are executed to process the graphics data and produce
`processed graphicsdata.
`
`bhA
`
`35
`
`40
`
`45
`
`wa5
`
`60
`
`a5
`
`exemplary
`show
`drawing(s)
`Accompanying
`embodiment(s) in accordance with one or more aspects of
`the
`present
`invention;
`however,
`the
`accompanying
`drawing(s) should not be taken to limit the present invention
`to the embodiment(s) shown, but are for explanation and
`understanding only.
`FIG.1 illustrates one embodimentof a computing system
`according to the invention including a host computer and a
`graphics subsystem.
`FIG. 2 is a block diagram of an embodiment of the
`Programmable Graphics Processing Pipeline of FIG. 1.
`FIG. 3 is a block diagram of an embodiment of the
`Execution Pipeline of FIG. 1.
`FIG. 4 is a block diagramof an alternate embodiment of
`the Execution Pipeline of FIG. 1.
`(IGS. 5A and 5B are flow diagrams of exemplary
`embodiments of thread assignment in accordance with one
`or more aspects of the present invention.
`FIGS. 6A and 6B are exemplary embodiments of a portion
`the Thread Storage Resource storing thread state data
`hin an embodiment of the Thread Control Unit of FIG. 3
`FIG. 4.
`
`
`
`of
`wi
`
`or
`
`
`
`FIGS. 7A and 7B are flow diagrams of exemplary
`embodiments of thread allocation and processing in accor-
`dance with one or more aspects of the present invention.
`FIGS. 8A and 8B are flow diagrams of exemplary
`embodiments of thread assignment in accordance with one
`or more aspects of the present invention.
`[IGS. 9A and 9B are flow diagrams of exemplary
`embodiments of thread selection in accordance with one or
`
`more aspects of the present invention.
`DETAILED DESCRIPTION
`
`In the following description, numerousspecific details are
`set forth to provide a more thorough understanding of the
`present invention. However, it will be apparent to one of
`skill in the art that the present invention may be practiced
`without one or more of these specific details. In other
`instances, well-known features have not been described in
`order to avoid obscuring the present invention.
`FIG.1 is an illustration of a Computing System generally
`designated 100 and including a Host Computer 110 and a
`Graphics Subsystem 170. Computing System 100 maybe a
`desktop computer, server, laptop computer, palm-sized com-
`puter,
`tablet computer, game console, cellular telephone,
`computer based simulator, or the like. Host Computer 110
`
`TCL 1005
`
`TCL 1005
`
`

`

`US 7,038,685 B1
`
`
`
`5
`
`
`
`3
`includes [lost Processor 114 that may include a system
`memory controller to interface directly to Host Memory112
`or may communicate with Host Memory 112 through a
`System Interface 115. System Interface 115 maybe an I/O
`(input/output) interface or a bridge device including the
`system memory controller to interface directly to Host
`Memory112. Examples of System Interface 115 known in
`the art include Intel® Northbridge and Intel® Southbridge.
`Host Computer 110 communicates with Graphics Sub-
`system 170 via System Interface 115 and a Graphics Inter-
`face 117 within a Graphics Processor 105. Data received at
`Graphics Interface 117 can be passed to a Front End 130 or
`written to a Local Memory 140 through Memory Controller
`120. Graphics Processor 105 uses graphics memoryto store
`graphics data and programinstructions, where graphics data
`is any data that is input to or output from components within
`the graphics processor. Graphics memorycan include por-
`tions of Host Memory112, Local Memory 140,registerfiles
`coupled to the components within Graphics Processor 105,
`and the like,
`
`4
`operations, such asstencil, z test, and the like, and saves the
`results or the samples output by Programmable Graphics
`Processing Pipeline 150 in Local Memory 140. When the
`data received by Graphics Subsystem 170 has been com-
`pletely processed by Graphics Processor 105, an Output 185
`of Graphics Subsystem 170 is provided using an Output
`Controller 180. Output Controller 180 is optionally config-
`uredto deliver data to a display device, network, electronic
`contro] system, other Computing System 100, other Graph-
`ics Subsystem 170, or the like. Alternatively, data is output
`to a film recording device or written to a peripheral device,
`e.g., disk drive, tape, compact disk, or the like.
`FIG.2 is an illustration of Programmable Graphics Pro-
`cessing Pipeline 150 of 'IG. 1. At least one set of samples
`is output by IDX 135 and received by Programmable
`Graphics Processing Pipeline 150 and theat least one set of
`samples is processed accordingto at least one program, the
`at least one program including graphics program instruc-
`tions. A program can process one or more sets of samples.
`Conversely, a set of samples can be processed bya sequence
`of one or more programs.
`Graphics Processor 105 includes, among other compo-
`nents, Front End 130 that receives commands from Host
`Samples, such as surfaces, primitives, or the like, are
`received from IDX 135 by Programmable Graphics Process-
`Computer 110 via Graphics Interface 117. Front End 130
`ing Pipeline 150 and stored in a Vertex Input Buffer 220
`interprets and formats the commands and outputs the for-
`including a register file, FIFO (first in first out), cache, or the
`matted commands and data to an IDX (Index Processor)
`
`ike (not shown). The samples are broadcast to Execution
`135. Some of the formatted commands are used by Pro-
`
`Pipelines 240, four of which are showninthe figure. Each
`grammable Graphics Processing Pipeline 150 to initiate
`Execution Pipeline 240 includesat least one multithreaded
`processing of data by providing the location of program
`processing unit, to be described further herein. ‘The samples
`instructions or graphics data stored in memory. IDX 135,
`output by Vertex Input Buffer 220 can be processed by any
`Programmable Graphics Processing Pipeline 150 and a
`
`one of the Execution Pipelines 240. A sample is accepted by
`Raster Analyzer 160 each include an interface to Memory
`an Execution Pipeline 240 whena processing thread within
`Controller 120 through which program instructions and data
`can be read from memory, ¢.g., any combination of Local
`the Execution Pipeline 240 is available as described further
`
`Memory140 and Host Memory 112. Whenaportion of Host herein. Each Execution Pipeline 240 signals to Vertex Input
`Memory112 is used to store program instructions and data,
`Buffer 220 when a sample can be accepted or when a sample
`the portion of Host Memory112 can be uncached soas to
`cannot be accepted. In one embodiment Programmable
`increase performance of access by Graphics Processor 105.
`Graphics Processing Pipeline 150 includes a single Execu-
`IDX 135 optionally reads processed data, e.g., data writ-
`tion Pipeline 240 containing one multithreaded processing
`ten by Raster Analyzer 160, from memory and outputs the
`unit. In an alternative embodiment, Programmable Graphics
`data, processed data and formatted commands to Program-
`Processing Pipeline 150 includes a plurality of Execution
`mable Graphics Processing Pipeline 150. Programmable
`Pipelines 240.
`Graphics Processing Pipeline 150 and Raster Analyzer 160
`Execution Pipelines 240 mayreceive first samples, such
`each contain one or more programmable processing units to
`as higher-ordersurface data, and tessellate the first samples
`perform a variety of specialized functions. Some of these
`to generate second samples, such as vertices. Execution
`functions are table lookup, scalar and vector addition, mul-
`Pipelines 240 may be configured to transform the second
`tiplication, division, coordinate-system mapping, calcula-
`samples from an object-based coordinate representation
`tion of vector normals,
`tessellation, calculation of deriva-
`(object space) to analternatively based coordinate system
`tives, interpolation, and the like. Programmable Graphics
`such as world space or normalized device coordinates
`
`Processing Pipeline 150 and Raster Analyzer 160 are each
`(NDC) space. Each Execution Pipeline 240 may communi-
`optionally configured such that data processing operations
`cate with Texture Unit 225 using a read interface (not shown
`are performed in multiple passes through those units or in
`in FIG. 2) to read program instructions and graphics data
`multiple passes within Programmable Graphics Processing
`such as texture maps from Local Memory 140 or Host
`Pipeline 150. Programmable Graphics Processing Pipeline
`Memory 112 via Memory Controller 120 and a Texture
`150 and a Raster Analyzer 160 also each include a write
`Cache 230. Texture Cache 230 is used to improve memory
`interface to Memory Controller 120 through which data can
`read performance byreducing read latency. In an alternate
`embodiment Texture Cache 230 is omitted. In another
`be written to memory.
`alternate embodiment, a Texture Unit 225 is included in each
`In a typical implementation Programmable Graphics Pro-
`
`cessing Pipeline 150 performs geometry computations, ras-
`Execution Pipeline 240. In another alternate embodiment
`terization, and pixel computations. Therefore Programmable
`program instructions are stored within Programmable
`Graphics Processing Pipeline 150 is programmed to operate
`Graphics Processing Pipeline 150.
`In another alternate
`on surface, primitive, vertex, fragment, pixel, sample or any
`embodiment each Execution Pipeline 240 has a dedicated
`other data. For simplicity, the remainder of this description
`instruction read interface to read program instructions from
`will use the term “samples”to refer to graphics data such as
`Local Memory 140 or Host Memory 112 via Memory
`Controller 120.
`surfaces, primitives, vertices, pixels, fragments, or the like.
`
`Samples output by Programmable Graphics Processing
`Execution Pipelines 240 output processed samples, such
`Pipeline 150 are passed to a Raster Analyzer 160, which
`as vertices, that are stored in a Vertex Output Buffer 260
`optionally performs near and far plane clipping and raster
`including a register file, FIFO, cache, or the like (not
`
`
`
`
`40
`
`45
`
`
`
`TCL 1005
`
`TCL 1005
`
`

`

`US 7,038,685 B1
`
`6
`
`
`
`Buffer 220. In an alternate embodiment source data is stored
`
`5
`shown). Processed vertices output by Vertex Output Buffer
`260 are received by a Primitive Assembly/Setup Unit 205.
`Primitive Assembly/Setup Unit 205 calculates parameters,
`such as deltas and slopes, to rasterize the processed vertices
`and outputs parameters and samples, such as vertices, to a
`Raster Unit 210. Raster Unit 210 performs scan conversion
`on samples, such as vertices, and outputs samples, such as
`fragments, to a Pixel Input Buffer 215. Alternatively, Raster
`Unit 210 resamples processed vertices and outputs addi-
`tional vertices to Pixel Input Buffer 215.
`Pixel Input Buffer 215 outputs the samples to each Execu-
`
`tion Pipeline 240. Samples, such as pixels and fragments,
`output by Pixel Input Buffer 215 are each processed by only
`one of the Execution Pipelines 240. Pixel Input Buffer 215
`determines which one of
`the Execution Pipelines 240 to
`output each sample to depending onan outputpixel position,
`¢.g., (x,y), associated with cach sample. In this manner, cach
`
`sample is output to the Execution Pipeline 240 designated to
`process samples associated with the outputpixel position. In
`an alternate embodiment, each sample output by Pixel Input
`
`Buffer 215 is processed by one of any available Execution
`Pipelines 240.
`Each Execution Pipeline 240 signals to Pixel Input Buffer
`240 when a sample can be accepted or when a sample cannot
`be accepted as described further herein. Program instruc-
`tions configure programmable computation units (PCUs)
`within an Execution Pipeline 240 to perform operations such
`as tessellation, perspective correction,
`texture mapping,
`shading, blending, and the like. Processed samples are
`output from each Execution Pipeline 240 to a Pixel Output
`Butter 270. Pixel Output Buffer 270 optionally stores the
`processed samplesin a register file, FIFO, cache, or the like
`(not shown). The processed samples are output from Pixel
`Output Buffer 270 to Raster Analyzer 160.
`FIG.3 is a block diagram of an embodiment of Execution
`Pipeline 240 of FIG. 1 includingat least one Multithreaded
`
`Processing Unit 300. An Execution Pipeline 240 can contain
`a plurality of Multithreaded Processing Units 300, each
`Multithreaded Processing Unit 300 containing at least one
`PCU 375. PCUs 375 are configured using programinstruc-
`tions read by a Thread Control Unit 320 via Texture Unit
`225. Thread Control Unit 320 gathers source data specified
`by the program instructions and dispatches the source data
`and program instructions to at least one PCU 375. PCUs 375
`performs computations specified by the program instruc-
`tions and outputs data to at least one destination, e.g., Pixel
`Output Buffer 160, Vertex Output Buffer 260 and Thread
`Control Unit 320.
`
`
`
`
`
`
`
`
`
`
`
`10
`
`iw)5
`
`35
`
`40
`
`45
`
`in Local Memory 140, locations in Host Memory 112, and
`the like.
`
`
`
`
`Alternatively, in an embodiment permitting multiple pro-
`gramsfor two or more thread types, Thread Control Unit 320
`also receives a program identifier specifying which one of
`the two or more programs the program counter is associated
`with. Specifically,
`in an embodiment permitting simulta-
`neous execution of four programsfor a thread type, twobits
`of thread state information are used to store the program
`identifier for a thread. Multithreaded execution of programs
`is possible because each thread may be executed. indepen-
`dentof other threads, regardless of whetherthe other threads
`are executing the same program or a different program.
`PCUs 375 update each program counter associated with the
`threads in Thread Control! Unit 320 following the execution
`of an instruction. For execution of a loop, call, return, or
`branch instruction the program counter may be updated
`based on the loop, call, return, or branch instruction.
`For example, cach fragment or group of fragments within
`a primitive can be processed independently from the other
`fragments or from the other groups of fragments within the
`primitive. Likewise, each vertex within a surface can be
`processed independently from the other vertices within the
`surface. For a set of samples being processed using the same
`program, the sequence of program instructions associated
`with each thread used to process each sample within the set
`will be identical, although the program counter for each
`thread may vary. However,it is possible that, during execu-
`tion, the threads processing someofthe samples within a set
`will diverge following the execution of a conditional branch
`instruction. After the execution of a conditional branch
`
`instruction, the sequence of executedinstructions associated
`with each thread processing samples within the sel may
`differ and each program counter stored in TSR 325 within
`Thread Control Unit 320 for the threads may differ accord-
`ingly.
`FIG. 4 is an illustration of an alternate embodiment of
`
`least one Multi-
`Execution Pipeline 240 containing at
`threaded Processing Unit 400. Thread Control Unit 420
`includes a TSR 325 to retain thread state data. In one
`embodiment‘!'SR 325stores thread state data for each ofat
`
`
`
`
`
`
`
`
`
`east two thread types, where the at least two thread types
`mayinclude pixel, primitive, and vertex. Thread state data
`or a thread may include, among other things, a program
`counter, a busy flag that indicates if the thread is either
`assigned to a sample or available to be assigned lo a sample,
`a pointer to a source sample to be processed by the instruc-
`tions associated with the thread or the outputpixel position
`and output buffer ID of the source sample to be processed,
`
`
`
`and a pointer specifying a destination location in Vertex
`
`
`
`Output Buffer 260 or Pixel Output Buffer 270, Additionally,
`thread state data for a thread assigned to a sample may
`include the sample type, e.g., pixel, vertex, primitive, or the
`like. The type ofdata a thread processes identifies the thread
`type, e.g., pixel, vertex, primitive, or the like. For example,
`a thread mayprocess a primitive, producing a vertex. After
`the vertex is rasterized and fragments are generated, the
`thread mayprocess a fragment.
`Source samples are stored in either Pixel Input Buffer 215
`or Vertex Input Buffer 220. Thread allocation priority, as
`described further herein,
`is used to assign a thread to a
`source sample. A thread allocation priority is specified for
`each sample type and Thread Control Unit 420 is configured
`to assign threads to samples or allocate locations in a
`Register File 350 based on the priority assigned to each
`sample type. The thread allocation priority maybe fixed,
`
`TCL 1005
`
`Asingle program may be used to process several sets of
`samples. Thread Control Unit 320 receives samples or
`pointers to samples stored in Pixel Input Buffer 215 and
`Vertex Input Buffer 220. Thread Control Unit 320 receives
`a pointer to a program to process one or more samples.
`Thread Control Unit 320 assigns a thread to each sample to
`be processed. A thread includes a pointer to a program
`instruction (program counter), such as the first instruction
`within the program, thread state information, and storage
`resources for storing intermediate data generated during
`processing of the sample. Threadstate information is stored
`in a TSR (Thread Storage Resource) 325. TSR 325 may be
`a register file, FIFO, circular buffer, or the like. An instruc-
`tion specifies the location of source data needed to execute
`the instruction. Source data, such as intermediate data gen-
`erated during processing ofthe sample is stored in a Register
`File 350. In addition to Register File 350, other source data
`
`
`
`may be stored in Pixel Input Buffer 215 or Vertex Input
`
`
`
`
`TCL 1005
`
`

`

`US 7,038,685 B1
`
`
`
`
`
`
`7
`programmable, or dynamic. In one embodiment the thread
`allocation priority may be fixed, always giving priority to
`allocating vertex threads andpixel threads are only allocated
`if vertex samples are not available for assignment to a
`thread.
`In analternate embodiment, Thread Control Unit 420 is
`configured to assign threads to source samples or allocate
`locations in Register File 350 using thread allocation pri-
`orities based on an amount of sample data in Pixel Input
`Buffer 215 and another amount of sample data in Vertex
`Input Buffer 220. Dynamically modifying a thread alloca-
`tion priority for vertex samples based. on the amount of
`
`
`sample data in Vertex Input Buffer 220 permits Vertex Input
`
`
`Buffer 220 to drain faster andfill Vertex Output Buffer 260
`
`and Pixel Input Buffer 215 faster or drain slower andfill
`Vertex Output Buffer 260 and Pixel Input Buffer 215 slower.
`Dynamically modifying a thread allocationpriority for pixel
`samples based on the amount of sample data in Pixel Input
`Buffer 215 permits Pixel Input Buffer 215 to drain faster and
`fill Pixel Output Buffer 270 faster or drain slower andfill
`Pixel Output Buffer 270 slower.
`In a further alternate
`embodiment, Thread Control Unit 420 is configured to
`assign threads to source samples or allocate locations in
`Register File 350 using thread allocation priorities based on
`graphics primitive size (number of pixels or fragments
`included ina primitive) or a numberof graphics primitives
`in Vertex Output Buffer 260. For example a dynamically
`determined thread allocation priority may be determined
`based on a numberof “pending” pixels, 1.e., the number of
`pixels to be rasterized from the primitives in Primitive
`Assembly/Setup 205 and in Vertex Output Buffer 260.
`Specifically, the thread allocation priority maybe tuned such
`that the number of pending pixels produced by processing
`vertex threadsis adequate lo achieve maximum ulilization of
`the computation resources in Execution Pipelines 240 pro-
`cessing pixel threads.
`Once a thread is assigned to a source sample,the thread
`is allocated storage resources such as locations in a Register
`File 350 to retain intermediate data generated during execu-
`tion of program instructions associated with the thread.
`Alternatively, source data is stored in storage resources
`including Local Memory 140, locations in Host Memory
`112, andthelike.
`A Thread Selection Unit 415 reads one or more thread
`
`from Thread
`entries, each containing thread state data,
`Control Unit 420. Thread Selection Unit 415 may read
`thread entries to process a group of samples. For example,
`in one embodiment a group of samples, e.g., a number of
`vertices defining a primitive,
`four adjacent
`fragments
`arranged in a square, or the like, are processed simulta-
`neously. In the one embodiment computed values such as
`derivatives are shared within the group of samples thereby
`reducing the number of computations needed to process the
`group of samples compared with processing the group of
`samples without sharing the computed values.
`Tn Multithreaded Processing Unit 400, a thread execution
`priority is specified for each thread type and Thread Selec-
`tion Unit 415 is configured to read thread entries based on
`the thread execution priority assigned to each thread type. A
`Thread execution priority may be fixed, programmable, or
`dynamic. In one embodiment the thread executionpriority
`may befixed, always giving priority to execution of vertex
`threads and pixel threads are only executed if vertex threads
`are not available for execution.
`In another embodiment, Thread Selection Unit 415 is
`configured to read thread entries based on the amount of
`sample data in Pixel Input Buffer 215 and the amount of
`
`8
`sample data in Vertex Input Buffer 220. In a further alternate
`embodiment, Thread Selection Unit 415 is configured to
`read thread entries using on a priority based on graphics
`primitive size (numberof pixels or fragments included in a
`primitive) or a number of graphics primitives in Vertex
`Output Buffer 260. For example a dynamically determined
`thread execution priority is determined based on a number of
`“pending”pixels, Le., the number of pixels to be rasterized
`from the primitives in Primitive Assembly/Setup 205 and in
`Vertex Output Buffer 260. Specifically, the thread execution
`priority may be tuned suchthat the numberof pending pixels
`produced by processing vertex threads is adequate to
`achieve maximum utilization of the computati

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket