throbber
111111111111111111111111111!1111
`11111)1
`0
`
`81
`
`11111111111111111111111111111
`
`(12) United States Patent
`Lindholm
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,038,685 B1
`May 2, 2006
`
`(54)
`
`(75)
`
`PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`Inventor: John Erik Lindholm, Saratoga, CA
`(US)
`
`(73)
`
`Assignee: NVIDIA Corporation, Santa Clara, CA
`(US)
`
`Notice:
`* )
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 134 days.
`
`(21)
`
`Appl. No.: 10/609,967
`
`(22)
`
`Filed:
`
`Jun. 30, 2003
`
`(51)
`
`Int. Cl.
`G06F 15/00
`G06F 13/00
`G06F 12/02
`G06F 9/46
`G06T 1/00
`(52) U.S. Cl.
`
`(2006.01)
`(2006.01)
`(2006.01)
`(2006.01)
`(2006.01)
` 345/501; 345/543; 345/536;
`718/104
` 345/501,
`(58) Field of Classification Search(cid:9)
`345/502, 530, 531, 522, 418, 419, 426, 427,
`345/543, 505, 536; 718/100, 104, 103
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5,020,115 A *
`5/1991 Black
`5,969,726 A
`10/1999 Rentschler et al.
`6,630,935 Bl* 10/2003 Taylor et al.
`6,731,289 Bl *
`5/2004 Peercy et al.
`2003/0041173 Al
`2/2003 Hoyle
`
` 382/298
`
` 345/522
` 345/503
`
`* cited by examiner
`
`Primary Examiner Kee M. Tung
`(74) Attorney, Agent, or Firm Patterson & Sheridan, LLP
`
`(57)
`
`ABSTRACT
`
`A programmable graphics processor for multithreaded
`execution of program instructions including a thread control
`unit. The programmable graphics processor is programmed
`with program instructions for processing primitive, pixel
`and vertex data. The thread control unit has a thread storage
`resource including locations allocated to store thread state
`data associated with samples of two or more types. Sample
`types include primitive, pixel and vertex. A number of
`threads allocated to processing a sample type may be
`dynamically modified.
`
`45 Claims, 9 Drawing Sheets
`
`From From
`
`za 22Q
`
`Execution
`
`Pipeline
`244
`
`Multithreaded
`Processing Unit
`&IQ
`
`Instruction
`Cache
`•
`414
`
`Thread
`Selection
`4- Unit
`4.12
`
`To 225
`From 225
`
`Resource
`Scoreboard
`9E1
`
`iwuAli
`
`Sequencer
`412
`
`Instruction
`Scheduler
`424
`
`Instruction
`Dispatcher
`441
`
`•
`
`From aa
`From 22Q
`
`Execution Unit
`
`PCU
`

`To zu
`
`To 27_Q
`
`Thread
`Control
`Unit
`424
`
`TSR
`222
`
`Register
`File
`224
`
`LG Ex. 1003, pg 1
`
`LG Ex. 1003
`LG v. ATI
`IPR2017-01225
`
`

`

`U.S. Patent
`
`May 2, 2006
`
`Sheet 1 of 9
`
`US 7,038,685 B1
`
`Host Computer 110
`
`100
`
`Host Memory
`112
`
`Host Processor
`114
`
`System Interface
`ill
`
`1k
`
`Graphics
`Subsystem
`17Q
`
`•
`Graphics Interface 117
`
`Graphics
`Processor
`105
`
`Local
`Memory
`140
`
`Memory
`Controller
`120
`
`4
`
`►
`
`•
`Front
`End
`130
`
`IDX
`135
`
`Programmable
`Graphics
`Processing
`Pipeline
`150
`
`1,
`Raster Analyzer
`160
`
`Output Controller
`180
`
`FIG. 1
`
`(-Output
`185
`
`i
`/
`
`LG Ex. 1003, pg 2
`
`

`

`U.S. Patent
`
`May 2, 2006
`
`Sheet 2 of 9
`
`US 7,038,685 B1
`
`From
`135
`
`Programmable
`Graphics
`Processing
`Pipeline
`150
`
`Vertex Input Buffer
`220
`
`'V
`Primitive Assembly/Setup
`205
`
`Y
`Raster Unit
`210
`
`'V
`Pixel Input Buffer
`215
`
`A
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Vertex Output Buffer
`260
`
`'V
`Pixel Output Buffer
`Ea
`
`Texture
`Unit
`225
`
`Texture
`Cache
`230
`
`.
`To 160
`
`Y
`To 160
`
`FIG. 2
`
`To From
`120
`120
`
`LG Ex. 1003, pg 3
`
`

`

`U.S. Patent
`
`May 2, 2006
`
`Sheet 3 of 9
`
`US 7,038,685 B1
`
`From
`215
`
`From
`220
`
`•
`•
`•
`
`Execution
`Pipeline
`240
`
`Multithreaded
`Processing Unit
`300
`
`To 225 4
`
`From 225
`
`From 215
`
`From 220
`
`Thread Control Unit
`320
`
`►
`
`TSR
`325
`
`Register
`File
`350
`
`••
`•
`
`PCU
`375
`
`To 260
`
`To 270
`
`LG Ex. 1003, pg 4
`
`

`

`U.S. Patent
`
`May 2, 2006
`
`Sheet 4 of 9
`
`US 7,038,685 B1
`
`From From
`215 220
`
`Execution
`
`Pipeline
`240
`
`Multithreaded
`Processing Unit
`400
`
`Instruction
`Cache
`410
`
`To 225
`
`From 225
`
`Thread
`Selection
`Unit
`415
`
`)0,
`
`•
`•
`
`Thread
`Control
`Unit
`420
`
`TSR
`
`Register
`File
`350
`
`Resource
`Scoreboard
`460
`
`A
`
`From 215
`From 220
`
`IWU 435
`
`Sequencer
`425
`
`Instruction
`Scheduler
`430
`
`Instruction
`Dispatcher
`440
`
`Execution Unit
`470
`
`PCU
`2za
`
`To 260
`
`To 270
`
`FIG. 4
`
`LG Ex. 1003, pg 5
`
`

`

`U.S. Patent
`
`May 2, 2006
`
`Sheet 5 of 9
`
`US 7,038,685 B1
`
`Assign
`VertexTh read
`535
`
`FIG. 5A
`
`FIG. 5B
`
`LG Ex. 1003, pg 6
`
`

`

`U.S. Patent
`
`May 2, 2006
`
`Sheet 6 of 9
`
`US 7,038,685 B1
`
`605
`
`625
`
`*
`
`635
`
`645
`
`610
`611
`612
`613
`
`FIG. 6A
`
`FIG. 6B
`
`) 620
`
`) 630
`
`640
`
`LG Ex. 1003, pg 7
`
`

`

`U.S. Patent
`
`May 2, 2006
`
`Sheet 7 of 9
`
`US 7,038,685 B1
`
`Allocating threads to
`a first sample type
`710
`
`Allocating threads to a
`second sample type
`715
`
`•
`Execute First
`Program Instructions
`720
`
`Determine
`allocations
`750
`
`Allocating threads to
`a first sample type
`755
`
`•
`Allocating threads to a
`second sample type
`760
`
`Execute First
`Program Instructions
`765
`
`Execute Second
`Program Instructions
`725
`
`Execute Second
`Program Instructions
`770
`
`FIG. 7A
`
`Allocating threads to
`the first sample type
`775
`
`FIG. 7B
`
`LG Ex. 1003, pg 8
`
`

`

`U.S. Patent
`
`May 2, 2006
`
`Sheet 8 of 9
`
`US 7,038,685 B1
`
`Receive Sample
`810
`
`Assign Thread
`825
`
`FIG. 8A
`
`Execute Thread
`880
`
`Deallocate Resources
`850
`
`FIG. 8B
`
`LG Ex. 1003, pg 9
`
`

`

`U.S. Patent
`
`May 2, 2006
`
`Sheet 9 of 9
`
`US 7,038,685 B1
`
`Determine thread
`priority
`950
`
`•
`Identify assigned
`thread(s) for priority
`955
`
`N
`
`•
`Read Program
`Counter(s)
`970
`
`•
`Update Program
`Counter(s)
`975
`
`FIG. 9B
`
`4,
`Identify assigned
`thread(s)
`910
`
`Select Thread(s)
`M
`
`Read Program
`Counter(s)
`920
`
`Update Program
`Counter(s)
`925
`
`FIG. 9A
`
`LG Ex. 1003, pg 10
`
`

`

`1
`PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`FIELD OF THE INVENTION
`
`One or more aspects of the invention generally relate to
`multithreaded processing, and more particularly to process-
`ing graphics data in a programmable graphics processor.
`
`BACKGROUND
`
`Current graphics data processing includes systems and
`methods developed to perform a specific operation on graph-
`ics data, e.g., linear interpolation, tessellation, rasterization,
`texture mapping, depth testing, etc. These graphics proces-
`sors include several fixed function computation units to
`perform such specific operations on specific types of graph-
`ics data, such as vertex data and pixel data. More recently,
`the computation units have a degree of programmability to
`perform user specified operations such that the vertex data is
`processed by a vertex processing unit using vertex programs
`and the pixel data is processed by a pixel processing unit
`using pixel programs. When the amount of vertex data being
`processed is low relative the amount of pixel data being
`processed, the vertex processing unit may be underutilized.
`Conversely, when the amount of vertex data being processed
`is high relative the amount of pixel data being processed, the
`pixel processing unit may be underutilized.
`Accordingly, it would be desirable to provide improved
`approaches to processing different types of graphics data to
`better utilize one or more processing units within a graphics
`processor.
`
`SUMMARY
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`A method and apparatus for processing and allocating
`threads for multithreaded execution of graphics programs is
`described. A graphics processor for multithreaded execution
`of program instructions associated with threads to process at
`least two sample types includes a thread control unit includ- 40
`ing a thread storage resource configured to store thread state
`data for each of the threads to process the at least two sample
`types.
`Alternatively, the graphics processor includes a multi-
`threaded processing unit. The multithreaded processing unit 45
`includes a thread control unit configured to store pointers to
`program instructions associated with threads, each thread
`processing a sample type of vertex, pixel or primitive. The
`multithreaded processing unit also includes at least one
`programmable computation unit configured to process data so
`under control of the program instructions.
`A method of multithreaded processing of graphics data
`includes receiving a pointer to a vertex program to process
`vertex samples. A first thread is assigned to a vertex sample.
`A pointer to a shader program to process pixel samples is 55
`received. A second thread is assigned to a pixel sample. The
`vertex program is executed to process the vertex sample and
`produce a processed vertex sample. The shader program is
`executed to process the pixel sample and produce a pro-
`cessed pixel sample.
`Alternatively, the method of multithreaded processing of
`graphics data includes allocating a first number of process-
`ing threads for a first sample type. A second number of
`processing threads is allocated for a second sample type.
`First program instructions associated with the first sample 65
`type are executed to process the graphics data and produce
`processed graphics data.
`
`60
`
`US 7,038,685 B1
`
`2
`A method of assigning threads for processing of graphics
`data includes receiving a sample to be processed. A sample
`type of vertex, pixel, or primitive, associated with the
`sample is determined. A thread is determined to be available
`for assignment to the sample. The thread is assigned to the
`sample.
`A method of selecting at least one thread for execution
`includes identifying one or more assigned threads from
`threads including at least a thread assigned to a pixel sample
`and a thread assigned to a vertex sample. At least one of the
`one or more assigned threads is selected for processing.
`A method of improving performance of multithreaded
`processing of graphics data using at least two sample types
`includes dynamically allocating a first number of threads for
`processing a first portion of the graphics data to a first
`sample type and dynamically allocating a second number of
`threads for processing a second portion of the graphics data
`to a second sample type.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`Accompanying drawing(s) show exemplary
`embodiment(s) in accordance with one or more aspects of
`the present invention; however, the accompanying
`drawing(s) should not be taken to limit the present invention
`to the embodiment(s) shown, but are for explanation and
`understanding only.
`FIG. 1 illustrates one embodiment of a computing system
`according to the invention including a host computer and a
`graphics subsystem.
`FIG. 2 is a block diagram of an embodiment of the
`Programmable Graphics Processing Pipeline of FIG. 1.
`FIG. 3 is a block diagram of an embodiment of the
`Execution Pipeline of FIG. 1.
`FIG. 4 is a block diagram of an alternate embodiment of
`the Execution Pipeline of FIG. 1.
`FIGS. 5A and 5B are flow diagrams of exemplary
`embodiments of thread assignment in accordance with one
`or more aspects of the present invention.
`FIGS. 6A and 6B are exemplary embodiments of a portion
`of the Thread Storage Resource storing thread state data
`within an embodiment of the Thread Control Unit of FIG. 3
`or FIG. 4.
`FIGS. 7A and 7B are flow diagrams of exemplary
`embodiments of thread allocation and processing in accor-
`dance with one or more aspects of the present invention.
`FIGS. 8A and 8B are flow diagrams of exemplary
`embodiments of thread assignment in accordance with one
`or more aspects of the present invention.
`FIGS. 9A and 9B are flow diagrams of exemplary
`embodiments of thread selection in accordance with one or
`more aspects of the present invention.
`
`DETAILED DESCRIPTION
`
`In the following description, numerous specific details are
`set forth to provide a more thorough understanding of the
`present invention. However, it will be apparent to one of
`skill in the art that the present invention may be practiced
`without one or more of these specific details. In other
`instances, well-known features have not been described in
`order to avoid obscuring the present invention.
`FIG. 1 is an illustration of a Computing System generally
`designated 100 and including a Host Computer 110 and a
`Graphics Subsystem 170. Computing System 100 may be a
`desktop computer, server, laptop computer, palm-sized com-
`puter, tablet computer, game console, cellular telephone,
`computer based simulator, or the like. Host Computer 110
`
`LG Ex. 1003, pg 11
`
`

`

`US 7,038,685 B1
`
`4
`3
`includes Host Processor 114 that may include a system
`operations, such as stencil, z test, and the like, and saves the
`memory controller to interface directly to Host Memory 112
`results or the samples output by Programmable Graphics
`Processing Pipeline 150 in Local Memory 140. When the
`or may communicate with Host Memory 112 through a
`data received by Graphics Subsystem 170 has been com-
`System Interface 115. System Interface 115 may be an I/O
`(input/output) interface or a bridge device including the 5 pletely processed by Graphics Processor 105, an Output 185
`of Graphics Subsystem 170 is provided using an Output
`system memory controller to interface directly to Host
`Memory 112. Examples of System Interface 115 known in
`Controller 180. Output Controller 180 is optionally config-
`the art include Intel® Northbridge and Intel® Southbridge.
`ured to deliver data to a display device, network, electronic
`Host Computer 110 communicates with Graphics Sub-
`control system, other Computing System 100, other Graph-
`system 170 via System Interface 115 and a Graphics Inter- ic) ics Subsystem 170, or the like. Alternatively, data is output
`face 117 within a Graphics Processor 105. Data received at
`to a film recording device or written to a peripheral device,
`Graphics Interface 117 can be passed to a Front End 130 or
`e.g., disk drive, tape, compact disk, or the like.
`written to a Local Memory 140 through Memory Controller
`FIG. 2 is an illustration of Programmable Graphics Pro-
`120. Graphics Processor 105 uses graphics memory to store
`cessing Pipeline 150 of FIG. 1. At least one set of samples
`graphics data and program instructions, where graphics data 15 is output by IDX 135 and received by Programmable
`Graphics Processing Pipeline 150 and the at least one set of
`is any data that is input to or output from components within
`the graphics processor. Graphics memory can include por-
`samples is processed according to at least one program, the
`tions of Host Memory 112, Local Memory 140, register files
`at least one program including graphics program instruc-
`coupled to the components within Graphics Processor 105,
`tions. A program can process one or more sets of samples.
`and the like.
`20 Conversely, a set of samples can be processed by a sequence
`Graphics Processor 105 includes, among other compo-
`of one or more programs.
`nents, Front End 130 that receives commands from Host
`Samples, such as surfaces, primitives, or the like, are
`Computer 110 via Graphics Interface 117. Front End 130
`received from IDX 135 by Programmable Graphics Process-
`ing Pipeline 150 and stored in a Vertex Input Buffer 220
`interprets and formats the commands and outputs the for-
`matted commands and data to an IDX (Index Processor) 25 including a register file, FIFO (first in first out), cache, or the
`135. Some of the formatted commands are used by Pro-
`like (not shown). The samples are broadcast to Execution
`grammable Graphics Processing Pipeline 150 to initiate
`Pipelines 240, four of which are shown in the figure. Each
`Execution Pipeline 240 includes at least one multithreaded
`processing of data by providing the location of program
`instructions or graphics data stored in memory. IDX 135,
`processing unit, to be described further herein. The samples
`Programmable Graphics Processing Pipeline 150 and a 30 output by Vertex Input Buffer 220 can be processed by any
`Raster Analyzer 160 each include an interface to Memory
`one of the Execution Pipelines 240. A sample is accepted by
`Controller 120 through which program instructions and data
`an Execution Pipeline 240 when a processing thread within
`the Execution Pipeline 240 is available as described further
`can be read from memory, e.g., any combination of Local
`Memory 140 and Host Memory 112. When a portion of Host
`herein. Each Execution Pipeline 240 signals to Vertex Input
`Memory 112 is used to store program instructions and data, 35 Buffer 220 when a sample can be accepted or when a sample
`the portion of Host Memory 112 can be uncached so as to
`cannot be accepted. In one embodiment Programmable
`increase performance of access by Graphics Processor 105.
`Graphics Processing Pipeline 150 includes a single Execu-
`IDX 135 optionally reads processed data, e.g., data writ-
`tion Pipeline 240 containing one multithreaded processing
`ten by Raster Analyzer 160, from memory and outputs the
`unit. In an alternative embodiment, Programmable Graphics
`data, processed data and formatted commands to Program- 40 Processing Pipeline 150 includes a plurality of Execution
`mable Graphics Processing Pipeline 150. Programmable
`Pipelines 240.
`Graphics Processing Pipeline 150 and Raster Analyzer 160
`Execution Pipelines 240 may receive first samples, such
`each contain one or more programmable processing units to
`as higher-order surface data, and tessellate the first samples
`perform a variety of specialized functions. Some of these
`to generate second samples, such as vertices. Execution
`functions are table lookup, scalar and vector addition, mul- 45 Pipelines 240 may be configured to transform the second
`tiplication, division, coordinate-system mapping, calcula-
`samples from an object-based coordinate representation
`tion of vector normals, tessellation, calculation of deriva-
`(object space) to an alternatively based coordinate system
`tives, interpolation, and the like. Programmable Graphics
`such as world space or normalized device coordinates
`Processing Pipeline 150 and Raster Analyzer 160 are each
`(NDC) space. Each Execution Pipeline 240 may communi-
`optionally configured such that data processing operations 50 cate with Texture Unit 225 using a read interface (not shown
`in FIG. 2) to read program instructions and graphics data
`are performed in multiple passes through those units or in
`such as texture maps from Local Memory 140 or Host
`multiple passes within Programmable Graphics Processing
`Pipeline 150. Programmable Graphics Processing Pipeline
`Memory 112 via Memory Controller 120 and a Texture
`150 and a Raster Analyzer 160 also each include a write
`Cache 230. Texture Cache 230 is used to improve memory
`interface to Memory Controller 120 through which data can 55 read performance by reducing read latency. In an alternate
`embodiment Texture Cache 230 is omitted. In another
`be written to memory.
`alternate embodiment, a Texture Unit 225 is included in each
`In a typical implementation Programmable Graphics Pro-
`cessing Pipeline 150 performs geometry computations, ras-
`Execution Pipeline 240. In another alternate embodiment
`terization, and pixel computations. Therefore Programmable
`program instructions are stored within Programmable
`Graphics Processing Pipeline 150 is programmed to operate 60 Graphics Processing Pipeline 150. In another alternate
`embodiment each Execution Pipeline 240 has a dedicated
`on surface, primitive, vertex, fragment, pixel, sample or any
`other data. For simplicity, the remainder of this description
`instruction read interface to read program instructions from
`Local Memory 140 or Host Memory 112 via Memory
`will use the term "samples" to refer to graphics data such as
`Controller 120.
`surfaces, primitives, vertices, pixels, fragments, or the like.
`Execution Pipelines 240 output processed samples, such
`Samples output by Programmable Graphics Processing 65
`Pipeline 150 are passed to a Raster Analyzer 160, which
`as vertices, that are stored in a Vertex Output Buffer 260
`optionally performs near and far plane clipping and raster
`including a register file, FIFO, cache, or the like (not
`
`LG Ex. 1003, pg 12
`
`

`

`US 7,038,685 B1
`
`5
`shown). Processed vertices output by Vertex Output Buffer
`260 are received by a Primitive Assembly/Setup Unit 205.
`Primitive Assembly/Setup Unit 205 calculates parameters,
`such as deltas and slopes, to rasterize the processed vertices
`and outputs parameters and samples, such as vertices, to a
`Raster Unit 210. Raster Unit 210 performs scan conversion
`on samples, such as vertices, and outputs samples, such as
`fragments, to a Pixel Input Buffer 215. Alternatively, Raster
`Unit 210 resamples processed vertices and outputs addi-
`tional vertices to Pixel Input Buffer 215.
`Pixel Input Buffer 215 outputs the samples to each Execu-
`tion Pipeline 240. Samples, such as pixels and fragments,
`output by Pixel Input Buffer 215 are each processed by only
`one of the Execution Pipelines 240. Pixel Input Buffer 215
`determines which one of the Execution Pipelines 240 to
`output each sample to depending on an output pixel position,
`e.g., (x,y), associated with each sample. In this manner, each
`sample is output to the Execution Pipeline 240 designated to
`process samples associated with the output pixel position. In
`an alternate embodiment, each sample output by Pixel Input
`Buffer 215 is processed by one of any available Execution
`Pipelines 240.
`Each Execution Pipeline 240 signals to Pixel Input Buffer
`240 when a sample can be accepted or when a sample cannot
`be accepted as described further herein. Program instruc-
`tions configure programmable computation units (PCUs)
`within an Execution Pipeline 240 to perform operations such
`as tessellation, perspective correction, texture mapping,
`shading, blending, and the like. Processed samples are
`output from each Execution Pipeline 240 to a Pixel Output
`Buffer 270. Pixel Output Buffer 270 optionally stores the
`processed samples in a register file, FIFO, cache, or the like
`(not shown). The processed samples are output from Pixel
`Output Buffer 270 to Raster Analyzer 160.
`FIG. 3 is a block diagram of an embodiment of Execution
`Pipeline 240 of FIG. 1 including at least one Multithreaded
`Processing Unit 300. An Execution Pipeline 240 can contain
`a plurality of Multithreaded Processing Units 300, each
`Multithreaded Processing Unit 300 containing at least one
`PCU 375. PCUs 375 are configured using program instruc-
`tions read by a Thread Control Unit 320 via Texture Unit
`225. Thread Control Unit 320 gathers source data specified
`by the program instructions and dispatches the source data
`and program instructions to at least one PCU 375. PCUs 375
`performs computations specified by the program instruc-
`tions and outputs data to at least one destination, e.g., Pixel
`Output Buffer 160, Vertex Output Buffer 260 and Thread
`Control Unit 320.
`A single program may be used to process several sets of
`samples. Thread Control Unit 320 receives samples or
`pointers to samples stored in Pixel Input Buffer 215 and
`Vertex Input Buffer 220. Thread Control Unit 320 receives
`a pointer to a program to process one or more samples.
`Thread Control Unit 320 assigns a thread to each sample to
`be processed. A thread includes a pointer to a program
`instruction (program counter), such as the first instruction
`within the program, thread state information, and storage
`resources for storing intermediate data generated during
`processing of the sample. Thread state information is stored
`in a TSR (Thread Storage Resource) 325. TSR 325 may be
`a register file, FIFO, circular buffer, or the like. An instruc-
`tion specifies the location of source data needed to execute
`the instruction. Source data, such as intermediate data gen-
`erated during processing of the sample is stored in a Register
`File 350. In addition to Register File 350, other source data
`may be stored in Pixel Input Buffer 215 or Vertex Input
`
`20
`
`6
`Buffer 220. In an alternate embodiment source data is stored
`in Local Memory 140, locations in Host Memory 112, and
`the like.
`Alternatively, in an embodiment permitting multiple pro-
`5 grams for two or more thread types, Thread Control Unit 320
`also receives a program identifier specifying which one of
`the two or more programs the program counter is associated
`with. Specifically, in an embodiment permitting simulta-
`neous execution of four programs for a thread type, two bits
`10 of thread state information are used to store the program
`identifier for a thread. Multithreaded execution of programs
`is possible because each thread may be executed indepen-
`dent of other threads, regardless of whether the other threads
`are executing the same program or a different program.
`15 PCUs 375 update each program counter associated with the
`threads in Thread Control Unit 320 following the execution
`of an instruction. For execution of a loop, call, return, or
`branch instruction the program counter may be updated
`based on the loop, call, return, or branch instruction.
`For example, each fragment or group of fragments within
`a primitive can be processed independently from the other
`fragments or from the other groups of fragments within the
`primitive. Likewise, each vertex within a surface can be
`processed independently from the other vertices within the
`25 surface. For a set of samples being processed using the same
`program, the sequence of program instructions associated
`with each thread used to process each sample within the set
`will be identical, although the program counter for each
`thread may vary. However, it is possible that, during execu-
`30 tion, the threads processing some of the samples within a set
`will diverge following the execution of a conditional branch
`instruction. After the execution of a conditional branch
`instruction, the sequence of executed instructions associated
`with each thread processing samples within the set may
`35 differ and each program counter stored in TSR 325 within
`Thread Control Unit 320 for the threads may differ accord-
`ingly.
`FIG. 4 is an illustration of an alternate embodiment of
`Execution Pipeline 240 containing at least one Multi-
`40 threaded Processing Unit 400. Thread Control Unit 420
`includes a TSR 325 to retain thread state data. In one
`embodiment TSR 325 stores thread state data for each of at
`least two thread types, where the at least two thread types
`may include pixel, primitive, and vertex. Thread state data
`45 for a thread may include, among other things, a program
`counter, a busy flag that indicates if the thread is either
`assigned to a sample or available to be assigned to a sample,
`a pointer to a source sample to be processed by the instruc-
`tions associated with the thread or the output pixel position
`so and output buffer ID of the source sample to be processed,
`and a pointer specifying a destination location in Vertex
`Output Buffer 260 or Pixel Output Buffer 270. Additionally,
`thread state data for a thread assigned to a sample may
`include the sample type, e.g., pixel, vertex, primitive, or the
`55 like. The type of data a thread processes identifies the thread
`type, e.g., pixel, vertex, primitive, or the like. For example,
`a thread may process a primitive, producing a vertex. After
`the vertex is rasterized and fragments are generated, the
`thread may process a fragment.
`Source samples are stored in either Pixel Input Buffer 215
`or Vertex Input Buffer 220. Thread allocation priority, as
`described further herein, is used to assign a thread to a
`source sample. A thread allocation priority is specified for
`each sample type and Thread Control Unit 420 is configured
`65 to assign threads to samples or allocate locations in a
`Register File 350 based on the priority assigned to each
`sample type. The thread allocation priority may be fixed,
`
`60
`
`LG Ex. 1003, pg 13
`
`

`

`US 7,038,685 B1
`
`8
`7
`sample data in Vertex Input Buffer 220. In a further alternate
`programmable, or dynamic. In one embodiment the thread
`embodiment, Thread Selection Unit 415 is configured to
`allocation priority may be fixed, always giving priority to
`read thread entries using on a priority based on graphics
`allocating vertex threads and pixel threads are only allocated
`primitive size (number of pixels or fragments included in a
`if vertex samples are not available for assignment to a
`thread.
`5 primitive) or a number of graphics primitives in Vertex
`Output Buffer 260. For example a dynamically determined
`In an alternate embodiment, Thread Control Unit 420 is
`thread execution priority is determined based on a number of
`configured to assign threads to source samples or allocate
`locations in Register File 350 using thread allocation pri-
`"pending" pixels, i.e., the number of pixels to be rasterized
`from the primitives in Primitive Assembly/Setup 205 and in
`orities based on an amount of sample data in Pixel Input
`Buffer 215 and another amount of sample data in Vertex 10 Vertex Output Buffer 260. Specifically, the thread execution
`Input Buffer 220. Dynamically modifying a thread alloca-
`priority may be tuned such that the number of pending pixels
`tion priority for vertex samples based on the amount of
`produced by processing vertex threads is adequate to
`sample data in Vertex Input Buffer 220 permits Vertex Input
`achieve maximum utilization of the computation resources
`Buffer 220 to drain faster and fill Vertex Output Buffer 260
`in Execution Pipelines 240 processing pixel threads.
`Thread Selection Unit 415 reads one or more thread
`and Pixel Input Buffer 215 faster or drain slower and fill 15
`Vertex Output Buffer 260 and Pixel Input Buffer 215 slower.
`entries based on thread execution priorities and outputs
`selected thread entries to Instruction Cache 410. Instruction
`Dynamically modifying a thread allocation priority for pixel
`Cache 410 determines if the program instructions corre-
`samples based on the amount of sample data in Pixel Input
`Buffer 215 permits Pixel Input Buffer 215 to drain faster and
`sponding to the program counters and sample type included
`fill Pixel Output Buffer 270 faster or drain slower and fill 20 in the thread state data for each thread entry are available in
`Pixel Output Buffer 270 slower. In a further alternate
`Instruction Cache 410. When a requested program instruc-
`embodiment, Thread Control Unit 420 is configured to
`tion is not available in Instruction Cache 410 it is read
`assign threads to source samples or allocate locations in
`(possibly along with other program instructions stored in
`Register File 350 using thread allocation priorities based on
`adjacent memory locations) from graphics memory. A base
`graphics primitive size (number of pixels or fragments 25 address, corresponding to the graphics memory location
`included in a primitive) or a number of graphics primitives
`where a first instruction in a program is stored, may be used
`in Vertex Output Buffer 260. For example a dynamically
`in conjunction with a program counter to determine the
`determined thread allocation priority may be determined
`location in graphics memory where a program instruction
`based on a number of "pending" pixels, i.e., the number of
`corresponding to the program counter is stored. In an
`pixels to be rasterized from the primitives in Primitive 30 alternate embodiment, Instruction Cache 410 can be shared
`Assembly/Setup 205 and in Vertex Output Buffer 260.
`between Multithreaded Processing Units 400 within Execu-
`tion Pipeline 240.
`Specifically, the thread allocation priority may be tuned such
`that the number of pending pixels produced by processing
`The program instructions corresponding to the program
`vertex threads is adequate to achieve maximum utilization of
`counters from the one or more thread entries are output by
`the computation resources in Execution Pipelines 240 pro- 35 Instruction Cache 410 to a scheduler, Instruction Scheduler
`430. The number of instructions output each clock cycle
`cessing pixel threads.
`from Instruction Cache 410 to Instruction Scheduler 430 can
`Once a thread is assigned to a source sample, the thread
`is allocated storage resources such as locations in a Register
`vary depending on whether or not the instructions are
`File 350 to retain intermediate data generated during execu-
`available in the cache. The number of instructions that can
`tion of program instructions associated with the thread. 40 be output each clock cycle from Instruction Cache 410 to
`Instruction Scheduler 430 may also vary between different
`Alternatively, source data is stored in storage resources
`including Local Memory 140, locations in Host Memory
`embodiments. In one embodiment, Instruction Cache 410
`112, and the like.
`outputs one instruction per clock cycle to Instruction Sched-
`A Thread Selection Unit 415 reads one or more thread
`uler 430.

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket