`
`PRESENTS,SHAT,COME:
`UNITED STATES DEPARTMENTOF COMMERCE
`United States Patent and Trademark Office
`
`May9, 2023
`
`
`
`
`
`By Authorityof the
`UnderSecretary of Commercefor Intellectual Property
`and Director of the United States Patent and Trademark Office
`
`THIS IS TO CERTIFY THAT ANNEXED HERETO IS A TRUE COPY FROM
`THE RECORDSOFTHIS OFFICE OF:
`
`PATENT NUMBER:7,015,913
`ISSUE DATE: March 21, 2006
`
`MhMeni y
`
`UM,
`Certifying Officer (
`
`TCL 1006
`
`
`
`US007015913B1
`
`«2) United States Patent
`US 7,015,913 B1
`(10) Patent No.:
`Mar.21, 2006
`(45) Date of Patent:
`Lindholm etal.
`
`(54) METHOD AND APPARATUS FOR
`MULTITHREADED PROCESSING OF DATA
`IN A PROGRAMMABLE GRAPHICS
`PROCESSOR
`
`(75)
`
`Inventors: John Erik Lindholm, Saratoga, CA
`(US); Rui M. Bastos, Santa Clara, CA
`(US); Harold Robert Feldman Zatz,
`Palo Alto, CA (US)
`
`(73) Assignee: NVIDIA Corporation, Santa Clara, CA
`(US)
`
`(*) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`US.C. 154(b) by 164 days.
`
`(21) Appl. No.: 10/608,346
`
`(22) Filed:
`
`Jun. 27, 2003
`
`(51)
`
`Int. Cl.
`(2006.01)
`GO6F 15/00
`(2006.01)
`GO6F 12/02
`(2006.01)
`GO6F 9/46
`(2006.01)
`GO6T1/00
`(52) U.S. Ch. ve eesesees 345/501; 345/543; 718/102;
`718/104
`(58) Field of Classification Search................ 345/501,
`345/504, 520, 522, 420, 423, 503, 519, 506,
`345/543, 557; 709/208, 231; 712/23, 28,
`712/31-32;, 718/102, 104
`See application file for complete search history.
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,818,469 A * 10/1998 Lawless et al. .......... 345/522
`5,946,487 A *
`8/1999 Dangelo.....
`wee TIT/148
`
`7/2000 Kwoketal. .
`ve 345/505
`6,088,044 A *
`...
`ve 345/629
`6,753,878 B1*
`6/2004 Heirich et al.
`
`7/2004 Sowizral et al.
`.
`we 345/420
`6,765,571 B1*
`2003/0140179 A1l*
`7/2003 Wilt etal. ow. 709/321
`2004/0160446 A1l*
`8/2004 Gosalia et al.
`.............. 345/503
`
`* cited by examiner
`
`Primary Examiner—Kee M. Tung
`(74) Aitorney, Agent, or Firm—Patterson & Sheridan, LLP
`
`(57)
`
`ABSTRACT
`
`A graphics processor and method for executing a graphics
`program as a plurality of threads where each sample to be
`processed bythe program is assigned to a thread. Although
`threads share processing resources within the programmable
`graphics processor, the execution of each thread can proceed
`independentof any other threads. For example, instructions
`1n a second thread are scheduled for execution while execu-
`
`tion of instructions in a first thread are stalled waiting for
`source data. Consequently,a first received sample (assigned
`to the first thread) may be processedafter a second received
`sample (assigned to the second thread). A benefit of inde-
`pendently executing each thread is improved performance
`because a stalled thread does not prevent the execution of
`otherthreads.
`
`33 Claims, 9 Drawing Sheets
`
`
`
`
`
`
`
`
`
`
`Register
`File
`450
`
`.
`
`Execution Unit
`
`r
`
`|
`'
`
`\\
`
`
`
`To 270
`
`TCL 1006
`
`
`
`
`
`
`
`
`From From
`213
`220
`
`Execution
`
`Pipeline
`
`
`Multithreaded
`
`
`Processing Unit
`
`To225 «1
`400
`Instruction
`Cache ——
`
`>|
`410
`From 225:
`
`if
`Thread
`Control
`Law235|Butor
`
`Cu 433
`420
`sTu 437|;—>
`Resource
`=
`Scoreboard
`Instruction
`
`460
`Scheduler
`
`=kT | Instruction
`* Dispatcher
`
`;
`
`From 215
`From 2204+—
`
`
`
`|
`<4
`
`TCL 1006
`
`
`
`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 1 of 9
`
`US 7,015,913 B1
`
`
`
`Host Computer 110
`
`Host Processor
`
`Host Memory
`
`Syst m Interfac
`
`190
`-——
`
`
`
`i114
`112
`ft
`
`115
`
`
`
`
`
`
` Front End
`
`
`
`
`
`Graphics
`Subsystem
`170
`
`
`
`
`
`
`
`
`
`Programmable
`Graphics
`Memory ——»>
`Local
`Processing
`Controller
`Memory
`140 «———__—Pipeline120,
`
`
`150
`to
`-—-— Raster Analyzer
`—— 180
`
`
`
`Lf
`
`ua
`
`Processor
`:
`GraphicsInterface 117
`
`:
`Graphics
`
`
`
`130
`
`
`t—-———
`————>|
`
`IDX
`135
`|
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`FIG. 1
`
`188
`VS
`
`TCL 1006
`
`TCL 1006
`
`
`
`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 2 of 9
`
`US 7,015,913 B1
`
`From
`135
`
`Primitive Assembly/Setup
`205
`
`Raster Unit
`210
`
`
`
`
`
`
`
`
`Vertex Input Buffer
`220
`
`Pixel Input Buffer
`215
`
`
`
`Programmable
`Graphics
`Processing
`Pipeline
`
`TextureUnit
`
`225
`
`Texture
`Cache
`
`270
`
`
`230
`
`
`
`
`
`
`
`
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`VertexOutputBuffer
`
`260
`
`|
`
`PixelOutput Buffer
`
`
`
`
`
`~ 120=120
`
`TCL 1006
`
`TCL 1006
`
`
`
`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 3 of 9
`
`US 7,015,913 B1
`
`
`
`FIG. 3
`
`TCL 1006
`
`TCL 1006
`
`
`
`U.S. Patent
`
`Mar.21, 2006
`
`Sheet 4 of 9
`
`US 7,015,913 BI
`
`From From
`215
`220
`
`Execution
`Pipeline
`240
`
`Multithreaded
`Processing Unit
`400
`
`Resource
`Scoreboard
`460
`
`To 225 «-}—
`
`From 225
`
`Fro
`
`Fro
`
`3.3
`
`|
`21
`ion
`
`22
`
`IO
`
`
`
`
`
`
`
`
`Execution Unit
`
`
`
`
`
`Instruction
`Cache
`410
`
`Instruction
`Scheduler
`430
`Tt
`Instruction
`Dispatcher
`440
`
`
`
`
`Thread
`Control
`Buffer
`420
`
`Register
`File
`450
`
`
`
`TCL 1006
`
`TCL 1006
`
`
`
`U.S. Patent
`
`Mar. 21,2006
`
`Sheet 5 of 9
`
`US 7,015,913 B1
`
`| Receive a Sample
`504
`
`Receive Another
`Sample
`503
`
`
`
`Source
`Data Available for the
`Sample?
`§05
`
`
`
`N
`
`Y
`
`Dispatch Instruction
`for Processing the
`Sample
`
`515
`
`N
`
`
`
`
`
`Source
`Data Available for the
`
`
`Other Sample?
`517
`
`Y
`519
`
`Data Available for the
`
`Sample?
`511
`
`
`
`
`DispatchInstruction
`for Processing the
`Other Sample
`
`Source
`Data Available for the
`Other Sample?
`307
`
`Y
`
`Dispatch Instruction
`for Processing the
`Other Sample
`509
`
` Source
`
`Dispatch Instruction
`for Processing the
`Sample
`513
`
`
`FIG. 5A
`
`TCL 1006
`
`TCL 1006
`
`
`
`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 6 of 9
`
`US 7,015,913 B1
`
`Receive Sample
`$20
`
`
`
`
`identify Thread
`Type Needed
`921
`
`
`
`
`
`PIOR
`Pixel
`
`
`
`disabled for pixel
`Thread Position
`threads?
`
`Hazard?
`
`
`523
`
`§25
`
`
`
`527
`
`
`
`Assign Thread
`530
`
`
`|Available?
`
`
`
`
`
`Allocate Resources
`
`533
`
`
`
`
`Fetch Instruction
`535
`
`
`=Schedule?
`
`
`
`537
`
`Update PC and;
`Resource Scoreboard
`
`
`
` More
`
`540
`Deallocate
`550
`
`Instructions?
`Resources
`
`
`347
`
`
`Dispatch Instruction
`543
`
`
`
`Process Sample
`545
`——
`
`
`
`FIG. 5B
`
`TCL 1006
`
`TCL 1006
`
`
`
`U.S. Patent
`
`Mar.21, 2006
`
`Sheet 7 of 9
`
`US 7,015,913 B1
`
`
` Instruction
`in window?
`
`
`605
`
`Y
`
`Timeout?
`610
`
`Removeinstruction
`from window
`615
`
`
`
`N
`
`
`
`
`
`synch mode
`enabled?
`620
`
`
`Sort by thread age lg
`640
`
`Check
`synchronization
`625
`
`
`
`
`nstruction
`synched?
`630
`
`N
`
`Removeinstruction
`from window
`
`635
`
`Y
`
`
`
`
`Read scoreboard
`645
`
`Check resources
`650
`
`Schedule instruction
`655
`
`Update scoreboard
`660
`
`Update PC
`670
`
`
`
`|
`
`
`
`
`
`Output instruction
`680
`
`
`FIG.6
`
`TCL 1006
`
`TCL 1006
`
`
`
`U.S. Patent
`
`Mar.21, 2006
`
`US 7,015,913 BI
`
`
`
`
`Issue Function Call to
`enable PIOR
`703
`
`PIOR configuration
`complete
`£06
`
`FIG. 7A
`
`Configure enable PIOR
`710
`
`
`
`
`Renderintersecting
`objects
`720
`
`Configure to disable
`PIOR
`730
`
`Render non-
`intersecting objects
`£40
`
`
`FIG. 7B
`
`TCL 1006
`
`TCL 1006
`
`
`
`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 9 of 9
`
`US 7,015,913 B1
`
`Configure enable PIOR
`/10
`
`Render opaque objects
`£25
`
`745
`
`Configure to disable
`PIOR
`£30
`
`Render non-opaque
`objects
`
`FIG. 7C
`
`TCL 1006
`
`TCL 1006
`
`
`
`US 7,015,913 B1
`
`1
`METHOD AND APPARATUS FOR
`MULTITHREADED PROCESSING OF DATA
`IN A PROGRAMMABLE GRAPHICS
`PROCESSOR
`
`FIELD OF THE INVENTION
`
`One or more aspects of the invention generally relate to
`multithreaded processing, and more particularly to process-
`ing graphics data in a programmable graphics processor.
`
`BACKGROUND
`
`Current graphics data processing is exemplified by sys-
`tems and methods developed to perform a specific operation
`on several graphics data elements, e.g., linear interpolation,
`tessellation,
`texture mapping, depth testing. Traditionally
`graphics processing systems were implemented as fixed
`function computation units and more recently the computa-
`tion units are programmable to perform a limited set of
`operations. In either system, the graphics data elements are
`processed in the order in which they are received bythe
`graphics processing system. Within the graphics processing
`system, when a resource, c.g., computation unit or data,
`required Lo process a graphics data element is unavailable,
`the processing of the elementstalls, i.e., does not proceed,
`until the resource becomesavailable. Because the system is
`pipelined, the stall propagates back through the pipeline,
`stalling the processing of later received elements that may
`notrequire the resource and reducing the throughputofthe
`system.
`For the foregoing reasons, there is a need for improved
`approaches to processing graphics data elements.
`
`SUMMARY
`
`The present invention is directed to a system and method
`that satisfies the need for a programmable graphics proces-
`sor that supports processing of graphics data elements in an
`order independent from the order in whichthe graphics data
`elements are received by the programmable graphics pro-
`cessing pipeline within the programmable graphics proces-
`SO.
`
`Various embodimentsof the invention include a comput-
`ing system comprising a host processor, a host memory, a
`system interface configured to interface with the host pro-
`cessor, and the programmable graphics processor for mul-
`tithreaded execution of program instructions. The graphics
`processorincludesatleast one multithreaded processing unit
`configured to receive samplesin a first order to be processed
`by program instructions associated with at least one thread.
`Each multithreaded processing umt includes a scheduler
`configured to receive the program instructions, determine
`availability of source data, and schedule the program
`instructions for execution in a second order independent of
`the first order. Each multithreaded processing unit further
`includes a resource tracking unit configured to track the
`availability of the source data, and a dispatcher configured
`to output the program instructions in the second orderto be
`executed by the al least one multithreaded processing unil.
`Further embodiments of the invention include an appli-
`cation programming interface for a programmable graphics
`processor comprising a function call to configure a multi-
`threaded processing unit within the programmable graphics
`processor to enable processing of samples independentof an
`order in which the samples are received.
`
`35
`
`40
`
`45
`
`55
`
`2
`Yet further embodiments of the invention include an
`
`application programming interface for a programmable
`graphics processor comprising a function call to configure a
`multithreaded processing unit within the programmable
`graphics processor to disable processing of samples inde-
`pendent of an order in which the samples are received.
`Various embodiments of a method of the invention
`
`include processing a first program instruction associated
`with a first thread and a second program instruction asso-
`ciated with a second thread. A first sample to be processed
`by a program instruction associated with a first thread is
`received before a second sample to be processed by a
`program instruction associated with a second thread is
`received. First source data required to process the program
`instruction associated with the first thread are determined to
`
`be not available. Second source data required to process the
`program instruction associated with the second thread are
`determined to be available. The program instruction asso-
`ciated with the second thread to process the second sample
`in the execution unil is dispatched prior to dispatching the
`program instruction associated with the first thread to pro-
`cess the first sample in the execution unit.
`Further embodimentsof a methodofthe invention include
`
`using a function call to configure the graphics processor.
`Support for processing samplesof at least one sample type
`independentof an order in which the samples are received
`by a multithreaded processing unit within the graphics
`processor is detected. The function call to configure the
`multithreaded processing unit within the graphics processor
`to enable processing of the samples independentof an order
`in which the samples are received is issued for the at least
`one sample type.
`Yet further embodiment so of a method of the invention
`include rendering a scene using the graphics processor. The
`multithreaded processing unit within the graphics processor
`1s configuredto enable processing of samples independent of
`an order in which the samples are received. The multi-
`threaded processing unit within the graphics processor pro-
`cess the samples independent of the order in which the
`samplesare receivedto renderatleast a portion of the scene.
`
`BRIEF DESCRIPTION OF THE VARIOUS
`VIEWS OF THE DRAWINGS
`
`exemplary
`show
`drawing(s)
`Accompanying
`embodiment(s) in accordance with one or more aspects of
`the
`present
`invention;
`however,
`the
`accompanying
`drawing(s) should notbe takento limit the present invention
`to the embodiment(s) shown, but are for explanation and
`understanding only.
`FIG. 1 illustrates one embodiment of a computing system
`according to the invention including a host computer and a
`graphics subsystem;
`FIG. 2 is a block diagram of an embodiment of the
`Programmable Graphics Processing Pipeline of FIG. 1;
`FIG. 3 is a conceptual diagram ofthe relationship between
`a program and threads;
`FIG. 4 is a block diagram of an embodiment of the
`Execution Pipeline of FIG. 2;
`FIGS. 5A and 5B illustrate embodiments of methods
`
`utilizing the Execution Pipeline illustrated in FIG. 4;
`FIG. 6 illustrates an embodimentof a method utilizing the
`Execution Pipeline illustrated in FIG. 4;
`FIGS.7A, 7B, and 7C illustrate embodiments of methods
`utilizing the Computing System illustrated in FIG. 1.
`
`TCL 1006
`
`TCL 1006
`
`
`
`US 7,015,913 B1
`
`3
`DISCLOSURE OF THE INVENTION
`
`The current invention involves new systems and methods
`for processing graphics data elements in an order indepen-
`dent from the order in which the graphics data elementsare
`received by a multithreaded processing unit within a graph-
`ics processor.
`FIG. 1 is an illustration of a Computing System generally
`designated 100 and including a Host Computer 10 and a
`Graphics Subsystem 170. Computing System 100 may be a
`desktop computer, server, laptop computer, palm-sized com-
`puter,
`tablet computer, game console, cellular telephone,
`computer based simulator, or the like. Host Computer 110
`includes Host Processor 114 which may include a system
`memorycontroller to interface directly to Host Memory112
`or may communicate with Host Memory 112 through a
`System Interface 115. System Interface 115 maybe an I/O
`(input/output) interface or a bridge device including the
`system memory controller to interface directly to Host
`Memory112. Examples of System Interface 115 knownin
`the art include Intel®Northbridge and Intelg® Southbridge.
`Host Computer 110 communicates with Graphics Sub-
`system 170 via System Interface 115 and a GraphicsInter-
`face 117 within a Graphics Processor 105. Data received at
`Graphics Interface 117 can be passedto a Front End 130 or
`written to a Local Memory 140 through Memory Controller
`120. Graphics Processor 105 uses graphics memoryto store
`graphics data and program instructions, where graphics data
`is any data thatis input to or output from components within
`the graphics processor. Graphics memorycan include por-
`tions of Host Memory112, Local Memory 140,registerfiles
`coupled to the components within Graphics Processor 105,
`and the like.
`
`Graphics Processor 105 includes, among other compo-
`nents, Front End 130 that receives commands from Host
`Computer 110 via Graphics Interface 117. Front End 130
`interprets and formats the commands and outputs the for-
`matted commands and data to an IDX (Index Processor)
`135. Some of the formatted commands are used by Pro-
`grammable Graphics Processing Pipeline 150 to initiate
`processing of data by providing the location of program
`instructions or graphics data stored in memory. IDX 135,
`Programmable Graphics Processing Pipeline 150 and a
`Raster Analyzer 160 each include an interface to Memory
`Controller 120 through which program instructions and data
`can be read from memory, e.g., any combination of Local
`Memory140 and Host Memory112. When a portion of Host
`Memory112 is used to store program instructions and data,
`the portion of Host Memory 112 can be uncached so as to
`increase performance of access by Graphics Processor 105.
`IDX 135 optionally reads processed data, e.g., data writ-
`ten by Raster Analyzer 160, from memory and outputs the
`data, processed data and formatted commands to Program-
`mable Graphics Processing Pipeline 150. Programmable
`Graphics Processing Pipeline 150 and Raster Analyzer 160
`each contain one or more programmable processing units to
`perform a variety of specialized functions. Some of these
`functionsare table lookup, scalar and vector addition, mul-
`tiplication, division, coordinate-system mapping, calcula-
`tion of vector normals, tessellation, calculation of deriva-
`tives, interpolation, and the like. Programmable Graphics
`Processing Pipeline 150 and Raster Analyzer 160 are each
`optionally configured such that data processing operations
`are performed in multiple passes through those units or in
`multiple passes within Programmable Graphics Processing
`Pipeline 150. Programmable Graphics Processing Pipeline
`
`4
`150 and a Raster Analyzer 160 also each include a write
`interface to Memory Controller 120 through which data can
`be written to memory.
`In a typical implementation Programmable Graphics Pro-
`cessing Pipeline 150 performs geometry computations, ras-
`terization, and pixel computations. Therefore Programmable
`Graphics Processing Pipeline 150 is programmedto operate
`on surface, primitive, vertex, fragment, pixel, sample or any
`other data. A fragmentis at least a portion of a pixel, ie., a
`pixel includes at
`least one fragment. For simplicity,
`the
`remainderof this description will use the term “samples”to
`refer to surfaces, primitives, vertices, pixels, or fragments.
`Samples output by Programmable Graphics Processing
`Pipeline 150 are passed to a Raster Analyzer 160, which
`optionally performs near and far plane clipping and raster
`operations, such as stencil, z test, and the like, and savesthe
`results or the samples output by Programmable Graphics
`Processing Pipeline 150 in Local Memory 140. When the
`data received by Graphics Subsystem 170 has been com-
`pletely processed by Graphics Processor 105, an Output 185
`of Graphics Subsystem 170 is provided using an Output
`Controller 180, Output Controller 180 is optionally config-
`ured to deliver data to a display device, network, electronic
`control system, other Computing System 100, other Graph-
`ics Subsystem 170, or the like.
`FIG. 2 is an illustration of Programmable Graphics Pro-
`cessing Pipeline 150 of FIG. 1. At least one set of samples
`is output by IDX 135 and received by Programmable
`Graphics Processing Pipeline 150 andthe at least one set of
`samples is processed according to at least one program,the
`at least one program including graphics program instruc-
`tions. A program can process one or more sets of samples.
`Conversely, a set of samples can be processed by a sequence
`of one or more programs.
`Samples, such as surfaces, primitives, or the like, are
`received from IDX 135 by Programmable GraphicsProcess-
`ing Pipeline 150 and stored in a Vertex Input Buffer 220 in
`a registerfile, FIFO(first in first out), cache, or the like (not
`shown). The samples are broadcast to Execution Pipelines
`240, four of which are shownin the figure. Each Execution
`Pipeline 240 includes at least one multithreaded processing
`unit, to be described further herein. ‘The samples output by
`Vertex Input Buffer 220 can be processed by anyoneofthe
`Execution Pipelines 240. A sample is accepted by a Execu-
`tion Pipeline 240 when a processing thread within the
`Execution Pipcline 240 is available as described further
`herein. Each Execution Pipeline 240 signals to Vertex Input
`Buffer 220 when a sample can be accepted or when a sample
`cannot be accepted.
`In one embodiment Programmable
`Graphics Processing Pipeline 150 includes a single Execu-
`tion Pipeline 240 containing one multithreaded processing
`unit. In an alternative embodiment, Programmable Graphics
`Processing Pipeline 150 includes a plurality of Execution
`Pipelines 240.
`Execution Pipelines 240 can receive first samples, such as
`higher-order surface data, and tessellate the first samples to
`generate second samples, such as vertices. Execution Pipe-
`lines 240 can be configuredto transform the second samples
`from an object-bascd coordinate representation (object
`space) to an alternatively based coordinate system such as
`world space or normalized device coordinates (NDC)space.
`Each Execution Pipeline 240 communicates with Texture
`Unit 225 using a read interface (not shown in FIG.2) to read
`program instructions and graphics data such as texture maps
`from Local Memory 140 or Host Memory 112 via Memory
`Controller 120 and a Texture Cache 230. Texture Cache 230
`
`is used to improve memoryread performance by reducing
`
`40
`
`45
`
`55
`
`TCL 1006
`
`TCL 1006
`
`
`
`US 7,015,913 B1
`
`5
`read latency. In an alternate embodiment Texture Cache 230
`is omitted. In another alternate embodiment, a Texture Unit
`225 is included in each Execution Pipeline 240.
`In yet
`another alternate embodiment program instructions are
`stored within Programmable Graphics Processing Pipeline
`150.
`
`
`
`
`
`Execution Pipelines 240 output processed samples, such
`as vertices, that are stored in a Vertex Output Buffer 260 in
`a register file, FIFO, cache, or the like (not shown). Pro-
`cessed vertices output by Vertex Output Buffer 260 are
`received by a Primitive Assembly/Setup 205. This unit
`calculates parameters, such as deltas and slopes,to rasterize
`the processed vertices. Primitive Assembly/Setup 205 out-
`puts parameters and samples, such as vertices, to Raster Unit
`210. The Raster Unit 210 performs scan conversion on
`samples, such as vertices, and outputs samples, such as
`fragments, to a Pixel Input Buffer 215. Alternatively, Raster
`
`
`Unit 210 resamples processed vertices and outputs addi-
`
`
`tional vertices to Pixel Input Buffer 215.
`Pixel Input Buffer 215 outputs the samples to each Execu-
`
`tion Pipeline 240. Samples, such as pixels and fragments,
`output by Pixel Input Buffer 215 are each processed by only
`one of the Execution Pipelines 240. Pixel Input Buffer 215
`determines which one of
`the Execution Pipelines 240 to
`output each sample to depending onan outputpixelposition,
`e.g., (x,y), associated with each sample. In this manner, each
`sampleis outputto the Execution Pipeline 240 designatedto
`process samples associated with the output pixel position. In
`an alternate embodiment, each sample output by Pixel Input
`Buffer 215 is processed byan available Execution Pipeline
`240.
`A sample is accepted by a Execution Pipeline 240 when
`a processing thread within the Execution Pipeline 240 is
`available as described further herein. Each Execution Pipe-
`line 240 signals to Pixel Input Buffer 240 when a sample can
`be accepted or when a sample cannot be accepted. Program
`instructions associated with a thread configure program-
`mable computation units within a Execution Pipeline 240 to
`perform operations such as texture mapping, shading,blend-
`ing, and the like. Processed samples are output from each
`Execution Pipeline 240 to a Pixel Output Buffer 270. Pixel
`Output Butter 270 optionally stores the processed samples in
`a register file, FIFO, cache, or the like (not shown). The
`processed samples are output from Pixel Output Buffer 270
`to Raster Analyzer 160.
`Execution Pipelines 240 are optionally configured using
`program instructions read by Texture Unit 225 suchthat dala
`processing operations are performed in multiple passes
`through at least one multithreaded processing unit, to be
`described further herein, within Execution Pipelines 240.
`Intermediate data generated during multiple passes can be
`stored in graphics memory.
`FIG. 3 is a conceptual diagramillustrating the relationship
`between a program and threads. A single program is used to
`process several sets of samples. Each program, such as a
`vertex program or shader program, includes a sequence of
`program instructions such as, a Sequence 330 of program
`instructions 331 to 344. The at
`least one multithreaded
`
`processing unit within a Exccution Pipeline 240 supports
`multithreaded execution. Therefore the program instructions
`in instruction Sequence 330 can be used by the at least one
`multithreaded processing unit to process each sample or
`each group of samples independently, i.e., the at least one
`multithreaded processing unit may process each sample
`asynchronouslyrelative to other samples. For example, each
`fragment or group of fragments within a primitive can be
`processed independently from the other fragments or from
`
`wein
`
`40
`
`6
`the other groups of fragments within the primitive. Like-
`wise, cach vertex within a surface can be processed inde-
`pendently from the other vertices within the surface. For a
`set of samples being processed using the same program,the
`sequence of program instructions associated with each
`thread used to process each sample within the set will be
`identical. However,it is possible that, during execution,the
`threads processing some of the samples within a set will
`diverge following the execution of a conditional branch
`instruction. After the execution of a conditional branch
`instruction, the sequence of executed instructions associated
`with each thread processing samples within the set may
`differ.
`
`In [IG. 3 program instructions within instruction
`Sequence 330 are stored in graphics memory,
`i.e., Host
`Memory 112, Local Memory 149, register files coupled to
`the components within Graphics Processor 105, and thelike.
`Each program counter
`(0 through 13)
`in instruction
`Sequence 330 corresponds to a program instruction within
`instruction Sequence 330. The program counters are con-
`ventionally numbered sequentially and can be used as an
`index to locate a specific program instruction within
`Sequence 330. Thefirst instruction 331 in the sequence 330
`represents is the program instruction corresponding to pro-
`gram counter 0. A base address, corresponding to the graph-
`ics memory location where the first instruction 331 in a
`program is stored, can be used in conjunction with a
`program counter to determine the location where a program
`instruction corresponding to the program counter is stored.
`In this example, program instructions within Sequence
`330 are associated with three threads. A Thread 350, a
`Thread 360 and a Thread 370 are each assignedto a different
`sample and each thread is uniquely identified by a thread
`identification code. A program instruction within Sequence
`330 is associated with a thread using a program counter that
`1s stored as a portion of thread state data, as described further
`herein. Thread 350 thread state data includes a program
`counter of 1 as shown in Sequence 330. The program
`counter associated with Thread 350 is a pointer to the
`program instruction in Sequence 330 corresponding to pro-
`gram counter 1 and stored at location 332. The instruction
`stored at location 332 is the next instruction to be used to
`
`process the sample assigned to Thread 350. Alternatively, an
`instruction stored at
`location 332 is the most recently
`executed instruction to process the sample assigned to
`Thread 350.
`Thethread state data for Thread 360 and Thread 370 each
`
`include a program counter of 11, as shown in FIG. 3,
`referencing the program instruction corresponding to pro-
`gram counter 11 in Program 330 and stored at location 342.
`Program counters associated with threads to process samples
`within a primitive, surface, or the like, are not necessarily
`identical because the threads can be executed independently.
`When branch instructions are not used, Thread 350, Thread
`360) and Thread 370 each execute all of the program instruc-
`tions in Sequence 330.
`The number of threads that can be executed simulta-
`
`neously is limited to a predetermined number in each
`embodiment and is related to the number of Exccution
`
`Pipelines 240, the amountof storage required for thread state
`data, the latency of Execution Pipelines 240, and the like.
`Each sample is a specific type, e.g., primitive, vertex, or
`pixel, corresponding to a program type. A primitive type
`sample, e.g., primitive, is processed bya primitive program,
`a vertex type sample, ¢.g., surface or vertex, is processed by
`a vertex program, and a pixel type sample, e.g., fragment or
`pixel, is processed by a shader program. Likewise, a primi-
`
`TCL 1006
`
`TCL 1006
`
`
`
`US 7,015,913 B1
`
`7
`tive thread is associated with program instructions within a
`primitive program, a vertex thread is associated with pro-
`gram instructions within a vertex program, and a pixel
`thread is associated with program instructions within a
`shader program.
`A number of threads of each thread type that may be
`executed simultaneously is predetermined in each embodi-
`ment. Therefore, not all samples within a set of samples of
`a type can be processed simultaneously when the number of
`threads of the type is less than the number of samples.
`Conversely, when the numberof threads of a type exceeds
`the number of samples of the type within a set, more than
`oneset can be processed simultaneously. Furthermore, when
`the number of threads of a type exceeds the number of
`samples of the type within one or more sets, more than one
`program of the type can be executed on the one or more sets
`and the thread state data can include data indicating the
`program associated with each thread.
`FIG. 4 is an illustration of a Execution Pipeline 240
`containing at least one Multithreaded Processing Unit 400.
`A Execution Pipeline 240 can contain a plurality of Multi-
`threaded Processing Units 400. Within each Multithreaded
`Processing Unit 400, a Thread Control Buffer 420 receives
`samples from Pixel Input Buffer 215 or Vertex Input Buffer
`220. Thread Control Buffer 420 includes storage resources
`to retain thread state data for a subset of the predetermined
`number of threads. In one embodiment Thread Control
`
`Buffer 420 includes storage resourcesfor each ofat least two
`thread types, wherethe at least two thread types can include
`
`pixel, primitive, and vertex. At least a portion of Thread
`
`
`
`Control Buffer 420 is a register file, FIFO, circular buffer, or
`the like. Thread state data for a thread can include, among
`other things, a program counter, a busy flag that indicatesif
`the thread is either assigned to a sample or available to be
`assigned to a sample, a pointer to the source sample to be
`processed bythe instructions associated with the thread or
`the output pixel position and output buffer ID of the sample
`to be processed, and a pointer specifying a destination
`location in Vertex Output Buffer 260 or Pixel Output Buffer
`270. Additionally, thread state data for a thread assigned to
`a sample can include the sample type, e.g., pixel, vertex,
`primitive, or the like.
`The source sample is stored in either Pixel Input Buffer
`215 or Vertex Input Buffer 220. When a thread is assigned
`to a sample,
`the thread is allocated storage resources to
`retain intermediate data gencrated during cxccution ofpro-
`gram instructions associated with the thread. The thread
`identification code for a thread may be the address of a
`location in Thread Control Buffer 420 in which the thread
`state data for the thread is stored. In one embodiment,
`priority is specified for each thread type and Thread Control
`Buffer 420 is configured to assign threads to samples or
`allocate storage resources based onthe priority assigned to
`eachthread type. In an alternate embodiment, Thread Con-
`trol Buffer 420 is configured to assign threads to samples or
`allocate storage resources based on an amount of sample
`
`data in Pixel Input Buffer 215 and another amount of sample
`
`
`data in Vertex Input Buffer 220.
`An Instruction Cache 410 reads one or morc thread
`
`
`from Thread
`entries, each containing thread state data,
`
`Control Buffer 420. Instruction Cache 410 may read thread
`entries to process a group of samples. For example, in one
`embodiment a group of samples, e.g., a numberof vertices
`defining a primitive, four adjacent fragments arranged in a
`square, or the like, are processed simultaneously. In the one
`embodiment computed values such as derivatives are shared
`within the group of samples thereby reducing the numberof
`
`
`
`
`
`8
`computations needed to process the group of samples com-
`pared with processing the group of samples without sharing
`the computed values.
`In an embodiment of Multithreaded Processing Unit 400,
`priority is specified for each thread type and Instruction
`Cache 410 is configured to read thread entries based on the
`priority assigned to each thread type. In another embodi-
`meat, Instruction Cache 410 is configured to read thread
`entries based on the amount of sample data in Pixel Input
`Buffer 215 and the amount of sample data in Vertex Input
`Buffer 220. Instruction Cache 410 determinesif the program
`Instructions corresponding to the program counters and
`sample type includedin the thread state data for