throbber

`
`PRESENTS,SHAT,COME:
`UNITED STATES DEPARTMENTOF COMMERCE
`United States Patent and Trademark Office
`
`May9, 2023
`
`
`
`
`
`By Authorityof the
`UnderSecretary of Commercefor Intellectual Property
`and Director of the United States Patent and Trademark Office
`
`THIS IS TO CERTIFY THAT ANNEXED HERETO IS A TRUE COPY FROM
`THE RECORDSOFTHIS OFFICE OF:
`
`PATENT NUMBER:7,015,913
`ISSUE DATE: March 21, 2006
`
`MhMeni y
`
`UM,
`Certifying Officer (
`
`TCL 1006
`
`

`

`US007015913B1
`
`«2) United States Patent
`US 7,015,913 B1
`(10) Patent No.:
`Mar.21, 2006
`(45) Date of Patent:
`Lindholm etal.
`
`(54) METHOD AND APPARATUS FOR
`MULTITHREADED PROCESSING OF DATA
`IN A PROGRAMMABLE GRAPHICS
`PROCESSOR
`
`(75)
`
`Inventors: John Erik Lindholm, Saratoga, CA
`(US); Rui M. Bastos, Santa Clara, CA
`(US); Harold Robert Feldman Zatz,
`Palo Alto, CA (US)
`
`(73) Assignee: NVIDIA Corporation, Santa Clara, CA
`(US)
`
`(*) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`US.C. 154(b) by 164 days.
`
`(21) Appl. No.: 10/608,346
`
`(22) Filed:
`
`Jun. 27, 2003
`
`(51)
`
`Int. Cl.
`(2006.01)
`GO6F 15/00
`(2006.01)
`GO6F 12/02
`(2006.01)
`GO6F 9/46
`(2006.01)
`GO6T1/00
`(52) U.S. Ch. ve eesesees 345/501; 345/543; 718/102;
`718/104
`(58) Field of Classification Search................ 345/501,
`345/504, 520, 522, 420, 423, 503, 519, 506,
`345/543, 557; 709/208, 231; 712/23, 28,
`712/31-32;, 718/102, 104
`See application file for complete search history.
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,818,469 A * 10/1998 Lawless et al. .......... 345/522
`5,946,487 A *
`8/1999 Dangelo.....
`wee TIT/148
`
`7/2000 Kwoketal. .
`ve 345/505
`6,088,044 A *
`...
`ve 345/629
`6,753,878 B1*
`6/2004 Heirich et al.
`
`7/2004 Sowizral et al.
`.
`we 345/420
`6,765,571 B1*
`2003/0140179 A1l*
`7/2003 Wilt etal. ow. 709/321
`2004/0160446 A1l*
`8/2004 Gosalia et al.
`.............. 345/503
`
`* cited by examiner
`
`Primary Examiner—Kee M. Tung
`(74) Aitorney, Agent, or Firm—Patterson & Sheridan, LLP
`
`(57)
`
`ABSTRACT
`
`A graphics processor and method for executing a graphics
`program as a plurality of threads where each sample to be
`processed bythe program is assigned to a thread. Although
`threads share processing resources within the programmable
`graphics processor, the execution of each thread can proceed
`independentof any other threads. For example, instructions
`1n a second thread are scheduled for execution while execu-
`
`tion of instructions in a first thread are stalled waiting for
`source data. Consequently,a first received sample (assigned
`to the first thread) may be processedafter a second received
`sample (assigned to the second thread). A benefit of inde-
`pendently executing each thread is improved performance
`because a stalled thread does not prevent the execution of
`otherthreads.
`
`33 Claims, 9 Drawing Sheets
`
`
`
`
`
`
`
`
`
`
`Register
`File
`450
`
`.
`
`Execution Unit
`
`r
`
`|
`'
`
`\\
`
`
`
`To 270
`
`TCL 1006
`
`
`
`
`
`
`
`
`From From
`213
`220
`
`Execution
`
`Pipeline
`
`
`Multithreaded
`
`
`Processing Unit
`
`To225 «1
`400
`Instruction
`Cache ——
`
`>|
`410
`From 225:
`
`if
`Thread
`Control
`Law235|Butor
`
`Cu 433
`420
`sTu 437|;—>
`Resource
`=
`Scoreboard
`Instruction
`
`460
`Scheduler
`
`=kT | Instruction
`* Dispatcher
`
`;
`
`From 215
`From 2204+—
`
`
`
`|
`<4
`
`TCL 1006
`
`

`

`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 1 of 9
`
`US 7,015,913 B1
`
`
`
`Host Computer 110
`
`Host Processor
`
`Host Memory
`
`Syst m Interfac
`
`190
`-——
`
`
`
`i114
`112
`ft
`
`115
`
`
`
`
`
`
` Front End
`
`
`
`
`
`Graphics
`Subsystem
`170
`
`
`
`
`
`
`
`
`
`Programmable
`Graphics
`Memory ——»>
`Local
`Processing
`Controller
`Memory
`140 «———__—Pipeline120,
`
`
`150
`to
`-—-— Raster Analyzer
`—— 180
`
`
`
`Lf
`
`ua
`
`Processor
`:
`GraphicsInterface 117
`
`:
`Graphics
`
`
`
`130
`
`
`t—-———
`————>|
`
`IDX
`135
`|
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`FIG. 1
`
`188
`VS
`
`TCL 1006
`
`TCL 1006
`
`

`

`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 2 of 9
`
`US 7,015,913 B1
`
`From
`135
`
`Primitive Assembly/Setup
`205
`
`Raster Unit
`210
`
`
`
`
`
`
`
`
`Vertex Input Buffer
`220
`
`Pixel Input Buffer
`215
`
`
`
`Programmable
`Graphics
`Processing
`Pipeline
`
`TextureUnit
`
`225
`
`Texture
`Cache
`
`270
`
`
`230
`
`
`
`
`
`
`
`
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`VertexOutputBuffer
`
`260
`
`|
`
`PixelOutput Buffer
`
`
`
`
`
`~ 120=120
`
`TCL 1006
`
`TCL 1006
`
`

`

`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 3 of 9
`
`US 7,015,913 B1
`
`
`
`FIG. 3
`
`TCL 1006
`
`TCL 1006
`
`

`

`U.S. Patent
`
`Mar.21, 2006
`
`Sheet 4 of 9
`
`US 7,015,913 BI
`
`From From
`215
`220
`
`Execution
`Pipeline
`240
`
`Multithreaded
`Processing Unit
`400
`
`Resource
`Scoreboard
`460
`
`To 225 «-}—
`
`From 225
`
`Fro
`
`Fro
`
`3.3
`
`|
`21
`ion
`
`22
`
`IO
`
`
`
`
`
`
`
`
`Execution Unit
`
`
`
`
`
`Instruction
`Cache
`410
`
`Instruction
`Scheduler
`430
`Tt
`Instruction
`Dispatcher
`440
`
`
`
`
`Thread
`Control
`Buffer
`420
`
`Register
`File
`450
`
`
`
`TCL 1006
`
`TCL 1006
`
`

`

`U.S. Patent
`
`Mar. 21,2006
`
`Sheet 5 of 9
`
`US 7,015,913 B1
`
`| Receive a Sample
`504
`
`Receive Another
`Sample
`503
`
`
`
`Source
`Data Available for the
`Sample?
`§05
`
`
`
`N
`
`Y
`
`Dispatch Instruction
`for Processing the
`Sample
`
`515
`
`N
`
`
`
`
`
`Source
`Data Available for the
`
`
`Other Sample?
`517
`
`Y
`519
`
`Data Available for the
`
`Sample?
`511
`
`
`
`
`DispatchInstruction
`for Processing the
`Other Sample
`
`Source
`Data Available for the
`Other Sample?
`307
`
`Y
`
`Dispatch Instruction
`for Processing the
`Other Sample
`509
`
` Source
`
`Dispatch Instruction
`for Processing the
`Sample
`513
`
`
`FIG. 5A
`
`TCL 1006
`
`TCL 1006
`
`

`

`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 6 of 9
`
`US 7,015,913 B1
`
`Receive Sample
`$20
`
`
`
`
`identify Thread
`Type Needed
`921
`
`
`
`
`
`PIOR
`Pixel
`
`
`
`disabled for pixel
`Thread Position
`threads?
`
`Hazard?
`
`
`523
`
`§25
`
`
`
`527
`
`
`
`Assign Thread
`530
`
`
`|Available?
`
`
`
`
`
`Allocate Resources
`
`533
`
`
`
`
`Fetch Instruction
`535
`
`
`=Schedule?
`
`
`
`537
`
`Update PC and;
`Resource Scoreboard
`
`
`
` More
`
`540
`Deallocate
`550
`
`Instructions?
`Resources
`
`
`347
`
`
`Dispatch Instruction
`543
`
`
`
`Process Sample
`545
`——
`
`
`
`FIG. 5B
`
`TCL 1006
`
`TCL 1006
`
`

`

`U.S. Patent
`
`Mar.21, 2006
`
`Sheet 7 of 9
`
`US 7,015,913 B1
`
`
` Instruction
`in window?
`
`
`605
`
`Y
`
`Timeout?
`610
`
`Removeinstruction
`from window
`615
`
`
`
`N
`
`
`
`
`
`synch mode
`enabled?
`620
`
`
`Sort by thread age lg
`640
`
`Check
`synchronization
`625
`
`
`
`
`nstruction
`synched?
`630
`
`N
`
`Removeinstruction
`from window
`
`635
`
`Y
`
`
`
`
`Read scoreboard
`645
`
`Check resources
`650
`
`Schedule instruction
`655
`
`Update scoreboard
`660
`
`Update PC
`670
`
`
`
`|
`
`
`
`
`
`Output instruction
`680
`
`
`FIG.6
`
`TCL 1006
`
`TCL 1006
`
`

`

`U.S. Patent
`
`Mar.21, 2006
`
`US 7,015,913 BI
`
`
`
`
`Issue Function Call to
`enable PIOR
`703
`
`PIOR configuration
`complete
`£06
`
`FIG. 7A
`
`Configure enable PIOR
`710
`
`
`
`
`Renderintersecting
`objects
`720
`
`Configure to disable
`PIOR
`730
`
`Render non-
`intersecting objects
`£40
`
`
`FIG. 7B
`
`TCL 1006
`
`TCL 1006
`
`

`

`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 9 of 9
`
`US 7,015,913 B1
`
`Configure enable PIOR
`/10
`
`Render opaque objects
`£25
`
`745
`
`Configure to disable
`PIOR
`£30
`
`Render non-opaque
`objects
`
`FIG. 7C
`
`TCL 1006
`
`TCL 1006
`
`

`

`US 7,015,913 B1
`
`1
`METHOD AND APPARATUS FOR
`MULTITHREADED PROCESSING OF DATA
`IN A PROGRAMMABLE GRAPHICS
`PROCESSOR
`
`FIELD OF THE INVENTION
`
`One or more aspects of the invention generally relate to
`multithreaded processing, and more particularly to process-
`ing graphics data in a programmable graphics processor.
`
`BACKGROUND
`
`Current graphics data processing is exemplified by sys-
`tems and methods developed to perform a specific operation
`on several graphics data elements, e.g., linear interpolation,
`tessellation,
`texture mapping, depth testing. Traditionally
`graphics processing systems were implemented as fixed
`function computation units and more recently the computa-
`tion units are programmable to perform a limited set of
`operations. In either system, the graphics data elements are
`processed in the order in which they are received bythe
`graphics processing system. Within the graphics processing
`system, when a resource, c.g., computation unit or data,
`required Lo process a graphics data element is unavailable,
`the processing of the elementstalls, i.e., does not proceed,
`until the resource becomesavailable. Because the system is
`pipelined, the stall propagates back through the pipeline,
`stalling the processing of later received elements that may
`notrequire the resource and reducing the throughputofthe
`system.
`For the foregoing reasons, there is a need for improved
`approaches to processing graphics data elements.
`
`SUMMARY
`
`The present invention is directed to a system and method
`that satisfies the need for a programmable graphics proces-
`sor that supports processing of graphics data elements in an
`order independent from the order in whichthe graphics data
`elements are received by the programmable graphics pro-
`cessing pipeline within the programmable graphics proces-
`SO.
`
`Various embodimentsof the invention include a comput-
`ing system comprising a host processor, a host memory, a
`system interface configured to interface with the host pro-
`cessor, and the programmable graphics processor for mul-
`tithreaded execution of program instructions. The graphics
`processorincludesatleast one multithreaded processing unit
`configured to receive samplesin a first order to be processed
`by program instructions associated with at least one thread.
`Each multithreaded processing umt includes a scheduler
`configured to receive the program instructions, determine
`availability of source data, and schedule the program
`instructions for execution in a second order independent of
`the first order. Each multithreaded processing unit further
`includes a resource tracking unit configured to track the
`availability of the source data, and a dispatcher configured
`to output the program instructions in the second orderto be
`executed by the al least one multithreaded processing unil.
`Further embodiments of the invention include an appli-
`cation programming interface for a programmable graphics
`processor comprising a function call to configure a multi-
`threaded processing unit within the programmable graphics
`processor to enable processing of samples independentof an
`order in which the samples are received.
`
`35
`
`40
`
`45
`
`55
`
`2
`Yet further embodiments of the invention include an
`
`application programming interface for a programmable
`graphics processor comprising a function call to configure a
`multithreaded processing unit within the programmable
`graphics processor to disable processing of samples inde-
`pendent of an order in which the samples are received.
`Various embodiments of a method of the invention
`
`include processing a first program instruction associated
`with a first thread and a second program instruction asso-
`ciated with a second thread. A first sample to be processed
`by a program instruction associated with a first thread is
`received before a second sample to be processed by a
`program instruction associated with a second thread is
`received. First source data required to process the program
`instruction associated with the first thread are determined to
`
`be not available. Second source data required to process the
`program instruction associated with the second thread are
`determined to be available. The program instruction asso-
`ciated with the second thread to process the second sample
`in the execution unil is dispatched prior to dispatching the
`program instruction associated with the first thread to pro-
`cess the first sample in the execution unit.
`Further embodimentsof a methodofthe invention include
`
`using a function call to configure the graphics processor.
`Support for processing samplesof at least one sample type
`independentof an order in which the samples are received
`by a multithreaded processing unit within the graphics
`processor is detected. The function call to configure the
`multithreaded processing unit within the graphics processor
`to enable processing of the samples independentof an order
`in which the samples are received is issued for the at least
`one sample type.
`Yet further embodiment so of a method of the invention
`include rendering a scene using the graphics processor. The
`multithreaded processing unit within the graphics processor
`1s configuredto enable processing of samples independent of
`an order in which the samples are received. The multi-
`threaded processing unit within the graphics processor pro-
`cess the samples independent of the order in which the
`samplesare receivedto renderatleast a portion of the scene.
`
`BRIEF DESCRIPTION OF THE VARIOUS
`VIEWS OF THE DRAWINGS
`
`exemplary
`show
`drawing(s)
`Accompanying
`embodiment(s) in accordance with one or more aspects of
`the
`present
`invention;
`however,
`the
`accompanying
`drawing(s) should notbe takento limit the present invention
`to the embodiment(s) shown, but are for explanation and
`understanding only.
`FIG. 1 illustrates one embodiment of a computing system
`according to the invention including a host computer and a
`graphics subsystem;
`FIG. 2 is a block diagram of an embodiment of the
`Programmable Graphics Processing Pipeline of FIG. 1;
`FIG. 3 is a conceptual diagram ofthe relationship between
`a program and threads;
`FIG. 4 is a block diagram of an embodiment of the
`Execution Pipeline of FIG. 2;
`FIGS. 5A and 5B illustrate embodiments of methods
`
`utilizing the Execution Pipeline illustrated in FIG. 4;
`FIG. 6 illustrates an embodimentof a method utilizing the
`Execution Pipeline illustrated in FIG. 4;
`FIGS.7A, 7B, and 7C illustrate embodiments of methods
`utilizing the Computing System illustrated in FIG. 1.
`
`TCL 1006
`
`TCL 1006
`
`

`

`US 7,015,913 B1
`
`3
`DISCLOSURE OF THE INVENTION
`
`The current invention involves new systems and methods
`for processing graphics data elements in an order indepen-
`dent from the order in which the graphics data elementsare
`received by a multithreaded processing unit within a graph-
`ics processor.
`FIG. 1 is an illustration of a Computing System generally
`designated 100 and including a Host Computer 10 and a
`Graphics Subsystem 170. Computing System 100 may be a
`desktop computer, server, laptop computer, palm-sized com-
`puter,
`tablet computer, game console, cellular telephone,
`computer based simulator, or the like. Host Computer 110
`includes Host Processor 114 which may include a system
`memorycontroller to interface directly to Host Memory112
`or may communicate with Host Memory 112 through a
`System Interface 115. System Interface 115 maybe an I/O
`(input/output) interface or a bridge device including the
`system memory controller to interface directly to Host
`Memory112. Examples of System Interface 115 knownin
`the art include Intel®Northbridge and Intelg® Southbridge.
`Host Computer 110 communicates with Graphics Sub-
`system 170 via System Interface 115 and a GraphicsInter-
`face 117 within a Graphics Processor 105. Data received at
`Graphics Interface 117 can be passedto a Front End 130 or
`written to a Local Memory 140 through Memory Controller
`120. Graphics Processor 105 uses graphics memoryto store
`graphics data and program instructions, where graphics data
`is any data thatis input to or output from components within
`the graphics processor. Graphics memorycan include por-
`tions of Host Memory112, Local Memory 140,registerfiles
`coupled to the components within Graphics Processor 105,
`and the like.
`
`Graphics Processor 105 includes, among other compo-
`nents, Front End 130 that receives commands from Host
`Computer 110 via Graphics Interface 117. Front End 130
`interprets and formats the commands and outputs the for-
`matted commands and data to an IDX (Index Processor)
`135. Some of the formatted commands are used by Pro-
`grammable Graphics Processing Pipeline 150 to initiate
`processing of data by providing the location of program
`instructions or graphics data stored in memory. IDX 135,
`Programmable Graphics Processing Pipeline 150 and a
`Raster Analyzer 160 each include an interface to Memory
`Controller 120 through which program instructions and data
`can be read from memory, e.g., any combination of Local
`Memory140 and Host Memory112. When a portion of Host
`Memory112 is used to store program instructions and data,
`the portion of Host Memory 112 can be uncached so as to
`increase performance of access by Graphics Processor 105.
`IDX 135 optionally reads processed data, e.g., data writ-
`ten by Raster Analyzer 160, from memory and outputs the
`data, processed data and formatted commands to Program-
`mable Graphics Processing Pipeline 150. Programmable
`Graphics Processing Pipeline 150 and Raster Analyzer 160
`each contain one or more programmable processing units to
`perform a variety of specialized functions. Some of these
`functionsare table lookup, scalar and vector addition, mul-
`tiplication, division, coordinate-system mapping, calcula-
`tion of vector normals, tessellation, calculation of deriva-
`tives, interpolation, and the like. Programmable Graphics
`Processing Pipeline 150 and Raster Analyzer 160 are each
`optionally configured such that data processing operations
`are performed in multiple passes through those units or in
`multiple passes within Programmable Graphics Processing
`Pipeline 150. Programmable Graphics Processing Pipeline
`
`4
`150 and a Raster Analyzer 160 also each include a write
`interface to Memory Controller 120 through which data can
`be written to memory.
`In a typical implementation Programmable Graphics Pro-
`cessing Pipeline 150 performs geometry computations, ras-
`terization, and pixel computations. Therefore Programmable
`Graphics Processing Pipeline 150 is programmedto operate
`on surface, primitive, vertex, fragment, pixel, sample or any
`other data. A fragmentis at least a portion of a pixel, ie., a
`pixel includes at
`least one fragment. For simplicity,
`the
`remainderof this description will use the term “samples”to
`refer to surfaces, primitives, vertices, pixels, or fragments.
`Samples output by Programmable Graphics Processing
`Pipeline 150 are passed to a Raster Analyzer 160, which
`optionally performs near and far plane clipping and raster
`operations, such as stencil, z test, and the like, and savesthe
`results or the samples output by Programmable Graphics
`Processing Pipeline 150 in Local Memory 140. When the
`data received by Graphics Subsystem 170 has been com-
`pletely processed by Graphics Processor 105, an Output 185
`of Graphics Subsystem 170 is provided using an Output
`Controller 180, Output Controller 180 is optionally config-
`ured to deliver data to a display device, network, electronic
`control system, other Computing System 100, other Graph-
`ics Subsystem 170, or the like.
`FIG. 2 is an illustration of Programmable Graphics Pro-
`cessing Pipeline 150 of FIG. 1. At least one set of samples
`is output by IDX 135 and received by Programmable
`Graphics Processing Pipeline 150 andthe at least one set of
`samples is processed according to at least one program,the
`at least one program including graphics program instruc-
`tions. A program can process one or more sets of samples.
`Conversely, a set of samples can be processed by a sequence
`of one or more programs.
`Samples, such as surfaces, primitives, or the like, are
`received from IDX 135 by Programmable GraphicsProcess-
`ing Pipeline 150 and stored in a Vertex Input Buffer 220 in
`a registerfile, FIFO(first in first out), cache, or the like (not
`shown). The samples are broadcast to Execution Pipelines
`240, four of which are shownin the figure. Each Execution
`Pipeline 240 includes at least one multithreaded processing
`unit, to be described further herein. ‘The samples output by
`Vertex Input Buffer 220 can be processed by anyoneofthe
`Execution Pipelines 240. A sample is accepted by a Execu-
`tion Pipeline 240 when a processing thread within the
`Execution Pipcline 240 is available as described further
`herein. Each Execution Pipeline 240 signals to Vertex Input
`Buffer 220 when a sample can be accepted or when a sample
`cannot be accepted.
`In one embodiment Programmable
`Graphics Processing Pipeline 150 includes a single Execu-
`tion Pipeline 240 containing one multithreaded processing
`unit. In an alternative embodiment, Programmable Graphics
`Processing Pipeline 150 includes a plurality of Execution
`Pipelines 240.
`Execution Pipelines 240 can receive first samples, such as
`higher-order surface data, and tessellate the first samples to
`generate second samples, such as vertices. Execution Pipe-
`lines 240 can be configuredto transform the second samples
`from an object-bascd coordinate representation (object
`space) to an alternatively based coordinate system such as
`world space or normalized device coordinates (NDC)space.
`Each Execution Pipeline 240 communicates with Texture
`Unit 225 using a read interface (not shown in FIG.2) to read
`program instructions and graphics data such as texture maps
`from Local Memory 140 or Host Memory 112 via Memory
`Controller 120 and a Texture Cache 230. Texture Cache 230
`
`is used to improve memoryread performance by reducing
`
`40
`
`45
`
`55
`
`TCL 1006
`
`TCL 1006
`
`

`

`US 7,015,913 B1
`
`5
`read latency. In an alternate embodiment Texture Cache 230
`is omitted. In another alternate embodiment, a Texture Unit
`225 is included in each Execution Pipeline 240.
`In yet
`another alternate embodiment program instructions are
`stored within Programmable Graphics Processing Pipeline
`150.
`
`
`
`
`
`Execution Pipelines 240 output processed samples, such
`as vertices, that are stored in a Vertex Output Buffer 260 in
`a register file, FIFO, cache, or the like (not shown). Pro-
`cessed vertices output by Vertex Output Buffer 260 are
`received by a Primitive Assembly/Setup 205. This unit
`calculates parameters, such as deltas and slopes,to rasterize
`the processed vertices. Primitive Assembly/Setup 205 out-
`puts parameters and samples, such as vertices, to Raster Unit
`210. The Raster Unit 210 performs scan conversion on
`samples, such as vertices, and outputs samples, such as
`fragments, to a Pixel Input Buffer 215. Alternatively, Raster
`
`
`Unit 210 resamples processed vertices and outputs addi-
`
`
`tional vertices to Pixel Input Buffer 215.
`Pixel Input Buffer 215 outputs the samples to each Execu-
`
`tion Pipeline 240. Samples, such as pixels and fragments,
`output by Pixel Input Buffer 215 are each processed by only
`one of the Execution Pipelines 240. Pixel Input Buffer 215
`determines which one of
`the Execution Pipelines 240 to
`output each sample to depending onan outputpixelposition,
`e.g., (x,y), associated with each sample. In this manner, each
`sampleis outputto the Execution Pipeline 240 designatedto
`process samples associated with the output pixel position. In
`an alternate embodiment, each sample output by Pixel Input
`Buffer 215 is processed byan available Execution Pipeline
`240.
`A sample is accepted by a Execution Pipeline 240 when
`a processing thread within the Execution Pipeline 240 is
`available as described further herein. Each Execution Pipe-
`line 240 signals to Pixel Input Buffer 240 when a sample can
`be accepted or when a sample cannot be accepted. Program
`instructions associated with a thread configure program-
`mable computation units within a Execution Pipeline 240 to
`perform operations such as texture mapping, shading,blend-
`ing, and the like. Processed samples are output from each
`Execution Pipeline 240 to a Pixel Output Buffer 270. Pixel
`Output Butter 270 optionally stores the processed samples in
`a register file, FIFO, cache, or the like (not shown). The
`processed samples are output from Pixel Output Buffer 270
`to Raster Analyzer 160.
`Execution Pipelines 240 are optionally configured using
`program instructions read by Texture Unit 225 suchthat dala
`processing operations are performed in multiple passes
`through at least one multithreaded processing unit, to be
`described further herein, within Execution Pipelines 240.
`Intermediate data generated during multiple passes can be
`stored in graphics memory.
`FIG. 3 is a conceptual diagramillustrating the relationship
`between a program and threads. A single program is used to
`process several sets of samples. Each program, such as a
`vertex program or shader program, includes a sequence of
`program instructions such as, a Sequence 330 of program
`instructions 331 to 344. The at
`least one multithreaded
`
`processing unit within a Exccution Pipeline 240 supports
`multithreaded execution. Therefore the program instructions
`in instruction Sequence 330 can be used by the at least one
`multithreaded processing unit to process each sample or
`each group of samples independently, i.e., the at least one
`multithreaded processing unit may process each sample
`asynchronouslyrelative to other samples. For example, each
`fragment or group of fragments within a primitive can be
`processed independently from the other fragments or from
`
`wein
`
`40
`
`6
`the other groups of fragments within the primitive. Like-
`wise, cach vertex within a surface can be processed inde-
`pendently from the other vertices within the surface. For a
`set of samples being processed using the same program,the
`sequence of program instructions associated with each
`thread used to process each sample within the set will be
`identical. However,it is possible that, during execution,the
`threads processing some of the samples within a set will
`diverge following the execution of a conditional branch
`instruction. After the execution of a conditional branch
`instruction, the sequence of executed instructions associated
`with each thread processing samples within the set may
`differ.
`
`In [IG. 3 program instructions within instruction
`Sequence 330 are stored in graphics memory,
`i.e., Host
`Memory 112, Local Memory 149, register files coupled to
`the components within Graphics Processor 105, and thelike.
`Each program counter
`(0 through 13)
`in instruction
`Sequence 330 corresponds to a program instruction within
`instruction Sequence 330. The program counters are con-
`ventionally numbered sequentially and can be used as an
`index to locate a specific program instruction within
`Sequence 330. Thefirst instruction 331 in the sequence 330
`represents is the program instruction corresponding to pro-
`gram counter 0. A base address, corresponding to the graph-
`ics memory location where the first instruction 331 in a
`program is stored, can be used in conjunction with a
`program counter to determine the location where a program
`instruction corresponding to the program counter is stored.
`In this example, program instructions within Sequence
`330 are associated with three threads. A Thread 350, a
`Thread 360 and a Thread 370 are each assignedto a different
`sample and each thread is uniquely identified by a thread
`identification code. A program instruction within Sequence
`330 is associated with a thread using a program counter that
`1s stored as a portion of thread state data, as described further
`herein. Thread 350 thread state data includes a program
`counter of 1 as shown in Sequence 330. The program
`counter associated with Thread 350 is a pointer to the
`program instruction in Sequence 330 corresponding to pro-
`gram counter 1 and stored at location 332. The instruction
`stored at location 332 is the next instruction to be used to
`
`process the sample assigned to Thread 350. Alternatively, an
`instruction stored at
`location 332 is the most recently
`executed instruction to process the sample assigned to
`Thread 350.
`Thethread state data for Thread 360 and Thread 370 each
`
`include a program counter of 11, as shown in FIG. 3,
`referencing the program instruction corresponding to pro-
`gram counter 11 in Program 330 and stored at location 342.
`Program counters associated with threads to process samples
`within a primitive, surface, or the like, are not necessarily
`identical because the threads can be executed independently.
`When branch instructions are not used, Thread 350, Thread
`360) and Thread 370 each execute all of the program instruc-
`tions in Sequence 330.
`The number of threads that can be executed simulta-
`
`neously is limited to a predetermined number in each
`embodiment and is related to the number of Exccution
`
`Pipelines 240, the amountof storage required for thread state
`data, the latency of Execution Pipelines 240, and the like.
`Each sample is a specific type, e.g., primitive, vertex, or
`pixel, corresponding to a program type. A primitive type
`sample, e.g., primitive, is processed bya primitive program,
`a vertex type sample, ¢.g., surface or vertex, is processed by
`a vertex program, and a pixel type sample, e.g., fragment or
`pixel, is processed by a shader program. Likewise, a primi-
`
`TCL 1006
`
`TCL 1006
`
`

`

`US 7,015,913 B1
`
`7
`tive thread is associated with program instructions within a
`primitive program, a vertex thread is associated with pro-
`gram instructions within a vertex program, and a pixel
`thread is associated with program instructions within a
`shader program.
`A number of threads of each thread type that may be
`executed simultaneously is predetermined in each embodi-
`ment. Therefore, not all samples within a set of samples of
`a type can be processed simultaneously when the number of
`threads of the type is less than the number of samples.
`Conversely, when the numberof threads of a type exceeds
`the number of samples of the type within a set, more than
`oneset can be processed simultaneously. Furthermore, when
`the number of threads of a type exceeds the number of
`samples of the type within one or more sets, more than one
`program of the type can be executed on the one or more sets
`and the thread state data can include data indicating the
`program associated with each thread.
`FIG. 4 is an illustration of a Execution Pipeline 240
`containing at least one Multithreaded Processing Unit 400.
`A Execution Pipeline 240 can contain a plurality of Multi-
`threaded Processing Units 400. Within each Multithreaded
`Processing Unit 400, a Thread Control Buffer 420 receives
`samples from Pixel Input Buffer 215 or Vertex Input Buffer
`220. Thread Control Buffer 420 includes storage resources
`to retain thread state data for a subset of the predetermined
`number of threads. In one embodiment Thread Control
`
`Buffer 420 includes storage resourcesfor each ofat least two
`thread types, wherethe at least two thread types can include
`
`pixel, primitive, and vertex. At least a portion of Thread
`
`
`
`Control Buffer 420 is a register file, FIFO, circular buffer, or
`the like. Thread state data for a thread can include, among
`other things, a program counter, a busy flag that indicatesif
`the thread is either assigned to a sample or available to be
`assigned to a sample, a pointer to the source sample to be
`processed bythe instructions associated with the thread or
`the output pixel position and output buffer ID of the sample
`to be processed, and a pointer specifying a destination
`location in Vertex Output Buffer 260 or Pixel Output Buffer
`270. Additionally, thread state data for a thread assigned to
`a sample can include the sample type, e.g., pixel, vertex,
`primitive, or the like.
`The source sample is stored in either Pixel Input Buffer
`215 or Vertex Input Buffer 220. When a thread is assigned
`to a sample,
`the thread is allocated storage resources to
`retain intermediate data gencrated during cxccution ofpro-
`gram instructions associated with the thread. The thread
`identification code for a thread may be the address of a
`location in Thread Control Buffer 420 in which the thread
`state data for the thread is stored. In one embodiment,
`priority is specified for each thread type and Thread Control
`Buffer 420 is configured to assign threads to samples or
`allocate storage resources based onthe priority assigned to
`eachthread type. In an alternate embodiment, Thread Con-
`trol Buffer 420 is configured to assign threads to samples or
`allocate storage resources based on an amount of sample
`
`data in Pixel Input Buffer 215 and another amount of sample
`
`
`data in Vertex Input Buffer 220.
`An Instruction Cache 410 reads one or morc thread
`
`
`from Thread
`entries, each containing thread state data,
`
`Control Buffer 420. Instruction Cache 410 may read thread
`entries to process a group of samples. For example, in one
`embodiment a group of samples, e.g., a numberof vertices
`defining a primitive, four adjacent fragments arranged in a
`square, or the like, are processed simultaneously. In the one
`embodiment computed values such as derivatives are shared
`within the group of samples thereby reducing the numberof
`
`
`
`
`
`8
`computations needed to process the group of samples com-
`pared with processing the group of samples without sharing
`the computed values.
`In an embodiment of Multithreaded Processing Unit 400,
`priority is specified for each thread type and Instruction
`Cache 410 is configured to read thread entries based on the
`priority assigned to each thread type. In another embodi-
`meat, Instruction Cache 410 is configured to read thread
`entries based on the amount of sample data in Pixel Input
`Buffer 215 and the amount of sample data in Vertex Input
`Buffer 220. Instruction Cache 410 determinesif the program
`Instructions corresponding to the program counters and
`sample type includedin the thread state data for

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket