`TOALG,TOWHOM THESE;
`UNITED STATES DEPARTMENT OF COMMERCE
`United States Patent and Trademark Office
`
`THIS IS TO CERTIFY THAT ANNEXED HERETOIS A TRUE COPY FROM
`THE RECORDSOF THIS OFFICE OF:
`
`May 9, 2023
`
`
`
`ByAuthority of the
`UnderSecretary of CommerceforIntellectual Property
`and Directorof the United States Patent and Trademark Office
`fi
`Miguel Tarver
`Certifying Officer
`
`PATENT NUMBER:7,038,685
`ISSUE DATE: May 2, 2006
`
`
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 1 of 22
`
`
`
`a2) United States Patent
`US 7,038,685 B1
`(10) Patent No.:
`
` Lindholm (45) Date of Patent: May2, 2006
`
`
`US007038685B1
`
`(54) PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`(75)
`
`Inventor:
`
`John Erik Lindholm, Saratoga, CA
`(US)
`
`(73) Assignee: NVIDIA Corporation, Santa Clara, CA
`(US)
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`US.C. 154(b) by 134 days.
`
`(*) Notice:
`
`(21) Appl. No.: 10/609,967
`
`(22) Filed:
`
`Jun. 30, 2003
`
`(51)
`
`Int. Cl.
`(2006.01)
`GO6F 15/00
`(2006.01)
`GO06F 13/00
`(2006.01)
`G06F 12/02
`(2006.01)
`GO6F 9/46
`(2006.01)
`G06T 1/00
`(52) US. CMe vcescsessessssees 345/501; 345/543; 345/536,
`718/104
`(58) Field of Classification Seareh............... 345/501,
`345/502, 530, 531, 522, 418, 419, 426, 427,
`345/543, 505, 536; 718/100, 104, 103
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5,020,115 A *
`S/L991 Black ...sesssseeeeseseen 382/298
`5,969,726 A
`10/1999 Rentschler etal.
`6,630,935 B1L* 10/2003 Taylor et al. ow. 345/522
`6,731,289 BL*
`5/2004 Peercy et al. wo... 345/503
`2003/0041173 Al
`2/2003 Hoyle
`
`ho
`.
`cited by examiner
`
`Primary Examiner—Kee M. Tung
`(74) Attorney, Agent, or Firm—Patterson & Sheridan, LLP
`
`(57)
`
`ABSTRACT
`
`for multithreaded
`A programmable graphics processor
`execution of program instructions including a thread control
`unit. The programmable graphics processor is programmed
`with program instructions for processing primitive, pixel
`and vertex data. The thread control unit has a thread storage
`resource including locations allocated to store thread state
`data associated with samples of two or more types. Sample
`types include primitive, pixel and vertex. A number of
`threads allocated to processing a sample type may be
`dynamically modified.
`
`45 Claims, 9 Drawing Sheets
`
`From From
`245
`220
`
`Execution
`
`Multithreaded
`Processing Unit
`400
`Thread
`nee Selection
`“ve
`Unit
`A410
`we
`
`||
`Pipeline
`wp
`"
`
`
`
`
`
`Thread
`Conta
`Unit
`=
`
`325
`
`Register
`
`File
`350
`
`
`
`||
`
`To 260
`
`10270
`
`Realtek Ex. 1005
`
`Case No. IPR2023-00922
`
`Page 2 of 22
`
`T0228 «
`From 225
`
`
`
`
`
`Resource
`Scoreboard |—»,_
`
`425
`
`"
`Instruction
`Scheduler
`430
`
`
`
`
`Instruction
`Dispatcher
`440
`|
`
`Execution Unit
`470
`
`From 215 ri
`From 220:
`
`
`
`
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 2 of 22
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 1 of 9
`
`US 7,038,685 B1
`
`100
`
`Host Memory
`Host Processor
`112
`114
`fF
`
`System Interface
`115
`
`|
`
`Graphics
`Subsystem
`170
`
`
`
`Graphics Interface 117
`araphics
`
`rocessor
`105
`
` Host Computer 110
`
`
`
`
`
`
`
`Memory
`
`
`
`Controller |||Processing
`120
`Pipeline
`150
`
`
`
`Front End
`130
`it
`
`IDX
`135
`
`Programmable
`Graphics
`
`160
`
`wid Raster Analyzer
`
`Output Controller
`180
`
`
`
`
`
`
`
`Output
`
`™/
`
`FIG. 1
`
`Realtek Ex. 1005
`
`Case No. IPR2023-00922
`
`Page 3 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 3 of 22
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 2 of 9
`
`US 7,038,685 B1
`
`From
`4135
`
`Programmable
`Graphics
`Processing
`Pipeline
`150
`
`
`
`Primitive Assembly/Setup
`205
`
`RasterUnit
`210
`
`
`
`
`
`Pixel Input Buffer
`Vertex Input Buffer
`220
`215
`
`
`
`
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`
`
`Execution
`Pipeline
`240
`
`
`
`
`
`Vertex Output Buffer
`260
`
`
`
`
`
`Pixel Output Buffer
`
`
`
`
`
`
`Texture
`Unit
`225
`
`Texture
`Cache
`230
`
`420,
`
`120
`
`Realtek Ex. 1005
`
`Case No. IPR2023-00922
`
`Page 4 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 4 of 22
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 3 of 9
`
`US 7,038,685 B1
`
`From
`215
`
`From
`220
`
`Execution
`Pipeline
`240
`
`Multithreaded
`Processing Unit
`
`File
`
`To 225
`
`From 215
`
`From 220
`
`Thread Control Unit
`320
`
`Register
`
`FIG. 3
`
`Realtek Ex. 1005
`
`Case No. IPR2023-00922
`
`Page 5 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 5 of 22
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 4 of 9
`
`US 7,038,685 B1
`
`Execution
`Pipeline
`
`Multithreaded
`Processing Unit
`Instructi
`nstruction
`cache
`
`—
`
`
`
`Thread
`Selection
`Unit
`
`415
`
`From From
`215
`220
`
`
`
`Thread
`Control
`Unit
`420
`
`TSR
`325
`
`470
`
`
`
`
`
`Resourceource
`Scoreboard
`460
`
`Sequencer
`425
`
`Instruction
`Scheduler
`430
`
`
`
`Instruction
`Dispatcher
`440
`
`
`
`Execution Unit
`
`Register
`File
`350
`
`Realtek Ex. 1005
`
`Case No. IPR2023-00922
`
`Page6 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 6 of 22
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 5 of 9
`
`US 7,038,685 B1
`
`
`Receive Pointer to a
`
`Program
`510
`
`
`
`Vertex
`
`Pixel
`or Vertex?
`515
`
`Pixel
`
`
`
`
`
`Receive Pointer to a
`Program
`510
`
`Pixel
`or Vertex?
`515
`
`.
`Pixel
`
`
`
`Vertex
`
`Assign Pixel
`Thread
`545
`
`
`
`
`
`
`
`Assign
`VertexThread
`
`
`
`535
`
`
`
`
`
`
`
`
`
`
`Pass
`Pass
`
`
`
`Priority Test?
`Priority Test?
`520
`535
`
`
` Vertex
`
` Pixel
`Thread
`Thread
`
`
`Available?
`Available?
`
`
`
`
`525
`540
`
`
`
`
`Assign
`Assign Pixel
`Thread
`VertexThread
`
`
`
`
`930
`545
`
`FIG. 5A
`
`FIG. 5B
`
`Realtek Ex. 1005
`
`Case No. IPR2023-00922
`
`Page 7 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 7 of 22
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 6 of 9
`
`US 7,038,685 B1
`
`
`
`625
`
`635
`
`645
`
`620
`
`630
`
`640
`
`FIG. 6B
`
`Realtek Ex. 1005
`
`Case No. IPR2023-00922
`
`Page8 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 8 of 22
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 7 of 9
`
`US 7,038,685 B1
`
`Allocating threads to
`a first sample type
`110
`
`Allocating threads to a
`second sample type
`715
`
`Execute First
`Program Instructions
`£20
`
`£25
`
`Execute Second
`Program Instructions
`
`Determine
`allocations
`750
`
`Allocating threads to
`a first sample type
`£99
`
`Allocating threads to a
`second sample type
`160
`
`Execute First
`Program Instructions
`165
`
`Execute Second
`Program Instructions
`170
`
`ifs
`
`FIG. 7A
`
`Allocating threads to
`the first sample type
`
`FIG. 7B
`
`Realtek Ex. 1005
`
`Case No. IPR2023-00922
`
`Page 9 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 9 of 22
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 8 of 9
`
`US 7,038,685 B1
`
` Receive Sample
`
`810
`
`
`
`
`
`815
`
`Identify Sample
`Type
`
`
`
`
` Assign Thread
`825
`
` Thread
`Available?
`
`
`
`820
`
`FIG. 8A
`
` Receive Sample
`
`
`
`
`850
`
` Identify Sample
`
`Type
`855
`
`
`
`
`
`
`Position
`PIOR
`disabled?
`Hazard?
`
`
`
`
`
`860
`865
`
`
`
`
`870
`
`
`
`eeAvailable?
` Assign Thread
`
`
` esources
`
`Available?
`877
`
`Execute Thread
`880
`
`
`
`Deallocate Resources
`850
`
`FIG. 8B
`
`Realtek Ex. 1005
`
`Case No. IPR2023-00922
`
`Page 10 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 10 of 22
`
`
`
`U.S. Patent
`
`May2, 2006
`
`Sheet 9 of 9
`
`US 7,038,685 B1
`
`Determine thread
`priority
`
`950
`
`
`Identify next priority
`980
`
`a
`
`
`Identify assigned
`
`thread(s) for priority
`955
`
`
`No threads?
`960
`
`
`
`:
`;
`Identify assigned
`thread(s)
`910
`
`Select Thread(s)
`915
`
`
`
`Read Program
`Counter(s)
`920
`
`Update Program
`
`
`
`
`
`
`
`
`Select Thread(s)
`965
`
`
`
`
`970
`Read Program
`Counter(s)
`
`
`
`
`
`
`Counter(s)
`925
`
`
`Update Program
`Counter(s)
`
`975
`
`
`FIG. 9A
`
`FIG. 9B
`
`Realtek Ex. 1005
`
`Case No. IPR2023-00922
`
`Page 11 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 11 of 22
`
`
`
`US 7,038,685 Bl
`
`1
`PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`FIELD OF THE INVENTION
`
`2
`A methodofassigning threads for processing of graphics
`data includes receiving a sample to be processed. A sample
`type of vertex, pixel, or primitive, associated with the
`sample is determined. A thread is determined to be available
`for assignmentto the sample. The thread is assigned to the
`sample.
`A method of selecting at least one thread for execution
`includes identifying one or more assigned threads from
`threads including atleast a thread assignedto a pixel sample
`10 and a thread assigned to a vertex sample.Atleast one of the
`one or more assigned threads is selected for processing.
`A met vuesdate performance of multeade’
`processing OF graphics
`data usingat least two sample types
`includes dynamically allocating a first number ofthreadsfor
`15 processing a first portion of the graphics data to a first
`sample type and dynamically allocating a second number of
`pie
`'yP
`y!
`y
`ing
`.
`threads for processing a second portion of the graphics data
`to a second sample type.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`
`
`
`One or more aspects of the invention generally relate to
`multithreaded processing, and more particularly to process-
`ing graphics data in a programmable graphics processor.
`BACKGROUND
`Current graphics data processing includes systems and
`methods developed to perform a specific operation on graph-
`ics data, e.g., linear interpolation, tessellation, rasterization.
`ee
`.
`°
`1:
`>
`texture mapping, depthtesting, etc. These graphics proces-
`sors include several fixed function computation units to
`nerform such specific operations on specific types of graph-
`:
`.
`ics data, such as vertex data and pixel data. Morerecently,
`he computation units have a degree of programmability to 30
`perform user specified operations such that the vertex data 1s
`exemplary
`show
`drawing(s)
`Accompanying
`processed by a vertex processing unit using vertex programs
`embodiment(s) in accordance with one or more aspects of
`and the pixel data is processed by a pixel processing unit
`the
`present
`invention;
`however,
`the
`accompanying
`using pixel programs. When the amount of vertex databeing
`processed is low relative the amount of pixel data being 5 drawing(s) should not be taken to limit the present invention
`processed, the vertex processing unit may be underutilized.
`to the embodiment(s) shown, but are for explanation and
`Conversely, when the amountof vertex data being processed
`understanding only.
`is high relative the amount of pixel data being processed, the
`FIG.1 illustrates one embodiment of a computing system
`pixel processing unit may be underutilized.
`according to the invention including a host computer and a
`Accordingly, it would be desirable to provide improved 30 graphics subsystem.
`approaches to processing different types of graphics data to
`FIG. 2 is a block diagram of an embodiment of the
`better utilize one or more processing units within a graphics
`Programmable Graphics Processing Pipeline of FIG. 1.
`processor.
`FIG. 3 is a block diagram of an embodiment of the
`Execution Pipeline of FIG. 1.
`FIG.4 is a block diagram of an alternate embodiment of
`SUMMARY
`the Execution Pipeline of FIG.1.
`:
`:
`A
`FIGS. 5A and 5B are flow diacrams of exempla
`method and apparatus for processing and allocating
`embodiments of thread assionment weenydance vith ome
`hreads for multithreaded execution of graphics programsis
`ts of the cent invention
`described. A graphics processor for multithreaded execution
`orFIGS Shen 46B
`P
`/
`b dj
`ts
`of
`vorti
`ofprogram instructions associated with threads to process at
`east two sample types includesa thread control unit includ- 40 of the "Thread Swe Roscoe orn ‘heen,i sue ‘ata
`ing a thread storage resource configured to store thread state
`thi
`b Tone ofthe Thread Co*ol Unit of FIG. 3
`data for each ofthe threadsto process the at least two sample
`- FIG4 emo
`.
`types.
`.
`Mimahe eis posesns a mie GS,7A and7, afw, pansof emg
`hreaded processing unit. The multithreaded processing unit 45 dance with one or more aspects of they resent ention
`includes a thread control unit configured to store pointers to
`FIGS. 8A and 8B a flow dia es of exem la
`
`
`
`program instructions associated with threads, each thread tsofthreadb din assi ti e 4 .iL v
`
`
`
`processing a sample type of vertex, pixel or primitive. The
`empocanen's © th assignment
`Ihaccorcance WIN) One
`multithreaded processing unit also includes at
`least one
`orFIGS ve ‘i 0B ee anea
`¢
`\
`programmable computation unit configured to process data 50
`.
`am
`are OW Maagranls OF
`exempiary
`under control of the program instructions.
`embodiments °thread selection in accordance with one or
`A method of multithreaded processing of graphics data
`more aspects of
`the present Invention.
`includes receiving a pointer to a vertex program to process
`DETAILED DESCRIPTION
`vertex samples.Afirst thread is assigned to a vertex sample.
`A pointer to a shader program to process pixel samples is 55
`—_In the following description, numerous specific details are
`received. A second threadis assigned to a pixel sample. The
`set forth to provide a more thorough understanding of the
`vertex program is executed to process the vertex sample and—_present invention. However, it will be apparent to one of
`produce a processed vertex sample. The shader program is
`skill in the art that the present invention may be practiced
`executed to process the pixel sample and produce a pro-
`without one or more of these specific details.
`In other
`cessed pixel sample.
`60 instances, well-known features have not been described in
`Alternatively, the method of multithreaded processing of
`order to avoid obscuring the present invention.
`graphics data includes allocating a first number of process-
`FIG.1 is an illustration of a Computing System generally
`ing threads for a first sample type. A second number of
`designated 100 and including a Host Computer 110 and a
`processing threads is allocated for a second sample type.
`Graphics Subsystem 170. Computing System 100 may be a
`First program instructions associated with the first sample
`desktop computer, server, laptop computer, palm-sized com-
`type are executed to process the graphics data and produce
`puter,
`tablet computer, game console, cellular telephone,
`processed graphics data.
`computer based simulator, or the like. Host Computer 110
`Realtek Ex. 1005
`
`35
`
`65
`
`Case No. IPR2023-00922
`
`Page 12 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 12 of 22
`
`
`
`US 7,038,685 Bl
`
`25
`
`3
`4
`operations, such as stencil, z test, and the like, and saves the
`includes Host Processor 114 that may include a system
`results or the samples output by Programmable Graphics
`memory controller to interface directly to Host Memory 112
`Processing Pipeline 150 in Local Memory 140. When the
`or may communicate with Host Memory 112 through a
`data received by Graphics Subsystem 170 has been com-
`System Interface 115. System Interface 115 may be an I/O
`pletely processed by Graphics Processor 105, an Output 185
`(input/output) interface or a bridge device including the
`of Graphics Subsystem 170 is provided using an Output
`system memory controller to interface directly to Host
`Controller 180. Output Controller 180 is optionally config-
`Memory 112. Examples of System Interface 115 known in
`ured to deliver data to a display device, network, electronic
`the art include Intel® Northbridge and Intel® Southbridge.
`control system, other Computing System 100, other Graph-
`Host Computer 110 communicates with Graphics Sub-
`10 ics Subsystem 170, or the like. Alternatively, data is output
`system 170 via System Interface 115 and a Graphics Inter-
`to a film recording device or written to a peripheral device,
`face 117 within a Graphics Processor 105. Data received at
`e.g., disk drive, tape, compact disk, or the like.
`Graphics Interface 117 can be passed to a Front End 130 or
`FIG.2 is an illustration of Programmable Graphics Pro-
`written to a Local Memory 140 through Memory Controller
`cessing Pipeline 150 of FIG. 1. At least one set of samples
`120. Graphics Processor 105 uses graphics memory to store
`is output by IDX 135 and received by Programmable
`graphics data and programinstructions, where graphics data 15
`Graphics Processing Pipeline 150 andtheat least one set of
`is any data thatis input to or output from components within
`samples is processed accordingto at least one program, the
`the graphics processor. Graphics memory can include por-
`at least one program including graphics program instruc-
`tions of Host Memory112, Local Memory 140,registerfiles
`tions. A program can process one or more sets of samples.
`coupled to the components within Graphics Processor 105,
`20 Conversely, a set of samples can be processed by a sequence
`and the like.
`of one or more programs.
`Graphics Processor 105 includes, among other compo-
`Samples, such as surfaces, primitives, or the like, are
`nents, Front End 130 that recerves commands from Host
`received from IDX 135 by Programmable Graphics Process-
`Computer 110 via Graphics Interface 117. Front End 130
`ing Pipeline 150 and stored in a Vertex Input Buffer 220
`interprets and formats the commands and outputs the for-
`including a register file, FIFO(first in first out), cache, or the
`matted commands and data to an IDX (Index Processor)
`like (not shown). The samples are broadcast to Execution
`135. Some of the formatted commands are used by Pro-
`Pipelines 240, four of which are shown in the figure. Each
`grammable Graphics Processing Pipeline 150 to initiate
`Execution Pipeline 240 includes at least one multithreaded
`processing of data by providing the location of program
`processing unit, to be described further herein. The samples
`instructions or graphics data stored in memory. IDX 135,
`Programmable Graphics Processing Pipeline 150 and a 30 output by Vertex Input Buffer 220 can be processed by any
`Raster Analyzer 160 each include an interface to Memory
`one of the Execution Pipelines 240. A sample is accepted by
`Controller 120 through which program instructions and data
`an Execution Pipeline 240 when a processing thread within
`can be read from memory, e.g., any combination of Local
`the Execution Pipeline 240 is available as described further
`
`
`Memory 140 and Host Memory 112. Whenaportion of Host —_herein. Each Execution Pipeline 240 signals to Vertex Input
`Memory 112 is used to store program instructions and data,
`35 Buffer 220 when a sample can be accepted or when a sample
`the portion of Host Memory 112 can be uncachedso as to
`cannot be accepted.
`In one embodiment Programmable
`increase performance of access by Graphics Processor 105.
`Graphics Processing Pipeline 150 includes a single Execu-
`IDX 135 optionally reads processed data, e.g., data writ-
`tion Pipeline 240 containing one multithreaded processing
`ten by Raster Analyzer 160, from memory and outputs the
`unit. In an alternative embodiment, Programmable Graphics
`data, processed data and formatted commands to Program- 40 Processing Pipeline 150 includes a plurality of Execution
`mable Graphics Processing Pipeline 150. Programmable
`Pipelines 240.
`Graphics Processing Pipeline 150 and Raster Analyzer 160
`Execution Pipelines 240 may receive first samples, such
`each contain one or more programmable processing units to
`as higher-order surface data, and tessellate the first samples
`perform a variety of specialized functions. Some of these
`to generate second samples, such as vertices. Execution
`unctionsare table lookup, scalar and vector addition, mul- 45 Pipelines 240 may be configured to transform the second
`iplication, division, coordinate-system mapping, calcula-
`samples from an object-based coordinate representation
`ion of vector normals, tessellation, calculation of deriva-
`(object space) to an alternatively based coordinate system
`ives, interpolation, and the like. Programmable Graphics
`such as world space or normalized device coordinates
`Processing Pipeline 150 and Raster Analyzer 160 are each
`(NDC) space. Each Execution Pipeline 240 may communi-
`optionally configured such that data processing operations 50 cate with Texture Unit 225 using a read interface (not shown
`are performed in multiple passes through those units or in
`in FIG. 2) to read program instructions and graphics data
`multiple passes within Programmable Graphics Processing
`such as texture maps from Local Memory 140 or Host
`Pipeline 150. Programmable Graphics Processing Pipeline|Memory 112 via Memory Controller 120 and a Texture
`150 and a Raster Analyzer 160 also each include a write
`Cache 230. Texture Cache 230 is used to improve memory
`interface to Memory Controller 120 through which data can 55 read performance by reducing read latency. In an alternate
`be written to memory.
`embodiment Texture Cache 230 is omitted.
`In another
`In a typical implementation Programmable Graphics Pro-
`alternate embodiment, a Texture Unit 225 is included in each
`cessing Pipeline 150 performs geometry computations, ras-
`Execution Pipeline 240. In another alternate embodiment
`erization, and pixel computations. Therefore Programmable
`program instructions are stored within Programmable
`Graphics Processing Pipeline 150 is programmed to operate 60 Graphics Processing Pipeline 150.
`In another alternate
`on surface, primitive, vertex, fragment, pixel, sample or any
`embodiment each Execution Pipeline 240 has a dedicated
`other data. For simplicity, the remainderof this description
`instruction read interface to read program instructions from
`will use the term “samples”to refer to graphics data such as
`Local Memory 140 or Host Memory 112 via Memory
`surfaces, primitives, vertices, pixels, fragments, or the like.
`Controller 120.
`Samples output by Programmable Graphics Processing 65
`Execution Pipelines 240 output processed samples, such
`Pipeline 150 are passed to a Raster Analyzer 160, which
`as vertices, that are stored in a Vertex Output Buffer 260
`optionally performs near and far plane clipping and raster
`including a register file, FIFO, cache, or the like (not
`Realtek Ex. 1005
`
`
`
`Case No. IPR2023-00922
`
`Page 13 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 13 of 22
`
`
`
`US 7,038,685 Bl
`
`5
`shown). Processed vertices output by Vertex Output Buffer
`260 are received by a Primitive Assembly/Setup Unit 205.
`Primitive Assembly/Setup Unit 205 calculates parameters,
`such as deltas andslopes, to rasterize the processed vertices
`and outputs parameters and samples, such as vertices, to a
`Raster Unit 210. Raster Unit 210 performs scan conversion
`on samples, such as vertices, and outputs samples, such as
`fragments, to a Pixel Input Buffer 215. Alternatively, Raster
`Unit 210 resamples processed vertices and outputs addi-
`tional vertices to Pixel Input Buffer 215.
`
`Pixel Input Buffer 215 outputs the samples to each Execu-
`tion Pipeline 240. Samples, such as pixels and fragments,
`output by Pixel Input Buffer 215 are each processed by only
`one of the Execution Pipelines 240. Pixel Input Buffer 215
`determines which one of the Execution Pipelines 240 to
`output each sample to depending on an output pixel position,
`e.g., (x,y), associated with each sample. In this manner, each
`sample is output to the Execution Pipeline 240 designatedto
`process samples associated with the outputpixelposition. In
`an alternate embodiment, each sample output by Pixel Input
`Buffer 215 is processed by one of any available Execution
`Pipelines 240.
`Each Execution Pipeline 240 signals to Pixel Input Buffer
`240 when a sample can be accepted or when a sample cannot
`be accepted as described further herein. Program instruc-
`tions configure programmable computation units (PCUs)
`within an Execution Pipeline 240 to perform operations such
`as tessellation, perspective correction,
`texture mapping,
`shading, blending, and the like. Processed samples are
`output from each Execution Pipeline 240 to a Pixel Output
`Buffer 270. Pixel Output Buffer 270 optionally stores the
`processed samples in a register file, FIFO, cache.or the like
`(not shown). The processed samples are output from Pixel
`Output Buffer 270 to Raster Analyzer 160.
`FIG.3 is a block diagram of an embodiment of Execution
`Pipeline 240 of FIG. 1 includingat least one Multithreaded
`Processing Unit 300. An Execution Pipeline 240 can contain
`a plurality of Multithreaded Processing Units 300, each
`Multithreaded Processing Unit 300 containing at least one
`PCU 375. PCUs 375 are configured using program instruc-
`tions read by a Thread Control Unit 320 via Texture Unit
`225. Thread Control Unit 320 gathers source data specified
`by the program instructions and dispatches the source data
`and program instructions to at least one PCU 375. PCUs 375
`performs computations specified by the program instruc-
`tions and outputs data to at least one destination, e.g., Pixel
`Output Buffer 160, Vertex Output Buffer 260 and Thread
`Control Unit 320.
`
`A single program may be used to process several sets of
`samples. Thread Control Unit 320 receives samples or
`pointers to samples stored in Pixel Input Buffer 215 and
`Vertex Input Buffer 220. Thread Control Unit 320 receives
`a pointer to a program to process one or more samples.
`Thread Control] Unit 320 assigns a thread to each sample to
`be processed. A thread includes a pointer to a program
`instruction (program counter), such as thefirst instruction
`within the program, thread state information, and storage
`resources for storing intermediate data generated during
`processing of the sample. Thread state informationis stored
`ina TSR (Thread Storage Resource) 325. TSR 325 may be
`a register file, FIFO, circular buffer, or the like. An instruc-
`ion specifies the location of source data needed to execute
`he instruction. Source data, such as intermediate data gen-
`erated during processing of the sample is stored in a Register
`File 350. In addition to Register File 350, other source data
`may be stored in Pixel Input Buffer 215 or Vertex Input
`
`
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`6
`Buffer 220. In an alternate embodiment sourcedata is stored
`in Local Memory 140, locations in Host Memory 112, and
`the like.
`Alternatively, in an embodiment permitting multiple pro-
`grams for two or more thread types, Thread Control Unit 320
`also receives a program identifier specifying which one of
`the two or more programs the program counter is associated
`with. Specifically,
`in an embodiment permitting simulta-
`neous execution of four programsfor a thread type, twobits
`of thread state information are used to store the program
`identifier for a thread. Multithreaded execution of programs
`is possible because each thread may be executed indepen-
`dent of other threads, regardless ofwhetherthe other threads
`are executing the same program or a different program.
`PCUs375 update each program counter associated with the
`threads in Thread Control Unit 320 following the execution
`of an instruction. For execution of a loop, call, return, or
`branch instruction the program counter may be updated
`based on the loop, call, return, or branchinstruction.
`For example, each fragment or group of fragments within
`a primitive can be processed independently from the other
`fragments or from the other groups of fragments within the
`primitive. Likewise, each vertex within a surface can be
`processed independently from the other vertices within the
`surface. For a set of samples being processed using the same
`program, the sequence of program instructions associated
`with each thread used to process each sample within the set
`will be identical, although the program counter for each
`thread may vary. However, it is possible that, during execu-
`tion, the threads processing some of the samples within a set
`will diverge following the execution of a conditional branch
`instruction. After the execution of a conditional branch
`instruction, the sequence of executed instructions associated
`with each thread processing samples within the set may
`differ and each program counter stored in TSR 325 within
`Thread Control Unit 320 for the threads may differ accord-
`ingly.
`FIG.4 is anillustration of an alternate embodiment of
`Execution Pipeline 240 containing at
`least one Multi-
`threaded Processing Unit 400. Thread Control Unit 420
`includes a TSR 325 to retain thread state data. In one
`embodiment TSR 325 stores thread state data for each of at
`least two thread types, where the at least two thread types
`may include pixel, primitive, and vertex. Thread state data
`for a thread may include, among other things, a program
`counter, a busy flag that indicates if the thread is either
`assigned to a sample or available to be assigned to a sample,
`a pointer to a source sample to be processed bythe instruc-
`tions associated with the thread or the output pixel position
`and output buffer ID of the source sample to be processed,
`and a pointer specifying a destination location in Vertex
`Output Buffer 260 or Pixel Output Buffer 270. Additionally,
`thread state data for a thread assigned to a sample may
`include the sample type, e.g., pixel, vertex, primitive, or the
`like. The type of data a thread processes identifies the thread
`type, e.g., pixel, vertex, primitive, or the like. For example,
`a thread may process a primitive, producing a vertex. After
`the vertex is rasterized and fragments are generated, the
`thread may process a fragment.
`Source samplesare stored in either Pixel Input Buffer 215
`or Vertex Input Buffer 220. Thread allocation priority, as
`described further herein,
`is used to assign a thread to a
`source sample. A thread allocation priority is specified for
`each sample type and Thread Control Unit 420 is configured
`to assign threads to samples or allocate locations in a
`Register File 350 based on the priority assigned to each
`sample type. The thread allocation priority maybe fixed,
`Realtek Ex. 1005
`
`
`
`Case No. IPR2023-00922
`
`Page 14 of 22
`
`Realtek Ex. 1005
`Case No. IPR2023-00922
`Page 14 of 22
`
`
`
`US 7,038,685 Bl
`
`
`
`
`
`7
`programmable, or dynamic. In one embodimentthe thread
`allocation priority may be fixed, always giving priority to
`allocating vertex threads and pixel threads are only allocated
`if vertex samples are not available for assignment to a
`hread.
`In an alternate embodiment, Thread Control Unit 420 is
`configured to assign threads to source samples or allocate
`ocations in Register File 350 using thread allocation pri-
`orities based on an amount of sample data in Pixel Inpu
`Buffer 215 and another amount of sample data in Vertex
`nput Buffer 220. Dynamically modifying a thread alloca-
`ion priority for vertex samples based on the amount o
`sample data in Vertex Input Buffer 220 permits Vertex Inpu
`Buffer 220 to drain faster and fill Vertex Output Buffer 260
`and Pixel Input Buffer 215 faster or drain slower and fil
`Vertex Output Buffer 260 and Pixel Input Buffer 215 slower.
`Dynamically modifying a thread allocation priority for pixel
`samples based on the amount of sample data in Pixel Inpu
`Buffer 215 permits Pixel Input Buffer 215 to drain faster and
`fill Pixel Output Buffer 270 faster or drain slower and fil
`Pixel Output Buffer 270 slower.
`In a further alternate
`embodiment, Thread Control Unit 420 is configured to
`assign threads to source samples or allocate locations in
`Register File 350 using threadallocation priorities based on
`graphics primitive size (number of pixels or fragments
`included in a primitive) or a numberof graphics primitives
`in Vertex Output Buffer 260. For example a dynamically
`determined thread allocation priority may be determined
`based on a numberof “pending”pixels, 1-e., the number of
`pixels to be rasterized from the primitives in Primitive
`Assembly/Setup 205 and in Vertex Output Buffer 260.
`Specifically, the thread allocation priority may be tuned such
`that the number of pending pixels produced by processing
`vertex threadsis adequate to achieve maximum utilization of
`the computation resources in Execution Pipelines 240 pro-
`cessing pixel threads.
`Once a thread is assigned to a source sample, the thread
`is allocated storage resources such as locations in a Register
`File 350 to retain intermediate data generated during execu-
`tion of program instructions associated with the thread.
`Alternatively, source data is stored in storage resources
`including Local Memory 140, locations in Host Memory
`112, and thelike.
`A Thread Selection Unit 415 reads one or more thread
`entries, each containing thread state data,
`from Thread
`Control Unit 420. Thread Selection Unit 415 may read
`thread entries to process a group of samples. For example,
`in one embodiment a group of samples, e.g., a number of
`vertices defining a primitive,
`four adjacent
`fragments
`arranged in a square, or the like, are processed simulta-
`neously. In the one embodiment computed values such as
`derivatives are shared within the group of samples thereby
`reducing the number of computations needed to process the
`group of samples compared with processing the group of
`samples without sharing the computed values.
`In Multithreaded Processing Unit 400, a thread execution
`priority is specified for each thread type and Thread Selec-
`tion Unit 415 is configured to read thread entries based on
`the thread execution priority assigned to each thread type. A
`Thread execution priority may be fixed, programmable, or
`dynamic. In one embodiment the thread execution priority
`maybefixed, always giving priority to execution of vertex
`threads and pixel threads are only executed if vertex threads
`are not available for execution.
`In anothe