`
`
`THIS IS TO CERTIFY THAT ANNEXED HERETO IS A TRUE COPY FROM
`THE RECORDSOFTHIS OFFICE OF:
`
`May 9, 2023
`
`PATENT NUMBER:7,015,913
`ISSUE DATE: March21, 2006
`
`TOALL,TOWH
`UNITED STATES DEPARTMENT OF COMMERCE
`United States Patent and Trademark Office
`
`a= ~
`
`ByAuthorityof the
`UnderSecretary of Commercefor Intellectual Property
`andDirectorof the United States Patent and Trademark Office
`
`TUN,
`
`wfetCertifying Officer
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 1 of 21
`
`
`
`US007015913B1
`
`«2, United States Patent
`US 7,015,913 B1
`(10) Patent No.:
`Mar.21, 2006
`(45) Date of Patent:
`Lindholm etal.
`
`(54) METHOD AND APPARATUS FOR
`MULTITHREADED PROCESSING OF DATA
`IN A PRO!GRAMMABLE GRAPHICS
`PROCESSOR
`
`(75)
`
`Inventors: John Erik Lindholm, Saratoga, CA
`(US); Rui M. Bastos, Santa Clara, CA
`(US); Harold Robert Feldman Zatz,
`Palo Alto, CA (US)
`
`(73)
`
`Assignee: NVIDIA Corporation, Santa Clara, CA
`(US)
`
`(*)
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`US.C. 154(b) by 164 days.
`
`(21)
`
`Appl. No.: 10/608,346
`
`(22)
`
`Filed:
`
`Jun. 27, 2003
`
`(51)
`
`(52)
`
`(58)
`
`Int. Cl.
`(2006.01)
`GO6F 15/00
`(2006.01)
`GO6F 12/02
`(2006.01)
`GO6F 9/46
`(2006.01)
`GO6T 1/00
`US. Che oes 345/501; 345/543; 718/102;
`718/104
`Field of Classification Search................ 345/501,
`345/504, 520, 522, 420, 423, 503, 519, 506,
`345/543, 557; 709/208, 231; 712/23, 28,
`712/31-32; 718/102, 104
`See application file for complete search history.
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,818,469 A * 10/1998 Lawless et al... 345/522
`5,946,487 A *
`8/1999 Dangelo ............
`weve 717/148
`
`we 345/505
`6,088,044 A *
`7/2000 Kwoketal. ........
`
`6/2004 Heirich et al... 345/629
`6,753,878 B1*
`6,765,571 B1*
`7/2004 Sowizral et al... 345/420
`2003/0140179 A1*
`7/2003 Wilt et al. wou 709/321
`2004/0160446 A1*
`8/2004 Gosalia et al.
`.........0.. 345/503
`
`* cited by examiner
`
`Primary Examiner—Kee M. Tung
`(74) Attorney, Agent, or Firm—Patterson & Sheridan, LLP
`
`(67)
`
`ABSTRACT
`
`A graphics processor and method for executing a graphics
`program as a plurality of threads where each sample to be
`processed by the program is assigned to a thread. Although
`threads share processing resources within the programmable
`graphics processor,the execution of each thread can proceed
`independent of any other threads. For example, instructions
`in a second thread are scheduled for execution while execu-
`tion of instructions in a first thread are stalled waiting for
`source data. Consequently,a first received sample (assigned
`to the first thread) may be processed after a second received
`sample (assigned to the second thread). A benefit of inde-
`pendently executing each thread is improved performance
`because a stalled thread does not prevent the execution of
`other threads.
`
`33 Claims, 9 Drawing Sheets
`
`From From
`215 220
`
`Execution
`Pipeline
`
`eto
`
`Multithreaded
`Processing Unit
`
`
`To 225
`400
`Instruction
`
`Cache —
`
`>!
`410
`From 225:
`
`
`
`
`
`To 260
`
`To 270
`
`Realtek Ex. 1006
`
`Case No. IPR2023-00922
`
`Page 2 of 21
`
`
`
`
`
`oo
`«—
`
`}
`
`CU 433
`
`Thread
`cone
`Buffer
`420
`
`I
`
`From 218
`From 220-+—
`
`Inetracti
`Resource
`Scoreboard —+|
`Instruction
`460
`Scheduler
`
`
`
`Instruction
`
`*| Dispatcher
`|_|
`
`
`Register
`
`File
`450
`
`Execution Unit
`
`
`
`
`
`
`
`
`
`
`
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 2 of 21
`
`
`
`U.S. Patent
`
`Mar. 21,2006
`
`Sheet 1 of 9
`
`US 7,015,913 B1
`
`
`
`Host Computer 110
`100
`
`Host Memory
`Host Processor
`142
`114
`to
`
`
`Syst m Interfac
`115
`
`
`
`
`
`
`
`
`| GraphicsInterface117
`Processor
`
`
`
`
`>
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Graphics
`Subsystem
`
`170
`
`:
`
`
`:
`Graphics
`
`Front End
`130
`
`<—$———
`—_—_
`
`IDX
`135
`
`a
`
`Programmable
`
`Graphics
`Memory r+—>
`Local
`Processing
`ontroller
`Memory
`140 100yg| Pipeline
`
`150
`ft
`r——— Raster Analyzer
`—— 160
`
`:
`
`Output Controller
`180
`
`FIG. 1
`
`Realtek Ex. 1006
`
`Case No. IPR2023-00922
`
`Page3 of 21
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 3 of 21
`
`
`
`U.S. Patent
`
`Mar. 21,2006
`
`Sheet 2 of 9
`
`US 7,015,913 B1
`
`From
`135
`
`Programmable
`Graphics
`Processing
`Pipeline
`150
`
`
`
`Primitive Assembly/Setup
`205
`|.
`
`
`
`
`Raster Unit
`210
`
`
`
`
`
`Vertex Input Buffer
`220
`
`Pixel Input Buffer
`215
`
`270
`
`
`
`
`
`TextureUnit
`
`225
`
`Texture
`Cache
`230
`
`
`
`
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`Execution
`Pipeline
`240
`
`
`
`
`
`
`
`Vertex Output Buffer
`260
`
`Pixel Output Buffer
`
`
`
`~~
`
`120
`
`120
`
`Realtek Ex. 1006
`
`Case No. IPR2023-00922
`
`Page 4 of 21
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 4 of 21
`
`
`
`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 3 of 9
`
`US 7,015,913 B1
`
`330
`
`350
`
`- 3
`
`60
`
`370
`
`FIG. 3
`
`Realtek Ex. 1006
`
`Case No. IPR2023-00922
`
`Page 5 of 21
`
`331
`
`332
`
`333
`
`342
`
`343
`
`344
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 5 of 21
`
`
`
`U.S. Patent
`
`Mar. 21,2006
`
`Sheet 4 of 9
`
`US 7,015,913 B1
`
`From From
`215
`220
`
`Execution
`Pipeline
`
`Multithreaded
`Processing Unit
`
`«
`
`To 225
`_
`From 225
`
`
`Instruction
`Cache
`410
`
`
`
`
`
`
`
`From 220
`
`
`Thread
`Control
`Buffer
`420
`
`Resource
`Scoreboard
`460
`_
`
`Instruction
`Scheduler
`430
`
`
`
`instruction
`Dispatcher
`440
`
`Register
`File
`450
`
`Execution Unit
`
`From 218
`
`Realtek Ex. 1006
`
`Case No. IPR2023-00922
`
`Page 6 of21
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 6 of 21
`
`
`
`U.S. Patent
`
`Mar. 21, 2006
`
`Sheet 5 of 9
`
`US 7,015,913 Bl
`
`Receive a Sample
`504
`
`Receive Another
`Sample
`203
`
`
`
`Source
`
`Data Available for the
`Sample?
`§05
`
`
`
`
` Source
`
`Data Available for the
`Other Sample?
`507
`
`
`DispatchInstruction
`for Processing the
`Other Sample
`
`Dispatch Instruction
`for Processing the
`Sample
`
`515
`
`
`
`
`
`Source
`Data Available for the
`Other Sample?
`517
`
`
`Dispatch Instruction
`for Processing the
`Other Sample
`
`519
`
`509
`
`
`
`
` Source
`
`Data Available for the
`Sample?
`§11
`
`Y
`
`Dispatch Instruction
`for Processing the
`Sample
`513
`
`FIG. 5A
`
`Realtek Ex. 1006
`
`Case No. IPR2023-00922
`
`Page 7 of 21
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 7 of 21
`
`
`
`U.S. Patent
`
`Mar. 21,2006
`
`Sheet 6 of 9
`
`US 7,015,913 B1
`
`Receive Sample
`$20
`
`identify Thread
`Type Needed
`921
`
`
`
`
`Thread
`
`Available?
`
`527
`
`Y
`
`Assign Thread
`530
`
`
`
`Allocate Resources
`533
`
`Fetch Instruction
`535
`
`Update PC and
`Resource Scoreboard
`540
`
`Dispatch Instruction
`543
`
`Process Sample
`545
`
`More
`Instructions?
`247
`
`
`
`
`
`
`
`
`
`PIOR
`Pixel
`
`
`disabledfor pixel
`Thread Position
`
`threads?
`Hazard?
`
`
`
`
`523
`525
`
`
`
`
`Can
`Schedule?
`
`537
`
`
`
`
`
`
`Deallocate
`Resources
`
`550
`
`
`
`FIG. 5B
`
`Realtek Ex. 1006
`
`Case No. IPR2023-00922
`
`Page8 of 21
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 8 of 21
`
`
`
`U.S. Patent
`
`Mar. 21,2006
`
`Sheet 7 of 9
`
`US 7,015,913 B1
`
`Instruction
`in window?
`
`:
`Timeout?
`
`Y
`
`Removeinstruction
`from window
`
`
`605
`
`
`810 TO 615
`
`
`
`
`Check
`
`synch made
`
`
`synchronization
`enabled?
`625
`
`
`
`620
`
`
`
`
`Removeinstruction
`nstruction
`
`
`
`
`synched?
`from window
`
`635
`630
`
`
`Sort by thread age
`640
`
`
`
`Read scoreboard
`645
`
`
`655
`680
`
` N
`
`
`
`Check resources
`
`650
`
`Schedule instruction
`
`Update scoreboard
`660
`
`Update PC
`670
`
`Output instruction
`
`FIG. 6
`
`Realtek Ex. 1006
`
`Case No. IPR2023-00922
`
`Page 9 of 21
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 9 of 21
`
`
`
`U.S. Patent
`
`Mar. 21,2006
`
`Sheet 8 of 9
`
`US 7,015,913 B1
`
`£40 FIG. 7A
`
`
`
`
`Issue Function Call to
`enable PIOR
`703
`
`PIOR configuration
`complete
`£06
`
`Configure enable PIOR
`710
`
`Renderintersecting
`objects
`720
`
`Configure to disable
`PIOR
`730
`
`Render non-
`intersecting objects
`
`FIG. 7B
`
`Realtek Ex. 1006
`
`Case No. IPR2023-00922
`
`Page 10 of 21
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 10 of 21
`
`
`
`U.S. Patent
`
`Mar. 21,2006
`
`Sheet 9 of 9
`
`US 7,015,913 B1
`
`
`
`£45
`
`Configure enable PIOR
`FAQ
`
`Render opaque objects
`£25
`
`Configure to disable
`PIOR
`£30
`
`Render non-opaque
`objects
`
`FIG. 7C
`
`Realtek Ex. 1006
`
`Case No. IPR2023-00922
`
`Page 11 of 21
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 11 of 21
`
`
`
`US 7,015,913 B1
`
`1
`METHOD AND APPARATUS FOR
`MULTITHREADED PROCESSING OF DATA
`IN A PROGRAMMABLE GRAPHICS
`PROCESSOR
`
`FIELD OF THE INVENTION
`
`One or more aspects of the invention generally relate to
`multithreaded processing, and more particularly to process-
`ing graphics data in a programmable graphics processor.
`
`BACKGROUND
`
`Current graphics data processing is exemplified by sys-
`tems and methods developed to perform a specific operation
`on several graphics data elements, e.g., linear interpolation,
`tessellation,
`texture mapping, depth testing. Traditionally
`graphics processing systems were implemented as fixed
`function computation units and more recently the computa-
`tion units are programmable to perform a limited set of
`operations. In either system, the graphics data elements are
`processed in the order in which they are received by the
`graphics processing system. Within the graphics processing
`system, when a resource, e.g., computation unit or data,
`required to process a graphics data element is unavailable,
`the processing of the elementstalls, 1e., does not proceed,
`until the resource becomes available. Because the system is
`pipelined, the stall propagates back through the pipeline,
`stalling the processing of later received elements that may
`not require the resource and reducing the throughput ofthe
`system.
`For the foregoing reasons, there is a need for improved
`approaches to processing graphics data elements.
`
`SUMMARY
`
`The present invention is directed to a system and method
`that satisfies the need for a programmable graphics proces-
`sor that supports processing of graphics data elements in an
`order independent from the order in which the graphics data
`elements are received by the programmable graphics pro-
`cessing pipeline within the programmable graphics proces-
`sor.
`
`20
`
`30
`
`35
`
`40
`
`2
`Yet further embodiments of the invention include an
`application programming interface for a programmable
`graphics processor comprising a function call to configure a
`multithreaded processing unit within the programmable
`graphics processor to disable processing of samples inde-
`pendent of an order in which the samples are received.
`Various embodiments of a method of the invention
`include processing a first program instruction associated
`with a first thread and a second program instruction asso-
`ciated with a second thread. A first sample to be processed
`by a program instruction associated with a first thread is
`received before a second sample to be processed by a
`program instruction associated with a second thread is
`received. First source data required to process the program
`instruction associated with the first thread are determined to
`be not available. Second source data required to process the
`program instruction associated with the second thread are
`determined to be available. The program instruction asso-
`ciated with the second thread to process the second sample
`in the execution unit is dispatched prior to dispatching the
`program instruction associated with the first thread to pro-
`cess the first sample in the execution unit.
`Further embodiments of a methodof the invention include
`using a function call to configure the graphics processor.
`Support for processing samples of at least one sample type
`independent of an order in which the samples are received
`by a multithreaded processing unit within the graphics
`processor is detected. The function call to configure the
`multithreaded processing unit within the graphics processor
`to enable processing of the samples independent of an order
`in which the samples are received is issued for the at least
`one sample type.
`Yet further embodimentso of a method of the invention
`include rendering a scene using the graphics processor. The
`multithreaded processing unit within the graphics processor
`1s configured to enable processing of samples independentof
`an order in which the samples are received. The multi-
`threaded processing unit within the graphics processor pro-
`cess the samples independent of the order in which the
`samples are received to renderat least a portion of the scene.
`
`BRIEF DESCRIPTION OF THE VARIOUS
`VIEWS OF THE DRAWINGS
`
`45
`
`55
`
`Various embodimentsof the invention include a comput-
`exemplary
`show
`drawing(s)
`Accompanying
`ing system comprising a host processor, a host memory, a
`embodiment(s) in accordance with one or more aspects of
`system interface configured to interface with the host pro-
`the
`present
`invention;
`however,
`the
`accompanying
`cessor, and the programmable graphics processor for mul-
`drawing(s) should notbe takento limit the present invention
`tthreaded execution of program instructions. The graphics
`50
`to the embodiment(s) shown, but are for explanation and
`processorincludesat least one multithreaded processing unit
`understanding only.
`configured to receive samplesinafirst order to be processed
`FIG. 1 illustrates one embodiment of a computing system
`by program instructions associated with at least one thread.
`according to the invention including a host computer and a
`Each multithreaded processing unit includes a scheduler
`graphics subsystem;
`configured to receive the program instructions, determine
`availability of source data, and schedule the program
`FIG. 2 is a block diagram of an embodiment of the
`instructions for execution in a second order independent of
`Programmable Graphics Processing Pipeline of FIG. 1;
`the first order. Each multithreaded processing unit further
`FIG. 3 is a conceptual diagram ofthe relationship between
`includes a resource tracking unit configured to track the
`a program and threads;
`availability of the source data, and a dispatcher configured
`FIG. 4 is a block diagram of an embodiment of the
`to output the program instructions in the second orderto be
`Execution Pipeline of FIG. 2;
`executed by the at least one multithreaded processing unit.
`FIGS. 5A and 5B illustrate embodiments of methods
`Further embodiments of the invention include an appli-
`utilizing the Execution Pipelineillustrated in FIG. 4;
`cation programminginterface for a programmable graphics
`FIG. 6 illustrates an embodiment of a method utilizing the
`processor comprising a function call to configure a multi-
`Execution Pipeline illustrated in FIG. 4;
`threaded processing unit within the programmable graphics
`processorto enable processing of samples independentof an
`FIGS. 7A, 7B, and 7C illustrate embodiments of methods
`order in which the samples are received.
`utilizing the Computing System illustrated in FIG. 1.
`Realtek Ex. 1006
`
`60
`
`65
`
`Case No. IPR2023-00922
`
`Page 12 of 21
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 12 of 21
`
`
`
`US 7,015,913 B1
`
`5
`
`
`
`3
`DISCLOSURE OF THE INVENTION
`
`4
`150 and a Raster Analyzer 160 also each include a write
`interface to Memory Controller 120 through which data can
`be written to memory.
`The current invention involves new systems and methods
`In a typical implementation Programmable Graphics Pro-
`for processing graphics data elements in an order indepen-
`cessing Pipeline 150 performs geometry computations, ras-
`dent from the order in which the graphics data elements are
`terization, and pixel computations. Therefore Programmable
`received by a multithreaded processing unit within a graph-
`Graphics Processing Pipeline 150 is programmedto operate
`ies processor.
`onsurface, primitive, vertex, fragment, pixel, sample Or anly
`FIG. 1 is an illustration of a Computing System generally
`other data. A fragmentis at least a portion of a pixel, Le., a
`designated 100 and including a Host Computer 10 and a
`Graphics Subsystem 170. Computing System 100 may be a 10 pixel includes at least one fragment. For simplicity, the
`desktop computer, server, laptop computer, palm-sized com-
`remainderof this description willuse the term “samples”
`to
`puter, tablet computer, game console, cellular telephone,
`refer to surfaces, primitives, vertices, pixels,or fragments.
`computer based simulator, or the like. Host Computer 110
`Samples output by Programmable Graphics Processing
`includes Host Processor 114 which may include a system
`Pipeline 150 are passed to a Raster Analyzer 160, which
`memory controller to interface directly to Host Memory 112 1s optionally performs near and far plane clipping and raster
`or may communicate with Host Memory 112 through a
`operations,such asstencil, z test, and the like, and saves the
`System Interface 115. System Interface 115 may be an I/O
`resulis or the samples output by Programmable Graphics
`(input/output) interface or a bridge device including the
`Processing Pipeline 150 in Local Memory 140. When the
`system memory controller to interface directly to Host
`data received by GraphicsSubsystem 170 has been com-
`Memory 112. Examples of System Interface 115 known in 20 pletely processed by Graphics PFOcessor 105,an Output 185
`the art include Intel®Northbridge and Intelg® Southbridge.
`of Graphics Subsystem 170 is provided using an Output
`.
`.
`.
`Controller 180. Output Controller 180 is optionally config-
`Host Computer 110 communicates with Graphics Sub-
`ured to deliver data to a display device, network, electronic
`system 170 via System Interface 115 and a Graphics Inter-
`control system, other Computing System 100, other Graph-
`face 117 within a Graphics Processor 105. Data received at 95 ies Subsystem 170 orthelike.
`,
`Graphics Interface 117 can be passed to a Front End 130 or
`FIG. 2 is an ‘Tlustration of Programmable Graphics Pro-
`written to a Local Memory 140 through Memory Controller
`cessing Pipeline 150 of FIG. 1. At least one set of samples
`120. Graphics Processor 105 uses graphics memoryto store
`is output by IDX 135 and received by Programmable
`graphics data and program instructions, where graphics data
`Graphics Processing Pipeline 150 and the at least one set of
`1S any data thatis input to or output from components within 49 samples is processed according to at least one program, the
`the graphics processor. Graphics memory can include por-
`at least one program including graphics program instruc-
`tionsof Host Memory 112, Local Memory 140, register files
`tions. A program can process one or more sets of samples.
`coupled to the components within Graphics Processor 105,
`Conversely, a set of samples can be processed by a sequence
`andthe like.
`of one or more programs.
`Graphics Processor 105 includes, among other compo- 35
`Samples, such as surfaces, primitives, or the like, are
`nents, Front End 130 that receives commands from Host
`_yeceived from IDX 135 by Programmable Graphics Process-
`Computer 110 via Graphics Interface 117. Front End 130_ing Pipeline 150 andstored in a Vertex Input Buffer 220 in
`interprets and formats the commands and outputs the for-
`_q register file, FIFO (first in first out), cache, orthe like (not
`matted commands and data to an IDX (Index Processor)
`shown). The samples are broadcast to Execution Pipelines
`135. Some of the formatted commands are used by Pro- 49 240, four of which are shownin the figure. Each Execution
`grammable Graphics Processing Pipeline 150 to initiate
`Pipeline 240 includesat least one multithreaded processing
`processing of data by providing the location of program
`—_ynit, to be described further herein. The samples output by
`instructions or graphics data stored in memory. IDX 135,
`Vertex Input Buffer 220 can be processed by any oneofthe
`Programmable Graphics Processing Pipeline 150 and a
`Execution Pipelines 240. A sample is accepted by a Execu-
`Raster Analyzer 160 each include an interface to Memory 45 tion Pipeline 240 when a processing thread within the
`Controller 120 through which program instructions and data
`Execution Pipeline 240 is available as described further
`can be read from memory, ¢.g., any combination of Local
`herein. Each Execution Pipeline 240 signals to Vertex Input
`Memory 140 and Host Memory 112. Whenaportion of Host—_Buffer 220 when a sample can be accepted or when a sample
`Memory 112 is used to store program instructions and data,
`cannot be accepted.
`In one embodiment Programmable
`the portion of Host Memory 112 can be uncached so as to 59 Graphics Processing Pipeline 150 includes a single Execu-
`increase performanceof access by Graphics Processor 105.
`tion Pipeline 240 containing one multithreaded processing
`IDX 135 optionally reads processed data, e.g., data writ-
`unit. In an alternative embodiment, Programmable Graphics
`en by Raster Analyzer 160, from memory and outputs the
`Processing Pipeline 150 includes a plurality of Execution
`data, processed data and formatted commands to Program-
`Pipelines 240.
`mable Graphics Processing Pipeline 150. Programmable s5
`Execution Pipelines 240 canreceive first samples, such as
`Graphics Processing Pipeline 150 and Raster Analyzer 160
`higher-order surface data, and tessellate the first samples to
`each contain one or more programmable processing units to
`generate second samples, such as vertices. Execution Pipe-
`perform a variety of specialized functions. Some of these
`lines 240 can be configuredto transform the second samples
`unctions are table lookup,scalar and vector addition, mul-
`from an object-based coordinate representation (object
`iplication, division, coordinate-system mapping, calcula-
`60 space) to an alternatively based coordinate system such as
`ion of vector normals, tessellation, calculation of deriva-
`world space or normalized device coordinates (NDC)space.
`tives, interpolation, and the like. Programmable Graphics
`Each Execution Pipeline 240 communicates with Texture
`Processing Pipeline 150 and Raster Analyzer 160 are each
`Unit 225 using a read interface (not shownin FIG.2) to read
`optionally configured such that data processing operations
`program instructions and graphicsdata such as texture maps
`are performed in multiple passes through those units or in 65 from Local Memory 140 or Host Memory 112 via Memory
`Controller 120 and a Texture Cache 230. Texture Cache 230
`multiple passes within Programmable Graphics Processing
`Pipeline 150. Programmable Graphics Processing Pipeline
`is used to improve memory read performance by reducing
`Realtek Ex. 1006
`
`
`
`Case No. IPR2023-00922
`
`Page 13 of 21
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 13 of 21
`
`
`
`US 7,015,913 B1
`
`5
`readlatency. In an alternate embodiment Texture Cache 230
`is omitted. In another alternate embodiment, a Texture Unit
`225 is included in each Execution Pipeline 240. In yet
`another alternate embodiment program instructions are
`stored within Programmable Graphics Processing Pipeline
`150.
`Execution Pipelines 240 output processed samples, such
`as vertices, that are stored in a Vertex Output Buffer 260 in
`a register file, FIFO, cache, or the like (not shown). Pro-
`cessed vertices output by Vertex Output Buffer 260 are
`received by a Primitive Assembly/Setup 205. This unit
`calculates parameters, such as deltas and slopes, to rasterize
`the processed vertices. Primitive Assembly/Setup 205 out-
`puts parameters and samples,such as vertices, to Raster Unit
`210. The Raster Unit 210 performs scan conversion on
`samples, such as vertices, and outputs samples, such as
`fragments, to a Pixel Input Buffer 215. Alternatively, Raster
`Unit 210 resamples processed vertices and outputs addi-
`tional vertices to Pixel Input Buffer 215.
`Pixel Input Buffer 215 outputs the samples to each Execu-
`tion Pipeline 240. Samples, such as pixels and fragments,
`output by Pixel Input Buffer 215 are each processed by only
`one of the Execution Pipelines 240. Pixel Input Buffer 215
`determines which one of the Execution Pipelines 240 to
`output each sample to depending on an output pixel position,
`e.g., (x,y), associated with each sample.In this manner, each
`sample is output to the Execution Pipeline 240 designatedto
`process samples associated with the output pixel position. In
`an alternate embodiment, each sample output by Pixel Input
`Buffer 215 is processed by an available Execution Pipeline
`240.
`A sample is accepted by a Execution Pipeline 240 when
`a processing thread within the Execution Pipeline 240 is
`available as described further herein. Each Execution Pipe-
`line 240 signalsto Pixel Input Buffer 240 when a sample can
`be accepted or when a sample cannot be accepted. Program
`instructions associated with a thread configure program-
`mable computation units within a Execution Pipeline 240 to
`perform operations suchas texture mapping, shading, blend-
`ing, and the like. Processed samples are output from each
`Execution Pipeline 240 to a Pixel Output Buffer 270. Pixel
`Output Buffer 270 optionally stores the processed samples in
`a register file, FIFO, cache, or the like (not shown). The
`processed samples are output from Pixel Output Buffer 270
`to Raster Analyzer 160.
`Execution Pipelines 240 are optionally configured using
`program instructions read by Texture Unit 225 such that data
`processing operations are performed in multiple passes
`through at least one multithreaded processing unit,
`to be
`described further herein, within Execution Pipelines 240.
`Intermediate data generated during multiple passes can be
`stored in graphics memory.
`FIG. 3 is aconceptual diagram illustrating the relationship
`between a program andthreads. A single program 1s used to
`process several sets of samples. Each program, such as a
`vertex program or shader program,includes a sequence of
`program instructions such as, a Sequence 330 of program
`instructions 331 to 344. The at least one multithreaded
`processing unit within a Execution Pipeline 240 supports
`multithreaded execution. Therefore the program instructions
`in instruction Sequence 330 can be used by theat least one
`multithreaded processing unit to process each sample or
`each group of samples independently, Le., the at least one
`multithreaded processing unit may process each sample
`asynchronouslyrelative to other samples. For example, each
`ragment or group of fragments within a primitive can be
`processed independently from the other fragments or from
`
`
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`the other groups of fragments within the primitive. Like-
`wise, each vertex within a surface can be processed inde-
`pendently from the other vertices within the surface. For a
`set of samples being processed using the same program,the
`sequence of program instructions associated with each
`thread used to process each sample within the set will be
`identical. However, it is possible that, during execution, the
`threads processing some of the samples within a set will
`diverge following the execution of a conditional branch
`instruction. After the execution of a conditional branch
`instruction, the sequence of executed instructions associated
`with each thread processing samples within the set may
`differ.
`In FIG. 3 program instructions within instruction
`Sequence 330 are stored in graphics memory, ie., Host
`Memory 112, Local Memory 140, register files coupled to
`the components within Graphics Processor 105,andthelike.
`Each program counter
`(0 through 13)
`in instruction
`Sequence 330 corresponds to a program instruction within
`instruction Sequence 330. The program counters are con-
`ventionally numbered sequentially and can be used as an
`index to locate a specific program instruction within
`Sequence 330. Thefirst instruction 331 in the sequence 330
`represents is the program instruction corresponding to pro-
`gram counter 0. A base address, corresponding to the graph-
`ics memory location where the first instruction 331 in a
`program is stored, can be used in conjunction with a
`program counter to determine the location where a program
`instruction corresponding to the program counter is stored.
`In this example, program instructions within Sequence
`330 are associated with three threads. A Thread 350, a
`Thread 360 and a Thread 370are eachassigned to a different
`sample and each thread is uniquely identified by a thread
`identification code. A program instruction within Sequence
`330 is associated with a thread using a program counterthat
`is stored as a portion of thread state data, as described further
`herein. Thread 350 thread state data includes a program
`counter of 1 as shown in Sequence 330. The program
`counter associated with Thread 350 is a pointer to the
`program instruction in Sequence 330 corresponding to pro-
`gram counter 1 and stored at location 332. The instruction
`stored at location 332 is the next instruction to be used to
`process the sample assigned to Thread 350. Alternatively, an
`instruction stored at
`location 332 is the most recently
`executed instruction to process the sample assigned to
`Thread 350.
`Thethread state data for Thread 360 and Thread 370 each
`include a program counter of 11, as shown in FIG. 3,
`referencing the program instruction corresponding to pro-
`gram counter 11 in Program 330 and stored at location 342.
`Program counters associated with threads to process samples
`within a primitive, surface, or the like, are not necessarily
`identical becausethe threads can be executed independently.
`Whenbranchinstructions are not used, Thread 350, Thread
`360 and Thread 370 each execute all of the program instruc-
`tions in Sequence 3390.
`The number of threads that can be executed simulta-
`
`neously is limited to a predetermined number in each
`embodiment and is related to the number of Execution
`Pipelines 240, the amountof storage required forthread state
`data, the latency of Execution Pipelines 240, and thelike.
`Each sample is a specific type, ¢.g., primitive, vertex, or
`pixel, corresponding to a program type. A primitive type
`sample, e.g., primitive, is processed by a primitive program,
`a vertex type sample, e.g., surface or vertex, is processed by
`a vertex program,and a pixel type sample, e.g., fragment or
`pixel, is processed by a shader program. Likewise, a primi-
`Realtek Ex. 1006
`
`Case No. IPR2023-00922
`
`Page 14 of 21
`
`Realtek Ex. 1006
`Case No. IPR2023-00922
`Page 14 of 21
`
`
`
`US 7,015,913 B1
`
`7
`tive thread is associated with program instructions within a
`primitive program, a vertex thread is associated with pro-
`gram instructions within a vertex program, and a pixel
`thread is associated with program instructions within a
`shader program.
`A number of threads of each thread type that may be
`executed simultaneously is predetermined in each embodi-
`ment. Therefore, not all samples within a set of samples of
`a type can be processed simultaneously when the number of
`threads of the type is less than the number of samples.
`Conversely, when the numberof threads of a type exceeds
`the number of samples of the type within a set, more than
`one set can be processed simultaneously. Furthermore, when
`the number of threads of a type exceeds the number of
`samples of the type within one or more sets, more than one
`program ofthe type can be executed on the one or more sets
`and the thread state data can include data indicating the
`program associated with each thread.
`FIG. 4 is an illustration of a Execution Pipeline 240
`containing at least one Multithreaded Processing Unit 400.
`A Execution Pipeline 240 can contain a plurality of Multi-
`threaded Processing Units 400. Within each Multithreaded
`Processing Unit 400, a Thread Control Buffer 420 receives
`samples from Pixel Input Buffer 215 or Vertex Input Buffer
`220. Thread Control Buffer 420 includes storage resources
`o retain thread state data for a subset of the predetermined
`number of threads. In one embodiment Thread Control
`Buffer 420 includesstorage resources for eachofat least two
`hread types, where the at least two thread types can include
`pixel, primitive, and vertex. At least a portion of Thread
`Control Buffer 420 is a register file, FIFO, circular buffer, or
`he like. Thread state data for a thread can include, among
`other things, a program counter, a busyflag that indicates if
`he thread is either assigned to a sample oravailable to be
`assigned to a sample, a pointer to the source sample to be
`processed by the instructions associated with the thread or
`he output pixel position and output buffer ID of the sample
`to be processed, and a pointer specifying a destination
`ocation in Vertex Output Buffer 260 or Pixel Output Buffer
`270. Additionally, thread state data for a thread assigned to
`a sample can include the sample type, e.g., pixel, vertex,
`primitive, or the like.
`The source sample is stored in either Pixel Input Buffer
`215 or Vertex Input Buffer 220. When a thread is assigned
`o a sample, the thread is allocated storage resources to
`etain intermediate data generated during execution of pro-
`gram instructions associated with the thread. The thread
`identification code for a thread may be the address of a
`ocation in Thread Control Buffer 420 in which the thread
`state data for the thread is stored. In one embodiment,
`priority is specified for each thread type and Thread Control
`Buffer 420 is configured to assign threads to samples or
`allocate storage resources based on the priority assigned to
`each thread type. In an alternate embodiment, Thread Con-
`trol Buffer 420 is configured to assign threads to samples or
`allocate storage resources based on an amount of sample
`data in Pixel Input Buffer 215 and another amountof sample
`data in Vert