`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 1 of 28 Page|D# 14337
`
`
`EXHIBIT
`D
`
`EXHIBIT
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 2 of 28 PageID# 14338
`Case 3: l4-CV-OO757-REP-DJ N DOCUment 87'1“" "mun“ I'm/"mm "I““Wflllflfln "m IMHflmllrlIIIZL4338
`
`USOO8174531B1
`
`(12) Unlted States Patent
`(10) Patent No.:
`US 8,174,531 B1
`
`Lindholm et al.
`(45) Date of Patent:
`May 8, 2012
`
`(54) PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`(75)
`
`Inventors: John Erik Lindholm, Saratoga, CA
`(US); Brett W. Coon, San Jose, CA
`(US); Stuart F. Oberman, Sunnyvale,
`CA (US); Ming Y. Siu, Santa Clara, CA
`(US); Matthew P. Gerlach, Commerce
`-
`TownShlp’ MI (Us)
`.
`(73) Ass1gnee: NVIDIA Corporation, Santa Clara, CA
`(US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 1 54(b) by 0 days.
`
`(21) Appl. No.: 12/649,201
`
`(22)
`
`Filed:
`
`Dec- 292 2009
`
`_
`_
`Related U.S. Appllcatlon Data
`
`(60) Division of application No. 11/458,633, filed on Jul.
`19,
`2006, which is
`a
`continuation-in—part of
`application No. 10/696,714, filed on Oct. 29, 2003,
`now Pat. No. 7,103,720, and a continuation-in—part of
`application No. 10/736,437, filed on Dec. 15, 2003,
`now Pat. No. 7,139,003, and a continuation-in—part of
`application No. 11/292,614, filed on Dec. 2, 2005, now
`Pat. No. 7,836,276.
`
`(51)
`
`Int. Cl.
`(2006.01)
`G06F 15/16
`(2006.01)
`G06F 15/80
`(2006.01)
`G06F 13/14
`(2006.01)
`G06T 1/20
`(52) U.S. Cl.
`......... 345/505, 345/502, 345/506; 345/520
`(58) Field of Classification Search .................. 345/502,
`345/505, 520, 506, 522
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5 421028 A
`5/1995 S
`5:579:473 A
`11/1996 Sgfiggnet 31.
`5,815,166 A
`9/1998 Baldwin
`5,838,988 A
`11/1998 Panwar et 31.
`5,860,018 A
`1/1999 Panwar et al.
`ggigegg :
`g;1333 gang”? ett31~
`t
`1
`,
`,
`e
`er1ng on e a .
`5,958,047 A
`9/1999 Panwar et a1.
`t
`5,978,864 A
`11/1999 H th '
`t
`5,996,060 A
`1 1/1999 Midglrggrg 2:13: a
`5,999,727 A
`12/1999 Panwar et a1.
`
`1.
`
`(Continued)
`
`JP
`
`FOREIGN PATENT DOCUMENTS
`2003-35589
`5/2003
`
`OTHER PUBLICATIONS
`
`Intel, IA-32 Intel Architecture Software Developer’s Manual, v01. 1,
`pp. 11-23 through 11-25. 2004.
`
`(Continued)
`
`Primary Examiner * Hau Nguyen
`(74) Attorney, Agent, or Firm 7 Patterson & Sheridan, LLP.
`
`(57)
`
`ABSTRACT
`
`A processing unit includes multiple execution pipelines, each
`of which is coupled to a first input section for receiving input
`data for pixel processing and a second input section for
`receiving input data for vertex processing and to a first output
`section for storing processed pixel data and a second output
`section for storing processed vertex data. The processed ver-
`tex data is rasterized and scan converted into pixel data that is
`used as the input data for pixel processing. The processed
`p1xel data 15 output to a raster analyzer.
`
`10 Claims, 14 Drawing Sheets
`
`From
`1.3.:
`
`
`Programmable
`
`'
`Prlmltlve Asgsggnbly/Setup
`Ran; Unit
`212
`v
`1’
`Pixel Input Buffer
`Vertex Inpul Buffer
`
`21.:
`m
`
`
`
`‘ L;
`
`l
`v
`i
`v
`l
`v
`i
`Execution
`Execution
`Executlon
`Execullon
`
`
`
`
`
`
`
`
`Pipeline
`Pipeline
`Pipeline
`Pipeline
`m
`145!
`m
`m
`L
`.
`I
`l
`l
`
`$522333,
`Figs—53m
`
`V
`To
`
`
`
`
`
`
`
`
`
`
`
`
`
`_’
`
`
`
`
`
`Texture
`Unlt
`<— Hi
`
`Texture
`Cache
`
`
`v
`i
`V
`m
`Vortex Output Buffer
`Pixel Output Buffer
`
`
`
`
`
`2m
`10
`I
`
`
`i
`T0169
`him
`From
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 3 of 28 PageID# 14339
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 3 of 28 Page|D# 14339
`
`US 8,174,531 B1
`Page 2
`
`'
`
`U.S. PATENT DOCUMENTS
`6,178,481 B1
`1/2001 Krueger et a1.
`6,204,856 B1
`3/2001 Wood et a1.
`6 222 550 B1
`4/2001 R
`tal
`6,266,733 B1
`7/2001 K‘I’fnllan e1
`6,279,086 B1
`8/2001 Arm‘aft ‘1 ~1
`6,279,100 B1
`8/2001 T lmtll et :1 ~1
`6,288,730 B1
`9/2001 Dffik *3? 21:1
`6,397,300 B1
`5/2002 Arimilli et a1.
`6,405,285 B1
`6/2002 Ar1m1111 et a1.
`6,418,513 Bl
`7/2002 Ar1m1111etal.
`6,434,667 B1
`8/2002 Ar1m1111 et a1.
`6’446’166 B1
`9/2002 Ar1m1111 et a1.
`6,463,507 B1
`10/2002 Ar1m1111 et a1.
`6,559,852 B1
`5/2003 Ashburn et 31.
`6,658,447 B2
`12/2003 Cota-Robles
`6,704,925 B1
`3/2004 Bugnion
`6,750,869 B1
`6/2004 Dawson
`6,771,264 B1
`8/2004 Duluk et a1.
`6,816,161 B2
`11/2004 Lavelle et a1.
`6,819,325 B2
`“/2004 Boyd et 3L
`6,919,896 B2
`7/2005 Sasak1et 31~ ~~~~~~~~~~~~~~~~~~ 345/505
`3’31}??? 3%
`33882 $15811
`a
`5
`ur
`e a.
`7,103,720 B1
`9/2006 Moy et a1.
`7,139,003 B1
`11/2006 Kirk et a1.
`7,237,094 B2
`6/2007 Curran et a1.
`7,254,697 B2
`8/2007 Bishop et 31.
`7,278,011 32
`10/2007 Elsen et 31.
`7,328,438 B2
`2/2008 Armstrong et a1.
`7,447,873 B1
`11/2008 Nordquist
`7,577,869 B2
`8/2009 Mantor et al.
`2001/0056456 A1
`12/2001 Cota-Robeles
`2003/0097395 A1
`5/2003 Peterson
`
`.................. 714/11
`
`2/2004 Parthasarathy
`2004/0024993 A1
`9/2004 Armstrong et a1.
`2004/0194096 A1
`10/2004 Isard et a1.
`2004/0207623 A1
`10/2004 Burkey et al.
`2004/0208066 A1
`5/2005 CerVini
`2005/0108720 A1
`6/2005 Boyd et a1.
`2005/0122330 A1
`1/2006 Hussain
`2006/0020772 A1
`7/2006 Burky et a1.
`2006/0155966 A1
`OTHER PUBLICATIONS
`
`Intel, IA-32 Intel Architecture Software Developer’s Manual, V0l.
`2B p. 4_72. 2004.
`Lo, et a1. “Converting Thread-Level Parallelism t0 Instruction-Level
`.
`.
`.
`.
`.
`,,
`.
`Paralle11sm V1a S1multane0us Mult1thread1ng, ACM Transact10ns on
`Computer Systems, vol. 15, No. 3, Aug. 1997, pp. 322-354.
`Tullsen, et a1. Exp101t1ng Ch01ce: Instruct10n Fetch and Issue on an
`Implementable Simultaneous Multithreading Processor,” Proceed-
`ings of the 23rd Annual International Symposium on Computer
`Architecture, May 1996, pp. 1-12.
`Eggers, et a1. “Simultaneous Multithreading: A Platform for Next-
`Generation Processors,” IEEE Micro, Vol. 17, N0. 5, pp. 12-19,
`See/W 19”
`.
`_
`.
`.
`.
`.
`_
`SEE? abstract of JP 2003 35589 W1th add1t10na1 translated 1nf0r
`'
`.
`Translated copy of Japanese Office Act10n dated Jun. 9, 2008 (pro-
`V1ded as an explanat10n ofrelevance 0f C1tat10n N0. B1).
`Hinton, et al. “The Microarchitecture 0f the Pentium 4 Processor,”
`Intel Technology Journal Q1, 2001, pp. 1-12.
`Sen et a1., “Shadow Silhouette Maps” Jul. 2003, ACM transactions on
`Graphics 22, 3, pp. 521-526.
`
`* cited by examiner
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 4 of 28 PageID# 14340
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 4 of 28 PagelD# 14340
`
`US. Patent
`
`May 8,2012
`
`Sheet 1 of 14
`
`US 8,174,531 B1
`
`
`
`Host Memory
`m.
`
`Host Processor
`
`
`
`
`
`
`
`Host Computer 11 100
`
`
`
`
`
`
`System lnte rface
`HA
`1_1§
`
`
`
`
`
`Graphics
`Subsystem
`m
`
`G"'aPhics Interface l
`
`
`1
`
`h.
`G
`rap '65
`Processor
`
`M
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`FIG. 1
`
`Front End
`
`mI
`
`[<— li
`
`—>_13_5
`l
`
`Memory
`Controller
`
`1—2—9 Lj—y
`
`Programmable
`Graphics
`Processing
`
`Pipeline
`
`1—5.9
`
`
`
`Raster Analyzer
`fl
`
`
`
`
`
`—l—>
`
`Local
`Memory
`
`1_4.Q
`
`
`
`Output Controller
`
`m
`
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 5 of 28 PageID# 14341
`Case 3:14-cv-OO757-REP-DJN Document87-4 Filed 04/16/15
`Page 5 of 28 PagelD# 14341
`
`US. Patent
`
`May 8, 2012
`
`Sheet 2 of 14
`
`US 8,174,531 B1
`
`From
`
`&
`
`
`
`
`
`
`
`Vertex Input Buffer
`m
`
`Pixel Input Buffer
`m
`
`m
`
`Programmable
`Graphics
`Processing
`Pipeline
`fl
`
`
`
`
`
`
`
`
`
`Primitive Assembly/Setup
`M
`filh
`
`Raster Unit
`2_10.
`
`
`
`
`Execution
`
`Execution
`
`Execution
`
`Execution
`
`
`
`
`m
`
`
`Pipeline
`M
`
`Pipeline
`£9
`
`Pipeline
`m
`
`Pipeline
`
`
`
`Vertex Output Buffer
`EL)
`
`
`
`
`
`Pixel Output Buffer
`
`FIG. 2
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 6 of 28 PageID# 14342
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 6 of 28 PagelD# 14342
`
`US. Patent
`
`May 8, 2012
`
`Sheet 3 of 14
`
`US 8,174,531 B1
`
`From
`
`2L5
`
`From
`
`fl
`
`Execution
`
`Pipeline
`2A9
`
`Multithreaded
`
`Processing Unit
`
`
`
`
`
`Thread Control Unit
`
`Register
`File
`
`152
`
`FIG. 3
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 7 of 28 PageID# 14343
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 7 of 28 PagelD# 14343
`
`US. Patent
`
`May 8,2012
`
`Sheet 4 of 14
`
`US 8,174,531 B1
`
`Execution
`
`From From
`
`21—5
`
`2_2_Q
`
`Pipeline
`Zfl
`
`Multithreaded
`
`Processing Unit
`
`Instruction
`Cache
`
`1L0
`
`Thread
`Selection
`Unit
`
`m
`
`Thread
`°°""°'
`Unit
`
`fl
`
`TSR
`
`
`
`
`
`
`
`
`
`
`
`
`
`Sequencer
`‘— 4_2§ _J
`
`Resource
`
`Scheduler
`
`4_3_0
`
`f I
`
`nstruction
`
`Dispatcher
`m
`
`Execution Unit
`m
`
`Register
`File
`
`E!
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 8 of 28 PageID# 14344
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 8 of 28 Page|D# 14344
`
`US. Patent
`
`May 8, 2012
`
`Sheet 5 of 14
`
`US 8,174,531 B1
`
`r—J
`
`
`Receive Pointer to a
`Program
`
`1Q
`
`
`
`Vertex
`
`
`
`Pixel
`Pixel
`
`or Vertex?
`
`E
`
`
`
`
`
` Receive Pointer to a
`
`Program
`
`m
`
` Pixel
`
`Pixel
`
`
`or Vertex?
`m
`
`
`
`Vertex
`Thread
`5%
`
` Assign Pixel
`
`
`
` Assign
`
`VertexThread
`
`5i
`
`FIG. 5A
`
`
`
`Pass
`Pass
`
`
`
`Priority Test?
`Priority Test?
`
`
`
`fl
`fl
`
`
`
`
`
`Vertex
`Pixel
`
`
`
`Thread
`Thread
`
`
`
`Available?
`Available?
`
`
`
`Q5
`fl).
`
`
`
`
`
`
`
`
`Assign Pixel
`Assign
`VertexThread
`Thread
`
`fl
`5_4§
`
`FIG. SB
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 9 of 28 PageID# 14345
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 9 of 28 Page|D# 14345
`
`US. Patent
`
`May 8, 2012
`
`Sheet 6 of 14
`
`US 8,174,531 B1
`
`
`
`FIG. 6A
`
`640
`
`FIG. GB
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 10 of 28 PageID# 14346
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 10 of 28 PagelD# 14346
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 7 of 14
`
`US 8,174,531 B1
`
`Determine
`
`allocations
`
`Allocating threads to
`a first sample type
`71—0
`
`Allocating threads to a
`second sample type
`m
`
`Execute First
`
`Program Instructions
`m
`
`LE
`
`Execute Second
`
`Program Instructions
`
`FIG. 7A
`
`75—0
`
`Allocating threads to
`a first sample type
`7_5§
`
`Allocating threads to a
`second sample type
`33
`
`Execute First
`
`Program Instructions
`@
`
`
`
`Execute Second
`
`Program Instructions
`
`fl
`
`Allocating threads to
`the first sample type
`IE
`
`FIG. TB
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 11 of 28 PageID# 14347
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 11 of 28 PagelD# 14347
`
`US. Patent
`
`May 8, 2012
`
`Sheet 8 of 14
`
`US 8,174,531 B1
`
`Receive Sample
`@
`
`
`
`
`Position
`Hazard?
`
`fl
`
`
`
`Y
`
`
`
`
`
`Receive Sample
`
`mg
`
`
`
`
`
`
`
`
`'eadAvailable?
`
`
`
`
`
`Identify Sample
`Type
`fl
`
`
`PlOR
`disabled?
`@
`
`
`
`N
`
`Thread
`Available?
`
`E Y
`
`Assign Thread
`.87_5
`
`
`
`Identify Sample
`Type
`81—5
`
`m
`
`Assign Thread
`fl
`
`FIG. 8A
`
`Resou
`
`Available?
`
`fl
`
`Execute Thread
`@
`
`
`
`
`Deallocate Resources
`fl
`
`
`
`
`
`FIG. BB
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 12 of 28 PageID# 14348
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 12 of 28 PagelD# 14348
`
`US. Patent
`
`May 8, 2012
`
`Sheet 9 of 14
`
`US 8,174,531 B1
`
`Determine thread
`
`priority
`
`fl
`
`
`
`
`.
`.
`.
`Identify assigned
`
`Identify 333‘ priority
`thread(s) for priority
`fl
`
`
`
`
`
`— f
`
`
`
`
`No threads?
`
`fl
`
`
`Select Thread(s)
`fl l“
`
`Select Thread(s)
`fl
`
`
`
`
`
`
`
`Identify assigned
`thread(s)
`fl
`
`
`Read Program
`Counter(s)
`QZ—O‘
`I
`
`Update Program
`Counter(s)
`
`%
`
`Read Program
`m
`Counter(s)
`
`Update Program
`Counter(s)
`flé
`
`FIG. 9A
`
`FIG. QB
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 13 of 28 PageID# 14349
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 13 of 28 PagelD# 14349
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 10 of 14
`
`US 8,174,531 B1
`
`From
`
`13
`
`
`
`
`Programmable
`Graphics
`Processing
`Pipeline
`
`From 12 CD
`
`To 120
`
`FIG. 10
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 14 of 28 PageID# 14350
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 14 of 28 PagelD# 14350
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 11 of 14
`
`US 8,174,531 B1
`
`From
`
`fl
`
`
`
` Raster Unit21—0
`
`
`Execution
`
`Pipeline
`
`
`Execution
`Pipeline
`
`
`&
`
`
`
`m
`
` Vertex
`Output
`
`Buffer
`
`@
`
`
`
`FIG. 11
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 15 of 28 PageID# 14351
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 15 of 28 PagelD# 14351
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 12 of 14
`
`US 8,174,531 B1
`
`Execution
`
`Pipeline
`240
`
`\
`
`1210
`
`Instruction
`
`Dispatch
`
`Register
`File
`
`Collection
`Unit
`
`Operand Collection
`
`to/from
`
`225
`
`to
`260/270
`
`1220
`
`
`
`Operand
`
`Unit
`
`{-
`
`Accumulator
`
`Accumulator
`
`FIG. 12
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 16 of 28 PageID# 14352
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 16 of 28 PagelD# 14352
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 13 of 14
`
`US 8,174,531 B1
`
`Instruction
`
`Dispatch
`1212
`
`\
`
`Instruction
`Cache
`
`Issue
`
`Scoreboard
`RAM
`
`
`
`-coreboard-rocessing
`
`
`
`InstructionBuffer
`
`Instruction
`Completion
`Signal
`
`to Register File
`1214
`
`Pipeline
`Configuration
`Signals
`
`FIG. 13
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 17 of 28 PageID# 14353
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 17 of 28 PagelD# 14353
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 14 of 14
`
`US 8,174,531 B1
`
`
`
` Unified
`
`Graphics Data
`Processing
`
`
`
`Receive vertex data
`
`1410
`
`1412
`
`Process vertex data
`
`
`
`
`through SIMD execution
`pipeline
`
`Rasterize processed
`vertex data
`
` 1414 1416
`
`
`
`Generate pixel data by
`scan converting
`rasterized vertex data
`
`
`
`1418
`
`
`
`Process pixel data
`through the same SIMD
`
`
`
`execution pipeline used
`in step 1412
`
` 1420
`
`Output processed pixel
`data to raster analyzer
`
`FIG. 14
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 18 of 28 PageID# 14354
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 18 of 28 Page|D# 14354
`
`US 8,174,531 B1
`
`1
`PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`RELATED APPLICATIONS
`
`This application is divisional of US. patent application
`Ser. No. 11/458,633, filed Jul. 19, 2006, which is a continu-
`ation-in-part of US. patent application Ser. No. 10/696,714,
`filed Oct. 29, 2003, issued as US. Pat. No. 7,103,720, a
`continuation-in-part of US. patent application Ser. No.
`10/736,437, filed Dec. 15, 2003, issued as US. Pat. No.
`7,139,003, and a continuation-in-part of US. patent applica-
`tion Ser. No. 11/292,614, filed Dec. 2, 2005 now US. Pat. No.
`7,836,276. The entire contents of the foregoing applications
`are hereby incorporated herein by reference.
`
`FIELD OF THE INVENTION
`
`One or more aspects of the invention relate generally to
`multithreaded processing, and more particularly to process-
`ing graphics data in a programmable graphics processor.
`
`BACKGROUND
`
`Current graphics data processing includes systems and
`methods developed to perform a specific operation on graph-
`ics data, e.g., linear interpolation, tessellation, rasterization,
`texture mapping, depth testing, etc. These graphics proces-
`sors include several fixed function computation units to per-
`form such specific operations on specific types of graphics
`data, such as vertex data and pixel data.
`More recently, the computation units have a degree of
`programmability to perform user specified operations such
`that the vertex data is processed by a vertex processing unit
`using vertex programs and the pixel data is processed by a
`pixel processing unit using pixel programs. When the amount
`of vertex data being processed is low relative the amount of
`pixel data being processed, the vertex processing unit may be
`underutilized. Conversely, when the amount of vertex data
`being processed is high relative the amount ofpixel data being
`processed, the pixel processing unit may be underutilized.
`Accordingly, it would be desirable to provide improved
`approaches to processing different types of graphics data to
`better utilize one or more processing units within a graphics
`processor.
`
`SUMMARY OF THE INVENTION
`
`The present invention provides a unified approach for
`graphics data processing. Sample data of different types, e.g.,
`vertex data and pixel data, are processed through the same
`execution pipeline.
`A processing unit according to an embodiment of the
`present invention includes multiple execution pipelines, each
`of which is coupled to a first input section for receiving input
`data for pixel processing and a second input section for
`receiving input data for vertex processing and to a first output
`section for storing processed pixel data and a second output
`section for storing processed vertex data. The processed ver-
`tex data is rasterized and scan converted into pixel data that is
`used as the input data for pixel processing. The processed
`pixel data is output to a raster analyzer.
`Each execution pipeline has a plurality of sets of parallel
`data execution paths that run at a higher clock speed than the
`clock speed of the processing unit. As a result, a large number
`of pixels or vertices can be processed in parallel through the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`
`execution pipeline. The total number ofpixels or vertices that
`can be processed through the execution pipelines per clock
`cycle ofthe processing unit is equal to: (the number of execu-
`tion pipelines)><(the number of sets of parallel data execution
`paths in each execution pipeline)><(the number ofparallel data
`execution paths in each set)><(the ratio of the clock speed of
`the parallel data execution paths to the processing unit clock
`speed).
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`Accompanying drawing(s) show exemplary embodiment
`(s) in accordance with one or more aspects of the present
`invention; however, the accompanying drawing(s) should not
`be taken to limit the present invention to the embodiment(s)
`shown, but are for explanation and understanding only.
`FIG. 1 illustrates one embodiment of a computing system
`according to the invention including a host computer and a
`graphics subsystem.
`FIG. 2 is a block diagram of an embodiment of the pro-
`grammable graphics processing pipeline of FIG. 1.
`FIG. 3 is a block diagram of an embodiment of the execu-
`tion pipeline of FIG. 2.
`FIG. 4 is a block diagram ofan alternate embodiment ofthe
`execution pipeline of FIG. 2.
`FIGS. 5A and 5B are flow diagrams of exemplary embodi-
`ments of thread assignment in accordance with one or more
`aspects of the present invention.
`FIGS. 6A and 6B are exemplary embodiments of a portion
`of the thread storage resource storing thread state data within
`an embodiment of the thread control unit of FIG. 3 or FIG. 4.
`
`FIGS. 7A and 7B are flow diagrams of exemplary embodi-
`ments of thread allocation and processing in accordance with
`one or more aspects of the present invention.
`FIGS. 8A and 8B are flow diagrams of exemplary embodi-
`ments of thread assignment in accordance with one or more
`aspects of the present invention.
`FIGS. 9A and 9B are flow diagrams of exemplary embodi-
`ments of thread selection in accordance with one or more
`
`aspects of the present invention.
`FIG. 10 is a block diagram of another embodiment of the
`programmable graphics processing pipeline of FIG. 1.
`FIG. 11 illustrates an embodiment ofthe texture processing
`cluster of FIG. 10.
`
`FIG. 12 is a block diagram of another embodiment of the
`execution pipeline of FIG. 2 or FIG. 11.
`FIG. 13 is a block diagram of an embodiment of the
`instruction dispatch unit of FIG. 12.
`FIG. 14 is a flow diagram that illustrates the steps of pro-
`cessing graphics data in accordance with one or more aspects
`of the present invention.
`
`DETAILED DESCRIPTION
`
`In the following description, numerous specific details are
`set forth to provide a more thorough understanding of the
`present invention. However, it will be apparent to one of skill
`in the art that the present invention may be practiced without
`one or more of these specific details. In other instances, well-
`known features have not been described in order to avoid
`
`obscuring the present invention.
`FIG. 1 is an illustration of a computing system generally
`designated 100 and including a host computer 110 and a
`graphics subsystem 170. Computing system 100 may be a
`desktop computer, server, laptop computer, palm-sized com-
`puter,
`tablet computer, game console, cellular telephone,
`computer based simulator, or the like. Host computer 110
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 19 of 28 PageID# 14355
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 19 of 28 Page|D# 14355
`
`US 8,174,531 B1
`
`3
`includes host processor 114 that may include a system
`memory controller to interface directly to host memory 1 12 or
`may communicate with host memory 112 through a system
`interface 115. System interface 115 may be an I/O (input/
`output) interface or a bridge device including the system
`memory controller to interface directly to host memory 112.
`Examples of system interface 115 known in the art include
`Intel® Northbridge and Intel® Southbridge.
`Host computer 110 communicates with graphics sub-
`system 170 Via system interface 115 and a graphics interface
`117 within a graphics processor 105. Data received at graph-
`ics interface 117 can be passed to a front end 130 or written to
`a local memory 140 through memory controller 120. Graph-
`ics processor 105 uses graphics memory to store graphics
`data and program instructions, where graphics data is any
`data that is input to or output from components within the
`graphics processor. Graphics memory can include portions of
`host memory 112, local memory 140, register files coupled to
`the components within graphics processor 105, and the like.
`Graphics processor 105 includes, among other compo-
`nents, front end 130 that receives commands from host com-
`puter 110 via graphics interface 117. Front end 130 interprets
`and formats the commands and outputs the formatted com-
`mands and data to an IDX (index processor) 135. Some of the
`formatted commands are used by programmable graphics
`processing pipeline 150 to initiate processing of data by pro-
`viding the location of program instructions or graphics data
`stored in memory. IDX 135, programmable graphics process-
`ing pipeline 150 and a raster analyzer 160 each include an
`interface to memory controller 120 through which program
`instructions and data can be read from memory, e.g., any
`combination of local memory 140 and host memory 112.
`When a portion of host memory 112 is used to store program
`instructions and data, the portion of host memory 112 can be
`uncached so as to increase performance of access by graphics
`processor 105.
`IDX 135 optionally reads processed data, e.g., data written
`by raster analyzer 160, from memory and outputs the data,
`processed data and formatted commands to programmable
`graphics processing pipeline 150. Programmable graphics
`processing pipeline 150 and raster analyzer 160 each contain
`one or more programmable processing units to perform a
`variety of specialized functions. Some of these functions are
`table lookup, scalar and vector addition, multiplication, divi-
`sion, coordinate-system mapping, calculation of vector nor-
`mals, tessellation, calculation of derivatives, interpolation,
`and the like. Programmable graphics processing pipeline 150
`and raster analyzer 160 are each optionally configured such
`that data processing operations are performed in multiple
`passes through those units or in multiple passes within pro-
`grammable graphics processing pipeline 150. Programmable
`graphics processing pipeline 150 and a raster analyzer 160
`also each include a write interface to memory controller 120
`through which data can be written to memory.
`In a typical implementation programmable graphics pro-
`cessing pipeline 150 performs geometry computations, ras-
`terization, and pixel computations. Therefore, programmable
`graphics processing pipeline 150 is programmed to operate
`on surface, primitive, vertex, fragment, pixel, sample or any
`other data. For simplicity, the remainder of this description
`will use the term “samples” to refer to graphics data such as
`surfaces, primitives, vertices, pixels, fragments, or the like.
`Samples output by programmable graphics processing
`pipeline 150 are passed to a raster analyzer 160, which
`optionally performs near and far plane clipping and raster
`operations, such as stencil, z test, and the like, and saves the
`results or the samples output by programmable graphics pro-
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`
`cessing pipeline 150 in local memory 140. When the data
`received by graphics subsystem 170 has been completely
`processed by graphics processor 105, an output 185 ofgraph-
`ics subsystem 170 is provided using an output controller 180.
`Output controller 180 is optionally configured to deliver data
`to a display device, network, electronic control system, other
`computing system 100, other graphics subsystem 170, or the
`like. Alternatively, data is output to a film recording device or
`written to a peripheral device, e.g., disk drive, tape, compact
`disk, or the like.
`FIG. 2 is an illustration ofprogrammable graphics process-
`ing pipeline 150 ofFIG. 1 . At least one set of samples is output
`by IDX 135 and received by programmable graphics process-
`ing pipeline 150 and the at least one set of samples is pro-
`cessed according to at least one program, the at least one
`program including graphics program instructions. A program
`can process one or more sets of samples. Conversely, a set of
`samples can be processed by a sequence of one or more
`programs.
`Samples, such as surfaces, primitives, or the like, are
`received from IDX 135 by programmable graphics process-
`ing pipeline 150 and stored in a vertex input buffer 220
`including a register file, FIFO (first in first out), cache, or the
`like (not shown). The samples are broadcast to execution
`pipelines 240, four of which are shown in the figure. Each
`execution pipeline 240 includes at least one multithreaded
`processing unit, to be described further herein. The samples
`output by vertex input buffer 220 can be processed by any one
`of the execution pipelines 240. A sample is accepted by an
`execution pipeline 240 when a processing thread within the
`execution pipeline 240 is available as described further
`herein. Each execution pipeline 240 signals to vertex input
`buffer 220 when a sample can be accepted or when a sample
`cannot be accepted.
`In one embodiment, programmable
`graphics processing pipeline 150 includes a single execution
`pipeline 240 containing one multithreaded processing unit. In
`an alternative embodiment, programmable graphics process-
`ing pipeline 150 includes a plurality of execution pipelines
`240.
`
`Execution pipelines 240 may receive first samples, such as
`higher-order surface data, and tessellate the first samples to
`generate second samples, such as vertices. Execution pipe-
`lines 240 may be configured to transform the second samples
`from an obj ect-based coordinate representation (object
`space) to an alternatively based coordinate system such as
`world space or normalized device coordinates (NDC) space.
`Each execution pipeline 240 may communicate with texture
`unit 225 using a read interface (not shown in FIG. 2) to read
`program instructions and graphics data such as texture maps
`from local memory 140 or host memory 112 via memory
`controller 120 and a texture cache 230. Texture cache 230
`
`serves to increase effective memory bandwidth. In an alter-
`nate embodiment texture cache 230 is omitted. In another
`alternate embodiment, a texture unit 225 is included in each
`execution pipeline 240. In another alternate embodiment,
`program instructions are stored within programmable graph-
`ics processing pipeline 150. In another alternate embodiment,
`each execution pipeline 240 has a dedicated instruction read
`interface to read program instructions from local memory 140
`or host memory 112 via memory controller 120.
`Execution pipelines 240 output processed samples, such as
`vertices, that are stored in a vertex output buffer 260 including
`a register file, FIFO, cache, or the like (not shown). Processed
`vertices output by vertex output buffer 260 are received by a
`primitive assembly/setup unit 205. Primitive assembly/setup
`unit 205 calculates parameters, such as deltas and slopes, to
`rasterize the processed vertices and outputs parameters and
`
`
`
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 20 of 28 PageID# 14356
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 20 of 28 Page|D# 14356
`
`US 8,174,531 B1
`
`5
`samples, such as vertices, to a raster unit 210. Raster unit 210
`performs scan conversion on samples, such as vertices, and
`outputs samples, such as fragments, to a pixel input buffer
`215. Alternatively, raster unit 210 resamples processed verti-
`ces and outputs additional vertices to pixel input buffer 215.
`Pixel input buffer 215 outputs the samples to each execu-
`tion pipeline 240. Samples, such as pixels and fragments,
`output by pixel input buffer 215 are each processed by only
`one of the execution pipelines 240. Pixel input buffer 215
`determines which one ofthe execution pipelines 240 to output
`each sample to depending on an output pixel position, e.g.,
`(x,y), associated with each sample. In this manner, each
`sample is output to the execution pipeline 240 designated to
`process samples associated with the output pixel position. In
`an alternate embodiment, each sample output by pixel input
`buffer 215 is processed by one of any available execution
`pipelines 240.
`Each execution pipeline 240 signals to pixel input buffer
`240 when a sample can be accepted or when a sample cannot
`be accepted as described further herein. Program instructions
`configure programmable computation units (PCUs) within an
`executionpipeline 240 to perform operations such as perspec-
`tive correction, texture mapping, shading, blending, and the
`like. Processed samples are output from each execution pipe-
`line 240 to a pixel output buffer 270. Pixel output buffer 270
`optionally stores the processed samples in a register file,
`FIFO, cache, or the like (not shown). The processed samples
`are output from pixel output buffer 270 to raster analyzer 160.
`FIG. 3 is a block diagram of an embodiment of execution
`pipeline 240 of FIG. 1 including at least one multithreaded
`processing unit 300. An execution pipeline 240 can contain a
`plurality of multithreaded processing units 300, each multi-
`threaded processing unit 300 containing at least one PCU
`375. PCUs 375 are configured using program instructions
`read by a thread control unit 320. Thread control unit 320
`gathers source data specified by the program instructions and
`dispatches the source data and program instructions to at least
`one PCU 375. PCUs 375 performs computations specified by
`the program instructions and outputs data to at least one
`destination, e.g., pixel output buffer 160, vertex output buffer
`260 and thread control unit 320.
`
`A single program may be used to process several sets of
`samples. Thread control unit 320 receives samples or pointers
`to samples stored in pixel input buffer 215 and vertex input
`buffer 220. Thread control unit 320 receives a pointer to a
`program to process one or more samples. Thread control unit
`320 assigns a thread to each sample to be processed. A thread
`includes a pointer to a program instruction (program counter),
`such as the first instruction within the program, thread state
`information, and storage resources for storing intermediate
`data generated during processing of the sample. Thread state
`information is stored in a TSR (thread storage resource) 325.
`TSR 325 may be a register file, FIFO, circular buffer, or the
`like. An instruction specifies the location of source data
`needed to execute the instruction. Source data, such as inter-
`mediate data generated during processing of the sample is
`stored in a register file 350. In addition to register file 350,
`other source data may be stored in pixel input buffer 215 or
`vertex input buffer 220. In an alternate embodiment source
`data is stored in local memory 140, locations in host memory
`112, and the like.
`Alternatively, in an embodiment permitting multiple pro-
`grams for two or more thread types, thread control unit 320
`also receives a program identifier specifying which one ofthe
`two or more programs the program counter is associated with.
`Specifically,
`in an embodiment permitting simultaneous
`execution of four programs for a thread type, two bits of
`
`6
`thread state information are used to store the program iden-
`tifier for a thread. Multithreaded execution of programs is
`possible because each thread may be executed independent of
`other threads, regardless of whether the other threads are
`executing the same program or a different program. PCUs
`375 update each program counter associated with the threads
`in thread control unit 320 following the execution of an
`instruction. For execution of a loop, call, return, or