throbber
Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 1 of 28 PageID# 14337
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 1 of 28 Page|D# 14337
`
`
`EXHIBIT
`D
`
`EXHIBIT
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 2 of 28 PageID# 14338
`Case 3: l4-CV-OO757-REP-DJ N DOCUment 87'1“" "mun“ I'm/"mm "I““Wflllflfln "m IMHflmllrlIIIZL4338
`
`USOO8174531B1
`
`(12) Unlted States Patent
`(10) Patent No.:
`US 8,174,531 B1
`
`Lindholm et al.
`(45) Date of Patent:
`May 8, 2012
`
`(54) PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`(75)
`
`Inventors: John Erik Lindholm, Saratoga, CA
`(US); Brett W. Coon, San Jose, CA
`(US); Stuart F. Oberman, Sunnyvale,
`CA (US); Ming Y. Siu, Santa Clara, CA
`(US); Matthew P. Gerlach, Commerce
`-
`TownShlp’ MI (Us)
`.
`(73) Ass1gnee: NVIDIA Corporation, Santa Clara, CA
`(US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 1 54(b) by 0 days.
`
`(21) Appl. No.: 12/649,201
`
`(22)
`
`Filed:
`
`Dec- 292 2009
`
`_
`_
`Related U.S. Appllcatlon Data
`
`(60) Division of application No. 11/458,633, filed on Jul.
`19,
`2006, which is
`a
`continuation-in—part of
`application No. 10/696,714, filed on Oct. 29, 2003,
`now Pat. No. 7,103,720, and a continuation-in—part of
`application No. 10/736,437, filed on Dec. 15, 2003,
`now Pat. No. 7,139,003, and a continuation-in—part of
`application No. 11/292,614, filed on Dec. 2, 2005, now
`Pat. No. 7,836,276.
`
`(51)
`
`Int. Cl.
`(2006.01)
`G06F 15/16
`(2006.01)
`G06F 15/80
`(2006.01)
`G06F 13/14
`(2006.01)
`G06T 1/20
`(52) U.S. Cl.
`......... 345/505, 345/502, 345/506; 345/520
`(58) Field of Classification Search .................. 345/502,
`345/505, 520, 506, 522
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5 421028 A
`5/1995 S
`5:579:473 A
`11/1996 Sgfiggnet 31.
`5,815,166 A
`9/1998 Baldwin
`5,838,988 A
`11/1998 Panwar et 31.
`5,860,018 A
`1/1999 Panwar et al.
`ggigegg :
`g;1333 gang”? ett31~
`t
`1
`,
`,
`e
`er1ng on e a .
`5,958,047 A
`9/1999 Panwar et a1.
`t
`5,978,864 A
`11/1999 H th '
`t
`5,996,060 A
`1 1/1999 Midglrggrg 2:13: a
`5,999,727 A
`12/1999 Panwar et a1.
`
`1.
`
`(Continued)
`
`JP
`
`FOREIGN PATENT DOCUMENTS
`2003-35589
`5/2003
`
`OTHER PUBLICATIONS
`
`Intel, IA-32 Intel Architecture Software Developer’s Manual, v01. 1,
`pp. 11-23 through 11-25. 2004.
`
`(Continued)
`
`Primary Examiner * Hau Nguyen
`(74) Attorney, Agent, or Firm 7 Patterson & Sheridan, LLP.
`
`(57)
`
`ABSTRACT
`
`A processing unit includes multiple execution pipelines, each
`of which is coupled to a first input section for receiving input
`data for pixel processing and a second input section for
`receiving input data for vertex processing and to a first output
`section for storing processed pixel data and a second output
`section for storing processed vertex data. The processed ver-
`tex data is rasterized and scan converted into pixel data that is
`used as the input data for pixel processing. The processed
`p1xel data 15 output to a raster analyzer.
`
`10 Claims, 14 Drawing Sheets
`
`From
`1.3.:
`
`
`Programmable
`
`'
`Prlmltlve Asgsggnbly/Setup
`Ran; Unit
`212
`v
`1’
`Pixel Input Buffer
`Vertex Inpul Buffer
`
`21.:
`m
`
`
`
`‘ L;
`
`l
`v
`i
`v
`l
`v
`i
`Execution
`Execution
`Executlon
`Execullon
`
`
`
`
`
`
`
`
`Pipeline
`Pipeline
`Pipeline
`Pipeline
`m
`145!
`m
`m
`L
`.
`I
`l
`l
`
`$522333,
`Figs—53m
`
`V
`To
`
`
`
`
`
`
`
`
`
`
`
`
`
`_’
`
`
`
`
`
`Texture
`Unlt
`<— Hi
`
`Texture
`Cache
`
`
`v
`i
`V
`m
`Vortex Output Buffer
`Pixel Output Buffer
`
`
`
`
`
`2m
`10
`I
`
`
`i
`T0169
`him
`From
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 3 of 28 PageID# 14339
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 3 of 28 Page|D# 14339
`
`US 8,174,531 B1
`Page 2
`
`'
`
`U.S. PATENT DOCUMENTS
`6,178,481 B1
`1/2001 Krueger et a1.
`6,204,856 B1
`3/2001 Wood et a1.
`6 222 550 B1
`4/2001 R
`tal
`6,266,733 B1
`7/2001 K‘I’fnllan e1
`6,279,086 B1
`8/2001 Arm‘aft ‘1 ~1
`6,279,100 B1
`8/2001 T lmtll et :1 ~1
`6,288,730 B1
`9/2001 Dffik *3? 21:1
`6,397,300 B1
`5/2002 Arimilli et a1.
`6,405,285 B1
`6/2002 Ar1m1111 et a1.
`6,418,513 Bl
`7/2002 Ar1m1111etal.
`6,434,667 B1
`8/2002 Ar1m1111 et a1.
`6’446’166 B1
`9/2002 Ar1m1111 et a1.
`6,463,507 B1
`10/2002 Ar1m1111 et a1.
`6,559,852 B1
`5/2003 Ashburn et 31.
`6,658,447 B2
`12/2003 Cota-Robles
`6,704,925 B1
`3/2004 Bugnion
`6,750,869 B1
`6/2004 Dawson
`6,771,264 B1
`8/2004 Duluk et a1.
`6,816,161 B2
`11/2004 Lavelle et a1.
`6,819,325 B2
`“/2004 Boyd et 3L
`6,919,896 B2
`7/2005 Sasak1et 31~ ~~~~~~~~~~~~~~~~~~ 345/505
`3’31}??? 3%
`33882 $15811
`a
`5
`ur
`e a.
`7,103,720 B1
`9/2006 Moy et a1.
`7,139,003 B1
`11/2006 Kirk et a1.
`7,237,094 B2
`6/2007 Curran et a1.
`7,254,697 B2
`8/2007 Bishop et 31.
`7,278,011 32
`10/2007 Elsen et 31.
`7,328,438 B2
`2/2008 Armstrong et a1.
`7,447,873 B1
`11/2008 Nordquist
`7,577,869 B2
`8/2009 Mantor et al.
`2001/0056456 A1
`12/2001 Cota-Robeles
`2003/0097395 A1
`5/2003 Peterson
`
`.................. 714/11
`
`2/2004 Parthasarathy
`2004/0024993 A1
`9/2004 Armstrong et a1.
`2004/0194096 A1
`10/2004 Isard et a1.
`2004/0207623 A1
`10/2004 Burkey et al.
`2004/0208066 A1
`5/2005 CerVini
`2005/0108720 A1
`6/2005 Boyd et a1.
`2005/0122330 A1
`1/2006 Hussain
`2006/0020772 A1
`7/2006 Burky et a1.
`2006/0155966 A1
`OTHER PUBLICATIONS
`
`Intel, IA-32 Intel Architecture Software Developer’s Manual, V0l.
`2B p. 4_72. 2004.
`Lo, et a1. “Converting Thread-Level Parallelism t0 Instruction-Level
`.
`.
`.
`.
`.
`,,
`.
`Paralle11sm V1a S1multane0us Mult1thread1ng, ACM Transact10ns on
`Computer Systems, vol. 15, No. 3, Aug. 1997, pp. 322-354.
`Tullsen, et a1. Exp101t1ng Ch01ce: Instruct10n Fetch and Issue on an
`Implementable Simultaneous Multithreading Processor,” Proceed-
`ings of the 23rd Annual International Symposium on Computer
`Architecture, May 1996, pp. 1-12.
`Eggers, et a1. “Simultaneous Multithreading: A Platform for Next-
`Generation Processors,” IEEE Micro, Vol. 17, N0. 5, pp. 12-19,
`See/W 19”
`.
`_
`.
`.
`.
`.
`_
`SEE? abstract of JP 2003 35589 W1th add1t10na1 translated 1nf0r
`'
`.
`Translated copy of Japanese Office Act10n dated Jun. 9, 2008 (pro-
`V1ded as an explanat10n ofrelevance 0f C1tat10n N0. B1).
`Hinton, et al. “The Microarchitecture 0f the Pentium 4 Processor,”
`Intel Technology Journal Q1, 2001, pp. 1-12.
`Sen et a1., “Shadow Silhouette Maps” Jul. 2003, ACM transactions on
`Graphics 22, 3, pp. 521-526.
`
`* cited by examiner
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 4 of 28 PageID# 14340
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 4 of 28 PagelD# 14340
`
`US. Patent
`
`May 8,2012
`
`Sheet 1 of 14
`
`US 8,174,531 B1
`
`
`
`Host Memory
`m.
`
`Host Processor
`
`
`
`
`
`
`
`Host Computer 11 100
`
`
`
`
`
`
`System lnte rface
`HA
`1_1§
`
`
`
`
`
`Graphics
`Subsystem
`m
`
`G"'aPhics Interface l
`
`
`1
`
`h.
`G
`rap '65
`Processor
`
`M
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`FIG. 1
`
`Front End
`
`mI
`
`[<— li
`
`—>_13_5
`l
`
`Memory
`Controller
`
`1—2—9 Lj—y
`
`Programmable
`Graphics
`Processing
`
`Pipeline
`
`1—5.9
`
`
`
`Raster Analyzer
`fl
`
`
`
`
`
`—l—>
`
`Local
`Memory
`
`1_4.Q
`
`
`
`Output Controller
`
`m
`
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 5 of 28 PageID# 14341
`Case 3:14-cv-OO757-REP-DJN Document87-4 Filed 04/16/15
`Page 5 of 28 PagelD# 14341
`
`US. Patent
`
`May 8, 2012
`
`Sheet 2 of 14
`
`US 8,174,531 B1
`
`From
`
`&
`
`
`
`
`
`
`
`Vertex Input Buffer
`m
`
`Pixel Input Buffer
`m
`
`m
`
`Programmable
`Graphics
`Processing
`Pipeline
`fl
`
`
`
`
`
`
`
`
`
`Primitive Assembly/Setup
`M
`filh
`
`Raster Unit
`2_10.
`
`
`
`
`Execution
`
`Execution
`
`Execution
`
`Execution
`
`
`
`
`m
`
`
`Pipeline
`M
`
`Pipeline
`£9
`
`Pipeline
`m
`
`Pipeline
`
`
`
`Vertex Output Buffer
`EL)
`
`
`
`
`
`Pixel Output Buffer
`
`FIG. 2
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 6 of 28 PageID# 14342
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 6 of 28 PagelD# 14342
`
`US. Patent
`
`May 8, 2012
`
`Sheet 3 of 14
`
`US 8,174,531 B1
`
`From
`
`2L5
`
`From
`
`fl
`
`Execution
`
`Pipeline
`2A9
`
`Multithreaded
`
`Processing Unit
`
`
`
`
`
`Thread Control Unit
`
`Register
`File
`
`152
`
`FIG. 3
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 7 of 28 PageID# 14343
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 7 of 28 PagelD# 14343
`
`US. Patent
`
`May 8,2012
`
`Sheet 4 of 14
`
`US 8,174,531 B1
`
`Execution
`
`From From
`
`21—5
`
`2_2_Q
`
`Pipeline
`Zfl
`
`Multithreaded
`
`Processing Unit
`
`Instruction
`Cache
`
`1L0
`
`Thread
`Selection
`Unit
`
`m
`
`Thread
`°°""°'
`Unit
`
`fl
`
`TSR
`
`
`
`
`
`
`
`
`
`
`
`
`
`Sequencer
`‘— 4_2§ _J
`
`Resource
`
`Scheduler
`
`4_3_0
`
`f I
`
`nstruction
`
`Dispatcher
`m
`
`Execution Unit
`m
`
`Register
`File
`
`E!
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 8 of 28 PageID# 14344
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 8 of 28 Page|D# 14344
`
`US. Patent
`
`May 8, 2012
`
`Sheet 5 of 14
`
`US 8,174,531 B1
`
`r—J
`
`
`Receive Pointer to a
`Program
`
`1Q
`
`
`
`Vertex
`
`
`
`Pixel
`Pixel
`
`or Vertex?
`
`E
`
`
`
`
`
` Receive Pointer to a
`
`Program
`
`m
`
` Pixel
`
`Pixel
`
`
`or Vertex?
`m
`
`
`
`Vertex
`Thread
`5%
`
` Assign Pixel
`
`
`
` Assign
`
`VertexThread
`
`5i
`
`FIG. 5A
`
`
`
`Pass
`Pass
`
`
`
`Priority Test?
`Priority Test?
`
`
`
`fl
`fl
`
`
`
`
`
`Vertex
`Pixel
`
`
`
`Thread
`Thread
`
`
`
`Available?
`Available?
`
`
`
`Q5
`fl).
`
`
`
`
`
`
`
`
`Assign Pixel
`Assign
`VertexThread
`Thread
`
`fl
`5_4§
`
`FIG. SB
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 9 of 28 PageID# 14345
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 9 of 28 Page|D# 14345
`
`US. Patent
`
`May 8, 2012
`
`Sheet 6 of 14
`
`US 8,174,531 B1
`
`
`
`FIG. 6A
`
`640
`
`FIG. GB
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 10 of 28 PageID# 14346
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 10 of 28 PagelD# 14346
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 7 of 14
`
`US 8,174,531 B1
`
`Determine
`
`allocations
`
`Allocating threads to
`a first sample type
`71—0
`
`Allocating threads to a
`second sample type
`m
`
`Execute First
`
`Program Instructions
`m
`
`LE
`
`Execute Second
`
`Program Instructions
`
`FIG. 7A
`
`75—0
`
`Allocating threads to
`a first sample type
`7_5§
`
`Allocating threads to a
`second sample type
`33
`
`Execute First
`
`Program Instructions
`@
`
`
`
`Execute Second
`
`Program Instructions
`
`fl
`
`Allocating threads to
`the first sample type
`IE
`
`FIG. TB
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 11 of 28 PageID# 14347
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 11 of 28 PagelD# 14347
`
`US. Patent
`
`May 8, 2012
`
`Sheet 8 of 14
`
`US 8,174,531 B1
`
`Receive Sample
`@
`
`
`
`
`Position
`Hazard?
`
`fl
`
`
`
`Y
`
`
`
`
`
`Receive Sample
`
`mg
`
`
`
`
`
`
`
`
`'eadAvailable?
`
`
`
`
`
`Identify Sample
`Type
`fl
`
`
`PlOR
`disabled?
`@
`
`
`
`N
`
`Thread
`Available?
`
`E Y
`
`Assign Thread
`.87_5
`
`
`
`Identify Sample
`Type
`81—5
`
`m
`
`Assign Thread
`fl
`
`FIG. 8A
`
`Resou
`
`Available?
`
`fl
`
`Execute Thread
`@
`
`
`
`
`Deallocate Resources
`fl
`
`
`
`
`
`FIG. BB
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 12 of 28 PageID# 14348
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 12 of 28 PagelD# 14348
`
`US. Patent
`
`May 8, 2012
`
`Sheet 9 of 14
`
`US 8,174,531 B1
`
`Determine thread
`
`priority
`
`fl
`
`
`
`
`.
`.
`.
`Identify assigned
`
`Identify 333‘ priority
`thread(s) for priority
`fl
`
`
`
`
`
`— f
`
`
`
`
`No threads?
`
`fl
`
`
`Select Thread(s)
`fl l“
`
`Select Thread(s)
`fl
`
`
`
`
`
`
`
`Identify assigned
`thread(s)
`fl
`
`
`Read Program
`Counter(s)
`QZ—O‘
`I
`
`Update Program
`Counter(s)
`
`%
`
`Read Program
`m
`Counter(s)
`
`Update Program
`Counter(s)
`flé
`
`FIG. 9A
`
`FIG. QB
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 13 of 28 PageID# 14349
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 13 of 28 PagelD# 14349
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 10 of 14
`
`US 8,174,531 B1
`
`From
`
`13
`
`
`
`
`Programmable
`Graphics
`Processing
`Pipeline
`
`From 12 CD
`
`To 120
`
`FIG. 10
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 14 of 28 PageID# 14350
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 14 of 28 PagelD# 14350
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 11 of 14
`
`US 8,174,531 B1
`
`From
`
`fl
`
`
`
` Raster Unit21—0
`
`
`Execution
`
`Pipeline
`
`
`Execution
`Pipeline
`
`
`&
`
`
`
`m
`
` Vertex
`Output
`
`Buffer
`
`@
`
`
`
`FIG. 11
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 15 of 28 PageID# 14351
`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 15 of 28 PagelD# 14351
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 12 of 14
`
`US 8,174,531 B1
`
`Execution
`
`Pipeline
`240
`
`\
`
`1210
`
`Instruction
`
`Dispatch
`
`Register
`File
`
`Collection
`Unit
`
`Operand Collection
`
`to/from
`
`225
`
`to
`260/270
`
`1220
`
`
`
`Operand
`
`Unit
`
`{-
`
`Accumulator
`
`Accumulator
`
`FIG. 12
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 16 of 28 PageID# 14352
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 16 of 28 PagelD# 14352
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 13 of 14
`
`US 8,174,531 B1
`
`Instruction
`
`Dispatch
`1212
`
`\
`
`Instruction
`Cache
`
`Issue
`
`Scoreboard
`RAM
`
`
`
`-coreboard-rocessing
`
`
`
`InstructionBuffer
`
`Instruction
`Completion
`Signal
`
`to Register File
`1214
`
`Pipeline
`Configuration
`Signals
`
`FIG. 13
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 17 of 28 PageID# 14353
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 17 of 28 PagelD# 14353
`
`U.S. Patent
`
`May 8, 2012
`
`Sheet 14 of 14
`
`US 8,174,531 B1
`
`
`
` Unified
`
`Graphics Data
`Processing
`
`
`
`Receive vertex data
`
`1410
`
`1412
`
`Process vertex data
`
`
`
`
`through SIMD execution
`pipeline
`
`Rasterize processed
`vertex data
`
` 1414 1416
`
`
`
`Generate pixel data by
`scan converting
`rasterized vertex data
`
`
`
`1418
`
`
`
`Process pixel data
`through the same SIMD
`
`
`
`execution pipeline used
`in step 1412
`
` 1420
`
`Output processed pixel
`data to raster analyzer
`
`FIG. 14
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 18 of 28 PageID# 14354
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 18 of 28 Page|D# 14354
`
`US 8,174,531 B1
`
`1
`PROGRAMMABLE GRAPHICS PROCESSOR
`FOR MULTITHREADED EXECUTION OF
`PROGRAMS
`
`RELATED APPLICATIONS
`
`This application is divisional of US. patent application
`Ser. No. 11/458,633, filed Jul. 19, 2006, which is a continu-
`ation-in-part of US. patent application Ser. No. 10/696,714,
`filed Oct. 29, 2003, issued as US. Pat. No. 7,103,720, a
`continuation-in-part of US. patent application Ser. No.
`10/736,437, filed Dec. 15, 2003, issued as US. Pat. No.
`7,139,003, and a continuation-in-part of US. patent applica-
`tion Ser. No. 11/292,614, filed Dec. 2, 2005 now US. Pat. No.
`7,836,276. The entire contents of the foregoing applications
`are hereby incorporated herein by reference.
`
`FIELD OF THE INVENTION
`
`One or more aspects of the invention relate generally to
`multithreaded processing, and more particularly to process-
`ing graphics data in a programmable graphics processor.
`
`BACKGROUND
`
`Current graphics data processing includes systems and
`methods developed to perform a specific operation on graph-
`ics data, e.g., linear interpolation, tessellation, rasterization,
`texture mapping, depth testing, etc. These graphics proces-
`sors include several fixed function computation units to per-
`form such specific operations on specific types of graphics
`data, such as vertex data and pixel data.
`More recently, the computation units have a degree of
`programmability to perform user specified operations such
`that the vertex data is processed by a vertex processing unit
`using vertex programs and the pixel data is processed by a
`pixel processing unit using pixel programs. When the amount
`of vertex data being processed is low relative the amount of
`pixel data being processed, the vertex processing unit may be
`underutilized. Conversely, when the amount of vertex data
`being processed is high relative the amount ofpixel data being
`processed, the pixel processing unit may be underutilized.
`Accordingly, it would be desirable to provide improved
`approaches to processing different types of graphics data to
`better utilize one or more processing units within a graphics
`processor.
`
`SUMMARY OF THE INVENTION
`
`The present invention provides a unified approach for
`graphics data processing. Sample data of different types, e.g.,
`vertex data and pixel data, are processed through the same
`execution pipeline.
`A processing unit according to an embodiment of the
`present invention includes multiple execution pipelines, each
`of which is coupled to a first input section for receiving input
`data for pixel processing and a second input section for
`receiving input data for vertex processing and to a first output
`section for storing processed pixel data and a second output
`section for storing processed vertex data. The processed ver-
`tex data is rasterized and scan converted into pixel data that is
`used as the input data for pixel processing. The processed
`pixel data is output to a raster analyzer.
`Each execution pipeline has a plurality of sets of parallel
`data execution paths that run at a higher clock speed than the
`clock speed of the processing unit. As a result, a large number
`of pixels or vertices can be processed in parallel through the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`
`execution pipeline. The total number ofpixels or vertices that
`can be processed through the execution pipelines per clock
`cycle ofthe processing unit is equal to: (the number of execu-
`tion pipelines)><(the number of sets of parallel data execution
`paths in each execution pipeline)><(the number ofparallel data
`execution paths in each set)><(the ratio of the clock speed of
`the parallel data execution paths to the processing unit clock
`speed).
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`Accompanying drawing(s) show exemplary embodiment
`(s) in accordance with one or more aspects of the present
`invention; however, the accompanying drawing(s) should not
`be taken to limit the present invention to the embodiment(s)
`shown, but are for explanation and understanding only.
`FIG. 1 illustrates one embodiment of a computing system
`according to the invention including a host computer and a
`graphics subsystem.
`FIG. 2 is a block diagram of an embodiment of the pro-
`grammable graphics processing pipeline of FIG. 1.
`FIG. 3 is a block diagram of an embodiment of the execu-
`tion pipeline of FIG. 2.
`FIG. 4 is a block diagram ofan alternate embodiment ofthe
`execution pipeline of FIG. 2.
`FIGS. 5A and 5B are flow diagrams of exemplary embodi-
`ments of thread assignment in accordance with one or more
`aspects of the present invention.
`FIGS. 6A and 6B are exemplary embodiments of a portion
`of the thread storage resource storing thread state data within
`an embodiment of the thread control unit of FIG. 3 or FIG. 4.
`
`FIGS. 7A and 7B are flow diagrams of exemplary embodi-
`ments of thread allocation and processing in accordance with
`one or more aspects of the present invention.
`FIGS. 8A and 8B are flow diagrams of exemplary embodi-
`ments of thread assignment in accordance with one or more
`aspects of the present invention.
`FIGS. 9A and 9B are flow diagrams of exemplary embodi-
`ments of thread selection in accordance with one or more
`
`aspects of the present invention.
`FIG. 10 is a block diagram of another embodiment of the
`programmable graphics processing pipeline of FIG. 1.
`FIG. 11 illustrates an embodiment ofthe texture processing
`cluster of FIG. 10.
`
`FIG. 12 is a block diagram of another embodiment of the
`execution pipeline of FIG. 2 or FIG. 11.
`FIG. 13 is a block diagram of an embodiment of the
`instruction dispatch unit of FIG. 12.
`FIG. 14 is a flow diagram that illustrates the steps of pro-
`cessing graphics data in accordance with one or more aspects
`of the present invention.
`
`DETAILED DESCRIPTION
`
`In the following description, numerous specific details are
`set forth to provide a more thorough understanding of the
`present invention. However, it will be apparent to one of skill
`in the art that the present invention may be practiced without
`one or more of these specific details. In other instances, well-
`known features have not been described in order to avoid
`
`obscuring the present invention.
`FIG. 1 is an illustration of a computing system generally
`designated 100 and including a host computer 110 and a
`graphics subsystem 170. Computing system 100 may be a
`desktop computer, server, laptop computer, palm-sized com-
`puter,
`tablet computer, game console, cellular telephone,
`computer based simulator, or the like. Host computer 110
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 19 of 28 PageID# 14355
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 19 of 28 Page|D# 14355
`
`US 8,174,531 B1
`
`3
`includes host processor 114 that may include a system
`memory controller to interface directly to host memory 1 12 or
`may communicate with host memory 112 through a system
`interface 115. System interface 115 may be an I/O (input/
`output) interface or a bridge device including the system
`memory controller to interface directly to host memory 112.
`Examples of system interface 115 known in the art include
`Intel® Northbridge and Intel® Southbridge.
`Host computer 110 communicates with graphics sub-
`system 170 Via system interface 115 and a graphics interface
`117 within a graphics processor 105. Data received at graph-
`ics interface 117 can be passed to a front end 130 or written to
`a local memory 140 through memory controller 120. Graph-
`ics processor 105 uses graphics memory to store graphics
`data and program instructions, where graphics data is any
`data that is input to or output from components within the
`graphics processor. Graphics memory can include portions of
`host memory 112, local memory 140, register files coupled to
`the components within graphics processor 105, and the like.
`Graphics processor 105 includes, among other compo-
`nents, front end 130 that receives commands from host com-
`puter 110 via graphics interface 117. Front end 130 interprets
`and formats the commands and outputs the formatted com-
`mands and data to an IDX (index processor) 135. Some of the
`formatted commands are used by programmable graphics
`processing pipeline 150 to initiate processing of data by pro-
`viding the location of program instructions or graphics data
`stored in memory. IDX 135, programmable graphics process-
`ing pipeline 150 and a raster analyzer 160 each include an
`interface to memory controller 120 through which program
`instructions and data can be read from memory, e.g., any
`combination of local memory 140 and host memory 112.
`When a portion of host memory 112 is used to store program
`instructions and data, the portion of host memory 112 can be
`uncached so as to increase performance of access by graphics
`processor 105.
`IDX 135 optionally reads processed data, e.g., data written
`by raster analyzer 160, from memory and outputs the data,
`processed data and formatted commands to programmable
`graphics processing pipeline 150. Programmable graphics
`processing pipeline 150 and raster analyzer 160 each contain
`one or more programmable processing units to perform a
`variety of specialized functions. Some of these functions are
`table lookup, scalar and vector addition, multiplication, divi-
`sion, coordinate-system mapping, calculation of vector nor-
`mals, tessellation, calculation of derivatives, interpolation,
`and the like. Programmable graphics processing pipeline 150
`and raster analyzer 160 are each optionally configured such
`that data processing operations are performed in multiple
`passes through those units or in multiple passes within pro-
`grammable graphics processing pipeline 150. Programmable
`graphics processing pipeline 150 and a raster analyzer 160
`also each include a write interface to memory controller 120
`through which data can be written to memory.
`In a typical implementation programmable graphics pro-
`cessing pipeline 150 performs geometry computations, ras-
`terization, and pixel computations. Therefore, programmable
`graphics processing pipeline 150 is programmed to operate
`on surface, primitive, vertex, fragment, pixel, sample or any
`other data. For simplicity, the remainder of this description
`will use the term “samples” to refer to graphics data such as
`surfaces, primitives, vertices, pixels, fragments, or the like.
`Samples output by programmable graphics processing
`pipeline 150 are passed to a raster analyzer 160, which
`optionally performs near and far plane clipping and raster
`operations, such as stencil, z test, and the like, and saves the
`results or the samples output by programmable graphics pro-
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`
`cessing pipeline 150 in local memory 140. When the data
`received by graphics subsystem 170 has been completely
`processed by graphics processor 105, an output 185 ofgraph-
`ics subsystem 170 is provided using an output controller 180.
`Output controller 180 is optionally configured to deliver data
`to a display device, network, electronic control system, other
`computing system 100, other graphics subsystem 170, or the
`like. Alternatively, data is output to a film recording device or
`written to a peripheral device, e.g., disk drive, tape, compact
`disk, or the like.
`FIG. 2 is an illustration ofprogrammable graphics process-
`ing pipeline 150 ofFIG. 1 . At least one set of samples is output
`by IDX 135 and received by programmable graphics process-
`ing pipeline 150 and the at least one set of samples is pro-
`cessed according to at least one program, the at least one
`program including graphics program instructions. A program
`can process one or more sets of samples. Conversely, a set of
`samples can be processed by a sequence of one or more
`programs.
`Samples, such as surfaces, primitives, or the like, are
`received from IDX 135 by programmable graphics process-
`ing pipeline 150 and stored in a vertex input buffer 220
`including a register file, FIFO (first in first out), cache, or the
`like (not shown). The samples are broadcast to execution
`pipelines 240, four of which are shown in the figure. Each
`execution pipeline 240 includes at least one multithreaded
`processing unit, to be described further herein. The samples
`output by vertex input buffer 220 can be processed by any one
`of the execution pipelines 240. A sample is accepted by an
`execution pipeline 240 when a processing thread within the
`execution pipeline 240 is available as described further
`herein. Each execution pipeline 240 signals to vertex input
`buffer 220 when a sample can be accepted or when a sample
`cannot be accepted.
`In one embodiment, programmable
`graphics processing pipeline 150 includes a single execution
`pipeline 240 containing one multithreaded processing unit. In
`an alternative embodiment, programmable graphics process-
`ing pipeline 150 includes a plurality of execution pipelines
`240.
`
`Execution pipelines 240 may receive first samples, such as
`higher-order surface data, and tessellate the first samples to
`generate second samples, such as vertices. Execution pipe-
`lines 240 may be configured to transform the second samples
`from an obj ect-based coordinate representation (object
`space) to an alternatively based coordinate system such as
`world space or normalized device coordinates (NDC) space.
`Each execution pipeline 240 may communicate with texture
`unit 225 using a read interface (not shown in FIG. 2) to read
`program instructions and graphics data such as texture maps
`from local memory 140 or host memory 112 via memory
`controller 120 and a texture cache 230. Texture cache 230
`
`serves to increase effective memory bandwidth. In an alter-
`nate embodiment texture cache 230 is omitted. In another
`alternate embodiment, a texture unit 225 is included in each
`execution pipeline 240. In another alternate embodiment,
`program instructions are stored within programmable graph-
`ics processing pipeline 150. In another alternate embodiment,
`each execution pipeline 240 has a dedicated instruction read
`interface to read program instructions from local memory 140
`or host memory 112 via memory controller 120.
`Execution pipelines 240 output processed samples, such as
`vertices, that are stored in a vertex output buffer 260 including
`a register file, FIFO, cache, or the like (not shown). Processed
`vertices output by vertex output buffer 260 are received by a
`primitive assembly/setup unit 205. Primitive assembly/setup
`unit 205 calculates parameters, such as deltas and slopes, to
`rasterize the processed vertices and outputs parameters and
`
`

`

`Case 3:14-cv-00757-REP-DJN Document 87-4 Filed 04/16/15 Page 20 of 28 PageID# 14356
`Case 3:14-cv-OO757-REP-DJN Document 87-4 Filed 04/16/15 Page 20 of 28 Page|D# 14356
`
`US 8,174,531 B1
`
`5
`samples, such as vertices, to a raster unit 210. Raster unit 210
`performs scan conversion on samples, such as vertices, and
`outputs samples, such as fragments, to a pixel input buffer
`215. Alternatively, raster unit 210 resamples processed verti-
`ces and outputs additional vertices to pixel input buffer 215.
`Pixel input buffer 215 outputs the samples to each execu-
`tion pipeline 240. Samples, such as pixels and fragments,
`output by pixel input buffer 215 are each processed by only
`one of the execution pipelines 240. Pixel input buffer 215
`determines which one ofthe execution pipelines 240 to output
`each sample to depending on an output pixel position, e.g.,
`(x,y), associated with each sample. In this manner, each
`sample is output to the execution pipeline 240 designated to
`process samples associated with the output pixel position. In
`an alternate embodiment, each sample output by pixel input
`buffer 215 is processed by one of any available execution
`pipelines 240.
`Each execution pipeline 240 signals to pixel input buffer
`240 when a sample can be accepted or when a sample cannot
`be accepted as described further herein. Program instructions
`configure programmable computation units (PCUs) within an
`executionpipeline 240 to perform operations such as perspec-
`tive correction, texture mapping, shading, blending, and the
`like. Processed samples are output from each execution pipe-
`line 240 to a pixel output buffer 270. Pixel output buffer 270
`optionally stores the processed samples in a register file,
`FIFO, cache, or the like (not shown). The processed samples
`are output from pixel output buffer 270 to raster analyzer 160.
`FIG. 3 is a block diagram of an embodiment of execution
`pipeline 240 of FIG. 1 including at least one multithreaded
`processing unit 300. An execution pipeline 240 can contain a
`plurality of multithreaded processing units 300, each multi-
`threaded processing unit 300 containing at least one PCU
`375. PCUs 375 are configured using program instructions
`read by a thread control unit 320. Thread control unit 320
`gathers source data specified by the program instructions and
`dispatches the source data and program instructions to at least
`one PCU 375. PCUs 375 performs computations specified by
`the program instructions and outputs data to at least one
`destination, e.g., pixel output buffer 160, vertex output buffer
`260 and thread control unit 320.
`
`A single program may be used to process several sets of
`samples. Thread control unit 320 receives samples or pointers
`to samples stored in pixel input buffer 215 and vertex input
`buffer 220. Thread control unit 320 receives a pointer to a
`program to process one or more samples. Thread control unit
`320 assigns a thread to each sample to be processed. A thread
`includes a pointer to a program instruction (program counter),
`such as the first instruction within the program, thread state
`information, and storage resources for storing intermediate
`data generated during processing of the sample. Thread state
`information is stored in a TSR (thread storage resource) 325.
`TSR 325 may be a register file, FIFO, circular buffer, or the
`like. An instruction specifies the location of source data
`needed to execute the instruction. Source data, such as inter-
`mediate data generated during processing of the sample is
`stored in a register file 350. In addition to register file 350,
`other source data may be stored in pixel input buffer 215 or
`vertex input buffer 220. In an alternate embodiment source
`data is stored in local memory 140, locations in host memory
`112, and the like.
`Alternatively, in an embodiment permitting multiple pro-
`grams for two or more thread types, thread control unit 320
`also receives a program identifier specifying which one ofthe
`two or more programs the program counter is associated with.
`Specifically,
`in an embodiment permitting simultaneous
`execution of four programs for a thread type, two bits of
`
`6
`thread state information are used to store the program iden-
`tifier for a thread. Multithreaded execution of programs is
`possible because each thread may be executed independent of
`other threads, regardless of whether the other threads are
`executing the same program or a different program. PCUs
`375 update each program counter associated with the threads
`in thread control unit 320 following the execution of an
`instruction. For execution of a loop, call, return, or

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket