throbber
AMD
`
`é&
`
`cd Author: Randy Ramsey
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`3-Now-16
`
`DOCUMENT-VER. NUM,
`1.0
`
`PAGE
`1 of62
`
`GFX9 SPI Specification
`
`Rev 1.0 — Last Edit: 3-Nov-16
`
`drive the intemet and businesses, For more information,visit httoy//www.amd.com,
`
`. This documentis issued to you alone. Do not transfer it to or share it with another person, even within your
`organization.
`
`THIS DOCUMENT CONTAINS |
`
`INFORMATION THAT COULD BE
`
`SUBSTANTIALLY DETRIMENTALTO THE INTEREST OF AMD THROUGH
`
`UNLICENSED USE OR UNAUTHORIZED DISCLOSURE.
`
`Preserve this document's integrity:
`
`= Do not reproduce any portionsof it.
`
`=> Donot separate any pages from this cover.
`
`. Store this document in a locked cabinet accessible only by authorized users. Do not leaveit unattended.
`
`. When you no longer need this document, return it to AMD.Please do not discard it.
`
`“Copyright 2012, Advanced Micra Devices, Inc. ("AMD"). AS rights reserved. This work contains confidential, proprietary to the reader information and trade
`secrets of AMD. No part of this document may be used, reproduced, or tranamitted is any form or by any meacs without the prior written permission of AMD.”
`
`AMD, the AMD Arrow Logo and com®inations thereof are trademarks of Advanced Micro Devices, Inc.
`trademark’ of HDMI Licensing, LLC.
`
`PCle & a registered trademark of PCI-SIG, HOMI is a
`
`AMD (NYSE: AMD) is a semiconductor design innovator wading the next era of vivid digital experiences with its ground-breaking AMD Fusion Accelerated
`Processing Units (APU). AMD's graphics and computing technologies power a warlety of devices including PCs, game consoles and the powerful computers that
`
`ATI Ex. 2027
`IPR2023-00922
`Page 1 of 62
`
`

`

` AMD
`
`DOCUMENT-VER, NUM,
`LO
`
`PAGE
`2 of62
`
`é&
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`a-Now-l6
`
`Revision History
`
`|___Date|_Revision|Description
`
`ATI Ex. 2027
`IPR2023-00922
`Page 2 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 2 of 62
`
`

`

`1.0
`
`ORIGINATE
`
`10-Feb-15
`
`EDIT DATE
`3-Nov-16
`
`DGCUMENT-VER, NUM,
`
`Table of Contents
`aed 6
`
`1.1 EN atesies essaarednessaahia eeegodis sees etapaee ly
`
` SwwwGoo
`
`Acronyms ....
`1.21
`
`Terrninolagy ...-
`122
`1.3
`Tor Leven DesceeTon..
`
`SP)Chip LevelData‘FlowwDiagram.
`1.3.1
`SPY Block Sitignitn Eelbbetlhere arabcaprpotdsantah
`13.3
`FEATURES f FUNCTIONALITY accccccsesesacsesacsesactneastotasansnsassnsanseinsinsacaniasa 1 ssastasansnta oa ssasanavamsa ia niasaactasetaeasrans 11
`2
`STAGE AND ORGANIZE DATA FOR SHADER LAUNCH.ic...cicccccitcecciis cscctasescinsrassissadasesatasoseteimnstarcietesascttisisstaiaismasioien Lb
`a1
`COMPUTE SHADER heseriginhiicksbese
`2.20
`eee
`
`
`227
`ThreodgrowpHalting netBiscrding
`
`| Ouewve Stotws....
`‘ta
`225
`Unordered Dispatches...
`
`2.206
`Stote Forwording to S06...‘i
`
`a2.7
`First Wave of Dispoten......
`
`2.2.8
`Compute Shader index Terms...
`2300
`VGT-SPI WERT" SHADERS. ..cccccua
`
`231
`ES, G$, Va Processing
`2d
`Orechip GS ccs
`
`2453
`Tesselotion ..2.2.0)...03.
`234
`Oistributed Tesseilotion ....
`
`23.4.1 Work Creation Description
`2.34.2
`OMfchip LOS ID Changes...
`2.3.4.3
` Offchip LOS Gealiocation Changes
`Piet SHADER (PS)...
`Se
`Pree! Dota Flow.
`slaEEaeseaaodalafenlacealccredseiMetises
`Calculate Per:Pind0earycentncCoovdiiriag:
`
`Pull Miodel ...
`Scole ResolutionBosedon1Screen‘Lotenion(9.125).
`
`Visualizing the Scaling .........
`
`Impacts to BC! Equation ......
`
`2.4
`244
`2.4.1.1
`24.12
`242
`2.4.2.1
`2A 22
`
`3
`
`END OF SPEC UPDATES, BEYOND THIS POINT INFO MAW BE OUT OF DWATE .....scrcrerrersesereeenteerereriee41
`
`Saucer far DG pov ls aSA oocis infersccsceteleunis mi havedestocwmbancisremnini in inmnicianneininee itll
`B12
`Undque Sample Pesnians per Piel...
`hoje astheatepace malafa LITSTHeeerie
`3.12
`LOS Parameter Data LoAgine FoR Pines...
`RuakeRat
`3.2
`Orgeniation of Date in the Parameter‘Cache...
`3.21
`a
`Bee|USPSReingaareola taerreeea
`G2 ESRReeeeecercaedgral
`a2 Point Sorte Querride.
`a5 PARAMGEN...
`
`
`
`iSectatae
`
`ai
`
`3.2.6
`3.4
`3.4
`
`seicisterciewiseoutenaaa
`Support DeeperPePorometerCacheindAvoid!dDupcateData...
`te
`fe eta
`ahseeiasiis
`ches
`Performance ..
`42.6.1
`
`Proce. SHADER VGPR INITIALIZATION .. Firat ba sabeak cfdadR HaeLeveeri riSe SSO LT
`VERTEX,POE. SVRICHHIROUILEATIONssnSafhathvgi ast EeedWfpa ipcaed ncne Pde Caa
`
`ATI Ex. 2027
`IPR2023-00922
`Page 3 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 3 of 62
`
`

`

`
`
`DGCUMENT-VER, NUM,
`EDIT DATE
`ORIGINATE
`1.0
`aes
`10-Feb-15
`BEE:ACCnmRERADATA PRCi iissshingled inulin bi nentsi
`2.6
`RESOURCE ALLOCATION...
`
`364
`CU end SMD Acslannient.=
`‘ia
`
`3.6.1.1
`SIMO Assignment for Work Distribution endInput Bandwidth
`ss
`
`
`3.64
`
`Wove Guffer.......
`
`
`
`eye
`
`Barrier......
`3.6.6
`Balky CS Threadkiroups...
`36?
`
`Position Gaffer and ParameterCache... ecahepucttr atnpaendateett eh eeeaniCaled
`
`
`
`36.8
`Late V5 Allocation ..
`smi lt ii tei pctkeaa aai
`3.6.8.1
`Alfecation Priarity -..
`3.6.3
`
`Virtwelization of Crcapetis‘UnitMocks.
`3.6.10
` Aesource Reservations ..
`3.6.11
`3.6.12 Multiplierfor Resource Limits.
`371
`Meintonning Gosaieta
`’
`B72
`Expert Graneting........
`
`h
`EMM ==)G77OEE roSeORPePRP SP
`54
`Se
`ParPet ve ice anatasjarsispsecsenaterine niciEee anaeA
`ere
`3.10 Wave/Event ORDERING.
`
`3.11
`Event COLECTION..
`sinindtnimeg (mrganivinniabsin gee
`SS
`
`3.130WAVEFRONT LIFETIME STATUS COUNTERS .. oroDE
`
`io
`
`re
`
`‘aH
`
`5 <
`
`n s
`
`4,12 HAfontoowiou,ceincad)PoePree(FonBeauxANDgPearoiainiceeed
`
`Fh
`
`PRBPUPOURIMICEvaisiiete reincncin bia nsepasiton pistanoserdn an puttaereese vs iui un painwinsprints nseido oamtemerenis eei
`
`ee BRETCUioa deeeGeneec
`a2
`PPRMDMETERE CMTE PENDigs pintspnce ete mganya Lag pnwde ge as UN GR
`AL,
`GAPEiisccaaccit ccaacaccdnc
`44
`PRE eins pe ishacessemcaipnatn ii
`r
`45
`(GRAPHICS BALANCED THROANSHIPUT CASES wooo. cee oct c tasesnics rahecist ieeeies ecbreso tesesnesiestasaieieecione
`
` S2eBEER
`
`BLEGATING cscrersetacacraraesntaesnssetmeaeacioes epesneemeatanrstn tatsegataed eta edeneae On gaELALEEAEa1
`
`=
`
`ATI Ex. 2027
`IPR2023-00922
`Page 4 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 4 of 62
`
`

`

`EDIT DATE
`3-Nov-16
`
`DGCUMENT-VER, NUM,
`
`ORIGINATE
`
`10-Feb-15
`
`1.0
`
`
`Tableoe re
`aibarpicieapebiiales eucich stadiarence bebicutecbetuat aaa ae
`Figure | = SPI Chip Level Data Flow Diagram...
`is
`jae orees SEee
`Figure 2— Chip Level Dingram ...
`eea
`Figure 3 ~“ees ConnectivityBlockDiagram,
`oe
`vavaterennneresestereees
`— cecsereenees
`Figure 3- CS Data Flow... .
`anid
`miata
`piesa
`ssi
`ratils higiig
`Figure 6 — Asyne Compute BlockDiagram.
`
`
`
`Figure 7 = CS Threadgroup Ordering ...rerelard 14cia tiiitiheete fim her riick
`
`
`Figure &— CS Thread Count Increment Esumple.
`Li
`
`Figure 9 —"Vertex” Data Flaw WOT-SPI...
`
`Figure 10 - VGT ES, GS, VS Vertex Input...
`.
`soy ms i moka Pn Lm mp
`
`Figure 11 = LS_H8.E8,.G5.V5 Vertex [npaut......
`PARSPSTe aed chodaagennedAan rgpieete
`lt
`Figure 12 — Pixel Input Data ..
`bpatateseaeueseseaenugeteteneerapipseasatasusnnetetestamunettusnenseteieeeeramninensees24
`Figure 13 — Color Expon Bus Arbitration,RB...
`int mii egressWo
`isns Bia atc
`ode
`
`
`Figure 14 — Color Export Bus Arbitration, 2RB. .. a chasedaan darrian or
`jana
`Figure 13 - Color Expon Bus Arbitration, 4B... aa eeeeta Ad
`
`Figure 16 — Color Export Bus Arbitration, RB.
`sutatat alii eocaTneans aah pr afeetierienaha eee
`|
`Figure 17 —- LDS Logical Lavout ..
`chesuebebeseaneeaee
`sosesteeabvavinasnsinanenicieseanseeees
`a5
`Figure 18 - ParameterCache DisaOrganica i Gehan
`Sth
`
`au
`bs hyn tad ba bwin ines mes pl
`pips
`sistant
`Figure 19 — Combined Data Flow ............
`i
`
`Figure 20— Persistent State Update FIFOs...
`nn asechip Gadaipa memp
`asian
`Figure 21 — Persistent State Update FIFOs ...
`pila
`ie
`A
`
`Figure 22 - Performance, Balanced Throughput Case,WS.PS
`a
`i)
`Figure 23 — Performance, Balanced Throughput Case, ES-G5-VS-PS....
`:
`ft
`paca
`Figure 24 - Perfomance, Balanced Throughput Case, LS-HS-ES-GS-V5-PS eee|
`
`ATI Ex. 2027
`IPR2023-00922
`Page 5 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 5 of 62
`
`

`

`ORIGINATE
`
`10-Feb-15
`
`EDIT DATE
`
`DGCUMENT-VER, NUM,
`
`1.0
`
`3-Nov-16
`
`1=Introduction
`This document describes the requirements, functionality, and target performance of the Shader Processor Input
`(SPI) block.
`
`1.1 Open Issues
`
`1.2 Definitions
`
`1.2.1 Acronyms
`SPI - Shader Processor Input
`SC = Scan Comverter
`SQ - Sequencer
`SOC — Sequencer Cache
`SG = 50 Global Block, instanced in SPI
`SX. - Shader Export
`SP - Shader Processor
`CP = Command Processor
`CPG — Command Processor, Grphics
`CPC— Command Processor, Compute
`SE = Shader Engine
`SH - Shader Array
`CU -— Compute Unit
`SIMD - Single instroction Mubiiple Data unit in the shader processor (SP).
`UL = Upper Let
`UR - Upper Right
`LL- Lower Lefl
`LR - Lower Right
`VGPR — Vector General Purpose Register in the SP
`SGPR - Scalar General Purpose Register inthe $O
`C5 = Compute shader
`LS — API Vertex shader stage when doing tessellation, wntes to LDS
`HS — Hull shader stage of tessellation
`VS5- Vertex shader, coukl be normal vertices, final pass of a Geometry Shader, or domain shader,
`GS - Geometry Shader, processes primitives.
`ES - Export Shader, first verex pass ofa Geometry Shader that processes vertices.
`PS - Pixel Shader, processes pixels,
`VSR -Vertex Input Slaging Register, hokks inpal data for vertex thrends.
`PSE -Pixel Input Staging Register, bolds input data for piel threads.
`LDS — Local Data Store
`se_id = Shader Engine Identification Number
`sh_id — Shader Anmay Idendification Number
`MSAA — Multi-ample Amti-Aliaging
`EQAA = Enhance Quality Anti-Aliasing
`
`ATI Ex. 2027
`IPR2023-00922
`Page 6 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 6 of 62
`
`

`

`3-Nov-16
`
`EDIT DATE
`
`DGCUMENT-VER, NUM,
`
`ORIGINATE
`
`10-Feb-15
`
`1.0
`
`1.2.2 Terminology
`through the graphics pipeline which can be weed to enforce
`token sent
`Event — an event
`is a special
`svochronization, Mush caches, and report status back tothe CP. All blocks pipeline these tokens and keep them
`ordered with other griphics dita,
`Thread: one instance of a shader preenim being executed ona wavefront. Each thread has its own data which
`is unique from any other thread.
`Wavefront:
`‘Thes tt the basic unit of work, There are 64 threads per wavefront, I isa group of threads thet cam
`be executed simuliancously ona SIMD.
`Threadgroup, Subgroup: Group of threads that may span several wavefronts, All threads are guaranteed to
`run on the same CL, This allows for shared CU resources auch as the Local Data Store (LDS) and
`synchronivalion rsounes acooss all threads.
`TesseHation Engine: A WOT mmdule that implements D1) tessellation functionality.
`Fisel Quad: A 2x2 pixel region,
`Pisel Center: Current pixels screen coordinates. grven as PIX_X.5, PIM¥4.
`Pixel Centroid: Current pivel’s centroid in screen coordinates, defined as the covered sample location closest bo
`pixel cemier, [f all samples of a pixel are hit, comer will be used for centroid even if center is not one of the
`current sample locations,
`Pisel Sample: Location of the sample ID of the cunent Henvion when mining at sample frequency.
`Facedness: The PA determined face flag indicating front or back facing.
`Param_gen: Automatically generated ST texture coondinues. typically used with points.
`SIMD: Single Instraction Multiple Data unit tn the shader processor (SP)
`Shader Array: A combination of blocks separate amd unique for shader processing, including a shader core
`consisting of Compute Units
`newveetor aka fpos, first_prim_ol_shet: Panumeter cache syne token recemved from the SC for pixels and
`used to make sure the SPT waits for V5 te finish exporting parameter data before pixels start trving to read it.
`Helper pice: Any non-hit pixel being processed asa part of a quad wath other hit pixels.
`
`1.3 Top Level Description
`The main purpose ofthe SPI ie to manage shader resources and provide shader input data to the GPRS and
`wavelronts to the SQ.
`(taccunwilates “vertex” ype shader input data from the VOT (VS, GS. ES, HS, LS) into
`wavelronts,
`[tl recenics compute shader (CS) data and state from the CPG and CPC on cedata inderfaces.
`Reames required to process wavefronts and CU/SIMD assignment in the shader array (SH) are managed by the
`SPI in terms of allocation and de-albocation, SPI passes data through for the VOT vers and prin, Por HS and
`GS, SPI onrolls threadgroups and subproupes into wavefronts. For CS, SP] unrolls threadgroups inte wanctronts
`and generates an index per thread based on the threadgroup sise. Piscel quad data delivered from the SC is
`accunmlated inte wavelrons, The SPI processes this data, per pixel, to imerpolate and produce barvecmtric
`gradient data (UW) or screen X,Y, andor primitive faceness data. The SP1 loads 1data into WOPRs and
`coordinates moving primitive attribute data from the paramecicr caches into a ‘CU Local Data Store (LDS) for the
`pixel shader to wise for attribute imerpolation, SPI synchronizes the vertex shader alinibute exports with the pixel
`shader reading those aliributes, guaruniceing Uke attribute dati has been written to the parameter cache before
`allowing PS t read.
`
`1.3.1 SPI Chip Level Data Flow Diagram
`Figure | shows the blocks and major data paths dirccily and functionally associated wiih the SPI.
`
`Inputs from the ViGT: subgroups, waves, events, and vertex inpul data for the dala types VS, GS, ES, HS, LS.
`Inputs from the $C: pixel data including coverage, primitive information and events.
`From the CPG: compute slate, events, ihreadgroups for GFX.
`From the CPC; compute state, events, threadgroups for async compute.
`Shader input data into the SGP Rs and wavefront input to the $O.
`V5 position and parameicr cache dain writes to the SX and PC,
`Panuneter cache read and LDS write contrels
`
`ATI Ex. 2027
`IPR2023-00922
`Page 7 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 7 of 62
`
`

`

` AMD
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`a-Now-l6
`
`DOCUMENT-VER, NUM,
`10
`
`PAGE
`B of62
`
`é&
`
`Primitive
`Connectivity) _—$————
`PA
`
`Primitives
`
`Position
`Data
`
`
`
`LS, HS,
`ES, GS, VS
`Input Data
`
`Pixel Quads With
`
`
`
`
`cs"eO-n DX11CS “&Prim info
`
`
`
`Param Cache Read
`
`LOS write entl
`
`GPRinput
`
`
`Wavefront data
`
` LDS write Data
`
`
`
`
`Position, Param Cache
`
`Figure 1—SPI Chip Level Data Flow Diagram
`
`Reterencing Figure 1, for doing just vertex and pixel shading, vertex and primitive Ivpe processing are associated
`with the green colored lines. The WGTinitially stars offsending vertex indices tn the form of vsveris to the SPI
`and at the same time sending the primitive connectivity to the PA identifying howthese vertices will ect built
`back imo primitives, The SPT bullers up the vevens inte a wavelront and once it bas received a full wavefront of
`data. the wave trinsfer from the WGT will inigger the SP1 to release the data to the SQ) and feed associated data
`Into the GPRs. When the vertex shader Sars processing postlion data, typically it will send ot position early fo
`the position buffers inthe SX which then allows the PA to pead that position data and stan building the primitives
`and producing those prinuives which go through the Scan Converter (SC) to produce pixels. Onee the SC has
`Primatives, it will stant producing pixels which wre fed to (he SPL ‘Onec the SPT has a full wawelront of pisels, it
`will tryand send associated data into the GPRs with the wavelront to the SO. Reads are made to copy parameter
`dota out of the parameter coche and write it into an SPI determined range of LIDS in a particular CU.
`
`ATI Ex. 2027
`IPR2023-00922
`Page 8 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 8 of 62
`
`

`

` AMD
`
`é&
`
`EDIT DATE
`ORIGINATE
`10-Feb-15 a
`
`DOCUMENT-VER, NUM.
`10
`
`PAGE
`9 of62
`
`1.3.2 Chip Level Diagram
`Figure 2 shows the SPI block and its associated relationship to chip level imer-connections. Here, the physical
`partitioning of barvc logic is shown by the BCI blocks. For the purpose of this document, the BCT logic will be
`considered as part of the logical SFT design
`
`Units
`
`
`Uptete
`Compute “=
`
`Shader
`Engine
`
`Upie 16
`Compute
`Units
`
`Figure 2= (Chip Level Diagram
`
`ATI Ex. 2027
`IPR2023-00922
`Page 9 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 9 of 62
`
`

`

` AMD
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`3-Now-16
`
`DOCUMENT-VER. NUM,
`LO
`
`PAGE
`10of 62
`
`é&
`
`1.3.3 SPI Block Diagram
`
`|
`
`SIMD SIMD SIMO | SIMD
`
`Unit (Up
`
`||
`
`e5
`
`GPs_SPmaa,1)2SP4_aereesfs,")
`S_50_esppert
`B3_sneeccr
`alk
`&
`th
`
`$Plin
`|
`* other SE
`pf
`ital
`Shad
`Wave Launch
`ization
`!
`| SPSPLce-Tacrecee
`Wave Buffer
`
`
`
` atsEx LDS, VGPR, SGPRaSeereieport SPSPY0-n_vatgdoce i
`
`“aE
`SA _SFYC)sabre
`t
`
`|
`
`*
`
`
`
`SASSwrve_soee epaj ;
`
`Cec_S_seve_aave
`SPLCPO
`gee
`ions
`| Si
`SPAS sta meee dave
`}k
`PD_6Fi_opdata(hT)
`|
`F
`gence
`SP_LOPC_feca parteiieT)
`SP_CPC _tgaao-T)
`ope |
`
`
`
`;
`
`Resource
`Allecater
`
`Wave
`Cantrallers
`
`|
`
`1
`SP
`CPO. SP_cecuia —
`SP_CPC_teapeoematompe partel)
`| GPG.
`ess
`
`vat,
`VOT_S5_*_wnew
`
`
`WOT_SP. "vet_———$@<$rt ia,
`
`
`SPLWOT "sons wo$ete
`fF1_ports hi)—
`
`af SCLECLby)|SGDCU_SP4_ dota, 1)
`—— BCl ——
`
`
`
`Figure 3— Top Level Connectivity Block Diagram
`
`ATI Ex. 2027
`IPR2023-00922
`Page 10 of 62
`
`|
`
`=
`
`CEOSP|_ony (CETACae}
`SP_CASSmad (TARORAM_renay
`SPL_SP\ pe sates
`x
`=—
`S_SPY0-nj_peiesiicc
`SP SPiso| core
`531 ghee)
`
`j
`
`GREM
`
`.
`
`-
`
`ATI Ex. 2027
`IPR2023-00922
`Page 10 of 62
`
`

`

`AMD
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`3-Now-16
`
`DOCUMENT-VER. NUM.
`Lo
`
`PAGE
`11 of 62
`
`eet [Freier]
`
`Diagram copied from //etapfgtsWidocdesignblocks/spi/efx9SPLBlockDiagrun,.vsd
`
`Figure 4— Block Diagram
`
`2 Features / Functionality
`
`2.1 Stage And Organize Data for Shader Launch
`The SPI logical block stages and organizes efficient loading of shader input data to the VectorScalar General
`Purpose Registers (VGPR/SGPR) and Local Data Store (LIDS) in the Shader Array and manages resources
`requeined to mun those shader programs, The VGT will have several types of inputs to the SPT; normal vertices that
`will create positions and parameters for rasterization and pixel processing (V5. which could be normal vertices or
`the final pass ofa Geometry Shader), Geometry Shader (GS) primitives, vertices that only expom to memory (ES.
`which is the first vertex pass of a Geometry Shader), wenlices acting as the first stage oftessellation processing
`(LS). and patch dam associnted with the Holl Shader (HS). The V5. GS. ES, HS. and LS are offen generalized inte
`the category of “verts” when discussing data moving through the SPI, The Scan Converter (SC) delivers pixel
`quads to the SPI for pixel shading, The CPG block delivers DX 11 Compare threadgroups to the SPI for lameching
`compute shaders. The CPC delivers Asyne Compate threadgroups to the SP] for launching compa shaders.
`
`2.2 Compute Shader (CS)
`As shown in Figure 1, Compate Shader input data can come fiom enher the CPG (GPX=CS) of the CRC (Asvne
`CS). CS waves go through the same resource arbitration and allocation as all other supported SPI wavefront tvpes.
`
`ATI Ex. 2027
`IPR2023-00922
`Page 11 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 11 of 62
`
`

`

` AMD
`
`DOCUMENT-VER, NUM,
`LO
`
`PAGE
`12 of 62
`
`é&
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`3-Now-16
`
`X11 requires support for compute shaders, and the SPI plays a role in getting compute shaders into the shader
`ami, Both the CRG and CPC deliver threadgroups to the SPI alone wilh persistent slate dita that tells the SPT
`howto process those threadgroups. The SPI is responsible for unrolling cach threadgroup into the number of
`wivelronis required to process all of the Uhreads for the threadsroup
`
`CS Input
`'
`Resource
`CS Input
`Controller == Allocation
`
`Wave
`Write Cntl
`
`SGPR
`Write
`
`VGPR
`Write =
`
`NewWave Cmd
`
`SGFPR Data
`
`VGPR Data
`
`Figure 3= (CS Data Flow
`
`ATI Ex. 2027
`IPR2023-00922
`Page 12 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 12 of 62
`
`

`

`é&
`
`
`
`AMD
`
`ORIGINATE
`10-Feb-15
`
`EE
`
`
`
`
`
`DOCUMENT-VER. NUM,
`LO
`
`PAGE
`13 of 62
`
`
`
`EDIT DATE
`a-Now16
`
`
`
`
`
`
`
`
`
`
`
`
`Figure &— Asyne Compute Bock Diagram
`
`2.2.1 Resource Probing
`Of there are more than 4 Asvnic Compute Pipes present ina configuration (more than | compute ME} then pairs of
`compute wave controllers will share a single probe to Resource Alloo (FLA) for allocating resources. Each of the
`pair takes alternating tune weing the probe to request resourees, This probing opponunity will altemate between
`the two pipelines ance every four clocks until a probing pipeline haga work group that fs and is selected by RA.
`Once a pipeline ts selected, it will allocate resources for all waves in its threadgroup before releasing the probe. If
`only one pipe of a pair has a threaderoup readyto allocate, it will have exclusive use of the probe for requesting
`heiress and can Conlin requesting on every four clock cycle.
`Each CS controller should check ts t2_per_cu limat, waee_per_sh limit, scratch Limit, and crawler space before
`Tequcsing mesourees so tl docst take cycles away from the other cscil shaning a common probe.
`
`2.2.2 Threadgroup Ordering
`Ordering of threadgroups fora given async compute pipe needs to be maintained across all SE. The Dispatch
`Controller (DC) assigns threadgroups round-robin to all SE in dhe chip, and the SPls from cach of those SE mast
`coopenne io ensure That a threadgroup froma given SE is net allowed to probe until the threadgroup before it has
`won allocation. The SPI needs to wait uniil tbe first wave of ihe previous group allocates, bul dees not need to
`wail for all waves of the previous threadgroup.
`The Dispaich Comroller will send twe signals with each threadgroup, firs_proup and kist_group te tell the SPI
`when each dispatch starts and finishes. [fa group is marked as first_group, ihe CS controller can start requesting:
`inanediately without waiting on any previous group, Ua task is pre-cmapted and restarted, the first threaderoup of
`
`ATI Ex. 2027
`IPR2023-00922
`Page 13 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 13 of 62
`
`

`

`
`
`PAGE
`DGCUMENT-VER, NUM,
`EDIT DATE
`ORIGINATE
`14af 62
`LO
`3-Now-16
`10-Feb-15
`the restart should be marked as first_eroup even if it is mot dhe first of the dispatch, Onec that first.group allocaics,
`the allocating controller sends a tg alloc pulse to the next SPI in the dispatch sequence so that it can stan
`requesting for its group. For allocating groups marked as ligtgroup, ne igalloc pulse is sem. This scheme avoids
`any problems that can arise from an implicit ordering scheme where the DC and the SPI both independently
`manige threadgroup ordering. First_groups can be sent to any SPI, regardless of where the previous group wats
`sent, amd last_eroups won't create anylefi-over status inthe SPL. Power gating and sofl_neset issues are also
`avoided since no duplicate aims meeds to be kept in evince between DC and SPL, which are physically in separate
`tiles
`SP1 also supports. a mode where a DISPATCHINITIATOR write clears the baton for that asyne compute pipe
`such that the last_ig fromthe dispatch controller is not mecessary. This is the default behavior for SPI, but it can
`be disabled by setting SPILLCONFIGCNTL_I-BATON_RESET_DISABLEto 1,
`
`
`
`Figure 7-— CS Threadgroup Ordering
`
`The compare controllers also support chisabling ofentire SH fora given pipe using the
`COMPUTE_STATICTHREADMGMT register. This feature is also known os “steering”, and allows a dispatch
`to be sent only toa subse. of the possible SH ina given config. The DC will shadow CU EN settings and only
`send ihreadgroups to SPI withat least one CU enabled for the dispatch When passingieceiving tg_alloc, each
`SPI needs to check its own CU_EN settings. [f the necciving SPI has a CU_EW of O then it should pass the token
`along te the next SPI. This passing of the token through disabled SPI adds extra time between (hreaderoups
`staning, The ADC will optimize for the case where only a single SH is enabled fer a dispatch by marking every
`threadgroup sent bo that single SH as both first and last of group. ‘This way oo onhening tokens are passed by the
`SPand the single enabled SH is allowed to Launch threadgroups as fet as possible.
`
`22.3 Threadgroup Halting and Discarding
`The CS controller will also respond to halt signaling to accomplish precise Iainch pre-emption. Upon being
`commanded to halt by the CPC, the controller will finish owt any wavefronts from partially staned work romps
`and then stall any subsequent traffic trem that pape
`
`
`CLIENT TARGET halt req
`I
`[fasserted the receiving block must halt the production af
`compute work ate well-defined pipeline locatson. Alber halting,
`I lhe FeBeer Mus Fel a haltack
`
`Ia discard is then nequcsied, any other cobries in the input fifo will be popped and discarded before siznaling
`back to the grid dispatcher the SPI has prepaned to switch. A discand req will always happen within a
`halt_req/hali_ack window, The SPI must be halted before it can be told to discard
`
`ATI Ex. 2027
`IPR2023-00922
`Page 14 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 14 of 62
`
`

`

`
`ORIGINATE
`EDIT DATE
`DOCUMENT-VER. NUM,
`PAGE
`15 af 62
`
`10-Feb-15
`3-Now-16
`LO
`
`
`
` CLIENT TARGET discard neq
`Ifasserted the seccving blovk discard anv pending compute work
`That has aot Vel been allocated shader rossirces,
`
`Accdient showld only assert thes when both
`CLIENT TARGET halt_reqand TARGET CLIENT halt ack
`
`are asxscrted!
`The CS comraller will drive aig allocated signal to the CP notifying the DC when a threadgroup allocates, This is
`needed so the DC cam track the exec mnber of proupe that Gineh versus those that are discarded afer a hall.
`
`2.24 Queue Status
`Exch CS contnoller niuniains a coun of active waves forall 8 quenes the can drive thal pipe. SPI provides that
`status through GREreads using several register decodes. One register, SPILLCSQWPR_ACTIVESTATUS,
`contains a singk ACTIVE bit for cach queue of cach pipe of a given ME, SPI CSQ WF ACTIVESTATUS is
`inlexed by GREM.MEID, SPICSQWPR_ACTIVECOUNT(0-7) COUNT provides the actual munber of
`warvelronts thal are in Tight for a specific quence. SPILCSQ)WF_ACTIVE_COUNT_{0-7) EVENTS provides the
`actual number of evens that arc in flight fora specific quewe. SPLCSOWF ACTIVE COUNT ts indexed by
`GRBM ME ID and GRAM PIPE ID
`
`2.2.5 Unordered Dispatches
`DC and SPL also support an Unerdered Dispatch made using the ORDERMODE field of the
`DISPATCHINITIATOR. When Lamching an Unordered dispatch. the Dispatch Comioller will send every
`threadgroup marked with both firstlast_group. This allows the SPI in cach SH to launch threadgroups
`independently without passing or expecting the order baton,
`In the ordered mode, SPI can halt on amy
`Unordered mode also changes the way SP] responds to halt requests.
`threadgroup boundary: and retinhalt_ack with threadgroups still pending in its input fifo. Inthe unordered mode,
`SPI will allocate all threadgroups that have been sent from the DAC before rehoming halt_ack.
`
`2.2.6 State Forwarding to $06
`All state traffic to cach compute pipe needs to be passed to the SOG for logging. State writes are sent from the
`outputs ofcredindcbit fifos with arbitration and backpressure bo ensure that only | controller sends per clock,
`
`2.2.7 First Wave of Dispatch
`SPI suppons SOVSOC volatile cache deallocaiion control by marking the first threeadgroup of a dispatch that is
`sent to cach CU and SOC (group of CL). The scorcboard logic used to tick when threadgroups are sent to
`CLYSOC needs to be reset al the start of each dispatch, so cach CS wave controller needs to provide thas
`information. The CS wave controller will signal “first wave of dispatch” to LA for the first wave request of the
`first threadgroup afier each DISPATCH INITIATOR
`SPI is aware of SQ to SOC mapping. both for this invalidate volatile feature as well a6 CU busysignaling for clk-
`gate control, The SPI is ifdePed to handle both different numbers of CU (GPU GCNUM CUPER SH) and
`different numbers of ClU-per-SOC (GPU GCMAK 3 CU_PER_S0C).
`
`2.2.8 Compute Shader Index Terms
`For CS, the SPI can load up to 3 index terms as input imo the VOPR, This i a 1 to 3-Dimersional incrementing
`indies that represents the relative [ID of the thread within its threadgroup known as Thread (DinGroup.
`COMPUTE_PGOMBSRC?TIDIGCOMPCNT is used io contre! the number ofcomponents written by the SPL
`Here is a simple example of bow the SPI would generate the ThreadIDinGroup acioss the wavefronts with
`incrementing Fd indices.
`
`Fora threadgroaigy with dimension ¥=3, ¥=16, 2=2, the SPT would create 2 wavefronts to process the 96 valid
`threads (3*1"2), Seqpocntially, Ue thread input values would look like this where the X increments [rst and
`wraps back io zero, Ateach wrap point, the VY tenn would increment all the way up to the 2 term incrementing.
`
`ATI Ex. 2027
`IPR2023-00922
`Page 15 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 15 of 62
`
`

`

`DGCUMENT-VER, NUM, 10-Feb-15
`
`ORIGINATE
`
`EDIT DATE
`3-Now-16
`
`LO
`
`Thread(X.¥.Z) = 0.0.0
`Thread! = 1.0.0
`Thread? = 2,0,0
`Thread3 = 0,10
`Thread#3 = 2.15,
`
`16 threads wide and 4 clocks deep counts demonstrated in Figure &.
`
`fino[200Jaro[iro[210[ozo[azo
`
`fewer|ewer|ewex|em| emer[ewer|emer|ewee|ewer | exe|eee|em|amex|exer | ee
`
`|2z0Jos[iso|20|eeeeeee
`[zoo|ono|10|210|osze|1120|2120|oso|nso|230|oa|1.140|
`[ozs[oar[issfansfour [uur[eur[oss|aPosyTagaTasTactToraTearTomaTeenTagefameVennFaasTeeLoretta
`us[aur[oar [usr
`oefs fief
`KERN|RLS|MAN|REX|REX|mex|sei MEGS|SERN|OSA | RMON|XMMLN :
`
`
`
`
`
`
`
`Figure = CS Thread Count Increment Example
`
`2.3 VGT-SPI "Vert" Shaders
`
`The SPI can receive one thread per clock from the WGT for cach of LS, HS, ES. GS, and VS. The LS, ES, and
`VS interfaces arc all 128 bits wide, GS is 87 bits wide, and HS is 43 bits. The SPI takes.a serial stream of up to 64
`threads fromthe ¥GT (one wavelrom) and accumulates it into four paralle! lines inthe Vertex Staging Register
`(VSR), muching the VGPR write formal and allowing the SPL ie minimize the VOPR input eveles for vertex data.
`The iverface between the SPland the VGPRs is 16 verts * | component wide and the SPI is always trying to
`write 16 threads per cycle into the GPRs. The SPI arbitrates on4 clock cveles so every time a panicular type gets
`to winte into the VGPRs it really wants to write 64 threads, 1601 atime, over 4 eveles.
`Ifthe SPI tried to write
`immediately tothe VGPRs every time the VGT came in woh | sertal thread, ihe other 63 threads of the 4 clock
`evele would be wasted.
`
`Figure 9 shows the serial stream from the WGT being packed into this 16 wide over 4 chock wovelront.
`
`fola]a
` a
`| > BREE
`28
`i
`]
`6elmo
`
`Figure 9-*Vertex™ Data Flow VGT-SFI
`
`ATI Ex. 2027
`IPR2023-00922
`Page 16 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 16 of 62
`
`

`

`17 af 62
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`3-Now-16
`
`DGCUMENT-VER, NUM,
`LO
`
`PAGE
`
`2.3.1 ES, GS, V5 Processing
`In GFXthe change was made to combine ES and G5 processing into a single shader slage so there is mo need to
`synchronize ES-lone to GS-stan, There is also no mecd for SPT to pass parce CU information from ES te GS
`groups like was necessary’ for onchip-GS processing tn previous families. ‘The synchronization of GS io VS
`processing is handled outside of the SPI (VOT waits on gscount_done from GS shader before generating VS). If
`GS is passing data io VS using onchip LOS (onchip-GS) then SPT must pass subgroup information from the
`producing GS to the consuming VS subgroup,
`
`Each vertex controller ruins independently, with the only interaction being dhe arbitration for writes to 9 particular
`VSR. until wavelrons request for resource allocation, There is only one copy of VSR memory composed of
`multiple banks which bold the different components. Ther: isa simple poionty arbitration here to make sure there
`are no data collisions when multiple comtimallers need to write to the same memory banks, The priority onder isa
`fixed lowesio-highest of LS, HS, ES, GS, V5, Space for multiple wavelronts exists foreach type inthe SPI
`which allows the SPI io slant copying one wavefront while the “GT starts sending the next wavelront,
`
`Once a fall wavefront of vertex indices are written imo the VSR. and ihe assectated wave irnsfer from the VOGT
`has occurred to let the SPL know it is ok to issue thal wavefront. the Verex Wave Controllers will try to allocate
`the resources that the shader moeds to execute in the shader complex.
`In the case of LS/H5 and ESAG5 groups,
`SPI waits until all transfers of all waves of the group (LS-ver/HS-ven or ES-verGS-prim) have been reecived
`before ining to hameh the group. This means the VSR. mest be able to held an entire group's worth of data, up io
`amex of4 wavelronts, foreach of these group types. [the wave/group wins resources allocation, the wawe
`control information (ressuree bases'sines, stateid, pipe_id, cic) is sent to the shader input write controllers to load
`the wavefron to the Shader Array
`
`Vertex Input from VGT
`weve
`vert
`'
`Vertex
`ee
`
`Controller|_dalaive_ VSR Write yee|ver | VSR|VSR
`
`ES,Gs.vs|. | wep os
`
`
`
`
`
`VSt_ready
`'
`
`Wave
`
`|
`
`
`
`Wave Write \SGPR Write|WGPR Write
`
`Esigs,
`
`|™
`
`* Allocation
`
`=| =]
`
`+ 4 '
`NewWave md
`SGPR Data
`VGPR Data
`
`Figure 1 = VGT ES, GS, VS Vertex Input
`
`ATI Ex. 2027
`IPR2023-00922
`Page 17 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 17 of 62
`
`

`

`1.0
`
`ORIGINATE
`
`10-Feb-15
`
`EDIT DATE
`3-Nov-16
`
`DGCUMENT-VER, NUM,
`
`2.3.2 On-chip Gs
`Onclup GS mode allows the use of onchip LDS space to store the ESGS and GS¥S ring

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket