`
`é&
`
`cd Author: Randy Ramsey
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`3-Now-16
`
`DOCUMENT-VER. NUM,
`1.0
`
`PAGE
`1 of62
`
`GFX9 SPI Specification
`
`Rev 1.0 — Last Edit: 3-Nov-16
`
`drive the intemet and businesses, For more information,visit httoy//www.amd.com,
`
`. This documentis issued to you alone. Do not transfer it to or share it with another person, even within your
`organization.
`
`THIS DOCUMENT CONTAINS |
`
`INFORMATION THAT COULD BE
`
`SUBSTANTIALLY DETRIMENTALTO THE INTEREST OF AMD THROUGH
`
`UNLICENSED USE OR UNAUTHORIZED DISCLOSURE.
`
`Preserve this document's integrity:
`
`= Do not reproduce any portionsof it.
`
`=> Donot separate any pages from this cover.
`
`. Store this document in a locked cabinet accessible only by authorized users. Do not leaveit unattended.
`
`. When you no longer need this document, return it to AMD.Please do not discard it.
`
`“Copyright 2012, Advanced Micra Devices, Inc. ("AMD"). AS rights reserved. This work contains confidential, proprietary to the reader information and trade
`secrets of AMD. No part of this document may be used, reproduced, or tranamitted is any form or by any meacs without the prior written permission of AMD.”
`
`AMD, the AMD Arrow Logo and com®inations thereof are trademarks of Advanced Micro Devices, Inc.
`trademark’ of HDMI Licensing, LLC.
`
`PCle & a registered trademark of PCI-SIG, HOMI is a
`
`AMD (NYSE: AMD) is a semiconductor design innovator wading the next era of vivid digital experiences with its ground-breaking AMD Fusion Accelerated
`Processing Units (APU). AMD's graphics and computing technologies power a warlety of devices including PCs, game consoles and the powerful computers that
`
`ATI Ex. 2027
`IPR2023-00922
`Page 1 of 62
`
`
`
` AMD
`
`DOCUMENT-VER, NUM,
`LO
`
`PAGE
`2 of62
`
`é&
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`a-Now-l6
`
`Revision History
`
`|___Date|_Revision|Description
`
`ATI Ex. 2027
`IPR2023-00922
`Page 2 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 2 of 62
`
`
`
`1.0
`
`ORIGINATE
`
`10-Feb-15
`
`EDIT DATE
`3-Nov-16
`
`DGCUMENT-VER, NUM,
`
`Table of Contents
`aed 6
`
`1.1 EN atesies essaarednessaahia eeegodis sees etapaee ly
`
` SwwwGoo
`
`Acronyms ....
`1.21
`
`Terrninolagy ...-
`122
`1.3
`Tor Leven DesceeTon..
`
`SP)Chip LevelData‘FlowwDiagram.
`1.3.1
`SPY Block Sitignitn Eelbbetlhere arabcaprpotdsantah
`13.3
`FEATURES f FUNCTIONALITY accccccsesesacsesacsesactneastotasansnsassnsanseinsinsacaniasa 1 ssastasansnta oa ssasanavamsa ia niasaactasetaeasrans 11
`2
`STAGE AND ORGANIZE DATA FOR SHADER LAUNCH.ic...cicccccitcecciis cscctasescinsrassissadasesatasoseteimnstarcietesascttisisstaiaismasioien Lb
`a1
`COMPUTE SHADER heseriginhiicksbese
`2.20
`eee
`
`
`227
`ThreodgrowpHalting netBiscrding
`
`| Ouewve Stotws....
`‘ta
`225
`Unordered Dispatches...
`
`2.206
`Stote Forwording to S06...‘i
`
`a2.7
`First Wave of Dispoten......
`
`2.2.8
`Compute Shader index Terms...
`2300
`VGT-SPI WERT" SHADERS. ..cccccua
`
`231
`ES, G$, Va Processing
`2d
`Orechip GS ccs
`
`2453
`Tesselotion ..2.2.0)...03.
`234
`Oistributed Tesseilotion ....
`
`23.4.1 Work Creation Description
`2.34.2
`OMfchip LOS ID Changes...
`2.3.4.3
` Offchip LOS Gealiocation Changes
`Piet SHADER (PS)...
`Se
`Pree! Dota Flow.
`slaEEaeseaaodalafenlacealccredseiMetises
`Calculate Per:Pind0earycentncCoovdiiriag:
`
`Pull Miodel ...
`Scole ResolutionBosedon1Screen‘Lotenion(9.125).
`
`Visualizing the Scaling .........
`
`Impacts to BC! Equation ......
`
`2.4
`244
`2.4.1.1
`24.12
`242
`2.4.2.1
`2A 22
`
`3
`
`END OF SPEC UPDATES, BEYOND THIS POINT INFO MAW BE OUT OF DWATE .....scrcrerrersesereeenteerereriee41
`
`Saucer far DG pov ls aSA oocis infersccsceteleunis mi havedestocwmbancisremnini in inmnicianneininee itll
`B12
`Undque Sample Pesnians per Piel...
`hoje astheatepace malafa LITSTHeeerie
`3.12
`LOS Parameter Data LoAgine FoR Pines...
`RuakeRat
`3.2
`Orgeniation of Date in the Parameter‘Cache...
`3.21
`a
`Bee|USPSReingaareola taerreeea
`G2 ESRReeeeecercaedgral
`a2 Point Sorte Querride.
`a5 PARAMGEN...
`
`
`
`iSectatae
`
`ai
`
`3.2.6
`3.4
`3.4
`
`seicisterciewiseoutenaaa
`Support DeeperPePorometerCacheindAvoid!dDupcateData...
`te
`fe eta
`ahseeiasiis
`ches
`Performance ..
`42.6.1
`
`Proce. SHADER VGPR INITIALIZATION .. Firat ba sabeak cfdadR HaeLeveeri riSe SSO LT
`VERTEX,POE. SVRICHHIROUILEATIONssnSafhathvgi ast EeedWfpa ipcaed ncne Pde Caa
`
`ATI Ex. 2027
`IPR2023-00922
`Page 3 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 3 of 62
`
`
`
`
`
`DGCUMENT-VER, NUM,
`EDIT DATE
`ORIGINATE
`1.0
`aes
`10-Feb-15
`BEE:ACCnmRERADATA PRCi iissshingled inulin bi nentsi
`2.6
`RESOURCE ALLOCATION...
`
`364
`CU end SMD Acslannient.=
`‘ia
`
`3.6.1.1
`SIMO Assignment for Work Distribution endInput Bandwidth
`ss
`
`
`3.64
`
`Wove Guffer.......
`
`
`
`eye
`
`Barrier......
`3.6.6
`Balky CS Threadkiroups...
`36?
`
`Position Gaffer and ParameterCache... ecahepucttr atnpaendateett eh eeeaniCaled
`
`
`
`36.8
`Late V5 Allocation ..
`smi lt ii tei pctkeaa aai
`3.6.8.1
`Alfecation Priarity -..
`3.6.3
`
`Virtwelization of Crcapetis‘UnitMocks.
`3.6.10
` Aesource Reservations ..
`3.6.11
`3.6.12 Multiplierfor Resource Limits.
`371
`Meintonning Gosaieta
`’
`B72
`Expert Graneting........
`
`h
`EMM ==)G77OEE roSeORPePRP SP
`54
`Se
`ParPet ve ice anatasjarsispsecsenaterine niciEee anaeA
`ere
`3.10 Wave/Event ORDERING.
`
`3.11
`Event COLECTION..
`sinindtnimeg (mrganivinniabsin gee
`SS
`
`3.130WAVEFRONT LIFETIME STATUS COUNTERS .. oroDE
`
`io
`
`re
`
`‘aH
`
`5 <
`
`n s
`
`4,12 HAfontoowiou,ceincad)PoePree(FonBeauxANDgPearoiainiceeed
`
`Fh
`
`PRBPUPOURIMICEvaisiiete reincncin bia nsepasiton pistanoserdn an puttaereese vs iui un painwinsprints nseido oamtemerenis eei
`
`ee BRETCUioa deeeGeneec
`a2
`PPRMDMETERE CMTE PENDigs pintspnce ete mganya Lag pnwde ge as UN GR
`AL,
`GAPEiisccaaccit ccaacaccdnc
`44
`PRE eins pe ishacessemcaipnatn ii
`r
`45
`(GRAPHICS BALANCED THROANSHIPUT CASES wooo. cee oct c tasesnics rahecist ieeeies ecbreso tesesnesiestasaieieecione
`
` S2eBEER
`
`BLEGATING cscrersetacacraraesntaesnssetmeaeacioes epesneemeatanrstn tatsegataed eta edeneae On gaELALEEAEa1
`
`=
`
`ATI Ex. 2027
`IPR2023-00922
`Page 4 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 4 of 62
`
`
`
`EDIT DATE
`3-Nov-16
`
`DGCUMENT-VER, NUM,
`
`ORIGINATE
`
`10-Feb-15
`
`1.0
`
`
`Tableoe re
`aibarpicieapebiiales eucich stadiarence bebicutecbetuat aaa ae
`Figure | = SPI Chip Level Data Flow Diagram...
`is
`jae orees SEee
`Figure 2— Chip Level Dingram ...
`eea
`Figure 3 ~“ees ConnectivityBlockDiagram,
`oe
`vavaterennneresestereees
`— cecsereenees
`Figure 3- CS Data Flow... .
`anid
`miata
`piesa
`ssi
`ratils higiig
`Figure 6 — Asyne Compute BlockDiagram.
`
`
`
`Figure 7 = CS Threadgroup Ordering ...rerelard 14cia tiiitiheete fim her riick
`
`
`Figure &— CS Thread Count Increment Esumple.
`Li
`
`Figure 9 —"Vertex” Data Flaw WOT-SPI...
`
`Figure 10 - VGT ES, GS, VS Vertex Input...
`.
`soy ms i moka Pn Lm mp
`
`Figure 11 = LS_H8.E8,.G5.V5 Vertex [npaut......
`PARSPSTe aed chodaagennedAan rgpieete
`lt
`Figure 12 — Pixel Input Data ..
`bpatateseaeueseseaenugeteteneerapipseasatasusnnetetestamunettusnenseteieeeeramninensees24
`Figure 13 — Color Expon Bus Arbitration,RB...
`int mii egressWo
`isns Bia atc
`ode
`
`
`Figure 14 — Color Export Bus Arbitration, 2RB. .. a chasedaan darrian or
`jana
`Figure 13 - Color Expon Bus Arbitration, 4B... aa eeeeta Ad
`
`Figure 16 — Color Export Bus Arbitration, RB.
`sutatat alii eocaTneans aah pr afeetierienaha eee
`|
`Figure 17 —- LDS Logical Lavout ..
`chesuebebeseaneeaee
`sosesteeabvavinasnsinanenicieseanseeees
`a5
`Figure 18 - ParameterCache DisaOrganica i Gehan
`Sth
`
`au
`bs hyn tad ba bwin ines mes pl
`pips
`sistant
`Figure 19 — Combined Data Flow ............
`i
`
`Figure 20— Persistent State Update FIFOs...
`nn asechip Gadaipa memp
`asian
`Figure 21 — Persistent State Update FIFOs ...
`pila
`ie
`A
`
`Figure 22 - Performance, Balanced Throughput Case,WS.PS
`a
`i)
`Figure 23 — Performance, Balanced Throughput Case, ES-G5-VS-PS....
`:
`ft
`paca
`Figure 24 - Perfomance, Balanced Throughput Case, LS-HS-ES-GS-V5-PS eee|
`
`ATI Ex. 2027
`IPR2023-00922
`Page 5 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 5 of 62
`
`
`
`ORIGINATE
`
`10-Feb-15
`
`EDIT DATE
`
`DGCUMENT-VER, NUM,
`
`1.0
`
`3-Nov-16
`
`1=Introduction
`This document describes the requirements, functionality, and target performance of the Shader Processor Input
`(SPI) block.
`
`1.1 Open Issues
`
`1.2 Definitions
`
`1.2.1 Acronyms
`SPI - Shader Processor Input
`SC = Scan Comverter
`SQ - Sequencer
`SOC — Sequencer Cache
`SG = 50 Global Block, instanced in SPI
`SX. - Shader Export
`SP - Shader Processor
`CP = Command Processor
`CPG — Command Processor, Grphics
`CPC— Command Processor, Compute
`SE = Shader Engine
`SH - Shader Array
`CU -— Compute Unit
`SIMD - Single instroction Mubiiple Data unit in the shader processor (SP).
`UL = Upper Let
`UR - Upper Right
`LL- Lower Lefl
`LR - Lower Right
`VGPR — Vector General Purpose Register in the SP
`SGPR - Scalar General Purpose Register inthe $O
`C5 = Compute shader
`LS — API Vertex shader stage when doing tessellation, wntes to LDS
`HS — Hull shader stage of tessellation
`VS5- Vertex shader, coukl be normal vertices, final pass of a Geometry Shader, or domain shader,
`GS - Geometry Shader, processes primitives.
`ES - Export Shader, first verex pass ofa Geometry Shader that processes vertices.
`PS - Pixel Shader, processes pixels,
`VSR -Vertex Input Slaging Register, hokks inpal data for vertex thrends.
`PSE -Pixel Input Staging Register, bolds input data for piel threads.
`LDS — Local Data Store
`se_id = Shader Engine Identification Number
`sh_id — Shader Anmay Idendification Number
`MSAA — Multi-ample Amti-Aliaging
`EQAA = Enhance Quality Anti-Aliasing
`
`ATI Ex. 2027
`IPR2023-00922
`Page 6 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 6 of 62
`
`
`
`3-Nov-16
`
`EDIT DATE
`
`DGCUMENT-VER, NUM,
`
`ORIGINATE
`
`10-Feb-15
`
`1.0
`
`1.2.2 Terminology
`through the graphics pipeline which can be weed to enforce
`token sent
`Event — an event
`is a special
`svochronization, Mush caches, and report status back tothe CP. All blocks pipeline these tokens and keep them
`ordered with other griphics dita,
`Thread: one instance of a shader preenim being executed ona wavefront. Each thread has its own data which
`is unique from any other thread.
`Wavefront:
`‘Thes tt the basic unit of work, There are 64 threads per wavefront, I isa group of threads thet cam
`be executed simuliancously ona SIMD.
`Threadgroup, Subgroup: Group of threads that may span several wavefronts, All threads are guaranteed to
`run on the same CL, This allows for shared CU resources auch as the Local Data Store (LDS) and
`synchronivalion rsounes acooss all threads.
`TesseHation Engine: A WOT mmdule that implements D1) tessellation functionality.
`Fisel Quad: A 2x2 pixel region,
`Pisel Center: Current pixels screen coordinates. grven as PIX_X.5, PIM¥4.
`Pixel Centroid: Current pivel’s centroid in screen coordinates, defined as the covered sample location closest bo
`pixel cemier, [f all samples of a pixel are hit, comer will be used for centroid even if center is not one of the
`current sample locations,
`Pisel Sample: Location of the sample ID of the cunent Henvion when mining at sample frequency.
`Facedness: The PA determined face flag indicating front or back facing.
`Param_gen: Automatically generated ST texture coondinues. typically used with points.
`SIMD: Single Instraction Multiple Data unit tn the shader processor (SP)
`Shader Array: A combination of blocks separate amd unique for shader processing, including a shader core
`consisting of Compute Units
`newveetor aka fpos, first_prim_ol_shet: Panumeter cache syne token recemved from the SC for pixels and
`used to make sure the SPT waits for V5 te finish exporting parameter data before pixels start trving to read it.
`Helper pice: Any non-hit pixel being processed asa part of a quad wath other hit pixels.
`
`1.3 Top Level Description
`The main purpose ofthe SPI ie to manage shader resources and provide shader input data to the GPRS and
`wavelronts to the SQ.
`(taccunwilates “vertex” ype shader input data from the VOT (VS, GS. ES, HS, LS) into
`wavelronts,
`[tl recenics compute shader (CS) data and state from the CPG and CPC on cedata inderfaces.
`Reames required to process wavefronts and CU/SIMD assignment in the shader array (SH) are managed by the
`SPI in terms of allocation and de-albocation, SPI passes data through for the VOT vers and prin, Por HS and
`GS, SPI onrolls threadgroups and subproupes into wavefronts. For CS, SP] unrolls threadgroups inte wanctronts
`and generates an index per thread based on the threadgroup sise. Piscel quad data delivered from the SC is
`accunmlated inte wavelrons, The SPI processes this data, per pixel, to imerpolate and produce barvecmtric
`gradient data (UW) or screen X,Y, andor primitive faceness data. The SP1 loads 1data into WOPRs and
`coordinates moving primitive attribute data from the paramecicr caches into a ‘CU Local Data Store (LDS) for the
`pixel shader to wise for attribute imerpolation, SPI synchronizes the vertex shader alinibute exports with the pixel
`shader reading those aliributes, guaruniceing Uke attribute dati has been written to the parameter cache before
`allowing PS t read.
`
`1.3.1 SPI Chip Level Data Flow Diagram
`Figure | shows the blocks and major data paths dirccily and functionally associated wiih the SPI.
`
`Inputs from the ViGT: subgroups, waves, events, and vertex inpul data for the dala types VS, GS, ES, HS, LS.
`Inputs from the $C: pixel data including coverage, primitive information and events.
`From the CPG: compute slate, events, ihreadgroups for GFX.
`From the CPC; compute state, events, threadgroups for async compute.
`Shader input data into the SGP Rs and wavefront input to the $O.
`V5 position and parameicr cache dain writes to the SX and PC,
`Panuneter cache read and LDS write contrels
`
`ATI Ex. 2027
`IPR2023-00922
`Page 7 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 7 of 62
`
`
`
` AMD
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`a-Now-l6
`
`DOCUMENT-VER, NUM,
`10
`
`PAGE
`B of62
`
`é&
`
`Primitive
`Connectivity) _—$————
`PA
`
`Primitives
`
`Position
`Data
`
`
`
`LS, HS,
`ES, GS, VS
`Input Data
`
`Pixel Quads With
`
`
`
`
`cs"eO-n DX11CS “&Prim info
`
`
`
`Param Cache Read
`
`LOS write entl
`
`GPRinput
`
`
`Wavefront data
`
` LDS write Data
`
`
`
`
`Position, Param Cache
`
`Figure 1—SPI Chip Level Data Flow Diagram
`
`Reterencing Figure 1, for doing just vertex and pixel shading, vertex and primitive Ivpe processing are associated
`with the green colored lines. The WGTinitially stars offsending vertex indices tn the form of vsveris to the SPI
`and at the same time sending the primitive connectivity to the PA identifying howthese vertices will ect built
`back imo primitives, The SPT bullers up the vevens inte a wavelront and once it bas received a full wavefront of
`data. the wave trinsfer from the WGT will inigger the SP1 to release the data to the SQ) and feed associated data
`Into the GPRs. When the vertex shader Sars processing postlion data, typically it will send ot position early fo
`the position buffers inthe SX which then allows the PA to pead that position data and stan building the primitives
`and producing those prinuives which go through the Scan Converter (SC) to produce pixels. Onee the SC has
`Primatives, it will stant producing pixels which wre fed to (he SPL ‘Onec the SPT has a full wawelront of pisels, it
`will tryand send associated data into the GPRs with the wavelront to the SO. Reads are made to copy parameter
`dota out of the parameter coche and write it into an SPI determined range of LIDS in a particular CU.
`
`ATI Ex. 2027
`IPR2023-00922
`Page 8 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 8 of 62
`
`
`
` AMD
`
`é&
`
`EDIT DATE
`ORIGINATE
`10-Feb-15 a
`
`DOCUMENT-VER, NUM.
`10
`
`PAGE
`9 of62
`
`1.3.2 Chip Level Diagram
`Figure 2 shows the SPI block and its associated relationship to chip level imer-connections. Here, the physical
`partitioning of barvc logic is shown by the BCI blocks. For the purpose of this document, the BCT logic will be
`considered as part of the logical SFT design
`
`Units
`
`
`Uptete
`Compute “=
`
`Shader
`Engine
`
`Upie 16
`Compute
`Units
`
`Figure 2= (Chip Level Diagram
`
`ATI Ex. 2027
`IPR2023-00922
`Page 9 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 9 of 62
`
`
`
` AMD
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`3-Now-16
`
`DOCUMENT-VER. NUM,
`LO
`
`PAGE
`10of 62
`
`é&
`
`1.3.3 SPI Block Diagram
`
`|
`
`SIMD SIMD SIMO | SIMD
`
`Unit (Up
`
`||
`
`e5
`
`GPs_SPmaa,1)2SP4_aereesfs,")
`S_50_esppert
`B3_sneeccr
`alk
`&
`th
`
`$Plin
`|
`* other SE
`pf
`ital
`Shad
`Wave Launch
`ization
`!
`| SPSPLce-Tacrecee
`Wave Buffer
`
`
`
` atsEx LDS, VGPR, SGPRaSeereieport SPSPY0-n_vatgdoce i
`
`“aE
`SA _SFYC)sabre
`t
`
`|
`
`*
`
`
`
`SASSwrve_soee epaj ;
`
`Cec_S_seve_aave
`SPLCPO
`gee
`ions
`| Si
`SPAS sta meee dave
`}k
`PD_6Fi_opdata(hT)
`|
`F
`gence
`SP_LOPC_feca parteiieT)
`SP_CPC _tgaao-T)
`ope |
`
`
`
`;
`
`Resource
`Allecater
`
`Wave
`Cantrallers
`
`|
`
`1
`SP
`CPO. SP_cecuia —
`SP_CPC_teapeoematompe partel)
`| GPG.
`ess
`
`vat,
`VOT_S5_*_wnew
`
`
`WOT_SP. "vet_———$@<$rt ia,
`
`
`SPLWOT "sons wo$ete
`fF1_ports hi)—
`
`af SCLECLby)|SGDCU_SP4_ dota, 1)
`—— BCl ——
`
`
`
`Figure 3— Top Level Connectivity Block Diagram
`
`ATI Ex. 2027
`IPR2023-00922
`Page 10 of 62
`
`|
`
`=
`
`CEOSP|_ony (CETACae}
`SP_CASSmad (TARORAM_renay
`SPL_SP\ pe sates
`x
`=—
`S_SPY0-nj_peiesiicc
`SP SPiso| core
`531 ghee)
`
`j
`
`GREM
`
`.
`
`-
`
`ATI Ex. 2027
`IPR2023-00922
`Page 10 of 62
`
`
`
`AMD
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`3-Now-16
`
`DOCUMENT-VER. NUM.
`Lo
`
`PAGE
`11 of 62
`
`eet [Freier]
`
`Diagram copied from //etapfgtsWidocdesignblocks/spi/efx9SPLBlockDiagrun,.vsd
`
`Figure 4— Block Diagram
`
`2 Features / Functionality
`
`2.1 Stage And Organize Data for Shader Launch
`The SPI logical block stages and organizes efficient loading of shader input data to the VectorScalar General
`Purpose Registers (VGPR/SGPR) and Local Data Store (LIDS) in the Shader Array and manages resources
`requeined to mun those shader programs, The VGT will have several types of inputs to the SPT; normal vertices that
`will create positions and parameters for rasterization and pixel processing (V5. which could be normal vertices or
`the final pass ofa Geometry Shader), Geometry Shader (GS) primitives, vertices that only expom to memory (ES.
`which is the first vertex pass of a Geometry Shader), wenlices acting as the first stage oftessellation processing
`(LS). and patch dam associnted with the Holl Shader (HS). The V5. GS. ES, HS. and LS are offen generalized inte
`the category of “verts” when discussing data moving through the SPI, The Scan Converter (SC) delivers pixel
`quads to the SPI for pixel shading, The CPG block delivers DX 11 Compare threadgroups to the SPI for lameching
`compute shaders. The CPC delivers Asyne Compate threadgroups to the SP] for launching compa shaders.
`
`2.2 Compute Shader (CS)
`As shown in Figure 1, Compate Shader input data can come fiom enher the CPG (GPX=CS) of the CRC (Asvne
`CS). CS waves go through the same resource arbitration and allocation as all other supported SPI wavefront tvpes.
`
`ATI Ex. 2027
`IPR2023-00922
`Page 11 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 11 of 62
`
`
`
` AMD
`
`DOCUMENT-VER, NUM,
`LO
`
`PAGE
`12 of 62
`
`é&
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`3-Now-16
`
`X11 requires support for compute shaders, and the SPI plays a role in getting compute shaders into the shader
`ami, Both the CRG and CPC deliver threadgroups to the SPI alone wilh persistent slate dita that tells the SPT
`howto process those threadgroups. The SPI is responsible for unrolling cach threadgroup into the number of
`wivelronis required to process all of the Uhreads for the threadsroup
`
`CS Input
`'
`Resource
`CS Input
`Controller == Allocation
`
`Wave
`Write Cntl
`
`SGPR
`Write
`
`VGPR
`Write =
`
`NewWave Cmd
`
`SGFPR Data
`
`VGPR Data
`
`Figure 3= (CS Data Flow
`
`ATI Ex. 2027
`IPR2023-00922
`Page 12 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 12 of 62
`
`
`
`é&
`
`
`
`AMD
`
`ORIGINATE
`10-Feb-15
`
`EE
`
`
`
`
`
`DOCUMENT-VER. NUM,
`LO
`
`PAGE
`13 of 62
`
`
`
`EDIT DATE
`a-Now16
`
`
`
`
`
`
`
`
`
`
`
`
`Figure &— Asyne Compute Bock Diagram
`
`2.2.1 Resource Probing
`Of there are more than 4 Asvnic Compute Pipes present ina configuration (more than | compute ME} then pairs of
`compute wave controllers will share a single probe to Resource Alloo (FLA) for allocating resources. Each of the
`pair takes alternating tune weing the probe to request resourees, This probing opponunity will altemate between
`the two pipelines ance every four clocks until a probing pipeline haga work group that fs and is selected by RA.
`Once a pipeline ts selected, it will allocate resources for all waves in its threadgroup before releasing the probe. If
`only one pipe of a pair has a threaderoup readyto allocate, it will have exclusive use of the probe for requesting
`heiress and can Conlin requesting on every four clock cycle.
`Each CS controller should check ts t2_per_cu limat, waee_per_sh limit, scratch Limit, and crawler space before
`Tequcsing mesourees so tl docst take cycles away from the other cscil shaning a common probe.
`
`2.2.2 Threadgroup Ordering
`Ordering of threadgroups fora given async compute pipe needs to be maintained across all SE. The Dispatch
`Controller (DC) assigns threadgroups round-robin to all SE in dhe chip, and the SPls from cach of those SE mast
`coopenne io ensure That a threadgroup froma given SE is net allowed to probe until the threadgroup before it has
`won allocation. The SPI needs to wait uniil tbe first wave of ihe previous group allocates, bul dees not need to
`wail for all waves of the previous threadgroup.
`The Dispaich Comroller will send twe signals with each threadgroup, firs_proup and kist_group te tell the SPI
`when each dispatch starts and finishes. [fa group is marked as first_group, ihe CS controller can start requesting:
`inanediately without waiting on any previous group, Ua task is pre-cmapted and restarted, the first threaderoup of
`
`ATI Ex. 2027
`IPR2023-00922
`Page 13 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 13 of 62
`
`
`
`
`
`PAGE
`DGCUMENT-VER, NUM,
`EDIT DATE
`ORIGINATE
`14af 62
`LO
`3-Now-16
`10-Feb-15
`the restart should be marked as first_eroup even if it is mot dhe first of the dispatch, Onec that first.group allocaics,
`the allocating controller sends a tg alloc pulse to the next SPI in the dispatch sequence so that it can stan
`requesting for its group. For allocating groups marked as ligtgroup, ne igalloc pulse is sem. This scheme avoids
`any problems that can arise from an implicit ordering scheme where the DC and the SPI both independently
`manige threadgroup ordering. First_groups can be sent to any SPI, regardless of where the previous group wats
`sent, amd last_eroups won't create anylefi-over status inthe SPL. Power gating and sofl_neset issues are also
`avoided since no duplicate aims meeds to be kept in evince between DC and SPL, which are physically in separate
`tiles
`SP1 also supports. a mode where a DISPATCHINITIATOR write clears the baton for that asyne compute pipe
`such that the last_ig fromthe dispatch controller is not mecessary. This is the default behavior for SPI, but it can
`be disabled by setting SPILLCONFIGCNTL_I-BATON_RESET_DISABLEto 1,
`
`
`
`Figure 7-— CS Threadgroup Ordering
`
`The compare controllers also support chisabling ofentire SH fora given pipe using the
`COMPUTE_STATICTHREADMGMT register. This feature is also known os “steering”, and allows a dispatch
`to be sent only toa subse. of the possible SH ina given config. The DC will shadow CU EN settings and only
`send ihreadgroups to SPI withat least one CU enabled for the dispatch When passingieceiving tg_alloc, each
`SPI needs to check its own CU_EN settings. [f the necciving SPI has a CU_EW of O then it should pass the token
`along te the next SPI. This passing of the token through disabled SPI adds extra time between (hreaderoups
`staning, The ADC will optimize for the case where only a single SH is enabled fer a dispatch by marking every
`threadgroup sent bo that single SH as both first and last of group. ‘This way oo onhening tokens are passed by the
`SPand the single enabled SH is allowed to Launch threadgroups as fet as possible.
`
`22.3 Threadgroup Halting and Discarding
`The CS controller will also respond to halt signaling to accomplish precise Iainch pre-emption. Upon being
`commanded to halt by the CPC, the controller will finish owt any wavefronts from partially staned work romps
`and then stall any subsequent traffic trem that pape
`
`
`CLIENT TARGET halt req
`I
`[fasserted the receiving block must halt the production af
`compute work ate well-defined pipeline locatson. Alber halting,
`I lhe FeBeer Mus Fel a haltack
`
`Ia discard is then nequcsied, any other cobries in the input fifo will be popped and discarded before siznaling
`back to the grid dispatcher the SPI has prepaned to switch. A discand req will always happen within a
`halt_req/hali_ack window, The SPI must be halted before it can be told to discard
`
`ATI Ex. 2027
`IPR2023-00922
`Page 14 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 14 of 62
`
`
`
`
`ORIGINATE
`EDIT DATE
`DOCUMENT-VER. NUM,
`PAGE
`15 af 62
`
`10-Feb-15
`3-Now-16
`LO
`
`
`
` CLIENT TARGET discard neq
`Ifasserted the seccving blovk discard anv pending compute work
`That has aot Vel been allocated shader rossirces,
`
`Accdient showld only assert thes when both
`CLIENT TARGET halt_reqand TARGET CLIENT halt ack
`
`are asxscrted!
`The CS comraller will drive aig allocated signal to the CP notifying the DC when a threadgroup allocates, This is
`needed so the DC cam track the exec mnber of proupe that Gineh versus those that are discarded afer a hall.
`
`2.24 Queue Status
`Exch CS contnoller niuniains a coun of active waves forall 8 quenes the can drive thal pipe. SPI provides that
`status through GREreads using several register decodes. One register, SPILLCSQWPR_ACTIVESTATUS,
`contains a singk ACTIVE bit for cach queue of cach pipe of a given ME, SPI CSQ WF ACTIVESTATUS is
`inlexed by GREM.MEID, SPICSQWPR_ACTIVECOUNT(0-7) COUNT provides the actual munber of
`warvelronts thal are in Tight for a specific quence. SPILCSQ)WF_ACTIVE_COUNT_{0-7) EVENTS provides the
`actual number of evens that arc in flight fora specific quewe. SPLCSOWF ACTIVE COUNT ts indexed by
`GRBM ME ID and GRAM PIPE ID
`
`2.2.5 Unordered Dispatches
`DC and SPL also support an Unerdered Dispatch made using the ORDERMODE field of the
`DISPATCHINITIATOR. When Lamching an Unordered dispatch. the Dispatch Comioller will send every
`threadgroup marked with both firstlast_group. This allows the SPI in cach SH to launch threadgroups
`independently without passing or expecting the order baton,
`In the ordered mode, SPI can halt on amy
`Unordered mode also changes the way SP] responds to halt requests.
`threadgroup boundary: and retinhalt_ack with threadgroups still pending in its input fifo. Inthe unordered mode,
`SPI will allocate all threadgroups that have been sent from the DAC before rehoming halt_ack.
`
`2.2.6 State Forwarding to $06
`All state traffic to cach compute pipe needs to be passed to the SOG for logging. State writes are sent from the
`outputs ofcredindcbit fifos with arbitration and backpressure bo ensure that only | controller sends per clock,
`
`2.2.7 First Wave of Dispatch
`SPI suppons SOVSOC volatile cache deallocaiion control by marking the first threeadgroup of a dispatch that is
`sent to cach CU and SOC (group of CL). The scorcboard logic used to tick when threadgroups are sent to
`CLYSOC needs to be reset al the start of each dispatch, so cach CS wave controller needs to provide thas
`information. The CS wave controller will signal “first wave of dispatch” to LA for the first wave request of the
`first threadgroup afier each DISPATCH INITIATOR
`SPI is aware of SQ to SOC mapping. both for this invalidate volatile feature as well a6 CU busysignaling for clk-
`gate control, The SPI is ifdePed to handle both different numbers of CU (GPU GCNUM CUPER SH) and
`different numbers of ClU-per-SOC (GPU GCMAK 3 CU_PER_S0C).
`
`2.2.8 Compute Shader Index Terms
`For CS, the SPI can load up to 3 index terms as input imo the VOPR, This i a 1 to 3-Dimersional incrementing
`indies that represents the relative [ID of the thread within its threadgroup known as Thread (DinGroup.
`COMPUTE_PGOMBSRC?TIDIGCOMPCNT is used io contre! the number ofcomponents written by the SPL
`Here is a simple example of bow the SPI would generate the ThreadIDinGroup acioss the wavefronts with
`incrementing Fd indices.
`
`Fora threadgroaigy with dimension ¥=3, ¥=16, 2=2, the SPT would create 2 wavefronts to process the 96 valid
`threads (3*1"2), Seqpocntially, Ue thread input values would look like this where the X increments [rst and
`wraps back io zero, Ateach wrap point, the VY tenn would increment all the way up to the 2 term incrementing.
`
`ATI Ex. 2027
`IPR2023-00922
`Page 15 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 15 of 62
`
`
`
`DGCUMENT-VER, NUM, 10-Feb-15
`
`ORIGINATE
`
`EDIT DATE
`3-Now-16
`
`LO
`
`Thread(X.¥.Z) = 0.0.0
`Thread! = 1.0.0
`Thread? = 2,0,0
`Thread3 = 0,10
`Thread#3 = 2.15,
`
`16 threads wide and 4 clocks deep counts demonstrated in Figure &.
`
`fino[200Jaro[iro[210[ozo[azo
`
`fewer|ewer|ewex|em| emer[ewer|emer|ewee|ewer | exe|eee|em|amex|exer | ee
`
`|2z0Jos[iso|20|eeeeeee
`[zoo|ono|10|210|osze|1120|2120|oso|nso|230|oa|1.140|
`[ozs[oar[issfansfour [uur[eur[oss|aPosyTagaTasTactToraTearTomaTeenTagefameVennFaasTeeLoretta
`us[aur[oar [usr
`oefs fief
`KERN|RLS|MAN|REX|REX|mex|sei MEGS|SERN|OSA | RMON|XMMLN :
`
`
`
`
`
`
`
`Figure = CS Thread Count Increment Example
`
`2.3 VGT-SPI "Vert" Shaders
`
`The SPI can receive one thread per clock from the WGT for cach of LS, HS, ES. GS, and VS. The LS, ES, and
`VS interfaces arc all 128 bits wide, GS is 87 bits wide, and HS is 43 bits. The SPI takes.a serial stream of up to 64
`threads fromthe ¥GT (one wavelrom) and accumulates it into four paralle! lines inthe Vertex Staging Register
`(VSR), muching the VGPR write formal and allowing the SPL ie minimize the VOPR input eveles for vertex data.
`The iverface between the SPland the VGPRs is 16 verts * | component wide and the SPI is always trying to
`write 16 threads per cycle into the GPRs. The SPI arbitrates on4 clock cveles so every time a panicular type gets
`to winte into the VGPRs it really wants to write 64 threads, 1601 atime, over 4 eveles.
`Ifthe SPI tried to write
`immediately tothe VGPRs every time the VGT came in woh | sertal thread, ihe other 63 threads of the 4 clock
`evele would be wasted.
`
`Figure 9 shows the serial stream from the WGT being packed into this 16 wide over 4 chock wovelront.
`
`fola]a
` a
`| > BREE
`28
`i
`]
`6elmo
`
`Figure 9-*Vertex™ Data Flow VGT-SFI
`
`ATI Ex. 2027
`IPR2023-00922
`Page 16 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 16 of 62
`
`
`
`17 af 62
`
`ORIGINATE
`10-Feb-15
`
`EDIT DATE
`3-Now-16
`
`DGCUMENT-VER, NUM,
`LO
`
`PAGE
`
`2.3.1 ES, GS, V5 Processing
`In GFXthe change was made to combine ES and G5 processing into a single shader slage so there is mo need to
`synchronize ES-lone to GS-stan, There is also no mecd for SPT to pass parce CU information from ES te GS
`groups like was necessary’ for onchip-GS processing tn previous families. ‘The synchronization of GS io VS
`processing is handled outside of the SPI (VOT waits on gscount_done from GS shader before generating VS). If
`GS is passing data io VS using onchip LOS (onchip-GS) then SPT must pass subgroup information from the
`producing GS to the consuming VS subgroup,
`
`Each vertex controller ruins independently, with the only interaction being dhe arbitration for writes to 9 particular
`VSR. until wavelrons request for resource allocation, There is only one copy of VSR memory composed of
`multiple banks which bold the different components. Ther: isa simple poionty arbitration here to make sure there
`are no data collisions when multiple comtimallers need to write to the same memory banks, The priority onder isa
`fixed lowesio-highest of LS, HS, ES, GS, V5, Space for multiple wavelronts exists foreach type inthe SPI
`which allows the SPI io slant copying one wavefront while the “GT starts sending the next wavelront,
`
`Once a fall wavefront of vertex indices are written imo the VSR. and ihe assectated wave irnsfer from the VOGT
`has occurred to let the SPL know it is ok to issue thal wavefront. the Verex Wave Controllers will try to allocate
`the resources that the shader moeds to execute in the shader complex.
`In the case of LS/H5 and ESAG5 groups,
`SPI waits until all transfers of all waves of the group (LS-ver/HS-ven or ES-verGS-prim) have been reecived
`before ining to hameh the group. This means the VSR. mest be able to held an entire group's worth of data, up io
`amex of4 wavelronts, foreach of these group types. [the wave/group wins resources allocation, the wawe
`control information (ressuree bases'sines, stateid, pipe_id, cic) is sent to the shader input write controllers to load
`the wavefron to the Shader Array
`
`Vertex Input from VGT
`weve
`vert
`'
`Vertex
`ee
`
`Controller|_dalaive_ VSR Write yee|ver | VSR|VSR
`
`ES,Gs.vs|. | wep os
`
`
`
`
`
`VSt_ready
`'
`
`Wave
`
`|
`
`
`
`Wave Write \SGPR Write|WGPR Write
`
`Esigs,
`
`|™
`
`* Allocation
`
`=| =]
`
`+ 4 '
`NewWave md
`SGPR Data
`VGPR Data
`
`Figure 1 = VGT ES, GS, VS Vertex Input
`
`ATI Ex. 2027
`IPR2023-00922
`Page 17 of 62
`
`ATI Ex. 2027
`IPR2023-00922
`Page 17 of 62
`
`
`
`1.0
`
`ORIGINATE
`
`10-Feb-15
`
`EDIT DATE
`3-Nov-16
`
`DGCUMENT-VER, NUM,
`
`2.3.2 On-chip Gs
`Onclup GS mode allows the use of onchip LDS space to store the ESGS and GS¥S ring