`
`EDIT DATE
`
`DOCUMENT-VER. NUM.
`
`PAGE
`
`| CREATEDATE
`
`1.0
`
`| PAGE | of|
`
`Author: Todd Martin, Mangesh Nijasure
`
`ISSUED TO:
`
`COPY NO.
`
`WD/IA/VGT
`
`Micro-Architecture Specification
`
`Rev 1.0 —Last Edit: [ SAVEDATE \@ "d-MMM-yy" \* MERGEFORMATJ5
`
`THIS DOCUMENT CONTAINS |
`
`INFORMATION THAT COULD BE
`
`SUBSTANTIALLY DETRIMENTAL TO THE INTEREST OF AMD THROUGH
`
`UNLICENSED USE OR UNAUTHORIZED DISCLOSURE.
`
`drive the Internet and businesses. For more information, visit | HYPERLINK "hittp://www.amd.com"|.
`
`Preserve this document's integrity:
`
`[SYMBOL222 \f "Symbol" \s 8 \h] Do not reproduceanyportionsofit.
`
`[SYMBOL222 \f "Symbol" \s 8 \h] Do not separate any pages from this cover,
`
`This documentis issued to you alone. Do not transfer it to or share it with another person, even within your
`organization.
`
`Store this documentin a locked cabinet accessible only by authorized users. Do notleaveit unattended.
`
`Whenyou no Jonger need this document, return it to AMD, Please do not discard it.
`
`
`“Copyright 2011, Advanced Micro Devices,
`Inc.
`("AMD"). All rights reserved. This work contains confidential, proprietary to the reader information end trade
`secrets of AMD, No part of ths document! may be used, reproduced, or transmitted in any form or by any means without the prior writter| permission of AMD.”
`
`AMD, the AMD Arrow Logo and combinations thereof are trademarks of Advanced Micra Devices, Inc.
`trademark of HDMILicensing, LLC.
`
`PCle is a registered trademark of PCISIG. HDM) 6 4
`
`AMD (NYSE: AMD) Is a semiconductor design Innovator leading the next era of Vivid digital experiences with its ground-breaking AMD Fusion Actelerated
`Processing Units (APU). AMD's graphics and computing technologies power a variety of devices including PCs, game consoles and the powerful eomputers that
`
`frierrame | —{numretrars |'ByRes
`
`AMD
`
`lotintdate \ie "M/W bern AM/PM"]
`
`AMD1044_0048455
`
`ATI Ex. 2026
`IPR2023-00922
`Page 1 of 110
`
`
`
`[ SAVEDATE \@ [ CREATEDATE
`
`ORIGINATE
`
`EDIT DATE
`
`DOCUMENT-VER. NUM.
`
`PAGE
`
`*
`
`1.0
`
`[ PAGE ] of [
`
`Revision History
`
`fe[roo[ovwnonSd
`
`
`
`[filename ] — [numchars ] Bytes
`
`[printdate \@ "MM/dd/yy hh:mm AM/PM"}
`
`AMD1044_0048456
`
`ATI Ex. 2026
`IPR2023-00922
`Page 2 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 2 of 110
`
`
`
`
`ORIGINATE
`EDIT DATE
`DOCUMENT-VER. NUM.
`PAGE
`by
`[ CREATEDATE
`[ SAVEDATE \@
`1.0
`[ PAGE] of [
`"d-MIMIM-
`hs
`ADA
`
`AM D
`
`"
`
`Table of Contents[ TOC \O "1-6" \T
`"FIGCAPTION,3,FIGC APTIONTOP,3 ,FIGCAPTIONBOTTOM,3,CAPTION,3"]
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048457
`
`ATI Ex. 2026
`IPR2023-00922
`Page 3 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 3 of 110
`
`
`
`PAGE [cREATEDATE|LSAVEDATE \@
`
`ORIGINATE
`
`EDIT DATE
`
`DOCUMENT-VER. NUM.
`1.0
`
`*
`
`[ PAGE J of [
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048458
`
`ATI Ex. 2026
`IPR2023-00922
`Page 4 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 4 of 110
`
`
`
`PAGE [createpate|! SAVEDATE \@
`
`ORIGINATE
`
`EDIT DATE
`
`DOCUMENT-VER. NUM.
`1.0
`
`[ PAGE ] of|
`
`1
`
`Introduction
`
`This document contains a descriptionof the features and hardware implementation of the WD,IA, and the
`VGTblocks and howtheyfit into the overall graphics architecture.
`
`1.1 Open Issues
`There are no knownopenissues for the WD, IA, or VGT.
`
`1.2 Scope
`This documentdetails the feature requirements and the hardware implementation for the WD, IA, and VGT
`blocks.
`
`1.3 Reference
`Noexternal documents other than those explicitly linked in this specification are necessary to understand the
`material presented here. This micro architecture specificationis self sufficient in describing the design and
`features of the WD, IA, and VGTblocks.
`
`1.4 Definitions / glossary of terms
`
`-
`
`-
`
`- WD-WorkDistributer, receives all the draw commandsand breaks them up into work groups whichare
`sent to one or more IA units
`JA —Input Assembler, receives work groups and breaks themup into prim groups for the VGT. Fetches
`indices from memory.
`VGT- Vertex Geometry Tessellator, this is the main block responsible for supporting all DX and OGL
`drawpackets
`- THREAD - A thread isasingle entity in a wavefront, this can be vertices, primitivesetc
`
`- WAVEFRONT- A groupofthreads that execute in SIMD fashion.
`-
`SE-— Shader Engine
`-
`PA —Primitive Assembler
`-
`LS—Local data Shader
`- HS-—Hull Shader
`- DS—Domain Shader
`
`-
`ES—Export Shader
`-
`GS — Geometry Shader
`-
`VS-— Vertex Shader
`-
`_CS— Compute Shader
`-
`EOP —End OfPacket
`-
`EOPG- End Of Primgroup
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048459
`
`ATI Ex. 2026
`IPR2023-00922
`Page 5 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 5 of 110
`
`
`
`PAGE [cREATEDATE|LSAVEDATE \@
`
`ORIGINATE
`
`EDIT DATE
`
`DOCUMENT-VER. NUM.
`1.0
`
`*
`
`[ PAGE J of [
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048460
`
`ATI Ex. 2026
`IPR2023-00922
`Page6 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 6 of 110
`
`
`
`* [cREATEDATE|LSAVEDATE \@
`
`ORIGINATE
`
`EDIT DATE
`
`DOCUMENT-VER. NUM.
`1.0
`
`PAGE
`[ PAGE J of [
`
`1.5 Top Level Diagram
`
`This diagram showsa 4 shader engine configuration.
`
`[| EMBEDVisio.Drawing.11 ]
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048461
`
`ATI Ex. 2026
`IPR2023-00922
`Page 7 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 7 of 110
`
`
`
`PAGE [cREATEDATE|LSAVEDATE \@
`
`ORIGINATE
`
`EDIT DATE
`
`DOCUMENT-VER. NUM.
`1.0
`
`*
`
`[ PAGE J of [
`
`2 Delta Requirements
`
`All delta features have been folded into this spec.
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048462
`
`ATI Ex. 2026
`IPR2023-00922
`Page 8 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 8 of 110
`
`
`
`PAGE [CREATEDATE|LSAVEDATE \@
`
`ORIGINATE
`EDIT DATE
`DOCUMENT-VER. NUM.
`
`1.0 [ PAGE]of [
`3 Features / Functionality
`a
`
`3.1 Overview
`
`The WD,IA, and VGTare responsible for creating thread data and wavefront assignmentfor nearlyall the
`graphics shaderstages in use bythe graphics core. These are Vertex Shader (VS), Export Shader (ES),
`Geometry Shader (GS), Local Data Shader (LS), and Hull Shader (HS). The remaining shadertype, Pixel
`Shader(PS), is not controlled directly by the VGT, but the VGT does provide primitive information to the
`Primitive Assembly (PA) block which begins a processof clipping and sendsclipped primitives to a Scan
`Converter (SC) that produces Pixel Shader (PS) data and wavefronts.
`
`Byproviding these workloadsin a deterministic, state-controlled order, the WD/IA/VGTenables and controls
`the various graphics pipelines (from a simple VS->PSpipeline, to a complicated pipeline such as LS->HS-
`>TESS_FIXED_FUNCTION->ES(as DS)->GS->VS). Primarily the WD/IA/VGTaccomplishes this by
`decomposing input packets whichyield all of the shader types. In addition to the controlling shader stages, the
`VGTinparticular, also provides a fixed function tessellation stage of the graphics pipeline. The register
`VGT_SHADER_STAGESENindicates which shaderstages are enabled for a given DrawInitiator.
`
`There is one WD blockper chip and one VGTper Shader Engine. A single IA is paired with two VGT blocks
`so a two SE chip will have one IA while a four SE chip has 2 IAs. A single VGT has a throughputof one
`primitive per cycle so the number of VGTspresent directly controls the maximumgeometry throughput of the
`system.
`
`The WD/IA/VGTis responsible for:
`
`COMPUTE:
`
`e¢
`
`Compute support has completely moved to the CP.
`
`GRAPHICS:
`
`e
`e
`
`e
`
`Receiving graphicsstate, draw requests, and synchronizationevents, from the GRBM bus.
`=Fetching, from memory, the individual vertex indices (16 or 32 bit pointers to vertex data) requested by a
`drawcall.
`Grouping the indices into primitives such aslines, triangles, or patches.
`Determining index reuse within a fixed windowofindices. This avoids redundantvertex shading.
`Providing primitive informationto the Primitive Assembler (PA) block.
`Providingstatistics and synchronizing events to the Command Processor (CP).
`Alternate graphics workloads to shader engines.
`Support legacytessellation mode (does not use the LS/HS/DSshaderstages, and has a different fixed
`function tessellation algorithm)
`Arbitrate (at a packet boundary) between high and normalpriority drawcalls
`Independent pipeline reset such that work assigned a given VMIDis stopped as quick as possible.
`Front end harvesting including deactivation of individual VGTsand associated IA.
`Order data for streamout.
`
`IfGeometry Shading is enabled
`o Generate ES wavefronts/vertices/GS primitives and send them to the SPI
`o Uponcompletion of ES, generate GS wavefronts and send themto the SPI
`© Receive Geometry information output from the GS
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048463
`
`ATI Ex. 2026
`IPR2023-00922
`Page 9 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 9 of 110
`
`
`
`PAGE
`
`ORIGINATE
`EDIT DATE
`DOCUMENT-VER. NUM.
`
`
`[CREATEDATE|[SAVEDATE \@ [ PAGE]of [1.0
`"d-NAMM-—-w"
`*
`nN
`An A
`o Uponcompletion of GS, generate VS wavefronts/vertices (including streamout data) and send them
`to the SPI. In this situation the VS is also known as the CopyShadersinceit is copying data from
`the GSVSring buffer to the position buffer and parameter cache.
`o Generate Primitive information and send it to the PA
`o
`ES/GSoutputdata is eitherall off chip or all on chip
`
`e
`
`If Tessellation is enabled
`o Generate LS wavefronts/vertices and send them to the SPI
`© Uponcompletion of LS, generate HS wavefronts and send them to the SPI
`o Uponcompletion of HS,retrieve tessellation factors from memory, and execute the fixed function
`tessellator stage.
`o Create DS wavefronts/data from output of tessellator. Send the DS wavefronts to the SPI as either
`ES wavefronts (Geometry Shader enabled) or as Vertex Shader wavefronts (Geometry shader
`disabled)
`o Optionally (if Geometry Shader enabled) create GS wavefronts/data and send themto the SPI
`© Generate Primitive information and send it to the PA
`
`Shader_en|
`VS
`HS
`DS
`GS
`Hardwaredata flow
`
`mode
`LS->HS->TE->ES->GS->VS
`on
`on
`on
`on
`F
`
`
`
`
`E off|API VS runsas LS. LS->HS->TE->VSon on on
`D
`on
`on
`off
`on
`Notvalid
`
`C off|Not validon on off
`
`
`
`B
`on
`off
`on
`on
`Notvalid
`
`A off|Not validon off on
`
`
`
`9
`on
`off
`off
`on
`API VS runs as the ES. ES->GS->VS
`
`8 off|VS->PSon off off
`
`
`
`
`7
`off
`on
`on
`on
`Not possible VS has to be on
`
`6 off|Not possible VS has to be onoff on on
`
`
`
`5
`off
`on
`off
`on
`Notpossible VS has to be on
`
`4 off|Not possible VS has to be onoff on off
`
`
`
`3
`off
`off
`on
`on
`Not possible VS has to be on
`
`2 off|Not possible VS has to be onoff off on
`
`
`
`1
`off
`off
`off
`on
`Not possible VS has to be on
`
`0 off|Not possible VS has to be onoff off off
`
`
`
`
`
`
`
`
`3.2 Data Flow based on SHADER_STAGES_EN programming
`
`
`
`
`
`
`
`
`
`3.3 Distribution of work amongst shader engines.
`
`If there is one IA the WDpasses through drawcalls, howeverif there are two IA’s the WD breaks drawsinto
`work groups whichare twice the size of a prim group. The IA sends an entire prim group to a VGTbefore
`switching to the next VGT. A endof prim group (eopg) signal follows each prim group and the SC looks for
`these to know whento switch input FIFOs.
`
`The WDwill discard any drawcall that that contains 0 indices or aprimitive type of DI.PT_NONE. Beginning
`in gfx8, the WD will also discard any drawcall that sets the numberofinstances to 0. If there is one IA, the
`drawwill be dropped and nothing will be sent downthe pipe, but if there are two IA’s, the WD will send a null
`cop downthe pipe and toggle the IA that will receive work next.
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048464
`
`ATI Ex. 2026
`IPR2023-00922
`Page 10 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 10 of 110
`
`
`
`PAGE [CREATEDATE|[SAVEDATE \@
`
`ORIGINATE
`EDIT DATE
`DOCUMENT-VER. NUM.
`
`1.0 [ PAGE]of [
`
`Belowis an example of processing primitive groups from IA to SCs.
`
`The Input Assembler(IA) block processesa serial stream of primitives. The bolded P’s represent primitives
`that will be sent to SEO.
`
`Pp, P, p (Copg). p. p. p (copg). p (eop/eopg), event, p, p (cop/eopg). event, eop, event, p, p (copg), p, eop/eopg,
`event
`
`Four packets processed by VGT modules are shownbelowwithdifferent colors. This represents data sent from
`the IA to VGT’s 0 and 1. PA_0 will receive the same sequence as VGT_0 and PA_1 will receive the same
`sequence as VGT_1. After reset processing starts with VGT module 0.
`
`Eopg marks the end of the prim groupandthis is whattells the IA block to switch SE’s. Eopgis onlysent to
`the active SE.
`
`Eopis sent to both SE’s in unordered mode. Otherwise the PA drops the eopthat isn’t accompanied by copg.
`
`Eachline represents a point in time so items onthe sameline occur simultancously.
`Read the data flow fromtop to bottom.
`
`VGT 1
`
`P P P
`
`/eopg
`eop
`event
`
`P
`
`P/eopg/eop
`event
`
`event
`
`P
`
`VGT 0
`
`P P P
`
`/eopg
`
`P/eopg/eop
`event
`
`yp
`event
`
`event
`
`P P
`
`/eopg
`
`eopg/eop
`eop
`event
`event
`
`
`The table below shows howPA outputs are loaded into SCs. Some of the primitives which can span scan
`windowsof two SCsare loaded into FIFOs of both SCs.
`
`The processing of FIFOs of each SCsare done independently for each SC. The diagram below showsorder of
`processing within FIFOsfor a given SC.
`
`The SC switches the fifo it’s reading from after reading out a eopg. The SC synchronizes FIFOs when
`processing an event or cop though whenordered mode (default) is used the PA will only send copifit is
`accompanied bya eopg.
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048465
`
`ATI Ex. 2026
`IPR2023-00922
`Page 11 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 11 of 110
`
`
`
`PAGE [ CREATEDATE
`
`ORIGINATE
`
`EDIT DATE
`[ SAVEDATE \@
`
`DOCUMENT-VER. NUM.
`
`1.0
`
`[ PAGE ] of [
`
`Read the data flowfromtop to bottom. Eachline shows whatis being read out by the SC during aclock cycle.
`
`SsCc_0
`
`$Cc_l
`
`FIFO_1
`
`P P P
`
`/eopg
`eop
`event
`
`P’/eopg/eop
`event
`
`event
`
`FIFO_0
`
`eopg
`
`eopg/eop
`event
`
`cop
`event
`
`event
`
`FIFO_1
`
`cops
`eop
`event
`
`P*/eopg/eop
`event
`
`event
`
`FIFO_0
`
`P
`
`P P
`
`/eopg
`
`P/eopg/eop
`event
`
`»p
`event
`
`event
`
`P
`P/eopg
`eopg/eop
`eop
`eopg/eop
`eop
`event
`event
`event
`event
`
`
`eopg
`
`The following table shows what each SC seesafter reading from the input FIFOs.
`Readthe data flow fromtop to bottom. Packets are separated forclarity.
`
`SC 0
`P
`P
`P
`
`P/eop
`event
`
`P*/eop
`
`SC 1
`P
`P
`P
`
`eop
`event
`
`P
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048466
`
`ATI Ex. 2026
`IPR2023-00922
`Page 12 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 12 of 110
`
`
`
`PAGE [ CREATEDATE
`
`[ PAGE ] of [
`
`ORIGINATE
`
`EDIT DATE
`[ SAVEDATE \@
`
`DOCUMENT-VER. NUM.
`;
`
`event
`
`P
`
`P
`eop
`event
`
`event
`
`P
`
`cop
`event
`
`3.4 Vertex Reuse
`
`The intent of the vertex reuse determinationis to efficiently use the Vertex Shader by preventing the Vertex
`Shaderfrom processing the same vertex multiple timesif that vertex is used in multiple primitives that occur
`relatively close together in the input stream. The VGT must detect vertex reuse within the previous 30 (or less)
`vertex indices. In other words, vertex reuse is determined bythe redundant occurrence of an external vertex
`index within a limited scope of the external vertex indexlist. If a “hit” is detected for a given vertex index, that
`vertex index is not resubmitted for vertex processing.
`
`Reuse checksare performed in multiple sub-blocks in the VGT. Hereis a list of shader stages and where vertex
`reuse occurs.
`
`VS -> PS: Vertex Reuse Block performs the reuse. If Streamout is enabled reuse is automatically disabled.
`
`ES -> GS -> VS -> PS: The GS Reuse Check Module performs reuse checks to remove redundantvertices from
`ES wavefronts. GS prims output strips, but there is no reuse betweenthe strips. If Streamout is enabled reuse is
`automatically disabled prior to the VS and thestrips are converted tolists.
`
`LS -> HS -> VS -> PS: Thetessellator performs reuse prior to the VS stage. There is no reuse betweenpatches,
`only betweenprimitives output bya single patch. If Streamout is enabled reuse is automatically disabled.
`
`LS -> HS -> ES -> GS -> VS -> PS: Thetessellator performs reuse prior to the ES stage. GS prims output
`strips, but there is no reuse between the strips. If Streamoutis enabled reuse is automatically disabled prior to
`the VS andthestrips are converted tolists.
`
`3.4.1 Bank Conflict Detection
`
`Withthe increase to a reuse depth from 16 to 30 in gfx8, it became possible for there to be bank conflicts in the
`parameter cache.
`In order to simplify logic in the SPI and the Parameter Cache, there will be bank conflict
`prevention code added to some of the VGT reuse checkers. Anyintra-primrelative indices that would cause
`bank conflicts in the parameter cache will not be allowed.
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048467
`
`ATI Ex. 2026
`IPR2023-00922
`Page 13 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 13 of 110
`
`
`
`
`
`ORIGINATE
`EDIT DATE
`DOCUMENT-VER. NUM.
`PAGE
`
`
`[CREATEDATE|[SAVEDATE \@ [ PAGE]of [1.0
`. 21 they are
`stored in the parameter cache as follows. Each of the columns shownis a bank, there are 16 banks in the PC.
`
`
`
`
`
`
`
`
`
`
`| Poa] 21|20[19|18| [17 [16
`
`15 }14]13/}12}11}1l0/9
`|8
`|7
`Joe 15
`|4
`|
`1
`|0
`
`If the new incoming primitive(a triangle) has the indices 22, 3 and 6. The indices 22 and 6 will be fetched from
`the same bank and will cause a conflict.
`
`The newconflict detection code will eliminate this condition by replicating the index 6 as follows
`
` (15[14[13|12|r]ilo{s|7|e[5[4[3j2[1Jo|
`
`This produces more unique vertices than ideal but it eliminates the need for complex bank conflict detection in
`the SPI and the PC.
`
`Indices from different primitives are allowed to request indices from the same bank, this conflict checking is
`handled downstream. The VGTwill only be responsible for eliminating anyintra-primconflicts.
`
`This conflict detection step is not necessary in the GS RCMorin the TE11 when the GS is enabled. This is
`because the primitives at these pipeline stages do not go to the parameter cache and will not cause bank
`conflicts when fetched
`
`The VGT_GS_VERTEX_REUSEregister is now deprecated. The VGT_VERTEXREUSEBLOCK_CNTL
`register nowcontrols reuse depth for all the reuse buffers (DX9, GS and TE11)
`
`Setting VGT_REUSE_OFF.REUSE_OFFturns off reuse in all blocks. This did not turn reuse off in the GS
`block earlier.
`
`To support this increased reuse depth, reuse is nowturned off for any degenerate primitives (any primitive with
`repeated indices). This is an implementation level detail to save schedule.
`
`3.5 Dealloc Distance and Reuse Depth
`
`The shader always writes 16 vertex parameters. Therefore dealloc_distance and reuse depth is always set to 16
`(points) or 14 (triangles). This also showsthat for legacy (non-DX11) tessellation we can setup
`HOS_REUSE_DEPTH to 16. The following changes were done to remove dealloc_slot issue independent of
`quad_pipes.
`
`1. The VGT submits 64 vertices per wave unless half pack is switched on.
`if (half_packed flag)
`create 32 verts per vector
`else
`create 64 verts per vector
`
`2. De-allocate distance will always be 16 and reuse can always be 14 unless driver wants to limitit to be less
`
`3. The VGT changeto create
`a.
`1 NewFlag pervertex vectorattachedto the first primitive containing the first vert of a newvertex
`vector (Same as today)
`1 De-allocate signal per Vertex Vector submitted instead of4.
`
`b.
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048468
`
`ATI Ex. 2026
`IPR2023-00922
`Page 14 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 14 of 110
`
`
`
`[ CREATEDATE
`
`EDIT DATE
`[ SAVEDATE \@
`
`DOCUMENT-VER. NUM.
`1.0
`
`PAGE
`[ PAGE ] of [
`
`ORIGINATE
`
`ii.
`
`c.
`
`of the previous vector
`It will be variable base onthe resulting numberofvertices per vector set up in number |.
`1.
`vert 80 and every 64 thereafter
`1 last signal that
`is sent in the msb of de-allocate signal
`i. This will be attached to the primitive containingthe first reference to a newspecific vert
`for cach vector depending on the vector size
`1.
`vert 61 and every 64 thereafter
`ii. This signal is only for SC usage
`4. The largest actual de-allocate count the VGT will send nowwill be 3
`an
`The scan converter change his partial vector submit circuit to emit a partial vector when he hasa de-
`allocate count of nonzero and gets a last flag. This will prevent hang conditions whenthereis a lot of
`culling going on. The Scan converter will also remove the previous partial submit for this reason and add
`asserts for the occurrenceofthe partial submit and a fail assert if de-allocate is everylarger than 2.
`6. The PA changes to pack parameters into the parameter cache. It will use the bad pipe flags and numberof
`parameters to determine nextoffsets.
`It will do a special case
`SQallocates and de-allocate based on num_quad_pipes and numberof parameters.
`limiting when num_quad+pipes > 2 and num_parameters >= 16,it will actuallyact like two quad pipes. So
`
`7.
`
`if (num_quad_pipes >2 && half_packedflag)
`alloc_amount = num_paramters * 2
`
`else
`
`alloc_amount num_quad_pipes * num_parameters
`
`3.6 VertexID
`
`VertexID is a 32 bit unsigned integer value created by the VGT and loaded into a VGPRfor the API vertex
`shader.
`It is unique per vertex though each VGT maintains its own count. This feature is not required by DX or
`OpenGL. The countis reset by a RESET_VTX_CNT event.
`
`3.7 PrimitivelD
`
`PrimitiveID is a 32 bit unsigned integer value created by the VGT and loaded into a VGPR. Theregister
`VGT_PRIMITIVEIDRESETspecifies the reset value to be used at the beginning of each instance. Typically
`it’s programmed to 0. For special GS modeslike scenario A and B the VGT_PRIMITIVEID_ENregister
`specifies that the primitive ID value should be loaded into a VGPRat the expense of aninstancestep rate value.
`
`PrimitiveID is automatically available to the HS, DS and GS. If it’s needed in the PS it must be passed as a
`parameter.
`It is expected that the situations where primitiveID is used by the PS butthere is no GSinstantiated
`are rare. To avoid having the hardware haveto pipe the full 32-bit primitiveID through hundreds of clocks of
`pipeline, the driver will be expected to change the VS into a GS_A, whichis basically a VS whichgets
`primitiveID on the input, and output the primitiveID on the expected vector/component where the PS expects it.
`The onlyother unique processing associated with a GS_Ais that the VGT must guarantee that the leading
`vertex is unique(i.c. does not hit in the vertex reuse cache). This is required so that unique data for the
`primitive (i.e. primitiveID) is available for constant interpolation for the primitive.
`
`3.8
`
`InstancelD
`
`InstanceID is a 32 bit unsigned integer value created by the VGT and loaded into a VGPR.It starts at 0 for all
`of the verticesofthe first instance, and increments thereafier for each instance.
`It should also be 0 for non-
`instance drawcalls.
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048469
`
`ATI Ex. 2026
`IPR2023-00922
`Page 15 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 15 of 110
`
`
`
`ORIGINATE
`EDIT DATE
`DOCUMENT-VER. NUM.
`PAGE
`
`[CREATEDATE|[SAVEDATE \@ 1.0
`
`[ PAGE ] of [
`
`The value will be supplied to the API VS and is available to the GS and PS. The path to the VSis the only
`hardware-dedicated path for instanceID as the driver is expected to create a VS (which would pass it along if
`necessary) if there is no VSinstantiated.
`
`The VGT will also supplyup to two step-rate divide valuesto assist the fetch shader for cases with a small
`numberof unique step-rates.
`In case the required numberstep rates exceeds what is supported by VGT,.
`the
`remaining instanceID/step-rate will be calculated bythe fetch shader. A vertex wavefront mayconsist of
`vertices with different instanceID’s.
`
`Ina multiple VGT system, the end of instance flag needs to be propagated to all VGTsinorderto correctly
`increment the instanceID and reset the reuse buffer. In variants with more than one IA, the WD sends a
`null_eoi to the ‘other’ [A which propagates the flag to the VGTs connected to it. Whenthe entire instancefits
`onone IA,the null_eois sentto the other side can add up as dead cycles and showupas performanceglitches.
`
`If the entire drawis smaller than a primgroup, the null_eois are suppressed and instance_id is still handled
`correctly. This is done by adding an interface bit from the WDto the IA indicating that the draw wasa
`candidate for optimization. Anynull_cois that are not cops will be discarded gracefully and will not exit the IA
`
`3.9 Reset Index
`
`It’s typically used to breakstrips, but
`A reset index is a special index value that signifies the end of a primitive.
`maybe enabledforlists. Reset index is not supported with patches. Reset index checking occurs in the IA and
`it’s enabled bysetting the VGT_MULTL_PRIM_IB_RESET_ENregister. The index value to check for is
`specified in VGT_MULTIPRIM_IB_RESET_INDX.
`
`Enabling reset index limits performance for designs that have greater than 2x primrate (2 or more IAs) asit
`requires WD_SWITCH_ON_EOPtobeset. For this reason we recommendour developerrelations personnel
`evangelize usinglists instead of strips with reset indices.
`
`Partial primitives that result from a reset indexor at the end of a packetare silently discarded.
`
`Prior to gfx8, drivers modified the value of the reset index check register
`VGT_MULTI_PRIM_IB_RESET_INDXbasedonthe index type (8, 16 or 32 bit). In orderto alleviate the
`software validation that is performed, in gfx8 and later projects the hardware masksout the register bits
`depending on the number ofbits in the current index.
`
`For 16 bit indices, earlier the driver needed to program 0Ox0000VVVV. where VVVV is the reset index.
`
`Nowthe driver can use OXXXXXVVVV, where XXXXare don’t care
`
`Besides the performance implications other caveats to using reset indexare:
`1. Line stipple will not produce the correct visual result with this mode. The line stipple pattern will not reset
`between strips (whichit would if the strips were sent with separate VGT_DRAW_INITIATOR
`commands).
`Edgeflags will not be correct for the prim order VGT_GRP_POLYGON. This will have a visual impact in
`OpenGLforthis primitive order if POLY_MODEis set to LINES or POINTS.(This applies mostlyto the
`OpenGLpolygonprimitive.)
`
`2.
`
`3.10 Provoking Vertex
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048470
`
`ATI Ex. 2026
`IPR2023-00922
`Page 16 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 16 of 110
`
`
`
`PAGE [CREATEDATE|[SAVEDATE \@
`
`ORIGINATE
`EDIT DATE
`DOCUMENT-VER. NUM.
`
`1.0 [ PAGE]of [
`Ifflat shading is enabled for aprimitive, then the provvoking vertexis the vertex whose coloris used to shade the
`entire primitive. OpenGL and Direct3D differ (for most primitive types) in their respective selections of the
`provoking vertex. The VGT will be designed so that the OpenGL primitives will always program the provoking
`vertex select to “last vertex” and the Direct3D primitives will always program the provoking vertex select to
`“first vertex”.
`
`OpenGL Specification
`The following table is based directly on table 4-2 from OpenGL Programming Guide, Second Edition. (The
`version in the OpenGL spec counts vertices and primitives starting at 1, whereas this version counts vertices and
`primitives starting at 0. After swapping for specified vertex order within the primitive, the provoking vertex is
`the last vertex in the primitive with the exceptionof the polygon primitive where the first vertex is the
`provoking vertex.
`
`Table1.
`
`OpenGLProvokingVertex.
`
`| N/A — 4i(first vtx in quad)
`
`Type of Polygon
`triangle strip
`
`triangle fan
`
`quadstrip 1
`independent quad
`
`|
`
`OpenGL
`Vertex Used to Select the Color
`for the ith Polygon
`
`i+2 (last vtxoumtwmuntri)
`OO
`
`Direct3D
`Vertex Used to Select the Color
`for the ith Polygon
`i (firstvtx intri)
`
`i+2 (last vtx intri)
`
`i (first vtx in tri)
`
`21+3 (next-to-last vtx in quad)
`| 4i+3 (last vtx in quad)
`
`N/A — 2i (first vtx in quad)
`
`! For OpenGL quadstrips, the provoking vertexis the last vertex in the vertex buffer that forms the primitive;
`however,it is the next-to-last vertex the primitive using the primitive-relative vertex order. For example,if the
`vertex buffer contains VO, V1, V2. and V3inthat order, then the first quad primitive fromthat strip will have the
`vertex order VO, V1, V3, V2. The provoking vertex for the quad is V3. See [ REF _Ref687713 \r \h ] for more
`detail.
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048471
`
`ATI Ex. 2026
`IPR2023-00922
`Page 17 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 17 of 110
`
`
`
`ORIGINATE
`
`DOCUMENT-VER. NUM.
`
`EDIT DATE
`[ SAVEDATE \@
`1.0
`3.10.1 Primitive Vertex Ordering and Provoking Vertex Summary
`Table 2.
`Primitive Vertex Order and Provoking Vertex Summary
`
`[ PAGE ] of [
`
`PAGE [ CREATEDATE
`
`
`
`
`.
`.
`Vo. V1
`Line List
`V2.
`V3
`
`
`.
`.
`Line Strip
`
`Line Loop
`
`Tri List
`S
`
`Vo. V1
`VL.V2
`Vo. V1
`V1. V2
`V2, VO <= created by VGT
`VO. V1. V2
`V3, V4, V5
`
`Vo. V1
`V1, V2
`V2, VO <= created by VGT
`VO. V1, V2
`V3, V4, V5
`
`Tri Stri
`
`P
`
`V0. V1, V2
`V1, V3, V2 <= VGT swapslast two
`
`V0, V1, V2
`V2, V1, V3 <= VGT swapsfirst two
`
`VO, V1, V2
`V1. V2. VO <= VGTrotatesfirst to last
`Tri Fan
`a
`V2, V3, VO <= VGTrotatesfirstto last
`VO, V2, V3
`
`
`Quad List (Native)
`
`Does not exist — assumed
`
`V0. V1, V2, V3
`
`V4, V5, V6, V7
`
`ve ve ve vy
`
`coe
`
`Does not exist — assumed
`
`
`
`
`QuadList (Decomposed)|VO. V1. V2 and V0, V2. V3 vr ve v3 ane ve ve el
`
`
`
`V4, V5, V6 and V4, V6, V7
`
`7
`
`oo
`
`QuadStrip
`.
`(Native)
`
`QuadStrip
`(Decomposed)
`
`Does not exist assumed
`V0. V1, V3, V2 <= VGT swapslast two
`_
`;
`;
`V2, V3, V5, V4 <= VGT swapslast two
`V4. V5. V7. V6 <= VGT swapslasttwo
`Does not exist — assumed
`V0. V1, V3 and VO, V3, V2
`V2. V3. V5 and V2, V5, V4
`V4. V5. V7 and V4, V7, V6
`
`Polygon (Decomposed)
`
`voay assumed
`V0. v2. V3 ‘atad
`ey.
`V0, V3, V4 etc...
`
`VO, V1, V3, V2 <= VGT swaps last two
`on tye
`_
`/ P
`V2. V3. V5. V4 <= VGT swapslast two
`V4. V5. V7. V6 <= VGT swaps last two
`»
`Woy
`Ws
`waps
`fast'lw
`VO. V1. V3 and V1. V2. V3
`V2, V3, V5 and V3, V4, V5
`V4. V5. V7 and V5. V6. V7
`oo ~
`
`V1, V2, VO <= VGTrotates first to last
`V2. V3. VO <= VGTrotates first to last
`V3, V4, VO <= VGTrotatesfirst to last
`
`Direct X Specification
`The DirectX 8.0 documentation states “Whenflat shading is enabled, the system shadesthe triangle with the
`color fromits first vertex.” There is no direct mentionofflat shading lines, but the VGT design assumes that
`lines also use the first vertex in each line segment as the provoking vertex.
`
`[filename ] — [numchars ] Bytes
`
`[printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048472
`
`ATI Ex. 2026
`IPR2023-00922
`Page 18 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 18 of 110
`
`
`
`PAGE [createpate|! SAVEDATE \@
`
`DOCUMENT-VER. NUM.
`1.0
`
`[ PAGE ] of|
`
`ORIGINATE
`
`EDIT DATE
`
`3.11 Primitive Types
`
`3.11.1 Triangle List
`
`The first edge in eachtriangle is a bold line. For OpenGL,the last vertex (shown with a square box in | REF
`_Ref685804 \r \h \* MERGEFORMATJ) is used as the provoking vertex. For Direct3D,thefirst vertex (shown
`ina circle in [ REF _Ref685804 \r \h \* MERGEFORMATJ) ineachtriangle ina triangle list is the provoking
`vertex.
`
`OpenGL and D3D order.
`
`
`3.11.2 Triangle Strip
`
`The first edge in eachtriangle is a bold line. Note for OpenGL,onlythe last vertex (shown with a square box in
`[ REF _Ref685843 \r\h \* MERGEFORMAT ]}) in eachtriangle progresses in a series (V2, V3, V4, etc). For
`OpenGL,thelast vertex is used as the provoking vertex.
`
`Provoking Vertex
`- OpenGL
`
`OpenGLorder
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048473
`
`ATI Ex. 2026
`IPR2023-00922
`Page 19 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 19 of 110
`
`
`
`ORIGINATE
`EDIT DATE
`DOCUMENT-VER. NUM.
`PAGE
`
`[createpate|! SAVEDATE \@ 1.0
`
`_Ref685858 \r\h \* MERGEFORMAT]) in eachtriangle progresses in a series (VO, V1, V2, etc). For
`Direct3D, the first vertex is used as the provoking vertex.
`
`
`
`[ PAGE ] of|
`
`1
`
`3.11.3 Triangle Fan
`
`D3D order
`
`The first edge in eachtriangle is a bold line. Note for OpenGL,the last vertex (shown with a square box [| REF
`_Ref685872 \r\h \* MERGEFORMATJ) is used as the provoking vertex.
`
`The first edge in each triangle is a bold line. Note for Direct3D,the first vertex (shown ina circle in | REF
`_Ref685888 \r \h \* MERGEFORMATJ) is used as the provoking vertex.
`
`OpenGLorder
`
`[filename ] — [numchars ] Bytes
`
`(printdate \@ "MM/dd/yy hh:mm AM/PM"]
`
`AMD1044_0048474
`
`ATI Ex. 2026
`IPR2023-00922
`Page 20 of 110
`
`ATI Ex. 2026
`
`
`IPR2023-00922
`Page 20 of 110
`
`
`
`EDIT DATE
`[ SAVEDATE \@
`
`DOCUMENT-VER. NUM.
`1.0
`
`PAGE
`
`ORIGINATE
`[ CREATEDATE
`
`[ PAGE ] of [
`
`D3D triangle fan order
`
`3.11.4 Quad List
`
`The first edge in each quad is a bold line. Note for OpenGL,the last vertex (shown with a square box [