`
`“i
`
`ORIGINATE
`29-Nov-11
`
`EDIT DATE
`i2-Feb-14
`
`DOCUMENT-VER. NUM.
`1.0
`
`PAGE
`1 of 35
`
`GFXIP_9x SX
`
`Micro-Architecture Specification
`
`Rev 1.0 — Last Edit: 12-Feb-14
`
`THIS DOCUMENTCONTAINS|
`
`INFORMATION THAT COULD BE
`
`SUBSTANTIALLY DETRIMENTAL TO THE INTEREST OF AMD THROUGH
`
`UNLICENSED USE OR UNAUTHORIZED DISCLOSURE.
`
`Preserve this document's integrity:
`
`= Do not reproduce any portions of it.
`
`= Do not separate any pages from this cover.
`
`drive the Internet and businesses, For more information, visit henoy/ /wew end. con
`
`. This document is issued to you alone. Do not transfer it to or share it with another person, even within your
`organization,
`
`3. Store this document in a locked cabinet accessible only by authorized users. Do not leaveit unattended.
`
`4. When you no longer need this document, return it to AMD, Please do not discard it.
`
`“Copyright 2011, Advanced Micro Devers, inc ("AMD") AS rights reserved. This work contains confidential, propeietary to Dee reader wlormation and trade
`wecrets of AMID. Mo part of this document may he wed, reproduced, or tranimitted i any form or fy any means without the olor written permission of AMD.”
`
`AMD, the AMD Arrow Loypo and comSinations thereof are trademarks af Advanced Mera Devers, Me
`trademara of HOMI Licensing, LLC
`
`PCle 6 o registered trademart of POSIG. HOM & a
`
`AMD (NYSE: AMD) is a semiconductor Geagn innovator wading the next era of vivid digital experences with its ground-breaking AMD Fusion Accelerated
`Processing Units (APU). AMD's graphics and computing technalagies power a variety of devices including PCs, game conscles and the powerful computers that
`
`uP I SE MAM doce
`
`19788 Byer
`
`12/OLJES 12 AM
`
`AMD1044_0104744
`
`ATI Ex. 2025
`IPR2023-00922
`Page 1 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 1 of 35
`
`
`
`1.0
`ORIGINATE
`EOIT DATE
`‘&
`29-Nov-11
`Areas
`
`DOCUMENT-VER. NUM.
`
`AMD
`
`Revision History
`
`‘DateSsBy=~ Revision|Deseniption
`
`
`(O41|iitialCopyfromgfs8
`WIZ
`eu|
`IDI
`1
`=
`baa
`=z
`5ell
`i——4)
`=!
`
`First edit for out-of-date contents
`
`NP SA MRS coce~ 1588 Byey
`
`SSUES 1aM
`
`AMD1044_0104745
`
`ATI Ex. 2025
`IPR2023-00922
`Page 2 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 2 of 35
`
`
`
`DOCUMENT-VER, NUM. 29-Nov-11
`
`ORIGINATE
`
`EDIT DATE
`aete
`
`LO
`
`PREFACE
`
`1
`
`INTRODUCTION
`
`Table of Contents
`
`DESPINITIONS / GLOSSARY OF TERMS
`LI
`‘Tor Lever DIAGRAM
`12
`Migure 1 SV tn chip contest
`13
`ATX REQUIRED FEATURES
`Had
`SX support for 2 PC/GDS redirect busses instead of I in 1 SE configurations (16 PIX PER SH)
`1.42
`SX support for 4 RBs. (16 Pik PER SH)
`ht
`SX support for deeper Position buffer, Colar buffers and position alloc storage {16 Pix PER SH)
`Lia
`Streaming of performance counters (STREAM PERF CNTRS (GPU.O1 & GPU.02))
`
`1
`
`DETAILED CHANGE DESCRIPTION
`
`SX SUPPORT POR 2 PC/GDS REDIRECT HUSSES INSTEAD OF LIN 1 OR 2 SE CONFIGURATIONS (1 SH PER SE
`14
`ONLY) &
`Testing
`1.1.1
`SX SUPPORT For 4 RBs,
`12
`Tesning
`2d
`3 SX SUPPORT POR DEEPER POSITION SUPPER, COLOR UPPERS AND POSITION ALLOG STORAGE
`
`2
`
`3
`
`4
`
`PERFORMANCE
`
`POWER
`
`HARDWARE IMPLEMENTATION: TOP LEVEL
`
`Tot LEVEL DRAWING
`4.1
`Figure 2) SN Top Level Dicgrane
`4.2
`ADDRESSING THE BUFFERS
`4.3
`FORMATOFTHE DATA
`44
`(CONTROLS OF AN EXPORT
`45
`Cook EXPORT
`AST
`The color scorehuard
`$52
`Export buffer address computation
`Figure 3: din example ofaddrexang the export Muffer
`4.6
` Posros EXPORTS
`4.7
`REDIRECT EXPORTS
`
`5
`
`HARDWARE IMPLEMENTATION: INTERFACE
`
`SHADER Core DvTeEREACES (SPUSQU’SP)
`$1
`SH SPT EXPREQ
`Saad
`SPY SH EXPGRAAT
`5.12
`SO SY EYPCMD
`5.43
`SE SY EXPADDR
`SLA
`SY_ SPY Free Signals
`SAS
`SUH &¥ DATA
`S5.h6
`$2
`SX OPC INTERFACES
`SV PC EXPCMD
`$3
`SX TOPA INTERFACES
`$4
`SX ToODB evTeRFacts
`
`4
`
`5
`
`5
`5
`6
`7
`?
`?
`7
`?
`
`8
`
`s
`x
`9
`9
`
`9
`
`o
`
`Ut)
`
`10
`id
`10
`I
`1
`12
`{2
`Iz
`3
`13
`13
`
`14
`
`14
`is
`I6
`lé
`is
`is
`ig
`20
`20
`2]
`23
`
`UP eT MA docs 177 Bye
`
`LVOLDE 1-23 aN
`
`AMD1044_0104746
`
`ATI Ex. 2025
`IPR2023-00922
`Page 3 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 3 of 35
`
`
`
`29-Nov-11
`
`EDIT DATE
`12-Feb-14
`
`DOCUMENT-VER, NUM.
`LO
`
`ORIGINATE
`
`A micro-architecture (or block level) specification serves muiny pamposes.
`
`«©
`
`The micro-architecture specification is used by design verification teams to build block and sysiem level
`venfication environments (test benches, test soonarios, test cases and testing methodologies’stnvcgics)
`Although, a document is clearly not the ONLY vehick used by design verification teams, it ls a critical
`picee of the verification planning process in that it provides valuable data to support the test plan meetings
`between design and verification teams
`®=The micro-architecture specification is also a useful tool for peer and block design reviews, Interfacing
`blocks require explicit details of control and handling of data being transferred. transformed between
`blocks. The specification as the obvious resource thal peer tears go to for this type of information
`This specification ts also used by post-silicon verification teams and in the creation of documentation (such
`as programmeng guidelines, cic.) that must be prepared for external customers,
`This documentuton ts also useful when designs are transferred to other teams. For example, derivatives of
`a graphics core are weed in many other products that include an integrated core, hand held devices, c1c
`
`®
`
`*
`
`Exch new major architecture should comain a micro-architecture specification for cach block in the subsystem. In
`the case where a design ts derived from a previous project, thal previous project specification would be updated and
`checked into the new project revision control documentition area, All the delta features would be described in ihe
`feature section of the micro-architecture specification and the document in general should be updated to match the
`new project updated block design.
`
`A lemplite is provided as a means of descnibing the detail reqpaced by all teams that use the micro-arclatecture
`specification amd to drive consistency between the specifications from one block to another. This templiae wis
`created by the Design Verification Workgroup which is comprised of representatives across all AMDBusiness Units.
`This templaic has been and will be distributed to other Business Units and Design Design Verification and
`Architecture teams for review and feedback. This template was created froma reviewof manyof ihe existing micro-
`architecture specifications. Examples are pulled from these documents and presented here to illustrate the type of
`information that ts required, To distinguish between the descriptions of contem and examples,all examples appear
`in therlien,
`
`NP SA MRS coce~ 1588 Byey
`
`SSUES 1aM
`
`AMD1044_0104747
`
`ATI Ex. 2025
`IPR2023-00922
`Page 4 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 4 of 35
`
`
`
`29-Nov-11
`
`ORIGINATE
`
`EDIT DATE
`12-Feb-14
`
`DOCUMENT-VER, NUM.
`10
`
`The SX (shader export) block is responsible to receive and re-order color and position exports from the shader
`core and forward themto the correct client: PA for position, and the correct DB for color The SX is also
`conduit for parameter and GDS exports and in this role forwards the data unchanged to the PC (parameter
`cache) block.
`
`The main teput forthe SX block is the shader output brs which is 2 busses cach 16x32 bits wide. The SX
`cutpul busses ane:
`1)
`128 bits wide bus to the PA (primitive assembicr) block supporting | position per clock
`2)
`256bits to cach DB (supporting up to 4 DB per SX) thus supporting 4 “compressed” pixels per clock (b4
`bpp) or 2 uncompressed pixels per clock (12% bpp).
`16x32 bit bus to PC/GDS.
`
`3)
`
`1.1 Definitions / glossary of terms
`
`Thread—one instance of a shader program being executed on a vector of pivels, vertices, or primitives
`Fach thread has its own state which is aniqiefron anyother thread
`Clause—0 group ofinstrnctienty all ofthe same type (allALU, all texture-fetch, ete.) execnted as a group;
`part ofa thread,
`Wave-one insinictien operates on a wave ofpolsvertices primitives over 4 clock cveles. Thes is the baste
`pantafwerk. The size ofa vector depends on the xusteo configuration, bul is always 4 clock cvcley,
`SIMD—Sorgle lostruction, Multiple Data. Here, a SIMD refers te one slice ofthe SP machine which all
`recemesthe same instmiction and operates on a vector ofdeta. Most taplementations will have multiple
`AIMDs. Fook SIMD recemves a separate iisirnction jroar the SO.
`
`1.2 Top Level Diagram
`
`PP x ST MAS doce 15788 Byte
`
`SS/OLRE 1-13 AM
`
`AMD1044_0104748
`
`ATI Ex. 2025
`IPR2023-00922
`Page 5 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 5 of 35
`
`
`
`ORIGINATE
`29-Nov-11
`
`EDIT DATE
`12-Feb-14
`
`DOCUMENT-VER, NUM.
`
`10
`
`Figure 1; 5Xin chip context
`
`This is the SX block in full chip context. The 5X sits at the bottomof the SH array and as such there will be
`as many SX blocks as there are SH blocks (there can be one SH block or 2 SH blocks per SE depending on
`the configunuion chosen). Up to 4 DBs can be connected to a single SX block and 2 512 input busses are
`the sole data inputs to the block. The outputs of the SX, other than the DBs are the PA for position and the
`PC/GDSfor parameter/GDS transactions,
`
`MP Pe SA MAS doce
`
`29 70R Byter
`
`L2/OUEE 1:1 AM
`
`AMD1044_0104749
`
`ATI Ex. 2025
`IPR2023-00922
`Page 6 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 6 of 35
`
`
`
`7 of35
`ORIGINATE
`EOIT DATE
`DOCUMENT-VER. NUM.
`‘i
`29-Nov-11
`isfatri¢
`LO
`
`PAGE
`
`AMD
`
`Requircinents (Delta Requirements)
`
`For more details, please sec GFXIP_¥x_SX_delta.doc
`
`13
`
`6AM
`
`uired Features
`
`13.1
`
`Support RB+ feature for 2RB setting
`a, Motivation: gfx9 9.87, Enable dual RB+ per SE (Scalable pixel rate per shader array)
`b, Area; 35%
`c. Schedule: 2 week EMU, 4 week RTL, 6 week DV
`d,
`Impacted blocks: SPI SX SC etc
`
`re SE MB doce
`
`35788 Byte
`
`SSO23:29 aM
`
`AMD1044_0104750
`
`ATI Ex. 2025
`IPR2023-00922
`Page 7 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 7 of 35
`
`
`
` ORIGINATE
`
`29-Nov-11
`
`EDIT DATE
`la-Feb-14
`
`DOCUMENT-VER. NUM.
`
`PAGE
`8 of 35
`
`Detailed Change Description
`
`1.1
`
`SX support for 2 PC/GDS redirect busses instead of 1 in 1 or 2 SE
`configurations (1 SH per SE only)
`
`5X will decode the SQ_5X_expomd bus and replicate the control data for 4 clocks on 2 separate
`5X_PC_expemd busses. This is to make this change transparent to the PC blacks.
`
`
`
`The 2 SX busses configuration is enabled when GPU__GC__DUAL_PC_EXPORT_BUS == 1. This feature
`should only be enabled if NUM_SE <= 2 and NUM_SH_PER_SE = 1, In this case only 1 of the vdata
`busses can carry GDS information at a time as the GDS will only have 2 input ports to the actual logic
`(the other ports are just pass through for the PC).
`
`4.8.2 Testing
`
`1. Test with lots of parameters and no pivels und HW coverage assert to make sure SPI is able to use both
`busses at the same time for PC transfers
`2; Test with lots of parameters and GDS ops to make sure SPL doesn't allow 2 GDS transactions at the same
`tine
`
`1.2 SX support for 4 RBs.
`
`SEP FT MAdoce 1)7E Bytes
`
`LEOLRE 1:1 aM
`
`AMD1044_0104751
`
`ATI Ex. 2025
`IPR2023-00922
`Page 8 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 8 of 35
`
`
`
`12-Feb-14 29-Nov-11
`
`DOCUMENT-VER, NUM.
`
`1.0
`
`ORIGINATE
`
`EDIT DATE
`
`This feanere is controticd by:
`
`epu.ge numbpersxe4/4
`
`// gpe.ec.num_rbpersc! gpu.gc.qum_shperse
`
`Bvervthing should be alreaty ready in the SX to support this but we need to test this and imke sure everything
`works as expected as the 4 RB per SX option was not tested much
`4.8.2 Testing
`
`PERF: Fill tests with vanous formats to make sure we can peg the export busses, With 4 RB per
`\)
`SX we are balanced so we nood to fully utilize the export busses to cnable all 4 RBsai once.
`
`1.3 SX support for deeper Position buffer, Color buffers and position alloc
`storage
`
`Those features are all controlled byfeature flags:
`
`epu.sx.color_scorcboard_shots=%4 // number of color waves
`epu.sx.posscorchoard slots=32 // mumber of vertex waves
`epusx.color_export_bulfersize256// this is the depth of the memoryin the SX
`Epusx.pos_export_buffer_stac=312 // this is the depth of the memoryin the SX
`gpu.sx.color_exportregbuffer_size=1024 (/ this is the logical size of the buffer should be 4x
`color_export_buffer_sive
`EPILSX.posexport_reebuffersize=2048 // this is the logical size of the buffer should be 44
`pos_export_buffer_size
`
`Mcase Thebes
`The knyptos project is using the above scitings.
`Acndcase
`
`We need to support values of 16, 32 and 64 for gpu.sx.pos_scoreboardslots, 256 and 512 for
`gpu.sx.color_export buffersive and 256 and $12 for gpasx. pos_export_bulfersive. Those affect memory depth
`andl pointer widths and wrap potnis inthe SX, The last 2 settings affect the SPI but must be changed to always be 4x
`of the above values.
`
`Right now, we are not planning to make the color buffer any deeper than it ts for area reasons bat we will make the
`postion buffer deeper so to enable the same number of WS waves per SE than Tahiti used to have
`
`2 Performance
`
`The SX has peak data input and oulpat rate requirements
`*
`Sustain the receipt of shader data exports at the maxinuen rae of 1024 bits per clock.
`*
`Sustain {2-4} pivels per clock peak output cach DB's depending on surface format
`*
`Sustain position exports to inaintain | primitive per clock operations in the PA per SX
`
`3
`
`Power
`
`We have picked out 4 main busy signats to genenuc the CAC signal which will be used by the power team.
`They have sume weight and be added together, the MSB will be used as CAC output.
`
`rr hk TE MM doce 1589 Byte
`
`S2/OLRE 23:23 AM
`
`AMD1044_0104752
`
`ATI Ex. 2025
`IPR2023-00922
`Page 9 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 9 of 35
`
`
`
`
`
`DOCUMENT-VER. NUM.
`1.0
`
`PAGE
`10 of 35
`
`EDIT DATE
`ORIGINATE
`12-Feb-14
`29-Nov-11
`4 Hardware Implementation: Top Level
`
`The SX works on 3 classes of exports from the shader: Color, Position and redirect exports. We will not explain
`a3 classes tm detxil,
`
`4.1 Top Level Drawing
`
`T5174 Gennes wen edepercdet cortoe
`Cxpot ae
`1) Powter can only be esperted cn | fan af o gwen toe
`2) MOOS?eae can only be eeperted on 1 tus at 6 geri tne
`2) 4 OB contg cartel Go 2 beratecs to aero Otte at he sete
`tte fie 08 Gum fe coreg OFT ww cre feat oo Pea oF PD
`GOB/Tess & 0823)
`
`wmtncton GPR--esd
`oO =
`Fe
`
`=
`
`et
`
`Ss See
`
`==||
`
`Newest ens oF
`reer
`
`4.2 Addressing the buffers
`
`Figure 2: SX Top Level Diagram
`
`Before we go in the details of howcach export mode works it is important to understand how Uhe various
`export buffers are addressed. The SPI is the master compolier for all the export buffers, it controls the
`allocation and supplics the base addresses for cach wave front and docs so on every export. The addresses
`supplicd by the SPL are always per 128 bit chunks and as such (since the export buffers really consist of 4
`128 bil memones side by side) it takes an increment of 4 in the buffer address before a newaddress is used in
`the memories, This is done so we can pack multiple expos on a single memory address, Here is the format
`of the address as supplied by the SPI;
`
`Memory address
`
`(10:2)
`
`Memory [D (1:0
`
`EP PT MA doce 3570 Byte
`
`LE/OLEE 12:1 aM
`
`AMD1044_0104753
`
`ATI Ex. 2025
`IPR2023-00922
`Page 10 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 10 of 35
`
`
`
`PAGE
`
`ORIGINATE
`29-Nov-11
`
`EDIT DATE
`12-Feb-14
`
`DOCUMENT-VER, NUM.
`LO
`
`Ll of 35
`
`The 2 LSBsare used to address which memory vou wart to write to and the upper 9 bits are the memory
`address itself. As such the computed address nceats 10 wrap at
`GPU_SX_ {COL/POS} EXPORT_BUFFER_DEPTH*4. The SPI will take care of wrapping the base
`addresses bud the SX ts responsible of wrapping any intemal address if computes (within the wave).
`
`4.3
`
`Format of the data
`
`The shader core. because of its scalar nature changed a bol from the previous architectures. Because ofthis it
`will export the data differently than it used to. The native form of exports will now be per component instcad
`of per pixel So instead of receiving 4 XYZW pinels, the SX will receive 16 X components from 16 different
`pixels, followed by 16 Ys and eo on (until all the required components, pot necessarily all 4, are exported),
`The render backends still expect to see the data in the same 4 XYZW format so the SX must nowreformat
`the data tht was sent from the shader
`
`For color we are planning to support 3 retive sives across 6 formats (all specified in the
`SPLSHADER COLFORMAT am! SPISHADER2 FORMATregisters), Those formas are:
`
`00 - SPI_SHADER_ZERO©. No exports dane (OC)
`01 - SPLSHADER_32_R: Can be FP32 of SINTS2/UINT32 Red Component (1C)
`02 - SPI_SHADER_32_GR: Can be FP32 or SINTS2/UINT32 GR Components (26)
`03 - SPI_SHADER_32_AR Can be FP32 or SINT32/UINT32 AR Components (20)
`04 - SPISHADER_FP16_ABGR: FP16 ABGR components (2C)
`06 - SPL_SHADER_UNORM16_ABGR: UNOGRM16 ABGR Components (2C}
`06 - SPILSHADER_SNORM16_ABGR: SNORM16 ABGR Components (2C)
`07 - SPISHADER_UINT16_ABGR: UINT1é ABGR Components (2C)
`08 - SPI_SHADER_SINT16_ABGR: SINT16 ABGR Components (20)
`09 - SPIL_SHADER_32_ABGR: Can he FP32 or SINTS2/UINTS2 ABGR Components (4C)
`
`For Position, the formats are a little more simple and are sfored in the SPLSHADER_POSFORMAT:
`They are:
`
`00 - SPI_SHADER_NONE SPI_SHADER_NONE (0C}
`01 - SPI_SHADER_1COMP: SPI_SHADER_1COMP (1C}
`02 - SPI_SHADER_2COMP: SPI_SHADER_2COMP (2C}
`03 - SPI_SHADER_4COMPRESS: SPI_SHADER_4COMPRESS (2C}
`04 - SP|_SHADER_4COMP: SPI_SHADER_4COMP (4C}
`
`This format can be different per MRT.it ts the responsibility of the shader to output the data in the nghi
`format as specified by the SPI_SHADER_*FORMATregisters. Whale the register lists all those modes (for
`futare compalibrlits ), currently only modes 0 or 4 are supparted by the HW for position exports.
`
`There will be a maximum of 2 512 bit data busses supported, controlled by 1 SPI expaddr bus (phase 0), bus
`0), phase 2, bus 1) and 1 SQ expend bus (same nubes as SPI bus).
`Controls of an export
`
`4.4
`
`An export to the SX whether it is color, position or redirect always starts with an SPLSX_enpaddr
`transaction. This | clock transaction is iclling the SX about the state of the export and which data bus will
`cary the data, the simd field ts used for this purpose (sind (1 are telling the SX to look at vdata bus 1) and
`Sim 2/3 sorting is for vdata bus 1).
`
`A fixed number of clocks Later. the SQ.SX_expemd hus will trigger. This ts also a | clock bus and also
`obeys the same simd niles than the SPI bus. This bus further qualifies the state for the export to happen.
`
`rr Ik A MB doce
`
`~ 37783 Byer
`
`SS/OLEE 13-0 AM
`
`AMD1044_0104754
`
`ATI Ex. 2025
`IPR2023-00922
`Page 11 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 11 of 35
`
`
`
`12 af 35
`
`ORIGINATE
`
`29-Nov-11
`
`EDIT DATE
`12-Feb-14
`
`DOCUMENT-VER, NUM.
`LO
`
`PAGE
`
`An export transaction is concluded witha 4 clock transaction on cither SP_SX_vdata{0,1} busses. This burs
`carries the data to be written and the write enable masks.
`
`4.5 Color export
`
`Because the shader core now expons all of its data per component, there was an opportunity to compress the
`export buffers, Before room for 4 components was slways reserved even if we were using only | or 2 of
`them. Nowthat only the components that arc being used are exported (and the shader is responsible to make
`sure the format of the shuder export is consisiomt with the format specified in the SP]SHADER * FORMAT
`fegisiers), we can compress the buffers and only write the needed data (without holes), This has multiple
`advantages:
`
`1) Since only the valid channels ane written, the same buffer depth that we used to have in older
`architectures. will hide more shader Linney with the sume depth in the compressed formal cases,
`2) Only the used data is watien/read so we will save power and always read/write as fast as we can
`3) All converters and alpha test can be moved to shader code to save area and some imore power,
`
`In onder to correctly pack the data to the cobon'position buffer the SX will have to rely on some state in onder
`to leave enough room in between components in onder lo write the next componenis of the same pixel. Same
`is Inve for MATs, since the DB expects the MRTs of a given quad to be all sent before we move to the next
`quad, the data needs to be written in consecutive (or close to consecutive) memory locations, In onder to
`achieve this. the SX will implement the following address equation (per DB).
`
`45.1 The color scoreboard
`
`Since the wave 1s coming from shader out-of-order, we have a scoreboard fo maintain the order of the incoming
`wave, Onlythose waves which have been tagged in scoreboard (SPLEXPADDR bus will contain scoreboard id
`information) and will be sent out of sx fromm small to bigger range. And the scoreboard should be crawled one
`by one. The scoreboard save ts different duc to different proyect.
`
`4.5.2 Export buffer address computation
`
`
`
`45.2.1Beforegfx8.1
`
`Adkiness © base (from the Sl) + MRT_FULA_SIZE + quad# in phase(compressed)*MRT_CUR_SIZE + MRT_PREV_SIZE +
`COMP
`
`Where:
`
`MRT_FULL_SIZE = SUM_FORMATS* @ quads in prev_phases
`MRT_CUR_SIZE = se of current MRT per SPI_SHADER_*FORMATregister
`MRT_PREV_SIZE = @ quads in phase* (sun of prew MRTS per SPLSHADER_*_FORMATregister (have to use DST to know’)
`
`MW (MRT_FPORMAT # 40)
`COMP=)
`if (MRT_PORMAT = 20)
`COMP = (component+quad#/2)'S2
`if (MRT_PORMAT = 4C)
`COMP = fcomponent+quad#ly%4
`
`UP Ik SA MBS coce—~ 15788 Byey
`
`SOURS 1aM
`
`AMD1044_0104755
`
`ATI Ex. 2025
`IPR2023-00922
`Page 12 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 12 of 35
`
`
`
`ORIGINATE
`29-Nov-11
`
`EDIT DATE
`12-Feb-14
`
`DOCUMENT-VER. NUM.
`LO
`
`PAGE
`
`13 of 35
`
`For example, let's say that we examine DBO's export buffer (cach DB has its own expon buller and is
`umquely mddressed), Let sty the SPLSHADERCOLFORMATis prognummed as 0x1291 so 4 MATS are
`exported by the shader and their forme is:
`
`MRTO 1C
`MRTt 40
`MRT2 20
`MRTS 1
`
`Let's further say that DBO will get | quad in phase 0, 3 im phase 1, 4 in phase 2 and | in phase 3. Then,
`
`slarting at address 0), the export buffer would book like:
`
`Figure 3: An example of addressing the export buffer
`
`The SPI always allocates # quads written to this DB * SUMof all MRT formats so in this case Y quads * BC
`= 72 locations.
`45.2.2 Gix8_1 and later
`Inorder to support RB+, we slightly modify buffer management behavior, The address calculation wayis
`changed from Loop phase -> Loop MRT -> Loop Quad (Figure 4)
`
`EP Pe SE MAS doce
`
`11 7EE Bytes
`
`L2/OURE LL AM
`
`AMD1044_0104756
`
`ATI Ex. 2025
`IPR2023-00922
`Page 13 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 13 of 35
`
`
`
`PAGE 12-Feb-14
`
`DOCUMENT-VER, NUM.
`1.0
`
`14 of 35
`
`ORIGINATE
`
`EDIT DATE
`
`P
`
`mato
`
`maT
`
`“
`
`Ps
`
`7 4
`
`‘
`
`fo
`
`if
`
`r ~ “~
`
`Quado
`
`Quad
`
`Ni
`
`\
`
`\
`
`MATT
`
`phase
`
`phasel
`
`phase2
`
`phase3
`
`To Loop MRT > Loop Phase > Loop Quad (Figure 5}
`
`Figure 4 Old memory address caleultation
`
`MRTO
`
`MRT1
`
`sexeakiae
`
`euanassse
`
`MRT?
`
`wr
`
`o
`
`a
`
`,
`
`Fo
`
`ae a
`
`~.|
`
`Phasel
`
`Phrase?
`
`Prase3
`
`i
`
`NK,
`
`he
`
`\
`
`Quadd
`
`Quadt
`
`}
`
`|
`
`\ a
`\
`
`Figure 5 New memory address calculation
`
`By sucha new memoryorganteation way, we will avoid some bank conflict issuc duc to suppor new RE
`feature. More detail could be found through document PLSX LinkQuad_investigation.docx
`
`4.6 Position exports
`
`Position exports work ina similar fashion than the old color exports bul since there is only | position buffer
`(4 memories 128 bits wide cach) and | client reading the buffer (the PA), it makes its management slightly
`casicr
`
`In the current plan of record, the SP! will always wlocate 64 positions and the SX will always free 64, no
`matter howmany vertices are actually in the buffer, The deallocation will be done once a position export for
`ihe full vector is completed.
`
`4.7 Redirect exports
`
`Since the parameter caches moved outside of the SX block and the SX is the datapath for the GDS writes, the
`SX needs to redirect some ofthe data it recetyes from the shader to the PC/GDSblocks. It knows what to
`
`EP eT MA doce 277EE Byer
`
`12VOLRE IL AM
`
`AMD1044_0104757
`
`ATI Ex. 2025
`IPR2023-00922
`Page 14 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 14 of 35
`
`
`
`15 of 35
`
`ORIGINATE
`29-Nov-11
`
`EDIT DATE
`12-Feb-14
`
`DOCUMENT-VER. NUM.
`LO
`
`PAGE
`
`redireet and when to do so by looking al the SPL_SX_expaddrpassthrough bit and the
`$Q_SX_espemd_target field. When set to a locmiion outside the SX (PC or GDS) the SX will write the input
`data onthe SX_PC_expemd/vdara busses instead of processing the (ransaction internally
`What is different from Cl to SI is we can have 2 redirect buses when the feature dual_pe_bus is on
`
`4.8 RB+ feature
`The data bus between SX and DB are 246bit and if one quad is J2bpp, it can be transfer 2 quaxts uf fully utilize the
`bus, That's what we want to improve in RB+ feature,
`SX will reocive pixel waves export information (one wave include 16 quads) from SPI like previous propect. In order
`to support RB+ feature. SX will book at the relationship of adjacent quads. If theyare tagged as paired which means
`there is potential chance to enter & pixel per clock mode, We will see source data detection (see the export data and
`do a comparison acconting some predefined rules) result and check the down-convert table, If both operation are
`passed. Then those 2 quads will be sent together in. one 256bit transaction
`
`Source data detection
`48.1
`The first step of source data detection ts used to generated some chock result between the export data and reference
`value (0/1), For example, if the setting is to see the value is (), the export data will be checked to see if it is 0. IF it is,
`we reward the check result is pass.
`For different formal, a small value epsilon will be used to jodie it is close to 0 or 1 (regard it as 0 oF 1), this is also
`defined in 4 register
`
`Per Context Per MRT Register: Add the following registers per state set and per cach of the 8 MRTs-
`
`BLEND_CPT_EPSILON ONRAIeeeeroneneaonce
`SaniaUn CNatacieeethead
`CisMistotatin187(setfor11bitalphaformats)
`03=Mustbewithin1S21(ator1bphoats)
`05—Mustbewithin 1oNI0
`05— Mustbewithin1.0°2"-9
`07—Mustbewithin PRSeatsBerote
`
`02 — Must bewithin 1,0°2*-
`
`04 = Mustbe within 1.0°2*-1
`
`08 — Must be within 1.0°2*-8
`
`10°27
`
`11 —Mustbewithineen for&tet
`ee
`“3
`(set
`formats}
`13—Mustbewithin1528aeorsbtformats
`15—Mustbewithin "eh(setfor4 bitformats)
`oecn #1 ridofthisandhaveasinglebathatsays ifthesourcedata
`beforeor aftertheformat down-conversion?
`Thedetection of0.0or 1.0canranedone viaeae on the unbiasedexponent onlyofthe FP16 or FP32 channels
`
`14—Mustbewithin 7
`
`and the MSB of the mantissa or else the upper bits of monnas. The intention is to match the precision of the vero and
`one detection to the precision of the destination format, For cxample if a value of 2°-12 is being added into an 8
`bit surfiee which can only represent 2"-8_ there will be no change to the destination result even though the souroe
`data is non-*cro,
`Source data detection is needed for these SP] SHADER export formats:
`SPL SHADER export format|Math «
`|FPIGABGR —s—ic“$a<aé$sd
`
`mt and MSB of mantissa
`
`oouremdEyponentandMSBofmantissa Onn
`
`FP32_ ABGRetix)
`APR Compa ee onemand MSB ofmantissa
`
`UP Fe MA doce 35789 Byte
`
`A2/OLE 10 aM
`
`AMD1044_0104758
`
`ATI Ex. 2025
`IPR2023-00922
`Page 15 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 15 of 35
`
`
`
`
`
`
`
`EDIT DATE
`ORIGINATE
`12-Feb-14
`29-Nov-11
`
`SNORM16_ABGR
`N+1 bits from the 2°-N Epsi
`The result is a set of masks of which pixels have “alpha” and/or “color” close enough to 0,0 andlor 10 } If Alphes
`docs mo exist inthe shader expon forme, it can be considered 1.0, Ifall of the shader exponed color channels are
`close to W.0.or 1.0, then “color” is 1.0 of 0.0, otherwise it i neither.
`
`PAGE
`16 of 35
`
`DOCUMENT-VER. NUM,
`
`Such a check is done on color and alpha independently. Then the second step is to generbypass/don't_rd_dst
`check result by belowniles:
`
`Per Context Per MRT Registers; Add SX_MRT#BLEND_OPT per state set and per cach of the &
`MRTs which controls how to detect specific optimizations on source and destination data and how to combine those
`optimivalions independenty for alpha and color
`Field Name
`COLOR_SRC_OPT
`
`Desctiption
`Sot which values in color or alpha will preserve the source color data and which
`values will allow the color source data to ba ignored. Settings can be derived trom
`COLOR_SRCBLEND.
`
`20
`
`POSSIBLEVALUES.
`00-BLEND_OPT_PRESERVE_|
`IGNORE ALL: Setwth ZERO
`01 -BLENO_OPT.
`BONE: GetwitOn
`Set with SRC_COLOR
`02 -BLEND_GPT_PRESERVE_C1_|
`03 - BLENO_OPT_PRESERVE_CO_IGNORE_C1: Set with
`ONE_MINUS_SRC_COLOR
`04 - BLEND_OPT_PRESERVE_A1_IGNORE_AD: Set with SRC_ALPHA
`05 - BLEND_OPT_|
`}IGNORE_At: Set
`eyaor
`06 - BLENO_OPT_PRESERVE_NONE_IGNORE_AO: Setwith ALPHA_SATURATE
`07 - BLENO_OPT_PRESERVE_NONE_IGNORE_NONE:Setwith any other
`BLEND_* mode
`Set which values in color of alpha will preserve the dest colordata and which valves
`
`64
`
`We atomte cols deeclafeBeanced: Salinecan be’doentom
`
`POSSIBLEVALUES”
`REQBEte
`00 - BLENO_OPT_PRESERVE_NONE_IGNORE ALL: Set with ZERO
`C2”BLENO-OPT-PRESERVE-C1-TONORE:GOGetwinSAC.COLOR
`t
`SRaKHORAOSetatALPHA
`OPT_PRESERVE_CO_IGNORE_C1: Set with
`One.MINUSsmSRC_ALPHA apres
`o7--BLENO-OPRESERVENONE.\GHORE-NONGSetwithanyother
`10:8 pesdissebsbalboa nconty Sncanestspalepee merryalba
`COLOR_COMB_FCN
`bypass the blender, destination reads can be skipped,
`andor the
`overwrite can
`whole pixel can be discarded, Settings can be derived from COLOR_COMB_FCN
`
`OPT_PRESERVE_NONE_IGNORE_AO.
`
`Set with ALPHA_SATURATE
`
`POSSIBLEVALUES:
`00 - OPT_COMB_NONE: No optimizations are enabled.
`01 - OPT_COMB_ADD: Set with OST_PLUS_:
`02 - OPT_COMB_SUBTRACT: Setwith SRC_MINUS_DST
`03 - OPT_COMB_MIN: Set with MIN_DST_:
`04 -OPT_COMB_MAX: Set with
`_DST_SRC
`05 - OPT_COMB_REVSUBTRACT. Sat withDST_MINUS_SRC
`06 - OPT_COMB_BLEND_DISABLEO:Set this or*_OPT_DISABLE when blend is
`
`07 -OPT_COMB_SAFE_ADD: Same as legacy CB auto mode
`
`MP te SE MAS doce 150R Bytes
`
`LE/OLEE 11:19 AM
`
`AMD1044_0104759
`
`ATI Ex. 2025
`IPR2023-00922
`Page 16 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 16 of 35
`
`
`
`AMD
`
`17 of 35 ALPHA_SRC_OPT
`
`ORIGINATE
`EDIT DATE
`DOCUMENT-VER. NUM.
`PAGE
`29-Nov-11
`la-Feb-14
`1816 werapaaeyehoornuaEESSS
`
`same as ALPHA
`
`00 - BLEND_OPT.EALL: Set wth ZERO
`01 -BLEND_OPT_PRESERVE_ALL_KSNORE_NONE; Set with ONE
`02 -BLEND_OPT_PRESERVE_C1_IGNORE_CO: Set with SRC_COLOR
`
`03 -BLENDOPTFreoenveCrIGNORE_C1: Setwith
`ONE_MINUS_SRC_COLOR
`04 - BLENO_OPT_PRESERVE_At_IGNORE_A0O: Set with SRC_ALPHA
`05 - BLEND_OPT_PRESERVE_A0_IGNORE_A1: Set with
`06.BLENDGPTPRESERVE:NONE_IGNORE_AO: Setwith ALPHA_SATURATE
`ONE_MINUS_SRC_ALPHA,
`aeeetVE_NONE_IGNORE_NONE: Setwithanyrother
`22.20 Set which values in color or alpha will preserve the dest color data and which values
`Renee oa ee Settingscan be derived fromALPHA
`Note thatALPHA_SATURATE in alpha should be set as ONE and COLOR Is the
`same as ALPHA
`
`ALPHA_DST_OPT
`
`POSSIBLEVALUES:
`~BLENO_OPT_PRESERVE_NONE IGNORE_ALLSetwith ZERO
`|OPT_PRESERVE_ALL_KGNORE_NONE:Setwith ONE
`“OPT_PRESERVE_C1_IGNORE_CO: Set with SRC_COLOR
`OPT_PRESERVE_CO_IGNORE_C1: Set with
`COLOR
`GPT_PRESERVE_Al_IGNORE_AO: Setwith SRC_ALPHA
`1T_PRESERVE_AQ_IGNORE_A1. Set with
`
`aa
`
`gasne
`ieiALPHA_COMB_FCN
`
`_PRESERVE_NONE_IGNORE_AO: Set with ALPHA_SATURATE
`_PRESERVENONE_IGNORE_NONE: Set with any other
`
`2624 Set how to combine the source and destination optimizations to figure out when an
`overwrite can bypass the blonder, destination reads can beskipped, and/or the
`whole pixel can be discarded, Settings can be derived from ALPHA_COMB_FCN
`
`After 0.0 and 1.0 are detected for alpha and color, the abowe BLENDOPT tests are executed to get “PreserveSre”,
`“TgnoreSire”, "PreserveDst”, and “IgnoreDst” fags for cach of color and alpha.
`In AO.CO,AL,CL, A means Alpha, C
`means Cobor, and O means ==0.0 while | means ==1. For the alpka blend opt tests, color is treated as alpha as well
`aliasing 2 and 3 with 4 and 5..
`and
`the COMBFCNs
`these with
`of
`cach
`and Dst of
`Then
`combine
`th Sr
`CB_COLOR#INFO.NUMBER_TYPE to get “BlendBypass”, “Don't_rd_dst”, and “Discardpixel” flags for cach
`of cobor and alpha:
`COMB FCN
`
`Discard Pixel
`
`ike oo
`
`merae
`
`GFP Pe TA MAGdoce —~ 1)70 Bytey
`
`LE/OUEE 12:19 AR
`
`AMD1044_0104760
`
`ATI Ex. 2025
`IPR2023-00922
`Page 17 of 35
`
`ATI Ex. 2025
`
`IPR2023-00922
`Page 17 of 35
`
`
`
`PAGE
`
`ORIGINATE
`29-Nov-11
`
`EDIT DATE
`satis
`
`DOCUMENT-VER. NUM.
`1.0
`
`18 of 35
`
`Never
`Never
`Never
`OPT_COMB_NONE
`
`
`OPT_COMB_ADD (Preserve Sic || SRC=) RA—[pnore Dst (Ignore Sre | SRCthy ak
`Ignore Dst
`Preserve Dsi
`
`OPT_COMB_SUBTRAC (Preserve Src || SRC=") RK—Ignore Dat Never
`
`
`
` D
`
`(Preserve Sree && SRC—1)
`Ignore Dst ||
`SRC=) && uno
`OPT_COMB_MIN
`&& u/snonn
`BlendBypass
`___
`.
`ee
`en
`SRC0 &&unorm
`Ignore Dest ||
`(Preserve Sre At SROww 1)
`| OPT_COMB_MAX
`pb opeARe ae ____BlendBypass eS ame
`OPT_COMB_REVSUBT
` SRC==) && Ignore Dst
`Ignore Dst
`(Ignore Sre || SRC=) RA
`1? ioe
`Preserve Ds
`OPT_COMB_BLENDDI Always
`Never
`SABLED
`OPT_COMB_SAFE_AD
`
`Preserve See && Ignore Dest
`
`Ignore Dst
`
`Ignore Src && Preserve Dst
`
`Always
`
`|
`
`Add SX_MRT#BLEND_OPT_CONTROLper state set which controls when to pay attention to cach of
`the color and alpha optimizations per MRT
`DESCRIPTION: Enables/Disablosthe BLEND.OPTS PRT: Program thesedemvedfrom the TARGET_MASK,
`ALEND_ENABLE, and FORGE_DST_ALPHA_1
`ce”SHADER._MASK register'susedformapping from shader export to MRT SX_MRT[1-7]_CONTR