`
`ORIGINATE DATE
`11 March, 2001
`
`EDIT DATE
`[date \@ "é MMMM
`
`DOCUMENT-REV. NUM
`GEN-CXXXXX-REVA
`
`PAGE
`tof3a
`
`Val
`t
`eeu
`Author: Steve Morein
`
`i lesue To:
`
`Copy No:
`
`R400 Top Level Specification
`
`ver 0.2
`
`| Overview: This replaces time R400 architecture specification
`
`
`
`AUTOMATICALLY UPDATED FIELDS
`Gocument Location:
`Documerti
`Curent intranet Search Tithe
`R400 Top Level Spec
`
`APPROVALS
`=|
`NameDept
`
`Signahse/Date
`
`+ i
`
`| Remarks:
`
`INFORMATION THAT COULD BE
`THIS DOCUMENT CONTAINS
`SUBSTANTIALLY DETRIMENTAL TO THE INTEREST OF ATI TECHNOLOGIES
`INC. THROUGH UNAUTHORIZED USE OR DISCLOSURE.
`
`| |
`
`“Copyright 2000, AT! Technologies Inc. Al rights reserved. The material in thie decurnert constitutes an unpublished
`| work created in 2000. The use of this copyright notice is intended to provide notice that ATI owns @ copyright in this
`| Unpublished work. The copyright notice is not an admission tat publication has occurred. This work comains confidential
`proprietary information and trade secrets of ATI. No part of this document may be used. reproduced. or tansmitied in any
`form of by any means without the peor written permission of AT) Technologies inc.”
`
`Ren Tap Lael fee DOS Ayeeros ATRc'cronce Copyright Notice on Cover Page © ++
`
`cnn artere
`
`AMD1044_0152586
`
`ATI Ex. 2028
`IPR2023-00922
`Page 1 of 34
`
`
`
`
`
`R400 Top Level Spec
`
`|
`
`ORNGINATE DATE
`11 Atarch, 2001
`
`EDIT DATE
`{date \@ “d MMMM
`
`DOCUMENT-REV. NUM
`
`Table Of Contents
`
`AGP &x
`255 Ba Memory Interface
`Unified Processing pipe
`Front end scaling
`Real-Time drawing command ability
`3D Features
`
`Aewie=
`
`6b
`
`I I| I | I I
`
`ovsevvouowuwvedgaewseaenonewewedansn4nxwansan
`
`1.6.1
`Noise Textures
`1.6.2
`Shadow buffers
`1.6 3=Sart Independent Transparency
`1.64
` Anti-Akasing
`1.6.5
`Texture compression
`1.6.46
`Zcompression
`”J
`164
`Texture Filtering
`16
`Curved Surface Support
`8
`9 Displacement maps
`High cofor depth
`
`1.6
`
`SE ac cccicissicca cansucci ceasestes ce oasecaess ters stan atabesa bina ven stoped seatae acislabinessnctea pienteveaecasia
`PROCESS............
`SSEERESAL, CHIR PRUNINGsgnsscsiomnenscsclevoecinpnsnecronraiantenseatboncromeet
`Unified Shader
`30 Rendering
`Real Time Rendering
`State Management
`Bad Data
`Display operation
`Seeeeeeeae eee
`BLOCKS.............
`
`HSLOCIIG CIRBOINIEIIDaaantactmarbtcacbiretbocantcbleaSSs
`HBIU = host bus interface unt
`
`
`
`|
`2
`3
`
`83
`
`Description
`Majorinterfaces
`Block diagram
`CP — control processor
`|
`Description
`2 Major interfaces
`3
`Block diagram
`RBBM — register interface manager
`|
`Description
`Bo2e Tap Lwed Spee Leal
`
`1 tye© ATI| Reference Copyright Notice on Cover Page © «+
`
`oe ner
`
`AMD1044_0152587
`
`ATI Ex. 2028
`IPR2023-00922
`Page 2 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 2 of 34
`
`
`
`
`
`|
`
`ORIGINATE DATE
`11 March, 2001
`
`EQIT DATE
`date > “4 MMMM
`
`DOCUMENT-REV. NUM
`
`GEN-CARXXK-REVA
`
`Major interfaces
`8.3.2
`Block diagram
`8.3.3
`RBBM operation
`8.3.4
`84
`CLK ~ clock generator
`8.4.1
`Description
`8.4.2
`Major interfaces
`$.4.3
`Block diagram
`83
`TC test controller
`
`Description
`8.5.1
`Major interfaces
`8.5.2
`Block diagram
`8.5.3
`8.6 VIP — Video input port
`$.6.1
`Description
`$6.2
`Major interfaces
`8.6.3
`Block diagram
`8.7
`ROM— boot rom
`
`Description
`8.7.1
`Major interfaces
`$.7.2
`Block diagram
`8.7.3
`&8
`120 -I2C interface
`
`Description
`8.8.1
`8.8.2 Major interfaces
`$8.3
`Block diagram
`8.9
`DU—- Desplay
`$.9.1
`Description...
`892 Major interfaces
`8.9.3
`Block diagram
`8.10 Mi — Memory Hub
`8.10.1
` Gescription
`8.10.2
`Major interfaces
`8.10.3
`Block diagram
`8.11
`HDP Host Data Path
`
`das
`
`=F
`
`=
`
`ae
`
`Sas
`4
`
`19
`20
`20
`21
`21
`21
`22
`22
`
`22
`22
`22
`22
`22
`22
`22
`22
`
`22
`22
`22
`22
`
`22
`22
`23
`23
`23
`23
`24
`24
`24
`24
`24
`24
`
`Description
`S111
`Major interfaces
`8.11.2
`Block diagram
`8.11.3
`8.12
`IDCT— Mpeg decoder
`8.12.1
`Description
`8.12.2
`Major interfaces
`$.12.3
`Block diagram
`Ree Top haved Spec DOC Mthbymaeee @ AE*cterenceCopyright Noticeon Cover Page ©«+
`
`25
`25
`25
`25
`25
`25
`25
`meson sere
`
`a01044_0152588
`
`ATI Ex. 2028
`IPR2023-00922
`Page 3 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 3 of 34
`
`
`
`DOCUMENT-REV. NUM
`
`
`
`Cd
`
`ORIGINATE DATE
`fit
`EDIT DATE
`PAGE
`
` 1? March, 2001 date \@ “d MMMM R400 Top Level Spec 40fs
`
`8.13PA—PrimitiveAssembly..
`ae
`=
`25
`8.13.1
` Dreseription
`25
`8.13.2
`Majorinterfaces
`25
`8.13.3
`Block diagram
`26
`8.14
`TO — Texture Decompression
`26
`8.14.1
`Description
`26
`8.14.2
`Major interfaces
`27
`8.14.3
`Block diagram
`27
`
`
`8.15 RE=Raster Engine 27
`8.15.1
` Beseription
`27
`$.15.2
`Major interfaces
`28
`$15.3
`Block diagram
`29
`8.16
`SP —Shader Pipe
`30
`8.16.1
`Deseription
`30
`$.16.2
`Major interfaces
`30
`8.16.3
`Block diagram
`31
`8.17
`TR Texture Pipe
`32
`8.17.1
`Description
`32
`8.17.2
`Major interfaces
`32
`8.17.3
`Block diagram
`32
`8.18
`RB = Render Backend
`32
`
`32
`Geseription
`8.18.1
`32
`Major interfaces
`8.18.2
`32
`Block diagram
`8.18.3
`32
`8.19
`MC — Memory Controller
`32
`8.19.1
`Deseription
`32
`8.19.2
`Major interfaces
`32
`8.19.3
`Block diagram
`CCIEHE Wr PRAIRIECGD PGE aicdsccecisecdescaccnttsinen nchicacsnsiah sucks carsalbssiaiord sw uvathusebcsusaswunseutssntn kembbabsetesiaues Mie
`9
`Logic Design
`32
`91
`91.1
`Data formats
`32
`91.2
`Register Bus
`32
`9.1.3
`Block Communication protocol
`32
`92
`Software
`32
`
`BaddTap
`
`LawlSpec DOC
`
`smthhymeree &iCopyright Notice onCover Page ©«=
`
`nese etere
`
`ATI Ex. 2028
`IPR2023-00922
`Page 4 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 4 of 34
`
`
`
`ORIGINATE DATE
`yar
`6 ‘
`11 March, 2001
`Revision Changes:
`Rev 0.0 (Steve Morein)
`Date March 11, 2001
`Indtial newision
`
`EDIT DATE
`{date \@ "é MMMM
`name
`
`4
`
`DOCUMENT-REV. NUM
`GEN-CXXXXK-REVA
`
`PAGE
`5 of 34
`
`Document recreated from earlier documents
`
`Date March 14,2001
`Apel 21,2001
`
`Finely got back to editing a
`Upated texture path, hopefully this ome works
`
`Rend Tap
`
`Level
`
`gee
`
`DOC
`
`aebyes & "eeCopyright Notice on Cover Page ©«=
`
`‘
`
`core
`
` AMD1044_0152590
`
`ATI Ex. 2028
`IPR2023-00922
`Page 5 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 5 of 34
`
`
`
`ral
`e x
`
`th
`
`ORIGINATE DATE
`1? Atarch, 2001
`
`EDIT DATE
`jdate ‘> “d MMMM
`atteteetbe
`
`o
`
`DOCUMENT-REV. NUM
`R4a00 Top Level Spec
`
`introduction
`
`The R400 will be the high end standalone graphics chip product when it
`it wil be followed wery rapkily with two variants
`The RV400, aimed at the volume PC space
`The R450, aimed at a volume high end market
`The targets for the three chips are
`
`is introduced
`
`Part
`
`pioehe'clk
`
`Memory
`akiopsiclkk Memory
`bexture
`Clock
`| speed —
`__wikdily
`| fetcheschk
`|
`| Speed
`|
`| July 2002
`
`Ra00|400 MHz 4 16 32 206 | A00OM Re
`
`
`
`
`Now 2002
`PRVa00 | 500 Mrz
`| 4
`ls
`16
`128
`600 Miz
`Feb 2003
`R450
`500 Mrz
`| 8
`16
`4
`2567
`S00 Miz
`
`dic size
`
`Tapeout
`
`
`
`Fo0b Tap
`
`hawSpec
`
`DOC
`
`ae yee & ATI<<< Copyright Notice on Cover Page ©«=
`
`ory
`
` AMD1044_0152591
`
`ATI Ex. 2028
`IPR2023-00922
`Page6 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 6 of 34
`
`
`
`ORIGANATE DATE
`
`EDIT DATE
`
`[date \@ "6 MMMM
`
`GEN-CXXXXK-REVA
`
`DOCUMENT-REV. NUM 11 March, 2001
`
`|. Features
`
`1.1 AGP 8x
`| expect that we wil need to support AGP tx and 2c
`The chip will euppert the 32 bt AGP interface at speeds up to 8x.
`whieh require 3.3 Von LO (AGP 4x i 1.5v and AGP &« is 750mv). AGP fast wetes are suppeeted for access to the
`frame butfer
`
`Open issue: 64 bt address space support
`
`|.2 256 Bit Memory Interface
`The R400 and F450 support four mernory channels, which can be 32 or 64 bits wide; the macineum memory bus width
`is a total of 256 bits. The RV400 supports two memory channels and a maximum total with of 1268 bas
`
`All channels need to be configured identically, 1, 2 or 4 channels can be configured.
`
`Memory standards supported
`
`
`Memory type
`vo
`
`SsTi25 135
`[oor
`SSTLI8
`= 18
`ODeinfineon
`
`Elpida
`L18(1.57)
`| Elpida
`
`
`Infineon
`12,10V
`ienfirion e-<diram
`
`Speed
`[100 to 500 MHz
`| 300 to 500 MHz
`| 300 te 400 MHz
`
`| 500 MHz
`
`No support for SSTL3.3, or SORAM (LVTTL = 3.3¥)is planned
`
`1.3 Unified Processing pipe
`The most ambitious feature in this design is the “truly unified pepe” ; a single programmiabie pipeline is used for 2D,
`Video, 3D verter, and 30 pivel operations. The unifed pipcine docs al of fs caloulations in 32 bit floaing port, the
`sam a6 the existing veriex transtorm in previous chip. and the next slep in the precision of the color/pixel calculations
`which have increased trom 6 bés (R100), through 16 bas (R200), to the 20 bits in the A300
`
`There is an area cost to the unified pipeline since we are forced to go to 32 bf precision for color, when application
`requirements may Mead leas (272 1o 24 bite) However the wiifed pipeline resus in a Mingle mativregister structize
`compared to the separate structures in @ more traditional design.
`It is haped that by only needing to design the one
`structure we can make the investment in design time and effort to realy optimere the area
`
`Soene of the benefits lo merging the pipelines inciude allowing the vertex operations to do texture fetches, which we
`could not afford add logic to the transform pipe to do, a single programming model for both operations, more precasion
`on color than we would normally provide, and the abéity to support significantly more registers and instructions in piel
`shaders.
`
`One important benofil # load balancing. in the current pipeline when the app if transform bound the pixel pipeline 6
`idie some significant portion of the time, and when the app is raster bound the transform hardware idie, The unified
`pipeline presented here dynamically allocates &s processing power between banstonm and rester
`
`1.4 Front end scaling
`We will remove the back end scaling capabiity trom the display, and replace & with a non-scaling overlay Tis will
`require us to be able to implement scaling using the untied pipeline. Key features that wil need to be supporied are
`large Alter kemels, de-intertacing, frame rate conversion, and good support for YU'V and color comversion
`
`1.5 Real-Time drawing commandability
`To allow for the emulation of backend scaling as wel as support mew features we need to be able to interrupt the 3D
`pipe and be able to ex@cule high priority commands with kv itency. The point of interruption is im the primitive
`
`edd Tap Lewel Spec DOC
`
`set tyes? & ATI
`
`ference Copyright Notice on Cover Page © «+
`
`cueaen arte rw
`
`503025,0152502
`
`ATI Ex. 2028
`IPR2023-00922
`Page7 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 7 of 34
`
`
`
`a
`
` Val
`ORIGINATE DATE
`EDIT DATE
`DOCUMENT-REV. NUM
`PAGE
`J
`11 March, 2001
`{date \@ “¢ MMMM.
`R400 Top Level Spec
`8ot34
`assembly, the maximum latency vell be about the time ft takes to render 4096 picets.
`real time commands are
`inserted into the 30 pipchine after transform, clipping, and setup. Those function need to be performed by the driver
`There are also limits on the number of constant registers avaiable
`
`1.6 3D Features
`There are a mumber of new 3D features we are considering for inclusion. Additional features may be added, and some
`of these may be dropped
`
`1.6.1 Noise Textures
`
`Porlin style noise is uselul for a number of applications. 1 & generated on chip and comsumes no extemal memory
`bandwidth. It also larger than any physkal texture can be: 250x240c256 lattice points, and stif hes detail when the
`resolution is 4Kx4Kx4K. There is an opporturity to get thes adopted as part of dB.
`
`
` nadow volumesJohn Carmack ip using
`shaniow volumes to generale ‘shedowellerte indoom Shudowrvolesunethesone!bufferipcomentewhether&
`Rin18.8. This.stein voorpeermattewneuinit_o0 reodem 20 pipesies, Jit.ths. wes..9f.sharkxoheres.9. Sect
`
`Our
`preferedshadowingmethodtp {wil ad¢ more dotad here ialer}..Sghadow bufferswhich storethe nentdept
`yaleatalan arty of postions,relative tocach Saht Shadow buffers have two key iméationsisiues: very
`
`
`resolutions are roquired to aveld alsing. — traditionalmipmapMfereringcannotbeappliedtoshadoweae hides
`
`
`nottespneppeo-Sleringierealprolem, Weermableioh400 solves the first problem through a combination of
`
`ae aaacescache bs docciesRa eahek Gaaeead at onste enecesee
`
`
`3 Sort Independent Transparency
`
`iore carvan Wooing Oho Dow weet to Sabet Sart ndopeneeert wenepetoney.CA.SantsnkwimlncoirestaSiam,
`
`Frnt, renderthe opaque pinelsandstoretransiucerkpixelsintoa listimbosl memory,Second,replaythatlistmutipic
`
`
`timestosuccoesineyrenterand remove the backmos!trarskscectpielswiltthe bet@empty,Thecostofthie
`
`
`
`ofParsivcertpacelsandBairdegreetechniqueisrelativeontytothenumber:
`
`
`
`
`80itwellbehaphlyefficientformagesthatcontainasmallpercentageoftranslucentpixeis,
`
`
`
`
`
`
`
`Thebvo-plansare_etnethe-dualZ-bullesappeoachoctheapproachdeecnbedim_cneedtodecidewheretheemad
`
`bhowidbepacedecoleCansee>
`
`1.6.4 Anti-Aliasing
`Lcdhevnrpandbeta Mah agcherrnay des bepeseab esata ahs~ oe icine eee aneee
`
`Spaced Serve SUIT GIN SIEeT Wells She WEST Ony SONNY Our go
`
`1.6.5 Texture compression
`To further reduce bardwidih we need to improve texture compression. We need to acteeve both better compression
`that SSTC, and have a high enough quality that textures that would lose too much detail wih SSTC can be
`compressed, Both of these goals do mot need to be achieved simultancously on all textures. We also need to look at
`compression of non-traditional surfaces such as normal maps. Advances here are dependent on the availabilty of
`
`edb Tap Lewel figec DOC
`
`4mHymeee ATPeterence Copyright Notice on Cover Page © «=
`
`ouwans artere
`
`EE aso1025_0152503
`
`ATI Ex. 2028
`IPR2023-00922
`Page 8 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 8 of 34
`
`
`
` ORIGINATE DATE
`PAGE
`DOCUMENT-REV. NUM
`EDIT DATE
`| 9ot34
`GEN-CARXXX-REVA
`(date \ “4 MMMM,
`11 March, 2001
`een ee if we are unable to find resources we wil supportonlytheeteSTCcompression
`currently
`
`1.6.6 <-cDepthCompression
`
`
`ofspecifyingtheplane(possibhyveth Seee
`
`numberofZplanos (perhaps 16)before tatingback to a2 value per sampae, R400wit orobablyalso
`
`
`
`depthcompressioninihmfaleckcase.
`
`1.6.7 Texture Filtering
`The texture pipes can fetch a 2x2 region from the texture map and fitter it
`The data per pine! can either be four cight b& values, two sitteen bit values, of one 32 value. Al data needs to be fixed
`point,
`Linear fiters are completely built in. and @ takes 1 cycie for bi-inear, 2 for trilinear, tour for quadra-inear (fitered mip-
`mapping of voluree textures). Variable depth anisotropy is supported in hardware with the texture pipe calculating the
`number of samples needed, Optionally the pixel shader can calculate the number of samples, and how to increment
`the texture address, and provide this to the texture pipe.
`
`1.6.8 Curved Surface Support
`We val suppert curved surtaces through combination of vertex shader code and a teasellation engine to generate now
`vertices
`
`The betsellation engine generated mew vertex indices from a input vertex index array. The new indices contain both
`the cocedinate in parametric space of the vertex, and the indices to the surface. of to data from which the sizface can
`be derived, More information is available in the programming guide.
`
`1.6.9 Displacement maps
`The tesseilation engine for curved surfaces can dice tiangles ito micropolygons, the wertex shaders for the vertices
`can then access into a Gaplacemernt map and change the location of the points.
`
`1.7 High color depth
`
`We wil support a 64 bit color buffer (16:16:16-16), we will support two formats;schGSja&a.sAGBE4) and a floating
`pomt format.,
`BeNeUna eaeee,Tenie.
`
`pmenenes 9. endRais.eae 3 Cae1-4..2)_108.Sen
`
`valveandwopesintand—’togetherwthamaskthalepeclioswhichplane4ov6ealeach-sampie.poetChanges
`
`
`
`
`
`
`
`
`fronle RIO0 echiimeuclude «dillorent wayOfspecilyang theplace(possiblywillymoreb86per Uplene),58thes
`iInetead of 4x4 ties. and a aeger maximum nuneber of Zolanes (peshape 16} daloee taling back to a Z value per
`sample, R400wil probably alec supportmaenax depth comprasson in the fallback case.
`edeTapheed Spec DOS
`Matas years AT!FBfecterence copyright Notice on Cover Page c-
`
`ous arseru
`
`a) AMD1044_0152504
`
`ATI Ex. 2028
`IPR2023-00922
`Page 9 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 9 of 34
`
`
`
`DOCUMENT-REV. NUM 11 March, 2001
`
`ORIGINATE DATE
`
`EDIT DATE
`
`jdate ‘4 “d MMMM
`
`R400 Top Level Spec
`
`2. Performance
`
`The basic performance is
`
`R400 MHz
`
`Gilrate bi-inear equiv peak trisec
`
`; Bitinear texture
`
`Under normal conditions, and when not further limited by memory bandwidth we expect to be » 75% efficient
`
`3, Schedule
`
`
`
`
`R400 _
`RVaO0
`Raso
`
`
`4. Process
`
`At the moment this looks like an easy choice: .13 will be in production for over a year, and .10 does not show up until
`the very end of 2002 acconding to the TSMC and UMC roadmaps
`
`Ve wall probably want to be in a fip chip packaging approach to meet power déstnbution goals. Vvith the 256 bit bus we
`will have at least GOO signal (O's (404 in memory) We may be as much as 104 at TY for average power, which wil
`require very good power distnibation, area bond flip chip is probably the only option
`
`eneral
`
`Chip operation
`
`5.1 Unified Shader
`
`The unified shader is a simd/vector engine that perfonns the same instructions on four sets of four (16 total) elements.
`For pixel shader operations the elements are pels with the sets of four required lo be 22 footprints. For vertex
`shader operations the sixteen elements are sixteen vertices. The basic clement is a 4 value vector — frequently
`interpreted as xy z.worrg ba
`
`The user model for the unified shader is composed of @ variable number of general purpose registers, a subset of
`weh are usually intielined wih data An ALU can do simple math, conditional moves, and permutations on the
`registers, and the ability to do a limfed number of memory reads using the texture cache, The number of register is
`variable, and She number of registers required for an operation are specified when the task is submitted to the unified
`shader. The unified shader will not start the task until there is enough free room for the besk’s registers
`
`The unified shader is based on the R300 pixel shader
`
`5.2 3D Rendering
`
`For 30 rendering data is passed twice through the unified shader> onoe to transform the vertices and a second time to
`determine the color of the pixels
`
`RadbTepLewelpwcDOC
`
`dnayee @ ATHa:CopyrightNoticeonCoverPage©«=
`
`ouesen artery
`
`ee —4510104.0152505
`
`ATI Ex. 2028
`IPR2023-00922
`Page 10 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 10 of 34
`
`
`
`os
`
`11 of
`Fal
`ORIGINATE DATE
`EDIT DATE
`DOCUMENT-REV. NUM
`| PAGE
`i
`11 March, 2001
`[date \— “¢ MMMM,
`GEN-CXXXXX-REVA
`|
`
`The input to the 30 pipe is expected to be indexed vertex arrays, Linear veriex arrays can easily be supported by the
`CP generating sequential indices. intine vertex data & an open issue,
`| would preter to write it to memory and then
`fetch A a6 a veriex array rather than add a direct path.
`
`The stream of indices Is sent to the Primitive Assembly biock by the CP, The from of the primitive assembly block
`maintains the tag for the vertex cache, The vertex cache stores transformed vertices. As misses are detected in the
`tag, the indices that miss are placed into 16 entry vectors. Each vector contaira « slate pointer, a poirter to the veriox
`shader to be used, and he 16 indices to vertices that need to be transformed. Vwnen either a vector is filled with 16
`erties or a stale change happens (so that the mext vertex does not share the state and vertex shader wth the
`previous vertex) the vector is issued to one of the “shader” pipelines for transformation, Which of the four shader
`pipelines it is issued to determined ether by some effort of load balancing or a simple round robin. All that is submitted
`to the picel pipeline is the state, the vertex program, and the indkes. The shader pipeline wil fetch the vertex array
`data through the cache infrastructure that is also used for texture fetches. After the tag the indices (actually now the
`indices into the vertex cache) are placed into a latency FIFO to hide the lalency of transforming the verikes.
`
`The shader pipeline receives the vector of 16 indices trom the primitive assembly block. The shader pipeline operates
`when rendering piools, by processing a vector of four 212 pixel footprints, a total of 16 peels. For vertex processing
`cach of the picets is replaced with a vertex. The vertex program inckxies information of how many local variables f will
`need. The rasberizer waits until that many tccal variables are free, (as each executing thread in the shader pipeline
`terminates 1 frees ita local varieties) With the proposed shader data pain ihe macinmum number of local variables per
`vertex
`256. However this leaves no ability to hide latency, 16 to 32 local variables will probably maximize latency
`hiding and therefore performance. The vertex shader program can use all the capabilities of the shader pipeline
`including texture feiches and dependent lookups. At the end of the vertex program,the transformed coordinates must
`be cufipul, One output wil be fhe x. y, 2, w position which we be stored in fhe postion cache of the vertex cache. The
`vertex program may ales oulpad a pumber of parameter values (colors, bexbuns cocedinates, other interpolated mputs
`into the peel shader}. The parameter values must be culput as a muliple of four 128 bi words, as the paramecter
`cache is designed for this
`
`The primitive assembly block reads the indies back out of the latency FIFO and accesses the pasion cache portion
`of the vertex cache, f assembles the
`vertices into primitives (ines,
`triangles, rectangles, quads?, points, 7}
`Baricentric values are assigned to the vertices, and vail be used later in the rasterizer to interpotate the parameters.
`The parameters are not accessed by the primitive assembly logic. which only works from the posfion data. The
`prienitive is clipped agains! both the viewsng volume a8 wel a8 user clip planes, with Pactional baricentric coordinates
`assigned to the clipped primitive sectiona. The primitive goes through the perspective divide and the viewport
`transform, The resulting screen space primitive is setup (plane equations for 1, 2, and the bancentric coordinates)
`The resulting primitive data, including the indices back into the parameter portion of the vertex cache are broadcast fo
`the four pipes. The final time that an index is culpa that access the okleat vertex cache line, a token is also sent
`When ail of the four pipelines retum the token the primitive assembly block can free that cacheline and allow it to be
`used for a new vector of vertices. The performance goal in the primitive assembly block is a biangle every two clocks
`An alternative option is for the vertex shader to generate screen coordinates and clip codes.If a primitive needs to be
`clipped, which can not be determined until primitive assembly, then the vertices are reverse transformed back into cip
`space by logic in the pernitive assemnly block, clipped. and then trarsionned back into screen space.
`
`To help meet marketing BS qumbers we can look ito doing backface culling at a rate of one triangle per clock. This
`will boos! ve to peak be number of S00 milion triangles per second
`
`Each pipe has a FIFO in front of the rasterizer to load balance. Each pipe wil handle 16x16 tiles of the screen which
`are interleaved between the pipes. To maximize the effective size of the FIFO we will probably cull the triangle ist
`before the FIFO. The rasterizer wil request the parameter data from the parameter cache for the primitives. A smal
`latency hiding FIFO will hide the latency of the access to the parameter cache. The parameter cache & 512 bite wide,
`and the interfaces from the parameter cache to the rasterer are 128 bits wide, this allows the parameter cache to
`output one pipelines request per clock, which is seriakzed over four clocks, keeping all four interfaces busy. The
`rasierzer keeps 6 amet cache of three to four vertiogs, this allow only the new parameter to be fetched when adjacent
`triangles are propesesed, The parameter cache interlace imposes a second performance Emits,
`in the worst case each
`polygon covers all four pipelines and there are no vertices shared tom triangle to Wiangle.
`In this case the peak
`performance is (800 MHz / (4 pipelines * 3 vertices) = (800/12) = 41.6 million tnangies per second.
`in the best case
`triangles are perfectly stripped and never cross over pipeline boundaries. in this case the peak performance (if we
`ignore the setup lina) i SOO milion trengles per second. As a practical manner we shoud be able to approach the
`sebup limit of 250 Malion triangles per second
`
`
`
`ehTaphae Spee DOS Mtsyeeros ArTTctocencecopyright NoticeonCover Page © + conser artery
`
`
`
`en§—|AM01044_0152596
`
`ATI Ex. 2028
`IPR2023-00922
`Page 11 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 11 of 34
`
`
`
`7
`
` Vat
`ORIGINATE DATE
`EDIT DATE
`DOCUMENT-REV. NUM
`11 March, 2001
`date \@ “d MMMM,
`R400 Top Level Spec
`"
`The nmasterizer also contains a portion of the Merarchical € memory, We are looking into moving this mo a cache
`based approach, but that is fer for certain at this point. We would ike to be able to do hierarchical z culling at a speed
`in excess of 64 picel per clock per pipelines (256 pixels per clock total). Wwe are also going to conskler some of the
`improved tatency hierarchical Z options to improve culling efficiency.
`
`The rasterer will generate four pixels per clock if there are mo more than eight interpolated parameters. The
`rasterer generates vectors of four 202 footprints (16 pixels). Each 2x2 footprint must be screen aligned and from the
`same triangle (with @ single shared z slope). The four footprints only need fo share the same state and shader
`program
`
`Before starting the processing of a vector the rastertzer (which includes the sequencer for the shader pipeline) checks
`fo make sure that there are enough &ee registers in the shader pepeline for the pixel shader program.
`If not, it stalls
`until there are @ough, The rastorzer aso needs to arbitrate between tho Iheoe sireanrs of vectors to be shaded:
`the
`vertex stream, the pixel stream, and the real tee stream.
`| think f wil be sufficient for the real time stream to have
`priority over the vertex stream which has priority over the pixel siream. This vill meet the real-time demands, and keep
`the vertex cache filed,
`
`The vector is then processed by the shader pipeline. We will probably support up to eight sequentally dependent
`texture fetches. (to use the A300 terminology, eight clauses).
`16 (87) textures are supported, but cach texture can be
`accessed multiple times by a single pixel shader which can provide a GiMerent address each time. This is especialy
`usetul for Complex filers.
`
`The output of the patel shader is the final color of the fragment. The pixel shader may also replace the Z value. Fog
`and stippling must be dome in the pixel shader program.
`
`The render backond dows the 2 compare, shencd operation and color alpha bend
`
`The texture fetch path has a number of design optens. One opten is an approach where the local, muftiported.,
`texture cache & small (1 to 4 KB), and contains uncompressed color in a canonical format (32 bite per pirel) and uses
`@ 4x2 of 4x4 caching. Ths ip backed up by a large (> 1668) LZ cache which also sfored uncompressed G08
`cachelines. The decompression logic ives between the memory controfer and the L2 cache.
`
`An alternative design uses the L2 cache to contain data in memory format (compressed) which is decompressed as
`needed to fulfil L1 texture cache mises. This wil increase the effective size of the L2. The L2 cache is datributed,
`with 1/4 of A residing in each memory controler, The Texture decompression logic can either be located in gach
`shader pipeline, cr exist as a shared block(s) that receive data from all four memory controller and send the
`decompressed 4x4 cachelnes to each shader pipeline The unified decomperesion biock wil result
`in better
`performance, and possibly less area, at the cos! of some of the acalabiltty,
`
`Assuming that we chose the L2 in memory controller and the unified decompression logic, the texture path would work
`as follows:
`
`In @ four pipeline design there are two texture decompression blocks. ome for Ihe “tefl” texture units in each shader
`pipeline, and the second for the “night” texture units.
`in the two pipetne, lower cost, version of the chip only a singic
`decompression pipeline is used, serving the lef and right texture urits.
`
`The L1 texture cache receives a texture request troen fs shacer pipeline, The usual tag and latency FIFO is used to
`generate the misses. These are sent to the shared texture decompression block, which looks up the texture to find the
`physical address and then sends the request to the L2 cache in the memory cortrofler. The L2 also has a latency
`FIFO and tag, and will return the data in order (but there is no order guaranteed between the dala returming from each
`LZ} The decompression block has a buffer which is used 1o place the data ram the momory cortrotiers back in ceder
`The decompression logic Gecompresses the texture and returns, im order,
`the 4x4 cachelines that the | caches are
`requesting. Most of the compression techniques we are considering are based on an 4x8 Se (or 4x4x4), when
`hecestary the decompression logic wil decompress an entire 64 pixel He and only return the requested 16 pixels to
`the L1 cache. This will bend to increase the bandvadth hetwoen the decompression logic and the L2 cache as 8x8
`biecks are repeatedly requested to provide different 4x4 subtiles to the L1. The L2 cache wil prevert the repeated
`reads from going !o memory, and we will probably implement an "LO" style cache in front of the L2 to also catch the
`redundant requesis.
`
`Each memory controtier will have two 64 bit read retum Buses, ome to each of the two deccenpression blocks, each
`decompression blocks drives a separate 128 bt bus to each of the four shader pipelines. This will tend to have better
`edTop heel Meee DOC
`uateeee? & ATIit: Copyright Notice on Cover Page © «=
`cuesen areerw
`
`ee AMD1044_0182597
`
`ATI Ex. 2028
`IPR2023-00922
`Page 12 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 12 of 34
`
`
`
`
`ORIGINATE DATE
`EOIT DATE
`DOCUMENT-REV. NUM
`
`GEN-CARAXK-REVA
`[date \ “6 MMMM,
`11 March, 2001
`ubigaton and load balancing than faving the memory cor#roller drive @ 32 ba bus to Ihe decompression logein cach
`shader pipeline. While the total number of wires is similar (128 bits per memory controfier, 128 bits into each texture
`cache) we are lees likely to eave the texture pipes starved when there is some imbalance
`
`5.3 Real Time Rendering
`The real time remdenng interface allows primitives to be inserted into the rendering pipeline at a very late stage
`therefore prowiding very kw ketency. The expected use is for scale bits timed by the display refresh.
`this suggests a
`smal number of lange prientives. We take advantage of this to simplity hardware by forcing the terface to be post
`setup, a real-time primitive needs 19 be transficemed and setup by sofware,
`
`time primitives also do mot have access to the state management hardware used by non-real-time 30
`Real
`commands. A single set of state registers, some constant registers, acid ore full parameter set is available. The real-
`time command shea wil generally need to wail for the current real-time drawing operation to complete before & can
`start the next real-time command. The driver can statically allocate some of the physical constant registers to the real-
`time stream, these are not available to the RBGM for renaming use, and are written by the realtime command stream,
`and read by the 3D pipe at the direct physical addresses. There are two options for the parameter memory. The
`parameter memory @ Mot visible Io Non-feal-time commands, for normal operation it @ entirely managed by hardware
`For real time ronderning there wil be dedicated space for three vertices, cach wih sixteen 128 bit interpolarts. # the
`real-time primitive requires more than eight imerpolants there wil ony be enough room for one primitive at a Sime,
`even if they need the same state and constants, if less than eight ierpolants are needed then there is room to
`manually double buffer ihe interpolants. and allow pipelining of primaives. The real lime command stream wid still
`need to manually check that the pipeline has finished with the previous primitive, before writing mew data to the
`parameter memory for the next primitive, while the pepeine works on ihe current primitive
`
`For example, the a drawing comenand in a real-time command Guffer mgt look lie this:
`
`4) make sure no realtime command ts in the pipeline
`“set rendering state for command
`# set congering state for command
`A set cordering stale for command
`set rencenng state for command
`“write constant register
`“write constant register
`4 ete.
`
`Watt_for_reaitime_pipe_idie
`Winte state reg m in contest 7 with data
`Write state reg m in context 7 with data
`Write state fog m in contest 7 with data
`Write state reg m in contest 7 with data
`Vente const reg at physical address k
`Write const neg at physical address ke
`VWirite const reg at physical address ke2
`Woite vertex 0, parameter 0, in re