throbber
om
`
`ORIGINATE DATE
`11 March, 2001
`
`EDIT DATE
`[date \@ "é MMMM
`
`DOCUMENT-REV. NUM
`GEN-CXXXXX-REVA
`
`PAGE
`tof3a
`
`Val
`t
`eeu
`Author: Steve Morein
`
`i lesue To:
`
`Copy No:
`
`R400 Top Level Specification
`
`ver 0.2
`
`| Overview: This replaces time R400 architecture specification
`
`
`
`AUTOMATICALLY UPDATED FIELDS
`Gocument Location:
`Documerti
`Curent intranet Search Tithe
`R400 Top Level Spec
`
`APPROVALS
`=|
`NameDept
`
`Signahse/Date
`
`+ i
`
`| Remarks:
`
`INFORMATION THAT COULD BE
`THIS DOCUMENT CONTAINS
`SUBSTANTIALLY DETRIMENTAL TO THE INTEREST OF ATI TECHNOLOGIES
`INC. THROUGH UNAUTHORIZED USE OR DISCLOSURE.
`
`| |
`
`“Copyright 2000, AT! Technologies Inc. Al rights reserved. The material in thie decurnert constitutes an unpublished
`| work created in 2000. The use of this copyright notice is intended to provide notice that ATI owns @ copyright in this
`| Unpublished work. The copyright notice is not an admission tat publication has occurred. This work comains confidential
`proprietary information and trade secrets of ATI. No part of this document may be used. reproduced. or tansmitied in any
`form of by any means without the peor written permission of AT) Technologies inc.”
`
`Ren Tap Lael fee DOS Ayeeros ATRc'cronce Copyright Notice on Cover Page © ++
`
`cnn artere
`
`AMD1044_0152586
`
`ATI Ex. 2028
`IPR2023-00922
`Page 1 of 34
`
`

`

`
`
`R400 Top Level Spec
`
`|
`
`ORNGINATE DATE
`11 Atarch, 2001
`
`EDIT DATE
`{date \@ “d MMMM
`
`DOCUMENT-REV. NUM
`
`Table Of Contents
`
`AGP &x
`255 Ba Memory Interface
`Unified Processing pipe
`Front end scaling
`Real-Time drawing command ability
`3D Features
`
`Aewie=
`
`6b
`
`I I| I | I I
`
`ovsevvouowuwvedgaewseaenonewewedansn4nxwansan
`
`1.6.1
`Noise Textures
`1.6.2
`Shadow buffers
`1.6 3=Sart Independent Transparency
`1.64
` Anti-Akasing
`1.6.5
`Texture compression
`1.6.46
`Zcompression
`”J
`164
`Texture Filtering
`16
`Curved Surface Support
`8
`9 Displacement maps
`High cofor depth
`
`1.6
`
`SE ac cccicissicca cansucci ceasestes ce oasecaess ters stan atabesa bina ven stoped seatae acislabinessnctea pienteveaecasia
`PROCESS............
`SSEERESAL, CHIR PRUNINGsgnsscsiomnenscsclevoecinpnsnecronraiantenseatboncromeet
`Unified Shader
`30 Rendering
`Real Time Rendering
`State Management
`Bad Data
`Display operation
`Seeeeeeeae eee
`BLOCKS.............
`
`HSLOCIIG CIRBOINIEIIDaaantactmarbtcacbiretbocantcbleaSSs
`HBIU = host bus interface unt
`
`
`
`|
`2
`3
`
`83
`
`Description
`Majorinterfaces
`Block diagram
`CP — control processor
`|
`Description
`2 Major interfaces
`3
`Block diagram
`RBBM — register interface manager
`|
`Description
`Bo2e Tap Lwed Spee Leal
`
`1 tye© ATI| Reference Copyright Notice on Cover Page © «+
`
`oe ner
`
`AMD1044_0152587
`
`ATI Ex. 2028
`IPR2023-00922
`Page 2 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 2 of 34
`
`

`

`
`
`|
`
`ORIGINATE DATE
`11 March, 2001
`
`EQIT DATE
`date > “4 MMMM
`
`DOCUMENT-REV. NUM
`
`GEN-CARXXK-REVA
`
`Major interfaces
`8.3.2
`Block diagram
`8.3.3
`RBBM operation
`8.3.4
`84
`CLK ~ clock generator
`8.4.1
`Description
`8.4.2
`Major interfaces
`$.4.3
`Block diagram
`83
`TC test controller
`
`Description
`8.5.1
`Major interfaces
`8.5.2
`Block diagram
`8.5.3
`8.6 VIP — Video input port
`$.6.1
`Description
`$6.2
`Major interfaces
`8.6.3
`Block diagram
`8.7
`ROM— boot rom
`
`Description
`8.7.1
`Major interfaces
`$.7.2
`Block diagram
`8.7.3
`&8
`120 -I2C interface
`
`Description
`8.8.1
`8.8.2 Major interfaces
`$8.3
`Block diagram
`8.9
`DU—- Desplay
`$.9.1
`Description...
`892 Major interfaces
`8.9.3
`Block diagram
`8.10 Mi — Memory Hub
`8.10.1
` Gescription
`8.10.2
`Major interfaces
`8.10.3
`Block diagram
`8.11
`HDP Host Data Path
`
`das
`
`=F
`
`=
`
`ae
`
`Sas
`4
`
`19
`20
`20
`21
`21
`21
`22
`22
`
`22
`22
`22
`22
`22
`22
`22
`22
`
`22
`22
`22
`22
`
`22
`22
`23
`23
`23
`23
`24
`24
`24
`24
`24
`24
`
`Description
`S111
`Major interfaces
`8.11.2
`Block diagram
`8.11.3
`8.12
`IDCT— Mpeg decoder
`8.12.1
`Description
`8.12.2
`Major interfaces
`$.12.3
`Block diagram
`Ree Top haved Spec DOC Mthbymaeee @ AE*cterenceCopyright Noticeon Cover Page ©«+
`
`25
`25
`25
`25
`25
`25
`25
`meson sere
`
`a01044_0152588
`
`ATI Ex. 2028
`IPR2023-00922
`Page 3 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 3 of 34
`
`

`

`DOCUMENT-REV. NUM
`
`
`
`Cd
`
`ORIGINATE DATE
`fit
`EDIT DATE
`PAGE
`
` 1? March, 2001 date \@ “d MMMM R400 Top Level Spec 40fs
`
`8.13PA—PrimitiveAssembly..
`ae
`=
`25
`8.13.1
` Dreseription
`25
`8.13.2
`Majorinterfaces
`25
`8.13.3
`Block diagram
`26
`8.14
`TO — Texture Decompression
`26
`8.14.1
`Description
`26
`8.14.2
`Major interfaces
`27
`8.14.3
`Block diagram
`27
`
`
`8.15 RE=Raster Engine 27
`8.15.1
` Beseription
`27
`$.15.2
`Major interfaces
`28
`$15.3
`Block diagram
`29
`8.16
`SP —Shader Pipe
`30
`8.16.1
`Deseription
`30
`$.16.2
`Major interfaces
`30
`8.16.3
`Block diagram
`31
`8.17
`TR Texture Pipe
`32
`8.17.1
`Description
`32
`8.17.2
`Major interfaces
`32
`8.17.3
`Block diagram
`32
`8.18
`RB = Render Backend
`32
`
`32
`Geseription
`8.18.1
`32
`Major interfaces
`8.18.2
`32
`Block diagram
`8.18.3
`32
`8.19
`MC — Memory Controller
`32
`8.19.1
`Deseription
`32
`8.19.2
`Major interfaces
`32
`8.19.3
`Block diagram
`CCIEHE Wr PRAIRIECGD PGE aicdsccecisecdescaccnttsinen nchicacsnsiah sucks carsalbssiaiord sw uvathusebcsusaswunseutssntn kembbabsetesiaues Mie
`9
`Logic Design
`32
`91
`91.1
`Data formats
`32
`91.2
`Register Bus
`32
`9.1.3
`Block Communication protocol
`32
`92
`Software
`32
`
`BaddTap
`
`LawlSpec DOC
`
`smthhymeree &iCopyright Notice onCover Page ©«=
`
`nese etere
`
`ATI Ex. 2028
`IPR2023-00922
`Page 4 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 4 of 34
`
`

`

`ORIGINATE DATE
`yar
`6 ‘
`11 March, 2001
`Revision Changes:
`Rev 0.0 (Steve Morein)
`Date March 11, 2001
`Indtial newision
`
`EDIT DATE
`{date \@ "é MMMM
`name
`
`4
`
`DOCUMENT-REV. NUM
`GEN-CXXXXK-REVA
`
`PAGE
`5 of 34
`
`Document recreated from earlier documents
`
`Date March 14,2001
`Apel 21,2001
`
`Finely got back to editing a
`Upated texture path, hopefully this ome works
`
`Rend Tap
`
`Level
`
`gee
`
`DOC
`
`aebyes & "eeCopyright Notice on Cover Page ©«=
`
`‘
`
`core
`
` AMD1044_0152590
`
`ATI Ex. 2028
`IPR2023-00922
`Page 5 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 5 of 34
`
`

`

`ral
`e x
`
`th
`
`ORIGINATE DATE
`1? Atarch, 2001
`
`EDIT DATE
`jdate ‘> “d MMMM
`atteteetbe
`
`o
`
`DOCUMENT-REV. NUM
`R4a00 Top Level Spec
`
`introduction
`
`The R400 will be the high end standalone graphics chip product when it
`it wil be followed wery rapkily with two variants
`The RV400, aimed at the volume PC space
`The R450, aimed at a volume high end market
`The targets for the three chips are
`
`is introduced
`
`Part
`
`pioehe'clk
`
`Memory
`akiopsiclkk Memory
`bexture
`Clock
`| speed —
`__wikdily
`| fetcheschk
`|
`| Speed
`|
`| July 2002
`
`Ra00|400 MHz 4 16 32 206 | A00OM Re
`
`
`
`
`Now 2002
`PRVa00 | 500 Mrz
`| 4
`ls
`16
`128
`600 Miz
`Feb 2003
`R450
`500 Mrz
`| 8
`16
`4
`2567
`S00 Miz
`
`dic size
`
`Tapeout
`
`
`
`Fo0b Tap
`
`hawSpec
`
`DOC
`
`ae yee & ATI<<< Copyright Notice on Cover Page ©«=
`
`ory
`
` AMD1044_0152591
`
`ATI Ex. 2028
`IPR2023-00922
`Page6 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 6 of 34
`
`

`

`ORIGANATE DATE
`
`EDIT DATE
`
`[date \@ "6 MMMM
`
`GEN-CXXXXK-REVA
`
`DOCUMENT-REV. NUM 11 March, 2001
`
`|. Features
`
`1.1 AGP 8x
`| expect that we wil need to support AGP tx and 2c
`The chip will euppert the 32 bt AGP interface at speeds up to 8x.
`whieh require 3.3 Von LO (AGP 4x i 1.5v and AGP &« is 750mv). AGP fast wetes are suppeeted for access to the
`frame butfer
`
`Open issue: 64 bt address space support
`
`|.2 256 Bit Memory Interface
`The R400 and F450 support four mernory channels, which can be 32 or 64 bits wide; the macineum memory bus width
`is a total of 256 bits. The RV400 supports two memory channels and a maximum total with of 1268 bas
`
`All channels need to be configured identically, 1, 2 or 4 channels can be configured.
`
`Memory standards supported
`
`
`Memory type
`vo
`
`SsTi25 135
`[oor
`SSTLI8
`= 18
`ODeinfineon
`
`Elpida
`L18(1.57)
`| Elpida
`
`
`Infineon
`12,10V
`ienfirion e-<diram
`
`Speed
`[100 to 500 MHz
`| 300 to 500 MHz
`| 300 te 400 MHz
`
`| 500 MHz
`
`No support for SSTL3.3, or SORAM (LVTTL = 3.3¥)is planned
`
`1.3 Unified Processing pipe
`The most ambitious feature in this design is the “truly unified pepe” ; a single programmiabie pipeline is used for 2D,
`Video, 3D verter, and 30 pivel operations. The unifed pipcine docs al of fs caloulations in 32 bit floaing port, the
`sam a6 the existing veriex transtorm in previous chip. and the next slep in the precision of the color/pixel calculations
`which have increased trom 6 bés (R100), through 16 bas (R200), to the 20 bits in the A300
`
`There is an area cost to the unified pipeline since we are forced to go to 32 bf precision for color, when application
`requirements may Mead leas (272 1o 24 bite) However the wiifed pipeline resus in a Mingle mativregister structize
`compared to the separate structures in @ more traditional design.
`It is haped that by only needing to design the one
`structure we can make the investment in design time and effort to realy optimere the area
`
`Soene of the benefits lo merging the pipelines inciude allowing the vertex operations to do texture fetches, which we
`could not afford add logic to the transform pipe to do, a single programming model for both operations, more precasion
`on color than we would normally provide, and the abéity to support significantly more registers and instructions in piel
`shaders.
`
`One important benofil # load balancing. in the current pipeline when the app if transform bound the pixel pipeline 6
`idie some significant portion of the time, and when the app is raster bound the transform hardware idie, The unified
`pipeline presented here dynamically allocates &s processing power between banstonm and rester
`
`1.4 Front end scaling
`We will remove the back end scaling capabiity trom the display, and replace & with a non-scaling overlay Tis will
`require us to be able to implement scaling using the untied pipeline. Key features that wil need to be supporied are
`large Alter kemels, de-intertacing, frame rate conversion, and good support for YU'V and color comversion
`
`1.5 Real-Time drawing commandability
`To allow for the emulation of backend scaling as wel as support mew features we need to be able to interrupt the 3D
`pipe and be able to ex@cule high priority commands with kv itency. The point of interruption is im the primitive
`
`edd Tap Lewel Spec DOC
`
`set tyes? & ATI
`
`ference Copyright Notice on Cover Page © «+
`
`cueaen arte rw
`
`503025,0152502
`
`ATI Ex. 2028
`IPR2023-00922
`Page7 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 7 of 34
`
`

`

`a
`
` Val
`ORIGINATE DATE
`EDIT DATE
`DOCUMENT-REV. NUM
`PAGE
`J
`11 March, 2001
`{date \@ “¢ MMMM.
`R400 Top Level Spec
`8ot34
`assembly, the maximum latency vell be about the time ft takes to render 4096 picets.
`real time commands are
`inserted into the 30 pipchine after transform, clipping, and setup. Those function need to be performed by the driver
`There are also limits on the number of constant registers avaiable
`
`1.6 3D Features
`There are a mumber of new 3D features we are considering for inclusion. Additional features may be added, and some
`of these may be dropped
`
`1.6.1 Noise Textures
`
`Porlin style noise is uselul for a number of applications. 1 & generated on chip and comsumes no extemal memory
`bandwidth. It also larger than any physkal texture can be: 250x240c256 lattice points, and stif hes detail when the
`resolution is 4Kx4Kx4K. There is an opporturity to get thes adopted as part of dB.
`
`
` nadow volumesJohn Carmack ip using
`shaniow volumes to generale ‘shedowellerte indoom Shudowrvolesunethesone!bufferipcomentewhether&
`Rin18.8. This.stein voorpeermattewneuinit_o0 reodem 20 pipesies, Jit.ths. wes..9f.sharkxoheres.9. Sect
`
`Our
`preferedshadowingmethodtp {wil ad¢ more dotad here ialer}..Sghadow bufferswhich storethe nentdept
`yaleatalan arty of postions,relative tocach Saht Shadow buffers have two key iméationsisiues: very
`
`
`resolutions are roquired to aveld alsing. — traditionalmipmapMfereringcannotbeappliedtoshadoweae hides
`
`
`nottespneppeo-Sleringierealprolem, Weermableioh400 solves the first problem through a combination of
`
`ae aaacescache bs docciesRa eahek Gaaeead at onste enecesee
`
`
`3 Sort Independent Transparency
`
`iore carvan Wooing Oho Dow weet to Sabet Sart ndopeneeert wenepetoney.CA.SantsnkwimlncoirestaSiam,
`
`Frnt, renderthe opaque pinelsandstoretransiucerkpixelsintoa listimbosl memory,Second,replaythatlistmutipic
`
`
`timestosuccoesineyrenterand remove the backmos!trarskscectpielswiltthe bet@empty,Thecostofthie
`
`
`
`ofParsivcertpacelsandBairdegreetechniqueisrelativeontytothenumber:
`
`
`
`
`80itwellbehaphlyefficientformagesthatcontainasmallpercentageoftranslucentpixeis,
`
`
`
`
`
`
`
`Thebvo-plansare_etnethe-dualZ-bullesappeoachoctheapproachdeecnbedim_cneedtodecidewheretheemad
`
`bhowidbepacedecoleCansee>
`
`1.6.4 Anti-Aliasing
`Lcdhevnrpandbeta Mah agcherrnay des bepeseab esata ahs~ oe icine eee aneee
`
`Spaced Serve SUIT GIN SIEeT Wells She WEST Ony SONNY Our go
`
`1.6.5 Texture compression
`To further reduce bardwidih we need to improve texture compression. We need to acteeve both better compression
`that SSTC, and have a high enough quality that textures that would lose too much detail wih SSTC can be
`compressed, Both of these goals do mot need to be achieved simultancously on all textures. We also need to look at
`compression of non-traditional surfaces such as normal maps. Advances here are dependent on the availabilty of
`
`edb Tap Lewel figec DOC
`
`4mHymeee ATPeterence Copyright Notice on Cover Page © «=
`
`ouwans artere
`
`EE aso1025_0152503
`
`ATI Ex. 2028
`IPR2023-00922
`Page 8 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 8 of 34
`
`

`

` ORIGINATE DATE
`PAGE
`DOCUMENT-REV. NUM
`EDIT DATE
`| 9ot34
`GEN-CARXXX-REVA
`(date \ “4 MMMM,
`11 March, 2001
`een ee if we are unable to find resources we wil supportonlytheeteSTCcompression
`currently
`
`1.6.6 <-cDepthCompression
`
`
`ofspecifyingtheplane(possibhyveth Seee
`
`numberofZplanos (perhaps 16)before tatingback to a2 value per sampae, R400wit orobablyalso
`
`
`
`depthcompressioninihmfaleckcase.
`
`1.6.7 Texture Filtering
`The texture pipes can fetch a 2x2 region from the texture map and fitter it
`The data per pine! can either be four cight b& values, two sitteen bit values, of one 32 value. Al data needs to be fixed
`point,
`Linear fiters are completely built in. and @ takes 1 cycie for bi-inear, 2 for trilinear, tour for quadra-inear (fitered mip-
`mapping of voluree textures). Variable depth anisotropy is supported in hardware with the texture pipe calculating the
`number of samples needed, Optionally the pixel shader can calculate the number of samples, and how to increment
`the texture address, and provide this to the texture pipe.
`
`1.6.8 Curved Surface Support
`We val suppert curved surtaces through combination of vertex shader code and a teasellation engine to generate now
`vertices
`
`The betsellation engine generated mew vertex indices from a input vertex index array. The new indices contain both
`the cocedinate in parametric space of the vertex, and the indices to the surface. of to data from which the sizface can
`be derived, More information is available in the programming guide.
`
`1.6.9 Displacement maps
`The tesseilation engine for curved surfaces can dice tiangles ito micropolygons, the wertex shaders for the vertices
`can then access into a Gaplacemernt map and change the location of the points.
`
`1.7 High color depth
`
`We wil support a 64 bit color buffer (16:16:16-16), we will support two formats;schGSja&a.sAGBE4) and a floating
`pomt format.,
`BeNeUna eaeee,Tenie.
`
`pmenenes 9. endRais.eae 3 Cae1-4..2)_108.Sen
`
`valveandwopesintand—’togetherwthamaskthalepeclioswhichplane4ov6ealeach-sampie.poetChanges
`
`
`
`
`
`
`
`
`fronle RIO0 echiimeuclude «dillorent wayOfspecilyang theplace(possiblywillymoreb86per Uplene),58thes
`iInetead of 4x4 ties. and a aeger maximum nuneber of Zolanes (peshape 16} daloee taling back to a Z value per
`sample, R400wil probably alec supportmaenax depth comprasson in the fallback case.
`edeTapheed Spec DOS
`Matas years AT!FBfecterence copyright Notice on Cover Page c-
`
`ous arseru
`
`a) AMD1044_0152504
`
`ATI Ex. 2028
`IPR2023-00922
`Page 9 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 9 of 34
`
`

`

`DOCUMENT-REV. NUM 11 March, 2001
`
`ORIGINATE DATE
`
`EDIT DATE
`
`jdate ‘4 “d MMMM
`
`R400 Top Level Spec
`
`2. Performance
`
`The basic performance is
`
`R400 MHz
`
`Gilrate bi-inear equiv peak trisec
`
`; Bitinear texture
`
`Under normal conditions, and when not further limited by memory bandwidth we expect to be » 75% efficient
`
`3, Schedule
`
`
`
`
`R400 _
`RVaO0
`Raso
`
`
`4. Process
`
`At the moment this looks like an easy choice: .13 will be in production for over a year, and .10 does not show up until
`the very end of 2002 acconding to the TSMC and UMC roadmaps
`
`Ve wall probably want to be in a fip chip packaging approach to meet power déstnbution goals. Vvith the 256 bit bus we
`will have at least GOO signal (O's (404 in memory) We may be as much as 104 at TY for average power, which wil
`require very good power distnibation, area bond flip chip is probably the only option
`
`eneral
`
`Chip operation
`
`5.1 Unified Shader
`
`The unified shader is a simd/vector engine that perfonns the same instructions on four sets of four (16 total) elements.
`For pixel shader operations the elements are pels with the sets of four required lo be 22 footprints. For vertex
`shader operations the sixteen elements are sixteen vertices. The basic clement is a 4 value vector — frequently
`interpreted as xy z.worrg ba
`
`The user model for the unified shader is composed of @ variable number of general purpose registers, a subset of
`weh are usually intielined wih data An ALU can do simple math, conditional moves, and permutations on the
`registers, and the ability to do a limfed number of memory reads using the texture cache, The number of register is
`variable, and She number of registers required for an operation are specified when the task is submitted to the unified
`shader. The unified shader will not start the task until there is enough free room for the besk’s registers
`
`The unified shader is based on the R300 pixel shader
`
`5.2 3D Rendering
`
`For 30 rendering data is passed twice through the unified shader> onoe to transform the vertices and a second time to
`determine the color of the pixels
`
`RadbTepLewelpwcDOC
`
`dnayee @ ATHa:CopyrightNoticeonCoverPage©«=
`
`ouesen artery
`
`ee —4510104.0152505
`
`ATI Ex. 2028
`IPR2023-00922
`Page 10 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 10 of 34
`
`

`

`os
`
`11 of
`Fal
`ORIGINATE DATE
`EDIT DATE
`DOCUMENT-REV. NUM
`| PAGE
`i
`11 March, 2001
`[date \— “¢ MMMM,
`GEN-CXXXXX-REVA
`|
`
`The input to the 30 pipe is expected to be indexed vertex arrays, Linear veriex arrays can easily be supported by the
`CP generating sequential indices. intine vertex data & an open issue,
`| would preter to write it to memory and then
`fetch A a6 a veriex array rather than add a direct path.
`
`The stream of indices Is sent to the Primitive Assembly biock by the CP, The from of the primitive assembly block
`maintains the tag for the vertex cache, The vertex cache stores transformed vertices. As misses are detected in the
`tag, the indices that miss are placed into 16 entry vectors. Each vector contaira « slate pointer, a poirter to the veriox
`shader to be used, and he 16 indices to vertices that need to be transformed. Vwnen either a vector is filled with 16
`erties or a stale change happens (so that the mext vertex does not share the state and vertex shader wth the
`previous vertex) the vector is issued to one of the “shader” pipelines for transformation, Which of the four shader
`pipelines it is issued to determined ether by some effort of load balancing or a simple round robin. All that is submitted
`to the picel pipeline is the state, the vertex program, and the indkes. The shader pipeline wil fetch the vertex array
`data through the cache infrastructure that is also used for texture fetches. After the tag the indices (actually now the
`indices into the vertex cache) are placed into a latency FIFO to hide the lalency of transforming the verikes.
`
`The shader pipeline receives the vector of 16 indices trom the primitive assembly block. The shader pipeline operates
`when rendering piools, by processing a vector of four 212 pixel footprints, a total of 16 peels. For vertex processing
`cach of the picets is replaced with a vertex. The vertex program inckxies information of how many local variables f will
`need. The rasberizer waits until that many tccal variables are free, (as each executing thread in the shader pipeline
`terminates 1 frees ita local varieties) With the proposed shader data pain ihe macinmum number of local variables per
`vertex
`256. However this leaves no ability to hide latency, 16 to 32 local variables will probably maximize latency
`hiding and therefore performance. The vertex shader program can use all the capabilities of the shader pipeline
`including texture feiches and dependent lookups. At the end of the vertex program,the transformed coordinates must
`be cufipul, One output wil be fhe x. y, 2, w position which we be stored in fhe postion cache of the vertex cache. The
`vertex program may ales oulpad a pumber of parameter values (colors, bexbuns cocedinates, other interpolated mputs
`into the peel shader}. The parameter values must be culput as a muliple of four 128 bi words, as the paramecter
`cache is designed for this
`
`The primitive assembly block reads the indies back out of the latency FIFO and accesses the pasion cache portion
`of the vertex cache, f assembles the
`vertices into primitives (ines,
`triangles, rectangles, quads?, points, 7}
`Baricentric values are assigned to the vertices, and vail be used later in the rasterizer to interpotate the parameters.
`The parameters are not accessed by the primitive assembly logic. which only works from the posfion data. The
`prienitive is clipped agains! both the viewsng volume a8 wel a8 user clip planes, with Pactional baricentric coordinates
`assigned to the clipped primitive sectiona. The primitive goes through the perspective divide and the viewport
`transform, The resulting screen space primitive is setup (plane equations for 1, 2, and the bancentric coordinates)
`The resulting primitive data, including the indices back into the parameter portion of the vertex cache are broadcast fo
`the four pipes. The final time that an index is culpa that access the okleat vertex cache line, a token is also sent
`When ail of the four pipelines retum the token the primitive assembly block can free that cacheline and allow it to be
`used for a new vector of vertices. The performance goal in the primitive assembly block is a biangle every two clocks
`An alternative option is for the vertex shader to generate screen coordinates and clip codes.If a primitive needs to be
`clipped, which can not be determined until primitive assembly, then the vertices are reverse transformed back into cip
`space by logic in the pernitive assemnly block, clipped. and then trarsionned back into screen space.
`
`To help meet marketing BS qumbers we can look ito doing backface culling at a rate of one triangle per clock. This
`will boos! ve to peak be number of S00 milion triangles per second
`
`Each pipe has a FIFO in front of the rasterizer to load balance. Each pipe wil handle 16x16 tiles of the screen which
`are interleaved between the pipes. To maximize the effective size of the FIFO we will probably cull the triangle ist
`before the FIFO. The rasterizer wil request the parameter data from the parameter cache for the primitives. A smal
`latency hiding FIFO will hide the latency of the access to the parameter cache. The parameter cache & 512 bite wide,
`and the interfaces from the parameter cache to the rasterer are 128 bits wide, this allows the parameter cache to
`output one pipelines request per clock, which is seriakzed over four clocks, keeping all four interfaces busy. The
`rasierzer keeps 6 amet cache of three to four vertiogs, this allow only the new parameter to be fetched when adjacent
`triangles are propesesed, The parameter cache interlace imposes a second performance Emits,
`in the worst case each
`polygon covers all four pipelines and there are no vertices shared tom triangle to Wiangle.
`In this case the peak
`performance is (800 MHz / (4 pipelines * 3 vertices) = (800/12) = 41.6 million tnangies per second.
`in the best case
`triangles are perfectly stripped and never cross over pipeline boundaries. in this case the peak performance (if we
`ignore the setup lina) i SOO milion trengles per second. As a practical manner we shoud be able to approach the
`sebup limit of 250 Malion triangles per second
`
`
`
`ehTaphae Spee DOS Mtsyeeros ArTTctocencecopyright NoticeonCover Page © + conser artery
`
`
`
`en§—|AM01044_0152596
`
`ATI Ex. 2028
`IPR2023-00922
`Page 11 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 11 of 34
`
`

`

`7
`
` Vat
`ORIGINATE DATE
`EDIT DATE
`DOCUMENT-REV. NUM
`11 March, 2001
`date \@ “d MMMM,
`R400 Top Level Spec
`"
`The nmasterizer also contains a portion of the Merarchical € memory, We are looking into moving this mo a cache
`based approach, but that is fer for certain at this point. We would ike to be able to do hierarchical z culling at a speed
`in excess of 64 picel per clock per pipelines (256 pixels per clock total). Wwe are also going to conskler some of the
`improved tatency hierarchical Z options to improve culling efficiency.
`
`The rasterer will generate four pixels per clock if there are mo more than eight interpolated parameters. The
`rasterer generates vectors of four 202 footprints (16 pixels). Each 2x2 footprint must be screen aligned and from the
`same triangle (with @ single shared z slope). The four footprints only need fo share the same state and shader
`program
`
`Before starting the processing of a vector the rastertzer (which includes the sequencer for the shader pipeline) checks
`fo make sure that there are enough &ee registers in the shader pepeline for the pixel shader program.
`If not, it stalls
`until there are @ough, The rastorzer aso needs to arbitrate between tho Iheoe sireanrs of vectors to be shaded:
`the
`vertex stream, the pixel stream, and the real tee stream.
`| think f wil be sufficient for the real time stream to have
`priority over the vertex stream which has priority over the pixel siream. This vill meet the real-time demands, and keep
`the vertex cache filed,
`
`The vector is then processed by the shader pipeline. We will probably support up to eight sequentally dependent
`texture fetches. (to use the A300 terminology, eight clauses).
`16 (87) textures are supported, but cach texture can be
`accessed multiple times by a single pixel shader which can provide a GiMerent address each time. This is especialy
`usetul for Complex filers.
`
`The output of the patel shader is the final color of the fragment. The pixel shader may also replace the Z value. Fog
`and stippling must be dome in the pixel shader program.
`
`The render backond dows the 2 compare, shencd operation and color alpha bend
`
`The texture fetch path has a number of design optens. One opten is an approach where the local, muftiported.,
`texture cache & small (1 to 4 KB), and contains uncompressed color in a canonical format (32 bite per pirel) and uses
`@ 4x2 of 4x4 caching. Ths ip backed up by a large (> 1668) LZ cache which also sfored uncompressed G08
`cachelines. The decompression logic ives between the memory controfer and the L2 cache.
`
`An alternative design uses the L2 cache to contain data in memory format (compressed) which is decompressed as
`needed to fulfil L1 texture cache mises. This wil increase the effective size of the L2. The L2 cache is datributed,
`with 1/4 of A residing in each memory controler, The Texture decompression logic can either be located in gach
`shader pipeline, cr exist as a shared block(s) that receive data from all four memory controller and send the
`decompressed 4x4 cachelnes to each shader pipeline The unified decomperesion biock wil result
`in better
`performance, and possibly less area, at the cos! of some of the acalabiltty,
`
`Assuming that we chose the L2 in memory controller and the unified decompression logic, the texture path would work
`as follows:
`
`In @ four pipeline design there are two texture decompression blocks. ome for Ihe “tefl” texture units in each shader
`pipeline, and the second for the “night” texture units.
`in the two pipetne, lower cost, version of the chip only a singic
`decompression pipeline is used, serving the lef and right texture urits.
`
`The L1 texture cache receives a texture request troen fs shacer pipeline, The usual tag and latency FIFO is used to
`generate the misses. These are sent to the shared texture decompression block, which looks up the texture to find the
`physical address and then sends the request to the L2 cache in the memory cortrofler. The L2 also has a latency
`FIFO and tag, and will return the data in order (but there is no order guaranteed between the dala returming from each
`LZ} The decompression block has a buffer which is used 1o place the data ram the momory cortrotiers back in ceder
`The decompression logic Gecompresses the texture and returns, im order,
`the 4x4 cachelines that the | caches are
`requesting. Most of the compression techniques we are considering are based on an 4x8 Se (or 4x4x4), when
`hecestary the decompression logic wil decompress an entire 64 pixel He and only return the requested 16 pixels to
`the L1 cache. This will bend to increase the bandvadth hetwoen the decompression logic and the L2 cache as 8x8
`biecks are repeatedly requested to provide different 4x4 subtiles to the L1. The L2 cache wil prevert the repeated
`reads from going !o memory, and we will probably implement an "LO" style cache in front of the L2 to also catch the
`redundant requesis.
`
`Each memory controtier will have two 64 bit read retum Buses, ome to each of the two deccenpression blocks, each
`decompression blocks drives a separate 128 bt bus to each of the four shader pipelines. This will tend to have better
`edTop heel Meee DOC
`uateeee? & ATIit: Copyright Notice on Cover Page © «=
`cuesen areerw
`
`ee AMD1044_0182597
`
`ATI Ex. 2028
`IPR2023-00922
`Page 12 of 34
`
`ATI Ex. 2028
`
`IPR2023-00922
`Page 12 of 34
`
`

`

`
`ORIGINATE DATE
`EOIT DATE
`DOCUMENT-REV. NUM
`
`GEN-CARAXK-REVA
`[date \ “6 MMMM,
`11 March, 2001
`ubigaton and load balancing than faving the memory cor#roller drive @ 32 ba bus to Ihe decompression logein cach
`shader pipeline. While the total number of wires is similar (128 bits per memory controfier, 128 bits into each texture
`cache) we are lees likely to eave the texture pipes starved when there is some imbalance
`
`5.3 Real Time Rendering
`The real time remdenng interface allows primitives to be inserted into the rendering pipeline at a very late stage
`therefore prowiding very kw ketency. The expected use is for scale bits timed by the display refresh.
`this suggests a
`smal number of lange prientives. We take advantage of this to simplity hardware by forcing the terface to be post
`setup, a real-time primitive needs 19 be transficemed and setup by sofware,
`
`time primitives also do mot have access to the state management hardware used by non-real-time 30
`Real
`commands. A single set of state registers, some constant registers, acid ore full parameter set is available. The real-
`time command shea wil generally need to wail for the current real-time drawing operation to complete before & can
`start the next real-time command. The driver can statically allocate some of the physical constant registers to the real-
`time stream, these are not available to the RBGM for renaming use, and are written by the realtime command stream,
`and read by the 3D pipe at the direct physical addresses. There are two options for the parameter memory. The
`parameter memory @ Mot visible Io Non-feal-time commands, for normal operation it @ entirely managed by hardware
`For real time ronderning there wil be dedicated space for three vertices, cach wih sixteen 128 bit interpolarts. # the
`real-time primitive requires more than eight imerpolants there wil ony be enough room for one primitive at a Sime,
`even if they need the same state and constants, if less than eight ierpolants are needed then there is room to
`manually double buffer ihe interpolants. and allow pipelining of primaives. The real lime command stream wid still
`need to manually check that the pipeline has finished with the previous primitive, before writing mew data to the
`parameter memory for the next primitive, while the pepeine works on ihe current primitive
`
`For example, the a drawing comenand in a real-time command Guffer mgt look lie this:
`
`4) make sure no realtime command ts in the pipeline
`“set rendering state for command
`# set congering state for command
`A set cordering stale for command
`set rencenng state for command
`“write constant register
`“write constant register
`4 ete.
`
`Watt_for_reaitime_pipe_idie
`Winte state reg m in contest 7 with data
`Write state reg m in context 7 with data
`Write state fog m in contest 7 with data
`Write state reg m in contest 7 with data
`Vente const reg at physical address k
`Write const neg at physical address ke
`VWirite const reg at physical address ke2
`Woite vertex 0, parameter 0, in re

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket