R400 Top Level Specification
ver 0.2
Overview: This replaces the R400 architecture specification.
R400 Top Level Spec
R400 Top Level Spec
` |
11 March, 2001
4 September, 2015
`8.3.2 Major interfaces... occ cece sees senses eresveneersevesstsettatitnentsvntis teens 19
`BlOCK diAQraM ence cccccccce cece secs crcecesccescevevevevescinevesssereisevavintesevivnsiesvesienteneeeieess 20
` RBBM operation o.oo cceceeceescesseeeeveveteeervevevstietvetitettesvttessertittitierene 20
`CLK — clack generator... ce cece ence cece cece teeter sie ceeeeseadeeecesensaseecescseeeesetiieeeeees 21
`SAL Deseriptionccc cc cece ee cece cece tent tete sete teteteevettetettneititietsustitesmtesercsees 21
`8.4.2 Major interfaces ooo... cccccccecccccccceeeeveeeeeeseeeenvevevevervevevevsentetiteventevitvesttestwvinmervenee, 21
`S.A.3—BlOCK GiAQFAM. o.oo. ccc cece cscs eeeeesetetees ee tevsvenisersetivstitttitittirerviniwistentntitteenns 22
`TO = test COMPO] ccc ccc ce cece reece cence ne esse ence cece codec eeecsceneeecesenccseeeeccccetecciteseceuees 22
`De@SCription cece cece cseeeescevenersestrevrestesnsvrteneeisnttiniensnttevitiersitnttmersnes 22
`8.5.2 Major interfaces... cece cceseecseseeereseseevscseetevsettrtevenivitevavenivecitetevinivensvenenevecetey 22
`Blok diagram... cc cece cceeec cscs este cseseetseseetevscneetevsettinevenitevevavenavititeveviviveneveneterecetes 22
`8.6 VIP = VId@O INDUL DOT cece cee teense tee ee denn COTE REE CREE CRED CEES KEE get teES Cette ESeataRebenens 22
`8.6.1—De@SCription ooo. cceescecetee scenes venetesscenevevseatesevstevenssnntenstesetnvtvensrenstonannenerererss 22
`8.6.2 Major interfaces. o.oo ccc ec cece cece tebe et ee te te tenet netetetitetetivitinititinettteees 22
`BlOCK diaQraM nooo occ cc cce cc ceccecceceeveeveceveeseveeevevevevervevevevientititeventrvitvertientwtitervene 22
`ROM ~ BOOT POMLecce ener b eco e tec c cde be cee ce cs beeeeesccteneeeecttensseaes 22
`S.7 1
`De@SCription ccc cece eeceevevecetecseeversesteaveveteteeceveteseteteetsvetentscrereerenetererteseterereess 22
`8.7.2 Major interfaces... occ ccccccceccceeeee sees tenses erveveneersevevesitettetitvintsvninistertneittenenn 22
`PS el
`=|(601ato=1]=01 22
`(20 2 MSTAoo cece cece ce cee ence ce keen eee cen coed ee code de ec ee ca dec cede ec ccseeeeseneraeeeencnterriees 22
` De@SCriPtiOn ccc eee se senses cevenerceetressenetsvitnerssnttinitntntnevititinsietistatnersnes 22
`8.8.2 Major interfaces... cece cee cceseecseseeeeeseseevscneetevsetertevensvativavenevetitetevinitensneneterecetey 22
`Block diagram. oo ice ee ec ee tenet neni ee et eeteceeetetetetetinstetetitinieititinetettetey 23
`DU — DISDIY cece ccc cence cence un cee cree ene eceeceucreesesconecseecsecesccerseeccccreersereutcrscenescserees 20
`De@SCriIPtion occ ceccceseesccscesescesesescesteeseveveseccevevevesnsetessvavsserteseseniesrwssetenerereess 23
`8.9.2 Major interfaces... ccc ccc ceeceeceeeeeeseeeeveveneeervevevstietvatittentsvitetessertensuttenenen 23
`BOCK GIAQPAM Loo. cece cee ce cscs eeeerseteteeseeeevevenisersetevistitntinitvertrvenitintertiniittevenn 24
`MA — MEMmory FUee EERE net tet tttttettteatenen teenie 24
`Deseription ooo ccc ecccceceecseseecscecstecseevesescnveveevetnvesstventicivavervetiventirerevvirervees, 24
`Major interfaces... ccc cece cscs ees eee eresveneer sevens sentestvintsvnvistertnttmenenes 24
`Block diQQrarm. en... cece cece cseseecseseeetesertesscseevevsertrtevenivatevstenivititetevitivensvenenerecetey 24
`HDP ~ Host Data Path occ cccccce cee cececcecececcgne asec eeeeeeeceeuenecsaeeeeceeeeeececeneanereeeesecees 24
`Description cece cesses cseeteeseteeescneetiecsttterssnensticnsversititnsietissvnitersaens 25
`Major interfaces... ccc cc cceceecccseeceeseetevsceeevevserertevenevevevaveneveciuevevevavevsvenevevecasey 25
`Block diagram... ccc ccc cece ee ce cent tenet te te ee teteteteteetestetetetetetetenenetenetetes 25
`IDCT — Mpeg decoder... ccc ccc cece ceee een cee ccceneee cue eeeecceeseceeesenieeeentrecatieecsteerenees 25
`Description 00ccc eee eee ee cece ee ee cate cetetetetee vee tetetetetertitittetetitetitettrcseees 25
`Major interfaces... ce cc cceceeceeeeees eee erevveveeervevevevtentetiteventevitevesttertnutmervene, 25
`Block diagran. occ ccc cece cs ceeee sete senses tevevenieersevenssettesiteventsvnitirtentnsitmtenenn 25
`PA— Primitive ASSOMDIY .0... 0.0 ccc ccc cece nett tennes see sescceseeeeseeeeescueseesseeeseesseeseeereeesereieas 25
`11 March, 2001
`4 September, 2015
R400 Top Level Spec
`4 of 32
`PO sn B=cr00c(0)9 25
`Major interfaces... ccc ccc csceceeccescesecseesescstesessnttesevsvtessetevenieivesersnnesereess 25
`Block diagram... cece cc cccceeeecececeeeeeeveveveeervevetestietvatiteventsvetetertierteniitternene 26
`TD — Texture D@COMPIeSSION 00... cece tence ceetn ee seceteebsceeeeeceeeeeeeseeeseesegetansesenananeaes 26
`Description occ cece cece ee ee te te tetetetetee vee tetetetetentitientetetitetitetereseees 26
`Major interfaces o.oo ce cc cccceeceeseeeseeesevvevevervevevevsstntetiteventevitvestestnvtmervene, 27
`Block diaAgraen... cece ccc ceeeee sete senses eerevenieersetenstiettenitninnsvnitirtsentnititernenns 27
`RE — Raster Engine... ccc cccecee tee ee eee eee e nee nnd On EE CE ED CeEOK CCGG EEGeccgetteESceteHeneeeenreenags 2?
`Description oc cece cece ee cescseetecseeteetscntetiecettterisntenticntirervitivinstienenitersenens 27
`8.15.2—Major interfaces... ccc cece cc cseeeseseseeescseetevseterneveninatevstenevicitevevinisensteniterecenes 28
`Block diagram... cece cece cseseecseseeesesestevscneevevsestrnevensvetevatenevititevevivinensneneterecesey 29
`SP - SNA] PHOS eee ccc cece cee eee een ee eee ceded deed cr Codes tec c eee Ecc deeb edeee cnr edeetencnaeeeeccsees 30
`S. 16.1
`Description oooccc cscs eecestensvesenevscrtetevseevensenstetrtesvarseireniesstonenererensten 30
`8.16.2—Major interfaces ooo cece cece cc cece cent eee et tebe bree ne tent tetetstetetenititititetetetitenes 30
`Block diagram... ooo ccoccccccceccccceeeveeeveeseveeeveveveeervevevevstenvetiveventevitvertiestnetevene, 31
`TPR TOxture PIDGcee tee ee etn nnn Eon OEE oOo C oto ce bette c eb etteceteene tetas 32
`Description oc ccc cee ccecececceneeeveteteevevecetescreteevenstererseseverereeteteteseneseteterenees 32
`Major interfaces... ccc cece ceeseees eee eresveneersetenerssettittvintsvnwistentnttmernenss 32
`POS 0 Al =010 e 116|£11 32
`RB — ReNder BACKS cccccccccece cee ee ee ccecececcceceereeeeeeecceeeeecccecesecerenenecrenenecresessceress 32
`Description oc ccc cece cece ee cesceeetesseetessscneetievstnterssneennicnsversititinsistinsisitersenens 32
`8.18.2—Major interfaces... ccc ccecee cc eseeesesesteescseetevsererteveninatevatenevetitetevininenstenitereceey 32
`Block diagram... cece eee ec cence eben et ee tate tevetetetetenstetetetinitititinettieetey 32
`MC —Memory Controller oo cece c eee e eee ce eben cceaaeecueeeeeeceeeenctstnieceeees 32
`Description ooccc es eeeevnstensvesenevscntetevseenensenstetitesverseirennisstenenererenetes 32
`Major interfaces... ccc cccecec cece ees eeeeeeeveneeervevetstsettatttentsvtetestestenttmenene 32
`Block di@QraN en... ccc cece cs ceeeee sete tenses eevevenieervevevttsettasitvintsvniweriertneimtenenns 32
`Logic DESIQK eee cecec eee e ce cee cease eee e ee ce cece cca ea ee seeeeeeeeceeeeeccititssiteeeesteeeeeeesees 32
`Datta formats i.e ccccccccceccsesescessevecevecseescreecrereesvevseressuscreevssssusvevseresvevevivescavenes 32
`QiL.2—Register Bus... ccccccccccecceccceececeveeseevensrcaeevervevavenstavenevvinivenavevevevettetsvneteatstineereesees 32
`Block Communication protocol oo... ccececccseseseseeteeeesseteenseatettsvnitestertisstenttenvenens 32
`D2 SOWA cece cece cee cece ce ee cence nee Ec nEG EEC cnc CEE Ee Sd COE E EE Ecc E EGU C cca dE E EEC ed cH CoE eeEccceteeEECcEtEesetees 32
Revision Changes:
11 March, 2001
4 September, 2015
Rev 0.0 (Steve Morein)
Date: March 11, 2001
Initial revision.
Date March 14,2001
Document recreated from earlier documents
Finally got back to editing it.
11 March, 2001
4 September, 2015
R400 Top Level Spec
`6 of 32
`The R400 will be the high end standalone graphics chip product whenit is introduced.
`lt will be followed very rapidly with two variants:
`The RV400, aimed at the volume PC space
`The R450, aimed at a volume high end market.
`The targets for the three chips are:
`Clock | pixels/clk|texture alu ops/clk|Memory die size|TapeoutMemory |
`; speed|| |Speed Po fetches/celk | width
`R400|400MHz | 8 16 32 | 256 400MHz 11.5 July,2002 |
`RV40|500MHz 4 8 16 128 500 MHz 8.5 Nov 2002 |
` Part
`R450|500MHz|8 16 32 | 256? 500 MHz 9.5 Feb 2003
`1. Features
11 March, 2001
4 September, 2015
`1.1 AGP 8x
`The chip will support the 32 bit AGP interface at speeds up to 8x. | expect that we will need to support AGP 1x and 2x
`which require 3.3 Voit 1/0 (AGP 4x is 1.5v and AGP &x is 750mv). AGP fast writes are supported for access to the
`frame buffer.
`Open issue: 64 bit address space support.
`1.2 256 Bit MemoryInterface
`The R400 and R450 support four memory channels, which can be 32 or 64 bits wide; the maximum memory bus
`width is a total of 256 bits. The RV400 supports two memory channels and a maximum total width of 128 bits.
`All channels
`need to be configured identically, 1, 2 or 4 channels can be configured.
`Memory standards supported:
`| VO
`Memory type
`| Speed
`| SSTL2.5
`100 to 500 MHz
`| SSTL1.8
`| 300 to 500 MHz
`| Elpida
`i Infineon
`1.8 (1.5?)
`nfinion e-dram
`300 to 400 MHz
`| 500 MHz
`No support for SSTL3.3, or SDRAM (LVTTL — 3.3V) is planned.
`1.3 Unified Processing pipe
`The most ambitious feature in this design is the “truly unified pipe” : a single programmable pipeline is used for 2D,
`Video, 3D vertex, and 3D pixel operations. The unified pipeline does all ofits calculations in 32 bit floating point, the
`same as the existing vertex transform in previous chip, and the next step in the precision of the color/pixel
`caiculations which have increased from 8 bits (R100), through 16 bits (R200), to the 20 bits in the R300.
`There is an area cosi to the unified pipeline since we are forced to go to 32 bit precision for color, when application
`requirements may need less (22 to 24 bits). However the unified pipeline results in a single math/register structure
`compared to the separate structures in a more traditional design. it is hoped that by only needing to design the one
`structure we can make the investment in design time and effort to really optimize the area.
`Some of the benefits to merging the pipelines include allowing the vertex operations to do texture fetches, which we
`could not afford add logic to the transform pipe to do, a single programming model for both operations, more precision
`on color than we would normaily provide, and the ability to support significantly more registers and instructions in
`pixel shaders.
`One important benefit is load balancing. In the current pipeline when the app it transform bound the pixel pipeline is
`idle some significant portion of the time, and when the app is raster bound the transform hardware idle. The unified
`pipeline presented here dynamically allocates its processing power between transform and raster.
`1.4 Front end scaling
`We will remove the back end scaling capability from the display, and replace it with a non-scaling overlay. This will
`require us to be able to implement scaling using the unified pipeline. Key features that will need to be supported are
`large filter kernels, de-interlacing, frame rate conversion, and good support for YUV and color conversion.
`1.5 Real-Time drawing command ability
`To allow for the emulation of backend scaling as well as support new features we need to be abie to interrupt the 3D
`pipe and be able to execute high priority commands with low latency. The point of interruption is in the primitive
R400 Top Level Spec
11 March, 2001
4 September, 2015
`assembly, the maximum latency will be about the time it takes to render 4096 pixels. The real time commands are
`inserted into the 3D pipeline after transform, clipping, and setup. Those function need to be performed bythe driver.
`There are also limits on the numberof constant registers available.
`1.6 3D Features
`There are a number of new 3D features we are considering for inclusion. Additional features may be added, and
`some of these may be cropped.
`1.6.1 Noise Textures
`Perlin style noise is useful for a number of applications. It is generated on chip and consumes no external memory
`bandwidth. It also larger than any physical texture can be: 256x256x256 lattice points, and still has detail when the
`resolution is 4Kx4Kx4K. There is an opportunity to get this adopted as part of dx9.
`1.6.2 Shadow buffers
`John Carmack is using shadow volumes to generate shadow effects in doom3. Shadow volumes are very poor way
`to use modern 3D pipelines. (will add more detail here later). Shadow buffers have two key limitations: very high
`resolutions are required to avoid aliasing, and traditional shadow buffers can not be mip-mapped so filtering is real
`problem. We are able to solve the first problem through a combination of our improved anti-aliasing Z compression,
`and a new method of implementing the shadow map probe.
`1.6.3 Sort Independent Transparency
`We are currenily looking into how best to support sort independent transparency. The two plans are either the dual Z
`buffer approach, or the approach described in <need to decide where the email should be placed so others can see>
`1.6.4 Anti-Aliasing
`The changes from the R300 include an increased number of samples per pixel, probably eight, and support for an
`allocated frame buffer size smaller than the worst case maximum.
`1.6.5 Texture compression
`To further reduce bandwidth we need to improve texture compression. We need to achieve both better compression
`that S3TC, and have a high enough quality that textures that would lose too much detail with S3TC can be
`compressed. Both of these goals do not need to be achieved simultaneously on all textures. We also need to look at
`compression of non-traditional surfaces such as normal maps. Advances here are dependent on the availability of
`resources to work on this. If we are unable to find resources we will support the s3tce compression currently in D3D.
`1.6.6 Z compression
`<larry needs to give me a paragraph here>
`1.6.7 Texture Filtering
`The texture pipes can fetch a 2x2 region from the texture map and filter it.
`The data per pixel can either be four eight bit values, two sixteen bit values, or one 32 value. All data needs to be
`fixed point.
`Linear filters are completely built in, and it takes 1 cycle for bi-linear, 2 for tri-linear, four for quadra-linear (filtered mip-
`mapping of volume textures). Variable depth anisotropy is supported in hardwarewith the texture pipe calculating the
`number of samples needed. Optionally the pixel shader can calculate the number of samples, and how to increment
`the texture address, and provide this to the texture pipe.
11 March, 2001
4 September, 2015
`1.6.8 Curved Surface Support
`We will support curved surfaces through combination of vertex shader code and a tessellation engine to generate
`new vertices.
`The tessellation engine generated new vertex indices from a input vertex index array. The newindices contain both
`the coordinate in parametric space of the vertex, and the indices to the surface, or to data from which the surface can
`be derived. More information is available in the programming guide.
`1.6.9 Displacement maps
`The tessellation engine for curved surfaces can dice triangles into micropolygons, the vertex shaders for the vertices
`can then accessinto a displacement map and change the location of the points.
`1.7 High color depth
`We will support a 64 bit color buffer (16:16:16:16), we will support two formats: sRGB64 and a floating point format..
`<need to insert format details.
`2. Performance
`The basic performanceis:
` R400 MHz __fillrate bi-linear equiv peak tri/sec
`Fill rate
`Bi-linear texture
`Peak tri/sec
`ee cee [fetches ene
`| R400
`: 400
`3.2 gigapixel
`6.4 Billion
`400 Million
`| R450
`: 500
`500 Million
`Under normal conditions, and when notfurther limited by memory bandwidth we expect to be > 75% efficient.
`3. Schedule
`| R400
`July, 2002
`Oct, 2002
`Dec, 2002
`| RV400
`Nov, 2002
`Jan, 2003
`March, 2003
`| R450 May 2003 Jan, 2003 April 2003
`4. Process
`At the momentthis looks like an easy choice: .13 will be in production for over a year, and .10 does not show up until
`the very end of 2002 according to the TSMC and UMC roadmaps.
`We will probably want to be in a flip chip packaging approach to meet power distribution goals. With the 256 bit bus
`we will have at least 600 signal |/O’s (404 in memory). We may be as much as 10A at 1V for average power, which
`will require very good power distribution, area bond flip chip is probably the only option.
`5. General Chip operation
`5.1 Unified Shader
`The unified shader is a simd/vector engine that performs the same instructions on four sets of four (16 total)
`elements. For pixel shader operations the elements are pixels with the sets of four required to be 2x2 footprints. For
R400 Top Level Spec
`10 of 32
11 March, 2001
4 September, 2015
`vertex shader operations the sixteen elements are sixteen vertices. The basic elementis a 4 value vector — frequently
`interpreted as X,y,Z,w or1,9,b,a.
`The user model for the unified shader is composed of a variable number of general purpose registers, a subset of
`which are usually initialized with data. An ALU can do simple math, conditional moves, and permutations on the
`registers, and the ability to do a limited number of memory reads using the texture cache. The numberofregister is
`variable, and the number of registers required for an operation are specified when the task is submitted to the unified
`shader. The unified shader will not start the task until there is enough free room for the task’s registers.
`The unified shader is based on the R300 pixel shader.
`5.2 3D Rendering
`For 3D rendering data is passed twice through the unified shader- once to transform the vertices and a second time
`to determine the color of the pixels.
`The input to the 3D pipe is expected to be indexed vertex arrays. Linear vertex arrays can easily be supported by the
`CP generating sequential indices. Inline vertex data is an open issue, | would prefer to write it to memory and then
`fetch it as a vertex array rather than add a direct path.
`The stream of indices is sent to the Primitive Assembly block by the CP. The front of the primitive assembly biock
`maintains the tag for the vertex cache; The vertex cache stores transformed vertices. As misses are detected in the
`tag, the indices that miss are placed into 16 entry vectors. Each vector contains a state pointer, a pointer to the vertex
`shader to be used, and the 16 indices to vertices that need to be transformed. When either a vector is filled with 16
`entries or a state change happens (so that the next vertex does not share the state and vertex shader with the
`previous vertex) the vector is issued to one of the “shader” pipelines for transformation. Which of the four shader
`pipelines it
`is issued to determined either by some effort of load balancing or a simple round robin. All that is
`submitted to the pixel pipeline is the state, the vertex program, and the indices. The shader pipeline will fetch the
`vertex array data through the cache infrastructure that is also used for texture fetches. After the tag the indices
`(actually now the indices into the vertex cache) are placed into a latency FIFO to hide the latency of transforming the
`The shader pipeline receives the vector of 16 indices from the primitive assembly block. The shader pipeline
`operates, when rendering pixels, by processing a vector of four 2x2 pixel footprints, a total of 16 pixels. For vertex
`processing each of the pixels is replaced with a vertex. The vertex program includes information of how many local
`variables it will need. The rasterizer waits until that many local variables are free, (as each executing thread in the
`shader pipeline terminatesit frees its local variables). With the proposed shader data path the maximum number of
`local variables per vertex is 256. However this leaves no ability to hide latency, 16 fo 32 local variables will probably
`maximize latency hiding and therefore performance. The vertex shader program can use all the capabilities of the
`shader pipeline including texture fetches and dependent lookups. At the end of the vertex program, the transformed
`coordinates must be output. One output will be the x, y, z, w position which we be stored in the position cache of the
`vertex cache. The vertex program may also output a number of parameter values (colors, texture coordinates, other
`interpolated inputs into the pixel shader). The parameter values must be output as a multiple of four 128 bit words, as
`the parameter cache is designed for this.
`The primitive assembly block reads the indices back out of the latency FIFO and accesses the position cache portion
`of the vertex cache.
`It assembles the
`vertices into primitives (lines,
`triangles, rectangles, quads?, points, ?).
`Baricentric values are assigned to the vertices, and will be used later in the rasterizer to interpolate the parameters.
`The parameters are not accessed by the primitive assembly logic, which only works from the position data. The
`primitive is clipped against both the viewing volume as well as user clip planes, with fractional baricentric coordinates
`assigned to the clipped primitive sections. The primitive goes through the perspective divide and the viewport
`transform. The resulting screen space primitive is setup (plane equations for 1AWV, Z, and the baricentric coordinates).
`The resulting primitive data, including the indices back into the parameter portion of the vertex cache are broadcast to
`the four pipes. The final time that an index is output that access the oldest vertex cache line, a token is also sent.
`Whenall of the four pipelines return the token the primitive assembly block can free that cacheline and allow it to be
`used for a new vector of vertices. The performance goal in the primitive assembly biock is a triangle every two clocks.
`An alternative option is for the vertex shader to generate screen coordinates and clip codes. If a primitive needs to be
`clipped, which can not be determined until primitive assembly, then the vertices are reverse transformed backinto clip
`space bylogic in the primitive assembly biock, clipped, and then transformed back into screen space.
`11 of
11 March, 2001
4 September, 2015
`To help meet marketing BS numbers we can look into doing backface culling at a rate of one triangle per clock. This
`will boosi us to peak bs number of 500 million triangles per second.
`Each pipe has a FIFO in front of the rasterizer to load balance. Each pipe will handle 16x16 tiles of the screen which
`are interleaved between the pipes. To maximize the effective size of the FIFO we will probably cull the triangle list
`before the FIFO. The rasterizer will request the parameter data from the parameter cache for the primitives. A small
`latency hiding FIFO will hide the latency of the access to the parameter cache. The parameter cache is 512 bits wide,
`and the interfaces from the parameter cache to the rasterizer are 128 bits wide, this allows the parameter cache to
`output one pipelines request per clock, which is serialized over four clocks, keeping all four interfaces busy. The
`rasterizer keeps a small cache of three to four ver

