Published on 13th Jun 2005, written by Dave Baumann for Consumer Graphics
`Those that havefollowed the development of 3D graphics over the past ten years or so
`ate will have seen a continual development of the capabilities of the processors, but
`Architecture Analysis
`fundamentally following the path of OpenGLpipeline model. 3dfx really ignited the
`© NVIDIA GT200 GPU and
`market with their “Voodoo Graphics” add-in boards, which were not much more than
`Architecture Analysis
`just a raster engine: It utilised one chip for texture sampling and another for pixel
`processing (a simple Render Output unit - ROP); 3dfx further evolved that by adding
`= an extra texture unit, allowing for slightly more complex effects in the raster pipeline.
`And so it was that this mode! was followed for a numberof years with the main
`. woa oe developments being the numberof pixel pipelines and textures supported per pipeline,
`until NVIDIA took the step of moving further forward on OpenGL pipeline and giving
`a eeny ~ NVIDIA Tesla
`accelerated support to the Transformation and Lighting process with GeForce 256.
`Whilst graphics processors had varying degrees of the geometry process, from clipping
`- RebeoetnaotOnan Source
`to setup, handied In hardware, adding a T&Lengine wasa significant step up the
`OpenGLpipeline, but didn’t really fundamentally change our thinking of graphics
`© A speculative look on the Wil U—sprocessors.
`u Andy Keane Interview & Tesla=At the same time as the graphics vendors started giving us T&L engines the pixel
`processors gradually increased In flexibility as well, up until the point that
`2 onaaaSeeman into
`“programmable shader architectures” were all anyone could talk about. The pixel
`sanpinaane Beyond3D's ae
`pipelines became more flexible such that they Nad limited programmability, as did
`ever book review
`vertex processing, with vertex shaders operating in paralle} with T&L engines.
`4 Q&A with Visceral's Technical
`Nowadays thelevel of programmability of both vertex and pixel shaders has Increased
`Soauremee i Brooks on
`significantly with each vertex shaders enveloping the T&L processors entirely and pixel
`¢ E3 2011: Behind Closed Doors -
`shaders consuming the texture processors. However, despite an Increasingly important
`Witcher 2 Xbox 360... anda
`onus being placed on the arrangement and capabilities of the shader Arithmetic Logic
`Units (ALU’s) in this programmable era, the designs of contemporary grephics
`| Tan Buck - NVIDIA Tesia Launch
`processors still bear the fundamental similarities to thelr forebears: vertex processing
`up one endof the pipeline, pixel processing down the other and still very much aligned
`with multiplies of pixel pipelines.
`Conceivably there Is no reason why this development model couldn’t continue to exist
`In the PC space and it certainly seems jike It will from al! vendors for at least the next
`year. However, ATI have multiple design teams working on different architectures
`concurrently, so whilst their PC processors may follow a fairly familiar lineage other
`parts of the company have been talking this shader era with a completely fresh
`perspective In order to consider the needs of a "Programmable Graphics Processor” and
`extract as much of the potential of the ALU‘s as possible by trying to minimise the
`Beyond3D - ATI Xenos: Xbox 360 Graphics Demystified
`wasted cycles, In doing so they will force us to reconsider how we think of the overall
`pipeline and make initia) performance assessments based upon “pipelines” alone.
`Ever since the announcementthat ATI were working with Microsoft on "Future XBox
`technologies" the rumour mil! has been working overtime as to the graphics behindit.
`Someof the messages since the announcement of the XBOX 360, the eventual console
`ATI's work will appear in, have not necessarily been reflective of the actual operation
`and even a little contradictory from representatives directly from ATI, With strict NDA's
`and designs being built for two different competitive consoles, very tight controls of
`what could be talked about had to be implemented within ATI, and the XBOX group
`operated very much within their ownsilo; it wasn't until Microsoft lifted the NDA's that
`ATI could even speak of it on a wider internalbasis, let along externally, and even then
`there is a lot of information to gather.
`Since XBOX 360's announcementand ATI's
`unleashing from the non disclosure agreements we've
`had the chance to notjust chat with Robert Feldstein,
`VP of Engineering, but also Joe Cox, Director of
`Engineering overseeing the XBOX graphics design
`team, and two lead architects of the graphics
`processor, Clay Taylor and Mark Fowler. Here we hope
`to accurately impart a slightly deeper understanding
`of the XBOX 360 graphics processor, how it sits within
`the system, understand more about its operation as
`well as give someinsights into the capabilities of the
`processor. Bear in mind that we are under NDAfor
`some of the operational details of the graphics
`processor to gain an understanding of how it differs
`from current platforms however someof the specifics
`won't be revealed in full detail! in this article.
`Click for a bigger version
`Throughout this article we'll attempt to piece together the operation of the graphics
`processor based on our conversations with ATI and some developers who have already
`had some knowledge ofXBOX 360's capabilities, however we'll also offer some opinions
`on certain elements. Sections typed in blue indicate Beyond3D's suppositions and have
`not been directly indicated to us by ATI.
`Beyond3D - AT| Xenos: Xbox 360 Graphics Demystified
Published on 13th Jun 2005, written by Dave Baumann for Consumer Graphics
`Xbox 360 System Overview
`The "XBOX 360" console wasofficially unveiled at a show on MTV the weekprior to E3
`2005, and at the unveiling Microsoft revealed a few technica! details of the platform.
`The primary specifications for the system are:
`* 3.2GHz Custom IBM Central Processor
`© Three CPU Cores
`° Two Threads Per core
`o VMX Unit Per Core
`© 128 VMX Registers Per Thread
`© 1MB L2 Cache (Lockable by Graphics Processor)
`+ S500OMHz Custom ATI Graphics Processor
`oe Unified Shader Core
`© 48 ALU’s for Vertex or Pixel Shader processing
`© 16 Filtered & 16 Unfiltered Texture samples per clock
`© 10MB eDRAM Framebuffer
`» 512MB System RAM
`o Unified Memory Architecture (UMA)
`© 128-bit interface
`© 700MHz GDOR3 RAM
`Of these core components obviously we are going to be most concerned with the
`graphics processing element. Whilst the graphics processor is different from others
`seen before in the PC space, and is very different from even ATI's impending new PC
`graphics components,it will be interesting to take a look at the graphics processorfor
`the very reasonthatit doesn't directly correspond to any current graphics processor
`but also because we feel that this will give hints as to the architectural direction ATI are
`likely to be taking in the future for PC and other applications.
`ATI C1 / Xenos
`A namethat has long since been mentioned in relation to the graphics behind Xenon
`(the development name for XBOX 360) is R500. Although this name has appeared from
`various sources, the actual development name ATI uses for Xenon's graphics is "C1",
`whilst the more "PR friendly" codename that has surfaced |s "Xenos". ATI are probably
`fairly keen not to use the R500 name as this drawsparallels with their upcoming series
`of PC graphics processors starting with R520, however R520 and Xenos are very
`ATI Ex. 2124
`Page 3 of 33
`ATI Ex. 2124
`Page 3 of 33


`Beyond3D - AT! Xenos: Xbox 360 Graphics Demystified
`Click for 2 bigger version
`distinct parts. R520's aim is obviously
`designed to meet the needs of the PC
`space and have Shader Mode! 3.0
`Capabilities as this |s currently the
`highest DirectX API specification available
`on the PC, and as such these new parts
`still have their lineage derived from the
`R300 core, with discrete Vertex and Pixel
`Shaders; Xenos, on the other hand,is a
`custom design specifically bullt to
`address the needs and unique
`characteristics of the game console. ATI
`had a clean slate with which to design on
`and no specified API to target. These
`factors have led to the Unified Shader
`design, something which ATI have
`prototyped and tested priortoits
`eventual implementation ( with the
`rumoured R400 development ? ) , with
`capabilities that don’t fall within any
`corresponding API specification. Whilst
`ostensibly Xenos has been halied as a
`Shader Model 3.0 part, its capabilities
`don't fall directly inline with it and exceed
`it in some areas giving this more than a
`whiff of WGF2.0 (Windows Graphics Foundation 2.0 - the new namefor DirectX Next /
`DirectX 10) aboutit.
`The Xenos graphics processoris not a singie element, but actually consists of two
`distinct elements: the graphics core (shader core) and the eDRAM module. The shader
`core is a 90nm chip manufactured by TSMC andis currently slated to run at SOOMHz*,
`whilst the eORAM module is another 90nm chip, manufactured by NEC and runs at
`SO0MHz* as well. These two chips both exist side by side, together on a single
`package, ensuring a fast interiink between the two. The main graphics chip, the parent
`core, could be considered as a “shader core” as this Is oneofits primary tasks. The
`eDRAM module |s a separate, daughter chip which contains the elements for reading
`and writing color, z and stencil and performing all of the alpha blending and z and
`stenci! ops, including the FSAA logic. We'll explore the capabilities and operations of
`both these chips in greater detail throughout thearticle.
`(*) Note: We understand the clockspeeds for the shader core and daughter die are
`target clockspeeds at present and there may be some room for small movementeither
`way on both dies dependant on yields. As Microsoft have now announced 500MHz
`speedsit is more likely that these will be the eventual release speeds.
`One elementthat has been reported on is the number of 150M transistors in relation to
`the graphics processing elements of Xenon, however according to ATI this is not
`correct as the shadercore itself is comprised from in the order of 232M transistors. It
`maybe that the 150Mtransistor figure pertains only to the eDRAM module as with
`10MB of DRAM, requiring onetransistor per bit, 80M transistors will be dedicated to
`just the memory; when we add the memory controllogic, Render Output Controllers
`(ROP's) and FSAA logic on top of that It may be concelvable to see an extra 70M
`transistors of logic in the eDRAM module.
`Beyond3D - AT! Xenos: Xbox 360 Graphics Demystified
`Cilck for a bigger version
`Update: We've recently been given an image of the Xenos graphics chip package
`(above) that highlights the dual die nature, with the parent die quite clearly to the
`centre of the package and the daughter over to the left. While the 232M transistor
`figure for the parent was given to us by ATI we are still trying to establish a more
`official figure for the daughter (even though these types of transistor counts are very
`much estimates anyway). We've speculated that the 150M figure that appeared when
`XBOX 360 wasfirst announced may just relate to daughter die, however another figure
`that has arisen is 100M - judging from the die sizes the daughter die doesn’t have
`more than half the area of the parent, which would giv indications towards the 100M
`side although 80M of those transistors are DRAM which may be more dense than the
`logic circuitry that will dominate the parent die, We are trying to get further
`One of the mistakes that Microsoft made with the original XBox was to contract their
`componentproviders into supplying entire chips with, evidently, no developmentpath -
`at least, this was the case with NVIDIA NV2A graphics processor, which resulted in
`Microsoft and NVIDIA going through a ‘egal arbitration process. Although the
`components in the XBOX 360 Inits initial form are hardly low cost, the cost of the unit
`over the course of Its lifetime is one that has quite obviously been addressed with
`contracts that pay via royalties for chips sold and with Microsoft in charge of ordering
`the chips from the various Fabs, howeverthe original semiconductor manufacturers are
`lIkely to still be In charge of further developments in terms of putting the cores on to
`smaller processes and we believe thatthis is part of the contract that ATI has with
`Microsoft. An obvious area for cost reduction of the Xenos processor is by merging the
`shader and daughter die on to a single core - we suspect thatthis will not happen until
`there is a process shrink available (that can also cater for both the complex logic and
`eDRAM) as two cores on 90nm mitigate some ofthe yield risks of a single, large die on
`Beyond3D - AT| Xenos: Xbox 360 Graphics Demystified
Published on 13th Jun 2005, written by Dave Baumann for Consumer Graphics

Bandwidths and Interconnects
`Beyond3D - ATI Xenos: Xbox 360 Graphics Demystified
`As we discussed earlier, the XBOX 360 carries a unified memory architecture and
`Xenos's parent die is acting as the Northbridge controller as well as the graphics
`processing device. The system memory bandwidth is 22.4GB/s courtesy of the 126-bit
`GDDR3 memory interface running at 700MHz. At 232M transistors the Xenos parent die
`isn't an enormouschip so internal memory communication Isn't going to be too latency
`bound, hence the memory interface only needs to be a standard crossbar, which is
`partitioned into two 64-bit blocks. Xenos's parent die also has a 32GB/s connection to
`the daughter, eDRAM die Connection to the Southbridge audio and I/O controlleris
`achieved via two PCI Express lanes which results in 5S00MB/s of both upstream and
`downstream bandwidth.
`As the CPU is going to be using Xenos to handle all its memory transfers, the
`connection between the two has 10.8G8/s of bandwidth both upstream and
`downstream simultaneously. Additionally the Xenos graphics processoris able to
`directly jock the cache of the CPU in order to retrieve data directly from it without it
`having to go to system memory beforehand. The purpose ofthis is that one (or more,
`if wanted) of the three CPU cores could be generating very high levels of geornetry that
`the developer doesn't want to, or can't, preserve in the memory footprints avaliable on
`the system whenin use. High-resolution dynamic geometry such as grass, leaves, hair,
`particles, water droplets and explosion effects are all examples of one type of scenario
`that the cache locking may be used In.
`Beyond3D - AT! Xenos: Xbox 360 Graphics Demystified
`Xenos Daughter Die
`The one key area of bandwidth, that has caused a fair quantity of controversy in its
`inclusion of specifications, is that of bandwidth available from the ROPS to the eDRAM,
`which stands at 256GB/s. The eDRAM Is always going to be the primary location for
`any of the bandwidth intensive frame buffer operations andso it is specifically designed
`to remove the frame buffer memory bandwidth bottieneck - additionally, Z and colour
`access patterns tend not to be particularly optimal for traditional DRAM controllers
`where they are frequent read/write penalties, so by placing al of these operations in
`the eDRAM daughter die, aside from the system calls, this leaves the system memory
`bus free for texture and vertex data fetches which are both read only and are therefore
`highly efficient. Of course, with 10MB of frame buffer space available this isn't sufficient
`to fit the entire frame buffer in with 4x FSAA enabled at High Definition resolutions and
`we'll cover how this Is handled later In the article.
`Both XBOX 360 and Playstation 3 feature UMA and graphics busses, respectively, that
`have been announced to use fairly fast 7OOMHz GDDR3 memory, but both only have a
`128-bit interface. Whilst this is less of a surprise for XBOX 360 as Xenos's use of
`eDRAM will move the vast majority of the frame buffer bandwidth to the EDRAM
`interface leaving the system memory bandwidth available primarily for texturing
`bandwidth. It does seem odd that by the time the consoles will be released the
`likelihood is that high end PC graphics wil! using at least the same speed RAM but on
`double wide busses. The primary issue here is, again, one of cost - the lifetimes of a
`console will be much greater than that of PC graphics and process shrinks are used to
`reduce the costs of the interna| components; 256-bit busses may actually prevent
`process shrinks beyond a certain level as with the number of pins required to support
`busses this width could quickly become pad limited as the die size is reduced. 128-bit
`busses result in far fewer pins than 256-bit busses, thus allowing the chip to shrink to
`smaller die sizes before becoming pad limited - by this pointit is also likely that
`Xenos's daughter die will have been integrated into the shader core, further reducing
`the number of pins that are required.
`Beyond3D - AT| Xenos: Xbox 360 Graphics Demystified
Published on 13th Jun 2005, written by Dave Baumann for Consumer Graphics

Pixel and eDRAM Operation
`Despite references to 192 processing elements in to the ROP's within the eDRAM we
`can actually resolve that to equating to 8 pixels writes per cycle, as wel! as having the
`capability to double the Z rate when there are no colour operations. However, as the
`ROP's have been targeted to provide 4x Multi-Sampling FSAA at no penalty this
`equates to a total capability of 32 colour samples or 64 Z and stencil operations per
`Most PC graphics processors have to balance their output with the available bandwidth
`and as such their ROP units usually only cater for 2 Multi-Samples per pixe! in a single
`cycle, and the Z output doesn't double with the number of Multi-Samples being
`produced either. Z and colour compression techniques are also employed in order to
`get close to the output capabilities with the bandwidth available. ATI's calculations lead
`to a colour and z bandwidth demand of around 26-134GB/s at 8 pixels with 4x Multi-
`Sampling AA enabled at High Definition TV resolutions. The lower end of that
`bandwidth figure is derived from having 4:1 colour and Z compression, however the
`lossless compression techniques are only optima! when there are no triangle edges
`intersecting a pixel, but with the presumed high geometry detail within a next
`generation console titles the opportunities for achieving this compression ratio across
`the entire frame will be reduced. So, with 256GB/s of bandwidth available in the
`eDRAM frame buffer there should always be sufficient bandwidth for achieving 8 pixels
`per clock with 4x Multi-Sampling FSAA enabled and as such this also means that Xenos
`does not need any lossiess compression routines for Z or colour when writing to the
`eDRAM frame buffer.
`So, as far as the operation is concerned, once pixe| data has come through the shader
`array andis ready to be processed into colour values in memory the Z data ofthe pixel
`is matched with the correct colour data coming out of the shaders. Xenos supports an
`“Alpha to Mask" feature, which allows for the use of Multi-Sampling for sort-
`independent translucency. All of this processing is performed on the parent die and the
`pixels are then transferred to the daughter die in the form of source colour perpixel
`and |oss-less compressed Z, per 2x2 pixel quad. The interconnect bandwidth between
`the parent and daughter die is only an eighth of the eDRAM bandwidth because the
`source colour data value is commonto all samples of a pixel here, and the Z is
`compressed, Once on the daughter die the pixels are unpacked to their Multi-Sampie
`level and each sample is driven through their Z and Alpha computations and thefina!
`data is stored on the eDRAM unti! either the entire frameor currenttile (we'll cover this
`in more detail! later) being rendered Is finished.
`Beyond3D - AT] Xenos: Xbox 360 Graphics Demystified
`When the frameortile has finished rendering, the colour data will then be resolved on
`the daughter die, with the Multi-Samples being blended downto their pixel |evel. The
`resolved buffer information is then passed back from the daughter die to the parent
`which then outputs to system RAM such that, when ail the tiles are finished, this can
`then be outputted to the display device. Although the resolved colour data has to be
`stored in system RAM, which uses some bandwidth during the transfer, the efficiency of
`the write as the resolved data comes out of the daughter die to be written to system
`RAM Is very high. This high efficiency Is due to the fact that it is dealing with a
`significant quantity of non-fragmented data and the busisn't as busy with lots of other
`bandwidth consuming, high frequency and inefficient frame buffer read / write / modify
`operations for the back buffer, This helps in alleviating the fact that the parentdie is
`also handling system memory requests. Also note that data can be written to the
`eDRAM at the sametimeas It Is being cleared from the previous data that resided
`there, meaning there should be [ittie to no walt when removing the previous data from
`the eDRAM ( We've heard comments from developers familiar to both designs that this
`element ofXenos bears similarities to the "Flipper" design for Nintendo's Gamecude, a
`part that was originally designed by ArtX, who ofcourse were subsequently purchase
`by ATI, however ATI are keen to point out that while there may be apparent similarities
`the designs are entirely independentas there are distinct virtual and physical barriers
`between the groups working on the various console developments, past and present,
`and no members of the Flipper architecture team were Involved in Xenos's
`As all the sampling units for frame buffer operations are multiplied to work optimally
`with 4x FSAA this Is actually the maximum modeavailable, Although the developer can
`choose to use 2x or no FSAA,there are no FSAA levels available higher than 4x. The
`sampling pattern is not programmablebutfixed, although it does use a sample pattem
`that doesn't have any of the sample points intersecting one or another on either the
`vertical! or horizontal axis. Although we don't know the exact sample pattern shape, we
`suspect it will be similar to that seen on other sparse sampled / jittered / rotated grid
`FSAA mechanisms we've seen over the past few years, such as this.
`The ROP's can handle several different formats, including a special FP10 mode. FPLO is
`a floating point precision mode in the format of 10-10-10-2 (bits for Red, Green, Blue,
`Alpha). The 10 bit colour storage has a 3 bit exponent and 7 bit mantissa, with an
`available range of -32.0 to 32.0, Whilst this mode does have somelimitationsIt can
`offer HDR effects but at the same cost In performance and size as standard 32-bit (8-
`8-8-8) Integer formats which will probably result In this format being used quite
`frequently on XBOX 360 tities. Other formats such as INT16 and FP16 are also
`available, but they obviously have space implications. Like the resolution of the MSAA
`samples, there is a conversion step to changethe front buffer format to a displayable
`8-8-8-8 format when moving the completed frame buffer portion from the eDRAM
`memory out to system RAM.
`The ROP's are fully orthogonal so Multi-Sampling can operate with al! pixel formats
`Render to texture operations will also be rendered out to the eDRAMfirst and then read
`out to UMA memory, when complete, in order to be used as a texture surface for the
`final frame rendering. Render to texture operations can also have Multi-Sample FSAA
`applied and the result can either be resolved on the way out to system memory or kept
`at the high resolution Multi-Sample level. As with standard pixel operations, the eDRAM
`memory can be written to with either another render to texture operation or pixel data
`whilst the data from the previous render to texture is being pushed out to UMA
`Beyond3D - AT! Xenos: Xbox 360 Graphics Demystified
Published on 13th Jun 2005, written by Dave Baumann for Consumer Graphics

Z-Only Rendering Pass
`Z-Only Rendering Pass
`Some games these days make use of graphics chips abilities to fast reject workload
`based on Z information. Engines such as Doom 3 or Source have the capabilities to, on
`each frame, run a geometry only pass which is for the purpose of pre-filling the Z
`buffer with the fina! Z depths of that frame. When thefull frame is ready to be
`rendered, pixel information that has a higher Z depth than the information in the Z
`buffer is rejected before any pixel operations are carried out on it, meaning that there
`are no pixels written that are wasted due to overdraw. This z-only prepass is expected
`to be commonly used on Xenos as it has additional advantages for tiling, explained
`A geometry pass to populate Z information Is going to gain from a processor that has
`double the Z compare / write units in relation to its pure pixelfill-rate, which Xenos's
`does. However another factor is that this pass is actually going to require geometry
`processing over the vertex shaders. In a traditional shader capable graphics processor
`the number of vertex units can often be many times less than the pixel shader ALU's,
`however in the case of Xenos al! of the shader units will be tasked purely with the
`geometry processing which should also ensure a fast operation of this early Z pass.
`As with ATI's current desktop parts, Xenos features a Hierarchical Z buffer. Hierarchical
`2 buffers contain "coarser" Z information than the full resolution Z buffer - usually

