`SYSTEM ARCHITECTURE
`
`THIS ARTICLE COVERS THE XBOX 360'S HIGH-LEVEL TECHNICAL
`
`REQUIREMENTS, A SHORT SYSTEM OVERVIEW, AND DETAILS OF THE CPU AND
`
`THE GPU. THE AUTHORS DESCRIBE THEIR ARCHITECTURAL TRADE-OFFS AND
`
`SUMMARIZE THE SYSTEM'S SOFTWARE PROGRAMMING SUPPORT.
`
`eeeee8 Microsoft's Xbox 360 game console
`is thefirst of the latest generation of game con-
`soles. Historically, game console architecture
`and design implementations have provided
`large discrete jumps in system performance,
`approximatelyat five-year intervals. Overthe
`last several generations, game console systems
`have increasingly become graphics supercom-
`puters in their own right, particularly at the
`launch of a given game console generation.
`The Xbox 360, pictured in Figure 1, contains
`an aggressive hardware architecture and imple-
`mentationtargeted at game console workloads.
`The core silicon implements the product
`designers’ goal of providing game developers a
`hardware platform to implementtheir next-gen-
`eration game ambitions. The core chips include
`the standard conceptual blocks ofCPU,graph-
`ics processing unit (GPU), memory, and 1/0.
`Each ofthese components and their intercon-
`nections are customized to provide a user-
`friendly game consele product.
`
`Design principles
`One ofthe Xbox 360’s main design princi-
`ples is the next-generation gaming principle—
`thatis, anew game console must provide value
`to customers for five to seven years. Thus, as
`for any true next-generation game console
`hardware, the Xbox 360 delivers a huge discrete
`jampin hardware performance for gaming.
`The Xbox 360 hardware design team had
`
`to translate the next-generation gaming prin-
`ciple into useful feature requirements and
`next-generation game workloads. For the
`game workloads, the designers’ direction came
`from interaction with game developers,
`including game engine developers, middle-
`ware developers, tool developers, APT and dri-
`ver developers, and game performance
`experts, both inside and outside Microsoft.
`One key next-generation game feature
`requirement was that the Xbox 360 system
`must implement a 720p (progressive scan)
`pervasive high-definition (HD), 16:9 aspect
`ratio screen in all Xbox 360 games. Thisfea-
`cure’s architectural implication was that the
`Xbox 360 required a huge,reliable fill rate.
`Another design principle of the Xbox 360
`architecture was that it must be flexible to suit
`
`the dynamic range ofgame engines and game
`developers. The Xbox 360 has a balanced
`hardware architecture for the software game
`pipeline, with homogencous, reallocatable
`hardware resources that adapt to different
`game genres, different developer emphases,
`and evento varying workloads within a frame
`of a game. In contrast, heterogeneous hard-
`ware resources lock software game pipeline
`performancein each stage andare notreallo-
`catable. Flexibilicy helps make the design
`“futureproof.” The Xbox 360’s three CPU
`cores, 48 unified shaders, and 512-Mbyte
`DRAMmain memory will enable developers
`
`Jett Andrews
`
`Nick Baker
`
`MMicrosott Corp.
`
`0272-1732/06/620.00 © 2006 IEEE
`
`Published by the IEEE Computer Society
`
`/h ATI Ex. 2145
`IPR2023-00922
`
`Page 1 of 13
`
`AMD1318_0156892
`
`ATI Ex. 2145
`IPR2023-00922
`Page 1 of 13
`
`
`
`HOT CHIPS 17 Figure 1 Xbox 866.4
`
`to create innovative games for the nextfive to
`seven years.
`A third design principle was programma-
`bility; thatis, the Xbox 360 architecture must
`be easy to program and develop software for.
`The silicon development team spent much
`timelistening to software developers (we are
`hardware folks at a software company,after
`all}. There was constant interaction and iter-
`ation with software developers at the very
`beginning of the project and all along the
`architecture and implementation phases.
`This interaction had an interesting dynam-
`ic. The software developers weren't shy about
`their hardwarelikes and dislikes. Likewise, the
`hardware team wasn’t shy about where next-
`generation hardware architecture and design
`were going as a result of changes in silicon
`processes, hardware architecture, and system
`design. What followed was further iteration
`on planned and potential workloads.
`An important part of Xbox 360 pro-
`grammability is that the hardware must pre-
`
`IEEE MICRO
`
`sent the simplest APIs and programming
`models to let game developers use hardware
`resources effectively. We extended pro-
`gramming models that developers liked.
`Because software developers liked the first
`Xbox, using it as a working model was nat-
`ural for the teams. In listening to developers,
`we did not repackage or include hardware
`features that developers did not like, even
`though that may have simplified the hard-
`ware implementation. We considered the
`software tool chain from the very beginning
`of the project.
`Another majordesign principle was that the
`Xbox 360 hardware be optimized for achiev-
`able performance. To that end, we designed a
`scalable architecture that provides the great-
`est usable performance per square millimeter
`while remaining within the console’s system
`power envelope.
`As we continued to work with game devel-
`opers, we scaled chip implementations to
`result in balanced hardware for the software
`
`game pipeline. Examples of higher-level
`implementationscalability include the num-
`ber of CPU cores,
`the number of GPU
`shaders, CPU L2 size, bus bandwidths, and
`main memorysize. Other scalable items rep-
`resented smaller optimizations in each chip.
`
`Hardware designed for games
`Figure 2 shows a top-level diagram of the
`Xbox 360system's core silicon components.
`The three identical CPU cores share an 8-way
`set-associative, 1-Mbyte L2 cache and run at
`3.2 GH. Each core contains a complement of
`four-way single-instruction, multiple data
`(SIMD) vector units.’ The CPU L2 cache,
`cores, and vector units are customized for
`Xbox 360 game and 31Dgraphics workloads.
`The
`front-side
`bus
`(FSB)
`runs
`at
`5.4 Gbit/pin/s, with 16 logical pins in each
`direction, giving a 10.8-Gbyte/s read and a
`10.8-Gbyte/s write bandwidth. The bus
`design and the CPU L2 provide added sup-
`port that allows the GPU to read directly from
`the CPU L2 cache.
`
`As Figure 2 shows, the /O chip supports
`abundant 1/O components. The Xbox media
`audio (XMA)decoder, custom-designed by
`Microsoft, provides on-the-fly decoding of a
`large number of compressed audio streams in
`hardware. Other customI/Ofeatures include
`
`ATI Ex. 2145
`IPR2023-00922
`
`Page 2 of 13
`
`AMD1318_0156893
`
`ATI Ex. 2145
`IPR2023-00922
`Page 2 of 13
`
`
`
`
`
`DVD (SATA)
`HDD port (SATA)
`Front controllers
`<> Wireless controller
`
`MU ports (2 USB)
`
`Rear panel USB
`
`
`
`Ethernet
`:
`:
`
`eo
`
`Audio out
`“Flash _
`
`
`
`
`
`
`
`
`
`
`
`
`Analog |g. Video oul
`chi oe
`
`Bus interface unit
`
`> Memory controller
`
`Hard disk drive
`Memoryuni
`
`infrared receiver
`; System managernent controller
`
`Xbox media audio
`
`
`
`the NANDflash controller and the system
`managementcontroller (SMC).
`The GPU3D core has 48 parallel, unified
`shaders.The GPUalso includes 10 Mbytes of
`embedded DRAM (FDRAM), whichruns at
`256 Gbytes/s forreliable frame and z-buffer
`bandwidth. The GPU includes interfaces
`between the CPU, I/Ochip, and the GPU
`internals.
`
`The 512-Mbyte unified main memory con-
`trolled by the GPUis a 700-MHz graphics-
`LeMory,
`double-data-rate-3.
`(GDDR3)
`which operates at 1.4 Gbit/pin/s and provides
`a total main memory bandwidth of 22.4
`Gbytes/s.
`The DVD and HDD ports are serial ATA
`(SATA)interfaces. The analog chip drives the
`HD video out.
`
`CPU chip
`Figure 3 shows the CPUchip in greater
`detail. Microsoft’s partner for the Xbox 360
`CPU is IBM. The CPU implements the Pow-
`erPC instruction set architecture,’? with the
`VM SIMD vector instruction set (VMX1 28)
`customized for graphics workloads.
`
`The shared 1.2 allows fine-grained, dynamic
`allocation ofcache lines between the six threads.
`Commonly, game workloads significantly vary
`in working-setsize. For exarnple, scene man-
`agement requires walking larger, random-miss-
`dominated data structures, similar to database
`searches. At the sametime, audio, Xbox proce-
`dural synthesis (described later), and many other
`game processes that require smaller working sets
`can run concurrently. The shared L2 allows
`workloads needing larger workingsets to allo-
`cate significantly more ofthe L2 than wouldbe
`available if the system used private [.2s (of the
`same total L2 size) instead.
`The CPU core has two-per-cycle, in-order
`instruction issuance. A separate vector/scalar
`issue queue (VIQ) decouples instruction
`issuance between integer and vector instruc-
`tions for nondependent work. There are two
`symmetric multithreading (SMT),° fine-
`grained hardware threads per core. The 1.1
`caches include a two-wayset-associative, 32-
`Kbyte L1 instruction cache and a four-way
`set-associative, 32-Kbyte L1 data cache. The
`write-through data cache does not allocate
`cache lines on writes.
`
`MARCH—APRIL 2008
`
`i| ATI Ex. 2145
`IPR2023-00922
`
`Page 3 of 13
`
`AMD1318_0156894
`
`ATI Ex. 2145
`IPR2023-00922
`Page 3 of 13
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`HOT CHIPS 17
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`TT
`
`
`
` - “Test. :
`
`
`
`debug,
`
` clocks,
`
`temperature _
` | sensor.
`
`S Front sidebus (FSB) _ oS
`
`
`Vector/scalar unit
`~¥sU
`
`
`Permute
`"| Perm
`Simple
`“| Sirnp
`
`
`MMU Main-memory unit
`int
`Integer
`
`
`Pi Programmable interrupt controller
`
`
`FPU Ficating point unit
`
`
`
`VIQ Vector/scalar issue queue
`
`
`Figure 3. Xbox 360 CPU block diagram
`
`
`
`
`
`
`
`The integer execution pipelines include
`branch, integer, and load/store units. Tn
`addition, each core contains an [EEE-754-
`compliance scalar floating-point unit (FPU),
`which includes single- and double-precision
`support at full hardware throughput ofone
`operation per cycle for most operations.
`Eachcore also includes the four-way SIMD
`VMX128 units: floating-point (FP), per-
`mute, and simple. As the name implies, the
`VMX1 28 includes 128 registers, of 128 bits
`each, per hardware thread to maximize
`throughput.
`The VMX1 28 implementation includes an
`added dot product instruction, common in
`
`graphics applications. The dor product
`implementation adds minimal latency to a
`multiply-add by simplifying the rounding of
`intermediate multiply results. The det prod-
`uct instruction takes far less latencythan dis-
`crete instructions.
`Anotheraddition we made to the VMX128
`was direct 3D (D3D) compressed data for-
`mats,°* the same formats supported bythe
`GPU. This allows graphics data to be gener-
`ated in the CPU and then compressed before
`being stored in the L2 or memory. Typicaluse
`of the compressed formats allows an approx-
`imate 50 percent savings in required band-
`width and memory footprint.
`
`IEEE MICRO
`
`ATI Ex. 2145
`IPR2023-00922
`
`Page 4 of 13
`
`AMD1318_0156895
`
`ATI Ex. 2145
`IPR2023-00922
`Page 4 of 13
`
`
`
`CPU data steaming
`In the Xbox, we paid considerable atten-
`tion to enabling data-streaming workloads,
`which are not typical PC orserver workloads.
`We added features that allow a given CPU
`core to execute a high-bandwidth workload
`(both read and write, but particularly write),
`while avoiding thrashing its own cache and
`the shared 12.
`First, some features shared among the CPU
`cores help data streaming. Oneofthese is 128-
`byte cache line sizes in all che CPU L1 and 1.2
`caches. Larger cache line sizes increase FSB
`and memory efficiency. The L2 includes a
`cache-set-locking functionality, commonin
`embedded systems but not in PCs.
`Specific features that improve streaming
`bandwidth for writes and reduce thrashing
`include the write-through L] data caches.
`Also, there is no write allocation of L.1 data
`cache lines when writes miss in the L1 data
`cache. This is important for write streaming
`because it keeps the L1 data cache from being
`thrashed by high bandwidth transient write-
`onlydata streams.
`We significantly upgraded write gathering
`in the L2. The shared 1.2 has an uncached unit
`for each CPU care. Each uncached unit has
`
`four noncached write-gathering buffers that
`allow multiple streams to concurrently gath-
`er and dumptheir gathered payloads to the
`FSB yet maintain very high uncached write-
`streaming bandwidth.
`The cacheable write streams are gathered by
`eight nonsequential gathering butters per CPU
`core. This allows programming flexibility in the
`write patterns ofcacheable very high bandwidth
`write streams into the L2. The write streams can
`
`randomlywrite within a window ofa fewcache
`lines without the writes backing up and caus-
`ing stalls. The cacheable write-gathering buffers
`effectively act as a bandwidth compression
`scheme for writes. This is because the L2 data
`
`arrays see a much lower bandwidth than the raw
`bandwidth required by a program’s store pat-
`tern, which would have lowutilization of the
`L2 cache arrays. Data transformation workloads
`commonly don’t generate the data in a waythat
`allows sequential write behavior. If the write
`gathering buffers were not present, software
`would have to effectively gather write data in
`the register set before storing. This would puta
`large amountofpressure on the numberofreg-
`
`isters and increase latency(and thus through-
`put) ofinner loops of computation kernels.
`We applied similar customization to read
`streaming. For each CPU core, there are eight
`outstanding loads/prefetches. A custom
`prefetch instruction, extended data cache
`block touch «KDCBT), prefetches data, but
`delivers to the requesting CPUcore’s L1 data
`cache and never puts data in the L2 cache as
`regularprefetch instructions de. This modifi-
`cation seems minor, butit is very important
`because it allows higher bandwidth read
`streaming workloads to run on as many
`threads as desired without thrashing the L2
`cache. Another option we considered for read
`streaming would be to lock a set of the L2 per
`thread for read streaming. Inthatcase, ifa user
`wanted to run four threads concurrently, half
`the L2 cache would be locked down, hurting
`workloads requiring a large 12 working-set
`size. Instead, read streaming occurs through
`the L1 data cache of the CPU core on which
`the given thread is operating, effectively giv-
`ing private read streaming first in, first our
`(FIFO)area per thread.
`Asystemfeature planned early in the Xbox
`360 project wasto allow the GPU to directly
`read data produced by the CPU,withthe data
`never going through the CPUcache’s back-
`ing store of main memory. In a specific case
`of this data streaming, called Xbox procedur-
`al synthesis (XPS), the CPU is effectively a
`data decompressor, procedurally generating
`geometry on-the-fly for consumption by the
`GPU 3Dcore. For 3D games, XPS allows a
`far greater amount ofdifferentiated geometry
`
`than simple traditional instancing allows,
`
`which is very importantforfil
`ng large HD
`screen worlds with highly detailed geometry.
`Ve added two features specifically to sup-
`port XPS. The first was support in the GPU
`and the FSB for a 128-byte GPU read from
`the CPU. The other was to directly lower
`communication latency from the GPU back
`to the CPU by extending the GPU's tail
`pointer write-backfeature.
`‘Tail pointer write-back is a method of con-
`trolling communication from the GPUto the
`CPU by having the CPU poll on a cacheable
`location, which is updated when a GPU
`instruction writes an update to the pointer.
`The system coherency scheme then updates
`the polling read with the GPU’s updated
`
`MARCH-APRIL 2608
`
`i ATI Ex. 2145
`IPR2023-00922
`
`Page 5 of 13
`
`AMD1318_0156896
`
`ATI Ex. 2145
`IPR2023-00922
`Page 5 of 13
`
`
`
`HOT CHIPS 17
`
` Core 1
`
`
`
`ee
`
`
`
`
`compressed data,
`VMX stores to L2
`
`
`
`
`
`
`
`
`
` xDCBT 128-byte prefetch
`
`around L2, into Li data cache /:
`
`
`
`
`
`
`
`
`Test,
`
`
`debug;
` i
`
`
`clocks,
`
`
`:
`temperature
`
`
`
`
`
`ssensor, t [|
`"Front side bus (FSB)
`
`
`
`
`
` Vector/scalar unit
`Ht
`
`Non-sequential gathering,
`locked set in L2
`
`
`
`:
`
`poe
`
`_ GPU 128-byte read frorn L2
`
`
`
`
`
`
`
`
`
`
`Permute
`Simple
`
`Main-memory unit
`Integer
`i
`> Programmableinterrupt
`controller
`
`
`
`
`
`
`
`Figure 4: CPU cached data-streariing exanit
`
`[eee
`
`:
`
`From memory
`
`os To GPU
`
`
`
`
`pointer value. Tail write-backs reduce com-
`munication latency compared to using
`interrupts. We lowered GPU-to-CPU com-
`munication latency even further by imple-
`menting the tail pointer’s backing-store target
`on the CPUdie. This avoids the round-trip
`from CPU to memory when the GPU point
`er update causes a probe and castout ofthe
`CPU cache data, requiring the CPU to refetch
`the data all the way from memory. Instead the
`refetch never leaves the CPUdie. This lower
`
`latency translates into smaller streaming
`FIFOsin the L2’s lockedset.
`A previously mentionedfeature very impor-
`tant to XPS is the addition of D31D com-
`pressed formats that we implementedin both
`
`the CPUand the GPU.Toget anidea ofthis
`feature’s usefulness, consider this: Given a typ-
`ical average of 2:1 compression and an XPS-
`targeted 9 Gbytes/s FSB bandwidth, the CPU
`cores can generate up to 18 Gbytes/s ofeffec-
`tive geometry and other graphics data and
`ship it to the GPU 3D core. Main memory
`sees none ofthis data traffic (or footprint).
`
`CPU cached data-steaming example
`Figure 4 illustrates an example of the Xbox
`360 using its data-streaming features for an
`XPS werkload. Consider the XPS workload,
`acting as a decompression kernel running on
`one or more CPU SMThardware threads.
`
`First, the XPS kernel must fetch new, unique
`
`dl
`
`IEEE MICRO
`
`ATI Ex. 2145
`IPR2023-00922
`
`Page 6 of 13
`
`AMD1318_0156897
`
`ATI Ex. 2145
`IPR2023-00922
`Page 6 of 13
`
`
`
`data from memoryto enable generationof the
`given piece of geometry. This likely includes
`world space coordinate data and specific data
`to make each geometryinstance unique. The
`XPS kernel prefetches this read data during a
`previous geometry generation iteration te
`cover the fetch’s memory latency. Because
`none of the per-instance read datais typical-
`ly reused between threads, the XPS kernel
`fetches it using the xCBTprefetch instruc-
`tion around the L2, which puts it directlyinto
`the requesting CPU core’s 1.1 data cache.
`Prefetching around the L2 separates the read
`data stream from the write data stream, avoid-
`ing L2 cache thrashing. Figure 4 shows this
`step as a solid-line arc from memoryto Core
`0's L1 data cache.
`The XPS kernel then crunches the data,
`primarily using theVMX128 computation
`ability to generate far more geometrydata
`than the amount read from memory. Before
`the data is written out, the XPS kernel com-
`presses it, using the D3D compressed data
`formats, which offer
`simple trade-offs
`between numberofbits, range, and precision.
`The XPS kernel stores these results as gener-
`ated to the locked set in the L2, with only
`minimal attentionto the write access pattern’s
`randomness (for example, the kernel places
`write accesses within a few cache lines of each
`other for efficient gathering). Furthermore,
`because of the write-through and no-write-
`allocate nature of the L1 data caches, none of
`the write data will thrash the L1 data cache
`of the CPUcore. The diagram showsthis step
`as a dashed-line arc from load/store in Core
`0 to the locked set in L2.
`Once the CPUcore has issued the stores,
`the store data sits in the gathering buffers wait-
`ing for more data until timed out orforced
`out by incoming write data demanding new
`64-byte ranges. The XPS output data is weit-
`ten to software-managed FIFOs in the L2 data
`arrays in a locked set in the L2 (the unshaded
`box in Figure 4). There are multiple FIFOs in
`one lockedset, so multiple threads can share
`one L2 set. This is possible within 128 Kbytes
`ofone set because tail pointer write-back com-
`munication frees completed FIFO area with
`lowered latency. Using the lockedset is impor-
`tant: otherwise, high-bandwidth write streams
`would thrash the 1.2 working set.
`Next, when more data is available to the
`
`Figure 5. Xbox360 CP
`
`
`
`
`
`
`
`
`
`
`GPU, the CPU notifies the GPU that the
`GPUcan advance within the FIFO, and the
`GPUperforms 128-byte reads to the FSB.
`This step is shownin the diagram as the dot-
`ted-line arc starting in the L2 and going to the
`GPU. The GPUdesign incorporates special
`features allowing it to read from the FSB, in
`contrast with the normal GPUread from
`
`main memory. The GPUalso has an added
`128-byte fetch, which enables maximum FSB
`and L2 data arrayutilization.
`The two final steps are not shownin the
`diagram. First,
`the GPU uses the corre-
`sponding D3D compressed data format sup-
`port to expand the compressed D3D formats
`into single-precision floating-point formats
`native to the 3D core. Then, the GPU com-
`mandstail pointer write-backs to the CPUto
`indicate that the GPUhas finished reading
`data. This tells the streaming FIFOs CPU
`software control that the given FIFO spaceis
`nowfree to be written with newgeometryor
`index data.
`Figure 5 shows a photo of the CPUdie,
`which contains 163 million transistors in an
`TBMsecond-generation 90-nm silicon-on-
`insulator (SOD) enhanced transistor process.
`
`MARCH-APRIL 2608
`
`il ATI Ex. 2145
`IPR2023-00922
`
`Page 7 of 13
`
`AMD1318_0156898
`
`ATI Ex. 2145
`IPR2023-00922
`Page 7 of 13
`
`
`
`HOT CHIPS 17
`
`Graphies precessing unit
`The GPUis the latest-generation graphics
`processor from ATT. It runs at 500 MHz and
`consists of 48 paralicl, combined vector and
`scalar shader ALUs. Unlike earlier graphics
`engines, the shaders are dynamicallyallocat-
`ed, meaning that there are no distinct vertex
`or pixel shader engines—the hardware auto-
`matically adjusts to the load ona fine-grained
`basis. The hardware is fully compatible with
`D3D9.0 and High-Level Shader Language
`(HLSL) 3.0,°"° with extensions.
`The ALUs are 32-bit IEEE 754 floating-
`point ALUs, withrelatively commongraphics
`simplifications of rounding modes, denor-
`malized numbers (lush to zero on reads),
`NaN handling, and exception handling. They
`are capable ofvector (including dot product)
`and scalar operations with single-cycle
`throughput—that is,all operationsissue every
`cycle. The superscalar instructions encode vec-
`tor, scalar, texture load, and vertex fetch with-
`in one
`instruction. This
`allows peak
`processing of 96 shader calculations per cycle
`while fetching textures and vertices.
`Feeding the shaders are 16 texture fetch
`engines, each capable of producing a filtered
`result in each cycle. In addition, there are 16
`programmable vertex fetch engines with built-
`in vessellation that the system can use instead
`of CPUgeometrygeneration. Finally, there
`are 16 interpolators in dedicated hardware.
`The render back end can sustain eight pix-
`els per cycle or 16 pixels per cycle for depth
`and stencil-only rendering (used in z-prepass
`or shadow buffers). The declicated z or blend
`logic and the EDRAMguaranteethatcight
`pixels per cycle can be maintained even with
`4X antialiasing and transparency. The
`z-prepass is a technique that performs a first-
`pass rendering ofa commandlist, with no ren-
`dering features applied except occlusion
`determination. The z-prepass initializes the
`z-butter so chat on a subsequent rendering pass
`with. full texturing and shaders applied, dis-
`carded pixels won't spend shader and textur-
`ing resources onoccludedpixels. With modern
`scene depth complexity, this techniquesignif
`icantly improves rendering performance, espe-
`cially with complex shader programs.
`As an example benchmark, the GPU can
`render each pixel with 4x antialiasing, a 7-
`buffer, six shader operations, and two texture
`
`fetches and can sustain this at eight pixels per
`cycle. This blazing fill rate enables the Xbox
`360 to deliver HD-resolution. rendering simul-
`taneously with manystate-of-the-art effects
`that traditionally would be mutually exclusive
`because offill rate limitations. For example,
`games Can mix particle, high-dynamic-range
`(HDR) lighting, fur, depth-of-field, motion
`blur, and other complexeffects.
`Fornext-generation geometric detail, shad-
`ing, and fill rate, the pipeline’s front end can
`process one triangle or vertex percycle. These
`are essentially full-featured vertices (rather
`than a single parameter), with the practical
`limitation of required memory bandwidth
`and storage. To overcomethis limitation, sev-
`eral compressed formatsare available for cach
`data type. In addition, XPS can transiently
`generate data on the fly within the CPU and
`pass it efficiently to the GPUwithout a main
`memory pass.
`The EDRAMremoves the rendertarget
`and z-buffer fill rate from the bandwidth
`
`equation. The EDRAMresides on a separate
`die from the main portion of GPU logic. The
`EDRAMdie also contains dedicated alpha
`blend, z-test, and antialiasing logic. Theinter-
`face to the EDRAM macro runs at 256
`Gbytes/s: (8 pixels/cycle + 8 z-compares/cycle)
`x (read + write) x 32 bits/sample x 4 sam-
`ples/pixel x 500 MHz.
`The GPUsupports several pixel depths; 32
`bits per pixel (bpp) and 64 bppare the most
`common,but there is support for up to 128
`bpp for multiple-render-target (MRT) or
`floating-point output. MRT is a graphics
`technique of outputting more than one piece
`of data per sample to the effective frame
`buffer, interleaved efficiently to minimize the
`performance impact of having more data. The
`data is used later for a variety of advanced
`graphics effects. To optimize space, the GPU
`supports 32-bpp and 64-bpp HDRlighting
`formats. The EDRAMonly supports render-
`ing operations to the render target and z-
`buffer. For render-to-texture. the GPU must
`“flush” the appropriate buffer to main mem-
`ory before using the buffer as a texture.
`Unlike a fine-grained tiler architecture, the
`GPU canachieve common HD resolutions and
`bic depths within a couple of EDRAMtiles.
`This simplifies the problem substantially. Tra-
`ditionaltiling architectures typically include a
`
`IEEE MICRO
`
`ATI Ex. 2145
`IPR2023-00922
`
`Page 8 of 13
`
`AMD1318_0156899
`
`ATI Ex. 2145
`IPR2023-00922
`Page 8 of 13
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Main die
`
`
`
`
`DRAMdie
` Figure 6. GPL block diagram.
`
`
`High-speed
`VO bus
`
`
`
`
`
`
`
`whole process inserted in the traditional graph-
`needed, as well as allowing multipass shaders.
`The latter can be useful for subdivision sur-
`ics pipeline for binning the geometryintoalarge
`numberof bins. Handling the bins in a high-
`faces.
`In addition,
`the display pipeline
`includes anin-line scaler that resizes the frame
`performance manneris complicated (for exam-
`ple, overflow cases, memory footprint, and
`buffer on the fly as it is ourput. This feature
`bandwidth). Because the GPU's EDRAMusu-
`allows games to pick a rendering resolution to
`allyrequires onlya couple ofbins, bin handling
`work with and then lets the display hardware
`is greatly simplified, allowing more-optimal
`make the best match to the displayresolution.
`hardware-software partitioning.
`As Figure 6 shows, the GPUconsists ofthe
`With a binning architecture, the full com-
`following blocks:
`mandlist must be presented before rendering.
`The hardware uses a few tricks to speed this
`process up. Rendering increasinglyrelies on a
`7-prepass to prepare the 7-buffer before exe-
`cuting complex pixel shader algorithms. We
`take advantage of this by collecting object
`extent information during this pass, as well as
`priming a full-resolurion hierarchical 7-buffer.
`We use the extent informationtoset flags to
`skip command list sections not needed with-
`ina tile. The full-resolution hi-z buffer retains
`its state berweentiles.
`
`® Bus interface unit. This interface to the
`ESB handles CPU-initiated transactions,
`as well as GPU-initiated transactions
`
`such as snoops and L2 cache reads.
`*¢ HO controller. Handles all internal mem-
`ory-mapped I/O accesses, as well as trans-
`actions to and from the /O chip via the
`two-lane PCI-Express bus (PCI-E).
`© Memory controllers (MCO, MC1). These
`128-byte interleavedGDDR3 memory
`controllers contain aggressive address
`tiling for graphics anda fast path to min-
`imize CPU latency,
`© Memory interface. Memorycrossbar and
`buffering for non-CPU initiators (such
`as graphics, I/O, and display).
`
`In anotherinteresting extension to normal
`D3D, the GPU supports a shader export fea-
`ture that allows data to be output directly
`from the shader to a buffer in memory. This
`lets the GPU serve as a vector math engine if
`
`MARCH—APRIL 2008
`
`i] ATI Ex. 2145
`IPR2023-00922
`
`Page 9 of 13
`
`AMD1318_0156900
`
`ATI Ex. 2145
`IPR2023-00922
`Page 9 of 13
`
`
`
`Semiconductor Menuta
`
`
`
`Figure 7, Xbox360 GPU
`
`Figure 8.Xoox 360 Ge :
`of NEC Electronics).
`
`® Graphics. This block, the largest on the
`chip, contains the rendering engine.
`* High-speed 1/O bus. This bus between the
`graphics core and the EDRAM dic is a
`chip-to-chip bus (via substrate) operat-
`ing at 1.8 GHz and 28.8 Gbytes/s. When
`multisample antialiasing is used, only
`pixel center data and coverage informa-
`
`id
`
`IEEE MICRO
`
`tionis transferred and then expanded on
`the EDRAMdic.
`* Antialasing and AlphatA (AA+AZ). Han-
`dles pixel-to-sample expansion, as well as
`z-test and alpha blend.
`® Display,
`
`Figures 7 and 8 showphotos of the GPU
`“parent” and EDRAM Cdaughter”) dies. The
`parent dic contains 232 million transistors in
`a TSMC 90-nm GT. The EDRAMdie con-
`tains 100 million transistors in an NEC 90-
`nm process.
`
`Architectural choices
`The major choices we made in designing
`the Xbox 360 architecture were to use chip
`multiprocessing (CMP), in-order issuance
`cores, and EDRAM.
`
`Chip mulligrecessing
`Ourreasonsfor using multiple CPU cores
`on one chip in Xbox 360 was relatively
`straightforward. The combination of power
`consumption and diminishing returns from
`instruction-level parallelism (ILP)is driving
`the industry in general to multicore. CMPis
`a natural twist on traditional symmetric mul-
`tiprocessing (SMP), in whichall the CPU
`cores are symmetric and have a commonview
`ofmain memorybut are on the same die ver-
`sus separate chips. Modern process geometries
`afford hardware designers the flexibility of
`CMP, which was usually too costlyin die area
`previously. Having multiple cores on one chip
`is more cost-effective. Tt enables shared 12
`
`implementation and minimizes communica-
`tion latency betweencores, resulting in high-
`er overall performance for the same die area
`and power consumption.
`In addition, we wanted to optimize the
`architecture for the workload, optimize in-
`game utilization ofsilicon area, and keepthe
`system easy to program. These goals made
`CMPa good choice for several reasons:
`First, for the game workload, both integer
`and floating-point performance are impor-
`tant. The high-level game codeis generallya
`database management problem, with plenty
`of object-oriented code and pointer manipu-
`lation. Such a workload needs a large L2 and
`high integer performance. The CMP shared
`L2 withits fine-grained, dynamic allocation
`
`ATI Ex. 2145
`IPR2023-00922
`
`Page 10 of 13
`
`AMD1318_0156901
`
`ATI Ex. 2145
`IPR2023-00922
`Page 10 of 13
`
`
`
`means this workload can use a large working
`set in the L2 while running. In addition, sev-
`eral sections ofthe applicationlend themselves
`well to vector floating-point acceleration.
`Second, to optimize silicon area, we can
`take advantage of two factors. To start with,
`we are presenting a stable platform forthe
`product's lHetime. This means tools and pro-
`gramming expertise will mature significantly,
`so we can rely more on generating code than
`optimizing performance at runtime. More-
`over, all Xbox 360 games(as opposed to Xbox
`games from Microsoft's first game console,
`which are emulated on Xbox 360) are corm-
`piled fromscratch and optimized for the cur-
`rent microarchitecture. We don’t have the
`
`problemofrunning legacy, but compatible,
`instruction set architecture executables that
`were compiled and optimized for a completely
`different microarchitecture. This problem has
`significant implications for CPU microarchi-
`tectures in PC and server markets.
`Third, although we knew multicore was the
`way to go, the tools and programming exper-
`tise for multithread programmingare certain-
`ly not mature, presenting a problem for our
`goal of keeping programming easy. For the
`types of workloads present in a game engine,
`we could justify at most six to eight threads in
`the system. The solution was to adapt the
`“more-but-simpler” philosophy to the CPU
`core topology. The key was keeping the num-
`ber of hardware threads limited, thus increas-
`ing the chance that they would be used
`effectively, We decided the best approach was
`to tightly couple dedicated vector math engines
`to integer cores rather than making them
`autonomous. This keeps the numberofthreads
`lowand allows vector math routines to be opti-
`mized and run on separate threads if necessary.
`
`ln-arder issuance cares
`The Xbox 360 CPU contains three two-
`issue, in-order instruction issuance cores. Fach
`core has two SMT hardware threads, which
`support fine-grained instruction issuance. The
`cores allowout-of-order execution in the com-
`
`moncases ofloads and vector/floating-point
`versus integer instructions. Loads, which are
`treated as prefetches, don’t stall until a load
`dependency is present. Vector and floating-
`point operations have their own, decoupled
`vector/float issue queue (VIQ),