throbber
XBOX 360
`SYSTEM ARCHITECTURE
`
`THIS ARTICLE COVERS THE XBOX 360'S HIGH-LEVEL TECHNICAL
`
`REQUIREMENTS, A SHORT SYSTEM OVERVIEW, AND DETAILS OF THE CPU AND
`
`THE GPU. THE AUTHORS DESCRIBE THEIR ARCHITECTURAL TRADE-OFFS AND
`
`SUMMARIZE THE SYSTEM'S SOFTWARE PROGRAMMING SUPPORT.
`
`eeeee8 Microsoft's Xbox 360 game console
`is thefirst of the latest generation of game con-
`soles. Historically, game console architecture
`and design implementations have provided
`large discrete jumps in system performance,
`approximatelyat five-year intervals. Overthe
`last several generations, game console systems
`have increasingly become graphics supercom-
`puters in their own right, particularly at the
`launch of a given game console generation.
`The Xbox 360, pictured in Figure 1, contains
`an aggressive hardware architecture and imple-
`mentationtargeted at game console workloads.
`The core silicon implements the product
`designers’ goal of providing game developers a
`hardware platform to implementtheir next-gen-
`eration game ambitions. The core chips include
`the standard conceptual blocks ofCPU,graph-
`ics processing unit (GPU), memory, and 1/0.
`Each ofthese components and their intercon-
`nections are customized to provide a user-
`friendly game consele product.
`
`Design principles
`One ofthe Xbox 360’s main design princi-
`ples is the next-generation gaming principle—
`thatis, anew game console must provide value
`to customers for five to seven years. Thus, as
`for any true next-generation game console
`hardware, the Xbox 360 delivers a huge discrete
`jampin hardware performance for gaming.
`The Xbox 360 hardware design team had
`
`to translate the next-generation gaming prin-
`ciple into useful feature requirements and
`next-generation game workloads. For the
`game workloads, the designers’ direction came
`from interaction with game developers,
`including game engine developers, middle-
`ware developers, tool developers, APT and dri-
`ver developers, and game performance
`experts, both inside and outside Microsoft.
`One key next-generation game feature
`requirement was that the Xbox 360 system
`must implement a 720p (progressive scan)
`pervasive high-definition (HD), 16:9 aspect
`ratio screen in all Xbox 360 games. Thisfea-
`cure’s architectural implication was that the
`Xbox 360 required a huge,reliable fill rate.
`Another design principle of the Xbox 360
`architecture was that it must be flexible to suit
`
`the dynamic range ofgame engines and game
`developers. The Xbox 360 has a balanced
`hardware architecture for the software game
`pipeline, with homogencous, reallocatable
`hardware resources that adapt to different
`game genres, different developer emphases,
`and evento varying workloads within a frame
`of a game. In contrast, heterogeneous hard-
`ware resources lock software game pipeline
`performancein each stage andare notreallo-
`catable. Flexibilicy helps make the design
`“futureproof.” The Xbox 360’s three CPU
`cores, 48 unified shaders, and 512-Mbyte
`DRAMmain memory will enable developers
`
`Jett Andrews
`
`Nick Baker
`
`MMicrosott Corp.
`
`0272-1732/06/620.00 © 2006 IEEE
`
`Published by the IEEE Computer Society
`
`/h ATI Ex. 2145
`IPR2023-00922
`
`Page 1 of 13
`
`AMD1318_0156892
`
`ATI Ex. 2145
`IPR2023-00922
`Page 1 of 13
`
`

`

`HOT CHIPS 17 Figure 1 Xbox 866.4
`
`to create innovative games for the nextfive to
`seven years.
`A third design principle was programma-
`bility; thatis, the Xbox 360 architecture must
`be easy to program and develop software for.
`The silicon development team spent much
`timelistening to software developers (we are
`hardware folks at a software company,after
`all}. There was constant interaction and iter-
`ation with software developers at the very
`beginning of the project and all along the
`architecture and implementation phases.
`This interaction had an interesting dynam-
`ic. The software developers weren't shy about
`their hardwarelikes and dislikes. Likewise, the
`hardware team wasn’t shy about where next-
`generation hardware architecture and design
`were going as a result of changes in silicon
`processes, hardware architecture, and system
`design. What followed was further iteration
`on planned and potential workloads.
`An important part of Xbox 360 pro-
`grammability is that the hardware must pre-
`
`IEEE MICRO
`
`sent the simplest APIs and programming
`models to let game developers use hardware
`resources effectively. We extended pro-
`gramming models that developers liked.
`Because software developers liked the first
`Xbox, using it as a working model was nat-
`ural for the teams. In listening to developers,
`we did not repackage or include hardware
`features that developers did not like, even
`though that may have simplified the hard-
`ware implementation. We considered the
`software tool chain from the very beginning
`of the project.
`Another majordesign principle was that the
`Xbox 360 hardware be optimized for achiev-
`able performance. To that end, we designed a
`scalable architecture that provides the great-
`est usable performance per square millimeter
`while remaining within the console’s system
`power envelope.
`As we continued to work with game devel-
`opers, we scaled chip implementations to
`result in balanced hardware for the software
`
`game pipeline. Examples of higher-level
`implementationscalability include the num-
`ber of CPU cores,
`the number of GPU
`shaders, CPU L2 size, bus bandwidths, and
`main memorysize. Other scalable items rep-
`resented smaller optimizations in each chip.
`
`Hardware designed for games
`Figure 2 shows a top-level diagram of the
`Xbox 360system's core silicon components.
`The three identical CPU cores share an 8-way
`set-associative, 1-Mbyte L2 cache and run at
`3.2 GH. Each core contains a complement of
`four-way single-instruction, multiple data
`(SIMD) vector units.’ The CPU L2 cache,
`cores, and vector units are customized for
`Xbox 360 game and 31Dgraphics workloads.
`The
`front-side
`bus
`(FSB)
`runs
`at
`5.4 Gbit/pin/s, with 16 logical pins in each
`direction, giving a 10.8-Gbyte/s read and a
`10.8-Gbyte/s write bandwidth. The bus
`design and the CPU L2 provide added sup-
`port that allows the GPU to read directly from
`the CPU L2 cache.
`
`As Figure 2 shows, the /O chip supports
`abundant 1/O components. The Xbox media
`audio (XMA)decoder, custom-designed by
`Microsoft, provides on-the-fly decoding of a
`large number of compressed audio streams in
`hardware. Other customI/Ofeatures include
`
`ATI Ex. 2145
`IPR2023-00922
`
`Page 2 of 13
`
`AMD1318_0156893
`
`ATI Ex. 2145
`IPR2023-00922
`Page 2 of 13
`
`

`

`
`
`DVD (SATA)
`HDD port (SATA)
`Front controllers
`<> Wireless controller
`
`MU ports (2 USB)
`
`Rear panel USB
`
`
`
`Ethernet
`:
`:
`
`eo
`
`Audio out
`“Flash _
`
`
`
`
`
`
`
`
`
`
`
`
`Analog |g. Video oul
`chi oe
`
`Bus interface unit
`
`> Memory controller
`
`Hard disk drive
`Memoryuni
`
`infrared receiver
`; System managernent controller
`
`Xbox media audio
`
`
`
`the NANDflash controller and the system
`managementcontroller (SMC).
`The GPU3D core has 48 parallel, unified
`shaders.The GPUalso includes 10 Mbytes of
`embedded DRAM (FDRAM), whichruns at
`256 Gbytes/s forreliable frame and z-buffer
`bandwidth. The GPU includes interfaces
`between the CPU, I/Ochip, and the GPU
`internals.
`
`The 512-Mbyte unified main memory con-
`trolled by the GPUis a 700-MHz graphics-
`LeMory,
`double-data-rate-3.
`(GDDR3)
`which operates at 1.4 Gbit/pin/s and provides
`a total main memory bandwidth of 22.4
`Gbytes/s.
`The DVD and HDD ports are serial ATA
`(SATA)interfaces. The analog chip drives the
`HD video out.
`
`CPU chip
`Figure 3 shows the CPUchip in greater
`detail. Microsoft’s partner for the Xbox 360
`CPU is IBM. The CPU implements the Pow-
`erPC instruction set architecture,’? with the
`VM SIMD vector instruction set (VMX1 28)
`customized for graphics workloads.
`
`The shared 1.2 allows fine-grained, dynamic
`allocation ofcache lines between the six threads.
`Commonly, game workloads significantly vary
`in working-setsize. For exarnple, scene man-
`agement requires walking larger, random-miss-
`dominated data structures, similar to database
`searches. At the sametime, audio, Xbox proce-
`dural synthesis (described later), and many other
`game processes that require smaller working sets
`can run concurrently. The shared L2 allows
`workloads needing larger workingsets to allo-
`cate significantly more ofthe L2 than wouldbe
`available if the system used private [.2s (of the
`same total L2 size) instead.
`The CPU core has two-per-cycle, in-order
`instruction issuance. A separate vector/scalar
`issue queue (VIQ) decouples instruction
`issuance between integer and vector instruc-
`tions for nondependent work. There are two
`symmetric multithreading (SMT),° fine-
`grained hardware threads per core. The 1.1
`caches include a two-wayset-associative, 32-
`Kbyte L1 instruction cache and a four-way
`set-associative, 32-Kbyte L1 data cache. The
`write-through data cache does not allocate
`cache lines on writes.
`
`MARCH—APRIL 2008
`
`i| ATI Ex. 2145
`IPR2023-00922
`
`Page 3 of 13
`
`AMD1318_0156894
`
`ATI Ex. 2145
`IPR2023-00922
`Page 3 of 13
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`HOT CHIPS 17
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`TT
`
`
`
` - “Test. :
`
`
`
`debug,
`
` clocks,
`
`temperature _
` | sensor.
`
`S Front sidebus (FSB) _ oS
`
`
`Vector/scalar unit
`~¥sU
`
`
`Permute
`"| Perm
`Simple
`“| Sirnp
`
`
`MMU Main-memory unit
`int
`Integer
`
`
`Pi Programmable interrupt controller
`
`
`FPU Ficating point unit
`
`
`
`VIQ Vector/scalar issue queue
`
`
`Figure 3. Xbox 360 CPU block diagram
`
`
`
`
`
`
`
`The integer execution pipelines include
`branch, integer, and load/store units. Tn
`addition, each core contains an [EEE-754-
`compliance scalar floating-point unit (FPU),
`which includes single- and double-precision
`support at full hardware throughput ofone
`operation per cycle for most operations.
`Eachcore also includes the four-way SIMD
`VMX128 units: floating-point (FP), per-
`mute, and simple. As the name implies, the
`VMX1 28 includes 128 registers, of 128 bits
`each, per hardware thread to maximize
`throughput.
`The VMX1 28 implementation includes an
`added dot product instruction, common in
`
`graphics applications. The dor product
`implementation adds minimal latency to a
`multiply-add by simplifying the rounding of
`intermediate multiply results. The det prod-
`uct instruction takes far less latencythan dis-
`crete instructions.
`Anotheraddition we made to the VMX128
`was direct 3D (D3D) compressed data for-
`mats,°* the same formats supported bythe
`GPU. This allows graphics data to be gener-
`ated in the CPU and then compressed before
`being stored in the L2 or memory. Typicaluse
`of the compressed formats allows an approx-
`imate 50 percent savings in required band-
`width and memory footprint.
`
`IEEE MICRO
`
`ATI Ex. 2145
`IPR2023-00922
`
`Page 4 of 13
`
`AMD1318_0156895
`
`ATI Ex. 2145
`IPR2023-00922
`Page 4 of 13
`
`

`

`CPU data steaming
`In the Xbox, we paid considerable atten-
`tion to enabling data-streaming workloads,
`which are not typical PC orserver workloads.
`We added features that allow a given CPU
`core to execute a high-bandwidth workload
`(both read and write, but particularly write),
`while avoiding thrashing its own cache and
`the shared 12.
`First, some features shared among the CPU
`cores help data streaming. Oneofthese is 128-
`byte cache line sizes in all che CPU L1 and 1.2
`caches. Larger cache line sizes increase FSB
`and memory efficiency. The L2 includes a
`cache-set-locking functionality, commonin
`embedded systems but not in PCs.
`Specific features that improve streaming
`bandwidth for writes and reduce thrashing
`include the write-through L] data caches.
`Also, there is no write allocation of L.1 data
`cache lines when writes miss in the L1 data
`cache. This is important for write streaming
`because it keeps the L1 data cache from being
`thrashed by high bandwidth transient write-
`onlydata streams.
`We significantly upgraded write gathering
`in the L2. The shared 1.2 has an uncached unit
`for each CPU care. Each uncached unit has
`
`four noncached write-gathering buffers that
`allow multiple streams to concurrently gath-
`er and dumptheir gathered payloads to the
`FSB yet maintain very high uncached write-
`streaming bandwidth.
`The cacheable write streams are gathered by
`eight nonsequential gathering butters per CPU
`core. This allows programming flexibility in the
`write patterns ofcacheable very high bandwidth
`write streams into the L2. The write streams can
`
`randomlywrite within a window ofa fewcache
`lines without the writes backing up and caus-
`ing stalls. The cacheable write-gathering buffers
`effectively act as a bandwidth compression
`scheme for writes. This is because the L2 data
`
`arrays see a much lower bandwidth than the raw
`bandwidth required by a program’s store pat-
`tern, which would have lowutilization of the
`L2 cache arrays. Data transformation workloads
`commonly don’t generate the data in a waythat
`allows sequential write behavior. If the write
`gathering buffers were not present, software
`would have to effectively gather write data in
`the register set before storing. This would puta
`large amountofpressure on the numberofreg-
`
`isters and increase latency(and thus through-
`put) ofinner loops of computation kernels.
`We applied similar customization to read
`streaming. For each CPU core, there are eight
`outstanding loads/prefetches. A custom
`prefetch instruction, extended data cache
`block touch «KDCBT), prefetches data, but
`delivers to the requesting CPUcore’s L1 data
`cache and never puts data in the L2 cache as
`regularprefetch instructions de. This modifi-
`cation seems minor, butit is very important
`because it allows higher bandwidth read
`streaming workloads to run on as many
`threads as desired without thrashing the L2
`cache. Another option we considered for read
`streaming would be to lock a set of the L2 per
`thread for read streaming. Inthatcase, ifa user
`wanted to run four threads concurrently, half
`the L2 cache would be locked down, hurting
`workloads requiring a large 12 working-set
`size. Instead, read streaming occurs through
`the L1 data cache of the CPU core on which
`the given thread is operating, effectively giv-
`ing private read streaming first in, first our
`(FIFO)area per thread.
`Asystemfeature planned early in the Xbox
`360 project wasto allow the GPU to directly
`read data produced by the CPU,withthe data
`never going through the CPUcache’s back-
`ing store of main memory. In a specific case
`of this data streaming, called Xbox procedur-
`al synthesis (XPS), the CPU is effectively a
`data decompressor, procedurally generating
`geometry on-the-fly for consumption by the
`GPU 3Dcore. For 3D games, XPS allows a
`far greater amount ofdifferentiated geometry
`
`than simple traditional instancing allows,
`
`which is very importantforfil
`ng large HD
`screen worlds with highly detailed geometry.
`Ve added two features specifically to sup-
`port XPS. The first was support in the GPU
`and the FSB for a 128-byte GPU read from
`the CPU. The other was to directly lower
`communication latency from the GPU back
`to the CPU by extending the GPU's tail
`pointer write-backfeature.
`‘Tail pointer write-back is a method of con-
`trolling communication from the GPUto the
`CPU by having the CPU poll on a cacheable
`location, which is updated when a GPU
`instruction writes an update to the pointer.
`The system coherency scheme then updates
`the polling read with the GPU’s updated
`
`MARCH-APRIL 2608
`
`i ATI Ex. 2145
`IPR2023-00922
`
`Page 5 of 13
`
`AMD1318_0156896
`
`ATI Ex. 2145
`IPR2023-00922
`Page 5 of 13
`
`

`

`HOT CHIPS 17
`
` Core 1
`
`
`
`ee
`
`
`
`
`compressed data,
`VMX stores to L2
`
`
`
`
`
`
`
`
`
` xDCBT 128-byte prefetch
`
`around L2, into Li data cache /:
`
`
`
`
`
`
`
`
`Test,
`
`
`debug;
` i
`
`
`clocks,
`
`
`:
`temperature
`
`
`
`
`
`ssensor, t [|
`"Front side bus (FSB)
`
`
`
`
`
` Vector/scalar unit
`Ht
`
`Non-sequential gathering,
`locked set in L2
`
`
`
`:
`
`poe
`
`_ GPU 128-byte read frorn L2
`
`
`
`
`
`
`
`
`
`
`Permute
`Simple
`
`Main-memory unit
`Integer
`i
`> Programmableinterrupt
`controller
`
`
`
`
`
`
`
`Figure 4: CPU cached data-streariing exanit
`
`[eee
`
`:
`
`From memory
`
`os To GPU
`
`
`
`
`pointer value. Tail write-backs reduce com-
`munication latency compared to using
`interrupts. We lowered GPU-to-CPU com-
`munication latency even further by imple-
`menting the tail pointer’s backing-store target
`on the CPUdie. This avoids the round-trip
`from CPU to memory when the GPU point
`er update causes a probe and castout ofthe
`CPU cache data, requiring the CPU to refetch
`the data all the way from memory. Instead the
`refetch never leaves the CPUdie. This lower
`
`latency translates into smaller streaming
`FIFOsin the L2’s lockedset.
`A previously mentionedfeature very impor-
`tant to XPS is the addition of D31D com-
`pressed formats that we implementedin both
`
`the CPUand the GPU.Toget anidea ofthis
`feature’s usefulness, consider this: Given a typ-
`ical average of 2:1 compression and an XPS-
`targeted 9 Gbytes/s FSB bandwidth, the CPU
`cores can generate up to 18 Gbytes/s ofeffec-
`tive geometry and other graphics data and
`ship it to the GPU 3D core. Main memory
`sees none ofthis data traffic (or footprint).
`
`CPU cached data-steaming example
`Figure 4 illustrates an example of the Xbox
`360 using its data-streaming features for an
`XPS werkload. Consider the XPS workload,
`acting as a decompression kernel running on
`one or more CPU SMThardware threads.
`
`First, the XPS kernel must fetch new, unique
`
`dl
`
`IEEE MICRO
`
`ATI Ex. 2145
`IPR2023-00922
`
`Page 6 of 13
`
`AMD1318_0156897
`
`ATI Ex. 2145
`IPR2023-00922
`Page 6 of 13
`
`

`

`data from memoryto enable generationof the
`given piece of geometry. This likely includes
`world space coordinate data and specific data
`to make each geometryinstance unique. The
`XPS kernel prefetches this read data during a
`previous geometry generation iteration te
`cover the fetch’s memory latency. Because
`none of the per-instance read datais typical-
`ly reused between threads, the XPS kernel
`fetches it using the xCBTprefetch instruc-
`tion around the L2, which puts it directlyinto
`the requesting CPU core’s 1.1 data cache.
`Prefetching around the L2 separates the read
`data stream from the write data stream, avoid-
`ing L2 cache thrashing. Figure 4 shows this
`step as a solid-line arc from memoryto Core
`0's L1 data cache.
`The XPS kernel then crunches the data,
`primarily using theVMX128 computation
`ability to generate far more geometrydata
`than the amount read from memory. Before
`the data is written out, the XPS kernel com-
`presses it, using the D3D compressed data
`formats, which offer
`simple trade-offs
`between numberofbits, range, and precision.
`The XPS kernel stores these results as gener-
`ated to the locked set in the L2, with only
`minimal attentionto the write access pattern’s
`randomness (for example, the kernel places
`write accesses within a few cache lines of each
`other for efficient gathering). Furthermore,
`because of the write-through and no-write-
`allocate nature of the L1 data caches, none of
`the write data will thrash the L1 data cache
`of the CPUcore. The diagram showsthis step
`as a dashed-line arc from load/store in Core
`0 to the locked set in L2.
`Once the CPUcore has issued the stores,
`the store data sits in the gathering buffers wait-
`ing for more data until timed out orforced
`out by incoming write data demanding new
`64-byte ranges. The XPS output data is weit-
`ten to software-managed FIFOs in the L2 data
`arrays in a locked set in the L2 (the unshaded
`box in Figure 4). There are multiple FIFOs in
`one lockedset, so multiple threads can share
`one L2 set. This is possible within 128 Kbytes
`ofone set because tail pointer write-back com-
`munication frees completed FIFO area with
`lowered latency. Using the lockedset is impor-
`tant: otherwise, high-bandwidth write streams
`would thrash the 1.2 working set.
`Next, when more data is available to the
`
`Figure 5. Xbox360 CP
`
`
`
`
`
`
`
`
`
`
`GPU, the CPU notifies the GPU that the
`GPUcan advance within the FIFO, and the
`GPUperforms 128-byte reads to the FSB.
`This step is shownin the diagram as the dot-
`ted-line arc starting in the L2 and going to the
`GPU. The GPUdesign incorporates special
`features allowing it to read from the FSB, in
`contrast with the normal GPUread from
`
`main memory. The GPUalso has an added
`128-byte fetch, which enables maximum FSB
`and L2 data arrayutilization.
`The two final steps are not shownin the
`diagram. First,
`the GPU uses the corre-
`sponding D3D compressed data format sup-
`port to expand the compressed D3D formats
`into single-precision floating-point formats
`native to the 3D core. Then, the GPU com-
`mandstail pointer write-backs to the CPUto
`indicate that the GPUhas finished reading
`data. This tells the streaming FIFOs CPU
`software control that the given FIFO spaceis
`nowfree to be written with newgeometryor
`index data.
`Figure 5 shows a photo of the CPUdie,
`which contains 163 million transistors in an
`TBMsecond-generation 90-nm silicon-on-
`insulator (SOD) enhanced transistor process.
`
`MARCH-APRIL 2608
`
`il ATI Ex. 2145
`IPR2023-00922
`
`Page 7 of 13
`
`AMD1318_0156898
`
`ATI Ex. 2145
`IPR2023-00922
`Page 7 of 13
`
`

`

`HOT CHIPS 17
`
`Graphies precessing unit
`The GPUis the latest-generation graphics
`processor from ATT. It runs at 500 MHz and
`consists of 48 paralicl, combined vector and
`scalar shader ALUs. Unlike earlier graphics
`engines, the shaders are dynamicallyallocat-
`ed, meaning that there are no distinct vertex
`or pixel shader engines—the hardware auto-
`matically adjusts to the load ona fine-grained
`basis. The hardware is fully compatible with
`D3D9.0 and High-Level Shader Language
`(HLSL) 3.0,°"° with extensions.
`The ALUs are 32-bit IEEE 754 floating-
`point ALUs, withrelatively commongraphics
`simplifications of rounding modes, denor-
`malized numbers (lush to zero on reads),
`NaN handling, and exception handling. They
`are capable ofvector (including dot product)
`and scalar operations with single-cycle
`throughput—that is,all operationsissue every
`cycle. The superscalar instructions encode vec-
`tor, scalar, texture load, and vertex fetch with-
`in one
`instruction. This
`allows peak
`processing of 96 shader calculations per cycle
`while fetching textures and vertices.
`Feeding the shaders are 16 texture fetch
`engines, each capable of producing a filtered
`result in each cycle. In addition, there are 16
`programmable vertex fetch engines with built-
`in vessellation that the system can use instead
`of CPUgeometrygeneration. Finally, there
`are 16 interpolators in dedicated hardware.
`The render back end can sustain eight pix-
`els per cycle or 16 pixels per cycle for depth
`and stencil-only rendering (used in z-prepass
`or shadow buffers). The declicated z or blend
`logic and the EDRAMguaranteethatcight
`pixels per cycle can be maintained even with
`4X antialiasing and transparency. The
`z-prepass is a technique that performs a first-
`pass rendering ofa commandlist, with no ren-
`dering features applied except occlusion
`determination. The z-prepass initializes the
`z-butter so chat on a subsequent rendering pass
`with. full texturing and shaders applied, dis-
`carded pixels won't spend shader and textur-
`ing resources onoccludedpixels. With modern
`scene depth complexity, this techniquesignif
`icantly improves rendering performance, espe-
`cially with complex shader programs.
`As an example benchmark, the GPU can
`render each pixel with 4x antialiasing, a 7-
`buffer, six shader operations, and two texture
`
`fetches and can sustain this at eight pixels per
`cycle. This blazing fill rate enables the Xbox
`360 to deliver HD-resolution. rendering simul-
`taneously with manystate-of-the-art effects
`that traditionally would be mutually exclusive
`because offill rate limitations. For example,
`games Can mix particle, high-dynamic-range
`(HDR) lighting, fur, depth-of-field, motion
`blur, and other complexeffects.
`Fornext-generation geometric detail, shad-
`ing, and fill rate, the pipeline’s front end can
`process one triangle or vertex percycle. These
`are essentially full-featured vertices (rather
`than a single parameter), with the practical
`limitation of required memory bandwidth
`and storage. To overcomethis limitation, sev-
`eral compressed formatsare available for cach
`data type. In addition, XPS can transiently
`generate data on the fly within the CPU and
`pass it efficiently to the GPUwithout a main
`memory pass.
`The EDRAMremoves the rendertarget
`and z-buffer fill rate from the bandwidth
`
`equation. The EDRAMresides on a separate
`die from the main portion of GPU logic. The
`EDRAMdie also contains dedicated alpha
`blend, z-test, and antialiasing logic. Theinter-
`face to the EDRAM macro runs at 256
`Gbytes/s: (8 pixels/cycle + 8 z-compares/cycle)
`x (read + write) x 32 bits/sample x 4 sam-
`ples/pixel x 500 MHz.
`The GPUsupports several pixel depths; 32
`bits per pixel (bpp) and 64 bppare the most
`common,but there is support for up to 128
`bpp for multiple-render-target (MRT) or
`floating-point output. MRT is a graphics
`technique of outputting more than one piece
`of data per sample to the effective frame
`buffer, interleaved efficiently to minimize the
`performance impact of having more data. The
`data is used later for a variety of advanced
`graphics effects. To optimize space, the GPU
`supports 32-bpp and 64-bpp HDRlighting
`formats. The EDRAMonly supports render-
`ing operations to the render target and z-
`buffer. For render-to-texture. the GPU must
`“flush” the appropriate buffer to main mem-
`ory before using the buffer as a texture.
`Unlike a fine-grained tiler architecture, the
`GPU canachieve common HD resolutions and
`bic depths within a couple of EDRAMtiles.
`This simplifies the problem substantially. Tra-
`ditionaltiling architectures typically include a
`
`IEEE MICRO
`
`ATI Ex. 2145
`IPR2023-00922
`
`Page 8 of 13
`
`AMD1318_0156899
`
`ATI Ex. 2145
`IPR2023-00922
`Page 8 of 13
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Main die
`
`
`
`
`DRAMdie
` Figure 6. GPL block diagram.
`
`
`High-speed
`VO bus
`
`
`
`
`
`
`
`whole process inserted in the traditional graph-
`needed, as well as allowing multipass shaders.
`The latter can be useful for subdivision sur-
`ics pipeline for binning the geometryintoalarge
`numberof bins. Handling the bins in a high-
`faces.
`In addition,
`the display pipeline
`includes anin-line scaler that resizes the frame
`performance manneris complicated (for exam-
`ple, overflow cases, memory footprint, and
`buffer on the fly as it is ourput. This feature
`bandwidth). Because the GPU's EDRAMusu-
`allows games to pick a rendering resolution to
`allyrequires onlya couple ofbins, bin handling
`work with and then lets the display hardware
`is greatly simplified, allowing more-optimal
`make the best match to the displayresolution.
`hardware-software partitioning.
`As Figure 6 shows, the GPUconsists ofthe
`With a binning architecture, the full com-
`following blocks:
`mandlist must be presented before rendering.
`The hardware uses a few tricks to speed this
`process up. Rendering increasinglyrelies on a
`7-prepass to prepare the 7-buffer before exe-
`cuting complex pixel shader algorithms. We
`take advantage of this by collecting object
`extent information during this pass, as well as
`priming a full-resolurion hierarchical 7-buffer.
`We use the extent informationtoset flags to
`skip command list sections not needed with-
`ina tile. The full-resolution hi-z buffer retains
`its state berweentiles.
`
`® Bus interface unit. This interface to the
`ESB handles CPU-initiated transactions,
`as well as GPU-initiated transactions
`
`such as snoops and L2 cache reads.
`*¢ HO controller. Handles all internal mem-
`ory-mapped I/O accesses, as well as trans-
`actions to and from the /O chip via the
`two-lane PCI-Express bus (PCI-E).
`© Memory controllers (MCO, MC1). These
`128-byte interleavedGDDR3 memory
`controllers contain aggressive address
`tiling for graphics anda fast path to min-
`imize CPU latency,
`© Memory interface. Memorycrossbar and
`buffering for non-CPU initiators (such
`as graphics, I/O, and display).
`
`In anotherinteresting extension to normal
`D3D, the GPU supports a shader export fea-
`ture that allows data to be output directly
`from the shader to a buffer in memory. This
`lets the GPU serve as a vector math engine if
`
`MARCH—APRIL 2008
`
`i] ATI Ex. 2145
`IPR2023-00922
`
`Page 9 of 13
`
`AMD1318_0156900
`
`ATI Ex. 2145
`IPR2023-00922
`Page 9 of 13
`
`

`

`Semiconductor Menuta
`
`
`
`Figure 7, Xbox360 GPU
`
`Figure 8.Xoox 360 Ge :
`of NEC Electronics).
`
`® Graphics. This block, the largest on the
`chip, contains the rendering engine.
`* High-speed 1/O bus. This bus between the
`graphics core and the EDRAM dic is a
`chip-to-chip bus (via substrate) operat-
`ing at 1.8 GHz and 28.8 Gbytes/s. When
`multisample antialiasing is used, only
`pixel center data and coverage informa-
`
`id
`
`IEEE MICRO
`
`tionis transferred and then expanded on
`the EDRAMdic.
`* Antialasing and AlphatA (AA+AZ). Han-
`dles pixel-to-sample expansion, as well as
`z-test and alpha blend.
`® Display,
`
`Figures 7 and 8 showphotos of the GPU
`“parent” and EDRAM Cdaughter”) dies. The
`parent dic contains 232 million transistors in
`a TSMC 90-nm GT. The EDRAMdie con-
`tains 100 million transistors in an NEC 90-
`nm process.
`
`Architectural choices
`The major choices we made in designing
`the Xbox 360 architecture were to use chip
`multiprocessing (CMP), in-order issuance
`cores, and EDRAM.
`
`Chip mulligrecessing
`Ourreasonsfor using multiple CPU cores
`on one chip in Xbox 360 was relatively
`straightforward. The combination of power
`consumption and diminishing returns from
`instruction-level parallelism (ILP)is driving
`the industry in general to multicore. CMPis
`a natural twist on traditional symmetric mul-
`tiprocessing (SMP), in whichall the CPU
`cores are symmetric and have a commonview
`ofmain memorybut are on the same die ver-
`sus separate chips. Modern process geometries
`afford hardware designers the flexibility of
`CMP, which was usually too costlyin die area
`previously. Having multiple cores on one chip
`is more cost-effective. Tt enables shared 12
`
`implementation and minimizes communica-
`tion latency betweencores, resulting in high-
`er overall performance for the same die area
`and power consumption.
`In addition, we wanted to optimize the
`architecture for the workload, optimize in-
`game utilization ofsilicon area, and keepthe
`system easy to program. These goals made
`CMPa good choice for several reasons:
`First, for the game workload, both integer
`and floating-point performance are impor-
`tant. The high-level game codeis generallya
`database management problem, with plenty
`of object-oriented code and pointer manipu-
`lation. Such a workload needs a large L2 and
`high integer performance. The CMP shared
`L2 withits fine-grained, dynamic allocation
`
`ATI Ex. 2145
`IPR2023-00922
`
`Page 10 of 13
`
`AMD1318_0156901
`
`ATI Ex. 2145
`IPR2023-00922
`Page 10 of 13
`
`

`

`means this workload can use a large working
`set in the L2 while running. In addition, sev-
`eral sections ofthe applicationlend themselves
`well to vector floating-point acceleration.
`Second, to optimize silicon area, we can
`take advantage of two factors. To start with,
`we are presenting a stable platform forthe
`product's lHetime. This means tools and pro-
`gramming expertise will mature significantly,
`so we can rely more on generating code than
`optimizing performance at runtime. More-
`over, all Xbox 360 games(as opposed to Xbox
`games from Microsoft's first game console,
`which are emulated on Xbox 360) are corm-
`piled fromscratch and optimized for the cur-
`rent microarchitecture. We don’t have the
`
`problemofrunning legacy, but compatible,
`instruction set architecture executables that
`were compiled and optimized for a completely
`different microarchitecture. This problem has
`significant implications for CPU microarchi-
`tectures in PC and server markets.
`Third, although we knew multicore was the
`way to go, the tools and programming exper-
`tise for multithread programmingare certain-
`ly not mature, presenting a problem for our
`goal of keeping programming easy. For the
`types of workloads present in a game engine,
`we could justify at most six to eight threads in
`the system. The solution was to adapt the
`“more-but-simpler” philosophy to the CPU
`core topology. The key was keeping the num-
`ber of hardware threads limited, thus increas-
`ing the chance that they would be used
`effectively, We decided the best approach was
`to tightly couple dedicated vector math engines
`to integer cores rather than making them
`autonomous. This keeps the numberofthreads
`lowand allows vector math routines to be opti-
`mized and run on separate threads if necessary.
`
`ln-arder issuance cares
`The Xbox 360 CPU contains three two-
`issue, in-order instruction issuance cores. Fach
`core has two SMT hardware threads, which
`support fine-grained instruction issuance. The
`cores allowout-of-order execution in the com-
`
`moncases ofloads and vector/floating-point
`versus integer instructions. Loads, which are
`treated as prefetches, don’t stall until a load
`dependency is present. Vector and floating-
`point operations have their own, decoupled
`vector/float issue queue (VIQ),

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket