`
`Reference 15
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 1
`
`
`
`Introduction
`to the Cell
`multiprocessor
`
`This paper provides an introductory overview of the Cell
`multiprocessor. Cell represents a revolutionary extension of
`conventional microprocessor architecture and organization. The
`paper discusses the history of the project, the program objectives
`and challenges, the design concept, the architecture and
`programming models, and the implementation.
`
`J. A. Kahle
`M. N. Day
`H. P. Hofstee
`C. R. Johns
`T. R. Maeurer
`D. Shippy
`
`Introduction: History of the project
`Initial discussion on the collaborative effort to develop
`Cell began with support from CEOs from the Sony
`and IBM companies: Sony as a content provider and
`IBM as a leading-edge technology and server company.
`Collaboration was initiated among SCEI (Sony
`Computer Entertainment Incorporated), IBM, for
`microprocessor development, and Toshiba, as a
`development and high-volume manufacturing technology
`partner. This led to high-level architectural discussions
`among the three companies during the summer of 2000.
`During a critical meeting in Tokyo, it was determined
`that traditional architectural organizations would not
`deliver the computational power that SCEI sought
`for their future interactive needs. SCEI brought to
`the discussions a vision to achieve 1,000 times the
`performance of PlayStation2** [1, 2]. The Cell objectives
`were to achieve 100 times the PlayStation2 performance
`and lead the way for the future. At this stage of the
`interaction, the IBM Research Division became involved
`for the purpose of exploring new organizational
`approaches to the design. IBM process technology was
`also involved, contributing state-of-the-art 90-nm process
`with silicon-on-insulator (SOI), low-k dielectrics, and
`copper interconnects [3]. The new organization would
`make possible a digital entertainment center that would
`bring together aspects from broadband interconnect,
`entertainment systems, and supercomputer structures.
`During this interaction, a wide variety of multi-core
`proposals were discussed, ranging from conventional
`chip multiprocessors (CMPs) to dataflow-oriented
`multiprocessors.
`By the end of 2000 an architectural concept had been
`agreed on that combined the 64-bit Power Architecture*
`[4] with memory flow control and ‘‘synergistic’’
`
`processors in order to provide the required
`computational density and power efficiency. After
`several months of architectural discussion and contract
`negotiations, the STI (SCEI–Toshiba–IBM) Design
`Center was formally opened in Austin, Texas, on
`March 9, 2001. The STI Design Center represented
`a joint investment in design of about $400,000,000.
`Separate joint collaborations were also set in place
`for process technology development.
`A number of key elements were employed to drive the
`success of the Cell multiprocessor design. First, a holistic
`design approach was used, encompassing processor
`architecture, hardware implementation, system
`structures, and software programming models. Second,
`the design center staffed key leadership positions from
`various IBM sites. Third, the design incorporated
`many flexible elements ranging from reprogrammable
`synergistic processors to reconfigurable I/O interfaces
`in order to support many systems configurations with
`one high-volume chip.
`Although the STI design center for this ambitious,
`large-scale project was based in Austin (with IBM, the
`Sony Group, and Toshiba as partners), the following
`IBM sites were also critical to the project: Rochester,
`Minnesota; Yorktown Heights, New York; Boeblingen
`(Germany); Raleigh, North Carolina; Haifa (Israel);
`Almaden, California; Bangalore (India); Yasu (Japan);
`Burlington, Vermont; Endicott, New York; and a joint
`technology team located in East Fishkill, New York.
`
`Program objectives and challenges
`The objectives for the new processor were the following:
`
` Outstanding performance, especially on game/
`multimedia applications.
`
`ÓCopyright 2005 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each
`reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions,
`of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any
`other portion of this paper must be obtained from the Editor.
`
`589
`
`0018-8646/05/$5.00 ª 2005 IBM
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`J. A. KAHLE ET AL.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 2
`
`
`
` Real-time responsiveness to the user and the network.
` Applicability to a wide range of platforms.
` Support for introduction in 2005.
`
`Outstanding performance, especially on
`game/multimedia applications
`The first of these objectives, outstanding performance,
`especially on game/multimedia applications, was expected
`to be challenged by limits on performance imposed by
`memory latency and bandwidth, power (even more than
`chip size), and diminishing returns from increased
`processor frequencies achieved by reducing the amount
`of work per cycle while increasing pipeline depth.
`The first major barrier to performance is increased
`memory latency as measured in cycles, and latency-
`induced limits on memory bandwidth. Also known as the
`‘‘memory wall’’ [5], the problem is that higher processor
`frequencies are not met by decreased dynamic random
`access memory (DRAM) latencies; hence, the effective
`DRAM latency increases with every generation. In a
`multi-GHz processor it is common for DRAM latencies
`to be measured in the hundreds of cycles; in symmetric
`multiprocessors with shared memory, main memory
`latency can tend toward a thousand processor cycles.
`A conventional microprocessor with conventional
`sequential programming semantics will sustain only a
`limited number of concurrent memory transactions. In
`a sequential model, every instruction is assumed to be
`completed before execution of the next instruction begins.
`If a data or instruction fetch misses in the caches,
`resulting in an access to main memory, instruction
`processing can only proceed in a speculative manner,
`assuming that the access to main memory will succeed.
`The processor must also record the non-speculative state
`in order to safely be able to continue processing. When a
`dependency on data from a previous access that missed in
`the caches arises, even deeper speculation is required in
`order to continue processing. Because of the amount
`of administration required every time computation is
`continued speculatively, and because the probability that
`useful work is being speculatively completed decreases
`rapidly with the number of times the processor must
`speculate in order to continue, it is very rare to see more
`than a few speculative memory accesses being performed
`concurrently on conventional microprocessors. Thus, if a
`microprocessor has, e.g., eight 128-byte cache-line fetches
`in flight (a very optimistic number) and memory latency is
`1,024 processor cycles, the maximum sustainable memory
`bandwidth is still a paltry one byte per processor cycle. In
`such a system, memory bandwidth limitations are
`latency-induced, and increasing memory bandwidth at
`the expense of memory latency can be counterproductive.
`The challenge therefore is to find a processor organization
`that allows for more memory bandwidth to be used
`
`effectively by allowing more memory transactions to be in
`flight simultaneously.
`Power and power density in CMOS processors have
`increased steadily to a point at which we find ourselves
`once again in need of the sophisticated cooling techniques
`we had left behind at the end of the bipolar era [6].
`However, for consumer applications, the size of the box,
`the maximum airspeed, and the maximum allowable
`temperature for the air leaving the system impose
`fundamental first-order limits on the amount of power
`that can be tolerated, independent of engineering
`ingenuity to improve the thermal resistance. With respect
`to technology, the situation is worse this time for two
`reasons. First, the dimensions of the transistors are
`now so small that tunneling through the gate and sub-
`threshold leakage currents prevent following constant-
`field scaling laws and maintaining power density for
`scaled designs [7]. Second, an alternative lower-power
`technology is not available. The challenge is therefore
`to find means to improve power efficiency along with
`performance [8].
`A third barrier to improving performance stems
`from the observation that we have reached a point
`of diminishing return for improving performance by
`further increasing processor frequencies and pipeline
`depth [9]. The problem here is that when pipeline depths
`are increased, instruction latencies increase owing to the
`overhead of an increased number of latches. Thus, the
`performance gained by the increased frequency, and
`hence the ability to issue more instructions in any given
`amount of time, must exceed the time lost due to the
`increased penalties associated with the increased
`instruction execution latencies. Such penalties include
`instruction issue slots1 that cannot be utilized because
`of dependencies on results of previous instructions
`and penalties associated with mispredicted branch
`instructions. When the increase in frequency cannot be
`fully realized because of power limitations, increased
`pipeline depth and therefore execution latency can
`degrade rather than improve performance. It is worth
`noting that processors designed to issue one or two
`instructions per cycle can effectively and efficiently sustain
`higher frequencies than processors designed to issue
`larger numbers of instructions per cycle. The challenge is
`therefore to develop processor microarchitectures and
`implementations that minimize pipeline depth and that
`can efficiently use the issue slots available to them.
`
`Real-time responsiveness to the user and the
`network
`From the beginning, it was envisioned that the Cell
`processor should be designed to provide the best possible
`
`1 An instruction issue slot is an opportunity to issue an instruction.
`
`J. A. KAHLE ET AL.
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`590
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 3
`
`
`
`experience to the human user and the best possible
`response to the network. This ‘‘outward’’ focus differs
`from the ‘‘inward’’ focus of processor organizations that
`stem from the era of batch processing, when the primary
`concern was to keep the central processor unit busy. As
`all game developers know, keeping the players satisfied
`means providing continuously updated (real-time)
`modeling of a virtual environment with consistent and
`continuous visual and sound and other sensory feedback.
`Therefore, the Cell processor should provide extensive
`real-time support. At the same time we anticipated that
`most devices in which the Cell processor would be used
`would be connected to the (broadband) Internet. At an
`early stage we envisioned blends of the content (real or
`virtual) as presented by the Internet and content from
`traditional game play and entertainment. This requires
`concurrent support for real-time operating systems
`and the non-real-time operating systems used to run
`applications to access the Internet. Being responsive to
`the Internet means not only that the processor should
`be optimized for handling communication-oriented
`workloads; it also implies that the processor should be
`responsive to the types of workloads presented by the
`Internet. Because the Internet supports a wide variety of
`standards, such as the various standards for streaming
`video, any acceleration function must be programmable
`and flexible. With the opportunities for sharing data and
`computation power come the concerns of security, digital
`rights management, and privacy.
`
`Applicability to a wide range of platforms
`The Cell project was driven by the need to develop a
`processor for next-generation entertainment systems.
`However, a next-generation architecture with strength
`in the game/media arena that is designed to interface
`optimally with a user and broadband network in real time
`could, if architected and designed properly, be effective in
`a wide range of applications in the digital home and
`beyond. The Broadband Processor Architecture [10] is
`intended to have a life well beyond its first incarnation
`in the first-generation Cell processor. In order to extend
`the reach of this architecture, and to foster a software
`development community in which applications are
`optimized to this architecture, an open (Linux**-based)
`software development environment was developed along
`with the first-generation processor.
`
`Support for introduction in 2005
`The objective of the partnership was to develop this new
`processor with increased performance, responsiveness,
`and security, and to be able to introduce it in 2005. Thus,
`only four years were available to meet the challenges
`outlined above. A concept was needed that would
`
`allow us to deliver impressive processor performance,
`responsiveness to the user and network, and the flexibility
`to ensure a broad reach, and to do this without making
`a complete break with the past. Indications were that a
`completely new architecture can easily require ten years
`to develop, especially if one includes the time required for
`software development. Hence, the Power Architecture*
`was used as the basis for Cell.
`
`Design concept and architecture
`The Broadband Processor Architecture extends the 64-bit
`Power Architecture with cooperative offload processors
`(‘‘synergistic processors’’), with the direct memory
`access (DMA) and synchronization mechanisms to
`communicate with them (‘‘memory flow control’’),
`and with enhancements for real-time management.
`The first-generation Cell processor (Figure 1) combines
`a dual-threaded, dual-issue, 64-bit Power-Architecture-
`compliant Power processor element (PPE) with eight
`newly architected synergistic processor elements (SPEs)
`[11], an on-chip memory controller, and a controller
`for a configurable I/O interface. These units are
`interconnected with a coherent on-chip element
`interconnect bus (EIB). Extensive support for pervasive
`functions such as power-on, test, on-chip hardware
`debug, and performance-monitoring functions is also
`included.
`The key attributes of this concept are the following:
`
` A high design frequency (small number of gates per
`cycle), allowing the processor to operate at a low
`voltage and low power while maintaining high
`frequency and high performance.
` Power Architecture compatibility to provide a
`conventional entry point for programmers, for
`virtualization, multi-operating-system support, and
`the ability to utilize IBM experience in designing
`and verifying symmetric multiprocessors.
` Single-instruction, multiple-data (SIMD)
`architecture, supported by both the vector media
`extensions on the PPE and the instruction set of the
`SPEs, as one of the means to improve game/media
`and scientific performance at improved power
`efficiency.
` A power- and area-efficient PPE that supports the
`high design frequency.
` SPEs for coherent offload. SPEs have local memory,
`asynchronous coherent DMA, and a large unified
`register file to improve memory bandwidth and to
`provide a new level of combined power efficiency and
`performance. The SPEs are dynamically configurable
`to provide support for content protection and
`privacy.
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`J. A. KAHLE ET AL.
`
`591
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 4
`
`
`
`between the processor elements. The bus is coherent
`to allow a single address space to be shared by the
`PPEs and SPEs for efficient communication and ease
`of programming.
` High-bandwidth flexible I/O configurable to support
`a number of system organizations, including a single-
`chip configuration with dual I/O interfaces and a
`‘‘glueless’’ coherent dual-processor configuration that
`does not require additional switch chips to connect
`the two processors.
` Full-custom modular implementation to maximize
`performance per watt and performance per square
`millimeter of silicon and to facilitate the design of
`derivative products.
` Extensive support for chip power and thermal
`management, manufacturing test, hardware and
`software debugging, and performance analysis.
` High-performance, low-cost packaging technology.
` High-performance, low-power 90-nm SOI
`technology.
`
`High design frequency and low supply voltage
`To deliver the greatest possible performance, given a
`silicon and power budget, one challenge is to co-optimize
`the chip area, design frequency, and product operating
`voltage. Since efficiency improves dramatically (faster
`than quadratic) when the supply voltage is lowered,
`performance at a power budget can be improved by using
`more transistors (larger chip) while lowering the supply
`voltage. In practice the operating voltage has a minimum,
`often determined by on-chip static RAM, at which the
`chip ceases to function correctly. This minimum
`operating voltage, the size of the chip, the switching
`factors that measure the percentage of transistors that
`will dissipate switching power in a given cycle, and
`technology parameters such as capacitance and leakage
`currents determine the power the processor will dissipate
`as a function of processor frequency. Conversely, a power
`budget, a given technology, a minimum operating
`voltage, and a switching factor allow one to estimate a
`maximum operating frequency for a given chip size. As
`long as this frequency can be achieved without making
`the design so inefficient that one would be better off with
`a smaller chip operating at a higher supply voltage, this is
`the design frequency the project should aim to achieve. In
`other words, an optimally balanced design will operate at
`the minimum voltage supported by the circuits and at the
`maximum frequency at that minimum voltage. The chip
`should not exceed the maximum power tolerated by the
`application. In the case of the Cell processor, having
`eliminated most of the barriers that cause inefficiency in
`high-frequency designs, the initial design objective was
`a cycle time no more than that of ten fan-out-of-four
`
`SXU
`
`SXU
`
`SXU
`
`SXU
`
`SXU
`
`SXU
`
`SXU
`
`SPE
`
`SXU
`
`LS
`
`LS
`
`LS
`
`LS
`
`LS
`
`LS
`
`LS
`
`LS
`
`DMA
`
`DMA
`
`DMA
`
`DMA
`
`DMA
`
`DMA
`
`DMA
`
`DMA
`
`On-chip coherent bus (up to 96 bytes per cycle)
`
`L2
`
`L1
`
`PPE
`
`Power
`core
`
`Memory
`controller
`Dual Rambus
`XDR**
`
`Bus interface
`controller
`Rambus
`FlexIO**
`
`(a)
`
`Rambus XDR DRAM interface
`Rambus XDR DRAM interface
`Rambus XDR DRAM interface
`Memory controller
`Memory controller
`Memory controller
`
`Power
`Power
`Power
`core
`core
`core
`
`L2
`L2L2
`0.5 MB
`0.5 MB
`0.5 MB
`
`Test and debug logic
`Test and debug logic
`Test and debug logic
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`Coherent bus
`Coherent bus
`Coherent bus
`
`I/O controller
`I/O controller
`I/O controller
`
`Rambus RRAC
`Rambus RRAC
`Rambus RRAC
`
`(b)
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`Figure 1
`
`(a) Cell processor block diagram and (b) die photo. The first
`generation Cell processor contains a power processor element
`(PPE) with a Power core, first- and second-level caches (L1 and
`L2), eight synergistic processor elements (SPEs) each containing
`a direct memory access (DMA) unit, a local store memory (LS)
`and execution units (SXUs), and memory and bus interface
`controllers, all interconnected by a coherent on-chip bus. (Cell die
`photo courtesy of Thomas Way, IBM Burlington.)
`
` A high-bandwidth on-chip coherent bus and high
`bandwidth memory to deliver performance on
`memory-bandwidth-intensive applications and to
`allow for high-bandwidth on-chip interactions
`
`592
`
`J. A. KAHLE ET AL.
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 5
`
`
`
`inverters (10 FO4). This was later adjusted to 11 FO4
`when it became clear that removing that last FO4 would
`incur a substantial area and power penalty.
`
`Power Architecture compatibility
`The Broadband Processor Architecture maintains full
`compatibility with 64-bit Power Architecture [4]. The
`implementation on the Cell processor has aimed to
`include all recent innovations of Power technology such
`as virtualization support and support for large page sizes.
`By building on Power and by focusing the innovation on
`those aspects of the design that brought new advantages,
`it became feasible to complete a complex new design on a
`tight schedule. In addition, compatibility with the Power
`Architecture provides a base for porting existing software
`(including the operating system) to Cell. Although
`additional work is required to unlock the performance
`potential of the Cell processor, existing Power
`applications can be run on the Cell processor without
`modification.
`
`Single-instruction, multiple-data architecture
`The Cell processor uses a SIMD organization in the
`vector unit on the PPE and in the SPEs. SIMD units
`have been demonstrated to be effective in accelerating
`multimedia applications and, because all mainstream PC
`processors now include such units, software support,
`including compilers that generate SIMD instructions for
`code not explicitly written to use SIMD, is maturing. By
`opting for the SIMD extensions in both the PPE and the
`SPE, the task of developing or migrating software to Cell
`has been greatly simplified. Typically, an application may
`start out to be single-threaded and not to use SIMD. A
`first step to improving performance may be to use SIMD
`on the PPE, and a typical second step is to make use
`of the SPEs. Although the SIMD architecture on the
`SPEs differs from the one on the PPE, there is enough
`overlap so that programmers can reasonably construct
`programs that deliver consistent performance on both
`the PPE and (after recompilation) on the SPEs. Because
`the single-threaded PPE provides a debugging and
`testing environment that is (still) most familiar, many
`programmers prefer this type of approach to
`programming Cell.
`
`Power processor element
`The PPE (Figure 2) is a 64-bit Power-Architecture-
`compliant core optimized for design frequency and power
`efficiency. While the processor matches the 11 FO4 design
`frequency of the SPEs on a fully compliant Power
`processor, its pipeline depth is only 23 stages, significantly
`less than what one might expect for a design that
`reduces the amount of time per stage by nearly a factor
`of 2 compared with earlier designs [12, 13]. The
`
`microarchitecture and floorplan of this processor avoid
`long wires and limit the amount of communication delay
`in every cycle and can therefore be characterized as
`‘‘short-wire.’’ The design of the PPE is simplified in
`comparison to more recent four-issue out-of-order
`processors. The PPE is a dual-issue design that does not
`dynamically reorder instructions at issue time (e.g., ‘‘in-
`order issue’’). The core interleaves instructions from two
`computational threads at the same time to optimize the
`use of issue slots, maintain maximum efficiency, and
`reduce pipeline depth. Simple arithmetic functions
`execute and forward their results in two cycles. Owing
`to the delayed-execution fixed-point pipeline, load
`instructions also complete and forward their results
`in two cycles. A double-precision floating-point
`instruction executes in ten cycles.
`The PPE supports a conventional cache hierarchy
`with 32-KB first-level instruction and data caches and
`a 512-KB second-level cache. The second-level cache
`and the address-translation caches use replacement
`management tables to allow the software to direct entries
`with specific address ranges at a particular subset of the
`cache. This mechanism allows for locking data in the
`cache (when the size of the address range is equal to the
`size of the set) and can also be used to prevent overwriting
`data in the cache by directing data that is known to be
`used only once at a particular set. Providing these
`functions enables increased efficiency and increased
`real-time control of the processor.
`The processor provides two simultaneous threads of
`execution within the processor and can be viewed as a
`two-way multiprocessor with shared dataflow. This gives
`software the effective appearance of two independent
`processing units. All architected states are duplicated,
`including all architected registers and special-purpose
`registers, with the exception of registers that deal with
`system-level resources, such as logical partitions,
`memory, and thread control. Non-architected resources
`such as caches and queues are generally shared for both
`threads, except in cases where the resource is small or
`offers a critical performance improvement to
`multithreaded applications.
`The processor is composed of three units [Figure 2(a)].
`The instruction unit (IU) is responsible for instruction
`fetch, decode, branch, issue, and completion. A fixed-
`point execution unit (XU) is responsible for all fixed-
`point instructions and all load/store-type instructions.
`A vector scalar unit (VSU) is responsible for all vector
`and floating-point instructions.
`The IU fetches four instructions per cycle per thread
`into an instruction buffer and dispatches the instructions
`from this buffer. After decode and dependency checking,
`instructions are dual-issued to an execution unit. A
`4-KB by 2-bit branch history table with 6 bits of global
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`J. A. KAHLE ET AL.
`
`593
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 6
`
`
`
`8
`
`IU
`
`Pre-decode
`
`L2
`interface
`
`Fetch control
`
`Branch scan
`
`L1 instruction cache
`Thread B
`Thread A
`
`4
`
`4
`
`Threads alternate
`fetch and dispatch
`cycles
`
`L1 data cache
`
`SMT dispatch (queue)
`
`2
`
`Decode
`Dependency
`Issue
`
`2
`
`Microcode
`
`1
`
`Thread A
`Thread B
`Thread A
`
`1
`Load/store
`unit
`
`1
`Fixed-point
`unit
`
`1
`Branch
`execution unit
`
`VSU
`
`VMX/FPU issue (queue)
`2
`
`Completion/flush
`
`XU
`
`1
`VMX
`load/store/permute
`
`1
`VMX
`arith./logic unit
`
`1
`FPU
`arith./logic unit
`
`1
`FPU
`load/store
`
`VMX completion
`
`FPU completion
`
`(a)
`
`PPE pipeline front end
`
`Instruction cache and buffer
`
`MC1
`
`MC2
`
`MC3
`
`MC4
`...
`Microcode
`
`MC9
`
`MC10
`
`MC11
`
`IC1
`
`IC2
`
`IC3
`
`IC4
`
`IB1
`
`IB2
`
`ID1
`
`ID3
`
`IS1
`
`IS2
`
`IS3
`
`ID2
`
`Instruction decode and issue
`
`BP1
`
`BP2
`
`BP3
`
`BP4
`
`Branch prediction
`
`PPE pipeline back end
`
`Branch instruction
`
`DLY
`
`DLY
`
`DLY
`
`RF1
`
`RF2
`
`EX1
`
`EX2
`
`EX3
`
`EX4
`
`IBZ
`
`IC0
`
`Fixed-point unit instruction
`
`DLY
`
`DLY
`
`DLY
`
`RF1
`
`RF2
`
`EX1
`
`EX2
`
`EX3
`
`EX4
`
`EX5
`
`WB
`
`Load/store instruction
`
`RF1
`
`RF2
`
`EX1
`
`EX2
`
`EX3
`
`EX4
`
`EX5
`
`EX6
`
`EX7
`
`EX8
`
`WB
`
`(b)
`
`Instruction cache
`IC
`Instruction buffer
`IB
`Branch prediction
`BP
`MC Microcode
`ID
`Instruction decode
`IS
`Instruction issue
`DLY Delay stage
`RF
`Register file access
`EX
`Execution
`WB Write back
`
`Figure 2
`
`Power processor element (a) major units and (b) pipeline diagram. Instruction fetch and decode fetches and decodes four instructions in
`parallel from the first-level instruction cache for two simultaneously executing threads in alternating cycles. When both threads are active,
`two instructions from one of the threads are issued in program order in alternate cycles. The core contains one instance of each of the major
`execution units (branch, fixed-point, load/store, floating-point (FPU), and vector-media (VMX). Processing latencies are indicated in part (b)
`[color-coded to correspond to part (a)]. Simple fixed-point instructions execute in two cycles. Because execution of fixed-point instructions
`is delayed, load to use penalty is limited to one cycle. Branch miss penalty is 23 cycles and is comparable to the penalty in designs with a much
`lower operating frequency.
`
`594
`
`J. A. KAHLE ET AL.
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 7
`
`
`
`history per thread is used to predict the outcome of
`branches. The IU can issue up to two instructions per
`cycle. All dual-issue combinations are possible except for
`two instructions to the same unit and the following
`exceptions. Simple vector, complex vector, vector
`floating-point, and scalar floating-point arithmetic cannot
`be dual-issued with the same type of instructions (for
`example, a simple vector with a complex vector is not
`allowed). However, these instructions can be dual-issued
`with any other form of load/store, fixed-point branch, or
`vector-permute instruction. A VSU issue queue decouples
`the vector and floating-point pipelines from the remaining
`pipelines. This allows vector and floating-point
`instructions to be issued out of order with respect to
`other instructions.
`The XU consists of a 32- by 64-bit general-purpose
`register file per thread, a fixed-point execution unit, and
`a load/store unit. The load/store unit consists of the L1
`D-cache, a translation cache, an eight-entry miss queue,
`and a 16-entry store queue. The load/store unit supports a
`non-blocking L1 D-cache which allows cache hits under
`misses.
`The VSU floating-point execution unit consists of a 32-
`by 64-bit register file per thread, as well as a ten-stage
`double-precision pipeline. The VSU vector execution
`units are organized around a 128-bit dataflow. The vector
`unit contains four subunits: simple, complex, permute,
`and single-precision floating point. There is a 32-entry by
`128-bit vector register file per thread, and all instructions
`are 128-bit SIMD with varying element width (2 3 64-bit,
`4 3 32-bit, 8 3 16-bit, 16 3 8-bit, and 128 3 1-bit).
`
`Synergistic processing element
`The SPE [11] implements a new instruction-set
`architecture optimized for power and performance on
`computing-intensive and media applications. The SPE
`(Figure 3) operates on a local store memory (256 KB)
`that stores instructions and data. Data and instructions
`are transferred between this local memory and system
`memory by asynchronous coherent DMA commands,
`executed by the memory flow control unit included in
`each SPE. Each SPE supports up to 16 outstanding DMA
`commands. Because these coherent DMA commands use
`the same translation and protection governed by the page
`and segment tables of the Power Architecture as the PPE,
`addresses can be passed between the PPE and SPEs, and
`the operating system can share memory and manage all of
`the processing resources in the system in a consistent
`manner. The DMA unit can be programmed in one of
`three ways: 1) with instructions on the SPE that insert
`DMA commands in the queues; 2) by preparing (scatter-
`gather) lists of commands in the local store and issuing
`a single ‘‘DMA list’’ of commands; or 3) by inserting
`commands in the DMA queue from another processor
`
`in the system (with the appropriate privilege) by using
`store or DMA-write commands. For programming
`convenience, and to allow local-store-to-local-store DMA
`transactions, the local store is mapped into the memory
`map of the processor, but this memory (if cached) is not
`coherent in the system.
`The local store organization introduces another level of
`memory hierarchy beyond the registers that provide local
`storage of data in most processor architectures. This is
`to provide a mechanism to combat the ‘‘memory wall,’’
`since it allows for a large number of memory transactions
`to be in flight simultaneously without requiring the deep
`speculation that drives high degrees of inefficiency
`on other processors. With main memory latency
`approaching a thousand cycles, the few cycles it takes to
`set up a DMA command becomes acceptable overhead to
`access main memory. Obviously, this organization of the
`processor can provide good support for streaming, but
`because the local store is large enough to store more than
`a simple streaming kernel, a wide variety of programming
`models can be supported, as discussed later.
`The local store is the largest component in the SPE,
`and it was important to implement it efficiently [14]. A
`single-port SRAM cell is used to minimize area. In order
`to provide good performance, in spite of the fact that the
`local store must arbitrate among DMA reads, writes,
`instruction fetches, loads, and stores, the local store was
`designed with both narrow (128-bit) and wide (128-byte)
`read and write ports. The wide access is used for DMA
`reads and writes as well as instruction (pre)fetch. Because
`a typical 128-byte DMA read or write requires 16
`processor cycles to place the data on the on-chip coherent
`bus, even when DMA reads and writes occur at full
`bandwidth, seven of every eight cycles remain available
`for loads, stores, and instruction fetch. Similarly,
`instructions are fetched 128 bytes at a time, and pressure
`on the local store is minimized. The highest priority is
`given to DMA commands, the next highest priority to
`loads and stores, and instruction (pre)fetch occurs
`whenever there is a cycle available. A special no-
`operation instruction exists to force the availability
`of a slot to instruction fetch when necessary.
`The execution units of the SPU are organized around
`a 128-bit dataflow. A large register file with 128 entries
`provides