throbber
Homayoun
`
`Reference 15
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 1
`
`

`

`Introduction
`to the Cell
`multiprocessor
`
`This paper provides an introductory overview of the Cell
`multiprocessor. Cell represents a revolutionary extension of
`conventional microprocessor architecture and organization. The
`paper discusses the history of the project, the program objectives
`and challenges, the design concept, the architecture and
`programming models, and the implementation.
`
`J. A. Kahle
`M. N. Day
`H. P. Hofstee
`C. R. Johns
`T. R. Maeurer
`D. Shippy
`
`Introduction: History of the project
`Initial discussion on the collaborative effort to develop
`Cell began with support from CEOs from the Sony
`and IBM companies: Sony as a content provider and
`IBM as a leading-edge technology and server company.
`Collaboration was initiated among SCEI (Sony
`Computer Entertainment Incorporated), IBM, for
`microprocessor development, and Toshiba, as a
`development and high-volume manufacturing technology
`partner. This led to high-level architectural discussions
`among the three companies during the summer of 2000.
`During a critical meeting in Tokyo, it was determined
`that traditional architectural organizations would not
`deliver the computational power that SCEI sought
`for their future interactive needs. SCEI brought to
`the discussions a vision to achieve 1,000 times the
`performance of PlayStation2** [1, 2]. The Cell objectives
`were to achieve 100 times the PlayStation2 performance
`and lead the way for the future. At this stage of the
`interaction, the IBM Research Division became involved
`for the purpose of exploring new organizational
`approaches to the design. IBM process technology was
`also involved, contributing state-of-the-art 90-nm process
`with silicon-on-insulator (SOI), low-k dielectrics, and
`copper interconnects [3]. The new organization would
`make possible a digital entertainment center that would
`bring together aspects from broadband interconnect,
`entertainment systems, and supercomputer structures.
`During this interaction, a wide variety of multi-core
`proposals were discussed, ranging from conventional
`chip multiprocessors (CMPs) to dataflow-oriented
`multiprocessors.
`By the end of 2000 an architectural concept had been
`agreed on that combined the 64-bit Power Architecture*
`[4] with memory flow control and ‘‘synergistic’’
`
`processors in order to provide the required
`computational density and power efficiency. After
`several months of architectural discussion and contract
`negotiations, the STI (SCEI–Toshiba–IBM) Design
`Center was formally opened in Austin, Texas, on
`March 9, 2001. The STI Design Center represented
`a joint investment in design of about $400,000,000.
`Separate joint collaborations were also set in place
`for process technology development.
`A number of key elements were employed to drive the
`success of the Cell multiprocessor design. First, a holistic
`design approach was used, encompassing processor
`architecture, hardware implementation, system
`structures, and software programming models. Second,
`the design center staffed key leadership positions from
`various IBM sites. Third, the design incorporated
`many flexible elements ranging from reprogrammable
`synergistic processors to reconfigurable I/O interfaces
`in order to support many systems configurations with
`one high-volume chip.
`Although the STI design center for this ambitious,
`large-scale project was based in Austin (with IBM, the
`Sony Group, and Toshiba as partners), the following
`IBM sites were also critical to the project: Rochester,
`Minnesota; Yorktown Heights, New York; Boeblingen
`(Germany); Raleigh, North Carolina; Haifa (Israel);
`Almaden, California; Bangalore (India); Yasu (Japan);
`Burlington, Vermont; Endicott, New York; and a joint
`technology team located in East Fishkill, New York.
`
`Program objectives and challenges
`The objectives for the new processor were the following:
`
` Outstanding performance, especially on game/
`multimedia applications.
`
`ÓCopyright 2005 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each
`reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions,
`of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any
`other portion of this paper must be obtained from the Editor.
`
`589
`
`0018-8646/05/$5.00 ª 2005 IBM
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`J. A. KAHLE ET AL.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 2
`
`

`

` Real-time responsiveness to the user and the network.
` Applicability to a wide range of platforms.
` Support for introduction in 2005.
`
`Outstanding performance, especially on
`game/multimedia applications
`The first of these objectives, outstanding performance,
`especially on game/multimedia applications, was expected
`to be challenged by limits on performance imposed by
`memory latency and bandwidth, power (even more than
`chip size), and diminishing returns from increased
`processor frequencies achieved by reducing the amount
`of work per cycle while increasing pipeline depth.
`The first major barrier to performance is increased
`memory latency as measured in cycles, and latency-
`induced limits on memory bandwidth. Also known as the
`‘‘memory wall’’ [5], the problem is that higher processor
`frequencies are not met by decreased dynamic random
`access memory (DRAM) latencies; hence, the effective
`DRAM latency increases with every generation. In a
`multi-GHz processor it is common for DRAM latencies
`to be measured in the hundreds of cycles; in symmetric
`multiprocessors with shared memory, main memory
`latency can tend toward a thousand processor cycles.
`A conventional microprocessor with conventional
`sequential programming semantics will sustain only a
`limited number of concurrent memory transactions. In
`a sequential model, every instruction is assumed to be
`completed before execution of the next instruction begins.
`If a data or instruction fetch misses in the caches,
`resulting in an access to main memory, instruction
`processing can only proceed in a speculative manner,
`assuming that the access to main memory will succeed.
`The processor must also record the non-speculative state
`in order to safely be able to continue processing. When a
`dependency on data from a previous access that missed in
`the caches arises, even deeper speculation is required in
`order to continue processing. Because of the amount
`of administration required every time computation is
`continued speculatively, and because the probability that
`useful work is being speculatively completed decreases
`rapidly with the number of times the processor must
`speculate in order to continue, it is very rare to see more
`than a few speculative memory accesses being performed
`concurrently on conventional microprocessors. Thus, if a
`microprocessor has, e.g., eight 128-byte cache-line fetches
`in flight (a very optimistic number) and memory latency is
`1,024 processor cycles, the maximum sustainable memory
`bandwidth is still a paltry one byte per processor cycle. In
`such a system, memory bandwidth limitations are
`latency-induced, and increasing memory bandwidth at
`the expense of memory latency can be counterproductive.
`The challenge therefore is to find a processor organization
`that allows for more memory bandwidth to be used
`
`effectively by allowing more memory transactions to be in
`flight simultaneously.
`Power and power density in CMOS processors have
`increased steadily to a point at which we find ourselves
`once again in need of the sophisticated cooling techniques
`we had left behind at the end of the bipolar era [6].
`However, for consumer applications, the size of the box,
`the maximum airspeed, and the maximum allowable
`temperature for the air leaving the system impose
`fundamental first-order limits on the amount of power
`that can be tolerated, independent of engineering
`ingenuity to improve the thermal resistance. With respect
`to technology, the situation is worse this time for two
`reasons. First, the dimensions of the transistors are
`now so small that tunneling through the gate and sub-
`threshold leakage currents prevent following constant-
`field scaling laws and maintaining power density for
`scaled designs [7]. Second, an alternative lower-power
`technology is not available. The challenge is therefore
`to find means to improve power efficiency along with
`performance [8].
`A third barrier to improving performance stems
`from the observation that we have reached a point
`of diminishing return for improving performance by
`further increasing processor frequencies and pipeline
`depth [9]. The problem here is that when pipeline depths
`are increased, instruction latencies increase owing to the
`overhead of an increased number of latches. Thus, the
`performance gained by the increased frequency, and
`hence the ability to issue more instructions in any given
`amount of time, must exceed the time lost due to the
`increased penalties associated with the increased
`instruction execution latencies. Such penalties include
`instruction issue slots1 that cannot be utilized because
`of dependencies on results of previous instructions
`and penalties associated with mispredicted branch
`instructions. When the increase in frequency cannot be
`fully realized because of power limitations, increased
`pipeline depth and therefore execution latency can
`degrade rather than improve performance. It is worth
`noting that processors designed to issue one or two
`instructions per cycle can effectively and efficiently sustain
`higher frequencies than processors designed to issue
`larger numbers of instructions per cycle. The challenge is
`therefore to develop processor microarchitectures and
`implementations that minimize pipeline depth and that
`can efficiently use the issue slots available to them.
`
`Real-time responsiveness to the user and the
`network
`From the beginning, it was envisioned that the Cell
`processor should be designed to provide the best possible
`
`1 An instruction issue slot is an opportunity to issue an instruction.
`
`J. A. KAHLE ET AL.
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`590
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 3
`
`

`

`experience to the human user and the best possible
`response to the network. This ‘‘outward’’ focus differs
`from the ‘‘inward’’ focus of processor organizations that
`stem from the era of batch processing, when the primary
`concern was to keep the central processor unit busy. As
`all game developers know, keeping the players satisfied
`means providing continuously updated (real-time)
`modeling of a virtual environment with consistent and
`continuous visual and sound and other sensory feedback.
`Therefore, the Cell processor should provide extensive
`real-time support. At the same time we anticipated that
`most devices in which the Cell processor would be used
`would be connected to the (broadband) Internet. At an
`early stage we envisioned blends of the content (real or
`virtual) as presented by the Internet and content from
`traditional game play and entertainment. This requires
`concurrent support for real-time operating systems
`and the non-real-time operating systems used to run
`applications to access the Internet. Being responsive to
`the Internet means not only that the processor should
`be optimized for handling communication-oriented
`workloads; it also implies that the processor should be
`responsive to the types of workloads presented by the
`Internet. Because the Internet supports a wide variety of
`standards, such as the various standards for streaming
`video, any acceleration function must be programmable
`and flexible. With the opportunities for sharing data and
`computation power come the concerns of security, digital
`rights management, and privacy.
`
`Applicability to a wide range of platforms
`The Cell project was driven by the need to develop a
`processor for next-generation entertainment systems.
`However, a next-generation architecture with strength
`in the game/media arena that is designed to interface
`optimally with a user and broadband network in real time
`could, if architected and designed properly, be effective in
`a wide range of applications in the digital home and
`beyond. The Broadband Processor Architecture [10] is
`intended to have a life well beyond its first incarnation
`in the first-generation Cell processor. In order to extend
`the reach of this architecture, and to foster a software
`development community in which applications are
`optimized to this architecture, an open (Linux**-based)
`software development environment was developed along
`with the first-generation processor.
`
`Support for introduction in 2005
`The objective of the partnership was to develop this new
`processor with increased performance, responsiveness,
`and security, and to be able to introduce it in 2005. Thus,
`only four years were available to meet the challenges
`outlined above. A concept was needed that would
`
`allow us to deliver impressive processor performance,
`responsiveness to the user and network, and the flexibility
`to ensure a broad reach, and to do this without making
`a complete break with the past. Indications were that a
`completely new architecture can easily require ten years
`to develop, especially if one includes the time required for
`software development. Hence, the Power Architecture*
`was used as the basis for Cell.
`
`Design concept and architecture
`The Broadband Processor Architecture extends the 64-bit
`Power Architecture with cooperative offload processors
`(‘‘synergistic processors’’), with the direct memory
`access (DMA) and synchronization mechanisms to
`communicate with them (‘‘memory flow control’’),
`and with enhancements for real-time management.
`The first-generation Cell processor (Figure 1) combines
`a dual-threaded, dual-issue, 64-bit Power-Architecture-
`compliant Power processor element (PPE) with eight
`newly architected synergistic processor elements (SPEs)
`[11], an on-chip memory controller, and a controller
`for a configurable I/O interface. These units are
`interconnected with a coherent on-chip element
`interconnect bus (EIB). Extensive support for pervasive
`functions such as power-on, test, on-chip hardware
`debug, and performance-monitoring functions is also
`included.
`The key attributes of this concept are the following:
`
` A high design frequency (small number of gates per
`cycle), allowing the processor to operate at a low
`voltage and low power while maintaining high
`frequency and high performance.
` Power Architecture compatibility to provide a
`conventional entry point for programmers, for
`virtualization, multi-operating-system support, and
`the ability to utilize IBM experience in designing
`and verifying symmetric multiprocessors.
` Single-instruction, multiple-data (SIMD)
`architecture, supported by both the vector media
`extensions on the PPE and the instruction set of the
`SPEs, as one of the means to improve game/media
`and scientific performance at improved power
`efficiency.
` A power- and area-efficient PPE that supports the
`high design frequency.
` SPEs for coherent offload. SPEs have local memory,
`asynchronous coherent DMA, and a large unified
`register file to improve memory bandwidth and to
`provide a new level of combined power efficiency and
`performance. The SPEs are dynamically configurable
`to provide support for content protection and
`privacy.
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`J. A. KAHLE ET AL.
`
`591
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 4
`
`

`

`between the processor elements. The bus is coherent
`to allow a single address space to be shared by the
`PPEs and SPEs for efficient communication and ease
`of programming.
` High-bandwidth flexible I/O configurable to support
`a number of system organizations, including a single-
`chip configuration with dual I/O interfaces and a
`‘‘glueless’’ coherent dual-processor configuration that
`does not require additional switch chips to connect
`the two processors.
` Full-custom modular implementation to maximize
`performance per watt and performance per square
`millimeter of silicon and to facilitate the design of
`derivative products.
` Extensive support for chip power and thermal
`management, manufacturing test, hardware and
`software debugging, and performance analysis.
` High-performance, low-cost packaging technology.
` High-performance, low-power 90-nm SOI
`technology.
`
`High design frequency and low supply voltage
`To deliver the greatest possible performance, given a
`silicon and power budget, one challenge is to co-optimize
`the chip area, design frequency, and product operating
`voltage. Since efficiency improves dramatically (faster
`than quadratic) when the supply voltage is lowered,
`performance at a power budget can be improved by using
`more transistors (larger chip) while lowering the supply
`voltage. In practice the operating voltage has a minimum,
`often determined by on-chip static RAM, at which the
`chip ceases to function correctly. This minimum
`operating voltage, the size of the chip, the switching
`factors that measure the percentage of transistors that
`will dissipate switching power in a given cycle, and
`technology parameters such as capacitance and leakage
`currents determine the power the processor will dissipate
`as a function of processor frequency. Conversely, a power
`budget, a given technology, a minimum operating
`voltage, and a switching factor allow one to estimate a
`maximum operating frequency for a given chip size. As
`long as this frequency can be achieved without making
`the design so inefficient that one would be better off with
`a smaller chip operating at a higher supply voltage, this is
`the design frequency the project should aim to achieve. In
`other words, an optimally balanced design will operate at
`the minimum voltage supported by the circuits and at the
`maximum frequency at that minimum voltage. The chip
`should not exceed the maximum power tolerated by the
`application. In the case of the Cell processor, having
`eliminated most of the barriers that cause inefficiency in
`high-frequency designs, the initial design objective was
`a cycle time no more than that of ten fan-out-of-four
`
`SXU
`
`SXU
`
`SXU
`
`SXU
`
`SXU
`
`SXU
`
`SXU
`
`SPE
`
`SXU
`
`LS
`
`LS
`
`LS
`
`LS
`
`LS
`
`LS
`
`LS
`
`LS
`
`DMA
`
`DMA
`
`DMA
`
`DMA
`
`DMA
`
`DMA
`
`DMA
`
`DMA
`
`On-chip coherent bus (up to 96 bytes per cycle)
`
`L2
`
`L1
`
`PPE
`
`Power
`core
`
`Memory
`controller
`Dual Rambus
`XDR**
`
`Bus interface
`controller
`Rambus
`FlexIO**
`
`(a)
`
`Rambus XDR DRAM interface
`Rambus XDR DRAM interface
`Rambus XDR DRAM interface
`Memory controller
`Memory controller
`Memory controller
`
`Power
`Power
`Power
`core
`core
`core
`
`L2
`L2L2
`0.5 MB
`0.5 MB
`0.5 MB
`
`Test and debug logic
`Test and debug logic
`Test and debug logic
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`Coherent bus
`Coherent bus
`Coherent bus
`
`I/O controller
`I/O controller
`I/O controller
`
`Rambus RRAC
`Rambus RRAC
`Rambus RRAC
`
`(b)
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`SPE
`SPESPE
`
`Figure 1
`
`(a) Cell processor block diagram and (b) die photo. The first
`generation Cell processor contains a power processor element
`(PPE) with a Power core, first- and second-level caches (L1 and
`L2), eight synergistic processor elements (SPEs) each containing
`a direct memory access (DMA) unit, a local store memory (LS)
`and execution units (SXUs), and memory and bus interface
`controllers, all interconnected by a coherent on-chip bus. (Cell die
`photo courtesy of Thomas Way, IBM Burlington.)
`
` A high-bandwidth on-chip coherent bus and high
`bandwidth memory to deliver performance on
`memory-bandwidth-intensive applications and to
`allow for high-bandwidth on-chip interactions
`
`592
`
`J. A. KAHLE ET AL.
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 5
`
`

`

`inverters (10 FO4). This was later adjusted to 11 FO4
`when it became clear that removing that last FO4 would
`incur a substantial area and power penalty.
`
`Power Architecture compatibility
`The Broadband Processor Architecture maintains full
`compatibility with 64-bit Power Architecture [4]. The
`implementation on the Cell processor has aimed to
`include all recent innovations of Power technology such
`as virtualization support and support for large page sizes.
`By building on Power and by focusing the innovation on
`those aspects of the design that brought new advantages,
`it became feasible to complete a complex new design on a
`tight schedule. In addition, compatibility with the Power
`Architecture provides a base for porting existing software
`(including the operating system) to Cell. Although
`additional work is required to unlock the performance
`potential of the Cell processor, existing Power
`applications can be run on the Cell processor without
`modification.
`
`Single-instruction, multiple-data architecture
`The Cell processor uses a SIMD organization in the
`vector unit on the PPE and in the SPEs. SIMD units
`have been demonstrated to be effective in accelerating
`multimedia applications and, because all mainstream PC
`processors now include such units, software support,
`including compilers that generate SIMD instructions for
`code not explicitly written to use SIMD, is maturing. By
`opting for the SIMD extensions in both the PPE and the
`SPE, the task of developing or migrating software to Cell
`has been greatly simplified. Typically, an application may
`start out to be single-threaded and not to use SIMD. A
`first step to improving performance may be to use SIMD
`on the PPE, and a typical second step is to make use
`of the SPEs. Although the SIMD architecture on the
`SPEs differs from the one on the PPE, there is enough
`overlap so that programmers can reasonably construct
`programs that deliver consistent performance on both
`the PPE and (after recompilation) on the SPEs. Because
`the single-threaded PPE provides a debugging and
`testing environment that is (still) most familiar, many
`programmers prefer this type of approach to
`programming Cell.
`
`Power processor element
`The PPE (Figure 2) is a 64-bit Power-Architecture-
`compliant core optimized for design frequency and power
`efficiency. While the processor matches the 11 FO4 design
`frequency of the SPEs on a fully compliant Power
`processor, its pipeline depth is only 23 stages, significantly
`less than what one might expect for a design that
`reduces the amount of time per stage by nearly a factor
`of 2 compared with earlier designs [12, 13]. The
`
`microarchitecture and floorplan of this processor avoid
`long wires and limit the amount of communication delay
`in every cycle and can therefore be characterized as
`‘‘short-wire.’’ The design of the PPE is simplified in
`comparison to more recent four-issue out-of-order
`processors. The PPE is a dual-issue design that does not
`dynamically reorder instructions at issue time (e.g., ‘‘in-
`order issue’’). The core interleaves instructions from two
`computational threads at the same time to optimize the
`use of issue slots, maintain maximum efficiency, and
`reduce pipeline depth. Simple arithmetic functions
`execute and forward their results in two cycles. Owing
`to the delayed-execution fixed-point pipeline, load
`instructions also complete and forward their results
`in two cycles. A double-precision floating-point
`instruction executes in ten cycles.
`The PPE supports a conventional cache hierarchy
`with 32-KB first-level instruction and data caches and
`a 512-KB second-level cache. The second-level cache
`and the address-translation caches use replacement
`management tables to allow the software to direct entries
`with specific address ranges at a particular subset of the
`cache. This mechanism allows for locking data in the
`cache (when the size of the address range is equal to the
`size of the set) and can also be used to prevent overwriting
`data in the cache by directing data that is known to be
`used only once at a particular set. Providing these
`functions enables increased efficiency and increased
`real-time control of the processor.
`The processor provides two simultaneous threads of
`execution within the processor and can be viewed as a
`two-way multiprocessor with shared dataflow. This gives
`software the effective appearance of two independent
`processing units. All architected states are duplicated,
`including all architected registers and special-purpose
`registers, with the exception of registers that deal with
`system-level resources, such as logical partitions,
`memory, and thread control. Non-architected resources
`such as caches and queues are generally shared for both
`threads, except in cases where the resource is small or
`offers a critical performance improvement to
`multithreaded applications.
`The processor is composed of three units [Figure 2(a)].
`The instruction unit (IU) is responsible for instruction
`fetch, decode, branch, issue, and completion. A fixed-
`point execution unit (XU) is responsible for all fixed-
`point instructions and all load/store-type instructions.
`A vector scalar unit (VSU) is responsible for all vector
`and floating-point instructions.
`The IU fetches four instructions per cycle per thread
`into an instruction buffer and dispatches the instructions
`from this buffer. After decode and dependency checking,
`instructions are dual-issued to an execution unit. A
`4-KB by 2-bit branch history table with 6 bits of global
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`J. A. KAHLE ET AL.
`
`593
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 6
`
`

`

`8
`
`IU
`
`Pre-decode
`
`L2
`interface
`
`Fetch control
`
`Branch scan
`
`L1 instruction cache
`Thread B
`Thread A
`
`4
`
`4
`
`Threads alternate
`fetch and dispatch
`cycles
`
`L1 data cache
`
`SMT dispatch (queue)
`
`2
`
`Decode
`Dependency
`Issue
`
`2
`
`Microcode
`
`1
`
`Thread A
`Thread B
`Thread A
`
`1
`Load/store
`unit
`
`1
`Fixed-point
`unit
`
`1
`Branch
`execution unit
`
`VSU
`
`VMX/FPU issue (queue)
`2
`
`Completion/flush
`
`XU
`
`1
`VMX
`load/store/permute
`
`1
`VMX
`arith./logic unit
`
`1
`FPU
`arith./logic unit
`
`1
`FPU
`load/store
`
`VMX completion
`
`FPU completion
`
`(a)
`
`PPE pipeline front end
`
`Instruction cache and buffer
`
`MC1
`
`MC2
`
`MC3
`
`MC4
`...
`Microcode
`
`MC9
`
`MC10
`
`MC11
`
`IC1
`
`IC2
`
`IC3
`
`IC4
`
`IB1
`
`IB2
`
`ID1
`
`ID3
`
`IS1
`
`IS2
`
`IS3
`
`ID2
`
`Instruction decode and issue
`
`BP1
`
`BP2
`
`BP3
`
`BP4
`
`Branch prediction
`
`PPE pipeline back end
`
`Branch instruction
`
`DLY
`
`DLY
`
`DLY
`
`RF1
`
`RF2
`
`EX1
`
`EX2
`
`EX3
`
`EX4
`
`IBZ
`
`IC0
`
`Fixed-point unit instruction
`
`DLY
`
`DLY
`
`DLY
`
`RF1
`
`RF2
`
`EX1
`
`EX2
`
`EX3
`
`EX4
`
`EX5
`
`WB
`
`Load/store instruction
`
`RF1
`
`RF2
`
`EX1
`
`EX2
`
`EX3
`
`EX4
`
`EX5
`
`EX6
`
`EX7
`
`EX8
`
`WB
`
`(b)
`
`Instruction cache
`IC
`Instruction buffer
`IB
`Branch prediction
`BP
`MC Microcode
`ID
`Instruction decode
`IS
`Instruction issue
`DLY Delay stage
`RF
`Register file access
`EX
`Execution
`WB Write back
`
`Figure 2
`
`Power processor element (a) major units and (b) pipeline diagram. Instruction fetch and decode fetches and decodes four instructions in
`parallel from the first-level instruction cache for two simultaneously executing threads in alternating cycles. When both threads are active,
`two instructions from one of the threads are issued in program order in alternate cycles. The core contains one instance of each of the major
`execution units (branch, fixed-point, load/store, floating-point (FPU), and vector-media (VMX). Processing latencies are indicated in part (b)
`[color-coded to correspond to part (a)]. Simple fixed-point instructions execute in two cycles. Because execution of fixed-point instructions
`is delayed, load to use penalty is limited to one cycle. Branch miss penalty is 23 cycles and is comparable to the penalty in designs with a much
`lower operating frequency.
`
`594
`
`J. A. KAHLE ET AL.
`
`IBM J. RES. & DEV. VOL. 49 NO. 4/5
`
`JULY/SEPTEMBER 2005
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2127, p. 7
`
`

`

`history per thread is used to predict the outcome of
`branches. The IU can issue up to two instructions per
`cycle. All dual-issue combinations are possible except for
`two instructions to the same unit and the following
`exceptions. Simple vector, complex vector, vector
`floating-point, and scalar floating-point arithmetic cannot
`be dual-issued with the same type of instructions (for
`example, a simple vector with a complex vector is not
`allowed). However, these instructions can be dual-issued
`with any other form of load/store, fixed-point branch, or
`vector-permute instruction. A VSU issue queue decouples
`the vector and floating-point pipelines from the remaining
`pipelines. This allows vector and floating-point
`instructions to be issued out of order with respect to
`other instructions.
`The XU consists of a 32- by 64-bit general-purpose
`register file per thread, a fixed-point execution unit, and
`a load/store unit. The load/store unit consists of the L1
`D-cache, a translation cache, an eight-entry miss queue,
`and a 16-entry store queue. The load/store unit supports a
`non-blocking L1 D-cache which allows cache hits under
`misses.
`The VSU floating-point execution unit consists of a 32-
`by 64-bit register file per thread, as well as a ten-stage
`double-precision pipeline. The VSU vector execution
`units are organized around a 128-bit dataflow. The vector
`unit contains four subunits: simple, complex, permute,
`and single-precision floating point. There is a 32-entry by
`128-bit vector register file per thread, and all instructions
`are 128-bit SIMD with varying element width (2 3 64-bit,
`4 3 32-bit, 8 3 16-bit, 16 3 8-bit, and 128 3 1-bit).
`
`Synergistic processing element
`The SPE [11] implements a new instruction-set
`architecture optimized for power and performance on
`computing-intensive and media applications. The SPE
`(Figure 3) operates on a local store memory (256 KB)
`that stores instructions and data. Data and instructions
`are transferred between this local memory and system
`memory by asynchronous coherent DMA commands,
`executed by the memory flow control unit included in
`each SPE. Each SPE supports up to 16 outstanding DMA
`commands. Because these coherent DMA commands use
`the same translation and protection governed by the page
`and segment tables of the Power Architecture as the PPE,
`addresses can be passed between the PPE and SPEs, and
`the operating system can share memory and manage all of
`the processing resources in the system in a consistent
`manner. The DMA unit can be programmed in one of
`three ways: 1) with instructions on the SPE that insert
`DMA commands in the queues; 2) by preparing (scatter-
`gather) lists of commands in the local store and issuing
`a single ‘‘DMA list’’ of commands; or 3) by inserting
`commands in the DMA queue from another processor
`
`in the system (with the appropriate privilege) by using
`store or DMA-write commands. For programming
`convenience, and to allow local-store-to-local-store DMA
`transactions, the local store is mapped into the memory
`map of the processor, but this memory (if cached) is not
`coherent in the system.
`The local store organization introduces another level of
`memory hierarchy beyond the registers that provide local
`storage of data in most processor architectures. This is
`to provide a mechanism to combat the ‘‘memory wall,’’
`since it allows for a large number of memory transactions
`to be in flight simultaneously without requiring the deep
`speculation that drives high degrees of inefficiency
`on other processors. With main memory latency
`approaching a thousand cycles, the few cycles it takes to
`set up a DMA command becomes acceptable overhead to
`access main memory. Obviously, this organization of the
`processor can provide good support for streaming, but
`because the local store is large enough to store more than
`a simple streaming kernel, a wide variety of programming
`models can be supported, as discussed later.
`The local store is the largest component in the SPE,
`and it was important to implement it efficiently [14]. A
`single-port SRAM cell is used to minimize area. In order
`to provide good performance, in spite of the fact that the
`local store must arbitrate among DMA reads, writes,
`instruction fetches, loads, and stores, the local store was
`designed with both narrow (128-bit) and wide (128-byte)
`read and write ports. The wide access is used for DMA
`reads and writes as well as instruction (pre)fetch. Because
`a typical 128-byte DMA read or write requires 16
`processor cycles to place the data on the on-chip coherent
`bus, even when DMA reads and writes occur at full
`bandwidth, seven of every eight cycles remain available
`for loads, stores, and instruction fetch. Similarly,
`instructions are fetched 128 bytes at a time, and pressure
`on the local store is minimized. The highest priority is
`given to DMA commands, the next highest priority to
`loads and stores, and instruction (pre)fetch occurs
`whenever there is a cycle available. A special no-
`operation instruction exists to force the availability
`of a slot to instruction fetch when necessary.
`The execution units of the SPU are organized around
`a 128-bit dataflow. A large register file with 128 entries
`provides

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket