throbber
ALTIVEC EXTENSION TO POWERPC
`ACCELERATES MEDIA PROCESSING
`
`DESIGNED AROUND THE PREMISE THAT MULTIMEDIA WILL BE THE PRIMARY
`
`CONSUMER OF PROCESSING CYCLES IN FUTURE PCS, ALTIVEC—WHICH APPLE
`
`CALLS THE VELOCITY ENGINE—INCREASES PERFORMANCE ACROSS A BROAD
`
`SPECTRUM OF MEDIA PROCESSING APPLICATIONS.
`
`There is a clear trend in personal com-
`puting toward multimedia-rich applications.
`These applications will incorporate a wide vari-
`ety of multimedia technologies, including audio
`and video compression, 2D image processing,
`3D graphics, speech and handwriting recogni-
`tion, media mining, and narrow-/broadband
`signal processing for communication.
`In response to this demand, major micro-
`processor vendors have announced architec-
`tural extensions to their general-purpose
`processors in an effort to improve their multi-
`media performance. Intel extended IA-32 with
`MMX1 and SSE (alias KNI),2 Sun enhanced
`Sparc with VIS,3 Hewlett-Packard added
`MAX4 to its PA-RISC architecture, Silicon
`Graphics extended the MIPS architecture with
`MDMX,5 and Digital (now Compaq) added
`MVI to Alpha. This article describes the most
`recent, and what we believe to be the most
`comprehensive, addition to this list: Power-
`PC’s AltiVec.6,7 AltiVec speeds not only media
`processing but also nearly any application in
`which data parallelism exists, as demonstrat-
`ed by a cycle-accurate simulation of Motoro-
`la’s MPC 7400, the heart of Apple G4 systems.
`
`Highlights and performance summary
`Like all the other extensions, AltiVec is a
`SIMD (single-instruction, multiple-data)
`
`extension to a general-purpose architecture.
`But the similarity ends there. Whereas the
`other extensions were obviously constrained
`by backward compatibility and a desire to
`limit silicon investment to a small fraction of
`the processor die area, the primary goal for
`AltiVec was high functionality. It was designed
`from scratch around the premise that multi-
`media will become the primary consumer of
`processing cycles8 in future PCs and therefore
`deserves first-class treatment in the CPU.
`Unlike most other extensions, which over-
`load their floating-point (FP) registers to
`accommodate multimedia data, AltiVec ded-
`icates a large new register file exclusively to it.
`Although overloading the FP registers avoids
`new architectural state, eliminating the need
`to modify the operating system, it also signif-
`icantly compromises performance, which was
`not acceptable for AltiVec.
`AltiVec treats multimedia data as first-class
`data in the form of vectors. Vector elements
`include all of the major data types found in
`3D graphics, image processing, digital audio
`and video, speech recognition, data mining,
`and other multimedia applications.
`AltiVec’s powerful data reorganization capa-
`bility goes far beyond that of any previous
`SIMD engine, making AltiVec uniquely well
`suited to the bit-parallel algorithms found in
`
`Keith Diefendorff
`Microprocessor Report
`
`Pradeep K. Dubey
`IBM Research Division
`
`Ron Hochsprung
`Apple Computer
`
`Hunter Scales
`Motorola Corporation
`
`0272-1732/00/$10.00 (cid:211) 2000 IEEE
`
`85
`
`SAMSUNG-1027
`Page 1 of 12
`
`

`

`ALTIVEC EXTENSION
`
`Table 1. Data types for various media tasks.
`
` Data type
` 8-bit integer
` 16-bit integer
`Single-precision float
`Unsigned
`Signed
`Unsigned
`Signed
`Signed
`Low quality
`High quality
`Low quality
`High quality
`Low quality
`Low quality
`
`High quality
`
`High quality
`High quality
`
`Task
`Video
`Audio
`Image processing
`3D graphics
`Speech recognition
`Communication
`Media mining
`
`Low quality
`
`128-bit vector, loop overheads
`tend to be small, giving
`AltiVec processors perfor-
`mance approaching that of
`true vector machines.
`On the basis of cycle-
`accurate simulations of more
`than 40 media processing ker-
`nels, we found that AltiVec
`delivered an average speedup
`of 6.5 on integer kernels and
`5.1 on floating-point kernels,
`over the same PowerPC
`processor without AltiVec.
`The speedups often approach—and sometimes
`even exceed—the theoretical SIMD paral-
`lelism, which is 16 on 8-bit data (for example,
`video), eight on 16-bit data (for example,
`modem filters), and four on 32-bit integers and
`floats (for example, 3D graphics and high-
`fidelity audio). Speedups greater than the theo-
`retical parallelism arise from the ability to use
`new algorithms that are inappropriate for scalar
`processors or for less capable SIMD processors.
`
`Crypto
`
`Crypto
`
`High quality
`
`digital signal processing (DSP) domains.
`These include error correction, bit-packing
`kernels, and many others.
`AltiVec extends the scalar PowerPC archi-
`tecture with a powerful new set of SIMD
`instructions. These instructions execute from
`the same instruction stream as the PowerPC’s
`scalar integer, floating-point, and branch
`instructions.
`AltiVec’s major architectural characteristics
`include
`
`• fixed-length 128-bit vectors, each com-
`prising four, eight, or 16 data elements;
`• a separate vector register file with a 32-
`register namespace, each register holding
`one 128-bit vector;
`• vector-element data types of 8-, 16-, and
`32-bit signed or unsigned integers, as
`well as IEEE single-precision floats;
`• 162 new SIMD-style instructions opti-
`mized for digital signal processing;
`• saturation or modulo arithmetic;
`• a four-operand, nondestructive instruc-
`tion format (three sources, one destina-
`tion); and
`• modeless operation for zero overhead use
`of AltiVec instructions.
`
`SIMD parallelism is well matched to the
`parallelism found in the packed-data streams
`of media applications. To use SIMD process-
`ing, algorithms typically break long data
`streams into sequences of short fixed-length
`vector operands. SIMD instructions then
`process these vectors iteratively in loops, each
`instruction performing the same operation on
`all corresponding elements in the source-
`operand vectors in parallel. With AltiVec’s long
`
`86
`
`IEEE MICRO
`
`AltiVec architecture
`One of the attributes that enable large
`speedups across such a broad spectrum of
`media processing applications is AltiVec’s sup-
`port for all of the important media data types.
`Table 1 shows the various data types that a
`processor must support if it is to perform well
`on media processing tasks. To date, AltiVec is
`the only SIMD architectural extension to sup-
`port all these types.
`AltiVec’s large vector register file provides
`quick access to a large number of values, such
`as the transform or filter coefficients that are
`accessed frequently in signal processing loops.
`The large register namespace facilitates soft-
`ware pipelining and loop unrolling necessary
`to cover the long latencies associated with
`media streams. With a separate register file,
`the general-purpose and floating-point regis-
`ters are not encumbered with multimedia
`data, so media processing doesn’t interfere
`with scalar processing. The separate file also
`permits the vector registers to be physically
`optimized for the wide SIMD execution units.
`Another important AltiVec feature is its
`four-operand instruction format (three source
`operands, one destination). This feature gives
`each instruction extraordinarily high operand
`
`SAMSUNG-1027
`Page 2 of 12
`
`

`

`Table 2. AltiVec instruction-set summary.*
`
` Arithmetic
`
` Source elements
`
` Destination elements
`
`Vectors
`
`X
`X
`
`X
`
`X
`
`Floats
`
`X
`
`X
`X
`X
`
`X
`
`X
`X
`X
`X
`X
`
`Words
`
`Halfwords
`
`X
`
`X
`X
`X
`
`X
`X
`
`X
`X
`
`X
`X
`X
`
`X
`
`X
`
`X
`X
`
`X
`
`X
`X
`
`X
`X
`
`X
`X
`
`X
`X
`X
`
`Bytes
`
`X
`
`X
`
`X
`X
`
`X
`X
`
`X
`
`X
`X
`X
`
`X
`
`Vectors
`
`Floats
`
`Words
`
`Halfwords
`
`Bytes
`
`Operands
`
`Saturate
`
`Modulo
`
`Unsigned
`
`Signed
`
`X
`X
`X
`X
`X
`X
`X
`
`X
`X
`
`X
`X
`X
`
`X
`X
`X
`X
`X
`X
`
`X
`
`X
`X
`X
`X
`
`X
`
`X
`X
`X
`X
`
`X
`X
`
`X
`X
`
`X
`
`X
`X
`
`X
`X
`
`X
`X
`X
`X
`
`X
`X
`X
`
`X
`X
`X
`X
`
`X
`
`2
`2
`3
`3
`2
`2
`2
`2
`2
`2
`2
`2
`2
`2
`3
`2
`1
`1
`2
`1
`1
`1
`
`X
`X
`
`X
`
`X
`X
`
`X
`X
`
`X
`X
`X
`X
`
`X
`
`X
`X
`X
`X
`
`X
`X
`
`X
`X
`
`X
`X
`X
`
`X
`
`X
`
`X
`X
`
`X
`X
`
`X
`
`X
`
`X
`X
`
`X
`
`X
`
`X
`X
`X
`
`X
`
`X
`X
`X
`X
`X
`X
`
`Instruction class
`Load/store
`Stream prefetch
`Add/sub
`Multiply
`Multiply-add
`Multiply-sum
`Sum across
`Partial sum across
`Average
`Logicals
`Rotate/shift
`Compare
`Select
`Pack
`Unpack/merge
`Splat
`Permute
`Shift elements
`Round to integer
`Convert w/scale
`Max/min
`1/xestimate
`1/sqrt(x) estimate
`Log/power estimates
`
`*This table summarizes AltiVec capabilities in a concise form. Not all combinations shown are available for every instruction in a given class.
`
`bandwidth and supports the encoding of pow-
`erful instructions such as multiply-add, per-
`mute, and select (described later). Since the
`four-operand format is nondestructive, it also
`eliminates the excess register shuffling and
`copying that comes with destructive two-
`operand formats like that of the x86 architec-
`ture. Thus, AltiVec’s instruction format allows
`programs to use registers efficiently, minimiz-
`ing spill/fill traffic to memory and producing
`a short instruction path, which are both
`important for efficient signal processing loops.
`AltiVec is based on a simple RISC-style
`load/store architecture, but instructions oper-
`ate on vector operands rather than on the sim-
`ple scalar operands of classical RISC engines.
`The AltiVec instruction set was distilled from
`
`many digital-media-processing algorithms
`into a set of generalized primitives that sup-
`port common operations such as saturation
`arithmetic. Using this approach, the design
`can support a wide spectrum of media appli-
`cations while avoiding the highly specialized
`instructions commonly found in traditional
`DSPs. Counting all variations of data types
`and arithmetic (modulo, saturation, signed,
`and unsigned), AltiVec adds 162 new instruc-
`tions to the PowerPC architecture, as sum-
`marized in Table 2.
`The AltiVec design criteria called for all
`instructions to be easily pipelined and suitable
`for superscalar, out-of-order dispatch. All
`AltiVec processors are expected to implement
`the full architectural vector width and to fully
`
`MARCH–APRIL 2000
`
`87
`
`SAMSUNG-1027
`Page 3 of 12
`
`

`

`ALTIVEC EXTENSION
`
`01
`
`04
`
`08
`
`00
`
`1F
`
`15
`
`09
`
`0A 05
`
`1F
`
`02
`
`03
`
`07
`
`0D 0B 0E
`
`vector instructions with Pow-
`erPC scalar instructions.
`
`E
`
`F
`
`Permute power
`Much of AltiVec’s perfor-
`mance and flexibility derives
`from the permute instruction
`(vperm), illustrated in Figure
`1a. This instruction performs
`two essential functions: data
`reorganization and
`table
`lookup. One of the historical
`problems with SIMD archi-
`tectures is that if the data
`structures do not precisely
`match the hardware organiza-
`tion, the program must pre-
`process the data to conform to
`the hardware. This prepro-
`cessing overhead
`severely
`reduces the potential SIMD
`speedup. Permute eliminates
`this problem by providing a
`method for arbitrarily re-
`arranging vector elements
`with a single instruction. Per-
`mute’s dual source-vector
`operands
`(VA and VB)
`enhance this capability by
`allowing rearrangement of
`data across vector boundaries.
`The other important use
`for permute (table lookup) is
`discussed later. Although per-
`mute requires a full 32 · 16
`bytewise hardware crossbar,
`the enormous value of per-
`mute’s table lookup function
`alone makes it worth the
`incremental hardware cost
`over that of the more restric-
`tive data shuffling operations found in other
`SIMD machines.
`In addition to the full generality of the per-
`mute instruction, AltiVec provides specific
`variants for unpacking (expanding) small ele-
`ments into larger fields, packing (truncating)
`large elements into smaller, tightly packed
`fields, merging (shuffling) elements from two
`vectors into one vector, replicating an element
`across a vector (splat), and double-vector shift-
`ing and rotating. These variations specify the
`permute-control vector implicitly or as an
`
`VC
`
`1 0
`
`VB
`
`VA
`
`VT
`
`(a)
`
`VC
`
`VT
`
`(b)
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`A
`
`B
`
`C
`
`D
`
`vperm VT, VA, VB, VC
`
`VA
`
`VB
`
`****************
`
`+
`
`+
`
`+
`
`vmsumubm VT, VA, VB, VC
`
`+
`
`Figure 1. Vector permute (a) and vector dot-product (b) primitives in AltiVec.
`
`pipeline all instructions; that is, all instructions
`will issue back-to-back with a throughput of
`at least one instruction per cycle. Most imple-
`mentations will support simultaneous dispatch
`of one ALU-class vector instruction along with
`one permute-class vector instruction, or either
`of
`these paired with a vector
`load
`or store. Thus, the peak throughput of AltiVec
`instructions will typically be two per cycle, as
`it is in Motorola’s MPC 7400 processor. No
`AltiVec implementations will ever impose any
`restriction on, or suffer any penalty for, mixing
`
`88
`
`IEEE MICRO
`
`SAMSUNG-1027
`Page 4 of 12
`
`

`

`explicit literal within the instruction opcode,
`thus avoiding the overhead of creating and
`storing the permute-control vector in these
`common cases. Special forms of pack and
`unpack are provided for 16-bit pixels in a
`1/5/5/5 format.
`
`Vector dot product (multiply-sum)
`One of the most common DSP operations
`is the vector dot product. AltiVec accomplish-
`es this operation with two instructions: multi-
`ply-sum and sum-across. Multiply-sum,
`illustrated in Figure 1b, multiplies corre-
`sponding elements in two vector registers (VA
`and VB), sums those products with four values
`from a third vector register (VC), and deposits
`the four 32-bit partial sums into the destina-
`tion vector register (VT). VC serves to accu-
`mulate partial sums for taking the dot product
`of long vectors. AltiVec processors carry out the
`multiplications to full precision and then sub-
`ject the individual partial sums to saturation
`(clamp to max on overflow, min on underflow).
`As a final step, the sum-across instruction can
`be used to sum the four accumulated partial
`sums into a single 32-bit scalar result.
`AltiVec provides multiply-sum in byte- and
`halfword-element forms. The byte-element
`forms support motion estimation in video
`compression. During motion search, the mul-
`tiply-sum instruction is used to locate the
`closest matching macroblock using the sum-
`of-differences-squared (ai – bj)2 measure. This
`approach produces a higher quality compari-
`son than one based on the sum-of-absolute-
`difference (|ai – bj|) instruction used by
`architectures such as VIS and SSE, while still
`achieving a throughput of 16 pixels per cycle.
`
`Multiply accumulate
`Another common DSP operation is multi-
`ply-accumulate. This operation underlies
`many digital filters, mathematical transforms,
`matrix-arithmetic operations, and so on.
`AltiVec provides multiply-add, a more pow-
`erful version of multiply-accumulate, with
`three source operands and one destination.
`With multiply-add, the respective elements
`in two source vectors are multiplied and the
`products added to corresponding elements of
`a third source vector. The intermediate cal-
`culations are carried out to infinite precision,
`and the final product sums are truncated to
`
`fit into destination-register fields the same size
`as the source elements.
`For halfword elements, AltiVec’s multiply-
`add has two forms: multiply-add-low and
`multiply-add-high. In the low form, the
`unsigned accumulator elements are added to
`the full product, and the intermediate prod-
`uct-sums are truncated modulo 216. In the
`high form, the signed accumulator elements
`are left justified (7-bit left shift), added to the
`32-bit intermediate products, and then satu-
`rated to fit into the 16-bit destination element
`fields. The 7-bit left shift (as opposed to 8-
`bit) results in one extra bit of precision by tak-
`ing advantage of the fact that the most
`significant bit of each signed 32-bit interme-
`diate product is redundant. A variant of this
`form rounds the intermediate product-sums
`to squeeze out an additional half bit of preci-
`sion. These tricks provide additional precision
`that is important to several algorithms, espe-
`cially audio processing.
`Multiply-add is not provided for byte ele-
`ments because 8 bits of precision is not suffi-
`cient for most DSP algorithms. In most
`algorithms involving 8-bit data, the elements
`are first expanded to 16 bits, where most com-
`putations are carried out, and the final results
`are truncated back to 8 bits. Video compres-
`sion and decompression, which use 8-bit val-
`ues throughout, are exceptions, but the
`operations involved in those algorithms are
`more suited to multiply-sum, which does have
`8-bit forms. In the few cases that require mul-
`tiply-add of byte elements, vector-multiply
`and vector-add instructions are provided.
`As a concession to silicon area, AltiVec does
`not provide a vector-multiply or multiply-add
`for 32-bit integers. For applications requiring
`more than 16 bits of precision, such as high-
`fidelity digital audio, 24-bit precision is usu-
`ally sufficient. AltiVec provides this level of
`precision with its floating-point instructions,
`which do include a multiply-add. Floating
`point also provides the wide dynamic range
`that is often the real motivation to use inte-
`gers larger than 24 bits.
`Conversion between integer and floating-
`point formats is very fast in AltiVec. A single
`instruction converts and scales a vector of four
`signed or unsigned 32-bit fixed-point words to
`a vector of four single-precision floating-point
`values, or vice versa. An instruction for round-
`
`MARCH–APRIL 2000
`
`89
`
`SAMSUNG-1027
`Page 5 of 12
`
`

`

`ALTIVEC EXTENSION
`
`ing floating-point values to integral values via
`any of the four IEEE-754 rounding modes9 is
`also provided. Since all AltiVec instructions are
`fully pipelined, SIMD floating-point arith-
`metic throughput is similar to that of SIMD
`integer arithmetic as long as the floating-point
`unit’s pipeline latency can be covered, which it
`usually can. With these floating-point features,
`AltiVec sacrifices little by not directly support-
`ing 32-bit-integer multiply.
`For division and square root, AltiVec uses
`Newton-Raphson refinement of a reciprocal
`seed. The high accuracy achieved with
`AltiVec’s fused multiply-add instruction (sin-
`gle rounding after the add operation) allows
`rapid convergence to an IEEE-754–accurate
`reciprocal value. This approach provides
`divide performance equaling or exceeding that
`of processors with expensive division hard-
`ware. Unlike with hardware dividers, AltiVec’s
`approach carries no hidden-state information
`from cycle to cycle; thus, division can be fully
`pipelined and intermediate instructions can
`be easily rescheduled.
`
`Conditionals
`Changes in control flow present a serious
`performance problem for any processor, espe-
`cially those running DSP applications where
`loops tend to be tight and the data-dependent
`decisions difficult to predict. The latter are
`nearly intractable in SIMD architectures,
`which process multiple data elements in a sin-
`gle instruction.
`Although AltiVec provides a means for con-
`ditional branching, it places more emphasis on
`avoiding branches. To this end, AltiVec offers
`a type of conditional-move instruction, called
`select (vsel), which, in concert with vector-
`compare instructions, operates efficiently on
`SIMD vectors. The vector-compare instruc-
`tions generate a predicate vector that can be
`stored in any vector register and used by sub-
`sequent vsel instructions to choose elements
`from one of two registers, depending on the
`value of the corresponding predicate elements.
`With this mechanism, data-dependent deci-
`sions can be made on all elements in a vector
`in parallel, making it unnecessary to test and
`branch on each element individually. AltiVec
`can use the vsel instruction to simulate predi-
`cated execution, which can eliminate many
`branches. Since vsel selects values on a bit-by-
`
`bit basis, it is also useful for selecting and merg-
`ing vector subfields that do not fall on element
`boundaries.
`In cases where data-dependent redirection
`of program control flow cannot be avoided,
`AltiVec’s vector-compare instructions option-
`ally update PowerPC’s condition register
`(CR06) field. Subsequent PowerPC condi-
`tional-branch instructions can test this regis-
`ter in the normal manner. A special form of
`vector compare—vector compare-bounds—
`speeds up 3D-graphics clipping operations.
`
`Loads and stores
`The philosophy behind AltiVec’s memory
`operations is to support only basic load and
`store primitives in an effort to keep the mem-
`ory path as fast as possible. The load-vector
`and store-vector instructions transfer full
`quadword vectors between memory and the
`vector registers. Load-vector-element and
`store-vector-element instructions transfer
`individual byte, halfword, and word scalar
`elements between memory and the vector reg-
`isters. Vector loads and stores use the index-
`addressing mode (RA|0 + RB) only.
`All memory accesses are aligned on their nat-
`ural size boundary. If a load or store’s address
`is not size aligned, the appropriate number of
`least-significant address bits is ignored and an
`aligned transfer occurs. AltiVec provides assis-
`tance for extracting misaligned data once it is
`in the registers. Special load-vector-for-shift-
`left/right instructions assist in this process by
`computing a permute-control vector based on
`the misaligned memory address. Random iso-
`lated unaligned-vector loads can be simulated
`with just four instructions, but the average cost
`of unaligned-vector loads in a long linear
`sequence, which is the more important case,
`approaches an average of only two instructions
`(one load vector and one vector permute).
`These two instructions will issue simultane-
`ously in most AltiVec implementations.
`
`Software-directed prefetch streams
`AltiVec allows software to manage the band-
`width between processor and memory with
`explicit cache management instructions. With
`these instructions, software can indicate to the
`cache hardware how it should prefetch data and
`prioritize replacement. The principal instruc-
`tion for this purpose is the cache-prefetch
`
`90
`
`IEEE MICRO
`
`SAMSUNG-1027
`Page 6 of 12
`
`

`

`instruction, called data-stream touch. This
`instruction specifies the starting address, a block
`size (one to three vectors), the number of blocks
`to prefetch (one to 256 blocks), a signed stride
`(± 32,768 bytes), and a 2-bit tag that unique-
`ly identifies one of four prefetch streams that
`can run simultaneously. Other forms of data-
`stream touch are provided for writing data into
`streams and marking data as transient, that is,
`as having poor temporal locality.
`
`AltiVec coding examples
`AltiVec has proven adept at accelerating a
`wide range of multimedia and DSP applica-
`tions. To illustrate its utility, we present cod-
`ing examples from the areas of image
`processing, wavelet signal processing, and
`Galois-field arithmetic.
`
`Median filter
`Median filter is an algorithm commonly
`used in image processing to smooth out noise.
`The support region of the pixel neighborhood
`in such a filter is normally a sliding filter win-
`dow. Since the data type is usually 8-bit
`unsigned (intensity) values, the sequence of
`scalar memory accesses typically involves
`numerous misaligned memory accesses. Fig-
`ure 2 provides a high-level description of a
`5 · 5 median filter based on a simple algo-
`rithm. The two most important functions are
`pixel load and compare-and-swap.
`In Figure 2, one pixel neighborhood appears
`in the dashed box. Twenty-five vectors are
`needed to calculate 16 output pixels in the
`third row. Even though pixel sets are mis-
`aligned, the use of AltiVec’s load and shift fea-
`tures completely avoids expensive unaligned
`memory accesses. Additionally, with this
`approach each pixel is fetched only once, thus
`minimizing the number of memory accesses.
`This example illustrates AltiVec’s double-
`shift instruction, vsldi. The critical innermost
`loop of compare-and-swap (V1, V2) can be
`implemented using a simple three-instruction
`sequence of vmaxub Vx, V1, V2; vminub V1,
`V1, V2; vor V2, V1, Vx. This sequence,
`which accumulates the minimum pixel value
`in V1, can be executed in just two cycles on a
`typical AltiVec superscalar implementation,
`resulting in a significant speedup over scalar
`processors for median filters, even with this
`simple algorithm. With a more sophisticated
`
`Sliding window
`
`Aligned
`
`(a)
`
`Aligned
`
`for each pixel vector { /* i.e., a set of 16 pixels */
` load appropriate 25 pixel values P[0..24]
` for i : = 0 to 12 {
` MIN = P [i]
` for j := i upto 24 {
` Compare_and_Swap (Min, P[j])
` /* MIN <-min (MIN, P[j]);
` P[j] <- max (MIN, P[j]) */
` }
` }
` /* MIN holds the median of P[0..24] at this point */
`}
`
`1. lvx
`
`2. lvx
`
`V0, [1600]
`
`; 10% gray pixels from memory, Row 1
`
`V31, [1616]
`
`; Next set of 10% gray pixels, Row 1
`
`3. vsldi
`
`V1, V0, V31, 1
`
`; Funnel shift for 20% gray pixels, Row 1
`
`4. vsldi
`
`V2, V0, V31, 2
`
`; Funnel shift for 30% gray pixels, Row 1
`
`5. vsldi
`
`V3, V0, V31, 3
`
`; Funnel shift for 40% gray pixels, Row 1
`
`6. vsldi
`
`V4, V0, V31, 4
`
`; Funnel shift for 50% gray pixels, Row 1
`
`(b)
`
`(c)
`
`Figure 2. Pixel access during 5 · 5 median filtering.
`
`algorithm, AltiVec can execute median filter at
`1.2 cycles per output pixel.
`
`Haar transform
`The applications of wavelet technology
`have increased markedly in areas such as dig-
`ital image processing, data compression,
`astronomy, acoustics, sub-band coding, imag-
`ing, speech discrimination, and fractal com-
`pression. The Haar transform, introduced in
`1910, is the oldest wavelet transform. The
`2 · 2 transform decomposes an image into
`four bands whose spatial frequencies and spa-
`tial contents differ.
`Figure 3 (next page) illustrates this trans-
`form, along with its implementation in
`AltiVec code. This example illustrates the use
`of the multiply-sum operation. Support for
`
`MARCH–APRIL 2000
`
`91
`
`SAMSUNG-1027
`Page 7 of 12
`
`

`

`ALTIVEC EXTENSION
`
`2· 2 input
`block
`
`Input pixels
`(unsigned bytes)
`
`Output pixels
`(signed 16 bits)
`
`P00 P01 P10 P11 P20 P21 P30 P31
`
`Band00
`
`Band01
`
`Band02
`
`Band03
`
`P02 P03 P12 P13 P22 P23 P32 P33
`
`Band10
`
`Band11
`
`Band12
`
`Band13
`
`Transform
`Band0i = Pi0 + Pi1 + Pi2 + Pi3
`Band1i = Pi0 + Pi1 - Pi2 - Pi3
`Band2i = Pi0 - Pi1 + Pi2 - Pi3
`Band3i = Pi0 - Pi1 - Pi2 + Pi3
`
`(a)
`
`Band20
`
`Band21
`
`Band22
`
`Band23
`
`Band30
`
`Band31
`
`Band32
`
`Band33
`
`0.; V0 and V1 hold two 2· 2 blocks of input pixels, V30 holds null
`0.; V4 holds 4-byte sequence: 1, 1, 1, 1 splatted
`0.; V5 holds 4-byte sequence: 1, 1, - 1, - 1 splatted
`0.; V6 holds 4-byte sequence: 1, - 1, 1, - 1 splatted
`0.; V7 holds 4-byte sequence: 1, - 1, - 1, 1 splatted
`; V1 = P71 P70 … P01 P00
`1. lvx
`V0, [image]
`V1, [image +XX]
`; V2 = P73 P72 … P03 P02
`2. lvx
`; V3 = … P43 P42 P41 P40
`3. vmrghh
`V2, V1, V0
`; V4 = … P03 P02 P01 P00
`4. vmrglh
`V0, V1, V0
`; V10 = Band07 … Band04
`5. vmsummbm
`V11, V2, V4, V30
`; V11 = Band03 … Band01
`6. vmsummbm
`V10, V0, V4, V30
`; V10 = Band07 … Band00
`7. vpkswss
`V10, V11, V10
`
`0.; other bands are similarly calculated
`
`(b)
`
`Figure 3. A 2 · 2 Haar transform (a) implemented in AltiVec code (b).
`
`vperm
`Index0…3
`
`vsel
`Index4
`vsel
`Index5
`vsel
`Index6
`vsel
`Index7
`
`Figure 4. 256-entry, 8-bit-element table lookup using AltiVec.
`
`one signed and one unsigned operand is cru-
`cial for filtering applications like this, where
`the signed filter coefficients are small enough
`to fit in a byte and the input is unsigned 8-bit
`data. A video-loop filter, which performs
`smoothing in YUV color space, is another
`such application. Here the convolution ker-
`nel is separable and has coefficients small
`enough to fit in a byte. In the absence of archi-
`tectural support for byte data types, the pro-
`
`92
`
`IEEE MICRO
`
`gram would have to unpack the data and
`extend the coefficients to 16 bits, losing half
`the potential parallelism.1,3
`
`Galois-field arithmetic
`Table lookups involving small tables (256 or
`fewer elements) and small elements (8 or fewer
`bits) are common in multimedia and commu-
`nication kernels such as cyclic redundancy
`check (CRC) code generation, convolution
`encoding, Galois-field (GF) multiplication, and
`many nonlinear functions, such as gamma cor-
`rection. Also, bit-level manipulations, includ-
`ing various simple functions such as bit-stream
`interleaving or bit (un)packing, are often pro-
`grammed using table lookup functions. Most
`traditional scalar and SIMD architectures do
`not support parallel implementation of such
`functions, so, for example, a sequence of 16
`index extractions, 16 address calculations, and
`16 data loads would be needed to load 16 data
`items based on 16 indices. AltiVec accom-
`plishes all of these operations with one vperm
`instruction.
`This powerful capability enables sophisticat-
`ed techniques like GF arithmetic, unlocking
`entirely new algorithms that are computation-
`ally intractable on conventional scalar and
`SIMD engines. Figure 4 illustrates the steps in
`performing a 16-way parallel lookup in a 256-
`element table (with byte elements).1,3
`GF arithmetic, that is, arithmetic over a
`finite field, has many important applications.
`One example is Reed-Solomon coding, which
`is a type of block code commonly used for
`error correction in broadband communica-
`tions. With add-form representation, GF addi-
`tion is just an exclusive-or operation. GF
`multiplication, however, is expensive. For
`example, a GF (24) multiplication involves
`three table lookups into two 16-element tables
`and a modulo-15 addition, as shown in Fig-
`ure 5. Note that the more commonly used GF
`(28) multiplication can be carried out using
`three GF (24) multiplications and four GF (24)
`additions.10
`
`Performance results
`The AltiVec performance data presented in
`this article is based on cycle-accurate simulation
`of Motorola’s MPC 7400 microprocessor with
`AltiVec. The simulated processor has several
`important microarchitecture characteristics:
`
`SAMSUNG-1027
`Page 8 of 12
`
`

`

`First input operand
`
`Second input operand
`
`Add-form to multiply-form table
`
`Mod-15
`addition
`
`Products in add-form
`
`Multiply-form to add-form table
`
`(a)
`
`0. ; X and Y: 16 pairs of numbers (byte wide) in add-form to be multiplied
`0. ; Z: 16 corresponding products (byte wide) in add-form
`0. ; FW table converts add-form to multiply-form, RE table does the reverse
`; V_X holds X, V_#n holds scalar n
`1. vcmpequ8 V4, V_X, V_#0
`; V_Y holds Y, check if X or Y is 0
`2. vcmpequ8 V5, V_Y, V_#0
`; product is 0 if either X or Y is 0
`3. vor V_Mask, V4, V5
`; lookup multiply-form of X
`4. vperm V1, V_FW, V0, V_X
`; lookup mulitply-form of Y
`5. vperm V2, V_FW, V0, V_Y
`; V3: product in multiply-form
`6. vaddum8 V3, V1, V2
`; fix-up for modulo 15 addition
`7. vcmpgtu8 V_mask, V3, V_#15
`8. vsubum8 V8, V3, V_#15
` subtract 15 for products ‡ 15
`9. vsel V3, V3, V8, V_mask
`; reverse_lookup add-form of prod
`10. vperm V3, V_RE, V_RE, V3
`; V3: final product in add-form
`11. vsel V3, V3, V_#0, V_Mask
`
`;;
`
`(b)
`
`• all instructions are fully
`pipelined with single-
`cycle throughput;
`• simple operations (sim-
`ple arithmetic, logical,
`permute) have a 1-cycle
`latency;
`• compound operations
`(for example, multiply-
`sum) have 3 to 4 cycles
`of latency;
`• the processor can issue
`one ALU-class and one
`permute-class instruc-
`tion each cycle (permute
`class
`includes most
`nonarithmetic opera-
`tions such as pack,
`unpack, splat, permute,
`and so on); and
`• there is no restriction on
`issuing AltiVec instruc-
`tions in the same cycle as
`PowerPC scalar instruc-
`tions.
`
`Table 3 (next page) sum-
`marizes performance on a
`variety of media functions,
`chosen to cover a wide spec-
`trum of media processing,
`including audio, video, graph-
`ics, speech recognition, and communication.
`The optimal scalar algorithm for a media func-
`tion is often different from the optimal AltiVec
`algorithm. Therefore, to present a fair com-
`parison, we made every effort to pick the best
`algorithm for each case.
`
`Figure 5. Galois-field (16) multiplication (a) implemented in AltiVec code (b).
`
`The AltiVec extension to the PowerPC
`
`architecture was designed to offer archi-
`tectural support for the entire range of multi-
`media processing. Multimedia processing has
`its roots in signal and image processing, where
`overall performance is often determined by
`the performance of critical loops iterating over
`a large input data set. AltiVec’s large vector
`register file, full-range data-type support, four-
`operand nondestructive instruction format,
`unmatched data-reorganization capability
`(permute), and powerful SIMD instruction
`set enable it to significantly speed processing
`in these critical loops. At the same time,
`
`AltiVec’s data-prefetch streams can keep the
`SIMD processing engine fed with data. The
`speedups achievable with AltiVec are sufficient
`to enable media-rich applications to run
`entirely on general-purpose PowerPC micro-
`processors without the aid of hard-to-use spe-
`cial-purpose media processors or dedicated
`hardware accelerators.
`MICRO
`
`References
`1. A. Peleg and U. Weiser, “MMX Technology
`Extension to the Intel Architecture,” IEEE
`Micro, July/Aug. 1996, pp. 42-50.
`2. K. Diefendorff, “Katmai Enhances MMX,”
`Micro

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket