`ACCELERATES MEDIA PROCESSING
`
`DESIGNED AROUND THE PREMISE THAT MULTIMEDIA WILL BE THE PRIMARY
`
`CONSUMER OF PROCESSING CYCLES IN FUTURE PCS, ALTIVEC—WHICH APPLE
`
`CALLS THE VELOCITY ENGINE—INCREASES PERFORMANCE ACROSS A BROAD
`
`SPECTRUM OF MEDIA PROCESSING APPLICATIONS.
`
`There is a clear trend in personal com-
`puting toward multimedia-rich applications.
`These applications will incorporate a wide vari-
`ety of multimedia technologies, including audio
`and video compression, 2D image processing,
`3D graphics, speech and handwriting recogni-
`tion, media mining, and narrow-/broadband
`signal processing for communication.
`In response to this demand, major micro-
`processor vendors have announced architec-
`tural extensions to their general-purpose
`processors in an effort to improve their multi-
`media performance. Intel extended IA-32 with
`MMX1 and SSE (alias KNI),2 Sun enhanced
`Sparc with VIS,3 Hewlett-Packard added
`MAX4 to its PA-RISC architecture, Silicon
`Graphics extended the MIPS architecture with
`MDMX,5 and Digital (now Compaq) added
`MVI to Alpha. This article describes the most
`recent, and what we believe to be the most
`comprehensive, addition to this list: Power-
`PC’s AltiVec.6,7 AltiVec speeds not only media
`processing but also nearly any application in
`which data parallelism exists, as demonstrat-
`ed by a cycle-accurate simulation of Motoro-
`la’s MPC 7400, the heart of Apple G4 systems.
`
`Highlights and performance summary
`Like all the other extensions, AltiVec is a
`SIMD (single-instruction, multiple-data)
`
`extension to a general-purpose architecture.
`But the similarity ends there. Whereas the
`other extensions were obviously constrained
`by backward compatibility and a desire to
`limit silicon investment to a small fraction of
`the processor die area, the primary goal for
`AltiVec was high functionality. It was designed
`from scratch around the premise that multi-
`media will become the primary consumer of
`processing cycles8 in future PCs and therefore
`deserves first-class treatment in the CPU.
`Unlike most other extensions, which over-
`load their floating-point (FP) registers to
`accommodate multimedia data, AltiVec ded-
`icates a large new register file exclusively to it.
`Although overloading the FP registers avoids
`new architectural state, eliminating the need
`to modify the operating system, it also signif-
`icantly compromises performance, which was
`not acceptable for AltiVec.
`AltiVec treats multimedia data as first-class
`data in the form of vectors. Vector elements
`include all of the major data types found in
`3D graphics, image processing, digital audio
`and video, speech recognition, data mining,
`and other multimedia applications.
`AltiVec’s powerful data reorganization capa-
`bility goes far beyond that of any previous
`SIMD engine, making AltiVec uniquely well
`suited to the bit-parallel algorithms found in
`
`Keith Diefendorff
`Microprocessor Report
`
`Pradeep K. Dubey
`IBM Research Division
`
`Ron Hochsprung
`Apple Computer
`
`Hunter Scales
`Motorola Corporation
`
`0272-1732/00/$10.00 (cid:211) 2000 IEEE
`
`85
`
`SAMSUNG-1027
`Page 1 of 12
`
`
`
`ALTIVEC EXTENSION
`
`Table 1. Data types for various media tasks.
`
` Data type
` 8-bit integer
` 16-bit integer
`Single-precision float
`Unsigned
`Signed
`Unsigned
`Signed
`Signed
`Low quality
`High quality
`Low quality
`High quality
`Low quality
`Low quality
`
`High quality
`
`High quality
`High quality
`
`Task
`Video
`Audio
`Image processing
`3D graphics
`Speech recognition
`Communication
`Media mining
`
`Low quality
`
`128-bit vector, loop overheads
`tend to be small, giving
`AltiVec processors perfor-
`mance approaching that of
`true vector machines.
`On the basis of cycle-
`accurate simulations of more
`than 40 media processing ker-
`nels, we found that AltiVec
`delivered an average speedup
`of 6.5 on integer kernels and
`5.1 on floating-point kernels,
`over the same PowerPC
`processor without AltiVec.
`The speedups often approach—and sometimes
`even exceed—the theoretical SIMD paral-
`lelism, which is 16 on 8-bit data (for example,
`video), eight on 16-bit data (for example,
`modem filters), and four on 32-bit integers and
`floats (for example, 3D graphics and high-
`fidelity audio). Speedups greater than the theo-
`retical parallelism arise from the ability to use
`new algorithms that are inappropriate for scalar
`processors or for less capable SIMD processors.
`
`Crypto
`
`Crypto
`
`High quality
`
`digital signal processing (DSP) domains.
`These include error correction, bit-packing
`kernels, and many others.
`AltiVec extends the scalar PowerPC archi-
`tecture with a powerful new set of SIMD
`instructions. These instructions execute from
`the same instruction stream as the PowerPC’s
`scalar integer, floating-point, and branch
`instructions.
`AltiVec’s major architectural characteristics
`include
`
`• fixed-length 128-bit vectors, each com-
`prising four, eight, or 16 data elements;
`• a separate vector register file with a 32-
`register namespace, each register holding
`one 128-bit vector;
`• vector-element data types of 8-, 16-, and
`32-bit signed or unsigned integers, as
`well as IEEE single-precision floats;
`• 162 new SIMD-style instructions opti-
`mized for digital signal processing;
`• saturation or modulo arithmetic;
`• a four-operand, nondestructive instruc-
`tion format (three sources, one destina-
`tion); and
`• modeless operation for zero overhead use
`of AltiVec instructions.
`
`SIMD parallelism is well matched to the
`parallelism found in the packed-data streams
`of media applications. To use SIMD process-
`ing, algorithms typically break long data
`streams into sequences of short fixed-length
`vector operands. SIMD instructions then
`process these vectors iteratively in loops, each
`instruction performing the same operation on
`all corresponding elements in the source-
`operand vectors in parallel. With AltiVec’s long
`
`86
`
`IEEE MICRO
`
`AltiVec architecture
`One of the attributes that enable large
`speedups across such a broad spectrum of
`media processing applications is AltiVec’s sup-
`port for all of the important media data types.
`Table 1 shows the various data types that a
`processor must support if it is to perform well
`on media processing tasks. To date, AltiVec is
`the only SIMD architectural extension to sup-
`port all these types.
`AltiVec’s large vector register file provides
`quick access to a large number of values, such
`as the transform or filter coefficients that are
`accessed frequently in signal processing loops.
`The large register namespace facilitates soft-
`ware pipelining and loop unrolling necessary
`to cover the long latencies associated with
`media streams. With a separate register file,
`the general-purpose and floating-point regis-
`ters are not encumbered with multimedia
`data, so media processing doesn’t interfere
`with scalar processing. The separate file also
`permits the vector registers to be physically
`optimized for the wide SIMD execution units.
`Another important AltiVec feature is its
`four-operand instruction format (three source
`operands, one destination). This feature gives
`each instruction extraordinarily high operand
`
`SAMSUNG-1027
`Page 2 of 12
`
`
`
`Table 2. AltiVec instruction-set summary.*
`
` Arithmetic
`
` Source elements
`
` Destination elements
`
`Vectors
`
`X
`X
`
`X
`
`X
`
`Floats
`
`X
`
`X
`X
`X
`
`X
`
`X
`X
`X
`X
`X
`
`Words
`
`Halfwords
`
`X
`
`X
`X
`X
`
`X
`X
`
`X
`X
`
`X
`X
`X
`
`X
`
`X
`
`X
`X
`
`X
`
`X
`X
`
`X
`X
`
`X
`X
`
`X
`X
`X
`
`Bytes
`
`X
`
`X
`
`X
`X
`
`X
`X
`
`X
`
`X
`X
`X
`
`X
`
`Vectors
`
`Floats
`
`Words
`
`Halfwords
`
`Bytes
`
`Operands
`
`Saturate
`
`Modulo
`
`Unsigned
`
`Signed
`
`X
`X
`X
`X
`X
`X
`X
`
`X
`X
`
`X
`X
`X
`
`X
`X
`X
`X
`X
`X
`
`X
`
`X
`X
`X
`X
`
`X
`
`X
`X
`X
`X
`
`X
`X
`
`X
`X
`
`X
`
`X
`X
`
`X
`X
`
`X
`X
`X
`X
`
`X
`X
`X
`
`X
`X
`X
`X
`
`X
`
`2
`2
`3
`3
`2
`2
`2
`2
`2
`2
`2
`2
`2
`2
`3
`2
`1
`1
`2
`1
`1
`1
`
`X
`X
`
`X
`
`X
`X
`
`X
`X
`
`X
`X
`X
`X
`
`X
`
`X
`X
`X
`X
`
`X
`X
`
`X
`X
`
`X
`X
`X
`
`X
`
`X
`
`X
`X
`
`X
`X
`
`X
`
`X
`
`X
`X
`
`X
`
`X
`
`X
`X
`X
`
`X
`
`X
`X
`X
`X
`X
`X
`
`Instruction class
`Load/store
`Stream prefetch
`Add/sub
`Multiply
`Multiply-add
`Multiply-sum
`Sum across
`Partial sum across
`Average
`Logicals
`Rotate/shift
`Compare
`Select
`Pack
`Unpack/merge
`Splat
`Permute
`Shift elements
`Round to integer
`Convert w/scale
`Max/min
`1/xestimate
`1/sqrt(x) estimate
`Log/power estimates
`
`*This table summarizes AltiVec capabilities in a concise form. Not all combinations shown are available for every instruction in a given class.
`
`bandwidth and supports the encoding of pow-
`erful instructions such as multiply-add, per-
`mute, and select (described later). Since the
`four-operand format is nondestructive, it also
`eliminates the excess register shuffling and
`copying that comes with destructive two-
`operand formats like that of the x86 architec-
`ture. Thus, AltiVec’s instruction format allows
`programs to use registers efficiently, minimiz-
`ing spill/fill traffic to memory and producing
`a short instruction path, which are both
`important for efficient signal processing loops.
`AltiVec is based on a simple RISC-style
`load/store architecture, but instructions oper-
`ate on vector operands rather than on the sim-
`ple scalar operands of classical RISC engines.
`The AltiVec instruction set was distilled from
`
`many digital-media-processing algorithms
`into a set of generalized primitives that sup-
`port common operations such as saturation
`arithmetic. Using this approach, the design
`can support a wide spectrum of media appli-
`cations while avoiding the highly specialized
`instructions commonly found in traditional
`DSPs. Counting all variations of data types
`and arithmetic (modulo, saturation, signed,
`and unsigned), AltiVec adds 162 new instruc-
`tions to the PowerPC architecture, as sum-
`marized in Table 2.
`The AltiVec design criteria called for all
`instructions to be easily pipelined and suitable
`for superscalar, out-of-order dispatch. All
`AltiVec processors are expected to implement
`the full architectural vector width and to fully
`
`MARCH–APRIL 2000
`
`87
`
`SAMSUNG-1027
`Page 3 of 12
`
`
`
`ALTIVEC EXTENSION
`
`01
`
`04
`
`08
`
`00
`
`1F
`
`15
`
`09
`
`0A 05
`
`1F
`
`02
`
`03
`
`07
`
`0D 0B 0E
`
`vector instructions with Pow-
`erPC scalar instructions.
`
`E
`
`F
`
`Permute power
`Much of AltiVec’s perfor-
`mance and flexibility derives
`from the permute instruction
`(vperm), illustrated in Figure
`1a. This instruction performs
`two essential functions: data
`reorganization and
`table
`lookup. One of the historical
`problems with SIMD archi-
`tectures is that if the data
`structures do not precisely
`match the hardware organiza-
`tion, the program must pre-
`process the data to conform to
`the hardware. This prepro-
`cessing overhead
`severely
`reduces the potential SIMD
`speedup. Permute eliminates
`this problem by providing a
`method for arbitrarily re-
`arranging vector elements
`with a single instruction. Per-
`mute’s dual source-vector
`operands
`(VA and VB)
`enhance this capability by
`allowing rearrangement of
`data across vector boundaries.
`The other important use
`for permute (table lookup) is
`discussed later. Although per-
`mute requires a full 32 · 16
`bytewise hardware crossbar,
`the enormous value of per-
`mute’s table lookup function
`alone makes it worth the
`incremental hardware cost
`over that of the more restric-
`tive data shuffling operations found in other
`SIMD machines.
`In addition to the full generality of the per-
`mute instruction, AltiVec provides specific
`variants for unpacking (expanding) small ele-
`ments into larger fields, packing (truncating)
`large elements into smaller, tightly packed
`fields, merging (shuffling) elements from two
`vectors into one vector, replicating an element
`across a vector (splat), and double-vector shift-
`ing and rotating. These variations specify the
`permute-control vector implicitly or as an
`
`VC
`
`1 0
`
`VB
`
`VA
`
`VT
`
`(a)
`
`VC
`
`VT
`
`(b)
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`A
`
`B
`
`C
`
`D
`
`vperm VT, VA, VB, VC
`
`VA
`
`VB
`
`****************
`
`+
`
`+
`
`+
`
`vmsumubm VT, VA, VB, VC
`
`+
`
`Figure 1. Vector permute (a) and vector dot-product (b) primitives in AltiVec.
`
`pipeline all instructions; that is, all instructions
`will issue back-to-back with a throughput of
`at least one instruction per cycle. Most imple-
`mentations will support simultaneous dispatch
`of one ALU-class vector instruction along with
`one permute-class vector instruction, or either
`of
`these paired with a vector
`load
`or store. Thus, the peak throughput of AltiVec
`instructions will typically be two per cycle, as
`it is in Motorola’s MPC 7400 processor. No
`AltiVec implementations will ever impose any
`restriction on, or suffer any penalty for, mixing
`
`88
`
`IEEE MICRO
`
`SAMSUNG-1027
`Page 4 of 12
`
`
`
`explicit literal within the instruction opcode,
`thus avoiding the overhead of creating and
`storing the permute-control vector in these
`common cases. Special forms of pack and
`unpack are provided for 16-bit pixels in a
`1/5/5/5 format.
`
`Vector dot product (multiply-sum)
`One of the most common DSP operations
`is the vector dot product. AltiVec accomplish-
`es this operation with two instructions: multi-
`ply-sum and sum-across. Multiply-sum,
`illustrated in Figure 1b, multiplies corre-
`sponding elements in two vector registers (VA
`and VB), sums those products with four values
`from a third vector register (VC), and deposits
`the four 32-bit partial sums into the destina-
`tion vector register (VT). VC serves to accu-
`mulate partial sums for taking the dot product
`of long vectors. AltiVec processors carry out the
`multiplications to full precision and then sub-
`ject the individual partial sums to saturation
`(clamp to max on overflow, min on underflow).
`As a final step, the sum-across instruction can
`be used to sum the four accumulated partial
`sums into a single 32-bit scalar result.
`AltiVec provides multiply-sum in byte- and
`halfword-element forms. The byte-element
`forms support motion estimation in video
`compression. During motion search, the mul-
`tiply-sum instruction is used to locate the
`closest matching macroblock using the sum-
`of-differences-squared (ai – bj)2 measure. This
`approach produces a higher quality compari-
`son than one based on the sum-of-absolute-
`difference (|ai – bj|) instruction used by
`architectures such as VIS and SSE, while still
`achieving a throughput of 16 pixels per cycle.
`
`Multiply accumulate
`Another common DSP operation is multi-
`ply-accumulate. This operation underlies
`many digital filters, mathematical transforms,
`matrix-arithmetic operations, and so on.
`AltiVec provides multiply-add, a more pow-
`erful version of multiply-accumulate, with
`three source operands and one destination.
`With multiply-add, the respective elements
`in two source vectors are multiplied and the
`products added to corresponding elements of
`a third source vector. The intermediate cal-
`culations are carried out to infinite precision,
`and the final product sums are truncated to
`
`fit into destination-register fields the same size
`as the source elements.
`For halfword elements, AltiVec’s multiply-
`add has two forms: multiply-add-low and
`multiply-add-high. In the low form, the
`unsigned accumulator elements are added to
`the full product, and the intermediate prod-
`uct-sums are truncated modulo 216. In the
`high form, the signed accumulator elements
`are left justified (7-bit left shift), added to the
`32-bit intermediate products, and then satu-
`rated to fit into the 16-bit destination element
`fields. The 7-bit left shift (as opposed to 8-
`bit) results in one extra bit of precision by tak-
`ing advantage of the fact that the most
`significant bit of each signed 32-bit interme-
`diate product is redundant. A variant of this
`form rounds the intermediate product-sums
`to squeeze out an additional half bit of preci-
`sion. These tricks provide additional precision
`that is important to several algorithms, espe-
`cially audio processing.
`Multiply-add is not provided for byte ele-
`ments because 8 bits of precision is not suffi-
`cient for most DSP algorithms. In most
`algorithms involving 8-bit data, the elements
`are first expanded to 16 bits, where most com-
`putations are carried out, and the final results
`are truncated back to 8 bits. Video compres-
`sion and decompression, which use 8-bit val-
`ues throughout, are exceptions, but the
`operations involved in those algorithms are
`more suited to multiply-sum, which does have
`8-bit forms. In the few cases that require mul-
`tiply-add of byte elements, vector-multiply
`and vector-add instructions are provided.
`As a concession to silicon area, AltiVec does
`not provide a vector-multiply or multiply-add
`for 32-bit integers. For applications requiring
`more than 16 bits of precision, such as high-
`fidelity digital audio, 24-bit precision is usu-
`ally sufficient. AltiVec provides this level of
`precision with its floating-point instructions,
`which do include a multiply-add. Floating
`point also provides the wide dynamic range
`that is often the real motivation to use inte-
`gers larger than 24 bits.
`Conversion between integer and floating-
`point formats is very fast in AltiVec. A single
`instruction converts and scales a vector of four
`signed or unsigned 32-bit fixed-point words to
`a vector of four single-precision floating-point
`values, or vice versa. An instruction for round-
`
`MARCH–APRIL 2000
`
`89
`
`SAMSUNG-1027
`Page 5 of 12
`
`
`
`ALTIVEC EXTENSION
`
`ing floating-point values to integral values via
`any of the four IEEE-754 rounding modes9 is
`also provided. Since all AltiVec instructions are
`fully pipelined, SIMD floating-point arith-
`metic throughput is similar to that of SIMD
`integer arithmetic as long as the floating-point
`unit’s pipeline latency can be covered, which it
`usually can. With these floating-point features,
`AltiVec sacrifices little by not directly support-
`ing 32-bit-integer multiply.
`For division and square root, AltiVec uses
`Newton-Raphson refinement of a reciprocal
`seed. The high accuracy achieved with
`AltiVec’s fused multiply-add instruction (sin-
`gle rounding after the add operation) allows
`rapid convergence to an IEEE-754–accurate
`reciprocal value. This approach provides
`divide performance equaling or exceeding that
`of processors with expensive division hard-
`ware. Unlike with hardware dividers, AltiVec’s
`approach carries no hidden-state information
`from cycle to cycle; thus, division can be fully
`pipelined and intermediate instructions can
`be easily rescheduled.
`
`Conditionals
`Changes in control flow present a serious
`performance problem for any processor, espe-
`cially those running DSP applications where
`loops tend to be tight and the data-dependent
`decisions difficult to predict. The latter are
`nearly intractable in SIMD architectures,
`which process multiple data elements in a sin-
`gle instruction.
`Although AltiVec provides a means for con-
`ditional branching, it places more emphasis on
`avoiding branches. To this end, AltiVec offers
`a type of conditional-move instruction, called
`select (vsel), which, in concert with vector-
`compare instructions, operates efficiently on
`SIMD vectors. The vector-compare instruc-
`tions generate a predicate vector that can be
`stored in any vector register and used by sub-
`sequent vsel instructions to choose elements
`from one of two registers, depending on the
`value of the corresponding predicate elements.
`With this mechanism, data-dependent deci-
`sions can be made on all elements in a vector
`in parallel, making it unnecessary to test and
`branch on each element individually. AltiVec
`can use the vsel instruction to simulate predi-
`cated execution, which can eliminate many
`branches. Since vsel selects values on a bit-by-
`
`bit basis, it is also useful for selecting and merg-
`ing vector subfields that do not fall on element
`boundaries.
`In cases where data-dependent redirection
`of program control flow cannot be avoided,
`AltiVec’s vector-compare instructions option-
`ally update PowerPC’s condition register
`(CR06) field. Subsequent PowerPC condi-
`tional-branch instructions can test this regis-
`ter in the normal manner. A special form of
`vector compare—vector compare-bounds—
`speeds up 3D-graphics clipping operations.
`
`Loads and stores
`The philosophy behind AltiVec’s memory
`operations is to support only basic load and
`store primitives in an effort to keep the mem-
`ory path as fast as possible. The load-vector
`and store-vector instructions transfer full
`quadword vectors between memory and the
`vector registers. Load-vector-element and
`store-vector-element instructions transfer
`individual byte, halfword, and word scalar
`elements between memory and the vector reg-
`isters. Vector loads and stores use the index-
`addressing mode (RA|0 + RB) only.
`All memory accesses are aligned on their nat-
`ural size boundary. If a load or store’s address
`is not size aligned, the appropriate number of
`least-significant address bits is ignored and an
`aligned transfer occurs. AltiVec provides assis-
`tance for extracting misaligned data once it is
`in the registers. Special load-vector-for-shift-
`left/right instructions assist in this process by
`computing a permute-control vector based on
`the misaligned memory address. Random iso-
`lated unaligned-vector loads can be simulated
`with just four instructions, but the average cost
`of unaligned-vector loads in a long linear
`sequence, which is the more important case,
`approaches an average of only two instructions
`(one load vector and one vector permute).
`These two instructions will issue simultane-
`ously in most AltiVec implementations.
`
`Software-directed prefetch streams
`AltiVec allows software to manage the band-
`width between processor and memory with
`explicit cache management instructions. With
`these instructions, software can indicate to the
`cache hardware how it should prefetch data and
`prioritize replacement. The principal instruc-
`tion for this purpose is the cache-prefetch
`
`90
`
`IEEE MICRO
`
`SAMSUNG-1027
`Page 6 of 12
`
`
`
`instruction, called data-stream touch. This
`instruction specifies the starting address, a block
`size (one to three vectors), the number of blocks
`to prefetch (one to 256 blocks), a signed stride
`(± 32,768 bytes), and a 2-bit tag that unique-
`ly identifies one of four prefetch streams that
`can run simultaneously. Other forms of data-
`stream touch are provided for writing data into
`streams and marking data as transient, that is,
`as having poor temporal locality.
`
`AltiVec coding examples
`AltiVec has proven adept at accelerating a
`wide range of multimedia and DSP applica-
`tions. To illustrate its utility, we present cod-
`ing examples from the areas of image
`processing, wavelet signal processing, and
`Galois-field arithmetic.
`
`Median filter
`Median filter is an algorithm commonly
`used in image processing to smooth out noise.
`The support region of the pixel neighborhood
`in such a filter is normally a sliding filter win-
`dow. Since the data type is usually 8-bit
`unsigned (intensity) values, the sequence of
`scalar memory accesses typically involves
`numerous misaligned memory accesses. Fig-
`ure 2 provides a high-level description of a
`5 · 5 median filter based on a simple algo-
`rithm. The two most important functions are
`pixel load and compare-and-swap.
`In Figure 2, one pixel neighborhood appears
`in the dashed box. Twenty-five vectors are
`needed to calculate 16 output pixels in the
`third row. Even though pixel sets are mis-
`aligned, the use of AltiVec’s load and shift fea-
`tures completely avoids expensive unaligned
`memory accesses. Additionally, with this
`approach each pixel is fetched only once, thus
`minimizing the number of memory accesses.
`This example illustrates AltiVec’s double-
`shift instruction, vsldi. The critical innermost
`loop of compare-and-swap (V1, V2) can be
`implemented using a simple three-instruction
`sequence of vmaxub Vx, V1, V2; vminub V1,
`V1, V2; vor V2, V1, Vx. This sequence,
`which accumulates the minimum pixel value
`in V1, can be executed in just two cycles on a
`typical AltiVec superscalar implementation,
`resulting in a significant speedup over scalar
`processors for median filters, even with this
`simple algorithm. With a more sophisticated
`
`Sliding window
`
`Aligned
`
`(a)
`
`Aligned
`
`for each pixel vector { /* i.e., a set of 16 pixels */
` load appropriate 25 pixel values P[0..24]
` for i : = 0 to 12 {
` MIN = P [i]
` for j := i upto 24 {
` Compare_and_Swap (Min, P[j])
` /* MIN <-min (MIN, P[j]);
` P[j] <- max (MIN, P[j]) */
` }
` }
` /* MIN holds the median of P[0..24] at this point */
`}
`
`1. lvx
`
`2. lvx
`
`V0, [1600]
`
`; 10% gray pixels from memory, Row 1
`
`V31, [1616]
`
`; Next set of 10% gray pixels, Row 1
`
`3. vsldi
`
`V1, V0, V31, 1
`
`; Funnel shift for 20% gray pixels, Row 1
`
`4. vsldi
`
`V2, V0, V31, 2
`
`; Funnel shift for 30% gray pixels, Row 1
`
`5. vsldi
`
`V3, V0, V31, 3
`
`; Funnel shift for 40% gray pixels, Row 1
`
`6. vsldi
`
`V4, V0, V31, 4
`
`; Funnel shift for 50% gray pixels, Row 1
`
`(b)
`
`(c)
`
`Figure 2. Pixel access during 5 · 5 median filtering.
`
`algorithm, AltiVec can execute median filter at
`1.2 cycles per output pixel.
`
`Haar transform
`The applications of wavelet technology
`have increased markedly in areas such as dig-
`ital image processing, data compression,
`astronomy, acoustics, sub-band coding, imag-
`ing, speech discrimination, and fractal com-
`pression. The Haar transform, introduced in
`1910, is the oldest wavelet transform. The
`2 · 2 transform decomposes an image into
`four bands whose spatial frequencies and spa-
`tial contents differ.
`Figure 3 (next page) illustrates this trans-
`form, along with its implementation in
`AltiVec code. This example illustrates the use
`of the multiply-sum operation. Support for
`
`MARCH–APRIL 2000
`
`91
`
`SAMSUNG-1027
`Page 7 of 12
`
`
`
`ALTIVEC EXTENSION
`
`2· 2 input
`block
`
`Input pixels
`(unsigned bytes)
`
`Output pixels
`(signed 16 bits)
`
`P00 P01 P10 P11 P20 P21 P30 P31
`
`Band00
`
`Band01
`
`Band02
`
`Band03
`
`P02 P03 P12 P13 P22 P23 P32 P33
`
`Band10
`
`Band11
`
`Band12
`
`Band13
`
`Transform
`Band0i = Pi0 + Pi1 + Pi2 + Pi3
`Band1i = Pi0 + Pi1 - Pi2 - Pi3
`Band2i = Pi0 - Pi1 + Pi2 - Pi3
`Band3i = Pi0 - Pi1 - Pi2 + Pi3
`
`(a)
`
`Band20
`
`Band21
`
`Band22
`
`Band23
`
`Band30
`
`Band31
`
`Band32
`
`Band33
`
`0.; V0 and V1 hold two 2· 2 blocks of input pixels, V30 holds null
`0.; V4 holds 4-byte sequence: 1, 1, 1, 1 splatted
`0.; V5 holds 4-byte sequence: 1, 1, - 1, - 1 splatted
`0.; V6 holds 4-byte sequence: 1, - 1, 1, - 1 splatted
`0.; V7 holds 4-byte sequence: 1, - 1, - 1, 1 splatted
`; V1 = P71 P70 … P01 P00
`1. lvx
`V0, [image]
`V1, [image +XX]
`; V2 = P73 P72 … P03 P02
`2. lvx
`; V3 = … P43 P42 P41 P40
`3. vmrghh
`V2, V1, V0
`; V4 = … P03 P02 P01 P00
`4. vmrglh
`V0, V1, V0
`; V10 = Band07 … Band04
`5. vmsummbm
`V11, V2, V4, V30
`; V11 = Band03 … Band01
`6. vmsummbm
`V10, V0, V4, V30
`; V10 = Band07 … Band00
`7. vpkswss
`V10, V11, V10
`
`0.; other bands are similarly calculated
`
`(b)
`
`Figure 3. A 2 · 2 Haar transform (a) implemented in AltiVec code (b).
`
`vperm
`Index0…3
`
`vsel
`Index4
`vsel
`Index5
`vsel
`Index6
`vsel
`Index7
`
`Figure 4. 256-entry, 8-bit-element table lookup using AltiVec.
`
`one signed and one unsigned operand is cru-
`cial for filtering applications like this, where
`the signed filter coefficients are small enough
`to fit in a byte and the input is unsigned 8-bit
`data. A video-loop filter, which performs
`smoothing in YUV color space, is another
`such application. Here the convolution ker-
`nel is separable and has coefficients small
`enough to fit in a byte. In the absence of archi-
`tectural support for byte data types, the pro-
`
`92
`
`IEEE MICRO
`
`gram would have to unpack the data and
`extend the coefficients to 16 bits, losing half
`the potential parallelism.1,3
`
`Galois-field arithmetic
`Table lookups involving small tables (256 or
`fewer elements) and small elements (8 or fewer
`bits) are common in multimedia and commu-
`nication kernels such as cyclic redundancy
`check (CRC) code generation, convolution
`encoding, Galois-field (GF) multiplication, and
`many nonlinear functions, such as gamma cor-
`rection. Also, bit-level manipulations, includ-
`ing various simple functions such as bit-stream
`interleaving or bit (un)packing, are often pro-
`grammed using table lookup functions. Most
`traditional scalar and SIMD architectures do
`not support parallel implementation of such
`functions, so, for example, a sequence of 16
`index extractions, 16 address calculations, and
`16 data loads would be needed to load 16 data
`items based on 16 indices. AltiVec accom-
`plishes all of these operations with one vperm
`instruction.
`This powerful capability enables sophisticat-
`ed techniques like GF arithmetic, unlocking
`entirely new algorithms that are computation-
`ally intractable on conventional scalar and
`SIMD engines. Figure 4 illustrates the steps in
`performing a 16-way parallel lookup in a 256-
`element table (with byte elements).1,3
`GF arithmetic, that is, arithmetic over a
`finite field, has many important applications.
`One example is Reed-Solomon coding, which
`is a type of block code commonly used for
`error correction in broadband communica-
`tions. With add-form representation, GF addi-
`tion is just an exclusive-or operation. GF
`multiplication, however, is expensive. For
`example, a GF (24) multiplication involves
`three table lookups into two 16-element tables
`and a modulo-15 addition, as shown in Fig-
`ure 5. Note that the more commonly used GF
`(28) multiplication can be carried out using
`three GF (24) multiplications and four GF (24)
`additions.10
`
`Performance results
`The AltiVec performance data presented in
`this article is based on cycle-accurate simulation
`of Motorola’s MPC 7400 microprocessor with
`AltiVec. The simulated processor has several
`important microarchitecture characteristics:
`
`SAMSUNG-1027
`Page 8 of 12
`
`
`
`First input operand
`
`Second input operand
`
`Add-form to multiply-form table
`
`Mod-15
`addition
`
`Products in add-form
`
`Multiply-form to add-form table
`
`(a)
`
`0. ; X and Y: 16 pairs of numbers (byte wide) in add-form to be multiplied
`0. ; Z: 16 corresponding products (byte wide) in add-form
`0. ; FW table converts add-form to multiply-form, RE table does the reverse
`; V_X holds X, V_#n holds scalar n
`1. vcmpequ8 V4, V_X, V_#0
`; V_Y holds Y, check if X or Y is 0
`2. vcmpequ8 V5, V_Y, V_#0
`; product is 0 if either X or Y is 0
`3. vor V_Mask, V4, V5
`; lookup multiply-form of X
`4. vperm V1, V_FW, V0, V_X
`; lookup mulitply-form of Y
`5. vperm V2, V_FW, V0, V_Y
`; V3: product in multiply-form
`6. vaddum8 V3, V1, V2
`; fix-up for modulo 15 addition
`7. vcmpgtu8 V_mask, V3, V_#15
`8. vsubum8 V8, V3, V_#15
` subtract 15 for products ‡ 15
`9. vsel V3, V3, V8, V_mask
`; reverse_lookup add-form of prod
`10. vperm V3, V_RE, V_RE, V3
`; V3: final product in add-form
`11. vsel V3, V3, V_#0, V_Mask
`
`;;
`
`(b)
`
`• all instructions are fully
`pipelined with single-
`cycle throughput;
`• simple operations (sim-
`ple arithmetic, logical,
`permute) have a 1-cycle
`latency;
`• compound operations
`(for example, multiply-
`sum) have 3 to 4 cycles
`of latency;
`• the processor can issue
`one ALU-class and one
`permute-class instruc-
`tion each cycle (permute
`class
`includes most
`nonarithmetic opera-
`tions such as pack,
`unpack, splat, permute,
`and so on); and
`• there is no restriction on
`issuing AltiVec instruc-
`tions in the same cycle as
`PowerPC scalar instruc-
`tions.
`
`Table 3 (next page) sum-
`marizes performance on a
`variety of media functions,
`chosen to cover a wide spec-
`trum of media processing,
`including audio, video, graph-
`ics, speech recognition, and communication.
`The optimal scalar algorithm for a media func-
`tion is often different from the optimal AltiVec
`algorithm. Therefore, to present a fair com-
`parison, we made every effort to pick the best
`algorithm for each case.
`
`Figure 5. Galois-field (16) multiplication (a) implemented in AltiVec code (b).
`
`The AltiVec extension to the PowerPC
`
`architecture was designed to offer archi-
`tectural support for the entire range of multi-
`media processing. Multimedia processing has
`its roots in signal and image processing, where
`overall performance is often determined by
`the performance of critical loops iterating over
`a large input data set. AltiVec’s large vector
`register file, full-range data-type support, four-
`operand nondestructive instruction format,
`unmatched data-reorganization capability
`(permute), and powerful SIMD instruction
`set enable it to significantly speed processing
`in these critical loops. At the same time,
`
`AltiVec’s data-prefetch streams can keep the
`SIMD processing engine fed with data. The
`speedups achievable with AltiVec are sufficient
`to enable media-rich applications to run
`entirely on general-purpose PowerPC micro-
`processors without the aid of hard-to-use spe-
`cial-purpose media processors or dedicated
`hardware accelerators.
`MICRO
`
`References
`1. A. Peleg and U. Weiser, “MMX Technology
`Extension to the Intel Architecture,” IEEE
`Micro, July/Aug. 1996, pp. 42-50.
`2. K. Diefendorff, “Katmai Enhances MMX,”
`Micro