`WITH MAX-2
`T
`
`Ruby B. Lee
`
`Hewlett-Packard
`
`parallel programs that benefit from suhword
`parallelism tend to process data that are of
`the same size. For example, if the word size
`is 64 bits, some useful subword sizes are 8,
`16, and 32 hits. Hence, an instruction oper-
`ates on eight %bit subwords, four 16-bit sub-
`words, two 32-bit subwords, or one 64-bit
`subword (a word) in parallel. The degree of
`SIMD parallelism within an instruction, then,
`depends upon the size of the subwords.
`1)ata parallelism refers to an algorithm's
`execution of the same program module on
`different sets of daca. Subword parallelism
`is an efficient and flexible solution for media
`processing, because the algorithms exhibit
`a great deal of data parallelism on lower pre-
`cision data. The basic components of multi-
`media objects are usually simple integers
`with 8 , 12, or 16 hits of precision. Subword
`parallelism is also useful for computations
`unrelated to multimedia that exhibit data
`parallelism on lower precision data.
`One key advantage of subword paral-
`lelism is that it allows general-purpose
`processors to exploit wider word sizes even
`when not processing high-precision data.
`The processor can achieve more subword
`parallelism on lower precision data rather
`than wasting milch of the word-oriented
`data paths and registers. Media processors-
`processors specially designed for media
`rather than general-purpose processing-
`and DSPs allow more flexibility in data path
`widths. Even for these processors, howev-
`er. organizing the data paths for words that
`support subword parallelism provides low-
`overhead parallelism with less duplication
`of control. A more efficient use of memory
`also results, since a single load- or store-
`word instruction moves multiple packed
`subwords bctwecn memory and processor
`registers. Hence, subword parallelism is an
`efficient organization for media processing,
`whether on a general-purpose micro-
`processor or a specially designed media
`processor or DSP.
`
`he general-purpose computing work-
`load is changing to include more pro-
`cessing of multimedia information.
`We define media processing as the process-
`ing of digital multimedia information, such
`as images. video, audio, 2D and 3D graph-
`ics, animation, and text. This multimedia
`data, at the lowest component level, tend to
`be 16 hits or less. However, general-purpose
`microprocessors are generally optimized for
`processing data in units of words, where a
`word is currently at least 32 or 64 bits.
`This article proposes that suhword paral-
`lelism-parallel computation on lower pre-
`cision data packed into a word-is
`an
`efficient and effective solution for accelerat-
`ing media processing. As an example, it
`describes MAX-2, a very lean, RISC-like set
`of media acceleration primitives included in
`the &-bit PA-RISC 2.0 architecture.' Because
`MAX2 strives to he a minimal set of instruc-
`tions, the article discusses both instructions
`included and excluded. Several examples
`illustrate the use of MAX-2 instructions,
`which provide sulnvord parallelism in a
`word-oriented general-purpose processor at
`essentiallv no incremenral cost.
`
`-
`
`M M - 2 illustrates how
`a small set qf
`instruction extensions
`can provide subword
`parallelism to
`accelerate mediu
`processing and other
`data -pa ru Del
`
`Subword para I le1 ism
`A subword i s a lower precision unit of
`&&a contained within a word. In subword
`parallelism, we pack multiple subwords into
`a word and then process whole words. With
`the appropriate subword houndaries, this
`technique results in parallel processing of
`subwords. Since the same instruction applies
`to all the subwords within the word, this is
`a form of small-scale SIMn (single-instruc-
`tion. multiple-data) processing.'
`It is possible to apply subword parallelism
`to noncontiguous subwords of different sizcs
`within a word. In practice, however, imple-
`mentations are much simpler if we allow
`only a few suhword sizes and if a single
`instruction operates on contiguous subwords
`that are all the same size. Furthermore, data-
`
`0272-1732/96/$5.00 0 1996 IEEE
`
`August 1996 51
`
`Authorized licensed use limited to: University of Virginia Libraries. Downloaded on September 08,2020 at 07:12:04 UTC from IEEE Xplore. Restrictions apply.
`
`Oracle-1028 p. 1
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`tions already work. For example, the load-word instruc-
`tion loads multiple packed subwords into a register from
`memory, in a single instruction.)
`
`Efficient support of inner-loop operations is more important
`than one-time operations For example, the data expansion, con-
`traction, clipping, and reduction functions just mentioned tend
`to occur outside the inner loop, so their support is not as per-
`formance critical. There are obviously many possibilities for
`selecting instructions and features that meet these needs. The
`next section presents an example of a very small set of instnic-
`tions supporting subword parallelism that we added to a gener-
`al-purpose RISC processor architecture.
`
`Data-parallel algorithms with lower precision data map m ell
`into subword-parallel programs. The support required for such
`subword-parallel computations then mirrors the needs of the
`data-parallel algorithms.
`To exploit data parallelism, we need subword parallel com-
`pute primitives, which perform the same operation simulta-
`neously on subwords packed into a word. These may include
`basic arithmetic operations like add, subtract, and multiply, as
`well as some form of divide, logical, and other compute oper-
`ations. Data-parallel computations also need
`
`0 data alignment before or after certain operations for sub-
`words representing fixed-point numbers or fractions;
`0 subword rearrangement within a register so that algo-
`rithms can continue par-allel processing at full clip;
`a way to expand data into larger containers for more
`precision in intermediate computations; similarly, a way
`to contract it to a fewer number of bits after the com-
`putation’s completion and before its output:
`conditional execution;
`0 reduction operations that combine the packed subwords
`in a register into a single value or a smaller set of val-
`ues; and
`0 a way to clip higher precision numbers to fewer bits for
`storage or transmission;
`* the ability to move data between processor registers and
`memory, as well as the ability to loop and branch to an
`arbitrary program location. (General-purpose proces-
`sors require no new instructions for these functions;
`since the load-word, store-word, and branch instruc-
`
`Table 1. MAX-2 instructions in PA-RISC 2.0.
`
`Parallel subword instruction*
`
`Description
`
`Cycles
`
`r u ~ ~ ~ ~ ~ s
`MAX-2 i ~ ~ t
`Iz/w( is a small set of Multimedia Acceleration extensions
`implemented in PA-RISC processors. For a general-purpose
`processor architecture like PA-RISC, the trend of both techni-
`cal and business computations to include an increasing amount
`of multimedia information motivated the inclusion of such
`instructions.
`We first implemented MAX-13 in the PA-71OOLC micro-
`processor. a 32-bit PA-RISC 1.1 architecture.* MAX-2 is the sec-
`ond generation of multimedia instructions’ and an integral part
`of the 64-bit PA-RISC 2.0 ar~hitecture.~ Hence, all 64-bit PA-
`RISC processors will contain MAX-2.
`Table 1 summarizes iVM-2. The first five parallel subword
`arithmetic instructions are the same as the MAX-1 instructions,
`n-hile the rest are new. In addition to four new instruction
`types, 1 W - 2 has twice the subword parallelism per instruc-
`tion. since it assumes 64-bit words; MAX-1 operated on proces-
`sors with 32-bit words. Since W - 1 is
`a proper subset of MAX-2, PA-RISC
`programs including MAX-1 instruc-
`tions will run unchanged on PA-RISC
`2.0 processors. (When something is
`common to both M A X 1 and MAX-2,
`we will use the notation MAX.)
`Although M A X 2 is the second gen-
`eration of multimedia instructions for
`PA-RISC, it is still much smaller than
`the sets proposed for other micro-
`processors. (See Kohn et al.6 and the
`articles by Tremblay et al. and Peleg
`and Weiser in this issue.) Our design
`goal was to leverage existing micro-
`processor capabilities as much as pos-
`sible. We only added new features
`that have both significant perfor-
`mance acceleration for media pro-
`cessing and potential general-purpose
`utility. In addition, the new instruc-
`tions do not have significant impact
`on the cycle time, area, and design
`time of the PA-RlSC processor.
`We considered supporting 8- and
`32-bit subwords in addition to 16-bit
`subwords, but rejected this for our
`workstation and server markets. We
`
`Adds corresponding subwords
`(saturation arithmetic speeds up
`and simplifies overflow handling)
`
`Subtracts corresponding subwords
`
`Multiplies subwords by integer constant
`
`Multiplies subwords by fractional constant
`
`Calculates arithmetic means of subwords
`Aligns data
`divides signed integer; extends sign
`divides unsigned integer; zero fill on left
`Aligns data; zero fill on right
`Rearranges subwords from two source
`registers
`Rearranges subwords within a register
`
`1
`1
`1
`
`1
`1
`1
`1
`
`1
`
`1
`
`1
`1
`1
`1
`
`1
`
`Parallel add
`with modulo arithmetic, HADD
`with signed saturation, HADD,ss
`with unsigned saturation, HADD,us
`Parallel subtract
`with modulo arithmetic, HSUB
`with signed saturation, HSUB,ss
`with unsigned saturation, HSUB,us
`Parallel shift left and add,
`with signed saturation, HSLADD
`Parallel shift right and add,
`with signed saturation, HSHRADD
`Parallel average, HAVG
`Parallel shift right
`signed, HSHR
`unsigned, HSHR,u
`Parallel shift left, HSHL
`Mix, MIXH, MlXW
`
`Permute, PERMH
`
`52 /€€€Micro
`
`Authorized licensed use limited to: University of Virginia Libraries. Downloaded on September 08,2020 at 07:12:04 UTC from IEEE Xplore. Restrictions apply.
`
`Oracle-1028 p. 2
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`rejected %hits subwords for insufficient precision, and 32-bits
`subwords for insufficient parallelism compared to 32-bit (sin-
`gle-precision) floating point.
`Although pixel components may be input and output as 8
`bits, using 16-hit subwords in intermediate calculations is usu-
`ally desirable. Color components of less than 8 bits represent
`very low-end graphics, which may he less important in the
`future. For medical imaging, pixel components are already 12
`rather than 8 bits, again requiring at least 16-bit computation-
`al precision.
`LJsing 32-bit integer subwords provides a parallelism of only
`two operations per subword instruction. Media computations
`that need this level of precision often work better with single-
`precision floating-point operations such as FMAC (floating-
`point multiply and accumulate), which already combine two
`operations into one instruction.
`
`Parallel subword compute instructions
`We found that the most common compute operations in
`media processing programs are the basic integer add and sub-
`tract and simple variations of these.:* The first five entries in
`Table 1 represent parallel subword versions of such arithmetic
`operations.
`The parallel-add and pardk-Subtract instructions each have
`three variants that describe what happens on an overflow. The
`default action is modulo arithmetic, which discards any over-
`flow. An instruction that specifies signed saturation clips the result
`to the largest or smallest signed integer in the result range,
`depending on the overflow direction. Similarly, an instruction
`specifying unsigned saturation clips a result that overflows to the
`largest or sinallest unsigned integer in the result range.3,'
`Often, the multiplications required in media processing are by
`constants. MAX speeds this up with parallel shift left and add,
`and parakl shift right and add. These two instructions very
`effectively implement multiplication by integer and fractional
`con~tants.',~ They require just a minor modification to the exist-
`ing preshifter in the integer arithmetic logic unit, rather than a
`whole new multiplier functional unit on the integer data path.
`The parallel-average instruction performs a very common and
`useful function in image and video processing: the arithmetic
`mean (Figure 1). It adds two source operands then performs a
`divide by two. This is a combined-operation instruction, involv-
`ing an add and a right shift of one bit. In the process, the instnic-
`tion shifts in the carry-out bit as the most significant bit of the
`result, so the instruction has the added advantage that no over-
`flow can occur. In addition, it rounds the least significant bit to
`ion in cascaded average operations.
`MAX-2 includes new parallel (subword) shift instructions,
`which use the existing 64-bit shifter but block any bits shifted
`out of one subword from being shifted into the adjoining sub-
`word. These instructions are useful for data alignment. Here,
`each subword represents some fixed-point (integer or fraction-
`al) number that must be pre- or post-aligned after certain arith-
`mctic operations. Parallel shift 1-ight (signcd 01- unsigned) can
`also speed up division of signed and unsigned subwords bp
`powers of two. It is also useful for sign or zero extension. (While
`the parallel shift and add instructions allow shifting of only one,
`two, or three bits, the parallel shift instructions allow shifting of
`any number of bits.)
`
`A
`
`B
`
`C
`
`D
`
`Input register 1
`
`A+E
`-
`2
`
`B+F
`2
`
`C+G
`2
`
`D+H
`- ~ ~
`2
`
`Figure 1. Parallel subword average instruction
`
`We can use the parallel shift-right instructions freely for inte-
`ger division. However, we can L I S ~ the parallel shift-left instruc-
`tion for multiplication only when we know that the subword
`values are small enough to not overflow during each subword's
`left shift. These instructions do no checking for overflow-that
`is, for signlficant hits shlfted out on the left of each subword. This
`is because we use parallel shift instructions mainly for dJtd align-
`ment and data rearrangement.
`We considered adding subword integer-multiply hardware,
`but decided against it for the following reasons. Media-pro-
`cessing computations like audio and graphics transformations.
`which use multiply-accumulate or multiply-by-a~variable oper-
`ations extensively, usually require intermediate calculation pre-
`cision greater than 16 bits. When audio samples come :is 16-bit
`linear audio input. internal computational precision greater than
`16 bits is desirable.
`Given the choice of 32-bit integer versus 32-bit floating-point
`precision, most audio and graphics programmers prefer the lat-
`ter for its automatic alignment features, accuracy, and greater
`dynamic range. FMAC instructions exist in PA-RISC 2.0 proces-
`sors in both single and double precision. These already provide
`the combined-operation paraklisin of two Operations (a multi-
`ply and an accumulate) per pipeline cycle. Furthermore, PA-RISC
`2.0 processors usually have two such floating-point multiply-
`accumulate units, giving a parallelism of four operations per
`cycle. By using the existing floating-point FMAC units, we save
`considerable area otherwise needed for new integer multiply
`units in the integer data path. We also save pipeline complexi-
`ty, since integer multiply is a inulticycle operation while all exist-
`ing integer instructions are single cycle.
`A bonus of using the floating-point register file rather than
`the integer register file is that it makes available twice as inany
`addressable registers (64 rather than 32). This is because in PA-
`RISC, though each integer and floating-point register file has 32
`registers, we can address each 64-bit floating-point register as
`two ?&bit registers. Another advantage is that this arrangernent
`sometimes allows video and audio processing to occur simul-
`taneously: one on the integer side and the other on the floating-
`point side, each with its own register file and conipptation
`hardware.
`Hence, we implement graphics transformations and audio
`compurations as single-precision floating-point programs. Data
`is loaded directly to one of the sixty-four addressable 32-bit
`floating-point registers. These programs do not need to move
`data back and forth helween the integer ancl floating-point reg-
`ister files. Alternative solutions to mu1t;ply-by-a-variable involve
`either new, costly suhword multiply hardware or some form
`
`August 1996 53
`
`Authorized licensed use limited to: University of Virginia Libraries. Downloaded on September 08,2020 at 07:12:04 UTC from IEEE Xplore. Restrictions apply.
`
`Oracle-1028 p. 3
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`general registers 1
`
`Integer
`
`registers
`
`li_i
`
`FSQRT
`
`HSHL
`HADD
`HSHR
`HSUB
`HSHLADD HSHR, U
`HSHRADD MIX
`HAVG
`PERMH
`
`Figure 2. M A X instructions, listed under t h e units t h a t
`execute them, use existing processor resources.
`
`Mix left
`
`MIX right
`
`Figure 3. M i x interleaves alternate subwords f r o m t w o
`registers.
`
`isters. The challenge is to find a small set of primitives that ful-
`fills most frequent subword regrouping or rearrangement
`needs. In MAX-2, we introduce just two new data rearrange-
`ment instructions: mix and permute.
`Mix. These instructions take subwords from two registers
`and interleave alternate subwords from each register in the
`result register. Mix left starts from the leftmost subword in each
`of the two source registers, while mix right ends with the right-
`most subn-ords from each source register. Figure 3 illustrates
`this for 16-hit subwords.
`Mix implements an even-odd paradigm for parallel process-
`ing: Even elements are processed in parallel, then odd elements,
`or v-ice versa. (We use the names mix-left and mix-right instead
`of mix-even and mix-odd because whether an element is odd
`or even depends on whether elements are numbered from zero
`or one; svarting from the left or right.) Mix is more powerful than
`interleaving sequential subwords from each source register
`(A4aBb rather than AaCc) because it can combine subwords from
`both halves of each source register in the result register. It also
`has a lon-er implementation cost, requiring only very minor
`changes to the existing two-input, unidirectional shifter.
`Permute. The permute instruction takes one source regis-
`ter and produces a permutation of that register's subwords.
`With 16-bit subwords, this instruction can generate all possi-
`ble permutations. with and without repetitions, of the four sub-
`n-ords in the source register. Figure 4 shows some possible
`permutations. To specify a particular permutation, we use a
`permute index. The instruction numbers subwords in the
`source register starting with zero for the leftmost subword. A
`permute index identifies which subword in the source regis-
`ter the instruction places in each subword of the destination
`register.
`
`Permute index
`(4
`
`Permute index
`(b)
`
`3
`
`2
`
`1
`
`0
`
`Figure 4. Permute allows rearrangement and repetition
`(a) and reversal (b) of subwords.
`
`
`l o ~ ~ ~ ~ ~ -
`Integer or f
`We implemented the M A X 2 instructions in the integer data
`path for several reasons. First, they require only very minor
`modifications to the integer ALU and SMU units; implementa-
`tion on the floating-point side would require new subword
`integer functional units using the floating-point registers.
`In addition. to implement MAX on the floating-point side, we
`w-ould have to replicate many useful integer functions or not
`use them. These include all field manipulation instructions like
`shift-pair, extract, and deposit, and logical instructions like
`AND. AND-complement, OR, and exclusive-OR (see Table 2).
`The shift-pair instruction is particularly useful, since it allows
`a 0- to 63-bit shift on any two source registers. Common shift
`instructions work on only one source tegistei-. For example,
`shift-pair can properly align an unaligned sequence of sub-
`words to a 64-bit word boundary.
`of less costly but less capable multiply hardware.6
`Figure 2 shows some of the resources of the PA-8000.'O The
`The area savings of not adding new integer units on the
`floating-point side outweighed the relatively minor disadvan-
`existing integer ALUs (arithmetic logic units) and SMUs (shift
`tage of sharing general-purpose registers with address and
`merge units) implement the MAX-2 instructions. M A X 1 used
`only the ALUs. However, different types of media-processing
`loop counter variables. Load and store instructions do not nec-
`essarily compete for superscalar issue slots with MAX instruc-
`computations use the entire processor, including integer and
`tions. For example, the PA-8000 allows two load or store
`floating-point register files, integer and floating-point func-
`instructions to issue with two bLO-2 instructions, even though
`tional units, and the enhanced cache memory system.
`they all use the general registers. Furthermore, the floating-
`r @ ~ ~ r a t i ~ s t r ~ ~ ~ ~ o n ~
`point registers and functional units can operate simultaneous-
`ly with M A X 2 instructions for possible parallel audio
`Many algorithms require subword rearrangement within reg-
`
`54 IEEE Micro
`
`Authorized licensed use limited to: University of Virginia Libraries. Downloaded on September 08,2020 at 07:12:04 UTC from IEEE Xplore. Restrictions apply.
`
`Oracle-1028 p. 4
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`processing or graphics transforina-
`tions, making a total of 96 registers
`available for simultaneous processing
`of different media streams.
`
`Table 2. Other useful PA-RISC features.
`
`Feature
`
`Description/motivation
`
`Shift right a pair of registers,
`SHRPD, SHRPW
`
`Extract a field, EXTRD, EXTRW
`
`Deposit a field into a register,
`DEPD, DEPDI, DEPW, DEPWI
`
`Logical operations,
`AND, ANDCM, OR, XOR
`FMAC
`
`Multiple floating-point
`condition bits
`Cache line prefetch,
`prefetch for read, LDD RO
`prefetch for write, LDW RO
`Cache hint spatial locality
`
`Other PA-RISC features
`Several other PA-KISC features that
`enhance floating-point, cache memo-
`ry, and branching performance' also
`contribute to higher performance
`niedia processing (see Table 2). We
`have already mentioned the useful-
`ness of
`the FMAC
`instructions.
`Multiple floating-point condition bits
`allow simultaneous testing of several
`conditions. They are also used to per-
`form faster boundary box "trivial"
`accept and reject tests in graphics
`computations.
`Cache prefetch instructions allow
`prefetching cache lines from memory
`into the cache to reduce cache miss
`penalties when the processor actual-
`ly needs the data. This is very useful
`for the predictable streaming data of
`many multimedia memory accesses.ll If a TLR miss occurs for
`such a cache prefetch instruction, it is ignored, and the instruc-
`tion reduces t o a one-cycle NOP (no operation). This reduces
`the downside of prefetching a cache line that is not actually
`used, allowing more aggressive prefetch policies.
`PA-RISC processors also have a cache hint in load and store
`instructions that indicates that the data has spatial locality hut
`no temporal locality. Thus, a processor implementation can
`fetch the cache line containing the desired data into a buffer
`without displacing useful data in the cache. This can reduce
`conflict misses in the cache due to streaming data that the
`processor does not reuse.
`
`Concatenate and shift 64-bit (SHRPD) or
`rightmost 32-bit (SHRPW) contents of two
`registers into one result register
`Select any field in the source register and places
`it right-aligned in the target register
`Select a right-aligned field from the source
`register or an immediate, and places it
`anywhere in the target register
`Existing logical operations in integer ALUs
`
`Floating-point multiply accumulate combined-
`operation instruction
`Enable Concurrency in floating-point
`comparisons and tests
`Reduce cache miss penalty and cache prefetch
`penalty by disallowing TLB miss
`
`Prevent cache pollution when data has no reuse
`
`high-level-language loop may look like thls:
`
`short x[2001, y[2001, 2[2001. d2001;
`int i:
`for (i = 0, i < 200, i++)i
`z[i] = x[il + y[il;
`w[i] = x[il - ylil; 1
`
`Figure i a (next page) shows a PA-RISC assembler version of
`the loop only, generated by a C compiler. Figure 5b shows the
`desired subword-parallel assembly version, which is very sim-
`ilar. The programmer has the choice of either modifying com-
`piler-generated assembly code to get the subword-parallel
`version, or modifying the C code to assist the compiler in gen-
`erating the subword-parallel version directly. One difficulty for
`cannot be sure that the 16-
`the compiler is data alignment-it
`hit shorts align on the 64-bit boundaries. We propose that the
`programmer specifically indicate that such alignment is neces-
`sary. using C's existing "union" feature, which superimposes
`the elements in the union onto the same storage locations. The
`programmer can use the feature to force four 16-bit shorts to
`be 64-bit aligned (where "long" is 64 bits).
`The compiler could recognize statements in for-loops that
`use variables declared in earlier "union" constructs as hints to
`generate subword-parallel code. However, for the program-
`mer, it is not necessarily easier to write such stylized C code,
`since it usually takes multiple C statements t o express a single
`MAX instruction.
`Consider the C statements specifying saturation arithmetic
`for each subword operation, for example. It is often easier for
`the programmer to just write the MAX-2 assembly instruction
`itself, or a C macro that corresponds to this assembly instruc-
`tion. Using macro calls M-HADU and M-HSIJR to generate
`the corresponding MAX-2 instruction. the following C code
`
`August 1996 55
`
`Mapping data parallelism
`A data-parallel computation includes a piece of code that it
`must execute many times for different sets of data. The goal is
`to map data-parallel computations that operate on lower pre-
`cision data onto subword-parallel instructions. A basic tech-
`nique is to execute multiple iterations of a loop in parallel,
`rather than to find opportunities for using subword-parallel
`instructions within the loop itself.
`For example, for an 8x8 discrete cosine transform (DCT),
`the processor must perform the same loop on eight rows and
`eight columns. If we apply our basic technique, a subword
`parallel instruction would work on four rows or four columns
`is, on four loop iterations at a time-rather
`at a time-that
`than
`trying to restructure the code for a single DCT loop using sub-
`word-parallel instructions.
`While it is beyond this article's scope to describe compre-
`hensive techniques for exploiting data parallelism in general
`and mapping to subword-parallel instructions in particular, the
`following example illustrates the process.
`High-level programming languages (for example, C) often
`capture data-parallel computations as for- or while-loops. A
`
`Authorized licensed use limited to: University of Virginia Libraries. Downloaded on September 08,2020 at 07:12:04 UTC from IEEE Xplore. Restrictions apply.
`
`Oracle-1028 p. 5
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`count to 200-1
`halfword from address x, postincrement by 2 bytes
`y, postincrement by 2 bytes
`
`;store 1 result in z, and postincrement
`tore 1 result in w, and postincrement
`rement loop count and loop back if not i 0
`
`haltwords from x, postincrement by 8 bytes
`haltwords from y, postincrement by 8 bytes
`
`subword subtracts
`sults in z, and postincrement
`sults in w, and postincrement
`;decrement loop count and loop back if not < 0
`
`Figure 5. High-level-language loop: PA-RISC assembler version w i t h o u t (a) and w i t h
`(b) subword-parallel instructions. In PA-RISC three-register assembly instructions,
`t h e order of t h e fields is opcode, sourcel, source2, destination. For load instruc-
`tions, t h e order is opcode, displacement (base), target. A halfword (h) denotes 16
`bits, and a doubleword (d) denotes 64 bits in PA-RISC load and store instructions.
`
`a1 examples of their use Often, an algo-
`rithm must rearrange the subwords
`packed into registers to fully uttlize sub-
`sequent subword-parallel instructions
`hfatnx transpose Suppose that an
`algorithm
`performs
`a
`certain
`sequence of operations on all the
`rows and columns of a matrix The
`first four rows of Table 4 show d 4x4
`matrlx of 16 bit subwords contained
`in four 64-bit registers We can apply
`parallel subword instructions to the
`elements of four columns in parallel,
`since these are in the four separate
`subword tracks in the registers Then,
`to apply the same algorithm to the
`elements of four rows in parallel, we
`must transpose the 4x4 matrix. The
`last four rows in Table 4 show the
`transposed matrix.
`The conventional way to achieve
`this is to store the subwords into dif-
`ferent memory locations, then read
`them back as a word with the sub-
`words in the rearranged positions
`within the word. This requires 16 store (halfword) and four
`load (doubleword) instructions. We can achieve considerable
`speedup if m-e rearrange the subwords within the register file
`rather than going through memory, which can incur potential
`cache misses. Using PA-RISC extract and deposit instructions
`for the reanangement would require more than 20 instructions.
`Table 4 shows such a 4x4 matrix transform done with just
`eight mix instructions. To transpose each row of the matrix, the
`processor must read four registers. Hence, for a processor with
`two register reads and one register write per cycle and no inter-
`mediate storage, eight is the minimum number of instructions
`for 4x4 matrix transpose. Table 4 shows the transformation using
`four temporary registers. Though we can accomplish the task
`with only two temporary registers, this does not permit the max-
`imum superscalar use of two SMUs processing two mix instruc-
`tions per cycle. This matrix transpose takes four cycles with two
`SMUs implementing mix instructions, and only two cycles with
`four such SMUs.
`Transposing an 8x8 matrix would require four such 4x4
`matrix transposes, taking 16 cycles with two SMUs. Sixteen 64-
`bit registers would be needed to house an 8x8 matrix of 16-
`bit subwords.
`Expanding and contracting. Mix instructions are also useful
`for data formatting. For example, the mix instruction with regis-
`ter zero as one source can expand 2-byte subwords in Ra into
`4-byte subwords in Rx and Ry. After the desired computation
`with this expanded precision, the mix instruction can also con-
`tract the 4-byte subwords contained in Rx and Ry back into
`packed 2-byte subwords in a single register (see Table 3). By
`using mix instructions to perform both expand and contract oper-
`ations. we preserve the data’s original order after contraction.
`Replicating subword constants. The permute instruction is
`useful for replicating subword constants (see Table 3). First, the
`load offset instruction, LDO, loads a 16-bit signed-immediate
`
`can generate the desired subword-parallel assembly code
`shown in Figure 5b:
`
`Union (long a[501; short x[20011 e;
`Union {long b[501; short y[20011 f;
`Union {long c[501; short z[20011 g,
`Union {long d[501; short w[20011 h;
`
`int i;
`for (i = 0, i<50, i++>l
`M-HADD (e.a[il, f.b[il, g.c[il);
`M-HSUB (e.a[il, f.b[il, h.d[il); I
`
`Since the parameters to the macro calls M-HADD and
`M-HSUB are 64-bit longs, the compiler will issue load-double
`(&bit) and store-double instructions. The programmer may
`freely intermix such in-line assembly macros with C statements.
`For the compiler, macro expansion is much easier and faster
`than using pattern matching for code generation. The coinpil-
`er performs register allocation and renaming, loop unrolling,
`scheduling, and other optimizations. Hence, PA-RISC compil-
`ers support M A X 2 instructions (and other PA-IUSC assembly
`instructions) through macros. Programmers can also include in
`header files the macro calls for optimized MAX-2 code
`sequences in Table 3 and any larger macros of their choice;
`such as the DCT.
`
`This section gives very short code examples pulled from
`larger programs, that illustrate how 1\/w(-2 instructions sup-
`port key needs of data-parallel computations (as enumerated
`earlier in the section on support for subword parallelism)
`Subword reamangement. Since m x and permute are new
`MA?-2 instructions not covered in earlier work,’ we give sever-
`
`56
`
`IEEE Micro
`
`Authorized licensed use limited to: University of Virginia Libraries. Downloaded on September 08,2020 at 07:12:04 UTC from IEEE Xplore. Restrictions apply.
`
`Oracle-1028 p. 6
`Oracle v. Teleputers
`IPR2021-0