`
`MicroUnity Systems
`Engineering
`
`%is media processor
`extends general-
`pufpose computer
`systems fo.
`communicating and
`processing digital
`video, audio, data,
`and radio frequency
`signals ut
`broadband rates.
`
`broadband media processor is a
`general-purpose processor system
`. with sufficient computing resources
`to communicate and process digital video.
`audio, data, and radio frequency signals at
`broadband rates (more than 1.5 Mbps).
`Because media processors reduce systems'
`initial cost, they can accelerate our progress
`toward media-rich? mobile communications
`services. They also enable nem-ork opera-
`tors to create and maintain nely services by
`broadcasting software through networks
`rather than installing successive generations
`of dedicated hardware.
`Our Mediaprocessor is a series of instruc-
`tion-set-compatible processors enabling
`development of sophisticated soh-are tools.
`Mediaprocessor security and memory man-
`agement features allow a broadband media
`processor to be the sole processor in a
`remotely programmable system. to scale to
`huge memory and I/O systems. and to sup-
`port multiuser operating systems.
`The IMediaProcessor is the core of several
`cost-effective system designs These include
`battery-powered handheld devices. compact
`network termination devices, multimedia per-
`sonal computers, and large multiprocessing
`systems-all oriented toward flexible broad-
`band communications. The architecture itself
`is scalable toward multiple implementations;
`this article focuses on the instruction set, sys-
`tem facilities, and software environments
`common to these implementations.
`
`Bandwidth, agility, and cost
`Over time. telecommunications has exhib-
`ited a general trend toward communicating
`richer and more realistic images of ideas to
`physically remote locations. Interfaces have
`advanced from telegraphy at a few bits per
`second to speech, audio, video, and cine-
`ma-grade digital video, reaching gigabit-per-
`second rates. To reduce the cost of
`
`broadband communications, the industry
`has focused on digital communications with
`sophisticated source (compression) and
`channel (modulation) coding.
`In turn, general-purpose computers must
`perform these increasingly sophisticated
`communications tasks, but the computation-
`al requirements are quite demanding.
`Extracting information from raw images or
`sound data requires hundreds of operations
`to separate redundant from unique informa-
`tion. Similarly, the modulation and demodu-
`lation of digital data onto analog channels
`involves hundreds of operations per symbol.
`These operations add digital redundancy to
`make the channel reliable and modulate the
`digital data to yield efficient analog channel
`use. Even with small symbols, general-pur-
`pose computers require operand and instruc-
`tion bandwidth about a thousand times their
`communications interface bandwidth.
`Compared with ASIC designs, media-
`processor-based designs have lower system
`costs. This is because they aggregate numer-
`ous ASIC memories and logic blocks into a
`unified hierarchy of memory arrays and a
`single multiprecision data path. In addition,
`companies can amortize broadband media
`processors' development costs over many
`applications. Thus, while the first applica-
`tions involve greater development for a gen-
`eral-purpose processor and sophisticated
`software tools, new applications leverage
`this effort. Life cycle costs of broadband
`media processor designs are also lower, as
`the user can change the design dramatical-
`ly by downloading new software into exist-
`ing devices. Moreover, for the new software,
`developers can use high-level language
`compilers and debuggers, whose sophisti-
`cation will continue to improve.
`
`We could add a broadband media proces-
`
`34
`
`IEEEMicro
`
`0272-1732/96/$5.00 0 1996 IEEE
`
`Oracle-1034 p. 1
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`sor’s new instructions to existing microprocessor architectures,
`but this would only further encumber already complex designs.
`MicroIJnity’s MecliaProcessor eliminates multiple register files,
`condition codes, and complex instruction formats to simplify
`bypass, interlock, and exception logic in deeply pipelined and
`highly parallel implementations.
`Figure 1 shows the uscr state, a 64-bit~64-register file (which
`can be accessed as a 128-bitx32-register pair file) and a 64-bit
`program counter. A MediaProcessor may execute multiple
`threads sharing the memory hierarchy, each with individual
`copies of the register file and program counter. The instruc-
`tions are all 32 bits with an S-bit major opcode and up to four
`&it register specifiers in fixed locations. The reinaining space
`is for immediate and suboperation specifiers. The high-order
`three bits of the major opcode classify instmctions, pointing
`them toward specific fiinctional units to simplify a critical logic
`path in multiple-issue implementations.
`Table 1 (next page) summarizes the instruction set. The
`absence of condition codes, the user state, and multiple
`instruction formats result in a large number of instructions,
`as the instructions now indicate these choices. We have coa-
`lesced many instructions into orthogonally organized classes.
`All system state is memory mapped; no instructions or sys-
`tem state whatsoever is privilegecl, as all protection is asso-
`ciated with the memory system. The virtual memory system
`provides four-level protection for read. write, execute, and
`gateway accesses to memory and the memory-mapped sys-
`ten1 state. This enables construction of very secure systems
`with small, trustworthy kernels and less-trusted supporting
`code. A very lightweight context switch upon synchronous
`exceptions and asynchronous events permits rapid handling
`of virtual memory system exceptions and 1/0 events.
`Multiple threads with independent register file contexts inter-
`leave real-time and other software and reduce the operation
`ancl memory latency of each thread.
`I/O devices are often the source of hard real-time constraints;
`microprocessor-1,ased systems must meet the bandwidth and
`latency demands of disk, network, and video interfaces. To
`this end, designers liave built I/O devices with embeclded
`processors that access memory autonomously. Thus, a multi-
`media PC is a complex heterogeneous multiprocessor. System
`designers and Lisers must deal with the complex issues of cache
`coherence, synchronization, and locking between I/O proces-
`sors and the general-purpose processor.
`At the system level, the MediaProcessor memory maps I/O
`devices with integral buffers and eliminates IIMA. Software
`I/O devices, eliminating excess trips through
`main memory by processing I/O data as it is communicated
`through the I/O system. This reduces demand on main mem-
`ory txindwidth-a
`significant system level cost-and
`elimi-
`nates the coherence issues tbat DMA introduces to processors
`with caches.
`Direct access inay also reduce latency, because when the
`processor guarantees 1-ea-time computational bandwidth,
`computation can begin before an I/O transfer is complete.
`This choice also enables more efficient memory use than
`woulcl separate frame lxiffers. It defines portions of the video
`display with different clisplay depths, and even eliminates
`the fr-arne buffer entirely, constructing the video display on
`
`I
`
`I
`
`Rb
`
`Ra
`
`I
`
`lmm12
`
`Bits
`(b)
`
`8
`
`6
`
`6
`
`6
`
`6
`
`~
`
`~~
`
`~~
`
`~
`
`~
`
`Figure 1. User state (a) and instruction format (b). Ra
`through Rd are register specifiers; imm indicates irnmedi-
`ate specifier.
`
`the fly from the visible portions of window buffers.
`
`Group instructions
`As microprocessor designs have progressed from 8- to 16-,
`32-, and 64-bit processors, designers have extended regis-
`ters to handle larger, or higher precision, scalar operands.
`However, since media processing involves mostly low-pre-
`cision arithmetic, what advantage is there to supporting 128-
`bit operands in a media processor? Mediaprocessor divides
`the 128-bit operands into groups of smaller operands (2x64,
`4x32, 8x16, 16x8, 32x4, 62x2, or 128x1 bits), on which it per-
`forms independent operations. This permits peak operand
`bandwidth for computation at the 128-bit word size. Each
`halving of the operand size doubles the number clf opera-
`tions per instruction.
`MediaProcessor group instructions specify an operation
`on four 128-bit register pairs, for a total operand and result
`bandwidth of 512 bits per instruction. One of the four
`operands is a register pair result up to 128 bits. Figure 2
`(page 37) shows the group floating-point-multiply-and- add^
`lialf instruction, which performs eight floating-point multi-
`plies and eight floating-point adds in a single instruction.
`The architecture supports 32-bit and 64-bit (single and
`double) floating-point formats compliant with IE
`and 754-like 128-bit and 16-bit (quadword and halfword)
`floating-point formats. Group instruction operand widths are
`128 bits for add, subtract, logical, and floating point, and 64
`bits for integer multiply (with 128-bit results). Floating-point
`data types of 16 and 32 bits allow arithmetic with simplified
`scaling on intermediate-precision symbol streams (12 and 24
`bits) at 8- and 16-bit integer rates.
`
`August 1996 35
`
`Oracle-1034 p. 2
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`Table 1. Mediaprocessor instruction set summary.
`
`Instructions
`
`Optional
`features
`
`Interval-issue-
`latency (cycles)
`
`2-1-2
`
`4-1 -0
`
`8-7-7
`
`2-1 -1 *
`
`2-1-1
`
`Aligned, immediate
`
`Immediate, -and-swap
`
`Immediate, -and-link
`Immediate
`
`or base-plus-index register address
`modes. Atomic synchronization
`instructions
`(store-add-and-swap,
`store-multiplex-and-swap, store-com-
`pare-and-swap, and store-multiplex)
`enable efficient sharing of inemoiy
`and processing resources among
`multiple threads of execution. Group
`operations together with these load
`and store operations perform parallel
`operations on sequentially organized
`data.
`Mediaprocessor provides addition-
`al 128- and 64-bit register-to-register
`instructions for extended math oper-
`ations, such as multiply over %bit
`Galois fields, GF(256). These instruc-
`tions are useful for computing the syn-
`drome bytes
`in Reed-Solomon
`error-correcting code blocks. They
`also support the nonarithmetic and
`finite-field arithmetic operations of
`broadband tasks without squandering
`the machine's bandwidth resources.
`
`Storage (8, 16, 32, 64, or 128 bits) and synchronization (64 bits)
`Load 8, 16, 32, 64, or 128 bits,
`Unsigned, aligned, immediate
`little- or big-endian
`Store 8, 16, 32, 64, or 128 bits,
`little- or big-endian
`Store add, compare, or multiplex
`64 bits
`Branch (64 bits)
`Branch and-equal, and-not-equal,
`less, or less-equal-zero
`Branch equal, not-equal, less,
`or greater-equal
`Branch floating-point equal,
`not-equal, less, or greater-equal
`(1 6, 32, 64, or 128 bits)
`2-1-1
`Branch
`2-2-1
`Branch gateway
`2-1-1
`Branch down or back
`Fixed point (64 bits) and group (128x1, 64x2, 32x4,16~8, 8xI6,4x32, or 2x64 bits)
`Add or subtract
`Immediate, overilow
`1-1-1
`Multiply
`Unsigned, -and-add
`1-2-4**
`Divide
`Unsigned
`AND, OR, AND-NOT, OR-NOT,
`Immediate
`XOR, XNOR, NOR, or NAND
`Shuffle, deal, or swizzle
`Compress or expand
`Extract
`Deposit or withdraw immediate
`Shift or rotate right or left
`4- or 8-way multiplex
`Select bytes
`Set or sub, equal, not-equal, less,
`or greater-equal
`Multiplex
`AND sum of bits
`Log most significant bit
`Galois-field multiply, polynomial
`multiply-divide, 8 or 64 bits
`Floating-point scalar (16,32, 64, or 128 bits) and group (8xI6,4x32, or 2x64 bits)
`Near, truncate, floor, ceiling, or exact
`Add, subtract, multiply, or divide
`Near, truncate, floor, ceiling, or exact
`Multiply-and-add or -subtract
`Square-root, sink, float, or deflate
`Near, truncate, floor, ceiling, or exact
`Absolute, negate, inflate
`Exception
`Set equal, not-equal, less,
`Exception
`greater-equal
`
`Unsigned, immediate
`Unsigned, immediate
`Unsigned, merge
`Unsigned, immediate, overflow
`Shuffle, transpose
`
`Unsigned, immediate
`
`1-1-1
`
`1-1-2
`1-1-2
`1-2-3
`1-1-2
`1-1-2
`1-1 -2
`1-1-2
`1-1-2
`
`1-1-1
`1-1-3
`1-1-2
`1-4-5
`
`swi
`The more challenging aspect of
`media processing is dealing with non-
`sequentially organized or mixed-pre-
`cision data. MediaProcessor switching
`instructions alter the arrangement of
`to 256-bit
`symbols within 64-
`operands. Single instructions perform
`many commonly required rearrange-
`ments, and a sequence of three
`instructions can rearrange the COII-
`tents of a register operand arbitrarily.
`Shuffling. This is perhaps the
`most useful switching instruction.
`Shuffling separates multielement
`symbols into elemental parts and per-
`forms the reverse (for example, on
`real and imaginary parts of a com-
`plex-valued symbol, or on red-green-
`blue-alpha components of a color
`pixel). Figure 3a (next page) illus-
`trates a specific form of a group-shuf-
`fle instruction (x = 7, y = 4, z = 11,
`which catenates 64-bit operands a
`and b, then rearranges the symbols
`so that groups of symbols are inter-
`leaved.
`In its general form, the group-shuf-
`fle instruction specifies the size in bits
`over which symbols are shuffled (29,
`the size in bits of the symbols (29,
`and the degree of shuffling ( 2 3 . Using parameters x, y, and
`z decoded from an immediate field of the instruction, the
`shuffling operation selects each bit (i, where i t 0 to 127) of
`
`* 2-1-4 for unpredicted branches
`** 1-5-7 for 32-bit multiply or multiply-and-add); 1-20-22 for 64-bit multiply;
`1-23-25 for 64-bit multiply-and-add.
`
`Load and store instructions operate on signed or unsigned
`symbols of 8, 16, 32, 64, or 128 bits, aligned or unaligned,
`with big- or little-endian byte ordering, and base-plus-offset
`
`36
`
`IEEE Micro
`
`Oracle-1034 p. 3
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`I
`
`] axi+q I bXj+r I cxk+sl dxl+t lexm+ul fxn+v Igxo+wl hxp+xl
`1
`1
`1
`1
`1
`1
`1
`1
`1
`-
`1
`
`128 bits
`
`
`
`-
`
`Figure 2. Group-floating-point-multiply-add-half instruction.
`
`result c from bit j from the catenated source operand values
`(a I I b): c, 6 (a I
`I b)!, whereJ 6 is
`I I iy+z-l I
`I ix-l
`I I
`Here, c, denotes bit i of c (where bit 0 is the least-sig-
`i,,
`nificant hit); i6 denotes a bit field extracted from bits 6
`through x of i (an empty set when x is 7). The symbol I I
`denotes bit field catenation. For the example illustrated,j t
`I I
`i, I I ic, 5
`i3 0 .
`Swizzling. Symbols inay appear in memory in reversed
`order, or may need to be copied into multiple locations to
`form vector operands from scalar operands. Figure 3b illus-
`trates a specific form of a group-swizzle instruction,
`(icopy=127, iswap = 112), which catenates one or two 64-bit
`operands, then swaps and copies partitions of the 128-bit
`value, producing a variety of permutation and copying of sym-
`bols. Group-swizzle can also reverse bits or bit fields within
`a symbol, for example, reversing each group of two bits of
`indexes in a radix-four fast Fourier transform. In its general
`form, the group-swizzle instruction uses two ?-bit immediate
`values (icopy, iswap) to compute each bit i of the result b
`from a bit in source operand value a: 6, t a,,, ccopy) A ,~~~,,. Here,
`& denotes bitwise AND, and A denotes bitwise exclusive-OR.
`Size conversions. Other switching instructions convert
`groups of symbols from one size to another for group oper-
`ations on mixcd-precision symbols, expanding operands to
`a large working precision, or reducing the a result’s precision
`after computation. Figure 3c illustrates a group-extract
`instruction, which extracts a 128-bit subset of a 256-hit
`operand (two register pairs), taking half-size values from
`each symbol and shifting by a specified amount.
`Group-compress instructions perform the same operation
`on a 128-bit operand, yielding a 64-bit result. Group-expand
`instructions perform the reverse of a group-compress, shift-
`ing symtiols a specified amount and placing them into dou-
`ble-sine symbols, zero- or sign-extending tho result.
`Additional switching instructions rearrange bit fields with-
`in symbols. Figure 3d illustrates a group-deposit instruction,
`which places a specified right-aligned bit field from each
`symhol into a new position in the result. Group-merge-
`deposit combines the result with the original contents of the
`
`Figure 3. Group switching instructions: group-shuffle-
`doublets 128,16,8 (a); group-swizzle 127,112 (b); group-
`extract or group-compress (c); and group-deposit (d).
`
`~
`
`~~
`
`result register. Group-withdraw instructions perform the
`reverse of group-deposit, taking a bit field from a specified
`position in each symbol, then right-aligning and zero- or
`sign-extending the result. Group-shift and group-rotate
`instructions shift and rotate groups of symbols.
`Arbitrary permutations. Applications as diverse as cryp-
`tography and QM-64 demodulation require a completely
`arbitrary permutation of bits. To achieve this, Mediaprocessor
`uses a sequence of instructions, as no single instruci:ion has
`enough operand information to specify the 64! possible bit
`permutations of a 64-bit symbol. A technique suggested by a
`Benes network stnichire’ divides the problem into a sequence
`of %bit symbol permutations. Group-8-mux and group-trans-
`pose-8-mux instructions used in a three-instruction sequence
`order 64 bits arbitrarily. In fact, the instruction sequence simul-
`taneously orders two such 64-bit symbols at once. Group-8-
`m u x (see Figur-e 4a) select^ cach bit i of the result d fiom hit
`j in source operand a by bits in control operands b and c: dz
`I I bW631.
`I 1 ‘&63
`1 I b,z&631+64
`a,, where./
`3
`Group-transpose-8-mux (Figure 4b), first transposes the
`64-bit symbols of source operand a (64-bit transpose switch-
`es rows for columns when bits are arranged in an 8x8 array,
`
`August 1996 37
`
`Oracle-1034 p. 4
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`Table 2. Table values for gamma correction.
`
`a
`
`X
`
`Intercept
`
`Slope
`
`X
`
`Bypass
`
`~
`
`55
`30
`23
`19
`16
`14
`13
`
`0-1 5
`16-31
`32-47
`48-63
`64-79
`80-95
`96-1 1 1
`1 12-1 27
`128-143
`144-1 59
`160-175
`176-1 91
`192-207
`208-223
`224-239
`240-255
`
`1
`26
`39
`50
`62
`72
`78
`85
`93
`93
`92
`102
`114
`114
`128
`128
`
`I
`
`0
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`1 1
`12
`13
`14
`15
`
`0
`5
`9
`14
`18
`23
`27
`30
`34
`37
`40
`43
`46
`48
`51
`53
`
`by element This avoids hard-to-predict conditional branches
`mrithin group-organized code. (The Gainma correction sec-
`tion demonstrates the use of these instructions.) Scalar com-
`pare-and-branch, branch, and branch-and-link operations
`provide data dependent and modular control structure.
`Branch-gateway atomically fetches 128 hits froin memory
`into a register pair of code and data pointers, while checking
`translation lookaside buffer priority and protection permis-
`sions. It then branches to the code pointer, storing a result
`link in its place. The gateway instruction's design ensures that
`the target routine can trust the data pointer. Branch-gateway
`instructions permit extremely rapid access to secure code.
`They also enable secure resource sharing among trusted and
`untmsted modules to support robust access control, authen-
`tication. and encryption for digital communications systems.
`Gam ma CO rrect i on
`This example uses several group instructions, demon-
`strating group table lookup, multiply, and data-dependent
`conditional control. Gamma correction is a nonlinear func-
`tion applied to pixel values to correct for a video display's
`nonlinear amplitude intensity response. Because the defini-
`tion of gamma correction
`here we use the function
`defined by ITU-R Recommendation 707,5 scaled for &-bit val-
`ues in the doiiiain, range 0 to 255, and defined as
`
`R e c707(xi =
`x I' 255 < 0.018
`4.5h / 255)
`255
`(I.099(~ I' 255)045 - 0.097 x / 255 > 0.018
`Generally, systems perform this correction using a 256-
`entry, %bit table containing precomputed values. In this
`example, to correct sixteen %bit values concurrently using
`group instructions, MediaProcessor approximates Rec707(x)
`using a 16-piecewise linear interpolation of the function. As
`the linear approximation is inaccurate for values in the range
`0 to 15, MediaProcessor conditionally uses a bypass table to
`
`Figure 4. Group-8-mux (a), group-transpose-8-mux (b), and
`group-select-bytes (c) instructions.
`
`and is a shuffle where x = 6, y = 0, z = 3): t, t ab. where ,k
`I
`1
`zj 3 . Then, it computes result d by d2 t 5.
`1 i2
`t is /
`Group-shuffle-4-mux can operate in a sequence to
`rearrange 128 bits. Group-shuffle-4-mux decodes (x, y. z)
`from an instruction field and shuffles the 128 bits of n by t,
`I 1 ix-l y+z I 1 i,-, o. It then coin-
`I 1 z,,+z-l
`t ah, where k t i6
`1
`
`1 bciabi)+br / b,,,,.,,.
`/
`putes result c by c, t 4, wherej t io
`The group-select-bytes instruction (Figure 4c) uses a 16x4-
`bit operand to select one of 16 bytes of a second operand for
`result operand's 16 bytes. This instruction lias
`each of t
`l ~
`two distinct uses. It can 1) permute or copy 16 bytes to a 16-
`byte result with the bytes arbitrarily rearranged, or 2) oper-
`ate as a general-purpose table-lookup operation performed
`entirely within the processor registers. The instruction selects
`each byte i of result d from a byte in the source operand
`value (a I I b) by a value in a nibble of source control value
`82 t (a /
`1 h),+, 8,, wherej + cl,,+j .,,.
`c: &+,
`ontrol i ~ s t r ~ ~ t i Q ~ ~
`An orthogonal collection of group integer and floating-point
`compare-and-set instructions and the group-multiplex instnic-
`tion, provide the ability to select alternative results element
`
`38 IEEE Micro
`
`Oracle-1034 p. 5
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`determine these values directly. Table 2 shows the table val-
`ues and indicates a slope and intercept for the linear approx-
`imations and bypass values of the function, computed as
`slope(x) * x
`+ intercept(x-)
`16
`
`(x < 16)? bypasdx) :
`
`We scaled the curve's slope by 16, as it must be approxi-
`mated by an integer.
`A future release of the C compiler will vectorize the com-
`putation when written as a simple C expression. An exam-
`ple is the C expression (xbl < 16) ? bypass[x[il & 151 :
`((slope[x[i] >> 4]*x[i]) >> 4) + intercept[x[i] >> 41. Figure 5a
`shows C code acceptable to the current compiler, with intrin-
`sic fhctions representing the instructions. Figure 5b shows
`assembly code with symbolic register names. The lo64(x)
`and hi64(x) functions select the low-order and high-order
`64 bits of a 128-bit value or register.
`Figure 5c shows the original function and the piecewise-
`linear approximation, which are nearly coincident, and an
`expanded error function. We scaled the error function up by
`16 times to show the fbnction's accuracy; it is within k1.5 worst
`case and 0.44 mean squared error-generally adequate for
`video display purposes. The function requires only 10 instruc-
`tions to correct 16 symbols, or 0.6 instructions per symbol.
`The group-deal-nibbles (GSHUFFLEI 128,4,16) instruction
`separates the four high-order hits from the four low-order bits
`of each of sixteen 8-bit input symbols, producing two 16x4-bit
`key symbols in a register pair. Two group-select-bytes
`(G.SELECT.8) instructions use the high-order key to produce
`sixteen 8-bit values for the intercept and slope. A third group-
`select-bytes instruction uses the low-order key to produce six-
`teen 8-bit values for bypass values. Two group-unsigned-
`multiply-bytes (G.UMUL.8) instructions multiply the sixteen 8-
`bit slope symbols with the sixteen &bit input symbols, pro-
`ducing sixteen 16-bit products. The group-extract-bytes
`(G.EXTRACTI.8) instruction divides these products by 16 and
`reduces them to sixteen &bit offset symbols. Next, the group-
`add-bytes (G.ADD.8) instruction adds the sixteen 8-bit offset
`symbols to the sixteen %bit intercept symbols, producing six-
`teen 8-bit output symbols. The group-set-unsigned-less-byte
`(G.SET.UL.8) instruction compares the sixteen 8-bit input sym-
`bols against a threshold value, producing a 128-bit mask.
`Finally, the 128-bit mask selects symbols from the bypass table
`to replace the computed value by the group-multiplex
`(G.MUX) instruction, producing sixteen %bit output symbols.
`
`MediaProcessor structure
`Microunity has developed and implemented a set of
`media-processing building blocks that compose a variety of
`systems (see Figure 6>, ranging from simple network devices
`to large multiprocessing systems.
`MediaProcessor integrates a high-bandwidth register file
`with a data path that performs group, branch, and gateway
`operations; load, store, and synchronization; group arith-
`metic; group switching; and extended mathematics. These
`operations produce a powerful capability for processing
`media data streams.
`The data path connects to an on-chip memory system that
`
`typedef unsigned long long urnt128,
`
`uint128 x, y;
`uint128 key, sx, IX, bx, PO, p l , pr;
`
`const uintl28 slope
`
`const uintl28 intercept = Ox80807272665~5d5d
`554e483e32271 a01 ;
`= 0x080809090aObObOb
`OcOdOe1013171e37;
`= Ox3533302e2b282522
`1 e l bl7120e090500;
`const uintl28 threshold = Ox101 01 01 01 01 01 01 0
`1010101 01 010101 0;
`
`const uintl28 bypass
`
`key = gdea14(hi64(x),lo64(x));
`sx = gselect8(h164(slope),lo64(slope),hi64(key));
`IX = gselect8(hi64(intercept),lo64(1ntercept),hi64(key)),
`bx = gselect8(h164(bypass),lo64(bypass),lo64(key));
`PO = gumu18(hi64(sx),h164(x));
`
`(4
`
`G.SHUFFLEI
`
`key,hi64(x),lo64(~),128,4,16
`
`G.SELECT.8
`
`bx,h164(bypass),lo64(bypass),lo64(E:ey)
`pO,lo64(sx),lo64(x)
`p l , hi64(sx),hi64(x)
`
`G.MUX
`
`y,mask,bx,sum
`
`256
`
`224
`
`192
`a, -
`9 160
`-
`._ 8
`," 128
`a, + s
`E 96
`
`64
`
`32
`
`0
`-1 6
`
`(c)
`
`Input pixel value
`
`Figure 5. Gamma correction C code (a), assembly code (b),
`and function and residual error (c).
`
`August 1996 39
`
`Oracle-1034 p. 6
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`-
`
`Media-
`Codec
`
`Audio
`Video
`Radio
`
`Net-l
`
`MediaProcessor
`
`1
`
`1 SDRAM, Flash, serial 1
`
`Figure 6. Structure of a MediaProcessor-based system (L: load, S: store)
`
`~~
`
`in a 1-GHz range), video and stereo
`audio input and output, telephony,
`infrared, and smart-card interfaces.
`The 128-Kbit buffer memory direct-
`ly connects the filtered samples to
`the processor with very low latency
`via the Mediachannel interface. A
`simple version has a single RF trans-
`ceiver and local-area network inter-
`faces.
`MediaBridge devices connect the
`Mediachannel interface to industry-
`standard DRAM or the PCI bus, allow-
`ing systems to use low-cost memories
`and a wide variety of existing I/O
`interfaces. A 1-Mbit memory operates
`either as a secondary cache when
`interfacing to DRAMS, or as 1/0 buffer
`memory to PCI devices. MediaRam
`devices provide DRAM storage with
`integrated Mediachannel interfaces.*
`
`includes both dedicated buffer space and high-bandwidth
`caches. These satisfy the twin media-processing requiremen=
`of real-time and expanded memory spaces. A TLB with
`extended functions provides support for mapping, protection,
`and priority of memory operations. The memory system
`includes on-chip support for external memory in highly inte-
`grated systems, and connects to additional support devices
`via an extremely high-bandwidth Mediachannel interface.'
`Mediachannel interfaces use a simple packet protocol for 8-
`byte read and write transactions from a single master to up to
`four slave devices per interface. A memory-mapped protocol
`layer supports cache-coherent multiprocessor capabilities pio-
`neered by the IEEE Std 1596 Scalable Coherent Interface.-
`The first MediaProcessor implementation is a 0.5-micron,
`BiCMOS 1-cm2 die. It issues instructions at 1 GHz into a five-
`way interleaved pipeline, providing five independent 200-
`MHz threads. For this fixed-point subset implementation, the
`Interval-issue-latency column in Table 1 summarizes the per-
`formance of each thread (in 200-MHz cycles). Interval is the
`minimum interval between similar instructions; issue is time
`consumed by the instruction; and latency is time to a depen-
`dent instruction (all in cycles).
`This MediaProcessor implementation has a 5 12-Gbps reg-
`ister bandwidth. A 128-Gbps memory bandwidth feeds on-
`chip 256-Kbit instruction and data memories that are
`partitionable to cache or dedicated buffer space. An integrat-
`ed SDRAM interface reaches 3.2-Gbps peak bandwidth, and
`Flash RAM and serial bus interfaces support agile and down-
`loadable bootstrap code. Two Mediachannel interfaces sup-
`port 32-Gbps communication to other system building blocks:
`the MediaCodec, MediaBridge, and MediaRam devices. Other
`CMOS interleaved and noninterleaved MedidProcessor ver-
`sions extend the range of possible implementations.
`MediaCodec devices are mixed-signal converters and dig-
`ital linear filters that allow the interfaces to use any commu-
`nications protocol. One version of MediaCodec has two
`broadband RF receivers (tuning 6- to 8-MHz channels with-
`
`Software tools
`We also provide a development environment for Media-
`Processor software. This environment includes
`
`C and C++ compilers;
`source-code debuggers and profilers;
`media and communications software libraries for stan-
`dards such as MPEG, Nat'l Television Standards
`Committee decode, Dolby, QAM and QPSK (quadrature
`amplitude modulation and quadrature phase shift key-
`ing), Viterbi and Reed-Solomon FEC (forward error cor-
`rection), and DES (Data Encryption Standard);
`0 a very small real-time microkernel for client devices; and
`64-bit Open Software Foundation Unix for server appli-
`cations.'
`
`Currently the C compiler can vectorize simple loops, and we
`plan to further improve its vectorization strength. A
`Mathematicalo library of functions representing MediaProcessor
`instructions permits symbolic development and verification
`of MediaProcessor software. Abbott et al." give further exam-
`ples of algorithms developed for the MediaProcessor.
`
`THE BROADBAND MEDIA PROCESSOR is a platform
`for developing the next generation of communications sys-
`tems and for future generations of communications algo-
`rithms. By meeting the challenges of bandwidth, agility, low
`cost, and simplicity, it enables the development of broad-
`band communications systems that use software-centered
`design. This, in turn, reduces system cost and enables a rich
`variety of new media services. Q
`
`A c ~ ~ ~ w l @ ~ g m e ~ ~ s
`Thanks go to the guest editors and referees, whose com-
`ments and encouragement helped refine and focus the arti-
`
`40 IEEEMicro
`
`Oracle-1034 p. 7
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`cle, and to the many people at Microunity who helped to
`bring this architecture to fruition.
`
`References
`1 . ANSI//€€€ Std 754-1985 Binary Floating-Point Arithmetic, IEEE,
`Piscataway, N.J., 1985.
`2. V.E. Benes, Mathematical Theory o f Communication Networks
`and Telephone Traffic, Academic Press, New York, 1965.
`3. C. Poynton, A Technicallntroduction to Digital video, John Wiley
`& Sons, New York, 1996.
`4. C. Poynton, "Gamma and Its Disguises: The Nonlinear Mappings
`of Intensity in Perception, CRTs, Film and Video," SMPTEJ., Vol.
`102, No. 12, Dec. 1993, pp. 1099-1 108.
`5. /TU Recommendation BT. 709- 1, Basic Parameter Values for the
`HDWStandard for the Studio and for international Programme,
`Int'l Telecommunications Union, Geneva.
`6. C. Hansen, "Architecture of a Broadband Mediaprocessor," Proc.
`Compcon, IEEE Computer Society Press, Los Alamitos, Calif.,
`1996, pp. 334-340.
`/€€€Std 1596-1992 Scalable Coherentlnterface, IEEE, 1992.
`7.
`8. T. Robinson et al., "Multi-Gigabytehec DRAMS with the Micro-
`Unity Mediachannel Interface," Proc. Compcon, IEEE CS Press,
`1996, pp. 387-381.
`9. R. Hayes et al., "Microunity Software Development Environ-
`ment," Proc. Compcon, IEEE CS Press, 1996, pp 341-348.
`I O . S . Wolfram, Mathematica: A System for Doing Mathematics by
`Computer, 2nd ed., Addison-Wesley Publishing Co., Redwood
`City, Calif., 1991.
`1 1 . C. Abbott et al., "Broadband Algorithms with the Microunity
`Mediaprocessor," Proc. Compcon, IEEE CS Press, 1996, pp. 349-
`354.
`
`Craig Hansen is chief architect at
`Microunity Systems Engineering Inc., in
`Sunnyvale, California. His current work
`is defining further Mediaprocessor archi-
`tecture enhancements and implementa-
`tions. Previously, he was a designer of
`processor architectures and systems at
`NeXT, Mips Computer Systems, Weitek, and Hewlett Packard.
`Hansen received a HS from Cornel1 University and an MS
`from Stanford University, both