throbber
Craig Hansen
`
`MicroUnity Systems
`Engineering
`
`%is media processor
`extends general-
`pufpose computer
`systems fo.
`communicating and
`processing digital
`video, audio, data,
`and radio frequency
`signals ut
`broadband rates.
`
`broadband media processor is a
`general-purpose processor system
`. with sufficient computing resources
`to communicate and process digital video.
`audio, data, and radio frequency signals at
`broadband rates (more than 1.5 Mbps).
`Because media processors reduce systems'
`initial cost, they can accelerate our progress
`toward media-rich? mobile communications
`services. They also enable nem-ork opera-
`tors to create and maintain nely services by
`broadcasting software through networks
`rather than installing successive generations
`of dedicated hardware.
`Our Mediaprocessor is a series of instruc-
`tion-set-compatible processors enabling
`development of sophisticated soh-are tools.
`Mediaprocessor security and memory man-
`agement features allow a broadband media
`processor to be the sole processor in a
`remotely programmable system. to scale to
`huge memory and I/O systems. and to sup-
`port multiuser operating systems.
`The IMediaProcessor is the core of several
`cost-effective system designs These include
`battery-powered handheld devices. compact
`network termination devices, multimedia per-
`sonal computers, and large multiprocessing
`systems-all oriented toward flexible broad-
`band communications. The architecture itself
`is scalable toward multiple implementations;
`this article focuses on the instruction set, sys-
`tem facilities, and software environments
`common to these implementations.
`
`Bandwidth, agility, and cost
`Over time. telecommunications has exhib-
`ited a general trend toward communicating
`richer and more realistic images of ideas to
`physically remote locations. Interfaces have
`advanced from telegraphy at a few bits per
`second to speech, audio, video, and cine-
`ma-grade digital video, reaching gigabit-per-
`second rates. To reduce the cost of
`
`broadband communications, the industry
`has focused on digital communications with
`sophisticated source (compression) and
`channel (modulation) coding.
`In turn, general-purpose computers must
`perform these increasingly sophisticated
`communications tasks, but the computation-
`al requirements are quite demanding.
`Extracting information from raw images or
`sound data requires hundreds of operations
`to separate redundant from unique informa-
`tion. Similarly, the modulation and demodu-
`lation of digital data onto analog channels
`involves hundreds of operations per symbol.
`These operations add digital redundancy to
`make the channel reliable and modulate the
`digital data to yield efficient analog channel
`use. Even with small symbols, general-pur-
`pose computers require operand and instruc-
`tion bandwidth about a thousand times their
`communications interface bandwidth.
`Compared with ASIC designs, media-
`processor-based designs have lower system
`costs. This is because they aggregate numer-
`ous ASIC memories and logic blocks into a
`unified hierarchy of memory arrays and a
`single multiprecision data path. In addition,
`companies can amortize broadband media
`processors' development costs over many
`applications. Thus, while the first applica-
`tions involve greater development for a gen-
`eral-purpose processor and sophisticated
`software tools, new applications leverage
`this effort. Life cycle costs of broadband
`media processor designs are also lower, as
`the user can change the design dramatical-
`ly by downloading new software into exist-
`ing devices. Moreover, for the new software,
`developers can use high-level language
`compilers and debuggers, whose sophisti-
`cation will continue to improve.
`
`We could add a broadband media proces-
`
`34
`
`IEEEMicro
`
`0272-1732/96/$5.00 0 1996 IEEE
`
`Oracle-1034 p. 1
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`sor’s new instructions to existing microprocessor architectures,
`but this would only further encumber already complex designs.
`MicroIJnity’s MecliaProcessor eliminates multiple register files,
`condition codes, and complex instruction formats to simplify
`bypass, interlock, and exception logic in deeply pipelined and
`highly parallel implementations.
`Figure 1 shows the uscr state, a 64-bit~64-register file (which
`can be accessed as a 128-bitx32-register pair file) and a 64-bit
`program counter. A MediaProcessor may execute multiple
`threads sharing the memory hierarchy, each with individual
`copies of the register file and program counter. The instruc-
`tions are all 32 bits with an S-bit major opcode and up to four
`&it register specifiers in fixed locations. The reinaining space
`is for immediate and suboperation specifiers. The high-order
`three bits of the major opcode classify instmctions, pointing
`them toward specific fiinctional units to simplify a critical logic
`path in multiple-issue implementations.
`Table 1 (next page) summarizes the instruction set. The
`absence of condition codes, the user state, and multiple
`instruction formats result in a large number of instructions,
`as the instructions now indicate these choices. We have coa-
`lesced many instructions into orthogonally organized classes.
`All system state is memory mapped; no instructions or sys-
`tem state whatsoever is privilegecl, as all protection is asso-
`ciated with the memory system. The virtual memory system
`provides four-level protection for read. write, execute, and
`gateway accesses to memory and the memory-mapped sys-
`ten1 state. This enables construction of very secure systems
`with small, trustworthy kernels and less-trusted supporting
`code. A very lightweight context switch upon synchronous
`exceptions and asynchronous events permits rapid handling
`of virtual memory system exceptions and 1/0 events.
`Multiple threads with independent register file contexts inter-
`leave real-time and other software and reduce the operation
`ancl memory latency of each thread.
`I/O devices are often the source of hard real-time constraints;
`microprocessor-1,ased systems must meet the bandwidth and
`latency demands of disk, network, and video interfaces. To
`this end, designers liave built I/O devices with embeclded
`processors that access memory autonomously. Thus, a multi-
`media PC is a complex heterogeneous multiprocessor. System
`designers and Lisers must deal with the complex issues of cache
`coherence, synchronization, and locking between I/O proces-
`sors and the general-purpose processor.
`At the system level, the MediaProcessor memory maps I/O
`devices with integral buffers and eliminates IIMA. Software
`I/O devices, eliminating excess trips through
`main memory by processing I/O data as it is communicated
`through the I/O system. This reduces demand on main mem-
`ory txindwidth-a
`significant system level cost-and
`elimi-
`nates the coherence issues tbat DMA introduces to processors
`with caches.
`Direct access inay also reduce latency, because when the
`processor guarantees 1-ea-time computational bandwidth,
`computation can begin before an I/O transfer is complete.
`This choice also enables more efficient memory use than
`woulcl separate frame lxiffers. It defines portions of the video
`display with different clisplay depths, and even eliminates
`the fr-arne buffer entirely, constructing the video display on
`
`I
`
`I
`
`Rb
`
`Ra
`
`I
`
`lmm12
`
`Bits
`(b)
`
`8
`
`6
`
`6
`
`6
`
`6
`
`~
`
`~~
`
`~~
`
`~
`
`~
`
`Figure 1. User state (a) and instruction format (b). Ra
`through Rd are register specifiers; imm indicates irnmedi-
`ate specifier.
`
`the fly from the visible portions of window buffers.
`
`Group instructions
`As microprocessor designs have progressed from 8- to 16-,
`32-, and 64-bit processors, designers have extended regis-
`ters to handle larger, or higher precision, scalar operands.
`However, since media processing involves mostly low-pre-
`cision arithmetic, what advantage is there to supporting 128-
`bit operands in a media processor? Mediaprocessor divides
`the 128-bit operands into groups of smaller operands (2x64,
`4x32, 8x16, 16x8, 32x4, 62x2, or 128x1 bits), on which it per-
`forms independent operations. This permits peak operand
`bandwidth for computation at the 128-bit word size. Each
`halving of the operand size doubles the number clf opera-
`tions per instruction.
`MediaProcessor group instructions specify an operation
`on four 128-bit register pairs, for a total operand and result
`bandwidth of 512 bits per instruction. One of the four
`operands is a register pair result up to 128 bits. Figure 2
`(page 37) shows the group floating-point-multiply-and- add^
`lialf instruction, which performs eight floating-point multi-
`plies and eight floating-point adds in a single instruction.
`The architecture supports 32-bit and 64-bit (single and
`double) floating-point formats compliant with IE
`and 754-like 128-bit and 16-bit (quadword and halfword)
`floating-point formats. Group instruction operand widths are
`128 bits for add, subtract, logical, and floating point, and 64
`bits for integer multiply (with 128-bit results). Floating-point
`data types of 16 and 32 bits allow arithmetic with simplified
`scaling on intermediate-precision symbol streams (12 and 24
`bits) at 8- and 16-bit integer rates.
`
`August 1996 35
`
`Oracle-1034 p. 2
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`Table 1. Mediaprocessor instruction set summary.
`
`Instructions
`
`Optional
`features
`
`Interval-issue-
`latency (cycles)
`
`2-1-2
`
`4-1 -0
`
`8-7-7
`
`2-1 -1 *
`
`2-1-1
`
`Aligned, immediate
`
`Immediate, -and-swap
`
`Immediate, -and-link
`Immediate
`
`or base-plus-index register address
`modes. Atomic synchronization
`instructions
`(store-add-and-swap,
`store-multiplex-and-swap, store-com-
`pare-and-swap, and store-multiplex)
`enable efficient sharing of inemoiy
`and processing resources among
`multiple threads of execution. Group
`operations together with these load
`and store operations perform parallel
`operations on sequentially organized
`data.
`Mediaprocessor provides addition-
`al 128- and 64-bit register-to-register
`instructions for extended math oper-
`ations, such as multiply over %bit
`Galois fields, GF(256). These instruc-
`tions are useful for computing the syn-
`drome bytes
`in Reed-Solomon
`error-correcting code blocks. They
`also support the nonarithmetic and
`finite-field arithmetic operations of
`broadband tasks without squandering
`the machine's bandwidth resources.
`
`Storage (8, 16, 32, 64, or 128 bits) and synchronization (64 bits)
`Load 8, 16, 32, 64, or 128 bits,
`Unsigned, aligned, immediate
`little- or big-endian
`Store 8, 16, 32, 64, or 128 bits,
`little- or big-endian
`Store add, compare, or multiplex
`64 bits
`Branch (64 bits)
`Branch and-equal, and-not-equal,
`less, or less-equal-zero
`Branch equal, not-equal, less,
`or greater-equal
`Branch floating-point equal,
`not-equal, less, or greater-equal
`(1 6, 32, 64, or 128 bits)
`2-1-1
`Branch
`2-2-1
`Branch gateway
`2-1-1
`Branch down or back
`Fixed point (64 bits) and group (128x1, 64x2, 32x4,16~8, 8xI6,4x32, or 2x64 bits)
`Add or subtract
`Immediate, overilow
`1-1-1
`Multiply
`Unsigned, -and-add
`1-2-4**
`Divide
`Unsigned
`AND, OR, AND-NOT, OR-NOT,
`Immediate
`XOR, XNOR, NOR, or NAND
`Shuffle, deal, or swizzle
`Compress or expand
`Extract
`Deposit or withdraw immediate
`Shift or rotate right or left
`4- or 8-way multiplex
`Select bytes
`Set or sub, equal, not-equal, less,
`or greater-equal
`Multiplex
`AND sum of bits
`Log most significant bit
`Galois-field multiply, polynomial
`multiply-divide, 8 or 64 bits
`Floating-point scalar (16,32, 64, or 128 bits) and group (8xI6,4x32, or 2x64 bits)
`Near, truncate, floor, ceiling, or exact
`Add, subtract, multiply, or divide
`Near, truncate, floor, ceiling, or exact
`Multiply-and-add or -subtract
`Square-root, sink, float, or deflate
`Near, truncate, floor, ceiling, or exact
`Absolute, negate, inflate
`Exception
`Set equal, not-equal, less,
`Exception
`greater-equal
`
`Unsigned, immediate
`Unsigned, immediate
`Unsigned, merge
`Unsigned, immediate, overflow
`Shuffle, transpose
`
`Unsigned, immediate
`
`1-1-1
`
`1-1-2
`1-1-2
`1-2-3
`1-1-2
`1-1-2
`1-1 -2
`1-1-2
`1-1-2
`
`1-1-1
`1-1-3
`1-1-2
`1-4-5
`
`swi
`The more challenging aspect of
`media processing is dealing with non-
`sequentially organized or mixed-pre-
`cision data. MediaProcessor switching
`instructions alter the arrangement of
`to 256-bit
`symbols within 64-
`operands. Single instructions perform
`many commonly required rearrange-
`ments, and a sequence of three
`instructions can rearrange the COII-
`tents of a register operand arbitrarily.
`Shuffling. This is perhaps the
`most useful switching instruction.
`Shuffling separates multielement
`symbols into elemental parts and per-
`forms the reverse (for example, on
`real and imaginary parts of a com-
`plex-valued symbol, or on red-green-
`blue-alpha components of a color
`pixel). Figure 3a (next page) illus-
`trates a specific form of a group-shuf-
`fle instruction (x = 7, y = 4, z = 11,
`which catenates 64-bit operands a
`and b, then rearranges the symbols
`so that groups of symbols are inter-
`leaved.
`In its general form, the group-shuf-
`fle instruction specifies the size in bits
`over which symbols are shuffled (29,
`the size in bits of the symbols (29,
`and the degree of shuffling ( 2 3 . Using parameters x, y, and
`z decoded from an immediate field of the instruction, the
`shuffling operation selects each bit (i, where i t 0 to 127) of
`
`* 2-1-4 for unpredicted branches
`** 1-5-7 for 32-bit multiply or multiply-and-add); 1-20-22 for 64-bit multiply;
`1-23-25 for 64-bit multiply-and-add.
`
`Load and store instructions operate on signed or unsigned
`symbols of 8, 16, 32, 64, or 128 bits, aligned or unaligned,
`with big- or little-endian byte ordering, and base-plus-offset
`
`36
`
`IEEE Micro
`
`Oracle-1034 p. 3
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`I
`
`] axi+q I bXj+r I cxk+sl dxl+t lexm+ul fxn+v Igxo+wl hxp+xl
`1
`1
`1
`1
`1
`1
`1
`1
`1
`-
`1
`
`128 bits
`
`
`
`-
`
`Figure 2. Group-floating-point-multiply-add-half instruction.
`
`result c from bit j from the catenated source operand values
`(a I I b): c, 6 (a I
`I b)!, whereJ 6 is
`I I iy+z-l I
`I ix-l
`I I
`Here, c, denotes bit i of c (where bit 0 is the least-sig-
`i,,
`nificant hit); i6 denotes a bit field extracted from bits 6
`through x of i (an empty set when x is 7). The symbol I I
`denotes bit field catenation. For the example illustrated,j t
`I I
`i, I I ic, 5
`i3 0 .
`Swizzling. Symbols inay appear in memory in reversed
`order, or may need to be copied into multiple locations to
`form vector operands from scalar operands. Figure 3b illus-
`trates a specific form of a group-swizzle instruction,
`(icopy=127, iswap = 112), which catenates one or two 64-bit
`operands, then swaps and copies partitions of the 128-bit
`value, producing a variety of permutation and copying of sym-
`bols. Group-swizzle can also reverse bits or bit fields within
`a symbol, for example, reversing each group of two bits of
`indexes in a radix-four fast Fourier transform. In its general
`form, the group-swizzle instruction uses two ?-bit immediate
`values (icopy, iswap) to compute each bit i of the result b
`from a bit in source operand value a: 6, t a,,, ccopy) A ,~~~,,. Here,
`& denotes bitwise AND, and A denotes bitwise exclusive-OR.
`Size conversions. Other switching instructions convert
`groups of symbols from one size to another for group oper-
`ations on mixcd-precision symbols, expanding operands to
`a large working precision, or reducing the a result’s precision
`after computation. Figure 3c illustrates a group-extract
`instruction, which extracts a 128-bit subset of a 256-hit
`operand (two register pairs), taking half-size values from
`each symbol and shifting by a specified amount.
`Group-compress instructions perform the same operation
`on a 128-bit operand, yielding a 64-bit result. Group-expand
`instructions perform the reverse of a group-compress, shift-
`ing symtiols a specified amount and placing them into dou-
`ble-sine symbols, zero- or sign-extending tho result.
`Additional switching instructions rearrange bit fields with-
`in symbols. Figure 3d illustrates a group-deposit instruction,
`which places a specified right-aligned bit field from each
`symhol into a new position in the result. Group-merge-
`deposit combines the result with the original contents of the
`
`Figure 3. Group switching instructions: group-shuffle-
`doublets 128,16,8 (a); group-swizzle 127,112 (b); group-
`extract or group-compress (c); and group-deposit (d).
`
`~
`
`~~
`
`result register. Group-withdraw instructions perform the
`reverse of group-deposit, taking a bit field from a specified
`position in each symbol, then right-aligning and zero- or
`sign-extending the result. Group-shift and group-rotate
`instructions shift and rotate groups of symbols.
`Arbitrary permutations. Applications as diverse as cryp-
`tography and QM-64 demodulation require a completely
`arbitrary permutation of bits. To achieve this, Mediaprocessor
`uses a sequence of instructions, as no single instruci:ion has
`enough operand information to specify the 64! possible bit
`permutations of a 64-bit symbol. A technique suggested by a
`Benes network stnichire’ divides the problem into a sequence
`of %bit symbol permutations. Group-8-mux and group-trans-
`pose-8-mux instructions used in a three-instruction sequence
`order 64 bits arbitrarily. In fact, the instruction sequence simul-
`taneously orders two such 64-bit symbols at once. Group-8-
`m u x (see Figur-e 4a) select^ cach bit i of the result d fiom hit
`j in source operand a by bits in control operands b and c: dz
`I I bW631.
`I 1 ‘&63
`1 I b,z&631+64
`a,, where./
`3
`Group-transpose-8-mux (Figure 4b), first transposes the
`64-bit symbols of source operand a (64-bit transpose switch-
`es rows for columns when bits are arranged in an 8x8 array,
`
`August 1996 37
`
`Oracle-1034 p. 4
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`Table 2. Table values for gamma correction.
`
`a
`
`X
`
`Intercept
`
`Slope
`
`X
`
`Bypass
`
`~
`
`55
`30
`23
`19
`16
`14
`13
`
`0-1 5
`16-31
`32-47
`48-63
`64-79
`80-95
`96-1 1 1
`1 12-1 27
`128-143
`144-1 59
`160-175
`176-1 91
`192-207
`208-223
`224-239
`240-255
`
`1
`26
`39
`50
`62
`72
`78
`85
`93
`93
`92
`102
`114
`114
`128
`128
`
`I
`
`0
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`1 1
`12
`13
`14
`15
`
`0
`5
`9
`14
`18
`23
`27
`30
`34
`37
`40
`43
`46
`48
`51
`53
`
`by element This avoids hard-to-predict conditional branches
`mrithin group-organized code. (The Gainma correction sec-
`tion demonstrates the use of these instructions.) Scalar com-
`pare-and-branch, branch, and branch-and-link operations
`provide data dependent and modular control structure.
`Branch-gateway atomically fetches 128 hits froin memory
`into a register pair of code and data pointers, while checking
`translation lookaside buffer priority and protection permis-
`sions. It then branches to the code pointer, storing a result
`link in its place. The gateway instruction's design ensures that
`the target routine can trust the data pointer. Branch-gateway
`instructions permit extremely rapid access to secure code.
`They also enable secure resource sharing among trusted and
`untmsted modules to support robust access control, authen-
`tication. and encryption for digital communications systems.
`Gam ma CO rrect i on
`This example uses several group instructions, demon-
`strating group table lookup, multiply, and data-dependent
`conditional control. Gamma correction is a nonlinear func-
`tion applied to pixel values to correct for a video display's
`nonlinear amplitude intensity response. Because the defini-
`tion of gamma correction
`here we use the function
`defined by ITU-R Recommendation 707,5 scaled for &-bit val-
`ues in the doiiiain, range 0 to 255, and defined as
`
`R e c707(xi =
`x I' 255 < 0.018
`4.5h / 255)
`255
`(I.099(~ I' 255)045 - 0.097 x / 255 > 0.018
`Generally, systems perform this correction using a 256-
`entry, %bit table containing precomputed values. In this
`example, to correct sixteen %bit values concurrently using
`group instructions, MediaProcessor approximates Rec707(x)
`using a 16-piecewise linear interpolation of the function. As
`the linear approximation is inaccurate for values in the range
`0 to 15, MediaProcessor conditionally uses a bypass table to
`
`Figure 4. Group-8-mux (a), group-transpose-8-mux (b), and
`group-select-bytes (c) instructions.
`
`and is a shuffle where x = 6, y = 0, z = 3): t, t ab. where ,k
`I
`1
`zj 3 . Then, it computes result d by d2 t 5.
`1 i2
`t is /
`Group-shuffle-4-mux can operate in a sequence to
`rearrange 128 bits. Group-shuffle-4-mux decodes (x, y. z)
`from an instruction field and shuffles the 128 bits of n by t,
`I 1 ix-l y+z I 1 i,-, o. It then coin-
`I 1 z,,+z-l
`t ah, where k t i6
`1
`
`1 bciabi)+br / b,,,,.,,.
`/
`putes result c by c, t 4, wherej t io
`The group-select-bytes instruction (Figure 4c) uses a 16x4-
`bit operand to select one of 16 bytes of a second operand for
`result operand's 16 bytes. This instruction lias
`each of t
`l ~
`two distinct uses. It can 1) permute or copy 16 bytes to a 16-
`byte result with the bytes arbitrarily rearranged, or 2) oper-
`ate as a general-purpose table-lookup operation performed
`entirely within the processor registers. The instruction selects
`each byte i of result d from a byte in the source operand
`value (a I I b) by a value in a nibble of source control value
`82 t (a /
`1 h),+, 8,, wherej + cl,,+j .,,.
`c: &+,
`ontrol i ~ s t r ~ ~ t i Q ~ ~
`An orthogonal collection of group integer and floating-point
`compare-and-set instructions and the group-multiplex instnic-
`tion, provide the ability to select alternative results element
`
`38 IEEE Micro
`
`Oracle-1034 p. 5
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`determine these values directly. Table 2 shows the table val-
`ues and indicates a slope and intercept for the linear approx-
`imations and bypass values of the function, computed as
`slope(x) * x
`+ intercept(x-)
`16
`
`(x < 16)? bypasdx) :
`
`We scaled the curve's slope by 16, as it must be approxi-
`mated by an integer.
`A future release of the C compiler will vectorize the com-
`putation when written as a simple C expression. An exam-
`ple is the C expression (xbl < 16) ? bypass[x[il & 151 :
`((slope[x[i] >> 4]*x[i]) >> 4) + intercept[x[i] >> 41. Figure 5a
`shows C code acceptable to the current compiler, with intrin-
`sic fhctions representing the instructions. Figure 5b shows
`assembly code with symbolic register names. The lo64(x)
`and hi64(x) functions select the low-order and high-order
`64 bits of a 128-bit value or register.
`Figure 5c shows the original function and the piecewise-
`linear approximation, which are nearly coincident, and an
`expanded error function. We scaled the error function up by
`16 times to show the fbnction's accuracy; it is within k1.5 worst
`case and 0.44 mean squared error-generally adequate for
`video display purposes. The function requires only 10 instruc-
`tions to correct 16 symbols, or 0.6 instructions per symbol.
`The group-deal-nibbles (GSHUFFLEI 128,4,16) instruction
`separates the four high-order hits from the four low-order bits
`of each of sixteen 8-bit input symbols, producing two 16x4-bit
`key symbols in a register pair. Two group-select-bytes
`(G.SELECT.8) instructions use the high-order key to produce
`sixteen 8-bit values for the intercept and slope. A third group-
`select-bytes instruction uses the low-order key to produce six-
`teen 8-bit values for bypass values. Two group-unsigned-
`multiply-bytes (G.UMUL.8) instructions multiply the sixteen 8-
`bit slope symbols with the sixteen &bit input symbols, pro-
`ducing sixteen 16-bit products. The group-extract-bytes
`(G.EXTRACTI.8) instruction divides these products by 16 and
`reduces them to sixteen &bit offset symbols. Next, the group-
`add-bytes (G.ADD.8) instruction adds the sixteen 8-bit offset
`symbols to the sixteen %bit intercept symbols, producing six-
`teen 8-bit output symbols. The group-set-unsigned-less-byte
`(G.SET.UL.8) instruction compares the sixteen 8-bit input sym-
`bols against a threshold value, producing a 128-bit mask.
`Finally, the 128-bit mask selects symbols from the bypass table
`to replace the computed value by the group-multiplex
`(G.MUX) instruction, producing sixteen %bit output symbols.
`
`MediaProcessor structure
`Microunity has developed and implemented a set of
`media-processing building blocks that compose a variety of
`systems (see Figure 6>, ranging from simple network devices
`to large multiprocessing systems.
`MediaProcessor integrates a high-bandwidth register file
`with a data path that performs group, branch, and gateway
`operations; load, store, and synchronization; group arith-
`metic; group switching; and extended mathematics. These
`operations produce a powerful capability for processing
`media data streams.
`The data path connects to an on-chip memory system that
`
`typedef unsigned long long urnt128,
`
`uint128 x, y;
`uint128 key, sx, IX, bx, PO, p l , pr;
`
`const uintl28 slope
`
`const uintl28 intercept = Ox80807272665~5d5d
`554e483e32271 a01 ;
`= 0x080809090aObObOb
`OcOdOe1013171e37;
`= Ox3533302e2b282522
`1 e l bl7120e090500;
`const uintl28 threshold = Ox101 01 01 01 01 01 01 0
`1010101 01 010101 0;
`
`const uintl28 bypass
`
`key = gdea14(hi64(x),lo64(x));
`sx = gselect8(h164(slope),lo64(slope),hi64(key));
`IX = gselect8(hi64(intercept),lo64(1ntercept),hi64(key)),
`bx = gselect8(h164(bypass),lo64(bypass),lo64(key));
`PO = gumu18(hi64(sx),h164(x));
`
`(4
`
`G.SHUFFLEI
`
`key,hi64(x),lo64(~),128,4,16
`
`G.SELECT.8
`
`bx,h164(bypass),lo64(bypass),lo64(E:ey)
`pO,lo64(sx),lo64(x)
`p l , hi64(sx),hi64(x)
`
`G.MUX
`
`y,mask,bx,sum
`
`256
`
`224
`
`192
`a, -
`9 160
`-
`._ 8
`," 128
`a, + s
`E 96
`
`64
`
`32
`
`0
`-1 6
`
`(c)
`
`Input pixel value
`
`Figure 5. Gamma correction C code (a), assembly code (b),
`and function and residual error (c).
`
`August 1996 39
`
`Oracle-1034 p. 6
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`-
`
`Media-
`Codec
`
`Audio
`Video
`Radio
`
`Net-l
`
`MediaProcessor
`
`1
`
`1 SDRAM, Flash, serial 1
`
`Figure 6. Structure of a MediaProcessor-based system (L: load, S: store)
`
`~~
`
`in a 1-GHz range), video and stereo
`audio input and output, telephony,
`infrared, and smart-card interfaces.
`The 128-Kbit buffer memory direct-
`ly connects the filtered samples to
`the processor with very low latency
`via the Mediachannel interface. A
`simple version has a single RF trans-
`ceiver and local-area network inter-
`faces.
`MediaBridge devices connect the
`Mediachannel interface to industry-
`standard DRAM or the PCI bus, allow-
`ing systems to use low-cost memories
`and a wide variety of existing I/O
`interfaces. A 1-Mbit memory operates
`either as a secondary cache when
`interfacing to DRAMS, or as 1/0 buffer
`memory to PCI devices. MediaRam
`devices provide DRAM storage with
`integrated Mediachannel interfaces.*
`
`includes both dedicated buffer space and high-bandwidth
`caches. These satisfy the twin media-processing requiremen=
`of real-time and expanded memory spaces. A TLB with
`extended functions provides support for mapping, protection,
`and priority of memory operations. The memory system
`includes on-chip support for external memory in highly inte-
`grated systems, and connects to additional support devices
`via an extremely high-bandwidth Mediachannel interface.'
`Mediachannel interfaces use a simple packet protocol for 8-
`byte read and write transactions from a single master to up to
`four slave devices per interface. A memory-mapped protocol
`layer supports cache-coherent multiprocessor capabilities pio-
`neered by the IEEE Std 1596 Scalable Coherent Interface.-
`The first MediaProcessor implementation is a 0.5-micron,
`BiCMOS 1-cm2 die. It issues instructions at 1 GHz into a five-
`way interleaved pipeline, providing five independent 200-
`MHz threads. For this fixed-point subset implementation, the
`Interval-issue-latency column in Table 1 summarizes the per-
`formance of each thread (in 200-MHz cycles). Interval is the
`minimum interval between similar instructions; issue is time
`consumed by the instruction; and latency is time to a depen-
`dent instruction (all in cycles).
`This MediaProcessor implementation has a 5 12-Gbps reg-
`ister bandwidth. A 128-Gbps memory bandwidth feeds on-
`chip 256-Kbit instruction and data memories that are
`partitionable to cache or dedicated buffer space. An integrat-
`ed SDRAM interface reaches 3.2-Gbps peak bandwidth, and
`Flash RAM and serial bus interfaces support agile and down-
`loadable bootstrap code. Two Mediachannel interfaces sup-
`port 32-Gbps communication to other system building blocks:
`the MediaCodec, MediaBridge, and MediaRam devices. Other
`CMOS interleaved and noninterleaved MedidProcessor ver-
`sions extend the range of possible implementations.
`MediaCodec devices are mixed-signal converters and dig-
`ital linear filters that allow the interfaces to use any commu-
`nications protocol. One version of MediaCodec has two
`broadband RF receivers (tuning 6- to 8-MHz channels with-
`
`Software tools
`We also provide a development environment for Media-
`Processor software. This environment includes
`
`C and C++ compilers;
`source-code debuggers and profilers;
`media and communications software libraries for stan-
`dards such as MPEG, Nat'l Television Standards
`Committee decode, Dolby, QAM and QPSK (quadrature
`amplitude modulation and quadrature phase shift key-
`ing), Viterbi and Reed-Solomon FEC (forward error cor-
`rection), and DES (Data Encryption Standard);
`0 a very small real-time microkernel for client devices; and
`64-bit Open Software Foundation Unix for server appli-
`cations.'
`
`Currently the C compiler can vectorize simple loops, and we
`plan to further improve its vectorization strength. A
`Mathematicalo library of functions representing MediaProcessor
`instructions permits symbolic development and verification
`of MediaProcessor software. Abbott et al." give further exam-
`ples of algorithms developed for the MediaProcessor.
`
`THE BROADBAND MEDIA PROCESSOR is a platform
`for developing the next generation of communications sys-
`tems and for future generations of communications algo-
`rithms. By meeting the challenges of bandwidth, agility, low
`cost, and simplicity, it enables the development of broad-
`band communications systems that use software-centered
`design. This, in turn, reduces system cost and enables a rich
`variety of new media services. Q
`
`A c ~ ~ ~ w l @ ~ g m e ~ ~ s
`Thanks go to the guest editors and referees, whose com-
`ments and encouragement helped refine and focus the arti-
`
`40 IEEEMicro
`
`Oracle-1034 p. 7
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`cle, and to the many people at Microunity who helped to
`bring this architecture to fruition.
`
`References
`1 . ANSI//€€€ Std 754-1985 Binary Floating-Point Arithmetic, IEEE,
`Piscataway, N.J., 1985.
`2. V.E. Benes, Mathematical Theory o f Communication Networks
`and Telephone Traffic, Academic Press, New York, 1965.
`3. C. Poynton, A Technicallntroduction to Digital video, John Wiley
`& Sons, New York, 1996.
`4. C. Poynton, "Gamma and Its Disguises: The Nonlinear Mappings
`of Intensity in Perception, CRTs, Film and Video," SMPTEJ., Vol.
`102, No. 12, Dec. 1993, pp. 1099-1 108.
`5. /TU Recommendation BT. 709- 1, Basic Parameter Values for the
`HDWStandard for the Studio and for international Programme,
`Int'l Telecommunications Union, Geneva.
`6. C. Hansen, "Architecture of a Broadband Mediaprocessor," Proc.
`Compcon, IEEE Computer Society Press, Los Alamitos, Calif.,
`1996, pp. 334-340.
`/€€€Std 1596-1992 Scalable Coherentlnterface, IEEE, 1992.
`7.
`8. T. Robinson et al., "Multi-Gigabytehec DRAMS with the Micro-
`Unity Mediachannel Interface," Proc. Compcon, IEEE CS Press,
`1996, pp. 387-381.
`9. R. Hayes et al., "Microunity Software Development Environ-
`ment," Proc. Compcon, IEEE CS Press, 1996, pp 341-348.
`I O . S . Wolfram, Mathematica: A System for Doing Mathematics by
`Computer, 2nd ed., Addison-Wesley Publishing Co., Redwood
`City, Calif., 1991.
`1 1 . C. Abbott et al., "Broadband Algorithms with the Microunity
`Mediaprocessor," Proc. Compcon, IEEE CS Press, 1996, pp. 349-
`354.
`
`Craig Hansen is chief architect at
`Microunity Systems Engineering Inc., in
`Sunnyvale, California. His current work
`is defining further Mediaprocessor archi-
`tecture enhancements and implementa-
`tions. Previously, he was a designer of
`processor architectures and systems at
`NeXT, Mips Computer Systems, Weitek, and Hewlett Packard.
`Hansen received a HS from Cornel1 University and an MS
`from Stanford University, both

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket