`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`Petitioners HTC and LG - Exhibit 1005, p. cover-2
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`Petitioners HTC and LG - Exhibit 1005, p. cover-3
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`Petitioners HTC and LG - Exhibit 1005, p. cover-4
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`Petitioners HTC and LG - Exhibit 1005, p. cover-5
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`Petitioners HTC and LG - Exhibit 1005, p. v
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`Petitioners HTC and LG - Exhibit 1005, p. vi
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`Petitioners HTC and LG - Exhibit 1005, p. vii
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`Petitioners HTC and LG - Exhibit 1005, p. viii
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`Petitioners HTC and LG - Exhibit 1005, p. ix
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`Petitioners HTC and LG - Exhibit 1005, p. x
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`An Architectural Overview of the Programmable
`Multimedia Processor, TM-1
`
`Selliah Rathnam, Gert Slavenburg
`
`Philips Semiconductors
`811 E. Arques Avenue, Sunnyvale, CA 94088
`
`ABSTRACT
`
`is the irst in afamily ofprogrammable multimedia
`cessor fl-om the Trimedia product group of Philips
`nductors. This “C" programmable processor
`ahigh performance VLIW-CPU core with video and
`:0 peripheral units designed to support the popular
`'
`' a applications. TM-I is designed to concur-
`rlyprocess video, audio, graphics, and communica-
`d'
`a". The VLIW-CPU core is capable of executing a
`- mum of twenty seven operations per cycle, and the
`iried execution rate is about five operations per cy-
`' he tuned a plications. The audio unit easily han-
`dferent autzgo ormats including the I6-bit stereo
`' The video unit is capable 0 processing different
`:1 RGB pixelformats with orizontal and vertical
`g- and color space conversion. TM-1 applications
`
`can range from low-cost, stand alone systems such as
`video
`ones to programmable, multipurpose plug-in
`cards fgar traditional computers.
`
`1.0
`
`INTRODUCTION
`
`TM-1 is a buildin -block for hi h- erformance multi-
`media a plications t at deal with ig -quality video and
`audio.
`-1 easil
`im lements o ular multimedia stan-
`dards such as MP G- and MP
`-2, but its orientation
`around a powerful general-purpose CPU makes it capa-
`ble of implementing a variety of multimedia algorithms,
`whether open or proprietary.
`
`More than just an into I ted microprocessor with up-
`usual peripherals, the T -1 microprocessor IS a fluid
`
`Main Memory
`Interface
`
`Hutlman decoder
`Slice-al—a-lime
`MPEG-1 & 2
`
`VLD
`coprocessor
`
`W190 0"‘
`
`CClH601l656
`vuv 4.-2:2
`
`CCIFIGO1/656
`YUV 4:2:2
`
`I S DC~80 kHz
`Slereéo digilal audio
`
`I S DC-80 kHz
`Stergo digital audio
`
`9%» l “ow
`
`I20 bus to
`camera, etc.
`
`I20 Interface
`
`interface
`_
`
`v.34 ulna’ IEDN
`From n
`Down 8: up scaling
`YUV —) HGB
`
`3'99/96 $5.00 © 1996 IEEE
`‘"38 of COMPCON as
`
`Petitioners HTC and LG - Exhibit 1005, p. 319
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`CCIHEDI/B56
`YUV 4:2:2
`
`Stereo
`Audio Out
`
`v.34 Modem
`Front End
`
`Figure 2. TM-1 system connections. A minimal
`T -1 system requires few supporting compo-
`nents.
`
`com uter system controlled by a small real-time OS ker-
`nel t at runs on the VLIW processor core. TM-i contains
`a CPU, a hi h-bandwidth internal bus, and internal bus-
`mastering D A peripherals.
`
`TM-1 is the first member of a famil of chips that will
`carry investments in software forwar
`in time. Compati-
`bility between family members is at the source-code lev-
`el; binary compatibility between family members is not
`guaranteed. Al
`family members, however, will be able
`to
`erform the most important multimedia functions,
`suc as running MPEG-2 software.
`
`Defining software com atibility at the source-code
`level
`ives Philips the free om to strike the optimum bal-
`ance etween cost and performance for all the chips in
`the TM-l family. Powerful compilers ensure that pro-
`grammers seldomly need to resort to non- ortable as-
`sembler
`rogramming. Programmers use T -1's power-
`ful low- evel operations from source code; these DSP—
`like operations are invoked with a familiar function—call
`syntax. Trimedia also
`rovides hand—coded and tuned
`multimedia libraries w ich can be used to increase the
`performance of the multimedia applications.
`
`As the first member of the family, TM-1 is tailored for —
`use in PC-based ap lications. Because it is based on a
`generzgpurpose C
`, TM-ll can serve as a(ljnultl-flcllntéi
`tion P enhancement vehic e. Typically, a P must e
`with niulti-standard video and audio streams, and users
`desire both decompression and compression,_if possible.
`While the CPU chips used in _PCs are becoming ca able
`of low-resolution real-time video decompression,
`igh-
`§i‘$‘l“yi§"s§l?fi ‘$i‘i°3¥p§§§§i°“i$.%3駰d§‘§2“3§1§§’i“ {iii
`their systems provide live video andgaudio without sacri-
`ficin the res onsiveness of the s stem.
`8
`P
`3’
`
`TM-1 enhances a PC system to rovide real-time mul-
`timedia, and it does so with the a vantages of a special-
`puipose, embedded solution—low cost and chip count-
`cm the advantages of a eneral-purpose rocessor—re-
`programmability. For P
`ap lications, M-1 far su_r-
`passes the capabilities of
`ixed-function multimedia
`c ips.
`-
`
`Other Trimedia family members will have different
`sets of interfaces a pro riate for their intended use, For
`example, a TM-1 ciiip or a cable-TV decoder box would
`eliminate the video-in interface.
`
`2.0
`
`TM-1 CHEF OVERVIEW
`
`The key features of TM-1 are:
`
`eneral- l11‘pOSC VLIW proces-
`- A very powerful,
`sor core that coor 'nates al on-chip activities. In
`addition to implementin the non-trivial parts of
`multimedia algorithms,
`is
`rocessor runs a small
`real~time operating system at is driven by inter-
`rupts from the other units.
`DMA-driven multimedia inputloutput units that
`operate independently and that properly format
`data to make processing efficient.
`DMA-driven multimedia coprocessors that operate
`independently and erform_ operations specific to
`important multime ia algorithms.
`system that
`A high-performance bus and memo
`provides communication between T -1 s process-
`ing units.
`
`Figure 1 shows a block diagram of the TM-1 chip. The
`bulk of a TM-1 system consists of the TM-1 micro ro-
`cessor itself, a block of synchronous DRAM (SDRAEl)\/1),
`and minimal external circuitry to interface to the incom-
`ing ancl/or outgoing multimedia data streams. TM-1 can
`gluelessiy interface to the standard PCI bus for
`ersonal—
`Computer-based a plications; thus, TM-ll can e placed
`directly on the P mainboard or on a plug-in card.
`
`Figure 2 shows a possible TM-1 system application. A
`video-in ut stream, if present, might come directly from
`a CCIR Ol-compliant digital video camera chip in YUV
`4:2:2 format; the interface is glueless in this case. A non-
`standard camera chi
`can be connected via a video de-
`coder chi
`(such as t e Phili
`s SAA7111). A CCIR 601
`out ut vi eo stream is provi ed directly from the TM-]
`to rive a dedicated video monitor. Stereo audio input
`and output re uire external ADC and DAC support. he
`operation of
`e video and audio interface units is highly
`customizable through programmable parameters.
`
`The glueless PCI interface allows the TM-1 to display
`video via a host PC’s video card and toCplay audio via a
`host PC’s sound hardware. The Image oprocessor pro-
`vides dis lay support for live video in an arbitrary num-
`ber of ar itraiily overlapped windows.
`
`Finally, the V.34 interface requires only an external
`modem front-end chip and phone line interface to pro-
`vide remote communication support. The modem can be
`used to connect TM-1-based systems for video phone or
`video conferencing applications, or it can be used for
`general—purpose data communication in PC systems.
`
`3.0 BRIEF EXAMPLES OF OPERATION
`
`The ke to understanding TM- 1' operation is observin ;
`that the PU and peripherals are time—shared and tha
`communication between units is through SDRAM mem
`
`' E|5age 13 of23
`
`Petitioners HTC and LG - Exhibit 1005, p. 320
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`0,-y_ The CPU switches from one task to the next; first it
`dccompresses a video frame, then it decompresses a slice
`of the audio stream, then back to video, etc. As neces-
`sar , the CPU issues commands to the peripheral units. to
`orc estrate their operation.
`
`the PCI bus for archival on local mass storage, or the host
`can transfer the compressed video over a network, such
`as ISDN. The data can also be sent to a remote system us-
`ing the integrated V34 interface to create, for example,
`a video phone or video conferencing system.
`
`The TlV_I-1 CPU can enlist the ICP and video-in units
`to help with some of the straightforward, tedious tasks
`- associated with video processing. The function of these
`units is programmable. For example, some video streams
`are—or need to be—sca1ed horizontally, so these units
`can handle the most common cases of horizontal down-
`and up-scaling without
`intervention from the TM-1
`CPU.
`
`3.1 Video Decompression in a PC
`
`'
`
`I
`
`_
`
`A typical mode of operation for a TM—l slystem is to
`CI card in
`serve as a Video-decomlpression engine on a
`a PC. In this case, the _C doesn’t know the TM-1 has a
`powerful, general-purpose CPU; rather, the PC just treats
`the hardware on the PCI card as a “black-box” engine.
`
`Video decompression begins when the PC operating
`s stem hands the TM—l a pointer to compressed video
`ata in the PC’s memo?’ (t e details of the communica-
`ly handled by a software driver
`tion plrotocol are typica
`insta ed in the PC s operating system).
`
`The TM-1 CPU fetches data from the compressed vid-
`eo stream via the PCI bus, decompresses frames from the
`video stream, and places them into local SDRAM. De-
`compression ma be aided by the VLD (variable—length
`decoder) unit, w ich implements Huffman decoding and
`is controlled by the TM~l CPU.
`
`When a frame is ready for dis lay, the TM—l CPU
`ives the ICP (image coprocessoriia display command.
`he ICP then autonomously fetches the decom ressed
`frame data from SDRAM and transfers it over
`:3 PCI
`bus to the frame buffer in the PC's video dis la card (or
`the frame buffer in PC system memory if t e C uses a
`UMA (Unified Memory Architecture) frame buffer).
`The ICP accommodates arbitrary window size, position,
`and overlaps.
`
`3.2 Video Compression
`Another typical application for TM-l is in video com-
`pression. In this case, uncompressed video is usually
`supplied directly to the TM-1 system via the video-in
`unit. A camera chip connected direct]
`to the video-in
`unit supplies YUV data in eight—bit,
`12:2 format. The
`video-in unit takes care of sampling the data from the
`camera chip and demultiplexing the raw video to
`§[DRAM in three separate areas, one each for Y, U, and
`
`When a complete video frame has been read from the
`camera chip b the video-in unit, it interrupts the TM-l
`CPU. The CPU compresses the video data in software
`(using a set of powerful data-parallel operations) and
`writes
`the compressed data to a separate area of
`SDRAM.
`
`Since the powerful, general-pu ose TM-1 CPU is
`available, the corn ressed data can e encrypted before
`being transferred or security.
`
`4.0
`
`VLIVV CORE AND PERIPHERAL
`UNITS
`
`4.1 VLIW Processor Core
`
`The heart of TM—l is its powerful 32-bit CPU core.
`The CPU implements a 32-bit linear address space and
`128, fully general—purpose 32-bit registers. The registers
`are not separated into banks; any operation can use any
`register for any operand.
`
`The core uses a VLIW instruction—set architecture and
`is fully general-purpose. TM-1 uses a VLIW instruction
`length t at allows up to five simultaneous operations to
`be issued. These operations can target any five of the 27
`functional units in the CPU, including inte er and float-
`ing-point arithmetic units and data-par
`el DSP-like
`units.
`
`'
`
`Instruction Cache (32Kb)
`
`I
`Instr. Fetch Buffer
`
`I
`Decompression Hardware
`
`Issue Register ( 5 Ops )
`
`Operation Routing Network
`
`Execution Unit ( 27 Functions )
`
`Register Routing and Forwarding Network
`
`Register File ( 128 X 32 )
`
`The compressed video data can now be disposed of in
`any of several ways. It can be sent to a host system over
`
`Figure 3. VLIW Processor Core and Instruction
`Cache.
`
`Page 14 of 23
`
`Petitioners HTC and LG - Exhibit 1005, p. 321
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`Although the processor core runs a tiny real-time op-
`erating sfistem to coordinate all activities in the TM-l
`s stem, t e processor core is not intended for true gener-
`a -purpose use as the only CPU in a computer system.
`For example, the processor core does not im lement vir-
`tual memory address translation, an essentia feature in a
`general-purpose computer system.
`
`TM-1 uses a VLIW architecture to maximize roces-
`sor throughput at the lowest possible cost. VL
`archi-
`tectures have performance exceeding that of superscalar
`general-purpose CPUs without the extreme complexity
`of a superscalar implementation. The hardware saved by
`eliminating superscalar logic reduces cost and allows the
`integration of multimedia—specifie features that enhance
`the power of the processor core.
`
`The TM-1 operation set includes all traditional micro-
`processor operations. In addition, multimedia-specific
`operations are included that dramatically accelerate stan-
`dard video compression and decompression algorithms.
`As just one of the five operations issued in a single TM-
`1 instruction, a sin le special or “custom” operation can
`implement up to
`1 traditional microprocessor o era-
`tions. Multimedia-specific operations combined wit
`the
`VLIW architecture result in tremendous throughput for
`multimedia applications.
`
`Internal “Data Highway” Bus
`4.2
`The internal data bus connects all internal blocks to-
`ether and provides access to internal control registers
`in each on-chi peripheral units), external SDRAM, and
`the external P 1 bus. The internal bus consists of sepa~
`rate 32-bit data and address buses, and transactions on
`the bus use a block~transfer protocol. Peripherals can be
`masters or slaves on the bus.
`
`I Access to the internal bus is controlled by a central ar-
`biter, which has a request line from each otential bus
`master. The arbiter is configurable in a num er of differ-
`ent modes so that the arbitration al orithm can be tai-
`lored for different ap lications. Perip eral units make re-
`quests to the arbiter or bus access, and dependin on the
`arbitration mode, bus bandwidth is allocated to t e units
`in different amounts. Each mode allocates bandwidth
`differently, but each mode guarantees each unit a mini-
`mum bandwidth and maximum service latency. All un-
`used bandwidth is allocated to the TM-1 CPU.
`
`The bus allocation mechanism is one of the features of
`TM-1 that makes it a true real—time system in stead ofJust
`a highly integrated microprocessor with unusual periph-
`erals.
`
`4.3 Memory and Cache Units
`TM—1’s memory hierarchy satisfies the low cost and
`high bandwidth requirement of multimedia markets.
`Since multimedia video streams can require relatively
`large temporary storage, a significant amount of DRAM
`is required.
`
`TM-l has a glueless interface with synchronous
`DRAM (SDRAM) or
`synchronous grap ics RAM
`
`(SGRAM), which provide higher bandwidth than the
`standard DRAM. As the SDRAM has been supported by
`major DRAM vendors,
`the competition among those
`vendors will kee the SDRAM rice in par with that of
`the standard D M. TM-1’s RAM memory size can
`range from 2Mbytes to 64 Mbytes.
`
`The TM-1 CPU core is supported b separate 16-KB
`data and 32—KB instruction caches.
`he data cache is
`dual—ported in order to allow two simultaneous load!
`store accesses, and both caches are eight-way set-asso-
`ciative with a 64—byte block size.
`
`4.4 Video-In Unit
`
`The video-in unit interfaces directly to any CCIR 601/
`656-com liant device that outputs eight-bit parallel,
`412:2
`tiirie—multiplexed data.'Such devices include
`direct digital camera systems, which can connect glue-
`lessly to TM-1 or through the standard CCIR 656 con-
`nector with only the addition of ECL level converters.
`Non-CCIR-compliant devices can use a di
`ital decoder
`chi
`, such as the Philips SAA7111, to inte ace to TM-1.
`Ol er front ends with a 16-bit interface can connect with
`a small amount of glue logic.
`
`The video-in unit deinultiplexes the ca tured YUV
`data before writing it into local TM-1 SD M. Separate
`data structures are maintained for Y, U, and V.
`
`The video-in unit can be pro rammed to perform on-
`the-fl horizontal resolution su sampling by a factor of
`two i needed. Man camera systems capture a 640-pix-
`elfline or 720-pixel
`ine image; with subsam ling, direct
`conversion to a 320-pixel/line or a 360-pixe /line image
`can be performed with no CPU intervention. Further, if
`subsairipling is required eventually, performing this
`function during data capture reduces initial storage re-
`quirements.
`
`4.5 Video-Out Unit
`
`i
`The video-out unit essentiallg erforms the inverse
`function of the video-in unit.
`eo—out generates an
`eight-bit, multiplexed Y [JV data stream by gathering bits
`from the se arate Y, U, and V data structures in
`SDRAM. W ile generating the multiplexed stream, the
`video-out unit can also u -scale horizontally by a factor
`of two to convert from Ci)
`to native CCIR resolution.
`
`Since the video-out unit likely drives a se arate video
`rnonitor——not the PC’s video screen—the P itself can-
`not be used to generate the graphics and text of a user in-
`terface. To remedy this, the video-out unit can generate
`graphics overlays in a limited number of configurations.
`
`4.6
`Image Coprocessor (ICP)
`The ima e coprocessor (ICP) is used for several pur-
`poses to 0 f-load tasks from the TM-1 CPU, suc
`as
`copying an ima e from SDRAM to the host’s video
`frame uffer. A though these tasks can be easily per-
`formed by the CPU, t ey are a poor use of the relatively
`expensive CPU resource. When performed in parallel by
`the ICP, these tasks are performed efficiently by simple
`hardware, which allows the CPU to continue with more
`Complex tasks.
`
`Page 15 of 23
`
`Petitioners HTC and LG - Exhibit 1005, p. 322
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`The ICP can c(ip[erate as either a memory-to~memory or
`3 memory-to-P
`coprocessor device.
`
`In memory-to-memory mode, the ICP can perform ei-
`that horizontal or vertical image filtering and resizing.
`. C1-h31CP implemepts 32 FIR filters of five adjacent pixel
`input values. The ilter coefficients are fully programma—
`. ble, and the position of the outpultzfiptel in the output ras-
`ter determines which _of the 3
`filters is applied to
`generate that output pixel value. Thus, the output raster
`is on a 32-times
`_
`iner grid than the input raster. The fil-
`teririg is done in either the horizontal or vertical direction
`but not both. Two ap lications of the ICP are required to
`filter and scale in bot directions.
`
`_
`
`-
`
`In rnemory—to-PCI mode, the ICP can perfonn hori-
`zontal resizing followed by color-space conversion. For
`_ example, assume an n x in pixel array is to be displayed
`_
`in a window on the PC_ video screen while the PC is run-
`- ning a gra hical user interface. The first step (if neces-
`sary) wou d use the ICP in memory—to-memory mode to
`perform a vertical resizin . The second step would use
`the ICP in memory-to-PC mode to perform a horizontal
`resizing (if necessary) and colorspace conversion from
`YUV to RGB.
`
`While sending the final, resampled and converted ix-
`els over the PC bus to the video frame buffer, the CP
`uses a full, per- ixel occlusion bit mask—accessed in
`destination coor inates—to determine which pixels are
`actually stored in the frame buffer for display. Condi-
`tioning the transfer with the bit mask allows TM-1 to ac-
`'commodate an arbitrary arrangement of overlapping
`windows on the PC video screen.
`
`'
`
`Figure 3 illustrates a possible display situation and the
`
`data structures in SDRAM that su port the ICP’s opera-
`tion. On the left in Fi ure 3, the C’s video screen has
`four overlapping win ows. Two, Image 1 and Image 2,
`are being used to display video generated by TM-1.
`
`The ri ht side of Figure 3 shows a conceptual view of
`SDRA contents. Two data structures are present, one
`for Image 1 and the other for Image 2. Figure 3 repre-
`sents apoint in time during which the ICP is displaying
`Image .
`
`When the ICP is displayin an image (i.e., copying it
`from SDRAM to a frame buf er), it maintains four point-
`ers to the data structures in SDRAM. Three pointers lo-
`cate the Y, U, and
`data arrays, and the fourth locates
`the per-pixel occlusion bit map. The Y, U, and V arrays
`are indexed by source coordinates while the occlusion bit
`map is accessed with screen coordinates.
`
`erforms
`ixels for display, it
`As the ICP generates
`be final
`horizontal scaling and co orspace conversion.
`RGB pixel value is then copied to the destination address
`in the screen’s frame buffer only if the corresponding bit
`in the occlusion bitmap is a one.
`
`As shown in the conceptual dia rain, the occlusion bit
`map has a pattern of Is and Os t at corresponds to the
`shape of the visible area of the destination window in the
`frame buffer. When the arrangement of windows on the
`PC screen is changed, modifications to the occlusion bit
`maps may be necessary.
`
`It is important to note that there is no reset limit on the
`number and sizes of windows that can e handled by the
`ICP. The on]
`limit is the available bandwidth. Thus, the
`ICP can han le a few large windows or many small win-
`
`PC SGFEBH
`
`In SDRAM
`
`mmmminm
`
`Figure 4. ICP operation. Windows on the PC screen and data structures in SDRAM for two live video
`windows.
`
`we age 16 of 23
`
`Petitioners HTC and LG - Exhibit 1005, p. 323
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`dows. The ICP can sustain a transfer rate of 50 megapix-
`els
`er second, wh_ich_is more than enough to saturate
`PC when transferring images to video frame buffers.
`
`ICP has a micro-programmable engine. All ICP oper-
`ations such as filtering, scaling and color s ace conver-
`sions and their formats are programmable. he ICP’ s mi-
`cro programs loads itself from the SDRAM memory.
`
`4.7 Variable-Length Decoder (VLD)
`The variable-len th decoder (VLD is included to re-
`lieve the TM-l CP of the task of ecoding Huffman-
`encoded video data streams. It can be used to help de-
`code MPEG-1 and MPEG-2 video streams.
`
`The VLD is a memor -to-memory coprocessor. The
`TM-1 CPU hands the V D a pointer to a Huffman-en-
`coded bit stream, and the VLD roduces a tokenized bit
`stream that is very convenient or the TM-l image de-
`compression software to use. The format of the output to-
`ken stream is optimized for the MPEG-2 decom ression
`software so that communication between the PU and
`VLD is minimized.
`'
`
`As with the other processing-"intensive coppocessors,
`the VLD is included mainly to relieve the CP of a task
`that wastes its performance potential. When dealing with
`the hi h bit rates of MPEG-2 data streams, too much of
`the C U’ s time is devoted to this task, which prevents its
`special capabilities from being used.
`
`4.8 Audio-In and Audio-Out Units
`The audio-in and audio-out units are similar to the vid-
`eo units. They connect to most serial ADC and DAC
`chips, and are programmable enough to handle most rea-
`sonable protocols. These units can transfer MSB or LSB
`first and ieft or right channel first.
`
`The sampling clock is driven by TM—l and is software
`pro rammable within a wide range from DC to 80 kHz
`witi-i a resolution of 0.02 Hz. The clock circuit allows the
`pro rammer subtle control over the sampling fretLuency
`so t at audio and video synchronization can e ac ieved
`in any system configuration. When changing the fre-
`quency, the instantaneous phase does not c ange_, which
`allows frequency manipulation without introducing dis-
`tortion.
`
`As with the video units, the aud'io—in and audio-out
`units buffer
`incoming and out oirig audio data in
`SDRAM. The audio—in unit buf ers samples in either
`
`eight— or 16-bit format, mono or stereo. The audio-out
`unit simply transfers sample data from memory to the ex-
`ternal DAC; an mani ulation of sound data is
`er-
`formed by the
`-1 C U since this
`rocessing wil re-
`quire at most a few percent of the CP resource.
`
`4.9
`
`PCI Bus Interface Unit (BIU)
`
`This unit connects the internal Data Highway Bus to
`an external PCI bus. It has a PCI master to initiate mem-
`ory read/write cycles for TM-1-CPU requested read]
`write transactions including burst read/write DMA trans-
`actions. The PCI tar et within the BIU responds to the
`transactions initiate by external PCI master devices to
`read/write the TM—1’s memory space, and it satisfies
`their requests. External devices can access the TM—l’s
`MMIO registers through this unit.
`
`The ICP unit has a direct connection to the BIU unit in
`order to transfer the pixel image data efficiently from
`TM-1 to the graphics evice or host memory through the
`PCI bus.
`
`The DMA transactions are considered as background
`transactions. To reduce the latenc of the sin le word
`readfwrite transactions on the PC bus, the B%U inter-
`leaves the burst read/write DMA cycles with single word
`read/write transactions.
`
`5.0 CUSTOM OPERATIONS
`
`Custom operations in the TM-1 CPU architecture are
`specialized, high function operations designed to dra-
`matically improve performance in im ortant multimedia
`applications. Custom operations enab e an a
`lication to
`take advantage of the high performance
`LIW-CPU
`core.
`
`Important multimedia applications, such as the decom-~
`pression of MPEG video streams, spend significant
`amounts of execution time dealing with eight-bit data
`items. Using 32-bit operations to manipulate small data
`items makes inefficient use of 32-bit execution hardware
`in the implementation. There are custom operations de-
`signed to operate on four ei1g1ht—bit data items simulta-
`neously in order to im rove
`e performance about four
`to ten times compare with that of the general pu ose
`CPU. Furthermore, some custom o erations are de ined
`to combine multiple arithmetic an control instructions
`into a single custom o eratioii. These custom operations
`can be used easily in t e C ianguage as function calls.
`Custom operation syntax is consistent with the C pro-
`
`unsigned char AI16][16];
`unsigned char B[16] [16],-
`
`row += 1)
`row < 16;
`(row = 0;
`for (col 2 0; col < 16; C01 +: 1)
`cost +: abs(ALrow][col]
`— B[row][col]);
`
`Eor
`i
`
`)
`
`Figure 5. Match-cost loop for MPEG motion estimation.
`
`Page 17 of 23
`
`Petitioners HTC and LG - Exhibit 1005, p. 324
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`rammin language, and just as with alt other operations
`._.ggnerate by the compiler, the scheduler takes care of
`-register allocation, operation packing, and flow analysis.
`
`The multimedia EpgilCatl0I:ld(j€.VBLOpI§'l6nEdh§S begn adfi
`y provi mg an co e
`an we
`ditionally i_TI1P1'0_V'3
`,
`,
`,
`: tuned multimedia code in the form of C library func-
`' tions.
`
`Example: Motion-Estimation Kernel
`'5.1
`One part of the MPEG coding algorithm is motion es-
`fimation. The purpose of motionestimation is to_reduce
`‘ the cost of storing a frame of video by expressing the
`contents of the frame in terms of adjacent frames.
`
`A given frame is reduced to small lglocks, and a subse-
`uent frame is represented by specifying how these small
`- blocks change_position'and_ a pearance; iisually, storing
`f the difference information is ess expensive than storing
`a whole block. For example, in a video sequence in
`' which the camera pans across a_ static scene, some frames
`"can be expressed sim ly as displaced versions of their
`predecessor frames. 0 create a subsequent frame, most
`locks are simply displaced relative to the output screen.
`The code in this exam le is for a match—cost caEcula—
`on, a small kernel of t e complete motion—estimation
`- code. This code provides an excellent example of how to
`: transform source code in order to make the best use of
`. TM-l’s custom operations.
`
`-_
`
`5 shows the original source code for the
`re
`I Fi
`mate —cost loop. The code is not a self-contained func-
`' tion. At some location early in the code, the arrays A[][]
`_' and B[][] are declared; At some location between those
`declarations and the loop of interest, the arrays are filled
`. with data.
`We start by noticing] that the computation in the loop
`t e absolute va ue of the difference
`‘of Figure _5 involves
`_
`_of two unsigned characters (bytes). TM-1 o eration set
`includes, several operations that process all our bytes in
`-_ a 32-bit word simultaneously. Since the match-cost cal-
`- culation is fundamental to the MPEG algorithm, it is not
`surprising to find a custom operation—urne8uu—that
`_ implements this operation exactly. The definition of
`ume8uu operation is shown in Figure 8.
`
`rocesses
`If we hope to use a custom operation that
`to create
`four pixel values simultaneously, we first rice
`four parallel pixel computations. Also, to use the ume8uu
`operation, however, the code must access the arrays with
`3 -bit word pointers instead of with 8-bit byte pointers.
`
`6 shows a parallel version of the code from
`Figure
`Figure 5. By unrolling the loo and simply giving each
`computation its own cost varia le and then summing the
`costs all at once, each cost computation is completely in-
`dependent.
`
`Fi ure 7 shows the loop recoded to access A[][] and
`B[][ as one-dimensional instead of as two—dimensional
`arrays. We take advantage of our knowledge of C—lan—
`guage arra storage conventions in order to perform this
`code trans ormation. Recoding to use one-dimensional
`arrays prepares the code for the transformation to 32-bit
`array accesses.
`
`Fi ure 7 also shows the loop of Figure 6 recoded to use
`ume uu. Once again taking advantage of our knowledge
`of the C-language array storage conventions, the one- i-
`mensional byte array is now accessed as a one-dimen-
`sional 32-bit-word array.
`
`Of course, since we are now using one-dimensional ar-
`rays to access the pixel data, it is natural to use a single
`‘for’ loop instead of two. Figure 9 shows this streamlined
`version of the code without the inner loop. Since C-lan-
`guage arrays are stored as a linear vector of values, we
`can simply increase the number of iterations of the outer
`loop from 16 to 64 to traverse the entire array.
`
`The recoding and use of the ume8uu o eration has re-
`sulted in a substantial improvement in
`e performance
`of the match-cost loop. In the ori
`inal version, the code
`executed 1280 operations (inclu 'ng loads, adds, sub-
`tracts, and absolute values); in the restructured version,
`there are only 256 o erations—128 loads, 64 ume8uu
`operations, and 64 a ditions. This is a factor of five re-
`duction in the number of operations executed. Also, the
`overhead of the inner loop has been eliminated, further
`increasing the performance advantage.
`
`Unsigned char A[l6][16lt
`unsigned char B[l6][16];
`
`unsigned char A[l6l[16]:
`unsigned char B[l6l[16l;
`
`gar
`
`row += 1)
`row < 15;
`(row = 0;
`for (col
`0; col < 16; C01 += 4]
`C
`abslA[row][co1+0l
`costo
`B[row][col+O]);
`abs(A[row][col+ll
`costl
`B[rowl[co1+1]):
`B[row][col+2])
`abs(A[rowl[col+2l
`Costa
`abs(A[row][c0l+3]
`cost3
`B[row]lcol+3])
`cost += Costo + costl + costz + eost3;
`
`(unsigned int *) A;
`unsigned int *IA
`(unsigned int *) B;
`unsigned int *IE
`for (row = 0, rowoffset n 0;
`row < 16;
`row += 1,
`rowoffset += 4)
`t
`
`for (C014 : D; cola < 4; cold +: 1)
`cost += UME8UU(IA[rowofifset + c014],
`IB[rowoffset + C0141);
`
`Figure 6. Unrolled and Parallel version of Figure 5.
`
`Figure 7. Using the custom operation umefiuu to speedup the
`loop of Figure 6 resulted in a performance speedup of about
`
`Page 18 of 23
`
`Petitioners HTC and LG - Exhibit 1005, p. 325
`HTC and LG v. PUMA, IPR2015-01501
`
`
`
`e8uu
`
`Sum of absolute values of
`unsigned 8-bit differences
`‘C’ function prototype:
`unsigned int
`ume8uu(unsigned int a, unsigned int b );
`unction of umesuu:
`abs(zero_extto32(a<31:24>) —zeroAextto32(b<31:24>ll+
`ab5(zero extto32{a<23:l6>l 4zero_extto32(b<23:16>)l+
`abstzero extto32(a<15:B>)
`— zero_extto32(b<15:s>)) +
`abstzero extto32(a<7:O>)
`- zero_extto32lb<7:D>ll
`
`nsigned char A[16}[l6]:
`unsigned char B[16][16];
`
`(unsigned int *) A;
`unsigned int “IA
`(unsigned int *) B;
`unsigned int “IE
`i +: 1)
`for (i = 0;
`i < 64;
`cost += UME8UU(IA[i], IB[i]);
`
`Figure 8. Custom Operation umefluu
`
`Figure 9. The loop of Figure 7 with me inner loop eliminated.
`
`6.0 APPLICATIONS
`
`8.0 REFERENCES
`
`TM-1 has the potential to be used in many multimedia
`applications and only few of them are discussed.
`
`6.1 Video Teleconferencing/Digital White
`Board
`
`Businesses are increasingly turning towards interac-
`tive computing as a means 0 becoming more efficient.
`Collaborative computing, for instance, involves sharing
`applications amon st multiple personal computers an
`multipoint video te econferencing.
`
`TM—1 is a single chip video teleconferencing solution
`that runs all current video codecs across all common
`trans on mechanisms. This may also includes H.324
`(PO S), H.320 (ISDN) and H.323 (LAN).
`
`6.2 Multimedia Card for Consumer
`Multimedia Applications
`The achievement of true computer based realism is
`only