throbber
Petitioners HTC and LG - Exhibit 1005, p. cover-1
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`Petitioners HTC and LG - Exhibit 1005, p. cover-2
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`Petitioners HTC and LG - Exhibit 1005, p. cover-3
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`Petitioners HTC and LG - Exhibit 1005, p. cover-4
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`Petitioners HTC and LG - Exhibit 1005, p. cover-5
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`Petitioners HTC and LG - Exhibit 1005, p. v
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`Petitioners HTC and LG - Exhibit 1005, p. vi
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`Petitioners HTC and LG - Exhibit 1005, p. vii
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`Petitioners HTC and LG - Exhibit 1005, p. viii
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`Petitioners HTC and LG - Exhibit 1005, p. ix
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`Petitioners HTC and LG - Exhibit 1005, p. x
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`An Architectural Overview of the Programmable
`Multimedia Processor, TM-1
`
`Selliah Rathnam, Gert Slavenburg
`
`Philips Semiconductors
`811 E. Arques Avenue, Sunnyvale, CA 94088
`
`ABSTRACT
`
`is the irst in afamily ofprogrammable multimedia
`cessor fl-om the Trimedia product group of Philips
`nductors. This “C" programmable processor
`ahigh performance VLIW-CPU core with video and
`:0 peripheral units designed to support the popular
`'
`' a applications. TM-I is designed to concur-
`rlyprocess video, audio, graphics, and communica-
`d'
`a". The VLIW-CPU core is capable of executing a
`- mum of twenty seven operations per cycle, and the
`iried execution rate is about five operations per cy-
`' he tuned a plications. The audio unit easily han-
`dferent autzgo ormats including the I6-bit stereo
`' The video unit is capable 0 processing different
`:1 RGB pixelformats with orizontal and vertical
`g- and color space conversion. TM-1 applications
`
`can range from low-cost, stand alone systems such as
`video
`ones to programmable, multipurpose plug-in
`cards fgar traditional computers.
`
`1.0
`
`INTRODUCTION
`
`TM-1 is a buildin -block for hi h- erformance multi-
`media a plications t at deal with ig -quality video and
`audio.
`-1 easil
`im lements o ular multimedia stan-
`dards such as MP G- and MP
`-2, but its orientation
`around a powerful general-purpose CPU makes it capa-
`ble of implementing a variety of multimedia algorithms,
`whether open or proprietary.
`
`More than just an into I ted microprocessor with up-
`usual peripherals, the T -1 microprocessor IS a fluid
`
`Main Memory
`Interface
`
`Hutlman decoder
`Slice-al—a-lime
`MPEG-1 & 2
`
`VLD
`coprocessor
`
`W190 0"‘
`
`CClH601l656
`vuv 4.-2:2
`
`CCIFIGO1/656
`YUV 4:2:2
`
`I S DC~80 kHz
`Slereéo digilal audio
`
`I S DC-80 kHz
`Stergo digital audio
`
`9%» l “ow
`
`I20 bus to
`camera, etc.
`
`I20 Interface
`
`interface
`_
`
`v.34 ulna’ IEDN
`From n
`Down 8: up scaling
`YUV —) HGB
`
`3'99/96 $5.00 © 1996 IEEE
`‘"38 of COMPCON as
`
`Petitioners HTC and LG - Exhibit 1005, p. 319
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`CCIHEDI/B56
`YUV 4:2:2
`
`Stereo
`Audio Out
`
`v.34 Modem
`Front End
`
`Figure 2. TM-1 system connections. A minimal
`T -1 system requires few supporting compo-
`nents.
`
`com uter system controlled by a small real-time OS ker-
`nel t at runs on the VLIW processor core. TM-i contains
`a CPU, a hi h-bandwidth internal bus, and internal bus-
`mastering D A peripherals.
`
`TM-1 is the first member of a famil of chips that will
`carry investments in software forwar
`in time. Compati-
`bility between family members is at the source-code lev-
`el; binary compatibility between family members is not
`guaranteed. Al
`family members, however, will be able
`to
`erform the most important multimedia functions,
`suc as running MPEG-2 software.
`
`Defining software com atibility at the source-code
`level
`ives Philips the free om to strike the optimum bal-
`ance etween cost and performance for all the chips in
`the TM-l family. Powerful compilers ensure that pro-
`grammers seldomly need to resort to non- ortable as-
`sembler
`rogramming. Programmers use T -1's power-
`ful low- evel operations from source code; these DSP—
`like operations are invoked with a familiar function—call
`syntax. Trimedia also
`rovides hand—coded and tuned
`multimedia libraries w ich can be used to increase the
`performance of the multimedia applications.
`
`As the first member of the family, TM-1 is tailored for —
`use in PC-based ap lications. Because it is based on a
`generzgpurpose C
`, TM-ll can serve as a(ljnultl-flcllntéi
`tion P enhancement vehic e. Typically, a P must e
`with niulti-standard video and audio streams, and users
`desire both decompression and compression,_if possible.
`While the CPU chips used in _PCs are becoming ca able
`of low-resolution real-time video decompression,
`igh-
`§i‘$‘l“yi§"s§l?fi ‘$i‘i°3¥p§§§§i°“i$.%3駰d§‘§2“3§1§§’i“ {iii
`their systems provide live video andgaudio without sacri-
`ficin the res onsiveness of the s stem.
`8
`P
`3’
`
`TM-1 enhances a PC system to rovide real-time mul-
`timedia, and it does so with the a vantages of a special-
`puipose, embedded solution—low cost and chip count-
`cm the advantages of a eneral-purpose rocessor—re-
`programmability. For P
`ap lications, M-1 far su_r-
`passes the capabilities of
`ixed-function multimedia
`c ips.
`-
`
`Other Trimedia family members will have different
`sets of interfaces a pro riate for their intended use, For
`example, a TM-1 ciiip or a cable-TV decoder box would
`eliminate the video-in interface.
`
`2.0
`
`TM-1 CHEF OVERVIEW
`
`The key features of TM-1 are:
`
`eneral- l11‘pOSC VLIW proces-
`- A very powerful,
`sor core that coor 'nates al on-chip activities. In
`addition to implementin the non-trivial parts of
`multimedia algorithms,
`is
`rocessor runs a small
`real~time operating system at is driven by inter-
`rupts from the other units.
`DMA-driven multimedia inputloutput units that
`operate independently and that properly format
`data to make processing efficient.
`DMA-driven multimedia coprocessors that operate
`independently and erform_ operations specific to
`important multime ia algorithms.
`system that
`A high-performance bus and memo
`provides communication between T -1 s process-
`ing units.
`
`Figure 1 shows a block diagram of the TM-1 chip. The
`bulk of a TM-1 system consists of the TM-1 micro ro-
`cessor itself, a block of synchronous DRAM (SDRAEl)\/1),
`and minimal external circuitry to interface to the incom-
`ing ancl/or outgoing multimedia data streams. TM-1 can
`gluelessiy interface to the standard PCI bus for
`ersonal—
`Computer-based a plications; thus, TM-ll can e placed
`directly on the P mainboard or on a plug-in card.
`
`Figure 2 shows a possible TM-1 system application. A
`video-in ut stream, if present, might come directly from
`a CCIR Ol-compliant digital video camera chip in YUV
`4:2:2 format; the interface is glueless in this case. A non-
`standard camera chi
`can be connected via a video de-
`coder chi
`(such as t e Phili
`s SAA7111). A CCIR 601
`out ut vi eo stream is provi ed directly from the TM-]
`to rive a dedicated video monitor. Stereo audio input
`and output re uire external ADC and DAC support. he
`operation of
`e video and audio interface units is highly
`customizable through programmable parameters.
`
`The glueless PCI interface allows the TM-1 to display
`video via a host PC’s video card and toCplay audio via a
`host PC’s sound hardware. The Image oprocessor pro-
`vides dis lay support for live video in an arbitrary num-
`ber of ar itraiily overlapped windows.
`
`Finally, the V.34 interface requires only an external
`modem front-end chip and phone line interface to pro-
`vide remote communication support. The modem can be
`used to connect TM-1-based systems for video phone or
`video conferencing applications, or it can be used for
`general—purpose data communication in PC systems.
`
`3.0 BRIEF EXAMPLES OF OPERATION
`
`The ke to understanding TM- 1' operation is observin ;
`that the PU and peripherals are time—shared and tha
`communication between units is through SDRAM mem
`
`' E|5age 13 of23
`
`Petitioners HTC and LG - Exhibit 1005, p. 320
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`0,-y_ The CPU switches from one task to the next; first it
`dccompresses a video frame, then it decompresses a slice
`of the audio stream, then back to video, etc. As neces-
`sar , the CPU issues commands to the peripheral units. to
`orc estrate their operation.
`
`the PCI bus for archival on local mass storage, or the host
`can transfer the compressed video over a network, such
`as ISDN. The data can also be sent to a remote system us-
`ing the integrated V34 interface to create, for example,
`a video phone or video conferencing system.
`
`The TlV_I-1 CPU can enlist the ICP and video-in units
`to help with some of the straightforward, tedious tasks
`- associated with video processing. The function of these
`units is programmable. For example, some video streams
`are—or need to be—sca1ed horizontally, so these units
`can handle the most common cases of horizontal down-
`and up-scaling without
`intervention from the TM-1
`CPU.
`
`3.1 Video Decompression in a PC
`
`'
`
`I
`
`_
`
`A typical mode of operation for a TM—l slystem is to
`CI card in
`serve as a Video-decomlpression engine on a
`a PC. In this case, the _C doesn’t know the TM-1 has a
`powerful, general-purpose CPU; rather, the PC just treats
`the hardware on the PCI card as a “black-box” engine.
`
`Video decompression begins when the PC operating
`s stem hands the TM—l a pointer to compressed video
`ata in the PC’s memo?’ (t e details of the communica-
`ly handled by a software driver
`tion plrotocol are typica
`insta ed in the PC s operating system).
`
`The TM-1 CPU fetches data from the compressed vid-
`eo stream via the PCI bus, decompresses frames from the
`video stream, and places them into local SDRAM. De-
`compression ma be aided by the VLD (variable—length
`decoder) unit, w ich implements Huffman decoding and
`is controlled by the TM~l CPU.
`
`When a frame is ready for dis lay, the TM—l CPU
`ives the ICP (image coprocessoriia display command.
`he ICP then autonomously fetches the decom ressed
`frame data from SDRAM and transfers it over
`:3 PCI
`bus to the frame buffer in the PC's video dis la card (or
`the frame buffer in PC system memory if t e C uses a
`UMA (Unified Memory Architecture) frame buffer).
`The ICP accommodates arbitrary window size, position,
`and overlaps.
`
`3.2 Video Compression
`Another typical application for TM-l is in video com-
`pression. In this case, uncompressed video is usually
`supplied directly to the TM-1 system via the video-in
`unit. A camera chip connected direct]
`to the video-in
`unit supplies YUV data in eight—bit,
`12:2 format. The
`video-in unit takes care of sampling the data from the
`camera chip and demultiplexing the raw video to
`§[DRAM in three separate areas, one each for Y, U, and
`
`When a complete video frame has been read from the
`camera chip b the video-in unit, it interrupts the TM-l
`CPU. The CPU compresses the video data in software
`(using a set of powerful data-parallel operations) and
`writes
`the compressed data to a separate area of
`SDRAM.
`
`Since the powerful, general-pu ose TM-1 CPU is
`available, the corn ressed data can e encrypted before
`being transferred or security.
`
`4.0
`
`VLIVV CORE AND PERIPHERAL
`UNITS
`
`4.1 VLIW Processor Core
`
`The heart of TM—l is its powerful 32-bit CPU core.
`The CPU implements a 32-bit linear address space and
`128, fully general—purpose 32-bit registers. The registers
`are not separated into banks; any operation can use any
`register for any operand.
`
`The core uses a VLIW instruction—set architecture and
`is fully general-purpose. TM-1 uses a VLIW instruction
`length t at allows up to five simultaneous operations to
`be issued. These operations can target any five of the 27
`functional units in the CPU, including inte er and float-
`ing-point arithmetic units and data-par
`el DSP-like
`units.
`
`'
`
`Instruction Cache (32Kb)
`
`I
`Instr. Fetch Buffer
`
`I
`Decompression Hardware
`
`Issue Register ( 5 Ops )
`
`Operation Routing Network
`
`Execution Unit ( 27 Functions )
`
`Register Routing and Forwarding Network
`
`Register File ( 128 X 32 )
`
`The compressed video data can now be disposed of in
`any of several ways. It can be sent to a host system over
`
`Figure 3. VLIW Processor Core and Instruction
`Cache.
`
`Page 14 of 23
`
`Petitioners HTC and LG - Exhibit 1005, p. 321
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`Although the processor core runs a tiny real-time op-
`erating sfistem to coordinate all activities in the TM-l
`s stem, t e processor core is not intended for true gener-
`a -purpose use as the only CPU in a computer system.
`For example, the processor core does not im lement vir-
`tual memory address translation, an essentia feature in a
`general-purpose computer system.
`
`TM-1 uses a VLIW architecture to maximize roces-
`sor throughput at the lowest possible cost. VL
`archi-
`tectures have performance exceeding that of superscalar
`general-purpose CPUs without the extreme complexity
`of a superscalar implementation. The hardware saved by
`eliminating superscalar logic reduces cost and allows the
`integration of multimedia—specifie features that enhance
`the power of the processor core.
`
`The TM-1 operation set includes all traditional micro-
`processor operations. In addition, multimedia-specific
`operations are included that dramatically accelerate stan-
`dard video compression and decompression algorithms.
`As just one of the five operations issued in a single TM-
`1 instruction, a sin le special or “custom” operation can
`implement up to
`1 traditional microprocessor o era-
`tions. Multimedia-specific operations combined wit
`the
`VLIW architecture result in tremendous throughput for
`multimedia applications.
`
`Internal “Data Highway” Bus
`4.2
`The internal data bus connects all internal blocks to-
`ether and provides access to internal control registers
`in each on-chi peripheral units), external SDRAM, and
`the external P 1 bus. The internal bus consists of sepa~
`rate 32-bit data and address buses, and transactions on
`the bus use a block~transfer protocol. Peripherals can be
`masters or slaves on the bus.
`
`I Access to the internal bus is controlled by a central ar-
`biter, which has a request line from each otential bus
`master. The arbiter is configurable in a num er of differ-
`ent modes so that the arbitration al orithm can be tai-
`lored for different ap lications. Perip eral units make re-
`quests to the arbiter or bus access, and dependin on the
`arbitration mode, bus bandwidth is allocated to t e units
`in different amounts. Each mode allocates bandwidth
`differently, but each mode guarantees each unit a mini-
`mum bandwidth and maximum service latency. All un-
`used bandwidth is allocated to the TM-1 CPU.
`
`The bus allocation mechanism is one of the features of
`TM-1 that makes it a true real—time system in stead ofJust
`a highly integrated microprocessor with unusual periph-
`erals.
`
`4.3 Memory and Cache Units
`TM—1’s memory hierarchy satisfies the low cost and
`high bandwidth requirement of multimedia markets.
`Since multimedia video streams can require relatively
`large temporary storage, a significant amount of DRAM
`is required.
`
`TM-l has a glueless interface with synchronous
`DRAM (SDRAM) or
`synchronous grap ics RAM
`
`(SGRAM), which provide higher bandwidth than the
`standard DRAM. As the SDRAM has been supported by
`major DRAM vendors,
`the competition among those
`vendors will kee the SDRAM rice in par with that of
`the standard D M. TM-1’s RAM memory size can
`range from 2Mbytes to 64 Mbytes.
`
`The TM-1 CPU core is supported b separate 16-KB
`data and 32—KB instruction caches.
`he data cache is
`dual—ported in order to allow two simultaneous load!
`store accesses, and both caches are eight-way set-asso-
`ciative with a 64—byte block size.
`
`4.4 Video-In Unit
`
`The video-in unit interfaces directly to any CCIR 601/
`656-com liant device that outputs eight-bit parallel,
`412:2
`tiirie—multiplexed data.'Such devices include
`direct digital camera systems, which can connect glue-
`lessly to TM-1 or through the standard CCIR 656 con-
`nector with only the addition of ECL level converters.
`Non-CCIR-compliant devices can use a di
`ital decoder
`chi
`, such as the Philips SAA7111, to inte ace to TM-1.
`Ol er front ends with a 16-bit interface can connect with
`a small amount of glue logic.
`
`The video-in unit deinultiplexes the ca tured YUV
`data before writing it into local TM-1 SD M. Separate
`data structures are maintained for Y, U, and V.
`
`The video-in unit can be pro rammed to perform on-
`the-fl horizontal resolution su sampling by a factor of
`two i needed. Man camera systems capture a 640-pix-
`elfline or 720-pixel
`ine image; with subsam ling, direct
`conversion to a 320-pixel/line or a 360-pixe /line image
`can be performed with no CPU intervention. Further, if
`subsairipling is required eventually, performing this
`function during data capture reduces initial storage re-
`quirements.
`
`4.5 Video-Out Unit
`
`i
`The video-out unit essentiallg erforms the inverse
`function of the video-in unit.
`eo—out generates an
`eight-bit, multiplexed Y [JV data stream by gathering bits
`from the se arate Y, U, and V data structures in
`SDRAM. W ile generating the multiplexed stream, the
`video-out unit can also u -scale horizontally by a factor
`of two to convert from Ci)
`to native CCIR resolution.
`
`Since the video-out unit likely drives a se arate video
`rnonitor——not the PC’s video screen—the P itself can-
`not be used to generate the graphics and text of a user in-
`terface. To remedy this, the video-out unit can generate
`graphics overlays in a limited number of configurations.
`
`4.6
`Image Coprocessor (ICP)
`The ima e coprocessor (ICP) is used for several pur-
`poses to 0 f-load tasks from the TM-1 CPU, suc
`as
`copying an ima e from SDRAM to the host’s video
`frame uffer. A though these tasks can be easily per-
`formed by the CPU, t ey are a poor use of the relatively
`expensive CPU resource. When performed in parallel by
`the ICP, these tasks are performed efficiently by simple
`hardware, which allows the CPU to continue with more
`Complex tasks.
`
`Page 15 of 23
`
`Petitioners HTC and LG - Exhibit 1005, p. 322
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`The ICP can c(ip[erate as either a memory-to~memory or
`3 memory-to-P
`coprocessor device.
`
`In memory-to-memory mode, the ICP can perform ei-
`that horizontal or vertical image filtering and resizing.
`. C1-h31CP implemepts 32 FIR filters of five adjacent pixel
`input values. The ilter coefficients are fully programma—
`. ble, and the position of the outpultzfiptel in the output ras-
`ter determines which _of the 3
`filters is applied to
`generate that output pixel value. Thus, the output raster
`is on a 32-times
`_
`iner grid than the input raster. The fil-
`teririg is done in either the horizontal or vertical direction
`but not both. Two ap lications of the ICP are required to
`filter and scale in bot directions.
`
`_
`
`-
`
`In rnemory—to-PCI mode, the ICP can perfonn hori-
`zontal resizing followed by color-space conversion. For
`_ example, assume an n x in pixel array is to be displayed
`_
`in a window on the PC_ video screen while the PC is run-
`- ning a gra hical user interface. The first step (if neces-
`sary) wou d use the ICP in memory—to-memory mode to
`perform a vertical resizin . The second step would use
`the ICP in memory-to-PC mode to perform a horizontal
`resizing (if necessary) and colorspace conversion from
`YUV to RGB.
`
`While sending the final, resampled and converted ix-
`els over the PC bus to the video frame buffer, the CP
`uses a full, per- ixel occlusion bit mask—accessed in
`destination coor inates—to determine which pixels are
`actually stored in the frame buffer for display. Condi-
`tioning the transfer with the bit mask allows TM-1 to ac-
`'commodate an arbitrary arrangement of overlapping
`windows on the PC video screen.
`
`'
`
`Figure 3 illustrates a possible display situation and the
`
`data structures in SDRAM that su port the ICP’s opera-
`tion. On the left in Fi ure 3, the C’s video screen has
`four overlapping win ows. Two, Image 1 and Image 2,
`are being used to display video generated by TM-1.
`
`The ri ht side of Figure 3 shows a conceptual view of
`SDRA contents. Two data structures are present, one
`for Image 1 and the other for Image 2. Figure 3 repre-
`sents apoint in time during which the ICP is displaying
`Image .
`
`When the ICP is displayin an image (i.e., copying it
`from SDRAM to a frame buf er), it maintains four point-
`ers to the data structures in SDRAM. Three pointers lo-
`cate the Y, U, and
`data arrays, and the fourth locates
`the per-pixel occlusion bit map. The Y, U, and V arrays
`are indexed by source coordinates while the occlusion bit
`map is accessed with screen coordinates.
`
`erforms
`ixels for display, it
`As the ICP generates
`be final
`horizontal scaling and co orspace conversion.
`RGB pixel value is then copied to the destination address
`in the screen’s frame buffer only if the corresponding bit
`in the occlusion bitmap is a one.
`
`As shown in the conceptual dia rain, the occlusion bit
`map has a pattern of Is and Os t at corresponds to the
`shape of the visible area of the destination window in the
`frame buffer. When the arrangement of windows on the
`PC screen is changed, modifications to the occlusion bit
`maps may be necessary.
`
`It is important to note that there is no reset limit on the
`number and sizes of windows that can e handled by the
`ICP. The on]
`limit is the available bandwidth. Thus, the
`ICP can han le a few large windows or many small win-
`
`PC SGFEBH
`
`In SDRAM
`
`mmmminm
`
`Figure 4. ICP operation. Windows on the PC screen and data structures in SDRAM for two live video
`windows.
`
`we age 16 of 23
`
`Petitioners HTC and LG - Exhibit 1005, p. 323
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`dows. The ICP can sustain a transfer rate of 50 megapix-
`els
`er second, wh_ich_is more than enough to saturate
`PC when transferring images to video frame buffers.
`
`ICP has a micro-programmable engine. All ICP oper-
`ations such as filtering, scaling and color s ace conver-
`sions and their formats are programmable. he ICP’ s mi-
`cro programs loads itself from the SDRAM memory.
`
`4.7 Variable-Length Decoder (VLD)
`The variable-len th decoder (VLD is included to re-
`lieve the TM-l CP of the task of ecoding Huffman-
`encoded video data streams. It can be used to help de-
`code MPEG-1 and MPEG-2 video streams.
`
`The VLD is a memor -to-memory coprocessor. The
`TM-1 CPU hands the V D a pointer to a Huffman-en-
`coded bit stream, and the VLD roduces a tokenized bit
`stream that is very convenient or the TM-l image de-
`compression software to use. The format of the output to-
`ken stream is optimized for the MPEG-2 decom ression
`software so that communication between the PU and
`VLD is minimized.
`'
`
`As with the other processing-"intensive coppocessors,
`the VLD is included mainly to relieve the CP of a task
`that wastes its performance potential. When dealing with
`the hi h bit rates of MPEG-2 data streams, too much of
`the C U’ s time is devoted to this task, which prevents its
`special capabilities from being used.
`
`4.8 Audio-In and Audio-Out Units
`The audio-in and audio-out units are similar to the vid-
`eo units. They connect to most serial ADC and DAC
`chips, and are programmable enough to handle most rea-
`sonable protocols. These units can transfer MSB or LSB
`first and ieft or right channel first.
`
`The sampling clock is driven by TM—l and is software
`pro rammable within a wide range from DC to 80 kHz
`witi-i a resolution of 0.02 Hz. The clock circuit allows the
`pro rammer subtle control over the sampling fretLuency
`so t at audio and video synchronization can e ac ieved
`in any system configuration. When changing the fre-
`quency, the instantaneous phase does not c ange_, which
`allows frequency manipulation without introducing dis-
`tortion.
`
`As with the video units, the aud'io—in and audio-out
`units buffer
`incoming and out oirig audio data in
`SDRAM. The audio—in unit buf ers samples in either
`
`eight— or 16-bit format, mono or stereo. The audio-out
`unit simply transfers sample data from memory to the ex-
`ternal DAC; an mani ulation of sound data is
`er-
`formed by the
`-1 C U since this
`rocessing wil re-
`quire at most a few percent of the CP resource.
`
`4.9
`
`PCI Bus Interface Unit (BIU)
`
`This unit connects the internal Data Highway Bus to
`an external PCI bus. It has a PCI master to initiate mem-
`ory read/write cycles for TM-1-CPU requested read]
`write transactions including burst read/write DMA trans-
`actions. The PCI tar et within the BIU responds to the
`transactions initiate by external PCI master devices to
`read/write the TM—1’s memory space, and it satisfies
`their requests. External devices can access the TM—l’s
`MMIO registers through this unit.
`
`The ICP unit has a direct connection to the BIU unit in
`order to transfer the pixel image data efficiently from
`TM-1 to the graphics evice or host memory through the
`PCI bus.
`
`The DMA transactions are considered as background
`transactions. To reduce the latenc of the sin le word
`readfwrite transactions on the PC bus, the B%U inter-
`leaves the burst read/write DMA cycles with single word
`read/write transactions.
`
`5.0 CUSTOM OPERATIONS
`
`Custom operations in the TM-1 CPU architecture are
`specialized, high function operations designed to dra-
`matically improve performance in im ortant multimedia
`applications. Custom operations enab e an a
`lication to
`take advantage of the high performance
`LIW-CPU
`core.
`
`Important multimedia applications, such as the decom-~
`pression of MPEG video streams, spend significant
`amounts of execution time dealing with eight-bit data
`items. Using 32-bit operations to manipulate small data
`items makes inefficient use of 32-bit execution hardware
`in the implementation. There are custom operations de-
`signed to operate on four ei1g1ht—bit data items simulta-
`neously in order to im rove
`e performance about four
`to ten times compare with that of the general pu ose
`CPU. Furthermore, some custom o erations are de ined
`to combine multiple arithmetic an control instructions
`into a single custom o eratioii. These custom operations
`can be used easily in t e C ianguage as function calls.
`Custom operation syntax is consistent with the C pro-
`
`unsigned char AI16][16];
`unsigned char B[16] [16],-
`
`row += 1)
`row < 16;
`(row = 0;
`for (col 2 0; col < 16; C01 +: 1)
`cost +: abs(ALrow][col]
`— B[row][col]);
`
`Eor
`i
`
`)
`
`Figure 5. Match-cost loop for MPEG motion estimation.
`
`Page 17 of 23
`
`Petitioners HTC and LG - Exhibit 1005, p. 324
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`rammin language, and just as with alt other operations
`._.ggnerate by the compiler, the scheduler takes care of
`-register allocation, operation packing, and flow analysis.
`
`The multimedia EpgilCatl0I:ld(j€.VBLOpI§'l6nEdh§S begn adfi
`y provi mg an co e
`an we
`ditionally i_TI1P1'0_V'3
`,
`,
`,
`: tuned multimedia code in the form of C library func-
`' tions.
`
`Example: Motion-Estimation Kernel
`'5.1
`One part of the MPEG coding algorithm is motion es-
`fimation. The purpose of motionestimation is to_reduce
`‘ the cost of storing a frame of video by expressing the
`contents of the frame in terms of adjacent frames.
`
`A given frame is reduced to small lglocks, and a subse-
`uent frame is represented by specifying how these small
`- blocks change_position'and_ a pearance; iisually, storing
`f the difference information is ess expensive than storing
`a whole block. For example, in a video sequence in
`' which the camera pans across a_ static scene, some frames
`"can be expressed sim ly as displaced versions of their
`predecessor frames. 0 create a subsequent frame, most
`locks are simply displaced relative to the output screen.
`The code in this exam le is for a match—cost caEcula—
`on, a small kernel of t e complete motion—estimation
`- code. This code provides an excellent example of how to
`: transform source code in order to make the best use of
`. TM-l’s custom operations.
`
`-_
`
`5 shows the original source code for the
`re
`I Fi
`mate —cost loop. The code is not a self-contained func-
`' tion. At some location early in the code, the arrays A[][]
`_' and B[][] are declared; At some location between those
`declarations and the loop of interest, the arrays are filled
`. with data.
`We start by noticing] that the computation in the loop
`t e absolute va ue of the difference
`‘of Figure _5 involves
`_
`_of two unsigned characters (bytes). TM-1 o eration set
`includes, several operations that process all our bytes in
`-_ a 32-bit word simultaneously. Since the match-cost cal-
`- culation is fundamental to the MPEG algorithm, it is not
`surprising to find a custom operation—urne8uu—that
`_ implements this operation exactly. The definition of
`ume8uu operation is shown in Figure 8.
`
`rocesses
`If we hope to use a custom operation that
`to create
`four pixel values simultaneously, we first rice
`four parallel pixel computations. Also, to use the ume8uu
`operation, however, the code must access the arrays with
`3 -bit word pointers instead of with 8-bit byte pointers.
`
`6 shows a parallel version of the code from
`Figure
`Figure 5. By unrolling the loo and simply giving each
`computation its own cost varia le and then summing the
`costs all at once, each cost computation is completely in-
`dependent.
`
`Fi ure 7 shows the loop recoded to access A[][] and
`B[][ as one-dimensional instead of as two—dimensional
`arrays. We take advantage of our knowledge of C—lan—
`guage arra storage conventions in order to perform this
`code trans ormation. Recoding to use one-dimensional
`arrays prepares the code for the transformation to 32-bit
`array accesses.
`
`Fi ure 7 also shows the loop of Figure 6 recoded to use
`ume uu. Once again taking advantage of our knowledge
`of the C-language array storage conventions, the one- i-
`mensional byte array is now accessed as a one-dimen-
`sional 32-bit-word array.
`
`Of course, since we are now using one-dimensional ar-
`rays to access the pixel data, it is natural to use a single
`‘for’ loop instead of two. Figure 9 shows this streamlined
`version of the code without the inner loop. Since C-lan-
`guage arrays are stored as a linear vector of values, we
`can simply increase the number of iterations of the outer
`loop from 16 to 64 to traverse the entire array.
`
`The recoding and use of the ume8uu o eration has re-
`sulted in a substantial improvement in
`e performance
`of the match-cost loop. In the ori
`inal version, the code
`executed 1280 operations (inclu 'ng loads, adds, sub-
`tracts, and absolute values); in the restructured version,
`there are only 256 o erations—128 loads, 64 ume8uu
`operations, and 64 a ditions. This is a factor of five re-
`duction in the number of operations executed. Also, the
`overhead of the inner loop has been eliminated, further
`increasing the performance advantage.
`
`Unsigned char A[l6][16lt
`unsigned char B[l6][16];
`
`unsigned char A[l6l[16]:
`unsigned char B[l6l[16l;
`
`gar
`
`row += 1)
`row < 15;
`(row = 0;
`for (col
`0; col < 16; C01 += 4]
`C
`abslA[row][co1+0l
`costo
`B[row][col+O]);
`abs(A[row][col+ll
`costl
`B[rowl[co1+1]):
`B[row][col+2])
`abs(A[rowl[col+2l
`Costa
`abs(A[row][c0l+3]
`cost3
`B[row]lcol+3])
`cost += Costo + costl + costz + eost3;
`
`(unsigned int *) A;
`unsigned int *IA
`(unsigned int *) B;
`unsigned int *IE
`for (row = 0, rowoffset n 0;
`row < 16;
`row += 1,
`rowoffset += 4)
`t
`
`for (C014 : D; cola < 4; cold +: 1)
`cost += UME8UU(IA[rowofifset + c014],
`IB[rowoffset + C0141);
`
`Figure 6. Unrolled and Parallel version of Figure 5.
`
`Figure 7. Using the custom operation umefiuu to speedup the
`loop of Figure 6 resulted in a performance speedup of about
`
`Page 18 of 23
`
`Petitioners HTC and LG - Exhibit 1005, p. 325
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`e8uu
`
`Sum of absolute values of
`unsigned 8-bit differences
`‘C’ function prototype:
`unsigned int
`ume8uu(unsigned int a, unsigned int b );
`unction of umesuu:
`abs(zero_extto32(a<31:24>) —zeroAextto32(b<31:24>ll+
`ab5(zero extto32{a<23:l6>l 4zero_extto32(b<23:16>)l+
`abstzero extto32(a<15:B>)
`— zero_extto32(b<15:s>)) +
`abstzero extto32(a<7:O>)
`- zero_extto32lb<7:D>ll
`
`nsigned char A[16}[l6]:
`unsigned char B[16][16];
`
`(unsigned int *) A;
`unsigned int “IA
`(unsigned int *) B;
`unsigned int “IE
`i +: 1)
`for (i = 0;
`i < 64;
`cost += UME8UU(IA[i], IB[i]);
`
`Figure 8. Custom Operation umefluu
`
`Figure 9. The loop of Figure 7 with me inner loop eliminated.
`
`6.0 APPLICATIONS
`
`8.0 REFERENCES
`
`TM-1 has the potential to be used in many multimedia
`applications and only few of them are discussed.
`
`6.1 Video Teleconferencing/Digital White
`Board
`
`Businesses are increasingly turning towards interac-
`tive computing as a means 0 becoming more efficient.
`Collaborative computing, for instance, involves sharing
`applications amon st multiple personal computers an
`multipoint video te econferencing.
`
`TM—1 is a single chip video teleconferencing solution
`that runs all current video codecs across all common
`trans on mechanisms. This may also includes H.324
`(PO S), H.320 (ISDN) and H.323 (LAN).
`
`6.2 Multimedia Card for Consumer
`Multimedia Applications
`The achievement of true computer based realism is
`only

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket