throbber
THE
`
`cninib
`H6NEBNC
`
`HANDBOOK
`
`Edited by
`VOJIN G. OKLOBDZIJA
`
`CRC PRESS
`
`Boca Raton London New York Washington, D.C.
`
` 1
`
`META 1035
`IPR2022-01308
`META V. THALES
`
`

`

`Cover photos from Molecular Expressions Website (www.microscopy.fsu.edu), National High Magnetic
`Field Laboratory, Optical Microscopy Division, The Florida State University, Tallahassee, FL.
`
`©1995-2001 Michael W. Davidson and The Florida State University. With permission.
`
`Library of Congress Cataloging-in-Publication Data
`
`The computer engineering handbook / Vojin G. Oklobdzija, editor-in-chief,
`p. cm.—(Electrical engineering handbook series)
`Includes bibliographical references and index.
`ISBN 0-8493-0885-2 (alk. paper)
`1. Computer engineering. 2. Electronic digital computers. I. Oklobdzija, Vojin G. II.
`Series.
`
`TK7885 .C645 2001
`004—dc21 2001043891
`
`This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with
`permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish
`reliable data and information, but the authors and the publisher cannot assume responsibility for the validity of all materials
`or for the consequences of their use.
`
`Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical,
`including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior
`permission in writing from the publisher.
`
`All rights reserved. Authorization to photocopy items for internal or personal use, or the personal or internal use of specific
`clients, may be granted by CRC Press LLC, provided that $1.50 per page photocopied is paid directly to Copyright Clearance
`Center, 222 Rosewood Drive, Danvers, MA 01923 USA The fee code for users of the Transactional Reporting Service is
`ISBN 0-8493-0885-2/02/$0.00+$1.50. The fee is subject to change without notice. For organizations that have been granted
`a photocopy license by the CCC, a separate system of payment has been arranged.
`
`The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works,
`or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying.
`
`Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.
`
`Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
`identification and explanation, without intent to infringe.
`
`Visit the CRC Press Web site at www.crcpress.com
`
`© 2002 by CRC Press LLC
`
`No claim to original U.S. Government works
`International Standard Book Number 0-8493-0885-2
`Library of Congress Card Number 2001043891
`Printed in the United States of America 1234567890
`Printed on acid-free paper
`
` 2
`
`META 1035
`IPR2022-01308
`META V. THALES
`
`

`

`19-2
`
`The Computer Engineering Handbook
`
`higher impact
`more options
`
`Algorithm
`
`Architecture
`
`Circuit
`
`Process/Device
`Level
`
`FIGURE 19.1 Each level impact for low-power design.
`
`A typical example of algorithm contribution is motion estimation of MPEG encoder. Motion estima¬
`tion is an extremely critical function of MPEG encoding. Implementing fundamental MPEG2 motion
`estimation using a full search block matching algorithm requires huge computations [3,4]- It reaches 4.5
`teraoperations per second (TOPS) if realizing a very wide search range (±288 pixels horizontal and ±96
`pixels vertical), on the other hand the rest of the functions take about 2 GOPS. Therefore motion
`estimation is the key problem to solve in designing a single chip MPEG2 encoder LSI. Reference [5]
`describes a good example to dramatically reduce actual required performance for motion estimation
`with a very wide search range, which was implemented as part of a 1.2 W single chip MPEG2 MP@ML
`video encoder. Two adaptive algorithms are applied. One is 8:1 adaptive subsampling algorithm that
`adaptively selects subsampled pixel locations using characteristics of maximum and minimum values
`instead of fixed subsampled pixel locations. This algorithm effectively chooses sampled pixels and reduces
`the computation requirements by seven-eighths. Another is an adaptive search area control algorithm,
`which has two independent search areas with H: ±32 and V: ±16 pixels in full search block matching
`algorithm for each. The center locations of these search areas are decided based on a distribution history
`of the motion vectors and this algorithm substantially expands the search area up to H: ±288 and V: ±96
`pixels. Therefore, the total computation requirement is reduced from 4.5 TOPS to 20 GOPS (216:1),
`which is possible to implement on a single chip. The first search area can follow a focused object close
`to the center of the camera finder with small motion. The second one can cope with a background object
`with large motion in camera panning. This adaptive algorithm attains high picture quality with very
`wide search range because it can efficiently grasp moving objects, that is, get correct motion vectors. As
`shown in this example, algorithm improvement can drastically reduce computation requirement and
`enable low power design.
`
`19.4 Architecture Level Impact
`
`The architecture level is the next to the algorithm level, also in terms of impact on power consumption.
`At the architecture level there are still many options and wide freedom in implementation. The archi¬
`tecture level is explained as CPU (microprocessor), DSP (digital signal processor), ASIC (dedicated
`hardwired logic), reconfigurable logic, and special purpose DSP.
`
`The CPU is the most widely used general-purpose architecture as shown in Fig. 19.2. Fundamentally
`anything can be performed by software. It is the most inefficient in power, however. The main features
`of the CPU are the following: (1) It is completely sequential in operation with instruction fetch and decode
`in every cycle. Basically this is not essential for computation itself and is just overhead. (2) There is no
`dedicated address generator for memory access. The regular ALU is used to calculate memory address.
`Throughput of data feeding is not, every cycle, based on load/store architecture via registers (RISC-based
`architecture). This means cycles are consumed for data movement and not just for computation itself.
`(CISC allows memory access operation, but this doesn’t mean it is more effective; it is a different story, not
`explained in detail here.) (3) Many temporal storage operations are included in computation procedure.
`
` 3
`
`META 1035
`IPR2022-01308
`META V. THALES
`
`

`

`Implementation-Level Impact on Low-Power Design
`
`19-3
`
`- no address generator
`- data supplied via registers
`- not every cycle data feeding
`
`Fetch
`
`- usually not fully parallel multiplier
`requiring multi-cycle operation
`- limited and fixed general resources
`
`- address calculated using ALU
`
`Mem/
`WriteBack
`
`sequential operation
`instruction fetch and decode
`in every cycle
`
`• many temporal storage operations
`
`ALU: Arithmetic Logical Unit
`MPY: Multiplier
`MUX: Multiplexer
`
`FIGURE 19.2 CPU structure.
`
`FIGURE 19.3 Dynamic instruction statistics.
`
`This is a completely justified overhead. (4) Usually, a fully parallel multiplier is not used, causing multi¬
`cycle operation. This also consumes more wasted power because clocking, memory, and extra circuits
`are activated in multiple for one multiply operation. (5) Resources are limited and prefixed. This results
`in overhead operations to be executed as general purpose. Figure 19.3 shows dynamic run time instruction
`statistics [6]. This indicates that essential computation instructions such as arithmetic operation occupy
`just 33% of the entire dynamic run time instruction stream. The data moving and control-like branches
`
`take two-thirds, which is large overhead consuming extra power.
`The DSP is an enhanced processor for multiply-accumulate computation. It is general-purpose in struc¬
`ture and more effective for signal processing than the CPU. But still it is not very power efficient. Figure
`19.4 shows the basic structure and its features are as follows. (1) The DSP is also sequential in operation
`with instruction fetch and decode in every cycle similar to the CPU. It causes overhead in the same way,
`but as an exception DSP has a hardware loop, which eliminates continued instruction fetch in repeated
`operations, improving power penalty. (2) Many temporal storage operations are also used. (3) Resources
`are limited and prefixed for general purpose as well. This is a major reason for causing temporal storage
`operations. (4) Fully parallel multiplier is used making one cycle operation possible. And also accumulator
`with guardbits is applied, which is very important to accumulate continuously without accuracy degradation
`and undesired temporal storing to registers. This improves power efficiency for multiply-accumulate-based
`
` 4
`
`META 1035
`IPR2022-01308
`META V. THALES
`
`

`

`19-4
`
`The Computer Engineering Handbook
`
`still many temporal storage operations
`
`- fully parallel multiplier with one cycle operation
`■ acculurator with guardbits
`
`| Add. generator ] 4^
`
`Inst.
`Decode
`
`Memory
`
`Memory
`
`R
`^-Lmac
`ALU K
`
`■ limited and fixed
`
`Add, generator j
`
`MAC: Multiply Accumulator
`
`Fetch
`
`Dec
`
`Read Mem
`
`Exe (Writeback Mem)
`
`- sequential operation
`- instruction fetch and activating
`large area in every cycle
`(except hard looping)
`
`- dedicated address generator
`- memory access operation from 2 memories
`at once (2 data feeding every cycle)
`
`FIGURE 19.4 DSP structure.
`
`- directly mapped operation in optimal form ex) out = X*(A+B)+B*C
`
`1 Inst' L.^j Inst-
`■ Memory I ""l Decode ,
`
`1 Add. generator!
`
`^Mitk^generatorJ
`
`♦
`
`A
`
`X
`
`B_JjVlemoiy]
`
`C- Memory I
`
`_
`
`^Md^eneratorj
`
`fixed function, no flexibility
`
`minimum temporal storage operations
`
`FIGURE 19.5 ASIC structure.
`
`computations. (5) It is equipped with dedicated address generators for memory access. This realizes more
`complex memory addressing without using regular ALU and consuming extra cycles, and two data can
`be fed in every cycle directory from memory. This is very important for DSP operation. Features (4) and
`(5) are advantages of the DSP in improving power efficiency over the CPU.
`
`We define the ASIC as dedicated hardware here. It is the most power efficient because the structure
`can be designed for the specific function and optimized. Figure 19.5 shows the basic image and the
`features are as follows: (1) Required fiinctions can be directly mapped in optimal form. This is the essential
`feature and source of power efficiency by minimizing any overheads. (2) Temporal storage operation can
`be minimized, which is large overhead in general purpose architectures. Basically this comes from feature
`(1). (3) It is not sequential in operation. Instruction fetch and decode are not required. This eliminates
`fundamental overhead of general-purpose processors. (4) Function is fixed as design. There is no flexi¬
`bility. This is the most significant drawback of dedicated hardware solutions.
`
`There is another category known as reconfigurable logic. Typical architecture is field programmable
`gate array (FPGA). This is gate level fine-grained programmable logic. It consists of programmable
`network structure and logic blocks that have a look-up table (LUT)-based programmable unit, flip-flop,
`and selectors as shown in Fig. 19.6. The features are: (1) It is quite flexible. Basically, the FPGA can be
`configured to any dedicated function if integrated gate capacity is enough to map it; (2) Structure can
`be optimized without being limited to prefixed data width and variation of fimction unit like a general
`32-bit ALU of CPU. Therefore, FPGA is not used only for prototyping but also where high performance
`and high throughput are targeted. (3) It is very inefficient in power. Switch network for fine-grain level
`flexibility causes large power overhead. Each gate function is realized by LUT programed as truth table,
`for example NAND, NOR, and so on. Power consumption of interconnect takes 65% of the chip, while
`logic part consumed only 5% [7], This means major power of FPGA is burned in unessential portion.
`
` 5
`
`META 1035
`IPR2022-01308
`META V. THALES
`
`

`

`Implementation-Level Impact on Low-Power Design
`
`19-5
`
`FIGURE 19.6 FPGA simplified structure.
`
`CPU DSP ASIC Reconflg.
`
`FIGURE 19.7 HR comparison.
`
`FPGA sacrifices power efficiency in order to attain wide range flexibility. It is a trade-off between flexibility
`and power efficiency. Lately, however, there is another class of reconfigurable architecture. It is coarse¬
`grained or heterogeneous reconfigurable architecture. Typical work is Maia of Pleiades project, U.C.
`Berkeley [8-12]. This architecture consists of heterogeneous modules that are mainly coarse-grain similar
`to ALU, multiplier, memory, etc. The flexibility is limited to some computation or application domain
`but power efficiency is dramatically improved. This type of architecture might gain acceptance because
`
`of strong demand for low power and flexibility.
`Figure 19.7 shows cycle comparison to execute fourth order infinite impulse response (HR) for CPU,
`DSP, ASIC, and reconfigurable logic. ASIC and reconfigurable logic are assumed as two parallel imple¬
`mentations. CPU takes more overhead than DSP, which is enhanced for multiply computation as men¬
`tioned previously. Also, dedicated hardware structures such as ASIC and reconfigurable logic can reduce
`
`computational overhead more than others.
`The last one is the special purpose DSP for MPEG2 video encoding. Figure 19.8 shows an example of
`programmable DSP for MPEG2 video encoding [13]. This architecture applied 3-level parallel processing
`of macro-block level, block level, and pixel level in reducing performance requirement from 1.1 GHz to
`
` 6
`
`META 1035
`IPR2022-01308
`META V. THALES
`
`

`

`19-6
`
`The Computer Engineering Handbook
`
`Next
`Macro-block
`
`Current Macro-block
`
`Previous
`Macro-block
`
`FIGURE 19.8 Special purpose DSP for MPEG2 video encoding.
`
`81 MHz with 13 operations in parallel on an average. The macro-blocks are processed in 3-stage pipeline
`with MIMD controlled by two RISCs. The 6 blocks of macro-block are handled by 6 vector processing
`engines (PEs) assigned to each block with SIMD way. The pixels of block are computed by the PE that
`consists of extended ALU, multiplier, accumulator, three barrel-shifters with truncating/rounding func¬
`tion and 6-port register file. This specialized DSP performs MPEG2 MP@ML video encoding at 1.3 W/3
`V/0.4 pm process with software programmability. The architecture improvement for dedicated applica¬
`tion can reduce performance requirement and overhead of general-purpose approach and plays an
`important role for low-power design.
`
`19.5 Circuit Level Impact
`
`The circuit level is the most detailed implementation layer. This level is explained as module level such
`as multiplier or memory and basement level like voltage control that affects wide range of the chip. The
`circuit level is quite important for performance but usually has less impact on power consumption than
`previous higher levels. One reason is that each component itself is just a part of the entire chip. Therefore,
`it is needed to focus on critical and major factors (most power hungry modules, etc.) in order to contribute
`to power reduction for chip level improvement.
`
`Module Level
`
`The module level is each component like adder, multiplier, and memory, etc. It has relatively less impact
`on power compared to algorithm and architecture level as mentioned above. Even if power consumption
`of one component is reduced to half, it is difficult to improve the total chip power consumption drastically
`in many cases. On the other hand, it is still important to focus on circuit level components, because the
`sum of all units is the total power. Memory components especially occupy a large portion of many chips.
`Two examples of module level are shown here.
`
`Usually there occur many glitches in logic block causing extra power at average 15 to 20% of the total
`power dissipation [14], Multiplier has a large adder-based array to sum partial products, which generates
`many glitches. Figure 19.9 is an example of multiplier improvement to eliminate those glitches [13]
`There are time-skews between X-side input signals and Y-side Booth encoded signals (Booth select)
`creating many glitches at Booth selectors. These glitches propagate in the Wallace tree and consume extra
`power. The glitch preventive booth (GPB) scheme (Figure 19.9) blocks X-signals until Booth encoded
`signals (Y-signals) are ready by delaying the clock in order to synchronize X-signals and Y-signals. During
`this blocking period, Booth selectors keep previous data as dynamic latches. This scheme reduces Wallace
`tree power consumption by 44% without extra devices in the Booth selectors.
`
`Another example is a memory power reduction [13]. Normally in ASIC embedded SRAM, the whole
`memory cell array is activated. But actually utilized memory cells whose data are read out are just part
`
` 7
`
`META 1035
`IPR2022-01308
`META V. THALES
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket