throbber
EXHIBIT 1047
`
`A. F. MURRAY, and P. B. DENYER, “A CMOS Design Strategy for Bit-Serial
`
`Signal Processing,”
`
`IEEE JSSC, Vol. 20, Issue 3, pp. 746-753 (1985)
`
`TRW Automotive U.S. LLC: EXHIBIT 1047
`PETITION FOR INTER PARTES REVIEW
`OF U.S. PATENT NUMBER 8,599,001
`IPR2015-00436
`
`

`
`746
`
`IEEE JOURNAL
`
`OF SOLID-STATE
`
`CIRCUITS,
`
`VOL.
`
`SC-20,
`
`NO.
`
`3,
`
`JUNE
`
`1985
`
`A CMOS Design Strategy for Bit-Serial
`Signal Processing
`
`ALAN F. MURRAY ANDPETER B. DENYER, MEMBER, IEEE
`
`of a
`the features and successes
`Abstract — We present a summary of
`Silicon Compiler
`(FIRST)
`for LSI nMOS bit-serial
`signal processors. A
`replacement
`cell
`library of CMOS operators has been designed for the
`compilation of
`true VLSI bit-serial
`signaf processors. The cell
`library is
`implemented
`in 2.5-pm buIk CMOS technology,
`and maintains a con-
`sistent performance
`of 20 MHz. We deseribe
`the design philosophy and
`style behind the CMOS cells, detailing the dynamic logic style used,
`its
`layout and testability. As an example of the capability of the library, we
`discuss a full-precisiou complex multiplier.
`
`I.
`
`INTRODUCTION
`
`D ESIGN complexity has become a dominant cost limit
`
`in the development of VLSI systems. Without new
`design methods and tools, advances in manufacturing ca-
`pability and algorithm development will
`far exceed our
`capacity for design [1]. This effect is nowhere more pro-
`nounced than in the field of real-time signal processing; the
`continuous
`flow of data together with the complexity of
`many of
`the algorithms,
`imposes severe computational
`demands that often cannot be satisfied by general purpose
`machines or components.
`We have addressed the issue of design complexity through
`the development
`of a powerful system synthesis tool—a
`silicon compiler, called FIRST. FIRST is more specialized
`than are most other silicon compilation/autolayout
`sys-
`tems. This restricts its application (to real-time signal
`processing tasks), but at the same time permits functional
`integraticm density that is closer to hand-crafted custom
`than to other examples of automatic layout generation.
`We describe a cell library implemented in 2.5-pm bulk
`CMOS technology (two-layer metal) whose operators are
`amenable to either silicon compilation or manual assembly
`to generate bit-serial signal processing functions, analogous
`to those served by FIRST.
`Section II of this paper constitutes a summary of the
`aims, capabilities and achievements of the FIRST Silicon
`Compiler,
`as implemented in S-pm nMOS technology.
`Section III describes the reimplementation of the cell library
`in 2.5-pm CMOS technology, giving details of the design
`and layout style, and illustrates the library’s scope and
`performance by detailing the design of two of the major
`
`received November 5, 1984; revised January 16?1985.
`Manuscript
`The authors are with the Department of Electrical Engineering, Univer-
`sity of Edinburgh, The King’s Buildings, Mayfield Road, Edinburgh EH9
`3JL, Scotland.
`
`operators. Section IV shows how the inherently high level
`of
`testability exhibited by bit-serial primitives is carried
`through into the normally problematic regime of stuck-open
`faults in CMOS by the use of a dynamic design style.
`
`II.
`
`THE FIRST
`
`SILICON COMPILER
`
`systems
`and
`chips
`processing
`signal
`synthesizes
`FIRST
`architectural
`This
`architectures.
`using
`exclusively
`bit-serial
`in the
`simplicity
`consequences
`restriction
`has
`significant
`of the environment. We shall
`later
`identify
`how
`and success
`restrictions
`and
`conventions
`such
`as
`these
`are
`key
`to a
`successful
`design methodology.
`Bit-serial
`architectures
`[2] have major
`bit-parallel
`alternatives,
`especially
`in the
`routing
`communication
`(issues which
`and
`portant
`in signal
`processing
`applications),
`of computation
`through
`bit-level
`pipelining.
`are exemplified
`in a range
`of case
`studies
`
`over
`advantages
`signal
`areas
`of
`im-
`are
`often
`and in efficiency
`These
`features
`given later.
`
`A. An Example
`
`We begin with a simple example system to give an
`impression of the scope and use of FIRST. For the purpose
`of
`this example we consider an implementation of
`the
`four-region approximation [3] to the magnitude M of a
`complex number A + jB
`
`where,
`
`‘=greater[(:+3G1
`
`G = greater[lAl, /Bl]
`
`L =lesser[lAl,
`
`IB1].
`
`(1)
`
`flow-graph, as
`We view this algorithm as a functional
`shown in Fig. 1. The oblong shapes are functional elements
`that perform fixed operations on the (bit-serial) data flow-
`ing through them. Now bit-serial elements exhibit a delay
`or latency of integer numbers of bit-times. It is important
`to equalize these delays and synchronize data paths enter-
`ing each element. Simple delay elements are shown as
`circles in Fig. 1. We have deliberately reduced this al-
`gorithm to a network of
`functional operators (add, scale,
`delay, etc.) which are supported by FIRST. We call this the
`set of primitive operators. FIRST supports primitives from
`
`0018-9200/85/0600-0746$01.00
`
`@1985 IEEE
`
`1047-001
`
`

`
`MURRAY
`
`AND DENYER:
`
`DESIGN
`
`STRATEGY
`
`FOR BIT-SERIAL
`
`SIGNAL
`
`PROCESSING
`
`747
`
`d4?re
`~‘afdei,evei,ht
`,3m.axdel
`
`ADD [1,0,0,0]-----!U
`
`sum
`
`ORD [24.0,0]
`
`-------:’-’:-
`
`NC
`
`rn.g
`
`1
`
`Q
`
`ki
`
`DSHl~
`
`[1.0]
`3
`
`h.lfi”mi.
`
`5
`
`2
`
`rnoxde!2
`
`-----------
`
`5
`
`AES [24,0]
`
`ab,re
`
`im
`------.--
`
`--
`
`ASS [24.0]
`
`obsim
`
`ORD [24,0,0]
`
`-----.-.--------
`
`inrni.
`
`---
`
`..-
`~
`
`----------------
`i“max
`
`DSHIFI
`
`[3,0]
`
`6
`
`eighth
`
`SUB [1,0.0,0]
`
`;c 1
`
`;
`
`3
`
`4
`+
`
`3
`
`.:
`
`6
`
`o
`
`e13
`
`o
`
`~ ~,~
`-+------~
`
`6
`-–-~~~-~-j
`
`()
`
`Fig. 1. Flow-graph of complex-to-magnitude example. Simple delay ele-
`ments are shown as circles, while computational operators are repre-
`sented by name.
`
`which all systems are constructed which form a finite set of
`rehltively high-level
`functions, well above the transistor or
`gate level.
`flow-graph
`FIRST is a tool which takes this functional
`as input,
`instantiates the required set of bit-serial primi-
`tives, assembles these primitives, and routes the network of
`flow-graph
`connections
`between them. Finally, FIRST
`wires input/output
`pads, and power and clock services to
`complete a custom chip design. In effect this environment
`acts as a compiler, delivering low-level code (in this case
`VLSI mask artwork)
`from an initial high-level program
`(the flow-graph), while simultaneously simulating the chip’s
`function.
`The corresponding FIRST implementation of the com-
`plex-to-magnitude
`algorithm in silicon is shown in Fig. 2.
`As a VLSI device this is not particularly impressive;
`it
`contains only a few thousand transistors. However, as a
`demonstration of the design methodology it is; our design
`cycle was very fast and required no intimate knowledge of
`integrated circuit design.
`The primitives themselves are irregular blocks of high-
`density layout, with i/o
`strategies configured to suit the
`routing channel convention, However, they are not held in
`FIRST as single “ macrocells,” but rather as collections of
`leaf cells and procedures for their assembly. In this way the
`primitive set is parametrized so that users may select for
`example,
`the size of a multiplier, the length of a delay
`element, etc.
`This simple floor-plan style has proved effective in effi-
`ciently implementing many of
`the systems we have at-
`tempted. Later we show a selection of such designs which
`exemplify
`this feature. We find a typical overhead of
`arcn.md 25–30 percent of unused silicon area.
`
`Fig. 2.
`
`FIRST (5-pm nMOS) chip synthesized from Fig. 1.
`
`B. Secrets of Success
`
`The nMOS version of FIRST has been in use at Edin-
`burgh University for some 18 months. Several designers
`have completed complex system designs in remarkably
`short timescales, vindicating our aims and approach.
`We begin by setting rigorous conventions covering all
`aspects of the timing, electrical and numerical formats of
`signals throughout the system. If we now implement a base
`set of primitives that uniformly respect these conventions,
`then any connection of these primitives will communicate
`successfully, without requiring any further design attention
`to be devoted to the individual
`interfaces. Thus we can
`connect a flow-graph of elements, as in Fig. 1, and auto-
`matically achieve correct hardware functionality.
`These advantages extend to elegant hierarchical design,
`because the communication conventions are automatically
`inherited by any network or subnetwork built exclusively
`from such components. Thus we are able to construct
`higher level operators as general hardware “routines.”
`These features make for a very speedy system design
`environment.
`
`C. A Resume of Case Studies
`
`Table I summarizes the scope of applications addressed
`by FIRST as a selection of case system design studies. The
`information on transistor and chip counts relates to 5-pm
`nMOS technology. The advent of a CMOS primitive library
`(as discussed below) dramatically improves packing density
`and chip counts.
`
`1047-002
`
`

`
`748
`
`IEEE JOURNAL
`
`OF SOLID-STATE
`
`CIRCUITS,
`
`VOL.
`
`SC-20,
`
`NO.
`
`3,
`
`JUNE
`
`1985
`
`TABLE I
`SYSTEM CASE STUDIES
`
`FIRST
`
`Total Transistors FIRST Code
`Chip
`Computation
`(lines)
`(K)
`MegaOps/sec Designs Chips
`250
`252
`3
`18
`55
`
`10
`
`15
`
`150
`
`5
`
`1
`
`6
`
`5
`
`1
`
`30
`
`7.5
`
`43
`
`160
`
`500
`
`160
`
`300
`
`System
`Speech
`E~ho
`Canceller
`Adaptive
`Lattice
`Filter
`LDI
`Filter
`5th Order
`FFT
`16-point
`8 MHz
`
`Most significantly this table shows that impressive total
`computation rates can be realized by many slow bit-serial
`elements operating in parallel. The final two columns pro-
`vide a comparison of the size of the system description
`code (FIRST input), and the transistor count of the result-
`ing system.
`
`III.
`
`RE-lMPLEMENTATION IN CMOS TECHNOLOGY
`
`the
`gains may be expected in mapping
`Significant
`techniques that we have demonstrated in nMOS technol-
`ogy, to an advanced 2.5-pm CMOS technology. Our design
`style uses a mixture of dynamic and static logic, using
`dynamic
`techniques where the benefits of
`the increased
`speed and reduction in transistor count
`(and therefore
`silicon area) can be realized and are significant. Where the
`fan-in of a logic gate is low, a static implementation may
`involve no more transistors than a dynamic one, particu-
`larly if the number of minterms in the Boolean expansion
`of the gate’s function is also low. In such cases we use a
`static gate, to avoid additional clock distribution. We use
`dynamic latches everywhere. The dynamic logic structure
`used is based on that known variously as” NP” or” NORA”
`CMOS, modified to render it less sensitive to clock edge
`times [4], [5].
`
`A. Design Style
`
`section of a gate consists of either
`The logic evaluator
`n- or p-type transistors. The desired function is evaluated
`by “discharging”
`a precharged capacitance, conditional on
`either a positive logic (n-gate) or negative logic (p-gate)
`function of
`the gate’s input variables. The technique is
`related to the “Domino
`CMOS”
`style,
`in which only
`n-gates are involved [6]. The mixed-gate approach is super-
`ior in that n- and p-gates may be cascaded without the
`need for the intervening inverter of Domino CMOS, and
`greater versatility is brought about by the positive and
`negative logic functions. In fact n-p, Domino, and static
`gates may be mixed, provided the rules for inclusion of
`each type are obeyed.
`gate taken
`Fig. 3(a) shows an exclusive-NOR (XNOR)
`from part of the library. The form of the dynamic latch
`
`L
`
`Vss
`
`PI >
`
`Sample
`
`A,B
`
`Preset
`
`C,D
`
`~( 3
`
`Hold
`
`A,B
`
`Evaluate
`
`C,D
`
`~1 l Sample
`
`A@
`
`B
`
`Fig. 3. Dynamic (mixed-gate) excluswe NOR gate.
`
`/ Prechmy
`
`J
`
`D (,.s,)
`
`Sample
`
`li->C
`
`@
`
`T
`
`Slow clock
`
`edges
`
`c
`
`D
`
`I
`
`-
`
`B=l
`A= 1
`
`Fig 4. Data corruption due to slow clock edges in NORA/NP-CMOS.
`
`used can be seen from Fig. 3(b) to consist of a “clocked
`inverter.”
`Input values are sampled during @i, during
`which time the nodes C and D are precharged (or, more
`correctly, preset)
`low and high, respectively. During ii,
`these nodes are conditionally “discharged,” depending on
`the voltages on the gates of the transistors in the evaluation
`logic tree.
`During the evaluation phase (@i = O), node C is pulled
`high if ~ and ~ are both low, forming the rninterm All.
`Simultaneously, node D is discharged, conditional on either
`C or both A–and ~ being high, forming the exclusive OR
`function. During @j, this value is sampled by the clocked
`inverter to give a “’held’ XNOR during ~j. The effect of
`this is shown in Fig. 3(c) and 3(d). In common with all
`synchronous design systems, the output of a complete gate
`is presented one full clock cycle after its inputs. In addition
`to the @i~ [n-p-gates]+ @j ordering illustrated, gates may
`also be constructed which sample inputs on @j, and change
`outputs on @i. Furthermore, n- and p-sections may be
`cascaded internal to a gate provided ripple through times
`do not become excessive. With these capabilities, extremely
`tight pipelining of logic may be achieved.
`is simply ii.
`In the original n-p or NORA scheme, @j
`This apparently attractive simplification carries the danger
`
`1047-003
`
`

`
`MURRAY AND DENYER: IDESIGN STRATEGY FOR BIT-SEIUALSIGNAL PrOCeSSing
`
`749
`
`polysilicon
`
`B.
`
`Fig. 5. Corruption of dynamically stored logic values due to capacitive
`(gate-drain) coupling.
`
`that slow clock edges can result in corruption of the held
`output values from a gate, as the precharge nodes begin to
`be precharged before the output is fully held. This effect is
`shown by SPICE simulation in Fig. 4. The onset of the
`precharge on node “D “ in the XNOR circuit of Fig. 4
`causes a corruption of the incompletely “held”
`value on
`node C (the output of the clocked invertor),
`if the clock
`edges are” slow.” Our simulations show that clock rise/fall
`times faster than 3 ns are required (appreciably less than a
`typical gate delay) if undesirably complicated rules for cell-
`design are to be avoided.
`Caution must, however, be exercised when designing
`dynamic sections even when our more conservative two-
`phase (plus inverses) nonoverlapping clocking scheme is
`adopted. The major effects to be avoided are as follows.
`
`Ch6!rgeRedistribution
`
`Charge sharing can occur between the nodes of an n- or
`p-section which has inputs from a preceding p- or n-section
`(i.e., from an “internal” precharge/discharge node, rather
`than from a fully sampled and held node). This may be
`avoided by careful placement of such inputs within the
`logic evaluation tree such that they are as far as possible
`from the precharge node of the gate [5].
`
`Corruption of Sampled Values at the Inputs to a Dynamic
`Seclion
`
`to the precharge node
`Capacitive coupling (gate-drain)
`of a gate can cause corruption of the dynamically held
`values at the gate’s inputs. In Fig. 5, the effect of a rapid
`pulldown of node F is shown, in the partial corruption of
`the dynamically held logical 1 on node D. Choice of
`transistor sizes within a logic tree matched to the capaci-
`tance of the input node minimizes this effect.
`
`Clock Breakthrough Leading to “Charge-Pumping”
`
`The effect of gate–drain coupling at a dynamic latch (ie.,
`gate-drain coupling at the clocked transistors) can lead to
`volt ages outside th~esupply rails. This may be logically
`
`insignificant, but creates the circumstances within which
`latchup can occur. This effect only comes into play when
`there is some stagger between the clock and its inverse,
`which there usually will be. The result is that the clocked
`inverter’s output ripples around its correct value. In Fig. 5,
`an example of
`this form of charge-pumping
`is shown,
`where node C is made to ripple around its correct value of
`5 V by the breakthrough of ~j and ~j. Again, carefully
`chosen transistor sizes minimize the coupling while preserv-
`ing speed.
`to apply, and the clock
`These rules are not difficult
`signals are not as difficult to generate or distribute over the
`device, as would be signals requiring a maximum limit of 3
`ns on clock edges.
`
`B. Layout Style
`
`A structured layout style is used for the design of the cell
`library, to ease checking and to impose some consistency in
`the layout of low-level gates. The layout style is a relaxed
`version of that known as “gate-matrix”
`layout [7], within
`which the p-channel devices of a logic gate are grouped
`together and the n-channel devices are similarly grouped to
`minimize the effect of the large minimum p-channel-to-n-
`channel device separation [7]. Furthermore, the devices are
`arranged in rows with the source–drain direction horizon-
`tal (say), such that polysilicon gate “wires” may run verti-
`cally between p- and n-channel rows.
`The resultant layout is dense, neat, and amenable to
`pitch-matching between gates, particularly with regard to
`power and clock distribution. In Fig. 6, a schematic repre-
`sentation is given, showing:
`
`a)
`
`b)
`
`c)
`
`d)
`
`(mainly horizontal) strips of transistors forming the
`logic, or as in this example,
`the clocked invertor
`structure;
`forming transistor gates
`(mainly vertical) polysilicon,
`and short-range intergate connections;
`(mainly vertical)
`first layer metal,
`for longer-dis-
`tance intercell interconnect, and to span the p-type
`to n-type boundary;
`(almost exclusively horizontal) second-layer metal,
`for supply and clock lines, utilizing a global vertical
`pitch. Thus, low-level entities fit together like LEGO
`bricks, recouping in denser inter-cell
`interconnect
`any loss incurred in adopting a fixed pitch.
`
`1047-004
`
`

`
`750
`
`IEEE JOURNAL
`
`OF SOLID-STATE
`
`CIRCUITS,
`
`VOL.
`
`SC-20,
`
`NO.
`
`3,
`
`JUNE
`
`1985
`
`SLIM IN + A + CARRY
`+ SUM OUT
`& CARRY
`
`IN
`OUT
`
`SUM IN
`
`CARRY
`OLIT
`
`&
`
`A
`
`CAR I
`IN
`
`11010110
`
`loll—
`
`MULTIPLICAND
`
`Xllolool
`
`l—
`
`COEFFICIENT
`
`ADDER
`
`AODER
`
`1
`
`2
`
`ADDER 3
`
`AOOER 4
`
`ADDER 5
`
`AOOER 6
`
`ADDER 7
`
`SUBTRACTER
`
`MULTIPLICAND
`x COEFFICIENT
`
`Oti”ob;-l”lb
`
`“Ih”o”obolo
`
`ool
`
`—
`
`Fig, 7.
`distinct
`
`Det~lsof
`adders,
`
`sefi~multipfier
`andresultant
`
`operation,
`computatlonat’’
`
`timing,
`shoting
`waterfront,”
`
`act1vityof
`
`C. A Bit Serial Multiplier
`
`We shall refer to the inputs to a multiplier as the
`multiplicand
`(n-bits), and the coefficient
`(m-bits), and
`fixed point,
`two’s complement arithmetic is used. Any
`bit-serial multiplier is constructed as a set of stages through
`which the data flow sequentially, the number of stages
`being determined by the coefficient
`length [8]. Therefore,
`any multiplier is restricted to multiplying fixed-length coef-
`ficients by multiplicands
`of arbitrary length.
`It is not
`efficient to attempt to specify a single multiplier to satisfy
`all needs, as this will result in a “lowest
`common de-
`nominator”
`approach, yielding nonoptimal
`implementa-
`tions of particular processing functions.
`We have developed a set of subcells to enable the design
`of a family of multipliers. The computational
`latency of a
`bit-serial multiplier is either 2n (for a “straightforward”
`serial-parallel-serial multiplier) or 1 + 3n /2 (for a Modified
`Booth’s “recoding” multiplier). We have concentrated on a
`nonrecoding multiplier family, as this allows’ for tighter
`data pipelining due to the lack of the sign bit extension
`required by a Booth operator
`[8]. As an example we
`present a full-precision complex operator handling bipolar
`data with no sign bit extension (ie., not even the singly
`extended sign bit required by a normal serial-parallel-serial
`multiplier), as this represents a typical, tightly-pipelined
`cell. It is capable of multiplying n-bit multiplicands by
`m-bit coefficients and of being clocked at a maximum of
`20 MHz. An incidental benefit of using a nonrecoding
`multiplier is the enhancement of the random pattern test-
`ability of the processors in which it is included, although
`this is inherently high even when a Booth’s algorithm
`operator is used [9].
`The algorithm is that of a serial pipeline multiplier [8],
`with some extra circuitry to obviate the need for sign bit
`extension (two XOR gates for each of
`the first m – 1
`multiplier stages). There is also some extra multiplexing to
`capture the n – m – 1 lowest significant bits of the output,
`which are normally “thrown away” in a standard pipeline
`
`SHAKO OEMY UNE SECTION
`
`MuLnPuER 3 FORMS PRODUCT sw
`
`I
`
`MULTIPLIER 4 FORMS PROOUCT S’R
`
`lm(PRODU CT)
`. S<R+SR,
`
`Fig. 8. Orgamzation of four serial multipliers to perform complex calcu-
`lationof
`(S+ zS’)X(R + zR’).
`
`to form the n + m – 1 full precision product.
`multiplier,
`The operation of a single multiplier (n= 12, m =8)
`is
`shown in Fig. 7. Each atomic operation consists of a full
`bitwise addition,
`in which the sum(out)
`from adder(i)
`corresponding
`to inputs present at time(t)
`is passed to
`adder( i + 1) as its sum(in), and the carry(out) from adder(i)
`itself in the next time-frame
`is saved to be used by adder(i)
`as carry(in). The only disturbance to this activity occurs at
`Least Significant Bit time, when an additional XOR is used
`to calculate the “wrap around” sum(in).
`Fig. 8 shows how a complex multiplier is formed, effec-
`tively, from four serial multipliers, which share some of the
`partial product
`formation circuitry in a central (delay line)
`section. The four components of the complex product are
`then combined in a final add/subtract
`section.
`The latency of the multiplier, to presentation of the least
`significant bit of the product,
`is m + 3 bits, of which one
`bit is attributable to the addition of the four raw product
`components
`to form the complex product.
`In Fig. 9, we show the detailed multiplier architecture
`and tesselation scheme, and its layout (m = 4). The multi-
`plicand (S+ iS’) and coefficient
`(R+ iR’) words flow
`through a central shift register network, within which the
`partial products are formed. The partial products are sub-
`sequently summed in four separate, parallel data streams,
`to form Least and Most significant words ( Ls word and
`34sword) of SR, SR’, S’R, S ‘R’. These are added and sub-
`tracted together to form the components of the complex
`
`1047-005
`
`

`
`MUIUtAY AND DENYER: DESIGN STRATEGY FOR BIT-SERIAL SIGNAL PROCESSING
`
`751
`
`~~~&-
`LStiORD
`v
`M,SWORD
`Fig. 9. Schematic diagram and layout of a 4-bit complex full-precision
`bit-seriaf multiplier.
`
`DARE OF PRODUCT
`
`SR– SR’
`
`(REAL)
`
`v
`MSWORD
`
`LSi~ORD
`
`L FORM REAL AND IMAGINARY
`
`5R’+5R
`
`(IMAG)
`
`INPU1 SIPO
`
`INPuT LATCH
`
`<~~–:
`
`K
`
`DATA INPUT
`
`64 mT ADOREss
`
`8HIFT REGISTER
`
`64 X 12 RAMCELLS
`
`SUPPLY BUS
`
`OUTPUT Plsu
`
`OATA
`OUTPUT 1
`1
`
`I
`
`H )
`
`Fig 10. Schematic diagram and layout of a 64 (12-bit) FIFO word
`delay. Parallel-in/Serial-out
`(PISO) shift register. Serial-in/uarallel-out
`(!SIP-O)shift reg&er.
`-
`‘ “
`
`in a final stage. There
`prc,duct [M? – S’R’ + i(S’R + SR’)]
`are n-bit h words, m-bit Ms words, and only n + m – 1
`bits of product. When m = <n,
`the upper n – m + 1 bits
`of the product Ms word are sign extensions, which allow
`for arithmetic growth in subsequent calculations. An m-bit
`is formed by m – 1 ADD sections, and one
`multiplier
`SUBTRACT section, followed by the final real/imaginary
`gerlerator. The individual partial product summation sec-
`tions, which form the bulk of the multiplier, are identical
`to those which would be used for a noncomplex multiplier,
`illustrating the modularity of the approach. A complete
`16-bit complex multiply operator comprises some 10000
`transistors.
`
`D. Other Library Elements
`
`The other computational elements necessary to provide a
`comprehensive
`primitive set for a wide range of signal
`
`(add, subtract, multiplex, etc.) are
`functions
`processing
`realized in a similar manner to the multiplier family, as are
`formatting functions. They form,
`in general, similar pipe-
`lined structures, in which modularity is almost unavoid-
`able.
`The other major requirement in implementing bit-serial
`signal processing operators is for bit. and word-oriented
`First-In, First-Out (FIFO) storage blocks (signal and con-
`trol delay elements). For small numbers of (bit) delays, a
`simpleminded shift register suffices. However, this rapidly
`becomes
`inefficient
`for larger (word) delays. Instead, a
`Random Access Memory is used, consisting of an array of
`three-transistor dynamic RAM cells, each occupying 25X 25
`pm, along with peripheral control, drive, and output cir-
`cuitry. The RAM is constructed in a manner consistent
`with the remainder of the cell library, such that any rea-
`sonable size and organization of FIFO memory may be
`straightforwardly assembled.
`Fig. 10 shows a 64-word by 12-bit FIFO RAM. Data do
`not actually move in the RAM. Rather, they are accessed
`and written according to a recirculating address, which
`appears at the top of the figure.
`
`IV.
`
`TESTABILITY
`
`Testing complex digital CMOS circuits is a problem, due
`to the need for deterministically derived test vector se-
`quences to sensitize and detect
`“ stuck-open”
`faults
`[10]. As
`circuit complexities
`increase, random pattern testing and
`self-test represent elegant and efficient methods of testing
`VLSI devices and systems. However, the intrinsically ran-
`dom nature of the test vectors conflicts with the sequencing
`required for high coverage of stuck-open faults in CMOS.
`
`1047-006
`
`

`
`IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. SC-20, NO. 3, JUNE 1985
`
`A(in)
`
`B(in)
`
`C(ln)
`
`SUM
`
`CARRY
`
`——
`Fig. 12. Dynamic CMOS (mixed-gate) adder.
`
`—
`
`(a) Static and (b) dynamic CMOS NOR gates, with the truth
`Fig. 11.
`table for fault-free gates, and with transistors 4 and 6 stuck open,
`respectively. (Tick = closed, X’= open).
`
`We have already indicated that fault-tolerance and self-test
`are of great importance to our overall design philosophy
`[9]. It isnecessary,
`therefore, that weusea CMOS circuit
`design style which circumvents the stuck-open problem,
`if
`we are to preserve the valuable testable properties of
`bit-serial architecture.
`
`A. Stuck -Atand Stuck -Open Faults
`
`involves comparing the
`The single-stuck-at fault model
`results of a “good machine” simulation with those of the
`same circuit with each node alternately stuck-at logical 1
`and O. A fault is “covered”
`by the test if it causes the
`output of the circuit to differ from that of a good machine.
`This crude approach has been successful
`in developing
`tests
`for nMOS circuits. However,
`stuck-open faults,
`whereby a transistor’s drain and source are permanently
`disconnected from each other, create parasitic latches in
`static CMOS circuits, and the single-stuck-at model
`is
`inadequate [10].
`A static CMOS gate relies on the Boolean logic function
`of the p-channel pullup tree being the dual of the n-chan-
`nel pulldown function. For the two-input static NOR gate
`shown in Fig. 1l(a),
`if any of the transistors 1–4 is stuck
`open, this duality is broken for at least one of the minterms
`in the expansion of the function. For these terms, there is
`neither a pullup nor pulldown for the output node C, and
`the previous output value is effectively stored. The truth
`table with transistor-4 stuck open shows that this stuck-
`open fault will only be detected if the sequence All = 00 to
`All =01 occurs, as otherwise a “correct” C = Owill already
`be latched at the output. Therefore, a SEQUENCE of
`input combinations is required to detect a stuck-open fault.
`This is always true for irredundant static CMOS gates, and
`the problem is exacerbated as fan-in increases, and the
`gate’s output becomes less symmetric.
`Fig. n(b)
`shows a mixed gate dynamic NOR gate. A
`stuck-open fault on either of the clocked transistors causes
`a permanent logical 1 or O to appear at the output, and the
`truth table in Fig. 1l(b) shows that transistor-6 stuck open
`causes the output for AB = 01 to be wrong.
`The sequencing problem does not exist for stuck-open
`faults. For every stuck-open fault, there will always be at
`least one single input combination for which the output of
`a dynamic gate is wrong. This has the effect of increasing
`
`R
`
`o~
`o
`
`50
`
`No. of rend. m test
`vectors
`Fig. 13. Comparison between the fault coverage for static and dynamic
`full adders,
`from probabilistic analysis (solid lines) and simulation
`experiments (broken lines).
`
`greatly the testability of circuits designed in this way, as
`will be shown in the following section.
`
`B. Test Coverage
`
`Let us look at the probability of detection of stuck-open
`faults in static and dynamic circuits. If a gate has k inputs,
`there are 2k possible input combinations. Any particular
`COMBINATION has therefore a probability of occurrence
`in any one cycle of a random sequence of 2 –‘, and any
`given SEQUENCE of combinations has a probability of
`occurrence of 2- 2k. It can be shown [11] that N(nz),
`the
`number of
`faqlts detected in a circuit containing 1 tran-
`sistors by the m ‘th cycle is
`
`N(m)=I–
`
`l–~
`
`~
`j=l
`[
`
`rn
`
`1n(i)
`
`(2)
`
`For static gates, p = 2, and for dynamic gates, p =1,
`since in the case of a dynamic gate, detection of a fault is
`independent of sequence and depends only on the occur-
`rence of an input combination.
`
`C. Comparison With Experiment
`
`In this context, comparison is made between the prob-
`abilistic expressions derived above, and the results of simu-
`lation “experiments”
`(cf.
`[12]) using pseudorandom pat-
`terns, with respect to a full adder circuit. A full adder “is a
`useful piece of circuitry which does not consist of too many
`transistors to be analytically manageable. The mixed-gate
`dynamic CMOS circuit for a full adder is shown in Fig. 12.
`In Fig. 13 are shown the forms of (2) (solid lines). The
`results of simulation experiments in which the respective
`
`Transistor
`
`/’x
`
`AB12
`-0-0//
`o-1
`
`lox/’
`Ilxx
`
`00”/””/
`ol,/”x
`lox”/’
`l-lxx
`
`B
`
`o
`
`34C
`Xxl
`x/’o
`
`/xo
`/-/”
`
`XXI
`XX-7
`“/Xo
`“/Xo
`!!
`
`TransistorBAB56E
`
`o-ox
`“olx”/’o
`
`1 O“,’’
`11”/’”70
`
`xl-
`
`x-o
`
`Ooxxl
`Olxxl
`1 o“/’-xo
`II
`J’XO
`
`752
`
`A
`
`——
`
`5
`
`CD
`
`8——
`
`A B
`
`+
`
`‘r&7E=~B
`
`1047-007
`
`

`
`MURRAY AND DENYSR: DESIGN STRATEGY FOR BIT-SERIAL SIGNAL PROCESSING
`
`753
`
`circuits are subjected to three different pseudorandom in-
`put sequences, with injected stuck-open faults, are also
`shown (broken ‘lines). The agreement between theory and
`experiment
`is gratifying, since the small number of tran-
`sistcms present makes the simulated coverage graphs far
`from s~mooth.It is anticipated that the agreement for larger
`and more complex blocks of circuitry will be even better,
`as the increase in the number of circuit nodes decreases the
`significance
`of
`individual events. Unfortunately,
`the in-
`crease in transistor count quickly makes an accurate ana-
`lytical treatment tedious and prohibitively time consuming.
`The results shown for this simple example demonstrate
`that dyna@c CMOS circuits in general are much more
`random pattern testable than their static counte~arts
`for
`gates c,f any appreciable complexity. The fact that, even for
`a relatively simple dynamic adder cell, the test length for
`the all-important 95–100-percent coverage goal is less than
`half that for a static gate is an extra reason to decide that
`dynamic design techniques in general, and mixed gate
`dynamic design in particular, offer a great deal
`in the
`implementation of efficient VLSI CMOS systems.
`
`V.
`
`SUMMARY
`
`the methods behind the design of a
`described
`have
`we
`comprehensive CMOS cell
`library of primitive operators
`for a wide variety of bit-serial signal processing applica-
`tions. We have also indicated how these cells may be
`integrated into the environment of the FIRST silicon com-
`piler, such that we are able to implement signal processors
`in silicon rapidly, efficiently, and with a high degree of
`confidence in their correctness. With bit-rates of 20 MHz,
`and the low-power consumption and high density afforded
`by the dynamic CMOS design technique, we have im-
`provedl upon the previous nMOS operator capability by
`almost trebling the speed, and increasing by a factor of
`around 10–20 the functional
`throughput per unit silicon
`areal. We have also preserved the property of high testabil-
`ity and amenability to self test exhibited by the prototype
`nMfDS cells in the presence of the normally problematic
`CMOS stuck-open faults.
`Major application areas are apparent in:
`
`a)
`b)
`
`c)I
`
`d)
`
`SONAR and RADAR filtering;
`real-time FFT systems;
`filters and equalizers for digital telecommunications;
`and
`image-processing filters for image enhancement.
`
`ACKNOWLEDGMENT
`
`The authors are grateful for the invaluable advice and
`comments of W. Donaldson and D. Renshaw. FIRST was
`developed in the Department of Electrical Engineering at
`the University of Edinburgh. The CMOS cell library has
`been developed cooperatively in the Wolfson Microelec-
`tronics Institute, University of Edinburgh.
`
`IU3FERENCES
`
`[1]
`[2]
`
`[3]
`
`[4]
`
`[5]
`
`[6]
`
`[7]
`
`[8]
`
`[9]
`
`[10]
`
`[11]
`
`[12]
`
`on Silicon. 1984.
`Syslems
`R. F. Lyon, “A bit-serial VLSI architectural methodology for signal
`processing,” vLSI-81, pp. 131-140, 1981.
`A. E. Filim “A baker’s dozen mamitude auuroximation and their
`vol.
`detection ‘statistics,”
`IEEE
`Tran;
`A ero.~p’.- Electron.
`Syst.,
`AES-12, pp. 87-89, 1976.
`N. F. Goncalves and H. G. De Man, “ NP-CMOS: A racefree
`dvnamic CMOS techniaue for ~iDelined Iozic structures,” in Proc.
`(%trssels, Belgium) Sept.
`8;h European
`Solid-Sta;e Circ;it; Conf.
`1982, pp. 141-144.
`“ NORA: A racefree dynamic CMOS technique for pipelined
`l~e’ structures,” IEEE J. Solid-State Circuits, vol. 18, tm. 261-266,
`. .
`Ju>e 1983.
`B. Murphy, R. Edwards, L. Thomas, and J. Molinelli, “A CMOS
`32b single

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket