`
`A. F. MURRAY, and P. B. DENYER, “A CMOS Design Strategy for Bit-Serial
`
`Signal Processing,”
`
`IEEE JSSC, Vol. 20, Issue 3, pp. 746-753 (1985)
`
`TRW Automotive U.S. LLC: EXHIBIT 1047
`PETITION FOR INTER PARTES REVIEW
`OF U.S. PATENT NUMBER 8,599,001
`IPR2015-00436
`
`
`
`746
`
`IEEE JOURNAL
`
`OF SOLID-STATE
`
`CIRCUITS,
`
`VOL.
`
`SC-20,
`
`NO.
`
`3,
`
`JUNE
`
`1985
`
`A CMOS Design Strategy for Bit-Serial
`Signal Processing
`
`ALAN F. MURRAY ANDPETER B. DENYER, MEMBER, IEEE
`
`of a
`the features and successes
`Abstract — We present a summary of
`Silicon Compiler
`(FIRST)
`for LSI nMOS bit-serial
`signal processors. A
`replacement
`cell
`library of CMOS operators has been designed for the
`compilation of
`true VLSI bit-serial
`signaf processors. The cell
`library is
`implemented
`in 2.5-pm buIk CMOS technology,
`and maintains a con-
`sistent performance
`of 20 MHz. We deseribe
`the design philosophy and
`style behind the CMOS cells, detailing the dynamic logic style used,
`its
`layout and testability. As an example of the capability of the library, we
`discuss a full-precisiou complex multiplier.
`
`I.
`
`INTRODUCTION
`
`D ESIGN complexity has become a dominant cost limit
`
`in the development of VLSI systems. Without new
`design methods and tools, advances in manufacturing ca-
`pability and algorithm development will
`far exceed our
`capacity for design [1]. This effect is nowhere more pro-
`nounced than in the field of real-time signal processing; the
`continuous
`flow of data together with the complexity of
`many of
`the algorithms,
`imposes severe computational
`demands that often cannot be satisfied by general purpose
`machines or components.
`We have addressed the issue of design complexity through
`the development
`of a powerful system synthesis tool—a
`silicon compiler, called FIRST. FIRST is more specialized
`than are most other silicon compilation/autolayout
`sys-
`tems. This restricts its application (to real-time signal
`processing tasks), but at the same time permits functional
`integraticm density that is closer to hand-crafted custom
`than to other examples of automatic layout generation.
`We describe a cell library implemented in 2.5-pm bulk
`CMOS technology (two-layer metal) whose operators are
`amenable to either silicon compilation or manual assembly
`to generate bit-serial signal processing functions, analogous
`to those served by FIRST.
`Section II of this paper constitutes a summary of the
`aims, capabilities and achievements of the FIRST Silicon
`Compiler,
`as implemented in S-pm nMOS technology.
`Section III describes the reimplementation of the cell library
`in 2.5-pm CMOS technology, giving details of the design
`and layout style, and illustrates the library’s scope and
`performance by detailing the design of two of the major
`
`received November 5, 1984; revised January 16?1985.
`Manuscript
`The authors are with the Department of Electrical Engineering, Univer-
`sity of Edinburgh, The King’s Buildings, Mayfield Road, Edinburgh EH9
`3JL, Scotland.
`
`operators. Section IV shows how the inherently high level
`of
`testability exhibited by bit-serial primitives is carried
`through into the normally problematic regime of stuck-open
`faults in CMOS by the use of a dynamic design style.
`
`II.
`
`THE FIRST
`
`SILICON COMPILER
`
`systems
`and
`chips
`processing
`signal
`synthesizes
`FIRST
`architectural
`This
`architectures.
`using
`exclusively
`bit-serial
`in the
`simplicity
`consequences
`restriction
`has
`significant
`of the environment. We shall
`later
`identify
`how
`and success
`restrictions
`and
`conventions
`such
`as
`these
`are
`key
`to a
`successful
`design methodology.
`Bit-serial
`architectures
`[2] have major
`bit-parallel
`alternatives,
`especially
`in the
`routing
`communication
`(issues which
`and
`portant
`in signal
`processing
`applications),
`of computation
`through
`bit-level
`pipelining.
`are exemplified
`in a range
`of case
`studies
`
`over
`advantages
`signal
`areas
`of
`im-
`are
`often
`and in efficiency
`These
`features
`given later.
`
`A. An Example
`
`We begin with a simple example system to give an
`impression of the scope and use of FIRST. For the purpose
`of
`this example we consider an implementation of
`the
`four-region approximation [3] to the magnitude M of a
`complex number A + jB
`
`where,
`
`‘=greater[(:+3G1
`
`G = greater[lAl, /Bl]
`
`L =lesser[lAl,
`
`IB1].
`
`(1)
`
`flow-graph, as
`We view this algorithm as a functional
`shown in Fig. 1. The oblong shapes are functional elements
`that perform fixed operations on the (bit-serial) data flow-
`ing through them. Now bit-serial elements exhibit a delay
`or latency of integer numbers of bit-times. It is important
`to equalize these delays and synchronize data paths enter-
`ing each element. Simple delay elements are shown as
`circles in Fig. 1. We have deliberately reduced this al-
`gorithm to a network of
`functional operators (add, scale,
`delay, etc.) which are supported by FIRST. We call this the
`set of primitive operators. FIRST supports primitives from
`
`0018-9200/85/0600-0746$01.00
`
`@1985 IEEE
`
`1047-001
`
`
`
`MURRAY
`
`AND DENYER:
`
`DESIGN
`
`STRATEGY
`
`FOR BIT-SERIAL
`
`SIGNAL
`
`PROCESSING
`
`747
`
`d4?re
`~‘afdei,evei,ht
`,3m.axdel
`
`ADD [1,0,0,0]-----!U
`
`sum
`
`ORD [24.0,0]
`
`-------:’-’:-
`
`NC
`
`rn.g
`
`1
`
`Q
`
`ki
`
`DSHl~
`
`[1.0]
`3
`
`h.lfi”mi.
`
`5
`
`2
`
`rnoxde!2
`
`-----------
`
`5
`
`AES [24,0]
`
`ab,re
`
`im
`------.--
`
`--
`
`ASS [24.0]
`
`obsim
`
`ORD [24,0,0]
`
`-----.-.--------
`
`inrni.
`
`---
`
`..-
`~
`
`----------------
`i“max
`
`DSHIFI
`
`[3,0]
`
`6
`
`eighth
`
`SUB [1,0.0,0]
`
`;c 1
`
`;
`
`3
`
`4
`+
`
`3
`
`.:
`
`6
`
`o
`
`e13
`
`o
`
`~ ~,~
`-+------~
`
`6
`-–-~~~-~-j
`
`()
`
`Fig. 1. Flow-graph of complex-to-magnitude example. Simple delay ele-
`ments are shown as circles, while computational operators are repre-
`sented by name.
`
`which all systems are constructed which form a finite set of
`rehltively high-level
`functions, well above the transistor or
`gate level.
`flow-graph
`FIRST is a tool which takes this functional
`as input,
`instantiates the required set of bit-serial primi-
`tives, assembles these primitives, and routes the network of
`flow-graph
`connections
`between them. Finally, FIRST
`wires input/output
`pads, and power and clock services to
`complete a custom chip design. In effect this environment
`acts as a compiler, delivering low-level code (in this case
`VLSI mask artwork)
`from an initial high-level program
`(the flow-graph), while simultaneously simulating the chip’s
`function.
`The corresponding FIRST implementation of the com-
`plex-to-magnitude
`algorithm in silicon is shown in Fig. 2.
`As a VLSI device this is not particularly impressive;
`it
`contains only a few thousand transistors. However, as a
`demonstration of the design methodology it is; our design
`cycle was very fast and required no intimate knowledge of
`integrated circuit design.
`The primitives themselves are irregular blocks of high-
`density layout, with i/o
`strategies configured to suit the
`routing channel convention, However, they are not held in
`FIRST as single “ macrocells,” but rather as collections of
`leaf cells and procedures for their assembly. In this way the
`primitive set is parametrized so that users may select for
`example,
`the size of a multiplier, the length of a delay
`element, etc.
`This simple floor-plan style has proved effective in effi-
`ciently implementing many of
`the systems we have at-
`tempted. Later we show a selection of such designs which
`exemplify
`this feature. We find a typical overhead of
`arcn.md 25–30 percent of unused silicon area.
`
`Fig. 2.
`
`FIRST (5-pm nMOS) chip synthesized from Fig. 1.
`
`B. Secrets of Success
`
`The nMOS version of FIRST has been in use at Edin-
`burgh University for some 18 months. Several designers
`have completed complex system designs in remarkably
`short timescales, vindicating our aims and approach.
`We begin by setting rigorous conventions covering all
`aspects of the timing, electrical and numerical formats of
`signals throughout the system. If we now implement a base
`set of primitives that uniformly respect these conventions,
`then any connection of these primitives will communicate
`successfully, without requiring any further design attention
`to be devoted to the individual
`interfaces. Thus we can
`connect a flow-graph of elements, as in Fig. 1, and auto-
`matically achieve correct hardware functionality.
`These advantages extend to elegant hierarchical design,
`because the communication conventions are automatically
`inherited by any network or subnetwork built exclusively
`from such components. Thus we are able to construct
`higher level operators as general hardware “routines.”
`These features make for a very speedy system design
`environment.
`
`C. A Resume of Case Studies
`
`Table I summarizes the scope of applications addressed
`by FIRST as a selection of case system design studies. The
`information on transistor and chip counts relates to 5-pm
`nMOS technology. The advent of a CMOS primitive library
`(as discussed below) dramatically improves packing density
`and chip counts.
`
`1047-002
`
`
`
`748
`
`IEEE JOURNAL
`
`OF SOLID-STATE
`
`CIRCUITS,
`
`VOL.
`
`SC-20,
`
`NO.
`
`3,
`
`JUNE
`
`1985
`
`TABLE I
`SYSTEM CASE STUDIES
`
`FIRST
`
`Total Transistors FIRST Code
`Chip
`Computation
`(lines)
`(K)
`MegaOps/sec Designs Chips
`250
`252
`3
`18
`55
`
`10
`
`15
`
`150
`
`5
`
`1
`
`6
`
`5
`
`1
`
`30
`
`7.5
`
`43
`
`160
`
`500
`
`160
`
`300
`
`System
`Speech
`E~ho
`Canceller
`Adaptive
`Lattice
`Filter
`LDI
`Filter
`5th Order
`FFT
`16-point
`8 MHz
`
`Most significantly this table shows that impressive total
`computation rates can be realized by many slow bit-serial
`elements operating in parallel. The final two columns pro-
`vide a comparison of the size of the system description
`code (FIRST input), and the transistor count of the result-
`ing system.
`
`III.
`
`RE-lMPLEMENTATION IN CMOS TECHNOLOGY
`
`the
`gains may be expected in mapping
`Significant
`techniques that we have demonstrated in nMOS technol-
`ogy, to an advanced 2.5-pm CMOS technology. Our design
`style uses a mixture of dynamic and static logic, using
`dynamic
`techniques where the benefits of
`the increased
`speed and reduction in transistor count
`(and therefore
`silicon area) can be realized and are significant. Where the
`fan-in of a logic gate is low, a static implementation may
`involve no more transistors than a dynamic one, particu-
`larly if the number of minterms in the Boolean expansion
`of the gate’s function is also low. In such cases we use a
`static gate, to avoid additional clock distribution. We use
`dynamic latches everywhere. The dynamic logic structure
`used is based on that known variously as” NP” or” NORA”
`CMOS, modified to render it less sensitive to clock edge
`times [4], [5].
`
`A. Design Style
`
`section of a gate consists of either
`The logic evaluator
`n- or p-type transistors. The desired function is evaluated
`by “discharging”
`a precharged capacitance, conditional on
`either a positive logic (n-gate) or negative logic (p-gate)
`function of
`the gate’s input variables. The technique is
`related to the “Domino
`CMOS”
`style,
`in which only
`n-gates are involved [6]. The mixed-gate approach is super-
`ior in that n- and p-gates may be cascaded without the
`need for the intervening inverter of Domino CMOS, and
`greater versatility is brought about by the positive and
`negative logic functions. In fact n-p, Domino, and static
`gates may be mixed, provided the rules for inclusion of
`each type are obeyed.
`gate taken
`Fig. 3(a) shows an exclusive-NOR (XNOR)
`from part of the library. The form of the dynamic latch
`
`L
`
`Vss
`
`PI >
`
`Sample
`
`A,B
`
`Preset
`
`C,D
`
`~( 3
`
`Hold
`
`A,B
`
`Evaluate
`
`C,D
`
`~1 l Sample
`
`A@
`
`B
`
`Fig. 3. Dynamic (mixed-gate) excluswe NOR gate.
`
`/ Prechmy
`
`J
`
`D (,.s,)
`
`Sample
`
`li->C
`
`@
`
`T
`
`Slow clock
`
`edges
`
`c
`
`D
`
`I
`
`-
`
`B=l
`A= 1
`
`Fig 4. Data corruption due to slow clock edges in NORA/NP-CMOS.
`
`used can be seen from Fig. 3(b) to consist of a “clocked
`inverter.”
`Input values are sampled during @i, during
`which time the nodes C and D are precharged (or, more
`correctly, preset)
`low and high, respectively. During ii,
`these nodes are conditionally “discharged,” depending on
`the voltages on the gates of the transistors in the evaluation
`logic tree.
`During the evaluation phase (@i = O), node C is pulled
`high if ~ and ~ are both low, forming the rninterm All.
`Simultaneously, node D is discharged, conditional on either
`C or both A–and ~ being high, forming the exclusive OR
`function. During @j, this value is sampled by the clocked
`inverter to give a “’held’ XNOR during ~j. The effect of
`this is shown in Fig. 3(c) and 3(d). In common with all
`synchronous design systems, the output of a complete gate
`is presented one full clock cycle after its inputs. In addition
`to the @i~ [n-p-gates]+ @j ordering illustrated, gates may
`also be constructed which sample inputs on @j, and change
`outputs on @i. Furthermore, n- and p-sections may be
`cascaded internal to a gate provided ripple through times
`do not become excessive. With these capabilities, extremely
`tight pipelining of logic may be achieved.
`is simply ii.
`In the original n-p or NORA scheme, @j
`This apparently attractive simplification carries the danger
`
`1047-003
`
`
`
`MURRAY AND DENYER: IDESIGN STRATEGY FOR BIT-SEIUALSIGNAL PrOCeSSing
`
`749
`
`polysilicon
`
`B.
`
`Fig. 5. Corruption of dynamically stored logic values due to capacitive
`(gate-drain) coupling.
`
`that slow clock edges can result in corruption of the held
`output values from a gate, as the precharge nodes begin to
`be precharged before the output is fully held. This effect is
`shown by SPICE simulation in Fig. 4. The onset of the
`precharge on node “D “ in the XNOR circuit of Fig. 4
`causes a corruption of the incompletely “held”
`value on
`node C (the output of the clocked invertor),
`if the clock
`edges are” slow.” Our simulations show that clock rise/fall
`times faster than 3 ns are required (appreciably less than a
`typical gate delay) if undesirably complicated rules for cell-
`design are to be avoided.
`Caution must, however, be exercised when designing
`dynamic sections even when our more conservative two-
`phase (plus inverses) nonoverlapping clocking scheme is
`adopted. The major effects to be avoided are as follows.
`
`Ch6!rgeRedistribution
`
`Charge sharing can occur between the nodes of an n- or
`p-section which has inputs from a preceding p- or n-section
`(i.e., from an “internal” precharge/discharge node, rather
`than from a fully sampled and held node). This may be
`avoided by careful placement of such inputs within the
`logic evaluation tree such that they are as far as possible
`from the precharge node of the gate [5].
`
`Corruption of Sampled Values at the Inputs to a Dynamic
`Seclion
`
`to the precharge node
`Capacitive coupling (gate-drain)
`of a gate can cause corruption of the dynamically held
`values at the gate’s inputs. In Fig. 5, the effect of a rapid
`pulldown of node F is shown, in the partial corruption of
`the dynamically held logical 1 on node D. Choice of
`transistor sizes within a logic tree matched to the capaci-
`tance of the input node minimizes this effect.
`
`Clock Breakthrough Leading to “Charge-Pumping”
`
`The effect of gate–drain coupling at a dynamic latch (ie.,
`gate-drain coupling at the clocked transistors) can lead to
`volt ages outside th~esupply rails. This may be logically
`
`insignificant, but creates the circumstances within which
`latchup can occur. This effect only comes into play when
`there is some stagger between the clock and its inverse,
`which there usually will be. The result is that the clocked
`inverter’s output ripples around its correct value. In Fig. 5,
`an example of
`this form of charge-pumping
`is shown,
`where node C is made to ripple around its correct value of
`5 V by the breakthrough of ~j and ~j. Again, carefully
`chosen transistor sizes minimize the coupling while preserv-
`ing speed.
`to apply, and the clock
`These rules are not difficult
`signals are not as difficult to generate or distribute over the
`device, as would be signals requiring a maximum limit of 3
`ns on clock edges.
`
`B. Layout Style
`
`A structured layout style is used for the design of the cell
`library, to ease checking and to impose some consistency in
`the layout of low-level gates. The layout style is a relaxed
`version of that known as “gate-matrix”
`layout [7], within
`which the p-channel devices of a logic gate are grouped
`together and the n-channel devices are similarly grouped to
`minimize the effect of the large minimum p-channel-to-n-
`channel device separation [7]. Furthermore, the devices are
`arranged in rows with the source–drain direction horizon-
`tal (say), such that polysilicon gate “wires” may run verti-
`cally between p- and n-channel rows.
`The resultant layout is dense, neat, and amenable to
`pitch-matching between gates, particularly with regard to
`power and clock distribution. In Fig. 6, a schematic repre-
`sentation is given, showing:
`
`a)
`
`b)
`
`c)
`
`d)
`
`(mainly horizontal) strips of transistors forming the
`logic, or as in this example,
`the clocked invertor
`structure;
`forming transistor gates
`(mainly vertical) polysilicon,
`and short-range intergate connections;
`(mainly vertical)
`first layer metal,
`for longer-dis-
`tance intercell interconnect, and to span the p-type
`to n-type boundary;
`(almost exclusively horizontal) second-layer metal,
`for supply and clock lines, utilizing a global vertical
`pitch. Thus, low-level entities fit together like LEGO
`bricks, recouping in denser inter-cell
`interconnect
`any loss incurred in adopting a fixed pitch.
`
`1047-004
`
`
`
`750
`
`IEEE JOURNAL
`
`OF SOLID-STATE
`
`CIRCUITS,
`
`VOL.
`
`SC-20,
`
`NO.
`
`3,
`
`JUNE
`
`1985
`
`SLIM IN + A + CARRY
`+ SUM OUT
`& CARRY
`
`IN
`OUT
`
`SUM IN
`
`CARRY
`OLIT
`
`&
`
`A
`
`CAR I
`IN
`
`11010110
`
`loll—
`
`MULTIPLICAND
`
`Xllolool
`
`l—
`
`COEFFICIENT
`
`ADDER
`
`AODER
`
`1
`
`2
`
`ADDER 3
`
`AOOER 4
`
`ADDER 5
`
`AOOER 6
`
`ADDER 7
`
`SUBTRACTER
`
`MULTIPLICAND
`x COEFFICIENT
`
`Oti”ob;-l”lb
`
`“Ih”o”obolo
`
`ool
`
`—
`
`Fig, 7.
`distinct
`
`Det~lsof
`adders,
`
`sefi~multipfier
`andresultant
`
`operation,
`computatlonat’’
`
`timing,
`shoting
`waterfront,”
`
`act1vityof
`
`C. A Bit Serial Multiplier
`
`We shall refer to the inputs to a multiplier as the
`multiplicand
`(n-bits), and the coefficient
`(m-bits), and
`fixed point,
`two’s complement arithmetic is used. Any
`bit-serial multiplier is constructed as a set of stages through
`which the data flow sequentially, the number of stages
`being determined by the coefficient
`length [8]. Therefore,
`any multiplier is restricted to multiplying fixed-length coef-
`ficients by multiplicands
`of arbitrary length.
`It is not
`efficient to attempt to specify a single multiplier to satisfy
`all needs, as this will result in a “lowest
`common de-
`nominator”
`approach, yielding nonoptimal
`implementa-
`tions of particular processing functions.
`We have developed a set of subcells to enable the design
`of a family of multipliers. The computational
`latency of a
`bit-serial multiplier is either 2n (for a “straightforward”
`serial-parallel-serial multiplier) or 1 + 3n /2 (for a Modified
`Booth’s “recoding” multiplier). We have concentrated on a
`nonrecoding multiplier family, as this allows’ for tighter
`data pipelining due to the lack of the sign bit extension
`required by a Booth operator
`[8]. As an example we
`present a full-precision complex operator handling bipolar
`data with no sign bit extension (ie., not even the singly
`extended sign bit required by a normal serial-parallel-serial
`multiplier), as this represents a typical, tightly-pipelined
`cell. It is capable of multiplying n-bit multiplicands by
`m-bit coefficients and of being clocked at a maximum of
`20 MHz. An incidental benefit of using a nonrecoding
`multiplier is the enhancement of the random pattern test-
`ability of the processors in which it is included, although
`this is inherently high even when a Booth’s algorithm
`operator is used [9].
`The algorithm is that of a serial pipeline multiplier [8],
`with some extra circuitry to obviate the need for sign bit
`extension (two XOR gates for each of
`the first m – 1
`multiplier stages). There is also some extra multiplexing to
`capture the n – m – 1 lowest significant bits of the output,
`which are normally “thrown away” in a standard pipeline
`
`SHAKO OEMY UNE SECTION
`
`MuLnPuER 3 FORMS PRODUCT sw
`
`I
`
`MULTIPLIER 4 FORMS PROOUCT S’R
`
`lm(PRODU CT)
`. S<R+SR,
`
`Fig. 8. Orgamzation of four serial multipliers to perform complex calcu-
`lationof
`(S+ zS’)X(R + zR’).
`
`to form the n + m – 1 full precision product.
`multiplier,
`The operation of a single multiplier (n= 12, m =8)
`is
`shown in Fig. 7. Each atomic operation consists of a full
`bitwise addition,
`in which the sum(out)
`from adder(i)
`corresponding
`to inputs present at time(t)
`is passed to
`adder( i + 1) as its sum(in), and the carry(out) from adder(i)
`itself in the next time-frame
`is saved to be used by adder(i)
`as carry(in). The only disturbance to this activity occurs at
`Least Significant Bit time, when an additional XOR is used
`to calculate the “wrap around” sum(in).
`Fig. 8 shows how a complex multiplier is formed, effec-
`tively, from four serial multipliers, which share some of the
`partial product
`formation circuitry in a central (delay line)
`section. The four components of the complex product are
`then combined in a final add/subtract
`section.
`The latency of the multiplier, to presentation of the least
`significant bit of the product,
`is m + 3 bits, of which one
`bit is attributable to the addition of the four raw product
`components
`to form the complex product.
`In Fig. 9, we show the detailed multiplier architecture
`and tesselation scheme, and its layout (m = 4). The multi-
`plicand (S+ iS’) and coefficient
`(R+ iR’) words flow
`through a central shift register network, within which the
`partial products are formed. The partial products are sub-
`sequently summed in four separate, parallel data streams,
`to form Least and Most significant words ( Ls word and
`34sword) of SR, SR’, S’R, S ‘R’. These are added and sub-
`tracted together to form the components of the complex
`
`1047-005
`
`
`
`MUIUtAY AND DENYER: DESIGN STRATEGY FOR BIT-SERIAL SIGNAL PROCESSING
`
`751
`
`~~~&-
`LStiORD
`v
`M,SWORD
`Fig. 9. Schematic diagram and layout of a 4-bit complex full-precision
`bit-seriaf multiplier.
`
`DARE OF PRODUCT
`
`SR– SR’
`
`(REAL)
`
`v
`MSWORD
`
`LSi~ORD
`
`L FORM REAL AND IMAGINARY
`
`5R’+5R
`
`(IMAG)
`
`INPU1 SIPO
`
`INPuT LATCH
`
`<~~–:
`
`K
`
`DATA INPUT
`
`64 mT ADOREss
`
`8HIFT REGISTER
`
`64 X 12 RAMCELLS
`
`SUPPLY BUS
`
`OUTPUT Plsu
`
`OATA
`OUTPUT 1
`1
`
`I
`
`H )
`
`Fig 10. Schematic diagram and layout of a 64 (12-bit) FIFO word
`delay. Parallel-in/Serial-out
`(PISO) shift register. Serial-in/uarallel-out
`(!SIP-O)shift reg&er.
`-
`‘ “
`
`in a final stage. There
`prc,duct [M? – S’R’ + i(S’R + SR’)]
`are n-bit h words, m-bit Ms words, and only n + m – 1
`bits of product. When m = <n,
`the upper n – m + 1 bits
`of the product Ms word are sign extensions, which allow
`for arithmetic growth in subsequent calculations. An m-bit
`is formed by m – 1 ADD sections, and one
`multiplier
`SUBTRACT section, followed by the final real/imaginary
`gerlerator. The individual partial product summation sec-
`tions, which form the bulk of the multiplier, are identical
`to those which would be used for a noncomplex multiplier,
`illustrating the modularity of the approach. A complete
`16-bit complex multiply operator comprises some 10000
`transistors.
`
`D. Other Library Elements
`
`The other computational elements necessary to provide a
`comprehensive
`primitive set for a wide range of signal
`
`(add, subtract, multiplex, etc.) are
`functions
`processing
`realized in a similar manner to the multiplier family, as are
`formatting functions. They form,
`in general, similar pipe-
`lined structures, in which modularity is almost unavoid-
`able.
`The other major requirement in implementing bit-serial
`signal processing operators is for bit. and word-oriented
`First-In, First-Out (FIFO) storage blocks (signal and con-
`trol delay elements). For small numbers of (bit) delays, a
`simpleminded shift register suffices. However, this rapidly
`becomes
`inefficient
`for larger (word) delays. Instead, a
`Random Access Memory is used, consisting of an array of
`three-transistor dynamic RAM cells, each occupying 25X 25
`pm, along with peripheral control, drive, and output cir-
`cuitry. The RAM is constructed in a manner consistent
`with the remainder of the cell library, such that any rea-
`sonable size and organization of FIFO memory may be
`straightforwardly assembled.
`Fig. 10 shows a 64-word by 12-bit FIFO RAM. Data do
`not actually move in the RAM. Rather, they are accessed
`and written according to a recirculating address, which
`appears at the top of the figure.
`
`IV.
`
`TESTABILITY
`
`Testing complex digital CMOS circuits is a problem, due
`to the need for deterministically derived test vector se-
`quences to sensitize and detect
`“ stuck-open”
`faults
`[10]. As
`circuit complexities
`increase, random pattern testing and
`self-test represent elegant and efficient methods of testing
`VLSI devices and systems. However, the intrinsically ran-
`dom nature of the test vectors conflicts with the sequencing
`required for high coverage of stuck-open faults in CMOS.
`
`1047-006
`
`
`
`IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. SC-20, NO. 3, JUNE 1985
`
`A(in)
`
`B(in)
`
`C(ln)
`
`SUM
`
`CARRY
`
`——
`Fig. 12. Dynamic CMOS (mixed-gate) adder.
`
`—
`
`(a) Static and (b) dynamic CMOS NOR gates, with the truth
`Fig. 11.
`table for fault-free gates, and with transistors 4 and 6 stuck open,
`respectively. (Tick = closed, X’= open).
`
`We have already indicated that fault-tolerance and self-test
`are of great importance to our overall design philosophy
`[9]. It isnecessary,
`therefore, that weusea CMOS circuit
`design style which circumvents the stuck-open problem,
`if
`we are to preserve the valuable testable properties of
`bit-serial architecture.
`
`A. Stuck -Atand Stuck -Open Faults
`
`involves comparing the
`The single-stuck-at fault model
`results of a “good machine” simulation with those of the
`same circuit with each node alternately stuck-at logical 1
`and O. A fault is “covered”
`by the test if it causes the
`output of the circuit to differ from that of a good machine.
`This crude approach has been successful
`in developing
`tests
`for nMOS circuits. However,
`stuck-open faults,
`whereby a transistor’s drain and source are permanently
`disconnected from each other, create parasitic latches in
`static CMOS circuits, and the single-stuck-at model
`is
`inadequate [10].
`A static CMOS gate relies on the Boolean logic function
`of the p-channel pullup tree being the dual of the n-chan-
`nel pulldown function. For the two-input static NOR gate
`shown in Fig. 1l(a),
`if any of the transistors 1–4 is stuck
`open, this duality is broken for at least one of the minterms
`in the expansion of the function. For these terms, there is
`neither a pullup nor pulldown for the output node C, and
`the previous output value is effectively stored. The truth
`table with transistor-4 stuck open shows that this stuck-
`open fault will only be detected if the sequence All = 00 to
`All =01 occurs, as otherwise a “correct” C = Owill already
`be latched at the output. Therefore, a SEQUENCE of
`input combinations is required to detect a stuck-open fault.
`This is always true for irredundant static CMOS gates, and
`the problem is exacerbated as fan-in increases, and the
`gate’s output becomes less symmetric.
`Fig. n(b)
`shows a mixed gate dynamic NOR gate. A
`stuck-open fault on either of the clocked transistors causes
`a permanent logical 1 or O to appear at the output, and the
`truth table in Fig. 1l(b) shows that transistor-6 stuck open
`causes the output for AB = 01 to be wrong.
`The sequencing problem does not exist for stuck-open
`faults. For every stuck-open fault, there will always be at
`least one single input combination for which the output of
`a dynamic gate is wrong. This has the effect of increasing
`
`R
`
`o~
`o
`
`50
`
`No. of rend. m test
`vectors
`Fig. 13. Comparison between the fault coverage for static and dynamic
`full adders,
`from probabilistic analysis (solid lines) and simulation
`experiments (broken lines).
`
`greatly the testability of circuits designed in this way, as
`will be shown in the following section.
`
`B. Test Coverage
`
`Let us look at the probability of detection of stuck-open
`faults in static and dynamic circuits. If a gate has k inputs,
`there are 2k possible input combinations. Any particular
`COMBINATION has therefore a probability of occurrence
`in any one cycle of a random sequence of 2 –‘, and any
`given SEQUENCE of combinations has a probability of
`occurrence of 2- 2k. It can be shown [11] that N(nz),
`the
`number of
`faqlts detected in a circuit containing 1 tran-
`sistors by the m ‘th cycle is
`
`N(m)=I–
`
`l–~
`
`~
`j=l
`[
`
`rn
`
`1n(i)
`
`(2)
`
`For static gates, p = 2, and for dynamic gates, p =1,
`since in the case of a dynamic gate, detection of a fault is
`independent of sequence and depends only on the occur-
`rence of an input combination.
`
`C. Comparison With Experiment
`
`In this context, comparison is made between the prob-
`abilistic expressions derived above, and the results of simu-
`lation “experiments”
`(cf.
`[12]) using pseudorandom pat-
`terns, with respect to a full adder circuit. A full adder “is a
`useful piece of circuitry which does not consist of too many
`transistors to be analytically manageable. The mixed-gate
`dynamic CMOS circuit for a full adder is shown in Fig. 12.
`In Fig. 13 are shown the forms of (2) (solid lines). The
`results of simulation experiments in which the respective
`
`Transistor
`
`/’x
`
`AB12
`-0-0//
`o-1
`
`lox/’
`Ilxx
`
`00”/””/
`ol,/”x
`lox”/’
`l-lxx
`
`B
`
`o
`
`34C
`Xxl
`x/’o
`
`/xo
`/-/”
`
`XXI
`XX-7
`“/Xo
`“/Xo
`!!
`
`TransistorBAB56E
`
`o-ox
`“olx”/’o
`
`1 O“,’’
`11”/’”70
`
`xl-
`
`x-o
`
`Ooxxl
`Olxxl
`1 o“/’-xo
`II
`J’XO
`
`752
`
`A
`
`——
`
`5
`
`CD
`
`8——
`
`A B
`
`+
`
`‘r&7E=~B
`
`1047-007
`
`
`
`MURRAY AND DENYSR: DESIGN STRATEGY FOR BIT-SERIAL SIGNAL PROCESSING
`
`753
`
`circuits are subjected to three different pseudorandom in-
`put sequences, with injected stuck-open faults, are also
`shown (broken ‘lines). The agreement between theory and
`experiment
`is gratifying, since the small number of tran-
`sistcms present makes the simulated coverage graphs far
`from s~mooth.It is anticipated that the agreement for larger
`and more complex blocks of circuitry will be even better,
`as the increase in the number of circuit nodes decreases the
`significance
`of
`individual events. Unfortunately,
`the in-
`crease in transistor count quickly makes an accurate ana-
`lytical treatment tedious and prohibitively time consuming.
`The results shown for this simple example demonstrate
`that dyna@c CMOS circuits in general are much more
`random pattern testable than their static counte~arts
`for
`gates c,f any appreciable complexity. The fact that, even for
`a relatively simple dynamic adder cell, the test length for
`the all-important 95–100-percent coverage goal is less than
`half that for a static gate is an extra reason to decide that
`dynamic design techniques in general, and mixed gate
`dynamic design in particular, offer a great deal
`in the
`implementation of efficient VLSI CMOS systems.
`
`V.
`
`SUMMARY
`
`the methods behind the design of a
`described
`have
`we
`comprehensive CMOS cell
`library of primitive operators
`for a wide variety of bit-serial signal processing applica-
`tions. We have also indicated how these cells may be
`integrated into the environment of the FIRST silicon com-
`piler, such that we are able to implement signal processors
`in silicon rapidly, efficiently, and with a high degree of
`confidence in their correctness. With bit-rates of 20 MHz,
`and the low-power consumption and high density afforded
`by the dynamic CMOS design technique, we have im-
`provedl upon the previous nMOS operator capability by
`almost trebling the speed, and increasing by a factor of
`around 10–20 the functional
`throughput per unit silicon
`areal. We have also preserved the property of high testabil-
`ity and amenability to self test exhibited by the prototype
`nMfDS cells in the presence of the normally problematic
`CMOS stuck-open faults.
`Major application areas are apparent in:
`
`a)
`b)
`
`c)I
`
`d)
`
`SONAR and RADAR filtering;
`real-time FFT systems;
`filters and equalizers for digital telecommunications;
`and
`image-processing filters for image enhancement.
`
`ACKNOWLEDGMENT
`
`The authors are grateful for the invaluable advice and
`comments of W. Donaldson and D. Renshaw. FIRST was
`developed in the Department of Electrical Engineering at
`the University of Edinburgh. The CMOS cell library has
`been developed cooperatively in the Wolfson Microelec-
`tronics Institute, University of Edinburgh.
`
`IU3FERENCES
`
`[1]
`[2]
`
`[3]
`
`[4]
`
`[5]
`
`[6]
`
`[7]
`
`[8]
`
`[9]
`
`[10]
`
`[11]
`
`[12]
`
`on Silicon. 1984.
`Syslems
`R. F. Lyon, “A bit-serial VLSI architectural methodology for signal
`processing,” vLSI-81, pp. 131-140, 1981.
`A. E. Filim “A baker’s dozen mamitude auuroximation and their
`vol.
`detection ‘statistics,”
`IEEE
`Tran;
`A ero.~p’.- Electron.
`Syst.,
`AES-12, pp. 87-89, 1976.
`N. F. Goncalves and H. G. De Man, “ NP-CMOS: A racefree
`dvnamic CMOS techniaue for ~iDelined Iozic structures,” in Proc.
`(%trssels, Belgium) Sept.
`8;h European
`Solid-Sta;e Circ;it; Conf.
`1982, pp. 141-144.
`“ NORA: A racefree dynamic CMOS technique for pipelined
`l~e’ structures,” IEEE J. Solid-State Circuits, vol. 18, tm. 261-266,
`. .
`Ju>e 1983.
`B. Murphy, R. Edwards, L. Thomas, and J. Molinelli, “A CMOS
`32b single