`©'l990’Kluwer Edadernic Pfibllsliers, B5‘ston. h7lanui':Tt'fiEd in ’l‘}K,'IT€her]and5_
`Ya‘-4 . (J 15* ‘F; ?'/90
`
`A Clock-Free Chip Set for High—Sampling Rate Adaptive Filters
`
`TERESA H.-Y. MENG*
`Department ofE1ectn'cal' Engineering, Stanford‘ University, Sranfond, California 94305.
`
`ROBEKI‘ W. BRODERSEN AND DAVID G. MESSERSCHMITT
`Departmem of Electrical Engineering and Computer Sciences, University qf California, Berkeley, Califomia 9420.
`
`Received May 8, 1989. Revised October 24, 1989.
`
`Abstract. As digital signal processing systems become larger and clock rates increase, the typical design approach
`using global clock synchronization will become increasingly difficult. The application of asynchronous clock-free
`designs to higivperformance digital signal processing systems is one promising approach to alleviating this problem.
`To demonstrate this approach for a typical signal processing task, the system architecture and circuit design of
`a chip set for implementing high-rate adaptive lattice filters using the asynchronous design techniques is presented.
`
`1. Introduction
`
`The issues in designing computing machines, both in
`hardware and in software, have always shifted in
`response to the evolution in technology. VLSI promises
`great processing power at low-cost, but there are also
`new constraints that potentially prevent us from taking
`advantage of technology advances. The increase in
`processing power available is a direct consequence of
`scaling the digital IC process. As the scaling of the IC
`process continues, it is doubtful that the benefits of the
`faster devices can be fully exploited due to other fun-
`damental limitations. System clock speeds are starting
`to lag behind logic speeds in recent chip designs. While
`gate delays are well below 1 its in advanced CMOS
`technologies, clock rates of more than 50 MHz are dif-
`ficult to obtain and where they have been attained re-
`quire extensive design and simulation effort.
`One important design consideration that limits clock
`speeds is clock skew [1], which is the difierence in phase
`of a global synchronization signal observed at different
`locations in the system. Clock skew can be reduced to
`a minimum with proper clock distribution [2], [3] but
`because of this global constraint high-performance cir-
`cuitry is confined to a small chip area.
`The alternative fiilly asynchronous design synthesis
`approach eliminates the need for a global clock and cir-
`cumvents problems associated with clock skew. At the
`chip level, since there are no global timing considera-
`tions,
`the design time spent on layout and circuit
`
`sponsored in part in; the Semiconductor Research
`"This research
`Corporation and by DARPA.
`
`timing simulation for the worst—case design are greatly
`reduced. At the board level, since the asynchronous ap-
`proach provides design modularity, systems built using
`this approach can be easily extended without problems
`in global synchronization [4], [5].
`Several recent dedicated hardware designs have
`achieved high-speed performance by eliminating clock
`skew problems through asynchronous designs [6], E7]
`and globally-asynchronous locally-synchronous systems
`[8]-[10} have been considered. Our work is highly
`motivated by feasibility and implementation. We are
`particularly concerned with performance, since we
`believe that the usefulness of a system will be eventually
`judged by its cost-effectiveness. Hence, we put special
`emphasis on exploiting maximum concurrency at both
`the architecture and the circuit levels, so that high-
`performance irnplernentation can be obtained without
`dependence on advanced technology. ,
`Motivations for adopting an asynchronous design
`methodology can be summarized as follows.
`Scaling and Iéchnalogical Reasons. The main motiva-
`tion for going to asynchronous circuitry is to eliminate
`the requirement for a global clock. Clock skew is a
`direct result of the RC time constants of the wires and
`the capacitive loading on the clock lines. While scal-
`ing the IC process reduces the area requirements for
`a given circuit, more circuitry is usually placed on the
`chip totake advantage of the extra space and to further
`reduce system costs [ll]. Hence, for a global signal such
`as a clock, the capacitive load tends to stay constant
`with scaling. Therefore, the capacitance associated with
`the interconnect layers cannot scale below a lower limit.
`
`
`
`Page 1 of 21
`
`SAMSUNG EXHIBIT 1011
`
`
`
` agflm&rwrmwMmWsdm
`
`i
`
`As the devices get smaller, the clock load represents
`a larger proportional load to the scaling factor and more
`buffering is needed to drive it, complicating the clock
`skew problem. Calculations show that it would take
`roughly 1 nsec for a very large transistor (1/gm *- 509)
`to charge a capacitance of IO pf in advanced CMOS
`technology [12]. This delay constitutes a constant speed
`limitation if a global signal is to be transmitted through
`the whole chip, not to mention the whole board.
`
`C4D Tools and Layout Factors. Increased circuit com-
`plexity has been addressed by the development of CAD
`tools to aid the designer. These tools often allow the
`designer to specify the functionality of a system at a
`structural level and have a computer program generate
`the actual chip layout according to the interconnection
`specified. The routing of clock wires as well as the load
`on the clock signal are global considerations which are
`difficult to extract from local connectivity. From a
`system design point of view, asynchrony means design
`modularity. A modular design approach greatly sim-
`plifies the design efforts of complicated systems and
`fits well with the macro-cell design methodology often
`adopted by ASIC CAD tools.
`
`Board Level Design. One of the primary advantages
`of using fully asynchronous design in implementing
`high-performance DSP systems is the ease of design
`at the board level, where clock distribution is not an
`
`issue. Inter-board communications have long used asyn-
`chronous Iinks (for example, the VME bus protocol)
`and interchip intra—board communications are becom-
`ing asynchronous too (for example,
`the MC68000
`series}. As DSP systems become complex, it is advan-
`tageous to simplify the design task to only local con-
`siderations by using asynchmnous components. This
`is particularly important for signal processing applica-
`tions using pipelined architectures, where computation
`can be extended by pipelining without any degradation
`in overall system throughput.
`One goal of this work is to demonstrate the ease with
`which asynchronous systems can be designed with a
`minimum amount of design efforts. We have thus
`designed a chip set that realizes a vectorized adaptive
`lattice filter, with arbitrary vector size, using an asyn-
`chronous desigt methodology. Our adaptive filter archi-
`tecture has localized forward-only interconnections," full
`pipelining in which the communication paths’ between
`chips constitute pipeline stages, and asynchronous inter-
`connection eliminatingthe global constraint of clock -
`distribution. Hence our architecture can achieve arbi-
`
`trary throughput consistent with inputloutput rate
`limitations (such as the speed of All) converters). The
`experimental implementation showed that asynchronous
`design simplifies the design process since no effort was
`devoted to clock distribution, nor were any timing
`simulations to verify clock skewing necessary.
`Section 2 gives a short description of the vectorized
`adaptive lattice filter structure and the chip partition
`of the adaptation algorithm. Section 3 reviews the asyn-
`chronous design methodology used in the chip imple-
`mentation. Section 4 illustrates the design of a fast
`self—timed array multiplier as an example of designing
`computation blocks with completion signal generation.
`Section 5 describes the design of the chip set for an
`adaptive lattice filter along with chip performance
`evaluations, and Section 6 gives the conclusions.
`
`2. Vectorized Adaptive Filters
`
`The realization of adaptive filters at high sampling rates
`is important in many applications but made inherently
`difficult by the recursive nature of the algorithms. The
`block adaptive filter scheme was proposed in which
`filters adapt coefficients at time Tusing the coefficients
`at time T—L, where L is the block size. In applications
`that require both fast convergence and fast tracking, in~
`troducing delays in the adaptation algorithm will
`degrade the filter tracking capability and destabilize the
`filter dynamics, such as in some radar signal process-
`ing systems.
`The vectorized adaptive filter scheme introduced in
`[13] achieves two objectives simultaneously. The first
`objective is to allow arbitrarily high-sampling rates for
`a given speed of hardware, at the expense of parallel
`hardware. The second objective is to not modify the
`input—output characteristics of the algorithm, and hence
`not affect the convergence and tracking capability. We
`chose to use a lattice filter [14] in our design because
`the adaptation feedback in a lattice filter can be limited
`to a single stage. This results in much less computa-
`tion within the feedback path and higher inherent speed
`in a given technology. The successive stages of a lattice
`filter are independent and therefore, can be pipelined.
`
`2.1. The Adaptation Algorithm
`
`Linear adaptive filtering is a special case of a first-
`order linear dynamical system. The adaptation of a
`
`Page 2 of 21
`
`
`
`A Clock-Free Chip Set for High-Sampling Rate Adaptive Filters
`
`347
`
`single stage of a lattice filter structure can be des-
`cribed by two equations, the state-update equation
`
`km = a(T)k(T-1) + I30").
`
`and the output computation
`
`y(T) = f(k(T),T),
`
`(1)
`
`(2)
`
`where a('1') and b(T) are read—out (memoryless) func-
`tions of the input signals at time T, and y{T) is a read-
`out function of the state k(T) and the input signals. Since
`the computation ofy(T) is memoryless, the calculation
`can be pipeline-interleaved [l5} and the computation
`throughput can be increased without theoretical limit,
`if input signals and state information can be provided
`at a sufficiently high speed. However, the state-update
`represents a recursive calculation which results in an
`iteration bound on the system throughput [16]. In order
`to relax the iteration bound without changing the
`. systerrfs dynamical properties, Equation (1) can be writ-
`ten as
`
`k( T+L - 1) =a(T+L— l)a(T+L --2)--'a(T+1)a(T)k(T— l)
`
`+a(T+L— l)a(T+ L -2)’-'a( T+ 1)b(T) +
`
`+a(T+L—1)b(T+L -2) +b(T+L -1)
`
`=c(T,L)k{T-— 1) +d(T,L)
`
`(3)
`
`e;(3T+2|n)
`e;(3T+1|n)
`e,(3T|n)
`e;(3T|n)
`e.,(3T+11rt)
`e.(3'I'+2|n)
`
`where c (fD “finctiom
`of input signals, but independent of the state k(T).
`(:02 L) and d(T, L) can be calculated with high—through-
`put using pipelining and parallelism, and the recursion
`k(T + L — 1) = c(’1'§ L)k(T — 1) + d(T, L) need only
`be computed every L samples. Therefore, the iteration
`bound is increased by a factor of L. L is called the vec-
`tor size, since a vector of L samples will be processed
`simultaneously.
`
`2.2. The Chip Partition for the Normalized LMS
`Lattice Filter
`
`A normalized least mean squares (LM S) or stochastic
`gradient adaptation algorithm is chosen for our exam-
`ple because of its simplicity and the absence of
`numerical overflow. A vectorized LMS lattice filter
`stage with L = 3 is shown in figure l, and the opera-
`tions that each processor needs to perform are shown
`in figure 2. The derivation of the algorithm for nor-
`malized LMS lattice filters can be found in [17]. Proc-
`essors A, and A2 jointly calculate the c(T, L) and
`d(T, L) in Equation (3), processor B, and B2 calculate
`the state-update for state k, and processor C, performs
`the output computation. For every processing cycle, a
`
`e;(3T I.-:+1)
`
`e;(3T+1|n+I)
`
`es(3T+2:n+1)
`
`e_,~ (ST |n+l)
`
`e;(31"+I|n+l)
`
`e,(3r+21n+1)
`
`Fig.
`
`I_. An LMS adaptive lattice filter stage with a vector size of three.
`
`Page 3 of 21
`
`
`
`348
`
`Meng, Broder-sen and Messerschmirr
`
`eel;
`
`6,55
`
` d‘
`
` E
`
`a'=(1-s)d+(e,3+e.,=)
`
`5 =Z"[(1-B)’E +d+(l-B)‘(eF+e»’)l
`
`{rep
`
`ab
`
`E:
`3.
`
`ab
`
`5’=(1..5)E+e_2+eE
`a =l—BE"(cf+eE)
`b =2E“e,-er, c'=ac
`d‘=ad+b
`
`9 *
`
`ab
`
`k = Z‘3{[1-E‘1(c +c§)](r:k +d)+2E"e;e5}
`a = l—£'1(ef+e )
`b =2E“¢;es
`
`efe;
`k'=aIc+b
`qr’ = e; -1: at
`£5 = 8,5 -4: 6}-
`
`Fig. 2. Operations required of each processor shown in figure 1.
`
`vector of L input samples are processed and a vector
`of L output samples are calculated. The processing
`speed is L times slower than the filter sampling rate.
`Since there is no theoretical limit to the number of
`
`samples allowed in a vector, the filter sampling rate is
`not constrained by the processing speed, but rather it
`is limited by only the 110 bandwidth (such as the speed
`of data converters). However, the higher sampling rate
`is achieved at the expenses of additional hardware com-
`plexity and system latency, and very high~sampling rates
`will require multiple-chip realizations. For example,
`at the sampling rate of 100 MHz the computational re-
`quirement is approximately 3 billion operations per se-
`cond per stage. Since there is no global clock routing
`at the board level with asynchronous design, pipelined
`computation hardware can be easily extended without
`any degradation in overall throughput as the vector size
`is increased and the number of lattice filter stages is
`increased.
`
`Our goal is to design a set of chips that can be con-
`nected at the board level to achieve any sampling rate
`consistent with 1/0 limitations. This requires the par-
`titioning of a single lattice filter stage into a set of chips
`which can be replicated to achieve any vector size. In
`this partitioning we attempted to minimize the number
`of different chips that had to be designed, allow flex-
`ibility in the vector size (that is, amount of parallelism),
`and minimize the data bandwidth (number of pins) on
`each chip.
`We found a partitioning that uses five different chips
`and meets these requirements. Block diagrams of four
`of them are shown in figure 3, and a fifth chip simply
`implements variable—Iength pipeline stages. 'I‘wo of the
`
`four chips (PEI and PE4) have built-in hardware
`multiplexers so that the chip function can be controlled
`by external signals to form a reconfigurable data path;
`“otherwise eight different chip designs would have been
`. required. The same chip set can also be used to con-
`struct a lattice LMS joint process estimator. These chips
`will be replicated as required and interconnected at the
`board level to achieve any specified system sampling
`rate.
`
`3. Asynchronous System Design
`
`An asynchronous processor is composed of two types
`of basic blocks: computation blocks and interconnec-
`tion blocks. Computation blocks perform processor
`operations, which include any oombinational logic such
`as multipliers and ALUs, and memory elements. A
`computation block must be designed such that the com-
`putation is started by an external request signal and
`generates an output signal to indicate that the computa-
`tion has completed. The completion signal can be read-
`ily generated by using a class of logic family called Dif-
`ferential Cascode Voltage Switch Logic (DCVSL). A
`comprehensive discussion of DCVSL is given in [12}.
`The completion signal can be viewed as a locally
`generated clock to indicate to the succeeding blocks
`when the output data is ready to be fetched and to re-
`quest 'a data transfer.
`When computation blocls are to be interconnected,
`asynchronous interconnection blocks must be inserted
`among them to ensure correct data transfers under all
`temporal variations of the various completion signals.
`
`"-*-w-Iv‘-~~-----=--aewnwwne:~swrm i —
`
`~ ~ .. - ...
`
`-
`
`-
`
`Page 4 of 21
`
`
`
`A Clock-Free Chip Set for High-Sampling Rate Adaptive Filters
`
`349
`
`-
`5
`D.«,.:1afll
`
`@ : Squarer
`® : Multiplier
`G-)*: Adder/Subtracter
`O : Multiplexing
`
`® : Shifter
`® : 16 bits «:4 bits converter
`H :Pipelining Stage
`
`Fig. 3. A chip set designed to implement the LMS adaptive lattice filter shown in figure l. The partition was done in such a way that any
`required filter sampling rate can be achieved by replicating these chips at the board level without redesign. PEI and PE4 have built-in hardware
`multiplexers so that the chip function can be controlled by external signals to form a reconfigurable data path.
`
`These interconnection blocks take the completion
`signals generated by computation blocks and issue re-
`quest signals to start the computation in other computa-
`tion blocks. It can be shown that the two binary signals,
`the request signal for start of computation and the com-
`pletion signal for finish of computation, are necessary
`and sufficient for realizing general asynchronous com-
`putation with unbounded delays [18]. Interconnection
`blocks are composed of hazard—free asynchronous logic,
`which can be synthesized from a behavioral descrip-
`tion specifying the sequence of operation that each in-
`terconnection block needs to perform [19]. We will now
`discuss the properties of asynchronous computation
`from a system point of view.
`
`3.1. Properties of Asynchronous Computation Blocks
`
`DCVSL computation blocks use the so-called dual-mil
`[20] coded signals inside the logic, which compute both
`the data signal and its complement. Contrary to past
`expectations, if there is no carry—chain in the design,
`asynchronous computation using DCVSL blocks always
`gives the same worst-case computation delay, indepen-
`dent of input data patterns, since both rising and fall-
`ing transitions have to be completed before the com-
`pletion signal can go high. Hence, the computational
`latency of each computation block for different input
`data patterns is approximately a constant. This lack of
`data dependency to the first order is actually an advan-
`tage for real time signal processing, since the worst-
`case computation delay can be easily determined.
`
`Once an asynchronous processor has been designed,
`there is no way to slow down the internal computation.
`The throughput can be controlled by issuing the request
`signal at a certain rate, but the computation within the
`processor is performed at the full speed. This prop-
`erty has an impact on testing: unlike synchronous cir-
`cuits, we cannot slow down the circuitry to make it work
`with a slower clock.
`
`In asynchronous design it is tempting to introduce
`shortcuts that improve performance at the expense of
`speed-independence. However, these shortcuts neces-
`sitate extensive timing simulations to verify correctness
`under varying conditions. To minimize design complex-
`ity we therefore generally staycd true to the speed inde-
`pendence design, with the result that tirning simulations
`were required only to predict performance and not to
`verify correctness.
`/
`
`3.2. Properties of Asynchronous Interconnection
`Blocks
`
`When computation blocks communicate between one
`another, an interconnection block is designed to control
`the data transfer mechanism. We require that an inter-
`connection circuit be delay-insensitive; that is, the cir-
`cuit’s behavior is independent of the relative gate delays
`among all the physical elements in the circuit. Delay-
`insensitive circuits demand that the Boolean functions
`
`of these circuits must not allow any hazard conditions.
`An automated synthesis procedure for designing hazard-
`free asynchronous interconnection circuits from a
`
`Page 5 of 21
`
`
`
`.0
`
`Meng Brodersen and Messerschmin‘
`
`behavioral level description has been developed [19] and
`used to generate all the interconnection circuits required
`in the adaptive filter chip set.
`One example of an interconnection block, a four
`phase pipeline handshake circuit is shown in figure 4.
`The pipeline handshake circuit allows the two computa-
`tion blocks connected to it to process data simul-
`taneously. The C-elernent shown in the figure represents
`a Boolean function of c = ab + be + ca where a and
`
`b are the two input signals and c is the output signal.
`The circuit is optimum in the sense that it allows the
`maximum concurrency under the hazard-free condition
`and requires a minimum amount of hardware for irn—
`plementation. Other more complicated interconnection
`circuits such as multiplexers or bus controllers have
`been synthesized to realize a fully asynchronous pro-
`grammable processor [21]. The hardware overhead in-
`curred by these interconnection circuits is usually small,
`often on the order of tens of transistors. However, the
`time-delay overhead incurred by the circuit delays can-
`not be ignored, and will be discussed in Section 5 as
`part of the chip performance evaluation.
`
`Comptzlatlntl
`Block
`
`Fig. 4. An optimum {bur-phase pipeline handshake circuit that allows
`the two computation blocks connected to it to process data at the
`sample time.
`
`3.3. Connection ofRegisters to Interconnection Blocks
`
`The connection between computation blocks and inter-
`connection bloclrs
`is governed by the request-
`completion signal pairs. Registers are usually con-
`sidered as part of the interconnection blocks since they
`implement a pipeline stage in the signal processing
`sense. For example, the connection of the pipeline hand-
`shake circuit (figure 4) to the corresponding register
`is shown in figure 5. The handshake circuit uses the
`rising edge of output acknowledge signal A0“, to con-
`trol register latching. The register shown in figure 5
`is an edge-triggered latch. The reason that an edge-
`triggered latch is needed instead of a level-triggered
`latch is for fast acknowledge signal feedback. Notice
`that in figure 4-,_R,,,,, is fedback to the first C-element
`without waiting for the completion of the block con-
`nected to it. It can be easily verified that if a feedback
`
`Fig. 5. The connection between the pipeline register, the pipeline
`handshake circuit, and computation blocks.
`
`signal is generated after the completion of the suc-
`ceeding block, the handshake circuit will enforce that
`only alternate blocks can compute at the same time and
`that the hardware utilization is reduced to at most 50%.
`
`In figure 5, the first C-element of the full handshake
`controls the feedback acknowledge signal, which in-
`dicates to the previous block whether its output data
`has been acknowledged, or received, by the register
`so that the previous block can proceed to compute the
`next sample. The second C-element controls the request
`to the next block, which indicates to the next block
`when to start computation. A completion detection
`mechanism [22] is used to detect the completion of the
`register latching.
`The data latching mechanism shown in figure 5 is
`logic-delay-insensitive, but not propagation-delay
`insensitive, as the propagation delays of data lines might
`be different from that of the request signal R,-,,. If the
`request signal and data lines do not match in propaga-
`tion delays, differential pairs of data lines, or coded data
`lines, will have to be transmitted and the completion
`signal R,-,, will be generated locally at the input to the
`C-element. In our design, we also assume that the pro-
`pagation delays within each logic block is negligible.
`To gain some overall performance improvement, we
`could match the delays of register latching and the sec-
`ond C-element. However, simulation showed that any
`attempts at logic delay matching proved to be unreliable,
`and thus, we chose to stay true with a delay-insensitive
`design.
`'
`‘
`
`3.4. System Initialization
`
`An asynchronous system can be initialized in several
`states, depending on whether initial samples are neces-
`sary or not [21]. For a feed-forward calculation, the ini-
`tial samples do not have any impact on output behavior,
`so we might as well set all the initial conditions (the
`
`Page 6 of 21
`
`
`
`-A-€loclc—vFree-Gldp-Set-forl-Iighvsarnpling~Rate1#rdaptive"Fi:lters'“"3S1"””‘“"*“””""
`
`request and acknowledge signals controlled by inter-
`connection blocks) to zero. For feedback configura-
`tions, samples within the feedback loop are necessary
`to avoid deadlock; hence, either the request or the
`acknowledge signal has to be initially set high to repre-
`sent a sample in a pipeline stage within a feedback loop.
`The output of a memory element, for example the C-
`element, can be easily set high initially, but when com-
`bined with DCVSL computation blocks the initializa-
`tion process is slightly more complicated than just set-
`ting the C-element.
`When the request signal of a DCVSL block is set
`high while its input register is being cleared at initializa-
`tion, the output data of the DCVSL will be undefined,
`which may result in a garbage sample that will propa-
`gate around the feedback loop. This is no problem for
`data samples in our particular adaptive filtering applica-
`tion, since the effect of an arbitrary initial sample will
`fade away because of the adaptation procedure. On the
`other hand, an undefined logic state may deadlock the
`system right at the start. A two-phase initialization is
`adopted in our design to cope with this problem.
`In the first initialization phase, every handshake
`signal is reset to zero to insure that the input data to
`each DCVSL block is stable, and then in the second
`phase, those request signals corresponding to a sam-
`ple delay in a feedback loop are set high so that a clean
`completion signal can be subsequently generated. The
`initialization can be easily implemented with a set-reset
`memory element. In our design, a set memory element
`can be set high by an int‘: signal, and a reset memory
`element can be reset low by the same init signal, but
`set memory elements and reset memory elements are
`disjoint. A simple logic network as shown in figure 6
`is used to realize the two-phase feedback initialization.
`During the first phase, when init is high and fled is
`low, the logic output Rm is low, and, therefore, a clean .
`precharged signal R_,0,,,_,, can be generated. During the
`second phase, fized is set high, which sets the output
`of the NAND gate to be low (Reg is the request signal
`which is also high initially), and Rm. is set high through
`the select logic. A completion signal is subsequently
`generated by the DCVSL (Rem), indicating that there
`is a sample in the DCVSL block. inf: can then be reset
`and the system can be started without deadlock.
`
`3.5. System Ihmughpur
`
`For a pipelined architecture, the system throughput is
`the reciprocal of the additive delays of the longest pipe
`
`Fig. 6 The logic used to allow two phase initialiuticn set-up. Dur-
`ing the first phase, init is high and all the reset memory elements
`and register contents are reset to low. During the second phase, feed
`is set high and the request signal (Rm) corresponding to a sample
`delay in a feedback loop are set high through the select logic. A com-
`pletion signal is subsequently generated by the DCVSL (Ramp), indi-
`cating that there is a sample in the DCVSL block.
`
`plus one handshake (two set-reset cycles of a C-element)
`plus the precharge. The delays of the handshake and
`the precharge can be considered as overhead as com-
`pared to a synchronous design {although the synchro-
`nous design wili also have some overhead in register
`latching), and may range from 8 ns to 20 ns (1.6_u.
`CMOS) depending on the data bit width and handshake
`circuits used. A nominal delay of 10 its was estimated
`from simulation using the 1.6,u CMOS parameters.
`A 10 ns overhead may seem too much to be prac-
`tically acceptable. But his overhead will be reduced in
`direct proportion with the gate delay in more advanced
`technologies. We expect this overhead to be below 2
`us in 0.8;; CMOS technology, which will allow a system
`throughput up to a few hundred MI-Iz without dis-
`tributing a global clock. Also worth mentioning is that
`the system throughput of a pipelined architecture is in-
`dependent of the system complexity if the delay of the
`longest stage remains a constant. A pipelined system
`can be extended without throughput degradation since
`only local communication is counted for throughput
`limitation and there is no global clock to constitute a
`measure of the systems physical size.
`
`4. A Self-Timed Array Multiplier
`
`In this section we will describe the design of a fast self-
`timed array multiplier to illustrate the circuit design
`considerations often encountered in designing DCVSL
`computation blocks and the solutions to these problems.
`Divisions in {MS stochastic gradient adaptive filter-
`ing are often used for step-size normalization and can
`be approximated by shifters without much degradation
`in convergence properties as verified by simulations
`[23]. Thus, in our application the most complicated
`computation block needed is a fast multiplier. The
`design of a self-timed multiplier will now be described.
`
`Page 7 of 21
`
`
`
`‘.!*:\:‘:nh::I:I:nar-‘.5. '
`
`3-5
`
`eng,
`
`1'0 €?'S€fl
`
`QSSBFSC
`
`11
`
`"
`
`4.]. Architecture of the Multiplier
`
`A multiplication in adaptive filters is usually followed
`by an accumulation operation, and the final addition
`in the multiplication can be combined with the accum-
`ulation to save the delay time of one carry-propagate
`add. As shown in figure 7, the multiplier core outputs
`two partial products to be added in the succeeding
`accumulate-adder, which consists of one carry-save add
`and one final carry-propagate add. The multiplier core
`is composed of three basic cells: registers, Booth en-
`coders, and carry-save adders. The 16 multiplier digits
`are grouped into eight groups and each group is Booth
`encoded to either shift or negate the multiplicand [24].
`Eight partial products are computed and added using
`a modified Wallace tree structure 125}. Four of these
`partial products can be added concurrently and therefore,
`the computation latency for eight additions is four times
`the delay of one carry-save add. Figure 7 shows the ar-
`chitecture of the multiplier core, which consists of eight
`Booth encoders operating concurrently and six can'y-
`save adders with four of them operating sequentially.
`
`The inv signals from encoders to carry-save adders im-
`plement the p1us~one operation in a two‘s complement
`representation of a negative number. The tnaxirnum
`quantization error is designed to be less than one least-
`significant hit by adding a 0.5 to one of the carry-save
`adders for rounding. The multiplier core occupies a sil-
`icon area of 2.7 mm X 3 mm in 1.6 pm CMOS design,
`and a computational latency of40 its was estimated from
`simulation by irsim [26] using the 1.6 pm parameters.
`The test results of the design will be given in Section 5.
`To increase the system throughput, a pipeline stage was
`added between the multiplier core and the accumulate-
`adder. We could have deep-pipelined the multiplier
`core, because pipelined computation within a feedback
`loop can be compensated by look-ahead computation
`without changing the system’s input-output behavior
`[27],
`[28]. However, since the total multiplication
`latency has been reduced to only one Booth encoding
`plus four carry-save adds for 16-bit data, the overhead
`incurred in adding pipeline stages, such as the circuitry
`for pipeline registers and the delay time for register latch-
`ing, completion detection, and handshake operation,
`
`-l
`
`Carry-save adder 1
`
`Carry-save adder 2
`
`Carry-save adder 3
`
`Carry-save adder 4
`
`-I
`
`Carry-save adder 5
`
`Carry-save adder 6
`
`Data_outY
`
`Data_outX
`
`Fig. I The architecture of the multiplier core, which consists of eight Booth encoders operating concurrently and six carry-save adders with
`four of them operating sequentially.
`
`Page 8 of 21
`
`
`
`A Clock-Free Chip Set for High-Sampling Rate Adaptive Filters
`
`353
`
`rules out the practicability of having one more stage
`of pipelining within the multiplier core.
`'
`The accumulate-adder consists of a carry-save adder
`at the front and a carry-propagate adder at the end to
`add three input operands per processing cycle, two of
`three operands being the output partial products from
`the multiplier core, with the third coming from a sec-
`ond source. The accumulate-adder occupies a silicon
`area of 1.1 mm X 1 mm in 1.6 pm CMOS design. A
`propagation delay of 30 its was estimated by simula-
`tion using the 1.6 pm parameters and the test results
`will be given in Section 5.
`
`4.2. Circuit Design for the Multiplier Core
`
`The goal in designing the multiplier core was to reduce
`the DCVSL delay overhead by allowing maximum con-
`current operation in both precharge and evaluation
`phases. Layout is another important design factor since
`a Wallace Tree structure consumes more routing area
`than sequential adds. Data bits have to be shifted two
`places between each Booth encoder array to be aligned
`at the proper position for addition. Since each data bit
`is coded in a differential form, four metal lines (two
`for the sum and two for the carry) have to be shifted
`two places per array, which constitute a 50% routing
`area overhead as compared to the same structure in a
`synchronous design. In this subsection, the basic cells
`used in the multiplier core, modified from a design pro-
`vided by Jacobs & Broder