throbber
A complex multiplier with low logic depth
`
`Berkeman, Anders; Öwall, Viktor; Torkelson, Mats
`
`Published in:
`[Host publication title missing]
`
`DOI:
`10.1109/ICECS.1998.813933
`
`1998
`
`Link to publication
`
`Citation for published version (APA):
`Berkeman, A., Öwall, V., & Torkelson, M. (1998). A complex multiplier with low logic depth. In [Host publication
`title missing] (Vol. 3, pp. 47-50) https://doi.org/10.1109/ICECS.1998.813933
`
`Total number of authors:
`3
`
`General rights
`Unless other specific re-use rights are stated the following general rights apply:
`Copyright and moral rights for the publications made accessible in the public portal are retained by the authors
`and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the
`legal requirements associated with these rights.
`• Users may download and print one copy of any publication from the public portal for the purpose of private study
`or research.
`• You may not further distribute the material or use it for any profit-making activity or commercial gain
`• You may freely distribute the URL identifying the publication in the public portal
`Read more about Creative commons licenses: https://creativecommons.org/licenses/
`Take down policy
`If you believe that this document breaches copyright please contact us providing details, and we will remove
`access to the work immediately and investigate your claim.
`
`LUND UNIVERSITY
`
`PO Box 117
`221 00 Lund
`+46 46-222 00 00
`
`IPR2023-00796
`Apple EX1029 Page 1
`
`

`

`A Complex Multiplier with Low Logic Depth
`
`Anders Berkeman, Viktor Owall and Mats Torkelson
`Department of Applied Electronics, Lund University
`Box 118, SE-221 00 Lund, Sweden
`Tel. $46 46 2229377, Fax. +46 46 129948
`E-mail: Anders.Berkeman@ t de .It h.se
`
`Abstract
`A complex multiplier has been designed for use in
`a pipelined fast fourier transform processor. The
`performance in terms of throughput of the processor
`is limited by the multiplication.
`Therefore, the
`multiplier is optimized to make the input to output
`delay as short as possible. A new architecture based
`o n distributed arithmetic and Wallace-trees has been
`developed and is compared t o a previous multiplier
`realized as a regular distributed arithmetic array.
`The simulated gain i n speed f o r the presented mul-
`tiplier is approximately 100%. For verification, the
`multiplier is currently under fabrication in a three
`metal-layer 0 . 5 ~ CMOS process using a standard
`cell library.
`
`1. Introduction
`A pipelined Fast Fourier Transform (FFT) proces-
`sor has been designed for use in an Orthogonal Fre-
`quency Division Multiplex (OFDM) system. Mul-
`tiplication is often the most time-critical and area
`consuming operation in a digital signal processor.
`Therefore, effort has to be made to decrease the
`number of multipliers and to increase their speed.
`In the designed FFT processor the critical path con-
`sists of a complex multiplier in series with a butterfly
`performing addition and subtraction. A part of the
`FFT pipeline is shown in figure 1. Since the butterfly
`processors are much faster than the complex multi-
`plier, the maximum clock frequency of the processor
`strongly depends of the multiplier delay.
`This paper present a novel multiplier architec-
`ture based on distributed arithmetic and Wallace
`trees. The new architecture is compared to a
`multiplier realized as a regular array and the speed
`improvement is 100% while power consumption is
`decreased. The multiplier is fully parameterized, so
`any configuration of input and output wordlengths
`could be elaborated. Both the array and the tree
`multiplier are under fabrication on the same die and
`will be evaluated for speed and power consumption
`on return.
`
`w
`Figure 1: Part of the FFT processor pipeline. The but-
`terfly processors are named “BF I” and “BF 11”. Shaded
`boxes are combinatorial logic.
`
`2. The FFT processor
`In the early versions of the FFT-processor, a complex
`array multiplier was used [2]. The array multiplier
`is a highly regular structure resulting in a minimal
`wire-length, which is important for high-speed design
`in sub-micron processes where wiring delay gives a
`significant contribution to the overall delay. How-
`ever, in a process where cell delay dominates wire
`delay, the logic depth of the design is more important
`than regularity. In the complex array multiplier the
`logic depth is proportional to the input wordlength
`N . In the adder tree multiplier, on the other hand,
`the depth is proportional to logN [3]. Even for a
`small wordlength, this is a significant improvement.
`This also affects the amount of energy per opera-
`tion for the multiplier. A lower logic depth will lead
`to a lower power consumption since the unnecessary
`switching activity is reduced.
`A way to decrease the critical path of the FFT pro-
`cessor would be to pipeline the multiplier in two or
`more stages. However, due to the pipelined struc-
`ture of the FFT processor, complexity of the control-
`ling hardware would increase [l]. Furthermore, the
`word lengths of the data paths are wide due to the
`application of the processor and the usage of com-
`plex arithmetic. A multiplier in this application has
`between 44 and 52 inputs, and a pipeline register
`inserted somewhere in the middle of the multiplier
`would need a word length of more than a hundred
`bits. This would increase area, routing and clock
`
`47
`
`0-7803-5008-1/98/$10.0001998 EEE.
`
`IPR2023-00796
`Apple EX1029 Page 2
`
`

`

`load and is not a preferable solution. Instead, the
`multiply operation is entirely combinatorial.
`The FFT processor is implemented using the
`R22DIF FFT-algorithm 111.
`In this algorithm,
`every second multiplication can be exchanged to a
`multiply by - j , which for an 8192-point FFT leaves
`only six complex multipliers. This is to be compared
`to thirteen using a straightforward implementation.
`The multiplication by - j is realized without a
`multiply by real-imaginary swap and negation of the
`imaginary part. This is the reason for two different
`butterfly processors, “BF I” and “BF 11” in figure 1.
`By using this algorithm, the number of instanciated
`multipliers is minimized compared to an ordinary
`radix-2 FFT without any loss in throughput.
`3. Multiplier algorithm
`A complex multiplier calculates two inner products,
`ZR = ARWR - AIWI
`
`{ ZI = ARWI + AIWR.
`
`(1)
`
`In the case of the FFT-processor, W = WR +jWr are
`the twiddle-factors stored in a ROM. The wordlength
`of WR and WI is denoted M . According to equa-
`tion (l), four real multiplications and two additions
`are needed.
`There are two methods to decrease multiplication de-
`lay if it is assumed that multiplication is performed
`by summation of partial products, with the excep-
`tion of logic minimization. The first is to reduce the
`number of partial products, and the second is to use
`a faster adder strategy to sum all the partial prod-
`ucts together [3]. Both methods have been combined
`in the presented architecture.
`Distributed arithmetic [4] was chosen as a means to
`reduce the number of partial products, and a Wal-
`lace tree adder was selected for adding the partial
`products together. By using distributed arithmetic,
`the complex multiplication is treated as two inde-
`pendent inner products ZR and 21. Each of the in-
`ner products will be calculated using one distributed
`arithmetic multiplier, as explained in section 3. This
`should be compared to a multiplier realized using
`equation (l), in which case four real multipliers are
`needed.
`As an alternative to distributed arithmetic, modi-
`fied Booth-encoding was considered. However, as
`the number of partial products are about the same
`for both methods, modified Booth-encoding requires
`more logic gates to implement. This is due to that in
`the modified Booth algorithm, three variables have
`to be decoded to select the proper partial product.
`In a complex multiplier based on distributed arith-
`metic, a simple two-input xor-gate does the selection.
`
`When using distributed arithmetic, the twiddle-
`factors have to be transformed from WR and WI to
`Ws and WO, where
`{ W D = WR - WI
`WS = WR + WI
`
`(2)
`
`This transformation does not cause any problems
`in the implementation, since the twiddle-factors are
`pre-calculated in the WS and WD format before
`realization. However, it is important that Ws and
`WD are calculated using floating-point arithmetic
`before they are converted to fixed point. Otherwise,
`accuracy is reduced.
`4. Mathematical background
`This section gives a mathematical background to the
`operation of the multiplier. In the equations that fol-
`low a bit-variable is treated as a variable holding the
`arithmetic value 0 or 1. In this way bits can be used
`together with arithmetic variables and operators. If
`A is an N-bit fractional number in two’s complement,
`the value of A is calculated as
`N-1
`A = -a0 +
`
`ai 2-i.
`
`(3)
`
`By using the identity
`
`i=l
`
`(4)
`
`(5)
`
`1
`A = - [ A - (-A)]
`2
`and the rule for negating a two’s complement number
`- A = + 2--(N-1),
`equation (3) can be written as
`(ai -z) 2-i-1 -2-N.
`A = -(a0 --%) 2-1 +
`i=l
`Introduce a0 = (G - ao), and for IC # 0, CY^ =
`(uk - a). Note that all cyk E (-1, +l). Using this
`notation, A can be written as
`A = A’ - zwN,
`
`N-1
`
`(6)
`
`(7)
`
`where
`
`N-1
`A‘ =
`
`cyi 2-a-l.
`
`i=O
`The relationship between ai and ai is
`{ -1 , if a+o = 0 or a0 = 1
`+1 , if a+o = 1 or a0 = 0
`
`ai =
`
`(8)
`
`(9)
`
`Using this encoding the complex product can be writ-
`ten as
`
`N - 1
`ZR = x(wfiaRi-wIar<) z-‘-’-(wR-wI) 2 - N (10)
`i=O
`
`48
`
`IPR2023-00796
`Apple EX1029 Page 3
`
`

`

`M + N - l ..........
`..........
`...........
`..........
`..........
`..........
`..........
`..........
`..........
`..........
`..........
`..........
`..........
`..........
`..........
`.......... ................
`..........
`
`N
`
`Figure 3: All partial product bits by significance for ZR
`or 21. Input wordlength is N and coefficient wordlength
`is M .
`
`is suitable for hardware mapping. It is realized as
`a multiplexer selecting &Ws or &WO, depending on
`the value of p = ( a ~ i @ a ~ i ) . If UR+O = 0 (or URO =
`l ) , an inverted version of the coefficients is chosen,
`and a '1' in the least significant position is added,
`corresponding to a two's complement negation. The
`expression
`
`is treated similarly. Figure 3 shows all the partial
`product bits that has to be added to generate ZR
`or 21. The wordlength for the twiddle factor, W,
`is M bits and for the data, A, it is N bits, in this
`case 10 and 16 bits respectively. The top sixteen
`lines in the figure is the partial products generated
`inside the sum of equation (12) or (13), and the
`third line from bottom is the ones that form the
`corresponding two's complement of these products.
`The last two lines is the - W s l ~ 2 - ~
`term.
`5. Implementation
`The proposed multiplier consists of two distributed
`arithmetic blocks, one calculating ZR, and the other
`21. The two blocks are similar and the difference
`is basically the sign in equation (1). Each block is
`divided into three parts, partial inner product gen-
`erator, adder tree and carry lookahead adder, see
`figure 2.
`The multiplier is synthesized to a 0 . 5 ~ cell library
`that does not contain any dedicated half or full adder
`cells. Estimated delay for a 10+10 by 16+16 mul-
`tiplier using a worst case industrial environment is
`about 16 nanoseconds, compared to 34 nanoseconds
`for the array multiplier. About 55% of this delay is
`due to the adder tree. The partial inner-product gen-
`erator takes 20% and the carry-lookahead adder uses
`25% of the total delay. Most of the delay is spent in
`the adder tree, and by using dedicated adder cells
`
`Figure 2: The multiplier for ZR or 21. The complete
`complex multiplier consists of two of these. Partial inner
`product generator at top, adder tree in the middle and
`fast carry-lookahead adder at the bottom.
`
`1
`
`W I ~ R I ~ : ~
`a ~ i
`
`2: 1
`
`;;
`
`N - 1
`2 - N . (11)
`2 1 = x ( m a R i + W R a r i ) 2 - " ' - ( m + W R )
`i=O
`The expression WIaRi +WRaIi is for z # 0 examined
`in the following table.
`a i i a;i
`
`-1
`
`- WD
`WS
`1
`Where Ws and WD were introduced in equation 2.
`From the table it is clear that p = ( a ~ i @ a ~ i ) can
`be used to select Ws or WO. Using p , Ws and WD,
`equation (10) and (11) can be written as
`
`N-I
`ZR =
`
`i=O
`
`( - 1 ) q I W s VpWo] 2-i-' - WO 2-N =
`
`(=@ [pws v FWD] f G) 2-2-l -WO 2-N (12)
`
`N - 1
`
`i=O
`
`N - 1
`ZI =
`
`i=O
`
`2-"' - WS 2 - N =
`(-1)""i~pwD V ~ W ~ I
`
`N - 1
`
`(G@[PWo V P W s ] + G ) 2-i-1-Ws 2 - N . (13)
`
`i=O
`When evaluating the sum, the powers G and ali
`should be replaced with U R ~ and a ~ i for the case
`i = 0, since these bits represent the sign in two's
`complement representation. The partial inner prod-
`uct
`-
`aRi @ [PWD v pWS] f G
`
`(14)
`
`49
`
`IPR2023-00796
`Apple EX1029 Page 4
`
`

`

`this delay could be decreased. However, the target
`cell-library does not contain any such cells and such
`improvements have not been made.
`When designing the adder tree, a generic tree gener-
`ator was used. This generator produces a tree with
`y inputs of wordlength x, that is a rectangle of x by
`y input bits. This rectangle has to be large enough
`to cover all the partial product bits of figure 3, i.e.
`x = M + N - 1 and y = N + 3. For certain sizes of
`N and M , the two last lines in figure 3 can be joined
`with two of the N first lines, minimizing y to N + 1.
`Unfortunately a lot of inputs to the adder tree are
`unused, and extra logic will be generated. There-
`fore, the area for the tree multiplier is about 75%
`larger than for the array multiplier. The number of
`gates for the array multiplier is 3000, while the tree
`multiplier uses 6200 gates, of which 4400 belongs to
`the two adder trees. Theoretically for a dedicated
`tree generator, the area should be only slightly larger
`than for the array multiplier. Both multipliers are
`currently under fabrication and the layout plots are
`shown in figure 4.
`When data flows through the pipeline of the FFT
`processor, the wordlength has to increase to keep ac-
`curacy in the calculations. For the current applica-
`tion the input wordlength is 12+12 bits (real + imag-
`inary) and the output wordlength is 16+16 bits. The
`twiddle-factors are kept constant at 10+10 bits at all
`stages of the pipeline. Different wordlengths in the
`datapath means that a set of multipliers of differ-
`ent wordlengths has to be instantiated if the longest
`wordlength is not to be used for all multipliers with
`a corresponding increase in area. Also, as FFT pro-
`cessors will be built for different applications the
`wordlength is subject to change. Therefore, the mul-
`tiplier is fully parameterized and a multiplier of spe-
`cific wordlength can be elaborated when needed.
`For our application, the output wordlength should
`equal the input wordlength, that is, some of the least
`significant bits of the result are cut away. A simple
`rounding scheme is applied to lower the distorsion
`when the output is truncated. A rounding bit is
`added to the right of the rightmost bit to be kept
`after truncation, causing a carry to propagate when
`the most significant position of the bits cut away is
`a one. A feature of the adder tree is that this bit
`can be inserted together with the partial inner prod-
`ucts at the top of the tree, see figure 3. In the array
`multiplier, an additional row of half-adders had to be
`included to handle rounding. As rounding includes
`addition of a one with the product, arithmetic over-
`flow at the output is possible. Therefore, a saturation
`unit is placed at the output of the carry-lookahead
`adder. This unit checks the most significant bits of
`
`50
`
`Figure 4: Plots of the two multipliers. Tree multiplier to
`the left and array multiplier to the right. The pad-frames
`are 3.2x2.9 mm2 and equal for both designs.
`
`the result and modifies the output if an overflow has
`occurred.
`
`5. Conclusion
`A Wallace-tree based complex multiplier has been
`designed and simulated with a speed improvement
`of approximately 100% compared to a previously de-
`signed array multiplier. In a worst case industrial en-
`vironment, the delay of a 10+10 by 16+16 multiplier
`is about 16 nanoseconds. This is when synthesized to
`a three metal-layer 0 . 5 , ~ process with a standard cell
`library that does not contain any dedicated half- or
`full-adder cells. The figure is an estimation without
`post-layout delay information. Under equal condi-
`tions the complex array multiplier currently being
`used has a delay of 34 nanoseconds.
`an
`Since
`the multiplier
`together with
`adder/subtractor is located in the critical path
`of the FFT-processor, throughput is expected to
`increase with approximately 80% while energy
`per operation is decreased.
`The multiplier is
`fully parameterized so any configuration of input
`and output wordlengths can be elaborated and
`synthesized. Both the array and the tree multi-
`plier are currently under fabrication on the same die.
`
`References
`“A New Approach to
`S. He and M. Torkelson.
`Pipeline FFT Processor”. In Proc. of IEEE Inter-
`national Parallel Processing Symposium, 1996.
`S. He and M. Torkelson. “A Complex Array Mul-
`tiplier Using Distributed Arithmetic”. In Proc. of
`IEEE Custom Integrated Circuits Conference, 1991.
`C.S. Wallace. “A Suggestion for a Fast Multiplier”.
`IEEE Transactions on Electronic Components, Vol.
`EC-13, Feb 1964.
`S.G. Smith and P.B. Denyer. “Efficient Bit-Serial
`Complex Multiplication and Sum-Of Products Com-
`putation Using Distributed Arithmetic”. In Proc. of
`IEEE ICASSP, 1986.
`
`IPR2023-00796
`Apple EX1029 Page 5
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket