`
`Journal of VLSI Signal Processing
`
`KL430-05-Trainor
`
`April 9, 1997
`
`12:28
`
`Journal of VLSI Signal Processing 16, 41–55 (1997)
`c(cid:176) 1997 Kluwer Academic Publishers. Manufactured in The Netherlands.
`
`Architectural Synthesis of Digital Signal Processing Algorithms Using “IRIS”
`
`D.W. TRAINOR, R.F. WOODS AND J.V. McCANNY
`Department of Electrical & Electronic Engineering, The Queen’s University of Belfast, Ashby Building,
`Stranmillis Road, Belfast BT9 5AH, Northern Ireland
`
`Received March 4, 1996; Revised August 14, 1996
`
`Abstract.
`In this paper, we present the IRIS architectural synthesis system for high-performance digital signal
`processing. This tool allows non-specialists to automatically derive VLSI circuit architectures from high-level,
`algorithmic representations, and provides a quick route to silicon implementation. By incorporating a novel synthesis
`methodology, called the Modular Design Procedure, within the IRIS system, parameterised models of complex and
`innovative DSP hardware can be derived and automatically assembled to create new DSP systems. The nature
`of this synthesis methodology is such that designers can explore a large range of architectural alternatives, whilst
`considering all the architectural implications of using specific hardware to realise the circuit. The applicability
`of IRIS is demonstrated using the design examples of a second order Infinite Impulse Response filter and a one-
`dimensional Discrete Cosine Transform circuit.
`
`1.
`
`Introduction
`
`In recent years, considerable research has been car-
`ried out into the design of novel VLSI architectures for
`digital signal processing (DSP) applications [1]. The
`highly structured nature of many DSP functions makes
`it possible to derive regular circuit architectures, such
`as systolic arrays, that are ideal for VLSI implementa-
`tion. Exploitation of algorithm parallelism, and using
`techniques such as pipelining, can tailor architectures
`to particular sampling rate, area and power require-
`ments.
`In the field of electronic systems design generally,
`and DSP design in particular, systems are becoming
`increasingly complex, performance specifications are
`becoming more stringent, and time-to market pressures
`are shortening design cycles.
`In response, attention
`is now being directed at employing design automa-
`tion to carry out more abstract, system-level design
`tasks, by using CAD tools to automatically derive cir-
`cuits from algorithmic descriptions and required per-
`formance specifications. Since this process involves
`translating a required behaviour into an equivalent cir-
`cuit structure, it is referred to as architectural synthesis.
`
`The purpose of this paper is to describe the IRIS ar-
`chitectural synthesis tool for high performance DSP.
`Unlike other approaches [1–7], the emphasis in IRIS is
`to allow architectural exploration of algorithms. The
`process involves using processing unit models to gen-
`erate optimal architectures based on those units. The
`main advantage of IRIS is that it offers the designer
`full freedom to investigate a wide range of architec-
`tures using user-preferred blocks or novel processing
`techniques. It is also tightly coupled to conventional
`VHDL synthesis tools, and has been used to produce
`practical and realistic designs.
`The remaining text of this paper describes the syn-
`thesis methodology and the operation of the IRIS ar-
`chitectural synthesis tool. In Section 2, an overview of
`commonly applied architectural synthesis techniques is
`given, together with some important limitations exhib-
`ited by these techniques when applied to the synthesis
`of novel, high performance DSP circuits. This is illus-
`trated using an IIR filter example in Section 3. A solu-
`tion to these difficulties is also presented in Section 3,
`which introduces a radical new synthesis methodol-
`ogy, the Modular Design Procedure, and discusses its
`automation within IRIS. The capabilities of the IRIS
`
`Magna 2034
`TRW v. Magna
`IPR2015-00436
`
`0001
`
`
`
`
`
`
`
`P1: KCU
`Journal of VLSI Signal Processing
`
`KL430-05-Trainor
`
`April 9, 1997
`
`12:28
`
`42
`
`Trainor, Woods and McCanny
`
`system are discussed during Section 4, and demon-
`strated throughout Sections 3 and 5, by carrying out
`novel implementations of second order recursive filter-
`ing and Discrete Cosine Transform algorithms. Finally,
`Section 6 offers some conclusions that can be drawn
`from the results of this research.
`
`2. Architectural Synthesis Methodologies
`
`Considerable effort has been focused on deriving meth-
`ods for mapping DSP algorithms onto VLSI archi-
`tectures, including techniques based on sets of Re-
`currence Equations [2], dependence graphs [1] and,
`algebraic methods [3]. Whilst many of these spe-
`cialist tools produce architectural representations from
`high levels of abstraction, they are unable to produce
`functionally-correct, implementable circuits, as issues
`such as internal word growth, truncation and data or-
`ganisation, have not been considered. These issues
`affect the latency and timing of data and thereby in-
`validate any derived architecture.
`It is left to the IC
`designer to modify and refine the initial architectures
`taking into considerations these design details. The
`second order Infinite Impulse Response filter example
`in Section 3 demonstrates the radical differences that
`implementation-level performance criteria can create
`between abstract algorithmic representations and a cor-
`responding physical circuit.
`as CATHEDRAL,
`Synthesis
`systems,
`such
`PHIDEO, HYPER and MARS [4–7] can take algo-
`rithmic descriptions and apply scheduling, assignment
`and hardware mapping techniques to synthesize an ar-
`chitecture. With these systems, hardware mapping, the
`process that maps a flow graph onto the available hard-
`ware blocks, is carried out after the various scheduling
`and assignment procedures. The assumption is there-
`fore made that hardware mapping does not alter the
`structure or functionality of the design. For the hard-
`ware units generally used in reported design examples
`synthesized by these tools, this may be a valid assump-
`tion. However, it has been demonstrated [8] that if
`the hardware units are complex pipelined processors,
`hardware mapping can invalidate the architecture. This
`needs to be resolved if the circuit is to operate correctly
`once implemented using the chosen hardware, and has
`been shown to be a complex task [8].
`A vital issue relating to the various synthesis method-
`ologies is the manner in which the design space may be
`explored. Most of the current architectural synthesis
`tools apply a fixed set of synthesis procedures, regard-
`
`less of the different properties of the circuit in question.
`Hence, with these tools, a number of algorithmic trans-
`formation methods are derived and applied uniformly
`to all problems. This approach may not be appropriate
`for many designers, who explore design trade-offs in
`different ways, depending on the specifications and de-
`sign properties for the particular problem currently un-
`der consideration. For these designers, a tool based on
`fixed synthesis principles is undesirable, since what is
`required is a system that allows the designer to explore
`and trade-off different design issues in a completely
`open manner.
`We believe that there is considerable scope for a de-
`sign methodology that allows designers to easily imple-
`ment and optimise architectures based on processing
`elements, and hence specific hardware, of their own
`choosing. Designers could use favourite processing
`units (which allows the re-use of previously optimised
`circuit layouts) or clever novel technologies, such as
`redundant arithmetic processors [9], which can offer
`clear advantages for some applications. Therefore, the
`IRIS synthesis system has been developed, which en-
`ables the extraction of parameterised expressions from
`complex VLSI processing elements, and uses these ex-
`pressions to achieve functionally-correct solutions for
`circuits built from these processors. Designers can
`quickly create and evaluate architectures that utilise
`existing hardware blocks and can be easily realised us-
`ing commercial silicon design tools. Also, designers
`can take advantage of novel circuit designs, gaining ar-
`chitectural knowledge as well as synthesis capability.
`
`3. Architectural Synthesis Using the IRIS System
`
`The design input of the IRIS system is a Signal Flow
`Graph (SFG) representation of the algorithm, consist-
`ing of zero-delay processing nodes connected by edges
`which may be weighted with an appropriate number of
`delays. For example, consider the algorithm carried
`out by a second order Infinite Impulse Response filter,
`which is defined by Eq. (1).
`yn D a0xn C a1xn¡1 C a2xn¡2 C b1 yn¡1 C b2 yn¡2 (1)
`
`Figure 1 shows a SFG that is functionally equivalent
`to the algorithm of Eq. (1). Each processing node rep-
`resents a Multiply/Accumulate (MAC) operation, and
`the black circles on particular graph edges represents a
`single delay, which is necessary to compute the algo-
`rithm.
`
`0002
`
`
`
`
`
`P1: KCU
`Journal of VLSI Signal Processing
`
`KL430-05-Trainor
`
`April 9, 1997
`
`12:28
`
`Architectural Synthesis
`
`43
`
`issues, using the second order IIR filter as a design
`example.
`
`3.1. Derivation of IRIS Processor Models
`
`In order to utilise complex processors as blocks that can
`be used to synthesize DSP systems in IRIS, models of
`the various processors need to be derived and placed in
`a library within the tool. It is these models that replace
`the mathematical operations of the SFG, and it is the
`information contained within the models that is used by
`IRIS to determine the data timing changes caused by
`replacing the zero-delay SFG operation with complex,
`pipelined hardware.
`The processor models abstract much of the structural
`detail of the processor architecture, but retain enough
`performance-related information so that the effects of
`placing the models into the SFG can be determined.
`When a processor model is derived, the MDP demands
`that two processor performance measurements must be
`associated with the model. The first of these is the data
`format, or “time shape” of data entering or leaving each
`input or output of the processor. The time shape may
`be defined as the position in time of the bits or digits
`of the data value relative to each other [8]. Figure 2
`shows some examples of typical data time shapes.
`The detailed structure of the particular processor,
`and particularly the placement of internal pipelining
`latches, will determine the data time shape at each pro-
`cessor input and output.
`It is necessary to maintain
`information on the processor time shapes to determine
`what extra circuitry, if any, needs to be placed between
`connected processors to convert between the format of
`the data at the output of the first processor, and the
`format expected at the input of the second.
`The second processor performance issue that must
`be addressed by the equivalent IRIS processor model
`is the latency through each datapath in the processor.
`These latency values must be incorporated in the pro-
`cessor model since the number of clock cycles required
`for a particular processor to produce its results has pro-
`found effects on the timing of data throughout the entire
`circuit.
`An important design feature of the IRIS system is
`that the information on data time shapes and data-
`path latencies within the processor model can be pa-
`rameterised in terms of other important processor de-
`sign criteria, such as the data wordlengths used and
`the number of internal pipelining stages. Therefore,
`design changes, such as changes in data wordlengths,
`
`Figure 1. SFG for second order IIR filter.
`
`The synthesis technique involves replacing the
`generic processing nodes of the SFG with models of
`heavily-parameterised pipelined processing units from
`a library and applying a methodology called the Modu-
`lar Design Procedure (MDP) [8]. Unlike other systems
`[4–7] within IRIS the processor parameters exactly
`model the performance of the individual processor.
`The process of synthesis thus involves replacing the
`zero-delay nodes in the original signal flow graph by the
`model of the practical processor. The original Signal
`Flow Graph with zero-delay operators defines the algo-
`rithmic functionality but this is changed when practical
`processors are inserted that have different latencies. If
`the timing effects of using practical processors is not
`addressed then the resulting architecture will imple-
`ment a different algorithm.
`IRIS addresses this problem by retiming the cir-
`cuit to preserve the original algorithm but in such a
`way to minimise the number of delays that are added.
`This is achieved by calculating the maximum num-
`ber of delays within all the loops in the Signal Flow
`Graph and then using retiming [10] to ensure the global
`timing in the synthesized architecture is equivalent to
`that in the original SFG. The circuit will then have
`a maximum possible sampling rate.
`If this maxi-
`mum possible sampling rate meets the specified sam-
`pling rate, then the design has been successful and
`the user has achieved an optimised solution for the
`hardware used.
`If the maximum possible sampling
`rate is in excess of the sampling rate required, then
`the user can investigate alternative designs using dif-
`ferent, namely slower and more hardware efficient,
`blocks or using hardware sharing, to achieve a more ef-
`ficient solution. Subsections 3.1 and 3.2 discuss these
`
`0003
`
`
`
` P1: KCU
`Journal of VLSI Signal Processing
`
`KL430-05-Trainor
`
`April 9, 1997
`
`12:28
`
`44
`
`Trainor, Woods and McCanny
`
`Figure 2. Typical data time shapes.
`
`can be easily taken into account. This parameterisa-
`tion is also complementary to the recent trend of the
`production, and purchase, of libraries of parameterised
`“mega-functions” [11], which are usually written in a
`Hardware Description Language. These libraries can
`increase design re-use and reduce time to market, by
`allowing systems to be constructed by connected com-
`plex re-usable blocks of hardware.
`Whilst a library of processor elements is available
`to the user, a processor interface capability is available
`to allow the incorporation of new processing blocks
`within IRIS. These blocks could be application-speci-
`fic components from commercial vendors, allowing a
`tight coupling between IRIS and conventional compil-
`ers. Alternatively, these processors could use novel
`processing techniques, such as redundant arithmetic
`[9], which would be difficult to capture using conven-
`tional synthesis tools.
`
`3.1.1. Derivation of IRIS Processor Models for the
`Second Order IIR Filter. In order to demonstrate the
`construction of IRIS processor models, consider the
`implementation of the second order IIR filter SFG
`shown in Fig. 1. This design requires the insertion of
`blocks of hardware, capable of the MAC operation,
`into the processing nodes of the SFG. Examples of
`high-performance modules, structurally defined at the
`bit level, that could be used to realise the filter circuit
`are the Carry-Save and Signed Binary Number Repre-
`sentation (SBNR) MAC processors [9] shown in Figs. 3
`and 4.
`
`In Fig. 3, the two signal p and q are multiplied, and
`the result added to the signal s. The labels pi , qi and si ,
`refer to data entering the processor, whilst po, qo and
`so refer to output data. The superscript of each label
`refers to the significance of the data bit to which that
`
`Figure 3. Structure of carry-save MAC processor.
`
`0004
`
`
`
` P1: KCU
`
`Journal of VLSI Signal Processing
`
`KL430-05-Trainor
`
`April 9, 1997
`
`12:28
`
`Architectural Synthesis
`
`45
`
`depicted by black dots in Fig. 3. These latches define
`the timing of the various operations within the MAC
`processor. An important point to notice is the extra
`latches that are added to the datapath of the s signal
`at the right hand side of the structure. These latches
`ensure that each bit of the input si travels through the
`same number of pipeline stages before emerging at so,
`therefore the relative timing of the various bits of data
`entering si is maintained at so. This allows several
`MAC processors to be cascaded without having to con-
`vert the data timing from the output of one processor
`to the input of the next. This arrangement is useful for
`several important DSP circuits, particularly digital fil-
`ters, which can be easily implemented using cascaded
`MAC processors.
`Figure 1 shows that the IIR filter SFG exhibits re-
`cursion, where previously generated results form the
`inputs for future iterations of the algorithm. Previous
`research has shown that for such structures, processors
`that exhibit low, wordlength-independent latency and
`use redundant number systems to produce results most
`significant digit first can produce more efficient imple-
`mentations of the feedback paths [9]. The SBNR MAC
`processor shown in Fig. 4 represents such a structure,
`multiplying the pi signal and the qi signal (which is rep-
`resented by the dotted lines in the figure) and adding
`the si signal. The use of SBNR allows the removal
`of carry propagation chains within the processor and
`hence the production of results in a most significant
`digit first format.
`Examples of IRIS processor models are shown in
`Figs. 5 and 6. These models have been derived from
`the MAC structures of Figs. 3 and 4, and will be used
`
`Figure 4. Structure of SBNR MAC processor.
`
`superscript is attached e.g., if a bit is identified with
`¡1, i.e.,
`a superscript, “1”, that bit is of significance 2
`0.5. The subscript of each label indicates the relative
`clock cycle at which the bit signal enters or leaves the
`processor e.g., a bit labelled “n ¡ 1” enters or leaves
`the processor one clock cycle after a bit labelled “n”.
`This notation defines the timing of the various bits of
`data as they pass through the processor.
`The scheduling of the various operations in the
`Carry-Save MAC processor results in a number of
`pipeline stages, implemented by the addition of latches,
`
`Figure 5. Carry-save MAC model.
`
`0005
`
`
`
` P1: KCU
`
`Journal of VLSI Signal Processing
`
`KL430-05-Trainor
`
`April 9, 1997
`
`12:28
`
`46
`
`Trainor, Woods and McCanny
`
`is applied. The reason for this is that, under truncation,
`a number of bits from data emerging from the output
`of a processor may not be useful and, if the data time
`shape is of a skewed format, extra clock cycles may
`have to be applied before the first useable bit emerges.
`These extra cycles are reflected in the truncation term,
`the value of which depends on the output time shape
`and upon which output bit the truncation is applied at.
`The derivation of the two IRIS MAC processor mod-
`els gives a practical demonstration of the performance
`discrepencies that exist between the zero-delay, ab-
`stract MAC operations shown in the SFG of Fig. 1, and
`actual processing hardware, capable of performing the
`MAC operation. The physical MAC processors exhibit
`non-zero latency, expect data in specific formats, and
`exhibit performance characteristics that depend upon
`design parameters such as data wordlengths and lev-
`els of pipelining. At the SFG level of abstraction, no
`such issues have been considered, hence the need to
`translate the SFG into a practical circuit based on spe-
`cific hardware. This highlights the problems associated
`with many high-level design techniques [1–3], which
`derive SFG-like representations from algorithms, but
`leave the designer with the problem of refining the cir-
`cuit, based on the hardware chosen.
`
`3.2.
`
`IRIS Retiming Procedure
`
`As previously stated, once IRIS processor models have
`been placed into the SFG nodes, a rescheduling or re-
`timing of the circuit is now required, due to the changes
`in data timing. This procedure consists of two stages,
`namely delay scaling and retiming. The IRIS scaling
`process determines the maximum allowable sampling
`period and retimes the SFG taking into account this
`value. The maximum allowable sampling period, also
`referred to as the pipelining period [12], is defined in
`Eq. (2).
`
`fi D max[TC ¥ DC]8 C
`
`(2)
`
`In Eq. (2), TC is the total number of clock cycles
`taken by the processors in a loop C of the SFG, and
`DC is the total number of delay elements in the same
`loop. In order to preserve algorithmic functionality, the
`sampling rate of the architecture must be synchronized
`to the rate of the slowest loop. In order to determine
`TC, and hence fiC, for each loop, IRIS must calculate
`the latency of each processor in loop C which means it
`must be aware of performance characteristics of each
`
`Figure 6. SBNR MAC model.
`
`to demonstrate the IRIS synthesis and retiming proce-
`dures, by producing a practical IIR filter architecture
`from the SFG of Fig. 1. Whilst these models have been
`manually derived from the structures shown in Figs. 3
`and 4, schemes by which processor models may be
`automatically generated from suitably structured HDL
`descriptions are currently being developed.
`The graphical representations of the processor mod-
`els in Figs. 5 and 6 demonstrate the parameterisation of
`the important performance characteristics of the pro-
`cessors. At each input and output of the block, a rep-
`resentation of the data time shape is displayed, and
`through each datapath, a parameterised expression for
`the latency is shown. The parameters n and m represent
`the wordlengths of the p and q data, whilst the vari-
`able x represents the number of rows of cells between
`successive pipeline stages in the processor. When de-
`termining the latency expressions, the smallest integer
`greater than or equal to the ratios specified is used. This
`is designated by the use of the symbol dae in Figs. 5
`and 6 which is defined as the smallest integer greater
`than or equal to “a”. An important issue relating to the
`Carry-Save model is that it is based on a slightly mod-
`ified version of the architecture shown in Fig. 3, which
`extends the functionality of the processor to utilise twos
`complement arithmetic [8]. The effect of these modi-
`fications on the processor model is an increase in the
`latency of the S datapath by one clock cycle.
`In addition to wordlength parameters, some of the la-
`tency values of the models depend on an additive trun-
`cation term, t, to reflect the fact that the latency through
`the datapath can be increased if numerical truncation
`
`0006
`
`
`
` P1: KCU
`
`Journal of VLSI Signal Processing
`
`KL430-05-Trainor
`
`April 9, 1997
`
`12:28
`
`Architectural Synthesis
`
`47
`
`parameterised processing block. When the pipelining
`period has been determined, all delay elements within
`the SFG are scaled up by this value which is equivalent
`to adjusting the rate of the entire structure to its slowest
`loop.
`After delay scaling, the retiming procedure needs
`to be applied, in order to compensate for the different
`performance characteristics between the generic SFG
`processors and the practical models, as demonstrated
`in Section 3.1. In addition to the processor latencies,
`the retiming problem must model the effects on timing
`of truncation and any required data organisation con-
`version circuitry (the nature of which is determined by
`IRIS before retiming occurs). In the retiming proce-
`dure, the latency of each datapath through a processor
`is realised by removing the appropriate number of de-
`lays from the edges of the scaled SFG and incorporating
`them within the processor (referred to as embedding the
`processor block). Latches are transferred as required
`to ensure that embedding occurs correctly, whilst min-
`imising the total number of latches used. This has been
`achieved by formulating the retiming routine as a linear
`programming problem [10].
`
`3.2.1. Retiming of the IIR Filter Example. For the
`purposes of the IIR filter example, assume that the
`circuit
`is to be implemented using twos comple-
`ment Carry-Save MAC processors for the three non-
`recursive SFG nodes, and SBNR MAC processors for
`the two recursive nodes, since the SBNR modules gen-
`erally have low latency, which leads to smaller values
`of fi in Eq.
`(2), and hence more efficient recursive
`circuits. The Carry-Save processors are employed in
`the non-recursive section because they occupy less area
`than equivalent SBNR processors.
`Parameter values for the models shown in Figs. 5
`and 6 need to be supplied to calculate specific latency
`values and data time shapes for the filter design.
`In
`this example, data wordlengths of 12 bits/digits and
`pipelining stages placed every 4 rows of MAC cells are
`assumed. Placing these values into the parameterised
`expressions of the models shown in Figs. 5 and 6, and
`assuming no extra latency caused by data truncation,
`gives the particular models displayed in Figs. 7 and 8.
`The SFG of Fig. 1 exhibits two recursive loops,
`highlighted in Fig. 9. Using the latency expressions of
`the models shown in Figs. 7 and 8, evaluating Eq. (2)
`gives the result displayed in Eq. (3).
`fi1 D .2 ¥ 1/ D 2
`fi2 D ..2 C 2/ ¥ 2/ D 2
`
`.loop1/
`.loop2/
`
`(3)
`
`Figure 7. Carry-save MAC with resolved parameters.
`
`Equation (3) shows that both loops in the IIR fil-
`ter imply a pipelining period of 2 cycles, and hence all
`delay elements in the filter SFG must be scaled by a fac-
`tor of 2 before any further retiming takes place. This
`effectively synchronises the circuit to the rate of the
`slowest loop, and determines that the filter circuit will
`be 50% efficient, i.e., new iterations of the algorithm
`can commence every 2nd clock cycle. It is important
`to note that delay scaling is only applied to delay ele-
`ments present in the original SFG, and not to registers
`introduced by the subsequent retiming process.
`Applying delay scaling and retiming to the IIR filter
`problem results in the synthesis of a filter architecture,
`
`Figure 8. SBNR MAC with resolved parameters.
`
`0007
`
`
`
`
`
`P1: KCU
`Journal of VLSI Signal Processing
`
`KL430-05-Trainor
`
`April 9, 1997
`
`12:28
`
`48
`
`Trainor, Woods and McCanny
`
`through the SFG by IRIS to model the various laten-
`cies associated with the particular processor models
`employed.
`An important issue relating to the filter design is
`the appearance of delays on the inputs of the circuit,
`indicating that retiming has been carried out in such a
`way that the external I/O of the original SFG, where
`all inputs arrive simultaneously, has been preserved
`in the synthesized architecture. The manner in which
`these delays are considered depends on the designer
`and the application. If the inputs of the circuit need to
`enter simultaneously, then the delays on the inputs of
`Fig. 10 represent the delay circuitry that is physically
`required.
`If simultaneous input entry is not important, then
`these delay values represent the relative timing required
`between the different inputs for correct operation, and
`it is the responsibility of the designer to ensure that this
`timing of the external signals is met.
`As previously stated, new iterations of the filter algo-
`rithm can be initiated every second clock cycle, since
`the calculated pipelining period is 2. This implies that
`the sample rate will be only 50% of the corresponding
`clock frequency. There are a number of approaches
`that can be adopted to change or take advantage of this
`characteristic. If the sample rate specification has al-
`ready been met, then scheduling techniques [13] can
`be applied by IRIS, to take advantage of the idle cy-
`cles in the circuit, thereby time-multiplexing several
`similar operations onto individual processors. This re-
`sults in a hardware saving with no reduction in overall
`processing speed.
`Alternatives to scheduling procedures, which can
`be applied to increase the sample rate of the circuit,
`include architectural transformations, and employing
`lower-latency processors in the recursive section of the
`IIR filter. For example, a modified version of the SBNR
`MAC processor shown in Fig. 4 has been reported in
`[14], which can be applied to recursive filters, and ex-
`hibits a latency of only a single clock cycle.
`Essentially, our synthesis methodology provides
`three mechanisms by which the design space may
`be explored, namely scheduling and allocation tech-
`niques, architectural transformations, and the choice
`of practical processors which have parameterised
`performance characteristics. The first two of these
`mechanisms are commonly employed in architectural
`synthesis tools, however the third concept has not
`been adequately considered by previous synthesis sys-
`tems. The relationship between the different design
`
`Figure 9. Recursive loops in the IIR filter circuit.
`
`based on and optimised to the specific MAC proces-
`sor hardware chosen. The generated architecture is
`shown in Fig. 10. In this circuit, it can be seen that two
`additional blocks have been placed in the circuit, enti-
`tled “tc msd” and “msd tc”, which convert data from
`twos complement representation to SBNR and vice-
`versa. IRIS can detect the need for this extra circuitry
`by specifying the number system used by a particu-
`lar processor in the IRIS processor model, and placing
`numerical conversion circuitry where required.
`In the architecture of Fig. 10, a number of blocks
`of delays have been introduced by the retiming pro-
`cess to account for the internal latencies of the pro-
`cessors and any conversion circuitry required, thereby
`ensuring that the global data timing of the synthesized
`architecture is the same as the original SFG. These
`blocks of timing delays are labelled x D, where x is
`the number of delays. Notice the block of 11 de-
`lays between the third twos complement MAC and the
`“tc msd” processor. This is caused by skewing cir-
`cuitry, required to convert the data time shape from
`the MAC output to the “tc msd” input”. A total of 31
`pipeline stages have been introduced and propagated
`
`Figure 10. Synthesized architecture for the IIR filter example.
`
`0008
`
`
`
`
`
`P1: KCU
`Journal of VLSI Signal Processing
`
`KL430-05-Trainor
`
`April 9, 1997
`
`12:28
`
`exploration methods will be demonstrated in the de-
`sign example discussed in Section 5.
`
`4.
`
`IRIS System Structure and Operation
`
`The functionality of IRIS can be partitioned into sev-
`eral subsystems, each carrying out a subset of the var-
`ious design functions that the tool makes available.
`Figure 11 gives a system viewpoint of the IRIS tool.
`In Fig. 11, the IRIS shell is shown as the central
`communicating process for the system. This shell is a
`windows-based program running environment, which
`processes commands from the user, invokes and moni-
`tors several tools within the IRIS framework, and relays
`the textual and graphical information produced by these
`tools back to the user. Parameter files provide the shell
`with information regarding the display, the location of
`libraries, access to help files and error messages, and
`lists the tools that can be invoked within the shell.
`From the shell, the user has control over the invo-
`cation and operation of two tools, a schematic editor,
`which also functions as an interface for the various syn-
`thesis functions, and a designer for the parameterised
`processor models required by the MDP. The processor
`designer is currently a simple parameter capture and
`analysis tool, which allows the user to create and eval-
`uate the various parameterised expressions that govern
`the characteristics of each processor model. Hierarchi-
`cal synthesis of systems is also possible by generating
`processor models from circuit architectures previously
`synthesized by IRIS. Within each processor model, in
`addition to the parameterised latency and data format
`
`Figure 11.
`
`IRIS system functionality diagram.
`
`Architectural Synthesis
`
`49
`
`expressions, the designer can associate parameterised
`area and timing expressions for the processor, thereby
`allowing IRIS to make speed and area estimates for syn-
`thesized circuits. These values can be fed back from
`silicon implementation tools, and investigation is con-
`tinuing into methods by which more accurate estimates
`of circuit area and speed can be made at the architec-
`tural level. IRIS can use these values as a guide for
`determining whether or not the architectural designs
`it creates meets the sample rate and area specifications
`supplied by the designer. In addition, the nature of IRIS
`is such that algorithms can very quickly be mapped onto
`silicon, and at the silicon implementation level, perfor-
`mance estimates are much more accurate than at the
`architectural level. Therefore, it is possible to iterate
`around the architectural synthesis and implementation
`stages without incurring large penalties in design time.
`Within the schematic editor, SFG schematics can be
`created using instances of various processing blocks,
`connected together by wires (representing the signals in
`the SFG) and terminated with external connectors (rep-
`resenting the external inputs and outputs to the SFG).
`From the schematic editor, a number of design verifi-
`cation and synthesis functions can be invoked, includ-
`ing a symbolic simulator which generates a difference
`equation corresponding to the SFG, and a numerical
`error analysis tool. Th