`download and build JHDL from sources themselves. Documentation on the
`JHDL system is provided as well.
`
`12.6 Summary
`
`273
`
`References
`
`[1] P. Bellows, B. L. Hutchings. JHDL-An HDL for reconfigurable systems. Proceed
`ings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1998.
`[2] P. Bertin, D. Roncin, J. Vuillemin. Programmable active memories: A performance
`assessment. In G. Borriello, C. Ebeling (eds.). Research on Integrated Systems:
`Proceedings of the 1993 Symposium, 1993.
`[3] P. Bertin, H. Touati. PAM programming environments: Practice and experience. In
`D. A. Buell, K. L. Pocek (eds.). Proceedings of IEEE Workshop on FPGAs for Custom
`Computing Machines, April 1994.
`[4] D. Galloway. The transmogrifier C hardware description language and compiler for
`FPGAs. Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines,
`April 1995.
`[SJ P. Graham, B. Nelson, B. Hutchings. Instrumenting bitstreams for debugging
`FPGA circuits. Proceedings of the IEEE Symposium on Field-Programmable Custom
`Computing Machines, April 2001.
`[6] P. S. Graham. Logical Hardware Debuggers for FPGA-Based Systems, Ph.D. thesis,
`Brigham Young University, 2001.
`[7] K. S. Hemmert, J. L. Tripp, B. L. Hutchings, P. A. Jackson. Source level debugger
`for the Sea Cucumber synthesizing compiler. Proceedings of the IEEE Symposium
`on Field-Programmable Custom Computing Machines, April 2003.
`[8] B. L. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, M. Rytting. A
`CAD suite for high-performance FPGA design. Proceedings of IEEE Workshop on
`FPGAs for Custom Computing Machines, April 1999.
`[9] C. Iseli, E. Sanchez. A C++ compiler for FPGA custom execution units synthe
`sis. Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines,
`April 1995.
`[10] W. J. Landaker, M. J. Wirthlin, B. L. Hutchings. Multitasking hardware on the
`SLAAC 1-V reconfigurable computing system. Proceedings of the 12th Interna
`tional Workshop on Field-Programmable Logic and Applications, Springer-Verlag,
`September 2002.
`[11] J. L. Tripp, P. A. Jackson, B. L. Hutchings. Sea Cucumber: A synthesizing compiler
`for FPGAs. Proceedings of the 12th International Workshop on Field-Programmable
`Logic and Applications, Springer-Verlag, September 2002.
`[12] T. Wheeler, P. Graham, B. Nelson, B. Hutchings. Using design-level scan to improve
`FPGA design observability and controllability for functional verification. Proceed
`ings of the 11th International Workshop on Field-Programmable Logic and Applica
`tions, Springer-Verlag, August/September 2001.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 301
`
`
`
`PART
`
`Ill
`
`MAPPING DESIGNS TO
`RECONFIGURABLE PLATFORMS
`
`The chapters that follow cover the key mapping steps unique to
`field-programmable gate arrays (FPGAs) and reconfigurable targets.
`These steps include technology mapping to the primitive FPGA pro
`grammable gates (Chapter 13), placement of these gates (Chapters 14
`through 16), routing of the interconnect between gates (Chapter 17),
`retiming of registers in the design (Chapter 18), and bitstream genera
`tion (Chapter 19). A final chapter summarizes a number of approaches to
`accelerating various stages of the mapping process (Chapter 20).
`Placement is a difficult mapping problem, but is critical to the
`performance of the resulting reconfigurable design. As a result, it can be
`very slow, limiting the rate of the edit-compile-debug loop for reconfig
`urable application development, and the designs it produces may have
`longer cycle times than we would like. For these reasons, in addition
`to the general-purpose algorithms for placement covered in Chapter 14,
`algorithms that are highly optimized to exploit the regularity of data
`paths are discussed in Chapter 15, and constructive approaches to lay
`out are treated in Chapter 16. These more specialized approaches can
`significantly reduce placement runtime and often deliver placements that
`allow faster design operation.
`As Chapters 13 through 20 demonstrate, there is a well-developed set
`of approaches and tools for programming reconfigurable applications.
`However, the tools are always slower than we might like them to be, espe
`cially as FPGA capacities continue to grow with Moore's Law. Moreover,
`the designs they produce are often too large or too slow, and the level at
`which we must program them is often lower than optimal. These defi
`ciencies present ample opportunities for innovation and improvement in
`software support for reconfigurable systems.
`For the designer who works on reconfiguration issues, the follow
`ing chapters provide a look under the covers at the tools used to map
`designs and at the problems they must solve. It is important to under
`stand which problems the tools are and are not solving and how well
`they can be expected to work. An understanding of the mapping flow and
`algorithms often helps the designer appreciate why tools may not produce
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 302
`
`
`
`276
`
`Part III ■ Mapping Designs to Reconfigurable Platforms
`
`the quality of results expected and how the design could be optimized to
`obtain better results. Similarly, understanding the problems that the tools
`are solving helps the designer understand the trade-offs associated with
`higher- or lower-level designs and how to mix and match design levels to
`obtain the desired quality of results with minimal effort.
`For the tool or software developer, this part covers the key steps
`in a traditional tool flow and summarizes the key algorithms used to
`map reconfigurable designs. With this knowledge the developer can
`rapidly assimilate conventional approaches and options and thus pre
`pare to explore opportunities to improve quality of results, reduce tool
`time, or increase automation and raise the configurable design's level of
`abstraction.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 303
`
`
`
`CHAPTER
`
`13
`
`TECHNOLOGY MAPPING
`
`Jason Cong
`Department of Computer Science
`California NanoSystems Institute
`University of California-Los Angeles
`Peichen Pan
`Magma Design Automation, Inc.
`
`Technology mapping is an essential step in an field-programmable gate array
`(FPGA) design flow. It is the process of converting a network of technology
`independent logic gates into a network comprising logic cells on the target FPGA
`device. 'i'echnology mapping has a significant impact on the quality of the final
`FPGA implementation.
`Technology-mapping algorithms have been proposed for optimizing area
`[29, 36, 58, 65], timing [9, 12, 13, 19, 21, 37, 58], power [2, 8, 34, 45, 52, 71],
`and routability [3, 67]. Mapping algorithms can be classified into those for gen
`eral networks [ 13, 16] and those for special ones such as treelike networks
`[35, 36]. Algorithms for special networks may be applied to general ones through
`partitioning, with a possible reduction in solution quality.
`Technology-mapping algorithms can be structural or functional. A structural
`mapping algorithm does not modify the input network other than to duplicate
`logic [12, 13]. It reduces technology mapping to a covering problem in which
`the technology-independent logic gates in the input network are covered with
`logic cones such that each cone can be implemented using one logic cell-for
`example, a K-input lookup table (K-LUT)-for LUT-based FPGAs. Figure 13.1
`is an example of structural mapping. The logic gates in the original network
`(a) are covered with three logic cones, each with at most three inputs, as indi
`cated (b). Note that node i is included in two cones and will be duplicated. The
`corresponding mapping solution (c) comprises three 3-LUTs.
`A functional mapping algorithm, on the other hand, treats technology map
`ping in its general form as a problem of Boolean transformation/decomposition
`of the input network into a set of interconnected logic cells [15, 48, 58, 60].
`It mixes Boolean optimization with covering. Functional mapping algorithms
`tend to be time consuming, which limits their use to small designs or to small
`portions of a design.
`
`Note: This work is partially supported by the National Science Foundation under grant number
`CCF 0530261.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 304
`
`
`
`278
`
`Chapter 13 • Technology Mapping
`
`Recent advances in technology mapping try to combine mapping with other
`steps in the design flow. Such integrated mapping algorithms have the potential
`to explore a larger solution space than is possible with just technology mapping
`and thus have the potential to arrive at mapping solutions with better quality.
`For example, algorithms have been proposed to combine logic synthesis with
`covering to overcome the limitations of pure structural mapping [11, 22, 57].
`
`13.1 STRUCTURAL MAPPING ALGORITHMS
`Technology mapping is part of a logic synthesis flow, which typically consists of
`three steps. First, the initial network is optimized using technology-independent
`optimization techniques such as node extraction/substitution and don't-care
`optimization [33]. Second, the optimized network is decomposed into one con
`sisting of 2-input gates plus inverters (that is, the network is 2-bounded) to
`increase flexibility in mapping [12, 36]. Third, the actual mapping takes place,
`with the goal of covering the 2-bounded network with K-LUTs while optimizing
`one or more objectives. In the remaining discussion, we assume that the input
`network is 2-bounded.
`A logic network can be represented as a graph where the nodes represent logic
`gates, primary inputs (Pis), and primary outputs (POs). The edges represent the
`interconnects or wires. A cut of a node v is a set of nodes in the input network
`such that every path from the primary inputs or sequential element outputs to
`v contains at least one node in the set. A K-cut is a cut with at most K nodes.
`For example, {a, b, z} is a 3-cut for the node yin the network in Figure 13.l(a).
`Given a K-cut for v, we can obtain a K-LUT for v by collapsing the gates in
`the logic cone between the nodes in the cut, v, including v itself. For the 3-cut
`{a, b, z} fory, the 3-LUT fory in Figure 13.l(c) is derived from the corresponding
`cone indicated for yin Figure 13.l(b).
`
`(a)
`
`(b)
`
`(c)
`
`FIGURE 13.1 ■ Structural technology mapping: (a) original network, (b) covering, and (c) mapping
`solution.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 305
`
`
`
`13.1 Structural Mapping Algorithms
`
`279
`
`Most structural mapping algorithms are based on the dynamic programming
`technique. They typically consist of the following steps:
`
`1. Cut generation/enumeration
`2. Cut ranking
`3. Cut selection
`4. Final mapping solution generation
`
`Cut generation obtains one or more cuts that will be used to generate LUTs;
`it is discussed in the next section. Cut ranking evaluates the cuts obtained in
`cut generation to see how good they are based on the mapping objectives. It
`assigns a label or cost to each cut by visiting the nodes in a topological order
`from Pis to POs. Cut selection picks a cut with the best label for each node and
`is typically done in reverse topological order from POs to Pis. Cut ranking and
`selection may be carried out multiple times to refine solution quality.
`After the final cut selection, a mapping solution is generated using the selected
`cuts. In this step, the nodes are visited in the reverse topological order, starting
`from POs and going back to Pis. At each node, a cut with the best label is
`selected and the corresponding LUT is added to the solution. Next, the nodes
`that drive the LUT are visited. This process is repeated until only Pis are left.
`At that point, a complete mapping solution is obtained.
`
`13. 1. 1 Cut Generation
`Early mapping algorithms combine cut generation and selection to determine
`one or a few "good" cuts for each node. The most successful example is the
`FlowMap algorithm, which finds a single cut with optimal mapping depth at
`each node via max-flow computation [16]. It computes the optimal mapping
`depth of each node in a topological order from Pis to POs, and at each node uses
`a max-flow formulation to test whether that node can have the same optimal
`mapping depth as the maximum depth of its input nodes. If not, the depth is
`set to one greater than the input nodes' maximum depth. It is shown that these
`are the only two possible mapping depths. The FlowMap algorithm was the first
`polynomial time algorithm to find a depth-optimal mapping solution for LUT
`based FPGAs.
`In practice, K, the number of inputs of the LUTs, is a small constant typically
`ranging between 3 and 6. It becomes practical to enumerate all K-cuts for each
`node. With all cuts available, we have additional flexibility in selecting cuts to
`optimize the mapping solution.
`Cuts can be generated by a traversal of the nodes in a combinational network
`( or the combinational portion of a sequential network) from Pis to POs in a
`topological order [29, 67]. Let cl>(v) denote the set of all K-cuts for a node v.
`For a PI, cl>(v) contains only the trivial cut consisting of the node itself, that is,
`cl>(v) = {{v}}. For a non-PI node v with two fanin nodes, u1 and u2, cl>(v) can be
`computed by merging the sets of cuts of u1 and u2 as follows:
`
`<l>(v) = { {v} U {c1 Uc2lc1 E <l>(u1), c2 E <l>(u2), let Uc2I} <= .K}
`
`(13.1)
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 306
`
`
`
`280
`
`Chapter 13 • Technology Mapping
`
`In other words, the set of cuts of v is obtained by the pairwise union of the
`cuts of its fanin nodes and then the elimination of those cuts with more than
`K nodes. Note that the trivial cut is added to the set. This is necessary so the
`nodes driven by v can include v in their cuts.
`
`13.1.2 Area-oriented Mapping
`For LUT mapping, the area of a mapping solution can be measured by the total
`number of LUTs. It has been shown that finding an area-optimal mapping solu
`tion is NP-hard [35]. Therefore, it is unlikely that there is an accurate way to
`rank cuts for area. The difficulty of precise area estimation is mainly due to the
`existence of multiple fanout nodes. In fact, for treelike networks, area-optimal
`mapping solutions can be determined in polynomial time [35].
`Cong et al. [29] proposed the concept of effective area as a way to rank
`and select cuts for area. A similar concept, area -flow, was later proposed by
`Manohararajah et al. [55]. Intuition regarding effective area is to distribute the
`area for a multi-fanout node to its fanout nodes so that logic sharing and recon
`vergence can be considered during area cost propagation. Effective areas are
`computed in a topological order from Pis to POs. The effective area a(v) of a PI
`node v is set to zero. Equation 13.2 is used to compute the effective area of a cut:
`
`a(c) = (LuEc [a(u);[output(u)I]) +Ac
`
`(13.2)
`
`where Ac is the area of the LUT corresponding to the cut c. The area cost of a
`non-PI node can then be set to the minimum effective area of its cuts: a(v) =
`min{a(c)['v'u E <l>(v)}.
`It should be pointed out that effective area may not account for the situation
`where the node may be duplicated in a mapping solution. In the example shown
`in Figure 13.2, with K = 3, the LUT for w is introduced solely for the LUT for v.
`However, in effective area computation, only one-half is counted for v, and as
`a result the LUT for w is undercounted. In this example, the sum of effective
`area of the POs is 2.5 whereas the mapping solution has three LUTs. In general,
`effective area is a lower bound of the actual area.
`
`FIGURE 13.2 ■ Inaccuracy in effective area.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 307
`
`
`
`13.1 Structural Mapping Algorithms
`
`281
`
`The PRAETOR algorithm [29] is an area-oriented mapping algorithm that
`ranks cuts using effective area. It further improves the basic mapping framework
`with a number of area reduction techniques. One such technique is to encourage
`the use of common subcuts. A cut for a fanout of a node v induces a cut for v
`(perhaps the trivial cut consisting of v itself). If two fanouts of v induce different
`cuts for v, the most likely result will be an area increase due to the need to
`duplicate v and possibly some of its predecessor nodes. To alleviate this problem,
`PRAETOR sorts and selects cuts with the same effective area in a predetermined
`order to avoid arbitrary selection. It assigns an integer ID to each node and then
`sorts all cuts with the same effective area according to the lexicographic order
`based on the IDs of the nodes in the cuts. The first cut with minimum effective
`area for each node is selected.
`Another area reduction technique introduced in PRAETOR is to carry out
`cut selection twice. The nodes with LUTs selected in the first pass are declared
`nonduplicable and can only be covered by LUTs for themselves in the second
`pass. This encourages selection of cuts Wiith less duplication. As an example,
`suppose that in the first pass of cut selection, the mapping solution shown in
`Figure 13.3(a), with four LUTs, is selected. In the second pass, the LUT con
`taining v and u 1 is excluded from consideration for u 1. This exclusion will also
`encourage the selection of the cut that results in the LUT containing a for u1.
`As a result, the mapping algorithm generates, in the second pass, the mapping
`solution in Figure 13.3(b), with only three LUTs. Experimental results show that
`PRAETOR can significantly improve area over previous algorithms.
`The !Map algorithm proposed by Manohararajah et al. [55] is another map
`ping algorithm targeting area optimization. It introduced two enhancements: (1)
`iteration between cut ranking and cut selection multiple times, and (2) adjust
`ment of the area costs between successive iterations using history information.
`In the effective area formula (equation 13.2), the fanout count of u in the initial
`network, joutput(u)I, is used to estimate the fanout count of the LUT rooted at u
`in the mapping solution. In the !Map algorithm between iterations, the fanout
`
`K=3
`
`(a)
`
`(b)
`
`FIGURE 13.3 ■ Effect of excluding cuts across nonduplicable nodes: (a) initial mapping solution,
`and (b) improved solution with better area.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 308
`
`
`
`282
`
`Chapter 13 ■ Technology Mapping
`
`count estimation is updated by using a weighted combination of the estimated
`and the real fanout counts in previous iterations. As a result, equation 13.2
`becomes a(c) = (I:u Ec [a(u)Jestimated_fc(u)]) +Ac, where estimated_fc(u) denotes
`the estimated fanout count for the current iteration.
`Ling et al. [54] proposed a mixed structural and functional area-mapping
`algorithm that starts with a mapping solution (e.g., generated by a structural
`mapping algorithm). The key idea is a Boolean satisfiability (SAT) formulation
`for the problem of mapping a small circuit with up to ten inputs into the smallest
`possible number of LUTs. The algorithm iteratively selects a small logic cone to
`remap to fewer LUTs using an SAT solver. It is shown that for some highly
`structured (albeit small) designs, area can be improved significantly.
`Most area optimization techniques are heuristic. A natural question is how
`close the mapping solutions obtained using existing mapping algorithms are
`from optimal. Cong and Minkovich [24] constructed a set of designs with known
`optimal area-mapping solutions, called LEKO (logic synthesis examples with
`known optimal bounds) examples, and tried existing academic algorithms and
`commercial tools on them. The average gap from optimal varied from 5 to
`23 percent. From LEKO examples, they further derived LEKU (logic synthe
`sis examples with known upper bounds) examples that require both logic opti
`mization and mapping. Existing algorithms perform poorly on LEKU examples,
`with an average optimality gap of more than 70 times. This indicates that more
`research is needed in area-oriented mapping and optimization.
`
`13.1.3 Performance-driven Mapping
`The FlowMap algorithm and its derivatives can find a mapping solution with
`optimal depth. Recent advances in delay mapping focus on achieving the best
`performance with minimal area.
`Exact layout information is not available during technology mapping in a
`typical FPGA design flow. Mapping algorithms usually ignore routing delays and
`try to optimize the total cell delays on the longest combinational paths in the
`mapping solution.
`Most delay optimal mapping algorithms use the labeling scheme introduced
`in the FlowMap algorithm to rank and select cuts. The label of a PI is set to
`zero, assuming that the signal arrives at the beginning of the clock edge. After
`the labels for all the nodes in the fanin cone of a node v are found, the label of
`a cut c of v is determined using the formula in equation 13.3:
`
`l(c) = max{l(u) + Dc l'<lu E c}
`
`(13.3)
`
`where De is the delay of the LUT corresponding to c. Intuitively, l(c) is the best
`arrival time at v if it is covered using the LUT generated from c. The label of v
`is then the smallest label among all of its cuts: l(v) = min{l(c)IVc E <l>(v)}.
`DAOmap [9] is a mapping algorithm that guarantees optimal delay while at
`the same time minimizing the area. It introduces three key techniques to opti
`mize area without degrading timing. First, it enhances effective area computation
`to make it better avoid node duplication. Second, it applies area optimization
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 309
`
`
`
`13.1 Structural Mapping Algorithms
`
`283
`
`techniques on noncritical paths. Last, it uses an iterative cut selection procedure
`to explore and perturb the solution space to improve solution quality.
`DAOmap first picks cuts with the minimum label for each node. From those, it
`then picks one with minimum effective area. Furthermore, when there is positive
`slack, which is the difference between required time and arrival time at a node,
`it picks a cut with as small an area cost as possible under the condition that the
`timing increase does not exceed the slack.
`Recognizing the heuristic nature of effective area computation, DAOmap also
`employs the technique of multiple passes of cut selection. Moreover, it adjusts
`area costs based on input sharing to encourage using nodes that have already
`been contained in selected cuts. This reduces the chance that a newly picked
`cut cuts into the interior of existing LUTs. Between successive iterations of cut
`selection, DAOmap also adjusts area cost to encourage selecting cuts containing
`nodes with a large number of fanouts in previous iterations. There are a few
`other secondary techniques used in DAOmap. The interested reader is referred
`to Chen and Cong [9] for details.
`Based on the results reported, DAOmap can improve the area by about
`13 percent on a large set of academic and industrial designs while maintaining
`optimal depths. It is also many times faster than previous mapping algorithms
`based on max-flow computation, mainly because of efficient implementation of
`cut enumeration.
`A recent delay optimal mapping algorithm introduced several techniques to
`improve area while preserving performance [57]. Like DAOmap, this algorithm
`goes through several passes of cut selection, with each pass selecting cuts with
`better areas among the cuts that do not degrade timing. It is also based on
`the concept of effective area (or area flow). However, it does cut selection from
`Pis to POs instead of from POs to Pis, as in most other· algorithms. With this
`processing order, the algorithm tries to use timing slacks on nodes close to
`Pis to reduce area cost. This is based on the observation that logic is typically
`denser when close to Pis, so slack relaxation is more effective for nodes closer to
`Pis. Experimental data shows 7 percent better area over DAOmap for the same
`optimal depths.
`
`13.1.4 Power-aware Mapping
`Power has become a major concern for FPGAs [51, 68]. Dynamic power dissi
`pation in FPGAs results from charging and discharging capacitances. It is deter
`mined by the switching activities and the load capacitance of the LUT outputs
`and can be captured by equation 13.4:
`
`where Cv is the output load capacitance of node v, fv is the switching activ
`ity of node v, and V is the supply voltage. Given a fixed supply voltage, power
`consumption in a mapped netlist is determined by switching activities and load
`capacitance of the LUT outputs.
`
`(13.4)
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 310
`
`
`
`284
`
`Chapter 13 ■ Technology Mapping
`
`Because technology mapping for power is NP-hard [34], a number of heuristic
`algorithms have been proposed. Most power-aware mapping algorithms try to
`reduce switching activities by hiding nodes with high switching activities inside
`LUTs, hence leaving LUTs with small output-switching activities in the mapped
`netlist.
`Anderson and Najm [2] proposed a mapping algorithm to reduce switch
`ing activities and minimize logic duplication. Logic duplication is necessary
`to optimize timing and area, but can potentially increase power consump
`tion. The algorithm uses the following power-aware cost function to rank cuts:
`Cost(c) = l(c)+P·P(c) + r·R(c), where l(c) is the depth label of the cut c as given
`in equation 13.3 and P(c) and R(c) are the power and replication costs of the
`cut, respectively. T he weighting factors p and y can be used to bias the three
`cost terms. Anderson and Najm suggest a very small P to get a depth-optimal
`mapping solution with minimal power.
`Power cost P(c) is defined in such a way that it encourages absorbing high
`activity connections inside LUTs. The replication cost tries to discourage logic
`duplication on timing noncritical paths. Power savings of over 14 percent were
`reported over timing-oriented mapping algorithms when both targeted optimal
`depths. When the mapping depth was relaxed by one level over optimal, addi
`tional power reduction of about 8 percent for 4-LUTs and 10 percent for 5-LUTs
`was reported.
`One serious limitation of the power-based ranking in Anderson and Najm
`[2] is that it cannot account for multiple fanouts and reconvergence, which
`are common in most practical designs. Chen et al. [8] proposed a low-power
`technology-mapping algorithm based on an improved power-aware ranking in
`equation 13.5:
`
`P(c) = (l:uec [P(u)tloutput(u)I]) + Uc
`
`(13.5)
`
`where Uc is a cost function that tries to capture power contributed by the
`cut c itself. Experimental results show that this algorithm outperforms previ
`ous power-aware mapping algorithms. It has also been extended to handle dual
`supply voltage FPGA architectures.
`
`13.2
`
`INTEGRATED MAPPING ALGORITHMS
`
`Technology mapping is a step in the middle of an FPGA design flow. Technology
`independent optimization is carried out before mapping; placement is carried
`out after. Sequential optimization such as retiming can be carried out before or
`after mapping. A separate approach can miss the best overall solutions even if we
`can solve each individual step optimally. In the section that follows we discuss
`mapping algorithms that combine mapping with other steps in the design flow.
`
`13.2.1 Simultaneous Logic Synthesis, Mapping
`Technology-independent Boolean optimizations carried out prior to technol
`ogy mapping can significantly impact the mapping solution. During technology
`independent optimization, we have the freedom to change the network structures,
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 311
`
`
`
`13.2 Integrated Mapping Algorithms
`
`285
`
`but accurate estimation of their impact on mapping is not available. During
`technology mapping, we can achieve optimal or close to optimal solutions using
`the algorithms discussed in Section 13.1. However, we are stuck with a fixed net
`work. It is desirable to capture the interactions between logic optimization and
`mapping to arrive at a solution with better quality.
`Lossless synthesis has been proposed by Mishchenko et al. [57] as a way to con
`sider technology-independent optimization during mapping. It is based on the
`concept of choice networks, which is similar to the concept of mapping graphs
`[11, 49]. A choice network contains choice nodes that encode functionally equiva
`lent but structurally different alternatives. The algorithm operates on a simple yet
`powerful data structure called AIG, which is a network of AND2 and INV gates.
`A combination of SAT and simulation techniques is used to detect functionally
`equivalent points in different networks and compress them to form one choice
`network.
`Figure 13.4 illustrates the construction of a network with choices from two
`equivalent networks with different structures. The nodes x1 and x2 in the two
`networks are functionally equivalent. They are combined in an equivalence class
`in the choice network, and an arbitrary member (x1 in this case) is set as the
`class representative. Note that p does not lead to a choice because its implemen
`tation is structurally the same in both networks. Similarly, o does not lead to a
`choice node.
`Rather than try to come up with one "good" optimized network before map
`ping, the algorithm proposed by Mishchenko et al. [57] accumulates choices
`by combining intermediate networks seen during logic synthesis to generate
`a network with many choices. In a sense, it does not make judgments on the
`goodness of the intermediate networks but defers that decision to the mapping
`phase, when the best combination of these choices is selected. In the final map
`ping solution, different sections may come from different intermediate networks.
`For example, the timing-critical sections of the final mapping solution may come
`
`X
`
`a
`
`b C d
`
`e
`
`a
`
`b c d
`
`e
`
`a
`
`b C d
`
`e
`
`FIGURE 13.4 ■ Combining networks to create a choice network.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 312
`
`
`
`286
`
`Chapter 13 ■ Technology Mapping
`
`from networks optimized for timing, while the timing noncritical sections of the
`final mapping solution may come from networks optimized for area.
`For mapping on choice networks, cut generation and cut ranking are extended
`to choice nodes. For example, the set of cuts of a choice node is simply the
`union of the sets of cuts of all of that node's fanin nodes. Similarly, the label of
`a choice node is the smallest one among the labels of its fanin nodes. The rest of
`the approach is similar to a conventional mapping algorithm. Results reported
`by Mishchenko et al. [57] show that both timing and area can be improved by
`over 7 percent on a set of benchmark designs compared to applying mapping
`to just one "optimized" network.
`
`Integrated Retiming, Mapping
`13.2.2
`Retiming (discussed in Chapter 18) is an optimization technique that relocates
`flip-flops (FFs) in a network while preserving functionality of the network [SO].
`Retiming can shift FF boundaries and change the timing. If retiming is applied
`after mapping, mapping may optimize the wrong paths because the critical
`paths seen during mapping may not be critical after the FFs are repositioned.
`On the other hand, if retiming is applied before mapping, it will be carried out
`using less accurate timing information because it is applied to an unmapped
`network. In either approach, the impact of retiming on cut generation cannot
`be accounted for.
`The network in FigUFe 13.S(a) is derived from the design in Figure 13.l(a) by
`retiming the FFs at the outputs of y and i to their inputs. After the retiming,
`all gates can be covered with one 3-LUT, as indicated in (a). The corresponding
`mapping solution is shown in (b). This mapping solution is obviously better
`than the one in Figure 13.l(c) in both area and timing.
`Pan and Liu [63] proposed a polynomial time-mapping algorithm that can
`find a solution with the best cycle time in the combined solution space of
`
`(a)
`
`a
`
`b
`
`a
`
`b
`
`(b)
`
`(c)
`
`FIGURE 13.5 ■ Retiming, mapping: (a) retiming and covering, (b) mapping solution, and
`(cl retimed solution.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 313
`
`
`
`13.2 Integrated Mapping Algorithms
`
`287
`
`retiming and mapping. In other words, the solution obtained is the best among
`all possible ways of retiming and mapping a network. Improved algorithms were
`later proposed that significantly reduce runtime while preserving the optimal
`ity of the final mapping solution [25, 27]. These algorithms, like the FlowMap
`algorithm, are all based on max-flow computation.
`A cut enumeration-based algorithm for integrated retiming and mapping was
`proposed by Pan and Lin [61]. In it, cut generation is extended to go across FF
`boundaries to generate sequential cuts. In a network with FFs, a gate m