throbber
users can download either compiled JAR files of the JHDL system, or they can
`download and build JHDL from sources themselves. Documentation on the
`JHDL system is provided as well.
`
`12.6 Summary
`
`273
`
`References
`
`[1] P. Bellows, B. L. Hutchings. JHDL-An HDL for reconfigurable systems. Proceed­
`ings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1998.
`[2] P. Bertin, D. Roncin, J. Vuillemin. Programmable active memories: A performance
`assessment. In G. Borriello, C. Ebeling (eds.). Research on Integrated Systems:
`Proceedings of the 1993 Symposium, 1993.
`[3] P. Bertin, H. Touati. PAM programming environments: Practice and experience. In
`D. A. Buell, K. L. Pocek (eds.). Proceedings of IEEE Workshop on FPGAs for Custom
`Computing Machines, April 1994.
`[4] D. Galloway. The transmogrifier C hardware description language and compiler for
`FPGAs. Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines,
`April 1995.
`[SJ P. Graham, B. Nelson, B. Hutchings. Instrumenting bitstreams for debugging
`FPGA circuits. Proceedings of the IEEE Symposium on Field-Programmable Custom
`Computing Machines, April 2001.
`[6] P. S. Graham. Logical Hardware Debuggers for FPGA-Based Systems, Ph.D. thesis,
`Brigham Young University, 2001.
`[7] K. S. Hemmert, J. L. Tripp, B. L. Hutchings, P. A. Jackson. Source level debugger
`for the Sea Cucumber synthesizing compiler. Proceedings of the IEEE Symposium
`on Field-Programmable Custom Computing Machines, April 2003.
`[8] B. L. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, M. Rytting. A
`CAD suite for high-performance FPGA design. Proceedings of IEEE Workshop on
`FPGAs for Custom Computing Machines, April 1999.
`[9] C. Iseli, E. Sanchez. A C++ compiler for FPGA custom execution units synthe­
`sis. Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines,
`April 1995.
`[10] W. J. Landaker, M. J. Wirthlin, B. L. Hutchings. Multitasking hardware on the
`SLAAC 1-V reconfigurable computing system. Proceedings of the 12th Interna­
`tional Workshop on Field-Programmable Logic and Applications, Springer-Verlag,
`September 2002.
`[11] J. L. Tripp, P. A. Jackson, B. L. Hutchings. Sea Cucumber: A synthesizing compiler
`for FPGAs. Proceedings of the 12th International Workshop on Field-Programmable
`Logic and Applications, Springer-Verlag, September 2002.
`[12] T. Wheeler, P. Graham, B. Nelson, B. Hutchings. Using design-level scan to improve
`FPGA design observability and controllability for functional verification. Proceed­
`ings of the 11th International Workshop on Field-Programmable Logic and Applica­
`tions, Springer-Verlag, August/September 2001.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 301
`
`

`

`PART
`
`Ill
`
`MAPPING DESIGNS TO
`RECONFIGURABLE PLATFORMS
`
`The chapters that follow cover the key mapping steps unique to
`field-programmable gate arrays (FPGAs) and reconfigurable targets.
`These steps include technology mapping to the primitive FPGA pro­
`grammable gates (Chapter 13), placement of these gates (Chapters 14
`through 16), routing of the interconnect between gates (Chapter 17),
`retiming of registers in the design (Chapter 18), and bitstream genera­
`tion (Chapter 19). A final chapter summarizes a number of approaches to
`accelerating various stages of the mapping process (Chapter 20).
`Placement is a difficult mapping problem, but is critical to the
`performance of the resulting reconfigurable design. As a result, it can be
`very slow, limiting the rate of the edit-compile-debug loop for reconfig­
`urable application development, and the designs it produces may have
`longer cycle times than we would like. For these reasons, in addition
`to the general-purpose algorithms for placement covered in Chapter 14,
`algorithms that are highly optimized to exploit the regularity of data­
`paths are discussed in Chapter 15, and constructive approaches to lay­
`out are treated in Chapter 16. These more specialized approaches can
`significantly reduce placement runtime and often deliver placements that
`allow faster design operation.
`As Chapters 13 through 20 demonstrate, there is a well-developed set
`of approaches and tools for programming reconfigurable applications.
`However, the tools are always slower than we might like them to be, espe­
`cially as FPGA capacities continue to grow with Moore's Law. Moreover,
`the designs they produce are often too large or too slow, and the level at
`which we must program them is often lower than optimal. These defi­
`ciencies present ample opportunities for innovation and improvement in
`software support for reconfigurable systems.
`For the designer who works on reconfiguration issues, the follow­
`ing chapters provide a look under the covers at the tools used to map
`designs and at the problems they must solve. It is important to under­
`stand which problems the tools are and are not solving and how well
`they can be expected to work. An understanding of the mapping flow and
`algorithms often helps the designer appreciate why tools may not produce
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 302
`
`

`

`276
`
`Part III ■ Mapping Designs to Reconfigurable Platforms
`
`the quality of results expected and how the design could be optimized to
`obtain better results. Similarly, understanding the problems that the tools
`are solving helps the designer understand the trade-offs associated with
`higher- or lower-level designs and how to mix and match design levels to
`obtain the desired quality of results with minimal effort.
`For the tool or software developer, this part covers the key steps
`in a traditional tool flow and summarizes the key algorithms used to
`map reconfigurable designs. With this knowledge the developer can
`rapidly assimilate conventional approaches and options and thus pre­
`pare to explore opportunities to improve quality of results, reduce tool
`time, or increase automation and raise the configurable design's level of
`abstraction.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 303
`
`

`

`CHAPTER
`
`13
`
`TECHNOLOGY MAPPING
`
`Jason Cong
`Department of Computer Science
`California NanoSystems Institute
`University of California-Los Angeles
`Peichen Pan
`Magma Design Automation, Inc.
`
`Technology mapping is an essential step in an field-programmable gate array
`(FPGA) design flow. It is the process of converting a network of technology­
`independent logic gates into a network comprising logic cells on the target FPGA
`device. 'i'echnology mapping has a significant impact on the quality of the final
`FPGA implementation.
`Technology-mapping algorithms have been proposed for optimizing area
`[29, 36, 58, 65], timing [9, 12, 13, 19, 21, 37, 58], power [2, 8, 34, 45, 52, 71],
`and routability [3, 67]. Mapping algorithms can be classified into those for gen­
`eral networks [ 13, 16] and those for special ones such as treelike networks
`[35, 36]. Algorithms for special networks may be applied to general ones through
`partitioning, with a possible reduction in solution quality.
`Technology-mapping algorithms can be structural or functional. A structural
`mapping algorithm does not modify the input network other than to duplicate
`logic [12, 13]. It reduces technology mapping to a covering problem in which
`the technology-independent logic gates in the input network are covered with
`logic cones such that each cone can be implemented using one logic cell-for
`example, a K-input lookup table (K-LUT)-for LUT-based FPGAs. Figure 13.1
`is an example of structural mapping. The logic gates in the original network
`(a) are covered with three logic cones, each with at most three inputs, as indi­
`cated (b). Note that node i is included in two cones and will be duplicated. The
`corresponding mapping solution (c) comprises three 3-LUTs.
`A functional mapping algorithm, on the other hand, treats technology map­
`ping in its general form as a problem of Boolean transformation/decomposition
`of the input network into a set of interconnected logic cells [15, 48, 58, 60].
`It mixes Boolean optimization with covering. Functional mapping algorithms
`tend to be time consuming, which limits their use to small designs or to small
`portions of a design.
`
`Note: This work is partially supported by the National Science Foundation under grant number
`CCF 0530261.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 304
`
`

`

`278
`
`Chapter 13 • Technology Mapping
`
`Recent advances in technology mapping try to combine mapping with other
`steps in the design flow. Such integrated mapping algorithms have the potential
`to explore a larger solution space than is possible with just technology mapping
`and thus have the potential to arrive at mapping solutions with better quality.
`For example, algorithms have been proposed to combine logic synthesis with
`covering to overcome the limitations of pure structural mapping [11, 22, 57].
`
`13.1 STRUCTURAL MAPPING ALGORITHMS
`Technology mapping is part of a logic synthesis flow, which typically consists of
`three steps. First, the initial network is optimized using technology-independent
`optimization techniques such as node extraction/substitution and don't-care
`optimization [33]. Second, the optimized network is decomposed into one con­
`sisting of 2-input gates plus inverters (that is, the network is 2-bounded) to
`increase flexibility in mapping [12, 36]. Third, the actual mapping takes place,
`with the goal of covering the 2-bounded network with K-LUTs while optimizing
`one or more objectives. In the remaining discussion, we assume that the input
`network is 2-bounded.
`A logic network can be represented as a graph where the nodes represent logic
`gates, primary inputs (Pis), and primary outputs (POs). The edges represent the
`interconnects or wires. A cut of a node v is a set of nodes in the input network
`such that every path from the primary inputs or sequential element outputs to
`v contains at least one node in the set. A K-cut is a cut with at most K nodes.
`For example, {a, b, z} is a 3-cut for the node yin the network in Figure 13.l(a).
`Given a K-cut for v, we can obtain a K-LUT for v by collapsing the gates in
`the logic cone between the nodes in the cut, v, including v itself. For the 3-cut
`{a, b, z} fory, the 3-LUT fory in Figure 13.l(c) is derived from the corresponding
`cone indicated for yin Figure 13.l(b).
`
`(a)
`
`(b)
`
`(c)
`
`FIGURE 13.1 ■ Structural technology mapping: (a) original network, (b) covering, and (c) mapping
`solution.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 305
`
`

`

`13.1 Structural Mapping Algorithms
`
`279
`
`Most structural mapping algorithms are based on the dynamic programming
`technique. They typically consist of the following steps:
`
`1. Cut generation/enumeration
`2. Cut ranking
`3. Cut selection
`4. Final mapping solution generation
`
`Cut generation obtains one or more cuts that will be used to generate LUTs;
`it is discussed in the next section. Cut ranking evaluates the cuts obtained in
`cut generation to see how good they are based on the mapping objectives. It
`assigns a label or cost to each cut by visiting the nodes in a topological order
`from Pis to POs. Cut selection picks a cut with the best label for each node and
`is typically done in reverse topological order from POs to Pis. Cut ranking and
`selection may be carried out multiple times to refine solution quality.
`After the final cut selection, a mapping solution is generated using the selected
`cuts. In this step, the nodes are visited in the reverse topological order, starting
`from POs and going back to Pis. At each node, a cut with the best label is
`selected and the corresponding LUT is added to the solution. Next, the nodes
`that drive the LUT are visited. This process is repeated until only Pis are left.
`At that point, a complete mapping solution is obtained.
`
`13. 1. 1 Cut Generation
`Early mapping algorithms combine cut generation and selection to determine
`one or a few "good" cuts for each node. The most successful example is the
`FlowMap algorithm, which finds a single cut with optimal mapping depth at
`each node via max-flow computation [16]. It computes the optimal mapping
`depth of each node in a topological order from Pis to POs, and at each node uses
`a max-flow formulation to test whether that node can have the same optimal
`mapping depth as the maximum depth of its input nodes. If not, the depth is
`set to one greater than the input nodes' maximum depth. It is shown that these
`are the only two possible mapping depths. The FlowMap algorithm was the first
`polynomial time algorithm to find a depth-optimal mapping solution for LUT­
`based FPGAs.
`In practice, K, the number of inputs of the LUTs, is a small constant typically
`ranging between 3 and 6. It becomes practical to enumerate all K-cuts for each
`node. With all cuts available, we have additional flexibility in selecting cuts to
`optimize the mapping solution.
`Cuts can be generated by a traversal of the nodes in a combinational network
`( or the combinational portion of a sequential network) from Pis to POs in a
`topological order [29, 67]. Let cl>(v) denote the set of all K-cuts for a node v.
`For a PI, cl>(v) contains only the trivial cut consisting of the node itself, that is,
`cl>(v) = {{v}}. For a non-PI node v with two fanin nodes, u1 and u2, cl>(v) can be
`computed by merging the sets of cuts of u1 and u2 as follows:
`
`<l>(v) = { {v} U {c1 Uc2lc1 E <l>(u1), c2 E <l>(u2), let Uc2I} <= .K}
`
`(13.1)
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 306
`
`

`

`280
`
`Chapter 13 • Technology Mapping
`
`In other words, the set of cuts of v is obtained by the pairwise union of the
`cuts of its fanin nodes and then the elimination of those cuts with more than
`K nodes. Note that the trivial cut is added to the set. This is necessary so the
`nodes driven by v can include v in their cuts.
`
`13.1.2 Area-oriented Mapping
`For LUT mapping, the area of a mapping solution can be measured by the total
`number of LUTs. It has been shown that finding an area-optimal mapping solu­
`tion is NP-hard [35]. Therefore, it is unlikely that there is an accurate way to
`rank cuts for area. The difficulty of precise area estimation is mainly due to the
`existence of multiple fanout nodes. In fact, for treelike networks, area-optimal
`mapping solutions can be determined in polynomial time [35].
`Cong et al. [29] proposed the concept of effective area as a way to rank
`and select cuts for area. A similar concept, area -flow, was later proposed by
`Manohararajah et al. [55]. Intuition regarding effective area is to distribute the
`area for a multi-fanout node to its fanout nodes so that logic sharing and recon­
`vergence can be considered during area cost propagation. Effective areas are
`computed in a topological order from Pis to POs. The effective area a(v) of a PI
`node v is set to zero. Equation 13.2 is used to compute the effective area of a cut:
`
`a(c) = (LuEc [a(u);[output(u)I]) +Ac
`
`(13.2)
`
`where Ac is the area of the LUT corresponding to the cut c. The area cost of a
`non-PI node can then be set to the minimum effective area of its cuts: a(v) =
`min{a(c)['v'u E <l>(v)}.
`It should be pointed out that effective area may not account for the situation
`where the node may be duplicated in a mapping solution. In the example shown
`in Figure 13.2, with K = 3, the LUT for w is introduced solely for the LUT for v.
`However, in effective area computation, only one-half is counted for v, and as
`a result the LUT for w is undercounted. In this example, the sum of effective
`area of the POs is 2.5 whereas the mapping solution has three LUTs. In general,
`effective area is a lower bound of the actual area.
`
`FIGURE 13.2 ■ Inaccuracy in effective area.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 307
`
`

`

`13.1 Structural Mapping Algorithms
`
`281
`
`The PRAETOR algorithm [29] is an area-oriented mapping algorithm that
`ranks cuts using effective area. It further improves the basic mapping framework
`with a number of area reduction techniques. One such technique is to encourage
`the use of common subcuts. A cut for a fanout of a node v induces a cut for v
`(perhaps the trivial cut consisting of v itself). If two fanouts of v induce different
`cuts for v, the most likely result will be an area increase due to the need to
`duplicate v and possibly some of its predecessor nodes. To alleviate this problem,
`PRAETOR sorts and selects cuts with the same effective area in a predetermined
`order to avoid arbitrary selection. It assigns an integer ID to each node and then
`sorts all cuts with the same effective area according to the lexicographic order
`based on the IDs of the nodes in the cuts. The first cut with minimum effective
`area for each node is selected.
`Another area reduction technique introduced in PRAETOR is to carry out
`cut selection twice. The nodes with LUTs selected in the first pass are declared
`nonduplicable and can only be covered by LUTs for themselves in the second
`pass. This encourages selection of cuts Wiith less duplication. As an example,
`suppose that in the first pass of cut selection, the mapping solution shown in
`Figure 13.3(a), with four LUTs, is selected. In the second pass, the LUT con­
`taining v and u 1 is excluded from consideration for u 1. This exclusion will also
`encourage the selection of the cut that results in the LUT containing a for u1.
`As a result, the mapping algorithm generates, in the second pass, the mapping
`solution in Figure 13.3(b), with only three LUTs. Experimental results show that
`PRAETOR can significantly improve area over previous algorithms.
`The !Map algorithm proposed by Manohararajah et al. [55] is another map­
`ping algorithm targeting area optimization. It introduced two enhancements: (1)
`iteration between cut ranking and cut selection multiple times, and (2) adjust­
`ment of the area costs between successive iterations using history information.
`In the effective area formula (equation 13.2), the fanout count of u in the initial
`network, joutput(u)I, is used to estimate the fanout count of the LUT rooted at u
`in the mapping solution. In the !Map algorithm between iterations, the fanout
`
`K=3
`
`(a)
`
`(b)
`
`FIGURE 13.3 ■ Effect of excluding cuts across nonduplicable nodes: (a) initial mapping solution,
`and (b) improved solution with better area.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 308
`
`

`

`282
`
`Chapter 13 ■ Technology Mapping
`
`count estimation is updated by using a weighted combination of the estimated
`and the real fanout counts in previous iterations. As a result, equation 13.2
`becomes a(c) = (I:u Ec [a(u)Jestimated_fc(u)]) +Ac, where estimated_fc(u) denotes
`the estimated fanout count for the current iteration.
`Ling et al. [54] proposed a mixed structural and functional area-mapping
`algorithm that starts with a mapping solution (e.g., generated by a structural
`mapping algorithm). The key idea is a Boolean satisfiability (SAT) formulation
`for the problem of mapping a small circuit with up to ten inputs into the smallest
`possible number of LUTs. The algorithm iteratively selects a small logic cone to
`remap to fewer LUTs using an SAT solver. It is shown that for some highly
`structured (albeit small) designs, area can be improved significantly.
`Most area optimization techniques are heuristic. A natural question is how
`close the mapping solutions obtained using existing mapping algorithms are
`from optimal. Cong and Minkovich [24] constructed a set of designs with known
`optimal area-mapping solutions, called LEKO (logic synthesis examples with
`known optimal bounds) examples, and tried existing academic algorithms and
`commercial tools on them. The average gap from optimal varied from 5 to
`23 percent. From LEKO examples, they further derived LEKU (logic synthe­
`sis examples with known upper bounds) examples that require both logic opti­
`mization and mapping. Existing algorithms perform poorly on LEKU examples,
`with an average optimality gap of more than 70 times. This indicates that more
`research is needed in area-oriented mapping and optimization.
`
`13.1.3 Performance-driven Mapping
`The FlowMap algorithm and its derivatives can find a mapping solution with
`optimal depth. Recent advances in delay mapping focus on achieving the best
`performance with minimal area.
`Exact layout information is not available during technology mapping in a
`typical FPGA design flow. Mapping algorithms usually ignore routing delays and
`try to optimize the total cell delays on the longest combinational paths in the
`mapping solution.
`Most delay optimal mapping algorithms use the labeling scheme introduced
`in the FlowMap algorithm to rank and select cuts. The label of a PI is set to
`zero, assuming that the signal arrives at the beginning of the clock edge. After
`the labels for all the nodes in the fanin cone of a node v are found, the label of
`a cut c of v is determined using the formula in equation 13.3:
`
`l(c) = max{l(u) + Dc l'<lu E c}
`
`(13.3)
`
`where De is the delay of the LUT corresponding to c. Intuitively, l(c) is the best
`arrival time at v if it is covered using the LUT generated from c. The label of v
`is then the smallest label among all of its cuts: l(v) = min{l(c)IVc E <l>(v)}.
`DAOmap [9] is a mapping algorithm that guarantees optimal delay while at
`the same time minimizing the area. It introduces three key techniques to opti­
`mize area without degrading timing. First, it enhances effective area computation
`to make it better avoid node duplication. Second, it applies area optimization
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 309
`
`

`

`13.1 Structural Mapping Algorithms
`
`283
`
`techniques on noncritical paths. Last, it uses an iterative cut selection procedure
`to explore and perturb the solution space to improve solution quality.
`DAOmap first picks cuts with the minimum label for each node. From those, it
`then picks one with minimum effective area. Furthermore, when there is positive
`slack, which is the difference between required time and arrival time at a node,
`it picks a cut with as small an area cost as possible under the condition that the
`timing increase does not exceed the slack.
`Recognizing the heuristic nature of effective area computation, DAOmap also
`employs the technique of multiple passes of cut selection. Moreover, it adjusts
`area costs based on input sharing to encourage using nodes that have already
`been contained in selected cuts. This reduces the chance that a newly picked
`cut cuts into the interior of existing LUTs. Between successive iterations of cut
`selection, DAOmap also adjusts area cost to encourage selecting cuts containing
`nodes with a large number of fanouts in previous iterations. There are a few
`other secondary techniques used in DAOmap. The interested reader is referred
`to Chen and Cong [9] for details.
`Based on the results reported, DAOmap can improve the area by about
`13 percent on a large set of academic and industrial designs while maintaining
`optimal depths. It is also many times faster than previous mapping algorithms
`based on max-flow computation, mainly because of efficient implementation of
`cut enumeration.
`A recent delay optimal mapping algorithm introduced several techniques to
`improve area while preserving performance [57]. Like DAOmap, this algorithm
`goes through several passes of cut selection, with each pass selecting cuts with
`better areas among the cuts that do not degrade timing. It is also based on
`the concept of effective area (or area flow). However, it does cut selection from
`Pis to POs instead of from POs to Pis, as in most other· algorithms. With this
`processing order, the algorithm tries to use timing slacks on nodes close to
`Pis to reduce area cost. This is based on the observation that logic is typically
`denser when close to Pis, so slack relaxation is more effective for nodes closer to
`Pis. Experimental data shows 7 percent better area over DAOmap for the same
`optimal depths.
`
`13.1.4 Power-aware Mapping
`Power has become a major concern for FPGAs [51, 68]. Dynamic power dissi­
`pation in FPGAs results from charging and discharging capacitances. It is deter­
`mined by the switching activities and the load capacitance of the LUT outputs
`and can be captured by equation 13.4:
`
`where Cv is the output load capacitance of node v, fv is the switching activ­
`ity of node v, and V is the supply voltage. Given a fixed supply voltage, power
`consumption in a mapped netlist is determined by switching activities and load
`capacitance of the LUT outputs.
`
`(13.4)
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 310
`
`

`

`284
`
`Chapter 13 ■ Technology Mapping
`
`Because technology mapping for power is NP-hard [34], a number of heuristic
`algorithms have been proposed. Most power-aware mapping algorithms try to
`reduce switching activities by hiding nodes with high switching activities inside
`LUTs, hence leaving LUTs with small output-switching activities in the mapped
`netlist.
`Anderson and Najm [2] proposed a mapping algorithm to reduce switch­
`ing activities and minimize logic duplication. Logic duplication is necessary
`to optimize timing and area, but can potentially increase power consump­
`tion. The algorithm uses the following power-aware cost function to rank cuts:
`Cost(c) = l(c)+P·P(c) + r·R(c), where l(c) is the depth label of the cut c as given
`in equation 13.3 and P(c) and R(c) are the power and replication costs of the
`cut, respectively. T he weighting factors p and y can be used to bias the three
`cost terms. Anderson and Najm suggest a very small P to get a depth-optimal
`mapping solution with minimal power.
`Power cost P(c) is defined in such a way that it encourages absorbing high­
`activity connections inside LUTs. The replication cost tries to discourage logic
`duplication on timing noncritical paths. Power savings of over 14 percent were
`reported over timing-oriented mapping algorithms when both targeted optimal
`depths. When the mapping depth was relaxed by one level over optimal, addi­
`tional power reduction of about 8 percent for 4-LUTs and 10 percent for 5-LUTs
`was reported.
`One serious limitation of the power-based ranking in Anderson and Najm
`[2] is that it cannot account for multiple fanouts and reconvergence, which
`are common in most practical designs. Chen et al. [8] proposed a low-power
`technology-mapping algorithm based on an improved power-aware ranking in
`equation 13.5:
`
`P(c) = (l:uec [P(u)tloutput(u)I]) + Uc
`
`(13.5)
`
`where Uc is a cost function that tries to capture power contributed by the
`cut c itself. Experimental results show that this algorithm outperforms previ­
`ous power-aware mapping algorithms. It has also been extended to handle dual
`supply voltage FPGA architectures.
`
`13.2
`
`INTEGRATED MAPPING ALGORITHMS
`
`Technology mapping is a step in the middle of an FPGA design flow. Technology­
`independent optimization is carried out before mapping; placement is carried
`out after. Sequential optimization such as retiming can be carried out before or
`after mapping. A separate approach can miss the best overall solutions even if we
`can solve each individual step optimally. In the section that follows we discuss
`mapping algorithms that combine mapping with other steps in the design flow.
`
`13.2.1 Simultaneous Logic Synthesis, Mapping
`Technology-independent Boolean optimizations carried out prior to technol­
`ogy mapping can significantly impact the mapping solution. During technology­
`independent optimization, we have the freedom to change the network structures,
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 311
`
`

`

`13.2 Integrated Mapping Algorithms
`
`285
`
`but accurate estimation of their impact on mapping is not available. During
`technology mapping, we can achieve optimal or close to optimal solutions using
`the algorithms discussed in Section 13.1. However, we are stuck with a fixed net­
`work. It is desirable to capture the interactions between logic optimization and
`mapping to arrive at a solution with better quality.
`Lossless synthesis has been proposed by Mishchenko et al. [57] as a way to con­
`sider technology-independent optimization during mapping. It is based on the
`concept of choice networks, which is similar to the concept of mapping graphs
`[11, 49]. A choice network contains choice nodes that encode functionally equiva­
`lent but structurally different alternatives. The algorithm operates on a simple yet
`powerful data structure called AIG, which is a network of AND2 and INV gates.
`A combination of SAT and simulation techniques is used to detect functionally
`equivalent points in different networks and compress them to form one choice
`network.
`Figure 13.4 illustrates the construction of a network with choices from two
`equivalent networks with different structures. The nodes x1 and x2 in the two
`networks are functionally equivalent. They are combined in an equivalence class
`in the choice network, and an arbitrary member (x1 in this case) is set as the
`class representative. Note that p does not lead to a choice because its implemen­
`tation is structurally the same in both networks. Similarly, o does not lead to a
`choice node.
`Rather than try to come up with one "good" optimized network before map­
`ping, the algorithm proposed by Mishchenko et al. [57] accumulates choices
`by combining intermediate networks seen during logic synthesis to generate
`a network with many choices. In a sense, it does not make judgments on the
`goodness of the intermediate networks but defers that decision to the mapping
`phase, when the best combination of these choices is selected. In the final map­
`ping solution, different sections may come from different intermediate networks.
`For example, the timing-critical sections of the final mapping solution may come
`
`X
`
`a
`
`b C d
`
`e
`
`a
`
`b c d
`
`e
`
`a
`
`b C d
`
`e
`
`FIGURE 13.4 ■ Combining networks to create a choice network.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 312
`
`

`

`286
`
`Chapter 13 ■ Technology Mapping
`
`from networks optimized for timing, while the timing noncritical sections of the
`final mapping solution may come from networks optimized for area.
`For mapping on choice networks, cut generation and cut ranking are extended
`to choice nodes. For example, the set of cuts of a choice node is simply the
`union of the sets of cuts of all of that node's fanin nodes. Similarly, the label of
`a choice node is the smallest one among the labels of its fanin nodes. The rest of
`the approach is similar to a conventional mapping algorithm. Results reported
`by Mishchenko et al. [57] show that both timing and area can be improved by
`over 7 percent on a set of benchmark designs compared to applying mapping
`to just one "optimized" network.
`
`Integrated Retiming, Mapping
`13.2.2
`Retiming (discussed in Chapter 18) is an optimization technique that relocates
`flip-flops (FFs) in a network while preserving functionality of the network [SO].
`Retiming can shift FF boundaries and change the timing. If retiming is applied
`after mapping, mapping may optimize the wrong paths because the critical
`paths seen during mapping may not be critical after the FFs are repositioned.
`On the other hand, if retiming is applied before mapping, it will be carried out
`using less accurate timing information because it is applied to an unmapped
`network. In either approach, the impact of retiming on cut generation cannot
`be accounted for.
`The network in FigUFe 13.S(a) is derived from the design in Figure 13.l(a) by
`retiming the FFs at the outputs of y and i to their inputs. After the retiming,
`all gates can be covered with one 3-LUT, as indicated in (a). The corresponding
`mapping solution is shown in (b). This mapping solution is obviously better
`than the one in Figure 13.l(c) in both area and timing.
`Pan and Liu [63] proposed a polynomial time-mapping algorithm that can
`find a solution with the best cycle time in the combined solution space of
`
`(a)
`
`a
`
`b
`
`a
`
`b
`
`(b)
`
`(c)
`
`FIGURE 13.5 ■ Retiming, mapping: (a) retiming and covering, (b) mapping solution, and
`(cl retimed solution.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2137, p. 313
`
`

`

`13.2 Integrated Mapping Algorithms
`
`287
`
`retiming and mapping. In other words, the solution obtained is the best among
`all possible ways of retiming and mapping a network. Improved algorithms were
`later proposed that significantly reduce runtime while preserving the optimal­
`ity of the final mapping solution [25, 27]. These algorithms, like the FlowMap
`algorithm, are all based on max-flow computation.
`A cut enumeration-based algorithm for integrated retiming and mapping was
`proposed by Pan and Lin [61]. In it, cut generation is extended to go across FF
`boundaries to generate sequential cuts. In a network with FFs, a gate m

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket