`Systems
`
`SCOTT HAUCK, MEMBER, IEEE
`
`Reprogrammable systems based on field programmable gate
`arrays are revolutionizing some forms of computation and
`digital logic. As a logic emulation system, they provide orders
`of magnitude faster computation than software simulation. As a
`custom-computing machine, they achieve the highest performance
`implementation for many types of applications. As a multimode
`system, they yield significant hardware savings and provide truly
`generic hardware.
`In this paper, we discuss the promise and problems of
`reprogrammable systems. This includes an overview of the chip
`and system architectures of reprogrammable systems as well as
`the applications of these systems. We also discuss the challenges
`and opportunities of future reprogrammable systems.
`Keywords— Adaptive computing, custom computing, FPGA,
`logic emulation, multi-FPGA systems, reconfigurable computing.
`
`I.
`
`INTRODUCTION
`In the mid-1980’s, a new technology for implementing
`digital logic was introduced: the field programmable gate
`array (FPGA). These devices could be viewed as either
`small, slow gate arrays (MPGA’s) or large, expensive
`programmable logic devices (PLD’s). FPGA’s were capable
`of implementing significantly more logic than PLD’s, espe-
`cially because they could implement multilevel logic, while
`most PLD’s were optimized for two-level logic. Although
`they did not have the capacity of MPGA’s, they also did
`not have to be custom fabricated, greatly lowering the costs
`for low-volume parts and avoiding long fabrication delays.
`While many of the FPGA’s were configured by static
`random access memory (SRAM) cells in the array, this was
`generally viewed as a liability by potential customers who
`worried over the chip’s volatility. Antifuse-based FPGA’s
`also were developed and for many applications were much
`more attractive, both because they tended to be smaller and
`faster due to less programming overhead and because there
`was no volatility to the configuration.
`
`Manuscript received May 5, 1997; revised January 20, 1998. This work
`was supported in part by the Defense Advanced Research Project Agency
`under Contract DABT63-97-C-0035 and in part by the National Science
`Foundation under Grants CDA-9703228 and MIP-9616572.
`The author is with the Department of Electrical and Computer Engi-
`neering, Northwestern University, Evanston, IL 60208-3118 USA.
`Publisher Item Identifier S 0018-9219(98)02669-3.
`
`In the late 1980’s and early 1990’s, there was a growing
`realization that the volatility of SRAM-based FPGA’s was
`not a liability but was in fact the key to many new types
`of applications. Since the programming of such an FPGA
`could be changed by a completely electrical process, much
`as a standard processor can be configured to run many
`programs, SRAM-based FPGA’s have become the work-
`horse of many new reprogrammable applications. Some
`uses of reprogrammability are simple extensions of the
`standard logic implementation tasks for which the FPGA’s
`were originally designed. An FPGA plus several different
`configurations stored in read-only memory (ROM) could
`be used for multimode hardware, with the functions on
`the chip changed in reaction to the current demands. Also,
`boards constructed purely from FPGA’s, microcontrollers,
`and other reprogrammable parts could be truly generic
`hardware, allowing a single board to be reprogrammed to
`serve many different applications.
`Some of the most exciting new uses of FPGA’s move
`beyond the implementation of digital
`logic and instead
`harness large numbers of FPGA’s as a general-purpose
`computation medium. The circuit mapped onto the FPGA’s
`need not be standard hardware equations but can even
`be operations from algorithms and general computations.
`While these FPGA-based custom-computing machines may
`not challenge the performance of microprocessors for all
`applications, for computations of the right form, an FPGA-
`based machine can offer extremely high performance, sur-
`passing any other programmable solution. Although a cus-
`tom hardware implementation will be able to beat the power
`of any generic programmable system, and thus there must
`always be a faster solution than a multi-FPGA system, the
`fact is that few applications will ever merit the expense
`of creating application-specific solutions. An FPGA-based
`computing machine, which can be reprogrammed like a
`standard workstation, offers the highest realizable perfor-
`mance for many different applications. In a sense, it is
`a hardware supercomputer, surpassing traditional machine
`architectures for certain applications. This potential has
`been realized by many different research machines. The
`Splash system [50] has provided performance on genetic
`string matching that is almost 200 times greater than all
`
`PROCEEDINGS OF THE IEEE, VOL. 86, NO. 4, APRIL 1998
`Authorized licensed use limited to: Riva Laughlin. Downloaded on May 08,2023 at 15:24:24 UTC from IEEE Xplore. Restrictions apply.
`
`615
`
`0018–9219/98$10.00 ª
`
`1998 IEEE
`
`Ex.1030
`CISCO SYSTEMS, INC. / Page 1 of 24
`
`
`
`(a)
`
`(b)
`
`Fig. 1. Actel’s programmable low-impedance circuit element (PLICE). As shown in (a), an
`unblown antifuse has an oxide–nitride–oxide (ONO) dielectric preventing current from flowing
`between diffusion and polysilicon. The antifuse can be blown by applying a 16-V pulse across the
`dielectric. This melts the dielectric, allowing a conducting channel to be formed (b). Current is then
`free to flow between the diffusion and the polysilicon [1], [54].
`
`other supercomputer implementations. The DECPeRLe-1
`system [133] has demonstrated world-record performance
`for many other applications, including RSA cryptography.
`One of the applications of multi-FPGA systems with
`the greatest potential is logic emulation. The designers of
`a custom chip need to verify that the circuit they have
`designed actually behaves as desired. Software simulation
`and prototyping have been the traditional solution to this
`problem. As chip designs become more complex, however,
`software simulation is only able to test an ever decreasing
`portion of the chips’ computations, and it is quite expensive
`in time and money to debug by repeated prototype fabrica-
`tions. The solution is logic emulation, the mapping of the
`circuit under test onto a multi-FPGA system. Since the logic
`is implemented in the FPGA’s in the system, the emulation
`can run at near real
`time, yielding test cycles several
`orders of magnitude faster than software simulation, yet
`with none of the time delays and inflexibility of prototype
`fabrications. These benefits have led many of the advanced
`microprocessor manufacturers to include logic emulation
`in their validation process.
`In this paper, we discuss the different applications
`and types of reprogrammable systems. In Section II, we
`present an overview of FPGA architectures as well as field
`programmable interconnect components (FPIC’s). Then,
`Section III details what kinds of opportunities these devices
`provide for new types of systems. We then categorize the
`types of reprogrammable systems in Section IV, including
`coprocessors and multi-FPGA systems. Section V describes
`in depth the different multi-FPGA systems, highlighting
`their important features. Last, Sections VI and VII conclude
`with an overview of the status of reprogrammable systems
`and how they are likely to evolve. Note that this paper is
`not meant to be a catalog of every existing reprogrammable
`architecture and application. We instead focus on some of
`the more important aspects of these systems in order to
`give an overview of the field.
`
`II. FPGA TECHNOLOGY
`One of the most common field programmable elements
`is PLD’s. PLD’s concentrate primarily on two-level, sum-
`of-products implementations of logic functions. They have
`simple routing structures with predictable delays. Since
`they are completely prefabricated, they are ready to use in
`seconds, avoiding long delays for chip fabrication. FPGA’s
`
`Fig. 2. Floating gate structure for EPROM/EEPROM. The float-
`ing gate is completely isolated. An unprogrammed transistor, with
`no charge on the floating gate, operates the same as a normal
`n-transistor, with the access gate as the transistor’s gate. To
`program the transistor, a high voltage on the access gate plus a
`lower voltage on the drain accelerates electrons from the source fast
`enough to travel across the gate oxide insulator to the floating gate.
`This negative charge then prevents the access gate from closing
`the source–drain connection during normal operation. To erase,
`EPROM uses ultraviolet light to accelerate electrons off the floating
`gate, while EEPROM removes electrons by a technique similar to
`programming but with the opposite polarity on the access gate
`[134], [148].
`
`are also completely prefabricated, but instead of two-level
`logic, they are optimized for multilevel circuits. This allows
`them to handle much more complex circuits on a single
`chip, but it often sacrifices the predictable delays of PLD’s.
`Note that FPGA’s are sometimes considered another form
`of PLD, often under the heading of complex PLD.
`Just as in PLD’s, FPGA’s are completely prefabricated
`and contain special features for customization. These con-
`figuration points are normally SRAM cells, EPROM, EEP-
`ROM, or antifuses. Antifuses are one-time programmable
`devices (Fig. 1), which when “blown” create a connection.
`When they are “unblown,” no current can flow between
`their terminals (thus, it is an “anti” fuse, since its behavior
`is opposite to a standard fuse). Because the configuration
`of an antifuse is permanent, antifuse-based FPGA’s are
`one-time programmable, while SRAM-based FPGA’s are
`reprogrammable, even in the target system. Since SRAM’s
`are volatile, an SRAM-based FPGA must be reprogrammed
`every time the system is powered up, usually from a ROM
`included in the circuit to hold configuration files. Note
`that FPGA’s often have on-chip control circuitry to load
`this configuration data automatically. EEPROM/EPROM
`(Fig. 2) are devices somewhere between SRAM and an-
`tifuse in their features. The programming of an EEP-
`ROM/EPROM is retained even when the power is turned
`off, avoiding the need to reprogram the chip at power-
`
`616
`
`PROCEEDINGS OF THE IEEE, VOL. 86, NO. 4, APRIL 1998
`Authorized licensed use limited to: Riva Laughlin. Downloaded on May 08,2023 at 15:24:24 UTC from IEEE Xplore. Restrictions apply.
`
`Ex.1030
`CISCO SYSTEMS, INC. / Page 2 of 24
`
`
`
`Fig. 3. Programming bit for (a) SRAM-based FPGA’s [145] and (b) a three-input lookup table.
`
`(a)
`
`(b)
`
`up, while their configuration can be changed electrically.
`However, the high voltages required to program the device
`often mean that they are not reprogrammed in the target
`system.
`than antifuses and EEP-
`SRAM cells are larger
`ROM/EPROM, meaning that SRAM-based FPGA’s will
`have fewer configuration points
`than FPGA’s using
`other
`programming
`technologies. However, SRAM-
`based FPGA’s have numerous advantages. Since they
`are easily reprogrammable,
`their configurations can be
`changed for bug fixes or upgrades. Thus, they provide
`an ideal prototyping medium. Also,
`these devices can
`be used in situations where they can expect
`to have
`numerous different configurations, such as multimode
`systems and reconfigurable computing machines. More
`details on such applications are included later in this
`paper. Because antifuse-based FPGA’s are only one-
`time programmable,
`they are generally not used in
`reprogrammable systems. EEPROM/EPROM devices could
`potentially be reprogrammed in-system, although in general
`this feature is not widely used. Thus,
`this paper will
`concentrate solely on SRAM-based FPGA’s.
`There are many different types of FPGA’s, with many
`different structures. Instead of discussing all of them here,
`which would be quite involved, this section will present
`two representative FPGA’s. Details on many others can
`be found elsewhere [21], [26], [75], [103], [112], [127].
`Note that reconfigurable systems can often employ non-
`FPGA reconfigurable elements; these will be described in
`Section V.
`In SRAM-based FPGA’s, memory cells are scattered
`throughout the FPGA. As shown in Fig. 3(a), a pair of
`cross-coupled inverters will sustain whatever value is pro-
`-transistor gate is provided
`grammed onto them. A single
`for either writing a value or reading a value back out.
`The ratio of sizes between the transistor and the upper
`-transistor
`inverter is set to allow values sent through the
`to overpower the inverter. The read-back feature is used
`during debugging to determine the current state of the
`system. The actual control of the FPGA is handled by the
`and
`outputs. One simple application of an SRAM
`terminal connected to the gate of
`bit is to have the
`-transistor. If a “1” is assigned to the programming
`an
`bit, the transistor is closed, and values can pass between
`the source and drain. If a “0” is assigned, the transistor
`is opened, and values cannot pass. Thus, this construct
`operates similarly to an antifuse, though it requires much
`
`Fig. 4. The Xilinx 4000 series FPGA structure [145]. Logic
`blocks are surrounded by horizontal and vertical routing channels.
`
`more area. One of the most useful SRAM-based structures
`programming
`is the lookup table (LUT). By connecting
`-input combinational
`bits to a multiplexer [Fig. 3(b)], any
`Boolean function can be implemented. Although it can
`,
`require a large number of programming bits for large
`LUT’s of up to five inputs can provide a flexible, powerful
`function implementation medium.
`One of the best known FPGA’s is the Xilinx logic
`cell array [126], [145]. In this section, we will describe
`their third-generation FPGA, the Xilinx 4000 series. The
`Xilinx array is an “Island-style” FPGA [127] with logic
`cells embedded in a general routing structure that permits
`arbitrary point-to-point communication (Fig. 4). The only
`requirement for good routing in this structure is that the
`source and destinations be relatively close together. Details
`of the routing structure are shown in Fig. 5. Each of the
`inputs of the cell (F1-F4, G1-G4, C1-C4, K) comes from
`one of a set of tracks adjacent to that cell. The outputs
`are similar (X, XQ, Y, YQ), except that they have the
`choice of both horizontal and vertical tracks. The routing
`structure is made up of three lengths of lines. Single-length
`lines travel the height of a single cell, where they then
`enter a switch matrix [Fig. 6(b)]. The switch matrix allows
`this signal to travel out vertically and/or horizontally from
`the switch matrix. Thus, multiple single-length lines can be
`cascaded together to travel longer distances. Double-length
`
`HAUCK: FPGA’S IN REPROGRAMMABLE SYSTEMS
`Authorized licensed use limited to: Riva Laughlin. Downloaded on May 08,2023 at 15:24:24 UTC from IEEE Xplore. Restrictions apply.
`
`617
`
`Ex.1030
`CISCO SYSTEMS, INC. / Page 3 of 24
`
`
`
`Fig. 5. Details of the Xilinx 4000 series routing structure [145]. The configurable logic blocks
`(CLB’s) are surrounded by vertical and horizontal routing channels containing single-length lines,
`double-length lines, and long lines. Empty diamonds represent programmable connections between
`perpendicular signal lines (signal lines on opposite sides of the diamonds are always connected).
`
`lines are similar, except that they travel the height of two
`cells before entering a switch matrix (notice that only half
`the double-length lines enter a switch matrix, and there is
`a twist in the middle of the line). Thus, double-length lines
`are useful for longer distance routing, traversing two cell
`heights without the extra delay and the wasted configuration
`sites of an intermediate switch matrix. Last, long lines are
`lines that go half the chip height and do not enter the
`switch matrix. In this way, routes of very long distance can
`be accommodated efficiently. With this rich sea of routing
`resources, the Xilinx 4000 series is able to handle fairly
`arbitrary routing demands, though mappings emphasizing
`local communication will still be handled more efficiently.
`As shown in Fig. 6(a), the Xilinx 4000 series logic cell
`is made up of three LUT’s, two programmable flip-flops,
`and multiple programmable multiplexers. The LUT’s allow
`arbitrary combinational functions of its inputs to be created.
`Thus, the structure shown can perform any function of five
`inputs (using all three LUT’s, with the F and G inputs
`identical), any two functions of four inputs (the two four-
`input LUT’s used independently), or some functions of up
`to nine inputs (using all three LUT’s, with the F and G
`inputs different). SRAM-controlled multiplexers then can
`route these signals out the X and Y outputs, as well as to
`the two flip-flops. The inputs at the top (C1-C4) provide
`enable and set or reset signals to the flip-flops, a direct
`connection to the flip-flop inputs, and the third input to
`the three-input LUT. This structure yields a very powerful
`method of implementing arbitrary, complex digital logic.
`
`Note that there are several additional features of the Xilinx
`FPGA not shown in these figures, including support for
`embedded memories and carry chains.
`While many SRAM-based FPGA’s are designed like
`the Xilinx architecture, with a routing structure optimized
`for arbitrary, long-distance communications, several other
`FPGA’s concentrate instead on local communication. The
`“cellular”-style FPGA’s [127] feature fast, local commu-
`nication resources at the expense of more global, long-
`distance signals. As shown in Fig. 7, the CLi FPGA [75]
`has an array of cells, with a limited number of routing
`resources running horizontally and vertically between the
`cells. There is one local communication bus on each side
`of the cell. It runs the height of eight cells, at which point
`it enters a repeater. Express buses are similar to local
`buses, except that there are no connections between the
`express buses and the cells. The repeaters allow access to
`the express buses. These repeaters can be programmed to
`connect together any of the two local buses and two express
`buses connected to it. Thus, limited global communication
`can be accomplished on the local and express buses, with
`the local buses allowing shorter distance communications
`and connections to the cells while express buses allow
`longer distance connections between local buses.
`While the local and global buses allow some of the
`flexibility of the Xilinx FPGA’s arbitrary routing structure,
`there are significantly fewer buses in the CLi FPGA than
`are present in the Xilinx FPGA. The CLi FPGA instead
`features a large number of local communication resources.
`
`618
`
`PROCEEDINGS OF THE IEEE, VOL. 86, NO. 4, APRIL 1998
`Authorized licensed use limited to: Riva Laughlin. Downloaded on May 08,2023 at 15:24:24 UTC from IEEE Xplore. Restrictions apply.
`
`Ex.1030
`CISCO SYSTEMS, INC. / Page 4 of 24
`
`
`
`Fig. 6. Details of the Xilinx (a) CLB and [(b), top] switchbox [145]. The multiplexers, LUT’s,
`and latches in the CLB are configured by SRAM bits. Diamonds in the switchbox represent six
`individual connections [(b), bottom], allowing any permutation of connections among the four
`signals incident to the diamond.
`
`(a)
`
`(b)
`
`As shown in Fig. 8, each cell receives two signals from each
`of its four neighbors. It then sends the same two outputs
`(A and B) to all of its neighbors. That is, the cell one to
`the north will send signals AN and BN, and the cell one
`to the south will send AS and BS, while both will receive
`the same signals A and B. The input signals become the
`inputs to the logic cell (Fig. 9).
`Instead of Xilinx’s LUT’s, which require many program-
`ming bits per cell, the CLi logic block is much simpler.
`It has multiplexers controlled by SRAM bits, which select
`one each of the A and B outputs of the neighboring cells.
`These are then fed into AND and XOR gates within the cell,
`as well as into a flip-flop. Although the possible functions
`are complex, notice that there is a path leading to the B
`output that produces the NAND of the selected A and B
`inputs, sending it out the B output. This path is enabled
`by setting the two 2 : 1 multiplexers to their constant input
`and setting B’s output multiplexer to the third input from
`the top. Thus, the cell is functionally complete. Also, with
`the XOR path leading to output A, the cell can efficiently
`implement a half-adder. The cell can perform a pure routing
`function by connecting one of the A inputs to the A output
`and one of the B inputs to the B output, or vice-versa.
`This routing function is created by setting the two 2 : 1
`multiplexers to their constant inputs and setting A’s and B’s
`output multiplexer to either of their top two inputs. There
`are also provisions for bringing in or sending out a signal
`on one or more of the neighboring local buses (NS1, NS2,
`EW1, EW2). Note that since there is only a single wire
`connecting the bus terminals, there can only be a single
`signal sent to or received from the local buses. If more
`than one of the buses is connected to the cell, they will be
`coupled together. Thus, the cell can take a signal running
`
`horizontally on an EW local bus and send it vertically on
`an NS local bus without using up the cell’s logic unit. By
`bringing a signal in from the local buses, however, the cell
`can implement two three-input functions.
`The major differences between the Island-style archi-
`tecture of the Xilinx 4000 series and the cellular style
`of the CLi FPGA is in their routing structure and cell
`granularity. The Xilinx 4000 series is optimized for com-
`plex, irregular random logic. It features a powerful routing
`structure optimized for arbitrary global routing and large
`logic cells capable of providing arbitrary four- and five-
`input functions. This provides a very flexible architecture,
`though one that requires many programming bits per cell
`(and thus cells that take up a large portion of the chip area).
`In contrast, the CLi architecture is optimized for highly
`local, pipelined circuits such as systolic arrays and bit-serial
`arithmetic. Thus, it emphasizes local communication at the
`expense of global routing and has simple cells. Because
`of the very simple logic cells, there will be many more
`CLi cells on a chip than will be found in the Xilinx FPGA,
`yielding a greater logic capacity for those circuits that match
`the FPGA’s structure. Because of the restricted routing,
`the CLi FPGA is much harder to map to automatically
`than the Xilinx 4000 series, though the simplicity of the
`CLi architecture makes it easier for a human designer to
`hand-map to the CLi’s structure. Thus, in general, cellular
`architectures tend to appeal to designers with appropriate
`circuit structures who are willing to spend the effort to
`hand-map their circuits to the FPGA, while the Xilinx 4000
`series is more appropriate for handling random-logic tasks
`and automatically mapped circuits.
`Compared with technologies such as full-custom standard
`cells and MPGA’s, FPGA’s will in general be slower and
`
`HAUCK: FPGA’S IN REPROGRAMMABLE SYSTEMS
`Authorized licensed use limited to: Riva Laughlin. Downloaded on May 08,2023 at 15:24:24 UTC from IEEE Xplore. Restrictions apply.
`
`619
`
`Ex.1030
`CISCO SYSTEMS, INC. / Page 5 of 24
`
`
`
`Fig. 7. The CLi6000 routing architecture [75]. One 8 8 tile, plus a few surrounding rows and
`columns, is shown. The full array has many of these tiles abutted horizontally and vertically.
`
`less dense due to the configuration points, which take
`up significant chip area, and add extra capacitance and
`resistance (and thus delay) to the signal lines. Thus, the
`programming bits add an unavoidable overhead to the
`circuit, which can be reduced by limiting the configurability
`of the FPGA but never totally eliminated. Also, since the
`metal layers in an FPGA are prefabricated, while the other
`technologies custom fabricate the metal layers for a given
`circuit, the FPGA will have less optimized routing. This
`again results in slower and larger circuits. Even given
`these downsides, FPGA’s have the advantage that they are
`completely prefabricated. This means that they are ready
`to use instantly, while mask programmed technologies can
`require weeks to be customized. Also, since there is no
`custom fabrication involved in an FPGA, the fabrication
`costs can be amortized over all the users of the architecture,
`removing the significant nonrecurring engineering costs of
`other technologies. However, per-chip costs will in general
`be higher, making the technology better suited for low-
`volume applications. Also, since SRAM-based FPGA’s
`are reprogrammable, they are ideal for prototyping, since
`the chips are reusable after bug fixes or upgrades, where
`mask programmed and antifuse versions would have to be
`discarded.
`A technology similar to SRAM-based FPGA’s is FPIC’s
`[8] and field programmable interconnect devices (FPID’s)
`
`[72] (we will use FPIC from now on to refer to both FPIC’s
`and FPID’s). Like an SRAM-based FPGA, an FPIC is a
`completely prefabricated device with an SRAM-configured
`routing structure (Fig. 10). Unlike an FPGA, an FPIC has
`no logic capacity. Thus, the only use for an FPIC is as
`a device to interconnect
`its I/O pins arbitrarily. While
`this is not generally useful for production systems, since
`a fixed interconnection pattern can be achieved by the
`printed circuit board that holds a circuit, it can be quite
`useful in prototyping and reconfigurable computing (these
`applications are discussed later in this paper). In each of
`these cases, the connections between the chips in the system
`may need to be reprogrammable, or this connection pattern
`may change over time. In a reconfigurable computer, many
`different mappings will be loaded onto the system, and each
`of them may desire a different interconnection pattern. In
`prototyping, the connections between chips may need to
`be changed over time for bug fixes and logic upgrades.
`In either case, by routing all of the I/O pins of the logic-
`bearing chips to FPIC’s, the interconnection pattern can
`easily be changed over time. Thus, fixed routing patterns
`can be avoided, potentially increasing the performance and
`capacity of the prototyping or reconfigurable computing
`machine.
`the economic viability
`There is some question about
`of FPIC’s. The problem is that they must provide some
`
`620
`
`PROCEEDINGS OF THE IEEE, VOL. 86, NO. 4, APRIL 1998
`Authorized licensed use limited to: Riva Laughlin. Downloaded on May 08,2023 at 15:24:24 UTC from IEEE Xplore. Restrictions apply.
`
`Ex.1030
`CISCO SYSTEMS, INC. / Page 6 of 24
`
`
`
`Fig. 8. Details of the CLi routing architecture [75].
`
`Fig. 9. The CLi logic cell [75].
`
`advantage over an FPGA with the same I/O capacity, since
`in general an FPGA can perform the same role as the FPIC.
`One possibility is providing significantly more I/O pins in
`an FPIC than are available in an FPGA. This can be a
`major advantage, since it takes many smaller I/O chips to
`match the communication resources of a single high-I/O
`I/Os requires three chips with
`chip (i.e., a chip with
`two-thirds the I/O’s to match the flexibility). Because the
`packaging technology necessary for such high I/O chips is
`somewhat exotic, however, high-I/O FPIC’s can be expen-
`sive. Another possibility is to provide higher performance
`or smaller chip size with the same I/O capacity. Since there
`is no logic on the chip, the space and capacitance due to the
`logic can be removed. Even with these possible advantages,
`however, FPIC’s face the significant disadvantage that they
`are restricted to a limited application domain. Specifically,
`while FPGA’s can be used for prototyping, reconfigurable
`computing, low-volume products, fast time-to-market sys-
`
`tems, and multimode systems, FPIC’s are restricted to the
`interconnection portion of prototyping and reconfigurable
`computing solutions. Thus, FPIC’s may never become
`commodity parts, greatly increasing their unit cost.
`
`III. REPROGRAMMABLE LOGIC APPLICATIONS
`With the development of FPGA’s, there are now oppor-
`tunities for implementing quite different systems than were
`possible with other technologies. In this section, we will
`discuss many of these new opportunities, especially those
`of multi-FPGA systems.
`When FPGA’s were first introduced they were primarily
`considered to be just another form of gate array. While they
`had lower speed and capacity and a higher unit cost, they
`did not have the large startup costs and lead times necessary
`for MPGA’s. Thus, they could be used for implementing
`random logic and glue logic in low-volume systems with
`nonaggressive speed and capacity demands. If the capacity
`of a single FPGA was not enough to handle the desired
`computation, multiple FPGA’s could be included on the
`board, distributing the computation among these chips.
`FPGA’s are more than just slow, small gate arrays.
`The critical feature of (SRAM-based) FPGA’s is their in-
`circuit reprogrammability. Since their programming can be
`changed quickly, without any rewiring or refabrication,
`they can be used in a much more flexible manner than
`standard gate arrays. One example of this is multimode
`hardware. For example, when designing a digital
`tape
`recorder with error-correcting codes, one way to implement
`such a system is to have separate code-generation and
`code-checking hardware built into the tape machine. There
`is no reason to have both of these functions available
`simultaneously, however, since when reading from the
`tape there is no need to generate new codes, and when
`writing to the tape the code-checking hardware will be idle.
`
`HAUCK: FPGA’S IN REPROGRAMMABLE SYSTEMS
`Authorized licensed use limited to: Riva Laughlin. Downloaded on May 08,2023 at 15:24:24 UTC from IEEE Xplore. Restrictions apply.
`
`621
`
`Ex.1030
`CISCO SYSTEMS, INC. / Page 7 of 24
`
`
`
`Fig. 10. The Aptix FPIC architecture [8]. The boxed “P” indicates an I/O pin.
`
`Thus, we can have an FPGA in the system and have two
`different configurations stored in ROM, one for reading and
`one for writing. In this way, a single piece of hardware
`handles multiple computations. There have been several
`multiconfiguration systems built from FPGA’s, including
`the just-mentioned tape machine, generic printer, and CCD
`camera interfaces, pivoting monitors with landscape and
`portrait configurations, and others [43], [96], [120], [144].
`While the previous uses of FPGA’s still treat these chips
`purely as methods for implementing digital logic, there are
`other applications where this is not the case. A system
`of FPGA’s can be seen as a computing substrate with
`somewhat different properties than standard microproces-
`sors. The reprogrammability of the FPGA’s allows one
`to download algorithms onto the FPGA’s and to change
`these algorithms just as general-purpose computers can
`change programs. This computing substrate is different
`from standard processors in that it provides a huge amount
`of fine-grain parallelism, since there are many logic blocks
`on the chips, and the instructions are quite simple, on
`the order of a single five-bit-input, one-bit-output function.
`Also, while the instruction stream of a microprocessor can
`be arbitrarily complex, with the function computed by the
`
`logic changing on a cycle-by-cycle basis, the program-
`ming of an FPGA is in general held constant throughout
`the execution of the mapping (exceptions to this include
`techniques of run-time reconfigurability described below).
`Thus,
`to achieve a variety of different functions in a
`mapping, a microprocessor does this temporally, with dif-
`ferent functions executed during different cycles, while an
`FPGA-based computing machine achieves variety spatially,
`having different logic elements compute different functions.
`This means that microprocessors are superior for complex
`control flow and irregular computations, while an FPGA-
`based computing machine can be superior for data-parallel
`applications, where a huge quantity of data must be acted on
`in a very similar manner. Note that there is work being done
`on trying to bridge this gap and develop FPGA-processor
`hybrids that can achieve both spatial and limited temporal
`function variation [18], [34], [35], [88], [95], [97].
`There have been several computing applications where
`a multi-FPGA system has delivered the highest perfor-
`mance implementation. An early example is genetic string
`matching on the Splash machine [50]. Here, a linear array
`of Xilinx 3000 series FPGA’s was used to implement a
`systolic algorithm to determine the edit distance between
`
`622
`
`PROCEEDINGS OF THE IEEE, VOL. 86, NO. 4, APRIL 1998
`Authorized licensed use limited to: Riva Laughlin. Downloaded on May 08,2023 at 15:24:24 UTC from IEEE Xplore. Restrictions apply.
`
`Ex.1030
`CISCO SYSTEMS, INC. / Page 8 of 24
`
`
`
`two strings. The edit distance is the minimum number of in-
`sertions and deletions necessary to transform one string into
`another, so the strings “flea” and “fleet” would have an edit
`distance of three (delete “a” and insert “et” to go from “flea”
`to “