`
`Reference 18
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 1
`
`
`
`US007856545B2
`
`(12) United States Patent
`Casselman
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7.856,545 B2
`Dec. 21, 2010
`
`(54) FPGA CO-PROCESSOR FOR ACCELERATED
`COMPUTATION
`
`2005, OO27970 A1
`2008, 0028.187 A1
`
`2/2005 Arnold
`1/2008 Casselman et al.
`
`(75) Inventor: Steven Casselman, Sunnyvale, CA (US)
`
`(73) Assignee: DRC Computer Corporation,
`Sunnyvale, CA (US)
`
`(*) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 727 days.
`
`(21) Appl. No.: 11/829,801
`(22) Filed:
`Jul. 27, 2007
`O
`O
`Prior Publication Data
`US 2008/OO2818.6 A1
`Jan. 31, 2008
`
`(65)
`
`Related U.S. Application Data
`(60) Provisional application No. 60/820,730, filed on Jul.
`28, 2006.
`
`(51) Int. Cl.
`(2006.01)
`G06F 5/00
`(2006.01)
`G06F 5/76
`(52) U.S. Cl. ........................................................ 712/34
`(58) Field of Classification Search .................... 71.2/34
`See application file for complete search history.
`References Cited
`
`(56)
`
`U.S. PATENT DOCUMENTS
`6.961,841 B2 * 1 1/2005 Huppenthal et al. ........... T12/34
`7.210,022 B2 * 4/2007 Jungcket al. ................. T12/34
`7,386,704 B2 * 6/2008 Schulz et al. ................. 712/15
`2004,025 0046 A1
`12/2004 Gonzalez
`
`
`
`FOREIGN PATENT DOCUMENTS
`WO PCT/US2007/074660
`T 2008
`WOWO PCT/US 2007/
`T 2008
`O74661
`WO WO 2008/O14493 A3 10, 2008
`
`OTHER PUBLICATIONS
`Blume et al.; Integration of High-Performance ASICs into
`Reconfigurable Systems Providing Additional Multimedia Function
`ality: 2000; IEEE.*
`
`(Continued)
`Primary Examiner Eddie P Chan
`Assistant Examiner Corey Faherty
`(74) Attorney, Agent, or Firm The Webostad Firm
`
`ABSTRACT
`(57)
`A co-processor module for accelerating computational per
`formance includes a Field Programmable Gate Array
`(“FPGA') and a Programmable Logic Device (“PLD)
`coupled to the FPGA and configured to control start-up con
`figuration of the FPGA. A non-volatile memory is coupled to
`the PLD and configured to store a start-up bitstream for the
`start-up configuration of the FPGA. A mechanical and elec
`trical interface is for being plugged into a microprocessor
`socket of a motherboard for direct communication with at
`least one microprocessor capable of being coupled to the
`motherboard. After completion of a start-up cycle, the FPGA
`is configured for direct communication with the at least one
`microprocessor via a microprocessorbus to which the micro
`processor Socket is coupled.
`
`14 Claims, 6 Drawing Sheets
`
`Processor
`
`10.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 2
`
`
`
`US 7.856,545 B2
`Page 2
`
`OTHER PUBLICATIONS
`Fawcett: Taking Advantage of Reconfigurable Logic; 1994; IEEE.*
`Blume etal. Integration of High-Performance ASICs Into
`Reconfigurable Systems Providing Additional Multimedia Function
`ality; 2000; IEEE.
`Letter from Jody Bishop to Steve Casselman dated Nov. 26, 2008.
`Letter from Jody Bishop to Steve Casselman dated Dec. 16, 2008.
`Maya Gokhale & Paul S. Graham, Reconfigurable Computing:
`Accelerating Computation with Field-Programmable Gate Arrays,
`2003, pp. 4-5. The Netherlands, Springer Pub.
`
`Jeffrey M. Arnold, Duncan A.Buell, Dzung T. Hoang, Daniel V.
`Pryor, Nabeel Shirazi, Mark R. Thistle, The Splash 2 Processor and
`Applications, 1993, pp. 282-285, IEEE.
`XSA Board V1.1, V1.2 User Manual, Jun. 23, 2005, XESS Corpo
`ration.
`XSA-50 Spartan-2 Prototyping Board with 2.5V, 50,000-gate FPGA.
`1998-2008, XESS, from http:...www.xess.com/prod027.php3.
`
`* cited by examiner
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 3
`
`
`
`U.S. Patent
`U.S. Patent
`
`Dec. 21, 2010
`Dec. 21, 2010
`
`Sheet 1 of 6
`Sheet 1 of 6
`
`US 7,856,545 B2
`US 7.856,545 B2
`
`
`
`/00
`
`104
`
`298
`V_i
`
`imininrn
`
`299
`
`7 100
`
`24f. 1
`
`102
`
`105-^
`
`/1°
`1012
`
`1103
`
`FIG. 1
`FIG. 1
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 4
`
`
`
`U.S. Patent
`U.S. Patent
`
`Dec. 21, 2010
`Dec. 21, 2010
`
`Sheet 2 of 6
`Sheet 2 of 6
`
`US 7,856,545 B2
`US 7.856,545 B2
`
`
`
`
`
`SRAM
`
`202
`
`214
`
`PLD
`
`203
`
`213
`
`FPGA
`
`212
`
`FLASH
`
`203
`
`200
`
`201
`
`211
`
`102
`
`210
`
`DRAM
`
`104
`
`Processor
`ProceSSOr
`
`m
`101
`
`10
`
`FIG. 2
`FG.2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 5
`
`
`
`U.S. Patent
`U.S. Patent
`
`Dec. 21, 2010
`Dec. 21, 2010
`
`Sheet 3 of 6
`Sheet 3 of 6
`
`US 7,856,545 B2
`US 7.856,545 B2
`
`
`
`201
`
`300
`
`Dynamic RAM Interface
`Dynamic RAM Interface
`302
`302
`
`305
`
`Hypertransport
`Hypertransport
`Interface
`Interface
`
`User
`Logic
`306
`
`301
`
`DMA and
`DMA and
`Arbitration
`Arbitration
`304
`304
`
`Static RAM
`Static RAM
`Interface
`Interface
`
`303
`
`FIG. 3
`FIG. 3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 6
`
`
`
`U.S. Patent
`U.S. Patent
`
`Dec. 21, 2010
`Dec. 21, 2010
`
`Sheet 4 of 6
`Sheet 4 of 6
`
`US 7,856,545 B2
`US 7.856,545 B2
`
`400
`
`
`
`200
`
`v 401
`
`402
`402
`
`FIG. 4
`FIG. 4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 7
`
`
`
`U.S. Patent
`
`Dec. 21, 2010
`
`Sheet 5 of 6
`
`US 7.856,545 B2
`
`Main processor Writes new
`bitstream to module
`
`Module saves bitstream in
`SRAM Or DRAM
`
`Main processor gives
`bitstream address and "reprogram"
`Command to module
`50
`
`Full Reprogram
`
`Partial Reprogram
`
`
`
`PLD reprograms FPGA using
`configuration pins
`
`50
`
`Internal memory interface reads
`bitstream and passes it to
`internal configuration logic
`
`05
`
`FIG. 5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 8
`
`
`
`U.S. Patent
`U.S. Patent
`
`
`
`Dec. 21, 2010
`Dec. 21, 2010
`
`Sheet 6 of 6
`Sheet 6 of 6
`
`US 7,856,545 B2
`US 7.856,545 B2
`
`Translate C description to HDL
`Translate C description to HDL
`
`Generate Constraints
`Generate Constraints
`
`Synthesize HDL to FPGA netlist
`Synthesize HDL to FPGA netlist
`
`Combine with wrapper logic
`Combine with wrapper logic
`
`i r.
`Place & Route. FPGA
`Place & Route FPGA
`
`Generate FPGA bistream
`Generate FPGA bistream
`
`601
`
`602
`
`603
`
`604
`
`605
`
`606
`
`FIG. 6
`FIG. 6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 9
`
`
`
`1.
`FPGA CO-PROCESSOR FOR ACCELERATED
`COMPUTATION
`
`This application claims benefit to U.S. provisional patent
`application No. 60/820,730, entitled “FPGA Co-Processor
`for Accelerated Computation.” filed Jul. 28, 2006, which is
`herein incorporated by reference in its entirety.
`
`FIELD
`
`One or more embodiments generally relate to accelerators
`and, more particularly, to a co-processor module including a
`Field Programmable Gate Array (“FPGA').
`
`BACKGROUND
`
`10
`
`15
`
`Co-processors have often been used to accelerate compu
`tational performance. For example, early microprocessors
`were unable to include floating-point computation circuitry
`due to chip area limitations. Doing floating-point computa
`tions in software is extremely slow so this circuitry was often
`placed in a second chip which was activated whenever a
`floating-point computation was required. As chip technology
`improved, the microprocessor chip and the floating-point co
`processor chip were combined together.
`25
`A similar situation occurs today with specialized compu
`tational algorithms. Standard microprocessors do not include
`circuitry for performing these algorithms because they are
`often specific to only a few users. By using an FPGA (field
`programmable gate-array) as a co-processor, an algorithm
`can be designed and programmed into hardware to build a
`circuit that is unique for each application, resulting in a sig
`nificant acceleration of the desired computation.
`
`30
`
`SUMMARY
`
`35
`
`40
`
`One or more embodiments generally relate to accelerators
`and, more particularly, to a co-processor module including a
`Field Programmable Gate Array (“FPGA').
`A co-processor module for accelerating computational
`performance includes a Field Programmable Gate Array
`(“FPGA') and a Programmable Logic Device (“PLD)
`coupled to the FPGA and configured to control start-up con
`figuration of the FPGA. A non-volatile memory is coupled to
`the PLD and configured to store a start-up bitstream for the
`start-up configuration of the FPGA. A mechanical and elec
`trical interface is for being plugged into a microprocessor
`socket of a motherboard for direct communication with at
`least one microprocessor capable of being coupled to the
`motherboard. After completion of a start-up cycle, the FPGA
`50
`is configured for direct communication with the at least one
`microprocessor via a microprocessorbus to which the micro
`processor Socket is coupled.
`
`45
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`55
`
`Accompanying drawing(s) show exemplary embodi
`ment(s) in accordance with one or more embodiments; how
`ever, the accompanying drawing(s) should not be taken to
`limit the invention to the embodiment(s) shown, but are for
`explanation and understanding only.
`FIG. 1 is a diagram of an exemplary co-processor module
`which may be coupled to a motherboard with two processor
`Sockets, according to one embodiment.
`FIG. 2 is a block diagram of an exemplary co-processor
`module, including major components and busses, according
`to one embodiment.
`
`60
`
`65
`
`US 7,856,545 B2
`
`2
`FIG.3 is a block diagram of an exemplary layout of internal
`functions of the co-processor FPGA, according to one
`embodiment.
`FIG. 4 is a diagram of an exemplary expanded co-processor
`module with a daughter card containing additional logic func
`tions, according to one embodiment.
`FIG.5 is a flowchart showing a method for partially or fully
`reprogramming a co-processor module from SRAM, accord
`ing to one embodiment.
`FIG. 6 is a flowchart showing a method for creating co
`processor configuration to accelerate a specific algorithm,
`according to one embodiment.
`
`DETAILED DESCRIPTION
`
`In the following description, numerous specific details are
`set forth to provide a more thorough description of the spe
`cific embodiments of the invention. It should be apparent,
`however, to one skilled in the art, that the invention may be
`practiced without all the specific details given below. In other
`instances, well-known features have not been described in
`detail so as not to obscure the invention. For ease of illustra
`tion, the same number labels are used in different diagrams to
`refer to the same items; however, in alternative embodiments
`the items may be different. Furthermore, although particular
`integrated circuit parts are described herein for purposes of
`clarity by way of example, it should be understood that the
`Scope of the description is not limited to these particular
`numerical examples as other integrated circuit parts may be
`used.
`A multi-processor System consists of several processing
`chips connected to each other by high-speed busses. By
`replacing one or more of these processor chips by applica
`tion-specific co-processors, it is often possible to obtain a
`significant acceleration in computational speed. Each co-pro
`cessor sits in the motherboard socket designed for a standard
`processor and makes use of motherboard resources.
`According to one embodiment, the co-processor FPGA is
`located on a module which plugs into a standard micropro
`cessor socket. Motherboards are commonly available which
`have multiple microprocessor Sockets, allowing one or more
`standard microprocessors to co-exist with one or more co
`processor modules. Thus, no changes to the motherboard or
`other system hardware are required, making it easy to build
`co-processor systems. The co-processor has access to moth
`erboard resources including large amounts of memory. These
`resources need not be duplicated on the co-processor module,
`reducing the cost, size and power requirements for the co
`processor. The co-processor is connected to the main proces
`Sor by one or more high-speed low-latency busses. Many
`algorithms require frequent communication between the
`main microprocessor and the co-processor, making this inter
`face a factor in achieving high performance.
`According to another embodiment, to accelerate computa
`tional algorithms, a co-processor module is included which
`plugs into a standard microprocessor Socket on a mother
`board and communicates with the microprocessor by one or
`more high-speed, low-latency busses. The co-processor has
`access to motherboard resources through the microprocessor
`socket. The co-processor includes an FPGA which is recon
`figurable and may be loaded with a new configuration pattern
`suitable for a different algorithm under control of the micro
`processor. The configuration pattern is developed using a set
`of software tools. The co-processor module capabilities may
`be extended by adding additional piggyback cards.
`An another embodiment is an accelerator module, includ
`ing an FPGA and a Programmable Logic Device (“PLD')
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 10
`
`
`
`US 7,856,545 B2
`
`10
`
`15
`
`30
`
`3
`coupled to the FPGA and configured to control start-up con
`figuration of the FPGA. A non-volatile memory is coupled to
`the PLD and configured to store a start-up bitstream for the
`start-up configuration of the FPGA. A mechanical and elec
`trical interface is configured for being plugged into a micro
`processor Socket of a motherboard for direct communication
`with at least one microprocessor capable of being coupled to
`the motherboard. After completion of a start-up cycle, the
`FPGA is configured for direct communication with the at
`least one microprocessor via a microprocessor bus to which
`the microprocessor Socket is coupled.
`Another embodiment generally is an accelerator system,
`comprising a first motherboard having accelerator modules
`and a second motherboard having at least one microproces
`sor. Each of the accelerator modules includes an FPGA and a
`Programmable Logic Device (“PLD) coupled to the FPGA
`and configured to control start-up configuration of the FPGA.
`A non-volatile memory is coupled to the PLD and configured
`to store a start-up bitstream for the start-up configuration of
`the FPGA. A mechanical and electrical interface is configured
`for being plugged into a microprocessor Socket of the first
`motherboard for direct communication as between the accel
`erator modules. The microprocessor Socket is coupled to a
`microprocessor bus for the direct communication between
`the accelerator modules.
`25
`Yet another embodiment generally is a method for co
`processing. An accelerator module is coupled to a micropro
`cessorbus, the accelerator module including a Field Program
`mable Gate Array (“FPGA). A microprocessorbus interface
`bitstream is loaded into the FPGA to program programmable
`logic thereof. Data is transferred to first memory of the accel
`erator module via a microprocessor bus using a microproces
`sor bus interface instantiated in the FPGA responsive to the
`microprocessor bus interface bitstream. A default configura
`tion bitstream stored in the first memory is instantiated in the
`FPGA to configure the FPGA to have the microprocessor bus
`interface with sufficient functionality to be recognized by a
`microprocessor coupled to the microprocessor bus.
`Still yet another embodiment generally is another method
`for co-processing. An accelerator module, which includes a
`Field Programmable Gate Array (“FPGA) and first memory,
`is coupled to a microprocessor bus. The first memory has a
`default configuration bitstream stored therein. The default
`configuration bitstream is loaded into the FPGA to program
`programmable logic thereof. The default configuration bit
`stream includes a microprocessorbus interface. The FPGA is
`configured with the default configuration bitstream with suf
`ficient functionality to be recognized by a microprocessor
`coupled to the microprocessor bus.
`Referring to FIG. 1, a multiprocessor motherboard 10 is
`shown containing two processor chips 100 and 101 and
`DRAM modules 104 and 105. In one embodiment, the pro
`cessor chips are Opteron microprocessors available from
`Advanced Micro Devices (AMD) although processors avail
`able from other companies such as Intel could also be used. A
`55
`typical motherboard also contains many other components
`which are omitted here for clarity. In one embodiment, the
`K8SRE (S2891) motherboard from Tyan Computer Corpo
`ration is used although many other Suitable motherboards are
`available from this and other vendors. Motherboards are
`available with various numbers of processor chips 100, 101.
`Typically, a motherboard contains between one and eight
`processor chips. In one embodiment, a motherboard with
`Sockets for at least two processor chips is required. One or
`more processor chips 100,101 are removed and replaced with
`co-processor modules 200. If the motherboard contains more
`than two processor chips, several of them may be replaced
`
`35
`
`4
`with co-processor modules 200 providing that at least one
`processor chip remains on the motherboard.
`It is also possible to build high performance computing
`systems with multiple motherboards interconnected by high
`speed busses. In such a system, some of the motherboards
`may contain only co-processor modules while other mother
`boards contain only processor chips or a mixture of processor
`chips and co-processor modules. In such a multi-board sys
`tem, there must be at least one processor chip in order to
`communicate with one or more co-processor modules.
`Returning now to FIG. 1, processor chips 100, 101 are
`attached to motherboard 10 using sockets 102, 103 which
`allow them to be easily removed. Co-processor module 200
`has the same mechanical and electrical interface via circuit
`board 299 and pins 298 as processor chips 100, 101 allowing
`easy replacement with minimal or no changes to motherboard
`10. Motherboard 10 also contains memory modules 104
`which are normally coupled for communication with a pro
`cessor chip 100 plugged in socket 102. Memory modules 105
`are similarly coupled for communication with a processor
`chip 101 plugged in socket 103. When processor chip 100 is
`replaced by co-processor 200, co-processor 200 has access to
`memory modules 104.
`Referring now to FIG. 2, a block diagram of co-processor
`module 200 is shown in more detail, along with its connec
`tions to motherboard 10. Co-processor module 200 contains
`FPGA (field-programmable gate array) 201, SRAM (static
`random access memory) 202, PLD (programmable logic
`device) 203 and flash memory 204, along with other compo
`nents such as resistors, capacitors, buffers and oscillators
`which have been omitted for clarity. In one embodiment,
`FPGA 201 is an XC4VLX60FF668 available from Xilinx
`corporation although there are numerous FPGAs available
`from Xilinx and other vendors such as Altera which would
`also be suitable. SRAM 202 may be a IDT71T75602S2OBG
`from Integrated Device Technology corporation, PLD 203
`may be an EPM7256BUC169 from Altera corporation and
`flash memory 204 may be a TC58FVM5T2AXB65 from
`Toshiba corporation, according to one embodiment. In each
`case, there are numerous alternative components which could
`be used instead. FPGA201 is connected through bus 211 and
`socket 102 to the motherboard memory module 104. It is also
`connected through bus 210 and socket 102 to the remaining
`motherboard processor chip 101. In one embodiment, bus
`210 is a hypertransport bus. The hypertransport bus has high
`bandwidth and low latency characteristics for example with
`respect to availability to processor 101, although other busses
`such as PCI, PCI Express or RapidIO could be used instead
`with the appropriate motherboard components. The hyper
`transport bus, which is a point-to-point bus, also forms a
`direct connection between processor 101 and co-processor
`module 200 without passing through any intermediate chips
`or busses. This direct connection greatly improves through
`put and latency when transferring data to the co-processor.
`FPGA 201 also connects to SRAM 202 and PLD 203 via
`bus 214. PLD 203 additionally connects to flash memory 204
`via bus 213 and to FPGA 201 via programming signals 212.
`Referring now to FIG.3, the internal logic of FPGA201 is
`described. An FPGA is a device which may be programmed to
`perform various logical functions. FPGA 201 is reprogram
`mable so it may perform a first set of logical functions, then,
`after reprogramming, a second set of logical functions. This
`allows different algorithms to be programmed depending on
`the needs of a particular customer or application. The logical
`function of FPGA201 is divided into two portions. Customer
`specific algorithms are programmed into the user logic sec
`tion 306 of FPGA 201. In addition to user logic 306, the
`
`40
`
`45
`
`50
`
`60
`
`65
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 11
`
`
`
`5
`FPGA includes a set of interface or support functions 300. In
`one embodiment, these support functions 300 are: a hyper
`transport interface 301, a DDR (double data-rate) DRAM
`(dynamic random-access memory) interface 302, a static
`RAM (random access memory) interface 303 and a DMA and
`arbitration function 304. These support functions 300 are
`connected to user logic 306 by standard wrapper interface
`305. The wrapper interface 305 is designed to present a con
`sistent view of support functions 300 so additional functions
`may be added or functions may be changed internally without
`the need to change user logic 306. The user logic portion of
`FPGA 201 may also be reprogrammed to represent different
`algorithms while the support functions 300 continue to oper
`ate. This is necessary since many functions such as hyper
`transport interface 301 and DDR memory interface 302 can
`not be interrupted without a long restart procedure.
`The physical size of module 200 is limited because of the
`need to fit into socket 102 without interfering with other
`components which may exist on motherboard 10. At the same
`time, it is desirable to be able to expand the functionality of
`module 200 to support various applications. Expanded func
`tionality may include, for example, additional memory or
`additional hypertransport interfaces. FIG. 4 shows how mod
`ule 200 may be expanded by adding a daughter card 400
`which includes additional components. The daughter card
`400 is attached to module 200 by connectors 401,402.
`Referring now to FIG. 5, the process of configuring FPGA
`201 on module 200 is described with renewed reference to
`FIGS. 1-3. When power is initially supplied or the processor
`reset signal is applied, FPGA 201 is programmed automati
`cally from flash memory 204. FPGA 201 may also be repro
`grammed automatically from flash memory 204 if it ceases to
`operate due to various conditions. Monitor logic is built into
`FPGA 201 and PLD 203 which checks for correct operation
`of FPGA201 and initiates reprogramming if it senses a fault
`condition. The programming and reprogramming processes
`are controlled by PLD 203. Xilinx and others supply logic
`circuits and detailed instructions for programming an FPGA
`from a flash memory. In order to initially program flash
`memory 204, a configuration pattern is loaded into FPGA 201
`using a JTAG connector on module 200. This configuration
`pattern is sufficient to operate hypertransport interface 301.
`Hypertransport interface 301 is then used to transfer data to
`flash memory 204 under control of PLD 203. Flash memory
`204 normally contains a default FPGA configuration for Sup
`port functions 300 that is sufficient to operate the hypertrans
`port interface 301, memory interfaces 302,303 and DMA and
`arbitration function 304 but does not include configuration
`information for user logic 306. PLD 203 is initially config
`ured using a JTAG (Joint Test Action Group standard 1149.1)
`connector on module 200. Alternatively, flash memory 204
`and PLD 203 may be initially loaded with a default configu
`ration before being soldered onto module 200. Flash memory
`204 and PLD 203 may be reloaded while FPGA 201 is oper
`ating, by transferring new data over hypertransport interface
`301. Flash memory 204 is intended to provide semi-perma
`nent storage for the default FPGA configuration and is
`changed infrequently. PLD 203 provides basic support func
`tions for module 200 and is also changed infrequently.
`Once the default configuration pattern (bitstream) is loaded
`into FPGA 201, module 200 becomes visible over the hyper
`transport bus to a main processor 101 in the system. At 501,
`the main processor transfers a new configuration pattern over
`hypertransport bus 210 for writing to FPGA 201 of module
`200. This new configuration pattern typically contains a user
`logic function 306 and may also contain new definitions for
`support functions 300. At 502, FPGA 201 of module 200
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 7,856,545 B2
`
`5
`
`10
`
`15
`
`6
`saves the new configuration pattern into either SRAM or
`DRAM using the memory interfaces 302 or 303. If full recon
`figuration of FPGA201 is planned, the configuration pattern
`must be saved into SRAM. DRAM cannot be used for full
`reconfiguration because the configuration data would be lost
`when DRAM interface 302 ceases to operate during the con
`figuration process. SRAM may be controlled using PLD 203
`instead of SRAM interface 303 in FPGA201 so the configu
`ration data is retained while FPGA201 is reprogrammed. The
`processors 501 and 502 may operate concurrently since the
`amount of data required to configure. FPGA201 may be very
`large. At 503, main processor 101 uses the hypertransport bus
`to send FPGA201 of module 200 the address of the configu
`ration pattern in SRAM or DRAM, along with a command to
`reprogram itself. A decision 506 is then made whether to do
`full or partial reconfiguration.
`During partial reconfiguration, support functions 300
`remain active and only enough data must be transferred over
`hypertransport bus 210 to configure user logic 306. This
`allows partial reconfiguration to be much faster than full
`reconfiguration, making partial reconfiguration the preferred
`alternative in most situations. Data for partial reconfiguration
`may be saved in either DRAM or SRAM. When module 201
`is used to accelerate computational algorithms, frequent
`reconfiguration is often necessary and reconfiguration time
`becomes a limiting factor in determining the amount of accel
`eration that may be obtained. Partial reconfiguration at 505
`involves FPGA 201 loading the reconfiguration data, where
`an internal memory interface of FPGA 201 is used to read a
`bitstream and pass it to user logic 306. After loading is com
`plete, new logic functions specified by the new configuration
`become active and may be used.
`If full reconfiguration is desired at 504 of FIG. 5 PLD 203
`takes over control of SRAM 202, erases FPGA 201 and
`transfers a complete new configuration pattern to FPGA 201.
`This is similar to initial programming except that the configu
`ration data comes from SRAM 202 instead of flash memory
`204
`With additional reference to FIG. 6, the process of gener
`ating user logic 306 is described. Co-processor module 200
`may accelerate computational algorithms. These algorithms
`are typically described in a computer language Such as C.
`Unfortunately, the C language is designed to execute on a
`sequential processor such as the Opteron from AMD or the
`Pentium from Intel. Using an FPGA co-processor directly to
`execute an algorithm described in the Clanguage would offer
`little or no acceleration since it would not utilize the primary
`advantages of the co-processor. The primary advantages of an
`FPGA co-processor compared to a sequential processor are a
`vast amount of parallelism and a potentially much higher
`memory bandwidth. In order to use the FPGA efficiently, the
`initial C description must be translated into a hardware
`description language (“HDL), such as VHDL or Verilog.
`This is shown in 601 of FIG. 6. Tools are available from
`companies such as Celoxica that do this translation. Addition
`ally, there are variations of the C language such as UPC
`(unified parallel C) in which some parallelism is made visible
`to the user. These dialects of C may be translated more effi
`ciently into FPGA co-processors.
`At 602, constraints are generated for the user design. These
`include both physical and timing constraints. Physical con
`straints are necessary to ensure that user logic 306 connects
`correctly and does not conflict with support functions 300.
`Timing constraints determine the operating speed of user
`logic 306 and prevent other potential timing problems such as
`race conditions.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2130, p. 12
`
`
`
`7
`At 603, user logic 306 is synthesized. Synthesis converts
`the design from an HDL description to a netlist of FPGA
`primitives. The Xilinx tool XST may be used.
`At 604, the user logic 306 is combined with the pre-de
`signed support functions 300. The support functions 300, as
`well as wrapper interface 305 associated therewith, have a
`pre-assigned fixed placement so they may be combined with
`arbitrary user logic without affecting operation of Support
`functions 300. Sections of the support functions 300 are very
`sensitive to timing and correct operation could not be guar
`anteed without fixing the placement.
`At 605, the design for instantiation in user logic 306 is
`placed and routed. Placement and routing is performed by the
`appropriate FPGA software tools. These are available from
`the FPGA vendor. Constraints generated at 602 guide the
`place and route 605 as well as synthesis 603 to ensure that the
`desired speed and functionality are achieved.
`At 606 a full or partial configuration pattern (or bitstream)
`for the FPGA is generated. This may be performed by a tool
`supplied by the FPGA vendor. The bitstream is then ready for
`download into co-processor FPGA 201.
`While the foregoing describes exemplary embodiment(s)
`in accordance with one or more embodiments, other and
`further embodiment(s) in accordance with the one or more
`embodiments may be devised without departing from the
`scope thereof, which is determined by the claim(s) that follow
`and equivalents thereof. Claim(s) listing steps do not imply
`any order of the steps. Trademarks are the property of their
`respective owners.
`What is claimed is:
`1. A method for co-processing, comprising:
`coupling an accelerator module to a microprocessor bus,
`the accelerator module including a Field Programmable
`Gate Array (“FPGA');
`loading a microprocessor bus interface bitstream into the
`FPGA to program programmable logic thereof;
`transferring data to first memory of the accelerator module
`via a microprocessor bus using a microprocessor bus
`interface instantiated in the FPGA responsive to the
`microprocessor bus interface bitstream;
`instantiating a default configuration bitstream stored in the
`first memory in the FPGA to configure the FPGA to have
`the microprocessor bus interface with sufficient func
`tionality to be recognized by a microprocessor coupled
`to the microprocessor bus; and
`communicating under control of the microprocessor a con
`figuration pattern to second memory of the accelerator
`module using a first memory interface instantiated in the
`FPGA responsive to the instantiating of the default con
`figuration bitstream.
`2. The method according to claim 1, wherein the loading is
`via a JTAG interface of the FPGA.
`3. The method according to claim 2, wherein the loading
`and the transferring are under control of a Programmable
`Logic Device (“PLD), the PLD being included as part of the
`accelerator module.
`
`8
`4. The method according to claim 1, further comprising
`sending control information from the microprocessor to the
`FPGA to indicate location of the configuration pattern in the
`second memory for instantiation in user programmable logic
`of the FPGA.
`5. The method according to claim 4, wherein the control
`information is for partial reconfiguration of the user program
`mable logic of the FPGA.
`6. The method according to claim 5, wherein the first
`memory is flash memory; and wherein the second memory is
`either Static Random Access Memory (“SRAM) or
`Dynamic Random Access Memory (“DRAM).
`7. The method according to claim 1, wherein the micro
`processor bus interface bitstream and the default configura
`tion bitstream are instantiated in the FPGA with pre-assigned
`fixed placement.
`8. A method for co-processing, comprising:
`coupling an accelerator module to a microprocessor bus,
`the accelerator module including a Field Programmable
`Gate Array (“FPGA) and first memory, the first
`memory having a default configuration bitstream stored
`therein;
`loading the default configuration bitstream into the FPGA
`to program programmable logic thereof, the default con
`figuration bitstream including a microprocessor bus
`interface; and
`configuring the FPGA with the default configuration bit
`stream with sufficient functionality to be recogniz