`
`[19]
`
`[11]
`
`Patent Number:
`
`5,956,518
`[45] Sep. 21, 1999
`Date of Patent:
`DeHonetal.
`
`
`
`US005956518A
`
`[54]
`
`INTERMEDIATE-GRAIN
`RECONFIGURABLE PROCESSING DEVICE
`
`[75]
`
`Inventors: André DeHon; Ethan Mirsky, both of
`Cambridge; Thomas F. Knight, Jr.,
`Belmont, all of Mass.
`
`[73] Assignee: Massachusetts Institute of
`Technology, Cambridge, Mass.
`
`[21] Appl. No.: 08/632,371
`
`[22]
`
`Filed:
`
`Apr. 11, 1996
`
`Tint, (de oeccc cceessccsssssneeecceesnsnneeeee GO06F 15/80
`[SD]
`[52] U.S. Che i eecsecsseccneeneenenee 395/800.15; 395/653
`[58] Field of Search 0.0. 395/800.1, 800.15,
`395/200.51, 280, 653
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`Charlé Rupp et al., “The Napa Adaptive Processing Archi-
`tecture”, IEEE Symposium on Field—Programmable Custom
`Computing Machines Conference (FCCM98), Apr. 15-17,
`1998, pp. 1-10.
`
`Stephen M. Scalera et al., The Design and Implementation
`of a Context Switching FPGA,
`IEEE Symposium on
`Field—Programmable Custom Computing .
`
`T. Bridges, “The GPA Machine: A Generally Partitionable
`MSIMDArchitecture,” Third Symposium on the Frontier of
`Massively Parallel Computation Proceedings IEEE pp.
`196-203 (1990).
`
`P. Clarke, “Pilington Preps Reconfigurable Video DSP,”
`News (Aug. 7, 1995).
`
`D.C. Chen,et al., “A Reconfigurable Multiprocess IC for
`Rapid Prototyping of Algorithmic—Specific High-Speed
`DSPData Paths,” JEEE Journal ofSolid-State Circuits, vol.
`27 (12): 1895-1904 (Dec 1992).
`
`A.K. Yeung,et al., “TA 6.3: A2.4GOPS Data—Given Recon-
`figurable Multiprocessor IC for DSP,” IEEE International
`Solid-State Circuits Conference, pp. 108-109, 346 (1995).
`
`(List continued on next page.)
`
`Primary Examiner—Eric Coleman
`Attorney, Agent, or Firm—Hamilton, Brook, Smith &
`Reynolds, P.C.
`
`[57]
`
`ABSTRACT
`
`6/1986 Guyeret al. oes 364/200
`4,597,041
`.. 364/900
`5/1988 Chiarulli etal.
`4,748,585
`
`
`... 364/736
`6/1988 Deering..........
`4,754,412
`8/1989 Saccardi ..
`... 364/200
`4,858,113
`9/1989 Freeman ..
`... 307/465
`4,870,302
`
`10/1989 Gifford .......
`... 364/200
`4,873,626
`..
`5/1991 Gorin et al.
`. 371/113
`5,020,059
`
`8/1993 Agrawal etal. .......
`. 364/489
`5,233,539
`8/1993 Ing-Simmonsetal. ...
`.. 395/800
`5,239,654
`
`8/1993 Papadopouloset al... 395/375
`5,241,635
`A programmable integrated circuit utilizes a large numberof
`11/1993 Zak ecco
`. 395/200.31
`5,265,207
`intermediate-grain processing elements which are multibit
`
`cteecteceesenceeceneeneee 395/800
`4/1994 COOK weet
`5,301,340
`processing units arranged in a configurable mesh. The
`4/1994 Grondalski .
`395/800.1
`5,305,462
`
`coarse-grain resources, such as memory and processing, are
`8/1994 Popli et al.
`.
`. 307/465
`5,336,950
`deployable in a way that takes advantage of the opportuni-
`5,426,378—6/1995) ONG vaceccceccsseccesesceesescsesesteees 326/39
`ties for optimization present in given problems. To accom-
`
`5,457,644 10/1995 McCollum .....
`.. 364/716
`11/1997 Casselman oo... eeeeecereeeeeee 395/500
`5,684,980
`plish this, the interconnect supports three different modes of
`operation: a static value in which a value set by the con-
`figuration data is provided to a functional unit, static source
`in which another functional unit serves as the value source,
`and a dynamic source mode in which the source is deter-
`mined by the value from another functional unit.
`
`OTHER PUBLICATIONS
`
`Takashi Miyamoriet al., “A Quantitative Analysis of Recon-
`figurable Coprocessors for Multimedia Applications,” IEEE
`Symposium on Field—Programmable Custom Computing
`Machines Conference (FCCM98), Apr. 15-17, 1998.
`
`31 Claims, 20 Drawing Sheets
`
`8 BIT MICROPROCESSOR
`
`
`
`
`OC
`
`
`
`
`
`JOO
`ajclepe
`
`Ajo
`ee u 7
`Petitioner Microsoft Corporation - Ex. 1019, p. 1
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 1
`
`
`
`5,956,518
`
`Page 2
`
`OTHER PUBLICATIONS
`
`J.E. Brewer, et al., “A Monolithic Processing Subsystem,”
`IEEE Transactions on Components, Packaging, and Manu-
`facturing Technology —Part B, vol. 17(3) :310-317 (Aug.
`1994).
`I. Gilbert, Chapter 11 —Mesh Multiprocessing, The Lincoln
`Laboratory Journal, 11) :11.1-11.18 (Spring 1988).
`Ira H. Gilbert, et al., “The Monolithic Synchronous Proces-
`sor,” Lincoln Laboratory, Massachusetts Institute of Tech-
`nology, Lexington, MA 02173 (No Date Given).
`G. Masera, et al., “A Microprogrammable Parallel Archi-
`tecture for DSP,” JEEE, pp. 824-827, (1991).
`L. Wang, et al., “Distributed Instruction Set Computer,”
`Proceedings of the 1988 International Conference on Par-
`allel Processing, vol. 1, pp. 426-429.
`M. Sowa,et al., “Parallel Execution on the Function—Parti-
`tioned Processor with Multiple Instruction Streams,” Sys-
`tems and Computers in Japan, 22(4) :22-27 (Nov. 1991).
`T. Alexander, et al., “A Reconfigurable Approach to a
`Systolic Sorting Architecture,” IEEE, pp. 1178-1182
`(1989).
`Z. Blazek, et al. “Design of a Reconfigurable Parallel
`RISC—Machine,” North-Holland Microprocessing and
`Microprogramming 21, pp. 39-46, (1987).
`S. Morton, et al., The Dynamically Reconfigurable CAP
`Array Chip I, IEEE Journal of Solid-State Circuits, SC21
`(5) :820-826 (Oct. 1986).
`L. Snyder,
`“A Taxonomy of Synchronous Parallel
`Machines,” Proceedings of the 1988 International Confer-
`ence on Parallel Processing, pp. 281-285 (Aug. 1988).
`L. Snyder, “An Inquiry into the Benefits of Multigauge
`Parallel Computation,” Proceedings of the 1985 Interna-
`tional conference on Parallel Processing, pp. 488-492
`(Aug. 1985).
`A. DeHon, “DPGA Utilization and Application,’ FPGA
`°96—ACM/SIGDA Fourth International Symposium on
`FPGAs, Monterey, CA (Feb. 11-13, 1996).
`D. Epstein, “Chromatic Raises the Multimedia Bar,” Micro-
`processor Report, pp. 23-27 (Oct. 23, 1995).
`J. Labrousse, et al., “Create-Life: A Modular Design
`Approach for High Performance ASIC’s,” IEEE CH2843,
`pp. 427-433 (Jan. 1990).
`M. Slater, “MicroUnity Lift Veil on MediaProcessor,”
`Microprocessor Report, pp. 11-18 (Oct. 23, 1995).
`E. Tau, et al., “A First Generation DPGa Implementation,”
`FPD °95 —Third Canadian Workshop of Field-Program-
`mable Devices Montreal, Canada (May 29 —Jun. 1, 1995).
`
`S. Kartashev, et al. “A Multicomputer System with
`Dynamic Architecture,” IEEE Transactions on Computers,
`vol. C-28, No. 10, pp. 704-721 (Oct. 1979).
`D. Bursky, “Programmable Data Paths Speed Computa-
`tions,” Electronic Design, pp. 171-174 (May 1, 1995).
`V. Bove, Jr., et al., “Cheops: A Reconfigurable Data—Flow
`System for Video Processing,” IEEE Transactions on Cir-
`cuits and Systems for Video Technology, pp. 140-149
`(1995).
`M. Schaffner, “Processing by Data and Program Blocks,”
`Transactions on Computers, vol. C-27, No.
`11, pp.
`1015-1027 (Nov. 1978).
`J. Nickolls, “The Design of the MasPar MP-1: A Cost
`Effective Massively Parallel Computer,” JEEE CH2843, pp.
`25-28 (Jan. 1990).
`B. Narasimha, “Performance—Oriented, Fully Routable
`Dynamic Architecture for a Field Programmable Logic
`Device,” UCB/ERL M93/42, University of California, Ber-
`keley, pp. 1-21 (Jun. 1993).
`M. Bolotski, et al., “A 1024 Processor 8ns SIMD Array,”
`Advanced Research in VLSI 1995, pp. 1-13 (1995).
`D. Cherepacha,et al., “A Datapath Oriented Architecture for
`FPGAs,” Second International ACM/SIGDA Workshop on
`Field Programmable Gate Arrays ACM, pp. 1-11 (Feb.
`1994).
`D. Jones, et al., “A Time—Multiplexed FPGA Architecture
`for Logic Emulation,” Proceedings of the IEEE 1995 Cus-
`tom Integrated Circuits Conference, pp. 495-498 (May
`1995).
`G. Nutt, “Microprocessor Implementation of a Parallel Pro-
`cessor,” Proceedings of the Fourth Annual Symposium on
`Computer Architecture, pp. 147-152, (1977).
`W. Kim, “MasPar MP-2 PE Chip: A Totally Cool Hot Chip,”
`Proceedings of Hot Chips V, pp. 1-5 (Mar. 29, 1993).
`E. Mirsky, et al., “Matrix: Coarse—Grain Reconfigurable
`Computing (Abstract)”, Published at the 5th Annual MIT
`Student Workshop on Scalable Computing, pp. 1-2 (Aug.
`1995) (Available on the Internet May 1, 1995).
`E. Mirsky, et al., “Matrix: A Reconfigurable Computing
`Architecture with Configurable Instruction Distribution and
`Deployable Resources,” Published at FCCM’96 —TEEE
`Symposium on FPGA’s for Custom Computing Machines,
`pp. 1-10 (Apr. 17-19, 1996).
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 2
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 2
`
`
`
`U.S. Patent
`
`WALSAS
`
`QWIS YOSSAIONdONSIWLIE8
`
`Sep. 21, 1999
`
`Sheet 1 of 20
`
`5,956,518
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 3
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 3
`
`
`
`Sep. 21, 1999
`
`Sheet 2 of 20
`
`U.S. Patent WALSASMITA
` PILILIL
`YOSSZIOYNdOYIIWLIEcE
`
`5,956,518
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 4
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 4
`
`
`
`Sep. 21, 1999
`
`Sheet 3 of 20
`
`5,956,518
`
`U.S. Patent
` PILILIL
`
`
`
`WALSASGWIWLid8
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 5
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 5
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 4 of 20
`
`5,956,518
`
`CONTEXT
`
`
`
`
`Configuration
`ADR
`Memory
`
`
`
`DATA
`Port | Port
`
`Network|A_PORT B_PORT ] Il2
`
`
`iPee_in
`
`
`134 ControlLogic
`
`l22
`
`Network
`
`
`
`
`
`Memory
`Block
`io
`
`l24
`
`MODE
`
`
`
`
`
`
`
`ALU
`Function
`
`Memory
`Function
`
`
`
`connconn
`
`1'4
`
`Carry In
`
`120
`
`Carry Out
`
`FIG. 6
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 6
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 6
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 5 of 20
`
`5,956,518
`
`O136°
`
`|
`
`Aaeat|Rieieie=aiita=a=e|alealaaSSaiCal
`iiceiw,NNce||THTP|TT
`dimen}
`
`OoON
`
`FIG. 8
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 7
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 7
`
`
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 6 of 20
`
`5,956,518
`
`Incoming Network
`Lines(Li,L2,L3)
`
`Level-2, Level-3
`Network
`I36,
` !3?
`
`Incoming Network
`
`
`
`Network
`Switch 1 (N1)
`
`Level 2,3
`Network Drivers
`
`Network
`Switch 2 (N2)
`
`Floating Port |
`
`
`Lines (U1, L2, L3)
`
`
`
`
`pa
`2
`L3Control =
`
`
`
`
`Function
`(Fm)
`
`
`
`
`Control Context
`Select
`
`202
`
`Level-1
`Network
`
` Level-1
`C/R Logic
`
`FIG. 9
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 8
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 8
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 7 of 20
`
`5,956,518
`
`Local Output 1x8
`| 2x8
`
`Control Byte
`215
`
`Control Bit
`2\7
`
`Word B
`
`Local Output
`12x8
`Level- 8x8
`Level-2
`Level-3
`
`Configuration Configuration
`Word A
`Word B
`
`160
`
`162
`Registers on
`A,B Ports Only
`
`Network
`Drivers
`(N1,N2)
`136
`
`Network
`Inputs
`30 x8
`
`Control Byte
`2I5
`
`FP out
`
`Control Bit
`217
`
`Configuration Configuration
`WordA
`
`FIG 11
`
`ADA, ADB, NI, N2
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 9
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 9
`
`
`
`217 Sheet 8 of 20
`evel-1South}
`
`a‘L
`
`U.S. Patent
`
`Sep. 21, 1999
`
`5,956,518
`
`Local Output
`
`Level-|
`
`Level-2
`
`Level-3
`
`Network
`Inputs
`30x8
`
`Control Byte
`215
`
`Control Bit
`
`Configuration
`WordA
`
`Configuration
`Word B
`
`FIG. 12
`
`Level-1 NorthWest
`
`Level-I North
`
`174
`
`N_ Enable
`
`W_. Enable
`
`iLevel-ISouthWest|
`
`S_Enable
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 10
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 10
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 9 of 20
`
`5,956,518
`
`Enable
`
`Enable
`
`Reg Enable
`(Level-2 Only)
`
`Select
`
`Nout
`
`N2out
`
`FPlout
`
`FIG.14
`
`FIG. 15
`
`FP2out
`Broadcast Cycle~192
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 11
`
`Control
`Context Select
`
`FIG. 16
`
`BFU Output
`
`7%
`|
`|
`|
`
`202
`
`Match?
`
`to Level 1
`
`|
`
`[oo
`
`FIG. 17
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 11
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 10 of 20
`
`5,956,518
`
`SlOlas
`
`
`
`ayAgjospuoy
`
`41g|O44U0D
`
`|0907]
`
`BUDidYO 9918S
`{1G|O4JUOD
`
`
`
`{x9,U0D[D207
`
`
`
`W1d(2/1)
`
`yAOMONZI
`
`YOyIMS
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 12
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 12
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 11 of 20
`
`5,956,518
`
`o/bGOl
`2cLlv
`
`
`
`Jx9JUOD[DQO|H
`
`T4xayu0g
`
`I4xayu0g
`
`paJlMpsy
`a|qowwosbodd
`
`IE}xa4u09
`
`TIE}x9ju0g
`
`pauIMpsDH
`ajqowwoibold
`
`volyDunBipuog
`
`g*yPAOM
`
`IgSls
`
`TogHulyDo}|4
`
`I0gbuijsooj3
`agonpay/dwod
`pooysoqubien
`pooysoqubian
`
`a@UDIdYO
`
`(W1d2/1)
`
`
`
`ayAgjOspuod
`
`$1JO4fUOD
`
`O¢gls
`
`
`ndingn4ag
`Petitioner Microsoft Corporation - Ex. 1019, p. 13
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 13
`
`
`
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 12 of 20
`
`5,956,518
`
`FIG.22
`
`I/OPorts
`
`2_
`Oo
`Oo
`©~s
`re
`
`I/OPorts
`
`I/OPorts
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 14
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 14
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 13 of 20
`
`5,956,518
`
`Data OE
`
`I/O Output Data
`I/O Input Data
`
`320
`
`322
`
`334
`
`336
`
`1/0 Bit O Output
`
`1/0 Bit O Input
`
`BitO OE
`I/O
`324 32 Pads
`{
`SS
`1/0 Register
`1/0 Register Q 326
`it
`340
`BInTOE 34
`328
`
`I/O Bit! Output 1/0 Register->}>— /;
`
`I/O Bit 1 Input
`I/O Register <
`
`332
`310
`1/0 Register > 8
`I/O Register <|
`
`
`Master CLK
`
`338
`
`NO
`
`330
`
`342
`
`FIG. 23
`
`Data In
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 15
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 15
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 14 of 20
`
`5,956,518
`
`360
`
`PLA In Data
`
`1/0 Input Data
`
`C/R Out
`
`I/O Input Bits
`
`PLA Out Data 372
`
`P C/R In 374
`I/O Output Bits 376
`
`OR Plane
`
`FIG. 25
`
`
`t|1/0Port|t|I/OPort|t|1/0Port|O Por if|1/0Port||I/OPort|386
`/
`
`PLA |360
`
`I/O Output Enables 378
`Col.5
`
`Tea36020
`
`|Bae360
`
`Output Multiplexor 384
`
`Control
`
`388
`
`Output
`Selector
`
`Col.O
`
`Col. 1
`
`Col. 2
`
`Col. 3
`
`Col.4
`
`FIG. 26
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 16
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 16
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 15 of 20
`
`5,956,518
`
`uoljoa]as
`
`(9XZ)XZ
`Z-lana7
`
`|-|ana7
`
`ayhgyndulWd
`
`DyOQyndu]O/T9x22
`
`syndu|
`
`(QX2)XZ?
`yndjino(QXP)X?E1889YAOMIaN)
`
`
`
`volyDunbijuod
`
`P4sOM
`
`8¢OlA
`
`
`
`aulyndino
`
`
`
`ajqougyndjno
`
`
`
`
`
`DjOGgINdjNOa1049
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 17
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 17
`
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 16 of 20
`
`5,956,518
`
`
`
`
`I/O Port O Input Data
`
`I/O Port] Input Data
`
`I/O Port 2 Input Data
`
`PLA O Output Byte
`PLA | Output Byte
`
`PLA 2 Output Byte
`
`FIG. 29
`
`Configuration
`Word
`
`FIG. 30
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 18
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 18
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 17 of 20
`
`5,956,518
`
`__l
`
`C/R
`tor
`
`Selec
`Word
`
`FIG. 31
`
`Network
`Inputs
`lOx4
`
`2x5
`
`Level—|
`
`Level-2
`Level-3
`1/0 Inputs
`PLA Outputs
`
`Dynamic Source Switch
`
`Configuration
`
`FIG. 32
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 19
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 19
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 18 of 20
`
`5,956,518
`
`460
`
`462
`
`Level-3
`Controller
`
`Dynamic Control
`Switch
`
`Level-3
`Controller
`
`Control Lines
`
`
`
`
`
`Level—3_1
`Level-3_2
`Control Lines
`
`
`
`
`FIG. 33
`
`Network
`Inputs
`I6x4
`
`Level-1
`
`Level-2
`Level-3
`I/O Inputs
`PLA Outputs Sx4
`
`Configuration
`
`522
`
`532
`
`X; (8bit)
`|
`
`324
`
`FIG. 36
`
`Word
`534
`
`yj
`530
`(16 bits output
`over 2 cycles)
`
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 20
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 20
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 19 of 20
`
`5,956,518
`
`G¢Sls
`
`4¢Olas
`
`IK
`
`(4148)
`
`(S@jQADZ49A0
`
`
`
`indjnosj1q9))
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 21
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 21
`
`
`
`U.S. Patent
`
`Sep. 21, 1999
`
`Sheet 20 of 20
`
`5,956,518
`
`x6;
`
`KO;
`
`x4;
`
`x3;
`
`x2;
`
`x];
`
`yi
`
`FIG.38
`
`a
`
`
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 22
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 22
`
`
`
`5,956,518
`
`1
`INTERMEDIATE-GRAIN
`RECONFIGURABLE PROCESSING DEVICE
`
`GOVERNMENT FUNDING
`
`This invention was made with Government support under
`the Advanced Research and Projects Agency (ARPA)of the
`Department of Defense under Rome Labs Contract No.
`F30602-94-C-0252. The Government has certain rights in
`the invention.
`
`BACKGROUND OD THE INVENTION
`
`Continuing advances in semiconductor technology have
`greatly increased the amount of processing that can be
`performed by single-chip, general-purpose computing
`devices. The relatively slow increase in inter-chip commu-
`nication bandwidth requires that modern high performance
`devices use as much of the potential on-chip processing
`power as possible. This results in large, dense integrated
`circuit devices and a large design space of processing
`architectures.
`
`One way of viewing this design space is in terms of
`granularity. Designers have the option of building very large
`processing units, or many smaller ones, in the same space.
`Traditional architectures are either very coarse grain, such as
`microprocessors, or very fine grain, such as field program-
`mable gate arrays (FPGAs). Both architectures have advan-
`tages and disadvantages.
`Microprocessors incorporate very few large processing
`units that operate on wide data-words, and each unit is
`hardwired to perform defined instructions on these data-
`words. Usually each unit is optimized for a different set of
`instructions, such as integer and floating point, and the units
`are generally hardwired to operate in parallel. The hardwired
`nature of these units allows very rapid instructions. In fact,
`a great deal of area on modern microprocessor chips is
`dedicated to cache memories in order to support a very high
`rate of instruction issue. Thus, the devicesefficiently handle
`very dynamic instruction streams.
`Very fine grain devices, such as FPGAs, incorporate a
`large number of very small processing elements. These
`elements are arranged in a configurable interconnect net-
`work. The configuration data used to define the functionality
`of the processing units and network can be thought of as a
`very large, semantically powerful, instruction word. Nearly
`any operation can be described and mapped to hardware.
`
`SUMMARYOF THE INVENTION
`
`Unfortunately, because microprocessors are highly opti-
`mized for simple, wide-word, dynamic instructions, they are
`relatively inefficient when performing other kinds of opera-
`tions. For example, many cycles are required to build up
`complex operations that are not part of the processor’s
`pre-selected instruction set. Also, when performing short-
`word operations, much of the processing unit is not being
`used, and when the instructions being issued are very
`regular, the large instruction caches are unnecessary. Thus,
`very coarse-grain microprocessors are not equipped to take
`the maximum advantage of these cases.
`The size of the “instruction word” creates a number of
`
`problems with fine-grain FPGA devices, however. Reload-
`ing new instructions takes a relatively long time, making
`dynamic instruction streams very difficult for these devices.
`Moreover,if the operation being performedis, in fact, a wide
`word operation, a great deal of this “instruction word” must
`be dedicated to re-describing the operation for each of the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`small processing elements. Thus, fine grain processing ele-
`ments are not well equipped to take advantage of a large
`number of common computing operations.
`The present
`invention utilizes a large number of
`intermediate-grain processing elements which are arranged
`in a configurable mesh. Thus,
`the regularity and rapid
`instruction issue features of coarse-grain units are exploited,
`but a reconfigurable or programmable interconnect allows
`these units to be connected in an application-specific man-
`ner. This meansthat coarse-grain resources, such as memory
`and processing, can be deployed in a way that takes advan-
`tage of the opportunities for optimization present in any
`given problem. In addition, configuration memories may be
`deployed to take advantage of application specific redun-
`dancy.
`In general according to one aspect, the invention features
`a programmable integrated circuit that comprises a logic
`units that perform operations on data in response to instruc-
`tions and memoriesthat store and retrieve addressed data. A
`
`configurable or programmable interconnect provides a mode
`of signal transmission between the logic units and memo-
`ries. Configuration control data defines data paths through
`the interconnect, which can be address inputs to memories,
`data inputs to memories and logic units, and instruction
`inputs to logic units. Thus, the interconnect is configurable
`to define an interdependent functionality of the functional
`units. A programmable configuration storage stores the
`configuration control data.
`Thus the present invention may be configured to operate
`according to a numberoftraditionally distinct computing
`architectures. For example, a centrally located functional
`unit may be assignedthe role of arithmetic logic unit (ALU)
`with memories of surrounding functional units being con-
`figured to act as instruction caches,register files, and pro-
`gram counters. Wider data paths are accommodatedby tying
`near-neighbor ALUs to each other. Wider instructions are
`achieved by configuring instruction memories of separate
`functional units as if they were a single memory. For a
`different problem, the same integrated circuit may be recon-
`figured to emulate a single instruction multiple data (SIMD)
`architecture. The logic units of rows of functional units are
`tied together to create wider data paths, and the rows
`perform separate serial tasks.
`In specific embodiments, functional units may provide at
`least part of the instructionsto logic units of other functional
`units. Also,
`the configuration storage may hold multiple
`contexts of configuration control data for reconfiguration of
`the programmable interconnect.
`In other embodiments, the interconnect may support three
`different modes of operation: a static value in which a value
`set by the configuration data is provided to a functional unit
`or static source in which another functional unit serves as the
`value source. A dynamic source mode can be included in
`which the source is determined by the value from another
`functional unit.
`
`In still other embodiments, each logic unit can also have
`programmable logic arrays on data paths between functional
`units which perform bit level logic operations. Additionally,
`reduction logic can be added that performs logic operations
`on the output of the logic units and passes a result to other
`functional units as control information. Network drivers are
`
`assigned to each unit to transmit received signals to other
`functional units. The sources of the signals received by the
`drivers may also be dynamic so that the sources are pro-
`grammable by other functional units.
`the invention
`In general according to another aspect,
`features an integrated reconfigurable computing device,
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 23
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 23
`
`
`
`5,956,518
`
`3
`which has functional units of multi-bit arithmetic logic units
`and memories. A configurable interconnect that connects the
`units includes function ports which determine the source of
`the instructionsto the logic units. Network ports of the units
`are configurable by the functional units and determine the
`source of addresses to the memories and the source of data
`to the logic units and memories.
`In general accordingto still another aspect, the invention
`can also be characterized in the context of a method for
`organizing signal transmission within an array of functional
`units. Data read from the memories of functional units may
`be transmitted as instructions to the logic units of other
`functional units. Also, data read from logic units may be
`transmitted as addresses for the memories of other func-
`tional units. Finally, the data read from functional units can
`also be used as data inputs for the logic units of other
`functional units.
`
`In specific embodiments,the paths of the data and instruc-
`tions are dynamic in response to control from the functional
`units. More specifically, static values, values from other
`functional units, and values from sources may be transmitted
`between functional units.
`
`The above and other features of the invention including
`various novel details of construction and combinations of
`
`parts, and other advantages, will now be more particularly
`described with reference to the accompanying drawings and
`pointed out in the claims. It will be understood that the
`particular method and device embodying the invention are
`shown by way of illustration and not as a limitation of the
`invention. The principles and features of this invention may
`be employed in various and numerous embodiments without
`departing from the scope of the invention.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`In the accompanying drawings, reference characters refer
`to the same parts throughout the different views. The draw-
`ings are not necessarily to scale; emphasis has instead been
`placed upon illustrating the principles of the invention. Of
`the drawings:
`FIG. 1 shows a programmable integrated processing
`device of the present invention, which has been configured
`as an 8-bit microprocessor;
`FIG. 2 shows a SIMD processor configuration for the
`processing device according to the invention;
`FIG. 3 shows a 32-bit processor configuration for the
`processing device according to the invention;
`FIG. 4 shows a very long instruction word (VLIW)
`processor configuration for the processing device according
`to the invention;
`FIG. 5 shows multiple instruction multiple data (MIMD)
`processor configuration for the processing device according
`to the invention;
`FIG. 6 is a block diagram showing the architecture of a
`basic functional unit (BFU) core of the present invention;
`FIG. 7 is a block diagram showing the inter-BFU con-
`nectivity provided by the level-1 network connections;
`FIG. 8 is a block diagram showing the BFU interconnec-
`tion provided by the level-2 network connections;
`FIG. 9 is a block diagram showing the network switch
`architecture for a BFU of the present invention;
`FIG. 10 is a block diagram illustrating the function switch
`architecture of the present invention;
`FIG. 11 is a block diagram showing the address/data and
`network switch architecture of the present invention;
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`FIG. 12 is a block diagram illustrating the floating port
`architecture of the present invention;
`FIG. 13 is a block diagram showing the level-1 network
`drivers of the present invention;
`FIG. 14 showsthe level-2 drivers of the present invention;
`FIG. 15 showsthe level-3 drivers of the present invention;
`FIG. 16 shows BFU input registers of the present inven-
`tion;
`FIG. 17 showsthe reduction logic in the BFU control
`architecture of the present invention;
`FIG. 18 is an example of multi-BFU reduction performed
`by the reduction logic of the present invention;
`FIG. 19 is a block diagram illustrating the operation of the
`distributed programmable logic array (PLA) associated with
`each BFU according to the invention;
`FIG. 20 is a block diagram showing the control logic for
`a single BFU;
`FIG. 21 showsan alternative embodimentof the configu-
`ration memory supporting multiple contexts;
`FIG. 22 is a block diagram of the configurable logic
`device of the present invention in the form of an integrated
`chip;
`FIG. 23 is a block diagram showing the input/output port
`architecture for the chip of the present invention;
`FIG. 24 is a block diagram showingthestructure of an I/O
`register according to the invention;
`FIG. 25 is a block diagram of a programmablelogic array
`for customizing the chip’s interface;
`FIG. 26 is a block diagram showing the movementof data
`from the BFU core off-chip according to the invention;
`FIG. 27 is a block diagram of a selector switch that
`chooses the core outputs to be driven on an output wire
`according to the invention;
`FIG. 28 is a block diagram showinga tri-state buffer used
`in the selector switch of FIG. 20;
`FIG. 29 is a block diagram illustrating how data enters the
`BFU core from off-chip;
`FIG. 30 is a block diagram showing the selector switch
`that selects among incoming data bytes from I/O ports and
`PLAs according to the invention;
`FIG. 31 is a block diagram of a C/R input architecture
`according to the invention;
`FIG. 32 is a block diagram showing the construction of
`the controller switches of the level-3 network lines accord-
`ing to the invention;
`FIG. 33 is a block diagram illustrating the dynamic
`control of the controller switches, which is shared between
`pairs of controllers at each column, according to the inven-
`tion;
`FIG. 34 shows the architecture of one of the dynamic
`control switches according to the invention;
`FIG. 35 is a block diagram showing the connectivity of
`BFUs in a systolic-type configuration according to the
`invention;
`FIG. 36 shows the configuration of the BFUs for a
`microcoded-type implementation for the convolution prob-
`lem according to the invention;
`FIG. 37 showsthe organization of the BFUs for a VLIW,
`horizontal microcode-type implementation according to the
`invention; and
`FIG. 38 showsthe organization of the BFUs for a VLIW/
`MSIMD-type implementation according to the invention.
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 24
`
`Petitioner Microsoft Corporation - Ex. 1019, p. 24
`
`
`
`5,956,518
`
`5
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`
`FIG. 1 shows a multi-bit microprocessor configuration of
`a reconfigurable processing device, which has been con-
`structed and programmed according to the principles of the
`present invention. A two-dimensional array of basic func-
`tional units 100 are located in a programmable interconnect
`101. Five of the BFUs 100 and the portion of the reconfig-
`urable interconnect connecting the BFUs have been config-
`ured to operate as a microprocessor 102.
`Each of the BFUs 100 preferably has addressable memory
`resources and logic resources, such as an 8-bit arithmetic
`logic unit (ALU). One of the BFUs 100, denoted ALU,
`utilizes its logic resources to perform the logic operations of
`the microprocessor 102 andutilizes its memory resources as
`a data store and/or extended register file. Another BFU
`operates as a function store F that controls the successive
`logic operations performed by the logic resources of the
`ALU. Two additional BFUs, A and B, operate as further
`instruction stores that control the addressing of the memory
`resources of the ALU. A final BFU, PC, operates as a
`program counter for the various instruction BFUsF, A, B.
`As shown in FIG. 2, the same reconfigurable processing
`array, however, may be reprogrammed to function as a
`SIMDsystem, and as described below,this reconfiguration
`can occur on a cycle-by-cycle basis. The functions of the
`program counter PC andinstruction stores A, B and F have
`been again assigned to different BFUs 100, but the ALU
`function has been replicated into 12 BFUs. Each of the
`ALUsis connected via the reconfigurable interconnect 101
`to operate on globally broadcast
`instructions from the
`instruction stores A, B, F. These same operations are per-
`formed by each of these ALU, or commoninstructions may
`be broadcast on a row-by-row basis.
`FIG. 3 shows how wider data paths can be constructed in
`the programmable device. This 32-bit microprocessor con-
`figured device has the same instruction stores A, B, F and
`program counter as described in connection with FIG. 1.
`Four BFUs, however, have been assigned an ALU operation,
`and the ALUsare chained together to act as a single 32-bit
`wide microprocessor in which the interconnect 101 supports
`carry-in and carry-out operations between the ALUs.
`FIG. 4 showshow the device can be configured to operate
`as a very long instruction word (VLIW)system. The various
`instruction stores A, B, F are defined to encompass multiple
`BFUs 100 to accommodate the desired instruction word
`width.
`
`FIG. 5 showsthe configuration of the present system to
`operate as a multiple instruction multiple data (MIMD)
`system. The 8-bit microprocessor configuration 102 of FIG.
`Lis replicated into an adjacent set of BFUs to accommodate
`multiple,
`independent processing units within the same
`device. Of course, wider data paths could also be accom-
`modated by chaining ALUs of each processor 102 to each
`other.
`1. Basic Functional Unit Architecture
`
`FIG. 6 shows the moderately coarse grain, preferably
`8-bit, BFU core. Primarily, the BFU core has memory block
`110, basic ALU core 120, and configuration memory 105.
`The main memory block 110 is a 256 wordx8 bit wide
`memory, whichis arranged to be used in either single or dual
`port modes. In dual port mode, the memory size is reduced
`to 128 words in order to be able to perform the two
`simultaneous read operations without increasing the read
`latency of the memory. The memory modeis controlled by
`control logic 114 accessed through a Memory/Mux function
`
`10
`
`15
`
`20
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`port 112, and the write enable can be controlled either
`through the memory/mux function port 112 or by the control
`logic 134 accessed through ALU function port 132. Control
`logic is hardwired and also controls the ALU functions.
`In single port mode, the memory 110 uses the A_ADR
`port for an address and outputs the selected value to both
`A_PORT and B_PORT. In dual port mode, the A_ADR
`port selects a value for A_PORT only, and B_ADR port
`selects a value for the B_PORT.
`In either mode the read operation takes place during the
`first