`USIJtlfi339819B‘l
`
`(12) United States Patent
`(10) Patent 1%.:
`US 6,339,819 B1
`
`Huppenthal et al.
`Jan. 15, 2002
`(45) Date of Patent:
`
`(54) MUL’I‘IPROCHSSOR WITH EACH
`PROCESSOR ELEMENT ACCESSING
`OPERANDS [N LOADED INPUT BUFFER
`AND FORWARDING RESULTS TO FIFO
`OUTPUT BUFFER
`
`(7'5)
`
`Inventors: Jon M. Huppemhal, Paul A. Leskal',
`b0“! 01‘ Cfilmfldfl Spring-5, CO (US)
`
`_
`(73) N‘S'gflcei SF“; Computers, Inez, Celorad“
`Spring-s, CO (US)
`I
`I
`I
`.
`I
`Subject to any dlsclatmer, the [em] 01 this
`patent is extended or adjusted under 35
`U.S.C. 1540;) by 0 days.
`
`I
`( ‘ ) Notlee:
`
`(31) AFPI- NO-i 09/563,551
`(22)
`Filed:
`May 3, 2000
`Related U.S. Application Data
`
`(63) Continuation-impart of application No. D‘Jf481,9f12, filed on
`Jan, 12. 2am, nnw Pat. No. 6,247.]“1, which is :1 continue
`ation oi application No. 081992.703. filed on Dee. 1?. 1997.
`now Pat. No. 0,076,152.
`
`Int. Cl.7
`(51)
`(52) U.S. Cl.
`
`(53) Field of Search
`
`GMI" 15116
`712/16; 326.r'39;32614l;
`712B?
`326/39. 41; figfloI
`71337
`
`(56)
`
`References Cited
`I
`.. I
`..
`I
`I
`I
`”'3' PA] is?“ DOCUMbN TS
`
`
`320m
`5.570.040 A - name Lytle a a].
`,
`
`71mm
`5,737,766 A *
`411998 Tan ..
`
` I”10%| Cloulier... . 712116
`5,302,062 A .
`
`712337
`
`2:2000 Cusselntan
`0.023.755 A *
`.
`l
`,
`.
`,
`.
`.
`_
`.
`omen meAHUNE’
`Vemuri. Ranger R. et at, “Configurable Computing: 'l‘eeh—
`nology and Applications", Apr. 2000, Computer. pp. 39—40.
`Dellon. Andre. "The Density Advantage of Configurable
`Computing". Apr. 2000, Computer, pp. 41-49.
`Haynes, Simon D. et 51]., "Video Image Processing with the
`Sonic Architecture”. Apr. 2600, Computer. pp. 50—51
`l’latzner. Marco. "Reconfigurable Accelerators for Combi~
`natorial Problems", Apr. 2000. Computer. pp. 58—60.
`Callahan. Timothy J. et 31., "The (jarp Architecture and C
`Compiler", Apr. 3000‘ Computer, PP. 5245:),
`(List continued on next page.)
`Primary Examiner—Kenneth S‘ Kim
`(74) Attorney, Agent, or Firm—William J. Kuhida; Kent A.
`l.embke; Hogan & Hanson [JP
`(57')
`ABSTRACT
`
`(“MAP")
`An enhanced memory algorithmic processor
`architecture for multiprocessor computer systems comprises
`an assembly that may comltrim, for example, field program-
`mahle gale arrays ("FPGAs") functioning as the memory
`.
`.
`.
`algortthmtc processors. The MAP elements may further
`include an operand storage, intelligent address generation,
`on board function libraries, result storage and multiple
`input-"output (“H0") ports. The MAP elements are intended
`to augment, not necessarily replace, the high performance
`microproces‘xlrs 1'11 the system and, in a particular embodi-
`ment of the present
`invention.
`they may be connected
`through the memory subsystem of the computer system
`resultinn in it bein ve
`ti htl
`eou lecl to the 5 Stem as
`well asbbeing globglly liii:t.‘t:‘,‘E'sil)lie {will any proeezsor in a
`lTlUill'Pm‘BSSOr mmpulcf Symm-
`47 Claims, 11 Drawing Sheets
`
`FRO M
`PREVIOUS
`-\ MAP
`24
`Cl {AIM
`PORT
`'\ sun
`24
`
`K‘- 112
`
`
`
`
`
`
`
`MEMORY
`CONTROLL
`
`CHAIN
`(ream
`SWITCH
`PROCESSOR
`PROCESSOR
`PORT
`ASSEMBLY
`800
`245m
`
`_
`r-24”“':_—E"?5:
`._— 238:
`0u.
`t
`It
`Ii
`L.
`_
`
`..._JAP_ "i
`g
`lll I| .
`
`
`MEMORY
`
`ONTROLL
`(FPGA)
`IIIIE
`
`
`
`
`R
`
`: C
`
`READ
`
`lRUNK
`j 53]
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 1
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 1
`
`
`
`US 6,339,819 B1
`Page 2
`
`OTHER PUBLICATIONS
`
`Goldstein, Seth Copen el al., “PipeRench: A Reconfigurable
`Architecture and Compiler", Apr. 2000, Computer. pp.
`NL'I'EI.
`Albaharna, Usama, et a]., “0n the viabilityr of FPGA—based
`integrated
`coprocessors”, ©l996
`IEEE,
`Pub]. No.
`0—8136—7548—91’96, pp. 206—215.
`Barthel, Dominique Aug. 25—26, 1997, "PVP a Parallel
`Video coProcessor", Hot Chips IX, pp. 203-210.
`Bittner, Ray, et a]., "Computing kernels implemented with a
`wormhole RTR CCM”, ©1997
`IEEE,
`Pub]. No.
`0—8186—8159—U97, pp. 98—105.
`Babb, Jonathan, 01 a]., “l’arallizing applications into sili-
`con“, ©1999 IEEE.
`Berlin, Patrice, et a]., “Programmable active memories: a
`performance assessment”, @1993 Massachusetts Institute of
`Technology, pp. 88—102.
`Culbertson. W. Bruce, at a]., "Exploring architectures for
`volume visualization on the Teramac custom computer",
`©1EEE. Pub]. No. 0—8186—?548—9t96, pp. 80—88.
`Culbertson, W, Bruce, at a]., "Defect
`tolerance on the
`'l‘erarnac custom computer", ©1997r
`IEEE, Pub]. No.
`8186—8l59—4i97, pp] tin—123.
`Chan, Pak, et a]., "Architectural tradenflls in field—program-
`mable—dcvicc-bascd computing systems", ©|993 IEEE,
`Pub]. No. 0—8186-3890—7/93. pp. 152—161.
`Clark, David, et al., “Supporting FPGA microprocessors
`through retargctable software tools", @1996 IEEE, Pub].
`No. 0—8186—7548—91'96, pp. 195—103.
`Cuccarco. Steven. et a]., “The (TM—2X: a hybrid (SM—33‘
`Xilink prototype", ©1993 iEEE. Pub]. No. 0—81867339tk7i
`93, pp. 121430.
`Dehon, Andre. "DI’GA—Coupled microprocessors: cont-
`mndity [C For the early 2]" century”, @1994 IEEE, Pub].
`No. 0—8186—5400—2/94. pp. 31—39.
`Dhaussy, Philippe, et a]., “Global control synthesis for an
`MIMDKFPGA
`machine",
`@1994,
`Pub].
`No.
`0—8] 86—5490—2t94, pp. 72—81.
`Elliott, Duncan, ct a]., “Computational Ram: a memo—
`ry—SIMD hybrid and its application to DSP", ©1992 IEEE,
`Pub]. No. 0—7803—0246—Xf92, pp. 30.6.1—30.r‘:.4.
`Fortes, Jose, et a]., “Systolic arrays,
`11 survey of seven
`projects”. @198?
`IEEE,
`Pub]. No.
`0018—9162871“
`men—mm, pp. 91—103.
`Puma, Karthikeya, at 3]., ”Temporal partitioning and sched-
`uling data flow graphs for
`reconfigurable computers",
`©1999 IEEE, Pub]. No. (Ill&9340f99. P137 5797590,
`Gibbs, W. Wayt, "Blilzing hits“, @1999 Scientific American
`Presents, pp. 57—61.
`Gonzalez, Ricardo, “Configurable and extensible processors
`change system design", Aug. 1571?, 199‘), Hot Chips. 11
`Tutorials, pp. 135446.
`Graham, Paul, at a]., “FPGA—based sonar processing",
`@1998 ACM 0-89791—978-5/98, pp. 201—208.
`Hauser, John, ct al.: "CARP:
`a MIPS processor with a
`reconfigurable eta—processor", ©1997 IEEE, Pub]. No.
`0—08186—8159—461 pp. 12—2].
`Hammond, Lance, et a]., "The Stanford Hydra CMP", Aug.
`15—17, 1999 Hot Chips 11 'l‘utorials, pp. 23—31.
`llartenstein, Reiner, et a]., “A reconfigurable data—driven
`ALU for Xputers", ©1994 IEEE, Pub]. No. 0-8186-5490-2]
`94, pp. 139—]46.
`
`Hayes, John, et a]., "A microprocessor—based hypercube,
`supercomputer", ©1986 IEEE, Pub]. No.
`(1272—1732f86f
`1000-0006, pp. 6-17.
`Hagiwara, Hiroshi, et a]., "A dynamically microprogram-
`mable computer with low—level parallelism", ©1980 IEEE,
`Pub]. No. (l018—9340180i07iln—0577, pp. 577—594.
`Hasebe, A], et a]., “Architecture ol' SIPS, a real time image
`processing system," @1988 IEEE, Pub]. No. (T112603r9f88f
`00001062] , pp. 621—630.
`Jean, Jack, at a]., "Dynamic reconfiguration to support
`concurrent
`applications". @1999
`IEEE,
`Pub]. No.
`0(118—9340f9‘), pp. 591—602.
`a computer—driven
`Kastrup, Bemardo, et al., “Concise:
`CPLD—based instruction set accelerator", ©1999 IEEE,.
`King, William, et a]., “Using MORRPII in an industrial
`machine
`vision
`system”. @1996
`IEEE,
`Pub]. No.
`08186—7'548—9f96, pp. 18—26.
`Manohar, Swaminalhan. et a]., "A pragmatic approach to
`systolic design”, @1988 IEEE, Pub]. No. CH3603—9f88t
`0m0t0463, pp. 463—472.
`Motornura, Masato, et a]., “An embedded DRAMii-‘PGA
`chip with instantaneous logic reconfiguration”, ©1998
`IEEE, Pub]. No. Ill—818649004538, pp. 2%256.
`McConnell, Ray, "Massively parallel computing on the
`Fuzion chip”,Aug. JS—ET, [999.Hot Chips 1] Tutorials, pp.
`83—94.
`McShane, Erik, el al., “Functionally integrated systems on a
`chip:
`technologies, architectures, CAD tools, and applica—
`tions”, ©1998 IEEE, Pub]. No. 8—8186—8424—0f98. pp.
`67—?5.
`Mauduit. Nicolas, et a]., "Lneuro 1.0: a piece of hardware
`LEGO for building neural network systems," ©1992 IEEE,
`Pub]. No. 104579227t92, pp. 414—422.
`Patterson, David, et a]., "A case intelligent DRAM: IRAM",
`Hot Chips VII], Aug. 19—20, 1996. pp. 75—94.
`Peterson, Janes, eta1., “Scheduling and partitioningANSI£
`programs onto mulli—FPGA CCM architectures". ©1996
`IEEE, Pub]. No. 0—8186—7548—9/96, pp.
`IT’S—187.
`Rupp, Charley, ct a]., "The NAPA adaptive processing
`architecture”, @1998 the Authors, pp. 1—10.
`Sailo, Osamu, et a]., “A ]M synapse self learning digital
`neural
`network
`chip”, ©1998
`IEEE,
`Pub]. No.
`0—7803—4344f1l98, pp. 94—95.
`Schott, Brian, et a]., "Architectures for systemilevel appli-
`cations of adaptive computing", ©1999 IEEE.
`Schmit, Herman, "Incremental reconfiguration of pipelined
`applications." @1997 IEEE, Pub]. No. 0—8186—8159—4f97,
`pp. 47—55.
`Villasenor, John, at a]., “Configurable computing”, @1997
`Scientific American, Jun. 1997'.
`Stone, Harold, “A logic—in—memory computer", @1970
`IEEE, lEEE Transactions on Computers, pp. 73—78.
`Trimberger, Steve, et a]., "A time—multiplexed FPGA",
`©1997 IEEE, Pub]. No. 0—8186—8159—4/97, pp. 22—28.
`'l'homburg, Mike, et a].,
`"'l‘ransformablc Computers“,
`©1994 IEEE, Pub]. No. 0-6186-5602-6/94, pp. 674-679.
`Tangcn, Uwe, et a]., "A parallel hardware evolvable com—
`puter POLYP extended abstract", @1997 IEEE, Pub]. No.
`0—8186—8159f4f97, pp. 238—239.
`Tomita, Shinji, et a]., “A computer low—level parallelism
`Opt-2". ©1986 IEEE. Pub]. No. U-n384—7495/86/0000t
`0230, pp. 280—289.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 2
`
`
`
`US 6,339,819 B1
`Page 3
`
`Ueda, Hirolada, et all, "A multiprocessor system ulilizing
`enhanced DSl"s for image processing", ©1988 IEEE, l’ubl‘
`No‘ CH2603¢9I88IEKIKW6LL pp. 6114320.
`Wang, Quiang, el al.."Automatcd fieldiprograrnrnablc com-
`pute accelerator design using partial evaluation”, ©1997
`IEEE, Publ. No. (l—SISGfi‘SlSEifi’l-M'F, pp. 1457154.
`Wirlhlin1 Michael, el all, “The Nano processor: a low
`resource reconfigurable processor", @1994 IEEE, Publ. Nu
`(L—8l86—54‘Jlt—21‘94, pp 23—30,
`Wittig. Ralph, et 41]., "One Chip: An FPGA processor with
`reconfigurable
`logic", ©1996
`IEEE,
`l’ubl.
`No‘
`0—8'186—7548-9J96, pp [26—135
`Wirthlin, MlChfll‘Jl, at £11., “A dynamic instruction set corn-
`puler”, ©1995 IEEE, Puhl. No. D—Slflfli'flflfifiXf'JS, pp.
`997107.
`Yamauchi.Tsukasa, ct 31., “SOP: Areconfigurahle massively
`parallel system and its control—data llrtw based compiling
`method“. @1996 IEEE, Pub]. No 0—8186—7548—9f96, pp;
`148—15fi‘
`
`"PAM—Film: High Performance
`Mencer. Oskar, el al.,
`FI’GA Design for Adaptive Computing", @1998 IEEE,
`Conference Paper, lnspcc Abstract No. 139811—126513—044,
`(‘9811521LHI09
`
`Miyamor, 'l'akashi, at a]., "A quantilalive analysis of recon-
`figurable coprocessor: for multimedia applications", @1998
`IEEE,
`Conference
`Paper.
`lnspec Abstract
`Nos.
`B9811—1265F-011, C98l 1—5310-010.
`
`Wll. Mangionc—Smith and BL. Hutchingfi. Configurable
`computing: The Road Ahead. In Proceedings of thc Recon—
`figurable Architecture Workshop (RAWEEJT), pp. 81791,
`199?.
`
`Mirsky. Ethan A., "Coarseifirain Reconfigurable Comput-
`ing", Massachusetts Institute of Technology, Jun. 1996‘
`
`* cited by examiner
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 3
`
`
`
`US. Patent
`
`Jan. 15, 2002
`
`Sheet 1 of 11
`
`US 6,339,819 B1
`
`160
`
`/‘
`Memory
`Subsystem
`Bank 0
`
`Memory
`Subsystem
`Bank 1
`
`Memory
`Subsystem
`Bank M
`
`14
`
`MEMORY
`INTERCONNECT
`
`FABRIC
`
`120
`
`Processor
`0
`
`[1-121
`
`1
`
`|
`
`| |1 l ll I K
`
`12M
`
`Processor
`N
`
`1120
`
`MAP
`0
`
`1121
`
`MAP
`.1
`
`| || | | | 1
`
`l
`112M I
`
`Fig. 1
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 4
`
`
`
`US. Patent
`
`Jan. 15, 2002
`
`Sheet 2 of 11
`
`US 6,339,819 B1
`
`Em:
`
`
`m0<am
`
` >EOEME%mum:«an:%
`
`nas—
`
`£52
`
`SN:
`
`«an:
`
`5&3
`
`«mm:
`
`n25
`
`132
`
`%
`
`man:
`
`n_<s_
`
`n22
`
`9N:
`
`mm“:
`
`5N:
`
`n_<s_
`
`%
`
`5N:
`
`%
`
`mum:
`
`%5N:mum:
`
`n_<_2
`
`n_<5_
`
`EN:
`
`<5:
`
`w
`
`_wZOGmmeIwmomeDw—m!4u44<x<a
`
`
`
`
`
`,_2m_._m_._._<m<n_Dm—,.10NET:ZOFGOQEOUMDDMZEEOEDEME
`
`
`
`
`2.0:.e,_eWEIF_KOOI_<A:wOwWZOFUDKLME
`ma:_I__
`£2_6so:__52n_.
`_AH
`
`mn—
`
`_
`
`e(no:uE__.25;
`
`fl.
`_I
`
`__.e_.aӣ2_mg:__Iu9_a_r_an:n.32.m_
`_uvDZ‘eum25—..
`.a_e.32.1«2953252."é:_E_I.
`
` _.e55:..59_.ue__"we
`inn:.L.m69w_$953.52.,
`
`_
`
`_,
`
`.150rE__.wzofianmz.
`
`._
`
`___
`
`,
`
`.I'J
`
`_.___
`
`
`
`822mmmamas
`
`.
`
`ZO_h_wOQEOUm—D
`
`H206mm
`
`ddqmfi.r/
`
`rmor
`
`w:93:
`
`43.35;J}
`
`”Nov
`
`434qu
`
`n206mm
`
`$.3qu
`
`v206mm
`
`amo—
`
`i:
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 5
`
`
`
`
`US. Patent
`
`US 6,339,819 B1
`
`0')
`
`9
`LL
`
` 1__H__mm15528umTimam?as).E.msw069m_5528_m@xz<mmmz:12:”:_m,w.2396_m_mmmmom<_m"><mm<
`E055.".W_____._
`
`
`
`______
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 6
`
`
`
`
`US. Patent
`
`Jan. 15, 2002
`
`Sheet 4 of 11
`
`US 6,339,819 B1
`
`at\I"E
`
`mm“92
`
`zofiEnoEzoom:_EfiwfiouIIIIIII...
`.201“.m:.93....qu.2
`
`twin“
`
`mm:
`
`zofi<m3wazoo
`mtm
`
`25:5
`
`mum:mwmmog
`
`zozssuizoanz<EEOu_£9.E_a;
`
`<09.zo:.<m:o_u_zoo_IIIII.I.IIIIIIImmm2mm:05;__
`0k2:mtm55on
`
`ozémmoEEEmmat55_x2:
`
`Encomo_.
`
`
`
`
`
`
`
`we“.__we.8:;.watfim
`
` mah§4mioum."E223_.m_z_I_mn=n__m_(an...20EVAMEmSE30963mwEEmJ___mmpzzouWNW,“2mm:"_mzfiwli_mzzwnzn.w.a?(on:W$55.32
`EOE...
`
`
`_mm:.9%Emma
`
`N?
`
`U...................................__m2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 7
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 7
`
`
`
`
`US. Patent
`
`Jan. 15, 2002
`
`Sheet 5 of 1]
`
`US 6,339,819 B1
`
`«m:
`
`mm:
`
`omw
`
`
`
`l<mm<'mOEwS.
`
`0.2
`
`>mOS—w2
`
`><mm<
`
`'IOtllN OD
`
`'IOHLNOO
`
`mmm
`
`Iota/W
`
`mowmmUOMm
`
`com
`
`
`
`mDOSmmn.20mm
`
`ads.
`
`Om<0m
`
`mOmmMUOmm
`
`Dm<0m
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 8
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 8
`
`
`
`US. Patent
`
`Jan. 15, 2002
`
`Sheet 6 of 1]
`
`US 6,339,819 B1
`
`ZEIU
`
`PKG“;
`
`<._.<D
`
`ENVow
`
`wwwmofl<
`
`SE28
`
`<20
`
`\-2ES:
`
`
`
`3.E05:
`
`aat
`
`
`
`
`
`S20>mO§m=2ZOE—200I._._>>
`
`mm
`
`20—2200
`
`>m02w§
`
`Iotgw
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 9
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 9
`
`
`
`
`
`
`
`US. Patent
`
`Jan. 15, 2002
`
`Sheet 7 of 1]
`
`US 6,339,819 B1
`
`vm
`
`vm
`
`SE305%:
`
`
`
`mnzéwao825.30ow
`
`23:02210Eom
`
`>EOEw§
`
`FDQZ.
`
`<H<D
`
`
`<29Bm<m20iz©omm
`
`53302mm
`
`
`
`
`mesa—23m:com
`
`><mm<
`
`JOmFZOO
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 10
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 10
`
`
`
`
`
`
`
`
`
`
`US. Patent
`
`Jan. 15, 2002
`
`Sheet 8 of 11
`
`US 6,339,819 B1
`
`ifldLnO NIVHO
`
`ViVCI AHOWHW
`
`8
`
`.09.25Em:
`
`QMOh
`
`k
`
`3
`
`«bk
`
`XJOHKE
`
`><EE<
`
`
`
`Emma;<F<D
`
`3mmmmaad.
`
`aw
`
`LUCOOCC<KE
`
`NV
`
`
`
`zEIo5632
`
`um‘-av
`ajvmmm(ban—
`
`Ken
`
`
`If;onom
`
`
`
` 0:4522:0.6528No5989amw\vM”mlmmmenEm‘5&2.m5&2.
`mmmmogq
`
`
`N:mat
`
`9am22:0:mm
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 11
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 11
`
`
`
`
`
`
`US. Patent
`
`Jan. 15,2002
`
`Sheet 9 of 11
`
`US 6,339,819 B1
`
`tr
`‘91::
`o.._..
`
`Fig.9
`
`
`
`InputData
`
`[00:64]
`
`
`
`ChainInput
`
`{00:54]
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 12
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 12
`
`
`
`US. Patent
`
`Jan. 15, 2002
`
`Sheet 10 of 11
`
`US 6,339,819 B1
`
`
`
`H.15—
`
`Z_<IDH___#m.3.3gro::a.58
`
`
`kmnvaNW?;.--mdem:m‘wwm:
` §<Iu.-.1‘ILMIIIIEIJukIiwfirwl.__WUa_:_mJ...
`:51wmwrwfi...-ti(
`
`
`
`com.mwwmmwmamvmunrmoqm.mmww
`
`Q»g........._
`
`
`
`1<2,/,__awDO_>m_WE.cvvmer—4NVN.qovmkll
`zEIU____mVN____.|__H_._
`
`
`agaro____aEon.I‘lwwwmm1..
`20mm_..................................
`I_.._‘I
`
`
`HETIOHLNOD
`AHOWSW
`
`HSTIOELNOO
`AHOWEW
`
`(VOdfl
`HQUMS
`Hossaooad
`
`A‘IGWESSV
`BOSSEOOHd
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 13
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 13
`
`
`
`
`
`
`
`US. Patent
`
`.h
`
`HwhS
`
`0]
`
`m
`
`lBw
`
`mITmcorI!Jmmtmlmm
`
`«tatx-asymEmDSaEm,2893>
`
`IImam1?I!IT
`
`mcmé
`
`v.36
`
`6.,£825>
`
`HIImcmII
`
`0‘163w
`
`ITm:9-1\
`
`wwtmmm
`
`Etmlmm
`
`
`
`
`
`SmDEaEO“m,ME..mmIn22:1«fl
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 14
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 14
`
`
`
`
`US 6,339,819 B]
`
`l
`MUIII‘II’ROCESSOR WITH EACH
`PROCESSOR ELEMENT ACCESSING
`OPERANDS IN LOADED INPUT BUFFER
`AND FORWARDING RESULTS TO FIFO
`OUTPUT BUFFER
`CROSS REFERENCE TO RELA'l'ED PATENT
`APPLICATIONS
`
`The present invention is a continuation-in—part application
`ofU.S. patent application Ser. No. fl9f481,9f]2 tiled Jan. 12,
`2000, now U.S. Pat. No. 6,247,110, which is a continuation
`of US. patent application Ser. No. 08f992,763 filed Dec. 17,
`1997, now US. Pat. No. 6,076,152, for: “Multiprocessor
`Computer Architecture Incorporating a Plurality of Memory
`Algorithm Processors in the Memory Subsystem", assigned
`to SRC Computers, Inc., Colorado Springs, Colo. assignec
`of the present invention, the disclosures of which are herein
`specifically incorporated by this reference.
`BACKGROUND OF THE INVENTION
`
`Ill
`
`15
`
`3f]
`
`The enhanced memory algorithmic processor architecture
`for multiprocessor computer systems of the present inven-
`
`The present invention relates, in general, to the field of
`computer architectures incorporating multiple processing
`elements. More particularly, the present invention relates to
`a multiprocessor computer architecture incorporating a
`number of memory algorithmic processors (“MAP") in the
`memory subsystem or closely coupled to the processing g
`elements to significantly enhance overall system processing
`speed.
`As commodity microprocessors increase in capability
`there is an ever increasing push to use them in high perfor-
`mance multiprocessor systems capable of performing tril-
`lions ot‘ calculations per second at significantly lower cost
`than those made from custom counterparts. However, many
`of these processors lack specific features common to sys-
`tems in this category that employ much more expensive
`custom processors. One such feature is the ability to perform
`vector processing.
`In this form ofprocessing, a data register or buffer is filled
`with operands forming what is called a vector. All of these
`operands are then passed one after the other thmugh a
`functional unit capable of performing operations such as
`multiplication. This functional unit will output one result
`every clock cycle. This type of processing does require that
`the same operation be performed on all operands in the input
`vector and it
`is, therefore, widely used in that it exhibits
`much higher processing rates than the traditional scalar
`method of computation used in most microprocessors.
`Nevertheless, neither vector nor scalar processors perform
`very well when required to perform bit manipulation as is
`required. for example, in matrix arithmetic. One such func-
`tion is a bit matrix multiply operation in which two matrices
`of difl’crcnt sizes are multiplied together to form a third
`matrix. Another shortfall of both vector and scalar process-
`ing is their inability to quickly perform pattern searches such
`as those used in a variety of pattern recognition programs.
`A solution to all of these deficiencies can be found by
`building a high performance computer which contains num-
`bers of commodity microprocessors to reduce the system
`cost
`together with MAP elements developed by SRC
`Computers, Inc, assignee of the present invention, to pro—
`vide the deficient functions at very low cost. The MAP
`architecture and specific features thereof is disclosed in the
`aforementioned patent applications, the disclosures of which
`are herein specifically incorporated by this reference.
`SUMMARY 01" THE INVENTION
`
`35
`
`4t]
`
`45
`
`50
`
`55
`
`a0
`
`65
`
`2
`tion is an assembly that not only contains, for example, field
`programmable gate arrays functioning as the memory algo-
`rithmic processors, but also an operand storage, intelligent
`address generation, on board function libraries, result store
`age and multiple U0 ports. Like the original MAP architec-
`turc disclosed in the aforementioned patent applications, this
`architecture ditfers from other so called “reconfigurable"
`computers in many ways.
`First,
`its function is intended to be altered every few
`seconds distinguishing itself from other systems with very
`long reconfiguration times primarily intended for a single
`function Secondly, it contains dedicated hardware to pro-
`vide for large data set operand storage (on the order of 16
`Mbytes or more) allowing the MAP element to function
`autonomously from its host system once operands are
`loaded. Thirdly, it contains dedicated data ports to allow, but
`not require, multiple MAP elements to be chained together
`to perform very large operations. As currently contemplated,
`it is intended that typically 32 to 512 or more MAP sections
`can be connected in a single systemi
`Further,
`the MAP element is intended to augment, not
`replace, the high performance microprocessors in the sys-
`tem. As such,
`in a particular embodiment of the present
`invention, it may be connected through the memory sub-
`system of the computer system resulting in it being very
`tightly coupled to the system as well as being globally
`accessible from any processor in the system. This technique
`was developed by SRC Computers, Inc. and distinguishes
`the MAP architecture from all other so called “attached array
`processor” systems that may exist
`today. While such
`“attached array processor“ systems may bear some superfi-
`cial similarities to MAP based systems,
`they are entirely
`separate units connected to the host computer
`through
`relatively slow interconnects resulting in test system per-
`formance.
`The MAP architecture developed by SRC Computers, Inc.
`as defined in the aforementioned patent applications over
`comes many of the limitations of such “attached array
`processor" systems. Because of the particular limitations in
`the exemplary architecture disclosed therein surrounding the.
`attachment of input storage anti chaining capabilities, certain
`vector processing functions may not have been optimally
`implemented unlike relatively smaller algorithms.
`'l'hrough the addition of these and other features to the
`MAP architecture, a much more powerful multiprocessor
`computer system is provided. Moreover, while, as originally
`diselosed, another feature of the MAP architecture was its
`ability to perform direct memory access ("DMA") into the
`common the memory ofthe system, enhancements disclosed
`herein have expanded the potential utilization of this feature.
`Particularly disclosed herein is a Memory Algorithmic
`Processor ("MAP") assembly (or element) comprising
`reconfigurable field programmable gate array (“Fl’GA”)
`circuitry, an intelligent address generator, input data buffers,
`output first-in, first-out (“FIFO") devices and ports to allow
`connection to a memory array and chaining of multiple MAP
`assemblies for the purpose of augmenting the capability of
`a microprocessor in a high performance computer.
`Further disclosed herein is a MAP assembly comprising
`an intelligent address generator capable of supporting a data
`gather function from its associated input butler or common
`memory. The MAP assembly may also comprise circuitry to
`allow the reconfigurable elements to reprogram their
`on-hoard configuration read only memory ("ROM") devices
`to cause alterations in the functionality of the reconfigurable
`circuitry.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 15
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 15
`
`
`
`US 6,339,819 B1
`
`3
`Still further disclosed herein is a MAP assembly com-
`prising dedicaled input and output ports for the purpose of
`allowing an infinite number of MAP elements to be chained
`together to accomplish a single function. The MAP assem-
`bly may also incorporate provisions to create a single MAP
`chain or multiple independent MAP chains automatically
`based on the contents of the reconfigurable circuitry.
`l-‘unhcr disclosed herein is a MAP assembly comprising
`output FIliOs for the purpose of holding output data and
`allowing the MAP element to not stall
`in the event the
`processor reading these results is delayed due to outside
`[actors such as workload or crossbar switch conllicLs. The
`MAP assembly may further comprise relatively large dedi-
`cated input storage buffers to allow for optimization of
`operand transfer as well as allow multiple accesses to an
`operand without requiring external processor intervention.
`Still further disclosed herein is a MAP assembly com-
`prising a dedicated port for connection to an input buffer so
`that the MAP element can simultaneously receive operands
`via the chained input (chain) port and the input [wires This
`allows the MAP element to perform mathematical process-
`ing at the maximum possible rate while also allowing the
`MAP element to accept operands via the chain port while
`accessing reference data in the input buffer (such as recip-
`rocal look up tables) to allow the MAP element to perform
`operations such as division at the fastest possible rate.
`Also further disclosed herein is a MAP assembly which
`may comprise connections to the memory subsystem of a
`high performance computer for the purpose of providing
`global access to it from all processors in a multiprocessor
`high performance computer system. The MAP assembly
`incorporates the capability to update multiple on board
`function ROMS under program control while in the system
`and may also include connections to the memory subsystem
`of a high performance computer utilizing DMA to accept
`commands from a microprocessor.
`BRIEF DESCRIPTION 01" THE DRAWINGS
`
`The aforementioned and other features and objects of the
`present
`invention and the manner of attaining them will
`become more apparent and the invention itself will be best
`understood by reference to the following description of a
`preferred embodiment taken in conjunction with the accom-
`panying drawings, wherein:
`FIG. 1 is a simplified, high level, functional block dia-
`gram of a multiprocessor computer architecture employing
`memory algorithmic processors (“MAP") in accordance
`with the disclosure of the aforementioned patent applica-
`tions in an alternative embodiment wherein direct memory
`access (“DMA”) techniques may be utilized to send com-
`mands to the MAP elements in addition to data;
`FIG. 2 is a simplified logical block diagram of a possible
`computer application program decomposition sequence for
`use in conjunction with a multiprocessor computer archi-
`tecture utilizing a number of MAP elements located, for
`example, in the computer system memory space, in accor-
`dance with a particular embodiment ofthe present invention;
`FIG. 3 is a more detailed functional block diagram of an
`exemplary individual one of the MAP elements of the
`preceding figures and illustrating the bank control
`logic,
`memory array and MAP assembly thereof;
`FIG. 4 is a more detailed functional block diagram of the
`control block of the MAP assembly of the preceding illus~
`tralion illustrating its interconnection to the user FPGA
`thereof in a particular embodiment;
`FIG. 5 is a functional block diagram of an alternative
`embodiment of the present
`invention wherein individual
`
`it]
`
`15
`
`3f]
`
`35
`
`4t]
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`MAP elements are closely associated with individual pro-
`cessor boards and each of the MAP elements comprises
`independent chain ports for coupling the MAP elements
`directly to each other;
`FIG; 6 isa functional block diagram ofan individual MAP
`element wherein each comprises on board memory and a
`control block providing common memory DMA capabili-
`tics;
`FIG. 7 is an additional functional block diagram of an
`individual MAP element illustrating the on board memory
`function as an input buffer and output FIFO portions thereof;
`FIG. 8 is a more detailed functional block diagram of an
`individual MAP element as illustrated in FIGS. 6 and 7'.
`FIG. 9 isa user array interconnect diagram illustrating, for
`example,
`four user Fl’GAs interconnected through
`horizontal, vertical and diagonal buses to allow for expan-
`sion in designs that exceed the capacity of a single l’I’GA;
`FIG. III is a functional block diagram of another altema-
`tivc embodiment ofthc present invention wherein individual
`MAP elements are closely associated with individual
`memory arrays and each of the MAP elements comprises
`independent chain ports for coupling the MAP elements
`directly to each other; and
`FIGS. 11A and 11B are timing diagrams respectively
`input and output timing in relationship to the system clock
`(“Sysclk”) signal.
`DESCRIPTION OF A PREFERRED
`EMBODIMENT
`
`With reference now to FIG. I, a multiprocessor computer
`10 architecture in accordance with one embodiment of the
`present invention is shown. The multiprocessor computer 10
`incorporates N processors 1200 through 12” which are
`[ii-directionally coupled to a memory interconnect fabric 14.
`The memory interconnect fabric 14 is then also coupled to
`M memory banks comprising memory bank subsystems 160
`(Bank 0) through 16M (Bank M). N number of memory
`algorithmic processors ("MAP”) 1120 through IIZN are also
`coupled to the memory interconnect fabric 14 as will be
`more fully described hereinafter.
`With reference now to FIG. 2, a representative application
`program decomposition for a multiprocessor computer
`architecture [0|] incorporating a plurality of memory algo-
`rithm processors in accordance with the present invention is
`shown. The computer architecture 100 is operative in
`response to user instructions and data which,
`in a coarse
`grained portion of the decomposition, are selectively
`directed to one of (for purposes of example only)
`four
`parallel regions “)2,
`through 1024 inclusive. The instruc-
`tions and data output from each of the parallel regions 1021
`through 1024 are respectively input
`to parallel
`regions
`segregated into data areas 104, through 1044 and instruction
`areas 106] through 106... Data maintained in the data areas
`104'
`through Ill-ll;l and instructions maintained in the
`instruction areas 106J through 196,. are then supplied to, for
`example, corresponding pairs of processors 1081, 1082 (P1
`and P2); 1083, 108., (P3 and P4); 1085, 1086 (P5 and P6);
`and 1087, 1088 (P7 and P8) as shown. At this point,
`the
`medium grained decomposition of the instructions and data
`has been accomplished.
`A fine grained decomposition, or parallelism, is elfcctu-
`ated by a further algorithmic decomposition wherein the
`output of each of the processors 1081 through 1GB”,
`is
`broken up, for example,
`into a number of fundamental
`algorithms 1101“, 11013, 11014, 110213 through 110M as
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 16
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2084, p. 16
`
`
`
`US 6,339,819 B1
`
`in
`
`15
`
`3f]
`
`5
`shown. Each of the algorithms is then supplied to a corre-
`sponding one of the MAP elements 112“, 112m: 112M,
`112:”, through 1123,, which may be located in the memory
`space of the computer architecture 100 for execution therein
`as will be more fully described hereinafter.
`With reference additionally now to FIG. 3, an exemplary
`implementation of a memory bank 120 in a MAP system
`computer architecture 100 of the present invention is shown
`for a representative one of the MAP elements 112 illustrated
`in the preceding figure. Each memory bank 120 includes a
`bank control logic block 122 bi—direclionaliy coupled to the
`computer system trunk lines, for example, a 72 line bus 124.
`The bank control
`logic block 122 is coupled to a
`bi-directional data bus 126 (for example 256 lines) and
`supplies addresses on an address bus 128 (for example 17
`lines) for accessing data at specified locations within a
`memory array 130.
`The data bus 126 and address bus 128 are also coupled to
`a MAP element 112. The MAP element 112 comprises a
`control block 132 coupled to the address bus 123. The
`control block 132 is also bi-directionally coupled to a user
`field programmable gate array (”FPUA") 134 by means of a
`number of signal lines 136. The user FPGA 134 is coupled
`directly to the data bus 126. In a particular embodiment, the
`lil’GA 134 may be provided as a Leccnt
`'l'echnolngies ‘
`ORSTSO device.
`The computer architecture 100 comprises a multiproces-
`sor system employing uniform memory access across com-
`mon shared memory with one or more MAP elements 112
`which maybe located in the memory subsystem, or memory
`space. As previously described, each MAP element 112
`contains at least one relatively large FPGA 134 that is used
`as a reconfigurable functional unit. In addition, a control
`block 132 and a preprogrammed or dynamically program—
`mable configuration ROM (as will be more fully described
`hereinafter) contains the information needed by the rccon~
`figurable MAP element 112 to enable it to perform a specific
`algorithm. It is also possible for the user to directly down-
`load a new configuration into the FPGA 134 under program
`control, although in some instances this may consume a
`number of memory accesses and might result in an overall
`decrease in system performance if the algorithm was short-
`lived.
`li‘PGAs have particular advantages in the application
`shown for several reasons. First, commercially available
`1“l’UAs now contain sufficient internal logic cells to perform
`meaningful computational Functions Secondly,
`they can
`operate at speeds comparable to microprocessors, which
`eliminates the need for speed matching buffers. Still further,
`the internal programmable routing resources of F'PGAs are
`now extensive enough that meaningful algorithms can now
`be programmed without the need to reassign the locations of
`the inpultoutput ("NO”) pins.
`By. for example, placing the MAP element 112 in the
`memory subsystem or memory space.
`it can be