`a2) Patent Application Publication (0) Pub. No.: US 2001/0014937 Al
`(43) Pub. Date: Aug. 16, 2001
`
`Huppenthaletal.
`
`US 20010014937A1
`
`(54) MULTIPROCESSOR COMPUTER
`ARCHITECTURE INCORPORATING A
`PLURALITY OF MEMORY ALGORITHM
`PROCESSORS IN THE MEMORY
`SUBSYSTEM
`
`(76)
`
`Inventors: Jon M. Huppenthal, Colorado Springs,
`CO (US); Paul A. Leskar, Colorado
`Springs, CO (US)
`
`Correspondence Address:
`HOGAN & HARTSON LLP
`ONE TABOR CENTER,SUITE 1500
`1200 SEVENTEENTH ST
`DENVER, CO 80202 (US)
`
`(21)
`
`Appl. No.:
`
`09/755,744
`
`(22)
`
`Filed:
`
`Jan. 5, 2001
`
`Related U.S. Application Data
`
`(60) Division of application No. 09/481,902, filed on Jan.
`12, 2000, now Pat. No. 6,247,110, which is a con-
`tinuation ofapplication No. 08/992,763, filed on Dec.
`17, 1997, now Pat. No. 6,076,152.
`
`Publication Classification
`
`(ST) MntO0! sassccmancaensacn GO06F 15/80
`
`CBS WBN.
`
`cccsccesisccterecrarcctieceectensasesacortvencentit 712/15
`
`(57)
`
`ABSTRACT
`
`A multiprocessor computer architecture incorporating a plu-
`rality of programmable hardware memory algorithm pro-
`cessors (“MAP”) in the memory subsystem. The MAP may
`comprise one or more field programmable gate arrays
`(“FPGAs”) which function to perform identified algorithms
`in conjunction with, andtightly coupledto, a microprocessor
`and each MAPis globally accessible by all of the system
`processors for the purpose of executing user definable
`algorithms. A circuit within the MAP signals when the last
`operand has completed its flow thereby allowing a given
`process to be interrupted and thereafter restarted. Through
`the use of read only memory (“ROM”)located adjacent the
`FPGA,a user program may use a single commandto select
`one of several possible pre-loaded algorithms thereby
`decreasing system reconfiguration time. A computer system
`memory structure MAPdisclosed herein may function in
`normal or direct memory access (“DMA”) modesof opera-
`tion and, in the latter mode, one device may feed results
`directly to another thereby allowing pipelining or parallel-
`‘Zing execution ofa user defined algorithm. The system of
`the present
`invention also provides a user programmable
`performance monitoring capability and utilizes parallelizer
`software to automatically detect parallel regions of user
`applications containing algorithms that can be executed in
`the programmable hardware.
`
`FINE GRAINED
`i
`i
`PARALLELISM
`|; ALGORITHMS | MEMORY SPACE
`
`PROCESSORS|
`110494
`1124
`4°
`| —-@j—_1A|_—fwapy
`11245
`
`MAP
`|
`1129,
`1084,
`| [770,11(MAP
`
`1085! 1105_
`110
`——3A1__(wapy’ 8
`
`@
`124,
`
`
`
`
`108g} W0s5Ot(uae)
`;
`1109p
`|
`s\te7p
`(e7}—4! OriGm| ALY 1279
`.
`108) ~@- EtBE 112g,
`"[70gq-Oey
`-[pe} | BA~
`yit2g5
`108g} F103
`
`
`
`
`
`USERINSTRUCTIONSANDDATA
`
`PARALLEL
`REGION 1
`
`PARALLEL
`REGION 2
`
`PARALLEL
`REGION 3
`
`
`
`REGION 4
`
`
`
`COARSE GRAINED
`DECOMPOSITION
`
`MEDIUM GRAINED
`DECOMPOSITION
`| PARALLEL REGIONS|
`
`
`
`3B 1
`
`110
`
`1
`
`Sf
`
`4B
`
`
`
`Petitioner Microsoft Corporation - Ex. 1022, Cover
`Petitioner Microsoft Corporation - Ex. 1022, Cover
`
`
`
`Patent Application Publication Aug. 16,2001 Sheet 1 of 4
`
`US 2001/0014937 Al
`
`126
`
`PROCESSOR
`0
`
`c 124
`
`PROCESSOR
`4
`
`PROCESSOR
`
`MEMORY
`INTERCONNECT
`
`N 14
`
`BANK 0
`FABRIC
`
`169
`
`MEMORY
`SUBSYSTEM
`
`MEMORY
`SUBSYSTEM
`BANK 1
`
`BANK M
`
`MEMORY
`SUBSYSTEM
`
`FIG. 1
`
`Petitioner Microsoft Corporation - Ex. 1022, Sheet 1
`Petitioner Microsoft Corporation - Ex. 1022, Sheet |
`
`
`
`Patent Application Publication Aug. 16, 2001
`
`Sheet 2 of 4
`
`US 2001/0014937 Al
`
`[aw}-—+-gg)
`
`
`
`dvVW
`
`dvW}+}—+4j7-—
`
`9
`
`[avn]@=ot‘85017Yoo1s|G87,,7||auaa|Oe—{es|—)
`vez,AuN®LgoLSNOILONYLSNI
`82z11|EeHWiva
`
`
`3OVvdSANOWSW|SWHLINOONW||SNOIDSYTaTIWuvd
`eaeGoOlb||oT|Son
`857441OrbOlt©[sa]
`Sreyseheeaeovps,AmicaOt
`lerkTaal—tae8Bez44Okeiedv5~TYEn,|Vez1,7Le
`VSzZ41SWWSg1||
`az¢gw)aOLLBOL
`
`GANIVYSANIAGANIVYeSWNIdaw
`
`WSNTWaVvdNOILISOdWOO30
`Sto.IF!OLL1¥e8oL
` suosescoualwo
`
`TaTWwdvd
`
`¢NOIDSSY
`
`€NOIDSY|_|
`
`Tanvavd
`
`YZOL
`
`6a
`
` |
`
`V1LVO GNV SNOILONYLSNI YASN
`
`TVaTivavd
`
`|NOIDSY
`
`
`
`QSANIVYSASYVOO
`
`NOILISOdWOO3G
`
`Petitioner Microsoft Corporation - Ex. 1022, Sheet 2
`Petitioner Microsoft Corporation - Ex. 1022, Sheet 2
`
`
`
`
`
`
`
`
`
`Patent Application Publication Aug. 16,2001 Sheet 3 of 4
`
`US 2001/0014937 Al
`
`SYSTEMTRUNKLINES
`
`Me
`
`S
`
`co>
`of
`S
`
`Lu
`s
`
`120
`
`oF
`
`©
`by
`~
`
`
`
`
`
`MAPASSEMBLY
`
`136
`
`CONTROL
`|
`
`
`
`MEMORYARRAY
`
`S
`-
`Se
`
`”w
`
`o
`WwW
`
`aQ
`
`OQ<
`
`x
`
`Petitioner Microsoft Corporation - Ex. 1022, Sheet 3
`Petitioner Microsoft Corporation - Ex. 1022, Sheet 3
`
`
`
`Patent Application Publication Aug. 16,2001 Sheet 4 of 4
`
`US 2001/0014937 Al
`
`174
`
`7 182
`
`ROM
`|
`
`FROM
`CONFIGURATION
`ROMS
`
`USER CONFIGURATION
`PATTERN
`256
`172
`
`
`256-BIT
`|PARALLEL-TO-SERIAL
`CONVERTER
`
`ADDRESS
`BITS
`
`128
`
`
`
`178
`
`16
`
`COMMAND
`DECODER
`
`17
`170
`
`
`
`
`CONFIGURATION
`
`MUX
`
`
`
`
`
`CONFIG.
`
`FILE
`TO
`USER
`FPGA
`
`
`3 REGISTER
`CONTROLBITS
`152
`STATUS
`(
`t
`FROM |
`PIPELINE
`
`
`STATUS
`FPGA
`EMPTY FLAG
`
`REGISTERS
`
`|| | |
`
`|
`!
`Lam
`1 LAST
`6!
`OPERAND|
`FLAG
`|
`|
`
`164
`
`158
`
`
`|
`
`
`
`
`
`
`
`
`
`PIPELINE
`
`
`11 COUNTER||
`
`STATUS WORD
`OUTPUT
`!
`FROM FPGA
`
`'
`163
`P|PELINE
`
`EQUALITY
`DEPTH BITS
`
`
`COMPARITOR
`
`
`
`
`FIG. 4
`
`“132
`
`|
`
`||;
`
`GS a a ete eee ec 4
`
`Petitioner Microsoft Corporation - Ex. 1022, Sheet 4
`Petitioner Microsoft Corporation - Ex. 1022, Sheet 4
`
`
`
`US 2001/0014937 Al
`
`Aug. 16, 2001
`
`MULTIPROCESSOR COMPUTER
`ARCHITECTURE INCORPORATINGA
`PLURALITY OF MEMORY ALGORITHM
`PROCESSORS IN THE MEMORY SUBSYSTEM
`
`BACKGROUND OF THE INVENTION
`
`in general, to the
`[0001] The present invention relates,
`field of computer architectures incorporating multiple pro-
`cessing elements. More particularly, the present invention
`relates to a multiprocessor computer architecture incorpo-
`rating a number of memory algorithm processors in the
`memory subsystem to significantly enhance overall system
`processing speed.
`[0002] All general purpose computers are based on cir-
`cuits that have some formof processing element. These may
`take the form of microprocessor chips or could be a collec-
`tion of smaller chips coupled together to form a processor.
`In any case, these processors are designed to execute pro-
`gramsthat are defined by a set of program steps. The fact
`that these steps, or commands, can be rearranged to create
`different end results using the same computer hardware is
`key to the computer’s flexibility. Unfortunately, this flex-
`ibility dictates that the hardware then be designed to handle
`a variety of possible functions, which results in generally
`slower operation than would be the case were it able to be
`designed to handle only one particular function. On the other
`hand, a single function computeris inherently not a particu-
`larly versatile computer.
`
`[0003] Recently, several groups have begun to experiment
`with creating a processor out of circuits that are electrically
`reconfigurable. This would allow the processor to execute a
`small set of functions more quickly and then be electrically
`reconfigured to execute a different small set. While this
`accelerates some program execution speeds, there are many
`functions that cannot be implemented well in this type of
`system due to the circuit densities that can be achieved in
`reconfigurable integrated circuits, such as 64-bit floating
`point math. In addition, all of these systems are presently
`intended to contain processors that operate alone. In high
`performance systems,this is not the case. Hundreds or even
`tens of thousands of processors are often used to solve a
`single problem in a timely manner. This introduces numer-
`ous issues
`that such reconfigurable computers cannot
`handle, such as sharing of a single copy of the operating
`system. In addition, a large system constructed from this
`type of custom hardware would naturally be very expensive
`to produce.
`
`SUMMARY OF THE INVENTION
`
`Particularly disclosed herein is the utilization of
`[0005]
`one or more FPGAsto perform user defined algorithms in
`conjunction with, and tightly coupled to, a microprocessor.
`More particularly, in a multiprocessor computer system, the
`FPGAs are globally accessible by all of the system proces-
`sors for the purpose of executing user definable algorithms.
`
`Ina particular implementationofthe present inven-
`[0006]
`tion disclosed herein, a circuit is provided either within, or
`in conjunction with, the FPGAs which signals, by means of
`a control bit, when the last operand has completed its flow
`through the MAP, thereby allowing a given process to be
`interrupted and thereafter restarted. In a still more specific
`implementation, one or more read only memory (“ROM”)
`integrated circuit chips may be coupled adjacent the FPGA
`to allow a user program to use a single command to select
`one of several possible algorithms preloaded in the ROM
`thereby decreasing system reconfiguration time.
`[0007] Still further provided is a computer system memory
`structure which includes one or more FPGAs for the purpose
`of using normal memory access protocol to accessit as well
`as being capable of direct memory access (“DMA”) opera-
`tion. In a multiprocessor computer system, FPGAs config-
`ured with DMA capability enable one device to feed results
`directly to another thereby allowing pipelining or parallel-
`izing execution of a user defined algorithm located in the
`reconfigurable hardware. The system and method of the
`present invention also provide a user programmable perfor-
`mance monitoring capability and utilizes parallelizer soft-
`ware to automatically detect parallel regions of user appli-
`cations containing algorithms that can be executed in
`programmable hardware.
`
`is disclosed herein is a computer
`[0008] Broadly, what
`including at least one data processor for operating on user
`data in accordance with program instructions. The computer
`includes at least one memory array presenting a data and
`address bus and comprises a memory algorithm processor
`associated with the memory array and coupled to the data
`and address buses. The memory algorithm processor is
`configurable to perform at least one identified algorithm on
`an operand received from a write operation to the memory
`array.
`
`[0009] Also disclosed herein is a multiprocessor computer
`includinga first plurality of data processors for operating on
`user data in accordance with program instructions and a
`second plurality of memory arrays, each presenting a data
`and address bus. The computer comprises a memory algo-
`rithm processor associated with at least one of the second
`plurality of memory arrays and coupled to the data and
`address bus thereof. The memory algorithm processor is
`configurable to perform at least one identified algorithm on
`an operand received from a write operation to the associated
`one of the second plurality of memory arrays.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`In response to these shortcomings, SRC Comput-
`[0004]
`ers, Inc., Colorado Springs, Colo., assignee of the present
`invention, has developed a Memory Algorithm Processor
`(“MAP”) multiprocessor computer architecture that utilizes
`very high performance microprocessors in conjunction with
`user reconfigurable hardware elements. These reconfig-
`urable elements, referred to as MAPs, are globally acces-
`sible by all processors in the systems.
`In addition,
`the
`manufacturing cost and design time of a particular multi-
`processor computer system is relatively low inasmuchas it
`can be built using industry standard, commodity integrated
`circuits and,
`in a preferred embodiment, each MAP may
`
`comprise a Field Programmable Gate Array (“FPGA”) oper- (O011] FIG.1is a simplified, high level, functional block
`ating as a reconfigurable functional unit.
`diagram of a standard multiprocessor computer architecture;
`
`[0010] The aforementioned and other features and objects
`of the present invention and the manner of attaining them
`will become more apparent and the inventionitself will be
`best understood by reference to the following description of
`a preferred embodiment
`taken in conjunction with the
`accompanying drawings, wherein:
`
`Petitioner Microsoft Corporation - Ex. 1022, p. 1
`Petitioner Microsoft Corporation - Ex. 1022, p. 1
`
`
`
`US 2001/0014937 Al
`
`Aug. 16, 2001
`
`[0012] FIG. 2 is a simplified logical block diagram of a
`possible
`computer application program decomposition
`sequence for use in conjunction with a multiprocessor
`computer architecture utilizing a number of memory algo-
`rithm processors (“MAPs”) in accordance with the present
`invention;
`
`(0013] FIG. 3 is a more detailed functional block diagram
`of an individual one of the MAPsofthe preceding figure and
`illustrating the bank control logic, memory array and MAP
`assembly thereof; and
`
`[0014] FIG. 4 is a more detailed functional block diagram
`of the control block of the MAP assembly of the preceding
`illustration illustrating its interconnection to the user FPGA
`thereof.
`
`DESCRIPTION OF A PREFERRED
`EMBODIMENT
`
`[0015] With reference now to FIG. 1, a conventional
`multiprocessor computer 10 architecture is shown. The
`multiprocessor computer LO incorporates N processors 12,,
`through 12,, which are bi-directionally coupled to a memory
`interconnect fabric 14. The memory interconnectfabric 14 is
`then also coupled to M memory banks comprising memory
`bank subsystems 16, (Bank 0) through 16,, (Bank M).
`
`[0016] With reference now to FIG. 2, a representative
`application program decomposition for a multiprocessor
`computer architecture 100 incorporating a plurality of
`memory algorithm processors in accordance with the present
`invention is shown. The computerarchitecture 100 is opera-
`live in response to user instructions and data which,
`in a
`coarse grained portion of the decomposition, are selectively
`directed to one of (for purposes of example only) four
`parallel regions 102, through 102, inclusive. The instruc-
`tions and data output from each ofthe parallel regions 102,
`through 102, are respectively input to parallel regions seg-
`regated into data areas 104,
`through 104, and instruction
`areas 106, through 106,. Data maintained in the data areas
`104,
`through 104, and instructions maintained in the
`instruction areas 106, through 106, are then supplied to, for
`example, corresponding pairs of processors 108,, 108, (P1
`andP2); 108.,, 108, (P3 and P4); 108., 108, (P5 and P6); and
`108,, 108, (P7 and P8) as shown, Atthis point, the medium
`grained decomposition of the instructions and data has been
`accomplished.
`
`is
`[0017] A fine grained decomposition, or parallelism,
`effectuated by a further algorithmic decomposition wherein
`the output of each of the processors 108,
`through 108, is
`broken up, for example,
`into a number of fundamental
`algorithms 110,,, 110,,, 110,,, 110,,,
`through 110,,, as
`shown. Each of the algorithms is then supplied to a corre-
`sponding one of the MAPs 112,,, 112,,, 112,,, 112.,,
`through 112,,, in the memory space of the computer archi-
`tecture 100 for execution therein as will be more fully
`described hereinafter.
`
`[0018] With reference additionally now to FIG, 3, a
`preferred implementation of a memory bank 120 in a MAP
`system computer architecture 100 of the present invention is
`shown for a representative one of the MAPs 112 illustrated
`in the preceding figure. Each memory bank 120 includes a
`bank control logic block 122 bi-directionally coupled to the
`computer system trunk lines, for example, a 72 line bus 124.
`
`logic block 122 is coupled to a bi-
`The bank control
`directional data bus 126 (for example 256 lines) and supplies
`addresses on an address bus 128 (for example 17 lines) for
`accessing data at specified locations within a memory array
`130.
`
`‘The data bus 126 and address bus 128 are also
`[0019]
`coupled to a MAP assembly 112. The MAP assembly 112
`comprises a control block 132 coupled to the address bus
`128. The control block 132 is also bi-directionally coupled
`to a user field programmable gate array (“FPGA”) 134 by
`means of a numberofsignal lines 136. The user FPGA 134
`is coupled directly to the data bus 126.
`In a particular
`embodiment, the FPGA 134 may be provided as a Lucent
`Technologies OR3T80 device.
`
`[0020] The computer architecture 100 comprises a multi-
`processor system employing uniform memory access across
`common shared memory with one or more MAPs 112
`located in the memory subsystem, or memory space. As
`previously described, each MAP 112 contains at least one
`relatively large FPGA 134 that is used as a reconfigurable
`functional unit.
`In addition, a control block 132 and a
`preprogrammed or dynamically programmable configura-
`tion read-only memory (“ROM” as will be more fully
`described hereinafter) contains the information needed by
`the reconfigurable MAP assembly 112 to enable it to per-
`form a specific algorithm. It is also possible for the user to
`directly download a new configuration into the FPGA 134
`under program control, although in someinstances this may
`consume a number of memory accesses and mightresult in
`an overall decrease in system performance if the algorithm
`was short-lived.
`
`FPGAs have particular advantages in the applica-
`[0021]
`tion shown for several reasons. First, commercially avail-
`able, off-the-shelf FPGAs now contain sufficient
`internal
`logic cells to perform meaningful computational functions.
`Secondly, they can operate at speeds comparable to micro-
`processors, which eliminates the need for speed matching
`buffers. Still
`further,
`the internal programmable routing
`resources of FPGAs are now extensive enough that mean-
`ingful algorithms can now be programmedwithout the need
`to reassign the locations of the input/output (“I/O”) pins.
`
`[0022] By placing the MAP 112 in the memory subsystem
`or memory space, it can be readily accessed through the use
`of memory read and write commands, which allows the use
`of a variety of standard operating systems. In contrast, other
`conventional
`implementations propose placement of any
`reconfigurable logic in or near the processor. This is much
`less effective in a multiprocessor environment because only
`one processor has rapid access to it. Consequently, recon-
`figurable logic must be placed by every processor in a
`multiprocessor system, which increases the overall system
`cost. In addition, MAP 112 can access the memory array 130
`itself,
`referred to as Direct Memory Access (“DMA”),
`allowing it to execute tasks independently and asynchro-
`nously of the processor. In comparison, were it were placed
`near the processor,
`it would have to compete with the
`processors for system routing resources in order to access
`memory, which deleteriously impacts processor perfor-
`mance. Because MAP112 has DMAcapability, (allowingit
`to write to memory), and because it receives its operands via
`writes to memory,il is possible to allow a MAP 112to feed
`results to another MAP 112. This is a very powerful feature
`
`Petitioner Microsoft Corporation - Ex. 1022, p. 2
`Petitioner Microsoft Corporation - Ex. 1022, p. 2
`
`
`
`US 2001/0014937 Al
`
`Aug. 16, 2001
`
`[0027] The command decoder 150 also supplies a five bit
`that allows for very extensive pipelining and parallelizing of
`control signal to a configuration multiplexer (*MUX”) 170
`large tasks, which permits them to complete faster.
`as shown. The configuration mux 170 receives a single bit
`[0023] Many of the algorithms that may be implemented
`output of a 256 bit parallel-serial converter 172 on line 176.
`will receive an operand and require many clock cycles to
`The inputs of the 256 bit parallel-to-serial converter 172 are
`produce a result. One such example may be a multiplication
`coupled to a 256 bit user configuration pattern bus 174. The
`that takes 64 clock cycles. This same multiplication may
`configuration mux 170 alsoreceives sixteen single bit inputs
`also need to be performed on thousands of operands.In this
`from the configuration ROMs(illustrated as ROM 182) on
`situation,
`the
`incoming operands would be presented
`bus 178 and providesasingle bit configuration file signal on
`sequentially so that while the first operand requires 64 clock
`line 180 to the user FPGA 134 as selected by the control
`cycles to produce results at the output, the second operand,
`signals from the command decoder 150 on the bus 168.
`arriving one clock cycle later at the input, will show results
`one clock cyclelater at the output. Thus, after an initial delay
`of 64 clock cycles, new output data will appear on every
`consecutive clock cycle until the results of the last operand
`appears. This is called “pipelining”.
`[0024]
`Ina multiprocessor system, it is quite common for
`the operating system to stop a processor in the middle ofa
`task, reassign it to a higher priority task, and then return it,
`or another, to complete the initial task. When this is com-
`bined with a pipelined algorithm, a problem arises (if the
`processor stops issuing operands in the middle of a list and
`stops accepting results) with respect
`to operands already
`issued but not yet through the pipeline. To handle this issue,
`a solution involving the combination of software and hard-
`ware is disclosed herein.
`
`In operation, when a processor 108 is halted by the
`[0028]
`operating system,
`the operating system will
`issue a last
`operand command to the MAP 112 through the use of
`commandbits embedded in the address field on bus 128.
`This commandis recognized by the command decoder 150
`of the control block 132 and it initiates a hardware pipeline
`counter 158. When the algorithm wasinitially loaded into
`the FPGA 134, several output bits connected to the control
`block 132 were configured to display a binary representation
`of the number of clock cycles required to get through its
`pipeline (i.e. pipeline “depth”)on bus 136 input
`to the
`equality comparitor 160. Afier receiving the last operand
`command, the pipeline counter 158 in the control block 132
`counts clock cycles until its count equals the pipeline depth
`for that particular algorithm. At
`that point,
`the equality
`comparitor 160 in the control block 132 de-asserts a busy bit
`on line 164 in an internal group ofstatus registers 152. After
`issuing the last operand signal,
`the processor 108 will
`repeatedly read the status registers 152 and accept any
`output data on bus 166. When the busy flag is de-asserted,
`the task can be stopped and the MAP 112 utilized for a
`different task. It should be noted that it is also possible to
`leave the MAP 112 configured, transfer the program to a
`different processor 108 and restart the task whereit left off.
`
`[0025] To make use of any type of conventional reconfig-
`urable hardware, the programmer could embed the neces-
`sary commandsin his application program code. The draw-
`back to this approach is that a program would then have to
`be tailored to be specific to the MAP hardware. The system
`of the present invention eliminates this problem, Multipro-
`cessor computers often use software called parallelizers. The
`purpose of this software is to analyze the user’s application
`code and determine how best
`to split
`it up among the
`processors. The present
`invention provides significant
`advantages over a conventional parallelizer and enables it to
`recognize portions ofthe user code that represent algorithms
`that exist in MAPs 112 for that system and to then treat the
`MAP112 as another computing element. The parallelizer
`then automatically generates the necessary code to utilize
`the MAP 112. This allows the user to write the algorithm
`directly in his code, allowing it
`to be more portable and
`reducing the knowledge ofthe system hardware that he has
`to have to utilize the MAP 112.
`
`[0026] With reference additionally now to FIG. 4, a block
`diagram of the MAPcontrol block 132 is shownin greater
`detail. The control block 132 is coupled to receive a number
`of command bits (for example, 17) from the address bus 128
`ata command decoder 150. The command decoder 150 then
`supplies a number ofregister control bits to a group ofstatus
`registers 152 on an eight bit bus 154. The command decoder
`150 also supplies a single bit last operand flag on line 156
`to a pipeline counter 158. The pipeline counter 158 supplies
`an cight bit output to an equality comparator 160 on bus 162.
`The equality comparitor 160also receives an eight bit signal
`from the FPGA 134 on bus 136 indicative of the pipeline
`depth. When the equality comparitor determines that
`the
`pipeline is empty, it provides a single bit pipeline empty flag
`on line 164 for input to the status registers 152. The status
`registers are also coupled to receive an eight bit status signal
`from the FPGA 134 on bus 136 andit produces a sixty four
`bit status word output on bus 166 in response to the signals
`on bus 136, 154 and line 164.
`
`In order to evaluate the effectiveness ofthe use of
`[0029]
`the MAP 112 in a given application, some form offeedback
`to the use is required. Therefore,
`the MAP 112 may be
`equipped with internal registers in the control block 132 that
`allow it
`to monitor efficiency related factors such as the
`number of input operands versus output data, the number of
`idle cycles over time and the number of system monitor
`interrupts received over time. One of the advantages that the
`MAP 112 hasis that because of its reconfigurable nature, the
`actual function and type of function that are monitored can
`also changeas the algorithm changes. This provides the user
`with an almost infinite number of possible monitored factors
`without having to monitor all factors all of the time.
`
`[0030] While there have been described above the prin-
`ciples of the present invention in conjunction with a specific
`multiprocessor architecture it is to be clearly understood that
`the foregoing description is made only by way of example
`and not as a limitation to the scope of the invention.
`Particularly, it is recognized that the teachings of the fore-
`going disclosure will suggest other modifications to those
`persons skilled in the relevant art. Such modifications may
`involve other features which are already known per se and
`which may be used instead c. or in addition to features
`already described herein. Although claims have been for-
`mulated in this application to particular combinations of
`features,
`it should be understood that
`the scope of the
`disclosure herein also includes any novel feature or any
`novel combination offeatures disclosed either explicitly or
`
`Petitioner Microsoft Corporation - Ex. 1022, p. 3
`Petitioner Microsoft Corporation - Ex. 1022, p. 3
`
`
`
`US 2001/0014937 Al
`
`Aug. 16, 2001
`
`implicitly or any generalization or modification thereof
`which would be apparent to persons skilled in the relevant
`art, whether or not such relates to the same invention as
`presently claimed in any claim and whether or not it miti-
`gatesany orall of the same technical problems as confronted
`by the present invention. The applicants hereby reserve the
`right to formulate new claims to such features and/or com-
`binations of such features during the prosecution of the
`present application or of any further application derived
`therefrom.
`
`What is claimed is:
`1. A computer including at least one data processor for
`operating on user data in accordance with program instruc-
`tions, said computer further including at least one memory
`array presenting a data and address bus, said computer
`comprising:
`
`a memory algorithm processor associated with said
`memory array and coupled to said data and address
`buses, said memory algorithm processor being config-
`urable to perform atleast one identified algorithm on an
`operand received from a write operation to said
`memory array.
`2. The computer of claim 1 wherein said memory algo-
`rithm processor comprises a field programmable gate array.
`3. The computer of claim 1 wherein said memory algo-
`rithm processor is operative to access said memory array
`independently of said processor.
`least one
`4. The computer of claim 1 wherein said at
`identified algorithm is preprogrammed into said memory
`algorithm processor.
`5. The computer of claim 4 wherein said at least one
`identified algorithm is preprogrammed into a memory
`device associated with said memory algorithm processor.
`6. The computer of claim 5 wherein said memory device
`comprises at least one read only memory device.
`7. The computer of claim 1
`further comprising a first
`plurality of said data processors and a second plurality of
`said memoryarrays, each of said memory arrays comprising
`an associated memory algorithm processor.
`8. The computer of claim 7 wherein a memory algorithm
`processor associated with a first one of said second plurality
`of said memory arrays Is operative to pass a result of a
`processed operand to another memory algorithm processor
`associated with a second one ofsaid secondplurality of said
`memory arrays.
`9. The computer of claim 1 wherein said memory algo-
`rithm processor further comprises:
`
`a contro] block including a command decoder coupled to
`said address bus and a counter coupled to said com-
`mand decoder, said command decoder for providing a
`last operand flag to said counter in response to a last
`operand command from an operating system ofsaid at
`least one processor.
`10. The computer of claim 9 wherein said memory
`algorithm processor further comprises:
`
`an equality comparator coupled to receive a pipeline
`depth signal and an output of said counter for providing
`a pipeline empty flag to at least one status register.
`11. The computer of claim 10 wherein said status register
`is coupled to said command decoder to receive a register
`control signal and a status signal to provide a status word
`output signal.
`
`12. A multiprocessor computer including a first plurality
`of data processors for operating on user data in accordance
`with program instructions anda second plurality of memory
`arrays, each presenting a data and address bus, said com-
`puter comprising:
`
`a memory algorithm processor associated with at least one
`of said second plurality of memory arrays and coupled
`io said data and address bus thereof, said memory
`algorithm processor being configurable to perform at
`least one identified algorithm on an operand received
`from a write operation to said associated one of said
`second plurality of memory arrays.
`13. The multiprocessor computer of claim 12 wherein said
`memory algorithm processor associated with one ofsaid
`second plurality of memory arrays is accessible by more
`than one of saidfirst plurality of data processors.
`14. The multiprocessor computer of claim 13 wherein said
`memory algorithm processor associated with one of said
`secondplurality of memory arrays is accessible by all of said
`first plurality of data processors.
`15. The multiprocessor computer of claim 12 wherein said
`memory algorithm processor comprises:
`
`a control block operative to provide a last operand flag in
`response to a last operand having been processed in
`said memory algorithm processor.
`16. The multiprocessor computer of claim 12 further
`comprising:
`
`at least one memory device associated with said memory
`algorithm processor for storing a numberof pre-loaded
`algorithms.
`17. The multiprocessor computer of claim 16 wherein said
`at least one memory device is responsive to a predetermined
`command to enable a selected one of said number of
`pre-loaded algorithms to be implemented by said memory
`algorithm processor.
`18. The multiprocessor computer of claim 16 wherein said
`at least one memory device comprisesat least one read only
`memory device.
`19. The multiprocessor computer of claim 12 wherein said
`memory algorithm processor comprises a field program-
`mable gate array.
`20. The multiprocessor computer of claim 12 wherein said
`memory algorithm processor is accessible through normal
`memory access protocol.
`21. The multiprocessor computer of claim 12 wherein said
`memory algorithm processor has direct memory access
`capability to said associated one of said second plurality of
`memory arrays.
`22. The multiprocessor computer of claim 12 wherein a
`memory algorithm processor associated with a first one of
`said second plurality of said memory arrays is operative to
`pass a result of a processed operand to another memory
`algorithm processor associated with a second one ofsaid
`second plurality of said memory arrays.
`23. The multiprocessor computer of claim 12 wherein said
`computer
`is operative to automatically detect parallel
`regions of application program code that are capable of
`being executed in said memory algorithm processor.
`24. The multiprocessor computer of claim 23 wherein said
`memory algorithm processor is configurable by said appli-
`cation program code.
`
`Petitioner Microsoft Corporation - Ex. 1022, p. 4
`Petitioner Microsoft Corporation - Ex. 1022, p. 4
`
`