`
`US 2001001493?A1
`
`(19) United States
`(12) Patent Application Publication (10) Pub. No.: US 2001X0014937 A1
`(43) Pub. Date: Aug. 16, 2001
`
`Huppcnthal ct at.
`
`(54) MULTIPROCICSSOR COMPUTER
`ARCHITECTURE INCORPORATING A
`PLURAIJ'I‘Y OI" MEMORY ALGORITHM
`PROCESSORS IN THE MEMORY
`SUBSYSTEM
`
`(76)
`
`Inventors: Jon M. Huppenthal, Colorado Springs,
`CO (US); Paul A. Leskar, Colorado
`Springs, CO (US)
`
`Correspondence Address:
`HOGAN 84 HARTSON LLP
`ONE TABOR CENTER, SUITE 1500
`1200 SIW EN'I‘ICEN'I‘H ST
`DENVER, CO 80202 t US]
`
`(21)
`
`App}. No.:
`
`amass-:44
`
`(22
`
`Filed:
`
`Jan. 5, 2001
`
`Related [1.8. Application Data
`
`(60) Division of application No. (193481302, tiled on Jan.
`13, 2000, now Pat. No. 5,247,110, which is a can-
`linuation of application No. 08,992,763. tiled on Dec.
`17, I997, now Pat. No. 6,076,152.
`
`Publication Classification
`
`(51)
`
`Int. Cl.7 ..................................................... G06F 15180
`
`(52) U.S. c1.
`
`................................................................ 712:15
`
`(57)
`
`ABSTRACT
`
`A multiprocessor computer architecture incorporating a plu-
`rality of programmable hardware memory algorithm pro-
`cessors (“MAP") in the memory subsystem. The MAP may
`comprise one or more field programmable gale arrays
`("FPO/ks”) which function to perform identified algorithms
`in conjunction with. and lightly coupled to, a microprocessor
`and each MAP is globally accessible by all of the system
`processors for the purpose of executing user definable
`algorithms. A circuit within the MAP signals when the last
`operand has completed its [low thereby allowing a given
`process to be interrupted and thereafter restarted. Through
`the use of read only memory ("ROM”) located adjacent the
`FPGA, a user program may use a single command to select
`one of several possible pre—loaded algorithms thereby
`decreasing system recon figu ration time. A computer system
`memory structure MAP disclosed herein may function in
`normal or direct memory access (“DMA”) modes of opera—
`tion and, in the latter mode, one device may [33‘] results
`directly to another thereby allowing pipelining or parallel-
`izing cxcculion of a user defined algorithm The system 01‘
`the present
`invention also provides a user programmable
`performance monitoring capability and utilizes parallelizer
`software to automatically detect parallel regions of user
`applications containing algorithms that can he executed in
`the programmable hardware.
`
`COARSE GRAlNED
`DECOMPOSlTION
`
`MEDIUM GRAINED
`DECOMPOSITION
`
`FINE GRAlNED
`PARALLELISM
`
`I
`
`I
`
`1 PARALLEL REGIONS i
`1041
`
`
`
`—~-’-MAP
`
`MAP
`
`11:22B
`
`112
`
`f
`
`4
`
`110
`
`
`
`
`REGION 1
`
`
`
`
`
`
`
`
`
`
`USERlNSTRUCITIONSANDDATA
`
`. ALGORITHMS . MEMORY SPACE
`1
`
`PROCESSORS}
`1121A-
`1101A!
`_._®—
`11213
`1101.3
`0
`1122A
`
` PARALLEL
`__
`.
`.
`E10321 “DEB/1(6) '___-- [1123A
`110
`
`o 331* .
`PARALLEL
`'_'—‘1104A-/®J—'lL-EFI
`
`REGION 2
`
`
`
`1034: mum/-
`
`
`PARALLEL
`
`REGlON 3
`
`—-—
`—--MAP
`
`f11253
`Ir
`1106A'/
`WT“- 112
`
`PARALLEL
`
`6‘1“”: --”,12”‘
`REGION 4
`_ @1l9.?_31'__-/1127B
`"—;;=®i—*-/1128A
`I
`/'
`BB
`'11'0837:®_'_‘-
`
`
`
`Petitioner Microsoft Corporation - Ex. 1022, Cover
`Petitioner Microsoft Corporation - EX. 1022, Cover
`
`
`
`Patent Application Publication Aug. 16, 2001 Sheet 1 of 4
`
`US 2001/0014937 A1
`
`120
`
`PROCESSOR
`o
`
`(121
`
`PROCESSOR
`1
`
`.
`
`PROCESSOR
`N
`
`10
`
`14
`
`160
`
`MEMORY
`SUBSYSTEM
`BANKS
`
`
`
`MEMORY
`INTERCONNECT
`FABRIC
`
`MEMORY
`
`SUBSYSTEM
`BANK i
`
`
`
`MEMORY
`
`SUBSYSTEM
`BANKM
`
`FIG. 1
`
`Petitioner Microsoft Corporation - Ex. 1022, Sheet 1
`Petitioner Microsoft Corporation - EX. 1022, Sheet 1
`
`
`
`Patent Application Publication Aug. 16, 2001
`
`S
`
`4M
`
`mmm02
`
`M
`
`a;..N:
`
`
`
`a.39.53.52.mzoammI_m<20diéi
`
`%vzoamm
`M.m,NUP;
`
`
`MOSHE>IOEm=2_wEItmOOIEH__940—megmljxxmg‘n.
`
`
`
`
`.szfléoMZEsz_§0520mgszfléommmdoo
`
`
`EmjwlfizmfimZOFEOQEOOMDZOFEOQEOUMQ
`
`AMJJEdi
`
`r2010mm
`
`fljfiufi
`
` |
`
`VLVCI GNV SNOILonmSNI aasn
`
`Petitioner Microsoft Corporation - Ex. 1022, Sheet 2
`Petitioner Microsoft Corporation - EX. 1022, Sheet 2
`
`
`
`
`Patent Application Publication Aug. 16, 2001 Sheet 3 of 4
`
`US 2001/001493? A]
`
`BANKCONTROLLOGIC
`
`MEMORYARRAY
`
`136
`
`MAPASSEMBLY
`
`12.9,
`
`FIG.3
`
`
`
`Petitioner Microsoft Corporation - Ex. 1022, Sheet 3
`Petitioner Microsoft Corporation - EX. 1022, Sheet 3
`
`y.
`
`z <0
`
`:1
`>—
`0:
`
`O 2L
`
`U E
`
`
`
`Patent Application Publication Aug. 16, 2001
`
`Sheet 4 of 4
`
`us 2001/0014937 A1
`
`USER CONFIGURATION
`PATTERN
`
`174
`
`256
`
`(182
`
`ROM
`
`.
`
`I
`
`172
`
`
`
`
`256-BIT
`
`PARALLEL—TO-SERIAL
`
`CONVERTER
`
`FROM
`CONFIGURATION
`ROMS
`
`178
`
`ADDRESS
`BITS
`
`123
`
`1
`
`16
`
`WU
`
`CONFIGURATION
`MUX
`
`
`
`
`COMMAND
`
`DECODER
`
`
`II |II |
`
`
`
`
`
`
`
`CONHG.
`FILE
`
`TO
`USER
`FPGA
`
`1
`-
`
`
`8 REGISTER
`152
`CONTROL BITS
`STATUS
`I
`r
`FROM
`PIPELINE
`
`STATUS
`EMPTY FLAG
`
`
`REGISTERS
`
`
`
`:
`L—"__'1
`1 LAST
`'
`OPERANO:
`FLAG
`I
`'
`158
`1
`
`-'I
`
`
`
`PIPELINE
`[
`COUNTER
`I
`
`
`I
`:
`.
`
`I
`I
`FROM FPGA
`' FIG' 4
`PIPELINE
`1
`
`
`DEPTH BITS
`EQUALITY
`I
`
`COMPARITOR
`II
`
`. K132
`
`i
`STATUS WORD
`OUTPUT
`
`L164
`
`’I'I
`
`l
`
`
`
`162
`
`Petitioner Microsoft Corporation - Ex. 1022, Sheet 4
`Petitioner Microsoft Corporation - EX. 1022, Sheet 4
`
`
`
`US 2001/0014937 A1
`
`Aug. 16, 2001
`
`MULTIPROCESSOR COM PUTER
`ARCHITECTURE INCORPORATING A
`PLURALITY OF MEMORY ALGORITHM
`PROCESSORS IN THE MEMORY SUBSYSTEM
`
`BACKGROUND OF THE INVEN‘I‘ION
`
`in general, to the
`[0001] The present invention relates,
`field of computer architectures incorporating multiple pro-
`cessing elements. More particularly. the present invention
`relates to a multiprocessor computer architecture incorpo-
`rating a number of memory algorithm processors in the
`memory subsystem to significantly enhance overall system
`processing speed.
`
`[0002] All general purpose computers are based on cir-
`cuits that have some form of processing element. These may
`take the form of microprocessor chips or could be a collec-
`tion of smaller chips coupled together to form a processor.
`In any case, these processors are designed to execute pro—
`grams that are defined by a set of program steps. The fact
`that these steps, or commands, can be rearranged to create
`dilferent end results using the same computer hardware is
`key to the computer’s flexibility. Unfortunately, this [lex-
`ibility dictates that the hardware then he designed to handle
`a variety of possible functions, which results in generally
`slower operation than would be the case were it able to be
`designed to handle only one particular function. On the other
`hand, a single function computer is inherently not a particu—
`larly versatile computer.
`
`[0003] Recently, several groups have begun to experiment
`with creating a processor out of circuits that are electrically
`reconfigurable. This would allow the processor to execute a
`small set of functions more quickly and then be electrically
`reconfigured to execute a different small set. While this
`accelerates some program execution speeds, there are many
`functions that cannot be implemented Well
`in this type of
`system due to the circuit densities that can be achieved in
`reconfigurable integrated circuits, such as 64-bit floating
`point math. In addition, all of these systems are presently
`intended to contain processors that operate alone. In high
`performance systems, this is not the case. Hundreds or even
`tens of thousands of processors are often used to solve a
`single problem in a timely manner. This introduces numer—
`ous issues
`that such reconfigurable computers cannot
`handle, such as sharing of a single copy of the operating
`system. In addition, a large system constructed from this
`type of custom hardware would naturally be very expensive
`to produce.
`
`SUMMARY OF THE INVENTION
`
`In response to these shortcomings, SRC Comput-
`[0004]
`ers, Inc., Colorado Springs, Colo., assignee of the present
`invention, has developed a Memory Algorithm Processor
`("MAP") multiprocessor computer architecture that utilizes
`very high performance microprocessors in conjunction with
`user reconfigurable hardware elements. These reconfig-
`urable elements, referred to as MAPs, are globally acces—
`sible by all processors in the systems.
`In addition,
`the
`manufacturing cost and design time of a particular multi-
`processor computer system is relatively low inasmuch as it
`can be built using industry standard, commodity integrated
`circuits and,
`in a preferred embodiment. each MAP may
`comprise a Field Programmable Gate Array {‘“FPGA") oper-
`ating as a reconfigurable functional unit.
`
`Particularly disclosed herein is the utilization of
`[0005]
`one or more FPGAs to perform user defined algorithms in
`conjunction with, and tightly coupled to, a microprocessor.
`More particularly, in a multiprocessor computer system, the
`FI’UAs are globally aceessiblc by all of the system proces-
`sors for the purpose of executing user definable algorithms.
`
`In a particular implementation of the present inven-
`[0006]
`tion disclosed herein, a circuit is provided either within, or
`in conjunction with, the FPGAs which signals, by means of
`a control bit, when the last operand has completed its flow
`through the MAP, thereby allowing a given process to be
`interrupted and thereafter restarted. In a still more specific
`implementation, one or more read only memory (“ROM“)
`integrated circuit chips may be coupled adjacent the FPGA
`to allow a user program to use a single command to select
`one of several possible algorithms preloaded in the ROM
`thereby decreasing system reconfiguration time.
`
`[0007] Still further provided is a computer system memory
`structure which includes one or more IiPGAs for the purpose
`of using normal memory access protocol to access it as well
`as being capable of direct memory access (“DMA”) opera-
`tion. In a multiprocessor computer system, FPGAs config-
`ured with DMA capability enable one device to feed results
`directly to another thereby allowing pipelining or parallel-
`izing execution of a user defined algorithm located in the
`reconfigurable hardware. The system and method of the
`present invention also provide a user programmable perfor-
`mance monitoring capability and utilizes parallelizer soft-
`ware to automatically detect parallel regions of user appli—
`cations containing algorithms that can be executed in
`programmable hardware.
`
`is disclosed herein is a computer
`[0008] Broadly. what
`including at least one data processor for operating on user
`data in accordance with program instructions. The computer
`includes at least one memory array presenting a data and
`address bus and comprises a memory algorithm processor
`associated with the memory array and coupled to the data
`and address buses. The memory algorithm processor is
`configurable to perform at least one identified algorithm on
`an operand received from a write operation to the memory
`array.
`
`[0009] Also disclosed herein is a multiprocessor computer
`including a first plurality of data processors for operating on
`user data in accordance with program instructions and a
`second plurality of memory arrays, each presenting a data
`and address bus. The computer comprises a memory algo—
`rithm processor associated with at least one of the second
`plurality of memory arrays and coupled to the data and
`address bus thereof. The memory algorithm processor is
`configurable to perform at least one identifier] algorithm on
`an operand received from a write operation to the associated
`one of the second plurality of memory arrays.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0010] The aforementioned and other features and objects
`of the present invention and the manner of attaining them
`will become more apparent and the invention itself will be
`best understood by reference to the following description of
`a preferred embodiment
`taken in conjunction with the
`accompanying drawings, wherein:
`
`[0011] FIG. 1 is a simplified, high level, functional block
`diagram of a standard multiprocessor computer architecture;
`
`Petitioner Microsoft Corporation - Ex. 1022, p. 1
`Petitioner Microsoft Corporation - EX. 1022, p. 1
`
`
`
`US 2001/0014937 A1
`
`Aug. 16, 2001
`
`[0012] FIG. 2 is a simplified logical block diagram of a
`possible
`computer application program decomposition
`sequence for use in conjunction with a multiprocessor
`computer architecture utilizing a number of memory algo—
`rithm processors (“MAPS”) in accordance with the present
`invention;
`
`[0013] FIG. 3 is a more detailed functional block diagram
`ofan individual one ofthe MAPs of the preceding figure and
`illustrating the bank control logic, memory array and MAP
`assembly thereof; and
`
`[0014] FIG. 4 is a more detailed functional block diagram
`of the control block of the MAP assembly of the preceding
`illustration illustrating its interconnection to the user EPGA
`thereof.
`
`DESCRIPTION OF A PREFERRED
`EMBODIMENT
`
`[0015] With reference now to FIG. 1, a conventional
`multiprocessor computer 10 architecture is shown. The
`multiprocessor computer 10 incorporates N processors 120
`through 12N which are bi-directionally coupled to a memory
`interconnect fabric 14. The memory interconnect fabric 14 is
`then also coupled to M memory banks comprising memory
`bank subsystems 160 (Bank 0) through 16M (Bank M).
`
`[0016] With reference now to FIG. 2, a representative
`application program decomposition for a multiprocessor
`computer architecture 100 incorporating a plurality of
`memory algorithm processors in accordance with the present
`invention is shown. The computer architecture 100 is opera-
`tive in response to user instructions and data which.
`in a
`coarse grained portion of the decomposition, are selectively
`directed to one of (for purposes of example only) four
`parallel regions 102, through 102-: inclusive. The instruc-
`tions and data output from each of the parallel regions 1021
`through 1024 are respectively input to parallel regions seg-
`regated into data areas 1041 through 104,. and instruction
`areas 1061 through 106... Data maintained in the data areas
`104-1
`through 104-4 and instructions maintained in the
`instruction areas 1061 through 106] are then supplied to, for
`example, corresponding pairs of processors 108,, 1082 (Pt
`and P2); 1083, 1084 (P3 and P4); 1085, 1086(1’5 and P6); and
`108,, 1083 (P7 and P8) as shown. At this point, the medium
`grained decomposition of the instructions and data has been
`accomplished.
`
`is
`[0017] A fine grained decomposition. or parallelism,
`effectuated by a further algorithmic decomposition wherein
`the output of each of the processors 1081 through 108,, is
`broken up, for example.
`into a number of fundamental
`algorithms [10”, 11013, 1.10%, 110213 through 1103,3 as
`shown. Each of the algorithms is then supplied to a corre—
`sponding one of the MAPS 112m, 11213: 1122A, 112215
`through 112,3}, in the memory space of the computer archi-
`tecture 100 for execution therein as will be more fully
`described hereinafter.
`
`[0018] With reference additionally now to FIG. 3, a
`preferred implementation of a memory bank 120 in a MAP
`system computer architecture 100 ofthe present invention is
`shown for a representative one of the MAPs 112 illustrated
`in the preceding figure. Each memory bank 120 includes a
`bank control logic block 122 bi-directionally coupled to the
`computer system trunk lines. for example, a 72 line bus 124.
`
`logic block 122 is coupled to a bi-
`The bank control
`directional data bus 126 (forexample 256 lines) and supplies
`addresses on an address bus 128 (for example 1‘? lines) for
`accessing data at specified locations within a memory array
`130.
`
`[0019] The data bus 126 and address bus 128 are also
`coupled to a MAP assembly 112. The MAP assembly 112
`comprises a control block 132 coupled to the address bus
`128. The control block 132 is also bi-directionally coupled
`to a user field programmable gate array (“FPGA”) 134 by
`means of a number of signal lines 136. The user FPGA134
`is coupled directly to the data bus 126.
`In a particular
`embodiment, the II'PGA 134 may be provided as a Lucent
`'I‘echnologies 0113180 device.
`
`[0020] The computer architecture 100I comprises a multi-
`processor system employing uniform memory access across
`common shared memory with one or more MAPs 112
`located in the memory subsystem, or memory space. As
`previously described, each MAP 112 contains at least one
`relatively large ['I'PCIA 134 that is used as a reconfigurable
`functional unit.
`In addition, a control block 132 and a
`preprogrammed or dynamically programmable configura-
`tion read-only memory (“ROM" as will be more fully
`described hereinafter) contains the information needed by
`the reconfigurable MAP assembly 112 to enable it to per-
`form a specific algorithm. It is also possible for the user to
`directly down10ad a new configuration into the FPGA 134
`under program control, although in some instances this may
`consume a number of memory accesses and might result in
`an overall decrease in system performance if the algorithm
`was short—lived.
`
`FPGAs have particular advantages in the applica-
`[0021]
`tion shown for several reasons. First, commercially avail-
`able, olf-the-shelf FPGAs now contain sufficient
`internal
`logic cells to perform meaningful computational functions.
`Secondly. they can operate at speeds comparable to micro-
`processors, which eliminates the need for speed matching
`buffers. Still
`further,
`the internal programmable routing
`resources of FPGAs are now extensive enough that mean-
`ingful algorithms can now be programmed without the need
`to reassign the locations of the inputi’output (“HO") pins.
`
`[0022] By placing the MAP 112 in the memory subsystem
`or memory space, it can be readily accessed through the use
`of memory read and write commands, which allows the use
`of a variety of standard operating systems. In contrast, other
`conventional
`implementations propose placement of any
`reconfigurable logic in or near the processor. This is much
`less etfective in a multiprocessor environment because only
`one processor has rapid access to it. Consequently, recon—
`figurable logic must be placed by every processor in a
`multiprocessor system, which increases the overall system
`cost. In addition, MAP 112 can access the memory array 130
`itself,
`referred to as Direct Memory Access ("DMA"),
`allowing it to execute tasks independently and asynchro-
`nously ofthe processor. In comparison, were it were placed
`near the processor,
`it would have to compete with the
`processors for system routing resources in order to access
`memory, which deleteriously impacts processor perfor—
`mance. Because MAP 112 has DMA capability, (allowing it
`to write to memory). and because it receives its operands via
`writes to memory, it is possible to allow a MAP 112 to feed
`results to another MAP 112. This is a very powerful feature
`
`Petitioner Microsoft Corporation - Ex. 1022, p. 2
`Petitioner Microsoft Corporation - EX. 1022, p. 2
`
`
`
`US 2001/0014937 At
`
`Aug. 16, 2001
`
`LA
`
`that allows for very extensive pipelining and parallelizing of
`large tasks, which permits them to complete faster.
`
`[0023] Many of the algorithms that may be implemented
`will receive an operand and require many clock cycles to
`produce a result. One such example may be a multiplication
`that takes 64 clock cycles. This same multiplication may
`also need to be performed on thousands of operands. In this
`situation,
`the
`incoming operands would be presented
`sequentially so that while the first operand requires 64 clock
`cycles to produce results at the output, the second operand,
`arriving one clock cycle later at the input, will show results
`one clock cycle later at the output. Thus, after an initial delay
`of 64 clock cycles, new output data will appear on every
`consecutive clock cycle until the results of the last operand
`appears. This is called “pipelining”.
`
`In a multiprocessor system, it is quite common [or
`[0024]
`the operating system to stop a processor in the middle of a
`task, reassign it to a higher priority task, and then return it,
`or another, to complete the initial task. When this is com-
`bined with a pipelined algorithm, a problem arises (if the
`processor stops issuing operands in the middle of a list and
`stops accepting results) with respect
`to operands already
`issued but not yet through the pipeline. To handle this issue,
`a solution involving the combination of software and hard-
`ware is disclosed herein.
`
`[0025] To make use of any type of conventional reconfig-
`urable hardware, the programmer could embed the neces-
`sary commands in his application program code. The draw—
`back to this approach is that a program would then have to
`be tailored to be specific to the MA!J hardware. The system
`of the present invention eliminates this problem. Multipro—
`cessor computers often use software called parallelizers. The
`purpose of this software is to analyze the user‘s application
`code and determine how best
`to split
`it up among the
`processors. The present
`invention provides significant
`advantages over a conventional parallelizer and enables it to
`recognize portions ot‘the user code that represent algorithms
`that exist in MAPs 112 for that system and to then treat the
`MAP 112 as another computing element. The parallelizer
`then automatically generates the necessary code to utilize
`the MAP 112. This allows the user to write the algorithm
`directly in his code, allowing it
`to be more portable and
`reducing the knowledge of the system hardware that he has
`to have to utilize the MAP 112.
`
`[0026] With reference additionally now to FIG. 4, a block
`diagram ol’ the MAP control block 132 is shown in greater
`detail. The control block 132 is coupled to receive a number
`ofcommand bits (for example, 17) from the address bus 128
`at a command decoder 150. The command decoder 150 then
`supplies a number of register control bits to a group ofstatus
`registers 152 on an eight bit bus 154. The command decoder
`150 also supplies a single bit last operand flag on line 156
`to a pipeline counter 158. The pipeline counter 158 supplies
`an eight bit output to an equality comparator 160 on bus 162.
`The equality comparilor 160 also receives an eight bit signal
`from the FPGA 134 on bus 136 indicative of the pipeline
`depth. When the equality comparilor determines that
`the
`pipeline is empty, it provides a single bit pipeline empty [lag
`on line 164 for input to the status registers 152. The status
`registers are also coupled to receive an eight bit status signal
`from the FPGA 134 on bus 136 and it produces a sixty four
`bit status word output on bus 166 in response to the signals
`on bus 136. 154 and line 164.
`
`[0027] The command decoder 150 also supplies a five bit
`control signal to a configuration multiplexer {“MUX") 170
`as shown. The configuration mux 170 receives a single bit
`output of a 256 bit parallel—serial converter 172 on line 176.
`The inputs of the 256 bit parallel-to-seriat converter 172 are
`coupled to a 256 bit user configuration pattern bus 174. The
`contigu ration mux 170 also receives sixteen single bit inputs
`from the configuration ROMS {illustrated as ROM 182} on
`bus 178 and provides a single bit configuration file signal on
`line 180 to the user FPGA 134 as selected by the control
`signals from the command decoder IS!) on the bus 168.
`
`In operation, when a processor 108 is halted by the
`[0028]
`operating system,
`the operating system will
`issue a last
`operand command to the MAP 112 through the use of
`command bits embedded in the address field on bus 128.
`This command is recognized by the command decoder 150
`ofthe control block 132 and it initiates a hardware pipeline
`counter 158. When the algorithm was initially loaded into
`the FPGA 134, several output bits connected to the control
`block 132 were configured to display a binary representation
`of the number of clock cycles required to get through its
`pipeline (ie. pipeline “depth”]on bus 136 input
`to the
`equality comparilor 160. After receiving the last operand
`command, the pipeline counter 158 in the control block 132
`counts clock. cycles until its count equals the pipeline depth
`for that particular algorithm. At
`that point,
`the equality
`comparilor 160 in the control block 132 de-asserts a busy bit
`on line 164 in an internal group of status registers 152. After
`issuing the last operand signal,
`the processor 108 will
`repeatedly read the status registers 152 and accept any
`output data on bus 166. When the busy flag is de-asserted,
`the task can be stopped and the MAP 112 utilired for a
`different task. it should be noted that it is also possible to
`leave the MAP 112 configured. transfer the program to a
`dilIerenl processor 108 and restart the [ask where it left 011.
`
`In order to evaluate the effectiveness 01‘ the use of
`[0029]
`the MAP 112 in a given application, some form of feedback
`to the use is required. Therefore,
`the MAP 112 may be
`equipped with internal registers in the control block 132 that
`allow it
`to monitor efficiency related factors such as the
`number of input operands versus output data, the number of
`idle cycles over time and the number of system monitor
`interrupts received over time. One of the advantages that the
`MAP 112 has is that because of its reconligu rable nature, the
`actual function and type of function that are monitored can
`also change as the algorithm changes. This provides the user
`with an almost infinite number ofpossible monitored factors
`without having to monitor all factors all of the time.
`
`[0030] While there have been described above the prin-
`ciples of the present invention in conjunction with a specific
`multiprocessor architecture it is to be clearly understood that
`the foregoing description is made only by way of example
`and not as a limitation to the scope of the invention.
`Particularly, it is recognized that the teachings of the fore-
`going disclosure will suggest other modifications to those
`persons skilled in the relevant art. Such modifications may
`involve other features which are already known per se and
`which may be used instead c. or in addition to features
`already described herein. Although claims have been for—
`mulated in this application to particular combinations of
`features.
`it should be understood that
`the scope of the
`disclosure herein also includes any novel feature or any
`novel combination of features disclosed either explicitly or
`
`Petitioner Microsoft Corporation - Ex. 1022, p. 3
`Petitioner Microsoft Corporation - EX. 1022, p. 3
`
`
`
`US 2001/0014937 A1
`
`Aug. 16, 2001
`
`implicitly or any generalization or modification thereof
`which would be apparent to persons skilled in the relevant
`art, whether or not such relates to the same invention as
`presently claimed in any claim and whether or not it miti—
`gates any or all of the same technical problems as confronted
`by the present invention. The applicants hereby reserve the
`right to formulate new claims to such features andfor com-
`binations of such features during the prosecution of the
`present application or of any further application derived
`therefrom.
`
`What is claimed is:
`l. A computer including at least one data processor [or
`operating on user data in accordance with program instruc-
`tions, said computer further including at least one memory
`array presenting a data and address bus, said computer
`comprising:
`
`a memory algorithm processor associated with said
`memory array and coupled to said data and address
`buses, said memory algorithm processor being config~
`urablc to perform at least one identified algorithm on an
`operand received from a write operation to said
`memory array.
`2. The computer of claim 1 wherein said memory algo-
`rithm processor comprises a field programmable gate array.
`3. The computer of claim I wherein said memory algo-
`rithm processor is operative to access said memory array
`independently of said processor.
`least one
`4. The computer of claim 1 wherein said at
`identified algorithm is preprogrammed into said memory
`algorithm processor.
`5. The computer of claim 4 wherein said at least one
`identified algorithm is preprogrammed into a memory
`device associated with said memory algorithm processor.
`6. The computer of claim 5 wherein said memory device
`comprises at least one read only memory device.
`7. The computer of claim I
`further comprising a first
`plurality of said data processors and a second plurality of
`said memory arrays, each of said memory arrays comprising
`an associated memory algorithm processor.
`8. The computer of claim 7 wherein a memory algorithm
`processor associated with a first one of said second plurality
`of said memory arrays is operative to pass a result of a
`processed operand to another memory algorithm processor
`associated with a second one of said second plurality of said
`memory arrays.
`9. The computer of claim 1 wherein said memory algo—
`rithm processor further comprises:
`
`a control block including a command decoder coupled to
`said address bus and a counter coupled to said com—
`mand decoder, said command decoder for providing a
`last operand flag to said counter in response to a last
`operand command from an operating system of said at
`least one processor.
`10. The computer of claim 9 wherein said memory
`algorithm processor further comprises:
`
`an equality comparator coupled to receive a pipeline
`depth signal and an output ofsaid counter for providing
`a pipeline empty flag to at least one status register.
`11. The computer of claim 10 wherein said status register
`is coupled to said command decoder to receive a register
`control signal and a status signal to provide a status word
`output signal.
`
`12. A multiprocessor computer including a first plurality
`of data processors for operating on user data in accordance
`with program instructions and a second plurality of memory
`arrays, each presenting a data and address bus, said com—
`puter comprising:
`
`a memory algorithm processor associated with at least one
`of said second plurality of memory arrays and coupled
`to said data and address bus thereof, said memory
`algorithm processor being configurable to perform at
`least one identified algorithm on an operand received
`from a write operation to said associated one of said
`second plurality of memory arrays.
`13. The multiprocessor computer of claim 12 wherein said
`memory algorithm processor associated with one of said
`second plurality of memory arrays is accessible by more
`than one of said first plurality of data processors.
`14. The multiprocessor computer ofclaim 13 wherein said
`memory algorithm processor associated with one of said
`second plurality of memory arrays is accessible by all of said
`first plurality of data processors.
`15. The mu ltiprocessor computer ot‘claim 12 wherein said
`memory algorithm processor comprises:
`
`a control block operative to provide a last operand flag in
`response to a last operand having been processed in
`said memory algorithm processor.
`16. The multiprocessor computer of claim 12 further
`comprising:
`
`at least one memory device associated with said memory
`algorithm processor for storing a number of pro-loaded
`algorithms.
`17. The multiprocessor computer of claim 16 wherein said
`at least one memory device is responsive to a predetermined
`command to enable a selected one of said number of
`pre-loaded algorithms to be implemented by said memory
`algorithm processor.
`18. The multiprocessor computer of claim 16 wherein said
`at least one memory device comprises at least one read only
`memory device.
`19. The multiprocessor computer of claim 12 wherein said
`memory algorithm processor comprises a field program-
`mable gate array.
`20. The multiprocessor computer of claim 12 wherein said
`memory algorithm processor is accessible through normal
`memory access protocol.
`21. The multiprocessor computer of claim 12 wherein said
`memory algorithm processor has direct memory access
`capability to said associated one of said second plurality of
`memory arrays.
`22. The multiprocessor computer of claim 12 wherein a
`memory algorithm processor associated with a first one of
`said second plurality of said memory arrays is operative to
`pass a result of a processed operand to another memory
`algorithm processor associated with a second one 01" said
`second plurality of said memory arrays.
`23. The multiprocessor computer of claim 12 wherein said
`computer
`is operative to automatically detect parallel
`regions of application program code that are capable of
`being executed in said memory algorithm processor.
`24. The multiprocessor computer of claim 23 wherein said
`memory algorithm processor is configurable by said appli-
`cation program code.
`
`Petitioner Microsoft Corporation - Ex. 1022, p. 4
`Petitioner Microsoft Corporation - Ex. 1022, p. 4
`
`