throbber

`
`UNITED STATES PATENT AND TRADEMARK OFFICE
`____________________
`BEFORE THE PATENT TRIAL AND APPEAL BOARD
`____________________
`INTEL CORPORATION,
`Petitioner
`v.
`FG SRC LLC,
`Patent Owner
`____________________
`CASE NO.: UNASSIGNED
`PATENT NO. 7,149,867
`____________________
`
`
`DECLARATION OF STANLEY SHANFIELD, PH.D.,
`
`IN SUPPORT OF PETITIONER’S OPPOSITION TO PATENT OWNER’S
`
`MOTION TO AMEND
`
`
`
`
`
`
`Mail Stop PATENT BOARD
`Patent Trial and Appeal Board
`U.S. Patent and Trademark Office
`P.O. Box 1450
`Alexandria, VA 22313-1450
`
`Intel Exhibit 1034 - 1
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`
`I. 
`
`TABLE OF CONTENTS
` Page
`
`
`
`
`
`
`
`
`
`
`
`INTRODUCTION ........................................................................................... 1 
`A. 
`Educational and Work Background ...................................................... 1 
`B.  Materials Considered ............................................................................. 2 
`LEVEL OF ORDINARY SKILL IN THE ART ............................................. 3 
`II. 
`III.  DATA MOVEMENT AMENDMENT ........................................................... 3 
`IV.  DATA COMPUTATION AMENDMENT ..................................................... 6 
`V. 
`TRIMBERGER .............................................................................................. 10 
`
`IX.  RESERVATION OF RIGHTS ...................................................................... 17 
`X. 
`
`CONCLUSION .............................................................................................. 17 
`
`
`
`i
`
`Intel Exhibit 1034 - 2
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`I.
`
`INTRODUCTION
`1. My name is Stanley Shanfield Ph.D., and I am a Technical Director at
`
`Draper Laboratory in Cambridge, Massachusetts. I have been retained to prepare this
`
`declaration as an expert witness on behalf of Petitioner Intel Corporation (“Intel” or
`
`“Petitioner”). In this report, I provide my opinions concerning the scope and
`
`patentability of the amended claims submitted in the Patent Owner’s motion to
`
`amend the claims of U.S. Patent No. 7,155,867 (“’867 patent”). I also provide herein
`
`the technical bases for these opinions, as appropriate. This declaration contains
`
`statements of my opinions formed to date, and the bases and rationale for these
`
`opinions. I may offer additional opinions based on further review of materials
`
`presented throughout the course of this proceeding, including any additional
`
`opinions and/or testimony of Patent Owner’s expert witnesses.
`
`2.
`
`For my efforts in connection with the preparation of this declaration, I
`
`have been compensated at my usual and customary rate for this type of consulting
`
`activity. My compensation is in no way contingent on the substance of my opinions
`
`or the results of this or any other proceedings relating to the ’867 patent.
`
`A. Educational and Work Background
`3. My educational background and qualifications are set forth generally in
`
`my prior declaration supporting Intel’s Petition for IPR (see EX1006 ¶¶ 3-16) and
`
`in my curriculum vitae which was submitted as Attachment A thereto.
`
`1
`
`Intel Exhibit 1034 - 3
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`B. Materials Considered
`4.
`I have considered information from various sources in forming my
`
`opinions. The following is a listing of the materials that I considered in forming the
`
`opinions in this declaration:
`
` The ’867 patent and its prosecution file history (EX1001, EX1002);
`
` Intel’s Petition for IPR (Paper No. 1);
`
` X. Zhang et al., Architectural Adaptation of Application-Specific
`
`Locality Optimizations, IEEE (1997) (EX1003);
`
` R. Gupta, Architectural Adaptation in AMRM Machines, IEEE (2000)
`
`(EX1004);
`
` Chien and R. Gupta, MORPH: A System for Robust Higher
`
`Performance Using Customization,” IEEE (1996) (EX1005);
`
` My initial declaration submitted with Intel’s Petition (EX1006);
`
` The Board’s Institution Decision in this proceeding (Paper 13);
`
` Patent Owner’s Motion to Amend the Claims (Paper 26);
`
` Declaration of William Mangione-Smith, Ph.D., in Support of Patent
`
`Owner’s Motion to Amend the Claims (EX2027);
`
` Patent Owner’s Response (“POR”) in this proceeding (Paper 34);
`
` Declaration of William Mangione-Smith, Ph.D., in Support of the POR
`
`(EX2028);
`
`2
`
`Intel Exhibit 1034 - 4
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
` U.S. Patent No. 5,737,631 to Trimberger (“Trimberger”); and
`
` Any other materials referenced in this declaration.
`
`II. LEVEL OF ORDINARY SKILL IN THE ART
`5. My opinions in this declaration are based on the knowledge of a person
`
`of ordinary skill in the art (“POSA”) at the time of the ’867 patent
`
`6. My determination of the level of ordinary skill in the art is set forth in
`
`my prior declaration supporting Intel’s Petition. See EX1006 ¶¶ 66-67.
`
`III. DATA MOVEMENT AMENDMENT
`7.
`I understand that the Patent Owner amended claim 1 to replace the word
`
`“retrieves” with “transfers” in the amended limitation, “wherein the data prefetch
`
`unit [retrieves] transfers only computational data required by the algorithm from a
`
`second memory . . . and places the [retrieved] computational data in the first
`
`memory.” (MTA 4). In my opinion, that amendment changes the scope of the claim
`
`because it no longer requires the data prefetch unit itself to do the prefetching from
`
`the second memory. Instead, in the Patent Owner’s amended claim language, another
`
`unit altogether could retrieve the computational data from second memory. In that
`
`case, the data prefetch unit merely needs to act as a conduit in transferring that data
`
`and placing it in the first memory in order to satisfy the amended claim. Thus, a
`
`system where the data prefetch unit is not actually required to retrieve the
`
`computational data from memory would fall within the scope of the amended claim
`
`3
`
`Intel Exhibit 1034 - 5
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`but not the original claim because the original claim requires the data prefetch unit
`
`itself to perform the data retrieval.
`
`8.
`
`But that system would be inconsistent with how a POSA would
`
`understand the ’867 patent specification because it specifically identifies the data
`
`prefetch unit as the component in the system that is responsible for prefetching the
`
`computational data and moving it throughout the memory hierarchy. See EX1001
`
`5:40-43. Whether that data movement is a copy or a more complex operation such
`
`as an indexed strided copy, it is the data prefetch unit that is doing the data movement
`
`in the memory hierarchy in the ’867 patent. EX1001 7:34-48, Fig. 4:
`
`
`The data prefetch unit must supply the computational data to the logic
`
`9.
`
`4
`
`Intel Exhibit 1034 - 6
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`blocks in the processor in time for execution, so it is the data prefetch unit that
`
`orchestrates the data prefetching from the memory hierarchy, not some other unit.
`
`In my opinion, this meaning is also consistent with the use of the terminology “data
`
`prefetch unit” as understood by a POSA at the time, as the description “fetch”
`
`implies a retrieval process.
`
`10.
`
`In addition, the proposed change allows for the retrieval of more than
`
`only “computational data required by the algorithm.” The “only” limitation in the
`
`amended claim language is open ended and does not restrict what is retrieved from
`
`memory. A system with a data prefetch unit that retrieves computational data
`
`required by the algorithm and other data from memory would fall within the
`
`amended claim’s scope (so long as it places only the computational data in the cache
`
`or other memory closer to the processing resources) but not the original claim’s
`
`scope because the original claim requires that the data prefetch unit retrieve “only
`
`computational data required by the algorithm from a second memory.”
`
`11. The express language of the original claim 1 requires the data prefetch
`
`unit to both retrieve and place the computational data required by the algorithm and
`
`nothing else. See EX1001 12:47-48; EX1033-2. The proposed change amends the
`
`claim’s scope to encompass a data prefetch unit that retrieves computational data
`
`required by the algorithm as well as other data from a second memory but places
`
`only the computational data in the first memory.
`
`5
`
`Intel Exhibit 1034 - 7
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`IV. DATA COMPUTATION AMENDMENT
`
`12.
`
`I also understand that the Patent Owner has amended the independent
`
`claims to add the limitation, “wherein computations performed by the algorithm are
`
`performed by an FPGA.” Performing algorithmic computations in processors —
`
`including sparse matrix computations specifically—using FPGAs was widely
`
`known to persons skilled in the art prior to the ’867 patent’s priority date. In fact, I
`
`personally worked on fast programming with sparse matrix multiplication
`
`algorithms using FPGAs in around the year 2000. I also recall many papers
`
`published by IEEE on that topic at around that time. A quick search through the
`
`IEEE database revealed several papers on performing sparse matrix computations
`
`using FPGAs, enclosed as Appendices 1-4 as follows:
`
`
`
`“Area and Time Efficient Implementations of Matrix
`
`Multiplication of FPGAs” (App. 1);
`
`
`
`“A High Throughput FPGA Implementation of A Bit-Level
`
`Matrix-Matrix Product” (App. 2);
`
`
`
`“On Sparse Matrix-vector Multiplication with FPGA-based
`
`System” (App. 3); and
`
`
`
`“An FPGA Based Parameterisable System for Matrix Product
`
`Implementation” (App. 4)).
`
`Thus, a POSA working in FPGAs at that time would understand that the addition
`
`6
`
`Intel Exhibit 1034 - 8
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`of performing the claimed algorithms’ computations in an FPGA was not new and
`
`does not add any patentable distinction over the state of the art.
`
`13.
`
`In addition, Zhang discloses performing an algorithm’s computations
`
`in FPGA. For instance, Zhang discloses a computer architecture “that integrates
`
`programmable logic into key components of the system with the goal of customizing
`
`architectural mechanisms and policies to match an application,” which a POSA
`
`would understand includes FPGA programmable logic. EX1003-12, abstract. Zhang
`
`teaches integrating “small blocks of programmable logic into key elements of a
`
`baseline architecture, including processing elements, components of the memory
`
`hierarchy, and the scalable interconnect, to provide architectural adaptation – the
`
`customization of architectural mechanisms and policies to match an application.” Id.
`
`-13, C2:44-49 & Fig. 2 (italic in original). Zhang therefore teaches using a
`
`reconfigurable processor comprising programmable “processing elements” that
`
`would have been understood by a POSA for their use in performing computations in
`
`the CPU such as the sparse matrix multiplication operations described in Zhang.
`
`Zhang specifically lists these components separately from other aspects of the
`
`processor such as the memory hierarchy and programmable interconnect, see e.g.,
`
`id.-12 C2:39-45, -13 C2:44-49, -17 C2:48-51 & Fig. 2.
`
`14. Zhang discloses optimizing matrix multiplication computations in a
`
`reconfigurable processor using the customization provided by the programmable
`
`7
`
`Intel Exhibit 1034 - 9
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`logic. See, e.g., EX1003-12, abstract (“We present two case studies of architectural
`
`customization to enhance latency tolerance and efficiently utilize network bisection
`
`on multiprocessors for sparse matrix computations.”); id. C2:39-45 (“Using sparse
`
`matrix computations as examples, our results show that customization for
`
`application-specific
`
`optimizations
`
`can
`
`bring
`
`significant
`
`performance
`
`improvement.”). Zhang confirms
`
`these calculations are performed using
`
`computational elements implemented in programmable logic—which includes
`
`FPGA—noting that “by adding a small amount of programmable logic to the
`
`memory units, we can yield some benefits of having computational elements within
`
`the memory.” Id.-17 C2:48-51. Further, Zhang shows integrating programmable
`
`logic within the CPU itself, in addition to the cache, network interface, and memory.
`
`EX1003 Fig. 2.
`
`15. Thus, in my view it would have been obvious to a POSA in view of
`
`Zhang’s teaching to use an FPGA to perform the computations required by the
`
`claimed algorithm. Additionally, given Zhang’s teaching of optimizing sparse
`
`matrix computations and using programmable logic for reconfigurability to adapt to
`
`specific applications, a POSA would have looked to ways to implement what is
`
`taught in Zhang, including using FPGAs which were well known for use in
`
`performing sparse matrix computations. The IEEE articles cited above provide
`
`documentary evidence substantiating that a POSA would have known how to
`
`8
`
`Intel Exhibit 1034 - 10
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`perform sparse matrix computations, such as those taught in Zhang, using FPGAs.
`
`Thus in view of Zhang’s teachings, a POSA would have understood Zhang’s
`
`programmable logic and reconfigurability to be implemented with FPGA, such that
`
`the “computations performed by the algorithm” are performed by FPGA
`
`programmable logic. See Pet. 11-12, 46; EX1006 ¶¶72-73, 155-56.
`
`16.
`
`In addition, a POSA would understand Zhang’s use of FPGAs as one
`
`of a finite number of options for implementing its computational units in
`
`programmable logic and would have been obvious to a POSA in view of Zhang’s
`
`teaching, which itself identifies using FPGAs as one of two options along with LSI
`
`logic. See EX1003-17 C1:22-26. Other options would include predecessor
`
`programmable logic technologies such as programmable logic devices (PLDs),
`
`programmable
`
`logic arrays
`
`(PLAs), programmable array
`
`logic
`
`(PAL),
`
`programmable read-only memories (PROMs), erasable programmable read-only
`
`memories (EPROMs), electrically-erasable programmable read-only memories
`
`(EEPROMs), etc. See EX1006, Attachment B at 25-49. A POSA would have been
`
`motivated to combine the teachings of Zhang and Gupta because Zhang teaches
`
`using programmable processing elements in a reconfigurable processor architecture
`
`that performs sparse matrix computations, and Gupta teaches a specific prototype
`
`implementation of that architecture and technique, including reconfigurable logic
`
`blocks in FPGAs for application-specific cache organization policies, hardware
`
`9
`
`Intel Exhibit 1034 - 11
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`assisted blocking, prefetching, and dynamic cache structures, see EX1004-9 C1:3-
`
`12, and it would have been obvious for a POSA to look to Gupta as one way to
`
`implement Zhang’s reconfigurable data prefetch architecture.
`
`17.
`
`I further understand that the Patent Owner asserts that the data
`
`computation amendment excludes use of conventional CPUs. See Paper 26 at 5-6.
`
`But at the time of the ’867 patent, conventional CPUs were routinely used as
`
`embedded components in FPGAs, as well as part of a larger multiprocessor systems
`
`that compute algorithms using FPGA components. In fact, the ’867 patent
`
`specification specifically expresses the intent to be used as part of such an overall
`
`system. See EX1001 6:19-25. In any case, excluding a conventional CPU would
`
`contravene the ’867 patent itself, which expressly teaches using its reconfigurable
`
`processor with conventional computing platforms. Id. Fig. 2 & 6:15-25 (“In a
`
`particular implementation, a number of RPs 100 are implemented within a memory
`
`subsystem of a conventional computer . . . . In this manner the RPs 100 can be
`
`accessed by memory operations and so coexist well with a more conventional
`
`hardware platform.”).
`
`V. TRIMBERGER
`
`18. The data computation amendment is also obvious in view of the
`
`instituted art (Zhang and Gupta and/or Zhang, Gupta and Chien) in further
`
`combination with Trimberger (EX1037), which teaches using FPGAs to perform
`
`10
`
`Intel Exhibit 1034 - 12
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`computations of algorithms. Trimberger discloses a technique to improve processor
`
`performance utilizing a microprocessor having conventional processor execution
`
`units configured in parallel with reprogrammable execution units. See EX1037 1:7-
`
`11 & Fig. 1:
`
`(“The present
`
`invention relates
`
`to
`
`techniques
`
`to
`
`improve
`
`the speed of
`
`microprocessors using reprogrammable hardware; and more particularly to the use
`
`of reprogrammable execution units in parallel with predefined execution units in a
`
`
`
`11
`
`Intel Exhibit 1034 - 13
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`data processing system.”).
`
`19. The purpose of the programable execution unit(s) 30 is to perform
`
`complex or special-purpose functions that may not be available in the instruction set
`
`of a general purpose processor or that would otherwise require implementing a
`
`special-purpose processor to perform, which has a number of drawbacks. See id.
`
`1:24-35, 47-50, 1:63-2:2, 2:44-51. The reprogrammable execution units are
`
`implemented in FPGA programmable logic and form a reprogrammable instruction
`
`set accelerator (“RISA”) that is configured to accelerate execution of user-defined,
`
`special-purpose
`
`instructions
`
`in general purpose processors for significant
`
`performance improvement for a variety of algorithms of particular user applications.
`
`See id. 2:52-59, 5:12-18, 5:65-6:1, 6:36-41. The RISA can be reprogrammed with
`
`different instructions at different times when the user changes from one application
`
`to another. See id. 3:7-9, 4:42-45. The instructions may also be extracted on the fly
`
`and used for dynamically reconfiguring the reprogrammable execution unit(s) on the
`
`RISA to perform the special functions. See id. 4:66-5:6, 6:10-27, 10:3-24.
`
`20. Trimberger teaches a microprocessor for executing computer programs
`
`which includes both conventional execution units executing conventional processor
`
`instructions configured in parallel with one or more reprogrammable execution units
`
`for executing special-purpose instructions for a particular user-defined function. See
`
`id. 4:29-31, 8:60-65. Some examples of the special-purpose computations that can
`
`12
`
`Intel Exhibit 1034 - 14
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`be performed by the RISA include complex algorithms for encryption/decryption,
`
`polynomial evaluation, and spreadsheet resolution algorithms. Id. 3:10-27.
`
`Trimberger teaches that the RISA can be reprogrammed to adapt to each user
`
`program. Id. 3:28-33. A POSA would understand these algorithms are instantiated
`
`in the RISA whenever the reprogrammable execution units are programmed and/or
`
`reprogrammed. Trimberger
`
`thus discloses a reconfigurable microprocessor
`
`implemented in FPGA that instantiates user algorithms as hardware, and specifically
`
`discloses using FPGA programmable logic to perform the computations performed
`
`by the algorithms. Id. 9:65-10:5, 10:19-24.
`
`21. Trimberger also expressly teaches integrating FPGA reprogrammable
`
`execution units configured in parallel with fixed execution units of a conventional
`
`microprocessor. See EX1037 17-11, 2:66-3:3, 3:38-47, 4:29-45, 6:36-44 & Figs. 1-
`
`2. And that is consistent with the ’867 patent, which likewise teaches using the
`
`purported invention in combination with conventional computing platforms.
`
`EX1003 6:19-25.
`
`22. A POSA would have been motivated to combine Trimberger with the
`
`instituted prior art in the instituted grounds. Each of Zhang, Gupta, Chien and
`
`Trimberger (i) concern the same field of reconfigurable processor architecture, (ii)
`
`incorporate FPGA reconfigurable logic components in a processor, and (iii) are
`
`designed to improve overall performance. Id. 2:66-3:3 (“the RISA can be tightly
`
`13
`
`Intel Exhibit 1034 - 15
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`coupled to instruction and data paths in parallel with the predefined execution units
`
`in microprocessors … [to] provide[] fast and efficient execution of new instructions
`
`and significant improvement in performance.”); EX1003-14 C1:46-52 (“The key
`
`elements of the MORPH (MultiprocessOr with Reconfigurable Parallel Hardware)
`
`architecture consists of processing and memory elements embedded in a scalable
`
`interconnect. With a small amount of programmable logic integrated with key
`
`elements of the system, the proposed MORPH architecture aims to exploit
`
`architectural customization for a broad range of purposes …”); EX1004-11 C1:28-
`
`32 (“The focus of the AMRM project is on architectural adaptations that close the
`
`gap between processor and memory speed by intelligent placement of data through
`
`the memory hierarchy.”).
`
`23.
`
`In addition, Zhang expressly discusses the desirability of providing
`
`architectural adaptations to its processor functional units in future work to improve
`
`instruction throughput and Trimberger specifically discloses using reprogrammable
`
`execution units to accelerate instruction processing in a processor architecture.
`
`EX1003-18 C1:22-29, EX1037 2:66-3:2 & Fig. 2 (showing a microprocessor with
`
`conventional execution unit 100 configured in parallel with and reprogrammable
`
`execution unit in RISA FPGA 120):
`
`14
`
`Intel Exhibit 1034 - 16
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`
`
`24. A POSA desiring to implement reconfigurable functional units in
`
`Zhang’s processor would look to Trimberger as one example of how to do that in a
`
`microprocessor with reprogrammable functional units in parallel with conventional
`
`execution units. In particular, a POSA would understand that Trimberger’s
`
`reprogrammable execution units are combinable with the execution units that exist
`
`within the MORPH processor architecture. A POSA would understand that
`
`Trimberger’s reprogrammable execution units, once configured on the processor
`
`(i.e., instantiated), behave as any other functional unit in the processor for use either
`
`as a substitute for one of the execution units in the MORPH processor architecture
`
`or arranged in parallel with them. Thus, this is a simple substitution of one known
`
`element for another and a POSA would have had a reasonable expectation of success
`
`in applying this substitution. In addition, a POSA would recognize such substitution
`
`15
`
`Intel Exhibit 1034 - 17
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`as advantageous because it enables users to implement special-purpose functions in
`
`the MORPH processor by way of adding Trimberger’s reprogrammable execution
`
`units that are configured for performing those special-purpose functions using the
`
`programmed instructions disclosed in Trimberger. EX1037 2:56-59, 4:31-45, 8:60-
`
`65. The combination would also have been advantageous because Trimberger’s use
`
`of reprogrammable execution units in microprocessors along with programmed
`
`instructions improves the speed of processing those instructions in a microprocessor.
`
`See EX1037 1:7-11, 2:66-3:2.
`
`25.
`
`Further, Trimberger’s functional units are compatible with the MORPH
`
`processor architecture because they are configured for use in a conventional
`
`microprocessor that executes instructions because they are connected directly to the
`
`microprocessor’s data and address buses. See EX1037 7:4-34, 57-63 & Fig. 2 (buses
`
`101 and 102). The MORPH/AMRM processor architecture is also configured to
`
`execute CPU instructions and includes data and address buses to which a POSA
`
`would understand Trimberger’s execution units can connect to. See EX1003-15 Fig.
`
`4, EX1004 Fig. 1.
`
`26.
`
`The reprogrammable execution units in Trimberger could be used in
`
`the MORPH/AMRM processor architecture and arranged in parallel with the
`
`processor’s existing functional units. A POSA would understand Trimberger’s
`
`teachings of reprogrammable execution units disposed in a conventional computer
`
`16
`
`Intel Exhibit 1034 - 18
`
`

`

`Declaration in Support of Intel’s Response to Patent Owner’s Motion to Amend
`
`architecture that executes CPU instructions would be combinable as functional units
`
`within Zhang’s reconfigurable processor.
`
`27. A POSA would
`
`therefore have been motivated
`
`to combine
`
`Trimberger’s teaching of reprogrammable execution units with the execution units
`
`existing in Zhang’s CPU which is adapted to improve over conventional processors.
`
`28.
`
`Thus, the data computation amendment would have been obvious to a
`
`POSA at the time of the ’867 patent in view of the instituted prior art in combination
`
`with Trimberger.
`
`IX. RESERVATION OF RIGHTS
`
`29. My opinions are based upon the information that I have considered to
`
`date. I reserve the right, however, to supplement my opinions in the future to respond
`
`to any arguments or to consider new information that becomes available to me.
`
`X.
`
`30.
`
`CONCLUSION
`
`For the reasons given above, it is my opinion that the claim amendments
`
`in the Patent Owner’s Motion to Amend are not patentable over the instituted
`
`MORPH/AMRM processor architecture.
`
`17
`
`Intel Exhibit 1034 - 19
`
`

`

`By:
`
`Stanley Shanfield Ph.D.
`
`8/18/21
`
`18
`
`Intel Exhibit 1034 - 20
`
`

`

`Appendix 1
`Appendix |
`
`Intel Exhibit 1034 - 21
`
`Intel Exhibit 1034 - 21
`
`

`

`Area and Time Efficient Implementations of
`Matri.x Multiplication on FPGAs*
`
`Ju-wook Jang1 , Seonil Choi2 , and Viktor K. Prasanna2
`1 Electronic Engineering
`2Electrical Engineering-Systems
`Sogang University
`University of Southern California
`Seoul, Korea
`Los Angeles, CA, U.S.A.
`jjang@sogang.ac.kr
`{ seonilch, prasanna }@usc.edu
`
`Abstract
`
`We develop new algorithms and archit8ctures for nrn(cid:173)
`trix multiplication on configurable hardware. These
`designs significantly reduce the latency as well as the
`area. Our designs improve the previous designs in [7]
`and [1] in terms of the area/speed metric where the
`speed denotes the maximum achievable running fre(cid:173)
`quency. The area/speed metrics for the designs in [7],
`[1], and our design are 14.45, 4.93, and 2.35, respec(cid:173)
`tively, for 4 x 4 matrix multiplication. The latency
`of the design in [7) is 0.57µ,s, while our design takes
`0 15µ,s using 18% less area. The area of our designs is
`smaller by 11 % - 46% compared with the best known
`systolic designs based on [ 9) with the same latency for
`the matrices of sizes 3 x 3-12 x 12. The performance
`improvements tend to grow with the problem size.
`
`1
`
`Introduction
`
`Matrix multiplication is a frequently used kernel op(cid:173)
`eration in a wide variety of graphic, image, robotics,
`and signal processing applications. Several signal and
`image processing operations consist of matrix multi(cid:173)
`plication. Most previous implementations of matrix
`multiplication on FPGAs focused on trade-offs be(cid:173)
`tween area and maximum running frequency [1][7].
`In this paper we develop designs which provide im(cid:173)
`proved trade-offs between area and latency.
`We significantly reduce the number of registers in(cid:173)
`volved in the data movement. n 2 registers of 16-bit
`words and n2 + 6n2 / s, (1 :s; s :s; n) registers of 8-bit
`words are involved in the data movement for n x n
`matrix multiplication in [9]. Only 4n registers of 8-
`
`•This work is supported by the DARPA Power Aware Com(cid:173)
`puting and Communication Program under contract F33615-
`C-00-1633 monitored by Wright Patterson Air Force Base and
`in part by the National Science Foundation under award No.
`99000ul3. Ju-wook Jang's work is supported by Ministry of
`Inforniat.jon and Comniunication, Korea.
`
`bit words are involved in the data movement in our
`design (based on Theorem 1). A closed form function
`representing the area of a design is derived by in(cid:173)
`corporating architecture / algorithm details and the
`FPGA vendors' specifications. The function enables
`the designer to make trade-offs between area and la(cid:173)
`tency before launching time-consuming low-level sim(cid:173)
`ulation.
`:'li1encer et. al [7] implemented matrix multiplica(cid:173)
`tion on the Xilinx XC4000E FPGA device. Their
`design employs bit-serial multipliers using Booth en(cid:173)
`coding. They focused on trade-offs between area and
`maximum running frequency with parameterized cir(cid:173)
`cuit generators. For a specific example of 4 x 4 ma(cid:173)
`trix multiplication, 954 CLBs are used to achieve the
`maximum running frequency of 33 :VfHz.
`Amira et. al [l] improved the design in [7] using the
`Xilinx XCVlO00E FPGA device.
`:\1odified Booth(cid:173)
`encoder multiplication was used along with Wallace
`tree addition. Emphasis was once again on maximiz(cid:173)
`ing the running frequency. For a specific example of
`4 x 4 matrix multiplication, 296 CLBs are used to
`achieve the maximum running frequency of 60 MHz.
`Area/speed (the number of CLBs divided by the max(cid:173)
`imum running frequency) was used as a performance
`metric.
`We improve the previous designs in [7] and [l] in
`terms of the area/speed metric. The area/speed met(cid:173)
`rics for the designs in [7], [1], and our designs are
`14.45, 4.93, and 2.35, respectively. The design area
`denotes the number of CLBs and its translation is
`performed in terms of the equivalent amount of logic
`on different FPGA devices. Details can be found in
`Section 3.4. The energy efficiency of our designs is
`also evaluated and reported as a separate work [5].
`Prasanna and Tsai [9] achieved the theoretical
`lower bound in latency for matrix multiplication with
`a design based on a linear array. Their method pro(cid:173)
`vides trade-offs between the number of registers and
`the latency. For performance comparison, we have
`
`0-7803-7574-2/02/$17.00 ©2002 IEEE
`
`93
`
`Authorized licensed use limited to: DRAPER. Downloaded on August 04,2021 at 18:36:00 UTC from IEEE Xplore. Restrictions apply.
`
`Intel Exhibit 1034 - 22
`
`

`

`implemented their design on the same target FPGA
`device used in the implementation of our designs. The
`areas ofour designs for 3 x 3-12 x 12 matrix multipli(cid:173)
`cations (based on Theorem 1 in Section 2) are smaller
`by 11 % - 46% compared with the designs based on [9]
`with the same latency. Our designs (based on Theo(cid:173)
`rem 2 in Section 2) also improve the performance in
`terms of areaxlatency (AT) metric by 53.2% - 69%
`for matrices of sizes 3 x 3 - 12 x 12. Experiments on
`larger matrices show that the performance improve(cid:173)
`ments increase with the matrix size.
`The rest of the paper is organized as follows. Al(cid:173)
`gorithms and architectures for area and time effi(cid:173)
`cient implementation of matrix multiplication are
`presented in Section 2. Section 3 describes imple(cid:173)
`mentation and compares performance with previous
`designs. Section 4 concludes the paper.
`
`2 Fast Algorithms for Matrix
`Multiplication
`
`Compared with the design in [9], our algorithm sig(cid:173)
`nificantly reduces the number of registers involved
`in data movement. n 2 registers of 16-bit words and
`n2 + 6n2 / s, (1 :S s ~ n), registers of 8-bit words are
`involved in the data movement in [9]. Only 4n regis(cid:173)
`ters of 8-bit words are involved in the data movement
`in our design.
`We present our algorithms and architectures in two
`theorems and two corollaries. Pseudo-code for cycle(cid:173)
`specific data movement and the detailed architectures
`are also provided. Theorem 1 improves the best
`known algorithm for matrix multiplication [9] with
`respect to the number of registers. This leads to op(cid:173)
`timal time complexity with a leading coefficient of 1
`for matrix multiplication on a linear array. It uses
`only two 1/0 ports, which makes our design attrac(cid:173)
`tive for hosts with limited 1/0 capability. Theorem
`1 is extended for trade-offs between latency and area
`using block multiplication (Corollary 1). While the
`short proof of Theorem 1 appears in [5], the full and
`extended proof is provided in this paper.
`The second algorithm is developed to exploit fu(cid:173)
`ture increases in

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket