throbber

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`DECLARATION OF GORDON MACPHERSON
`
`I, Gordon MacPherson, am over twenty-one (21) years of age. I have never been
`convicted of a felony, and I am fully competent to make this declaration. I declare the following
`to be true to the best of my knowledge, information and belief:
`
`1. I am Director Board Governance & IP Operations of The Institute of Electrical and
`Electronics Engineers, Incorporated (“IEEE”).
`
`2. IEEE is a neutral third party in this dispute.
`
`3. Neither I nor IEEE itself is being compensated for this declaration.
`
`4. Among my responsibilities as Director Board Governance & IP Operations, I act as a
`custodian of certain records for IEEE.
`
`5. I make this declaration based on my personal knowledge and information contained
`in the business records of IEEE.
`
`6. As part of its ordinary course of business, IEEE publishes and makes available
`technical articles and standards. These publications are made available for public
`download through the IEEE digital library, IEEE Xplore.
`
`7. It is the regular practice of IEEE to publish articles and other writings including
`article abstracts and make them available to the public through IEEE Xplore. IEEE
`maintains copies of the publications in the ordinary course of its regularly conducted
`activities.
`
`8. The articles below have been attached as Exhibits A – C to this declaration:
`
`
`A. X. Zhang, et. al., “Architectural adaptation for application-specific locality
`optimizations”, Proceedings International Conference on Computer Design
`VLSI in Computers and Processors, October 12 – 15, 1997.
`
`B. R. Gupta, “Architectural adaptation in AMRM machines”, Proceedings
`IEEE Computer Society Workshop on VLSI 2000. System Design for a
`System-on-Chip Era, April 27 – 28, 2000.
`
`
`
`445 Hoes Lane Piscataway, NJ 08854
`
`DocuSign Envelope ID: 5BBAD0DA-1565-45BD-9AB4-39D99A34F3C8
`
`Intel Exhibit 1027 - 1
`
`

`

`C. A.A. Chien and R.K. Gupta, “MORPH: a system architecture for robust
`high performance using customization (an NSF 100 TeraOps point design
`study)”, Proceedings of 6th Symposium on the Frontiers of Massively
`Parallel Computation (Frontiers '96), October 27 – 31, 1996.
`
`
`9. I obtained copies of Exhibits A – C through IEEE Xplore, where they are maintained
`in the ordinary course of IEEE’s business. Exhibits A – C are true and correct copies
`of the Exhibits, as it existed on or about March 19, 2021.
`
`10. The articles and abstracts from IEEE Xplore show the date of publication as well as
`additional publication information. IEEE Xplore populates this information using the
`metadata associated with the publication, which is created and maintained as part of
`IEEE’s standard business practices.
`
`locality
`11. X. Zhang, et. al., “Architectural adaptation for application-specific
`optimizations” was published as part of the Proceedings of the International
`Conference on Computer Design VLSI in Computers and Processors. The
`International Conference on Computer Design VLSI in Computers and Processors
`was held from October 12 – 15, 1997. In accordance with IEEE’s standard practices,
`copies of the proceedings were made available no later than the last day of the
`conference. The article is currently available for public download from the IEEE
`digital library, IEEE Xplore.
`
`12. R. Gupta, “Architectural adaptation in AMRM machines” was published as part of
`the Proceedings of the IEEE Computer Society Workshop on VLSI 2000. The IEEE
`Computer Society Workshop on VLSI 2000 was held from April 27 – 28, 2000. In
`accordance with IEEE’s standard practices, copies of the proceedings were made
`available no later than the last day of the conference. The article is currently available
`for public download from the IEEE digital library, IEEE Xplore.
`
`13. A.A. Chien and R.K. Gupta, “MORPH: a system architecture for robust high
`performance using customization (an NSF 100 TeraOps point design study)” was
`published as part of the Proceedings of 6th Symposium on the Frontiers of Massively
`Parallel Computation (Frontiers '96). The 6th Symposium on the Frontiers of
`Massively Parallel Computation (Frontiers '96) was held from October 27 – 31, 1996.
`In accordance with IEEE’s standard practices, copies of the proceedings were made
`available no later than the last day of the conference. The article is currently available
`for public download from the IEEE digital library, IEEE Xplore.
`
`
`
`
`
`
`
`
`
`
`
`DocuSign Envelope ID: 5BBAD0DA-1565-45BD-9AB4-39D99A34F3C8
`
`Intel Exhibit 1027 - 2
`
`

`

`
`
`
`
`
`
`
`
`14. I hereby declare that all statements made herein of my own knowledge are true and
`that all statements made on information and belief are believed to be true, and further
`that these statements were made with the knowledge that willful false statements and
`the like are punishable by fine or imprisonment, or both, under 18 U.S.C. § 1001.
`
`I declare under penalty of perjury that the foregoing statements are true and correct.
`
`
`
`
`Executed on:
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`DocuSign Envelope ID: 5BBAD0DA-1565-45BD-9AB4-39D99A34F3C8
`
`3/22/2021
`
`Intel Exhibit 1027 - 3
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`EXHIBIT A
`
`
`
`
`
`DocuSign Envelope ID: 5BBAD0DA-1565-45BD-9AB4-39D99A34F3C8
`
`Intel Exhibit 1027 - 4
`
`

`

`3/19/2021
`
`Architectural adaptation for application-specific locality optimizations | IEEE Conference Publication | IEEE Xplore
`
`IEEE.org
`
`IEEE Xplore
`
`IEEE-SA
`
`IEEE Spectrum
`
`More Sites
`
`SUBSCRIBE
`
`SUBSCRIBECart Create Account
`
`Personal Sign In
`
`Browse
`
`My Settings
`
`Help
`
`Institutional Sign In
`
`Institutional Sign In
`
`All
`
`ADVANCED SEARCH
`
`Conferences > Proceedings International Con...
`
`Architectural adaptation for application-specific locality
`optimizations
`Publisher: IEEE
`
`PDF
`
`Cite This
`
`Cite This
`
`Alerts
`
`Manage
`Content Alerts
`
`Add to Citation
`Alerts
`
` << Results
`
`Xingbin Zhang ; A. Dasdan ; M. Schulzt ; R.K. Gupta ; A.A. Chien All Authors
`
`26
`Full
`Text Views
`
`1P
`
`atent
`Citation
`
`4P
`
`aper
`Citations
`
`Abstract
`
`Authors
`
`References
`
`Citations
`
`Keywords
`
`Metrics
`
`More Like This
`
`Abstract:We propose a machine architecture that integrates programmable logic into
`key components of the system with the goal of customizing architectural mechanisms
`and policies ... View more
`
`Metadata
`Abstract:
`We propose a machine architecture that integrates programmable logic into key
`components of the system with the goal of customizing architectural mechanisms and
`policies to match an application. This approach presents an improvement over the
`traditional approach of exploiting programmable logic as a separate co-processor by
`pre-serving machine usability through software and on a traditional computer
`architecture by providing application-specific hardware. We present two case studies of
`architectural customization to enhance latency tolerance and efficiently utilize network
`bisection on multiprocessors for sparse matrix computations. We demonstrate that
`application-specific hardware and policies can provide substantial improvements in
`performance on a per application basis. Based on these preliminary results, we propose
`
`https://ieeexplore.ieee.org/document/628862
`
`1/2
`
`Export to
`
`Collabratec
`
`Downl
`PDF
`
` Back to Results
`
`More Like This
`
`Optimization of Block Sparse Matrix-Vector
`Multiplication on Shared-Memory Parallel
`Architectures
`2016 IEEE International Parallel and
`Distributed Processing Symposium
`Workshops (IPDPSW)
`Published: 2016
`
`Performance Analysis and Optimization of
`Sparse Matrix-Vector Multiplication on Intel
`Xeon Phi
`2017 IEEE International Parallel and
`Distributed Processing Symposium
`Workshops (IPDPSW)
`Published: 2017
`
`Show More
`
`Intel Exhibit 1027 - 5
`
`
`
`
`
`
`
`
`
`
`
`

`

`3/19/2021
`
`Architectural adaptation for application-specific locality optimizations | IEEE Conference Publication | IEEE Xplore
`that an application-driven machine customization provides a promising approach to
`achieve high performance and combat performance fragility.
`
`Published in: Proceedings International Conference on Computer Design VLSI in
`Computers and Processors
`
`Date of Conference: 12-15 Oct. 1997
`
`INSPEC Accession Number: 5761626
`
`Date Added to IEEE Xplore: 06 August
`2002
`
`DOI: 10.1109/ICCD.1997.628862
`
`Publisher: IEEE
`
`Print ISBN:0-8186-8206-X
`
`Print ISSN: 1063-6404
`
`Conference Location: Austin, TX, USA
`
`Authors
`
`References
`
`Citations
`
`Keywords
`
`Metrics
`
`IEEE Personal Account
`
`Purchase Details
`
`Profile Information
`
`Need Help?
`
`Follow
`
`CHANGE USERNAME/PASSWORD
`
`PAYMENT OPTIONS
`
`COMMUNICATIONS PREFERENCES
`
`US & CANADA: +1 800 678 4333
`
`VIEW PURCHASED DOCUMENTS
`
`PROFESSION AND EDUCATION
`
`WORLDWIDE: +1 732 981 0060
`
`TECHNICAL INTERESTS
`
`CONTACT & SUPPORT
`
`About IEEE Xplore | Contact Us | Help | Accessibility | Terms of Use | Nondiscrimination Policy | Sitemap | Privacy & Opting Out of Cookies
`A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.
`
`© Copyright 2021 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
`
`IEEE Account
`
`Purchase Details
`
`Profile Information
`
`Need Help?
`
`» Change Username/Password
`» Update Address
`
`» Payment Options
`» Order History
`» View Purchased Documents
`
`» Communications Preferences
`» Profession and Education
`» Technical Interests
`
`» US & Canada: +1 800 678 4333
`» Worldwide: +1 732 981 0060
`» Contact & Support
`
`
` About IEEE Xplore Contact Us
`|
`
`
`|
`
`Help
`
`
`|
`
`Accessibility
`
`
`|
`
`Terms of Use
`
`
`|
`
`Nondiscrimination Policy
`
`
`|
`
`Sitemap
`
`
`|
`
`Privacy & Opting Out of Cookies
`
`A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.
`© Copyright 2021 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
`
`https://ieeexplore.ieee.org/document/628862
`
`2/2
`
`Intel Exhibit 1027 - 6
`
`  
`
`
`
`
`
`

`

`Architectural Adaptation for Application-Specific Locality Optimizations
`
`-.
`
`Rajesh K. Guptai
`Andrew A. Chien*
`Xingbin Zhang* Ali Dasdan* Martin Schulzt
`hstitut fur Inforrnatik
`Qepartment of Computer Science
`Technische Universitat Miinchen
`University of Illinois at Urbana-Champaign
`(zhung,dusdun, achien) @ cs. uiuc. edu
`schulzm @ informatik. tu-muenchen.de
`Qnformation and Computer Science, University of California at Irvine
`rgupta@ ics.uci.edu
`
`Abstract
`
`We propose a machine architecture that integrates pro-
`grammable logic into key components
`the goal of customizing architectural me
`cies to match an application. This approach presents
`an improvement over traditi nal approach of exploiting
`programmable logic as a s urate co-processor by pre-
`serving machine usability through sofhvare and over tra-
`ditional computer architecture by providing application-
`specific hardware assists. We present two case studies of
`architectural customization to enhance latency tolerance
`and eficiently utilize network bisection on multiproces-
`sors for sparse matrix computations. We demonstrate that
`application-spec@ hardware assists and policies can pro-
`vide substantial improvements in pegormanee on a per ap-
`plication basis. Based on these preliminary results, we pro-
`pose that an application-driven machine customization pro-
`vides a promising approach to achieve high pe?formance
`and combat performance fragility.
`
`1 Introduction
`
`Technology projections for the coming decade [ 11 point
`out that system performance is going to be increasingly
`dominated by intra-chip interconnect delay. This presents
`a unique opportunity for programmable logic as the inter-
`connect dominance reduces the contribution of per stage
`logic complexity on performance and the marginal costs
`of adding switching logic in the interconnect. However,
`the traditional co-processing architecture of exploiting pro-
`grammable logic as a specialized functional unit to de-
`liver a specific application suffers from the problem of ma-
`chine retargetability. A system generated using this ap-
`proach typically can not be retargeted to another application
`
`without repartitioning hardware and software functionality
`and reimplementing the co-processing hardware. This re-
`targetability problem is an obstacle toward exploiting pro-
`grammable logic for general purpose computing.
`
`We propose a machine architecture that integrates pro-
`grammable logic into key components of the system with
`the goal of customizing architectural mechanisms and poli-
`cies to match an application. We base our design on the
`premise that communication is already critical and getting
`increasingly so [ 171, and flexible interconnects can be used
`to replace static wires at competitive performance [6,9,20].
`Our approach presents an improvement over co-processing
`by preserving machine usability through software and over
`traditional computer architecture by providing application-
`specific hardware assists. The goal of application-specific
`hardware assists is to overcome the rigid architectural
`choices in modern computer systems that do not work well
`across different applications and often cause substantial per-
`formance fragility. Because performance fragility is espe-
`cially apparent on memory performance on systems with
`deep memory hierarchies, we present two case studies of ar-
`chitectural customization to enhance latency tolerance and
`efficiently utilize network bisection on multiprocessors. Us-
`ing sparse matrix computations as examples, our results
`show that customization for application-specific optimiza-
`tions can bring significant performance improvement (1OX
`reduction in miss rates, 100X reduction in data traffic), and
`that an application-driven machine customization provides
`a promising approach to achieve robust, high performance.
`
`The rest of the paper is organized as follows. Section 2
`presents our analyses of the technology trends. Section 3
`describes our proposed architecture and the project context.
`We describe our case studies in Section 4 and discuss re-
`lated work in Section 5. Finally, we conclude with future
`directions in Section 6.
`
`1063-6404197 $10.00 0 1997 IEEE
`
`150
`
`Authorized licensed use limited to: IEEE Publications Operations Staff. Downloaded on March 19,2021 at 14:38:13 UTC from IEEE Xplore. Restrictions apply.
`IPI
`
`1,
`
`Intel Exhibit 1027 - 7
`
`

`

`2 Background
`
`Technology projections for the coming decade point out
`a unique opportunity of programmable logic. However,
`the traditional co-processing approach of exploiting pro-
`grammable logic suffers from the problem of machine re-
`targetability, which limits its use for general purpose appli-
`cations.
`
`2.1 Key Technology Trends
`
`The basis for architectural adaptation is in the key trends
`in the semiconductor technology. In particular, the dif-
`ference in scaling of switching logic speed and intercon-
`nect delays points out increasing opportunities for pro-
`grammable logic circuits in the coming decade. Projections
`by the Semiconductor Industry Association (SIA) [ 11 show
`that on-chip system performance is going to be increasingly
`dominated by interconnect delays. Due to these intercon-
`nect delays, the on-chip clock periods will be limited to
`about 1 nanosecond, which is well above the projections
`based on channel length scaling [l]. Meanwhile, the unit
`gate delay (inverter with fanout of two) scales down to 20
`pico-seconds. Thus, modern day control logic consisting
`of 7-8 logic stages per cycle would form less than 20% of
`the total cycle time. This clearly challenges the fundamen-
`tal design trade-off today that tries to simplify the amount of
`logic per stage in the interest of reducing the cycle time [ 141.
`In addition, this points to a sharply reduced marginal cost of
`per stage logic complexity on the circuit-level performance.
`The decreasing delay penalty for (re)programmable
`logic blocks compared to interconnect delays also makes
`the incorporation of small programmable logic blocks at-
`tractive even in custom data paths. Because the interconnect
`delays scale down much more slowly than transistor switch-
`ing delays, in the year 2007, the delay of the average length
`inter-connect (assuming an average interconnect length of
`1 OOOX the pitch) would correspond to approximately three
`gate delays (see [5] for a detailed analysis). This is in con-
`trast to less than half the gate delay of the average intercon-
`nect in current process technology. This implies that due to
`purely electrical reasons, it would be preferred to include at
`least one inter-connect buffer in a cycle time. This buffer
`gate when combined with a weak-feedback device would
`form the core of a storage element that presents less than
`50% switching delay overhead (from 20ps to 30ps), mak-
`ing it perform5nce competitive to replace static wires with
`flexible interconnect.
`In view of these technology trends and advances in
`circuit modeling using hardware description languages
`(HDLs) such as Verilog and VHDL, the process of hard-
`ware design is increasingly a language-level activity, sup-
`ported by compilation and synthesis tools [ l l , 121. With
`
`these CAD and synthesis capabilities, programmable logic
`circuit blocks are beginning to be used in improving system
`performance.
`
`2.2 Co-processing
`
`The most common architecture in embedded comput-
`ing systems to exploit programmable logic can be char-
`acterized as one of co-processing, i.e., a processor work-
`ing in conjunction with dedicated hardware assists to de-
`liver a specific application. The hardware assists are built
`using programmable circuit blocks for easy interpretation
`with the predesigned CPU. Figure 1 shows the schematic
`of a co-processing architecture, where the co-processing
`hardware may be operated under direct control of the pro-
`cessor which stalls while the dedicated hardware is opera-
`tional [ lo], or the co-processing may be done concurrently
`with software [ 131. However, a system generated using this
`approach typically can not be retargeted to another applica-
`tion without repartitioning hardware and software function-
`ality and reimplementing the co-processing hardware even
`if the macro-level organization of the system components
`remains unaffected. This presents an obstacle of exploiting
`programmable logic for general-purpose computing even
`though technology trends make it possible to do so.
`
`U Memory
`
`Figure 1. A co-processing Architecture
`
`3 Architectural Adaptation
`
`We propose an architecture that integrates small blocks
`of programmable logic into key elements of a baseline ar-
`chitecture, including processing elements, components of
`the memory hierarchy, and the scalable interconnect, to pro-
`vide architectural adaptation - the customization of archi-
`tectural mechanisms and policies to match an application.
`Figure 2 shows our architecture. Architectural adaptation
`can be used in the bindings, mechanisms, and policies on
`the interaction of processing, memory, and communication
`resources while keeping the macro-level organization the
`
`151
`
`Authorized licensed use limited to: IEEE Publications Operations Staff. Downloaded on March 19,2021 at 14:38:13 UTC from IEEE Xplore. Restrictions apply.
`
`Intel Exhibit 1027 - 8
`
`

`

`same and thus preserving the programming model for de-
`veloping applications. Depending upon the hard
`nology used and the support available from the runtime en-
`vironment, this adaptation can be done statically or at run-
`time.
`
`1
`I
`
`I
`
`Flexible
`Interconnect
`
`Figure 2. An Architecture for Adaptation
`
`adaptation provides t
`application-specific hardware assists t
`architectural choices in modern computer systems that do
`not work well across different applications and often cause
`substantial performance fragility. In particular, the integra-
`tion of programmable logic with memory components en-
`ables application-specific locality optimizations. These op-
`timizations can be designed to overcome long latency and
`limited transfer bandwidth in the memory hierarchy. In ad-
`dition, because the entire application remains in software
`while the underlying hardware is adapted for system per-
`formance, our approach improves over co-processing archi-
`tectures by preserving machine usability through software.
`The main disadvantage of our approach is the potential in-
`crease on system design and verification time due to the ad-
`dition of programmable logic. We believe that the advances
`in design technology will address the increase of logic com-
`plexity.
`
`3.1 Project Context
`
`Our study is in the context o
`
`MORPH [5] project,
`
`Hardware) architecture consists of processing and mem-
`ory elements embedded in a sca
`ct. With
`a small amount of programmable logic integrated with key
`elements of the system, the proposed MORPH architecture
`aims to exploit architectural customization for a broad range
`of purposes such as:
`
`0 control over computing node granularity (processor-
`memory association)
`0 interleaving (address-physical memory element map-
`ping)
`cache policies (consistency model, coherence proto-
`col, object method protocols)
`0 cache organization (block size or objects)
`0 behavior monitoring and adaptation
`
`As an example of its flexibility, MORPH could be used
`herent machine, a non-cache
`to implement either a cach
`coherent machine, or eve
`sters of cache coherent ma-
`chines connected by putlget or message passing. In this
`paper, we focus on archi
`system for locality optim
`
`4 Case Studies
`
`We present two case studies of architectural adaptation
`for application-specific locality optimization
`architectures with deep memory hierarchies,
`bandwidth and access latency differentials ac
`memory hierarchies can span several orders
`cality optimizations critical for performance. Al-
`mpiler optimizations can be
`such as dense matrix m
`applications can greatly bene
`tectural support. However, numerous studi
`that no fixed architectural policies o
`cache organization, work well for al
`ing performance fragility across different applications. We
`present two case studies of architectural adaptation using
`application-specific knowledge to enhanc
`ance and efficiently utilize network bisecti
`cessors.
`
`ntal Methodology
`
`As our application examples, we use the sparse
`matrix
`library SPARSE developed by
`(version 1.3 ava
`Sangiovanni-Vincentelli
`http://www.netlib.org/sparse/),
`ditional sparse matrix multiply routine th
`an efficie
`This library
`sparse matric
`and column linked lists of matrix
`elements as shown in Figure 3. Only nonzero elements
`are represented, and elements in each row and column are
`connected by singly linked lists via the n
`nextCol fields. Space for elements, whic
`per matrix element, are allocated dynamically in blocks
`of elements for efficiency. There are also several one
`dimensional arrays for storing the root pointers for row and
`column lists.
`
`Authorized licensed use limited to: IEEE Publications Operations Staff. Downloaded on March 19,2021 at 14:38:13 UTC from IEEE Xplore. Restrictions apply.
`
`152
`
`Intel Exhibit 1027 - 9
`
`

`

`Starting Column Pointers
`
`struct MatrixElement (
`Complex Val;
`int rOw,coI;
`< other fields >
`struct MatrixElement
`*nextRow, *nextCol;
`
`Pointers
`
`Matrix
`
`1; ------
`I
`I 0 Element I
`I
`I 1 Pointer I
`I
`I
`L - - - - - 1
`Figure 3. Sparse Library Data Structures
`
`information.
`Figure 4 shows the prefetcher implementation using
`programmable logic integrated with the L1 cache. The
`prefetcher requires two pieces of application-specific infor-
`mation: the address ranges and the memory layout of the
`target data structures. The address range is needed to indi-
`cate memory bounds where prefetching is likely to be use-
`ful. This is application dependent, which we determined by
`inspecting the application program, but can easily be sup-
`plied by the compiler. The program sets up the required in-
`formation and can enable or disable prefetching at any point
`of the program. Once the prefetcher is enabled, however, it
`determines what and when to prefetch by checking the vir-
`tual addresses of cache lookups to check whether a matrix
`element is being accessed.
`
`I , i$Lical
`! j'\rT
`
`Processor virtual addresseddata
`1 , ,_
`1 L1 Cache
`
`data
`
`I,
`
`.
`',
`
`1
`1
`\
`
`'.
`-_----
`
`,' / /
`,
`
`/
`
`L2 Cache
`
`Line Size
`Associativity
`Cache Size
`Write
`Policy
`Replacement
`Policy
`Transfer
`Rate
`
`L2 Cache
`L1 Cache
`32B or 64B
`32B or 64B
`2
`1
`512KB
`32KB
`Write back +
`Write back +
`Write allocate Write allocate
`
`Random
`(Ll-L2)
`16B/5 cycles
`
`Random
`(L2-Mem)
`8B/15 cycles
`
`Table 1. Simulation Parameters
`
`Figure 4. Organizations of Prefetcher Logic
`The first prefetching example targets records spanning
`multiple cache lines and for our example, prefetches all
`fields of a matrix element structure whenever some field of
`the element is accessed. The pseudocode of this prefetch-
`ing scheme for the sparse matrix example is shown be-
`low, assuming a cache line size of 32 bytes, a matrix ele-
`ment padded to 64 bytes, and a single matrix storage block
`aligned at 64-byte boundary. Prefetching is triggered only
`by read misses. Because each matrix element spans two
`cache lines, the prefetcher generates an additional L2 cache
`lookup address from the given physical address (assuming
`a lock-up free L2 cache) that prefetches the other cache line
`not yet referenced.
`
`4.2 Architectural Adaptation for Latency Toler-
`ance
`
`Our first case study uses architectural adaptation for
`prefetching. As the gap between processor and memory
`speed widens, prefetching is becoming increasingly impor-
`tant to tolerate the memory access latency. However, obliv-
`ious prefetching can degrade a program's performance by
`saturating bandwidth. We show two example prefetching
`schemes that aggressively exploit application access pattern
`
`/ * Prefetch only if vAddr refers to the matrix * /
`GroupPrefetch(vAddr,pAddr,startBlock,endBlock) {
`if (startBlock <= vAddr && vAddr < endBlock) I
`/ * Determine the prefetch address * /
`if (pAddr & 0x20) ptrLoc = pAddr - 0x20;
`else ptrLoc = pAddr + 0x20;
`<Initiate transfer of line at ptrLoc to L1 cache>
`1 )
`
`The second prefetching example targets pointer fields
`that are likely to be traversed when their parent structures
`are accessed. For example, in a sparse matrix-vector mul-
`
`Authorized licensed use limited to: IEEE Publications Operations Staff. Downloaded on March 19,2021 at 14:38:13 UTC from IEEE Xplore. Restrictions apply.
`
`153
`
`Intel Exhibit 1027 - 10
`
`

`

`tiply, the record pointed to by the nextRow field is ac-
`cessed close in time with the current matrix element. The
`pseudocode below shows the prefetcher code for prefetch-
`ing the row pointer, assuming a cache line size of 64 bytes.
`Again prefetching is triggered only by read misses, and the
`prefetcher generates an additional address after the initial
`cache miss is satisfied using the nextRow pointer value
`(whose offset is hardwired at setup time) embedded in the
`data returned by the L2 cache.’
`
`/ * Prefetch only if vAddr refers to the matrix * /
`PointerPrefetch(data,vAddr,startBlock,endBlock) {
`if (startBlock <= vAddr && vAddr < endBlock) {
`/ * Get row pointer from returned cache line * /
`/ * row ptr offset = 24 * /
`ptrLoc = data [ 2 4 ] ;
`<Initiate transfer of elt at ptrLoc to L1 cache>
`1 1
`
`Our prefetching examples are similar to the prefetching
`schemes proposed in [23], where they are shown to benefit
`various irregular applications. However, unlike [23], using
`architectural customization enables more flexible prefetch-
`ing policies, e.g., multiple level prefetch, according to the
`application access pattern.
`
`4.3 Architectural Adaptation for Bandwidth Re-
`duction
`
`Our second case study uses a sparse matrix-matrix mul-
`tiply routine as an example to show architectural adapta-
`tion to improve data reuse and reduce data traffic between
`the memory unit and the processor. The architectural cus-
`tomization aims to send only used fields of matrix elements
`during a given computation to reduce bandwidth require-
`ment using dynamic scatter and gather. Our scheme con-
`tains two units of logic, an address translation logic and a
`gather logic, shown in Figure 5.
`The two main ideas are prefetching of whole rows or
`columns using pointer chasing in the memory module and
`packing/gathering of only the used fields of the matrix ele-
`ment structure. When the root pointer of a column or row
`is accessed, the gather logic in the main memory module
`chases the row or column pointer to retrieve different ma-
`trix elements and forwards them directly to the cache. The
`cache, in order to avoid conflict misses, is split into two
`parts: one small part acting as a standard cache for other re-
`quests and one part for the prefetched matrix elements only.
`The latter part has an application-specific management pol-
`icy, and can be distinguished by mapping it to a reserved
`
`‘As pointed in [23], the implementation of this prefetching scheme is
`complicated by the need to translate the virtual pointer address to physi-
`cal address. We assume that the prefetcher logic can also access the TLB
`structure. An alternative implementation is to place the prefetcher logic in
`memory and forward the data of the next record to the upper memory hier-
`archy. This requires an additional group translation table [23] for address
`translation.
`
`1 Processor
`
`Gather Logic -i
`
`Val 1 ,RowPtrl,ColPtrl
`...
`Val2,RowF’tr2,ColPtr2
`
`Val3,RowPtr3,ColPu3
`
`Memory
`
`Figure 5. Scatter and Gather Logic
`
`address space. The gather logic in pseudocode is shown be-
`low.
`
`/ * Row gather: pAddr is the start of a row * /
`Gather (pAddr) {
`chaseAddr = pAddr;
`while(chaseAddr) {
`forward chaseAddr->Val
`forward chaseAddr->row
`chaseAddr=virtual-to-physical(chaseAddr->nextRow)
`} >
`Because the data gathering changes the storage mapping
`of matrix elements, in order not to change the program code,
`a translate logic in the cache is required to present “virtual”
`-2
`linked list structures to the processor. When the processor
`accesses the start of a row or column linked list, a prefetch
`for the entire row or column is initiated. Because the target
`location in the cache for the linked list is known,
`returning the actual pointer to the first element, th
`logic returns an address in the reserved address space corre-
`sponding to the location of the first element in the
`managed cache region. In additio
`cesses the next pointer field, the r
`the translate logic, and an address is synthesized dynami-
`cally to access the next element in this cache region. The
`translate logic in pseudocode is shown below.
`
`Translate(vAddr, pAddr, newPAddr) {
`/ * check if accessing start of a row * /
`& VAddrX-endRowR
`return row loca
`/ * Similarly for column roots * /
`. . .
`/ * Accessing packed rows * /
`if(startPackedRows<=pAddr && pAddri=endPackedRows) {
`off = pAddr & 63; / * get field offset * /
`/ * row ptr. at offset 24 * /
`if( off == 24 )
`return pAddr + 64; / * synthesize next addr. * /
`else { / * Two fields at off1 and 2
`cked
`at new-off1 and 2 of new
`* /
`if(off < off-2) new-off = off-loffl-new-offl);
`else new-off = off-(off2-new-off2);
`
`Authorized licensed use limited to: IEEE Publications Operations Staff. Downloaded on March 19,2021 at 14:38:13 UTC from IEEE Xplore. Restrictions apply.
`
`154
`
`Intel Exhibit 1027 - 11
`
`

`

`-
`-
`
`Naive
`SW-Blocking
`HW Gather
`HW Gather+Bypass
`
`8r
`
`560
`
`0
`;E 280
`
`-
`490 - 420
`-
`-
`z 350
`m
`- \\\W
`I- d 210 '
`-
`-
`-
`140
`-
`70
`0
`
`Figure 7. Data traffic volume of different
`schemes. (Total sire of non-zeros, 1.35 MB)
`
`chine performance (frequently less than a tenth [5]). There-
`fore, we believe that there are significant opportunities for
`application-specific architectural adaptation. In this paper,
`we have demonstrated mechanisms for latency hiding and
`required bandwidth reduction that leverage small hardware
`support as well as do not change the programming model.
`Following the same methodology, we can build such assists
`for other applications as well. Among other examples that
`applications can benefit from are mechanisms for recogni-
`tion of working set size for a given application that can be
`used to alter cache update policies or even use a synthe-
`sized small victim cache [16], and mechanisms for moni-
`toring access patterns and conflicts in the caches or mem-
`ory banks and reconfiguring the assists according to these
`patterns and conflicts. In

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket