throbber
N
`
`L. HENNESSY
`
`_—“
`
`DAVID A. PATTERSON
`
`COMPUTER
`ARCHITECTURE
`
`A Quantitative Approach
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`Computer Architecture
`A Quantitative Approach
`
`Fifth Edition
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`John L. Hennessy is the tenth president of Stanford University, where he has been a member
`of the faculty since 1977 in the departments of electrical engineering and computer science.
`Hennessy is a Fellow of the IEEE and ACM; a member of the National Academy of Engineering,
`the National Academy of Science, and the American Philosophical Society; and a Fellow of
`the American Academy of Arts and Sciences. Among his many awards are the 2001 Eckert-
`Mauchly Award for his contributions to RISC technology, the 2001 Seymour Cray Computer
`Engineering Award, and the 2000 John von Neumann Award, which he shared with David
`Patterson. He has also received seven honorary doctorates.
`
`In 1981, he started the MIPS project at Stanford with a handful of graduate students. After
`completing the project in 1984, he took a leave from the university to cofound MIPS Computer
`Systems (now MIPS Technologies), which developed one of the first commercial RISC
`microprocessors. As of 2006, over 2 billion MIPS microprocessors have been shipped in devices
`ranging from video games and palmtop computers to laser printers and network switches.
`Hennessy subsequently led the DASH (Director Architecture for Shared Memory) project, which
`prototyped the first scalable cache coherent multiprocessor; many of the key ideas have been
`adopted in modern multiprocessors. In addition to his technical activities and university
`responsibilities, he has continued to work with numerous start-ups both as an early-stage
`advisor and an investor.
`
`David A. Patterson has been teaching computer architecture at the University of California,
`Berkeley, since joining the faculty in 1977, where he holds the Pardee Chair of Computer
`Science. His teaching has been honored by the Distinguished Teaching Award from the
`University of California, the Karlstrom Award from ACM, and the Mulligan Education Medal and
`Undergraduate Teaching Award from IEEE. Patterson received the IEEE Technical Achievement
`Award and the ACM Eckert-Mauchly Award for contributions to RISC, and he shared the IEEE
`Johnson Information Storage Award for contributions to RAID. He also shared the IEEE John von
`Neumann Medal and the C & C Prize with John Hennessy. Like his co-author, Patterson is a
`Fellow of the American Academy of Arts and Sciences, the Computer History Museum, ACM,
`and IEEE, and he was elected to the National Academy of Engineering, the National Academy
`of Sciences, and the Silicon Valley Engineering Hall of Fame. He served on the Information
`Technology Advisory Committee to the U.S. President, as chair of the CS division in the Berkeley
`EECS department, as chair of the Computing Research Association, and as President of ACM.
`This record led to Distinguished Service Awards from ACM and CRA.
`
`At Berkeley, Patterson led the design and implementation of RISC I, likely the first VLSI reduced
`instruction set computer, and the foundation of the commercial SPARC architecture. He was a
`leader of the Redundant Arrays of Inexpensive Disks (RAID) project, which led to dependable
`storage systems from many companies. He was also involved in the Network of Workstations
`(NOW) project, which led to cluster technology used by Internet companies and later to cloud
`computing. These projects earned three dissertation awards from ACM. His current research
`projects are Algorithm-Machine-People Laboratory and the Parallel Computing Laboratory,
`where he is director. The goal of the AMP Lab is develop scalable machine learning algorithms,
`warehouse-scale-computer-friendly programming models, and crowd-sourcing tools to gain
`valueable insights quickly from big data in the cloud. The goal of the Par Lab is to develop tech-
`nologies to deliver scalable, portable, efficient, and productive software for parallel personal
`mobile devices.
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`Computer Architecture
`A Quantitative Approach
`
`Fifth Edition
`
`John L. Hennessy
`Stanford University
`
`David A. Patterson
`University of California, Berkeley
`
`With Contributions by
`
`Krste Asanovic´
`University of California, Berkeley
`Jason D. Bakos
`University of South Carolina
`Robert P. Colwell
`R&E Colwell & Assoc. Inc.
`Thomas M. Conte
`North Carolina State University
`José Duato
`Universitat Politècnica de València and Simula
`Diana Franklin
`University of California, Santa Barbara
`
`David Goldberg
`The Scripps Research Institute
`
`Norman P. Jouppi
`HP Labs
`Sheng Li
`HP Labs
`Naveen Muralimanohar
`HP Labs
`Gregory D. Peterson
`University of Tennessee
`Timothy M. Pinkston
`University of Southern California
`Parthasarathy Ranganathan
`HP Labs
`David A. Wood
`University of Wisconsin–Madison
`Amr Zaky
`University of Santa Clara
`
`Amsterdam • Boston • Heidelberg • London
`New York • Oxford • Paris • San Diego
`San Francisco • Singapore • Sydney • Tokyo
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`Acquiring Editor: Todd Green
`Development Editor: Nate McFadden
`Project Manager: Paul Gottehrer
`Designer: Joanne Blank
`
`Morgan Kaufmann is an imprint of Elsevier
`225 Wyman Street, Waltham, MA 02451, USA
`
`© 2012 Elsevier, Inc. All rights reserved.
`
`No part of this publication may be reproduced or transmitted in any form or by any means, electronic
`or mechanical, including photocopying, recording, or any information storage and retrieval system,
`without permission in writing from the publisher. Details on how to seek permission, further informa-
`tion about the Publisher’s permissions policies and our arrangements with organizations such as the
`Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website:
`www.elsevier.com/permissions.
`
`This book and the individual contributions contained in it are protected under copyright by the
`Publisher (other than as may be noted herein).
`
`Notices
`Knowledge and best practice in this field are constantly changing. As new research and experience
`broaden our understanding, changes in research methods or professional practices, may become
`necessary. Practitioners and researchers must always rely on their own experience and knowledge in
`evaluating and using any information or methods described herein. In using such information or
`methods they should be mindful of their own safety and the safety of others, including parties for
`whom they have a professional responsibility.
`
`To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume
`any liability for any injury and/or damage to persons or property as a matter of products liability, neg-
`ligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas
`contained in the material herein.
`
`Library of Congress Cataloging-in-Publication Data
`Application submitted
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library.
`
`ISBN: 978-0-12-383872-8
`
`For information on all MK publications
`visit our website at www.mkp.com
`
`Printed in the United States of America
`11 12 13 14 15 10 9 8 7 6 5 4 3 2 1
`
`Typeset by: diacriTech, Chennai, India
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`Contents
`
`1
`
`Chapter 1
`
`Foreword
`
`Preface
`
`Acknowledgments
`
`Fundamentals of Quantitative Design and Analysis
`1.1
`Introduction
`1.2
`Classes of Computers
`1.3 Defining Computer Architecture
`1.4
`Trends in Technology
`1.5
`Trends in Power and Energy in Integrated Circuits
`1.6
`Trends in Cost
`1.7 Dependability
`1.8 Measuring, Reporting, and Summarizing Performance
`1.9 Quantitative Principles of Computer Design
`1.10 Putting It All Together: Performance, Price, and Power
`1.11 Fallacies and Pitfalls
`1.12 Concluding Remarks
`1.13 Historical Perspectives and References
`Case Studies and Exercises by Diana Franklin
`
`Chapter 2 Memory Hierarchy Design
`2.1
`Introduction
`2.2
`Ten Advanced Optimizations of Cache Performance
`2.3 Memory Technology and Optimizations
`2.4
`Protection: Virtual Memory and Virtual Machines
`2.5
`Crosscutting Issues: The Design of Memory Hierarchies
`2.6
`Putting It All Together: Memory Hierachies in the
`ARM Cortex-A8 and Intel Core i7
`Fallacies and Pitfalls
`
`2.7
`
`ix
`
`xv
`
`xxiii
`
`2
`5
`11
`17
`21
`27
`33
`36
`44
`52
`55
`59
`61
`61
`
`72
`78
`96
`105
`112
`
`113
`125
`
`xi
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`xii
`
`■ Contents
`
`Chapter 3
`
`Chapter 4
`
`2.8
`Concluding Remarks: Looking Ahead
`2.9 Historical Perspective and References
`Case Studies and Exercises by Norman P. Jouppi,
`Naveen Muralimanohar, and Sheng Li
`
`Instruction-Level Parallelism and Its Exploitation
`3.1
`Instruction-Level Parallelism: Concepts and Challenges
`3.2
`Basic Compiler Techniques for Exposing ILP
`3.3
`Reducing Branch Costs with Advanced Branch Prediction
`3.4 Overcoming Data Hazards with Dynamic Scheduling
`3.5 Dynamic Scheduling: Examples and the Algorithm
`3.6 Hardware-Based Speculation
`3.7
`Exploiting ILP Using Multiple Issue and Static Scheduling
`3.8
`Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and
`Speculation
`3.9
`Advanced Techniques for Instruction Delivery and Speculation
`3.10 Studies of the Limitations of ILP
`3.11 Cross-Cutting Issues: ILP Approaches and the Memory System
`3.12 Multithreading: Exploiting Thread-Level Parallelism to Improve
`Uniprocessor Throughput
`3.13 Putting It All Together: The Intel Core i7 and ARM Cortex-A8
`3.14 Fallacies and Pitfalls
`3.15 Concluding Remarks: What’s Ahead?
`3.16 Historical Perspective and References
`Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell
`
`Data-Level Parallelism in Vector, SIMD, and GPU Architectures
`4.1
`Introduction
`4.2
`Vector Architecture
`4.3
`SIMD Instruction Set Extensions for Multimedia
`4.4 Graphics Processing Units
`4.5 Detecting and Enhancing Loop-Level Parallelism
`4.6
`Crosscutting Issues
`4.7
`Putting It All Together: Mobile versus Server GPUs
`and Tesla versus Core i7
`4.8
`Fallacies and Pitfalls
`4.9
`Concluding Remarks
`4.10 Historical Perspective and References
`Case Study and Exercises by Jason D. Bakos
`
`Chapter 5
`
`Thread-Level Parallelism
`5.1
`5.2
`5.3
`
`Introduction
`Centralized Shared-Memory Architectures
`Performance of Symmetric Shared-Memory Multiprocessors
`
`129
`131
`
`131
`
`148
`156
`162
`167
`176
`183
`192
`
`197
`202
`213
`221
`
`223
`233
`241
`245
`247
`247
`
`262
`264
`282
`288
`315
`322
`
`323
`330
`332
`334
`334
`
`344
`351
`366
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`Contents
`
`■ xiii
`
`5.4 Distributed Shared-Memory and Directory-Based Coherence
`5.5
`Synchronization: The Basics
`5.6 Models of Memory Consistency: An Introduction
`5.7
`Crosscutting Issues
`5.8
`Putting It All Together: Multicore Processors and Their Performance
`5.9
`Fallacies and Pitfalls
`5.10 Concluding Remarks
`5.11 Historical Perspectives and References
`Case Studies and Exercises by Amr Zaky and David A. Wood
`
`378
`386
`392
`395
`400
`405
`409
`412
`412
`
`Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and
`Data-Level Parallelism
`6.1
`432
`Introduction
`6.2
`Programming Models and Workloads for Warehouse-Scale Computers 436
`6.3
`Computer Architecture of Warehouse-Scale Computers
`441
`6.4
`Physical Infrastructure and Costs of Warehouse-Scale Computers
`446
`6.5
`Cloud Computing: The Return of Utility Computing
`455
`6.6
`Crosscutting Issues
`461
`6.7
`Putting It All Together: A Google Warehouse-Scale Computer
`464
`6.8
`Fallacies and Pitfalls
`471
`6.9
`Concluding Remarks
`475
`6.10 Historical Perspectives and References
`476
`Case Studies and Exercises by Parthasarathy Ranganathan
`476
`
`Appendix A
`
`Instruction Set Principles
`A.1
` Introduction
`A.2
` Classifying Instruction Set Architectures
`A.3
` Memory Addressing
`A.4
` Type and Size of Operands
`A.5
` Operations in the Instruction Set
`A.6
` Instructions for Control Flow
`A.7
` Encoding an Instruction Set
`A.8
` Crosscutting Issues: The Role of Compilers
`A.9
` Putting It All Together: The MIPS Architecture
`A.10 Fallacies and Pitfalls
`A.11 Concluding Remarks
`A.12 Historical Perspective and References
` Exercises by Gregory D. Peterson
`
`Appendix B
`
`Review of Memory Hierarchy
`B.1
`B.2
`B.3
`
` Introduction
` Cache Performance
` Six Basic Cache Optimizations
`
` A-2
` A-3
` A-7
` A-13
` A-14
` A-16
` A-21
` A-24
` A-32
` A-39
` A-45
` A-47
` A-47
`
` B-2
` B-16
` B-22
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`xiv ■ Contents
`
`B.4
`B.5
`B.6
`B.7
`B.8
`
`
` Virtual Memory
` Protection and Examples of Virtual Memory
` Fallacies and Pitfalls
` Concluding Remarks
` Historical Perspective and References
` Exercises by Amr Zaky
`
`Pipelining: Basic and Intermediate Concepts
`C.1
` Introduction
`C.2
` The Major Hurdle of Pipelining—Pipeline Hazards
`C.3
` How Is Pipelining Implemented?
`C.4
` What Makes Pipelining Hard to Implement?
`C.5
` Extending the MIPS Pipeline to Handle Multicycle Operations
`C.6
` Putting It All Together: The MIPS R4000 Pipeline
`C.7
` Crosscutting Issues
`C.8
` Fallacies and Pitfalls
`C.9
` Concluding Remarks
`C.10 Historical Perspective and References
`
` Updated Exercises by Diana Franklin
`
`Online Appendices
`Storage Systems
`Embedded Systems
`By Thomas M. Conte
`
`Interconnection Networks
`Revised by Timothy M. Pinkston and José Duato
`
`Vector Processors in More Depth
`Revised by Krste Asanovic
`
`Hardware and Software for VLIW and EPIC
`Large-Scale Multiprocessors and Scientific Applications
`Computer Arithmetic
`by David Goldberg
`
`Appendix C
`
`Appendix D
`
`Appendix E
`
`Appendix F
`
`Appendix G
`
`Appendix H
`Appendix I
`Appendix J
`
`Appendix K
`
`Appendix L
`
`Survey of Instruction Set Architectures
`Historical Perspectives and References
`
`References
`
`Index
`
` B-40
` B-49
` B-57
` B-59
` B-59
` B-60
`
` C-2
` C-11
` C-30
` C-43
` C-51
` C-61
` C-70
` C-80
` C-81
` C-81
` C-82
`
`R-1
`
`I-1
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`2
`
`Memory Hierarchy
`Design
`
`1
`
`Ideally one would desire an indefinitely large memory capacity such
`that any particular … word would be immediately available. … We
`are … forced to recognize the possibility of constructing a hierarchy of
`memories, each of which has greater capacity than the preceding but
`which is less quickly accessible.
`
`A. W. Burks, H. H. Goldstine,
`and J. von Neumann
`Preliminary Discussion of the
`Logical Design of an Electronic
`Computing Instrument (1946)
`
`Computer Architecture. DOI: 10.1016/B978-0-12-383872-8.00003-3
`© 2012 Elsevier, Inc. All rights reserved.
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`72 ■ Chapter Two Memory Hierarchy Design
`
`2.1
`
`Introduction
`
`Computer pioneers correctly predicted that programmers would want unlimited
`amounts of fast memory. An economical solution to that desire is a memory hier-
`archy, which takes advantage of locality and trade-offs in the cost-performance
`of memory technologies. The principle of locality, presented in the first chapter,
`says that most programs do not access all code or data uniformly. Locality occurs
`in time (temporal locality) and in space (spatial locality). This principle, plus the
`guideline that for a given implementation technology and power budget smaller
`hardware can be made faster, led to hierarchies based on memories of different
`speeds and sizes. Figure 2.1 shows a multilevel memory hierarchy, including typ-
`ical sizes and speeds of access.
`Since fast memory is expensive, a memory hierarchy is organized into several
`levels—each smaller, faster, and more expensive per byte than the next lower level,
`which is farther from the processor. The goal is to provide a memory system with
`cost per byte almost as low as the cheapest level of memory and speed almost as
`fast as the fastest level. In most cases (but not all), the data contained in a lower
`level are a superset of the next higher level. This property, called the inclusion
`property, is always required for the lowest level of the hierarchy, which consists of
`main memory in the case of caches and disk memory in the case of virtual memory.
`
`Memory
`
`I/O bus
`
`Disk storage
`
`Memory
`bus
`
`L3
`
`Cache
`
`L2
`
`Cache
`
`L1
`
`Cache
`
`CPU
`Registers
`
`Register
`reference
`
`Disk
`memory
`reference
`
`4 –16 TB
`5 –10 ms
`
`Storage
`
`Flash
`memory
`reference
`
`4 – 8 GB
`25 – 50 us
`
`Level 1
`Cache
`reference
`
`Level 2
`Cache
`reference
`
`Level 3
`Cache
`reference
`
`Memory
`reference
`
`Size:
`Speed:
`
`1000 bytes
`300 ps
`
`64 KB
`1 ns
`
`256 KB
`2 – 4 MB
`4 –16 GB
`3–10 ns
`10 – 20 ns
`50 –100 ns
`(a) Memory hierarchy for server
`
`Memory
`bus
`
`Memory
`
`L2
`
`Cache
`
`L1
`
`Cache
`
`Level 1
`Cache
`reference
`
`Level 2
`Cache
`reference
`
`Memory
`reference
`
`CPU
`Registers
`
`Register
`reference
`
`Size:
`Speed:
`
`500 bytes
`500 ps
`
`64 KB
`256 KB
`256 – 512 MB
`2 ns
`10 – 20 ns
`50 –100 ns
`(b) Memory hierarchy for a personal mobile device
`
`Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on
`top (a) and in a personal mobile device (PMD) on the bottom (b). As we move farther
`away from the processor, the memory in the level below becomes slower and larger.
`Note that the time units change by a factor of 109—from picoseconds to millisec-
`onds—and that the size units change by a factor of 1012—from bytes to terabytes. The
`PMD has a slower clock rate and smaller caches and main memory. A key difference is
`that servers and desktops use disk storage as the lowest level in the hierarchy while
`PMDs use Flash, which is built from EEPROM technology.
`
`Morgan Kaufmann, Burlington, ISBN: 9780123838735
`© Hennessy, John L.; Patterson, David A., Oct 07, 2011, Computer Architecture : A Quantitative Approach
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`2.1
`
`Introduction
`
`■ 73
`
`The importance of the memory hierarchy has increased with advances in per-
`formance of processors. Figure 2.2 plots single processor performance projec-
`tions against the historical performance improvement in time to access main
`memory. The processor line shows the increase in memory requests per second
`on average (i.e., the inverse of the latency between memory references), while
`the memory line shows the increase in DRAM accesses per second (i.e., the
`inverse of the DRAM access latency). The situation in a uniprocessor is actually
`somewhat worse, since the peak memory access rate is faster than the average
`rate, which is what is plotted.
`More recently, high-end processors have moved to multiple cores, further
`increasing the bandwidth requirements versus single cores. In fact, the aggregate
`peak bandwidth essentially grows as the numbers of cores grows. A modern high-
`end processor such as the Intel Core i7 can generate two data memory references
`per core each clock cycle; with four cores and a 3.2 GHz clock rate, the i7 can
`generate a peak of 25.6 billion 64-bit data memory references per second, in addi-
`tion to a peak instruction demand of about 12.8 billion 128-bit instruction refer-
`ences; this is a total peak bandwidth of 409.6 GB/sec! This incredible bandwidth
`is achieved by multiporting and pipelining the caches; by the use of multiple lev-
`els of caches, using separate first- and sometimes second-level caches per core;
`and by using a separate instruction and data cache at the first level. In contrast, the
`peak bandwidth to DRAM main memory is only 6% of this (25 GB/sec).
`
`Processor
`
`Memory
`
`1995
`Year
`
`2000
`
`2005
`
`2010
`
`100,000
`
`10,000
`
`1000
`
`100
`
`10
`
`Performance
`
`1
`1980
`
`1985
`
`1990
`
`Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance,
`measured as the difference in the time between processor memory requests (for a
`single processor or core) and the latency of a DRAM access, is plotted over time.
`Note that the vertical axis must be on a logarithmic scale to record the size of the
`processor–DRAM performance gap. The memory baseline is 64 KB DRAM in 1980, with
`a 1.07 per year performance improvement in latency (see Figure 2.13 on page 99). The
`processor line assumes a 1.25 improvement per year until 1986, a 1.52 improvement
`until 2000, a 1.20 improvement between 2000 and 2005, and no change in processor
`performance (on a per-core basis) between 2005 and 2010; see Figure 1.1 in Chapter 1.
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`74 ■ Chapter Two Memory Hierarchy Design
`
`Traditionally, designers of memory hierarchies focused on optimizing aver-
`age memory access time, which is determined by the cache access time, miss
`rate, and miss penalty. More recently, however, power has become a major
`consideration. In high-end microprocessors, there may be 10 MB or more of
`on-chip cache, and a large second- or third-level cache will consume significant
`power both as leakage when not operating (called static power) and as active
`power, as when performing a read or write (called dynamic power), as described
`in Section 2.3. The problem is even more acute in processors in PMDs where the
`CPU is less aggressive and the power budget may be 20 to 50 times smaller. In
`such cases, the caches can account for 25% to 50% of the total power consump-
`tion. Thus, more designs must consider both performance and power trade-offs,
`and we will examine both in this chapter.
`
`Basics of Memory Hierarchies: A Quick Review
`
`The increasing size and thus importance of this gap led to the migration of the
`basics of memory hierarchy into undergraduate courses in computer architecture,
`and even to courses in operating systems and compilers. Thus, we’ll start with a
`quick review of caches and their operation. The bulk of the chapter, however,
`describes more advanced innovations that attack the processor–memory perfor-
`mance gap.
`When a word is not found in the cache, the word must be fetched from a
`lower level in the hierarchy (which may be another cache or the main memory)
`and placed in the cache before continuing. Multiple words, called a block (or
`line), are moved for efficiency reasons, and because they are likely to be needed
`soon due to spatial locality. Each cache block includes a tag to indicate which
`memory address it corresponds to.
`A key design decision is where blocks (or lines) can be placed in a cache. The
`most popular scheme is set associative, where a set is a group of blocks in the
`cache. A block is first mapped onto a set, and then the block can be placed any-
`where within that set. Finding a block consists of first mapping the block address
`to the set and then searching the set—usually in parallel—to find the block. The
`set is chosen by the address of the data:
`
`(Block address) MOD (Number of sets in cache)
`
`If there are n blocks in a set, the cache placement is called n-way set associative.
`The end points of set associativity have their own names. A direct-mapped cache
`has just one block per set (so a block is always placed in the same location), and
`a fully associative cache has just one set (so a block can be placed anywhere).
`Caching data that is only read is easy, since the copy in the cache and mem-
`ory will be identical. Caching writes is more difficult; for example, how can the
`copy in the cache and memory be kept consistent? There are two main strategies.
`A write-through cache updates the item in the cache and writes through to update
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`2.1
`
`Introduction
`
`■ 75
`
`main memory. A write-back cache only updates the copy in the cache. When the
`block is about to be replaced, it is copied back to memory. Both write strategies
`can use a write buffer to allow the cache to proceed as soon as the data are placed
`in the buffer rather than wait the full latency to write the data into memory.
`One measure of the benefits of different cache organizations is miss rate.
`Miss rate is simply the fraction of cache accesses that result in a miss—that is,
`the number of accesses that miss divided by the number of accesses.
`To gain insights into the causes of high miss rates, which can inspire better
`cache designs, the three Cs model sorts all misses into three simple categories:
`
`■ Compulsory—The very first access to a block cannot be in the cache, so the
`block must be brought into the cache. Compulsory misses are those that occur
`even if you had an infinite sized cache.
`■ Capacity—If the cache cannot contain all the blocks needed during execution
`of a program, capacity misses (in addition to compulsory misses) will occur
`because of blocks being discarded and later retrieved.
`■ Conflict—If the block placement strategy is not fully associative, conflict
`misses (in addition to compulsory and capacity misses) will occur because a
`block may be discarded and later retrieved if multiple blocks map to its set
`and accesses to the different blocks are intermingled.
`
`Figures B.8 and B.9 on pages B-24 and B-25 show the relative frequency of
`cache misses broken down by the three Cs. As we will see in Chapters 3 and 5,
`multithreading and multiple cores add complications for caches, both increasing
`the potential for capacity misses as well as adding a fourth C, for coherency
`misses due to cache flushes to keep multiple caches coherent in a multiprocessor;
`we will consider these issues in Chapter 5.
`Alas, miss rate can be a misleading measure for several reasons. Hence, some
`designers prefer measuring misses per instruction rather than misses per memory
`reference (miss rate). These two are related:
`
`Misses
`--------------------------
`Instruction
`
`=
`

`Miss rate Memory accesses
`-----------------------------------------------------------------------
`Instruction count
`
`=
`
`Miss rate
`

`
`Memory accesses
`------------------------------------------
`Instruction
`
`(It is often reported as misses per 1000 instructions to use integers instead of
`fractions.)
`The problem with both measures is that they don’t factor in the cost of a miss.
`A better measure is the average memory access time:
`Average memory access time = Hit time + Miss rate × Miss penalty
`where hit time is the time to hit in the cache and miss penalty is the time to replace
`the block from memory (that is, the cost of a miss). Average memory access time is
`still an indirect measure of performance; although it is a better measure than miss
`rate, it is not a substitute for execution time. In Chapter 3 we will see that specula-
`tive processors may execute other instructions during a miss, thereby reducing the
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`76 ■ Chapter Two Memory Hierarchy Design
`
`effective miss penalty. The use of multithreading (introduced in Chapter 3) also
`allows a processor to tolerate missses without being forced to idle. As we will
`examine shortly, to take advantage of such latency tolerating techniques we need
`caches that can service requests while handling an outstanding miss.
`If this material is new to you, or if this quick review moves too quickly, see
`Appendix B. It covers the same introductory material in more depth and includes
`examples of caches from real computers and quantitative evaluations of their
`effectiveness.
`Section B.3 in Appendix B presents six basic cache optimizations, which we
`quickly review here. The appendix also gives quantitative examples of the bene-
`fits of these optimizations. We also comment briefly on the power implications of
`these trade-offs.
`
`1. Larger block size to reduce miss rate—The simplest way to reduce the miss
`rate is to take advantage of spatial locality and increase the block size. Larger
`blocks reduce compulsory misses, but they also increase the miss penalty.
`Because larger blocks lower the number of tags, they can slightly reduce
`static power. Larger block sizes can also increase capacity or conflict misses,
`especially in smaller caches. Choosing the right block size is a complex
`trade-off that depends on the size of cache and the miss penalty.
`2. Bigger caches to reduce miss rate—The obvious way to reduce capacity
`misses is to increase cache capacity. Drawbacks include potentially longer hit
`time of the larger cache memory and higher cost and power. Larger caches
`increase both static and dynamic power.
`3. Higher associativity to reduce miss rate—Obviously, increasing associativity
`reduces conflict misses. Greater associativity can come at the cost of
`increased hit time. As we will see shortly, associativity also increases power
`consumption.
`4. Multilevel caches to reduce miss penalty—A difficult decision is whether to
`make the cache hit time fast, to keep pace with the high clock rate of proces-
`sors, or to make the cache large to reduce the gap between the processor
`accesses and main memory accesses. Adding another level of cache between
`the original cache and memory simplifies the decision (see Figure 2.3). The
`first-level cache can be small enough to match a fast clock cycle time, yet the
`second-level (or third-level) cache can be large enough to capture many
`accesses that would go to main memory. The focus on misses in second-level
`caches leads to larger blocks, bigger capacity, and higher associativity. Multi-
`level caches are more power efficient than a single aggregate cache. If L1 and
`L2 refer, respectively, to first- and second-level caches, we can redefine the
`average memory access time:
`Hit timeL1 + Miss rateL1 × (Hit timeL2 + Miss rateL2 × Miss penaltyL2)
`5. Giving priority to read misses over writes to reduce miss penalty—A write
`buffer is a good place to implement this optimization. Write buffers create
`hazards because they hold the updated value of a location needed on a read
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`2.1
`
`Introduction
`
`■ 77
`
`1-way
`4-way
`
`2-way
`8-way
`
`16 KB
`
`32 KB
`
`64 KB
`Cache size
`
`128 KB
`
`256 KB
`
`900
`
`800
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`0
`
`Access time in microseconds
`
`Figure 2.3 Access times generally increase as cache size and associativity are
`increased. These data come from the CACTI model 6.5 by Tarjan, Thoziyoor, and Jouppi
`[2005]. The data assume a 40 nm feature size (which is between the technology used in
`Intel’s fastest and second fastest versions of the i7 and the same as the technology used
`in the fastest ARM embedded processors), a single bank, and 64-byte blocks. The
`assumptions about cache layout and the complex trade-offs between interconnect
`delays (that depend on the size of a cache block being accessed) and the cost of tag
`checks and multiplexing lead to results that are occasionally surprising, such as the
`lower access time for a 64 KB with two-way set associativity versus direct mapping. Sim-
`ilarly, the results with eight-way set associativity generate unusual behavior as cache
`size is increased. Since such observations are highly dependent on technology and
`detailed design assumptions, tools such as CACTI serve to reduce the search space
`rather than precision analysis of the trade-offs.
`
`miss—that is, a read-after-write hazard through memory. One solution is to
`check the contents of the write buffer on a read miss. If there are no conflicts,
`and if the memory system is available, sending the read before the writes
`reduces the miss penalty. Most processors give reads priority over writes.
`This choice has little effect on power consumption.
`6. Avoiding address translation during indexing of the cache to reduce hit
`time—Caches must cope with the translation of a virtual address from the
`processor to a physical address to access memory. (Virtual memory is cov-
`ered in Sections 2.4 and B.4.) A common optimization is to use the page
`offset—the part that is identical in both virtual and physical addresses—to
`index the cache, as described in Appendix B, page B-38. This virtual index/
`physical
`tag method
`introduces some system complications and/or
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`78 ■ Chapter Two Memory Hierarchy Design
`
`limitations on the size and structure of the L1 cache, but the advantages of
`removing the translation lookaside buffer (TLB) access from the critical
`path outweigh the disadvantages.
`
`Note that each of the six optimizations above has a potential disadvantage
`that can lead to increased, rather than decreased, average memory access time.
`The rest of this chapter assumes familiarity with the material above and the
`details in Appendix B. In the Putting It All Together section, we examine the
`memory hierarchy for a microprocessor designed for a high-end server, the Intel
`Core i7, as well as one designed for use in a PMD, the Arm Cortex-A8, which is
`the basis for the processor used in the Apple iPad and several high-end
`smartphones. Within each of these classes, there is a significant diversity in
`approach due to the intended use of the computer. While the high-end processor
`used in the server has more cores and bigger caches than the Intel processors
`designed for desktop uses, the processors have similar architectures. The differ-
`ences are driven by performance and the nature of the workload; desktop com-
`puters are primarily runn

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket