`
`L. HENNESSY
`
`_—“
`
`DAVID A. PATTERSON
`
`COMPUTER
`ARCHITECTURE
`
`A Quantitative Approach
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`Computer Architecture
`A Quantitative Approach
`
`Fifth Edition
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`John L. Hennessy is the tenth president of Stanford University, where he has been a member
`of the faculty since 1977 in the departments of electrical engineering and computer science.
`Hennessy is a Fellow of the IEEE and ACM; a member of the National Academy of Engineering,
`the National Academy of Science, and the American Philosophical Society; and a Fellow of
`the American Academy of Arts and Sciences. Among his many awards are the 2001 Eckert-
`Mauchly Award for his contributions to RISC technology, the 2001 Seymour Cray Computer
`Engineering Award, and the 2000 John von Neumann Award, which he shared with David
`Patterson. He has also received seven honorary doctorates.
`
`In 1981, he started the MIPS project at Stanford with a handful of graduate students. After
`completing the project in 1984, he took a leave from the university to cofound MIPS Computer
`Systems (now MIPS Technologies), which developed one of the first commercial RISC
`microprocessors. As of 2006, over 2 billion MIPS microprocessors have been shipped in devices
`ranging from video games and palmtop computers to laser printers and network switches.
`Hennessy subsequently led the DASH (Director Architecture for Shared Memory) project, which
`prototyped the first scalable cache coherent multiprocessor; many of the key ideas have been
`adopted in modern multiprocessors. In addition to his technical activities and university
`responsibilities, he has continued to work with numerous start-ups both as an early-stage
`advisor and an investor.
`
`David A. Patterson has been teaching computer architecture at the University of California,
`Berkeley, since joining the faculty in 1977, where he holds the Pardee Chair of Computer
`Science. His teaching has been honored by the Distinguished Teaching Award from the
`University of California, the Karlstrom Award from ACM, and the Mulligan Education Medal and
`Undergraduate Teaching Award from IEEE. Patterson received the IEEE Technical Achievement
`Award and the ACM Eckert-Mauchly Award for contributions to RISC, and he shared the IEEE
`Johnson Information Storage Award for contributions to RAID. He also shared the IEEE John von
`Neumann Medal and the C & C Prize with John Hennessy. Like his co-author, Patterson is a
`Fellow of the American Academy of Arts and Sciences, the Computer History Museum, ACM,
`and IEEE, and he was elected to the National Academy of Engineering, the National Academy
`of Sciences, and the Silicon Valley Engineering Hall of Fame. He served on the Information
`Technology Advisory Committee to the U.S. President, as chair of the CS division in the Berkeley
`EECS department, as chair of the Computing Research Association, and as President of ACM.
`This record led to Distinguished Service Awards from ACM and CRA.
`
`At Berkeley, Patterson led the design and implementation of RISC I, likely the first VLSI reduced
`instruction set computer, and the foundation of the commercial SPARC architecture. He was a
`leader of the Redundant Arrays of Inexpensive Disks (RAID) project, which led to dependable
`storage systems from many companies. He was also involved in the Network of Workstations
`(NOW) project, which led to cluster technology used by Internet companies and later to cloud
`computing. These projects earned three dissertation awards from ACM. His current research
`projects are Algorithm-Machine-People Laboratory and the Parallel Computing Laboratory,
`where he is director. The goal of the AMP Lab is develop scalable machine learning algorithms,
`warehouse-scale-computer-friendly programming models, and crowd-sourcing tools to gain
`valueable insights quickly from big data in the cloud. The goal of the Par Lab is to develop tech-
`nologies to deliver scalable, portable, efficient, and productive software for parallel personal
`mobile devices.
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`Computer Architecture
`A Quantitative Approach
`
`Fifth Edition
`
`John L. Hennessy
`Stanford University
`
`David A. Patterson
`University of California, Berkeley
`
`With Contributions by
`
`Krste Asanovic´
`University of California, Berkeley
`Jason D. Bakos
`University of South Carolina
`Robert P. Colwell
`R&E Colwell & Assoc. Inc.
`Thomas M. Conte
`North Carolina State University
`José Duato
`Universitat Politècnica de València and Simula
`Diana Franklin
`University of California, Santa Barbara
`
`David Goldberg
`The Scripps Research Institute
`
`Norman P. Jouppi
`HP Labs
`Sheng Li
`HP Labs
`Naveen Muralimanohar
`HP Labs
`Gregory D. Peterson
`University of Tennessee
`Timothy M. Pinkston
`University of Southern California
`Parthasarathy Ranganathan
`HP Labs
`David A. Wood
`University of Wisconsin–Madison
`Amr Zaky
`University of Santa Clara
`
`Amsterdam • Boston • Heidelberg • London
`New York • Oxford • Paris • San Diego
`San Francisco • Singapore • Sydney • Tokyo
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`Acquiring Editor: Todd Green
`Development Editor: Nate McFadden
`Project Manager: Paul Gottehrer
`Designer: Joanne Blank
`
`Morgan Kaufmann is an imprint of Elsevier
`225 Wyman Street, Waltham, MA 02451, USA
`
`© 2012 Elsevier, Inc. All rights reserved.
`
`No part of this publication may be reproduced or transmitted in any form or by any means, electronic
`or mechanical, including photocopying, recording, or any information storage and retrieval system,
`without permission in writing from the publisher. Details on how to seek permission, further informa-
`tion about the Publisher’s permissions policies and our arrangements with organizations such as the
`Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website:
`www.elsevier.com/permissions.
`
`This book and the individual contributions contained in it are protected under copyright by the
`Publisher (other than as may be noted herein).
`
`Notices
`Knowledge and best practice in this field are constantly changing. As new research and experience
`broaden our understanding, changes in research methods or professional practices, may become
`necessary. Practitioners and researchers must always rely on their own experience and knowledge in
`evaluating and using any information or methods described herein. In using such information or
`methods they should be mindful of their own safety and the safety of others, including parties for
`whom they have a professional responsibility.
`
`To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume
`any liability for any injury and/or damage to persons or property as a matter of products liability, neg-
`ligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas
`contained in the material herein.
`
`Library of Congress Cataloging-in-Publication Data
`Application submitted
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library.
`
`ISBN: 978-0-12-383872-8
`
`For information on all MK publications
`visit our website at www.mkp.com
`
`Printed in the United States of America
`11 12 13 14 15 10 9 8 7 6 5 4 3 2 1
`
`Typeset by: diacriTech, Chennai, India
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`Contents
`
`1
`
`Chapter 1
`
`Foreword
`
`Preface
`
`Acknowledgments
`
`Fundamentals of Quantitative Design and Analysis
`1.1
`Introduction
`1.2
`Classes of Computers
`1.3 Defining Computer Architecture
`1.4
`Trends in Technology
`1.5
`Trends in Power and Energy in Integrated Circuits
`1.6
`Trends in Cost
`1.7 Dependability
`1.8 Measuring, Reporting, and Summarizing Performance
`1.9 Quantitative Principles of Computer Design
`1.10 Putting It All Together: Performance, Price, and Power
`1.11 Fallacies and Pitfalls
`1.12 Concluding Remarks
`1.13 Historical Perspectives and References
`Case Studies and Exercises by Diana Franklin
`
`Chapter 2 Memory Hierarchy Design
`2.1
`Introduction
`2.2
`Ten Advanced Optimizations of Cache Performance
`2.3 Memory Technology and Optimizations
`2.4
`Protection: Virtual Memory and Virtual Machines
`2.5
`Crosscutting Issues: The Design of Memory Hierarchies
`2.6
`Putting It All Together: Memory Hierachies in the
`ARM Cortex-A8 and Intel Core i7
`Fallacies and Pitfalls
`
`2.7
`
`ix
`
`xv
`
`xxiii
`
`2
`5
`11
`17
`21
`27
`33
`36
`44
`52
`55
`59
`61
`61
`
`72
`78
`96
`105
`112
`
`113
`125
`
`xi
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`xii
`
`■ Contents
`
`Chapter 3
`
`Chapter 4
`
`2.8
`Concluding Remarks: Looking Ahead
`2.9 Historical Perspective and References
`Case Studies and Exercises by Norman P. Jouppi,
`Naveen Muralimanohar, and Sheng Li
`
`Instruction-Level Parallelism and Its Exploitation
`3.1
`Instruction-Level Parallelism: Concepts and Challenges
`3.2
`Basic Compiler Techniques for Exposing ILP
`3.3
`Reducing Branch Costs with Advanced Branch Prediction
`3.4 Overcoming Data Hazards with Dynamic Scheduling
`3.5 Dynamic Scheduling: Examples and the Algorithm
`3.6 Hardware-Based Speculation
`3.7
`Exploiting ILP Using Multiple Issue and Static Scheduling
`3.8
`Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and
`Speculation
`3.9
`Advanced Techniques for Instruction Delivery and Speculation
`3.10 Studies of the Limitations of ILP
`3.11 Cross-Cutting Issues: ILP Approaches and the Memory System
`3.12 Multithreading: Exploiting Thread-Level Parallelism to Improve
`Uniprocessor Throughput
`3.13 Putting It All Together: The Intel Core i7 and ARM Cortex-A8
`3.14 Fallacies and Pitfalls
`3.15 Concluding Remarks: What’s Ahead?
`3.16 Historical Perspective and References
`Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell
`
`Data-Level Parallelism in Vector, SIMD, and GPU Architectures
`4.1
`Introduction
`4.2
`Vector Architecture
`4.3
`SIMD Instruction Set Extensions for Multimedia
`4.4 Graphics Processing Units
`4.5 Detecting and Enhancing Loop-Level Parallelism
`4.6
`Crosscutting Issues
`4.7
`Putting It All Together: Mobile versus Server GPUs
`and Tesla versus Core i7
`4.8
`Fallacies and Pitfalls
`4.9
`Concluding Remarks
`4.10 Historical Perspective and References
`Case Study and Exercises by Jason D. Bakos
`
`Chapter 5
`
`Thread-Level Parallelism
`5.1
`5.2
`5.3
`
`Introduction
`Centralized Shared-Memory Architectures
`Performance of Symmetric Shared-Memory Multiprocessors
`
`129
`131
`
`131
`
`148
`156
`162
`167
`176
`183
`192
`
`197
`202
`213
`221
`
`223
`233
`241
`245
`247
`247
`
`262
`264
`282
`288
`315
`322
`
`323
`330
`332
`334
`334
`
`344
`351
`366
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`Contents
`
`■ xiii
`
`5.4 Distributed Shared-Memory and Directory-Based Coherence
`5.5
`Synchronization: The Basics
`5.6 Models of Memory Consistency: An Introduction
`5.7
`Crosscutting Issues
`5.8
`Putting It All Together: Multicore Processors and Their Performance
`5.9
`Fallacies and Pitfalls
`5.10 Concluding Remarks
`5.11 Historical Perspectives and References
`Case Studies and Exercises by Amr Zaky and David A. Wood
`
`378
`386
`392
`395
`400
`405
`409
`412
`412
`
`Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and
`Data-Level Parallelism
`6.1
`432
`Introduction
`6.2
`Programming Models and Workloads for Warehouse-Scale Computers 436
`6.3
`Computer Architecture of Warehouse-Scale Computers
`441
`6.4
`Physical Infrastructure and Costs of Warehouse-Scale Computers
`446
`6.5
`Cloud Computing: The Return of Utility Computing
`455
`6.6
`Crosscutting Issues
`461
`6.7
`Putting It All Together: A Google Warehouse-Scale Computer
`464
`6.8
`Fallacies and Pitfalls
`471
`6.9
`Concluding Remarks
`475
`6.10 Historical Perspectives and References
`476
`Case Studies and Exercises by Parthasarathy Ranganathan
`476
`
`Appendix A
`
`Instruction Set Principles
`A.1
` Introduction
`A.2
` Classifying Instruction Set Architectures
`A.3
` Memory Addressing
`A.4
` Type and Size of Operands
`A.5
` Operations in the Instruction Set
`A.6
` Instructions for Control Flow
`A.7
` Encoding an Instruction Set
`A.8
` Crosscutting Issues: The Role of Compilers
`A.9
` Putting It All Together: The MIPS Architecture
`A.10 Fallacies and Pitfalls
`A.11 Concluding Remarks
`A.12 Historical Perspective and References
` Exercises by Gregory D. Peterson
`
`Appendix B
`
`Review of Memory Hierarchy
`B.1
`B.2
`B.3
`
` Introduction
` Cache Performance
` Six Basic Cache Optimizations
`
` A-2
` A-3
` A-7
` A-13
` A-14
` A-16
` A-21
` A-24
` A-32
` A-39
` A-45
` A-47
` A-47
`
` B-2
` B-16
` B-22
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`xiv ■ Contents
`
`B.4
`B.5
`B.6
`B.7
`B.8
`
`
` Virtual Memory
` Protection and Examples of Virtual Memory
` Fallacies and Pitfalls
` Concluding Remarks
` Historical Perspective and References
` Exercises by Amr Zaky
`
`Pipelining: Basic and Intermediate Concepts
`C.1
` Introduction
`C.2
` The Major Hurdle of Pipelining—Pipeline Hazards
`C.3
` How Is Pipelining Implemented?
`C.4
` What Makes Pipelining Hard to Implement?
`C.5
` Extending the MIPS Pipeline to Handle Multicycle Operations
`C.6
` Putting It All Together: The MIPS R4000 Pipeline
`C.7
` Crosscutting Issues
`C.8
` Fallacies and Pitfalls
`C.9
` Concluding Remarks
`C.10 Historical Perspective and References
`
` Updated Exercises by Diana Franklin
`
`Online Appendices
`Storage Systems
`Embedded Systems
`By Thomas M. Conte
`
`Interconnection Networks
`Revised by Timothy M. Pinkston and José Duato
`
`Vector Processors in More Depth
`Revised by Krste Asanovic
`
`Hardware and Software for VLIW and EPIC
`Large-Scale Multiprocessors and Scientific Applications
`Computer Arithmetic
`by David Goldberg
`
`Appendix C
`
`Appendix D
`
`Appendix E
`
`Appendix F
`
`Appendix G
`
`Appendix H
`Appendix I
`Appendix J
`
`Appendix K
`
`Appendix L
`
`Survey of Instruction Set Architectures
`Historical Perspectives and References
`
`References
`
`Index
`
` B-40
` B-49
` B-57
` B-59
` B-59
` B-60
`
` C-2
` C-11
` C-30
` C-43
` C-51
` C-61
` C-70
` C-80
` C-81
` C-81
` C-82
`
`R-1
`
`I-1
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`2
`
`Memory Hierarchy
`Design
`
`1
`
`Ideally one would desire an indefinitely large memory capacity such
`that any particular … word would be immediately available. … We
`are … forced to recognize the possibility of constructing a hierarchy of
`memories, each of which has greater capacity than the preceding but
`which is less quickly accessible.
`
`A. W. Burks, H. H. Goldstine,
`and J. von Neumann
`Preliminary Discussion of the
`Logical Design of an Electronic
`Computing Instrument (1946)
`
`Computer Architecture. DOI: 10.1016/B978-0-12-383872-8.00003-3
`© 2012 Elsevier, Inc. All rights reserved.
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`72 ■ Chapter Two Memory Hierarchy Design
`
`2.1
`
`Introduction
`
`Computer pioneers correctly predicted that programmers would want unlimited
`amounts of fast memory. An economical solution to that desire is a memory hier-
`archy, which takes advantage of locality and trade-offs in the cost-performance
`of memory technologies. The principle of locality, presented in the first chapter,
`says that most programs do not access all code or data uniformly. Locality occurs
`in time (temporal locality) and in space (spatial locality). This principle, plus the
`guideline that for a given implementation technology and power budget smaller
`hardware can be made faster, led to hierarchies based on memories of different
`speeds and sizes. Figure 2.1 shows a multilevel memory hierarchy, including typ-
`ical sizes and speeds of access.
`Since fast memory is expensive, a memory hierarchy is organized into several
`levels—each smaller, faster, and more expensive per byte than the next lower level,
`which is farther from the processor. The goal is to provide a memory system with
`cost per byte almost as low as the cheapest level of memory and speed almost as
`fast as the fastest level. In most cases (but not all), the data contained in a lower
`level are a superset of the next higher level. This property, called the inclusion
`property, is always required for the lowest level of the hierarchy, which consists of
`main memory in the case of caches and disk memory in the case of virtual memory.
`
`Memory
`
`I/O bus
`
`Disk storage
`
`Memory
`bus
`
`L3
`
`Cache
`
`L2
`
`Cache
`
`L1
`
`Cache
`
`CPU
`Registers
`
`Register
`reference
`
`Disk
`memory
`reference
`
`4 –16 TB
`5 –10 ms
`
`Storage
`
`Flash
`memory
`reference
`
`4 – 8 GB
`25 – 50 us
`
`Level 1
`Cache
`reference
`
`Level 2
`Cache
`reference
`
`Level 3
`Cache
`reference
`
`Memory
`reference
`
`Size:
`Speed:
`
`1000 bytes
`300 ps
`
`64 KB
`1 ns
`
`256 KB
`2 – 4 MB
`4 –16 GB
`3–10 ns
`10 – 20 ns
`50 –100 ns
`(a) Memory hierarchy for server
`
`Memory
`bus
`
`Memory
`
`L2
`
`Cache
`
`L1
`
`Cache
`
`Level 1
`Cache
`reference
`
`Level 2
`Cache
`reference
`
`Memory
`reference
`
`CPU
`Registers
`
`Register
`reference
`
`Size:
`Speed:
`
`500 bytes
`500 ps
`
`64 KB
`256 KB
`256 – 512 MB
`2 ns
`10 – 20 ns
`50 –100 ns
`(b) Memory hierarchy for a personal mobile device
`
`Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on
`top (a) and in a personal mobile device (PMD) on the bottom (b). As we move farther
`away from the processor, the memory in the level below becomes slower and larger.
`Note that the time units change by a factor of 109—from picoseconds to millisec-
`onds—and that the size units change by a factor of 1012—from bytes to terabytes. The
`PMD has a slower clock rate and smaller caches and main memory. A key difference is
`that servers and desktops use disk storage as the lowest level in the hierarchy while
`PMDs use Flash, which is built from EEPROM technology.
`
`Morgan Kaufmann, Burlington, ISBN: 9780123838735
`© Hennessy, John L.; Patterson, David A., Oct 07, 2011, Computer Architecture : A Quantitative Approach
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`2.1
`
`Introduction
`
`■ 73
`
`The importance of the memory hierarchy has increased with advances in per-
`formance of processors. Figure 2.2 plots single processor performance projec-
`tions against the historical performance improvement in time to access main
`memory. The processor line shows the increase in memory requests per second
`on average (i.e., the inverse of the latency between memory references), while
`the memory line shows the increase in DRAM accesses per second (i.e., the
`inverse of the DRAM access latency). The situation in a uniprocessor is actually
`somewhat worse, since the peak memory access rate is faster than the average
`rate, which is what is plotted.
`More recently, high-end processors have moved to multiple cores, further
`increasing the bandwidth requirements versus single cores. In fact, the aggregate
`peak bandwidth essentially grows as the numbers of cores grows. A modern high-
`end processor such as the Intel Core i7 can generate two data memory references
`per core each clock cycle; with four cores and a 3.2 GHz clock rate, the i7 can
`generate a peak of 25.6 billion 64-bit data memory references per second, in addi-
`tion to a peak instruction demand of about 12.8 billion 128-bit instruction refer-
`ences; this is a total peak bandwidth of 409.6 GB/sec! This incredible bandwidth
`is achieved by multiporting and pipelining the caches; by the use of multiple lev-
`els of caches, using separate first- and sometimes second-level caches per core;
`and by using a separate instruction and data cache at the first level. In contrast, the
`peak bandwidth to DRAM main memory is only 6% of this (25 GB/sec).
`
`Processor
`
`Memory
`
`1995
`Year
`
`2000
`
`2005
`
`2010
`
`100,000
`
`10,000
`
`1000
`
`100
`
`10
`
`Performance
`
`1
`1980
`
`1985
`
`1990
`
`Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance,
`measured as the difference in the time between processor memory requests (for a
`single processor or core) and the latency of a DRAM access, is plotted over time.
`Note that the vertical axis must be on a logarithmic scale to record the size of the
`processor–DRAM performance gap. The memory baseline is 64 KB DRAM in 1980, with
`a 1.07 per year performance improvement in latency (see Figure 2.13 on page 99). The
`processor line assumes a 1.25 improvement per year until 1986, a 1.52 improvement
`until 2000, a 1.20 improvement between 2000 and 2005, and no change in processor
`performance (on a per-core basis) between 2005 and 2010; see Figure 1.1 in Chapter 1.
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`74 ■ Chapter Two Memory Hierarchy Design
`
`Traditionally, designers of memory hierarchies focused on optimizing aver-
`age memory access time, which is determined by the cache access time, miss
`rate, and miss penalty. More recently, however, power has become a major
`consideration. In high-end microprocessors, there may be 10 MB or more of
`on-chip cache, and a large second- or third-level cache will consume significant
`power both as leakage when not operating (called static power) and as active
`power, as when performing a read or write (called dynamic power), as described
`in Section 2.3. The problem is even more acute in processors in PMDs where the
`CPU is less aggressive and the power budget may be 20 to 50 times smaller. In
`such cases, the caches can account for 25% to 50% of the total power consump-
`tion. Thus, more designs must consider both performance and power trade-offs,
`and we will examine both in this chapter.
`
`Basics of Memory Hierarchies: A Quick Review
`
`The increasing size and thus importance of this gap led to the migration of the
`basics of memory hierarchy into undergraduate courses in computer architecture,
`and even to courses in operating systems and compilers. Thus, we’ll start with a
`quick review of caches and their operation. The bulk of the chapter, however,
`describes more advanced innovations that attack the processor–memory perfor-
`mance gap.
`When a word is not found in the cache, the word must be fetched from a
`lower level in the hierarchy (which may be another cache or the main memory)
`and placed in the cache before continuing. Multiple words, called a block (or
`line), are moved for efficiency reasons, and because they are likely to be needed
`soon due to spatial locality. Each cache block includes a tag to indicate which
`memory address it corresponds to.
`A key design decision is where blocks (or lines) can be placed in a cache. The
`most popular scheme is set associative, where a set is a group of blocks in the
`cache. A block is first mapped onto a set, and then the block can be placed any-
`where within that set. Finding a block consists of first mapping the block address
`to the set and then searching the set—usually in parallel—to find the block. The
`set is chosen by the address of the data:
`
`(Block address) MOD (Number of sets in cache)
`
`If there are n blocks in a set, the cache placement is called n-way set associative.
`The end points of set associativity have their own names. A direct-mapped cache
`has just one block per set (so a block is always placed in the same location), and
`a fully associative cache has just one set (so a block can be placed anywhere).
`Caching data that is only read is easy, since the copy in the cache and mem-
`ory will be identical. Caching writes is more difficult; for example, how can the
`copy in the cache and memory be kept consistent? There are two main strategies.
`A write-through cache updates the item in the cache and writes through to update
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`2.1
`
`Introduction
`
`■ 75
`
`main memory. A write-back cache only updates the copy in the cache. When the
`block is about to be replaced, it is copied back to memory. Both write strategies
`can use a write buffer to allow the cache to proceed as soon as the data are placed
`in the buffer rather than wait the full latency to write the data into memory.
`One measure of the benefits of different cache organizations is miss rate.
`Miss rate is simply the fraction of cache accesses that result in a miss—that is,
`the number of accesses that miss divided by the number of accesses.
`To gain insights into the causes of high miss rates, which can inspire better
`cache designs, the three Cs model sorts all misses into three simple categories:
`
`■ Compulsory—The very first access to a block cannot be in the cache, so the
`block must be brought into the cache. Compulsory misses are those that occur
`even if you had an infinite sized cache.
`■ Capacity—If the cache cannot contain all the blocks needed during execution
`of a program, capacity misses (in addition to compulsory misses) will occur
`because of blocks being discarded and later retrieved.
`■ Conflict—If the block placement strategy is not fully associative, conflict
`misses (in addition to compulsory and capacity misses) will occur because a
`block may be discarded and later retrieved if multiple blocks map to its set
`and accesses to the different blocks are intermingled.
`
`Figures B.8 and B.9 on pages B-24 and B-25 show the relative frequency of
`cache misses broken down by the three Cs. As we will see in Chapters 3 and 5,
`multithreading and multiple cores add complications for caches, both increasing
`the potential for capacity misses as well as adding a fourth C, for coherency
`misses due to cache flushes to keep multiple caches coherent in a multiprocessor;
`we will consider these issues in Chapter 5.
`Alas, miss rate can be a misleading measure for several reasons. Hence, some
`designers prefer measuring misses per instruction rather than misses per memory
`reference (miss rate). These two are related:
`
`Misses
`--------------------------
`Instruction
`
`=
`
`×
`Miss rate Memory accesses
`-----------------------------------------------------------------------
`Instruction count
`
`=
`
`Miss rate
`
`×
`
`Memory accesses
`------------------------------------------
`Instruction
`
`(It is often reported as misses per 1000 instructions to use integers instead of
`fractions.)
`The problem with both measures is that they don’t factor in the cost of a miss.
`A better measure is the average memory access time:
`Average memory access time = Hit time + Miss rate × Miss penalty
`where hit time is the time to hit in the cache and miss penalty is the time to replace
`the block from memory (that is, the cost of a miss). Average memory access time is
`still an indirect measure of performance; although it is a better measure than miss
`rate, it is not a substitute for execution time. In Chapter 3 we will see that specula-
`tive processors may execute other instructions during a miss, thereby reducing the
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`76 ■ Chapter Two Memory Hierarchy Design
`
`effective miss penalty. The use of multithreading (introduced in Chapter 3) also
`allows a processor to tolerate missses without being forced to idle. As we will
`examine shortly, to take advantage of such latency tolerating techniques we need
`caches that can service requests while handling an outstanding miss.
`If this material is new to you, or if this quick review moves too quickly, see
`Appendix B. It covers the same introductory material in more depth and includes
`examples of caches from real computers and quantitative evaluations of their
`effectiveness.
`Section B.3 in Appendix B presents six basic cache optimizations, which we
`quickly review here. The appendix also gives quantitative examples of the bene-
`fits of these optimizations. We also comment briefly on the power implications of
`these trade-offs.
`
`1. Larger block size to reduce miss rate—The simplest way to reduce the miss
`rate is to take advantage of spatial locality and increase the block size. Larger
`blocks reduce compulsory misses, but they also increase the miss penalty.
`Because larger blocks lower the number of tags, they can slightly reduce
`static power. Larger block sizes can also increase capacity or conflict misses,
`especially in smaller caches. Choosing the right block size is a complex
`trade-off that depends on the size of cache and the miss penalty.
`2. Bigger caches to reduce miss rate—The obvious way to reduce capacity
`misses is to increase cache capacity. Drawbacks include potentially longer hit
`time of the larger cache memory and higher cost and power. Larger caches
`increase both static and dynamic power.
`3. Higher associativity to reduce miss rate—Obviously, increasing associativity
`reduces conflict misses. Greater associativity can come at the cost of
`increased hit time. As we will see shortly, associativity also increases power
`consumption.
`4. Multilevel caches to reduce miss penalty—A difficult decision is whether to
`make the cache hit time fast, to keep pace with the high clock rate of proces-
`sors, or to make the cache large to reduce the gap between the processor
`accesses and main memory accesses. Adding another level of cache between
`the original cache and memory simplifies the decision (see Figure 2.3). The
`first-level cache can be small enough to match a fast clock cycle time, yet the
`second-level (or third-level) cache can be large enough to capture many
`accesses that would go to main memory. The focus on misses in second-level
`caches leads to larger blocks, bigger capacity, and higher associativity. Multi-
`level caches are more power efficient than a single aggregate cache. If L1 and
`L2 refer, respectively, to first- and second-level caches, we can redefine the
`average memory access time:
`Hit timeL1 + Miss rateL1 × (Hit timeL2 + Miss rateL2 × Miss penaltyL2)
`5. Giving priority to read misses over writes to reduce miss penalty—A write
`buffer is a good place to implement this optimization. Write buffers create
`hazards because they hold the updated value of a location needed on a read
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`2.1
`
`Introduction
`
`■ 77
`
`1-way
`4-way
`
`2-way
`8-way
`
`16 KB
`
`32 KB
`
`64 KB
`Cache size
`
`128 KB
`
`256 KB
`
`900
`
`800
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`0
`
`Access time in microseconds
`
`Figure 2.3 Access times generally increase as cache size and associativity are
`increased. These data come from the CACTI model 6.5 by Tarjan, Thoziyoor, and Jouppi
`[2005]. The data assume a 40 nm feature size (which is between the technology used in
`Intel’s fastest and second fastest versions of the i7 and the same as the technology used
`in the fastest ARM embedded processors), a single bank, and 64-byte blocks. The
`assumptions about cache layout and the complex trade-offs between interconnect
`delays (that depend on the size of a cache block being accessed) and the cost of tag
`checks and multiplexing lead to results that are occasionally surprising, such as the
`lower access time for a 64 KB with two-way set associativity versus direct mapping. Sim-
`ilarly, the results with eight-way set associativity generate unusual behavior as cache
`size is increased. Since such observations are highly dependent on technology and
`detailed design assumptions, tools such as CACTI serve to reduce the search space
`rather than precision analysis of the trade-offs.
`
`miss—that is, a read-after-write hazard through memory. One solution is to
`check the contents of the write buffer on a read miss. If there are no conflicts,
`and if the memory system is available, sending the read before the writes
`reduces the miss penalty. Most processors give reads priority over writes.
`This choice has little effect on power consumption.
`6. Avoiding address translation during indexing of the cache to reduce hit
`time—Caches must cope with the translation of a virtual address from the
`processor to a physical address to access memory. (Virtual memory is cov-
`ered in Sections 2.4 and B.4.) A common optimization is to use the page
`offset—the part that is identical in both virtual and physical addresses—to
`index the cache, as described in Appendix B, page B-38. This virtual index/
`physical
`tag method
`introduces some system complications and/or
`
`QUALCOMM EXHIBIT 2012
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`78 ■ Chapter Two Memory Hierarchy Design
`
`limitations on the size and structure of the L1 cache, but the advantages of
`removing the translation lookaside buffer (TLB) access from the critical
`path outweigh the disadvantages.
`
`Note that each of the six optimizations above has a potential disadvantage
`that can lead to increased, rather than decreased, average memory access time.
`The rest of this chapter assumes familiarity with the material above and the
`details in Appendix B. In the Putting It All Together section, we examine the
`memory hierarchy for a microprocessor designed for a high-end server, the Intel
`Core i7, as well as one designed for use in a PMD, the Arm Cortex-A8, which is
`the basis for the processor used in the Apple iPad and several high-end
`smartphones. Within each of these classes, there is a significant diversity in
`approach due to the intended use of the computer. While the high-end processor
`used in the server has more cores and bigger caches than the Intel processors
`designed for desktop uses, the processors have similar architectures. The differ-
`ences are driven by performance and the nature of the workload; desktop com-
`puters are primarily runn