# Homayoun Reference 29 ### In Praise of Memory Systems: Cache, DRAM, Disk Memory Systems: Cache, DRAM, Disk is the first book that takes on the whole hierarchy in a way that is consistent, covers the complete memory hierarchy, and treats each aspect in significant detail. This book will serve as a definitive reference manual for the expert designer, yet it is so complete that it can be read by a relative novice to the computer design space. While memory technologies improve in terms of density and performance, and new memory device technologies provide additional properties as design options, the principles and methodology presented in this amazingly complete treatise will remain useful for decades. I only wish that a book like this had been available when I started out more than three decades ago. It truly is a landmark publication. Kudos to the authors. —Al Davis, University of Utah Memory Systems: Cache, DRAM, Disk fills a huge void in the literature about modern computer architecture. The book starts by providing a high level overview and building a solid knowledge basis and then provides the details for a deep understanding of essentially all aspects of modern computer memory systems including architectural considerations that are put in perspective with cost, performance and power considerations. In addition, the historical background and politics leading to one or the other implementation are revealed. Overall, Jacob, Ng, and Wang have created one of the truly great technology books that turns reading about bits and bytes into an exciting journey towards understanding technology. -Michael Schuette, Ph.D., VP of Technology Development at OCZ Technology This book is a critical resource for anyone wanting to know how DRAM, cache, and hard drives really work. It describes the implementation issues, timing constraints, and trade-offs involved in past, present, and future designs. The text is exceedingly well-written, beginning with high-level analysis and proceeding to incredible detail only for those who need it. It includes many graphs that give the reader both explanation and intuition. This will be an invaluable resource for graduate students wanting to study these areas, implementers, designers, and professors. —Diana Franklin, California Polytechnic University, San Luis Obispo Memory Systems: Cache, DRAM, Disk fills an important gap in exploring modern disk technology with accuracy, lucidity, and authority. The details provided would only be known to a researcher who has also contributed in the development phase. I recommend this comprehensive book to engineers, graduate students, and researchers in the storage area, since details provided in computer architecture textbooks are woefully inadequate. —Alexander Thomasian, IEEE Fellow, New Jersey Institute of Technology and Thomasian and Associates Memory Systems: Cache, DRAM, Disk offers a valuable state of the art information in memory systems that can only be gained through years of working in advanced industry and research. It is about time that we have such a good reference in an important field for researchers, educators and engineers. -Nagi Mekhiel, Department of Electrical and Computer Engineering, Ryerson University, Toronto This is the only book covering the important DRAM and disk technologies in detail. Clear, comprehensive, and authoritative, I have been waiting for such a book for long time. -Yiming Hu, University of Cincinnati Memory is often perceived as the performance bottleneck in computing architectures. Memory Systems: Cache, DRAM, Disk, sheds light on the mystical area of memory system design with a no-nonsense approach to what matters and how it affects performance. From historical discussions to modern case study examples this book is certain to become as ubiquitous and used as the other Morgan Kaufmann classic textbooks in computer engineering including Hennessy and Patterson's Computer Architecture: A Quantitative Approach. —R. Jacob Baker, Micron Technology, Inc. and Boise State University. Memory Systems: Cache, DRAM, Disk is a remarkable book that fills a very large void. The book is remarkable in both its scope and depth. It ranges from high performance cache memories to disk systems. It spans circuit design to system architecture in a clear, cohesive manner. It is the memory architecture that defines modern computer systems, after all. Yet, memory systems are often considered as an appendage and are covered in a piecemeal fashion. This book recognizes that memory systems are the heart and soul of modern computer systems and takes a 'holistic' approach to describing and analyzing memory systems. The classic book on memory systems was written by Dick Matick of IBM over thirty years ago. So not only does this book fill a void, it is a long-standing void. It carries on the tradition of Dick Matick's book extremely well, and it will doubtless be the definitive reference for students and designers of memory systems for many years to come. Furthermore, it would be easy to build a top-notch memory systems course around this book. The authors clearly and succinctly describe the important issues in an easy-to-read manner. And the figures and graphs are really great—one of the best parts of the book. When I work at home, I make coffee in a little stove-top espresso maker I got in Spain. It makes good coffee very efficiently, but if you put it on the stove and forget it's there, bad things happen—smoke, melted gasket—'burned coffee meltdown.' This only happens when I'm totally engrossed in a paper or article. Today, for the first time, it happened twice in a row—while I was reading the final version of this book. —Jim Smith, University of Wisconsin—Madison # Memory Systems Cache, DRAM, Disk ## Memory Systems ## Cache, DRAM, Disk Bruce Jacob University of Maryland at College Park Spencer W. Ng Hitachi Global Storage Technologies David T. Wang MetaRAM With Contributions By Samuel Rodriguez Advanced Micro Devices Publisher Denise E.M. Penrose Acquisitions Editor Chuck Glaser Publishing Services Manager George Morrison Senior Production Editor Paul Gottehrer Developmental Editor Nate McFadden Assistant Editor Kimberlee Honjo Cover Design Ioanne Blank Text Design Dennis Schaefer Composition diacriTech Interior printer Maple-Vail Book Manufacturing Group Cover printer Phoenix Color Morgan Kaufmann Publishers is an imprint of Elsevier. 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA This book is printed on acid-free paper. © 2008 by Elsevier Inc. All rights reserved. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written permission of the publisher. Permissions may be sought directly from Elsevier's Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail: permissions@elsevier.com. You may also complete your request online via the Elsevier homepage (http://elsevier.com), by selecting "Support & Contact" then "Copyright and Permission" and then "Obtaining Permissions." #### Library of Congress Cataloging-in-Publication Data Application submitted ISBN: 978-0-12-379751-3 For information on all Morgan Kaufmann publications, visit our Web site at www.mkp.com or www.books.elsevier.com Printed in the United States of America 08 09 10 11 12 5 4 3 2 1 ## Working together to grow libraries in developing countries www.elsevier.com | www.bookaid.org | www.sabre.org ELSEVIER BOOK AID International Sabre Foundation ## **Dedication** Jacob To my parents, Bruce and Ann Jacob, my wife, Dorinda, and my children, Garrett, Carolyn, and Nate Ng Dedicated to the memory of my parents Ching-Sum and Yuk-Ching Ng Wang Dedicated to my parents Tu-Sheng Wang and Hsin-Hsin Wang You can tell whether a person plays or not by the way he carries the instrument, whether it means something to him or not. Then the way they talk and act. If they act too hip, you know they can't play [jack]. -Miles Davis [...] in connection with musical continuity, Cowell remarked at the New School before a concert of works by Christian Wolff, Earle Brown, Morton Feldman, and myself, that here were four composers who were getting rid of glue. That is: Where people had felt the necessity to stick sounds together to make a continuity, we four felt the opposite necessity to get rid of the glue so that sounds would be themselves. Christian Wolff was the first to do this. He wrote some pieces vertically on the page but recommended their being played horizontally left to right, as is conventional. Later he discovered other geometrical means for freeing his music of intentional continuity. Morton Feldman divided pitches into three areas, high, middle, and low, and established a time unit. Writing on graph paper, he simply inscribed numbers of tones to be played at any time within specified periods of time. There are people who say, "If music's that easy to write, I could do it." Of course they could, but they don't. I find Feldman's own statement more affirmative. We were driving back from some place in New England where a concert had been given. He is a large man and falls asleep easily. Out of a sound sleep, he awoke to say, "Now that things are so simple, there's so much to do." And then he went back to sleep. —John Cage, Silence ## Contents = | Preface | "lt's | It's the Memory, Stupid!"xxi | | | | |----------|--------------|------------------------------|---------------------------------------------------------------|----|--| | Overview | On<br>Ov.1 | | ory Systems and Their Design | | | | | 02 | Ov.1.1 | Locality of Reference Breeds the Memory Hierarchy | | | | | | Ov.1.2 | Important Figures of Merit | | | | | | Ov.1.3 | The Goal of a Memory Hierarchy | | | | | Ov.2 | Four A | necdotes on Modular Design | | | | | | | Anecdote I: Systemic Behaviors Exist | | | | | | Ov.2.2 | Anecdote II: The DLL in DDR SDRAM | 17 | | | | | Ov.2.3 | Anecdote III: A Catch-22 in the Search for Bandwidth | 18 | | | | | Ov.2.4 | Anecdote IV: Proposals to Exploit Variability in Cell Leakage | | | | | | | Perspective | | | | | Ov.3 | Cross- | Cutting Issues | 20 | | | | | Ov.3.1 | Cost/Performance Analysis | 20 | | | | | Ov.3.2 | Power and Energy | 26 | | | | | Ov.3.3 | Reliability | 32 | | | | | Ov.3.4 | Virtual Memory | 34 | | | | <b>Ov.</b> 4 | An Exa | ample Holistic Analysis | 41 | | | | | Ov.4.1 | Fully-Buffered DIMM vs. the Disk Cache | 41 | | | | | Ov.4.2 | Fully Buffered DIMM: Basics | 43 | | | | | Ov.4.3 | Disk Caches: Basics | 46 | | | | | Ov.4.4 | Experimental Results | 47 | | | | | Ov.4.5 | Conclusions | 52 | | | | Ov.5 | What | to Expect | 54 | | | Part I | Ca | ache | 55 | | | |-----------|---------------------------------|--------------------------------------------------------|-----|--|--| | Chapter 1 | An Overview of Cache Principles | | | | | | | 1.1 | Caches, 'Caches,' and "Caches" | | | | | | 1.2 | Locality Principles | | | | | | | 1.2.1 Temporal Locality | | | | | | | 1.2.2 Spatial Locality | 63 | | | | | | 1.2.3 Algorithmic Locality | 64 | | | | | | 1.2.4 Geographical Locality? Demographical Locality? | 65 | | | | | 1.3 | What to Cache, Where to Put It, and How to Maintain It | 66 | | | | | | 1.3.1 Logical Organization Basics: Blocks, Tags, Sets | 67 | | | | | | 1.3.2 Content Management: To Cache or Not to Cache | 68 | | | | | | 1.3.3 Consistency Management: Its Responsibilities | 69 | | | | | | 1.3.4 Inclusion and Exclusion | 70 | | | | | 1.4 | Insights and Optimizations | 73 | | | | | | 1.4.1 Perspective | 73 | | | | | | 1.4.2 Important Issues, Future Directions | 77 | | | | Chapter 2 | Log | gical Organization | 79 | | | | | 2.1 | Logical Organization: A Taxonomy | 79 | | | | | 2.2 | Transparently Addressed Caches | 82 | | | | | | 2.2.1 Implicit Management: Transparent Caches | 86 | | | | | | 2.2.2 Explicit Management: Software-Managed Caches | 86 | | | | | 2.3 | Non-Transparently Addressed Caches | 90 | | | | | | 2.3.1 Explicit Management: Scratch-Pad Memories | 91 | | | | | | 2.3.2 Implicit Management: Self-Managed Scratch-Pads | 92 | | | | | 2.4 | Virtual Addressing and Protection | 92 | | | | | | 2.4.1 Virtual Caches | 93 | | | | | | 2.4.2 ASIDs and Protection Bits | 96 | | | | | | 2.4.3 Inherent Problems | 96 | | | | | 2.5 | Distributed and Partitioned Caches | 97 | | | | | | 2.5.1 UMA and NUMA | 98 | | | | | | 2.5.2 COMA | 99 | | | | | | 2.5.3 NUCA and NuRAPID | 99 | | | | | | 2.5.4 Web Caches | 100 | | | | | | 2.5.5 Buffer Caches | 102 | | | | es 103<br>104 | |---------------| | 104 | | | | 106 | | 117 | | 120 | | 120 | | 129 | | 141 | | 151 | | 151 | | 155 | | 161 | | 169 | | 170 | | 174 | | 180 | | 202 | | 202 | | 203 | | 203 | | 204 | | 205 | | 208 | | 212 | | 212 | | 213 | | 213 | | 214 | | 217 | | 218 | | 218 | | 219 | | 220 | | | | | 4.2 | Cons | istency with Self | 220 | | |-----------|--------------------|-------|-------------------------------------|-----|--| | | | 4.2.1 | Virtual Cache Management | 220 | | | | | 4.2.2 | ASID Management | 225 | | | | 4.3 | Cons | istency with Other Clients | 226 | | | | | 4.3.1 | Motivation, Explanation, Intuition | 226 | | | | | 4.3.2 | Coherence vs. Consistency | 231 | | | | | 4.3.3 | Memory-Consistency Models | 233 | | | | | 4.3.4 | Hardware Cache-Coherence Mechanisms | 240 | | | | | 4.3.5 | Software Cache-Coherence Mechanisms | 254 | | | Chapter 5 | lmp | oleme | ntation Issues | 257 | | | | 5.1 | Over | view | 257 | | | | <b>5.2</b> | SRAM | M Implementation | 258 | | | | | 5.2.1 | Basic 1-Bit Memory Cell | 259 | | | | | 5.2.2 | Address Decoding | 262 | | | | | 5.2.3 | Physical Decoder Implementation | 268 | | | | | 5.2.4 | Peripheral Bitline Circuits | 278 | | | | | 5.2.5 | Sense Amplifiers | 281 | | | | | 5.2.6 | Write Amplifier | 283 | | | | | 5.2.7 | SRAM Partitioning | 285 | | | | | 5.2.8 | SRAM Control and Timing | 286 | | | | | 5.2.9 | SRAM Interface | 292 | | | | <b>5.3</b> | Adva | nced SRAM Topics | 293 | | | | | 5.3.1 | Low-Leakage Operation | 293 | | | | <b>5.4</b> | Cach | e Implementation | 297 | | | | | 5.4.1 | Simple Caches | 297 | | | | | 5.4.2 | Processor Interfacing | 298 | | | | | 5.4.3 | Multiporting | 298 | | | Chapter 6 | Cache Case Studies | | | | | | - | 6.1 | Logic | cal Organization | 301 | | | | | 6.1.1 | Motorola MPC7450 | 301 | | | | | 6.1.2 | AMD Opteron | 301 | | | | | 6.1.3 | Intel Itanium-2 | 303 | | | | 6.2 | Pipel | ine Interface | 304 | |-----------|-----|-------|-----------------------------------------------------|-----| | | | 6.2.1 | Motorola MPC7450 | 304 | | | | 6.2.2 | AMD Opteron | 304 | | | | 6.2.3 | Intel Itanium-2 | 304 | | | 6.3 | Case | Studies of Detailed Itanium-2 Circuits | 305 | | | | 6.3.1 | L1 Cache RAM Cell Array | 305 | | | | 6.3.2 | L2 Array Bitline Structure | 305 | | | | 6.3.3 | L3 Subarray Implementation | 308 | | | | 6.3.4 | Itanium-2 TLB and CAM Implementation | 308 | | Part II | D | RA | .M | 313 | | Chapter 7 | Ov | | w of DRAMs | | | | 7.1 | | M Basics: Internals, Operation | | | | 7.2 | Evolu | ıtion of the DRAM Architecture | | | | | 7.2.1 | Structural Modifications Targeting Throughput | 322 | | | | 7.2.2 | Interface Modifications Targeting Throughput | 328 | | | | 7.2.3 | Structural Modifications Targeting Latency | 330 | | | | 7.2.4 | Rough Comparison of Recent DRAMs | 331 | | | 7.3 | Mode | ern-Day DRAM Standards | 332 | | | | 7.3.1 | Salient Features of JEDEC's SDRAM Technology | 332 | | | | 7.3.2 | Other Technologies, Rambus in Particular | 335 | | | | 7.3.3 | Comparison of Technologies in Rambus and JEDEC DRAM | 341 | | | | 7.3.4 | Alternative Technologies | 343 | | | 7.4 | Fully | Buffered DIMM: A Compromise of Sorts | 348 | | | 7.5 | Issue | s in DRAM Systems, Briefly | 350 | | | | 7.5.1 | Architecture and Scaling | 350 | | | | 7.5.2 | Topology and Timing | 350 | | | | 7.5.3 | Pin and Protocol Efficiency | 351 | | | | 7.5.4 | Power and Heat Dissipation | 351 | | | | 7.5.5 | Future Directions | 351 | | | | | | | | Chapter 8 | | RAM Device Organization: Basic Circuits and | | |-----------|------|-----------------------------------------------------------|-----| | | Ar | chitecture | 353 | | | 8.1 | DRAM Device Organization | 353 | | | 8.2 | DRAM Storage Cells | 355 | | | | 8.2.1 Cell Capacitance, Leakage, and Refresh | 356 | | | | 8.2.2 Conflicting Requirements Drive Cell Structure | 356 | | | | 8.2.3 Trench Capacitor Structure | 357 | | | | 8.2.4 Stacked Capacitor Structure | 357 | | | 8.3 | RAM Array Structures | 358 | | | | 8.3.1 Open Bitline Array Structure | 359 | | | | 8.3.2 Folded Bitline Array Structure | 360 | | | 8.4 | Differential Sense Amplifier | 360 | | | | 8.4.1 Functionality of Sense Amplifiers in DRAM Devices | 361 | | | | 8.4.2 Circuit Diagram of a Basic Sense Amplifier | 362 | | | | 8.4.3 Basic Sense Amplifier Operation | | | | | 8.4.4 Voltage Waveform of Basic Sense Amplifier Operation | 363 | | | | 8.4.5 Writing into DRAM Array | 365 | | | 8.5 | Decoders and Redundancy | 366 | | | | 8.5.1 Row Decoder Replacement Example | 368 | | | 8.6 | DRAM Device Control Logic | 368 | | | | 8.6.1 Synchronous vs. Non-Synchronous | 369 | | | | 8.6.2 Mode Register-Based Programmability | 370 | | | 8.7 | DRAM Device Configuration | | | | | 8.7.1 Device Configuration Trade-offs | | | | 8.8 | Data I/O | | | | | 8.8.1 Burst Lengths and Burst Ordering | | | | | 8.8.2 N-Bit Prefetch | | | | 8.9 | DRAM Device Packaging | | | | 8.10 | DRAM Process Technology and Process Scaling | | | | | Considerations | 374 | | | | 8.10.1 Cost Considerations | 375 | | | | 8.10.2 DRAM- vs. Logic-Optimized Process Technology | 375 | | Chapter 9 | DR | RAM System Signaling and Timing | 377 | | | 9.1 | Signaling System | | | Chapter 10 | DRAM Memory System Organization | | | 409 | | |------------|---------------------------------|--------------------|---------------------------------------------|-----|--| | | 10.1 | Conventiona | l Memory System | 409 | | | | 10.2 | Basic Nomen | clature | 409 | | | | | 10.2.1 Chan | ıel | 410 | | | | | 10.2.2 Rank. | | 413 | | | | | 10.2.3 Bank. | | 414 | | | | | 10.2.4 Row | | 415 | | | | | 10.2.5 Colum | ın | 415 | | | | | 10.2.6 Memo | ry System Organization: An Example | 416 | | | | 10.3 | Memory Mod | lules | 417 | | | | | 10.3.1 Single | In-line Memory Module (SIMM) | 418 | | | | | 10.3.2 Dual l | n-line Memory Module (DIMM) | 418 | | | | | 10.3.3 Regist | ered Memory Module (RDIMM) | 418 | | | | | 10.3.4 Small | Outline DIMM (SO-DIMM) | 419 | | | | | 10.3.5 Memo | ry Module Organization | 420 | | | | | 10.3.6 Serial | Presence Detect (SPD) | 421 | | | | 10.4 | <b>Memory Syst</b> | em Topology | 422 | | | | | 10.4.1 Direct | RDRAM System Topology | 422 | | | | 10.5 | Summary | | 423 | | | Chapter 11 | Basic | DRAMA | Memory-Access Protocol | 495 | | | Chapter 11 | 11.1 | | Commands | | | | | | | ic DRAM Command Format | | | | | | | eary of Timing Parameters | | | | | | | ccess Command | | | | | | | n-Read Command | | | | | | 11.1.5 Colun | n-Write Command | 430 | | | | | | ırge Command | | | | | | | h Command | | | | | | ū | ł Cycle | | | | | | | e Cycle | | | | | | | ound Commands | | | | | 11.2 | • | nand Interactions | | | | | | | cutive Reads and Writes to Same Rank | | | | | | 11.2.2 Read t | o Precharge Timing | 438 | | | | | | cutive Reads to Different Rows of Same Bank | | | | | | 11.2.4 | Consecutive Reads to Different Banks: Bank Conflict | 440 | |------------|------|---------|-----------------------------------------------------------------------------------|-----| | | | 11.2.5 | Consecutive Read Requests to Different Ranks | 441 | | | | 11.2.6 | Consecutive Write Requests: Open Banks | 442 | | | | 11.2.7 | Consecutive Write Requests: Bank Conflicts | 444 | | | | 11.2.8 | Write Request Following Read Request: Open Banks | 444 | | | | 11.2.9 | Write Request Following Read Request to Different Banks, | | | | | | Bank Conflict, Best Case, No Reordering | | | | | 11.2.10 | Read Following Write to Same Rank, Open Banks | | | | | 11.2.11 | Write to Precharge Timing | | | | | 11.2.12 | Read Following Write to Different Ranks, Open Banks | 447 | | | | 11.2.13 | Read Following Write to Same Bank, Bank Conflict | 448 | | | | 11.2.14 | Read Following Write to Different Banks of Same Rank, Bank<br>Case, No Reordering | - | | | | 11.2.15 | Column-Read-and-Precharge Command Timing | 449 | | | | 11.2.16 | Column-Write-and-Precharge Timing | 450 | | | 11.3 | Additio | onal Constraints | 450 | | | | 11.3.1 | Device Power Limit | 450 | | | | 11.3.2 | t <sub>RRD</sub> : Row-to-Row (Activation) Delay | 452 | | | | 11.3.3 | t <sub>EAW</sub> : Four-Bank Activation Window | 453 | | | | 11.3.4 | 2T Command Timing in Unbuffered Memory Systems | 454 | | | 11.4 | Comm | and Timing Summary | 454 | | | 11.5 | Summa | ary | 454 | | Chapter 12 | | | y Developments of DRAM Device | 457 | | | 12.1 | DRAM | Device Families | 457 | | | | 12.1.1 | Cost (Capacity), Latency, Bandwidth, and Power | 457 | | | 12.2 | | cal-Commodity DRAM Devices | | | | | 12.2.1 | The Intel 1103 | | | | | 12.2.2 | Asynchronous DRAM Devices | 461 | | | | 12.2.3 | Page Mode and Fast Page Mode DRAM (FPM DRAM) | | | | | 12.2.4 | Extended Data-Out (EDO) and Burst Extended | | | | | | Data-Out (BEDO) Devices | 463 | | | 12.3 | Moder | n-Commodity DRAM Devices | 464 | | | | 12.3.1 | Synchronous DRAM (SDRAM) | 465 | | | | 12.3.2 | Double Data Rate SDRAM (DDR SDRAM) | 471 | | |------------|------------------------|--------|-------------------------------------------------------|-----|--| | | | 12.3.3 | DDR2 SDRAM | 474 | | | | | 12.3.4 | Protocol and Architectural Differences | 475 | | | | | 12.3.5 | DDR3 SDRAM | 476 | | | | | 12.3.6 | Scaling Trends of Modern-Commodity DRAM Devices | 477 | | | | 12.4 | High 1 | Bandwidth Path | 480 | | | | | 12.4.1 | Direct RDRAM | 480 | | | | | 12.4.2 | Technical and Pseudo-Technical Issues of Direct RDRAM | 487 | | | | | 12.4.3 | XDR Memory System | 491 | | | | 12.5 | Low L | atency | 494 | | | | | 12.5.1 | Reduced Latency DRAM (RLDRAM) | 494 | | | | | 12.5.2 | Fast Cycle DRAM (FCRAM) | 495 | | | | 12.6 | Intere | esting Alternatives | 495 | | | | | 12.6.1 | Virtual Channel Memory (VCDRAM) | 495 | | | | | 12.6.2 | Enhanced SDRAM (ESDRAM) | 496 | | | Chapter 13 | DRAM Memory Controller | | | | | | | 13.1 | DRAM | A Controller Architecture | 497 | | | | 13.2 | Row-l | Buffer-Management Policy | 499 | | | | | 13.2.1 | Open-Page Row-Buffer-Management Policy | 499 | | | | | 13.2.2 | Close-Page Row-Buffer-Management Policy | | | | | | 13.2.3 | Hybrid (Dynamic) Row-Buffer-Management Policies | 500 | | | | | 13.2.4 | Performance Impact of Row-Buffer-Management Policies | 500 | | | | | 13.2.5 | Power Impact of Row-Buffer-Management Policies | 501 | | | | 13.3 | | ess Mapping (Translation) | | | | | | 13.3.1 | Available Parallelism in Memory System Organization | | | | | | 13.3.2 | Parameter of Address Mapping Schemes | 504 | | | | | 13.3.3 | Baseline Address Mapping Schemes | | | | | | 13.3.4 | Parallelism vs. Expansion Capability | 506 | | | | | 13.3.5 | Address Mapping in the Intel 82955X MCH | 506 | | | | | 13.3.6 | Bank Address Aliasing (Stride Collision) | 510 | | | | 13.4 | Perfo | rmance Optimization | 511 | | | | | 13.4.1 | Write Caching | 512 | | | | | 13.4.2 | Request Queue Organizations | | | | | | 13.4.3 | Refresh Management | 514 | | | | | | | | | | | | 13.4.4 Agent-Centric Request | Queuing Organization | 516 | |------------|--------------|---------------------------------|-----------------------|-----| | | | • | eduling | | | | 13.5 | | | | | | | y | | | | Chapter 14 | The | fully Buffered DIMM | Memory System | 519 | | - | 14.1 | | | | | | 14.2 | Architecture | | 521 | | | 14.3 | Signaling and Timing | | 524 | | | | 14.3.1 Clock Data Recovery | | 524 | | | | 14.3.2 Unit Interval | | 525 | | | | 14.3.3 Resample and Resync | | 525 | | | 14.4 | Access Protocol | | 526 | | | | 14.4.1 Frame Definitions | | 527 | | | | 14.4.2 Command Definitions | | 528 | | | | 14.4.3 Frame and Command | Scheduling | 528 | | | 14.5 | The Advanced Memory But | ffer | 530 | | | | 14.5.1 SMBus Interface | | 531 | | | | 14.5.2 Built-In Self-Test (BIST | ") | 532 | | | | 14.5.3 Thermal Sensor | | 532 | | | 14.6 | Reliability, Availability, and | l Serviceability | 532 | | | | 14.6.1 Checksum Protection i | n the Transport Layer | 532 | | | | 14.6.2 Bit Lane Steering | | 533 | | | | 14.6.3 Fail-over Modes | | 534 | | | | 14.6.4 Hot Add and Replace | | 534 | | | 14.7 | FBDIMM Performance Cha | aracteristics | 535 | | | | 14.7.1 Fixed vs. Variable Later | ncy Scheduling | 538 | | | 14.8 | Perspective | | 540 | | Chapter 15 | Mer | orv System Design Ar | nalysis | 541 | | | 15.1 | | | | | | 15.2 | | | | | | <del>-</del> | | on | | | | | 9 , , | ng Language Compiler | | | | | | cessing | | | | | 15.2.4 255 vortex: C Object-O | · · | 546 | | | 15.2.5 | 172.mgrid: Fortran 77 Multi-Grid Solver: 3D Potential Field | 547 | |------|---------|-------------------------------------------------------------------------|--------------| | | 15.2.6 | SETI@HOME | 547 | | | 15.2.7 | Quake 3 | 548 | | | 15.2.8 | 178.galgel, 179.art, 183.equake, 188.ammp, JMark 2.0,<br>and 3DWinbench | 548 | | | 15.2.9 | Summary of Workload Characteristics | 550 | | 15.3 | The RA | D Analytical Framework | 551 | | | 15.3.1 | DRAM-Access Protocol | 551 | | | 15.3.2 | Computing DRAM Protocol Overheads | 551 | | | 15.3.3 | Computing Row Cycle Time Constraints | 553 | | | 15.3.4 | Computing Row-to-Row Activation Constraints | 555 | | | 15.3.5 | Request Access Distance Efficiency Computation | 557 | | | 15.3.6 | An Applied Example for a Close-Page System | 558 | | | 15.3.7 | An Applied Example for an Open-Page System | 558 | | | 15.3.8 | System Configuration for RAD-Based Analysis | 559 | | | 15.3.9 | Open-Page Systems: 164.gzip | 561 | | | 15.3.10 | Open-Page Systems: 255.vortex | 562 | | | 15.3.11 | Open-Page Systems: Average of All Workloads | 562 | | | 15.3.12 | Close-Page Systems: 164.gzip | 565 | | | 15.3.13 | Close-Page Systems: SETI@HOME Processor Bus Trace | 566 | | | 15.3.14 | Close-Page Systems: Average of All Workloads | 567 | | | 15.3.15 | t <sub>EAW</sub> Limitations in Open-Page System: All Workloads | 568 | | | 15.3.16 | Bandwidth Improvements: 8-Banks vs. 16-Banks | 568 | | 15.4 | Simula | tion-Based Analysis | 570 | | | 15.4.1 | System Configurations | 570 | | | 15.4.2 | Memory Controller Structure | 571 | | | 15.4.3 | DRAM Command Scheduling Algorithms | 572 | | | 15.4.4 | Workload Characteristics | 575 | | | 15.4.5 | Timing Parameters | 575 | | | 15.4.6 | Protocol Table | 575 | | | 15.4.7 | Queue Depth, Scheduling Algorithms, and Burst Length | 577 | | | 15.4.8 | Effect of Burst Length on Sustainable Bandwidth | 578 | | | 15.4.9 | Burst Chop in DDR3 SDRAM Devices | 5 <i>7</i> 9 | | | 15.4.10 | Revisiting the 8-Bank and 16-Bank Issue with DRAMSim | 584 | | | 15.4.11 | 8 Bank vs. 16 Banks — Relaxed t <sub>FAW</sub> and t <sub>WTR</sub> | | | | 15.4.12 | Effect of Transaction Ordering on Latency Distribution | 587 | | = | |------------| | 590 | | 592 | | 592 | | 593 | | 596 | | 599 | | | | 601 | | 601 | | 603 | | 605 | | 606 | | 607<br>608 | | 609 | | 609<br>609 | | 609<br>609 | | 609<br>609 | | 610 | | 610 | | 612 | | 612 | | 612 | | 615 | | 615 | | 615 | | 616 | | 618 | | 618 | | 620 | | | | | 17.2 | Mech | anical and Magnetic Components | 620 | |------------|------|--------|----------------------------------------------|-----| | | | 17.2.1 | Disks | 621 | | | | 17.2.2 | Spindle Motor | 623 | | | | 17.2.3 | Heads | 625 | | | | 17.2.4 | Slider and Head-Gimbal Assembly | 631 | | | | 17.2.5 | Head-Stack Assembly and Actuator | 633 | | | | 17.2.6 | Multiple Platters | 635 | | | | 17.2.7 | Start/Stop | 636 | | | | 17.2.8 | Magnetic Disk Recording Integration | 637 | | | | 17.2.9 | Head-Disk Assembly | 639 | | | 17.3 | Electr | onics | 640 | | | | 17.3.1 | Controller | 640 | | | | 17.3.2 | Memory | 642 | | | | 17.3.3 | Recording Channel | 642 | | | | 17.3.4 | Motor Controls | 646 | | | | | | | | Chapter 18 | The | Data | Layer | 640 | | Chapter 10 | 18.1 | | Blocks and Sectors | | | | 10.1 | 18.1.1 | Fixed-Size Blocks | | | | | 18.1.2 | Variable Size Blocks | | | | | 18.1.3 | | | | | 18.2 | | s and Cylinders | | | | 18.3 | | ess Mapping | | | | 10.5 | 18.3.1 | | | | | | 18.3.2 | | | | | | 18.3.3 | Logical Address to Physical Location Mapping | | | | 18.4 | | l-Bit Recording | | | | 10.1 | | Handling ZBR | | | | 18.5 | | 110000006 2251 | | | | 10.0 | 18.5.1 | Dedicated Servo | | | | | 18.5.2 | Embedded Servo | | | | | 18.5.3 | Servo ID and Seek | | | | | 18.5.4 | Servo Burst and Track Following | | | | | 18.5.5 | Anatomy of a Servo | | | | | 18.5.6 | ZBR and Embedded Servo | | | | | 10.0.0 | 221. S. S. Lindowwood Colo Commission | 000 | | | 18.6 | Sector ID and No-ID Formatting | 670 | |------------|------|-----------------------------------------------|-----| | | 18.7 | Capacity | 672 | | | 18.8 | Data Rate | 673 | | | 18.9 | Defect Management | 673 | | | | 18.9.1 Relocation Schemes | 674 | | | | 18.9.2 Types of Defects | 675 | | | | 18.9.3 Error Recovery Procedure | 676 | | Chapter 19 | Perf | ormance Issues and Design Trade-Offs | 677 | | | 19.1 | Anatomy of an I/O | 677 | | | | 19.1.1 Adding It All Up | 679 | | | 19.2 | Some Basic Principles | 681 | | | | 19.2.1 Effect of User Track Capacity | 681 | | | | 19.2.2 Effect of Cylinder Capacity | 682 | | | | 19.2.3 Effect of Track Density | 684 | | | | 19.2.4 Effect of Number of Heads | 686 | | | 19.3 | BPI vs. TPI | 688 | | | 19.4 | Effect of Drive Capacity | 689 | | | | 19.4.1 Space Usage Efficiency | 690 | | | | 19.4.2 Performance Implication | 690 | | | 19.5 | Concentric Tracks vs. Spiral Track | 692 | | | | 19.5.1 Optical Disks | 693 | | | 19.6 | Average Seek | 694 | | | | 19.6.1 Disks Without ZBR | 694 | | | | 19.6.2 Disks With ZBR | 695 | | Chapter 20 | Driv | ve Interface | 699 | | • | 20.1 | Overview of Interfaces | | | | | 20.1.1 Components of an Interface | 701 | | | | 20.1.2 Desirable Characteristics of Interface | | | | 20.2 | ATA | | | | 20.3 | Serial ATA | 703 | | | 20.4 | SCSI | 705 | | | 20.5 | Serial SCSI | | | | 20.6 | Fibre Channel | | | | 20.7 | Cost Performance and Reliability | | | Chapter 21 | Operational Performance Improvement | | | | | |------------|-------------------------------------|-------------------------------------------------|-----|--|--| | - | - | Latency Reduction Techniques | | | | | | | 21.1.1 Dual Actuator | 711 | | | | | | 21.1.2 Multiple Copies | 712 | | | | | | 21.1.3 Zero Latency Access | 713 | | | | | 21.2 | Command Queueing and Scheduling | 715 | | | | | | 21.2.1 Seek Time-Based Scheduling | 716 | | | | | | 21.2.2 Total Access Time Based Scheduling | 717 | | | | | | 21.2.3 Sequential Access Scheduling | 723 | | | | | 21.3 | Reorganizing Data on the Disk | 723 | | | | | | 21.3.1 Defragmentation | 724 | | | | | | 21.3.2 Frequently Accessed Files | 724 | | | | | | 21.3.3 Co-Locating Access Clusters | 725 | | | | | | 21.3.4 ALIS | 726 | | | | | 21.4 | Handling Writes | 728 | | | | | | 21.4.1 Log-Structured Write | 728 | | | | | | 21.4.2 Disk Buffering of Writes | 729 | | | | | 21.5 | Data Compression | 729 | | | | | | | | | | | Chapter 99 | The | Cache Layer | 731 | | | | Chapter 22 | 22.1 | Disk Cache | | | | | | <i>22.</i> 1 | | | | | | | | 22.1.1 Why Disk Cache Works | | | | | | | 22.1.2 Cache Automation | | | | | | 00.0 | 22.1.3 Read Cache, Write Cache | | | | | | 22.2 | Cache Organizations | | | | | | | 22.2.1 Desirable Features of Cache Organization | | | | | | | 22.2.2 Fixed Segmentation | | | | | | | 22.2.3 Circular Buffer | | | | | | 00.0 | 22.2.4 Virtual Memory Organization | | | | | | 22.3 | Caching Algorithms | | | | | | | 22.3.1 Perspective of Prefetch | | | | | | | 22.3.2 Lookahead Prefetch | | | | | | | 22.3.3 Look-behind Prefetch | | | | | | | 22.3.4 Zero Latency Prefetch | 743 | | | | | | 22.3.5 | ERP During Prefetch | 743 | | |------------|----------------------|--------|----------------------------------|-----|--| | | | 22.3.6 | Handling of Sequential Access | | | | | | 22.3.7 | Replacement Policies | | | | Chapter 23 | Performance Testing7 | | | | | | | 23.1 | | nd Measurement | | | | | | 23.1.1 | Test Initiator | | | | | | 23.1.2 | Monitoring and Measuring | | | | | | 23.1.3 | The Test Drive | | | | | 23.2 | Basic' | Tests | | | | | | 23.2.1 | Media Data Rate | 751 | | | | | 23.2.2 | Disk Buffer Data Rate | 751 | | | | | 23.2.3 | Sequential Performance | 752 | | | | | 23.2.4 | Random Performance | 752 | | | | | 23.2.5 | Command Reordering Performance | 755 | | | | 23.3 | Bench | nmark Tests | 755 | | | | | 23.3.1 | Guidelines for Benchmarking | 756 | | | | 23.4 | Drive | Parameters Tests | 757 | | | | | 23.4.1 | Geometry and More | 757 | | | | | 23.4.2 | Seek Time | | | | Chapter 24 | Stor | age Si | ubsystems | 763 | | | • | 24.1 | _ | Striping | | | | | 24.2 | Data I | | 765 | | | | | 24.2.1 | Basic Mirroring | 766 | | | | | 24.2.2 | Chained Decluster Mirroring | 766 | | | | | 24.2.3 | Interleaved Decluster Mirroring | 767 | | | | | 24.2.4 | Mirroring Performance Comparison | 767 | | | | | 24.2.5 | Mirroring Reliability Comparison | 769 | | | | 24.3 | RAID | | 770 | | | | | 24.3.1 | RAID Levels | 770 | | | | | 24.3.2 | RAID Performance | 774 | | | | | 24.3.3 | RAID Reliability | 775 | | | | | 24.3.4 | Sparing | 776 | | | | | 24.3.5 RAID Controller | 778 | |------------|-------------|---------------------------------------------|-----| | | | 24.3.6 Advanced RAIDs | 779 | | | 24.4 | SAN | 780 | | | 24.5 | NAS | 782 | | | 24.6 | ISCSI | 783 | | Chapter 25 | Adv | vanced Topics | 785 | | | 25.1 | Perpendicular Recording | 785 | | | | 25.1.1 Write Process, Write Head, and Media | 786 | | | | 25.1.2 Read Process and Read Head | 788 | | | 25.2 | Patterned Media | 788 | | | | 25.2.1 Fully Patterned Media | 789 | | | | 25.2.2 Discrete Track Media | 790 | | | 25.3 | Thermally Assisted Recording | 790 | | | <b>25.4</b> | Dual Stage Actuator | 792 | | | | 25.4.1 Microactuators | 793 | | | 25.5 | Adaptive Formatting | 794 | | | 25.6 | Hybrid Disk Drive | 796 | | | | 25.6.1 Benefits | 796 | | | | 25.6.2 Architecture | 796 | | | | 25.6.3 Proposed Interface | 797 | | | 25.7 | Object-Based Storage | 798 | | | | 25.7.1 Object Storage Main Concept | 798 | | | | 25.7.2 Object Storage Benefits | 800 | | Chapter 26 | Case | e Study | 803 | | - | 26.1 | The Mechanical Components | 803 | | | | 26.1.1 Seek Profile | | | | 26.2 | Electronics | | | | 26.3 | Data Layout | | | | | 26.3.1 Data Rate | | | | 26.4 | Interface | 809 | | | 26.5 | Cache | 810 | | | 26.6 | Performance Testing | 810 | | | 26.6.1Sequential Access26.6.2Random Access | | | | |------------|----------------------------------------------------------------------|-----|--|--| | Part IV | Cross-Cutting Issues | 813 | | | | Chapter 27 | The Case for Holistic Design | 815 | | | | • | 27.1 Anecdotes, Revisited | | | | | | 27.1.1 Anecdote I: Systemic Behaviors Exist | | | | | | 27.1.2 Anecdote II: The DLL in DDR SDRAM | | | | | | 27.1.3 Anecdote III: A Catch-22 in the Search for Bandwidth | | | | | | 27.1.4 Anecdote IV: Proposals to Exploit Variability in Cell Leakage | | | | | | 27.2 Perspective | | | | | Chapter 28 | Analysis of Cost and Performance | 829 | | | | • | 28.1 Combining Cost and Performance | 829 | | | | | 28.2 Pareto Optimality | | | | | | 28.2.1 The Pareto-Optimal Set: An Equivalence Class | | | | | | 28.2.2 Stanley's Observation | | | | | | 28.3 Taking Sampled Averages Correctly | | | | | | 28.3.1 Sampling Over Time | 834 | | | | | 28.3.2 Sampling Over Distance | 835 | | | | | 28.3.3 Sampling Over Fuel Consumption | 836 | | | | | 28.3.4 The Moral of the Story | 837 | | | | | 28.4 Metrics for Computer Performance | 838 | | | | | 28.4.1 Performance and the Use of Means | 838 | | | | | 28.4.2 Problems with Normalization | 839 | | | | | 28.4.3 The Meaning of Performance | 842 | | | | | 28.5 Analytical Modeling and the Miss-Rate Function | 843 | | | | | 28.5.1 Analytical Modeling | 843 | | | | | 28.5.2 The Miss-Rate Function | 844 | | | | Chapter 29 | Power and Leakage | 847 | | | | - | 29.1 Sources of Leakage in CMOS Devices | | | | | | 29.2 A Closer Look at Subthreshold Leakage | 851 | | | | | 29.3 | | TI and Energy/Power Breakdown of Pipelined<br>meter Caches | 855 | |------------|------------------------------------|--------|-------------------------------------------------------------------------|-----| | | | 29.3.1 | Leakage in SRAM Cells | 855 | | | | 29.3.2 | Pipelined Caches | 857 | | | | 29.3.3 | Modeling | 858 | | | | 29.3.4 | Dynamic and Static Power | 859 | | | | 29.3.5 | Detailed Power Breakdown | 860 | | Chapter 30 | Memory Errors and Error Correction | | | | | | 30.1 | Types | and Causes of Failures | 865 | | | | 30.1.1 | Alpha Particles | 866 | | | | 30.1.2 | Primary Cosmic Rays and Terrestrial Neutrons | 866 | | | | 30.1.3 | Soft Error Mechanism | 867 | | | | 30.1.4 | Single-Bit and Multi-Bit Failures | 867 | | | 30.2 | Soft E | rror Rates and Trends | 868 | | | 30.3 | Error | Detection and Correction | 869 | | | | 30.3.1 | Parity | 869 | | | | 30.3.2 | Single-Bit Error Correction (SEC ECC) | 870 | | | | 30.3.3 | Single-Bit Error Correction, Double-Bit Error<br>Detection (SECDED ECC) | 873 | | | | 30.3.4 | Multi-Bit Error Detection and Correction: Bossen's b-Adjacent Algorithm | 874 | | | | 30.3.5 | Bit Steering and Chipkill | 875 | | | | 30.3.6 | Chipkill with ×8 DRAM Devices | 877 | | | | 30.3.7 | Memory Scrubbing | 879 | | | | 30.3.8 | Bullet Proofing the Memory System | 880 | | | 30.4 | Relial | oility of Non-DRAM Systems | 880 | | | | 30.4.1 | SRAM | 880 | | | | 30.4.2 | Flash | 880 | | | | 30.4.3 | MRAM | 880 | | | 30.5 | Space | Shuttle Memory System | 881 | | Chapter 31 | Virtu | al Me | mory | 883 | | | 31.1 | AVirt | ual Memory Primer | 884 | | | | 31.1.1 | Address Spaces and the Main Memory Cache | 885 | | | | 31.1.2 | Address Mapping and the Page Table | 886 | | | | 31.1.3 | Hierarchical Page Tables | 887 | | Index | | | | 055 | |------------|------|--------|--------------------------------------------------------------|-------------| | References | | | | 921 | | | | 31.2.2 | Precise Interrupts in Pipelined Computers | 910 | | | | 31.2.1 | The Basic In-Order Pipe | 908 | | | 31.2 | Imple | menting Virtual Memory | 906 | | | | 31.1.8 | Perspective: A Taxonomy of Address Space Organizations | 901 | | | | 31.1.7 | Perspective: Segmented Addressing Solves the Synonym Problem | 895 | | | | 31.1.6 | Translation Lookaside Buffers, Revisited | <b>89</b> 3 | | | | 31.1.5 | Comparison: Inverted vs. Hierarchical | 892 | | | | 31.1.4 | Inverted Page Tables | 890 | ## Preface ## "It's the Memory, Stupid!" If you develop an ear for sounds that are musical it is like developing an ego. You begin to refuse sounds that are not musical and that way cut yourself off from a good deal of experience. -John Cage In 1996, Richard Sites, one of the fathers of computer architecture and lead designers of the DEC Alpha, had the following to say about the future of computer architecture research: Across the industry, today's chips are largely able to execute code faster than we can feed them with instructions and data. There are no longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit. The real design action is in memory subsystems—caches, buses, bandwidth, and latency. An anecdote: in a recent database benchmark study using TPC-C, both 200-MHz Pentium Pro and 400MHz 21164 Alpha systems were measured at 4.2–4.5 CPU cycles per instruction retired. In other words, three out of every four CPU cycles retired zero instructions: most were spent waiting for memory. Processor speed has seriously outstripped memory speed. Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse. If a CPU chip today needs to move 2 GBytes/s (say, 16 bytes every 8 ns) across the pins to keep itself busy, imagine a chip in the foreseeable future with twice the clock rate, twice the issue width, and two instruction streams. All these factors multiply together to require about 16 GBytes/s of pin bandwidth to keep this chip busy. It is not clear whether pin bandwidth can keep up—32 bytes every 2ns? I expect that over the coming decade memory subsystems design will be the **only** important design issue for microprocessors. [Sites 1996, emphasis Sites'] The title of Sites' article is "It's the Memory, Stupid!" Sites realized in 1996 what we as a community are only now, more than a decade later, beginning to digest and internalize fully: *uh, guys, it really is the memory system ... little else matters right now, so stop wasting time and resources on other facets of the design.* Most of his colleagues designing next-generation Alpha architectures at Digital Equipment Corp. ignored his advice and instead remained focused on building ever faster microprocessors, rather than shifting their focus to the building of ever faster *systems*. It is perhaps worth noting that Digital Equipment Corp. no longer exists. The increasing gap between processor and memory speeds has rendered the organization, architecture, and design of memory subsystems an increasingly important part of computer-systems design. Today, the divide is so severe we are now in one of those down-cycles where the processor is so good at xxxi number-crunching it has completely sidelined itself; it is too fast for its own good, in a sense. Sites' prediction came true: memory subsystems design is now and has been for several years the *only* important design issue for microprocessors and systems. Memory-hierarchy parameters affect system performance *significantly* more than processor parameters (e.g., they are responsible for 2–10× changes in execution time, as opposed to 2–10%), making it absolutely essential for any designer of computer systems to exhibit an in-depth knowledge of the memory system's organization, its operation, its not-so-obvious behavior, and its range of performance characteristics. This is true now, and it is likely to remain true in the near future. Thus this book, which is intended to provide exactly that type of in-depth coverage over a wide range of topics. ## **Topics Covered** In the following chapters we address the logical design and operation, the physical design and operation, the performance characteristics (i.e., design trade-offs), and, to a limited extent, the energy consumption of modern memory hierarchies. In the cache section, we present topics and perspectives that will be new (or at least interesting) to even veterans in the field. What this implies is that the cache section is not an overview of processorcache organization and its effect on performance instead, we build up the concept of cache from first principles and discuss topics that are incompletely covered in the computer-engineering literature. The section discusses a significant degree of historical development in cache-management techniques, the physical design of modern SRAM structures, the operating system's role in cache coherence, and the continuum of cache architectures from those that are fully transparent (to application software and/or the operating system) to those that are fully visible. DRAM and disk are interesting technologies because, unlike caches, they are not typically integrated onto the microprocessor die. Thus any discussion of these topics necessarily deals with the issue of communication: e.g., channels, signalling, protocols, and request scheduling. DRAM involves one or more chip-to-chip crossings, and so signalling and signal integrity are as fundamental as circuit design to the technology. In the DRAM section, we present an intuitive understanding of exactly what happens inside the DRAM so that the ubiquitous parameters of the interface (e.g., $t_{\rm RCD}$ , $t_{\rm CAS}$ , etc.) will make sense. We survey the various DRAM architectures that have appeared over the years and give an in-depth description of the technologies in the next generation memory-system architecture. We discuss memory-controller issues and investigate performance issues of modern systems. The disk section builds from the bottom up, providing a view of the disk from physical recording principles to the configuration and operation of disks within system settings. We discuss the operation of the disk's read/write heads; the arrangement of recording media within the enclosure; and the organization-level view of blocks, sectors, tracks, and cylinders, as well as various protocols used to encode data. We discuss performance issues and techniques used to improve performance, including caching and buffering, prefetching, request scheduling, and data reorganization. We discuss the various disk interfaces available today (e.g., ATA, serial ATA, SCSI, fibre channel, etc.) as well as system configurations such as RAID, SAN, and NAS. The last section of the book, *Cross-Cutting Issues*, covers topics that apply to all levels of the memory hierarchy, such as the tools of analysis and how to use them correctly, subthreshold leakage power in CMOS devices and circuits, a look at power breakdowns in future SRAMs, codes for error detection and error correction, the design and operation of virtual memory systems, and the hardware mechanisms that are required in microprocessors to support virtual memory. #### **Goals and Audience** The primary goal of this book is to bring the reader to a level of understanding at which the physical design and/or detailed software emulation of the entire hierarchy is possible, from cache to disk. As we argue in the initial chapter, this level of understanding is important now and will become increasingly necessary over time. Another goal of the book is to discuss techniques of analysis, so that the next generation of design engineers is prepared to tackle the nontrivial multidimensional optimization problems that result from considering detailed side-effects that can manifest themselves at any point in the entire hierarchy. Accordingly, our target audience are those planning to build and/or optimize memory systems: i.e., computer-engineering and computer-science faculty and graduate students (and perhaps advanced undergraduates) and developers in the computer design, peripheral design, and embedded systems industries. As an educational textbook, this is targeted at graduate and undergraduate students with a solid background in computer organization and architecture. It could serve to support an advanced senior-level undergraduate course or a second-year graduate course specializing in computer-systems design. There is clearly far too much material here for any single course; the book provides depth on enough topics to support two to three separate courses. For example, at the University of Maryland we use the DRAM section to teach a graduate class called High-Speed Memory Systems, and we supplement both our general and advanced architecture classes with material from the sections on Caches and Cross-Cutting Issues. The Disk section could support a class focused solely on disks, and it is also possible to create for advanced students a survey class that lightly touches on all the topics in the book. As a reference, this book is targeted toward both academics and professionals alike. It provides the breadth necessary to understand the wide scope of behaviors that appear in modern memory systems, and most of the topics are addressed in enough depth that a reader should be able to build (or at least model in significant detail) caches, DRAMs, disks, their controllers, their subsystems ... and understand their interactions. What this means is that the book should not only be useful to developers, but it should also be useful to those responsible for long-range planning and forecasting for future product developments and their issues. #### **Acknowledgments and Thanks** Beyond the kind folks named in the book's Dedication, we have a lot of people to thank. The following people were enormously helpful in creating the contents of this book: - Prof. Rajeev Barua, ECE Maryland, provided text to explain scratch-pad memories and their accompanying partitioning problem (see Chapter 3). - Dr. Brinda Ganesh, a former graduate student in Bruce Jacob's research group, now at Intel DAP, wrote the latency-oriented section of the DRAM-performance chapter (see Section 15.5). - Joseph Gross, a graduate student in Bruce Jacob's research group, updated the numbers in David Wang's Ph.D. proposal to produce the tables comparing the characteristics of modern DRAMs in the book's Overview chapter. - Dr. Ed Growchowski, formerly of Hitachi Global Storage Technologies and currently a consultant for IDEMA, provided some of the disk recording density history charts. - Dr. Martin Hassner, Dr. Bruce Wilson, Dr. Matt White, Chuck Cox (now with IBM), Frank Chu, and Tony Dunn of Hitatchi Global Storage Technologies provided important consultation on various details of disk drive design. - Dr. Windsor Hsu, formerly of IBM Almaden Research Center and now with Data Domain, Inc., provided important consultation on system issues related to disk drives. - Dr. Aamer Jaleel, a former graduate student in Bruce Jacob's research group, now at Intel VSSAD, is responsible for Chapter 4's sections on cache coherence. - Michael Martin, a graduate student in Bruce Jacob's research group, executed simulations for the final table (Ov. 5) and formatted the four large time-sequence graphs in the *Overview* chapter. - Rami Nasr, a graduate student at Maryland who wrote his M.S. thesis on the Fully - Buffered DIMM architecture, provided much of the contents of Chapter 14. - Prof. Marty Peckerar, ECE Maryland, provided brilliant text and illustrations to explain the subthreshold leakage problem (see the book's *Overview* and Section 29.2). - Profs. Yan Solihin, ECE NCSU, and Donald Yeung, ECE Maryland, provided much of the material on hardware and software prefetching (Sections 3.1.2 and 3.2.2); this came from a book chapter on the topic written by the pair for Kaeli and Yew's Speculative Execution in High Performance Computer Architectures (CRC Press, 2005). - Sadagopan Srinivasan, a graduate student in Bruce Jacob's research group, performed the study at the end of Chapter 1 and provided much insight on the memory behavior of both streaming applications and multi-core systems. - Dr. Nuengwong Tuaycharoen, a former graduate student in Bruce Jacob's research group, now at Thailand's Dhurakijpundit University, performed the experiments, and wrote the example holistic analysis at the end of the book's *Overview* chapter, directly relating to the behavior of caches, DRAMs, and disks in a single study. - Patricia Wadkins and others in the Rochester SITLab of Hitachi Global Storage Technologies provided test results and measurement data for the book's Section III on *Disks*. - Mr. Yong Wang, a signal integrity expert, formerly of HP and Intel, now with Meta-RAM, contributed extensively to the Chapter on *DRAM System Signaling and Timing* (Chapter 9). - Michael Xu at Hitachi Global Storage Technologies drew the beautiful, complex illustrations, as well as provided some of the photographs, in the book's Section III on Disks. In addition, several students involved in tool support over the years deserve special recognition: Brinda Ganesh took the reins from David Wang to maintain *DRAMsim*; among other - things, she is largely responsible for the FB-DIMM support in that tool. - Joseph Gross also supports DRAMsim and is leading the development of the second generation version of the software, which is object-oriented and significantly streamlined. - Nuengwong Tuaycharoen integrated the various disparate software modules to produce *SYSim*, a full-system simulator that gave us the wonderful time-sequence graphs at the end of the *Overview*. Numerous reviewers spent considerable time reading early drafts of the manuscript and providing excellent critiques or our direction, approach, and raw content. The following reviewers directly contributed to the book in this way: Ashraf Aboulnaga, *University of Waterloo*; Al Davis, *University of Utah*; Diana Franklin, *California Polytechnic University, San Luis Obispo*; Yiming Hu, *University of Cincinnati*; David Kaeli, *Northeastern University*; Nagi Mekhiel, *Ryerson University, Toronto*; Michael Schuette, Ph.D., VP of Technology Development at *OCZ Technology*, Jim Smith, *University of Wisconsin—Madison*; Yan Solin, *North Carolina State University*, and Several Anonymous reviewers (i.e., we the authors were told not to use their names). The editorial and production staff at Morgan Kaufmann/Elsevier was amazing to work with: Denise Penrose pitched the idea to us in the first place (and enormous thanks go to Jim Smith, who pointed her in our direction) and Nate McFadden and Paul Gottehrer made the process of writing, editing, and proofing go incredibly smoothly. Lastly, Dr. Richard Matick, the author of the original memory-systems book (*Computer Storage Systems and Technology*, John Wiley & Sons, 1976) and currently a leading researcher in embedded DRAM at IBM T. J. Watson, provided enormous help in direction, focus, and approach. Dave, Spencer, Sam, and I are indebted to all of these writers, illustrators, coders, editors, and reviewers alike; they all helped to make this book what it is. To those contributors: thank you. You rock. Bruce Jacob, Summer 2007 College Park, Maryland # On Memory Systems and Their Design Memory is essential to the operation of a computer system, and nothing is more important to the development of the modern memory system than the concept of the memory hierarchy. While a flat memory system built of a single technology is attractive for its simplicity, a well-implemented hierarchy allows a memory system to approach simultaneously the performance of the fastest component, the cost per bit of the cheapest component, and the energy consumption of the most energy-efficient component. For years, the use of a memory hierarchy has been very convenient, in that it has simplified the process of designing memory systems. The use of a hierarchy allowed designers to treat system design as a modularized process—to treat the memory system as an abstraction and to optimize individual subsystems (caches, DRAMs [dynamic RAM], disks) in isolation. However, we are finding that treating the hierarchy in this way—as a set of disparate subsystems that interact only through well-defined functional interfaces and that can be optimized in isolation—no longer suffices for the design of modern memory systems. One trend becoming apparent is that many of the underlying implementation issues are becoming significant. These include the physics of device and interconnect scaling, the choice of signaling protocols and topologies to ensure signal integrity, design parameters such as granularity of access and support for concurrency, and communication-related issues such as scheduling algorithms and queueing. These low-level details have begun to affect the higher level design process quite dramatically, whereas they were considered transparent only a design-generation ago. Cache architectures are appearing that play to the limitations imposed by interconnect physics in deep submicron processes; modern DRAM design is driven by circuit-level limitations that create system-level headaches; and modern disk performance is dominated by the on-board caching and scheduling policies. This is a non-trivial environment in which to attempt optimal design. This trend will undoubtedly become more important as time goes on, and even now it has tremendous impact on design results. As hierarchies and their components grow more complex, systemic behaviors—those arising from the complex interaction of the memory system's parts-have begun to dominate. The real loss of performance is not seen in the CPU or caches or DRAM devices or disk assemblies themselves, but in the subtle interactions between these subsystems and in the manner in which these subsystems are connected. Consequently, it is becoming increasingly foolhardy to attempt systemlevel optimization by designing/optimizing each of the parts in isolation (which, unfortunately, is often the approach taken in modern computer design). No longer can a designer remain oblivious to issues "outside the scope" and focus solely on designing a subsystem. It has now become the case that a memory-systems designer, wishing to build a properly behaved memory hierarchy, must be intimately familiar with issues involved at all levels of an implementation, from cache to DRAM to disk. Thus, we wrote this book. #### **Ov.1** Memory Systems A memory hierarchy is designed to provide multiple functions that are seemingly mutually exclusive. We start at random-access memory (RAM): all microprocessors (and computer systems in general) expect a random-access memory out of which they operate. This is fundamental to the structure of modern software, built upon the von Neumann model in which code and data are essentially the same and reside in the same place (i.e., memory). All requests, whether for instructions or for data, go to this random-access memory. At any given moment, any particular datum in memory may be needed; there is no requirement that data reside next to the code that manipulates it, and there is no requirement that two instructions executed one after the other need to be adjacent in memory. Thus, the memory system must be able to handle randomly addressed<sup>1</sup> requests in a manner that favors no particular request. For instance, using a tape drive for this primary memory is unacceptable for performance reasons, though it might be acceptable in the Turing-machine sense. Where does the mutually exclusive part come in? As we said, all microprocessors are built to expect a random-access memory out of which they can operate. Moreover, this memory must be fast, matching the machine's processing speed; otherwise, the machine will spend most of its time tapping its foot and staring at its watch. In addition, modern software is written to expect gigabytes of storage for data, and the modern consumer expects this storage to be cheap. How many memory technologies provide both tremendous speed and tremendous storage capacity at a low price? Modern processors execute instructions both out of order and speculatively-put simply, they execute instructions that, in some cases, are not meant to get executed—and system software is typically built to expect that certain changes to memory are permanent. How many memory technologies provide non-volatility and an *undo* operation? While it might be elegant to provide all of these competing demands with a single technology (say, for example, a gigantic battery-backed SRAM [static RAM]), and though there is no engineering problem that cannot be solved (if ever in doubt about this, simply query a room full of engineers), the reality is that building a full memory system out of such a technology would be prohibitively expensive today. The good news is that it is not necessary. Specialization and division of labor make possible all of these competing goals simultaneously. Modern memory systems often have a terabyte of storage on the desktop and provide instruction-fetch and data-access bandwidths of 128 GB/s or more. Nearly all of the storage in the system is non-volatile, and speculative execution on the part of the microprocessor is supported. All of this can be found in a memory system that has an average cost of roughly 1/100,000,000 pennies per bit of storage. The reason all of this is possible is because of a phenomenon called locality of reference [Belady 1966, Denning 1970]. This is an observed behavior that computer applications tend to exhibit and that, when exploited properly, allows a small memory to serve in place of a larger one. ### Ov.1.1 Locality of Reference Breeds the **Memory Hierarchy** We think linearly (in steps), and so we program the computer to solve problems by working in steps. The practical implications of this are that a computer's use of the memory system tends to be non-random and highly predictable. Thus is born the concept of locality of reference, so named because memory references tend to be localized in time and space: - If you use something once, you are likely to use it again. - If you use something once, you are likely to use its neighbor. The first of these principles is called *temporal local*ity; the second is called spatial locality. We will discuss them (and another type of locality) in more detail in Part I: Cache of this book, but for now it suffices to <sup>&</sup>lt;sup>1</sup>Though "random" addressing is the commonly used term, authors actually mean *arbitrarily* addressed requests because, in most memory systems, a randomly addressed sequence is one of the most efficiently handled events. $<sup>^2</sup>$ Even Cray machines, which were famous for using SRAM as their main memory, today are built upon DRAM for their main memory. say that one can exploit the locality principle and render a single-level memory system, which we just said was expensive, unnecessary. If a computer's use of the memory system, given a small time window, is both predictable and limited in spatial extent, then it stands to reason that a program does not need all of its data immediately accessible. A program would perform nearly as well if it had, for instance, a twolevel store, in which the first level provides immediate access to a subset of the program's data, the second level holds the remainder of the data but is slower and therefore cheaper, and some appropriate heuristic is used to manage the movement of data back and forth between the levels, thereby ensuring that the mostneeded data is usually in the first-level store. This generalizes to the *memory hierarchy*: multiple levels of storage, each optimized for its assigned task. By choosing these levels wisely a designer can produce a system that has the best of all worlds: performance approaching that of the fastest component, cost per bit approaching that of the cheapest component, and energy consumption per access approaching that of the least power-hungry component. The modern hierarchy is comprised of the following components, each performing a particular function or filling a functional niche within the system: Cache (SRAM): Cache provides access to program instructions and data that has very low latency (e.g., 1/4 nanosecond per access) and very high bandwidth (e.g., a 16-byte instruction block and a 16-byte - data block per cycle => 32 bytes per 1/4 nanosecond, or 128 bytes per nanosecond, or 128 GB/s). It is also important to note that cache, on a per-access basis, also has relatively low energy requirements compared to other technologies. - **DRAM:** DRAM provides a random-access storage that is relatively large, relatively fast, and relatively cheap. It is large and cheap compared to cache, and it is fast compared to disk. Its main strength is that it is just fast enough and just cheap enough to act as an operating store. - Disk: Disk provides permanent storage at an ultra-low cost per bit. As mentioned, nearly all computer systems expect some data to be modifiable yet permanent, so the memory system must have, at some level, a permanent store. Disk's advantage is its very reasonable cost (currently less than 50¢ per gigabyte), which is low enough for users to buy enough of it to store thousands of songs, video clips, photos, and other memory hogs that users are wont to accumulate in their accounts (authors included). Table Ov.1 lists some rough order-of-magnitude comparisons for access time and energy consumption per access. Why is it not feasible to build a flat memory system out of these technologies? Cache is far too expensive to be used as permanent storage, and its cost to store a single album's worth of audio would exceed that of the | TABLE Ov.1 | Cost-performance for | or various memory | technologies | |------------|----------------------|-------------------|--------------| |------------|----------------------|-------------------|--------------| | Technology | Bytes per Access (typ.) | Latency per Access | Cost per Megabyte <sup>a</sup> | Energy per Access | |----------------|---------------------------|-----------------------|--------------------------------|--------------------------| | On-chip Cache | 10 | 100 of picoseconds | \$1–100 | 1 nJ | | Off-chip Cache | 100 | Nanoseconds | \$1–10 | 10–100 nJ | | DRAM | 1000 (internally fetched) | 10–100<br>nanoseconds | \$0.1 | 1–100 nJ (per<br>device) | | Disk | 1000 | Milliseconds | \$0.001 | 100–1000 mJ | <sup>&</sup>lt;sup>a</sup>Cost of semiconductor memory is extremely variable, dependent much more on economic factors and sales volume than on manufacturing issues. In particular, on-chip caches (i.e., those integrated with a microprocessor core) can take up half of the die area, in which case their "cost" would be half of the selling price of that microprocessor. Depending on the market (e.g., embedded versus high end) and sales volume, microprocessor costs cover an enormous range of prices, from pennies per square millimeter to several dollars per square millimeter. original music CD by several orders of magnitude. Disk is far too slow to be used as an operating store, and its average seek time for random accesses is measured in milliseconds. Of the three, DRAM is the closest to providing a flat memory system. DRAM is sufficiently fast enough that, without the support of a cache front-end, it can act as an operating store for many embedded systems, and with battery back-up it can be made to function as a permanent store. However, DRAM alone is not cheap enough to serve the needs of human users, who often want nearly a terabyte of permanent storage, and, even with random access times in the tens of nanoseconds, DRAM is not quite fast enough to serve as the only memory for modern general-purpose microprocessors, which would prefer a new block of instructions every fraction of a nanosecond. So far, no technology has appeared that provides every desired characteristic: low cost, non-volatility, high bandwidth, low latency, etc. So instead we build a system in which each component is designed to offer one or more characteristics, and we manage the operation of the system so that the poorer characteristics of the various technologies are "hidden." For example, if most of the memory references made by the microprocessor are handled by the cache and/or DRAM subsystems, then the disk will be used only rarely, and, therefore, its extremely long latency will contribute very little to the average access time. If most of the data resides in the disk subsystem, and very little of it is needed at any given moment in time, then the cache and DRAM subsystems will not need much storage, and, FIGURE Ov.1: A memory hierarhy. therefore, their higher costs per bit will contribute very little to the average cost of the system. If done right, a memory system has an average cost approaching that of bottom- most layer and an average access time and bandwidth approaching that of topmost layer. The memory hierarchy is usually pictured as a pyramid, as shown in Figure Ov.1. The higher levels in the hierarchy have better performance characteristics than the lower levels in the hierarchy; the higher levels have a higher cost per bit than the lower levels; and the system uses fewer bits of storage in the higher levels than found in the lower levels. Though modern memory systems are comprised of SRAM, DRAM, and disk, these are simply technologies chosen to serve particular needs of the system, namely permanent store, operating store, and a fast store. Any technology set would suffice if it (a) provides permanent and operating stores and (b) satisfies the given computer system's performance, cost, and power requirements. # Permanent Store The system's permanent store is where everything lives ... meaning it is home to data that can be modified (potentially), but whose modifications must be remembered across invocations of the system (powerups and power-downs). In general-purpose systems, this data typically includes the operating system's files, such as boot program, OS (operating system) executable, libraries, utilities, applications, etc., and the users' files, such as graphics, word-processing documents, spreadsheets, digital photographs, digital audio and video, email, etc. In embedded systems, this data typically includes the system's executable image and any installation-specific configuration information that it requires. Some embedded systems also maintain in permanent store the state of any partially completed transactions to withstand worst-case scenarios such as the system going down before the transaction is finished (e.g., financial transactions). These all represent data that should not disappear when the machine shuts down, such as a user's saved email messages, the operating system's code and configuration information, and applications and their saved documents. Thus, the storage must be *non-volatile*, which in this context means not susceptible to power outages. Storage technologies chosen for permanent store include magnetic disk, flash memory, and even EEPROM (electrically erasable programmable read-only memory), of which flash memory is a special type. Other forms of programmable ROM (read-only memory) such as ROM, PROM (programmable ROM), or EPROM (erasable programmable ROM) are suitable for non-writable permanent information such as the executable image of an embedded system or a general-purpose system's boot code and BIOS.<sup>3</sup> Numerous exotic non-volatile technologies are in development, including magnetic RAM (MRAM), FeRAM (ferroelectric RAM), and phase-change RAM (PCRAM). In most systems, the cost per bit of this technology is a very important consideration. In general-purpose systems, this is the case because these systems tend to have an enormous amount of permanent storage. A desktop can easily have more than 500 GB of permanent store, and a departmental server can have one hundred times that amount. The enormous number of bits in these systems translates even modest costper-bit increases into significant dollar amounts. In embedded systems, the cost per bit is important because of the significant number of units shipped. Embedded systems are often consumer devices that are manufactured and sold in vast quantities, e.g., cell phones, digital cameras, MP3 players, programmable thermostats, and disk drives. Each embedded system might not require more than a handful of megabytes of storage, yet a tiny 1¢ increase in the cost per megabyte of memory can translate to a \$100,000 increase in cost per million units manufactured. # Operating (Random-Access) Store As mentioned earlier, a typical microprocessor expects a new instruction or set of instructions on every clock cycle, and it can perform a data-read or data-write every clock cycle. Because the addresses of these instructions and data need not be sequential (or, in fact, related in any detectable way), the memory system must be able to handle random access—it must be able to provide instant access to any datum in the memory system. The machine's operating store is the level of memory that provides random access at the microprocessor's data granularity. It is the storage level out of which the microprocessor could conceivably operate, i.e., it is the storage level that can provide random access to its storage, one data word at a time. This storage level is typically called "main memory." Disks cannot serve as main memory or operating store and cannot provide random access for two reasons: instant access is provided for only the data underneath the disk's head at any given moment, and the granularity of access is not what a typical processor requires. Disks are block-oriented devices, which means they read and write data only in large chunks; the typical granularity is 512 B. Processors, in contrast, typically operate at the granularity of 4 B or 8 B data words. To use a disk, a microprocessor must have additional buffering memory out of which it can read one instruction at a time and read or write one datum at a time. This buffering memory would become the *de facto* operating store of the system. Flash memory and EEPROM (as well as the exotic non-volatile technologies mentioned earlier) are potentially viable as an operating store for systems that have small permanent-storage needs, and the non-volatility of these technologies provides them with a distinct advantage. However, not all are set up as an ideal operating store; for example, flash memory supports wordsized reads but supports only block-sized writes. If this type of issue can be handled in a manner that is transparent to the processor (e.g., in this case through additional data buffering), then the memory technology can still serve as a reasonable hybrid operating store. Though the non-volatile technologies seem positioned perfectly to serve as operating store in all manner of devices and systems, DRAM is the most commonly used technology. Note that the only requirement of a memory system's operating store is that it provide random access with a small access granularity. Nonvolatility is not a requirement, so long as it is provided by another level in the hierarchy. DRAM is a popular choice for operating store for several reasons: DRAM is faster than the various non-volatile technologies (in some cases much faster); DRAM supports an unlimited number of writes, whereas some non-volatile technologies start to fail after being erased and rewritten too many times (in some technologies, as few as 1-10,000 erase/write cycles); and DRAM processes are very similar to those used to build logic devices. <sup>&</sup>lt;sup>3</sup>BIOS = basic input/output system, the code that provides to software low-level access to much of the hardware. DRAM can be fabricated using similar materials and (relatively) similar silicon-based process technologies as most microprocessors, whereas many of the various non-volatile technologies require new materials and (relatively) different process technologies. # Fast (and Relatively Low-Power) Store If these storage technologies provide such reasonable operating store, why, then, do modern systems use cache? Cache is inserted between the processor and the main memory system whenever the access behavior of the main memory is not sufficient for the needs or goals of the system. Typical figures of merit include performance and energy consumption (or power dissipation). If the performance when operating out of main memory is insufficient, cache is interposed between the processor and main memory to decrease the average access time for data. Similarly, if the energy consumed when operating out of main memory is too high, cache is interposed between the processor and main memory to decrease the system's energy consumption. The data in Table Ov.1 should give some intuition about the design choice. If a cache can reduce the number of accesses made to the next level down in the hierarchy, then it potentially reduces both execution time and energy consumption for an application. The gain is only potential because these numbers are valid only for certain technology parameters. For example, many designs use large SRAM caches that consume much more energy than several DRAM chips combined, but because the caches can reduce execution time they are used in systems where performance is critical, even at the expense of energy consumption. It is important to note at this point that, even though the term "cache" is usually interpreted to mean SRAM, a cache is merely a concept and as such imposes no expectations on its implementation. Caches are best thought of as compact databases, as shown in Figure Ov.2. They contain data and, optionally, metadata such as the unique ID (address) of each data block in the array, whether it has been updated recently, etc. Caches can be built from SRAM, DRAM, disk, or virtually any storage technology. They can be managed completely in hardware and thus can be transparent to the running application and even to the memory system itself; and at the other extreme they can be explicitly managed by the running application. For instance, Figure Ov.2 shows that there is an optional block of metadata, which if implemented in hardware would be called the cache's tags. In that instance, a key is passed to the tags array, which produces either the location of the corresponding item in the data array (a cache hit) or an indication that the item is not in the data array (a cache miss). Alternatively, software can be written to index the array explicitly, using direct cache-array addresses, in which case the key lookup (as well as its associated tags array) is unnecessary. The configuration chosen for the cache is called its organization. Cache organizations exist at all spots along the continuum between these two extremes. Clearly, the choice of organization will significantly impact the cache's performance and energy consumption. Predictability of access time is another common figure of merit. It is a special aspect of performance that is very important when building real-time systems or systems with highly orchestrated data movement. DRAM is occasionally in a state where it needs to ignore external requests so that it can guarantee the integrity of its stored data (this is called *refresh* and will be discussed in detail in Part II of the book). Such hiccups in data movement can be disastrous for some applications. For this reason, many microprocessors, such as digital signal processors (DSPs) and processors used in embedded control applications (called *microcontrollers*), often FIGURE Ov.2: An idealized cache lookup. A cache is logically comprised of two elements: the data array and some management information that indicates what is in the data array (labeled "metadata"). Note that the key information may be virtual, i.e., data addresses can be embedded in the software using the cache, in which case there is no explicit key lookup, and only the data array is needed. have special caches that look like small main memories. These are scratch-pad RAMs whose implementation lies toward the end of the spectrum at which the running application manages the cache explicitly. DSPs typically have two of these scratch-pad SRAMs so that they can issue on every cycle a new multiply-accumulate (MAC) operation, an important DSP instruction whose repeated operation on a pair of data arrays produces its dot product. Performing a new MAC operation every cycle requires the memory system to load new elements from two different arrays simultaneously in the same cycle. This is most easily accomplished by having two separate data busses, each with its own independent data memory and each holding the elements of a different array. Perhaps the most familiar example of a softwaremanaged memory is the processor's register file, an array of storage locations that is indexed directly by bits within the instruction and whose contents are dictated entirely by software. Values are brought into the register file explicitly by software instructions, and old values are only overwritten if done so explicitly by software. Moreover, the register file is significantly smaller than most on-chip caches and typically consumes far less energy. Accordingly, software's best bet is often to optimize its use of the register file [Postiff & Mudge 1999]. #### Important Figures of Merit Ov.1.2 The following issues have been touched on during the previous discussion, but at this point it would be valuable to formally present the various figures of merit that are important to a designer of memory systems. Depending on the environment in which the memory system will be used (supercomputer, departmental server, desktop, laptop, signal-processing system, embedded control system, etc.), each metric will carry more or less weight. Though most academic studies tend to focus on one axis at a time (e.g., performance), the design of a memory system is a multi-dimensional optimization problem, with all the adherent complexities of analysis. For instance, to analyze something in this design space or to consider one memory system over another, a designer should be familiar with concepts such as Pareto optimality (described later in this chapter). The various figures of merit, in no particular order other than performance being first due to its popularity, are performance, energy consumption and power dissipation, predictability of behavior (i.e., real time), manufacturing costs, and system reliability. This section describes them briefly, collectively, Later sections will treat them in more detail. #### Performance The term "performance" means many things to many people. The performance of a system is typically measured in the time it takes to execute a task (i.e., task latency), but it can also be measured in the number of tasks that can be handled in a unit time period (i.e., task bandwidth). Popular figures of merit for performance include the following:4 - Cycles per Instruction (CPI) - Total execution cycles = Total user-level instructions committed - Memory-system CPI overhead - = Real CPI CPI assuming perfect memory - Memory Cycles per Instruction (MCPI) - Total cycles spent in memory system Total user-level instructions committed - $Cache\ miss\ rate = \frac{Total\ cache\ misses}{Total\ cache\ accesses}$ - Cache hit rate = 1 Cache miss rate - Average access time - = (hit rate · average to service hit) + (miss rate · average to service miss) - Million Instructions per Second (MIPS) - Instructions executed (seconds) 10<sup>6</sup> · Average required for execution $<sup>^4</sup>$ Note that the MIPS metric is easily abused. For instance, it is inappropriate for comparing different instruction-set architectures, and marketing literature often takes the definition of "instructions executed" to mean any particular given window of time as opposed to the full execution of an application. In such cases, the metric can mean the highest possible issue rate of instructions that the machine can achieve (but not necessarily sustain for any realistic period of time). A cautionary note: using a metric of performance for the memory system that is independent of a processing context can be very deceptive. For instance, the MCPI metric does not take into account how much of the memory system's activity can be overlapped with processor activity, and, as a result, memory system A which has a worse MCPI than memory system B might actually yield a computer system with better total performance. As Figure Ov.5 in a later section shows, there can be significantly different amounts of overlapping activity between the memory system and CPU execution. How to average a set of performance metrics correctly is still a poorly understood topic, and it is very sensitive to the weights chosen (either explicitly or implicitly) for the various benchmarks considered [John 2004]. Comparing performance is always the least ambiguous when it means the amount of time saved by using one design over another. When we ask the question this machine is how much faster than that machine? the implication is that we have been using that machine for some time and wish to know how much time we would save by using this machine instead. The true measure of performance is to compare the total execution time of one machine to another, with each machine running the benchmark programs that represent the user's typical workload as often as a user expects to run them. For instance, if a user compiles a large software application ten times per day and runs a series of regression tests once per day, then the total execution time should count the compiler's execution ten times more than the regression test. # Energy Consumption and Power Dissipation Energy consumption is related to work accomplished (e.g., how much computing can be done with a given battery), whereas power dissipation is the rate of consumption. The instantaneous power dissipation of CMOS (complementary metal-oxide-semiconductor) devices, such as microprocessors, is measured in watts (W) and represents the sum of two components: *active power*, due to switching activity, and *static power*, due primarily to subthreshold leakage. To a first approximation, average power dissipation is equal to the following (we will present a more detailed model later): $$P_{\text{avg}} = (P_{\text{dynamic}} + P_{\text{static}}) \equiv C_{\text{tot}} V_{\text{dd}}^2 f + I_{\text{leak}} V_{\text{dd}}$$ (EQ Ov.1) where $C_{\rm tot}$ is the total capacitance switched, $V_{\rm dd}$ is the power supply, f is the switching frequency, and $I_{\rm leak}$ is the leakage current, which includes such sources as subthreshold and gate leakage. With each generation in process technology, active power is decreasing on a device level and remaining roughly constant on a chip level. Leakage power, which used to be insignificant relative to switching power, increases as devices become smaller and has recently caught up to switching power in magnitude [Grove 2002]. In the future, leakage will be the primary concern. Energy is related to power through time. The energy consumed by a computation that requires T seconds is measured in joules (J) and is equal to the integral of the instantaneous power over time T. If the power dissipation remains constant over T, the resultant energy consumption is simply the product of power and time. $$E = (P_{\text{avg}} \cdot T) \equiv C_{\text{tot}} V^2_{\text{dd}} N + I_{\text{leak}} V_{\text{dd}} T$$ (EQ Ov.2) where N is the number of switching events that occurs during the computation. In general, if one is interested in extending battery life or reducing the electricity costs of an enterprise computing center, then energy is the appropriate metric to use in an analysis comparing approaches. If one is concerned with heat removal from a system or the thermal effects that a functional block can create, then *power* is the appropriate metric. In informal discussions (i.e., in common-parlance prose rather than in equations where units of measurement are inescapable), the two terms "power" and "energy" are frequently used interchangeably, though such use is technically incorrect. Beware, because this can lead to ambiguity and even misconception, which is usually unintentional, but not always so. For instance, microprocessor manufacturers will occasionally claim to have a "low-power" microprocessor that beats its predecessor by a factor of, say, two. This is easily accomplished by running the microprocessor at half the clock rate, which does reduce its power dissipation, but remember that power is the rate at which energy is consumed. However, to a first order, doing so doubles the time over which the processor dissipates that power. The net result is a processor that consumes the same amount of energy as before, though it is branded as having lower power, which is technically not a lie. Popular figures of merit that incorporate both energy/power and performance include the following: - **Energy-Delay Product** - $= \left( \begin{array}{c} \text{Energy required} \\ \text{to perform task} \end{array} \right) \cdot \left( \begin{array}{c} \text{Time required} \\ \text{to perform task} \end{array} \right)$ \to perform task/ - Power-Delay Product - $= \begin{pmatrix} \text{Power required} \\ \text{to perform task} \end{pmatrix}^m \cdot \begin{pmatrix} \text{Time required} \\ \text{to perform task} \end{pmatrix}^r$ - MIPS per watt - Performance of benchmark in MIPS Average power dissipated by benchmark The second equation was offered as a generalized form of the first (note that the two are equivalent when m = 1 and n = 2) so that designers could place more weight on the metric (time or energy/power) that is most important to their design goals [Gonzalez & Horowitz 1996, Brooks et al. 2000a]. # Predictable (Real-Time) Behavior Predictability of behavior is extremely important when analyzing real-time systems, because correctness of operation is often the primary design goal for these systems (consider, for example, medical equipment, navigation systems, anti-lock brakes, flight control systems, etc., in which failure to perform as predicted is not an option). Popular figures of merit for expressing predictability of behavior include the following: - Worst-Case Execution Time (WCET), taken to mean the longest amount of time a function could take to execute - Response time, taken to mean the time between a stimulus to the system and the system's response (e.g., time to respond to an external interrupt) Jitter, the amount of deviation from an average timing value These metrics are typically given as single numbers (average or worst case), but we have found that the probability density function makes a valuable aid in system analysis [Baynes et al. 2001, 2003]. # Design (and Fabrication and Test) Costs Cost is an obvious, but often unstated, design goal. Many consumer devices have cost as their primary consideration: if the cost to design and manufacture an item is not low enough, it is not worth the effort to build and sell it. Cost can be represented in many different ways (note that energy consumption is a measure of cost), but for the purposes of this book, by "cost" we mean the cost of producing an item: to wit, the cost of its design, the cost of testing the item, and/ or the cost of the item's manufacture. Popular figures of merit for cost include the following: - Dollar cost (best, but often hard to even approximate) - Design size, e.g., die area (cost of manufacturing a VLSI (very large scale integration) design is proportional to its area cubed or more) - Packaging costs, e.g., pin count - Design complexity (can be expressed in terms of number of logic gates, number of transistors, lines of code, time to compile or synthesize, time to verify or run DRC (design-rule check), and many others, including a design's impact on clock cycle time [Palacharla et al. 1996]) Cost is often presented in a relative sense, allowing differing technologies or approaches to be placed on equal footing for a comparison. - Cost per storage bit/byte/KB/MB/etc. (allows cost comparison between different storage technologies) - Die area per storage bit (allows sizeefficiency comparison within same process technology) In a similar vein, cost is especially informative when combined with performance metrics. The following are variations on the theme: - Bandwidth per package pin (total sustainable bandwidth to/from part, divided by total number of pins in package) - Execution-time-dollars (total execution time multiplied by total cost; note that cost can be expressed in other units, e.g., pins, die area, etc.) An important note: cost should incorporate all sources of that cost. Focusing on just one source of cost blinds the analysis in two ways: first, the true cost of the system is not considered, and second, solutions can be unintentionally excluded from the analysis. If cost is expressed in pin count, then all pins should be considered by the analysis; the analysis should not focus solely on data pins, for example. Similarly, if cost is expressed in die area, then all sources of die area should be considered by the analysis; the analysis should not focus solely on the number of banks, for example, but should also consider the cost of building control logic (decoders, muxes, bus lines, etc.) to select among the various banks. # Reliability Like the term "performance," the term "reliability" means many things to many different people. In this book, we mean reliability of the data stored within the memory system: how easily is our stored data corrupted or lost, and how can it be protected from corruption or loss? Data integrity is dependent upon physical devices, and physical devices can fail. Approaches to guarantee the integrity of stored data typically operate by storing redundant information in the memory system so that in the case of device failure, some but not all of the data will be lost or corrupted. If enough redundant information is stored, then the missing data can be reconstructed. Popular figures of merit for measuring reliability - Mean Time Between Failures (MTBF): <sup>5</sup> given in time (seconds, hours, etc.) or number of uses - Bit-error tolerance, e.g., how many bit errors in a data word or packet the mechanism can correct, and how many it can detect (but not necessarily correct) - Error-rate tolerance, e.g., how many errors per second in a data stream the mechanism can correct - Application-specific metrics, e.g., how much radiation a design can tolerate before failure, etc. Note that values given for MTBF often seem astronomically high. This is because they are not meant to apply to individual devices, but to system-wide device use, as in a large installation. For instance, if the expected service lifetime of a device is several years, then that device is expected to fail in several years. If an administrator swaps out devices every few years (before the service lifetime is up), then the administrator should expect to see failure frequencies consistent with the MTBF rating. #### Ov.1.3 The Goal of a Memory Hierarchy As already mentioned, a well-implemented hierarchy allows a memory system to approach simultaneously the performance of the fastest component, the cost per bit of the cheapest component, and the energy consumption of the most energy-efficient component. A modern memory system typically has performance close to that of on-chip cache, the fastest component in the system. The rate at which microprocessors fetch and execute their instructions is measured in nanoseconds or fractions of a nanosecond. A modern low-end desktop machine has several hundred gigabytes of storage and sells for under \$500, roughly half of which goes to the on-chip caches, off-chip caches, DRAM, and disk. This represents an average cost of characterize both device fragility and robustness of a proposed solution. They include the following: <sup>&</sup>lt;sup>5</sup>A common variation is "Mean Time To Failure (MTTF)." several dollars per gigabyte—very close to that of disk, the cheapest component. Modern desktop systems have an energy cost that is typically in the low tens of nanojoules per instruction executed—close to that of on-chip SRAM cache, the least energy-costly component in the system (on a per-access basis). The goal for a memory-system designer is to create a system that behaves, on average and from the point of view of the processor, like a big cache that has the price tag of a disk. A successful memory hierarchy is much more than the sum of its parts; moreover, successful memory-system design is non-trivial. How the system is built, how it is used (and what parts of it are used more heavily than others), and on which is sues an engineer should focus most of his effort at design time—all these are highly dependent on the target application of the memory system. Two common categories of target applications are (a) generalpurpose systems, which are characterized by their need for universal applicability for just about any type of computation, and (b) embedded systems, which are characterized by their tight design restrictions along multiple axes (e.g., cost, correctness of design, energy consumption, reliability) and the fact that each executes only a single, dedicated software application its entire lifespan, which opens up possibilities for optimization that are less appropriate for general-purpose systems. # General-Purpose Computer Systems General-purpose systems are what people normally think of as "computers." These are the machines on your desktop, the machines in the refrigerated server room at work, and the laptop on the kitchen table. They are designed to handle any and all tasks thrown at them, and the software they run on a day-to-day basis is radically different from machine to machine. General-purpose systems are typically overbuilt. By definition they are expected by the consumer to run all possible software applications with acceptable speed, and therefore, they are built to handle the average case very well and the worst case at least tolerably well. Were they optimized for any particular task, they could easily become less than optimal for all dissimilar tasks. Therefore, general-purpose systems are optimized for everything, which is another way of saying that they are actually optimized for nothing in particular. However, they make up for this in raw performance, pure number-crunching. The average notebook computer is capable of performing orders of magnitude more operations per second than that required by a word processor or email client, tasks to which the average notebook is frequently relegated, but because the general-purpose system may be expected to handle virtually anything at any time, it must have significant spare numbercrunching ability, just in case. It stands to reason that the memory system of this computer must also be designed in a Swiss-armyknife fashion. Figure Ov.3 shows the organization of a typical personal computer, with the components of the memory system highlighted in grey boxes. The cache levels are found both on-chip (i.e., integrated on the same die as the microprocessor core) and off-chip (i.e., on a separate die). The DRAM system is comprised of a memory controller and a number of DRAM chips organized into DIMMs (dual in-line memory modules, printed circuit boards that contain a handful of DRAMs each). The memory controller can be located on-chip or off-chip, but the DRAMs are always separate from the CPU to allow memory upgrades. The disks in the system are considered peripheral devices, and so their access is made through one or more levels of controllers, each representing a potential chip-to-chip crossing (e.g., here a disk request passes through the system controller to the PCI (peripheral component interconnect) bus controller, to the SCSI (small computer system interface) controller, and finally to the disk itself). The software that runs on a general-purpose system typically executes in the context of a robust operating system, one that provides virtual memory. Virtual memory is a mechanism whereby the operating system can provide to all running user-level software (i.e., email clients, web browsers, spreadsheets, word-processing packages, graphics and video editing software, etc.) the illusion that the user-level software is in direct control of the computer, when in fact its use of the computer's resources is managed by the operating system. This is a very effective way for an operating system to provide simultaneous access by FIGURE Ov.3: Typical PC organization. The memory subsystem is one part of a relatively complex whole. This figure illustrates a two-way multiprocessor, with each processor having its own dedicated off-chip cache. The parts most relevant to this text are shaded in grey; the CPU and its cache system, the system and memory controllers, the DIMMs and their component DRAMs, and the hard drive/s. large numbers of software packages to small numbers of limited-use resources (e.g., physical memory, the hard disk, the network, etc.). The virtual memory system is the primary constituent of the memory system, in that it is the primary determinant of the manner/s in which the memory system's components are used by software running on the computer. Permanent data is stored on the disk, and the operating store, DRAM, is used as a cache for this permanent data. This DRAM-based cache is explicitly managed by the operating system. The operating system decides what data from the disk should be kept, what should be discarded, what should be sent back to the disk, and, for data retained, where it should be placed in the DRAM system. The primary and secondary caches are usually transparent to software, which means that they are managed by hardware, not software (note, however, the use of the word "usually"-later sections will delve into this in more detail). In general, the primary and secondary caches hold demand-fetched data, i.e., running software demands data, the hardware fetches it from memory, and the caches retain as much of it as possible. The DRAM system contains data that the operating system deems worthy of keeping around, and because fetching data from the disk and writing it back to the disk are such time-consuming processes, the operating system can exploit that lag time (during FIGURE Ov.4: DSP-style memory system. Example based on Texas Instruments' TMS320C3x DSP family. which it would otherwise be stalled, doing nothing) to use sophisticated heuristics to decide what data to retain. # **Embedded Computer Systems** Embedded systems differ from general-purpose systems in two main aspects. First and foremost, the two are designed to suit very different purposes. While general-purpose systems run a myriad of unrelated software packages, each having potentially very different performance requirements and dynamic behavior compared to the rest, embedded systems perform a single function their entire lifetime and thus execute the same code day in and day out until the system is discarded or a software upgrade is performed. Second, while performance is the primary (in many instances, the only) figure of merit by which a general-purpose system is judged, optimal embedded-system designs usually represent trade-offs between several goals, including manufacturing cost (e.g., die area), energy consumption, and performance. As a result, we see two very different design strategies in the two camps. As mentioned, generalpurpose systems are typically overbuilt; they are optimized for nothing in particular and must make up for this in raw performance. On the other hand, embedded systems are expected to handle only one task that is known at design time. Thus, it is not only possible, but highly beneficial to optimize an embedded design for its one suited task. If general-purpose systems are overbuilt, the goal for an embedded system is to be appropriately built. In addition, because effort spent at design time is amortized over the life of a product, and because many embedded systems have long lifetimes (tens of years), many embedded design houses will expend significant resources up front to optimize a design, using techniques not generally used in general-purpose systems (for instance, compiler optimizations that require many days or weeks to perform). The memory system of a typical embedded system is less complex than that of a general-purpose system.<sup>6</sup> Figure Ov.4 illustrates an average digital signalprocessing system with dual tagless SRAMs on-chip, <sup>&</sup>lt;sup>6</sup>Note that "less complex" does not necessarily imply "small," e.g., consider a typical iPod (or similar MP3 player), whose primary function is to store gigabytes' worth of a user's music and/or image files. an off-chip programmable ROM (e.g., PROM, EPROM, flash ROM, etc.) that holds the executable image, and an off-chip DRAM that is used for computation and holding variable data. External memory and device controllers can be used, but many embedded microprocessors already have such controllers integrated onto the CPU die. This cuts down on the system's die count and thus cost. Note that it would be possible for the entire hierarchy to lie on the CPU die, yielding a single-chip solution called a system-on-chip. This is relatively common for systems that have limited memory requirements. Many DSPs and microcontrollers have programmable ROM embedded within them. Larger systems that require megabytes of storage (e.g., in Cisco routers, the instruction code alone is more than a 12 MB) will have increasing numbers of memory chips in the system. On the right side of Figure Ov.4 is the software's view of the memory system. The primary distinction is that, unlike general-purpose systems, is that the SRAM caches are visible as separately addressable memories, whereas they are transparent to software in general-purpose systems. Memory, whether SRAM or DRAM, usually represents one of the more costly components in an embedded system, especially if the memory is located on-CPU because once the CPU is fabricated, the memory size cannot be increased. In nearly all system-on-chip designs and many microcontrollers as well, memory accounts for the lion's share of available die area. Moreover, memory is one of the primary consumers of energy in a system, both on-CPU and off-CPU. As an example, it has been shown that, in many digital signal-processing applications, the memory system consumes more of both energy and die area than the processor datapath. Clearly, this is a resource on which significant time and energy is spent performing optimization. #### Four Anecdotes on Modular Design **Ov.2** It is our observation that computer-system design in general, and memory-hierarchy design in particular, has reached a point at which it is no longer sufficient to design and optimize subsystems in isolation. Because memory systems and their subsystems are so complex, it is now the rule, and not the exception, that the subsystems we thought to be independent actually interact in unanticipated ways. Consequently, our traditional design methodologies no longer work because their underlying assumptions no longer hold. Modular design, one of the most widely adopted design methodologies, is an oftpraised engineering design principle in which clean functional interfaces separate subsystems (i.e., modules) so that subsystem design and optimization can be performed independently and in parallel by different designers. Applying the principles of modular design to produce a complex product can reduce the time and thus the cost for system-level design, integration, and test; optimization at the modular level guarantees optimization at the system level, provided that the system-level architecture and resulting module-to-module interfaces are optimal. That last part is the sticking point: the principle of modular design assumes no interaction between module-level implementations and the choice of system-level architecture, but that is exactly the kind of interaction that we have observed in the design of modern, high-performance memory systems. Consequently, though modular design has been a staple of memory-systems design for decades, allowing cache designers to focus solely on caches, DRAM designers to focus solely on DRAMs, and disk designers to focus solely on disks, we find that, going forward, modular design is no longer an appropriate methodology. Earlier we noted that, in the design of memory systems, many of the underlying implementation issues have begun to affect the higher level design process quite significantly: cache design is driven by interconnect physics; DRAM design is driven by circuit-level limitations that have dramatic system-level effects; and modern disk performance is dominated by the on-board caching and scheduling policies. As hierarchies and their components grow more complex, we find that the bulk of performance is lost not in the CPUs or caches or DRAM devices or disk assemblies themselves, but in the subtle interactions between these subsystems and in the manner in which these subsystems are connected. The bulk of lost performance is due to poor configuration of systemlevel parameters such as bus widths, granularity of access, scheduling policies, queue organizations, and so forth. This is extremely important, so it bears repeating: the bulk of lost performance is not due to the number of CPU pipeline stages or functional units or choice of branch prediction algorithm or even CPU clock speed; the bulk of lost performance is due to poor configuration of system-level parameters such as bus widths, granularity of access, scheduling policies, queue organizations, etc. Today's computersystem performance is dominated by the manner in which data is moved between subsystems, i.e., the scheduling of transactions, and so it is not surprising that seemingly insignificant details can cause such a headache, as scheduling is known to be highly sensitive to such details. Consequently, one can no longer attempt systemlevel optimization by designing/optimizing each of the parts in isolation (which, unfortunately, is often the approach taken in modern computer design). In subsystem design, nothing can be considered "outside the scope" and thus ignored. Memory-system design must become the purview of architects, and a subsystem designer must consider the system-level ramifications of even the slightest low-level design decision or modification. In addition, a designer must understand the low-level implications of systemlevel design choices. A simpler form of this maxim is as follows: > A designer must consider the system-level ramifications of circuit- and device-level decisions as well as the circuit- and devicelevel ramifications of system-level decisions. To illustrate what we mean and to motivate our point, we present several anecdotes. Though they focus on the DRAM system, their message is global, and we will show over the course of the book that the relationships they uncover are certainly not restricted to the DRAM system alone. We will return to these anecdotes and discuss them in much more detail in Chapter 27, The Case for Holistic Design, which follows the technical section of the book. ## Ov.2.1 Anecdote I: Systemic Behaviors Exist In 1999-2001, we performed a study of DRAM systems in which we explicitly studied only systemlevel effects-those that had nothing to do with the CPU architecture, DRAM architecture, or even DRAM interface protocol. In this study, we held constant the CPU and DRAM architectures and considered only a handful of parameters that would affect how well the two communicate with each other. Figure Ov.5 shows some of the results [Cuppu & Jacob 1999, 2001, Jacob 2003]. The varied parameters in Figure Ov.5 are all seemingly innocuous parameters, certainly not the type that would account for up to 20% differences in system performance (execution time) if one parameter was increased or decreased by a small amount, which is indeed the case. Moreover, considering the top two graphs, all of the choices represent intuitively "good" configurations. None of the displayed values represent strawmen, machine configurations that one would avoid putting on one's own desktop. Nonetheless, the performance variability is significant. When the analysis considers a wider range of bus speeds and burst lengths, the problematic behavior increases. As shown in the bottom graph, the ratio of best to worst execution times can be a factor of three, and the local optima are both more frequent and more exaggerated. Systems with relatively low bandwidth (e.g., 100, 200, 400 MB/s) and relatively slow bus speeds (e.g., 100, 200 MHz), if configured well, can match or exceed the performance of system configurations with much faster hardware that is poorly configured. Intuitively, one would expect the design space to be relatively smooth: as system bandwidth increases, so should system performance. Yet the design space is far from smooth. Performance variations of 20% or more can be found in design points that are immediately adjacent to one another. The variations from best-performing to worst-performing design exceed a factor of three across the full space studied, and local minima and maxima abound. Moreover, the behaviors are related. Increasing one parameter by a factor of two toward higher expected performance (e.g., increasing the channel width) can move the system off a local optimum, but local optimality can be restored by changing other related parameters to follow suit, FIGURE Ov.5: Execution time as a function of bandwidth, channel organization, and granularity of access. Top two graphs from Cuppu & Jacob [2001] (© 2001 | [EEE]); bottom graph from Jacob [2003] (© 2003 | [EEE]). such as increasing the burst length and cache block size to match the new channel width. This complex interaction between parameters previously thought to be independent arises because of the complexity of the system under study, and so we have named these "systemic" behaviors.<sup>7</sup> This study represents the moment we realized that systemic behaviors exist and that they are significant. Note that the behavior <sup>&</sup>lt;sup>7</sup>There is a distinction between this type of behavior and what in complex system theory is called "emergent system" behaviors or properties. Emergent system behaviors are those of individuals within a complex system, behaviors that an individual may perform in a group setting that the individual would never perform alone. In our environment, the behaviors are observations we have made of the design space, which is derived from the system as a whole. is not restricted to the DRAM system. We have seen it in the disk system as well, where the variations in performance from one configuration to the next are even more pronounced. Recall that this behavior comes from the varying of parameters that are seemingly unimportant in the grand scheme of things-at least they would certainly seem to be far less important than, say, the cache architecture or the number of functional units in the processor core. The bottom line, as we have observed. is that systemic behaviors-unanticipated interactions between seemingly innocuous parameters and mechanisms-cause significant losses in performance, requiring in-depth, detailed design-space exploration to achieve anything close to an optimal design given a set of technologies and limitations. #### Ov.2.2 Anecdote II: The DLL in DDR SDRAM Beginning with their first generation, DDR (double data rate) SDRAM devices have included a circuit-level mechanism that has generated significant controversy within JEDEC (Joint Electron Device Engineering Council), the industry consortium that created the DDR SDRAM standard. The mechanism is a delaylocked loop (DLL), whose purpose is to more precisely align the output of the DDR part with the clock on the system bus. The controversy stems from the cost of the technology versus its benefits. The system's global clock signal, as it enters the chip, is delayed by the DLL so that the chip's internal clock signal, after amplification and distribution across the chip, is exactly in-phase with the original system clock signal. This more precisely aligns the DRAM part's output with the system clock. The trade-off is extra latency in the datapath as well as a higher power and heat dissipation because the DLL, a dynamic control mechanism, is continuously running. By aligning each DRAM part in a DIMM to the system clock, each DRAM part is effectively de-skewed with respect to the other parts, and the DLLs cancel out timing differences due to process variations and thermal gradients. Figure Ov.6 illustrates a small handful of alternative solutions considered by JEDEC, who ultimately chose Figure Ov.6(b) for the standard. The interesting thing is that the data strobe is not used to capture data at the memory controller, bringing into question its purpose if the DLL is being used to help with data transfer to the memory controller. There is significant disagreement over the value of the chosen design; an anonymous JEDEC member, when FIGURE Ov.6: Several alternatives to the per-DRAM DLL. The figure illustrates a half dozen different timing conventions (a dotted line indicates a signal is unused for capturing data): (a) the scheme in single data rate SDRAM; (b) the scheme chosen for DDR SDRAM; (c) moving the DLL onto the module, with a per-DRAM static delay element (Vernier); (d) moving the DLL onto the memory controller, with a per-DRAM static delay; (e) using a separate read clock per DRAM or per DIMM; and (f) using only a static delay element and recalibrating periodically to address dynamic changes. asked "what is the DLL doing on the DDR chip?" answered with a grin, "burning power." In applications that require low latency and low power dissipation, designers turn off the DLL entirely and use only the data strobe for data capture, ignoring the system clock (as in Figure Ov.6(a)) [Kellogg 2002, Lee 2002, Rhoden 2002]. The argument for the DLL is that it de-skews the DRAM devices on a DIMM and provides a path for system design that can use a global clocking scheme, one of the simplest system designs known. The argument against the DLL is that it would be unnecessary if a designer learned to use the data strobe—this would require a more sophisticated system design, but it would achieve better performance at a lower cost. At the very least, it is clear that a DLL is a circuit-oriented solution to the problem of system-level skew, which could explain the controversy. # Ov.2.3 Anecdote III: A Catch-22 in the Search for Bandwidth With every DRAM generation, timing parameters are added. Several have been added to the DDR specification to address the issues of power dissipation and synchronization. - t<sub>FAW</sub> (Four-bank Activation Window) and t<sub>RRD</sub> (Row-to-Row activation Delay) put a ceiling on the maximum current draw of a single DRAM part. These are protocol-level limitations whose values are chosen to prevent a memory controller from exceeding circuit-related thresholds. - t<sub>DQS</sub> is our own name for the DDR systembus turnaround time; one can think of it as the DIMM-to-DIMM switching time that has implications only at the system level (i.e., it has no meaning or effect if considering read requests in a system with but a single DIMM). By obeying t<sub>DQS</sub>, one can ensure that a second DIMM will not drive the data bus at the same time as a first when switching from one DIMM to another for data output. These are per-device timing parameters that were chosen to improve the behavior (current draw, timing uncertainty) of individual devices. However, they do so at the expense of a significant loss in system-level performance. When reading large amounts of data from the DRAM system, an application will have to read, and thus will have to activate, numerous DRAM rows. At this point, the t<sub>FAW</sub> and t<sub>RRD</sub> timing parameters kick in and limit the available read bandwidth. The t<sub>RRD</sub> parameter specifies the minimum time between two successive row activation commands to the same DRAM device (which implies the same DIMM, because all the DRAMs on a DIMM are slaved together<sup>8</sup>). The t<sub>FAW</sub> parameter represents a sliding window of time during which no more than four row activation commands to the same device may appear. The parameters are specified in nanoseconds and not bus cycles, so they become increasingly problematic at higher bus frequencies. Their net effect is to limit the bandwidth available from a DIMM by limiting how quickly one can get the data out of the DRAM's storage array, irrespective of how fast the DRAM's I/O circuitry can ship the data back to the memory controller. At around 1 GBps, sustainable bandwidth hits a ceiling and remains flat no matter how fast the bus runs because the memory controller is limited in how quickly it can activate a new row and start reading data from it. The obvious solution is to interleave data from different DIMMs on the bus. If one DIMM is limited in how quickly it can read data from its arrays, then one should populate the bus with many DIMMs and move through them in a round-robin fashion. This should bring the system bandwidth up to maximum. However, the function of $t_{DQS}$ is to prevent exactly that: $t_{DQS}$ is the bus turnaround time, inserted to account for skew on the bus and to prevent different bus masters from driving the bus at the same time. <sup>&</sup>lt;sup>8</sup>This is a minor oversimplification. We would like to avoid having to explain details of DRAM-system organization, such as the concept of *rank*, at this point. To avoid such collisions, a second DIMM must wait at least t<sub>DOS</sub> after a first DIMM has finished before driving the bus. So we have a catch: - One set of parameters limits device-level bandwidth and expects a designer to go to the system level to reclaim performance. - The other parameter limits system-level bandwidth and expects a designer to go to the device level to reclaim performance. The good news is that the problem is solvable (see Chapter 15, Section 15.4.3, DRAM Command Scheduling Algorithms), but this is nonetheless a very good example of low-level design decisions that create headaches at the system level. # Ov. 2.4 Anecdote IV: Proposals to Exploit Variability in Cell Leakage The last anecdote is an example of a system-level design decision that ignores circuit- and device-level implications. Ever since DRAM was invented, it has been observed that different DRAM cells exhibit different data-retention time characteristics, typically ranging between hundreds of milliseconds to tens of seconds. DRAM manufacturers typically set the refresh requirement conservatively and require that every row in a DRAM device be refreshed at least once every 64 or 32 ms to avoid losing data. Though refresh might not seem to be a significant concern, in mobile devices researchers have observed that refresh can account for one-third of the power in otherwise idle systems, prompting action to address the issue. Several recent papers propose moving the refresh function into the memory controller and refreshing each row only when needed. During an initialization phase, the controller would characterize each row in the memory system, measuring DRAM dataretention time on a row-by-row basis, discarding leaky rows entirely, limiting its DRAM use to only those rows deemed non-leaky, and refreshing once every tens of seconds instead of once every tens of milliseconds. The problem is that these proposals ignore another, less well-known phenomenon of DRAM cell variability, namely that a cell with a long retention time can suddenly (in the time frame of seconds) exhibit a short retention time [Yaney et al. 1987, Restle et al. 1992, Ueno et al. 1998, Kim 2004]. Such an effect would render these power-efficient proposals functionally erroneous. The phenomenon is called variable retention time (VRT), and though its occurrence is infrequent, it is non-zero. The occurrence rate is low enough that a system using one of these reduced-refresh proposals could protect itself against VRT by using error correcting codes (ECC, described in detail in Chapter 30, Memory Errors and Error Correction), but none of the proposals so far discuss VRT or ECC. ### Ov. 2.5 Perspective To summarize so far: Anecdote I: Systemic behaviors exist and are significant (they can be responsible for factors of two to three in execution time). Anecdote II: The DLL in DDR SDRAM is a circuitlevel solution chosen to address system-level skew. **Anecdote III:** t<sub>DOS</sub> represents a circuit-level solution chosen to address system-level skew in DDR SDRAM; t<sub>FAW</sub> and t<sub>RRD</sub> are circuit-level limitations that significantly limit system-level performance. Anecdote IV: Several research groups have recently proposed system-level solutions to the DRAMrefresh problem, but fail to account for circuit-level details that might compromise the correctness of the resulting system. Anecdotes II and III show that a common practice in industry is to focus at the level of devices and circuits, in some cases ignoring their system-level ramifications. Anecdote IV shows that a common practice in research is to design systems that have device- and circuit-level ramifications while abstracting away the details of the devices and circuits involved. Anecdote I illustrates that both approaches are doomed to failure in future memory-systems design. It is clear that in the future we will have to move away from modular design; one can no longer safely abstract away details that were previously considered "out of scope." To produce a credible analysis, a designer must consider many different subsystems of a design and many different levels of abstraction—one must consider the forest when designing trees and consider the trees when designing the forest. # **Ov.3** Cross-Cutting Issues Though their implementation details might apply at a local level, most design decisions must be considered in terms of their system-level effects and side-effects before they become part of the system/ hierarchy. For instance, power is a cross-cutting, system-level phenomenon, even though most power optimizations are specific to certain technologies and are applied locally; reliability is a system-level issue, even though each level of the hierarchy implements its own techniques for improving it; and, as we have shown, performance optimizations such as widening a bus or increasing support for concurrency rarely result in system performance that is globally optimal. Moreover, design decisions that locally optimize along one axis (e.g., power) can have even larger effects on the system level when all axes are considered. Not only can the global power dissipation be thrown off optimality by blindly making a local decision, it is even easier to throw the system off a global optimum when more than one axis is considered (e.g., power/ performance). Designing the best system given a set of constraints requires an approach that considers multiple axes simultaneously and measures the system-level effects of all design choices. Such a holistic approach requires an understanding of many issues, including cost and performance models, power, reliability, and software structure. The following sections provide overviews of these cross-cutting issues, and Part IV of the book will treat these topics in more detail. ## Ov.3.1 Cost/Performance Analysis To perform a cost/performance analysis correctly, the designer must define the problem correctly, use the appropriate tools for analysis, and apply those tools in the manner for which they were designed. This section provides a brief, intuitive look at the problem. Herein, we will use *cost* as an example of problem definition, *Pareto optimality* as an example of an appropriate tool, and *sampled averages* as an example to illustrate correct tool usage. We will discuss these issues in more detail with more examples in Chapter 28, *Analysis of Cost and Performance*. # Problem Definition: Cost A designer must think in an all-inclusive manner when accounting for cost. For example, consider a cost-performance analysis of a DRAM system wherein performance is measured in sustainable bandwidth and cost is measured in pin count. To represent the cost correctly, the analysis should consider all pins, including those for control, power, ground, address, and data. Otherwise, the resulting analysis can incorrectly portray the design space, and workable solutions can get left out of the analysis. For example, a designer can reduce latency in some cases by increasing the number of address and command pins, but if the cost analysis only considers data pins, then these optimizations would be cost-free. Consider DRAM addressing, which is done half of an address at a time. A 32-bit physical address is sent to the DRAM system 16 bits at a time in two different commands; one could potentially decrease DRAM latency by using an SRAM-like wide address bus and sending the entire 32 bits at once. This represents a real cost in design and manufacturing that would be higher, but an analysis that accounts only for data pins would not consider it as such. Power and ground pins must also be counted in a cost analysis for similar reasons. High-speed chip-to-chip interfaces typically require more power and ground pins than slower interfaces. The extra power and ground signals help to isolate the I/O drivers from each other and the signal lines from each other, both improving signal integrity by reducing crosstalk, ground bounce, and related effects. I/O systems with higher switching speeds would have an unfair advantage over those with lower switching speeds (and thus fewer power/ ground pins) in a cost-performance analysis if power and ground pins were to be excluded from the analysis. The inclusion of these pins would provide for an effective and easily quantified trade-off between cost and bandwidth. Failure to include address, control, power, and ground pins in an analysis, meaning failure to be allinclusive at the conceptual stages of design, would tend to blind a designer to possibilities. For example, an architecturally related family of solutions that at first glance gives up total system bandwidth so as to be more cost-effective might be thrown out at the conceptual stages for its intuitively lower performance. However, considering all sources of cost in the analysis would allow a designer to look more closely at this family and possibly to recover lost bandwidth through the addition of pins. Comparing SDRAM and Rambus system architectures provides an excellent example of considering cost as the total number of pins leading to a continuum of designs. The Rambus memory system is a narrow-channel architecture, compared to SDRAM's wide-channel architecture, pictured in Figure Ov.7 Rambus uses fewer address and command pins than SDRAM and thus incurs an additional latency at the command level. Rambus also uses fewer data pins and occurs an additional latency when transmitting data as well. The trade-off is the ability to run the bus at a much higher bus frequency, or *pin-bandwidth* in bits per second per pin, than SDRAM. The longer channel of the DRDRAM (direct Rambus DRAM) memory system contributes directly to longer read-command latencies and longer bus turnaround times. However, the longer channel also allows for more devices to be connected to the memory system and reduces the likelihood that consecutive commands access the same device. The width and depth of the memory channels impact the bandwidth, latency, pin count, and various cost components of the respective memory systems. The effect that these organizational differences have on the DRAM access protocol is shown in Figure Ov.8 which illustrates a row activation and column read FIGURE Ov.7: Difference in topology between SDRAM and Rambus memory systems. FIGURE Ov.8: Memory access latency in SDRAM and DDR SDRAM memory systems (top) and DRDRAM (bottom). command for both DDR SDRAM and Direct Rambus DRAM. Contemporary SDRAM and DDR SDRAM memory chips operating at a frequency of 200 MHz can activate a row in 3 clock cycles. Once the row is activated, memory controllers in SDRAM or DDR SDRAM memory systems can retrieve data using a simple column address strobe command with a latency of 2 or 3 clock cycles. In Figure Ov.8(a), Step 1 shows the assertion of a row activation command, and Step 2 shows the assertion of the column address strobe signal. Step 3 shows the relative timing of a high-performance DDR SDRAM memory module with a CASL (CAS latency) of 2 cycles. For a fair comparison against the DRDRAM memory system, we include the bus cycle that the memory controller uses to assert the load command to the memory chips. With this additional cycle included, a DDR SDRAM memory system has a read latency of 6 clock cycles (to critical data). In a SDRAM or DDR SDRAM memory system that operates at 200 MHz, 6 clock cycles translate to 30 ns of latency for a memory load command with row activation latency inclusive. These latency values are the same for high-performance SDRAM and DDR SDRAM memory systems. The DRDRAM memory system behaves very differently from SDRAM and DDR SDRAM memory systems. Figure Ov.8(b) shows a row activation command in Step 1, followed by a column access command in Step 2. The requested data is then returned by the memory chip to the memory controller in Step 3. The row activation command in Step 1 is transmitted by the memory controller to the memory chip in a packet format that spans 4 clock cycles. The minimum delay between the row activation and column access is 7 clock cycles, and, after an additional (also minimum) CAS (column address strobe) latency of 8 clock cycles, the DRDRAM chip begins to transmit the data to the memory controller. One caveat to the computation of the access latency in the DRDRAM memory system is that CAS delay in the DRDRAM memory system is a function of the number of devices on a single DRDRAM memory channel. On a DRDRAM memory system with a full load of 32 devices | | Operating<br>Frequency<br>(Data) | Data<br>Channel<br>Pin Count | Data<br>Channel<br>Bandwidth | Control<br>Channel<br>Pin Count | Command<br>Channel<br>Bandwidth | Address<br>Channel<br>Pin Count | Address<br>Channel<br>Bandwidth | |----------------------|----------------------------------|------------------------------|------------------------------|---------------------------------|---------------------------------|---------------------------------|---------------------------------| | SDRAM controller | 133 | 64 | 1064 MB/s | 28 | 465 MB/s | 30 | 500 MB/s | | DDR SDRAM controller | 2 * 200 | 64 | 3200 MB/s | 42 | 1050 MB/s | 30 | 750 MB/s | | DRDRAM controller | 2 * 600 | 16 | 2400 MB/s | 9 | 1350 MB/s | 8 | 1200 MB/s | | x16 SDRAM chip | 133 | 16 | 256 MB/s | 9 | 150 MB/s | 15 | 250 MB/s | | x16 DDR SDRAM chip | 2 *200 | 16 | 800 MB/s | 11 | 275 MB/s | 15 | 375 MB/s | Table Ov.2 Peak bandwidth statistics of SDRAM, DDR SDRAM, and DRDRAM memory systems TABLE Ov.3 Cross-comparison of SDRAM, DDR SDRAM, and DRDRAM memory systems | DRAM<br>Technology | Operating<br>Frequency<br>(Data Bus) | Pin Count<br>per Channel | Peak Band-<br>width | Sustained<br>BW on<br>StreamAdd | Bits per<br>Pin per<br>Cycle<br>(Peak) | Bits per Pin<br>per Cycle<br>(Sustained) | |--------------------|--------------------------------------|--------------------------|---------------------|---------------------------------|----------------------------------------|------------------------------------------| | SDRAM | 133 | 152 | 1064 MB/s | 540 MB/s | 0.4211 | 0.2139 | | DDR SDRAM | 2 * 200 | 171 | 3200 MB/s | 1496 MB/s | 0.3743 | 0.1750 | | DRDRAm | 2 * 600 | 117 | 2400 MB/s | 1499 MB/s | 0.1368 | 0.0854 | on the data bus, the CAS-latency delay may be as large as 12 clock cycles. Finally, it takes 4 clock cycles for the DRDRAM memory system to transport the data packet. Note that we add half the transmission time of the data packet in the computation of the latency of a memory request in a DRDRAM memory system due to the fact that the DRDRAM memory system does not support critical word forwarding, and the critically requested data may exist in the latter parts of the data packet; on average, it will be somewhere in the middle. This yields a total latency of 21 cycles, which, in a DRDRAM memory system operating at 600 MHz, translates to a latency of 35 ns. The Rambus memory system trades off a longer latency for fewer pins and higher pin bandwidth (in this example, three times higher bandwidth). How do the systems compare in performance? Peak bandwidth of any interface depends solely on the channel width and the operating frequency of the channel. In Table Ov.2, we summarize the statistics of the interconnects and compute the peak bandwidths of the memory systems at the interface of the memory controller and at the interface of the memory chips as well. Table Ov.3 compares a 133-MHz SDRAM, a 200-MHz DDR SDRAM system, and a 600-MHz DRDRAM system. The 133-MHz SDRAM system, as represented by a PC-133 compliant SDRAM memory system on an AMD Athlon-based computer system, has a theoretical peak bandwidth of 1064 MB/s. The maximum sustained bandwidth for the single channel of SDRAM, as measured by the use of the add kernel in the STREAM benchmark, reaches 540 MB/s. The maximum sustained bandwidth for DDR SDRAM and DRDRAM was also measured on STREAM, yielding 1496 and 1499 MB/s, respectively. The pin cost of each system is factored in, yielding bandwidth per pin on both a per-cycle basis and a per-nanosecond basis. # Appropriate Tools: Pareto Optimality It is convenient to represent the "goodness" of a design solution, a particular system configuration, as a single number so that one can readily compare the number with the goodness ratings of other candidate design solutions and thereby quickly find the "best" system configuration. However, in the design of memory systems, we are inherently dealing with a multi-dimensional design space (e.g., one that encompasses performance, energy consumption, cost, etc.), and so using a single number to represent a solution's worth is not really appropriate, unless we can assign exact weights to the various figures of merit (which is dangerous and will be discussed in more detail later) or we care about one aspect to the exclusion of all others (e.g., performance at any cost). Assuming that we do not have exact weights for the figures of merit and that we do care about more than one aspect of the system, a very powerful tool to aid in system analysis is the concept of *Pareto optimality* or *Pareto efficiency*, named after the Italian economist Vilfredo Pareto, who invented it in the early 1900s. Pareto optimality asserts that one candidate solution to a problem is better than another candidate solution only if the first *dominates* the second, i.e., if the first is better than or equal to the second in *all* figures of merit. If one solution has a better value in one dimension but a worse value in another, then the two candidates are Pareto equivalent. The best solution is actually a set FIGURE Ov.9: Pareto optimality. Members of the Pareto-optimal set are shown in solid black; non-optimal points are grey. of candidate solutions: the set of Pareto-equivalent solutions that is not dominated by any solution. Figure Ov.9(a) shows a set of candidate solutions in a two-dimensional space that represent a cost/ performance metric. The x-axis represents system performance in execution time (smaller numbers are better), and the y-axis represents system cost in dollars (smaller numbers are better). Figure Ov.9(b) shows the Pareto-optimal set in solid black and connected by a line; non-optimal data points are shown in grey. The Pareto-optimal set forms a wavefront that approaches both axes simultaneously. Figures Ov.9(c) and (d) show the effect of adding four new candidate solutions to the space: one lies inside the wavefront, one lies on the wavefront, and two lie outside the wavefront. The first two new additions, A and B, are both dominated by at least one member of the Pareto-optimal set, and so neither is considered Pareto optimal. Even though B lies on the wavefront, it is not considered Pareto optimal. The point to the left of B has better performance than B at equal cost. Thus, it dominates B. Point C is not dominated by any member of the Pareto-optimal set, nor does it dominate any member of the Pareto-optimal set. Thus, candidatesolution C is added to the optimal set, and its addition changes the shape of the wavefront slightly. The last of the additional points, D, is dominated by no members of the optimal set, but it does dominate several members of the optimal set, so D's inclusion in the optimal set excludes those dominated members from the set. As a result, candidate-solution D changes the shape of the wave front more significantly than candidate-solution C. # Tool Use: Taking Sampled Averages Correctly In many fields, including the field of computer engineering, it is quite popular to find a sampled average, i.e., the average of a sampled set of numbers. rather than the average of the entire set. This is useful when the entire set is unavailable, difficult to obtain, or expensive to obtain. For example, one might want to use this technique to keep a running performance average for a real microprocessor, or one might want to sample several windows of execution in a terabytesize trace file. Provided that the sampled subset is representative of the set as a whole, and provided that the technique used to collect the samples is correct, this mechanism provides a low-cost alternative that can be very accurate. The discussion will use as an example a mechanism that samples the miles-per-gallon performance of an automobile under way. The trip we will study is an out and back trip with a brief pit stop, as shown in Figure Ov.10. The automobile will follow a simple course that is easily analyzed: - The auto will travel over even ground for 60 miles at 60 mph, and it will achieve 30 mpg during this window of time. - The auto will travel uphill for 20 miles at 60 mph, and it will achieve 10 mpg during this window of time. FIGURE Ov.10: Course taken by the automobile in the example. FIGURE 0v.11: Sampling miles-per-gallon (mpg) over time. The figure shows the trip in time, with each segment of time labeled with the average miles-per-gallon for the car during that segment of the trip. Thus, whenever the sampling algorithm samples miles-per-gallon during a window of time, it will add that value to the running average. - The auto will travel downhill for 20 miles at 60 mph, and it will achieve 300 mpg during this window of time. - 4. The auto will travel back home over even ground for 60 miles at 60 mph, and it will achieve 30 mpg during this window of time. - 5. In addition, before returning home, the driver will sit at the top of the hill for 10 minutes, enjoying the view, with the auto idling, consuming gasoline at the rate of 1 gallon every 5 hours. This is equivalent to 1/300 gallon per minute or 1/30 of a gallon during the 10-minute respite. Note that the auto will achieve 0 mpg during this window of time. Our car's algorithm samples evenly in time, so for our analysis we need to break down the segments of the trip by the amount of time that they take: Outbound: 60 minutes Uphill: 20 minutes Idling: 10 minutes Downhill: 20 minutes Return: 60 minutes This is displayed graphically in Figure Ov.11, in which the time for each segment is shown to scale. Assume, for the sake of simplicity, that the sampling algorithm samples the car's miles-per-gallon every minute and adds that sampled value to the running average (it could just as easily sample every second or millisecond). Then the algorithm will sample the value 30 mpg 60 times during the first segment of the trip, the value 10 mpg 20 times during the second segment of the trip, the value 0 mpg 10 times during the third segment of the trip, and so on. Over the trip, the car is operating for a total of 170 minutes. Thus, we can derive the sampling algorithm's results as follows: $$\frac{60}{170}30+\frac{20}{170}10+\frac{10}{170}0+\frac{20}{170}300+\frac{60}{170}30=57.5mpg \tag{EQ Ov.3}$$ The sampling algorithm tells us that the auto achieved 57.5 mpg during our trip. However, a quick reality check will demonstrate that this cannot be correct; somewhere in our analysis we have made an invalid assumption. What is the correct answer, the correct approach? In Part IV of the book we will revisit this example and provide a complete picture. In the meantime, the reader is encouraged to figure the answer out for him- or herself. # Ov.3.2 Power and Energy Power has become a "first-class" design goal in recent years within the computer architecture and design community. Previously, low-power circuit, chip, and system design was considered the purview of specialized communities, but this is no longer the case, as even high-performance chip manufacturers can be blindsided by power dissipation problems. # Power Dissipation in Computer Systems Power dissipation in CMOS circuits arises from two different mechanisms: *static power*, which is primarily *leakage power* and is caused by the transistor not completely turning off, and *dynamic power*, which is largely the result of switching capacitive loads between two different voltage states. Dynamic power is dependent on frequency of circuit activity, since no power is dissipated if the node values do not change, while static power is independent of the frequency of activity and exists whenever the chip is powered on. When CMOS circuits were first used, one of their main advantages was the negligible leakage current flowing with the gate at DC or steady state. Practically all of the power consumed by CMOS gates was due to dynamic power consumed during the transition of the gate. But as transistors become increasingly smaller, the CMOS leakage current starts to become significant and is projected to be larger than the dynamic power, as shown in Figure Ov.12. In charging a load capacitor C up $\Delta V$ volts and discharging it to its original voltage, a gate pulls an amount of current equal to $C \cdot \Delta V$ from the $V_{dd}$ supply to charge up the capacitor and then sinks this charge to ground discharging the node. At the end of a charge/discharge cycle, the gate/capacitor combination has moved $C \cdot \Delta V$ of charge from $V_{dd}$ to ground, which uses an amount of energy equal to $C \cdot \Delta V \cdot V_{dd}$ that is independent of the cycle time. The average dynamic power of this node, the average rate of its energy consumption, is given by the following equation [Chandrakasan & Brodersen 1995]: $$P_{dynamic} = C \cdot \Delta V \cdot V_{dd} \cdot \alpha \cdot f$$ (EQ Ov.4) Dividing by the charge/discharge period (i.e., multiplying by the clock frequency f) produces the rate of energy consumption over that period. Multiplying by the expected activity ratio $\alpha$ , the probability that the node will switch (in which case it dissipates dynamic power; otherwise, it does not), yields an average power dissipation over a larger window of time for which the activity ratio holds (e.g., this can yield average power for an entire hour of computation, not just a nanosecond). The dynamic power for the whole chip is the sum of this equation over all nodes in the circuit. It is clear from EQ Ov.4 what can be done to reduce the dynamic power dissipation of a system. We can either reduce the capacitance being switched, the voltage swing, the power supply voltage, the activity ratio, or the operating frequency. Most of these options are available to a designer at the architecture level. Note that, for a specific chip, the voltage swing $\Delta V$ is usually proportional to $V_{\rm dd}$ , so EQ Ov.4 is often simplified to the following: $$P_{dynamic} = C \cdot V_{dd}^2 \cdot \alpha \cdot f$$ (EQ Ov.5) Moreover, the activity ratio α is often approximated as 1/2, giving the following form: $$P_{ ext{dynamic}} = \frac{1}{2} \cdot C \cdot V^2_{ ext{dd}} \cdot f$$ (EQ Ov.6) FIGURE Ov.12: Projections for dynamic and leakage, along with gate length. (Figure taken from Kim et al. [2004a]). Static leakage power is due to our inability to completely turn off the transistor, which leaks current in the subthreshold operating region [Taur & Ning 1998]. The gate couples to the active channel mainly through the gate oxide capacitance, but there are other capacitances in a transistor that couple the gate to a "fixed charge" (charge which cannot move) present in the bulk and not associated with current flow [Peckerar et al. 1979, 1982]. If these extra capacitances are large (note that they increase with each process generation as physical dimensions shrink), then changing the gate bias merely alters the densities of the fixed charge and will not turn the channel off. In this situation, the transistor becomes a leaky faucet; it does not turn off no matter how hard you turn it. Leakage power is proportional to $V_{\rm dd}$ . It is a linear, not a quadratic, relationship. For a particular process technology, the per-device leakage power is given as follows [Butts & Sohi 2000]: $$P_{\text{static}} = I_{\text{leakage}} \cdot V_{\text{dd}}^2$$ (EQ Ov.7) Leakage energy is the product of leakage power times the duration of operation. It is clear from EQ Ov.7 what can be done to reduce the leakage power dissipation of a system: reduce leakage current and/or reduce the power supply voltage. Both options are available to a designer at the architecture level. Heat in VLSI circuits is becoming a significant and related problem. The rate at which physical dimensions such as gate length and gate oxide thickness have been reduced is faster than for other parameters, especially voltage, resulting in higher power densities on the chip surface. To lower leakage power and maintain device operation, voltage levels are set according to the silicon bandgap and intrinsic built-in potentials, in spite of the conventional scaling algorithm. Thus, power densities are increasing exponentially for next-generation chips. For instance, the power density of Intel's Pentium chip line has already surpassed that of a hot plate with the introduction of the Pentium Pro [Gelsinger 2001]. The problem of power and heat dissipation now extends to the DRAM system, which traditionally has represented low power densities and low costs. Today, higher end DRAMs are dynamically throttled when, due to repeated high-speed access to the same devices, their operating temperatures surpass design thresholds. The next-generation memory system embraced by the DRAM community, the Fully Buffered DIMM architecture, specifies a per-module controller that, in many implementations, requires a heatsink. This is a cost previously unthinkable in DRAM-system design. Disks have many components that dissipate power, including the spindle motor driving the platters, the actuator that positions the disk heads, the bus interface circuitry, and the microcontroller/s and memory chips. The spindle motor dissipates the bulk of the power, with the entire disk assembly typically dissipating power in the tens of watts. # Schemes for Reducing Power and Energy There are numerous mechanisms in the literature that attack the power dissipation and/or energy consumption problem. Here, we will briefly describe three: dynamic voltage scaling, the powering down of unused blocks, and circuit-level approaches for reducing leakage power. **Dynamic Voltage Scaling** Recall that total energy is the sum of switching energy and leakage energy, which, to a first approximation, is equal to the following: $$\begin{split} E_{\text{tot}} &= [(C_{\text{tot}} \cdot {V^2}_{\text{dd}} \cdot \alpha \cdot f) \\ &+ (N_{\text{tot}} \cdot I_{\text{leakage}} \cdot V_{\text{dd}})] \cdot T \end{split} \tag{EQ Ov.8}$$ T is the time required for the computation, and $N_{\rm tot}$ is the total number of devices leaking current. Variations in processor utilization affect the amount of switching activity (the activity ratio $\alpha$ ). However, a light workload produces an idle processor that wastes clock cycles and energy because the clock signal continues propagating and the operating voltage remains the same. Gating the clock during idle cycles reduces the switched capacitance $C_{\rm tot}$ during idle cycles. Reducing the frequency f during periods of low workload eliminates most idle cycles altogether. None of the approaches, however, affects $C_{\rm tot}V^2_{\rm dd}$ for the actual computation or substantially reduces the energy lost to leakage current. Instead, reducing the supply voltage $V_{\rm dd}$ in conjunction with the frequency f achieves savings in switching energy and reduces leakage energy. For high-speed digital CMOS, a reduction in supply voltage increases the circuit delay as shown by the following equation [Baker et al. 1998, Baker 2005]: $$T_d = \frac{C_L V_{\rm dd}}{\mu C_{\rm ox}(W/L)(V_{\rm dd} - V_{\rm p})^2} \tag{EQ Ov.9} \label{eq:equation:eq}$$ where - T<sub>d</sub> is the delay or the reciprocal of the frequency f - V<sub>dd</sub> is the supply voltage - $C_L$ is the total node capacitance - μ is the carrier mobility - *C<sub>ox</sub>* is the oxide capacitance - $V_t$ is the threshold voltage - *W/L* is the width-to-length ratio of the transistors in the circuit This can be simplified to the following form, which gives the maximum operating frequency as a function of supply and threshold voltages: $$f_{\text{MAX}} \sim \frac{(V_{\text{dd}} - V_t)^2}{V_{\text{dd}}}$$ (EQ Ov.10) As mentioned earlier, the threshold voltage is closely tied to the problem of leakage power, so it cannot be arbitrarily lowered. Thus, the right-hand side of the relation ends up being a constant proportion of the operating voltage for a given process technology. Microprocessors typically operate at the maximum speed at which their operating voltage level will allow, so there is not much headroom to arbitrarily lower $V_{\rm dd}$ by itself. However, $V_{\rm dd}$ can be lowered if the clock frequency is also lowered in the same proportion. This mechanism is called *dynamic voltage scaling (DVS)* [Pering & Broderson 1998] and is appearing in nearly every modern microprocessor. The technique sets the microprocessor's frequency to the most appropriate level for performing each task at hand, thus avoiding hurry-up-and-wait scenarios that consume more energy than is required for the computation (see Figure Ov.13). As Weiser points out, FIGURE Ov.13: Dynamic voltage scaling. Not every task needs the CPU's full computational power. In many cases, for example, the processing of video and audio streams, the only performance requirement is that the task meet a deadline, see (a). Such cases create opportunities to run the CPU at a lower performance level and achieve the same perceived performance while consuming less energy. As (b) shows, reducing the clock frequency of a processor reduces power dissipation but simply spreads a computation out over time, thereby consuming the same total energy as before. As (c) shows, reducing the voltage level as well as the clock frequency achieves the desired goal of reduced energy consumption and appropriate performance level. Figure and caption from Varma et al. [2003]. idle time represents wasted energy, even if the CPU is stopped [Weiser et al. 1994]. Note that it is not sufficient to merely have a chip that *supports* voltage scaling. There must be a heuristic, either implemented in hardware or software, that decides when to scale the voltage and by how much to scale it. This decision is essentially a prediction of the near-future computational needs of the system and is generally made on the basis of the recent computing requirements of all tasks and threads running at the time. The development of good heuristics is a tricky problem (pointed out by Weiser et al. [1994]). Heuristics that closely track performance requirements save little energy, while those that save the most energy tend to do so at the expense of performance, resulting in poor response time, for example. Most research quantifies the effect that DVS has on reducing dynamic power dissipation because dynamic power follows $V_{\rm dd}$ in a quadratic relationship: reducing $V_{\rm dd}$ can significantly reduce dynamic power. However, lowering $V_{\rm dd}$ also reduces leakage power, which is becoming just as significant as dynamic power. Though the reduction is only linear, it is nonetheless a reduction. Note also that even though DVS is commonly applied to microprocessors, it is perfectly well suited to the memory system as well. As a processor's speed is decreased through application of DVS, it requires less speed out of its associated SRAM caches, whose power supply can be scaled to keep pace. This will reduce both the dynamic and the static power dissipation of the memory circuits. **Powering-Down Unused Blocks** A popular mechanism for reducing power is simply to turn off functional blocks that are not needed. This is done at both the circuit level and the chip or I/O-device level. At the circuit level, the technique is called *clock gating*. The clock signal to a functional block (e.g., an adder, multiplier, or predictor) passes through a gate, and whenever a control circuit determines that the functional block will be unused for several cycles, the gate halts the clock signal and sends a non-oscillating voltage level to the functional block instead. The latches in the functional block retain their information; do not change their outputs; and, because the data is held constant to the combinational logic in the circuit, do not switch. Therefore, it does not draw current or consume energy. Note that, in the naïve implementation, the circuits in this instance are still powered up, so they still dissipate static power; clock gating is a technique that only reduces dynamic power. Other gating techniques can reduce leakage as well. For example, in caches, unused blocks can be powered down using Gated- $V_{\rm dd}$ [Powell et al. 2000] or Gated-ground [Powell et al. 2000] techniques. Gated- $V_{dd}$ puts the power supply of the SRAM in a series with a transistor as shown in Figure Ov.14. With the stacking effect introduced by this transistor, the leakage current is reduced drastically. This technique benefits from having both lowleakage current and a simpler fabrication process requirement since only a single threshold voltage is conceptually required (although, as shown in Figure Ov.14, the gating transistor can also have a high threshold to decrease the leakage even further at the expense of process complexity). At the device level, for instance in DRAM chips or disk assemblies, the mechanism puts the device into a low-activity, low-voltage, and/or low-frequency mode such as sleep or doze or, in the case of disks, *spin-down*. For example, microprocessors can dissipate anywhere from a fraction of a watt to over 100 W of power; when not in use, they can be put into a low-power sleep or doze mode that consumes milli-watts. The processor typically expects an interrupt to cause it to resume normal operation, for instance, a clock interrupt, the interrupt output of a watchdog timer, or an external device interrupt. DRAM chips typically consume on the order of 1 W each; they have a low-power mode that will reduce this by more than an order of magnitude. Disks typically dissipate power in the tens of watts, the bulk of which is in the spindle motor. When the disk is placed in the "spin-down" mode (i.e., it is not rotating, but it is still responding to the disk controller), FIGURE Ov.14: Gated- $V_{\rm dd}$ technique using a high- $V_{\rm t}$ transistor to gate $V_{\rm dd}$ . FIGURE 0v.15: Different multi-V<sub>1</sub> configurations for the 6T memory cell showing which leakage currents are reduced for each configuration. the disk assembly consumes a total of a handful of watts [Gurumurthi et al. 2003]. **Leakage Power in SRAMs** Low-power SRAM techniques provide good examples of approaches for lowering leakage power. SRAM designs targeted for low power have begun to account for the increasingly larger amount of power consumed by leakage currents. One conceptually simple solution is the use of multi-threshold CMOS circuits. This involves using process-level techniques to increase the threshold voltage of transistors to reduce the leakage current. Increasing this threshold serves to reduce the gate overdrive and reduces the gate's drive strength, resulting in increased delay. Because of this, the technique is mostly used on the noncritical paths of the logic, and fast, low- $V_t$ transistors are used for the critical paths. In this way the delay penalty involved in using higher $V_t$ transistors can be hidden in the non-critical paths, while reducing the leakage currents drastically. For example, multi- $V_t$ transistors are selectively used for memory cells since they represent a majority of the circuit, reaping the most benefit in leakage power consumption with a minor penalty in the access time. Different multi- $V_t$ configurations are shown in Figure Ov.15, along with the leakage current path that each configuration is designed to minimize. Another technique that reduces leakage power in SRAMs is the Drowsy technique [Kim et al. 2004a]. This is similar to gated- $V_{dd}$ and gated-ground techniques in that it uses a transistor to conditionally enable the power supply to a given part of the SRAM. The difference is that this technique puts infrequently accessed parts of the SRAM into a FIGURE Ov.16: A drowsy SRAM cell containing the transistors that gate the desired power supply. state-preserving, low-power mode. A second power supply with a lower voltage than the regular supply provides power to memory cells in the "drowsy" mode. Leakage power is effectively reduced because of its dependence on the value of the power supply. An SRAM cell of a drowsy cache is shown in Figure Ov.16. #### Ov.3.3 Reliability Like performance, reliability means many things to many people. For example, embedded systems are computer systems, typically small, that run dedicated software and are embedded within the context of a larger system. They are increasingly appearing in the place of traditional electromechanical systems, whose function they are replacing because one can now find chip-level computer systems which can be programmed to perform virtually any function at a price of pennies per system. The reliability problem stems from the fact that the embedded system is a state machine (piece of software) executing within the context of a relatively complex state machine (real-time operating system) executing within the context of an extremely complex state machine (microprocessor and its memory system). We are replacingsimpleelectromechanicalsystems with ultracomplex systems whose correct function cannot be guaranteed. This presents an enormous problem for the future, in which systems will only get more complex and will be used increasingly in safetycritical situations, where incorrect functioning can cause great harm. This is a very deep problem, and one that is not likely to be solved soon. A smaller problem that we can solve right now—one that engineers currently do—is to increase the reliability of data within the memory system. If a datum is stored in the memory system, whether in a cache, in a DRAM, or on disk, it is reasonable to expect that the next time a processor reads that datum, the processor will get the value that was written. How could the datum's value change? Solid-state memory devices (e.g., SRAMs and DRAMs) are susceptible to both hard failures and soft errors in the same manner that other semiconductor-based electronic devices are susceptible to both hard failures and soft failures. Hard failures can be caused by electromigration, corrosion, thermal cycling, or electrostatic shock. In contrast to hard failures, soft errors are failures where the physical device remains functional, but random and transient electronic noises corrupt the value of the stored information in the memory system. Transient noise and upset comes from a multitude of sources, including circuit noise (e.g., crosstalk, ground bounce, etc.), ambient radiation (e.g., even from sources within the computer chassis), clock jitter, or substrate interactions with high-energy particles. Which of these is the most common is obviously very dependent on the operating environment. Figure Ov.17 illustrates the last of these examples. It pictures the interactions between high-energy alpha particles and neutrons with the silicon lattice. The figure shows that when high-energy alpha particles pass through silicon, the alpha particle leaves an ionized trail, and the length of that ionized trail depends on the energy of the alpha particle. The figure also illustrates that when high-energy neutrons pass through silicon, some neutrons pass through without affecting operations of the semiconductor device, but some neutrons collide with nuclei in the silicon lattice. The atomic collision can result in the creation of multiple ionized trails as the secondary particles generated in the collision scatter in the silicon lattice. In the presence of an electric field, the ionized trails of FIGURE 0v.17: Generation of electron-hole pairs in silicon by alpha particles and high-energy neutrons. TABLE Ov.4 Cross-comparison of failure rates for SRAM, DRAM, and disk | Technology | Failure Rate <sup>a</sup><br>(SRAM & DRAM:<br>at 0.13 µm) | Frequency of Multi-bit<br>Errors<br>(Relative to Single-bit Errors) | Expected Service Life | |------------|-----------------------------------------------------------|---------------------------------------------------------------------|-----------------------| | SRAM | 100 per million device-hours | | Several years | | DRAM | 1 per million device-hours | 10–20% | Several years | | Disk | 1 per million device-hours | | Several years | <sup>a</sup>Note that failure rate, i.e., a variation of mean-time-between-failures, says nothing about the expected performance of a single device. However, taken with the expected service life of a device, it can give a designer or administrator an idea of expected performance. If the service life of a device is 5 years, then the part will last about 5 years. A very large installation of those devices (e.g., in the case of disks or DRAMs, hundreds or more) will collectively see the expected failure rate: i.e., several hundred disks will collectively see several million device hours of operation before a single disk fails. electron-hole pairs behave as temporary surges in current or as charges that can change the data values in storage cells. In addition, charge from the ionized trails of electron-hole pairs can impact the voltage level of bit lines as the value of the stored data is resolved by the sense amplifiers. The result is that the soft error rate (SER) of a memory-storage device depends on a combination of factors including the type, number, and energy distribution of the incident particles as well as the process technology design of the storage cells, design of the bit lines and sense amplifiers, voltage level of the device, as well as the design of the logic circuits that control the movement of data in the DRAM device. Table Ov.4 compares the failure rates for SRAM, DRAM, and disk. SRAM device error rates have historically tracked DRAM devices and did so up until the 180-nm process generation. The combination of reduced supply voltage and reduced critical cell charge means that SRAM SERs have climbed dramatically for the 180-nm and 130-nm process generations. In a recent publication, Monolithic System Technology, Inc. (MoSys) claimed that for the 250-nm process generation, SRAM SERs were reported to be in the range of 100 failures per million device-hours per megabit, while SERs were reported to be in the range of 100,000 failures per megabit for the 130-nm process generation. The generalized trend is expected to continue to increase as the demand for low power dissipation forces a continued reduction in supply voltage and reduced critical charge per cell. Solid-state memory devices (SRAMs and DRAMs) are typically protected by error detection codes and/or ECC. These are mechanisms wherein data redundancy is used to detect and/or recover from single- and even multi-bit errors. For instance, parity is a simple scheme that adds a bit to a protected word, indicating the number of even or odd bits in the word. If the read value of the word does not match the parity value, then the processor knows that the read value does not equal the value that was initially written, and an error has occurred. Error correction is achieved by encoding a word such that a bit error moves the resulting word some distance away from the original word (in the Hamming-distance sense) into an invalid encoding. The encoding space is chosen such that the new, invalid word is closest in the space to the original, valid word. Thus, the original word can always be derived from an invalid codeword, assuming a maximum number of bit errors. Due to SRAM's extreme sensitivity to soft errors, modern processors now ship with parity and singlebit error correction for the SRAM caches. Typically, the tag arrays are protected by parity, whereas the data arrays are protected by single-bit error correction. More sophisticated multi-bit ECC algorithms are typically not deployed for on-chip SRAM caches in modern processors since the addition of sophisticated computation circuitry can add to the die size and cause significant delay relative to the timing demands of the on-chip caches. Moreover, caches store frequently accessed data, and in case an uncorrectable error is detected, a processor simply has to re-fetch the data from memory. In this sense, it can be considered unnecessary to detect and correct multi-bit errors, but sufficient to simply detect multi-bit errors. However, in the physical design of modern SRAMs, often designers will intentionally place capacitors above the SRAM cell to improve SER. Disk reliability is a more-researched area than data reliability in disks, because data stored in magnetic disks tends to be more resistant to transient errors than data stored in solid-state memories. In other words, whereas reliability in solid-state memories is largely concerned with correcting soft errors, reliability in hard disks is concerned with the fact that disks occasionally die, taking most or all of their data with them. Given that the disk drive performs the function of permanent store, its reliability is paramount, and, as Table Ov.4 shows, disks tend to last several years. This data is corroborated by a recent study from researchers at Google [Pinheiro et al. 2007]. The study tracks the behavior and environmental parameters of a fleet of over 100,000 disks for five years. Reliability in the disk system is improved in much the same manner as ECC: data stored in the disk system is done so in a redundant fashion. RAID (redundant array of inexpensive disks) is a technique wherein encoded data is striped across multiple disks, so that even in the case of a disk's total failure the data will always be available. #### Ov.3.4 Virtual Memory Virtual memory is the mechanism by which the operating system provides executing software access to the memory system. In this regard, it is the primary consumer of the memory system: its procedures, data structures, and protocols dictate how the components of the memory system are used by all software that runs on the computer. It therefore behooves the reader to know what the virtual memory system does and how it does it. This section provides a brief overview of the mechanics of virtual memory. More detailed treatments of the topic can also be found on-line in articles by the author [Jacob & Mudge 1998a-c]. In general, programs today are written to run on no particular hardware configuration. They have no knowledge of the underlying memory system. Processes execute in imaginary address spaces that are mapped onto the memory system (including the DRAM system and disk system) by the operating system. Processes generate instruction fetches and loads and stores using imaginary or "virtual" names for their instructions and data. The ultimate home for the process's address space is nonvolatile permanent store, usually a disk drive; this is where the process's instructions and data come from and where all of its permanent changes go to. Every hardware memory structure between the CPU and the permanent store is a cache for the instructions and data in the process's address space. This includes main memory-main memory is really nothing more than a cache for a process's virtual address space. A cache operates on the principle that a small, fast storage device can hold the most important data found on a larger, slower storage device, effectively making the slower device look fast. The large storage area in this case is the process address space, which can range from kilobytes to gigabytes or more in size. Everything in the address space initially comes from the program file stored on disk or is created on demand and defined to be zero. This is illustrated in Figure Ov.18. # Address Translation Translating addresses from virtual space to physical space is depicted in Figure Ov.19. Addresses are mapped at the granularity of pages. Virtual memory is FIGURE Ov.18: Caching the process address space. In the first view, a process is shown referencing locations in its address space. Note that all loads, stores, and fetches use virtual names for objects. The second view illustrates that a process references locations in its address space indirectly through a hierarchy of caches. The third view shows that the address space is not a linear object stored on some device, but is instead scattered across hard drives and dynamically allocated when necessary. FIGURE Ov.19: Mapping virtual pages into physical page frames. essentially a mapping of *virtual page numbers (VPNs)* to *page frame numbers (PFNs)*. The mapping is a function, and any virtual page can have only one location. However, the inverse map is not necessarily a function. It is possible and sometimes advantageous to have several virtual pages mapped to the same page frame (to share memory between processes or threads or to allow different views of data with different protections, for example). This is depicted in Figure Ov.19 by mapping two virtual pages (0x00002 and 0xFFFFC) to PFN 12. If DRAM is a cache, what is its organization? For example, an idealized *fully associative* cache (one in which any datum can reside at any location within the cache's data array) is pictured in Figure Ov.20. A data tag is fed into the cache. The first stage compares the input tag to the tag of every piece of data in the cache. The matching tag points to the data's location in the cache. However, DRAM is not physically built like a cache. For example, it has no inherent concept of a tags array: one merely tells memory what data location one wishes to read or write, and the datum at that location is read out or overwritten. There is no attempt to match the address against a tag to verify the contents of the data location. However, if main memory is to be an effective cache for the virtual address space, the tags mechanism must be implemented somewhere. There is clearly a myriad of possibilities, from special DRAM designs that include a hardware tag feature to software algorithms that make several memory references to look up one datum. Traditional virtual memory has the tags array implemented in software, and this software structure often holds more entries than there are entries in the data array (i.e., pages in main memory). The software structure is called a page table; it is a database of mapping information. The page table performs the function of the tags array depicted in Figure Ov.20. For any given memory reference, it indicates where in main memory (corresponding to "data array" in the figure) that page can be found. There are many different possible organizations for page tables, most of which require only a few memory references to find the appropriate tag entry. However, requiring more than one memory reference for a page table lookup can be very costly, and so access to the page table is sped up by caching its entries in a special cache called the transla- tion lookaside buffer (TLB) [Lee 1960], a hardware structure that typically has far fewer entries than there are pages in main memory. The TLB is a hardware cache which is usually implemented as a content addressable memory (CAM), also called a fully associative cache. The TLB takes as input a VPN, possibly extended by an address-space identifier, and returns the corresponding PFN and protection information. This is illustrated in Figure Ov.21. The address-space identifier, if used, extends the virtual address to distinguish it from similar virtual addresses produced by other processes. For a load or store to complete successfully, the FIGURE Ov.20: An idealized cache lookup. A cache is comprised of two parts: the tag's array and the data array. In the example organization, the tags act as a database. They accept as input a key (an address) and output either the location of the item in the data array or an indication that the item is not in the data array. Virtual Address: # Virtual Page Number (VPN) Page Offset Address Space Identifier (ASID) TLB Page Offset Physical Address: Page Frame Number (PFN) FIGURE Ov.21: Virtual-to-physical address translation using a TLB. TLB must contain the mapping information for that virtual location. If it does not, a *TLB miss* occurs, and the system<sup>9</sup> must search the page table for the appropriate entry and place it into the TLB. If the system fails to find the mapping information in the page table, or if it finds the mapping but it indicates that the desired page is on disk, a *page fault* occurs. A page fault interrupts the OS, which must then retrieve the page from disk and place it into memory, create a new page if the page does not yet exist (as when a process allocates a new stack frame in virgin territory), or send the process an error signal if the access is to illegal space. # Shared Memory Shared memory is a feature supported by virtual memory that causes many problems and gives rise to cache-management issues. It is a mechanism whereby two address spaces that are normally protected from each other are allowed to intersect at points, still retaining protection over the non-intersecting regions. Several processes sharing portions of their address spaces are pictured in Figure Ov.22. The shared memory mechanism only opens up a pre-defined portion of a process's address space; the rest of the address space is still protected, and even the shared portion is only unprotected for those processes sharing the memory. For instance, in Figure Ov.22, the region of A's address space that is shared with process B is unprotected from whatever actions B might want to take, but it is safe from the actions of any other processes. It is therefore useful as a simple, secure means for inter-process communication. Shared memory also reduces requirements for physical memory, as when the text regions of processes are shared whenever multiple instances of a single program are run or when multiple instances of a common library are used in different programs. FIGURE Ov.22: Shared memory. Shared memory allows processes to overlap portions of their address space while retaining protection for the nonintersecting regions. This is a simple and effective method for inter-process communication. Pictured are four process address spaces that have overlapped. The darker regions are shared by more than one process, while the lightest regions are still protected from other processes. <sup>&</sup>lt;sup>9</sup>In the discussions, we will use the generic term "system" when the acting agent is implementation-dependent and can refer to either a hardware state machine or the operating system. For example, in some implementations, the page table search immediately following a TLB miss is performed by the operating system (MIPS, Alpha); in other implementations, it is performed by the hardware (PowerPC, x86). The mechanism works by ensuring that shared pages map to the same physical page. This can be done by simply placing the same PFN in the page tables of two processes sharing a page. An example is shown in Figure Ov.23. Here, two very small address spaces are shown overlapping at several places, and one address space overlaps with itself; two of its virtual pages map to the same physical page. This is not just a contrived example. Many operating systems allow this, and it is useful, for example, in the implementation of user-level threads. # Some Commercial Examples A few examples of what has been done in industry can help to illustrate some of the issues involved. MIPS Page Table Design MIPS [Heinrich 1995, Kane & Heinrich 1992| eliminated the page tablewalking hardware found in traditional memory management units and, in doing so, demonstrated that software can table-walk with reasonable efficiency. It also presented a simple hierarchical page table design. shown in Figure Ov.24. On a TLB miss, the VPN of the FIGURE Ov.23: An example of shared memory. Two process address spaces—one comprised of six virtual pages and the other of seven virtual pages—are shown sharing several pages. Their page tables maintain information on where virtual pages are located in physical memory. The darkened pages are mapped to several locations; note that the darkest page is mapped at two locations in the same address space. FIGURE 0v.24: The MIPS 32-bit hierarchical page table. MIPS hardware provides support for a 2-MB linear virtual page table that maps the 2-GB user address space by constructing a virtual address from a faulting virtual address that indexes the mapping PTE (page-table entry) in the user page table. This 2-MB page table can easily be mapped by a 2-KB user root page table. address that missed the TLB is used as an index into the user page table, which is accessed using a virtual address. The architecture provides hardware support for this activity, storing the virtual address of the base of the user-level page table in a hardware register and forming the concatenation of the base address with the VPN. This is illustrated in Figure Ov.25. On a TLB miss, the hardware creates a virtual address for the mapping PTE in the user page table, which must be aligned on a 2-MB virtual boundary for the hardware's lookup address to work. The base pointer, called *PTEBase*, is stored in a hardware register and is usually changed on context switch. PowerPC Segmented Translation The IBM 801 introduced a segmented design that persisted through the POWER and PowerPC architectures [Chang & Mergen 1988, IBM & Motorola 1993, May et al. 1994, Weiss & Smith 1994]. It is illustrated in Figure Ov.26. Applications generate 32-bit "effective" addresses that are mapped onto a larger "virtual" address space at the granularity of segments, 256-MB virtual regions. Sixteen segments comprise an application's address space. The top four bits of the effective address select a segment identifier from a set of 16 registers. This segment ID is concatenated with the bottom 28 bits of the effective address to form an extended virtual address. This extended address is used in the TLB and page table. The operating system performs data movement and relocation at the granularity of pages, not segments. The architecture does not use explicit addressspace identifiers; the segment registers ensure address space protection. If two processes duplicate an identifier in their segment registers, they share that virtual segment by definition. Similarly, protection is guaranteed if identifiers are not duplicated. If memory is shared through global addresses, the TLB and cache need not be flushed on context switch<sup>10</sup> because the system behaves like a single address space operating system. For more details, see Chapter 31, Section 31.1.7, Perspective: Segmented Addressing Solves the Synonym Problem. FIGURE Ov.25: The use of the MIPS TLB context register. The VPN of the faulting virtual address is placed into the context register, creating the virtual address of the mapping PTE. This PTE goes directly into the TLB. FIGURE Ov.26: PowerPC segmented address translation. Processes generate 32-bit effective addresses that are mapped onto a 52-bit address space via 16 segment registers, using the top 4 bits of the effective address as an index. It is this extended virtual address that is mapped by the TLB and page table. The segments provide address space protection and can be used for shared memory. <sup>&</sup>lt;sup>10</sup>Flushing is avoided until the system runs out of identifiers and must reuse them. For example, the address-space identifiers on the MIPS R3000 and Alpha 21064 are six bits wide, with a maximum of 64 active processes [Digital 1994, Kane & Heinrich 1992]. If more processes are desired, identifiers must be constantly reassigned, requiring TLB and virtual-cache flushes. # Ov.4 An Example Holistic Analysis Disk I/O accounts for a substantial fraction of an application's execution time and power dissipation. A new DRAM technology called *Fully Buffered DIMM* (*FB-DIMM*) has been in development in the industry [Vogt 2004a, b, Haas & Vogt 2005], and, though it provides storage scalability significantly beyond the current DDRx architecture, FB-DIMM has met with some resistance due to its high power dissipation. Our modeling results show that the energy consumed in a moderate-size FB-DIMM system is indeed quite large, and it can easily approach the energy consumed by a disk. This analysis looks at a trade-off between storage in the DRAM system and in the disk system, focusing on the disk-side write buffer; if configured and managed correctly, the write buffer enables a system to approach the performance of a large DRAM installation at half the energy. Disk-side caches and write buffers have been proposed and studied, but their effect upon total system behavior has not been studied. We present the impact on total system execution time, CPI, and memory-system power, including the effects of the operating system. Using a full-system, executionbased simulator that combines Bochs, Wattch, CACTI, DRAMsim, and DiskSim and boots the RedHat Linux 6.0 kernel, we have investigated the memory-system behavior of the SPEC CPU2000 applications. We study the disk-side cache in both single-disk and RAID-5 organizations. Cache parameters include size, organization, whether the cache supports write caching or not, and whether it prefetches read blocks or not. Our results are given in terms of L1/L2 cache accesses, power dissipation, and energy consumption; DRAMsystem accesses, power dissipation, and energy consumption; disk-system accesses, power dissipation, and energy consumption; and execution time of the application plus operating system, in seconds. The results are not from sampling, but rather from a simulator that calculates these values on a cycle-by-cycle basis over the entire execution of the application. # Ov.4.1 Fully-Buffered DIMM vs. the Disk Cache It is common knowledge that disk I/O is expensive in both power dissipated and time spent waiting on it. What is less well known is the system-wide breakdown of disk power versus cache power versus DRAM power, especially in light of the newest DRAM architecture adopted by industry, the FB-DIMM. This new DRAM standard replaces the conventional memory bus with a narrow, high-speed interface between the memory controller and the DIMMs. It has been shown to provide performance similar to that of DDRx systems, and thus, it represents a relatively low-overhead mechanism (in terms of execution time) for scaling DRAM-system capacity. FB-DIMM's latency degradation is not severe. It provides a noticeable bandwidth improvement, and it is relatively insensitive to scheduling policies [Ganesh et al. 2007]. FB-DIMM was designed to solve the problem of storage scalability in the DRAM system, and it provides scalability well beyond the current JEDEC-style DDRx architecture, which supports at most two to four DIMMs in a fully populated dual-channel system (DDR2 supports up to two DIMMs per channel; proposals for DDR3 include limiting a channel to a single DIMM). The daisy-chained architecture of FB-DIMM supports up to eight DIMMs per channel, and its narrow bus requires roughly one-third the pins of a DDRx SDRAM system. Thus, an FB-DIMM system supports an order of magnitude more DIMMs than DDRx. This scalability comes at a cost, however. The DIMM itself dissipates almost an order of magnitude more power than a traditional DDRx DIMM. Couple this with an order-of-magnitude increase in DIMMs per system, and one faces a serious problem. To give an idea of the problem, Figure Ov.27 shows the simulation results of an entire execution of the *gzip* benchmark from SPEC CPU2000 on a complete-system simulator. The memory system is only moderate in size: one channel and four DIMMs, totalling a half-gigabyte. The graphs demonstrate numerous important issues, but in this book we are concerned with two items in particular: Program initialization is lengthy and represents a significant portion of an application's run time. As the CPI graph shows, the first two-thirds of execution time are spent dealing with the disk, and the corresponding CPI (both average and instantaneous) ranges from the 100s to the 1000s. After this initialization phase, the application settles into a