`
`Michael J. Flynn Computer
`Architecture
`
`PIPELINED AND PARALLEL PROCESSOR DESIGN
`
`
`
`|PR2015-00328
`
`US. Pat No. 5,898,849
`
`Exh. 2005
`IPR2015-00328
`U.S. Pat No. 5,898,849
`
`1
`
`
`
`Michael J. Flynn
`
`‘ Computer Architecture
`
`PIPELINEDCAND PARALLEL PROCESSOR DESIGN
`
`Intended for practicing computer designers and for students who have computer
`design as a professional objective, this text presents basic concepts that underlie
`processor designs. It develops a design methodology including:
`
`Advanced concepts, such as wave and optimum pipeliiiing, branch tables
`and prediction, superscalar designs, interconnection networks, and disk
`cache éuffers
`
`Design target data—instruction set data, branch data, cache data
`
`Evaluation tools, queueing models for §asic processor-memory, vector
`memory, interleaved caches, static and dynamic interconnections, and disk
`arrays
`
`
`
`
`
`Michael J. Flynn is Professor of Electrical Engineering at Stanford University where
`he heads the computer architecture and arithmetic research group. He received his
`BSEE from Manhattan College, M.S. from Syracuse University, and Ph.D. from
`Purdue University. A fellow of both the IEEE and the ACM, Dr. Flynn was founding
`chairraan of both the ACM Special Interest Group on Computer Architecture and the
`IEEE Computer Society’s Technical Committee on Computer Architecture. Dr. Flynn
`was the 1992 recipient of the ACM/IEEE Eckert-Mauchly Award.
`
`
`
`
`
`Memory Systems and Pipelined Processors, Harvey G. Cragon
`
`Algorithms and Data Structures in Computer Engineering, E. Stewart Lee
`
`ISBN U-BE?EE|-EEI'+-l
`
`9 780867 202045 >
`
`2
`
`
`
`J.C. Browne and John S. Werth, Consulting Editors
`
`An Introduction to Parallel Programming
`K. Mani Chandy and Stephen Taylor
`
`Computer Architecture: Pipelined and Parallel Processor Design
`Michael J. Flynn
`
`Concurrency in Programming and Database Systems
`Arthur J. Bernstein and Philip M. Lewis
`
`3
`
`
`
`Computer Architecture
`
`Pipelined and Parallel
`
`Processor Design
`
`Michael J. Flynn
`
`Stanford University
`
`
`
`
`
`Jones and Bartlett Publishers
`Sudbury, Massachusetts
`Boston
`London
`Singapore
`
`
`
`4
`
`
`
`
`
`Editorial, Sales, and Customer Service Ofi‘ices
`Jones and Bartlett Publishers
`40 Tall Pine Drive
`Sudbury, MA 01776
`508—443-5000
`800—832-0034
`
`Jones and Bartlett International
`Barb House, Barb Mews
`London W6 7PA
`
`Copyright © 1995 by Jones and Bartlett Publishers, Inc.
`
`All rights reserved. No part of the material protected by this copyright notice
`may be reproduced or utilized in any form, electronic or mechanical, including
`photocopying, recording, or by any information storage and retrieval system,
`without written permission from the copyright owner.
`
`library of Congress Cataloging-in—Publication Data
`
`Flynn, Michael J ., 1934 —
`Computer architecture : pipelined and parallel processor design /
`Michael J. Flynn.
`p.
`cm.
`Includes bibliographical references and index.
`ISBN 0-86720-204-1
`1. Computer architecture. 2. Microprocessors-Design and
`construction. I. Title.
`QA76.9.A73F58
`1995
`OO4.2’2--dc20
`
`94-41225
`.CIP
`
`Printed in the United States of America
`
`99989796
`
`1098765432
`
`5
`
`
`
`Contents
`
`Preface
`
`Acknowledgments
`
`1 Architecture and Machines
`
`1.1 Some Definitions and Terms ...................
`
`1.2
`
`Interpretation and Microprogramming ..............
`
`1.3 The Instruction Set ..........................
`
`xv
`
`xvii
`
`1
`
`3
`
`5
`
`9
`
`1.4 Basic Data Types ........................... 13
`1.5
`Instructions .............................. 21
`
`1.5.1 Classes of Operations .................... 21
`1.5.2
`Instruction Mnemonics ................... 25
`
`1.5.3 General Machine Conventions ............... 26
`
`1.5.4 Branches ............................ 30
`
`1.5.5 Register Sets and Addressing Modes ........... 32
`
`1.5.6
`
`Instruction Code Examples ................. 33
`
`1.5.7 Other Instruction Set Issues ................ 35
`1.5.8 Program Size ......................... 37
`
`1.6 Addressing and Memory ...................... 39
`
`1.6.1 Process Addressing ..................... 39
`
`1.6.2 System Addresses and Segmentation ........... 41
`
`1.6.3 Memory Space ......................... 43
`
`1.7 Virtual to Real Mapping ....................... 45
`
`1.8 Basic Instruction Timing ...................... 47
`
`1.8.1 Examples of Well-mapped Machine Instruction Timing
`
`49
`
`53
`1.8.2 Overlapped and Pipelined Processors ...........
`1.9 Conclusions .............................. 53
`
`1.10 Historical Development of Computers .............
`
`54
`
`1.11 Annotated Bibliography ...................... 56
`1.12 Problem Set ............................. 58
`
`_—
`
`6
`
`
`
`
`
` vi Contents
`
`63
`2 Time, Area, and Instruction Sets
`2.1
`Introduction .............................. 63
`2.2 Time .................................. 64
`2.2.1 The Nature of a Cycle .................... 64
`2.2.2 Partitioning Instruction Executéon into Cycles .....
`65
`2.2.3 Clocking Overhead and Reliable Clocking ........ 66
`2.2.4 Pipelined Processors ..................... 70
`2.2.5 Optimum Pipelining ..................... 70
`2.2.6 Cycle Quantization ...................... 77
`2.2.7 Wave Pipelining ........................ 79
`2.3 Cost—Area ............................... 83
`2.3.1 Area ............................... 84
`2.3.2 Data Storage .......................... 93
`2.4 Technology State of the Art ..................... 99
`2.5 The Economics of a Processor Project: A Study ......... 103
`2.5.1 Phase 1: Development .................... 106
`2.5.2 Phase 2: Early Manufacturing ............... 106
`2.5.3 Phase 3: Production ..................... 107
`2.5.4 Phase 4: All Good Things Must Come to an End .
`.
`.
`. 108
`2.6 Instruction Sets: Processor Evaluation Metrics ......... 109
`2.6.1 Program Execution ...................... 1 10
`2.6.2
`Instruction Set Comparisons ................ 112
`2.6.3
`Invariant Effects ....................... 1 1 7
`2.6.4 Code Density ......................... 1 18
`2.6.5 Role of Registers, Evaluation Stacks, and Data Buffers
`124
`2.7 Conclusions .............................. 132
`2.8 Some Areas for Further Research ................. 133
`2.9 Data Notes ............................... 134
`2.10 Annotated Bibliography ...................... 134
`2.1 1 Problem Set ............................. 1 36
`
`141
`3 Data: How Programs Behave
`3.1
`Introduction .............................. 141
`3.2
`Instruction Usage ........................... 142
`3.2.1 Data Categories ........................ 142
`3.2.2 Format Distribution ..................... 144
`3.2.3 Operation Set Distribution ................. 144
`3.3 Process Management ......................... 1 50
`3.3.1 Procedure Calls: User State ................. 150
`3.3.2 Calls to the System ...................... 153
`
`_4_¥
`
`7
`
`
`
`
`
`Con tents vii
`
`3.4 Breaks in Machine Execution .................... 156
`
`Instruction Run Length ................... 1 56
`3.4.1
`3.4.2 Branches ............................ 1 58
`
`3.4.3 Branch Target Distribution ................. 161
`
`3.4.4 Condition Code Testing ................... 163
`
`3.4.5 Move and Arithmetic Class Operations .......... 164
`
`3.4.6 Register-Based Addressing ................. 1.67
`
`3.4.7 Decimal and Character Operand Length ......... 170
`3.5 Conclusions .............................. 174
`
`3.6 Some Areas for Further Research ................. 174
`
`3.7 Data Notes ............................... 175
`
`3.8 Annotated Bibliogeaphy ....................... 176
`3.9 Problem Set .............................. 177
`
`181
`4 Pipelined Processor Design
`4.1
`Introduction .............................. 181
`
`4.1.1 Evolution of a Computer Processor Family ........ 182
`
`4.1.2 Processor Design ....................... 185
`
`4.1.3 Organization of the Chapter ................ 185
`
`4.2 Approaching Pipelined Processors ................ 186
`
`4.2.1 Examples of Pipeline Implementations .......... 190
`
`4.3 Evaluating Pipelined Processor Performance .......... 193
`
`4.4 Design of a Pipelined Processor .................. 213
`4.4.1 Cache Access Controller ................... 214
`
`4.4.2 Accounting for the Effect of Buffers in a Pipelined Sys—
`tem ............................... 2 1 5
`
`4.4.3 Buffer Design ......................... 2 16
`
`4.4.4 Designing a Buffer for a Mean Request Rate ....... 2 16
`
`. 2 19
`.
`.
`.
`I-Buffers Designed for Maximum Request Rates
`4.4.5
`4.5 Branches ................................ 222
`
`4.5.1 Branch Elimination ...................... 22 5
`
`4.5.2 Branch Speedup ........................ 22 5
`
`4.5.3 Branch Prediction Strategies ................ 226
`
`4.5.4 Branch Target Capture: Branch Target Buffers ..... 236
`4.6 Interlocks ............................... 240
`
`4.6.1 Decoder and Interlocks ................... 240
`
`4.6.2 Bypassing ........................... 241
`4.6.3 Address Generation Interlocks ............... 242
`
`4.6.4 Execution Interlocks and Interlock Tables ........ 243
`
`4.7 Run-On Delay ............................. 250
`
`E—
`
`8
`
`
`
`-——*’
`
`
`
` viii Contents
`
`4.8 Miscellaneous Effects ........................ 251
`4.8.1 Store in Instruction Stream Delay ............. 251
`4.9 Conclusions .............................. 259
`4.10 Some Areas for Further Research ................ 260
`4.11 Data Notes .............................. 260
`4.12 Annotated Bibliography ...................... 261
`4.13 Problem Set ............................. 262
`
`265
`5 Cache Memory
`5.1
`Introduction .............................. 265
`5.2 Basic Notions ............................. 266
`5.3 Cache Organization ......................... 269
`5.4 Cache Data ............................... 274
`5.5 Adjustmg the Data for Cache Organization ........... 277
`5.6 Write Policies ............................. 280
`5.7 Strategies for Line Replacement at Miss Time .......... 283
`5.7.1 Fetching a Line ........................ 284
`5.7.2 Line Replacement ....................... 286
`5.8 Cache Environment ......................... 287
`5.9 Other Types of Cache ........................ 291
`5.10 Split 1- and D-Caches ........................ 294
`5.10.1
`I- and D-Caches ....................... 294
`5.10.2 Code Density Effects .................... 297
`5.11 On—Chip Caches ........................... 299
`5.12 Two-Level Caches .......................... 302
`5.12.1 Logical Inclusion ...................... 308
`5.13 Write Assembly Cache ....................... 309
`5.14 Cache References per Instruction ................ 31 1
`5.14.1
`Instruction Traffic ..................... 3 1 1
`5.14.2 Data Traffic ......................... 314
`5.1 5 Technology-Dependent Cache Considerations ......... 3 1 7
`5.16 Virtual-to-Real Translation .................... 322
`5.16.1 Translation Lookaside Buffer (TLB) ........... 323
`5.17 Overlapping the T cycle in V —> R Translation ......... 325
`5.17.1 Set Associative Caches ................... 326
`5.17.2 Virtual Caches ........................ 327
`5.17.3 Real Caches Using Colored Pages ............ 328
`5.18 Studies ................................ 329
`5.18.1 Actual Reference Traffic .................. 331
`5.19 Design Summary .......................... 336
`
`
`
`#4
`
`9
`
`
`
`
`
`Con ten ts ix
`
`5.19.1 Cache Evaluation Design Rules .............. 336
`
`5.19.2 Cache/TLB Excess CPI Design Rules ........... 337
`5.20 Conclusions ............................. 337
`
`5.21 Some Areas for Further Research ................ 338
`
`5.22 Data Notes .............................. 338
`
`5.23 Bibliography ............................. 340
`5.24 Problem Set ............................. 341
`
`345
`6 Memory System Design
`6.1
`Introduction .............................. 345
`
`6.2 The Physical Memory ........................ 349
`
`6.2.1 The Memory Module ..................... 3 50
`6.2.2 Error Detection and Correction .............. 354
`
`6.2.3 Memory Buffers ........................ 3 5 8
`
`6.2.4 Partitioning of the Address Space ............. 358
`
`6.3 Models of Simple Processor—Memory Interaction ........ 360
`
`6.3.1 Memory Systems Design .................. 361
`
`6.3.2 Multiple Simple Processors ................. 362
`6.3.3 Hellerman’s Model ...................... 363
`
`6.3.4 Strecker’s Model ....................... 363
`
`6.3.5 Rau’s Model .......................... 365
`
`6.4 Processor—Memory Modeling Using Queueing Theory ..... 365
`
`6.4.1 Performance Models of Processor Memory Interactions 368
`6.4.2 Arrival Distribution ..................... 369
`
`6.4.3 Service Distribution ..................... 370
`
`6.4.4 Terminology .......................... 371
`
`6.4.5 Queue Properties ....................... 3 72
`
`6.5 Open-, Closed-, and Mixed-Queue Models ............ 374
`
`6.5.1 The Open-Queue (Flores) Memory Model ......... 376
`
`6.5.2 Closed Queues ........................ 3 78
`
`6.5.3 Mixed Queuesg ......................... 382
`
`6.6 Waiting Time, Performance, and Buffer Size ........... 384
`
`6.6.1 Pipelined Processors ..................... 385
`
`6.6.2 Designing a M /M / 1 Buffer Given a Mean Queue Size
`
`. 389
`
`6.6.3 Comparison of Memory Models .............. 391
`
`6.7 Review and Selection of Queueing Models ............ 394
`6.8 Processors with Cache ........................ 396
`
`6.8.1 Fully and Partially Blocking Caches ............ 397
`
`6.8.2 Accessing a Line (T11ne access) ................ 397
`
`6.8.3 Contention Time (Tbusy) and Copyback Caches ..... 400
`
`
`
`10
`
`10
`
`
`
`Contents
`
`6.8.4 I/O Effects ............................ 402
`6.8.5 Performance Effects ...................... 403
`6.8.6 Copyback Cache Study .................... 404
`6.8.7 Simple Write—Through Caches ................ 407
`6.8.8 Write-Through Cache Example ............... 410
`6.8.9 Shared Bus ........................... 411
`6.8.10 Nonblocking Caches ..................... 415
`6.8.11 Interleaved Caches ...................... 416
`6.9 Conclusions ............................... 417
`6.10 Some Areas for Further Research ................. 418
`6.11 Data Notes ............................... 419
`6.12 Annotated Bibliography ....................... 419
`6.13 Problem Set .............................. 420
`Concurrent Processors
`42 5
`7.1 Introduction ............................... 425
`7.2 Vector Processors ........................... 427
`7.2.1 Vector Functional Units .................. 428
`7.2.2 Vector Instructions/Operations
`............. 437
`7.2.3 Vector Processor Implementation ............. 434
`7.2.4 A Generic Vector Processor ................. 436
`7.3 Vector Memory ............................. 437
`7.3.1 The Special Case of Vector Memory ............ 437
`7.3.2 Modeling Vector Memory Performance .......... 443
`7.3.3 Gamma (y)—Binomial Model ................. 446
`7.3.4 Bypassing between Vector Instructions .......... 449
`7.4 Vector Processor Speedup ...................... 452
`7.4.1 Basic Issues .......................... 452
`7.4.2 Measures ............................ 455
`7.5 Multiple-Issue Machines ....................... 456
`7.6 Out—of—Order and Multiple—Instruction Execution ........ 458’
`7.6.1 Data Dependencies ...................... 459
`7.6.2 Representing Data Dependencies ............. 461
`7.6.3 Other Types of Dependencies ................ 463
`7.6.4 When and How to Detect Instruction Concurrency .
`.
`. 464
`7.6.5 Two Scheduling Implementations ............. 467
`7.6.6 An Improved Scoreboard ................... 475
`7.6.7 Dealing with Out—of—Order Execution ........... 486
`7.6.8 Interleaved Caches ...................... 490
`7.6.9 Branches and Specuiative Execution ............ 492
`
`11
`
`11
`
`
`
`Contents
`
`xi
`
`7.6.10 Adaptive Speculation .................... 493
`7.6.11 Results ............................. 494
`
`7.7 Comparing Vector and Multiple-Issue Processors ........ 499
`
`7.7.1 Cost Comparison ....................... 499
`
`7.7.2 Performance Comparison .................. 502
`
`7.7.3 Alternative Organizations .................. 505
`7.8 Conclusions ............................... 505
`
`7.9 Some Areas for Further Research .................. 506
`
`7.10 Data Notes ............................... 507
`
`7.11 Annotated Bibliography ....................... 507
`7.12 Problem Set .............................. 508
`
`51 1
`8 Shared Memory Multiprocessors
`8.1 Basic Issues ............................... 511
`
`8.2 Partitioning ............................... 5 1 3
`
`8.3 Scheduling ................................ 519
`
`8.3.1 Run-Time Scheduling Techniques ............. 520
`
`8.4 Synchronization and Coherency .................. 523
`
`8.5 The Effects of Partitioning and Scheduling Overhead ..... 526
`8.5.1 Grain Size and Overhead ................... 531
`
`8.6 Types of Shared Memory Multiprocessors ............ 532
`
`8.7 Multithreaded or Shared Resource Multiprocessing ...... 533
`
`8.8 Memory Coherence in Shared Memory Multiprocessors .
`
`.
`
`.
`
`. 538
`
`8.9 Shared-Bus Multiprocessors ..................... 541
`
`8.9.1 Snoopy Protocols ....................... 541
`8.9.2 Bus-Based Models ....................... 5 S 1
`
`8.10 Scalable Multéprocessors ...................... 5 5 5
`
`8.1 1 Directory-Based Protocols ...................... 5 56
`
`8.1 1.1 Directory Structure ...................... 5 5 7
`8.1 1.2 Invalidate Protocols ..................... 5 59
`
`8.11.3 Update Protocols ....................... 562
`
`8.12 Evaluating Some Systems Alternatives .............. 563
`8.13 Interconnections ........................... 569
`
`8.14 Static Networks ............................ 571
`
`8.14.1 Links and Nodes ....................... 572
`
`8.15 Dynamic Networks .......................... 574
`
`8.16 Evaluating Interconnect Networks ................ 578
`
`8.16.1 Direct Static vs. Indirect Dynamic ............. 578
`
`8.16.2 Network Dimensionality and Link-Limited Network . 585
`
`8.17 Hotspots and Combining ...................... 588
`
`9—
`
`12
`
`12
`
`
`
` xii Contents
`
`8.18 Other Characterizations of Multiprocessors .......... 590
`8.19 Conclusions ............................. 592
`8.20 Some Areas for Further Research ................ 594
`8.2 1 Annotated Bibliography ...................... 594
`8.22 Probéem Set ............................. 595
`9 I/O and the Storage Hierarchy
`599599
`9.1 The Role of I/O ............................
`9.2 Evolution of I/O Systems Organization .............. 601
`9.2.1
`I/O Processors/Channels .................. 603
`9.2.2
`I/O System Support for Multiprocessors ......... 604
`9.3 Design of Storage Systems ..................... 605
`9.3.1 Disk Technology ....................... 605
`9.3.2 The Disk Device ........................ 606
`9.4 Simple I/O Transactions ....................... 609
`9.4.1 Multiple Servers ........................ 614
`9.4.2 Single-Server Low Population (11) ............. 61 5
`9.4.3 Disk Modeling ......................... 618
`9.4.4 Multiprogramming Models and Inverted Servers .
`.
`.
`. 621
`9.4.5
`Improving I/O Response and Capacity .......... 630
`9.5 1/0 Traffic and Virtual Memory Effects .............. 632
`9.5.1 Basic I/O Request Rate ................... 633
`9.5.2 Virtual Memory I/O Traffic ................. 63 5
`9.5.3 Disk Cache Buffers ...................... 636
`9.5.4 Concurrent Disks ....................... 638
`9.5.5 Clusters of Independent Disks ............... 643
`9.5.6 Striping ............................. 644
`9.5.7 Disk Arrays .......................... 646
`9.5.8 Composite Configurations ................. 648
`9.6 Some Practical Considerations ................... 652
`9.7 Redundancy in Disk Arrays ..................... 654
`9.8 Conclusions .............................. 657658
`9.9 Some Areas for Further Research ................. 658
`9.10 Data Notes ..............................
`9.11 Annotated Bibliography ...................... 659
`9.12 Problem Set ............................. 660
`10 Processor Studies
`663
`10.1 The Baseline Mark II ........................ 663664
`10.1.1 Design Assumptions ....................
`
`13
`
`13
`
`
`
`Con ten ts
`
`xiii
`
`10.1.2 Design Alternatives ..................... 665
`
`10.1.3 Pipeline Timing Analysis ................. 665
`
`10.1.4 Pipeline Penalty Analysis ................. 671
`
`10.1.5 Cache and Memory Analysis ............... 676
`
`10.1.6 Cost-Performance Analysis ................ 677
`
`10.2 Area Performance Analysis of Processors ........... 683
`
`10.2.1 The Problem ......................... 683
`
`10.2.2 Specifications ........................ 685
`
`10.2.3 Assumptions ......................... 688
`
`10.2.4 The Design .......................... 688
`
`10.2.5 Analysis ........................... 704
`
`10.3 Study Results ............................ 71 7
`10.4 Conclusions ............................. 718
`
`Appendix A DTMR Cache Miss Rates
`
`719
`
`A.1 Basic DTMR .............................. 719
`
`A2 Associativity Adjustments ..................... 719
`
`A3 User + System ............................. 721
`
`AA Transaction-Based Systems ..................... 722
`
`A5 Multiprogrammed (Warm Cache) Environment ......... 728
`
`Appendix B SPECmark vs. DTMR Cache Performance
`
`Appendix C Modeling System Effects in Caches
`
`741
`
`743
`
`C1 Cold Start Cache ........................... 743
`
`C2 Cache Misses in Multiprogramming Environment ....... 744
`
`Appendix D New DRAM Technologies
`
`747
`
`D1 Typical Performance Enhancements ............... 747
`
`D2 Enhanced DRAM ........................... 748
`
`D3 Synchronous DRAM ......................... 748
`
`D.4 Cache DRAM ............................. 748
`
`D5 Rambus DRAM ............................ 748
`
`D6 Ramlink DRAM ............................ 749
`
`DJ Chip Level Summary ......................... 749
`
`Appendix E M / G / 1 Queues
`
`Appendix F Some Details on Bus-Based Protocols
`
`Bibliography
`
`Index
`
`751
`
`755
`
`765
`
`782
`
`~——
`
`14
`
`14
`
`
`
`
`
`Section 4.4 Design of a Pipelined Processor
`
`2 1 S
`
`1. Data access (read).
`
`2. I-buffer (target fetch).
`
`3. I-buffer (in-line fetch).
`
`4. Data write (unless the store buffer is full).
`
`Data reads are given priority, assuming that an ample I-buffer is part of the
`design. (See section 4.4.2.) For data writes, priority is given to stores if the
`store buffer is full; otherwise, priority is given to reads (of all types).
`Contention for cache access is frequently a concern for processor design-
`ers, and split cache designs (I and D, see chapter 5) are a popular method
`to double the available cache accessing bandwidth. For many register set
`architectures with a well-designed I-buffer and 8-byte access paths, cache
`contention at a single integrated cache should not be a serious source of
`contention, so long as the cache can accommodate a request each cycle.
`(See study 5.4.)
`
`4.4.2 Accounting for the Effect of Buffers in a Pipelined System
`
`Buffers, however they are organized, change the way our instruction timing
`templates are used. The buffer decouples the time at which an event occurs
`from the time at which the input data is used.
`
`For example, so far we have shown processors without explicit I-buffers.
`Thus, actions proceed:
`IF
`[F
`D
`l--—l——-l——l
`
`>I<
`
`*+1
`
`
`
`
`
`|—+—l—1[FE D
`
`Suppose now that >|< is a branch (BC) and * + 1 has its decode delayed by
`three cycles:
`[FlF
`r—k—a—D—u
`[F erI
`|——l——I——l
`
`D
`l-—l
`
`*
`
`*+1
`
`The IF for * + 1 occurs before * is known to be a branch. Presumably, there
`It: SOmEplace to keep the result of the IF (a minimal one-entry buffer) so
`at the decode of >|< + 1 can proceed when the branch decision is deter-
`famed (assuming the iii-line path is selected). Note that in this case the IF
`Inpleted three cycles before its decode began.
`arger bUffers generalize the preceding situation. A single IF may fetch two
`.
`In
`latsemlctlons (A and B) into the instruction buffer to be used several cycles
`1" and We might have:
`
`L.
`
`IF‘IB lLFL
`
`
`
`15
`
`15
`
`
`
`Chapter 4 Pipelined Processor Design
`216
`The instruction buffer ought to be transparent to instruction execution.
`If an instruction is in the I—buffer, we assume that it can be accessed and
`decoded in one decode cycle (D). Depending on the type of analysis we are
`doing, we may show a buffered instruction execution sequence as:
`
`D
`
`AG
`AG
`D
`F—a—H
`._———1D
`
`*+1
`*+2
`
`without indicating the IF, or we may show it as:
`IF
`[F
`D
`.———+——+——l
`J—i—E—a—D—i
`
`
`
`>|<
`*+1
`*+2
`
`lf—i—‘l—H[F D
`
`prior to the corresponding
`'
`t in the presence of a buffer, the
`her than indicated. Also, while it
`IF may have occurred several cycles ear
`the IF may actually fetch
`appears that a single IF occurs per instruction,
`ossibilities later in this
`several instructions at once. We consider these p
`chapter.
`
`4.4.3 Buffer Design
`
`for a mean request rate or for a maximum request
`Buffers may be designed
`'
`ected number of requests and
`rate. In the former case,
`then trade off buffer size against the probability of an overflow. Overflows
`per se (where an action is lost) do n
`'
`'
`ernal CPU buffers, but
`an “overflow” condition—full buffer and a new request—will force the pro-
`cessor to slow down to bring the buffer entries down below buffer capac-
`on occurs, the processor pipeline
`ity. Thus, each time an “overflow” conditi
`stalls to allow the overflowed buffer to access memory (or other resource).
`The store buffer is usually designed for a mean request rate.
`dominate performance, such as in—line instruc-
`e design for the maximum request rate. We ensure that
`h the processor request rate with the
`the buffer size is sufficient to mate(1 buffer allows the processor to con-
`_
`cache service rate. A properly size
`tinue accessing instructions at its maximum rate without the instruction
`buffer running out of instructions.
`
`6.
`
`4.4.4 Designing a Buffer for a Mean Request Rate
`Suppose we have determined the mean request rate for a particular:0‘31;e
`7 \
`55
`211
`How large should we make the buffer to hold these requests.or intern
`we know Q, the mean number of requests present
`t per
`processor buffers, each source can make a maximum of on
`es
`cycle, but these requests can cluster and appear to a buffer ac
`hew
`slower memory as multiple concurrent requests. Now we c
`known Chebyshev’s Inequality or a variation thereof.
`
`anuset
`
`
`
`16
`
`16
`
`
`
`r
`
`Section 4.8 Miscellaneous Effects
`
`25 5
`
`
`Software Practices and Benchmarking
`Designers try to understand their environment by sampling bench-
`marks that they believe to be representative of their environment and,
`at the same time, somewhat control the future usage of the machines
`by controlling software practices. Depending upon the breadth of
`benchmarks used in preparing the design, the resultant processor may
`be very sensitive to particular aspects of new benchmarks [185]. A
`large and robust sample set of user programs is invaluable in assess-
`ing many issues in processor design. This suite of test programs forms
`
`
`
`
`
`
`
`higher clock rates. For the moment
`in-line branch prediction.
`
`From Chapter 3, the relevant instruction frequency profile derived from
`Tables 3.4 and 3.14—3.15 is as shown in Table 4.19.
`The branch profile is: BR = 2.6%, BC = 10.4%, BC that go to target 54%.
`We evaluate the BR penalty as:
`BR
`IF
`D/RF AG/r m
`H‘F—H
`l—Ll ... ti.
`
`Target
`
`or a t"VG-cycle penalty.
`The dEIaY due to BC can be determined:
`
`BC
`
`Target
`
`AG m
`D
`H—t—afl
`Fig #4
`
`Since
`
`line. 1?;CC is Set by the end of EX, there is no penalty when the BC goes in
`2‘0/c1e p €ch goes to the target, the situation is the same as the BR case—a
`Enalty_
`
`
`
`17
`
`17
`
`
`
`
`
`256 Chapter 4 Pipelined Processor Design
`Table 4.19 Instruction profiles.
`
`Instruction Profiles
`______________——
`
`BR
`BC
`IALU“
`IMult
`IDiV
`FPAdd
`FPMult
`FPDiv
`Load
`Store
`Total
`
`
` __________——————
`
`2.6%
`10.4%
`2 5.8%
`1.2%
`0.4%
`6.5%
`5.2%
`0.4%
`30.0%
`1 7.5%
`100.0%
`
`/ “
`
`This category includes Iadd, shift, compare, etc., and those
`moves that do not reference memory (i.e., register to register).
`
`BC penalty
`
`BR penalty
`
`Total branch delay
`
`=
`=
`=
`—
`=
`
`(0.54 BC’s that go to target)(BC°o)(BC delay)
`0.54 X .104 X (2.0) = .11
`(BR%) (BR delay)
`0.026 X 2 = 0.052
`0.11 + 0.05 = 0.16 CPi delay
`
`We can now compute the data dependency effects (using Tables 3.19 and
`3.20). Because of bypassing, there is no ALU dependency for our baseline
`processor.
`PA
`EX
`D
`1.?
`|———1-’—i———1"1
`*
`PA
`1}"
`D
`EX
`*+1
`However, there is a load-ALU dependency required to keep the PA in order:
`
`*
`* + 1
`
`Ir
`
`DF
`D/RF AG/T
`’
`[F
`D/RF
`)——i———i EX
`
`PA
`PA
`EX
`r——+———|
`
`There are actually two effects:
`
`1. Since the PA must be done in order (the register store bandwidth]:S
`usually limited to one PA per cycle), the ALU PA is delayed one CYC 'U
`2. When the ALU instruction uses the result of the LD, the EX 0f the AL
`
`is delayed one cycle.
`
`
`
`For simplicity, we combine these two effects and assume that both Exdela)’
`PA) are delayed one cycle. This is shown above and is manifested 5 Ethe BC
`in the ALU instruction that sets the CC unm‘ediately before a 139 BC goes
`goes to the target no additional delay is encountered, but if the
`
`
`
`18
`
`18
`
`
`
`Y]
`
`Section 4.8 Miscellaneous Effects
`
`2 57
`
`in-line, the CC is set at the end of the cycle in which >I< + 1 should have been
`decoded. In this simple processor, we assume that decode is not performed
`until CC is set. This results in a delay of one cycle each time a BC fails (goes
`in line). This adds a delay of
`
`Prob ((BC) and in line)(one cycle) = .104 x .46 x 1 = 0.05 CPI.
`
`Thus, address dependencies can be caused by any of the following:
`
`LD—-LD,
`
`LD——ST,
`LD——BR,
`LD——BC.
`
`We can estimate the occurrence of these events by making the assumption
`that these instructions occur with independent probability; then:
`
`Prob of dependency event
`
`= Occurrence of LD preceded by LD/ST/B
`=
`(Prob LD)(Prob LD/ST/B/)
`=
`.3 (.605) = .18
`
`However, simply the occurrence of a potential address dependency does not
`mean that there is an actual dependency. In a LD—LD instruction pair, not
`every leading LD references a register used in the subsequent LD address
`computation. In fact, we know from Table 3.19 that the actual dependency
`occurs with frequency 0.099. So, the net frequency of an address depen-
`dency is:
`
`Prob of address dependency = 0.18 x 0.099 t 0.02
`
`and
`
`Delay due to address dependency
`
`=
`=
`
`0.02 (1.0) cycle delay
`0.02 CPI delay
`
`For run-on instructions, we simply sum up the execution times of each of
`the run-on instructions and weight that execution time by their frequency
`0f execution. The run-on instructions that we consider for the baseline
`processor are:
`
`Instruction Execution time Run-on time
`
`MPY.W
`DIV.W
`ADD. F
`MPY. F
`DIV. F
`
`3 cycles
`15 cycles
`3 cycles
`3 cycles
`15 cycles
`
`2 cycles
`14 cycles
`2 cycles
`2 cycles
`14 cycles
`
`Run~on delay = Z wi(Run-on delayh,
`i
`
`
`
`19
`
`19
`
`
`
`I2
`
`
`
`58 Chapter 4 Pipeh‘ned ProcessorDesign
`
`Branch (0.l6 CPI)
`
`Total = 1.60 CPI
`
`— 1% Addr dep. (0.02 CPI)
`
`Run-on (0.37 CPI)
`
`LD-ALU (0.05 CPI)
`
`
`Figure 4.32 Baseline processor performance (without cache).
`aparticular run-oninstruction, corresponding
`where w, is the probability of
`struction profile:
`to its frequency of occurrence. For our in
`Run-on delay = 0.37 CPI delay.
`Our processor performance is now determined as:
`CPI =
`1(decoder) + 0.16(branch delay) + 0.05(LD—ALU)
`+0.02(addr dep. delay) + 0.37(run-on delay)
`
`at can be reduced with a mininmm of additional area:
`
`lies within the current page, it requires no tr
`effect of changing the branch timing templates to th f
`
`BR/BC
`
`TIP
`D/AG
`IF
`p.+—+-——L
`
`This halves the branch penalty from two cycles to one, and r8educeS
`the overall effects of branch from .16 excess CPI to about
`.
`2. We can also use a small instructionbuffer. Suppose we have 5:644)“
`path from high cache to the instruction buffer. This
`n
`duce
`the overall Mg or delays due to branches, but it will later Te t we
`the possibility of contention at the cache, which at this 1110an
`While there is not much to be (easily) done about data dependengflark)
`‘
`delay simply by changing the application (ban 1is due
`base. Ofthe run-on delay (.37 cycles per instruction), 0.29 excess
`
`0 O -
`
`.
`
`.
`
`o
`
`are not considering.
`
`.
`
`we
`
`20
`
`20
`
`
`
`Section 4.9 Conclusions
`
`259
`
`LD-ALU (0.05 CPI)
`
`Run-on (0.09 CPI)
`
`Total = 1.25 CPI
`
`
` Decode
`
`80%
`
`2% Data dep. (0.02 CPI)
`
`Branch (0.09 CPI)
`
`(I CPI)
`
`
`
`Figure 4.33 Improved baseline processor with branch adder running “in-
`teger” benchmarks (no cache misses).
`
`to floating-point instructions when running non-floating point programs.
`Suppose that in these workstation (or benchmark) applications the instruc-
`tion profile remains unchanged, except that floating-point instructions are
`absent. This eliminates 12% of the instructions (we assume floating-point
`LD and ST are replaced with integer LD and STs). The remaining branch,
`run-on, and data dependencies are now slightly more frequent. We estimate
`this effect as amplifying these non-floating delays by 1/ (1 — .12) = 1.136,
`so that now we have:
`
`Run-on delay = Integer run-on delay x 1.136
`=
`(14x0.004+2><0.012) x1.136
`=
`0.09 excess CPI
`
`Data dependency = 0.02 x 1.136 m 0.02
`
`Branch delay = 0.08 X 1.136 = 0.09
`
`The new performance is 1.32 CPI. This is shown in Figure 4.33.
`Of course, all of this ignores delays in the memory hierarchy, which we
`discuss in the next chapter.
`
`
`
`4.9 Conclusions
`
`EDP—lined processors have become the implementation of choice for almost
`te Thrifdnnes from mainframes to microprocessors.