`COMPUTER
`SYSTEMS
`
`DESIGN AND EVALUATION
`
`SECOND EDITION
`
`DANIEL P. SIEWlOREK
`ROBERT S. SWARZ
`
`DIGITAL PRESS
`
`IBM-Oracle 1014
`Page 1 of 157
`
`
`
`Copyright © 1992 by Digital Equipment Corporation.
`
`All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
`
`transmitted, in any form or by any means, electronic, mechanical, photocopying, recording’, or otherwise,
`without prior written permission of the publisher¯
`
`Printed in the United States of America.
`987654321
`
`Order number EY-Hag0E-DP
`
`The Publisher offers discour~ts on bulk orders of this book. [:or information, please write:
`
`Special Sales Department
`Digital Press
`One Burlington Woods Drive
`Burlington, MA 01003
`
`Design: Outside Designs
`Production: Technical Texts
`Composition: DEKR Corporation
`Printer: Arcata/Halliday
`
`Trademark products mentioned in this book are listed on’page 890.
`
`Views expressed in this book are those of the authors, not of the pt~blisher. Digital Equipment Corporation is
`
`not responsible for any errors that may appear in thfs book.
`
`Library of Congress Cataloging-in-Publication Data
`
`Siewiorek, Daniel P.
`Reliable computer systems : design and evaluation / Daniel P.
`Siewiorek, Robert S. Swarz. -- 2nd ed.
`p. crrl.
`Roy. ed~ of: The theory and practice of reliable system design.
`Bedford, MA : Digital Press, c1982.
`Includes bibliographical references and index.
`’[SBN 1-55558-075-0
`
`1. Electronic digital computers---Reliability. 24 Fault-tolerant
`computing. I. Swarz, Robert S. II. Siewiorek, Daniel P. Theory
`¯ and practice of re|iab]e system design. Ill. "]’ifle.
`QA76.5.$537 1992
`(}04-dc:Z0 92-10671
`C~P
`
`CREDITS
`
`Figure 1-3: Eugene Foley, "The Effects of Microelectronics Revolution on Systems and Board Tesb" Computers~
`Vol. 12, No. 10 (October 1979). Copyright © 1979 tEEF. Reprinted by permission.
`
`Figure 1-6: S. Russell Craig, "incoming Inspection and Test Programs," Electronics Test (October 1980). Re-
`printed by permissiom
`
`Credits are continued on p. 885, which is considered a continuation of the copyright page,
`
`IBM-Oracle 1014
`Page 2 of 157
`
`
`
`To Karon and Lonnie
`
`A Special Remembrance:
`
`During the development of this book, a friend, co[league, and fault-tolerant pioneeF
`passed away. Dr~ Wing N. Toy documented his 37 years of experience in designing
`several generations of fault-tolerant computers for the Bell System electronic switching
`systems described in Chapter 8. We dedicate this book to Dr. Toy in the confidence
`that his writings will continue to influence designs produced by those who learn from
`these pages.
`
`IBM-Oracle 1014
`Page 3 of 157
`
`
`
`CONTENTS
`
`Preface xv
`
`I THE THEORY OF ]~ELIABLF SYSTEM DESIGN
`
`i
`
`I FUNDAMENTAL CONCEPTS 3
`
`Physical Levels in a Digital System 5
`Temporal Stages of a Digital System
`Cost of a Digital System 18
`Summary 21
`References 21
`
`2 FAULTS AND THEIR MANIFESTATIONS 22
`
`System Errors 24.
`Fault Manifestations 31
`Fault Distributions 49
`Distribution Models for Permanent Faults: The MIL-HDBK-217 Model
`Distribution Models for Intermittent and Transient Faults 6.5
`Software Fault Models 73
`Summary 76
`References .76
`Problems 77
`
`57
`
`3 RELIABILITY TECHNIQUES 79
`Steven A. Elkind and Daniel P. Siewi~rek
`
`System-Failure Response Stages 80
`Hardware Fault-Avoidance Techniques 84
`Hardware Fault-Detection Techniques 96
`Hardware Masking Redundancy Techniques
`Hardware Dynamic ReduF~dancy Techniques
`Software Reliability Techniques 201
`Summary 2i9
`Referer~ces 219
`Problems 221
`
`138
`169
`
`4 MAINTAINABILITY AND TESTING TECHNIQUES 228
`
`Specificatioa-Based Diagnosis 229
`Symptom-Based Diagnosis 260
`
`vii
`
`IBM-Oracle 1014
`Page 4 of 157
`
`
`
`viii
`
`CONTENTS
`
`Summary 268
`References 268
`Problems 269
`
`5 EVALUATION CRITERIA 271
`Stephen MeConnel and Daniel P. Siewiorek
`
`Introdt~ct[on 271
`Survey Of Eva[uation Criteria: Hardware 272
`Survey of Evaluation Criteria: Software 279
`Reliability Modeling Techniques: Combinatorial Models 285
`Examples of Cornbinatoria[ Modeling 294
`ReJ!ability and.Availability Modeling Techniques: Markov Models
`Examples of Markov Modeling 334
`Availability Modeling Techniques ~42
`Software Assistance for Modeling T~chniques 349
`Applications of Modeling Techniques to Systems Designs 356
`Summary 391
`References 391
`Problems 392
`
`305
`
`6 FINANCIAL CONSIDERATIONS 402
`
`402
`
`Fund.amenta! Concepts
`Cost,Mode]~ 408
`Summary 419
`References 419
`Problems 420
`
`II THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`423
`
`Fundamental Concepts 402
`GeneraI-PuFpose Computing 424
`High-Availability Systems 424
`Long-Life Systems 425
`Critical Computations 425
`
`7 GENERAL-PURPOSE COMPUTING 427
`
`Introduction 427
`Generic Computer 427
`DEC 430
`
`IBM 43!
`
`The DEC Case: RAMP in the VAX Family 433
`Daniel p. Siewior~k
`
`IBM-Oracle 1014
`Page 5 of 157
`
`
`
`CONTENTS
`
`ix
`
`The VAX Architecture 4~g
`
`First-Generation VAX Implementations 439
`Second-GeneraLion VAX Implementations 455
`References 484
`
`The IBM Case Part i: Reliability, AvailabifiO/, ahd 5en’iceability in IBM 308X
`and tBM 3090 Processor Complexes 485
`Daniel P. Siewiorek
`
`Technology 485
`Manufacturing 486
`Overview of the 3090 Processor Con~plex 493
`References 507 .
`
`The IBM Case Part H: Recovery Through Programming: MVS
`Recovery Management 508
`C.T. ConnolIy
`
`introduct{on 508
`RAS Objectives 509
`Overview of Recovery Management 509
`MVSiXA Hardware Error Recovery 511
`MVS/XA Serviceability Facilities 520
`Availability 522
`Summary 523
`Bibliography 523
`Reference 523
`
`8
`
`H~GH-AVAILABILITY SYSTEMS
`
`524
`
`Introduction 524
`AT&T Switching Systems
`Tandem Computers, Inc.
`Stratu~ Computers, Inc.
`References 533
`
`~2~
`
`The AT&T Case Part l: Fault-Tolerant Design of AT&T Telephone
`Switching System Processors 333
`W.N. Toy
`
`Introduction 533
`Allocation and Causes of System Downtime 534
`Duplex Architecture 535
`Fault Simulation Techniques 538
`First-Generation ESS Processors 540
`Second-Generation.Processors 544
`Third-Generation 3B20D Processor 551
`Summary 572
`References 573
`
`IBM-Oracle 1014
`Page 6 of 157
`
`
`
`CONTENTS
`
`The AT&T Case Part I l: Large-Scale Real-Time Program Retrofit Methodology in
`AT&T 5ESS® Switch 574
`L.C. Toy
`
`5ESS Switch Architecture Overview 574
`Software Replacement 576
`Summary 585
`References 586
`
`The Tandem Case: Fault Tolerance in Tandem Computer Systems 586
`Joel Bartlett, Wendy Bartlett, Richard Carr, Dave Garcia, lim Gray, Robert Horat,
`Robert ]ardine, Doug Jewett, Dan Lenoski, and Dix McGuire
`
`Hardware 588
`Processor Module Implerffentation Details 597
`Integrity $2 618
`Maintenan(:e Facilities and Practices 622
`Software 625
`operations. 647
`Summary and Conclusions 647
`References 648
`
`The Stratus Case: The Stratus Architecture
`Steven Webber
`
`,648
`
`:~tratus Solutions to Downtime 650
`Issues of Fault Tolerance 652 .
`System Architecture Overview 653
`Recovery Scenarios 664
`Architecture Tradeoffs 665
`Stratus Software 666
`Service Strategies 669
`Summary 670
`
`9 ~.ONG-LIFE SYSTEMS 671
`
`Introduction 671
`Generic Spacecraft 671
`Deep-Space Planetary Probes 676
`Other Noteworthy Spacecraft Designs
`References 679
`
`679
`
`The Galileo Case: Galileo Orbiter Fault Protection System
`Robert W. Kocsis
`
`679
`
`The Galileo Spacecraft 680
`Attitude and Articulation Control Subsystem
`Command and Data Subsystem 683
`AACSiCDS Interactions 687
`Sequences and Fault Protection 688
`
`680
`
`IBM-Oracle 1014
`Page 7 of 157
`
`
`
`CONTENTS
`
`xi
`
`Fault-Protection Design Problems and Their Resolution 689
`Summary 690
`References 690
`
`10 CRITICAL COMPUTATIONS 691
`
`Introduction 691
`C.vmp 691
`SIFT 693
`
`The C.vmp Case: A Voted Muftiprocessor 694
`Daniel P. Siewiorek, Vittal Kini, Henry Mashburn, Stephen McConnel, and Michael Tsao
`
`System Architecture 694
`Issues of Processor Synchronization 699
`Performance Measurements 702
`Operational Experiences 707
`References 709
`
`The SIFT Case: Design and Analysis of a Fault-Tolerant Computer for
`Aircraft Control 710
`John H. Wensley, Leslie Lampo~t, Jack Goldberg, Milton W. Green, Karl N. Levitt,
`P.ML Me]liar-Smith, Robert E. Shostak, and Charles B. Weinstock
`
`Motivation and Background 710
`SIFT Concept of Fault Tolerance 711
`The SIFT Hardware 719
`The Software System 723
`The Proof of Correctness 728
`Summary 733
`Appendix: Sample Special Specification 733
`References 735
`
`III A DESIGN METHODOLOGY AND EXAMPLE OF DEPENDABLE SYSTEM
`DESIGN
`737
`
`11 A DESIGN METHODOLOGY 739
`Daniel P. Siewiorek and David Johnson
`
`Introduction 739
`A Design Methodology for Dependable System Design 739
`
`The VAXft 310 Case: A Fault-Tolerant System by Digital Equipment Corporation
`William Bruckert and Thomas Bissett
`
`745
`
`Defining Design Goals and Requirements for the VAXff 310 746
`VAXff 310 Overview 747
`Details of VAXfl 310 Operation 756
`
`Summary 766
`
`IBM-Oracle 1014
`Page 8 of 157
`
`
`
`xii
`
`CONTENTS
`
`APP£NDIXES
`
`769
`
`APPENDIX A 771
`
`Error-Correcting Codes for Semiconductor Memory Applications:
`A State-of-the-Art Review 771
`C.L. Chen and M.Y. Hsiao
`
`introduction 771
`Binary Linear Block Codes 773
`SEC-DEC Codes 775
`SEC-DED-SBD Codes 778
`SBC-DBD Codes 779
`DEC-TED Codes 781
`Extended Error Correctio~ 784
`Conclusions 786
`References 786
`
`APPENDIX B 787
`
`Aflthmetic Error Codes: Cost and Effectiveness Studies for Application in Digital
`System Design 787
`Algirdas Avizienis ~
`
`i
`
`Methodology of Code Evaluation 787
`Fault Effects in Binary Arithmetic Processors 790
`Lo~c-Cost Radix-2 Arithmetic Codes 794
`Multiple Arithmetic Error Codes 799
`References 802
`
`APPENDIX C 803
`
`Design for Testability--A Survey 803
`Thomas W. Williams aad Kenneth P. Parker
`
`Introduction 803
`Design for Testability 807
`Ad-Hoc Design for Testability 808
`Structured Design for Testability 813
`Self-Testing and Built-ln Tests 821
`Conclusion 828
`References 829
`
`APPENDIX D 831
`
`Summary of MIL-HDBK-217E Reliability Model 831
`
`Failure Rate Model and Factors 831
`Reference 833
`
`IBM-Oracle 1014
`Page 9 of 157
`
`
`
`CONTENTS
`
`APPENDIX E 835
`
`Algebraic Solutions to Markov Models 835
`Jeffrey P. Hansen
`
`Solution of M~IF Models 837
`Complete Solution for Three- and Four-State Models
`Solutions to Commonly Encountered Markov Models
`References 839
`
`838
`839
`
`GLOSSARY 841
`
`REFERENCES 845
`
`CRED]TS 885
`
`TRADEMARKS 890
`
`INDEX 891
`
`IBM-Oracle 1014
`Page 10 of 157
`
`
`
`8
`
`HIGH-AVAILABILITY
`SYSTEMS
`
`INTRODUCTION
`
`Dynamic redundancy is the basic approach used in high-availability systems. These
`systems are typically composed of multiple processors with extensive error-detection
`mechanisms. When an error is dete~t~d, the computation is resumed on another
`processor. The evolution of high-availability systems is traced th rough the family history
`Of three commercial vendors: AT&T, Tandem, and Stratu’s.
`
`AT&T pioneered fault-tolerant computing in the telephone switching application.
`The two AT&T case studies given in this chapter trace the vaFiations of duplication and
`matching dgMsed for the swit~:hing systems’to detect failures and to automatically
`resume computations. The primary form of dete~tion is hardware Iock-s~ep duplication
`~nd comparison that requires ab0~t 2.5 times the hardware cost of a nonredundant
`system. Thousands of switching systems have been installed and th’ey are currently
`commercially.available in the form of the 3B20 processor. Table8-1 summarizes the
`evolution of the AT&T switching systems. It includes system characteristics such as the
`number of telephoDe lines accommodated as well as the processor model used to
`control the switching gear.
`Telephone switching systems utilize natural redundancy in the network and its
`operation to meet an aggressive availability goal of 2 hours downtime in 40 years (3
`minutes per year). Telephone users will redial if they get a wrong number or are
`disconnected. However, there~is a user aggravation level that must be avoided: users
`will redial as long as errors do not happen too ffequen.tly. User aggravation thresholds
`awe different for failure to establish a call (moderately high) and disconnection of an
`established call (very low). Thus, a telephone switching system follows a staged failure
`recovery process, a~ shown in Table 8-2.
`Figure 8-1 illustrates that the telephone switching application requires quite a
`different organization than that of ~ g~nera[-purpose computer. In particular, a sub-
`stantia[ portion of the telephone switching system complexity is in the peripheral
`hardware. As depicted in Figure 8-1, the telephone switching system is composed of
`four maior components: the transmission interface, the network, s!gnal processors,
`and the central controller. Telephone lines ~arrying analog, signals attach to the voice
`band interface frame (V]F), which samples and digitally encodes the analog signals.
`The output is pulse code modulated (P~M). The echo suppressor terminal (EST) re-
`moves echos that may have been introduced on long distance trunk lines. The PCM
`
`AT&T
`SWITCHING
`SYSTEMS
`
`524
`
`IBM-Oracle 1014
`Page 11 of 157
`
`
`
`TABLE 8-1
`
`installed AF&T
`telephone
`switching systems
`
`8, HIGH-AVAILABILITY SYSTEMS
`
`525
`
`System
`
`Number
`of Lines
`
`.Year
`Introduced
`
`Number
`Installed
`
`Processor Comments
`
`1 ESS
`
`5,000-65,000
`
`1965
`
`1,000
`
`No. 1
`
`2 ESS 1,000 10,000
`1A ESS 100,000
`
`2B ESS
`
`1,000-20~000
`
`1969
`1976
`
`1975
`
`500
`2,000
`
`No. 2
`No. 1A
`
`>500
`
`No. 3A
`
`3 ESS
`5 ESS
`
`500-5,000 1976
`1982
`1,000-85,000
`
`>500
`>1,000
`
`No. 3A
`No. 3B
`
`First proeessor with
`separate control and
`data memories
`
`Four to eight times
`faster than No. 1
`Combined control and
`data store;
`micr9coded; emu]ates
`No. 2
`
`Multipurpose processor
`
`TABLE 8-2
`Levels of recovery
`in a telephone
`switching system
`
`Phase
`
`1
`
`2
`
`3
`
`Recovery Action Effect
`
`initialize specific transient memory. Temporary storage affected; no
`calls lost
`Lose calls being established; calls
`in progress not !ost
`Lose calls being established; calls
`in progress not affected
`
`Reconfigure peripheral hardware. Initialize
`all transient memory.
`Verify memory operation, establish a
`workable processor configuration, verify
`program, configure peripheral hard.ware,
`initialize all transient memory.
`Establish a workable processor
`configuration, configure periphera!
`hardware, initialize air memoff/.
`
`ca!ls lost
`
`signals are multiplexed onto a time-slotted digital bus. The digital bus enters a time-
`space-time network. The time slot interchange (TSI) swit~;hes PCM signals to different
`time slots on the bus. The output of the TSI goes to the time multiplexed switch (TMS),
`which switches the PCM signals in a particular time slot from any bus to any other
`bus. The output of the TMS returns to the TS1, where the PCM signals may be
`interchanged to another time slot. Signals ’intended for analog lines are converted
`from PCM to analog signals in the VlF. A network clock coordinates the timing for all
`of the switching functions.
`The signal processors provide scanning and signal distribution functions, thus
`~elieving the central processor of these agtivities. The common channel interface
`signaling (CCJS) provides an independent data line between telephone switching sys-
`tems. The CCIS terminal is used to send s~pervi~ory switching information for the
`
`IBM-Oracle 1014
`Page 12 of 157
`
`
`
`526
`
`II. THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`FIGURI~ 8-1
`Diagram of a typi-
`cal teJephone
`switching system
`
`Sen~ice cir,:nits
`
`~
`
`PCM
`
`PCM
`
`Wire facilities gand
`inte~
`
`Analog carrier L~
`
`Digital carr!er
`
`Analog
`c~rrier
`signaling
`
`I signaling and Timing
`
`control ¯
`
`PU bus
`
`Peripheral unit (PU) bus
`
`I Bus ~nterface I PU bug
`
`’
`~
`
`I PUI~B
`
`Data lin~ chanriel
`
`interoffice
`s~gnaling
`
`Master
`control
`console
`MCC
`
`cC~ Program store (PS) bus
`
`Auxiliary unit (A.U) b~s
`
`[ CC
`
`Call store (CS)
`
`various trunk lines coming into the office. The entire peripheral hardware is interfaced
`to the central ~:ontrol (CC) over AC-coupled buses. A telephone switching processor
`is composed o~ the central control, which manipulates data associated with call pro-
`cessing, admirOstrative tasks, and recovery; program store; call store for storing tran-
`sient information related to the processing of telephone calls; file store disk system
`used to store backup program copies; auxiliary units magnetic tapes storage containing
`basic resta.rt programs and new software releases; input/output (I/O) interfaces to
`;~erminal deviges; and master control console used as the control and display console
`for the system. In general, a telephone switching processor could be used to control
`more than one type of telephone switching system.
`The history ofAT&T Processors is summarized in Table 8-3. Even though all the
`processors ~re based upon full dupiicatior~, it is interesting to observe the evolution
`from the tightly lock-stepped ma~ching of every machine cycle in the early processors
`to a higher dependence on self-checking and matching only on writes to memory.
`Furthermore, as the processors evOlved from dedicated, real-time controllers to mul-
`
`IBM-Oracle 1014
`Page 13 of 157
`
`
`
`Processor/
`
`Year
`
`Introduced
`
`No. 1, 1965
`
`Complexity
`
`(Gates)
`
`12,000
`
`No. 2, 1969
`
`5,000
`
`PS, CS, CC,
`buses
`
`Six internal nodes, 24
`bits per node; one
`node matched each
`machine cycle; node
`selected to be matched
`dependent on instruc-
`tion being executed
`Single match point on
`Entire
`computer call store input
`
`No. 1A, 1976
`
`50,000
`
`PS, cS, CC,
`buses
`
`16 internal nodes, 24 bits
`per node; two nodes
`matched each machine
`cycle
`
`No. 3A, 1975
`
`16,500
`
`Entire
`computer
`
`None
`
`3B2OD, 1981
`
`75,000
`
`Entire
`computer
`
`None
`
`8. HIGH-AVAILABII-[TY SYSTEMS
`
`527
`
`TABLE 8-3, Summary of AT&T Telephone Switching Processors
`
`Unit.~of
`
`Switching
`
`Matching
`
`Other Error Detection/Correction
`
`Hamming code on PS; parity on CS;
`automatic retry on CS, PS; watch-
`dog timer; sanity program to de-
`termine if reorganization led to a
`valid configuration
`
`Diagnostic programs; parity on PS;
`detection of multiword accesses
`in CS; w.atch-dog timer
`Two-parity bits on PS; roving spares
`
`(i.e., contents of PS not com-
`pletely duplicated, can be loaded
`from disk upon error detection);
`two-parity bits on CS; roving
`spares sufficient for complete du-
`plication of transient data; proces-
`sor configuration circuit to search
`automatically for a yalid configura-
`tion
`On-line processor writes into both
`stores; m-of-2m code on micro-
`store plus parity; serf-checking
`decoders; two-parity bits on regis-
`ters; duplication of ALU; watch-
`dog timer; maintenance channel
`for observability and controllabil-
`ity of the other processor; 25% of
`logic devoted to self-checking
`logic and 14% to maintenance
`acceSS
`On-line processor write into both
`stores; byte parity on data paths;
`parity checking where parity pre-
`served, duplication otherwise;
`modified Hamming code on main
`memory; maintenance channel for
`observability and controllability of
`the other processor; 30% of con-
`tro] logic devoted to self-check-
`ing; error-correction codes on
`disks; software audits, sanity
`timer, integrity monitor
`
`IBM-Oracle 1014
`Page 14 of 157
`
`
`
`TANDEM
`COMPUTERS~
`INC.
`
`THE PRACTICE OF R~-LIABLE SYSTEM DESIGN
`
`tiple-purpose processors, the operating system and software not only became more
`sophisticated but also became a dominant portion of the system design and mainte-
`nance effort.
`The part I of the AT&T case study in this chapter, by Wing Toy, sketches the
`evolution of the telephone switching system processors and focuses on the latest
`member of the family, the 3B20D. Part II of the case study, by Liane C. Toy, outlines
`the procedure used in the 5ESS for updating hardware and/or software without incur-
`ring any downtime.
`
`Over a decade after the first AT&T computer-controlled switching system was installed,
`Tandem designed a high-availability system targeted for the on-line transaction pro-
`cessing (OLTP) market. Replication of processors, memories, and disks was used not
`only to tolerate failures, but also to provide modular expansion of computing re-
`sources. Tandem was concerned about the propagation of errors, and thus developed
`a loosely coupled multiple computer architecture. While one computer acts as primary,
`the backup computer is active only to receive periodic checkpoint information. Hence,
`1.3 physical computers are required to behave as one logical fault-tolerant computer.
`Disks, of course, have to be fully replicated to’provide a complete backup copy of the
`database. This approach places a heavy burden’ upon the system and user software
`developers to guarantee correct operation no matter when or where a failure occurs.
`In particular~ the primary memory state of a computation may not be available due to
`the failure of the processors. Some feel, however, that the multiple computer structure
`is superior to a lock-step duplication approach in tolerating design errors.
`The architecture discussed in the Tandem case study, by Bartlett, Bartlett, Garcia,
`Gray, Horst, Jardine, Jewett, Lenoski, and McGuire, is the fi.rst commercially available,
`modularly expandable system designed specifically for high availability. Design objec-
`tives for the system include the following:
`
`¯ "Nonstop" operation wherein failures are detected, components are reconfigured
`out of service, and repaired components are configured back into the system
`without stopping the other system components
`¯ Fail-fast logic whereby no single hardware failure can compromise the data integ-
`rity of the system
`¯ Modular system expansion through adding more processing power, memory, and
`peripherals without impacting applications software
`
`As in the AT&T switching systems, the Tandem architecture is designed to take advan-
`tage of the OLTP application to simplify error detection and recovery. The Tandem
`architecture is composed of up to 16 computers interconnected by two message-
`oriented Dyriabuses. The hardware and soft~vare modules are designed to be fast-fail;
`that is, to rapidly detect errors and subsequent terminate processing. Software modules
`employ consistency checks and defensive programming techniques. Techniques em-
`ployed in hardware modules include the following:
`
`IBM-Oracle 1014
`Page 15 of 157
`
`
`
`8. HiGH-AVAIEABIL[TY SYSTEMS
`
`529
`
`Checksums on Dynabus messages
`Parity on data paths
`Error-correcting code memory
`Watch-dog timers
`
`AI] 1/O device controllers are dual ported for access by an alternate path in case of
`processor or I/O failure. The software builds a process-oriented system with all com-
`munications handled as messages on this hardware structure. This abstraction allows
`the blurring of the physical boundaries between processors and peripherals. Any I/O
`device or resource in the system can be accessed by a process, regardless of where
`the resource and process reside.
`Retry is extensively used to access an I/O device. Initially, hardware/firmware
`retries the access assuming a temporary fault. Next, software retries, followed by
`alternative path retry and finally alternative device retry.
`A network systems management program provides a set of operators that helps
`reduce the number of administrative errors typically encountered in complex systems.
`The Tandem Maintenance and Diagnostic System analyzes event logs to successfully
`call out failed field-replaceable units 90 percent of the time. Networking software exists
`that allows interconnection of up to 255 geographically dispersed Tandem systems.
`Tandem applications include order entry, hospital records, bank transactions, and
`library transactions.
`Data integrity is maintained through the mechanisms of i/O "process pairs"; one
`I/O process is designated as primary and the other is designated as backup. All file
`modification messages are delivered to the primary 1/O process. The primary sends a
`message with checkpoint information to the backup so that it can take over if the
`primary’s processor or access path to the I/O device fails. Files can also be duplicated
`on physically distinct devices controlled by an I/O process pair on physically distinct
`processors. All file modification messages are delivered to both 1/O processes. Thus,
`in the event of physical failure or isolation of the primary, the backup fi]e is up-to-date
`and available. "
`User applications can also utilize the process-pair mechanism. As an example of
`how process pairs work, consider the nonstop application, program A, shown in Figure
`8-2. Program A starts a backup process, A1, in another processor. There are also
`duplicate file images, one designated primary and the other backup. Program A peri-
`odically (at user~specified points) sends checkpoint information to A1. A1 is the same
`program as A, but knows that it is a backup program. A1 reads checkpoint messages
`to update its data ~rea, file status, and program counter.
`The checkpoint information is inse~ted in the corresponding memory locations of
`the backup process, as opposed to the more usual approach of updating a disk file.
`This approach permits the backup process to take over immediately in the event of
`failure without having to perform the usual recovery journaling and disk accesses
`before processing resumes.
`Program A1 loads and executes if the system reports that A’s processor is down
`(error messages sent from A’s operating system image or A’s p.~ocessor fails to respond
`
`IBM-Oracle 1014
`Page 16 of 157
`
`
`
`II. THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`FIGURE 8-2
`Shadow processor
`in Tandem
`
`Backup
`
`exists? ~ __~
`
`l, ,,~ " Data
`
`[..
`
`A!
`
`Backup
`
`~
`
`OS ~ I/0
`
`to a periodic "I’m alive" message). All file activity by A is performed on both the
`primary and backup fiie copies. When A1 starts to execute from the last checkpoints,
`it may attempt to repeat I/O operations successfully completed by A. The system file
`handler will recognize this and send A1 a successfully completed ]/O message.
`Program A periodically asks the operating system if a backup process exists. Since one
`no longer does, it can request the creation and’initialization of a copy of both the
`process and file structure.
`A major issue in the design of loosely coupled duplicated systems is how both
`copies can be kept identical in the face of errors. As an example of how consistency
`is maintained, consider the interaction of an I/O processor pair as depicted in Table
`8-4. Initially, all sequence numbers (SeqNo) are set to zero. The requester sends a
`request to the server. If the sequence number is less than the server’s local copy, a
`failure has occurred and the status of the completed operation is returned. Note that
`the requested operation is done only once. Next, the operation is performed and a
`checkpoint of the request is sent to the server backup. The disk is written, the sequence
`number incremented to one, and the results checkpointed to the server backup, which
`also increments its sequence number. The results are returned from the server to the
`requestor. Finally the results are checkpointed to the requester backup, which also
`increments its sequence number.
`Now consider failures. If either backup fails, the operation completes successfully.
`If the requester faifs after the request has been made, the server will complete the
`operation but be unable to return the result. When the requester backup becomes
`active, it will repeat the request. Since its sequence number is zero, the server test at
`step 2 will return the result without performing the operation again. Finally, if the
`server fails, the server backup either does nothing .or completes the operation using
`checkpointed information. When the requester resends the request, the new server
`(that is, the old server backup) either performs the operation or returns the saved
`results. More information on the operating system and the programming of nonstop
`applications can be found in Bartlett [1978].
`
`IBM-Oracle 1014
`Page 17 of 157
`
`
`
`8. HIGH-AVAILA.BILITY SYSTEMS
`
`531
`
`TABLE 8-4
`Sample process-
`pair transactions
`
`Step
`
`1
`
`5
`6
`
`Requester
`SeqNo = 0
`
`Requester Backup
`SeqNo - 0
`~
`
`Server
`SeqNo = 0
`
`Server Backup
`SeqNo - 0
`
`Issue
`request
`to write
`record ~
`If SeqNo <
`MySeqNo, then
`return saved status
`Otherwise, read disk,
`perform operation,
`checkpoint request.
`Write to disk
`SeqNo = 1
`checkpoint result
`Return results
`
`Checkpoint -
`~- SeqNo = I
`results
`
`Saves request
`
`Saves result
`SeqNo = 1
`
`STRATUS
`
`COMPUTERS,
`INC.
`
`Source: Bartlett, 1981; © 1981 ACM.
`
`Whereas the Tandem architecture was based upon minicomputer technology, Stratus
`entered the OLTP market five years after Tandem by harnessing microprocessors. By
`1980, the performance of microprocessor chips was beginning to rival that of minicom-
`puters, Because of the smaller form factor of microprocessor chips, it was possible to
`place two microprocessors on a single board and to compare their output pins on
`every dock cycle. Thus, the Stratus system appears to users as a conventional system
`that does not require special software for error detection and recovery. The case study
`by Steven Webber describes the Stratus approach in detail.
`The design goal for Stratus systems is continuous processing, which is defined as
`uninterrupted operation without loss of data, performance degradation, or special
`progFamm[ng. The Stratus self-checking, duplicate-and-match architecture is shown in
`Figure 8-3. A module (or computer) is composed of replicated power and backp!ane
`buses (StrataBus) into which a variety of boards can be inserted. Boards are logically
`divided into halves that drive outputs to and receive inputs from both buses. The bus
`drivers/receivers are ’duplicated and controlled independently. The logical halves are
`driven in Iockvstep by the same clock. A comparitor is used to detect any disagreements
`between the two halves of the board. Multiple failures that affect the two independent
`halves of a board could cause the module to hang as it alternated between buses
`seeking a fault-free path. Up to 32 modules can be interconnected into a system via a
`message-passing Stratus intermodule bus (SIB). Access to the SIB is by dual 14 mega-
`byte-per-second links. Systems, in return, are tied together by an X.25 packet-switched
`network.
`
`IBM-Oracle 1014
`Page 18 of 157
`
`
`
`532
`
`!1. THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`FIGURE 8-3
`The Stratus pair-
`and-spare architec-
`ture
`
`Power 0
`
`Bus
`
`A
`
`Bus B
`
`Power 1
`
`Processor
`
`B ha!f L
`A half
`
`Processor
`
`half
`
`Memory ]~
`
`--~ Memory ~-~
`
`Disk ~
`
`~-~ C°mmunicati°ns
`
`~ Communications ]~-
`
`Link
`
`~1 Link
`
`Now consider how the system in Figure 8-~3 tolerates failure. The two processor
`boards (each containing a pair of microprocessors), each self-checking modules, are
`used in a pair-aqd-spare configuration. Each board operates independently. Each half
`of ~ach board (for example, side A) received inputs from a different bus (for example,
`bus A) and drives a different bus (for example, bus A). Each bus is the wired-OR of
`urge-half of each board (for example, bus A is the wired-OR of all A board halves). The
`boards constantfy compare their two haives, and upon disagreement, the board re-
`moves itself from service, a maintenance interrupt is generated, and a red light is
`illuminated. The spare pair on the other processor board continues processing and is
`now the sole driver of both buses. The operating system executes a diagnostic on the
`failed board to determine whether the error was caused by a transient or permanent
`fault. In the case of a transient, the board is returned to service. Permanent faults are
`reported by phone to the Stratus Customer Assistance Center (CAC). TEe CAC recon-
`firms the problem, selects a replacement board of the same revision, prints installation
`instructions, and ships the board by overnight courier. The first time the user realizes
`there is a problem is when the board is delivered. The user removes the old board
`and inserts the new board without disrupting the system (that is, makes a "hot" swap).
`The new board interrupts the system, and the processor tha