`
`
`
`RELIABLE
`
`COMPUTER
`SYSTEMS
`
`DESIGN AND EVALUATION
`
`SECOND EDITION
`
`DANIEL P. SIEWIOREK
`
`ROBERT s. SWARZ
`
`DIGITAL PRESS
`
`1
`1
`
`VMWARE, INC. 1014
`
`VMWARE, INC. 1014
`
`
`
`
`
`Copyright
`
`1992 by Digitai Equipment Corporation.
`
`All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
`transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise,
`without prior written permission of the publisher.
`
`Printed in the United States of America.
`9 B 7 6 5 4 3 2 ‘1
`
`Order number EY—HBBDE-DP
`
`The Publisher offers discounts on bulk orders of this book. For information, please write:
`Special Sales Department
`Digital Press
`One Burlington Woods Drive
`Buriington, MA 01003
`
`Design: Outside Designs
`?roduction: Technical Texts
`Composition: DEKR Corporation
`Printer: Arcatai'Halliday
`
`I
`Trademark products mentioned in this book are listed on'page 890.
`
`Views expressed in this book are those of the authors, not of the publisher. Digital Equipment Corporation is
`not responsible for any errors that may appear in this book.
`
`Library of Congress Cataloging-En-Publication Data
`
`Siewiorek, Daniel P.
`Reliable computer systems : design and evaluation 1 Daniel P.
`Siewiorek, Robert S. Swarz. —- 2nd ed.
`p.
`cm.
`Rev. ed. of: The theory and practice of reiiable system design.
`Bedford, MA : Digital Press, c1982.
`Includes bibliographical references and index.
`'ISBN 1-5553-1175-0
`1. Electronic digital computers—Reliability.
`2. Fault-toierant
`computing.
`I. Swarz, Robert 5.
`II. Siewiorek, Danie} P. Theory
`.and practice of reiiabie system design.
`JII. Title.
`QA76.5.55377 1992
`004-dc20
`
`92-10671
`CIP
`
`CREDITS
`
` ,
`
`‘
`l3
`
`
`
`.
`9'
`g
`a
`’
`3*
`
`
`
`
`
`
`E
`
`;
`3
`
`‘
`
`Figure 1—3: Eugene Foley, "The Effects of Microelectronics Revolution on Systems and Board Test,” Computers,
`Vol. 12, No. 10 (October 1979). Copyright © 1979 JEEE. Reprinted by permission.
`Figure 1—6: S. Russell Craig, ”Incoming Inspection and Test Programs,” Electronics Test (October1980), Re-
`printed by permission.
`
`Credits are continued on p. 885, which is considered a continuation of the copyright page.
`
`
`
`2
`
`
`
`3‘:
`l?
`
`ii
`
`
`
`
`
`
`To Karon and Lonnie
`
`A Special Remembrance:
`
`During the deveiopment of this book, a friend, colleague, and fault-toierant pioneer
`passed away. Dr. Wing N. Toy documented his 37 years of experience in designing
`severai generations of fault—tolerant computers for the Bell System electronic switching
`systems described in Chapter 8. We dedicate this book to Dr. Toy in the confidence
`that his writings will continue to influence designs produeed by those who learn from
`these pages.
`
`
`
`
`
`
`
`3
`
`
`
`
`
`CONTENTS
`
`Preface
`
`xv
`
`I THE THEORY OF RELIABLE SYSTEM DESIGN
`
`1
`
`I
`
`FUNDAMENTAL CONCEPTS 3
`
`Physical Levels in a Digital System 5
`Temporal Stages of a Digital System 6
`Cost of a Digital System 18
`Summary 21
`References
`21
`
`2
`
`FAULTS AND THEIR MANIFESTATIONS
`
`22
`
`24‘
`System Errors
`31
`Fault Manifestations
`Fault Distributions 49
`
`Distribution Models for Permanent Faults: The MIL-HDBK-217 Model
`Distribution Models for intermittent and Transient Faults
`65
`Software Fault Modeis
`73
`
`57
`
`Summary 76
`References 76
`Problems
`77
`
`79
`3 RELIABILITY TECHNIQUES
`Steven A. Elkind and Daniel P. Siewiorek
`
`80
`System—Failure Response Stages
`84
`Hardware Fault-Avoidance Techniques
`96
`Hardware Fault-Detection Techniques
`Hardware Masking Redundancy Techniques
`Hardware Dynamic Redundancy Techniques
`Software Reliability Techniques
`201
`Summary 219
`References
`219
`Problems
`221
`
`138
`169
`
`4 MAINTAINABILITY AND TESTING TECHNIQUES 228
`
`Specification-Based Diagnosis
`Symptom-Based Diagnosis
`260
`
`229
`
`,.
`t;
`I”
`
`
`
`
`
`
`vii
`
`
`
`4
`
`
`
`
`
`
`397W1$¥W€memfiwy‘aymrmmmx—uma—._.
`
`
`
`
`
`
`
`CONTENTS
`
`Summary 268
`References
`268
`Problems
`269
`
`'5 EVALUATION CRITERIA 271
`Stephen McConneI‘and Daniel P. Siewiorek
`Introduction 271
`
`272
`Survey of Evaluation Criteria: Hardware
`279
`Survey of Evaluation Criteria: Software
`Reliability Modeling Techniques; Combinatorial Models
`Examples of Combinatorial Modeling 294
`Reliability and-Availability Modeling Techniques: Markov Models
`Examples of Markov Modeling 334
`‘
`Availability Modeling Techniques
`342
`349
`Software Assistance for Modeling Techniques
`Applications of Modeling Techniques to Systems Designs
`Summary 391
`7
`'
`References
`391
`
`235
`
`356
`
`Problems
`
`392
`
`6
`
`FINANCIAL CONSIDERATIONS 402
`
`Fundamental Concepts
`Cost‘Models
`408
`Summary 419
`References
`41 9
`
`402
`
`Problems
`
`420
`
`II THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`423
`
`402-
`Fundamental Concepts
`GeneraI-Purpose Computing 424
`High-Availability Systems
`.424
`Long-Life Systems 425
`Critical Computations
`
`425
`
`7 GENERAL-9URPOSE COMPUTING 427
`
`Introduction 427
`
`Generic Computer 427
`DEC .430
`IBM 431,
`
`The DEC CaSE: RAMP in the VAX Family 433
`Daniel P. Siewiorek
`
`5
`
`305
`
`
`
`
`
`
`CONTENTS
`
`ix
`
`The VAX Architecture 433
`
`439
`First-Generation VAX Implementations
`Second-Generation VAX Implementations
`References
`484
`
`455
`
`The lBM Case Part I; Reliability, Availability, and Serviceability in {BM 303x
`and {BM 3090 ProcesSOr Complexes
`485
`Daniel P. Siewiorek
`
`Technology 485'
`Manufacturing 486
`Overview of the 3090 Processor Complex 493
`References
`507
`
`The IBM Case Part ll: Recovery Through Programming: MVS
`Recovery Management
`508
`CT. Connolly
`
`Introduction 508
`RAS Objectives
`509
`509
`Overview of Recovery Management
`MVS/XA Hardware Error Recovery 511
`MVS/XA Serviceability Facilities
`520
`Availability 522
`Summary 523
`Bibliography 523
`Reference
`523
`
`8 HIGH-AVAILABILITY SYSTEMS
`
`524
`
`Introduction 524
`
`AT&T Switching Systems 524
`Tandem Computers, Inc.
`528
`Stratus Computers, Inc.
`531
`References
`533
`
`The AT&T Case Part 1: Fault—Tolerant Design of AT&T Telephone
`Switching System Processors
`533
`W.N. Toy
`Introduction 533
`
`Allocation and Causes of System Downtime
`Duplex Architecture
`535
`538
`Fault Simulation Techniques
`First-Generation ESS Protessors
`
`540
`
`534
`
`544
`Second-GenerationProcessors
`Third-Generation 38200 Processor
`
`551
`
`Summary 572
`References
`573
`
`
`
`
`
`
`
`6
`
`
`
`
`
`
`3=
`
`i
`
`ii
`
`i
`J
`J
`
`3
`i!
`l
`
`3
`i
`I
`y
`ii
`
`1
`
`x
`
`CONTENTS
`
`The AT&T Case Part ii: Large-Scale Real-Time Program Retrofit Methodology in
`AT&T SESS® Switch 574
`LC. Toy
`‘
`
`BESS Switch Architecture Overview 574
`Software Replacement 576
`S um mary 585
`Refe rerices
`586
`
`586
`The Tandem Case: Fault Tolerance in Tandem Computer Systems
`Joe! Bartlett, Wendy Bartlett, Richard Carr, Dave Garcia, Jim Gray, Robert Horst,
`Robert Jardine, Doug Jewett, Dan Lenoski, and Dix McGuire
`Hardware
`588
`
`Processdr Module Implementation Details
`Integrity 52
`613
`Maintenance Facilities and Practices
`Software
`625
`
`622
`
`597
`
`647
`Operations
`Summary and Conclusions
`References
`648
`
`647
`
`‘
`
`The Stratus Case: The Stratus Architecture 648
`\
`Steven Webber
`
`650
`
`Stratus Solutions to Downtime
`issues of Fault Tolerance
`652 .
`' System Architecture Overview 653
`Recovery Scenarios
`664
`Architecture Tradeoffs
`665
`Stratus Software
`666
`
`'
`
`Service Strategies
`Summary 670
`
`669
`
`9 LONG-LIFE SYSTEMS
`
`671
`
`Introduction 67’]
`67‘]
`Generic Spacecraft
`676
`Deep~$pace Planetary Probes
`Other Noteworthy Spacecraft Designs
`References
`679
`
`679
`
`_
`
`The Galileo Case: Galileo Orbiter Fault Protection System 679
`Robert W. Kocsis
`
`680
`The Galileo Spacecraft
`Attitude and Articulation Control Subsystem 680
`Command and Data Subsystem 683
`AACS/CDS Interactions
`667
`
`Sequences and Fault Protection 688
`
`
`
`7
`
`
`
`
`
`CONTENTS
`
`xi
`
`
`
`
`
`
`
`.4.swag-mrwnafdmswerwwwrim-fie
`
`
`
`
`Fault-Protection Design Problems and Their Resoiution
`Summary 690
`References
`690
`
`689
`
`10 CRITICAL COMPUTATIONS 691
`
`Introduction 691
`
`C.vmp 691
`SIFT 693
`
`The C.vmp Case: A Voted Multiprocessor 694
`Daniel P. Siewiorek, Vittal Kini, Henry Mashburn, Stephen McConnel, and Michael Tsao
`
`694
`System Architecture
`Issues of Processor Synchronization 699
`Performance Measurements
`702
`
`Operational Experiences
`References
`709
`
`707
`
`The SIFT Case: Design and Analysis of a Fault-Tolerant Computer for
`Aircraft Control 710
`
`John H. Wensley, Leslie Lamport, Jack Goldberg, Milton W. Green, Karl N. Levitt,
`P.M. Melliar-Smith, Robert E. Shostak, and Charles B. Weinstock
`
`710
`Motivation and Background
`SIFT Concept of Fault Tolerance
`The SIFT Hardware
`719
`
`711
`
`The Software System 723
`The Proof of Correctness
`
`728
`
`Summary 733
`Appendix: Sample Special Specification
`References
`735
`
`733
`
`III A DESIGN METHODOLOGY AND EXAMPLE OF DEPENDABLE SYSTEM
`DESIGN
`737
`
`11
`
`A DESIGN METHODOLOGY 739
`Daniel P. Siewiorek and David Johnson
`
`Introduction 739
`
`A Design Methodology for Dependable System Design
`
`739
`
`The VAXft 310 Case: A Fault-Tolerant System by Digital Equipment Corporation 745
`William Bruckert and Thomas Bissett
`
`Defining Design Goals and Requirements for the VAXft 310 746
`VAXft 310 Overview 747
`
`Details of VAXft 310 Operation 756
`Summary 766
`
`
`
`
`
`
`
`8
`
`
`
`
`
`
`
`
`l
`
`l
`
`l
`ii
`
`is
`
`
`
`
`‘
`
`3i.‘
`
`xii
`
`CONTENTS
`
`APPENDIXES
`
`769
`
`APPENDIX A 771
`
`Error-Correcting Codes for Semiconductor Memory Applications:
`A State—of—the-Art Review 771
`CL. Chen and M.Y. Hsiao
`
`Introduction 771
`
`Binary Linear Block Codes
`
`773
`
`‘
`
`778
`
`775
`SEC-DEC Codes
`SEC-DED-SBD Codes
`SBC-DBD Codes
`779
`DEC-TED Codes
`781
`Extended Error Correction 784
`Conclusions
`786
`References
`786
`
`APPENDIX B
`
`787
`
`Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital
`\
`System Design 787
`Algirdas Avizienis
`
`787
`Methodology of Code Evaluation
`Fault Effects in Binary Arithmetic Processors
`Low-Cost Radix-2 Arithmetic Codes
`794
`Multiple Arithmetic Error Codes
`799
`References
`802
`
`790
`
`APPENDIX C 803
`
`Design for Testabiiity—A Survey 803
`Thomas W. Williams and Kenneth P. Parker
`introduction 803
`
`Design for Testability 807
`Ad-Hoc Design for Testability 808
`Structured Design for Testability 813
`Self~Testing and Built-In Tests
`821
`Conclusion 828
`References
`829
`
`APPENDIX D 83‘]
`
`Summary of MiL-HDBK—277E Reliability Model
`Failure Rate Model and Factors
`831
`Reference
`833
`
`831
`
`
`
`9
`
`
`
`CONTENTS
`
`'
`
`xiii
`
`GLOSSARY 841
`
`REFERENCES
`
`845
`
`CREDITS
`
`885
`
`TRADEMARKS
`
`890
`
`INDEX 891
`
`APPENDIX E
`
`835
`
`Algebraic Solutions to Markov Models
`Jeffrey P. Hansen
`Solution of MTTF Models
`
`837
`
`835
`
`CompEete Solution for Three- and Four-State Modeis
`Solutions to Commonly Encountered Markov Models
`References
`839
`
`
`
`838
`839
`
`
`
`-.._x.y,z,_gn,.5
`
`menu.55
`H,
`
`
`
`10
`10
`
`
`
`
`
`
`
`
`
`HIGH-AVAILABILITY
`
`SYSTEMS
`
`INTRODUCTION
`
`Dynamic redundancy is the basic approach used in high-availability systems. These
`systems are typically composed of multiple processors with extensive error—detection
`mechanisms. When an error is detefited, the computation is resumed on another
`processor. The evolution of high-availability systems is traced th rough the family history
`'of three commercial vendors: AT&T, Tandem, and Stratus.
`
`AT&T pioneered fault-tolerant computing in the telephone switching application
`The two AT&T case studies given in this chapter trace the variations of duplication and
`matching devised for the switching systems to detect failures and to automatically
`resume computations. The primary form of detectionIs hardware lock——step duplication
`and comparison that requires about 2.5 times the hardware cost of a nonredundant
`system. Thousands of switching systems have been installed and they are currently
`commerciallyavailable in the form of the 3320 processor Table 8—1 summarizes the
`evolution of the AT&T switching systems it includes system characteristics such as the
`number of telephone lines accommodated as well as the processor model used to
`control the switching gear.
`Telephone switching systems utilize natural redundancy in the network and its
`operation to meet an aggressive availability goal of 2 hours downtime in 40 years (3
`minutes per year}. Telephone users will redial
`if they get a wrong number or are
`disconnected. However, thereIs a user aggravation level that must be avoided. users
`will redial as long as errors do not happen too frequently. User aggravation thresholds
`are different for failure to establish a call (moderately high) and disconnection of an
`established call (very low}. Thus, a telephone switching system follows a staged failure
`recovery process, as shown in Table 8—2.
`Figure 8—1
`illustrates that the telephone switching application requires quite a
`different organization than that of a generai- purpose computer.
`in particular, a sub-
`stantial portion of the telephone switching system complexity is in the peripheral
`hardware. As depicted in Figure 8—1, the telephone switching system is composed of
`four major components: the transmission interface, the network, signal processors,
`and the central controller. Telephone lines carrying analog, signals attach to the voice
`band interface frame (VlF), which samples and digitally encodes the analog signals.
`The output is pulse code modulated (PCM). The echo suppressor terminal (EST) re-
`moves echos that may have been introduced on long distance trunk lines. The PCM
`
`AT&T
`
`SWITCHING.
`SYSTEMS
`
`524
`
`11
`11
`
`
`
`6. HIGH-AVAILABILITY SYSTEMS
`
`525
`
`
`
`W N
`
`umber
`
`System
`
`of Lines
`
`Year
`Number
`
`Introduced
`Installed
`Processor
`
`Comments
`
`1 E55
`
`5,000—65,000
`
`1965
`
`1,000
`
`No. 1
`
`2 E55
`1A ESS
`
`1000710000
`100,000
`
`213 E85
`
`LOGO—20,000
`
`1969
`1976
`
`1975
`
`'
`
`500
`2,000
`
`No. 2
`No. 1A
`
`>500
`
`N0. 3A
`
`First processor with
`separate control and
`data memories
`
`Four to eight times
`faster than No. 1
`
`Combined control and
`data store;
`-
`microcoded; emulates
`No. 2
`
`3 E35
`5 E55 '
`
`SUD—5,000
`1,000a85,000
`
`1976
`>500
`No. 3A
`
`1982
`>1 ,000
`No. BB
`Multipurpose processor
`
`
`
`1
`
`2
`
`3
`
`4
`
`Initialize specific transient memory.
`
`Temporary storage affected; no
`calls lost
`
`Reconfigure peripheral hardware. Initialize
`all transient memory.
`Verify memory operationr establisha
`workable processor configuration, verify
`program, configure peripheral hardware,
`initialize all transient memory.
`I
`Establish a workable processor
`configuration, configure peripheral
`hardware, initialize all memory.
`
`Ail calls lost
`’
`
`Lose calls being established; calls
`in progress not lost
`Lose calls being established; calls
`in progress not affected
`I
`
`signals are multiplexed onto a time-slotted digital bus. The digital bus enters a time-
`space—time network. The time slot interchange (TSI}'switches PCM signals to different
`time slots on the bus. The output of the TSI goes to the time multiplexed switch (TMS),
`which switches the PCM signals in a particular time slot from any bus to any other
`bus. The output of the TMS returns to the TS], where the PCM signals may be
`interchanged to another time slot‘. Signals intended for analog lines are converted
`from PCM to analog signals in the VIF. A network clock coordinates the timing for all
`of the switching functions.
`‘
`l
`The signal processors provide scanning and signal distribution functions, thus
`relieving the central processor of these activities. The common channel
`interface
`signaling (CCIS) provides an independent data link: between telephone switching sys-
`tems. The CCIS terminal is used to send supervisory switching information for the
`
`12
`12
`
`TABLE 8—1
`
`Summary of
`installed AT&T
`
`telephone
`switching systems
`
`TABLE 8—2
`
`Levels of recovery
`in a telephone
`switching system
`
`
`
`
`
`
`
`
`
`526
`
`ll. THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`FIGURE 8—1
`
`Diagram of a typi-
`cal telephone
`switching system
`
`
`
`
`
`
`
`
`Service circuits
`
`Wire facilities
`
`Analog carrier
`
`Digital carrier
`'
`,
`.
`
`Analog
`mrrler
`signaling
`
`Signal
`processor
`1,
`
`PCM
`
`PCM
`
`PCM
`
`
`
`
`Digroup
`terminal
`DT
`
`‘
`
`'
`I
`
`Data
`signaling and
`control
`
`
`Signal
`
`processor
`
`
`2
`
`
`Echo
`Time slot
`
`
`suppressor
`
`terminal!
`interchange
`1'5]
`
`
`
`EST
`
`
`
`
`‘
`.
`.
`Peripheral unit (PU) bus
`
`Time
`multiplexed
`switch
`TMS
`
`
`
`PU bus
`
`l/ 0
`
`PU bus
`
`Bus interface
`
`pugg
`MCC Central
`CC
`
`Data links
`
`
`
`' Common
`channel
`intemf-fice
`signaling
`
`
`
`
`
`Master
`control '
`console
`
`I
`control
`
`Auxiliary unit (AU))bus
`
`AU'
`units
`
`File
`store
`
`Program store (PS) bus
`
`at
`
`Call store (C5) bus
`
`
`
`various trunk lines coming into the office. The entire peripheral hardware is interfaced
`to the central control (CC) over AC—coupled buses. A telephone switching processor
`is composed oi the central control, which manipulates data associated with call pro-
`cessing, administrative tasks, and recovery; program store; call store for storing tran-
`sient information related to the processing of telephone calls; fiie store disk system
`used to store backup program copies; auxiliary units magnetic tapes storage containing
`basic restart programs and new software releases;
`input/output (l/O)
`interfaces to
`terminal devices; .and master control console used as the control and display console
`for the system. In general, a telephone switching processor could be used to control
`more than one type of telephone switching system.
`The history ofAT&T processors is summarized in Table 8—3 Even though all the
`processors-are based upon full duplication, it is interesting to observe the evolution
`from the tightly lock--stepped matching of every machine cycle'In the early processors
`to a higher dependence on self-checking and matching oniy on writes to memory.
`Furthermore, as the processors evolved from. dedicated, real-time controllers to mul-
`
`13
`13
`
`
`
`B. HIGH-AVAILABILITY SYSTEMS
`
`527
`
`Processont
`
`Year
`Introduced
`No. 1, 1965
`
`Complexity
`(Gates)
`12,000
`
`Unit,of
`Switching
`PS, CS, CC,
`buses
`
`No. 2, 1969
`
`5,000
`
`No. 1A, 1976
`
`50,000
`
`Entire
`computer
`
`PS, CS, CC,
`buses
`
`Matching
`Six internal nodes, 24
`bits per node; one
`node matched each
`machine cycle; node
`seiected to be matched
`dependent on instruc-
`tion being executed
`Single match point on
`call store input
`
`16 internal nodes, 24 bits
`per node; two nodes
`matched each machine
`cycle
`
`7
`
`No. 3A, 1975
`
`16,500
`
`Entire
`computer
`
`None
`
`‘
`
`3BZOD, 1981
`
`75,000
`
`Entire
`computer
`
`None
`
`
`
`TABLE 8—3 Summary of AT&T Telephone Switching Processors
`
`
`
`.
`
`I
`
`
`
`:
`
`'
`
`,.
`
`i
`If
`5
`
`'1'
`l
`
`'
`
`Other Error Detection/Correction
`Hamming code on P5; parity on C5;
`automatic retry on C5, PS; watch-
`dog timer; sanity program to de-
`termine if reorganization led to a
`valid configuration
`
`,
`
`Diagnostic programs; parity on PS;
`detection of multiword accesses
`in CS; watch—dog timer
`Two-parity bits on PS; roving spares
`(i.e., contents of PS not com-
`pletely duplicated, can be loaded
`from disk upon error detection);
`two-parity bits on C5; roving
`spares sufficient for complete du-
`plication of transient data; proces-
`sor configuration circuit to search
`automatically for a yaiid configura—
`tion
`On-iine processor writes into both
`stores; m-of-Zm code on micro—
`store plus parity; self-checking
`decoders; two—parity bits on regis-
`ters; duplication of ALU; watch-
`dog timer; maintenance channel
`for observability and controllabil—
`ity of the other processor; 25% of
`logic devoted to self—checking
`logic and 14% to maintenance
`access
`
`On-line processor write into both
`stores; byte parity on data paths;
`parity checking where Parity pre-
`served, duplication otherwise;
`modified Hamming code on main
`memory; maintenance channel for
`observability and controllability of
`the other processor; 30% of con-
`trol logic devoted to self-check—
`ing; error—correction codes on
`disks; software audits, sanity
`timer, integrity monitor
`
`14
`14
`
`
`
`
`
`528
`
`N. THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`TANDEM
`
`COMPUTERS,
`INC.
`
`
`
`tiple-purpose processors, the operating system and software not only became more
`sophisticated but also became a dominant portion of the system design and mainte-
`nance effort.
`
`I of the AT&T case study in this chapter, by Wing Toy, sketches the
`The part
`evolution of the telephone switching system processors and focuses on the latest
`member of the family, the 33201). Part II of the case study, by Liane C. Toy, outlines
`the procedure used in the BESS for updating hardware and/or software without incur—
`ring any downtime.
`
`Over a decade after the first AT&T computer-controlled switching system was installed,
`Tandem designed a high-availability system targeted for the on-line transaction pro-
`cessing (OLTP) market. Replication of processors, memories, and disks was used not
`only to tolerate failures, but also to provide modular expansion of computing re-
`sources. Tandem was concerned about the propagation of errors, and thus developed
`a loosely coupled multiple computer architecture. While one computer acts as primary,
`the backup computer is active only to receive periodic checkpoint information. Hence,
`1.3 physical computers are required to behave as one logical fault-tolerant computer.
`Disks, of course, have to be fully replicated to‘provide a complete backup copy of the
`database. This approach places a heavy burden‘upon the system and user software
`developers to guarantee correct operation no matter when or where a failure occurs.
`in particular, the primary memory state of a computation may not be available due to
`the failure of the processors. Some feel, however, that the multiple computer structure
`is superior to a lock-step duplication approach in tolerating design errors.
`The architecture discussed in the Tandem case study, by Bartlett, Bartlett, Garcia,
`Gray, Horst, Jardine, Jewett, Lenoski, and McGuire, is the first commercially available,
`modularly expandable system designed specifically for high availability. Design objec-
`tives for the system include the following:
`
`- ”Nonstop" operation wherein failures are detected, components are reconfigured
`out of service, and repaired components are configured back into the system
`without stopping the other system components
`. Fail-fast logic whereby no single hardware failure can compromise the data integ-
`rity of the system
`- Modular system expansion through adding more processing power, memory, and
`peripherals without impacting applications software
`
`As in the AT&T switching systems, the Tandem architecture is designed to take advan-
`tage of the OLTP application to simplify error detection and recovery. The Tandem
`architecture is composed of up to 16 computers interconnected by two message—
`oriented Dynabuses. The hardware and software modules are designed to be fast-fail;
`that is, to rapidly detect errors and subsequent terminate processing. Software modules
`employ consistency checks and defensive programming techniques. Techniques em-
`ployed in hardware modules include the following:
`'
`
`15
`15
`
`
`
`
`
`8. HIGH—AVAILABILITY SYSTEMS
`
`529
`
`Checksums on Dynabus messages
`Parity on data paths
`Errdr-correcting code memory
`Watch-dog timers
`‘
`
`l/O device controllers are dual ported for access by an alternate path in case of
`All
`processor or l/O failure. The software builds a process-oriented system with all come
`munications handled as messages on this hardware structure. This abstraction allows
`the blurring of the physical boundaries between processors and peripherals. Any [/0
`device or resource in the system can be accessed by a process, regardless of where
`the resource and process reside.
`Initially, hardware/firmware
`Retry is extensively used to access an l/O device.
`retries the access assuming a temporary fault. Next, software retries, followed by
`alternative path retry and finally alternative device retry.
`A network systems management program provides a set of operators that helps
`reduce the number of administrative errors typically encountered in complex systems.
`The Tandem Maintenance and Diagnostic System analyzes event logs to successfully
`call out failed field-replaceable units 90 percent of the time. Networking software exists
`that allows interconnection of up to 255 geographically dispersed Tandem systems.
`Tandem applications include order entry, hospital records, bank transactions, and
`library transactions.
`Data integrity is maintained through the mechanisms of U0 ”process pairs”; one
`l/O process is designated as primary and the other is designated as backup. All file
`modification messages are delivered to the primary i/O process. The primary sends a
`message with checkpoint information to the backup so that it can take over if the
`primary’s processor or access path to the IIO device fails. Files can also be duplicated
`on physically distinct devices controlled by an IIO process pair on physically distinct
`processors. All file modification messages are delivered to both l/O processes. Thus,
`in the event of physicaE failure or isolation of the primary, the backup file is up-to—date
`and available.
`'-
`
`User applications can also utilize the process-pair mechanism. As an example of
`how process pairs work, consider the nonstop application, program A, shown in Figure
`3—2. Program A starts a backup process, A1,
`in another processor. There are also
`duplicate file images, one designated primary and the other backup. Program A peri-
`odically (at user—specified points) sends checkpoint information to A1. A1 is the same
`program as A, but knows that it is a backup program. A1 reads checkpoint messages
`to update its data area, file status, and program counter.
`The checkpoint information is inserted in the corresponding memory locations of
`the backup process, as opposed to the more usual approach of updating a disk file.
`This approach permits the backup process to take over immediately in the event of
`failure without having to perform the usual recovery journaling and disk accesses
`before processing resumes.
`Program A1 loads and executes if the system reports that A’s processor is down
`(error messages sent from A’s operating system image or A’s processor fails to respond
`
`16
`16
`
`
`
` 530
`
`FIGURE 8—2
`
`Shadow processor
`in Tandem
`
`ll. THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`Backup
`exists?
`
`
`
`Backup
`exists?
`
`Checkpoint
`0 Data
`0 File status
`
`
`
`to a periodic ”l’m alive” message). All file activity by A is performed on both the
`primary and backup file copies. When A’t starts to execute from the last checkpoints,
`it may attempt to repeat U0 operations successfully completed by A. The system file
`handler will
`recognize this and send A1 a successfully Completed l/O message.
`Program A periodically asks the operating system if a backup process exists. Since one
`no longer does,
`it can request the creation and‘initialization of a copy of both the
`process and file structure.
`A major issue in the design of loosely c0upled duplicated systems is how both
`copies can be kept identical in the face of errors. As an example of how consistency
`is maintained, consider the interaction of an l/O processor pair as depicted in Table
`84. Initially, all sequence numbers (SeqNo) are set to zero. The requester sends a
`request to the server.
`If the sequence number is less than the server’s local copy, a
`failure has occurred and the status of the completed operation is returned. Note that
`the requested operation is done only once. Next, the operation is performed and a
`checkpoint of the request is sent to the server backup. The disk is written, the sequence
`number incremented to one, and the results checkpointed to the server backup, which
`also increments its sequence number. The results are returned from the server to the
`requester. Finally the results are checkpointed to the requester backup, which also
`increments its sequence number.
`Now consider failures. If either backup fails, the operation completes successfully.
`If the requester fails after the request has been made, the server will complete the
`operation but be unable to return the result. When the requester backup becomes
`active, it will repeat the request. Since its sequence number is zero, the server test at
`step 2 will return the result without performing the operation again. Finally,
`if the
`server fails, the server backup either does nothing or completes the operation using
`checkpointed information. When the requester resends the request, the new server
`(that is, the old server backup) either performs the operation or returns the saved
`results. More information on the operating system and the programming of nonstop
`applications can be found in Bartlett [1978].
`
`17
`17
`
`
`
`8. HIGH-AVAILABILITY SYSTEMS
`
`531
`
`TABLE 8—4
`
`Sample process-
`pair transactions
`
`
`
`Server Backup
`Server
`Requester Backup
`Requester
`Step
`SeqNo : 0
`SeqNo : D
`SeqNo : 0
`SeqNo = 0
`“mm—
`1
`Issue
`‘
`
`STRATUS
`COMPUTERS,
`INC.
`
`I
`
`request
`to write
`
`record W—h—
`if SeqNo <
`MySeqNo, then
`return saved status
`
`_
`
`Otherwise, read disk,
`perform operation, —>— Saves request
`checkpoint request
`Write to disk
`
`SeqNo = 1 —--—- Saves result
`checkpoint result
`SeqNo = 1
`<—-————. Return results
`
`2
`
`3
`
`4
`
`5
`
`'
`Checkpoint ——a-5eqNo = 1
`6
`results
`Source: Bartlett, 1981; © 1981 ACM.
`
`Whereas the Tandem architecture was based upon minicomputer technology, Stratus
`entered the OLTP market five years after Tandem by harneSsing microprocessors. By
`1980, the performance of microprocessor chips was beginning to rival that of minicom-
`puters, Because of the smaller form factor of microprocessor chips, it was possible to
`place two microprocessors on a single board and to compare their output pins on
`every clock cycle. Thus, the Stratus system appears to users as a conventional system
`that does not require special software for error detection and recovery. The case study
`by Steven Webber describes the Stratus approach in detail.
`The design goal for Stratus systems is continuous processing, which is defined as
`uninterrupted operation without loss of data, performance degradation, or special
`programming. The Stratus self-checking, duplicate-and-match architecture is shown in
`Figure 8—3. A module (or computer) is composed of replicated power and backplane
`buses (StrataBus) into which a variety of boards can be inserted. Boards are logically
`divided into halves that drive outputs to and receive inputs from both buses. The bus
`drivers/receivers are duplicated and controlled independently. The logical halves are
`driven in lockrstep by the same clock. A comparitor is used to detect any disagreements
`between the two halves of the board. Multiple failures that affect the two independent
`halves of a board could cause the module to hang as it alternated between buses
`seeking a fault-free path. Up to 32 modules can be interconnected into a system via a
`message-passing Stratus intermodule bus (SIB). Access to the SIB is by dual 14 mega-
`byte-per-second links. Systems, in return, are tied together by an X25 packet-switched
`network.
`
`
`
`
`
`18
`18
`
`
`
`
`
`532
`
`ll. THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`FIGURE 8—3
`The Stratus pair-
`and-spare architec—
`ture
`'
`
`Power 0
`
`Bus A
`
`Bus B
`'
`
`Processor
`
`Power 1
`
`
`
`
`Now consider how the system in Figure 8—3 tolerates faiiure. The two processor
`boards (each containing a pair of microprocessors), each self-checking modules, are
`used in a pair-and-spare configuration. Each board operates independently. Each half
`of each board (for example, side A) received inputs from a different bus {for example,
`bus A) and drives a different bus (for example, bus A). Each bus is the wired-OR of
`one-half of each board (for example, bus A is the wired-OR of all A board halves). The
`boards constantly compare their two halves, and upon disagreement, the board re—
`moves itself from service, a maintenance interrupt is generated, and a red light is
`illuminated. The spare pair on the other processor board continues processing and is
`now the sole driver of both buses. The operating system executes a diagnostic on the
`failed 'board to determine whether the error was caused by a transient or permanent
`fault. In the case of a transient, the board is returned to service. Permanent faults are
`reported by phone to the Stratus Customer Assistance Center (CAC). The CAC recon-
`firms the problem, selects a replacement board of the same revision, prints installation
`instructions, and ships the board by overnight courier. The first time the user realizes
`there is a problem is when the board is delivered. The user removes the old board
`and inserts the‘new board without disrupting the system {that is, makes a "hot” swap).
`The new board interrupts the system, and the processor that has been running brings
`the replacement into full synchronization, at which point the full configuration is
`available again. Detection and recovery are transparent to the application software.
`The detection and recovery procedures for other system components are similar,
`although the'fuli implementation of pair-and—spare is restricted to oniy the processor
`and memory. The disk controllers contain dupliCate read/write circuitry. Communica-
`
`19
`19
`
`
`
`8. HiCH~AVAILABILl1Y SYSTEMS
`
`533
`
`In" addition the memory controllers monitor
`tions controllers are also self-Checking.
`the bus for parity errors. The controElers‘ can declare a bus broken and instruct all
`boards to stop using that bus. Other boards monitor the bus for data directed to them.
`If the board detects an inconsistency but the memory controllers have not declared
`the bus broken, the board assumes that its bus receivers have failed and declares itself
`fail