`
`THIRD EDITION
`
`DANIEL P. SIEWIOREK and ROBERT S. SWARZ
`
`_
`
`' DELL EXH|B|T1026
`
`
`
`.-F
`....
`..'o'-.-__
`¥ -
`_ . .
`-
`' ‘
`
`.
`
`-
`
`. . H.
`‘ .
`‘
`",.w‘_.‘
`,.
`.-Q,‘
`
`
`
`I
`
`.
`
`.
`...
`,‘.g
`. or
`.
`.
`. n
`u-
`' .‘ '' _'
`..
`.IoI _ _
`."'.
`
`~ ~.-
`.n.
`.
`. u 1|
`'
`.
`.
`.
`
`.
`
`.
`
`-
`
`.
`
`-
`
`-
`
`‘
`
`' ‘' 4 ' .
`
`‘
`
`
`
`RELIABLE
`COMPUTER
`SYSTEMS
`DESIGN AND EVALUATION
`
`THIRD EDITION
`
`Daniel P. Siewiorek
`Carnegie Mellon University
`Pittsburgh, Pennsylvania
`
`Robert S. Swarz
`
`Worcester Polytechnic Institute
`Worcester, Massachusetts
`
`A KPeters
`Natick, Massachusetts
`
`-
`
`
`
`-
`
`Editorial, Sales, and Customer Service Office
`
`A K Peters, Ltd.
`63 South Avenue
`Natick, MA 01760
`
`Copyright © 1998 by A K Peters, Ltd.
`
`All rights reserved. No part of the material protected by this copyright notice
`may be reproduced or utilized in any form, electronic or mechanical, including
`photocopying, recording, or by any information storage 'and retrieval system,
`without written permission from the copyright owner.
`
`Trademark products mentioned in the book are listed on page 890.
`
`Library of Congress Cataloging-in-Publication Data
`
`Siewiorek, Daniel P.
`Reliable computer systems: design and evaluation / Daniel P.
`Siewiorek, Robert S. Swarz. - 3rd ed.
`p.cm.
`First ed. published under title: The theory and practice of
`reliable system design.
`Includes bibliographical references and index.
`ISBN 1-5688l-092-X
`1. Electronic digital computers - Reliability. 2. Fault-tolerant
`computing. I. Swarz, Robert S. II. Siewiorek, Daniel P. Theory
`and practice of reliablesystem design. III. Title.
`QA76.5.S537 1998
`004-dc21
`Printed in the United States of America
`0201 0099 98
`10 9 8 7 6 5 4 3 2 1
`
`98-202237
`CIP
`
`CREDITS
`
`Figure 1-3: Eugene Foley, "The Effects of Microelectronics Revolution on Systems and
`Board Test," Computers, Vol. 12, No. 10 (October 1979). Copyright © 1979 IEEE.
`Reprinted by permission.
`
`Figure 1-6: S. Russell Craig, "Incoming Inspection and Test Programs," Electronics Test
`(October 1980). Reprinted by permission.
`
`Credits are continued on pages 885-890, which are considered a
`continuation of the copyright page.
`
`
`
`p
`
`2
`
`FAULTS AND THEIR
`MANIFESTATIONS
`
`Designing a reliable system requires finding a way to prevent errors caused by the
`logical faults arising from physical failures. Figure 2-1 depicts the possible sources of
`such errors and service failures. Service can be viewed from the hierarchical physical
`levels within the system, such as service delivered by a chip or by the designer, or
`service may be viewed from the system level by the user. In either case, the following
`terms [Laprie, 1985; Avizienis, 1982] are used:
`
`• Failure occurs when the delivered service deviates from the specified service;
`failures are caused by errors.
`• Error is the manifestation of a fault within a program or data structure; errors can
`occur some distance from the fault sites.
`• Fault is an incorrect state of hardware or software resulting from failures of com(cid:173)
`ponents, physical interference from the environment, operator error, or incorrect
`design.
`• Permanent describes a failure or fault that is continuous and stable; in hardware,
`permanent failures reflect an irreversible physical change. (The word hard is used
`interchangeably with permanent.)
`is only occasionally present due to unstable
`• Intermittent describes a fault that
`hardware or varying hardware or software states (for example, as a function of
`load or activity).
`• Transient describes a fault resulting from temporary environmental conditions.
`(The word soft is used interchangeably with transient.)
`
`A permanent fault can be caused by a physical defect or an inadequacy in the
`design of the system. Intermittent faults can be caused by unstable or marginally stable
`hardware or an inadequacy in design. Environmental conditions as well as some design
`errors can lead to transient faults. All these faults can cause errors. Incorrect designs
`and operator mistakes can lead directly to errors.
`The distinction between intermittent and transient faults is not always made in the
`literature [Kamal, 1975; Tasar and Tasar, 1977]. The dividing line is the applicability of
`repair [Breuer, 1973; Kamal and Page, 1974; Losq, 1978; Savir, 1978]. Intermittent faults
`incorrect hardware or software
`resulting from physical conditions of the hardware,
`design, or even from unstable but repeated environmental conditions are potentially
`faults due to temporary envi(cid:173)
`detectable and repairable by replacement or redesign;
`ronmental conditions, however, are incapable of repair because the hardware is phys(cid:173)
`ically undamaged. It is this attribute of transient faults that magnifies their importance.
`
`22
`
`
`
`2. FAULTS AND THEIR MANIFESTATIONS
`
`23
`
`FIGURE 2-1
`Sources of errors
`and service failures
`
`Error
`
`Service
`failure
`
`Even in the absence of all physical defects, including those manifested as intermittent
`faults, errors will still occur.
`Avizienis [1985] presented c1~sses of faults based on their origin. These fault classes
`are similar to the classification of faults presented by Toy [1978] for the Bell Electronic
`Switching Systems. Two basic fault classes and their description are as follows:
`
`• Physical faults: These stem from physical phenomena internal to the system, such
`as threshold changes, shorts, opens, etc., or external changes, such as environ(cid:173)
`mental, electromagnetic, vibration, etc.
`• Human faults: These may be either design faults, which are committed during
`system design, modification, or establishment of operating procedures, or they
`may be interaction faults, which are violations of operating or maintenance pro(cid:173)
`cedures.
`
`According to this classification, physical faults can be introduced or occur either
`during the manufacturing stage or during operation life. Physical
`faults during the
`useful operational
`life of the system are caused by physical processes that occur
`through normal and abnormal use. Design faults are caused by improper translation
`of an idea or concept into an operational realization. Interaction faults are caused by
`ambiguous documentation or human inattention to detail.
`What are the sources of errors? What is the relative frequency of errors? How do
`faults manifest themselves as errors? Do the arrival
`times of faults (or errors) fit a
`probability distribution? If so, what are the parameters of that distribution? This chapter
`attempts to answer these questions and introduces a variety of fault models for the
`design and evaluation of fault-tolerant systems. The following section discusses the
`origin and frequency of errors by type of fault causing the errors. The remainder of
`the chapter focuses on fault manifestations within systems according to the hierarchical
`
`
`
`24
`
`I. THE THEORY OF RELIABLE SYSTEM DESIGN
`
`levels given in Chapter 1. It also presents the mathematical distributions that
`physical
`describe the probability of fault occurrence.
`
`SYSTEM ERRORS
`
`Origin of Errors by Type of Fault
`
`TABLE 2-1
`Ratios of all errors
`to permanent
`errors
`
`Transient and intermittent faults have been seen as a major source of errors in systems
`in several studies. For example, an early study for the U.S. Air Force [Roth et aI., 1967a]
`showed that 80 percent of the electronic failures in computers are due to intermittent
`faults. Another study by IBM [Ball and Hardie, 19671 indicated that JJintermittents
`comprised over 90% of field failures./I Transient faults, which have been observed in
`microprocessor chips [Brodsky, 19801, will become a more frequent problem in the
`future with shrinking device dimensions,
`lower energy levels for indicating logical
`values, and higher-speed operation. *
`Table 2-1 gives the ratios of measured mean time between errors (MTBE) (due to
`three fault types) to mean time to failure (MTIF) (due to permanent faults) for
`all
`several systems. The last row of this table is the estimate of permanent and transient
`failure rates for a 1-megaword, 37-bit memory composed of 4K MaS RAMs [Geilhufe,
`1979; Ohm, 1979]. In this case, transient errors are caused by a-particles emitted by
`the decay of trace radioactive particles in the semiconductor packaging materials. As
`they pass through the semiconductor material, a-particles create sufficient hole-elec(cid:173)
`tron pairs to add charge to or remove charge from bit cells. By exposing MaS RAMs
`
`Error Detection
`Mechanism
`
`Parity
`Diagnostics
`Crash
`Mismatch
`Crash
`Parity
`
`System
`MTBE for
`all Fault Types
`(hrs)
`
`System
`MTIF for
`Permanent Faults
`(hrs)
`
`44
`128
`97-328
`80-170
`689
`106
`
`800-1600
`4200
`4900
`1300
`6552
`1450
`
`MTBE/MTIF
`
`0.03-0.06
`0.03
`0.02-0.07
`0.06-0.13
`0.11
`0.07
`
`Systemffechnology
`
`CMUA PDP-l0, ECL
`CM* LSI-11, NMOS
`C.vmp TMR LSI-11
`Telettra, TIL
`SUN-2, TIL, MOS
`1M x 37 RAM,
`MOS
`
`Source: Data from Siewiorek et aI., 1978a; Morganti, 1978; McConnel, Siewiorek, and Tsao, 1979; Geilhufe,
`1979; Ohm, 1979; Lin and Siewiorek, 1990.
`
`• The same semiconductor evolution that has led to increased reliability per gate or bit has also introduced
`new failure modes. The smaller dimensions of semiconductor devices have decreased the amount of energy
`required to change the state of a memory bit. The loss of memory information caused by the decay of radioactive
`trace elements in packaging material has been documented. Studies show that even in sheltered environments
`such as well-conditioned computer rooms, transientJintermittent errors are 20 to 50 times more prevalent than
`hard failures. TransientJintermittent errors also exhibit clustering (a high probability that, once one error has
`occurred, another will occur soon), workload dependence (the heavier the system workload, the more likely
`an error), and common failure modes (more than one system, or portion of a system, affected simultaneously).
`
`
`
`2. FAULTS AND THEIR MANIFESTATIONS
`
`25
`
`I
`+
`
`.1
`
`T
`
`10.0
`
`1.0
`
`0.1
`
`0.01
`
`FIGURE 2-2
`Operational life er-
`ror rate as function
`of RAM densities
`[From Geilhufe,
`1979; ©1979 IEEEl
`
`-:i
`
`0
`~
`di
`IJ
`QJ ';:
`_QIru"~~
`:: ....
`0,....
`....
`'"~
`-Q
`
`~'
`
`i;j
`
`Co
`
`0.001 L-
`1
`
`~E..._
`
`___:L_
`4K
`
`=_
`16K
`
`_ .L_
`
`Memory size in bits
`
`to artificial a-particle sources, the operational 'life error rate can be determined as a
`function of RAM density (Figure 2-2), voltage, and cycle time.
`The Sun-2 data shown in Table 2-1 is derived from a study that observed 13 SUN(cid:173)
`2 workstations on the Carnegie Mellon University Andrew network [Lin and Siewiorek,
`1990]. Each workstation was composed of a Motorola 68010 processor with a 10-MHz
`clock and up to 4 Fujitsu Eagle-470 Mbyte disk drives. These workstations were used
`as a distributed file server system. Sampling techniques indicated that their workload
`was essentially constant.
`Over 21 workstation-years of data from the Sun-2 file server system accumulated
`and is summarized as follows:
`
`Source of
`Error
`
`Number of
`Occurrences
`
`Mean Time to
`Occurrence (hrs)
`
`Permanent fau It
`Intermittent fault
`Transient fault
`System crash
`
`29
`610
`446
`298
`
`6552
`58
`354
`689
`
`Permanent faults were determined through interviews with field service personnel
`and analysis of system maintenance logs. The most severe manifestation of an error is
`a system crash wherein the system software has to be reloaded and restarted. System
`crashes were determined from an on-line, distributed diagnostics system. All errors in
`the system log prior to a permanent
`fault were examined and allocated to either
`transient or intermittent. * If an error occurred indicating the same physical device
`
`• It was assumed that a permanent failure resulted in only one system crash after which the system was totally
`inoperable.
`
`
`
`26
`
`I. THE THEORY OF RELIABLE SYSTEM DESIGN
`
`•
`
`within less than a week of the previous error, the error was deemed to be caused by
`an intermittent fault. Since the mean time to a transient fault was over two weeks, it
`was felt that there was little probability of inadvertently mixing intermittent and tran(cid:173)
`sient faults.
`A number of
`interesting observations can be made from the Sun·2 data. The
`permanent faults represented approximately 10 percent of the system crashes. Thus,
`if permanent faults were eliminated, the system crash rate would only be improved by
`about 10 percent. The ratio of intermittent faults to permanent faults is about 20. Thus,
`the first intermittent symptom appears over 1200 (20 x 58 hours between intermittent
`occurrences) hours prior to the repair activity. This data represents a large window of
`vulnerability and provides ample opportunity for trend analysis to help isolate the
`source of failure (for example, see Symptom-Directed Diagnosis in Chapter 4). If we
`subtract the number of permanent faults from the system crashes and divide by the
`total number of faults we see that on average only 1 out of every 4 faults causes a
`system crash. *
`In summary, since the only information available on errors is collected by relevant
`error-detecting mechanisms, it is very difficult in practice to detect all errors or deter(cid:173)
`mine their exact source. Very few studies have systematically attempted to determine
`the cause of errors. The existing evidence is that the vast majority are due to nonper(cid:173)
`manent (Le., intermittent or transient) faults.
`
`frequency of Errors by Type of fault
`
`The ultimate goal of any fault-tolerant system is not only to produce few errors but
`suffer as little down time as possible. Hardware modules are constantly improving
`their reliability and soon will reach in excess of 100,000 hours mean time between
`failure per module. However, transient and intermittent faults are 20-100 times more
`prevalent than hard failures, and these faults can lead to a system outage. Even after
`vendor and user quality assurance, thousands of bugs remain in system software. Most
`software failures are transient leading to a dump and restart of the system.
`For example, typical sources of failure for a mainframe computing system as well
`as the typical mean time between failure (MTBF) and mean time to repair (MTIR) are
`given in the following table [Gray, 1990]:
`
`Source of Failure
`
`MTBF
`
`Power
`Phone lines
`Soft
`Hard
`Hardware modules
`Software
`
`2000 hrs
`
`0.1 hr
`4000 hrs
`10,000 hrs
`1 bug/1,000 lines of code
`
`MTTR
`
`1 hr
`
`0.1 sec
`10 hrs
`10 hrs
`
`• Interestingly enough, 14 of the 29 permanent faults had 3 or fewer error log entries, and of those 14, 8 had
`no error log entries prior to repair, all of which indicates that there is substantial room for improvement in
`error-detecting mechanisms.
`
`
`
`2. FAULTS AND THEIR MANIFESTATIONS
`
`27
`
`Since the outage of telephone lines due to soft failures is frequent and recovery is
`very quick, phone line outages tend to be scattered,
`impacting individual terminals
`but not the complete system. However, some form of redundancy, such as providing
`alternative paths, is clearly required to reduce the frequency of telephone line failures.
`Among the several other studies providing data on how often computers fail
`is a
`survey of Japanese computer users [Watanabe, 1986], which included 1383 institutions
`over the period from June 1984 to July 1985. These users reported 7517 outages" with
`an average MTBF of 10 weeks and an average outage duration of 90 minutes. The
`the probability that the outage was que to the source, and the
`sources of outage,
`MTBF are given in the following table [Gray (Watanabe, translator), 1986]:
`
`MTBF
`(months)
`
`5 9
`
`18
`24
`24
`
`Source of Outage
`
`Probability
`
`Vendor hardware, software,
`and maintenance
`Application software
`Communication lines
`Environment
`Operations
`
`0.420
`0.250
`0.120
`0.112
`0.093
`
`In 1985 and 1987, Gray [1990] surveyed the Early Warning Reports (EWR) from
`Tandem's customer's systems. From June 1985 until June 1987, 1,300 customers were
`surveyed. Over 6,000 systems composed of 16,000 processors and 97,000 disks were
`included. The data represented 12,000 system years, 32,000 CPU years, 97,000 disk
`years of experience. Reports of 490 system failures were provided by 205 customers.
`The following table lists the sources of outage, the probability that an outage was due
`to the source, and the MTBF for the source [Gray, 1990]:
`
`Sources of Outage
`
`Probability
`
`Hardware
`Software
`Operations
`Maintenance
`Environment
`
`0.29
`0.36
`0.13
`0.11
`0.10
`
`MTBF
`(yrs)
`
`140
`63
`211
`200
`235
`
`Resultant mean time between failures for a system was approximately 27 years.
`
`It is interesting to compare the theoretically predicted and actually mea(cid:173)
`Example.
`sured probability of a disk pair failing. As will be shown in Chapter 5, the MTBF
`for a duplex system with failure rate A and repair rate I.lo is given by Eq. 1.
`
`MTBF - 2~2
`
`(1)
`
`• The author did not define the meaning of outage; we may consider an outage to be a failure to provide
`service.
`
`--
`
`
`
`28
`
`I. THE THEORY OF RELIABLE SYSTEM DESIGN
`
`•
`
`TABLE 2-2
`Outage data for
`the Tandem two(cid:173)
`year study
`
`Source of Outage
`
`Probability
`
`Source of Outage
`
`Probability
`
`Hardware
`Disks
`Communications
`Processors
`Wiring
`Spares
`Maintenance
`Disks
`Communications
`Processors
`
`Source: Gray, 1987.
`
`0.49
`0.24
`0.18
`0.09
`0.01
`
`0.67
`0.20
`0.13
`
`Operations
`Procedures
`Configuration
`Move
`Overflow
`Upgrade
`
`0.42
`0.39
`0.13
`0.04
`0.01
`
`For a 50,000 hour MTBF disk (failure rate = 1/50,000) with as-hour MTIR (repair
`rate = 1/5), Eq. 1 predicts a 28,000-year MTBF:
`
`(1/5)
`8
`MTBF = 2(1/50,000)(1/50,000) = 2.5 x 10 hours = 28,539 years
`
`Thirty-five double disk failures were reported in the 48,000 disk-pair years yielding
`a MTBF of 1371 years. The actually measured reliability was a factor of 20 less than
`theoretical, but it is a factor of 200 better than non redundant disks. The MTBF of
`a CPU pair was measured at 877 years.
`
`Table 2-2 further subdivides into constitutent parts the sources of outage reported
`by Tandem customers. Both in hardware and in maintenance, disks and communica(cid:173)
`tions are the most probable sources of outage. In operations, procedural and config(cid:173)
`uration problems represent the major source of outage.
`In another study [Gray, 1987], a large Tandem customer was examined in detail.
`The customer, a manufacturer, had 13 sites, with 10 sites in the United States, 1 in
`Canada, 1 in Mexico, and 1 in England. The customer had 14 nodes composed of 54
`processors and 120 disks and made 18 system-years of data available for the study,
`including reports of 199 outages, yielding a 4-week MTBF. Of the 199 outages, 52 were
`due to power, yielding an MTBF of 4 months for power and an average outage duration
`of 1 hour; 22 were due to communication lines, yielding an average MTBF of 10 months
`for communications and an average outage duration of 11 hours; 47 were due to
`scheduled reorganizations of transfer files; 27 outages were due to installation of
`software; and the remaining 51 outages were distributed as indicated in Table 2-3.
`Table 2-3 lists the source of outages and the probability that the outages were due to
`the source. The table also gives the sources and probabilities for 99 unscheduled
`outages. As we can see, power is the largest single contributor to unscheduled outage.
`Figure 2-3 gives a distribution of
`the duration of power outage as a function of
`frequency for this Tandem study. By way of comparison, Figure 2-4 depicts the cu(cid:173)
`mulative distribution of power outage durations for 350 Bell Canada locations studied
`
`
`
`TABLE 2-3
`Outage data for
`Tandem
`manufacturing
`customer
`
`2. FAULTS AND THEIR MANIFESTATIONS
`
`29
`
`Total Outages (n = 199)
`
`Unscheduled Outages (n = 99)
`
`Source
`
`Probability
`
`Source
`
`Probability
`
`Power
`Reorganization of transfer files
`Installation of software
`Communication lines
`Reconfiguration
`Application software
`Testing
`Max files
`Hardware
`Corrupted files
`
`Source: Gray, 1987.
`
`0.261
`0.236
`0.136
`0.111
`0.091
`0.050
`0.040
`0.040
`0.Q25
`0.010
`
`Power
`Communication lines
`Application software
`Max files
`Hardware
`Corrupted files
`
`0.525
`0.222
`0.101
`0.081
`0.051
`0.020
`
`during a period of 8376 months [BNR, 1984]. During that time, 1420 outages were
`recorded, producing over 3600 hours of down time. The average outage lasted 2.54
`hours. The graph in Figure 2-4 represents the percentage of outages that exceeded
`the duration marked on the horizontal axis. For example, about 24 percent of the
`outages exceeded 3 hours. Thus, if batteries that produce 3 hours of reserve power
`locations, an average 24 percent of the locations would have
`had been available at all
`insufficient power to maintain operation.
`By way of contrast, 50 percent of the Tandem power outages exceeded 3 hours.
`Figure 2-5 depicts the number of Tandem customers reporting the specified number
`of outages during the two-year study. Over three-fourths of the customers had no
`outage in the two-year period. However, some customers had a large number of
`
`45
`
`40
`
`35
`
`30
`
`j
`25ei
`
`20
`
`1tl
`
`5
`
`bz
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ]0
`Duration or outage
`
`FIGURE 2-3 Duration of power outages for Tandem manufacturing customer [From Gray, 1987J
`
`
`
`30
`
`I. THE THEORY OF RELIABLE SYSTEM DESIGN
`
`70
`
`FIGURE 2-4
`Cumulative distri-
`bution of power
`outage durations
`for Bell Canada
`Study [From BNR,
`1984J
`
`FIGURE 2-5
`Number of Tan-
`dem customers re-
`porting n outages
`in two-year study
`[From Gray, 1987J
`
`ci
`
`"1:l
`III
`
`t
`Q..c
`<I> 60
`c
`..
`.2
`i;j
`= 50
`-="1:le 40
`
`~
`i;j
`
`-= 30
`fn
`J!=Q 20
`...Q
`fo
`J!c 10
`III
`IIa.
`
`~I
`
`Outage duration (hrs)
`
`problems that seemed to be independent of the size of the system. All large customers
`had some outages; hence, a customer with 140 CPUs and 20 systems should expect
`approximately one outage per year.
`In summary, systems fail for many reasons, including hardware failure, incorrect
`design of hardware or software,
`improper operation or maintenance, and unstable
`environments. The probability of error is distributed ove~ this entire spectrum with no
`single cause dominating. Hardware is rarely responsible for failures, and the frequency
`rate can often be measured in years rather than weeks or hours.
`
`<I>
`
`..III
`E.s
`<I>=u
`...Q..III
`E=z
`
`..Q
`
`1000
`
`100
`
`10
`
`Number of outages
`
`
`
`8
`
`HIGH-AVAILABILITY
`SYSTEMS
`
`INTRODUCTION
`
`Dynamic redundancy is the basic approach used in high-availability systems. These
`systems are typically composed of multiple processors with extensive error-detection
`the computation is resumed on another
`mechanisms. When an error is detected,
`processor. The evolution of high-availability systems is traced through the family history
`of three commercial vendors: AT&T, Tandem, and Stratus.
`
`AT&T pioneered fault-tolerant computing in the telephone switching application.
`The two AT&T case studies given in this chapter trace the variations of duplication and
`matching devised for the sWitching systems to detect failures and to automatically
`resume computations. The primary form of detection is hardware lock-step duplication
`and comparison that requires about 2.5 times the hardware cost of a non redundant
`system. Thousands of switching systems have been installed and they are currently
`commercially available in the form of the 3B20 processor. Table 8-1 summarizes the
`evolution of the AT&T switching systems. It includes system characteristics such as the
`number of telephone lines accommodated as well as the processor model used to
`control the switching gear.
`Telephone switching systems utilize natural redundancy in the network ~nd its
`operation to meet an aggressive availability goal of 2 hours downtime in 40 years (3
`minutes per year). Telephone users will redial
`if they get a wrong number or are
`disconnected. However, there is a user aggravation level that must be avoided: users
`will redial as long as errors do not happen too frequently. User aggravation thresholds
`are different for failure to establish a call (moderately high) and disconnection of an
`established call (very low). Thus, a telephone switching system follows a staged failure
`recovery process, as shown in Table 8-2.
`Figure 8-1 illustrates that the telephone switching application requires quite a
`different organization than that of a general-purpose computer.
`In particular, a sub(cid:173)
`stantial portion of the telephone switching system complexity is in the peripheral
`hardware. As depicted in Figure 8-1, the telephone switching system is composed of
`four major components:
`the transmission interface, the network, signal processors,
`and the central controller. Telephone lines carrying analog signals attach to the voice
`band interface frame (VI F), which samples and digitally encodes the analog signals.
`The output is pulse code modulated (PCM). The echo suppressor terminal (EST) re(cid:173)
`moves echos that may have been introduced on long distance trunk lines. The peM
`
`AT&T
`SWITCHING
`SYSTEMS
`
`524
`
`
`
`8. HIGH-AVAILABILITY SYSTEMS
`
`525
`
`System
`
`Number
`of Lines
`
`Year
`Introduced
`
`Number
`Installed
`
`Processor
`
`Comments
`
`1 ESS
`
`5,000-65,000
`
`1%5
`
`1,000
`
`No.1
`
`2 ESS
`lA ESS
`
`1,000-10,000
`100,000
`
`26 ESS
`
`1,000-20,000
`
`HSS
`5 ESS
`
`500-5,000
`1,000-65,000
`
`1969
`1976
`
`1975
`
`1976
`1982
`
`500
`2,000
`
`No.2
`No.1A
`
`>500
`
`No.3A
`
`>500
`>1,000
`
`No.3A
`No.3B
`
`First processor with
`separate control and
`data memories
`
`Four to eight times
`faster than No.1
`Combined control and
`data store;
`microcoded; emulates
`No.2
`
`Multipurpose processor
`
`Phase
`
`Recovery Action
`
`Effect
`
`Temporary storage affected; no
`calls lost
`Lose calls being established; calls
`in progress not lost
`Lose calls being established; calls
`in progress not affected
`
`All calls lost
`
`Initialize specific transient memory.
`
`Reconfigure peripheral hardware. Initialize
`all transient. memory.
`Verify memory operation, establish a
`workable processor configuration, verify
`program, configure peripheral hardware,
`initialize all transient memory.
`Establish a workable processor
`configuration, configure peripheral
`hardware, initialize all memory.
`
`2 3
`
`4
`
`signals are multiplexed onto a time-slotted digital bus. The digital bus enters a time(cid:173)
`space-time network. The time slot interchange (TSI) switches PCM signals to different
`time slots on the bus. The output of the TSI goes to the time multiplexed switch (TMS),
`which switches the PCM signals in a particular time slot from any bus to any other
`bus. The output of
`the TMS returns to the TSI, where the PCM signals may be
`interchanged to another time slot. Signals intendeq for analog lines are converted
`from PCM to analog signals in the VIF. A network clock coordinates the timing for all
`of the switching functions.
`thus
`The signal processors provide scanning and signal distribution functions,
`these activities. The common channel
`interface
`relieving the central processor of
`signaling (CClS) provides an independent data link between telephone switching sys(cid:173)
`tems. The CCiS terminal
`is used to send supervisory switching information for the
`
`TABLE 8-1
`Summary of
`installed AT&T
`telephone
`switching systems
`
`TABLE 8-2
`Levels of recovery
`in a telephone
`switching system
`
`
`
`526
`
`II. THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`PCM
`
`1
`
`Time
`multiplexed
`swllch
`TMS
`
`Time slot
`interchange
`TSI
`
`PCM
`
`I
`
`!
`
`PCM
`
`!
`
`Echo
`suppressor
`terminal
`
`1 EST
`
`PU bus
`
`Timing
`
`Digroup
`terminal
`pT
`
`II
`
`Data
`I signaling and
`control
`
`service clrcults
`
`Wire facilities
`
`AnalOll carrier
`
`Dlsllal carrier
`
`_.a.._...
`
`~":,,,,:,,_~~~:-:-
`
`-1 Bus interface """",P..U_b..u..s__
`PUBB
`
`Data links
`
`Common
`channel
`interoffice
`signaling
`
`Master
`control
`console
`MCC
`
`FIGURE 8-1
`Diagram of a typi·
`cal telephone
`switching system
`
`various trunk lines coming into the office. The entire peripheral hardware is interfaced
`to the centrai control (CC) ovet AC-coupled buses. A telephone switching processor
`is Composed of the central control, which manipulates data associated with call pro(cid:173)
`cessing, administrative tasks, and recovery; program store; call store for storing tran(cid:173)
`sient information telated to the processing of telephone calls;
`file store disk system
`used to store backup program copies; auxiliary units magnetic tapes storage containing
`basic restart programs and new software releases;
`input/output (I/O) interfaces to
`terminal devices; and master control console used as the control and display console
`for the system. In general, a telephone switching processor could be used to control
`more than one type of telephone switching system.
`The history of AT&T processors is summarized in Table 8-3. Even though all the
`processors are based upon full duplication, it is interesting to observe the evolution
`from the tightly lock-stepped matching of every machine cycle in the early processors
`to a higher dependence on self-checking and matching only on writes to memory·
`Furthermore, as the processors evolved from dedicated, real-time controllers' to mul-
`
`
`
`Processor/
`Year
`Introduced
`
`Complexity
`(Gates)
`
`No. 1, 1965
`
`12,000
`
`No. 2,1969
`
`5,000
`
`No. 1A, 1976
`
`50,000
`
`Unit of
`Switching
`
`PS, CS, CC,
`buses
`
`Entire
`computer
`
`PS, CS, CC,
`buses
`
`Six internal nodes, 24
`bits per node; one
`node matched each
`machine cycle; node
`selected to be matched
`dependent on instruc(cid:173)
`tion being executed
`Single match point on
`call store input
`
`16 internal nodes, 24 bits
`per nodei two nodes
`matched each machine
`cycle
`
`No. 3A, 1975
`
`16,500
`
`Entire
`computer
`
`None
`
`3B20D,1981
`
`75,000
`
`Entire
`computer
`
`None
`
`8. HIGH-AVAIlABILITY SYSTEMS
`
`527
`
`TABLE 8-3 Summary of AT&T Telephone Switching Processors
`
`Matching
`
`Other Error Detection/Correction
`
`Hamming code on PSi parity on CS;
`automatic retry on CS, PSi watch(cid:173)
`dog timer; sanity program to de(cid:173)
`termine if reorganization led to a
`valid configuration
`
`Diagnostic programs; parity on PSi
`detection of multiword accesses
`in CS; watch-dog timer
`Two-parity bits on PSi roving spares
`(Le., contents of PS not com(cid:173)
`pletely duplicated, can be loaded
`from disk upon error detection);
`two-parity bits on CS; roving
`spares sufficient for complete du(cid:173)
`plication of transient data; proces(cid:173)
`sor configuration circuit to search
`automatically for a valid configura(cid:173)
`tion
`On-line processor writes into both
`stores; m-of-2m code on micro(cid:173)
`store plus paritYi self-checking
`decoders; two-parity bits on regis(cid:173)
`ters; duplication of ALU; watch(cid:173)
`dog timer; maintenance channel
`for observability and controllabil(cid:173)
`ity of the other processor; 25% of
`logic devoted to self-checking
`logic and 14% to maintenance
`access
`On-line processor write into both
`stores; byte parity on data paths;
`parity checking where parity pre(cid:173)
`served, duplication otherwise;
`modified Hamming code on main
`memory; maintenance channel for
`observability and controllability of
`the other processor; 30% of con(cid:173)
`trol
`logic devoted to self-check(cid:173)
`ing; error-correction codes on
`disks; software audits, sanity
`timer, integrity monitor
`
`
`
`528
`
`II. THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`TANDEM
`COMPUTERS,
`INC.
`
`tiple-purpose processors, the operating system and software not only became more
`sophisticated but also became a dominant portion of the system design and mainte(cid:173)
`nance effort.
`I of the AT&T case study in this chapter, by Wing Toy, sketches the
`The part
`evolution of the telephone switching system processors and focuses on the latest
`member of the family, the 3B20D. Part II of the case study, by Liane C. Toy, outlines
`the procedure used in the 5ESS for updating hardware and/or software without incur(cid:173)
`ring any downtime.
`
`Over a decade after the first AT&T computer-controlled sWitching system was installed,
`Tandem designed a high-availability system targeted for the on-line transaction pro(cid:173)
`cessing (OLTP) market. Replication of processors, memories, and disks was used not
`only to tolerate failures, but also to provide modular expansion of computing re(cid:173)
`sources. Tandem was concerned about the propagation of errors, and thus developed
`a loosely coupled multiple computer architecture. While one computer acts as primary,
`the backup computer is active only to receive periodic checkpoint information. Hence,
`1.3 physical computers are required to behave as one logical fault-tolerant computer.
`Disks, of course, have to be fully replicated to proVide a complete backup copy of the
`database. This approach places a heavy burden upon the system and user software
`developers to guarantee correct operation no matter when or where a failure occurs.
`In particular, the primary memory state of a computation may not be available due to
`the failure of the processors. Some feel, however, that the multiple computer structure
`is superior to a lock-step duplication approach in tolerating design errors.
`The architecture discussed in the Tandem case study, by Bartlett, Bartlett, Garcia,
`Gray, Horst, Jardine, Jewett, Lenoski, and McGuire, is the first commercially available,
`modularly expandable system designed specifically for high availability. Design objec(cid:173)
`tives for the system includ