throbber
RELIABLE conpumz svsrms
`
`THIRD EDITION
`
`DANIEL P. SIEWIOREK and ROBERT S. SWARZ
`
`_
`
`' DELL EXH|B|T1026
`
`
`
`.-F
`....
`..'o'-.-__
`¥ -
`_ . .
`-
`' ‘
`
`.
`
`-
`
`. . H.
`‘ .
`‘
`",.w‘_.‘
`,.
`.-Q,‘
`
`
`
`I
`
`.
`
`.
`...
`,‘.g
`. or
`.
`.
`. n
`u-
`' .‘ '' _'
`..
`.IoI _ _
`."'.
`
`~ ~.-
`.n.
`.
`. u 1|
`'
`.
`.
`.
`
`.
`
`.
`
`-
`
`.
`
`-
`
`-
`
`‘
`
`' ‘' 4 ' .
`
`‘
`
`

`
`RELIABLE
`COMPUTER
`SYSTEMS
`DESIGN AND EVALUATION
`
`THIRD EDITION
`
`Daniel P. Siewiorek
`Carnegie Mellon University
`Pittsburgh, Pennsylvania
`
`Robert S. Swarz
`
`Worcester Polytechnic Institute
`Worcester, Massachusetts
`
`A KPeters
`Natick, Massachusetts
`
`-
`
`

`
`-
`
`Editorial, Sales, and Customer Service Office
`
`A K Peters, Ltd.
`63 South Avenue
`Natick, MA 01760
`
`Copyright © 1998 by A K Peters, Ltd.
`
`All rights reserved. No part of the material protected by this copyright notice
`may be reproduced or utilized in any form, electronic or mechanical, including
`photocopying, recording, or by any information storage 'and retrieval system,
`without written permission from the copyright owner.
`
`Trademark products mentioned in the book are listed on page 890.
`
`Library of Congress Cataloging-in-Publication Data
`
`Siewiorek, Daniel P.
`Reliable computer systems: design and evaluation / Daniel P.
`Siewiorek, Robert S. Swarz. - 3rd ed.
`p.cm.
`First ed. published under title: The theory and practice of
`reliable system design.
`Includes bibliographical references and index.
`ISBN 1-5688l-092-X
`1. Electronic digital computers - Reliability. 2. Fault-tolerant
`computing. I. Swarz, Robert S. II. Siewiorek, Daniel P. Theory
`and practice of reliablesystem design. III. Title.
`QA76.5.S537 1998
`004-dc21
`Printed in the United States of America
`0201 0099 98
`10 9 8 7 6 5 4 3 2 1
`
`98-202237
`CIP
`
`CREDITS
`
`Figure 1-3: Eugene Foley, "The Effects of Microelectronics Revolution on Systems and
`Board Test," Computers, Vol. 12, No. 10 (October 1979). Copyright © 1979 IEEE.
`Reprinted by permission.
`
`Figure 1-6: S. Russell Craig, "Incoming Inspection and Test Programs," Electronics Test
`(October 1980). Reprinted by permission.
`
`Credits are continued on pages 885-890, which are considered a
`continuation of the copyright page.
`
`

`
`p
`
`2
`
`FAULTS AND THEIR
`MANIFESTATIONS
`
`Designing a reliable system requires finding a way to prevent errors caused by the
`logical faults arising from physical failures. Figure 2-1 depicts the possible sources of
`such errors and service failures. Service can be viewed from the hierarchical physical
`levels within the system, such as service delivered by a chip or by the designer, or
`service may be viewed from the system level by the user. In either case, the following
`terms [Laprie, 1985; Avizienis, 1982] are used:
`
`• Failure occurs when the delivered service deviates from the specified service;
`failures are caused by errors.
`• Error is the manifestation of a fault within a program or data structure; errors can
`occur some distance from the fault sites.
`• Fault is an incorrect state of hardware or software resulting from failures of com(cid:173)
`ponents, physical interference from the environment, operator error, or incorrect
`design.
`• Permanent describes a failure or fault that is continuous and stable; in hardware,
`permanent failures reflect an irreversible physical change. (The word hard is used
`interchangeably with permanent.)
`is only occasionally present due to unstable
`• Intermittent describes a fault that
`hardware or varying hardware or software states (for example, as a function of
`load or activity).
`• Transient describes a fault resulting from temporary environmental conditions.
`(The word soft is used interchangeably with transient.)
`
`A permanent fault can be caused by a physical defect or an inadequacy in the
`design of the system. Intermittent faults can be caused by unstable or marginally stable
`hardware or an inadequacy in design. Environmental conditions as well as some design
`errors can lead to transient faults. All these faults can cause errors. Incorrect designs
`and operator mistakes can lead directly to errors.
`The distinction between intermittent and transient faults is not always made in the
`literature [Kamal, 1975; Tasar and Tasar, 1977]. The dividing line is the applicability of
`repair [Breuer, 1973; Kamal and Page, 1974; Losq, 1978; Savir, 1978]. Intermittent faults
`incorrect hardware or software
`resulting from physical conditions of the hardware,
`design, or even from unstable but repeated environmental conditions are potentially
`faults due to temporary envi(cid:173)
`detectable and repairable by replacement or redesign;
`ronmental conditions, however, are incapable of repair because the hardware is phys(cid:173)
`ically undamaged. It is this attribute of transient faults that magnifies their importance.
`
`22
`
`

`
`2. FAULTS AND THEIR MANIFESTATIONS
`
`23
`
`FIGURE 2-1
`Sources of errors
`and service failures
`
`Error
`
`Service
`failure
`
`Even in the absence of all physical defects, including those manifested as intermittent
`faults, errors will still occur.
`Avizienis [1985] presented c1~sses of faults based on their origin. These fault classes
`are similar to the classification of faults presented by Toy [1978] for the Bell Electronic
`Switching Systems. Two basic fault classes and their description are as follows:
`
`• Physical faults: These stem from physical phenomena internal to the system, such
`as threshold changes, shorts, opens, etc., or external changes, such as environ(cid:173)
`mental, electromagnetic, vibration, etc.
`• Human faults: These may be either design faults, which are committed during
`system design, modification, or establishment of operating procedures, or they
`may be interaction faults, which are violations of operating or maintenance pro(cid:173)
`cedures.
`
`According to this classification, physical faults can be introduced or occur either
`during the manufacturing stage or during operation life. Physical
`faults during the
`useful operational
`life of the system are caused by physical processes that occur
`through normal and abnormal use. Design faults are caused by improper translation
`of an idea or concept into an operational realization. Interaction faults are caused by
`ambiguous documentation or human inattention to detail.
`What are the sources of errors? What is the relative frequency of errors? How do
`faults manifest themselves as errors? Do the arrival
`times of faults (or errors) fit a
`probability distribution? If so, what are the parameters of that distribution? This chapter
`attempts to answer these questions and introduces a variety of fault models for the
`design and evaluation of fault-tolerant systems. The following section discusses the
`origin and frequency of errors by type of fault causing the errors. The remainder of
`the chapter focuses on fault manifestations within systems according to the hierarchical
`
`

`
`24
`
`I. THE THEORY OF RELIABLE SYSTEM DESIGN
`
`levels given in Chapter 1. It also presents the mathematical distributions that
`physical
`describe the probability of fault occurrence.
`
`SYSTEM ERRORS
`
`Origin of Errors by Type of Fault
`
`TABLE 2-1
`Ratios of all errors
`to permanent
`errors
`
`Transient and intermittent faults have been seen as a major source of errors in systems
`in several studies. For example, an early study for the U.S. Air Force [Roth et aI., 1967a]
`showed that 80 percent of the electronic failures in computers are due to intermittent
`faults. Another study by IBM [Ball and Hardie, 19671 indicated that JJintermittents
`comprised over 90% of field failures./I Transient faults, which have been observed in
`microprocessor chips [Brodsky, 19801, will become a more frequent problem in the
`future with shrinking device dimensions,
`lower energy levels for indicating logical
`values, and higher-speed operation. *
`Table 2-1 gives the ratios of measured mean time between errors (MTBE) (due to
`three fault types) to mean time to failure (MTIF) (due to permanent faults) for
`all
`several systems. The last row of this table is the estimate of permanent and transient
`failure rates for a 1-megaword, 37-bit memory composed of 4K MaS RAMs [Geilhufe,
`1979; Ohm, 1979]. In this case, transient errors are caused by a-particles emitted by
`the decay of trace radioactive particles in the semiconductor packaging materials. As
`they pass through the semiconductor material, a-particles create sufficient hole-elec(cid:173)
`tron pairs to add charge to or remove charge from bit cells. By exposing MaS RAMs
`
`Error Detection
`Mechanism
`
`Parity
`Diagnostics
`Crash
`Mismatch
`Crash
`Parity
`
`System
`MTBE for
`all Fault Types
`(hrs)
`
`System
`MTIF for
`Permanent Faults
`(hrs)
`
`44
`128
`97-328
`80-170
`689
`106
`
`800-1600
`4200
`4900
`1300
`6552
`1450
`
`MTBE/MTIF
`
`0.03-0.06
`0.03
`0.02-0.07
`0.06-0.13
`0.11
`0.07
`
`Systemffechnology
`
`CMUA PDP-l0, ECL
`CM* LSI-11, NMOS
`C.vmp TMR LSI-11
`Telettra, TIL
`SUN-2, TIL, MOS
`1M x 37 RAM,
`MOS
`
`Source: Data from Siewiorek et aI., 1978a; Morganti, 1978; McConnel, Siewiorek, and Tsao, 1979; Geilhufe,
`1979; Ohm, 1979; Lin and Siewiorek, 1990.
`
`• The same semiconductor evolution that has led to increased reliability per gate or bit has also introduced
`new failure modes. The smaller dimensions of semiconductor devices have decreased the amount of energy
`required to change the state of a memory bit. The loss of memory information caused by the decay of radioactive
`trace elements in packaging material has been documented. Studies show that even in sheltered environments
`such as well-conditioned computer rooms, transientJintermittent errors are 20 to 50 times more prevalent than
`hard failures. TransientJintermittent errors also exhibit clustering (a high probability that, once one error has
`occurred, another will occur soon), workload dependence (the heavier the system workload, the more likely
`an error), and common failure modes (more than one system, or portion of a system, affected simultaneously).
`
`

`
`2. FAULTS AND THEIR MANIFESTATIONS
`
`25
`
`I
`+
`
`.1
`
`T
`
`10.0
`
`1.0
`
`0.1
`
`0.01
`
`FIGURE 2-2
`Operational life er-
`ror rate as function
`of RAM densities
`[From Geilhufe,
`1979; ©1979 IEEEl
`
`-:i
`
`0
`~
`di
`IJ
`QJ ';:
`_QIru"~~
`:: ....
`0,....
`....
`'"~
`-Q
`
`~'
`
`i;j
`
`Co
`
`0.001 L-
`1
`
`~E..._
`
`___:L_
`4K
`
`=_
`16K
`
`_ .L_
`
`Memory size in bits
`
`to artificial a-particle sources, the operational 'life error rate can be determined as a
`function of RAM density (Figure 2-2), voltage, and cycle time.
`The Sun-2 data shown in Table 2-1 is derived from a study that observed 13 SUN(cid:173)
`2 workstations on the Carnegie Mellon University Andrew network [Lin and Siewiorek,
`1990]. Each workstation was composed of a Motorola 68010 processor with a 10-MHz
`clock and up to 4 Fujitsu Eagle-470 Mbyte disk drives. These workstations were used
`as a distributed file server system. Sampling techniques indicated that their workload
`was essentially constant.
`Over 21 workstation-years of data from the Sun-2 file server system accumulated
`and is summarized as follows:
`
`Source of
`Error
`
`Number of
`Occurrences
`
`Mean Time to
`Occurrence (hrs)
`
`Permanent fau It
`Intermittent fault
`Transient fault
`System crash
`
`29
`610
`446
`298
`
`6552
`58
`354
`689
`
`Permanent faults were determined through interviews with field service personnel
`and analysis of system maintenance logs. The most severe manifestation of an error is
`a system crash wherein the system software has to be reloaded and restarted. System
`crashes were determined from an on-line, distributed diagnostics system. All errors in
`the system log prior to a permanent
`fault were examined and allocated to either
`transient or intermittent. * If an error occurred indicating the same physical device
`
`• It was assumed that a permanent failure resulted in only one system crash after which the system was totally
`inoperable.
`
`

`
`26
`
`I. THE THEORY OF RELIABLE SYSTEM DESIGN
`
`•
`
`within less than a week of the previous error, the error was deemed to be caused by
`an intermittent fault. Since the mean time to a transient fault was over two weeks, it
`was felt that there was little probability of inadvertently mixing intermittent and tran(cid:173)
`sient faults.
`A number of
`interesting observations can be made from the Sun·2 data. The
`permanent faults represented approximately 10 percent of the system crashes. Thus,
`if permanent faults were eliminated, the system crash rate would only be improved by
`about 10 percent. The ratio of intermittent faults to permanent faults is about 20. Thus,
`the first intermittent symptom appears over 1200 (20 x 58 hours between intermittent
`occurrences) hours prior to the repair activity. This data represents a large window of
`vulnerability and provides ample opportunity for trend analysis to help isolate the
`source of failure (for example, see Symptom-Directed Diagnosis in Chapter 4). If we
`subtract the number of permanent faults from the system crashes and divide by the
`total number of faults we see that on average only 1 out of every 4 faults causes a
`system crash. *
`In summary, since the only information available on errors is collected by relevant
`error-detecting mechanisms, it is very difficult in practice to detect all errors or deter(cid:173)
`mine their exact source. Very few studies have systematically attempted to determine
`the cause of errors. The existing evidence is that the vast majority are due to nonper(cid:173)
`manent (Le., intermittent or transient) faults.
`
`frequency of Errors by Type of fault
`
`The ultimate goal of any fault-tolerant system is not only to produce few errors but
`suffer as little down time as possible. Hardware modules are constantly improving
`their reliability and soon will reach in excess of 100,000 hours mean time between
`failure per module. However, transient and intermittent faults are 20-100 times more
`prevalent than hard failures, and these faults can lead to a system outage. Even after
`vendor and user quality assurance, thousands of bugs remain in system software. Most
`software failures are transient leading to a dump and restart of the system.
`For example, typical sources of failure for a mainframe computing system as well
`as the typical mean time between failure (MTBF) and mean time to repair (MTIR) are
`given in the following table [Gray, 1990]:
`
`Source of Failure
`
`MTBF
`
`Power
`Phone lines
`Soft
`Hard
`Hardware modules
`Software
`
`2000 hrs
`
`0.1 hr
`4000 hrs
`10,000 hrs
`1 bug/1,000 lines of code
`
`MTTR
`
`1 hr
`
`0.1 sec
`10 hrs
`10 hrs
`
`• Interestingly enough, 14 of the 29 permanent faults had 3 or fewer error log entries, and of those 14, 8 had
`no error log entries prior to repair, all of which indicates that there is substantial room for improvement in
`error-detecting mechanisms.
`
`

`
`2. FAULTS AND THEIR MANIFESTATIONS
`
`27
`
`Since the outage of telephone lines due to soft failures is frequent and recovery is
`very quick, phone line outages tend to be scattered,
`impacting individual terminals
`but not the complete system. However, some form of redundancy, such as providing
`alternative paths, is clearly required to reduce the frequency of telephone line failures.
`Among the several other studies providing data on how often computers fail
`is a
`survey of Japanese computer users [Watanabe, 1986], which included 1383 institutions
`over the period from June 1984 to July 1985. These users reported 7517 outages" with
`an average MTBF of 10 weeks and an average outage duration of 90 minutes. The
`the probability that the outage was que to the source, and the
`sources of outage,
`MTBF are given in the following table [Gray (Watanabe, translator), 1986]:
`
`MTBF
`(months)
`
`5 9
`
`18
`24
`24
`
`Source of Outage
`
`Probability
`
`Vendor hardware, software,
`and maintenance
`Application software
`Communication lines
`Environment
`Operations
`
`0.420
`0.250
`0.120
`0.112
`0.093
`
`In 1985 and 1987, Gray [1990] surveyed the Early Warning Reports (EWR) from
`Tandem's customer's systems. From June 1985 until June 1987, 1,300 customers were
`surveyed. Over 6,000 systems composed of 16,000 processors and 97,000 disks were
`included. The data represented 12,000 system years, 32,000 CPU years, 97,000 disk
`years of experience. Reports of 490 system failures were provided by 205 customers.
`The following table lists the sources of outage, the probability that an outage was due
`to the source, and the MTBF for the source [Gray, 1990]:
`
`Sources of Outage
`
`Probability
`
`Hardware
`Software
`Operations
`Maintenance
`Environment
`
`0.29
`0.36
`0.13
`0.11
`0.10
`
`MTBF
`(yrs)
`
`140
`63
`211
`200
`235
`
`Resultant mean time between failures for a system was approximately 27 years.
`
`It is interesting to compare the theoretically predicted and actually mea(cid:173)
`Example.
`sured probability of a disk pair failing. As will be shown in Chapter 5, the MTBF
`for a duplex system with failure rate A and repair rate I.lo is given by Eq. 1.
`
`MTBF - 2~2
`
`(1)
`
`• The author did not define the meaning of outage; we may consider an outage to be a failure to provide
`service.
`
`--
`
`

`
`28
`
`I. THE THEORY OF RELIABLE SYSTEM DESIGN
`
`•
`
`TABLE 2-2
`Outage data for
`the Tandem two(cid:173)
`year study
`
`Source of Outage
`
`Probability
`
`Source of Outage
`
`Probability
`
`Hardware
`Disks
`Communications
`Processors
`Wiring
`Spares
`Maintenance
`Disks
`Communications
`Processors
`
`Source: Gray, 1987.
`
`0.49
`0.24
`0.18
`0.09
`0.01
`
`0.67
`0.20
`0.13
`
`Operations
`Procedures
`Configuration
`Move
`Overflow
`Upgrade
`
`0.42
`0.39
`0.13
`0.04
`0.01
`
`For a 50,000 hour MTBF disk (failure rate = 1/50,000) with as-hour MTIR (repair
`rate = 1/5), Eq. 1 predicts a 28,000-year MTBF:
`
`(1/5)
`8
`MTBF = 2(1/50,000)(1/50,000) = 2.5 x 10 hours = 28,539 years
`
`Thirty-five double disk failures were reported in the 48,000 disk-pair years yielding
`a MTBF of 1371 years. The actually measured reliability was a factor of 20 less than
`theoretical, but it is a factor of 200 better than non redundant disks. The MTBF of
`a CPU pair was measured at 877 years.
`
`Table 2-2 further subdivides into constitutent parts the sources of outage reported
`by Tandem customers. Both in hardware and in maintenance, disks and communica(cid:173)
`tions are the most probable sources of outage. In operations, procedural and config(cid:173)
`uration problems represent the major source of outage.
`In another study [Gray, 1987], a large Tandem customer was examined in detail.
`The customer, a manufacturer, had 13 sites, with 10 sites in the United States, 1 in
`Canada, 1 in Mexico, and 1 in England. The customer had 14 nodes composed of 54
`processors and 120 disks and made 18 system-years of data available for the study,
`including reports of 199 outages, yielding a 4-week MTBF. Of the 199 outages, 52 were
`due to power, yielding an MTBF of 4 months for power and an average outage duration
`of 1 hour; 22 were due to communication lines, yielding an average MTBF of 10 months
`for communications and an average outage duration of 11 hours; 47 were due to
`scheduled reorganizations of transfer files; 27 outages were due to installation of
`software; and the remaining 51 outages were distributed as indicated in Table 2-3.
`Table 2-3 lists the source of outages and the probability that the outages were due to
`the source. The table also gives the sources and probabilities for 99 unscheduled
`outages. As we can see, power is the largest single contributor to unscheduled outage.
`Figure 2-3 gives a distribution of
`the duration of power outage as a function of
`frequency for this Tandem study. By way of comparison, Figure 2-4 depicts the cu(cid:173)
`mulative distribution of power outage durations for 350 Bell Canada locations studied
`
`

`
`TABLE 2-3
`Outage data for
`Tandem
`manufacturing
`customer
`
`2. FAULTS AND THEIR MANIFESTATIONS
`
`29
`
`Total Outages (n = 199)
`
`Unscheduled Outages (n = 99)
`
`Source
`
`Probability
`
`Source
`
`Probability
`
`Power
`Reorganization of transfer files
`Installation of software
`Communication lines
`Reconfiguration
`Application software
`Testing
`Max files
`Hardware
`Corrupted files
`
`Source: Gray, 1987.
`
`0.261
`0.236
`0.136
`0.111
`0.091
`0.050
`0.040
`0.040
`0.Q25
`0.010
`
`Power
`Communication lines
`Application software
`Max files
`Hardware
`Corrupted files
`
`0.525
`0.222
`0.101
`0.081
`0.051
`0.020
`
`during a period of 8376 months [BNR, 1984]. During that time, 1420 outages were
`recorded, producing over 3600 hours of down time. The average outage lasted 2.54
`hours. The graph in Figure 2-4 represents the percentage of outages that exceeded
`the duration marked on the horizontal axis. For example, about 24 percent of the
`outages exceeded 3 hours. Thus, if batteries that produce 3 hours of reserve power
`locations, an average 24 percent of the locations would have
`had been available at all
`insufficient power to maintain operation.
`By way of contrast, 50 percent of the Tandem power outages exceeded 3 hours.
`Figure 2-5 depicts the number of Tandem customers reporting the specified number
`of outages during the two-year study. Over three-fourths of the customers had no
`outage in the two-year period. However, some customers had a large number of
`
`45
`
`40
`
`35
`
`30
`
`j
`25ei
`
`20
`
`1tl
`
`5
`
`bz
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ]0
`Duration or outage
`
`FIGURE 2-3 Duration of power outages for Tandem manufacturing customer [From Gray, 1987J
`
`

`
`30
`
`I. THE THEORY OF RELIABLE SYSTEM DESIGN
`
`70
`
`FIGURE 2-4
`Cumulative distri-
`bution of power
`outage durations
`for Bell Canada
`Study [From BNR,
`1984J
`
`FIGURE 2-5
`Number of Tan-
`dem customers re-
`porting n outages
`in two-year study
`[From Gray, 1987J
`
`ci
`
`"1:l
`III
`
`t
`Q..c
`<I> 60
`c
`..
`.2
`i;j
`= 50
`-="1:le 40
`
`~
`i;j
`
`-= 30
`fn
`J!=Q 20
`...Q
`fo
`J!c 10
`III
`IIa.
`
`~I
`
`Outage duration (hrs)
`
`problems that seemed to be independent of the size of the system. All large customers
`had some outages; hence, a customer with 140 CPUs and 20 systems should expect
`approximately one outage per year.
`In summary, systems fail for many reasons, including hardware failure, incorrect
`design of hardware or software,
`improper operation or maintenance, and unstable
`environments. The probability of error is distributed ove~ this entire spectrum with no
`single cause dominating. Hardware is rarely responsible for failures, and the frequency
`rate can often be measured in years rather than weeks or hours.
`
`<I>
`
`..III
`E.s
`<I>=u
`...Q..III
`E=z
`
`..Q
`
`1000
`
`100
`
`10
`
`Number of outages
`
`

`
`8
`
`HIGH-AVAILABILITY
`SYSTEMS
`
`INTRODUCTION
`
`Dynamic redundancy is the basic approach used in high-availability systems. These
`systems are typically composed of multiple processors with extensive error-detection
`the computation is resumed on another
`mechanisms. When an error is detected,
`processor. The evolution of high-availability systems is traced through the family history
`of three commercial vendors: AT&T, Tandem, and Stratus.
`
`AT&T pioneered fault-tolerant computing in the telephone switching application.
`The two AT&T case studies given in this chapter trace the variations of duplication and
`matching devised for the sWitching systems to detect failures and to automatically
`resume computations. The primary form of detection is hardware lock-step duplication
`and comparison that requires about 2.5 times the hardware cost of a non redundant
`system. Thousands of switching systems have been installed and they are currently
`commercially available in the form of the 3B20 processor. Table 8-1 summarizes the
`evolution of the AT&T switching systems. It includes system characteristics such as the
`number of telephone lines accommodated as well as the processor model used to
`control the switching gear.
`Telephone switching systems utilize natural redundancy in the network ~nd its
`operation to meet an aggressive availability goal of 2 hours downtime in 40 years (3
`minutes per year). Telephone users will redial
`if they get a wrong number or are
`disconnected. However, there is a user aggravation level that must be avoided: users
`will redial as long as errors do not happen too frequently. User aggravation thresholds
`are different for failure to establish a call (moderately high) and disconnection of an
`established call (very low). Thus, a telephone switching system follows a staged failure
`recovery process, as shown in Table 8-2.
`Figure 8-1 illustrates that the telephone switching application requires quite a
`different organization than that of a general-purpose computer.
`In particular, a sub(cid:173)
`stantial portion of the telephone switching system complexity is in the peripheral
`hardware. As depicted in Figure 8-1, the telephone switching system is composed of
`four major components:
`the transmission interface, the network, signal processors,
`and the central controller. Telephone lines carrying analog signals attach to the voice
`band interface frame (VI F), which samples and digitally encodes the analog signals.
`The output is pulse code modulated (PCM). The echo suppressor terminal (EST) re(cid:173)
`moves echos that may have been introduced on long distance trunk lines. The peM
`
`AT&T
`SWITCHING
`SYSTEMS
`
`524
`
`

`
`8. HIGH-AVAILABILITY SYSTEMS
`
`525
`
`System
`
`Number
`of Lines
`
`Year
`Introduced
`
`Number
`Installed
`
`Processor
`
`Comments
`
`1 ESS
`
`5,000-65,000
`
`1%5
`
`1,000
`
`No.1
`
`2 ESS
`lA ESS
`
`1,000-10,000
`100,000
`
`26 ESS
`
`1,000-20,000
`
`HSS
`5 ESS
`
`500-5,000
`1,000-65,000
`
`1969
`1976
`
`1975
`
`1976
`1982
`
`500
`2,000
`
`No.2
`No.1A
`
`>500
`
`No.3A
`
`>500
`>1,000
`
`No.3A
`No.3B
`
`First processor with
`separate control and
`data memories
`
`Four to eight times
`faster than No.1
`Combined control and
`data store;
`microcoded; emulates
`No.2
`
`Multipurpose processor
`
`Phase
`
`Recovery Action
`
`Effect
`
`Temporary storage affected; no
`calls lost
`Lose calls being established; calls
`in progress not lost
`Lose calls being established; calls
`in progress not affected
`
`All calls lost
`
`Initialize specific transient memory.
`
`Reconfigure peripheral hardware. Initialize
`all transient. memory.
`Verify memory operation, establish a
`workable processor configuration, verify
`program, configure peripheral hardware,
`initialize all transient memory.
`Establish a workable processor
`configuration, configure peripheral
`hardware, initialize all memory.
`
`2 3
`
`4
`
`signals are multiplexed onto a time-slotted digital bus. The digital bus enters a time(cid:173)
`space-time network. The time slot interchange (TSI) switches PCM signals to different
`time slots on the bus. The output of the TSI goes to the time multiplexed switch (TMS),
`which switches the PCM signals in a particular time slot from any bus to any other
`bus. The output of
`the TMS returns to the TSI, where the PCM signals may be
`interchanged to another time slot. Signals intendeq for analog lines are converted
`from PCM to analog signals in the VIF. A network clock coordinates the timing for all
`of the switching functions.
`thus
`The signal processors provide scanning and signal distribution functions,
`these activities. The common channel
`interface
`relieving the central processor of
`signaling (CClS) provides an independent data link between telephone switching sys(cid:173)
`tems. The CCiS terminal
`is used to send supervisory switching information for the
`
`TABLE 8-1
`Summary of
`installed AT&T
`telephone
`switching systems
`
`TABLE 8-2
`Levels of recovery
`in a telephone
`switching system
`
`

`
`526
`
`II. THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`PCM
`
`1
`
`Time
`multiplexed
`swllch
`TMS
`
`Time slot
`interchange
`TSI
`
`PCM
`
`I
`
`!
`
`PCM
`
`!
`
`Echo
`suppressor
`terminal
`
`1 EST
`
`PU bus
`
`Timing
`
`Digroup
`terminal
`pT
`
`II
`
`Data
`I signaling and
`control
`
`service clrcults
`
`Wire facilities
`
`AnalOll carrier
`
`Dlsllal carrier
`
`_.a.._...
`
`~":,,,,:,,_~~~:-:-
`
`-1 Bus interface """",P..U_b..u..s__
`PUBB
`
`Data links
`
`Common
`channel
`interoffice
`signaling
`
`Master
`control
`console
`MCC
`
`FIGURE 8-1
`Diagram of a typi·
`cal telephone
`switching system
`
`various trunk lines coming into the office. The entire peripheral hardware is interfaced
`to the centrai control (CC) ovet AC-coupled buses. A telephone switching processor
`is Composed of the central control, which manipulates data associated with call pro(cid:173)
`cessing, administrative tasks, and recovery; program store; call store for storing tran(cid:173)
`sient information telated to the processing of telephone calls;
`file store disk system
`used to store backup program copies; auxiliary units magnetic tapes storage containing
`basic restart programs and new software releases;
`input/output (I/O) interfaces to
`terminal devices; and master control console used as the control and display console
`for the system. In general, a telephone switching processor could be used to control
`more than one type of telephone switching system.
`The history of AT&T processors is summarized in Table 8-3. Even though all the
`processors are based upon full duplication, it is interesting to observe the evolution
`from the tightly lock-stepped matching of every machine cycle in the early processors
`to a higher dependence on self-checking and matching only on writes to memory·
`Furthermore, as the processors evolved from dedicated, real-time controllers' to mul-
`
`

`
`Processor/
`Year
`Introduced
`
`Complexity
`(Gates)
`
`No. 1, 1965
`
`12,000
`
`No. 2,1969
`
`5,000
`
`No. 1A, 1976
`
`50,000
`
`Unit of
`Switching
`
`PS, CS, CC,
`buses
`
`Entire
`computer
`
`PS, CS, CC,
`buses
`
`Six internal nodes, 24
`bits per node; one
`node matched each
`machine cycle; node
`selected to be matched
`dependent on instruc(cid:173)
`tion being executed
`Single match point on
`call store input
`
`16 internal nodes, 24 bits
`per nodei two nodes
`matched each machine
`cycle
`
`No. 3A, 1975
`
`16,500
`
`Entire
`computer
`
`None
`
`3B20D,1981
`
`75,000
`
`Entire
`computer
`
`None
`
`8. HIGH-AVAIlABILITY SYSTEMS
`
`527
`
`TABLE 8-3 Summary of AT&T Telephone Switching Processors
`
`Matching
`
`Other Error Detection/Correction
`
`Hamming code on PSi parity on CS;
`automatic retry on CS, PSi watch(cid:173)
`dog timer; sanity program to de(cid:173)
`termine if reorganization led to a
`valid configuration
`
`Diagnostic programs; parity on PSi
`detection of multiword accesses
`in CS; watch-dog timer
`Two-parity bits on PSi roving spares
`(Le., contents of PS not com(cid:173)
`pletely duplicated, can be loaded
`from disk upon error detection);
`two-parity bits on CS; roving
`spares sufficient for complete du(cid:173)
`plication of transient data; proces(cid:173)
`sor configuration circuit to search
`automatically for a valid configura(cid:173)
`tion
`On-line processor writes into both
`stores; m-of-2m code on micro(cid:173)
`store plus paritYi self-checking
`decoders; two-parity bits on regis(cid:173)
`ters; duplication of ALU; watch(cid:173)
`dog timer; maintenance channel
`for observability and controllabil(cid:173)
`ity of the other processor; 25% of
`logic devoted to self-checking
`logic and 14% to maintenance
`access
`On-line processor write into both
`stores; byte parity on data paths;
`parity checking where parity pre(cid:173)
`served, duplication otherwise;
`modified Hamming code on main
`memory; maintenance channel for
`observability and controllability of
`the other processor; 30% of con(cid:173)
`trol
`logic devoted to self-check(cid:173)
`ing; error-correction codes on
`disks; software audits, sanity
`timer, integrity monitor
`
`

`
`528
`
`II. THE PRACTICE OF RELIABLE SYSTEM DESIGN
`
`TANDEM
`COMPUTERS,
`INC.
`
`tiple-purpose processors, the operating system and software not only became more
`sophisticated but also became a dominant portion of the system design and mainte(cid:173)
`nance effort.
`I of the AT&T case study in this chapter, by Wing Toy, sketches the
`The part
`evolution of the telephone switching system processors and focuses on the latest
`member of the family, the 3B20D. Part II of the case study, by Liane C. Toy, outlines
`the procedure used in the 5ESS for updating hardware and/or software without incur(cid:173)
`ring any downtime.
`
`Over a decade after the first AT&T computer-controlled sWitching system was installed,
`Tandem designed a high-availability system targeted for the on-line transaction pro(cid:173)
`cessing (OLTP) market. Replication of processors, memories, and disks was used not
`only to tolerate failures, but also to provide modular expansion of computing re(cid:173)
`sources. Tandem was concerned about the propagation of errors, and thus developed
`a loosely coupled multiple computer architecture. While one computer acts as primary,
`the backup computer is active only to receive periodic checkpoint information. Hence,
`1.3 physical computers are required to behave as one logical fault-tolerant computer.
`Disks, of course, have to be fully replicated to proVide a complete backup copy of the
`database. This approach places a heavy burden upon the system and user software
`developers to guarantee correct operation no matter when or where a failure occurs.
`In particular, the primary memory state of a computation may not be available due to
`the failure of the processors. Some feel, however, that the multiple computer structure
`is superior to a lock-step duplication approach in tolerating design errors.
`The architecture discussed in the Tandem case study, by Bartlett, Bartlett, Garcia,
`Gray, Horst, Jardine, Jewett, Lenoski, and McGuire, is the first commercially available,
`modularly expandable system designed specifically for high availability. Design objec(cid:173)
`tives for the system includ

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket