`
`Lisa Spainhower, Jack Isenberg, Ram Chillarege, Joseph Berding
`International Business Machines Corporation
`South Road, Poughkeepsie, NY 12602
`
`Abstract
`
`The ES/9000 Model 900 is IBM’s high-end fault-
`tolerant commercial processor. Although, high-end
`commercial processors were traditionally designed to
`be very reliable, this is the first one that implements
`a fault-tolerant machine. The design exploits circuit
`level concurrent-error detection, fault-identification and
`reconfiguration with system level techniques when mul-
`tiple functional resources are available. It provides true
`graceful degradation during Central Processor or Chan-
`nel re-configuration and repair. This paper:
`
`¯ Discusses the design point for this processor and
`the trade-offs involved.
`
`Shows the error detection and on-line repair pro-
`cess of a Central Processor with the work recovered
`on an alternate Central Processor, transparent to
`the application.
`
`Describes Dynamic Path Selection and the hot-
`pluggable channels.
`
`Illustrates the fault-tolerance techniques used in
`the Level I Cache and the Central Store.
`
`1
`
`Introduction
`
`This paper presents the design for fault-tolerance in
`the BS/9000 Model 900 high-end commercial proces-
`sor. The Model 900 is a 6-way tightly coupled multi-
`processor with a two-level cache, expanded storage and
`fiber optic channels. Compared to its predecessor, the
`3090 model 600], it provides about twice the process-
`ing power and the following reductions in components:
`a 30% reduction in the number of chips, 40% reduction
`in the number of Thermal Conduction Modules, (TCMs
`are the second level multichip packaging vehicle) and a
`
`IBM may have patents or paneling patent applications cov-
`ering subject matter described herein. Licenses under IBM’s
`Utility patents axe available on reasonable and nond~iscrimina.
`tory terms and conditions. Lnquiries relative to licensing should
`be directed, in writing, to: ~’BM Corporation, Director of Con-
`tracts and Licensing, Armonk, NY 10504. Trademaxks: ES/9000,
`ES/3090, BSGON, SYSTEM/390, System/370, 3090, MVS/XA,
`MVS/ESA,S/390.
`
`0"/31-3071/92 $03.00 © 1992 IEEE
`
`38
`
`40% reduction in the number of signal cables. The ma-
`chine also makes significant advancement in processor
`design, implementing out of sequence execution, mul-
`tiple execution elements, etc. This paper focuses only
`on the design of the fault-tolerant aspects of the ma-
`chine whereas details of the processor organization can
`be found in [Liptay92].
`The design of a fault-tolerant machine in this mar-
`ket segment involves a complex set of trade-offs. These
`trade-offs involve, cost, performance, packaging, main-
`tenance strategy, operating system support and cus-
`tomer requirements. Thus, the choices of techniques to
`achieve fault-tolerance have to be weighed both from
`a top down and a bottom up view without losing per-
`spective of the final application program or customer
`view. Since the design team does not start with a clean
`slate there is a significant evolutionary process from the
`earlier generation of machines. The earlier generation
`of machines has been proven to be very reliable, using
`extensive error checking and recovery techniques in the
`processor. Furthermore, the packaging, screening and
`burn-in of components provide additional leverage of
`very large mean-time-between failures. To turn a high-
`end machine of this class into a fault-tolerant machine
`requires a very careful selection of the design point.
`Clearly, the challenge in this market segment is in de-
`signing a fault-tolerant machine that continues to be
`performance driven and cost competitive.
`This paper first discusses the Design Point. The
`standard circuit design process in the high end en-
`forces the principles of concurrent error detection and
`fault isolation. To take this design point and make it
`fault-tolerant one needs to enhance it to provide to-
`tal concurrent error detection and recovery with on-
`line non-disruptive reconfiguration [Pradhan86]. This
`is designed by integrating the circuit level capabilities
`of error detection with system level redundancy to pro-
`vide complete recovery and reconfiguration. The result-
`ing fault-tolerant system provides graceful degradation
`of processing power until repair, with complete trans-
`parency to the existing application base.
`Following the discussion on the design point, there
`are three sections discussing the details of the major
`subsystems, the Central Processors (CP), the Channels
`and the Storage Subsystem. In each section we have
`illustrated some of the key ideas used in this design
`and how they have been implemented. Clearly, no sin-
`gle paper can exhaustively discuss all the elements that
`
`IBM-Oracle 1012
`Page 1 of 10
`
`
`
`go into building the fault-tolerant system. The power
`subsystem and operating system layer are beyond the
`scope of this paper. This paper should interest system
`architects, integrators, and researchers in fault-tolerant
`computing.
`
`placed without affecting the rest of the system.
`In the following sub-sections we describe the error
`handling and maintenance capability and then describe
`the design goals for each of the Subsystems.
`
`2 The Design Point
`
`A critical part of any such design is a clear recogni-
`tion of customer requirements. In the high-end there
`has always been a very strong emphasis on data in-
`tegrity. Data integrity implies that all computation is
`checked and in the event of an unrecoverable error, the
`program aborted or the machine stopped. This is the
`reason why concurrent error detection is so advanced in
`most previous IBM high-end processors. Data integrity
`however, requires only that an error not be allowed to
`propagate - an unrecoverable error could result in the
`job or program being terminated. This may result in a
`system outage if the job being terminated is a critical
`software subsystem that failed software level recovery.
`In the case of a permanent hardware failure which re-
`quires repair, the processor would be check stopped. In
`recent years, the requirements have gone beyond data
`integrity alone. The goal is that no computation be
`lost, thereby never requiring program termination; that
`repair be on-line and non-disruptive, providing contin-
`uous availability.
`In most commercial environments a slight loss of per-
`formance is acceptable during the process of repair.
`This is particularly true considering that a) the pro-
`cessors are expensive and an expensive idle processor is
`usually unacceptable; b) although systems are operated
`at almost full utili~.ation (around 90%) they are rarely
`fully utili~,ed. A small shortfall in processing capability
`can be handled by workload management algorithms
`in the dispatcher or by shedding some less important
`workloads. With the repair becoming non-disruptive,
`true graceful degradation becomes the goal in these en-
`vironments.
`The design of a high-end processor is primarily driven
`by performance and cost trade offs. For performance
`reasons, the organization of the CentrM Processor Com-
`plex (CPC) has several identical elements, in paral-
`lel, such as processors, channels, levels of storage, etc.
`Thus there is a fundamental capability for high avail
`ability from the redundant resources that are deployed
`for performance. At the same time, each of these ele-
`ments is designed at the circuit level to have concurrent
`fault detection, recovery by retry, and fault identifica-
`tion and isolation. The strategy for fault-tolerance is
`to enhance the existing Error Detection and Fault Iso-
`lation (EDFI) techniques and couple them with system
`level techniques exploiting the identicM elements in the
`CPC. The resulting design can identify errors and
`cover transient and intermittent failures by retry. For
`permanent faults requiring repair, the CPC will isolate
`the failing element, reconfignre the system around it,
`recover by rolling back to an error-free consistent state
`and resume execution. The failed hardware can be re-
`
`Error Handling and Maintenance
`
`In order to ensure data integrity and provide on-line
`diagnosis, the circuits in the high-end employ concur-
`rent error detection and fault isolation. To achieve the
`fault-tolerance and continuous-availability goals, it was
`critical that this capability be enhanced. Thus the de-
`sign goal was to increase the level of error detection
`from what was typically in the 90% range to total con-
`current error detection. The design point for error de-
`tection is to ensure that an error is first identified within
`the same machine cycle. The reporting and recording
`may take additionM cycles. Every latch is protected by
`a code and there are no naked latches. All dataflows,
`arrays, and control busses have ECC, or checking, and
`state machines have parity predict, etc. Expanding to
`this level of coverage added less than 3% to the num-
`ber of logic chips. One reason for the small increase in
`overhead is that many of the hard 4o check logic chips,
`like sequence logic, are very pin limited and thus the
`logic could be duplicated on the chip for checking with-
`out costing more chips. Another reason was that in
`a heavily pipelined machine at the high-end, a lot of
`the circuits are in parallel where the checking levels are
`already very high.
`There are a several failure modes that this Design
`Point addresses however, for the purpose of understand-
`ing the error detection and recovery strategy we discuss
`a few of the dominant circuit fault models. Broadly,
`errors in the logic can occur due to solid hard circuit
`faults, intermittent faults, and transients faults. The
`solid hard circuit faults, commonly called permanent
`faults, occur when a digital circuit no longer yields a
`correct output given a specific set of inputs. Any time
`the specific input is repeated, the incorrect output is
`produced. An intermittent fault is an occurrence where
`a specific event produces an incorrect result, but the
`same inputs at a different point in time may produce
`the correct result. This can occur as a result of a design
`error, marginal circuits or loading conditions. Tran-
`sients occur when there are environmental conditions,
`noise transients, or cosmic particles cause an incorrect
`result, but the circuit itself functions correctly.
`To provide online error correction and repair not only
`should the error be identified but also resolved whether
`it is a transient or permanent. For the case of transient,
`the recovery should be guided to the source of the data
`that had an error and for a permanent, the exact iden-
`tity of the replaceable part. As an example, the design
`ground rules require that error checkers be placed at the
`driver for any signals that leave the replaceable part,
`and immediately after the receiver of any signals enter-
`ing a replaceable part. This allows most failures to be
`identified to the correct part without further analysis.
`The on-line EDFI would determine the Field Replace-
`able Unit (FRU) [Bossen82] to be replaced. It is neces-
`
`39
`
`IBM-Oracle 1012
`Page 2 of 10
`
`
`
`sary to be able to determine whether an error is due to
`a permanent physical failure requiring repair, or only
`a transient that can be handled without physical part
`replacement. In an ES/9000 system it is not always
`easy to distinguish which type of fault has occurred,
`but the design handles all of them. The entire logic
`structure called for the capability to back out and retry
`all internal operations with appropriate thresholds for
`determining success. Since the retry of an operation can
`involve different paths through the logic than the orig-
`inal error, the system of thresholds is very fine grained
`and rather complex. For example, if a fault occurs on
`a directory entry in a cache, the retry of the opera-
`tion may use a different cache set and, therefore, the
`threshold must be on the use of the directory address
`and set, not on the actual operation being performed.
`This has led to a very extensive threshold system imple-
`mentation. A detailed description of the error detection
`techniques can be found in [Bossen92]. Note that failure
`due to a hard circuit fault might be recoverable through
`instruction retry since the machine is run unoverlapped
`during the retry, thereby using different circuit combi-
`nations. Intermittents on the other hand could cause
`errors several times and go away later. To deal with
`either type of fault, multiple retries with appropriate
`thresholds are used to determine whether a unit needs
`to be taken out for repair or whether the system can
`continue with the unit containing the failure.
`If a failure is determined to be a physical failure
`requiring repair, the design requires the capability to
`fence off all external interfaces to the element being re-
`paired so that reconfiguration and maintenance can be
`performed while the remainder of the system continues
`in operation. Thus all interfaces have the capability to
`"fence" or de-gate the drivers and/or receivers of the in-
`terfaces. When a failure is determined to require repair,
`a service request is initiated. The Processor Controller
`Element is provided with a Remote Support Facility
`(RSF) which can report failure information automati-
`cally. An auto-diai for service from the RSF will result
`in the correct routing of the information to the cus-
`tomer service engineer, parts stocking location, and, if
`further analysis of the failure information is required,
`to the remote support center.
`
`Design Goals for Subsystems
`
`The design choices for fault-tolerance pay careful atten-
`tion to both the physical layout of the machine and the
`logical boundaries. Figure 1 shows the physical struc-
`ture of the Model 900. Each central processor (CP) is
`contained on 1 board in 4 TCMs which includes two
`128K byte store through Level 1 Caches (L1). Each
`CP contains multiple execution elements (each for dif-
`ferent types of instructions) and implements out of se-
`quence instruction execution. There are two System
`Control Element (SCE) boards each with 6 TCMs and
`each board has 8 card positions to accept up to 512
`MBytes of central storage. The SCE includes several
`elements, a store-in Level 2 cache (2MBytes per SCE),
`the directories and store buffers. There are two Inter-
`
`Figure I: ES9000/900 System Structure
`
`communications Connection Elements (ICE) per Sys-
`tem - each is packaged on a board with 5 TCMs. This
`element contains an I/O processor, channel control and
`switching function and the Expanded Storage controls.
`The Expanded Storage is packaged on cards and up
`to 8 GBytes is available. The system can be config-
`ured with up to a total of 256 channels, 96 of which
`could be Parallel channels and the remainder Enter-
`prise Systems Connection Architecture (ESCON) fiber
`optic channels. To avoid confusion between an indi-
`vidual CP and the package including the TCM boards,
`storage and channels, the package is called the Central
`Processor Complex (CPC) or simply the processor.
`The CPC hardware can be divided into 3 logical el-
`ements. Figure 2 shows the Central Processors (CPs),
`Storage Subsystem, and the Channel Subsystem (CSS).
`A different approach was taken in each logical section
`depending on element redundancy and the concurrent
`error detection and reconfiguration capable in the el-
`ement. We examine the goals of each of these logical
`elements here and describe the details of the design in
`the following sections.
`
`The Model 900 is a 6-way tightly coupled multi-
`processor which is an inherently redundant design
`point. The implementation takes care that a pro-
`gram dispatched on any of the CPs and will execute
`architecturally correctly. If a CP takes a catas-
`trophic failure, the work can be moved to another
`CP and the failing CP fenced out of the proces-
`sor resource pool and repaired. Degrading from
`6 CPs to 5 CPs is less than i/6th (due to lower
`MP degradation), which is quite tolerable for most
`commereial environments. This graceful degrada-
`tion together with non-disruptive repair provides
`a very cost-effective solution in a high-end CPC.
`Additional steps taken in internal design of the
`CP have the additional benefit of greatly reduc-
`ing the likelihood of a CP taking a catastrophic
`failure. The operating system’s primary function
`
`IBM-Oracle 1012
`Page 3 of 10
`
`
`
`CPs
`
`CP TCMs, SCE TCMs
`
`Storage
`
`SCE TCMs, ICE TCMs
`
`Subsystems
`
`C~ntr~l & ExpaP~ed
`
`Storage Cards
`
`Support
`Subsystems
`
`Power,
`Cooling,
`Processo~-
`con~oller
`
`~CE TCM
`Channel Cards
`
`s
`
`Figure 2:ES9000/900 Logical Structure
`
`is to schedule work to the CPs using algorithms to
`best manage the workload. Achieving good perfor-
`mance for the important jobs even in a degraded
`configuration is a natural fallout of the operating
`system scheduler.
`
`¯ The Channel Subsystem (CSS) is the means
`through which the processor attaches to DASD and
`tape storage devices, display controllers, switches,
`and communication networks. It is made up of
`TOM data path and control logic and up to 256
`card-on-board channels. The CSS has consider-
`able inherent redundancy. Redundant paths are
`used in such a way that if one is busy, an alternate
`is used, thus increasing performance rather than
`waiting for the busy one to be freed. Performance
`and availability requirements determine the num-
`ber of redundant paths to any particular I/O de-
`vice. These are generally sufficient to tolerate the
`loss of one path with no noticeable performance
`degradation. If a channel should fail, the failing
`channel can be varied out of the active configu-
`ration and concurrently repaired. Channel pack-
`aging of one parallel channel per card or two ES-
`CON channels per card facilitates the concurrent
`repair. ESCON [Elliot92/ architecture further fa-
`cilitates the ease with which redundant paths can
`be configured so that an alternate path can be used
`to address an I/O device. The Channel Subsystem
`TCM provides configuration boundaries for fault
`tolerance in the data funnel from the individual
`channel to Central Storage.
`
`¯ The Storage Subsystem is responsible for the han-
`dling of system data. It is made up of Gentral
`Storage, Expanded Storage, and the L2 Cache.
`This data is usually accessed synchronously with
`the CP’s operation. The storage subsystem con-
`trois manage the data through the use of Least
`Recently Used(LRU) algorithms. The storage sys-
`
`tern is not duplicated since this has adverse per-
`formance effects and the cost implications are pro-
`hibitive. However, several techniques are applica-
`ble to make storage fault-tolerant: ECC, scrub-
`bing, array reconfiguration and sparing. Array re-
`configuration is the ability to delete a section of
`storage and move the data for the target address
`to another area in the storage array. For the cen-
`tral and expanded storage array, spare memory ar-
`ray chips are added for automatic replacement of
`a failing memory chip. Fault avoidance has also
`been used most extensively in this dement. An en-
`tire level of packaging was removed, compared to
`the Model 6003, by integrating the Central Storage
`cards into the System Gontrol Element.
`
`Support subsystems which are not critical to the
`performance or cost/performance characteristics
`of the Model 900 widely use explicit redundancy.
`Each major physical element (e.g. SCE, CP, ICE)
`has its own power boundary. Within that bound-
`ary there are power supplies for each required volt-
`age level. For each voltage level within the power
`boundary there is one more supply than needed
`to meet the power requirements. All N+I sup-
`plies are normally active. When one fails, the re-
`maining N automatically re-adjust their output to
`meet the needs of the load. Likewise, subsequent
`to a repair, all N+I automatically re-adjust their
`output to meet the needs which were being pro-
`vided by N supplies. Details of this can be found
`in [Covi92]. The air cooling subsystem fans and
`blowers are similarly designed. Pumps are redun-
`dant in the water cooling subsystem and are peri-
`odically switched to detect any latent faults in the
`inactive pump. The Processor Controller Element
`(PCE), which provides system operations, service
`interfaces as well as participating in recovery and
`reconfiguration, is duplexed. One side is active and
`the backup continuously runs diagnostic programs
`and communicates with the active via "I am well"
`signals. These explicitly redundant subsystems -
`power supplies, fans, blowers, pumps, PCE - are
`all concurrently maintainable.
`
`The following sections describe the design of three
`subsystems that were described in this section, the CPs,
`Channel Subsystem, and Storage Subsystem.
`
`3 The Central Processor
`
`All System 390 Central Processor Complexes with more
`than one Central Processor have the ability to dynam-
`ically vary any CP on-line and off-line, and, as long as
`one of the N CPs is on-line, continue instruction ex-
`ecution. With the exclusion of speciai operations for
`which hardware may not be provided on all CPs (i.e.
`Vector Facility and Integrated Cryptographic Facility),
`each CP in a multiprocessing environment is capable of
`executing any task dispatched by the operating system
`and it is unpredictable where any particular task will be
`
`41
`
`IBM-Oracle 1012
`Page 4 of 10
`
`
`
`Deckle ~ Felch
`
`T
`
`I
`
`E
`
`RECOVERY
`
`Figure 3: Instruction Retry and Recovery
`
`executed. A major goal of the Model 900 design was to
`exploit this capability and continue uninterrupted op-
`eration in the event that one CP becomes unavailable.
`Accomplishing this goal required the following designs:
`
`1. The processing state in the failed CP must be rolled
`back to a consistent, error free state.
`
`2. The state of the storage system as seen by all other
`CPs must be architecturally accurate and error
`free.
`
`3. The state of the hardware capability of the failed
`CP (transient failure vs permanent failure) must
`be determined.
`
`4. The failed CP must be removed from the configu-
`ration before it propagates any corrupted data.
`
`5. The work that was running on the failed CP must
`be transferred to an operational processor.
`
`6. The failed CP must be repaired and returned to
`service without a system interruption.
`
`The implementation of this is best illustrated by an
`example involving instruction retry and recovery. Fig-
`ure 3 illustrates the pipeline and four instructions that
`are issued. Note that this machine implements out-of-
`sequence execution and this example also illustrates the
`leveraging of a performance technique to boost fault-
`tolerance at the circuit level. Out-of-sequence execu-
`tion allows for multiple instructions to be issued with-
`out requiring that they finish in the same order that
`they are issued so long as they complete in the same
`order. There is a fine distinction between finish and
`complete - finish is when the instruction finishes execu-
`tion, and complete is when the results of the instruction
`are put away into storage. Essentially, once an instruc-
`tion completes it cannot be backed out since its results
`are posted in storage and architecturally the instruction
`has executed. Prior to completing, an instruction can
`
`42
`
`be backed out provided all other pending stores from
`instructions issued after it are also cancelled and the
`storage is left in a consistent state.
`Assume a failure occurs at time T2 as shown in Fig-
`ure 3. The clocks are immediately frozen at time T2.
`Note that the last completed instruction was instruc-
`tion number 1 which completed at time T1. Although
`instructions 3 and 4 had both finished (they had had
`their results calculated and put into temporary facilities
`awaiting completion), they can not be flagged as com-
`pleted since instruction number 2 has not completed.
`Since instruction execution results are maintained in
`temporary facilities until the instruction is determined
`to be complete, it is possible to back out of all inter-
`mediate pipeline operations (including the finished in-
`structions) by flushing the pipeline and temporary fa-
`cilities. This leaves the processing state, of the program
`in execution, at an error free state - that is, at a point
`in time corresponding to the completion of instruction
`number 1. Since there is assumed to be a failure in
`the hardware of the CP, a physically separate Processor
`Controller, examines the pipeline contents, architected
`facilities, and temporary facilities, and puts them back
`into a consistent state. The Processor Controller ob-
`tains the CP state by scanning out the internal facili-
`ties using totally separate scan clocks so that the logic
`clocks can remain stopped.
`Next, the storage system must be insured to be con-
`sistent. The Model 900 particularly adapts to this be-
`cause of its storage hierarchy design. One of the advan-
`tages of the store-through L1 Data Cache is providing
`isolation of internal Central Processor faults. Because
`the shared L2 Cache contains any critical system data
`resulting from GP instruction execution, the L1 Cache
`copy is not critical and can be discarded upon detection
`of an error.
`Determining whether the failure detected was a tran-
`sient error or a true permanent circuit failure is done
`by retry. The Model 900 GP register management [Lip-
`tay90], the main function of which is to control out-of-
`sequence instruction execution, provides an audit trail
`of changes effected by incomplete instructions which is
`used by instruction retry for checkpoint restart. In or-
`der to do retry, the Processor Controller sets the inter-
`hal state of the logic in the CP to the state it would be
`in at the completion of instruction number 1. The logic
`clocks are then stepped to see if the following instruc-
`tions execute successfully with no errors. If they do, the
`error was a transient, and normal operation follows. If
`retry is unsuccessful after a thresholded number of re-
`tries, the failure is determined to be permanent and the
`work must be moved. At this point the failed processor
`is put into the CP check-stopped state.
`GP Checkstop [ESA/390] indicates severe CP dam-
`age for which processing on that GP must be termi-
`nated. In a uniprocessor, CP Checkstop is the equiva-
`lent of System Checkstop and all instruction processing
`ceases. In the past, when there was more than one
`CP and Processor Availability Facility (PAF) was not
`available, the operating system would initiate Alternate
`CP Recovery (ACR). ACR logically removes the failing
`
`IBM-Oracle 1012
`Page 5 of 10
`
`
`
`ALL CP$ ONLINE
`
`Non-disruptive Repair
`
`VARY
`
`GP1 f ~1 CHECK STOP
`
`CP1 C~ CP3 CP4 CP5 CP6
`
`CP1 C~ CP3 CP4 CP5 C~
`
`GP POWER ~UNDARY ON -VARY CP1 OFFLINE
`
`Figure 4: Concurrent CP Repair after Processor Avail-
`ability Facility
`
`CP from the system and attempts to transfer work that
`was in progress on the failing CP to an operative CP
`[MVS/XA]. The Model 900 introduces Processor Avail-
`ability Facility whereby the failing CP can be physically
`isolated for repair while the work in progress is com-
`pletely recovered and run on the N-1 CPs.
`
`One of the Model 900 objectives was to provide concur-
`rent maintenance for redundant subsystems, whether
`inherent or explicit. Figure 4 shows the procedure for
`concurrent repair of a CP. The failed GP needs to be
`both logically and physically isolated from the opera-
`tional components of the Processor Complex. Physical
`isolation is provided by a power boundary. Each CP
`has its own independent power supplies which are not
`shared with any other hardware and which may be pow-
`ered up and down independently of the power status of
`other power boundaries. Logical isolation is provided
`by fencing, an interna! mechanism by which any physi-
`cally connected on-line logical unit disables the receipt
`of signals from the fenced CP. Fencing is invoked au-
`tomatically by the hardware. Also automatically in-
`voked through the PCE is an auto-call for repair. The
`auto-call will result in the customer service engineer ar-
`ranging a time convenient to the customer for CP con-
`current maintenance. At that time, the customer ser-
`vice engineer performs on-line maintenance procedures,
`beginning with powering down the CP boundary, fol-
`lowed by replacing the failed part, powering the bound-
`ary back on, and, with operations assistance, varying
`the CP back into the configuration. Again, throughout
`this repair process, N-1 CPs continue uninterrupted in-
`struction processing and upon its completion all CPs
`are online, executing tasks.
`
`Processor Availability Facility
`
`4 The Channel Subsystem
`
`Processor Availability Facility (PAF), designed for the
`Model 900 presents the check-stopped condition along
`with the details of the error status, and the complete
`status of the process in execution to the operating sys-
`tem. Upon examination of the error status, the operat-
`ing system, stores the task in the normal dispatch queue
`to allow resumption of execution on another processor
`according to normal dispatch priorities [Daly91].
`In the past when retry was not successful, ACR would
`ABnormally END (ABEND) the task and would initi-
`ate software recovery. Depending on the robustness of
`the recovery and the criticality of the task that was
`ABENDed, the system may or may not recover. In
`general, control program tasks, where the recovery is
`quite robust, recovery has been successful If the task
`was an application, termination generaily would not
`cause the system to fail. However, for many subsys-
`tem and other critical component tasks, it was quite
`common for the system to fail. With the successful
`PAF operation, the task is not ABENDed and no other
`software recovery mechanisms need be invoked. PAF
`provides online reconfiguration in addition to the con-
`current error detection and fault-identification, making
`this a fault-tolerant processor. During and subsequent
`to the PAF operation, N-1 CPs continue to execute in-
`structions normally, no tasks are lost, and no interven-
`tion is required. The normal priority of dispatching in-
`sures optimum usage of the system while it is running
`with one less CP.
`
`The Model 900 Channel Subsystem is comprised of
`TGM control, storage path logic and Card-on-Board
`channels. A maximum of 256 channels can be attached.
`All 256 may be ESCON or up to 96 can be parallel. ES-
`CON, a new I/O architecture, is standard on the Model
`900. This provides a high-speed, long distance fiber op-
`tic serial channel. A direct channel to I/O connection
`is point-to-p~int. ESCON also provides for channels to
`attach to a star-distributed dynamic switch, the ES-
`CON Director. The Director allows dynamic switching
`between any combination of up to 60 channels and con-
`trol units. This enhanced connectivity facilitates pro-
`viding redundant channels to I/O device paths.
`There are two ESCON channels per card or one paral-
`lel channel per card. Depending on the mix of ESCON
`and parallel channels, a Model 900 may have a max-
`imum of 128 to 176 cards in the Channel Subsystem.
`Asymmetry is permitted, but if symmetrical, 64 to 88
`cards are in one power boundary and the other 64 to
`88 in a separate power boundary.
`
`Dynamic Path Selection
`
`For performance reasons 370/XA architecture was de-
`veloped to relieve the operating system from the task
`of managing the I/O activity. Channe! redundancy has
`long been a requirement of high-end processor com-
`plexes. In order to achieve throughput, more than one
`
`43
`
`IBM-Oracle 1012
`Page 6 of 10
`
`
`
`~ Path
`
`Dynamic
`
`Selection
`
`3990
`Storage Controller
`
`Channel
`Subsystem
`
`Selection of
`Reconnect Path
`
`Figure 5: Dynamic Path Selection
`
`channel must be capable of accessing the same I/O de-
`vice. This is so I/O operations are not delayed when the
`channel is busy with a different I/O operation. When
`a busy condition is encountered, the Channel Subsys-
`tem can chose another route to the device. The status
`of physical connections available, the choice of physi-
`cal connection to be used, and the routing of opera-
`tions are all controlled by the Channel Subsystem. Fig-
`ure 5 shows that an I/O request that may be initiated
`along one physical path can complete along a different
`path determined by the Channel Subsystem or Storage
`Controller. It executes the architecture which defines
`a connection between an I/O device and the operat-
`ing system independent o