`Reliability
`
`Douglas Bossen, Joel Tendler, Kevin
`Reick
`
`IBM Server Group, Austin, TX
`
`AMD EX1024
`U.S. Patent No. 6,895,519
`
`0001
`
`
`
`POWER4 Microprocessor
`POWER4 Microprocessor
`(cid:122) 2-way SMP system on a chip
`(cid:190) > 1 GHz processor frequency
`
`(cid:122) Storage Hierarchy
`(cid:190) L1: 32 KB Data, 64 KB
`Instruction per processor
`(cid:190) L2: ~1.5 MB per chip
`(cid:190) L3: 32 MB per chip
`
`(cid:122) Chip interconnect:
`(cid:190) New Distributed Switch design
`(cid:190) Buses operate at _ processor
`speed
`
`(cid:122) Technology:
`(cid:190) 0.18 λm lithography
`¬ Copper, SOI
`(cid:190) 174 million transistors
`
`POWER4
`
`L3
`
`Mem
`
`GX Bus
`
`>1GHz
`Core
`
`>1GHz
`Core
`
`L3 Dir
`
`Shared L2
`
`Chip - Chip Commo
`
`0002
`
`
`
`System Building Block
`
`To other modules
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`C C
`
`L2
`
`IOCC
`
`C C
`
`L2
`
`IOCC
`
`C C
`
`L2
`
`IOCC
`
`C C
`
`L2
`
`IOCC
`
`L3 Ctl/Dir
`
`L3 Ctl/Dir
`
`L3 Ctl/Dir
`
`L3 Ctl/Dir
`
`L3
`
`L3
`
`L3
`
`L3
`
`Memory
`
`Memory
`
`Memory
`
`Memory
`
`From other modules
`
`(cid:122) 4-chip, 8-way
`SMP module
`
`(cid:122) New Distributed
`Switch design
`
`(cid:122) Hybrid switch and
`bus configuration
`
`(cid:122) Enables
`aggressive
`cache-to-cache
`transfers
`
`(cid:122) Buses extended
`in multi-module
`configurations
`
`POWER4
`
`0003
`
`
`
`32-way System Logical Structure
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`L3
`
`L3
`
`L3
`
`L3
`
`L3
`
`L3
`
`L3
`
`L3
`
`Memory
`
`Memory
`
`Memory
`
`Memory
`
`Memory
`
`Memory
`
`Memory
`
`Memory
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`GX Bus
`
`L3
`
`L3
`
`L3
`
`L3
`
`L3
`
`L3
`
`L3
`
`L3
`
`Memory
`
`Memory
`
`Memory
`
`Memory
`
`Memory
`
`Memory
`
`Memory
`
`Memory
`
`POWER4
`
`0004
`
`
`
`32-way Components
`(cid:122) Hardware
`(cid:190) 4 multi-chip modules with
`16 POWER4 chips
`(cid:190) 16 L3 dual chip modules
`with 32 16MB EDRAM
`chips
`(cid:190) Up to 16 memory
`controllers
`(cid:190) Thousands of DRAM chips
`to support up to 256 GB
`RAM
`(cid:190) GX to Remote I/O Bridge
`Chips
`(cid:190) PCI Host Bridge Chips
`(cid:190) PCI-PCI Bridge Chips
`(cid:190) Redundant Power
`Supplies and Air Moving
`Facilities
`I/O devices
`
`(cid:122) Code
`(cid:190) Service Processor Code
`(cid:190) System Firmware Code
`(cid:190) Operating System Code
`
`System comprised of
`hardware & software
`components
`interacting and
`affecting system
`RAS
`
`POWER4
`
`0005
`
`
`
`Chip Design Driven by System
`Requirements
`Fault
`Fault
`Avoidance
`Avoidance
`
`(cid:122)Component minimization
`(cid:122)Intrinsic SER Mitigation
`
`Recovery
`Recovery
`
`Diagnosis and
`Diagnosis and
`Reconfiguration
`Reconfiguration
`
`Repair
`Repair
`Policy
`Policy
`
`(cid:122)Local chip level: ECC, refresh
`(cid:122)Chip interactions: bus/command retry
`(cid:122)System error handling: UE, PCI bus
`retry
`(cid:122)Run time / IPL diagnostics
`(cid:122)Local chip level: spare bits, lines
`elements
`(cid:122)System level: CPU, cache, memory
`sparing + degraded modes of
`(cid:122)Minimize system down time
`operation
`(cid:122)Concurrent and deferred maintenance
`
`POWER4
`
`0006
`
`
`
`Fault Avoidance: SER Mitigation With ECC
`
`Chip
`Example
`
`Failure Category
`
`Intrinsic
`
`Intermittent
`Marginal Pattern
`& Stress Sensitive
`
`Aggregate Chip
`SER
`
`Failure Rate
`(Relative)
`
`1
`
`2-3
`
`Source
`
`FITS
`
`MTBF
`
`Vendor DB
`
`10-100
`
`> 1140 yrs
`
`Empirical
`Field Avg
`
`20-300
`
`> 380 yrs
`
`150-2000
`
`Vendor Appl
`Notes
`
`1500-
`140000
`
`0.8-76 yrs
`
`POWER4 Chip Set Design Rule:
`
`Implement ECC or equivalent recovery on all arrays sufficient
`to keep residual SER at or below each chip’s IFR
`
`POWER4
`
`0007
`
`
`
`ECC or hardware
`ECC or hardware
`refresh:
`refresh:
`Memory, Caches,
`Memory, Caches,
`ERAT, TLB
`ERAT, TLB
`
`Recovery: ECC, Retry, UE Handling
`(cid:122) Hamming SEC-DED
`augmented for special
`uncorrectable error (UE)
`handling
`(cid:122) Mainstore ECC designed for
`chip kill + redundant bit steering
`(cid:122) Masks intermittent errors
`(cid:122) Supports UE handling
`(cid:122) Increases unmasked MTBF from
`4 months to > 20 years
`
`Retry:
`Retry:
`Module, GX &
`Module, GX &
`PCI buses, and
`PCI buses, and
`I/O link
`I/O link
`
`UE handling:
`UE handling:
`ECC protected
`ECC protected
`arrays, buses
`arrays, buses
`using retry
`using retry
`
`(cid:122) Marking, moving UE data
`(cid:122) AIX to support process
`terminate and software partition
`reboot
`
`POWER4
`
`0008
`
`
`
`Recovery: Main Store ECC and
`Extensions
`
`Memory scrubbing corrects soft single
`bit errors in background while memory
`is idle preventing multiple bit errors
`
`XXXX
`
`XXXX
`
`• • •
`
`Spare
`memory
`chip
`
`Bit scattering allows normal single bit
`ECC error processing to function even
`with a chip kill failure by scattering
`memory chip bits across separate
`ECC words
`
`Bit steering dynamically reassigns
`memory I/O if error threshold is
`reached on same bit
`
`POWER4
`
`0009
`
`
`
`Recovery: Additional Logic to Avoid
`Checkstops
`L1
`Data
`Cache
`
`Main
`Memory
`
`ECC
`
`L2
`Cache
`
`L3
`Cache
`
`ECC
`
`ECC
`
`ECC
`
`DL1 FIR
`
`L2 FIR
`
`L3 FIR
`
`MS FIR
`
`Status: OK
`
`Status: UE
`
`Status: SUE
`
`Status: SUE
`
`Service
`Process
`or
`
`FRU
`Callout
`
`Synch
`Machine
`Check
`
`POWER4
`
`0010
`
`
`
`Recovery: More Logic to Avoid
`Checkstops
`AIX Dev
`Driver
`
`System
`Firmware
`
`GX Bus
`
`C C
`
`L2
`
`IOCC
`
`Remote I/O
`(RIO)
`Bridge Chip
`
`L3 Ctl/Dir
`
`POWER4
`
`L3
`
`Memory
`
`PCI
`PCI
`Host
`Host
`Bridge
`Bridge
`(PHB)
`(PHB)
`Chip
`Chip
`
`PCI
`
`Remote
`I/O Bus
`
`On PCI Error Detect:
`
`(cid:122) Slot freeze on error
`(cid:122) Return all 1’s to
`Device Driver
`(cid:122) Driver calls
`firmware, reset slot
`(cid:122) Driver recovery,
`retry operation
`(cid:122) Threshold, fault
`isolation logged
`
`PCI-PCI
`Bridge Chip
`
`PCI XXXX
`
`PCI Adapter
`
`PCI-PCI
`Bridge Chip
`
`PCI
`
`POWER4
`
`0011
`
`
`
`Diagnosis & Reconfiguration: IPL RAS Design
`
`POWER4
`Chip
`
`(cid:190) Built-in Self Test (BIST)
`(cid:190) Chips / single core can be deconfigured
`(cid:190) Spare bits switched in for single cell failures in L1,
`L2 caches, and in L2, L3 directories
`
`L3
`
`(cid:190) Line delete capability maps out bad bits
`(cid:190) L3 cache bypassed for more serious failures
`
`Main
`Memory
`
`(cid:190) ECC and redundant bit steering
`(cid:190) Entire memory card deconfigured for serious failure
`
`I/O
`
`(cid:190) Redundant Remote I/O link takes over in case of
`failure
`(cid:190) I/O drawer deconfigured during boot if failure
`detected to allow IPL to proceed
`(cid:190) PCI adapter marked unavailable if error detected
`and can be replaced via Hot Swap later
`
`POWER4
`
`0012
`
`
`
`Diagnosis & Reconfiguration: Run Time RAS
`Design
`(cid:190) Internal checkers, parity, ECC, UE & Special UE
`handling
`POWER4
`Chip
`(cid:190) FIR error capture, Who’s on First logic
`(cid:190) Spare bits / quiesce mode
`(cid:190) Internal checkers, parity, ECC, UE & Special UE
`handling
`(cid:190) FIR error capture, Who’s on First logic
`(cid:190) Cache line delete
`(cid:190) Bypass on next boot following UE
`(cid:190) ECC, address checks, UE & Special UE handling
`(cid:190) FIR error capture, Who’s on First logic
`(cid:190) Memory scrubbing, chip kill, bit steering
`(cid:190) Card deconfigure on next boot following UE
`
`Main
`Memory
`
`L3
`
`I/O
`
`(cid:190) GX bus error checkers, UE & Special UE handling
`(cid:190) FIR error capture, Who’s on First logic
`(cid:190) Remote I/O link hardware failover
`(cid:190) PCI device detect, recover, isolate, deconfigure
`POWER4
`
`0013
`
`
`
`Diagnosis & Reconfiguration: Fault Isolation
`Registers
`
`(cid:122) FIR design guidelines:
`(cid:190) Capture error at source
`(cid:190) Identify component showing error symptoms
`
`Compone
`nt
`
`Bus
`
`Compone
`nt
`
`FIR
`
`(cid:122) (cid:122) (cid:122)
`
`(cid:190) Leads to:
` FIR bits can be OR’d but carefully
`
`POWER4
`
`0014
`
`
`
`Diagnosis & Reconfiguration: Who’s on First Logic
`
`(cid:122) Requirement:
`(cid:190) Separate cause from effect to isolate faulty component
`from components propagating fault
`
`(cid:122) Solution:
`(cid:190) Each FIR starts a timer when it detects an error condition
`(cid:190) FIRs freeze timer when checkstop finally occurs
`(cid:190) FIR measuring longest elapsed time identifies causing
`component
`
`(cid:122) Used for:
`(cid:190) FRU callout
`(cid:190) Reconfigure system
`
`POWER4
`
`0015
`
`
`
`Diagnosis & Reconfiguration: Role of Service
`Processor
`(cid:122) Controls First Failure Data Capture
`(cid:190) FIR and Who’s on First logic enable capability
`
`(cid:122) Responsible for reconfiguring system
`(cid:190) Use spares for dynamic reconfiguration where possible
`(cid:190) Deconfigure failing part and bypass otherwise
`(cid:190) FRU callout
`
`(cid:122) Allows for component de-allocation before hard
`failures occur in conjunction with Operating System
`(cid:190) Processor
`(cid:190) Chip
`(cid:190) L3 line deletes
`(cid:190) L3 and memory (on next IPL)
`(cid:190) PCI adaptor and I/O devices
`
`POWER4
`
`0016
`
`
`
`Repair Policy: Maximize System Availability
`
`(cid:190) Internal array spare bits eliminate single bit error causes
`allowing continued operation without repair
`(cid:190) Core / chip deconfiguration support deferred repair for
`other errors
`(cid:190) Cache line delete eliminates single bit error causes
`allowing continued operation without repair
`(cid:190) L3 bypass supports deferred repair for other problems
`(cid:190) Bit steering eliminates single bit error causes allowing
`continued operation without repair
`(cid:190) Chip kill allows non-degraded operation and deferred
`repair for other DRAM failures
`(cid:190) Card deconfiguration allows deferred repair for card logic
`failures
`(cid:190) Remote I/O link hardware failover allows continued
`operation
`(cid:190) PCI device hot plug for concurrent repair
`(cid:190) Redundancy and hot plug to allow for concurrent repair
`
`POWER4
`Chip
`
`L3
`
`Main
`Memory
`
`I/O
`
`Power and
`Cooling
`
`POWER4
`
`0017
`
`
`
`POWER4 RAS Design
`
`(cid:122) Focus on maximizing system operation
`(cid:190) Avoid faults through masking and recovery
`(cid:190) Dynamically bypass faulty componentry
`(cid:190) Allow for concurrent repair where possible, deferred repair
`otherwise
`
`(cid:122) Requires special handling to uniquely identify failure
`source
`
`(cid:122) Total system design encompassing all hardware
`components, system firmware and operating system
`code
`
`(cid:122) Mainframe RAS attributes in a UNIX server
`
`POWER4
`
`0018
`
`