`
`M. Abbott, D. Har, L. Herger, M. Kauffmann, K. Mak, J. Murdock,
`C. Schulz, T. B. Smith, B. Tremaine, D. Yeh, L. Wong
`
`IBM T. J. Watson Research Center, 30 Saw Mill River Rd
`Hawthorne, NY 10532
`
`Abstract
`The DM16000 prototype is a fault-toleranddurable-
`memory RS/6000'. The main storage of this system is
`battery backed so as to maintain memory content across
`prolonged power interruptions. In addition, there are no
`single points of failure, and all likely multiple failure
`scenarios are covered. The prototype is intended to
`match the data integrity and availability characteristics
`of RAIDS disks. Redundancy is managed in hardware
`and is transparent to the software; application programs
`and the operating system (AIXTM') can run unmodijied.
`The prototype is based on the IBM P o w e r P P ' 601
`microprocessor operating at 80 MHz. and is equivalent
`in performance and
`software appearance
`to a
`conventional 4-way shared bus,
`cache coherent,
`symmetric multiprocessor (SMP), with 4 gigabytes of
`non-volatile main storage.
`1.0 Introduction
`A Fault Tolerant, Durable Memory RS/6000
`technology prototype is under development at IBM's
`Watson Research Center. The main storage of this
`system is battery backed and will maintain memory
`content across prolonged power interruptions. This
`allows main store to be used as a durable or persistent
`storage medium in place of disk, when the improved
`performance of main memory storage of data could
`prove beneficial. Use of main store as a storage medium
`obligates the designer to meet traditional data integrity
`and availability expectations for a storage medium. Our
`goal was to match the data integrity and availability of
`the best RAID5 disk subsystems. Hardware redundancy
`assures that there are no single points of failure, and that
`all likely multiple failure scenarios are covered. This
`redundancy
`is
`software
`transparent. Application
`programs and the operating system (AIX) can run
`without modification, and are protected from hardware
`
`RSl6000, PowerPC, AIX, and Micro Channel are a
`trademarks of Intemational Business Machines Corporation.
`
`faults. Capitalizing on this hardware platform, certain
`augmentations have been made
`to
`the software
`environment to facilitate rapid recovery from software
`failures.
`Hardware fault tolerance and memory durability are
`achieved without significant performance penalty for the
`implementing technology. The prototype design is based
`on the IBM PowerPC 601 microprocessor operating at 80
`MHz. The software appearance is that of a conventional
`shared bus, cache coherent, symmetric multiprocessor
`(SMP) of up to four PowerPC 601 processors, and with
`up to 4 gigabytes of non-volatile main storage.
`2.0 System Architecture
`A large obstacle in the introduction of a new
`technology or concept
`into a mainstream product
`architecture
`is operating
`system and application
`compatibility with that base of experience and products.
`Most fault tolerant systems require major modification or
`total rewrite of existing software. This has limited the
`application of many fault tolerant systems to those
`situations where
`the
`costs of
`these
`software
`modifications are justified by the need for improved
`reliability. Even when this software customization is
`affordable, there is frequently a very serious lag between
`when new functions are available on mainstream
`platforms, and when they become available for fault
`tolerant system equivalents. This forces a choice
`between availability, and the more open flexibility of
`mainstream platforms.
`The trends in hardware technology are trivializing the
`costs of producing internally redundant or fault tolerant
`systems. The hardware costs for the triple redundant
`core DW6000 plus 1 gigabyte of durable memory is less
`than the cost of two conventional systems with the same
`raw performance. Further, there is no performance
`penalty for the management of hardware fault tolerance
`as this is transparently handled in hardware. Duplication
`of conventional machines sacrifices some system
`performance since a fraction of the systems throughput
`must be dedicated to the software processes that mirror
`recovery data to shared disk or via messaging directly to
`
`03634928194 $3.00 0 1994 IEEE
`
`414
`
`
`
`Page 1 of 10
`
`SAMSUNG EXHIBIT 1070
`
`
`
`the backup processor.
`A design choice was thus made to use straight
`forward hardware redundancy techniques
`to handle
`hardware failures in a fashion that is transparent to the
`software. The DM/6000 can use existing uniprocessor
`and symmetric multiprocessor (SMP) operating systems
`and application software. The DM/6OOO
`runs an
`unmodified A H kernel for the 601 PowerPC and
`standard RS/6000
`applications, with
`hardware
`availability characteristics equal to or superior to those
`of existing commercial
`fault
`tolerant computing
`systems[ 11 [9]. Conversely, specific software concepts
`and features of the DM/6000 that are intended to address
`software availability and recovery issues can be used in
`standard RS/6000 system environments to improve
`software availability in those environments. This is
`important for it implies that the effort to improve
`software availability and recovery characteristics is not
`specific to the DM/6000 hardware, but is more generally
`useful across the entire Droduct line.
`
`The physical architecture, illustrated in Figure 2,
`incorporates
`redundancy
`to meet
`reliability and
`availability requirements.
`The processor core, the
`interfaces to the L3 memory array, and interfaces to the
`YO channel subsystems are triplicated. All of the
`components of one rail of this Triplicated Processor Core
`(TPC), including up
`the L2
`to 4 processors,
`memorykache, and the YO channel interfaces, with all
`of the high speed, wide data width, shared data buses
`interconnecting these components are packaged together
`on a single card (the PC card). Three PC cards together
`The PC cards operate in tight
`form the TPC.
`synchronism with one another. The data paths on to and
`off of the PC cards are point to point byte or halfword
`serial links with clocks accompanying all data lines. The
`result is that all the performance critical data buses in the
`system are localized on the PC cards and can operate at
`full processor clocking rates. The off card interconnect
`is (pin) manageable and hot-pluggable.
`SYmh9.PISneS
`
`Figure 1 : DW6000 Logical Architecture
`Figure 1
`illustrates
`the
`logical, or
`software
`appearance, of the hardware architecture of an DM/6000
`system. Hardware redundancy is not visible to the
`operating system or application software. This logical
`architecture is a classic high performance computer
`architecture in the mold of emerging RISC based server
`class machines. The processor core consists of four 601
`PowerPC RISC engines making up a tightly coupled
`shared bus Symmetric Multiprocessor (SMP). The
`memory structure is that of a multi level coherent cached
`memory, ultimately backed by a large main memory
`array (4GB). Each processor has a private unified 8-way
`associative 32 KB Level 1 (Ll) cache. There is a shared
`64 MB Level 2 (L2) memory, split between 32 MB of
`directly addressable memory and 32MB of cache that is
`backed by the large main store (L3 memory array). The
`system has from two to four YO channels, which link the
`core to industry standards based YO subsystems. The
`YO channel link is flexible enough to accommodate a
`variety of standards (Micro Channelm, ISA, EISA, PCI,
`IBM ES/9000 channels etc.) but the prototype YO
`subsystems are exclusively Micro Channel based.
`
`Figure 2: Physical Architecture
`The memory in the large L3 memory array is
`partitioned across six Symbol Planes (SP's). This
`partitioning provides an economical physical size for the
`memory card while facilitating physical partitioning of
`the memory into independent fault containment regions.
`The system is tolerant to the partial or total failure of any
`one of the Symbol Planes. Each SP is a single pluggable
`circuit card with on-card power regulation and battery
`backup.
`Hardware Fault Tolerance
`There are a wide variety of techniques available to
`achieve fault tolerance. Each of these techniques has its
`own strengths and weaknesses. No single technique is
`optimum in all situations. The DM/6OOO employs a
`variety of fault tolerant techniques to optimize the
`reliability of the system, while minimizing system costs
`and impact on system performance. Each major area of
`the system uses a redundancy approach that most closely
`matches the performance requirements of that area
`without sacrificing economy.
`0 Alternate path redundancy is used to tolerate failures
`
`415
`
`Page 2 of 10
`
`SAMSUNG EXHIBIT 1070
`
`
`
`in VO channel paths. In this approach, each critical
`peripheral resource is attached to the system through
`at least two independent paths. For example, a token
`ring might be accessed by either of two paths when a
`token ring adapter is installed on each of two
`independent Micro Channel subsystems. If a fault
`disables the path through one of these adapters, then
`an alternate path is available through the other.
`Triplicated Redundancy with majority voting is used
`to mask faults in the TPC. Voting in each VO
`channel path, and in each symbol plane, mask any
`fault in the TPC. Since the voters are external to the
`TPC, a voter failure causes a failure of and is
`equivalent to a failure of the fault containment region
`in which it is located (a single VO channel, or a
`single symbol plane). Voter failures are tolerated as
`simply one of the contributing sources that could
`cause the failure of the region in which they are
`packaged .
`Error correcting codes are used to mask symbol plane
`failures. Data is striped across all symbol planes.
`Single symbol error correction circuitry in each rail is
`able to reconstruct any lost data, should a symbol
`plane fail. The advantage of this structure is that it
`minimizes the cost of protecting memory, while at
`the same time providing a very high performance
`memory system. Since all planes operate in parallel
`this memory
`structure
`is
`inherently
`high
`performance. Each symbol plane also includes
`internal single bit error correction to suppress soft
`errors.
`A distributed fault tolerant clock provides all logic
`timing in the system. Each of the three rails of the
`TPC has an independent voltage controlled crystal
`oscillator, and they are mutually phase locked to one
`another. Other fault containment regions of the
`system receive triplex FT clock transmissions from
`the TPC and phase lock to the majority phase to
`produce a local copy of the FT clock for its use. This
`clocking is essentially the same in concept as that of
`[7][8] and is not discussed further in this paper.
`The primary power input and the cooling systems use
`dual redundancy with full system capacity available
`in each of the redundant paths. For this prototype,
`each of the primary power supplies, converts 220
`VAC to -48 VDC. This intermediate power is
`distributed by dual
`-48
`VDC power buses.
`Normally, the load is shared between these two
`buses. When a failure occurs in either of the primary
`power supplies, or they lose their AC feed, the
`remaining bus rail automatically provides current to
`support the full load until the failed unit is repaired.
`Final power regulation and the backup batteries are
`distributed. Each fault containment region, or circuit
`card has its own private power regulator that draws
`input power from either or both of
`the
`two
`intermediate -48 VDC power buses. The failure of a
`
`power regulator will cause the loss of the associated
`fault containment region, but will not affect the
`operation of any other fault containment region in the
`system. If a fault containment region requires battery
`backup (the PC cards and the SP cards), it is fitted
`with a private battery and a charger that draws power
`from the intermediate power network.
`Processor (PowerPC 601)
`To meet demanding high end server requirements,
`the DM6000 design point was chosen as a 4-way SMP
`with high performance VO channels.
`Each of the four 601 PowerPC microprocessors on a
`PC card runs in cycle synchronism with the associated
`PowerPC 601 chip in the adjacent rails. At system
`startup a maintenance processor pre-loads the 601
`PowerPC’s memory with its initial program (bootcode)
`and synchronously releases triads of PowerPC 601’s in
`the rails from reset. Each triad of PowerPC 601’s
`continues to operate in tight synchronism as a single
`logical processor.
`An internal rail fault will cause that rail to diverge
`from or lose sync with the other two rails. Majority
`voting in the symbol planes and in the YO channels will
`mask the faulty rail. Because of the tight level of
`integration on the PC card, any fault on the card
`generally causes the whole card to lose sync and fail.
`The 601 PowerPC’s operate at a bus clock rate of 40
`MHz and an internal clock rate of 80 MHz. The on chip
`L1 cache of each 601 PowerPC maintains cache
`coherency using standard 601 PowerPC bus snooping
`protocol. Snoops are also presented to the L1 caches
`during YO by the memory controller so that VO activity
`is cache coherent.
`L2 Memory and PC Card Interconnect
`The L2 memory is the PC card’s primary storage
`array. The L2 memory is shared by the 601 PowerPC
`Processors, I/O channel interfaces, and the symbol plane
`(L3) memory interface. The array is 64 Mbytes of high
`speed DRAM, physically organized into two independent
`“banks” that are interleaved by modulo 256-byte address
`blocks. A multi-ported interconnect fabric/controller
`manages all data exchanges between PC card
`components.
`This
`interconnect
`provides
`for
`simultaneous access to both L2 banks at 320 MB/s for
`each bank. A 32 MB portion of this L2 memory is
`logically partitioned from the array and operated by the
`interconnect as a cache for the SP (L3) memory. Tag
`arrays for this cache are internal to this interconnect.
`The shared L2 cache has the following features:
`32,768 direct mapped lines.
`1024 byte line with four 256 byte blocks per line,
`each with individual valid and modified status bits.
`Store-in operation with hardware managed automatic
`write-back of dirty lines on reuse.
`Critical L1 line first, block fill, on read misses.
`
`416
`
`Page 3 of 10
`
`SAMSUNG EXHIBIT 1070
`
`
`
`0 Programmable line fill, a miss can fill either the
`entire line or just the accessed block.
`All processor access to L3 memory is through the L2
`cache. The remaining portion of the L2 array is directly
`accessible. This L2 direct memory can be used for
`kernel code storage, application processor private storage
`(stacks, automatic variables, ... ) and for YO buffers, to
`avoid undesirable interactions between code, private
`memory, or 1/0 buffers and L3 caching. The actual
`details of the memory hierarchy are of course not visible
`to the software (real memory appears as a single flat
`address space), but the varying access characteristics can
`be used by the virtual memory manager to tune
`performance.
`Byte parity is maintained in the L2 memory with
`the processors, U 0
`parity checks performed by
`interfaces, and the SP interface. After a failure in one
`rail and until that faulty rail is repaired, the system is
`susceptible to soft errors in the L2 memories of either of
`the two operating rails. The parity in the L2 memory
`provides for the detection of these soft errors before they
`propagate off rail. A rail fail-stops if it detects a parity
`error, signaling the surrounding voters of this condition.
`The voters use this notification to ignore inputs from the
`stopped rail. The remaining rail is able to continue in
`after a second failure, when the second failure is
`detected by the on-card parity checks. This provides
`tolerance of the most common double fault mode, a hard
`rail fail followed by a soft error in the L2 memory in one
`of the two remaining rails.
`When a faulty rail is repaired, its internal state must
`be aligned to that of the remaining rails before it can be
`successfully resynchronized with them. The bulk of this
`state is contained in the L2 memory. A special hardware
`assist performs this resynchronization of the L2 memory.
`The 64 MB L2 (which includes L3 cache content) can be
`resynchronized at 320 MB/s or in 0.2 seconds.
`The L2 memory is backed up by batteries, protecting
`against a total primary power outage. The memory
`automatically switches to battery as the power fails and
`will hold its state for a minimum of 48 hours. When
`primary power is restored, the L2 switches back to
`primary power and the system can be restarted without
`rebooting. A primary power fail interrupt provides
`adequate warning so that volatile processor state can be
`flushed to L2 memory before power is lost.
`Symbol Plane (U) Memory
`The SP memory provides of up to 4 GB of battery
`backed memory. At the PC card boundary L3 data
`words are 8 bytes (64 bits) wide plus 4 bytes (32 bits) of
`single symbol error correcting code (SSECC). This code
`word is striped across the six symbol planes, 4 data SP’s
`and 2 ECC planes. The SSECC can recover lost data
`even after the failure of an entire symbol plane.
`The SP memory has two distinct components, the
`Symbol Plane Interface (SPI) in the TPC and the SP’s
`
`themselves. Communications between the SP and the
`TPC is tightly synchronous, with active Vernier skew
`compensation being used to remove all clock skew
`effects [6].
`TPC Symbol Plane Interface
`The SPI is in the TPC. It attaches to the SP memory
`port of the PC card interconnect fabric/controller and is
`responsible for detailed control of the PC interface to the
`six SP’s.
`Beyond normal sequencing and flow
`management the SPI also provides the SSECC circuitry
`that tolerates the loss of any single SP.
`The Symbol Plane Interface consists of two identical SPI
`ASIC’s. During writes they generate the check codes for
`the data before it is passed out of the TPC. Read data
`from the SP’s is checked and lost symbols are
`reconstructed by these same ASIC’s. The generation and
`checking of the SSECC is performed entirely within the
`TPC with the SP’s voting the write data from the three
`rails. The voting makes the SP’s tolerant to the failure of
`any one of the SPI ASIC’s on any single rail, and a SPI
`ASIC failure only affects the read data bound for that
`particular rail.
`Pinning requirements and circuit density demanded that
`the SPI be partitioned into two ASIC’s. By providing
`half of the data and half of the check codes to each of the
`ASIC’s these two circuits were made identical and
`independent. Each SPI ASIC supports 8 of the 16 data
`connections to each of the SP’s. Thus each SPI ASIC
`processes 32 data bits and 16 check bits on the out-board
`side and 32 data bits and 4 parity bits on the in-board
`side.
`Internally, each of the SPI ASIC’s implements two error
`detection and correction (EDAC) logic blocks. Each
`block handles four 4-bit data nibbles (one nibble from
`each of the four data SP’s) and two 4-bit check nibbles
`(one nibble from each of the two check SP‘s). A S4EC
`Hamming-Type code [3] is implemented on each of these
`nibble groups.
`Each EDAC is able to detect and correct up to one
`nibble failure. 99.96% of all double faults are detected.
`The 4 EDAC operating in parallel are able to pipe the 64
`bits of data through the SPI chip set at full bus speeds.
`The prototype made several expediency compromises
`which would not necessarily be reflected in a product. A
`nibble code was chosen for decodedencoder efficiencies,
`minimizing ASIC design complexity. An eight bit
`symbol would probably be chosen for a product as it
`provides better scalability to larger memories. Two error
`correcting planes can protect up to 128 data planes [3] in
`such a configuration. Additionally, a sixteen bit wide
`data path operating at a relatively modest 40 Mhz. was
`used. This avoided the need for special driver and
`receiver macros for the ASIC’s. A product would likely
`invest in these macro’s boosting the serial data rate by a
`
`417
`
`Page 4 of 10
`
`SAMSUNG EXHIBIT 1070
`
`
`
`factor 8 and narrowing the data path to individual
`symbol planes accordingly. With a narrower data paths
`to each plane, and a scalable error correction code, it is
`possible to design the ASIC’s to handle 4, 8 or 16 data
`planes without exhausting available pins. A large
`memory of 16 planes has a more attractive 12.5% ECC
`overhead, while retaining the option of operating with
`fewer planes in smaller configurations.
`The prototype’s 16 bit wide interface to each symbol
`plane is also used to transmit commands and addresses to
`the symbol planes. Note that identical commands and
`address must be transmitted to each symbol plane.
`During address and command transfers the EDAC
`circuits are bypassed and identical information is sent to
`each symbol plane. The resultant SPI /SP bandwidth is
`320 MB/s for data (useful excluding SSECC) and 80
`MB/s for addresses and commands. As an example a
`READ cache-line command can be transmitted to the
`Symbol Planes in 3 cycles (3 bytes of identical command
`and address information to each plane), and the line can
`be returned in 128 cycles (1024 bytes at 8 bytes/cycle).
`The SPI also provides inter-rail exchange of L2 data
`during L2 resynchronization, and maintenance system
`access to the Symbol Planes.
`Symbol Plane
`Each s mbol plane may be populated with up to 1
`GByte (23 iK = 1,073,741,824 bytes) of battery backed
`DRAM. All logic and data path functions in a SP are
`implemented in a single ASIC, the SPC-ASIC. Each
`SPC manages a logical half duplex 2 byte data interface
`with the SPI’s in the TPC. Data, command, and address
`information is transferred at 2 bytes per clock cycle. The
`SP has an local clock system that is slaved to the fault
`tolerant clock system in the TPC. Refresh of the DRAM
`is managed on the SPC with provisions for synchronizing
`the refresh between all of the SP’s in the system. When
`primary power is lost to the SP, the SPC automatically
`switches to battery power until primary power is
`restored.
`A SPC handles the triplication in the SPI interface.
`The logical 2 byte interface consists of three replicated 2
`byte interfaces, one to each of the three rails of the TPC.
`Data, commands and address information arriving from
`the TPC are majority voted at the receiving SPC
`interface to mask (and detect) errors caused by failed
`TPC rails and rail out-of-sync conditions. Data to be
`transmitted to the TPC is first triplicated, then buffered
`and sent to each of the three PC cards. In a hard rail fail
`scenario,
`the SPC voter circuits will additionally
`functions so as to pass good data when one of the two
`operational rails transmits a fault code instead of a valid
`datakommand symbol. This occurs when a parity error
`is detected in a PC card.
`Data is stored on a SP in 64-bit words. Access to a
`SP is on a 64-bit word basis only. Since multiple SP’s
`always operate in parallel, this corresponds to a 32 byte
`
`block minimum access granularity to the TPC. All
`processor access to SP address space are cached in the
`TPC. Thus processor loads or stores can only result in
`cache-line operations at the SP memory. U 0 operations
`can be as small as the 32 byte granularity of the SP
`memory.
`Within a symbol plane each 64-bit data word is
`coded into a 72-bit SEC-DED code. This ECC prevents
`common single bit soft errors in the DRAM from causing
`symbol errors at the SPI. The DRAM chip organization
`is not by one, so that this code is not adequate to correct
`all chip failures, but it does correct many single chip
`hard failures. The ECC does detect all single chip errors
`and these are flagged to the SSECC in the TPC. Chip
`fails which cannot be corrected on the SP are corrected
`by the SPI’s SSECC code.
`Contaminated or invalid data in an SP must be
`regenerated using the SSECC and the data from good
`SP’s. This most commonly occurs after an SP failure has
`been repaired, or after a routine upgrade to SP memory,
`when an entire SP is uninitialized. The entire content of
`a SP can be regenerated by reading and rewriting all of
`memory and using the SPI SSECC circuitry to regenerate
`each word as it is read from SP memory. This scrubbing
`of the symbol plane data is performed under the control
`of the SPC upon command from the maintenance
`processor. A scrub is accomplished by sequentially
`reading
`the words
`the symbol plane memory,
`in
`transmitting them to the SPI which passes the data
`through its SSECC circuitry, regenerating any lost
`symbols, internally wrapping the data in the SPI and
`transmitting it back to the SPC’s where it is ultimately
`rewritten, correcting the content of SP DRAM memory.
`Two types of scrubs are supported:
`1. High Speed Scrub (160 MB/s peak). Normal system
`operations can be performed while a high speed scrub
`is in progress but the scrub has some impact on
`system performance.
`2. Background Scrub (<1 MB/s). Background scrub is
`performed at a slow enough rate so that it has no
`impact on the system throughput.
`There are three normal access modes that are
`supported by the SPC. These are:
`1. Cache block fetch - reads a 256 byte block from the
`SP memory for storing in a L2 cache line. The
`ordering of this read is critical double word in the
`critical L1 line first, with the remainder of the critical
`L1 line filled in a round robin fashion and then the
`remaining L1 lines of the L2 line filled in a round
`robin fashion with the double words within these L1
`lines filled in ascending address order. Returning the
`critical double word first minimizes the latency as
`seen by a PowerPC 601 for a read access that misses
`both the L1 and L2 caches. The likelihood of a
`subsequent PowerPC 601 stall due to the same L2
`cache miss is minimized by returning the remainder
`of the critical L1 line immediately following the
`
`418
`
`Page 5 of 10
`
`SAMSUNG EXHIBIT 1070
`
`
`
`critical double word.
`2. Cache block write - supports write-back of a 256 byte
`L2 cache lines as it is being cast out. The data
`ordering for cache writes is strictly ascending address
`order.
`3. Asynchronous readwrite operation - is used for in-
`memory data moves which are asynchronous with
`respect
`to
`processor
`instruction
`execution.
`Asynchronous memory to memory moves are most
`commonly used by YO device drivers.
`YO Subsystem
`Logically, YO interfacing follows the model of a
`bridge between
`the processor private memory
`interconnect (bus) and industry standard YO buses. The
`architecture has been tailored and optimized to support
`the requirements of both the server environment (high
`bandwidth, numerous diverse device connectivity) and
`fault tolerance (fault containment, fault masking, and
`redundant paths).
`Typical desk top and desk side machines bridge to
`the VO bus in a single VLSI chip. This implementation
`is physically constrained as this single chip must be
`electrically close to two high performance buses -- the
`processor private interconnect and the YO bus. This has
`the practical effect of limiting the number of VO buses to
`one, or two, which is problematic for servers, and does
`not provide good granularity with respect to failures.
`A fault tolerant machine, such as the DM/6000 needs
`to limit the impact of channel path failures. This
`demands a minimum of two YO bus subsystems,
`preferably more for a large server. The DM/6000
`additionally requires voting (for output) and replication
`(for input) in the bridge. All these requirements must be
`provided
`in a
`fashion
`that
`supports concurrent
`maintenance with hot-pluggability .
`The DM/6000 solution to all of these requirements is
`a “split-chip” bridge design. The bridge chip is split into
`two pieces, one to interface with the processor/memory
`bus and one to interface with the YO bus. This is shown
`in Figure 3 (triplication of the processor core is not
`shown). Between the two halves is a (triplicated), full
`duplex, point-to-point byte serial link.
`The link
`signaling technology supports full processor speed
`communications with an YO subsystem located up to 10
`meters away from the Tpc.
`The DM/6000 PC card supports up to four Link
`Interface ASIC’s (the LI ASIC’s) connected to the PC
`interconnect’s YO port. On the YO channel side the LI,
`connects to the high speed full duplex byte serial link to
`a remote (off card) bus controller ASIC (the Remote Bus
`Controller or RBC ASIC). The link implements a
`transaction protocol with
`retry and allows
`two
`transactions to be outstanding on the link in each
`direction The RBC ASIC is YO bus specific, and in the
`case of the prototype, bridges to a Micro Channel (RBC-
`MCA).
`
`“0- - -
`
`Remote
`Bus
`Controller
`
`In-bound Llnk
`
`VO
`Link
`Interface
`
`Figure 3. Logical VO Structure
`The central processor burden in managing the system
`VO is of critical importance in a high performance
`server. The DM/6000 relegates the mundane tasks of
`controlling the peripheral devices to YO processors
`located in the U 0 bus subsystems. Communications
`between the TPC processors and the YO controllers
`consists of ‘START-IO’ commands (messages) from the
`TPC processors and completion status or exception
`interrupts from the YO controllers. This technique has
`been widely used in high performance computing
`systems and proved capable of insulating the processor
`core from the relatively slow response characteristics of
`YO subsystems.
`The YO subsystem is allowed DMA access to the
`memory of the TPC complex. The TPC processors have
`memory mapped program (load/store) access to the VO
`subsystem address space. A TPC processor may start
`YO by a single memory mapped store to an interrupt
`request register in the RBC, passing a pointer to an
`argument list (or control block) in memory. This
`“START-IO’ write executes in a single cycle of the
`main processor, the processor execution stream is not
`interlocked with successful completion of the requested
`write, but is immediately released.
`When the YO subsystem gets the START-IO it
`fetches the argument list or control block from storage,
`performs any requested functions, returning status
`through the control block, and signals the TPC upon
`completion. Any data movement to or from main store
`the VO
`is
`subsystem’s
`responsibility. Hardware
`exceptions, which might occur during the START-IO
`write are captured by the LI in status registers and
`presented to the TPC processor as a normal YO
`exception interrupt.
`The TPC processor must also protect itself against
`passive failures of an YO subsystem by appropriate
`software
`time-outs and other defensive measures.
`Protection from errant YO DMA (unrestricted writes by
`the YO subsystem to main memory) is provided by
`memory protection and mapping functions in the LI
`ASIC, which can be used to restrict VO DMA access to a
`(relocateable) YO buffer . This protection mechanism is
`within the triplicated portion of the system, normal
`majority voting suffices to protect the system from LI
`mapping hardware failures.
`The LI-RBC design is a rather straight forward
`imitation of a plausible but unexceptional YO
`architecture for non-lT machine that addresses server
`connectivity and performance issues. This unexceptional
`characteristic
`is considered a virtue here
`in
`that
`
`419
`
`Page 6 of 10
`
`SAMSUNG EXHIBIT 1070
`
`
`
`loss of sync in the TPC. The classic Byzantine fault
`handling algorithm, used in the F K X [7] for example,
`has the serious drawback that it requires that all
`incoming data be fully exchanged among TPC rails.
`With a 9 bit wide data path (control flag and an 8 bit
`data byte) this exchange requires 36 pins per VO channel
`path for inter-rail data exchange. Total pin count for
`each channel including the full duplex RBC-LI path is
`about 62 pins (including clocking and parity lines), or
`about 250 pins for all four channels. This was not
`feasible for this design as there are inadequate pins on
`the TPC card or on the LI ASIC to dedicate 250 pins to
`VO channel paths.
`A new Byzantine fault resistant input algorithm was
`developed to address this problem. This algorithm is
`illustrated in Figure 5 . Incoming data passes through a
`pipeline.
`A byte (9 bits) and a parity bit is replicated by the
`RBC and transmitted to the LI in each of the three rails
`of the TPC during each clock. On receipt, each LI
`computes a single bit parity signature (PS) per cycle
`using only the 9 bit byte (received parity is ignored in
`this calculation). This PS is computed ov