`Brey et al.
`
`|||||||||||||
`USOO5274.646A
`11)
`Patent Number:
`5,274,646
`(45) Date of Patent:
`Dec. 28, 1993
`
`(54) EXCESSIVE ERROR CORRECTION
`CONTROL
`75 Inventors: Thomas M. Brey, Hyde Park;
`Matthew A. Krygowski, Hopewell
`Junction; Bruce L. McGilvray,
`Pleasant Valley; Trinh H. Nguyen,
`Wappingers Falls; William W. Shen,
`Poughkeepsie; Arthur J. Sutton,
`Cold Spring, all of N.Y.
`s
`o
`73) Assignee: International Business Machines
`Corporation, Armonk, N.Y.
`(21) Appl. No.: 686,721
`22) Filed:
`Apr. 17, 1991
`(51) Int. Cl. .............................................. G06F 11/10
`52 U.S. Cl. ..................................... 371/40.1; 371/13;
`371/21.6
`58) Field of Search ....................... 371/13, 40.1, 40.2,
`371/21.6, 395/575
`
`(56)
`
`t
`
`R555 PR
`E:
`
`4,964,129 10/1990 Bowden, III et al. ............. 371/40.2
`5,058, 15 10/1991 Blake et al. ........................ 37/40.1
`Primary Examiner-Charles E. Atkinson
`Attorney, Agent, or Firm-Bernard M. Goldman
`57)
`ABSTRACT
`A.
`A method of automatically invoking a recoverable and
`fault tolerant implementation of the complemented/re
`complemented (C/R) error correction method without
`the assistance of a service processor when an excessive
`error is detected in main storage (MS) by ECC logic
`circuits. An excessive error is not correctable by the
`ECC. These novel changes to the C/R method increase
`its effectiveness and protect the C/R hardware against
`random failure. Further, if an excessive error is cor
`rected in a page in MS, an excessive error reporting
`process is provided for controlling the reporting using a
`storage map to determine if a previous correction in that
`page has been reported. If it has been reported, then no
`further reporting of soft excessive errors is made for
`that page. A service processor is signaled in parallel to
`update its persistent copy of the storage map so that on
`References Cited
`a next initializations of MS the memory map can be
`U.S. PATENT DOCUMENTS
`restored in the memory. The memory map is used to
`371/40.2
`4,175,692
`/1979 Watanabe
`assist the repair of failing parts of MS, and is reset after
`54, '7/68 A
`. .';
`4,604,751 8/1986 Aichelmann, Jr. et al. .371/26 MS is repaired.
`4,661,955 4/1987 Arlington et al. .................... 371/3
`4,888,773 12/1989 Arlington et al. ................. 371/40.2
`24 Claims, 7 Drawing Sheets
`111 store."(s)-5E--
`ECC
`MAI
`DU- - - ECC BITs
`<33
`121- INVERer -
`12
`NQRPR-
`CgFFR
`ECC logic si?1
`ESSEE
`disfiel
`is
`H-H - GENERATOR
`H-S-s-
`
`OUTPR
`
`TC
`REQUESTOR
`DU+ECC s
`
`V
`
`
`
`
`
`
`
`
`
`
`
`STATUS
`BUFFER
`
`-R
`
`
`
`STATUS BUFFER
`OUTPR
`COMPARATOR
`OMPARATOR & PARITY CHK
`
`s
`SR
`N
`OUT OF STEP)
`
`STATUS
`STATUS
`BUFFER 2
`BUFFER 1
`PARITY
`PARTY
`CHECK
`CHECK
`! -
`S
`-54
`53-
`INTERRUP
`
`IPR2018-01474
`Apple Inc. EX1010 Page 1
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 1 of 7
`
`5,274,646
`
`MAIN
`STORAGE (MS)
`Du---
`121- INVERTER
`Y
`NO ERROR
`BIT
`CORR ERR
`UNcóir
`SPECIAL U
`H-H
`2ND - H-S
`FETCH s-s
`
`
`
`
`
`ECC LOGIC
`
`
`
`& SP
`GENERATOR
`
`FIG 1
`
`TO
`REQUESTOR
`£25
`REGS
`(SEE FIGS
`6 & 7)
`
`TAT
`
`PARITY CHECK
`
`ECC
`5Eic
`
`OUTPTR
`STATUS BIT
`INVERT
`CONTROL
`
`
`
`TO
`REQUESTOR
`
`DATA
`
`BUFFER
`
`25
`
`2.
`
`INTR
`OUTPTR
`
`
`
`26
`
`
`
`
`
`STATUS BUFFER
`OUTPTR
`COMPARATOR
`OMPARATOR & PARITY CHK
`
`
`
`I
`
`42
`
`43
`
`BITS
`
`5
`SRRE
`NTER
`OUT OF STEP)
`
`STATUS
`STATUS
`BUFFER 2
`BUFFER
`PARITY
`PARITY
`CHECK
`CHECK
`5.
`53
`INTERRUPT
`
`IPR2018-01474
`Apple Inc. EX1010 Page 2
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 2 of 7
`
`5,274,646
`
`
`
`FIG 2
`
`INPOINTER - - - - - -
`
`6.
`
`65
`
`OUTPOINTER - - - - - -
`
`FIG 3
`
`STATUS
`STATUS
`BUFFER 2
`BUFFER
`ENTRY
`ENTRY
`STATUS PARITY STATUS PARITY
`
`DATA BUFFER ENTRY
`
`2E
`
`27E
`
`2 E
`
`IPR2018-01474
`Apple Inc. EX1010 Page 3
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 3 of 7
`
`5,274,646
`
`FIG
`
`MEMORY REQUEST ADDRESS
`
`RE
`RiD MEMORY ADDRESS REGISTER FETCH Felic
`
`7
`
`7a-
`
`FIG. 5
`
`C/R
`REQUEST
`
`C/R SEQUENCER
`
`ST
`FETCH
`
`ST
`STORE
`
`2ND
`FETCH
`
`2ND
`STORE
`
`FIG. 6
`
`73-FROM MS CONTROLLER TO REQUESTOR PROCESSOR OR IMO
`BIT
`EXCESS, SPECIAL --ee
`NO
`DATA ERROR CORR. ERROR
`UE
`H-HH-SS-SRID
`ERROR
`
`O- - -63,
`
`l
`
`l
`
`7u.
`FIG 7
`\\
`FROM MS CONTROLLER TO SP
`ECC
`BIT 2 BIT
`o
`DATASYNDROME Erior CORR. UNCORR, SPEAL H-HH-SS-SRID
`BITS
`ERROR ERROR
`
`
`
`O- -63
`
`IPR2018-01474
`Apple Inc. EX1010 Page 4
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 4 of 7
`
`5,274,646
`
`FIG. 8
`
`REGUESTOR REQUESS DATA LINE
`
`2
`
`FETCH ALL OU-ECC GROUPS IN
`REQUESTED DATA LINE FROM MS
`
`3. GENERATE STATUS INFORMATION FOR EACH OU-ECC GROUP
`(INDICATE ANY EXCESSIVE ERROR FOR ANY GROUP)
`
`4. SEND EACH FETCHED OU--ECC GROUP WITH
`TS STATUS INFORMATION TO TS REQUESTOR
`
`ENO
`
`IPR2018-01474
`Apple Inc. EX1010 Page 5
`
`
`
`U.S.
`Patent
`
`Dec. 28, 1993
`
`Sheet 5 of 7
`
`5,274,646
`
`FIG. 9
`
`O.
`
`RE-FETCH REUEST MADE BY REQUESTOR RECEIVING ERROR STATUS
`SIGNAL TO INVOKE CMR PROCESS TO ATTEMPT TO CORRECT THE
`ERRONEOUS DU (DATA UNT)
`
`ii.
`
`FETCH IDU -- ECC FROM MAN STORAGE
`
`(FIRST FETCH &
`NO INVERSION
`HERE)
`
`2.
`
`GENERATE S = EXCESSIVE ERROR STATE, AND ITS PARITY P
`
`SMULTANEOUSLY
`14. STORE S, P BITS IN STATUS BUFFERS
`AND 2 A NPOINTER 1 AND 2. LOCATION
`15. STORE OU--ECC GROUP N DATA BUFFER
`AT (NPOINTER 3 LOCATION
`16. INCREMENT INPOINTER i , 2 & 3. TOGETHER
`
`18 FETCH TRANSFER COMPLETED WITHOUT CONTROL, FAILURE
`
`
`
`(NO)
`
`19
`
`YEs)
`
`NCREMENT CAR SEUENCER
`TO NITIATE ST STORE
`
`Simultaneously
`OUTPUT & COMPARE CONTENT OF STATUS BUFFERS 82
`PARTY CHECK CONTENT OF STATUS BUFFER 1 & 2
`COMPARE CONTENT OF OUTPOINTER 1, 2 & 3
`OUTPU CONTENT OF DATA BUFFER
`INCREMENT OUTPOINTER 1,2 & 3 TOGETHER
`INVERT OU--ECC F STATUS BIT EQUALS
`DISABLE SECOND ECC LOGIC AND
`STORE OU AND ECC TO MAN STORAGE
`
`STORE COMPLETED WITHOUT CONTROL FAILURE
`29 ACTIVATE - ves
`
`SEQUENCE
`OEPENDENT
`RESTORATION
`(SEE TABLE)
`
`(NO)
`
`31. INCREMENT CMR SEQUENCER
`TO INITIATE SECOND FETCH
`
`GOTO FIG. O
`
`IPR2018-01474
`Apple Inc. EX1010 Page 6
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 6 of 7
`
`5,274,646
`
`FROM FIG 9
`
`FIG. O
`
`5. FETCH IDU - ECC FROM MAN STORAGE
`
`second FETCH)
`
`SMULANE SY
`54. OUTPUT & COMPARE CONTENT OF STATUS BUFFER
`55. PARTY CHECK CONTENT OF STATUS BUFFER 1 & 2
`S COMPARE CONTENT OF OUTPOINTERS 1 & 2
`S7
`NVER OU-ECC F STATUS B ELAS
`58. INCREMENT OUTPOINTERS 1 & 2 FOR NEXT DU+ECC FROM MS
`
`2
`
`
`
`
`
`
`
`(O DATA BUFFER)
`
`(O REGUESTOR)
`
`ECC DEECS FOR ERROR 2
`(S-S,
`(YES)
`
`
`
`62.
`3. SIGNAL TO REQUESTOR 8 TO SP
`S-S, 2-BIT UNCORRECTED ERROR, OR
`H-H, NO ERROR, OR
`H-S, 1-BIT ERROR CORRECTED
`
`INTERRUP SP
`
`8. PASS DU - ECC O REQUESTOR mem)END
`
`7. STORE OU -- ECC INTO DATA BUFFER
`AT INPOINTER J LOCATION
`72. INCREMENT INPOINTER 3
`
`73 SECOND FETCH COMPLETED WITHOUT FAULT
`(YES)
`(NO)
`74. INCREMENT CAR SEQUENCER TO INITIATE 2ND STORE
`
`2nd SRE)
`
`81. AT OUTPOINTER 3 LOCATION,
`OUTPUT DATA BUFFER THRU
`2ND INVERTER (WITHOUT INVERSION)
`AND THRU 2ND ECC T CORRECT ANY
`i-BIT ERROR IN THE DU
`82 INCREMENT OUTPOINTER 3
`
`84 STORE OU -- ECC AT SAME MEMORY ADDRESS
`
`Be SECOND STORE COMPLETED WITHOU FAULT
`- No
`(YES)
`
`88 ACVATE SELENCE
`DEPENDENT RESTORATION
`(SEE TABE)
`
`END
`
`IPR2018-01474
`Apple Inc. EX1010 Page 7
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 7 of 7
`
`5,274,646
`
`S-S
`OETECTED
`
`H-H
`R
`H-S
`DETECTED
`
`SPECIAL
`UE
`DETECTED
`
`TES LR BT
`IN HSA
`
`TES LTR BIT
`N HSA
`
`LTR BIT
`SET P
`
`(YES)
`
`NO)
`
`(YES)
`
`R BT
`SE 2
`
`(N)
`92
`
`SET
`R BT
`
`SET
`LTR BI
`
`9i
`
`MC INTERRUPT &
`SET B
`RESET BIT 32
`IN MCIC
`
`MV INTERRUPT
`. SET BIT 1.7
`N MCIC
`
`MC INTERRUPT &
`SET BITS A & 32
`IN MCC
`
`GOTO
`NEXT INSTRUCTION
`IN TASK
`(NO INTERRUPTION)
`
`LEGENO
`
`BIT 16 = UNCORRECTABLE ERROR SENT FROM ERROR-FREE DU
`HAROWARE AND NOT A SPECIA UE
`
`BIT 7
`
`OU HARDWARE ERROR CORRECTED DURING TRANSMISSION
`
`BIT 32
`
`SPECIAL UE CHARACTER TRANSMITTED FROM ERROR-FREE
`OU HARDWARE WITHOUT ERROR
`
`IPR2018-01474
`Apple Inc. EX1010 Page 8
`
`
`
`EXCESSIVE ERROR CORRECTION CONTROL
`
`O
`
`25
`
`INTRODUCTION
`The invention pertains to excessive error correction,
`its control, and its management in an efficient manner.
`An excessive error is an erroneous bit that cannot be
`corrected by the ECC (error correction code) provided
`with the data unit stored in a memory, such as the ran
`dom access memory in a computer system.
`BACKGROUND OF THE INVENTION
`The complement/recomplement (C/R) type of error
`correction was disclosed in U.S. Pat. No. 3,949,208
`entitled "Apparatus for Detecting and Correcting Er
`15
`rors in an Encoded Memory Word' to M. C. Carter and
`assigned to the same assignee as the subject application.
`The C/R technique has been used to augment the error
`correction capability of Hamming type ECCs (error
`20
`correction codes) for data units stored in the memory of
`a computer system. The C/R technique has been used
`to correct one or more hard (H) errors in a data unit,
`leaving the ECC to correct any soft (S) error in the data
`unit.
`A hard error is an error caused by a permanent fault
`in a circuit, such as a broken wire, and causes a bit
`position in memory to be stuck permanently in a given
`state, either a 1 or 0 state. A soft error is usually caused
`by an alpha particle changing the 0 or 1 state of a cir
`cuit, wherein a soft error condition will not exist the
`next time other data is stored in that circuit. Thus a hard
`error remains permanently in the hardware while a soft
`error exists only in the single recording of a data unit.
`The C/R method corrects only the permanently stuck
`state of hard errors. The C/R method may be used with
`35
`computer storage built with semiconductor dynamic
`random access memory (DRAM) semiconductor chips.
`The C/R method is initiated only after the ECC in a
`data unit finds an excessive error. Then the C/R process
`reads and complements (inverts) each read bit value in
`the data unit. Then the C/R process stores the inverted
`data unit back into the same bit locations in memory.
`When stored in their original locations, only the errone
`ous bits in hard error locations revert to their prior
`stuck states. All non-erroneous bits, and any erroneous
`bits with soft errors, will be inverted in relation to the
`stuck bits with hard errors which will not be inverted
`because of their stuck condition. A second fetch of the
`stored inverted data unit again inverts the read bits to
`correct all hard errors; and the ECC is then used only to
`50
`correct any soft errors up to the maximum capability of
`the ECC. After this second inversion and at the end of
`the C/R process, the data unit is again stored in memory
`in its original location in its original erroneous form.
`ECCs (error correcting codes) have been commonly
`55
`used in DRAM (dynamic random access memory) stor
`age by large computer systems, i.e. main storage (MS)
`and extended storage (ES). The most commonly used
`ECC has been for the SEC/DED (single error correc
`tion/ double error detection), which can detect, but
`cannot correct, double-bit errors in any data unit (DU)
`when the DU is stored or transferred. If a second bit
`error (an excessive error) is detected in a DU when
`using such SEC/DED type of ECC, the second errone
`ous bit cannot be corrected by the ECC. However, the
`second erroneous bit (the excessive error in a system
`using SEC/DED) often can be corrected by the C/R
`method for the transmission of the data, in which the
`
`5,274,646
`2
`C/R method can correct any number of hard errors (H)
`and but can only correct a single soft error (S) per DU.
`Accordingly, the combination of the C/R method and
`ECC can correct during transmission any number of
`hard and soft errors in a data unit up to the error detec
`tion capability of the ECC.
`It is the transient characteristic of soft errors that
`prevents the C/R method from correcting any soft
`errors in a data unit. It is the ECC which corrects the
`soft errors. Hence, the combined C/R and ECC
`(SEC/DED) methods are limited to correcting one soft
`error in the transmission of a data unit, and the occur
`rence of two soft errors (the S-S case) is uncorrectable.
`Both of the C/R or ECC nethods are limited to
`correcting stored errors only during the transmission of
`a data unit. The hard or soft errors existing in the data
`unit in memory remain in the memory from which the
`transmission occurred. The C/R method can only cor
`rect the complemented (inverted) readout of a memory
`data unit which is stored with hard errors.
`After the successful completion of the C/R process,
`the stored data unit remains with the same erroneous
`bits in memory, but the requestor receives a corrected
`data unit if the number of soft errors does not exceed the
`ECC capability. The C/R method provides complete
`error correction if there are no the soft error bits. And
`the C/R method enables complete error correction
`after it corrects all hard errors if the soft error bits can
`be corrected by the ECC. No error correction is ob
`tained if the number of soft errors exceeds the capability
`of the ECC. For example, two soft errors in a data unit
`(the S-S error case) are not correctable by the C/R
`method if the maximum ECC capability is one errone
`ous bit per data unit.
`The C/R method is much slower than the ECC cor
`rection process alone, because the C/R method requires
`two additional fetches and two additional stores in
`memory. Accordingly, the C/R method is not invoked
`unless an excessive error is detected. For example when
`using an SEC/DED type of ECC, only two error bits
`per data unit can be detected. If only one error is de
`tected (no excessive error exists), it is corrected by the
`ECC without initiating the C/R method.
`Although the C/R method can correct any number
`of permanent (hard) errors in a data unit, it may be
`limited by the maximum error detection capability of
`the ECC, since ECC error detection is used to control
`the initiation of the C/R method.
`The C/R error correction technique has been effec
`tively used in commercial computer systems having
`SEC/DED (single error correction/double error detec
`tion) ECC stored in memory to correct two errors in a
`data unit; the ECC alone has a maximum ability to
`correct a single error in a data unit. If the data unit has
`a hard error bit and a soft error bit (herein called the
`H-S case), the ECC correction is applied to the single
`soft error bit after the C/R operation has corrected the
`hard error bit.
`Currently used large computer systems desiring the
`best type of maintenance store a record of the occur
`rence of all excessive errors, whether corrected or not.
`This is because excessive errors are not corrected in
`memory, even though excessive errors may be cor
`rected for a requestor by using the C/R method to keep
`a task executing that would otherwise have to quit.
`Hence excessive error correction by the C/R method is
`considered outside the normal error correction ability
`
`45
`
`65
`
`IPR2018-01474
`Apple Inc. EX1010 Page 9
`
`
`
`10
`
`15
`
`5,274,646
`3
`4.
`ability to retry its operation and by providing for the
`of the system. A C/R corrected data unit is vulnerable
`to crashing the system if another soft error occurs in it.
`monitoring and reporting of the correction of excessive
`Other pertinent art was published in 1980 by Bossen
`errors to reduce the need for system interruptions. This
`and Hsaio in the IBM Research Journal, May 1980,
`invention can reduce the amount of communications to
`page 390 entitled "A System Solution to Memory Soft
`a single excessive-error communication per memory
`page frame (or any other unit of memory reporting)
`Errors'.
`Stringent error reporting and accounting have been
`even though excessive errors occur many times in a
`used to insure closely coordinated system maintenance
`memory unit. The monitoring by this invention greatly
`in prior large computer systems. They report excessive
`reduces the involvement of both the operating system
`errors to a service processor (SP) in the system which
`software and of a system service processor in C/R error
`maintains records of all significant error conditions
`corrections.
`occurring in the system to determine, for example,
`An object of this invention is to provide means in
`when to switch a CPU offline to perform maintenance.
`which the C/R process is done dynamically by a re
`Previously, an interruption was required to both the
`questing processor and a memory controller so that
`requesting processor and to memory operation before
`intervention by a service processor is reduced or elimi
`the C/R method could be invoked. C/R invocation
`nated.
`occurred in response to the occurrence of error detec
`Another object of this invention is to allow C/R
`tion which cause both the processor clock and the mem
`correction of a data unit without suspension of data
`ory access to be stopped until the processor has recov
`processing by the requesting processor.
`ered, and an interruption signal to be sent to the system
`20
`A further object of the invention is to make the hard
`service processor (SP). Then the SP interrupted its
`ware implementing the C/R method tolerant of some of
`current program and performed a recovery action on
`the failures that can enable the method to correct an
`the stopped processor, which was usually to retry the
`error when it otherwise may not have been corrected.
`instruction which stopped executing when the proces
`The invention significantly increases the value of the
`sor's clock was stopped. After the recovery action was
`25
`C/R method by providing it with retryability, which
`completed, the SP restarted the processor and the mem
`requires that no matter where failure occurs during
`ory resumed access for normal operation. Then the
`execution of the C/R process, the original erroneous
`processor issued a re-fetch request that invoked the
`DU-ECC value existing prior to the start of the C/R
`C/R method for the data unit having the excessive
`method be restored in its original location, so that the
`error. If the excessive error existed after operation of 30
`C/R method can be retried. Failure-free recovery is
`the C/R method, another processor stoppage occurred,
`essential to obtaining reliability in the use of the C/R
`etc. The next time the processor was restarted, it could
`method.
`record instruction processor damage to the task.
`The invention automatically invokes the C/R method
`This prior operation of the C/R method was very
`when the ECC method detects an excessive error (e.g.
`slow because of the clock-stopping interruption to the
`35
`two erroneous bits in a data unit when the C/R method
`requestor and to stopping memory access, and the SP
`is used with SEC/DED type ECC). Then this invention
`intervention before the C/R method could be invoked,
`uses the ECC error detection in combination with the
`which greatly reduced the efficiency of the system,
`C/R method to detect the different cases of double
`involving milliseconds instead of the normal CPU mi
`error combinations, which are the H-H, H-S and S-S
`crosecond speed for each operation of the C/R method.
`cases. The detection of the H-H case reveals that two
`System performance was further severely degraded due
`hard errors exist in a data unit in memory, which pres
`to a machine check interruption causing a loss of all
`ents a failing condition for memory. The detection of
`data in the CPU's cache, and a loss of all translations in
`the H-S case reveals that one hard error exists in a data
`the CPU's TLB (translation lookaside buffer), adding to
`unit in memory, which also presents a failing condition
`the reduction in CPU performance due to the need to
`for memory. But the detection of the S-S case reveals
`refetch all data lost in the cache and to retranslate all
`two soft errors exists in a data unit in memory, which
`addresses lost in the TLB. The program task was
`does not present a failing condition for memory but
`ABENDed (abnormally ended) by a machine check
`presents an uncorrectable error condition for a data unit
`interruption if the data was not corrected.
`stored in memory.
`C/R error correction does not correct any hard error
`The invention can use the C/R method with other
`or any soft error in the memory itself, even though the
`ECC methods instead of SEC/DED, such as with dou
`C/R method may correct the hard errors and the ECC
`ble
`error
`correction/triple
`error
`detection
`method may correct a soft error in the group only dur
`(DEC/TED), or with triple error correction/quadru
`ing its transmission to the requestor. However, an ECC
`ple error detection (TEC/QED), etc. The replacement
`corrected soft error could be corrected in MS by stor
`55
`of SEC/DED with any of these other known ECC
`ing the corrected DU-ECC group in its original loca
`types correspondingly increases the number of soft
`tion, which is sometimes called "scrubbing" the data.
`correctable errors in a data unit. For example, replace
`SUMMARY OF THE INVENTION
`ment of the SEC/DED ECC with the DEC/TED type
`of ECC allows up to two soft errors to be corrected per
`This invention makes the use of the C/R method
`more reliable and efficient, and less disrupting of system
`data unit, which would handle the H-H-H, H-H-S,
`H-S-S and S-S-S cases. Or replacement of the
`operations compared to prior use of the C/R method.
`This invention eliminates processor interruption upon
`SEC/DED ECC with the TEC/QED type of ECC
`the invocation and correct operation of the C/R
`allows up to three soft errors to be corrected per data
`method to significantly improve the efficiency of sys
`unit, which would handle the H-H-H-H, H-H-H-S,
`tem performance when the memory technology has a
`H-H-S-S, H-S-S-S and S-S-S-S cases, and so on for
`higher order ECC types. The location of the error bits
`significant number of excessive error conditions. And
`this invention augments the C/R method facilitating its
`in a data unit is not important in this invention.
`
`45
`
`50
`
`IPR2018-01474
`Apple Inc. EX1010 Page 10
`
`
`
`O
`
`5,274,646
`6
`5
`questor responding to a double-bit error status signal
`This invention enables different types of remedial
`sent to the requestor by the MS controller.
`actions to be used for these different detectable cases by
`FIG. 11 is a flow diagram of a process of signalling to
`enabling records to be maintained in the system by
`a service processor from the described C/R process,
`signalling the occurrence of the particular error case in
`and the system use of the various signals provided by
`an automatic manner that can be recorded by the system
`the C/R method.
`for later maintenance use.
`The invention may prevent an emergency mainte
`DESCRIPTION OF THE PREFERRED
`nance situation that may shut down a system by en
`EMBODIMENT
`abling the system reporting of errors to detect between
`FIG. 1 provides hardware that obtains novel control
`excessive hard and soft errors. A unique reporting pro
`of the C/R method for correcting erroneous data re
`cess enables later non-emergency maintenance to han
`quested from main storage (MS) 11 by a requestor,
`dle conditions causing the detected errors.
`which may be a CPU (central processing unit), an I/0
`This invention increases the fault tolerance of a sys
`processor that is controlling one or more I/O devices, or
`tem in the manner of use of the combined C/R and ECC
`a service processor (SP). Each request for data sends an
`methods by having redundant status control registers
`15
`and using comparisons and parity checking for them.
`address to MS for accessing a line of data in MSDRAM
`arrays comprising a memory in a computer system. The
`An example of the report signalling by this invention
`invention may be applied to any memory, but the pre
`is where the CPU maintains a memory map in main
`memory of all 4KB page frames in system main storage
`ferred embodiment uses the main storage (MS) in a
`computer system. Each request provides the MS ad
`(MS). The memory map is called a Logical Track Re
`20
`dress of a requested data line in MS and the requestor's
`cord (LTR). The requesting processor reports to the
`identifier (RID), which is provided to the memory con
`LTR the H-H, H-S and S-S excessive error cases for the
`respectively addressed memory unit (4KB page frame)
`troller.
`if the excessive error type has not previously been re
`Each data line in MS 11 contains one or more data
`ported for the respectively addressed memory unit,
`units (DUs), of which each data unit is a group of data
`25
`significantly reducing the number of processor interrup
`bits over which an error correcting code (ECC) is gen
`erated to provide a total group of bits indicated by the
`tions for reporting that delay its processing and reduces
`system efficiency.
`notation DU--ECC group. The ECC bits in a group are
`commingled among the DU bits, and the ECC bits
`The LTR also is reported to a system service proces
`enable SEC/DED for its group during its readout and
`sor (SP) to enable the SP to maintain its own page frame
`map called Physical Track Record (PTR) in a persistent
`transmission.
`In the preferred embodiment each data line is com
`disk file. The PTR is retained after the LTR is lost by a
`prised of sixteen DU--ECC groups. The bits in a DU
`reset of the volatile CPU memory. Upon the next reini
`+ECC group are readout and transmitted in parallel on
`tialization of the CPU, the SP uses the PTR to rebuild
`a memory bus in the preferred embodiment, but they
`the LTR in memory for the CPU software so the CPU
`35
`may be serially transmitted on a bus for the current data
`need not waste time rebuilding the LTR by repeating
`line. Also, the preferred embodiment presumes each
`error interruptions for bad pages that had previously
`DU is a double word of 64 data bits having 8 ECC bits
`been reported as error conditions. The PTR has a
`providing SEC/DED. Hence, the 64+8=72 bits are
`threshold of an amount of error conditions after which
`transferred on the bus in parallel. If a serial bus were
`the appropriate part of the system can be fenced off and
`used, each DU--ECC group would be assembled/disas
`shut down for convenient maintenance.
`sembled into its parallel bit form to and from memory.
`BRIEF DESCRIPTION OF THE DRAWINGS
`Original Fetch Request (FIG. 8 Process)
`FIG. 1 illustrates the interacting collection of hard
`ware used by the preferred embodiment to improve the
`In the hardware of FIG. 1, each of the 16 DU--ECC
`45
`groups in a requested data line is transferred from MS
`fault tolerance of the C/R method.
`11 through inverter circuits 12 to an ECC logic circuit
`FIG. 2 illustrates the structure of the data buffer with
`13. During an original fetch, each group passes through
`its inpointer and outpointer.
`inverter 12 in its true (non-inverted) form to ECC logic
`FIG. 3 illustrates respective contents of the three
`buffers accessed at corresponding locations of their
`circuit 13. Circuit 13 error checks each DU-ECC
`inpointers or outpointers.
`group and passes it to its requestor, whether the ECC
`FIG. 4 represents a memory request register used for
`logic found it error-free or not. If a DU--ECC group
`receiving requests from CPUs, I/Os and other request
`has only one erroneous bit, it is error corrected before
`ors requesting a complement/recomplement (C/R)
`being sent to the requestor. Status information is sent
`with the data to the requestor to inform the requestor of
`error correction operation.
`55
`the group's error-free or particular error condition.
`FIG. 5 represents a C/R sequencer used to control
`If the requestor receives error-free data (whether or
`the steps in the C/R operations used by shown hard
`not ECC corrected), the C/R function is bypassed (not
`ware in FIG. 1.
`done for the request). A particular status signal identi
`FIG. 6 shows a status request register in an MS con
`fies any detected error condition to the requestor for the
`troller used to indicate to the operating system software
`sent DU--ECC group, so that the requestor can deter
`the status of a current fetch request.
`mine what it wants done with the request, including
`FIG. 7 shows a status request register in the MS
`whether the requestor will request the C/R method to
`controller used to indicate to the system service proces
`be performed on the erroneous DU-ECC group.
`sor (SP) the status of a current fetch request.
`The flow diagram of FIG. 8 shows the steps in an
`FIG. 8 is a flow diagram showing the fetch error
`original fetch operation for a data line request, in which
`detection process.
`step 1 represents the current request. Step 2 represents
`FIG. 9 and FIG. 10 are flow diagrams showing the
`when the fetch request obtains priority from the mem
`C/R process invoked by a refetch request from a re
`
`30
`
`SO
`
`65
`
`IPR2018-01474
`Apple Inc. EX1010 Page 11
`
`
`
`5,274,646
`8
`7
`No interruption is caused to the normal operation of
`ory controller, which involves putting the request into
`MS 11 or to the requesting processor during the process
`memory request register 71 in FIG. 4, from which ad
`of issuing a re-fetch request invoking the C/R method.
`dresses are generated for each DU--ECC group in MS.
`The successful use of the C/R process without an
`Step 3 is the generation of status information which is
`interruption to the requesting processor provides signif
`put into registers 73 and 74 in FIGS. 6 and 7, which are
`icant novelty for the subject invention over the prior
`located in the memory controller for MS 11.
`art. Previously, an interruption to the requesting pro
`Register 73 in FIG. 6 represents the status informa
`cessor always occurred on sensing an excessive error
`tion sent to the requestor and has the following fields set
`condition (2 bit error with SEC/DED ECC) during
`by ECC circuits 13: a requestor identifier (RID) of the
`which the SP was invoked to initiate and control the
`processor which made the request, the 72 bits of the
`execution of the C/R method. A significant amount of
`fetched DU--ECC group, and the following single bit
`processor time was lost in that prior process compared
`indicator fields: no error, 1 bit corrected error, exces
`to the process of the subject invention.
`sive error, and a special UE (uncorrected error) indica
`The following TABLE represents the four states of
`tor. The special UE is a unique character stored in a
`the C/R sequencer shown in FIG. 4 and some of their
`memory location to represent that the location had bad
`consequences including the restoration actions needed
`unrecoverable data due to an error.
`for restoring an erroneous DU-ECC group to its origi
`Register 74 in FIG. 7 represents the status informa
`nal state in MS if a failure should occur in the C/R
`tion sent to a service processor (SP) which is also set by
`hardware before its completion of the C/R process.
`ECC circuits 13 with the same information as is put into
`register 73, except that register 74 also has a field set
`TABLE
`with the ECC syndrome bits for the group. By provid
`DU--
`ing the syndrome bits to the SP, the SP has the option
`DU-ECC ECC
`Forn in
`Forn in
`of using the syndrome bits and other status information
`Data Buf MS At
`to verify the ECC operation, if required. For an original
`25
`At End
`End of
`fetch, the fields H-H, H-S and S-S in registers 73 and 74
`of State
`State
`are not generated (remain offin reset state) in the status
`True
`True
`registers.
`In step 4 of FIG. 8, the status information in register
`73 is communicated to the requestor indicated in the
`RID field, which is the fetch request currently repre
`sented in register 71 in FIG. 4.
`Re-fetch Request (See Process in FIG.9)
`The requestor may be a CPU, an I/O processor, or a
`service processor. Each requestor has hardware for
`receiving and sensing the received status information
`and determining if further action is needed for continu
`ing the storage request with the C/R method if an ex
`cessive error is reported in the status register informa
`tion. In the preferred embodiment, if the requestor re
`ceives an excessive error indication in the status regis
`ter, the requestor automatically makes a re-fetch request
`to MS, which is a request for invoking the C/R method
`45
`to attempt a correction of the excessive error in that
`DU-ECC group.
`Status register 74 in FIG. 7 has its contents transmit
`ted to a service processor (SP) such as exists in the
`processor control element (PCE) of an IBM 3090 sys
`SO
`tem. The SP controls the error recovery operations for
`the system when required for an MS error condition,
`such as if an unrecoverable failure occurs in the hard
`ware handling the request. For example if an error
`condition occurs in handling