`
`Reference 1
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2102, p. 1
`
`
`
`We believe the key to the 10’s longevity is its
`basically simple, clean structure with adequately large
`(one Mbyte) address space that allows users to get
`work done. In this way, it has evolved easily with use
`and with technology. An equally significant factor in
`its success is a single operating system environment
`enabling user program sharing among all machines.
`The machine has thus attracted users who have built
`significant languages and applications in a variety of
`environments. These user-developers are thus the
`dominant system architects-implementors.
`In retrospect, the machine turned out to be larger
`and further from a minicomputer than we expected.
`As such it could easily have died or destroyed the tiny
`DEC organization that started it. We hope that this
`paper has provided insight into the interactions of its
`development.
`Acknowledgments. Dan Siewiorek deserves our
`greatest thanks for helping with a complete editing of
`the text. The referees and editors have been especially
`helpful. The important program contributions by users
`are too numerous for us to give by name but here are
`most of them: apl, Basic, bliss, dot, Lisp, Pascal,
`Simula, sos, teco, and Tenex. Likewise, there have
`been so many contributions to the 10’s architecture
`and implementations within DEC and throughout the
`user community that we dare not give what would be a
`partial list.
`Received April 1977; revised September 1977
`References
`X. Beil, G., Cady, R., McFarland, H., Delagi, B., O’Laughlin, J.,
`and Noonan, R. A new architecture for minicomputers —the DEC
`PDP-11. Proc. AFIPS 1970 SJCC, Vol. 36, AFIPS Press,
`Montvale, N.J., pp. 657-675.
`2. Bell, G., and Freeman, P. Cai —A computer architecture for
`AI research AFIPS Conf. Proc. Vol. 38 (Spring, 1971), 779-790.
`3. Bell, G., and Newell, A. Computer Structures: Readings and
`Examples. McGraw-Hill, New York, 1971.
`4. Bobrow, D.G., Burchfiel, J.D., Murphy, D. L., and
`Tomlinson, R.S. TENEX, A Paged Time Sharing System for the
`PDP-10. Comm. ACM 15, 3 (March 1972), 135-143.
`5. Bullman, D.M. Editor, stack computers issue. Computer 10, 5
`(May 1977), 14-52.
`6. Clark, W.A. The Lincoln TX-2 computer. Proc. WJCC 1957,
`Vol. 11, pp. 143-171.
`7. Lunde, A. Empirical evaluation of some features of Instruction
`Set Processor architecture. Comm. ACM 20, 3 (March 1977), 143-
`152.
`8. Mitchell, J.L., and Olsen, K.H. TX-0, a transistor computer.
`Proc. EJCC 1956, Vol. 10, pp. 93-100.
`9. McCarthy, J. Time Sharing Computer Systems, Management
`and the Computer of the Future M. Greenberger, Ed., M.I.T. Press,
`Cambridge, Mass., 1962, pp. 221-236.
`10. Murphy, D.L. Storage organization and management in
`TENEX. Proc. AFIPS 1972 FJCC, Vol. 41, Pt. I, AFIPS Press,
`Montvale, N.J., pp. 23-32.
`11. Olsen, K.H. Transistor circuitry in the Lincoln TX-2. Proc.
`WJCC 1957, Vol. 11, pp. 167-171.
`12. Roberts, L.G. Ed. Section on Resource Sharing Computer
`Networks. AFIPS 1970 SJCC, Vol. 36, AFIPS Press, Montvale,
`N.J., pp. 543-598.
`13. Wulf, W., and Bell, G. C.mmp —A mutli-mini-processor. Proc.
`AFIPS 1972 FJCC, Vol. 41, AFIPS Press, Montvale, N.J., pp.
`765-777.
`14. Wulf, W., Russell, D., and Habermann, A.N. BLISS: A
`language for systems programming. Comm. ACM 14, 12 (Dec.
`1971), 780-790.
`63
`
`Computer
`Systems
`
`G. Bell, S. H. Fuller, and
`D. Siewiorek, Editors
`
`The CRAY-1
`Computer System
`Richard M. Russell
`Cray Research, Inc.
`
`This paper describes the CRAYil, discusses the
`evolution of its architecture, and gives an account of
`some of the problems that were overcome during its
`manufacture.
`The CRAY-1 is the only computer to have been
`built to date that satisfies ERDA’s Class VI
`requirement (a computer capable of processing from
`20 to 60 million floating point operations per second)
`[1].
`The CRAY-l’s Fortran compfler (cft) is designed
`to give the scientific user immediate access to the
`benefits of the CRAY-l’s vector processing
`architecture. An optimizing compfler, cft,
`“vectorizes” innermost DO loops. Compatible with
`the ansi 1966 Fortran Standard and with many
`commonly supported Fortran extensions, cft does not
`require any source program modifications or the use
`of additional nonstandard Fortran statements to
`achieve vectorization. Thus the user’s investment of
`hundreds of man months of effort to develop Fortran
`programs for other contemporary computers is
`protected.
`Key Words and Phrases: architecture, computer
`systems
`CR Categories: 1.2, 6.2, 6.3
`
`Introduction
`
`Vector processors are not yet commonplace ma
`chines in the larger-scale computer market. At the
`time of this writing we know of only 12 non-CRAY-1
`vector processor installations worldwide. Of these 12,
`the most powerful processor is the ILLIAC IV (1
`installation), the most populous is the Texas Instru
`ments Advanced Scientific Computer (7 installations)
`and the most publicized is Control Data’s STAR 100
`Copyright © 1977, Association for Computing Machinery, Inc.
`General permission to republish, but not for profit, all or part of
`this material is granted provided that ACM’s copyright notice is
`given and that' reference is made to the publication, to its date of
`issue, and to the fact that reprinting privileges were granted by
`permission of the Association for Computing Machinery.
`Author’s address: Cray Research Inc., Suite 213, 7850 Metro
`Parkway, Minneapolis, MN 55420.
`January 1978
`Communications
`Volume 21
`of
`the ACM
`Number 1
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2102, p. 2
`
`
`
`(4 installations). In its report on the CRAY-1, Auer
`bach Computer Technology Reports published a com
`parison of the CRAY-1, the ASC, and the STAR 100
`[2]. The CRAY-1 is shown to be a more powerful
`computer than any of its main competitors and is
`estimated to be the equivalent of five IBM 370/195s.
`Independent benchmark studies have shown the
`CRAY-1 fully capable of supporting computational'
`rates of 138 milhon floating-point operations per sec
`ond (mflops) for sustained periods and even higher
`rates of 250 mflops in short bursts [3, 4]. Such
`comparatively high performance results from the
`CRAY-1 internal architecture, which is designed to
`accommodate the computational needs of carrying out
`many calculations in discrete steps, with each step
`producing interim results used in subsequent steps.
`Through a technique called “chaining,” the CRAY-1
`vector functional units, in combination with scalar and
`vector registers, generate interim results and use them
`again immediately without additional memory refer
`ences, which slow down the computational process in
`other contemporary computer systems.
`Other features enhancing the CRAY-l’s computa
`tional capabilities are: its small size, which reduces
`distances electrical signals must travel within the com
`puter’s framework and allows a 12.5 nanosecond clock
`period (the CRAY-1 is the world’s fastest scalar proc
`essor); a one million word semiconductor memory
`equipped with error detection and correction logic
`(secded); its 64-bit word size; and its optimizing
`Fortran compiler.
`
`Architecture
`The CRAY-1 has been called “the world’s most
`expensive love-seat” [5]. Certainly, most people’s first
`reaction to the CRAY-1 is that it is so small. But in
`computer design it is a truism that smaller means
`faster. The greater the separation of components, the
`longer the time taken for a signal to pass between
`them. A cylindrical shape was chosen for the CRAY-1
`in order to keep wiring distances small.
`Figure 1 shows the physical dimensions of the
`machine. The mainframe is composed of 12 wedge
`like columns arranged in a 270° arc. This leaves room
`for a reasonably trim individual to gain access to the
`interior of the machine. Note that the love-seat dis
`guises the power supplies and some plumbing for the
`Freon cooling system. The photographs (Figure 2 and
`3) show the interior of a working CRAY-1 and an
`exterior view of a column with one module in place.
`Figure 4 is a photograph of the interior of a single
`module.
`
`An Analysis of the Architecture
`Table I details important characteristics of the
`CRAY-1 Computer System. The CRAY-1 is equipped
`with 12 i/o channels, 16 memory banks, 12 functional
`
`64
`
`Fig. 1. Physical organization of mainframe.
`-------- 56te"----- A
`
`77"r19"
`
`103%"
`
`— Dimensions
`Base-1031 inches diameter by 19 inches high
`Columns —561 inches diameter by 77 inches high including
`height of base
`— 24 chassis
`— 1662 modules; 113 module types
`— Each module contains up to 288 IC packages per module
`— Power consumption approximately 115 kw input for maximum
`memory size
`—Freon cooled with Freon/water heat exchange
`—Three memory options
`— Weight 10,500 lbs (maximum memory size)
`—Three basic chip types
`5/4 NAND gates
`Memory chips
`Register chips
`
`units, and more than 4k bytes of register storage.
`Access to memory is shared by the i/o channels and
`high-speed registers. The most striking features of the
`CRAY-1 are: only four chip types, main memory
`speed, cooling system, and computation section.
`
`\
`
`Four Chip Types
`Only four chip types are used to build the CRAY-
`1. These are 16 x 4 bit bipolar register chips (6
`nanosecond cycle time), 1024 x 1 bit bipolar memory
`chips (50 nanosecond cycle time), and bipolar logic
`chips with subnanosecond propagation times. The logic
`chips are all simple low- or high-speed gates with both
`a 5 wide and a 4 wide gate (5/4 nand). Emitter-
`coupled logic circuit (ecl) technology is used through
`out the CRAY-1.
`The printed circuit board used in the CRAY-1 is a
`5-layer board with the two outer surfaces used for
`signal runs and the three inner layers for -5.2V,
`-2.0V, and ground power supplies. The boards are
`six inches wide, 8 inches long, and fit into the chassis
`as shown in Figure 3.
`All integrated circuit devices used in the CRAY-1
`are packaged in 16-pin hermetically sealed flat packs
`supplied by both Fairchild and Motorola. This type of
`package was chosen for its reliability and compactness.
`Compactness is of special importance; as many as 288
`packages may be added to a board to fabricate a
`module (there are 113 module types), and as many as
`72 modules may be inserted into a 28-inch-high chassis.
`January 1978
`Volume 21
`Number 1
`
`Communications
`of
`the ACM
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2102, p. 3
`
`
`
`Fig. 2. The CRAY-1 Computer.
`‘ ■.
`
`■ -
`
`Fig. 4. A single module.
`
`4
`
`:
`
`i
`
`:
`
`mm.
`
`m
`
`i
`
`t;
`
`%
`
`i
`
`■
`
`f?
`1R
`: .i''
`
`Fig. 3. CRAY-1 modules in place.
`
`:
`a
`$
`
`’ ■
`
`-i
`
`li
`m
`3 V
`
`hm■
`
`mfie
`uiuJo l 'iMimiErfl
`/Si]
`
`■ ».ci* * '■ - < ■ ".rw* hsri'isr
`cgl’-srikT. IWRfcK
`, its- rar
`
`I
`
`vsss;
`ISTS 7*3
`.. [
`|--'1. 1 - 8 tT-s-pri.mBTSTS
`E l V I
`I'i2*l;.'1c"! I „sr.]
`
`L ' I. f I :r I I r “ I. .rl
`I. I- I
`KtFlrs* I
`| 1 I. (... H...' I- I I'.TjTEu
`
`I 1
`xAS'iil ■ itiam
`
`K.,.» ^
`
`I .,T7.| .,t'|
`
`-WWJji
`
`■ lid
`
`4
`
`il
`H—Mi
`
`Such component densities evitably lead to a mammoth
`cooling problem (to be described).
`
`Main Memory Speed
`CRAY-1 memory is organized in 16 banks, 72
`modules per bank. Each module contributes 1 bit to a
`64-bit word. The other 8 bits are used to store an 8-bit
`check byte required for single-bit error correction,
`double-bit error detection (secded). Data words are
`stored in 1-bank increments throughout memory. This
`organization allows 16-way interleaving of memory
`accesses and prevents bank conflicts except in the case
`
`65
`
`•Tr
`
`!
`
`c
`• I
`
`mm
`
`i-mmz
`
`V •
`
`.2: * •L
`ill
`111
`wa
`
`?’;
`
`■Mlil
`
`Table I. CRAY-1 CPU characteristics summary
`Computation Section
`Scalar and vector processing modes
`12.5 nanosecond clock period operation
`64-bit word size
`Integer and floating-point arithmetic
`Twelve fully segmented functional units
`Eight 24-bit address (A) registers
`Sixty-four 24-bit intermediate address (B) registers
`Eight 64-bit scalar (5) registers
`Sixty-four 64-bit intermediate scalar (T) registers
`Eight 64-element vector (K) registers (64-bits per element)
`Vector length and vector mask registers
`One 64-bit real time clock (RT) register
`Four instruction buffers of sixty-four 16-bit parcels each
`128 basic instructions
`Prioritized interrupt control
`Memory Section
`1,(>48,576 64-bit words (plus 8 check bits per word)
`16 independent banks of 65,536 words each
`4 clock period bank cycle time
`1 word per clock period transfer rate for B, T, and V registers
`1 word per 2 clock periods transfer rate for A and 5 registers
`4 words per clock period transfer rate to instruction buffers (up to
`16 instructions per clock period)
`i/o Section
`24 i/o channels organized into four 6-channel groups
`Each channel group contains either 6 input or 6 output channels
`Each channel group served by memory every 4 clock periods
`Channel priority within each channel group
`16 data bits, 3 control bits per channel, and 4 parity bits
`Maximum channel rate of one 64-bit word every 100 nanoseconds
`Maximum data streaming rate of 500,000 64-bit words/second
`Channel error detection
`
`of memory accesses that step through memory with
`either an 8 or 16-word increment.
`Cooling System
`The CRAY-1 generates about four times as much
`heat per cubic inch as the 7600. To cool the CRAY-1
`a new cooling technology was developed, also based
`on Freon, but employing available metal conductors in
`a new way. Within each chassis vertical aluminum/
`stainless steel cooling bars line each column wall. The
`January 1978
`Volume 21
`Number 1
`
`Communications
`of
`the ACM
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2102, p. 4
`
`
`
`Fig. 5. Block diagram of registers.
`VECTOR REGISTERS
`V7
`V6
`V5
`V4
`V3
`VZ
`
`-T
`
`v
`
`((AOMAklV/'
`
`co
`
`VI
`
`VO
`
`MEMORY
`
`VJ
`(/ Vk
`Vi
`
`*-/u
`
`Vector
`Control
`
`[
`
`VM
`
`RTC
`
`TOO
`
`iAOi
`
`through
`
`T77
`
`((Ah) + jkm)
`
`SCALAR REGISTERS
`57
`si^=5
`3?
`IlLi
`SI
`
`so
`
`Sj
`^ Sit
`_£i
`
`I
`TKTTt
`^ Logical
`Add
`
`VECTOR
`
`Sj
`
`Vj
`
`VJ^
`Vi1"^
`
`S1
`
`Sj
`
`Sk
`
`. I Keeld. ftp
`i Multiply"
`Add
`
`FLOATING
`POINT
`
`_l..Pftn/l Z
`I SR ft
`I I naira
`Add
`
`ADDRESS REGISTERS
`//
`‘"I A7
`A6
`
`Exchange
`Control
`
`XA
`
`Vector
`Control
`
`VL
`
`hi
`
`Ak
`
`SCALAR
`
`n
`
`1 MulHnlv
`Add
`
`(AO)
`
`800
`through
`877
`
`((Ah) +
`
`jkm)
`
`Ai
`Bile
`
`St
`
`I
`
`r
`
`A
`
`1
`
`00
`
`3
`
`2
`
`I AO
`
`h p
`
`Aj
`
`ffc
`
`n
`
`L-+i
`
`\
`Ak [ {u ’ Ak l ft
`f}
`P1
`t
`I CL
`!
`i
`I/O
`» control
`HNIP
`J-r[
`H LIP
`
`ADDRESS
`
`e
`
`1<
`
`31
`
`I1 2
`
`FUNCTIONAL UNITS
`
`i 7
`INSTRUCTION BUFFERS
`
`CIP
`L. — —
`
`Execution
`
`Freon refrigerant is passed through a stainless steel
`tube within the aluminum casing. When modules are
`in place, heat is dissipated through the inner copper
`heat transfer plate in the module to the column walls
`and thence into the cooling bars. The modules are
`mated with the cold bar by using stainless steel pins to
`pinch the copper plate against the aluminum outer
`casing of the bar.
`To assure component reliability, the cooling system
`
`66
`
`was designed to provide a maximum case temperature
`of 130°F (54°C). To meet this goal, the following
`temperature differentials are observed:
`
`Temperature at center of module 130°F (54°C)
`Temperature at edge of module
`118°F (48°C)
`Cold plate temperature at wedge
`78°F (25°C)
`Cold bar temperature
`70°F (21°C)
`Refrigerant tube temperature
`70°F (21°C)
`January 1978
`Volume 21
`Number 1
`
`Communications
`of
`the ACM
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2102, p. 5
`
`
`
`Functional Units
`There are 12 functional units, organized in four
`groups: address, scalar, vector, and floating point.
`Each functional unit is pipelined into single clock
`segments. Functional unit time is shown in Table II.
`Note that all of the functional units can operate concur
`rently so that in addition to the benefits of pipelining
`(each functional unit can be driven at a result rate of 1
`per clock period) we also have parallelism across the
`units too. Note the absence of a divide unit in the
`CRAY-1. In order to have a completely segmented
`divide operation the CRAY-1 performs floating-point
`division by the method of reciprocal approximation.
`This technique has been used before (e.g. IBM System/
`360 Model 91).
`Registers
`Figure 5 shows the CRAY-1 registers in relation
`ship to the functional units, instruction buffers, i/o
`channel control registers, and memory. The basic set
`of programmable registers are as follows:
`8 24-bit address (A) registers
`64 24-bit address-save (B) registers
`8 64-bit scalar (S) registers
`64 64-bit scalar-save (T) registers
`8 64-word (4096-bit) vector (V) registers
`Expressed in 8-bit bytes rather than 64-bit words,
`that’s a total of 4,888 bytes of high-speed (6ns) register
`storage.
`The functional units take input operands from and
`store result operands only to A, S, and V registers.
`Thus the large amount of register storage is a crucial
`factor in the CRAY-Ts architecture. Chaining could
`not take place if vector register space were not availa
`ble for the storage of final or intermediate results. The
`B and T registers greatly assist scalar performance.
`Temporary scalar values can be stored from and re
`loaded to the A and S register in two clock periods.
`Figure 5 shows the CRAY-l’s register paths in detail.
`The speed of the cft Fortran IV compiler would be
`seriously impaired if it were unable to keep the many
`Pass 1 and Pass 2 tables it needs in register space.
`Without the register storage provided by the B, T, and
`V registers, the CRAY-l’s bandwidth of only 80
`million words/second would be a serious impediment
`to performance.
`
`Instruction Formats
`Instructions are expressed in either one or two 16-
`bit parcels. Below is the general form of a CRAY-1
`instruction. Two-parcel instructions may overlap mem
`ory-word boundaries, as follows:
`k
`m
`i
`Fields
`h
`j
`g
`4-6
`7-9
`10-12 13-15
`0-3
`16-31
`Bit posi
`(3)
`(4)
`(3)
`(3)
`(3)
`(16)
`tions
`Parcel 1
`Parcel 2
`The computation section processes instructions at a
`maximum rate of one parcel per clock period.
`
`67
`
`Table II. CRAY-1 functional units
`
`Address function units
`address add unit
`address multiply unit
`Scalar functional units
`scalar add unit
`scalar shift unit
`scalar logical unit
`population/leading zero count
`unit
`Vector functional units
`vector add unit
`vector shift unit
`vector logical unit
`Floating-point functional units
`floating-point add unit
`floating-point multiply unit
`reciprocal approximation unit
`
`Register
`usage
`
`A
`A
`
`S
`S
`
`S
`
`S
`
`Functional
`unit time
`(clock pe
`riods)
`
`2
`6
`
`3
`2 or 3 if double-
`word shift
`
`1
`
`3
`
`3
`4
`2
`
`V
`V
`V
`S and V 6
`S and V 7
`S and V 14
`
`For arithmetic and logical instructions, a 7-bit op
`eration code (gh) is followed by three 3-bit register
`designators. The first field, i, designates the result
`register. The j and k fields designate the two operand
`registers or are combined to designate a B or T
`register.
`The shift and mask instructions consist of a 7-bit
`operation code (gh) followed by a 3-bit i field and a 6-
`bit jk field. The i field designates the operand register.
`The jk combined field specifies a shift or mask count.
`Immediate operand, read and store memory, and
`branch instructions require the two-parcel instruction
`word format. The immediate operand and the read
`and store memory instructions combine the j, k, and
`m fields to define a 22-bit quantity or memory address.
`In addition, the read and store memory instructions
`use the h field to specify an operating register for
`indexing. The branch instructions combine the i, j, k,
`and m fields into a 24-bit memory address field. This
`allows branching to any one of the four parcel positions
`in any 64-bit word, whether in memory or in an
`instruction buffer.
`Operating Registers
`Five types of registers —three primary (A, S, and
`V) and two intermediate (B and T) —are provided in
`the CRAY-1.
`A registers — eight 24-bit A registers serve a variety
`of applications. They are primarily used as address
`registers for memory references and as index registers,
`but also are used to provide values for shift counts,
`loop control, and channel i/o operations. In address
`applications, they hre used to index the base address
`for scalar memory references and for providing both a
`base address and an index address for vector memory
`references.
`The 24-bit integer functional units modify values
`
`Communications
`of
`the ACM
`
`January 1978
`Volume 21
`Number 1
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2102, p. 6
`
`
`
`(such as program addresses) by adding, subtracting,
`and multiplying A register quantities. The results of
`these operations are returned to A registers.
`Data can be transferred directly from memory to A
`registers or can be placed in B registers as an interme
`diate step. This allows buffering of the data between
`A registers and memory. Data can also be transferred
`between A and S registers and from an A register to
`the vector length registers. The eight A registers are
`individually designated by the symbols AO, Al, A2,
`A3, A4, A5, A6, and A7.
`B registers—there are sixty-four 24-bit B registers,
`which are used as auxiliary storage for the A registers.
`The transfer of an operand between an A and a B
`register requires only one clock period. Typically, B
`registers contain addresses and counters that are refer
`enced over a longer period than would permit their
`being retained in A registers. A block of data in B
`registers may be transferred to or from memory at the
`rate of one clock period per register. Thus, it is feasible
`to store the contents of these registers in memory
`prior to calling a subroutine requiring their use. The
`sixty-four B registers are individually designated by
`the symbols BO, Bl, B2, . . . , and B778.
`S registers—eight 64-bit S registers are the principle
`data handling registers for scalar operations. The S
`registers serve as both source and destination registers
`for scalar arithmetic and logical instructions. Scalar
`quantities involved in vector operations are held in S
`registers. Logical, shift, fixed-point, and floating-point
`operations may be performed on S register data. The
`eight S registers are individually designated by the
`symbols SO, SI, S2, S3, S4, S5, S6, and S7.
`T registers—sixty-four 64-bit T registers are used as
`auxiliary storage for the S registers. The transfer of an
`operand between S and T registers requires one clock
`period. Typically, T registers contain operands that
`are referenced over a longer period than would permit
`their being retained in S registers. T registers allow
`intermediate results of complex computations to be
`held in intermediate access storage rather than in
`memory. A block of data in T registers may be
`transferred to or from memory at the rate of one word
`per clock period. The sixty-four T registers are individ
`ually designated by the symbols TO, Tl, T2, . . . , and
`T778.
`V registers—eight 64-element V registers provide
`operands to and receive results from the functional
`units at a one clock period rate. Each element of a V
`register holds a 64-bit quantity. When associated data
`is grouped into successive elements of a V register, the
`register may be considered to contain a vector. Exam
`ples of vector quantities are rows and columns of a
`matrix, or similarly related elements of a table. Com
`putational efficiency is achieved by processing each
`element of the vector identically. Vector merge and
`test instructions are provided in the CRAY-1 to allow
`operations to be performed on individual elements
`designated by the content of the vector mask (VM)
`68
`
`register. The number of vector register elements to be
`processed is contained in the vector length (VL) regis
`ter. The eight V registers are individually designated
`by the symbols VO, VI, V2, V3, V4, V5, B6, and V7.
`
`Supporting Registers
`The CPU contains a variety of additional registers
`that support the control of program execution. These
`are the vector length (VL) and vector mask (VM)
`registers, the program counter (P), the base address
`(BA) and limit address (LA) registers, the exchange
`address (XA) register, the flag (F) register, and the
`mode (M) register.
`VL register—the 64-bit vector mask (VM) register
`controls vector element designation in vector merge
`and test instructions. Each bit of the VM register
`corresponds to a vector register element. In the vector
`test instruction, the VM register content is defined by
`testing each element of a V register for a specific
`condition.
`P register—the 24-bit P register specifies the mem
`ory register parcel address of the current program
`instruction. The high order 22 bits specify a memory
`address and the low order two bits indicate a parcel
`number. This parcel address is advanced by one as
`each instruction parcel in a nonbranching sequence is
`executed and is replaced whenever program branching
`occurs.
`BA registers—the 18-bit base address (BA) register
`contains the upper 18 bits of a 22-bit memory address.
`The lower four bits of this address are considered
`zeros. Just prior to initial or continued execution of a
`program, a process known as the “exchange sequence”
`stores into the BA register the upper 18 bits of the
`lowest memory address to be referenced during pro
`gram execution. As the program executes, the address
`portion of each instruction referencing memory has its
`content added to that of the BA register. The sum
`then serves as the absolute address used for the mem
`ory reference and ensures that memory addresses lower
`than the contents of the BA register are not accessed.
`Programs must, therefore, have all instructions refer
`encing memory do so with their address portions
`containing relative addresses. This process supports
`program loading and memory protection operations
`and does not, in producing an absolute address, affect
`the content of the instruction buffer, BA, or memory.
`LA register—the 18-bit limit address (LA) register
`contains the upper 18 bits of a 22-bit memory address.
`The lower 4 bits of this address are considered zeros.
`Just prior to initial or continued execution of a pro
`gram, the “exchange sequence” process stores into the
`LA register the upper 18 bits of that absolute address
`one greater than allowed to be referenced by the
`program. When program execution begins, each in
`struction referencing a memory location has the abso
`lute address for that reference (determined by summing
`its address portion with the BA register contents)
`checked against the LA register content. If the absolute
`January 1978
`Volume 21
`Number 1
`
`Communications
`of
`the ACM
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2102, p. 7
`
`
`
`address equals or exceeds the LA register content, an
`out-of-range error condition is flagged and program
`execution terminates. This process supports the mem
`ory protection operation.
`XA register — the 8-bit exchange address (XA) reg
`ister contains the upper eight bits of a 12-bit memory
`address. The lower four bits of the address are consid
`ered zeros. Because only twelve bits are used, with the
`lower four bits always being zeros, exchange addresses
`can reference only every 16th memory address begin
`ning with address 0000 and concluding with address
`4080. Each of these addresses designates the first
`word of a 16-word set. Thus, 256 sets (of 16 memory
`words each) can be specified. Prior to initiation or
`continuation of a program’s execution, the XA register
`contains the first memory address of a particular 16-
`word set or exchange package. The exchange package
`contains certain operating and support registers’ con
`tents as required for operations following an interrupt.
`The XA register supports the exchange sequence op
`eration and the contents of XA are stored in an
`exchange package whenever an exchange sequence
`occurs.
`F register — \h& 9-bit F register contains flags that,
`whenever set, indicate interrupt conditions causing
`initiation of an exchange sequence. The interrupt con
`ditions are: normal exit, error exit, i/o interrupt, uncor
`rected memory error, program range error, operand
`range error, floating-point overflow, real-time clock
`interrupt, and console interrupt.
`M register — the M (mode) register is a three-bit
`register that contains part of the exchange package for
`a currently active program. The three bits are selec
`tively set during an exchange sequence. Bit 37, the
`floating-point error mode flag, can be set or cleared
`during the execution interval for a program through
`use of the 0021 and 0022 instructions. The other two
`bits (bits 38 and 39) are not altered during the execu
`tion interval for the exchange package and can only be
`altered when the exchange package is inactive in stor
`age. Bits are assigned as follows in word two of the
`exchange package.
`Bit 37—Floating-point error mode flag. When this
`bit is set, interrupts on floating-point errors are
`enabled.
`Bit 38 —Uncorrectable memory error mode flag.
`When this bit is set, interrupts on uncorrectable
`memory parity errors are enabled.
`Bit 39 —Monitor mode flag. When this bit is set, all
`interrupts other than parity errors are inhibited.
`
`Integer Arithmetic
`All integer arithmetic is performed in 24-bit or 64-
`bit 2’s complement form.
`
`Floating-Point Arithmetic
`Floating-point numbers are represented in signed
`magnitude form. The format is a packed signed binary
`
`69
`
`fraction and a biased binary integer exponent. The
`fraction is a 49-bit signed magnitude value. The expo
`nent is 15-bit biased. The unbiased exponent range is:
`2—200008 tQ 2+177778)
`or approximately
`J 0-2500 to 10+2500
`An exponent equal to or greater than 2+2000U* is recog
`nized by the floating-point functional units as an over
`flow condition, and causes an interrupt if floating point
`interrupts are enabled.
`Chaining
`The chaining technique takes advantage of the
`parallel operation of functional units. Parallel vector
`operations may be processed in two ways: (a) using
`different functional units and V registers, and (b)
`chaining; that is, using the result stream to one vector
`register simultaneously as the operand set for another
`operation in a different functional unit.
`Parallel operations on vectors allow the generation
`of two or more results per clock period. A vector
`operation either uses two vector registers as sources of
`operands or uses one scalar register and one vector
`register as sources of operands. Vectors exceeding 64
`elements are processed in 64-element segments.
`Basically, chaining is a phenomenon that occurs
`when results issuing from one functional unit (at a rate
`of one/clock period) are immediately fed into another
`functional unit and so on. In other words, intermediate
`results do not have to be stored to memory and can be
`used even before the vector operation that created
`them runs to completion.
`Chaining has been compared to the technique of
`“data forwarding” used in the IBM 360/195. Like
`data forwarding, chaining takes place automatically.
`Data forwarding consists of hardware facilities within
`the 195 floating-point processor communicating auto
`matically by transferring “name tags,” or internal codes
`between themselves [6], Unlike the CRAY-1, the user
`has no access to the 195’s data-forwarding buffers.
`And, of course, the 195 can only forward scalar values,
`not entire vectors.
`Interrupts and Exchange Sequence
`Interrupts are handled cleanly by the CRAY-1
`hardware. Instruction issue is terminated by the hard
`ware upon detection of an interrupt condition. All
`memory bank activity is allowed to complete as are
`any vector instructions that are in execution, and then
`an exchange sequence is activated. The Cray Operating
`System (cos) is always one partner of any exchange
`sequence. The cause of an interrupt is analyzed during
`an exchange sequence and all interrupts are processed
`until none remain.
`Only the address and scalar registers are maintained
`in a program’s exchange package (Fig. 6). The user’s
`B, T, and V registers are saved by the operating
`system in the user’s Job Table Area.
`January 1978
`Volume 21
`Number 1
`
`Communications
`of
`the ACM
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2102, p. 8
`
`
`
`LA
`
`M
`
`63
`
`AO
`A I
`A2
`
`AS
`
`A4
`
`A6
`
`A6
`AT
`
`Fig. 6. Exchange package.
`36 <10
`0 2
`10 12 16 18 2k 31
`*1 » l"l ■> ^
`r
`ft
`22
`y/z/^/M/A
`BA
`n ♦ 1
`n + 2
`'Mz////////////////////////////ti.
`n+3
`n* 4 V///////////////////////////////////.
`A ♦ 5
`WMmmmmmmy.
`A+ 6
`n* 7
`n* e
`fM- 9
`A 4 10
`A 4 II
`A* 12
`A 4 13
`04 14
`A+ 15
`
`SO
`si
`S2
`
`S3
`S*
`
`SS
`se
`ST
`
`H - Modes'^
`36 Interrupt on correctable
`memory error
`37 Interrupt on floating point
`38 Interrupt on uncorrectable
`memory error
`39 Monitor mode
`F - Flaost
`31 Console interrupt
`32 RFC Interrupt
`33 Floating point error
`34 Operand range
`35 Program range
`36 Memory error
`37 I/O Interrupt
`38 Error exit
`39 Normal exit
`
`+B1t position from left of word
`
`Registers
`S Syndrome bi ts
`RAB Read address for error
`(where B is bank)
`P Program address
`BA Base address
`LA Limit address
`XA Exchange address
`VL Vector 1e