`obst
`
`54
`
`APPARATUS FOR PERFORMING A BIT
`SERAL ORTHOGONAL
`TRANSFORMATION INSTRUCTION
`
`(75)
`
`Inventor: Kenneth obst, Silver Spring, Md.
`
`(73)
`
`Assignee:
`
`The United States of America as
`represented by National Security
`Agency, Washington, D.C.
`
`21
`
`Appl. No.: 533,238
`
`22
`
`Filed:
`
`Jun. 4, 1990
`
`(51)
`52)
`(58
`
`Int. Cl. ................................................ G06F 7/38
`U.S. C. ................................ 364/736; 364/715.0
`Field of Search ........... 364/736, 725, 716, 715.01,
`364/730,715.02
`
`USOOS10137A
`Patent Number:
`11
`45) Date of Patent:
`
`5,101,371
`Mar. 31, 1992
`
`
`
`56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`3,936,806 2/1976 Batcher .
`4,314,349 2/1982 Batcher ............................... 364/716
`4,727,474 2/1988 Batcher ......
`Primary Examiner-Tan V. Mai
`(57)
`ABSTRACT
`Apparatus for performing a bit serial orthogonal trans
`formation instruction (BSOTI) is characterized by a
`unique circuit of bit inputs from two vector registers, a
`plurality of switching devices, and OR gates for per
`forming transformation. The apparatus enables vector
`machines to operate on individual bits of words as
`quickly as it operates on the words stored in memory.
`Vector computers are thus able to perform efficient
`bit-serial arithmetic as well as the more traditional fast
`vector/scalar operations.
`
`11 Claims, 1 Drawing Sheet
`
`Ci,63
`
`Ci,62
`
`Oracle-1029, p. 1
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`U.S. Patent
`
`Mar. 31, 1992
`
`5,101,371
`
`F. G.
`
`
`
`
`
`Ci,62
`
`OUTPUT
`
`7 v/
`
`Oracle-1029, p. 2
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`1
`
`APPARATUS FOR PERFORMING A BIT SERIAL
`ORTHOGONAL TRANSFORMATION
`INSTRUCTION
`
`BACKGROUND OF THE INVENTION
`The present invention relates to vector computers for
`performing bit transformation to afford efficient access
`of bits within a word stored in a memory as well as
`access of the stored words.
`The bit serial orthogonal transformation instruction
`(BSOTI) is a vector instruction which per forms the
`following function:
`ABT-C
`
`O
`
`15
`
`5,101,371
`2
`granularity of the operation performed. Most commer
`cial vector machines, for example the Cray 2, have a
`vector size of 64 words by sixty-four bits per word.
`The advantage of vector machines lies in their tre
`mendous speed in processing information. The disad
`vantage of such machines is that their programs must be
`"vectorized' or structured in a particular manner so as
`to take advantage of its features. Therefore, program
`mers of vector computers must have detailed knowl
`edge of the machine's architecture to effectively utilize
`these machines for fine grain computation. This leads to
`significant increases in time expense and effort. Also,
`vector machines are not efficient in applications where
`fine granularity is required.
`The motivation for the orthogonal transformation
`instruction is to be able to operate on bits as efficiently
`as one currently operates on words of information
`stored in a computer memory. This is a problem be
`cause most computers store multiple bits per word, and
`these bits must be accessed that way for error detec
`tion/correction reasons. The major advantage of SIMD
`computers, in general, is their ability to exploit locality
`of memory reference and efficiently trade space for
`time. This is most effective when the bit serial complex
`ity of a computational kernel is small or the data fields
`involved are small. With the bit serial orthogonal trans
`formation instruction (BSOTI), vector computers will
`be able to perform efficient bit-serial arithmetic as well
`as the more traditional fast vector/scalar operations on
`full word integer and floating point values.
`The present invention was developed in order to
`allow vector computers to operate on bits of informa
`tion as effectively as SIMD computers, whereby the
`granularity problem associated with vector computers
`is overcome. Functionally, the invention works in con
`junction with the hardware gather scatter instruction to
`perform the following bit operations: matrix multiplica
`tion, matrix transposition, matrix row/column permuta
`tions, compression, and expansion.
`SUMMARY OF THE INVENTION
`Accordingly, it is a primary object of the present
`invention to provide a device for performing a bit trans
`formation. The device includes a plurality of input lines
`for supplying a plurality of bits, respectively, from a
`word in a first vector register and a plurality of input
`terminals arranged in a matrix of rows and columns for
`supplying a plurality of bits from a plurality of words,
`respectively, in a second vector register. A plurality of
`AND gates corresponding in number to the number of
`input terminals are provided, with each AND gate
`having one input connected with one of the input termi
`nals for controlling the state thereof and the other input
`connected with one of the bit input lines. All of the
`AND gates in a row have their other input connected
`with the same input line for receiving the bit input
`therefrom. The AND gates of each columm have their
`outputs connected with the input of an OR gate. Thus
`there is one OR gate for each column of AND gates.
`Each OR gate has an output providing a transformation
`function of the bit inputs, whereby efficient access of
`bits within a word stored in memory is afforded.
`According to a more specific object of the invention,
`the bits on the input lines are associated with the indi
`vidual bits fetched from a single first memory and the
`bits on the input terminals are associated with a second
`memory having words comprising a number of bits
`
`35
`
`30
`
`This instruction effectively doubles the instruction set
`of vector computers. More particularly, BSOTI allows
`vector machines to operate on individual bits of a word
`as quickly as they operate on words of information
`20
`stored in its memory. Thus, vector computers will be
`able to perform efficient bit-serial arithmetic as well as
`the more conventional fast vector/scalar operations.
`BRIEF DESCRIPTION OF THE PRIOR ART
`25
`The levels of speed and performance achieved by
`conventional supercomputers are made possible
`through the use of multiple central processing units
`(CPUs). The increase in the number of CPUs allows
`these machines to perform many computations at the
`same time. The simultaneous computing is referred to as
`parallel computing or parallel processing. Parallel pro
`cessing devices are well-known in the patented prior art
`as evidenced by the U.S. patents to Batcher U.S. Pat.
`No. 3,936,806, U.S. Pat. No. 4,314,349, and U.S. Pat.
`No. 4,727,474.
`Supercomputers carry out parallel processing in vari
`ous ways, and nomenclature has been developed to
`categorize these machines. The class of parallel comput
`ers referred to as single instruction multiple data
`(SIMD) computers are identified by a particular archi
`tecture in which a single master processor broadcasts
`the identical instruction to a large number of slave pro
`cessors. Each slave processor then executes this instruc
`tion on data fetched from its own particular local mem
`45
`ory.
`Several SIMD machines (e.g. CM-2, MPP, and
`DAP) have been built where the slave processors oper
`ate on only one bit. The advantage of using machines
`with one-bit slave processors is apparent in the areas of
`SO
`graphics and image processing. Furthermore, the ability
`to operate on an individual bit is necessary for certain
`iterative numerical computing tasks where the amount
`of precision needed increases with the number of itera
`tions (as in the field of cryptology). The major disad
`55
`vantage is that these machines are relatively slow com
`pared to other parallel computers.
`Vector computers are another class of parallel pro
`cessing machines. As used herein, a vector computer or
`vector machine is a relatively fast vector/scalar proces
`sor that treats a vector as a single entity of bits greater
`than a single word of memory. The instruction set is
`such that only one instruction is needed to process the
`entire vector. In many ways, the architecture of a vec
`tor machine resembles that of an SIMD machine in that
`it applies the same instruction to a large number of
`different data elements. The principle difference be
`tween a vector machine and an SIMD machine is the
`
`Oracle-1029, p. 3
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`5
`
`-continued
`
`O
`O
`
`0 0
`0 1 0
`
`represents the reverse
`identity matrix
`
`1 0 0
`
`O
`
`5,101,371
`3
`corresponding with the square of the number of bits in
`the first memory.
`According to a further object of the invention, the
`number of input lines corresponds with the number of
`bits in a word and each bit is supplied to a separate input
`line. The input lines can be pipelined from individual
`bits of words from a vector register.
`BRIEF DESCRIPTION OF THE FIGURES
`Other objects and advantages of the present invention
`will become apparent from a study of the following
`specification when viewed in the light of the accompa
`nying drawing, in which:
`FIG. 1 is a circuit diagram of the bit transformation
`apparatus according to a preferred embodiment of the
`invention; and
`FIG. 2 is a more detailed circuit diagram of a portion
`of the circuit of FIG, 1.
`DETAILED DESCRIPTION
`Mathematically, A, B, and C represent binary matri
`ces and
`ABT
`
`O
`
`5
`
`20
`
`For certain values of A and B, ABT-C can perform
`some very important functions including matrix trans
`position (i.e. "corner turn”), bit reversal, and arbitrary
`bit permutations.
`By way of explanation, matrix transposition is per
`formed as follows. If A represents the identity matrix,
`then ABT-C perform as a vector register transpose
`operation mapping bits of B into words of C and words
`of B into bits of C. This is a very important operation
`because it allows other instructions that normally oper
`ate on words to operate on bits. For example, gather/-
`scatter instructions that allow efficient arbitrary map
`pings between words now also allow arbitrary map
`pings between bits. In general, the normal LLOS in
`struction sequence is mapped into
`
`25
`represents the binary matrix multiplication of matrix A
`and matrix B transpose. This multiplication is per
`formed over a field which mathematicians refer to as
`GF(2).
`On a vector computer, A, B, and C are vector regis
`ters. For example, on the Cray series of computers, A,
`B, and C are 64x64 bit matrices organized as 64 words
`of 64 bits per word. Diagramatically, the mathematical
`function
`
`30
`
`ABT-C
`
`can be represented as
`
`BT
`A
`O O
`O O
`
`C
`O
`O
`
`-Ge.
`
`O O
`
`O
`
`where O
`
`represents one 64 bit word of a vector
`
`O
`O
`
`O
`
`represents one vector
`register of 64
`64-bit words
`
`1 O O
`O
`0
`
`O
`O
`
`O
`
`0 0 1
`
`represents the
`identity matrix
`
`35
`
`50
`
`55
`
`65
`
`LTLTOTS
`
`where
`L is the vector load
`S is the vector store
`O is the vector operation and
`T is the matrix transpose operation
`For bit reversal functions, let B represent the reverse
`identity matrix. Then ABT-C performs a bit reversal
`of the bits of vector register A.
`Finally, for arbitrary bit permutations, let B represent
`an arbitrary permutration matrix (i.e. an arbitrary order
`ing of the rows of the identify matrix). Then ABT-C
`performs an arbitrary permutation of the bits of vector
`register A.
`The circuitry for performing the orthogonal transfor
`mation instruction ABT-C in accordance with the
`invention will now be described with reference to FIG.
`1. The invention will be described with reference to
`vector or word sizes of sixty-four bits, although as is
`apparent to those of ordinary skill in the art, the inven
`tion is not limited to such sizes, so long as the word size
`of the machine equals the vector register size in words.
`There are provided a plurality of input lines a. In the
`drawing, let a refer to the jth bit of word i of vector
`register A. Similarly, let bit refer to the jth bit of word i
`of vector register B. The same designations are used for
`the bits and words of register C (i.e. ci). The words of
`a vector register are ordered from the first (i=0) to the
`last (i=63) word and the bits of a vector register are
`ordered from most significant (j=63) to the least signifi
`cant (ji=0). In addition to the input lines a, there are a
`plurality of input terminals b. Preferably, the number of
`input terminals b is the square of the number of input
`lines a. The input terminals b are arranged in a matrix of
`rows and columns. Each column corresponds with a
`word i in register B while each row corresponds with a
`bit j of the particular word.
`The input lines input a bit from a word in vector
`register A, while each input terminal b inputs a particu
`lar bit from a particular word in register B.
`
`Oracle-1029, p. 4
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`5,101,371
`6
`5
`reduced to 64--A" ticks using both OTI units. There
`A plurality of switching devices 2, preferably AND
`gates, are arranged in a matrix and have inputs con
`fore, it is desirable to have at least two OTI units on a
`nected with the input lines a and terminals b as shown in
`particular machine.
`FIG. 1. Each AND gate 2 has a pair of inputs. A first
`Since both SIMD and vector machines need to pro
`input of each AND gate is connected with one of the
`vide some kind of error detection/correction on very
`input terminals b, respectively, for receiving a bit from
`large memories, accesses to memory must be full word
`accesses. Therefore, in general the expected perfor
`a word of the second register. These bits control the
`state of operation of the gate. A second input for each
`mance for single bit indirect fetches will be the same
`AND gate in a first row of gates is connected with the
`with or without this invention, namely 64 bits for 1 on
`first input line for receiving a bit from the first register.
`Cray machines and 32 bits for 1 on a Thinking Machines
`Similarly, the AND gates of the succeeding rows have
`CM-2.
`second inputs connected with successive input lines,
`An example of one type of software for the orthogo
`respectively.
`nal transformation instruction will now be described. In
`A plurality of OR gates 4 are provided in the cir
`FORTRAN, arrays of 64-bit words are declared as
`cuitry according to the invention. The inputs to each
`15
`OR gate are connected with the outputs of a column of
`X(n1, m2,..., ni)
`AND gates as shown. Thus there is one OR gate for
`where array X has idimensions of
`each column of AND gates. The OR gates have outputs
`c which contain the bit transformation of the bit inputs
`from the vector registers A and B via the input lines a
`20
`and terminals b, respectively.
`The transformation circuit of FIG. 1 enables efficient
`access of the bits within a word stored in a memory as
`well as access of the words themselves.
`In a preferred embodiment, the OR gates comprise
`25
`"exclusive OR' gates for full realization of the orthogo
`nal transformation instruction. With full realization, all
`4K bits of B are first loaded. Then the words of A are
`pipelined through the circuit into C. The circuit re
`quires 4K AND gates, one for each bit of B, and sixty
`30
`four 64-input "exclusive OR' gates.
`Partial realization of the orthogonal transformation
`instruction is realized by substituting "inclusive OR'
`gates for the more costly "exclusive OR' gates. Such a
`substitution still provides matrix transposition, bit rever
`35
`sal, and arbitrary bit permutation functions. However,
`the binary matrix multiplication function would not be
`possible.
`Partial realization of matrix transposition functions
`may also be realized. If B and C are designated as a pair
`of vector registers, then they could be wired together
`such that bit is directly connected with ci. Then matrix
`transposition could be performed directly by loading B
`and accessing C.
`Referring now to FIG. 2, the actual hardware realiza
`45
`tion of the 64 input "exclusive OR' or inclusive OR'
`gates will be described. Let do. . . . d53 be the inputs.
`Then the 64 input gate can be realized as a cascade of 21
`4-input gates as. All outputs are latched for pipelined
`operation. For the case of the 64-input "inclusive OR"
`50
`gate, it may also be possible to construct the gate as one
`very long transistor with 64 attachments to the gate of
`the transistor.
`The latency through the combining 64-input OR gates
`4 can be hidden through pipelining. Thus, one would
`55
`expect the orthogonal transformation instruction to
`execute in 64- A ticks once the B register is fully loaded
`which takes another 64- A ticks where A is a small
`number compared to 64. This should be the same basic
`timing as other vector instructions on typical comput
`es.
`Because the B register in ABT-C must be fully
`loaded before A can be pipelined into C, the inherent
`latency for matrix transposition (i.e. corner turn) is
`128-A ticks. However, if a particular machine has at
`65
`least two OTI functional units and if the total number of
`transposes desired is greater than or equal to two, then
`this latency can be partially hidden and in the limit
`
`If in addition to parentheses ( ) for word dimen
`sions, one were to use brackets
`for bit dimensions,
`then the new declaration would look something like
`X(n, n2, ..., ni) (m1, m2, ..., mil
`where array X has i word dimensions of sizen, n2, ...
`, ni and jbit dimensions of size mi, m2, . . . . mi.
`With BSOTI, it will now be possible to automatically
`vectorize both bit and word operations using standard
`compiler optimization strategies. Granularity issues
`may limit performance but full vector performance
`across large nior midimensions should always be possi
`ble if the corresponding my or ni dimensions, respec
`tively, are held constant. In addition, if the product of
`mi dimensions is 64 or less, then the bit permutation
`operation can be done totally in vector registers, i.e.
`gather/scatter to memory is not necessary to perform
`this kind of bit addressing.
`Some example codes with expected timings are set
`forth below:
`Let
`L be vector load
`S be vector store
`Ll be random vector load (gather)
`S be random vector store (scatter) and
`O be OTI or ABT-C
`and assume that all vector operations (other than load
`ing of B for OTI) can be chained one word per tick.
`Then the following timings for specific codes can be
`expected.
`
`O
`
`1, 2 . .
`
`.
`
`. Ili.
`
`1
`
`FORTRAN Code
`DO 1 I s , 64
`DO 1 J = 1, 64.
`YO)(J) as X(JD(I)
`DO = 1, 64.
`DO 1 J = 1, 64
`Y (I) = X(I)65-J)
`DO 1 I was 1, 64
`dO J s 1, 64
`1 Y (DJ = X(Q())R(JD)
`
`instruction
`ABTC
`Setup LA
`Stream LigOSC
`
`uence Tinin
`(ticks)
`64 - A
`128 - A
`
`Setup LB
`Strean LOSC
`
`64 - A
`64 - A
`
`64 - A
`Setup LB
`Stream LOSC 64 - A
`
`where Q and R are assumed to be permutations.
`While in accordance with the provisions of the patent
`statue the preferred forms and embodiments have been
`
`Oracle-1029, p. 5
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`10
`
`5,101,371
`8
`7
`stored in a memory and access of the words stored
`illustrated and described, it will be apparent to those of
`in memory are provided.
`ordinary skill in the art that various changes and modifi
`2. Apparatus as defined in claim 1, wherein said OR
`cations may be made without deviating from the inven
`gates comprise exclusive OR gates.
`tive concepts set forth above.
`3. Apparatus as define din claim 1, wherein said OR
`What is claimed is:
`gates comprise inclusive OR gates.
`1. Apparatus for performing a bit transformation,
`4. Apparatus as defined in claim 1, wherein said
`comprising
`switching means comprise AND gates.
`(a) a plurality of input lines for supplying a plurality
`5. Apparatus as defined in claim 1, wherein the num
`of bits, respectively, from a word in a first vector
`ber of input terminals is the square of the number of
`register;
`input lines.
`(b) a plurality of input terminals arranged in a matrix
`6. Apparatus as defined in claim 5, wherein the bits on
`of rows and columns for supplying a plurality of
`said input lines are the individual bits fetched from a
`bits from a plurality of words, respectively in a
`single first memory.
`second vector register;
`7. Apparatus as defined in claim 6, wherein the bits on
`(c) a plurality of switching devices arranged in a
`said input terminals are from a second memory having
`matrix of rows and columns corresponding with
`words comprising a number of bits corresponding to the
`said matrix of second input terminals, each of said
`square of the number of bits in said first memory.
`switching devices having two inputs and one out
`8. Apparatus as defined in claim 1, wherein said out
`put, one input of each switching device being con
`puts from said OR gates are connected with a computer
`20
`nected with one of said input terminals, respec
`memory for storing words.
`tively, for controlling the state of the switching
`9. Apparatus as defined in claim 1, wherein the num
`device, the other input of each switching device in
`ber of input lines corresponds with the number of bits in
`a row of switching devices being connected with
`a word, each bit being supplied to a separate input line.
`one of said input lines for receiving the bit input
`10. Apparatus as defined in claim 9, wherein said
`25
`therefrom; and
`input lines are pipelined from individual bits of words
`(d) a plurality of OR gates each of which is connected
`from a vector register.
`with the outputs of one column of said switching
`11. Apparatus as defined in claim 9, wherein said
`devices, each OR gate having an output providing
`outputs from said OR gates are pipelined to individual
`a transformation function of the bit inputs,
`bits of words of a vector register.
`whereby efficient access of bits within the word
`
`5
`
`35
`
`45
`
`50
`
`55
`
`65
`
`Oracle-1029, p. 6
`Oracle v. Teleputers
`IPR2021-00078
`
`