throbber
United States Patent (19)
`obst
`
`54
`
`APPARATUS FOR PERFORMING A BIT
`SERAL ORTHOGONAL
`TRANSFORMATION INSTRUCTION
`
`(75)
`
`Inventor: Kenneth obst, Silver Spring, Md.
`
`(73)
`
`Assignee:
`
`The United States of America as
`represented by National Security
`Agency, Washington, D.C.
`
`21
`
`Appl. No.: 533,238
`
`22
`
`Filed:
`
`Jun. 4, 1990
`
`(51)
`52)
`(58
`
`Int. Cl. ................................................ G06F 7/38
`U.S. C. ................................ 364/736; 364/715.0
`Field of Search ........... 364/736, 725, 716, 715.01,
`364/730,715.02
`
`USOOS10137A
`Patent Number:
`11
`45) Date of Patent:
`
`5,101,371
`Mar. 31, 1992
`
`
`
`56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`3,936,806 2/1976 Batcher .
`4,314,349 2/1982 Batcher ............................... 364/716
`4,727,474 2/1988 Batcher ......
`Primary Examiner-Tan V. Mai
`(57)
`ABSTRACT
`Apparatus for performing a bit serial orthogonal trans
`formation instruction (BSOTI) is characterized by a
`unique circuit of bit inputs from two vector registers, a
`plurality of switching devices, and OR gates for per
`forming transformation. The apparatus enables vector
`machines to operate on individual bits of words as
`quickly as it operates on the words stored in memory.
`Vector computers are thus able to perform efficient
`bit-serial arithmetic as well as the more traditional fast
`vector/scalar operations.
`
`11 Claims, 1 Drawing Sheet
`
`Ci,63
`
`Ci,62
`
`Oracle-1029, p. 1
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`U.S. Patent
`
`Mar. 31, 1992
`
`5,101,371
`
`F. G.
`
`
`
`
`
`Ci,62
`
`OUTPUT
`
`7 v/
`
`Oracle-1029, p. 2
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`1
`
`APPARATUS FOR PERFORMING A BIT SERIAL
`ORTHOGONAL TRANSFORMATION
`INSTRUCTION
`
`BACKGROUND OF THE INVENTION
`The present invention relates to vector computers for
`performing bit transformation to afford efficient access
`of bits within a word stored in a memory as well as
`access of the stored words.
`The bit serial orthogonal transformation instruction
`(BSOTI) is a vector instruction which per forms the
`following function:
`ABT-C
`
`O
`
`15
`
`5,101,371
`2
`granularity of the operation performed. Most commer
`cial vector machines, for example the Cray 2, have a
`vector size of 64 words by sixty-four bits per word.
`The advantage of vector machines lies in their tre
`mendous speed in processing information. The disad
`vantage of such machines is that their programs must be
`"vectorized' or structured in a particular manner so as
`to take advantage of its features. Therefore, program
`mers of vector computers must have detailed knowl
`edge of the machine's architecture to effectively utilize
`these machines for fine grain computation. This leads to
`significant increases in time expense and effort. Also,
`vector machines are not efficient in applications where
`fine granularity is required.
`The motivation for the orthogonal transformation
`instruction is to be able to operate on bits as efficiently
`as one currently operates on words of information
`stored in a computer memory. This is a problem be
`cause most computers store multiple bits per word, and
`these bits must be accessed that way for error detec
`tion/correction reasons. The major advantage of SIMD
`computers, in general, is their ability to exploit locality
`of memory reference and efficiently trade space for
`time. This is most effective when the bit serial complex
`ity of a computational kernel is small or the data fields
`involved are small. With the bit serial orthogonal trans
`formation instruction (BSOTI), vector computers will
`be able to perform efficient bit-serial arithmetic as well
`as the more traditional fast vector/scalar operations on
`full word integer and floating point values.
`The present invention was developed in order to
`allow vector computers to operate on bits of informa
`tion as effectively as SIMD computers, whereby the
`granularity problem associated with vector computers
`is overcome. Functionally, the invention works in con
`junction with the hardware gather scatter instruction to
`perform the following bit operations: matrix multiplica
`tion, matrix transposition, matrix row/column permuta
`tions, compression, and expansion.
`SUMMARY OF THE INVENTION
`Accordingly, it is a primary object of the present
`invention to provide a device for performing a bit trans
`formation. The device includes a plurality of input lines
`for supplying a plurality of bits, respectively, from a
`word in a first vector register and a plurality of input
`terminals arranged in a matrix of rows and columns for
`supplying a plurality of bits from a plurality of words,
`respectively, in a second vector register. A plurality of
`AND gates corresponding in number to the number of
`input terminals are provided, with each AND gate
`having one input connected with one of the input termi
`nals for controlling the state thereof and the other input
`connected with one of the bit input lines. All of the
`AND gates in a row have their other input connected
`with the same input line for receiving the bit input
`therefrom. The AND gates of each columm have their
`outputs connected with the input of an OR gate. Thus
`there is one OR gate for each column of AND gates.
`Each OR gate has an output providing a transformation
`function of the bit inputs, whereby efficient access of
`bits within a word stored in memory is afforded.
`According to a more specific object of the invention,
`the bits on the input lines are associated with the indi
`vidual bits fetched from a single first memory and the
`bits on the input terminals are associated with a second
`memory having words comprising a number of bits
`
`35
`
`30
`
`This instruction effectively doubles the instruction set
`of vector computers. More particularly, BSOTI allows
`vector machines to operate on individual bits of a word
`as quickly as they operate on words of information
`20
`stored in its memory. Thus, vector computers will be
`able to perform efficient bit-serial arithmetic as well as
`the more conventional fast vector/scalar operations.
`BRIEF DESCRIPTION OF THE PRIOR ART
`25
`The levels of speed and performance achieved by
`conventional supercomputers are made possible
`through the use of multiple central processing units
`(CPUs). The increase in the number of CPUs allows
`these machines to perform many computations at the
`same time. The simultaneous computing is referred to as
`parallel computing or parallel processing. Parallel pro
`cessing devices are well-known in the patented prior art
`as evidenced by the U.S. patents to Batcher U.S. Pat.
`No. 3,936,806, U.S. Pat. No. 4,314,349, and U.S. Pat.
`No. 4,727,474.
`Supercomputers carry out parallel processing in vari
`ous ways, and nomenclature has been developed to
`categorize these machines. The class of parallel comput
`ers referred to as single instruction multiple data
`(SIMD) computers are identified by a particular archi
`tecture in which a single master processor broadcasts
`the identical instruction to a large number of slave pro
`cessors. Each slave processor then executes this instruc
`tion on data fetched from its own particular local mem
`45
`ory.
`Several SIMD machines (e.g. CM-2, MPP, and
`DAP) have been built where the slave processors oper
`ate on only one bit. The advantage of using machines
`with one-bit slave processors is apparent in the areas of
`SO
`graphics and image processing. Furthermore, the ability
`to operate on an individual bit is necessary for certain
`iterative numerical computing tasks where the amount
`of precision needed increases with the number of itera
`tions (as in the field of cryptology). The major disad
`55
`vantage is that these machines are relatively slow com
`pared to other parallel computers.
`Vector computers are another class of parallel pro
`cessing machines. As used herein, a vector computer or
`vector machine is a relatively fast vector/scalar proces
`sor that treats a vector as a single entity of bits greater
`than a single word of memory. The instruction set is
`such that only one instruction is needed to process the
`entire vector. In many ways, the architecture of a vec
`tor machine resembles that of an SIMD machine in that
`it applies the same instruction to a large number of
`different data elements. The principle difference be
`tween a vector machine and an SIMD machine is the
`
`Oracle-1029, p. 3
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`5
`
`-continued
`
`O
`O
`
`0 0
`0 1 0
`
`represents the reverse
`identity matrix
`
`1 0 0
`
`O
`
`5,101,371
`3
`corresponding with the square of the number of bits in
`the first memory.
`According to a further object of the invention, the
`number of input lines corresponds with the number of
`bits in a word and each bit is supplied to a separate input
`line. The input lines can be pipelined from individual
`bits of words from a vector register.
`BRIEF DESCRIPTION OF THE FIGURES
`Other objects and advantages of the present invention
`will become apparent from a study of the following
`specification when viewed in the light of the accompa
`nying drawing, in which:
`FIG. 1 is a circuit diagram of the bit transformation
`apparatus according to a preferred embodiment of the
`invention; and
`FIG. 2 is a more detailed circuit diagram of a portion
`of the circuit of FIG, 1.
`DETAILED DESCRIPTION
`Mathematically, A, B, and C represent binary matri
`ces and
`ABT
`
`O
`
`5
`
`20
`
`For certain values of A and B, ABT-C can perform
`some very important functions including matrix trans
`position (i.e. "corner turn”), bit reversal, and arbitrary
`bit permutations.
`By way of explanation, matrix transposition is per
`formed as follows. If A represents the identity matrix,
`then ABT-C perform as a vector register transpose
`operation mapping bits of B into words of C and words
`of B into bits of C. This is a very important operation
`because it allows other instructions that normally oper
`ate on words to operate on bits. For example, gather/-
`scatter instructions that allow efficient arbitrary map
`pings between words now also allow arbitrary map
`pings between bits. In general, the normal LLOS in
`struction sequence is mapped into
`
`25
`represents the binary matrix multiplication of matrix A
`and matrix B transpose. This multiplication is per
`formed over a field which mathematicians refer to as
`GF(2).
`On a vector computer, A, B, and C are vector regis
`ters. For example, on the Cray series of computers, A,
`B, and C are 64x64 bit matrices organized as 64 words
`of 64 bits per word. Diagramatically, the mathematical
`function
`
`30
`
`ABT-C
`
`can be represented as
`
`BT
`A
`O O
`O O
`
`C
`O
`O
`
`-Ge.
`
`O O
`
`O
`
`where O
`
`represents one 64 bit word of a vector
`
`O
`O
`
`O
`
`represents one vector
`register of 64
`64-bit words
`
`1 O O
`O
`0
`
`O
`O
`
`O
`
`0 0 1
`
`represents the
`identity matrix
`
`35
`
`50
`
`55
`
`65
`
`LTLTOTS
`
`where
`L is the vector load
`S is the vector store
`O is the vector operation and
`T is the matrix transpose operation
`For bit reversal functions, let B represent the reverse
`identity matrix. Then ABT-C performs a bit reversal
`of the bits of vector register A.
`Finally, for arbitrary bit permutations, let B represent
`an arbitrary permutration matrix (i.e. an arbitrary order
`ing of the rows of the identify matrix). Then ABT-C
`performs an arbitrary permutation of the bits of vector
`register A.
`The circuitry for performing the orthogonal transfor
`mation instruction ABT-C in accordance with the
`invention will now be described with reference to FIG.
`1. The invention will be described with reference to
`vector or word sizes of sixty-four bits, although as is
`apparent to those of ordinary skill in the art, the inven
`tion is not limited to such sizes, so long as the word size
`of the machine equals the vector register size in words.
`There are provided a plurality of input lines a. In the
`drawing, let a refer to the jth bit of word i of vector
`register A. Similarly, let bit refer to the jth bit of word i
`of vector register B. The same designations are used for
`the bits and words of register C (i.e. ci). The words of
`a vector register are ordered from the first (i=0) to the
`last (i=63) word and the bits of a vector register are
`ordered from most significant (j=63) to the least signifi
`cant (ji=0). In addition to the input lines a, there are a
`plurality of input terminals b. Preferably, the number of
`input terminals b is the square of the number of input
`lines a. The input terminals b are arranged in a matrix of
`rows and columns. Each column corresponds with a
`word i in register B while each row corresponds with a
`bit j of the particular word.
`The input lines input a bit from a word in vector
`register A, while each input terminal b inputs a particu
`lar bit from a particular word in register B.
`
`Oracle-1029, p. 4
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`5,101,371
`6
`5
`reduced to 64--A" ticks using both OTI units. There
`A plurality of switching devices 2, preferably AND
`gates, are arranged in a matrix and have inputs con
`fore, it is desirable to have at least two OTI units on a
`nected with the input lines a and terminals b as shown in
`particular machine.
`FIG. 1. Each AND gate 2 has a pair of inputs. A first
`Since both SIMD and vector machines need to pro
`input of each AND gate is connected with one of the
`vide some kind of error detection/correction on very
`input terminals b, respectively, for receiving a bit from
`large memories, accesses to memory must be full word
`accesses. Therefore, in general the expected perfor
`a word of the second register. These bits control the
`state of operation of the gate. A second input for each
`mance for single bit indirect fetches will be the same
`AND gate in a first row of gates is connected with the
`with or without this invention, namely 64 bits for 1 on
`first input line for receiving a bit from the first register.
`Cray machines and 32 bits for 1 on a Thinking Machines
`Similarly, the AND gates of the succeeding rows have
`CM-2.
`second inputs connected with successive input lines,
`An example of one type of software for the orthogo
`respectively.
`nal transformation instruction will now be described. In
`A plurality of OR gates 4 are provided in the cir
`FORTRAN, arrays of 64-bit words are declared as
`cuitry according to the invention. The inputs to each
`15
`OR gate are connected with the outputs of a column of
`X(n1, m2,..., ni)
`AND gates as shown. Thus there is one OR gate for
`where array X has idimensions of
`each column of AND gates. The OR gates have outputs
`c which contain the bit transformation of the bit inputs
`from the vector registers A and B via the input lines a
`20
`and terminals b, respectively.
`The transformation circuit of FIG. 1 enables efficient
`access of the bits within a word stored in a memory as
`well as access of the words themselves.
`In a preferred embodiment, the OR gates comprise
`25
`"exclusive OR' gates for full realization of the orthogo
`nal transformation instruction. With full realization, all
`4K bits of B are first loaded. Then the words of A are
`pipelined through the circuit into C. The circuit re
`quires 4K AND gates, one for each bit of B, and sixty
`30
`four 64-input "exclusive OR' gates.
`Partial realization of the orthogonal transformation
`instruction is realized by substituting "inclusive OR'
`gates for the more costly "exclusive OR' gates. Such a
`substitution still provides matrix transposition, bit rever
`35
`sal, and arbitrary bit permutation functions. However,
`the binary matrix multiplication function would not be
`possible.
`Partial realization of matrix transposition functions
`may also be realized. If B and C are designated as a pair
`of vector registers, then they could be wired together
`such that bit is directly connected with ci. Then matrix
`transposition could be performed directly by loading B
`and accessing C.
`Referring now to FIG. 2, the actual hardware realiza
`45
`tion of the 64 input "exclusive OR' or inclusive OR'
`gates will be described. Let do. . . . d53 be the inputs.
`Then the 64 input gate can be realized as a cascade of 21
`4-input gates as. All outputs are latched for pipelined
`operation. For the case of the 64-input "inclusive OR"
`50
`gate, it may also be possible to construct the gate as one
`very long transistor with 64 attachments to the gate of
`the transistor.
`The latency through the combining 64-input OR gates
`4 can be hidden through pipelining. Thus, one would
`55
`expect the orthogonal transformation instruction to
`execute in 64- A ticks once the B register is fully loaded
`which takes another 64- A ticks where A is a small
`number compared to 64. This should be the same basic
`timing as other vector instructions on typical comput
`es.
`Because the B register in ABT-C must be fully
`loaded before A can be pipelined into C, the inherent
`latency for matrix transposition (i.e. corner turn) is
`128-A ticks. However, if a particular machine has at
`65
`least two OTI functional units and if the total number of
`transposes desired is greater than or equal to two, then
`this latency can be partially hidden and in the limit
`
`If in addition to parentheses ( ) for word dimen
`sions, one were to use brackets
`for bit dimensions,
`then the new declaration would look something like
`X(n, n2, ..., ni) (m1, m2, ..., mil
`where array X has i word dimensions of sizen, n2, ...
`, ni and jbit dimensions of size mi, m2, . . . . mi.
`With BSOTI, it will now be possible to automatically
`vectorize both bit and word operations using standard
`compiler optimization strategies. Granularity issues
`may limit performance but full vector performance
`across large nior midimensions should always be possi
`ble if the corresponding my or ni dimensions, respec
`tively, are held constant. In addition, if the product of
`mi dimensions is 64 or less, then the bit permutation
`operation can be done totally in vector registers, i.e.
`gather/scatter to memory is not necessary to perform
`this kind of bit addressing.
`Some example codes with expected timings are set
`forth below:
`Let
`L be vector load
`S be vector store
`Ll be random vector load (gather)
`S be random vector store (scatter) and
`O be OTI or ABT-C
`and assume that all vector operations (other than load
`ing of B for OTI) can be chained one word per tick.
`Then the following timings for specific codes can be
`expected.
`
`O
`
`1, 2 . .
`
`.
`
`. Ili.
`
`1
`
`FORTRAN Code
`DO 1 I s , 64
`DO 1 J = 1, 64.
`YO)(J) as X(JD(I)
`DO = 1, 64.
`DO 1 J = 1, 64
`Y (I) = X(I)65-J)
`DO 1 I was 1, 64
`dO J s 1, 64
`1 Y (DJ = X(Q())R(JD)
`
`instruction
`ABTC
`Setup LA
`Stream LigOSC
`
`uence Tinin
`(ticks)
`64 - A
`128 - A
`
`Setup LB
`Strean LOSC
`
`64 - A
`64 - A
`
`64 - A
`Setup LB
`Stream LOSC 64 - A
`
`where Q and R are assumed to be permutations.
`While in accordance with the provisions of the patent
`statue the preferred forms and embodiments have been
`
`Oracle-1029, p. 5
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`10
`
`5,101,371
`8
`7
`stored in a memory and access of the words stored
`illustrated and described, it will be apparent to those of
`in memory are provided.
`ordinary skill in the art that various changes and modifi
`2. Apparatus as defined in claim 1, wherein said OR
`cations may be made without deviating from the inven
`gates comprise exclusive OR gates.
`tive concepts set forth above.
`3. Apparatus as define din claim 1, wherein said OR
`What is claimed is:
`gates comprise inclusive OR gates.
`1. Apparatus for performing a bit transformation,
`4. Apparatus as defined in claim 1, wherein said
`comprising
`switching means comprise AND gates.
`(a) a plurality of input lines for supplying a plurality
`5. Apparatus as defined in claim 1, wherein the num
`of bits, respectively, from a word in a first vector
`ber of input terminals is the square of the number of
`register;
`input lines.
`(b) a plurality of input terminals arranged in a matrix
`6. Apparatus as defined in claim 5, wherein the bits on
`of rows and columns for supplying a plurality of
`said input lines are the individual bits fetched from a
`bits from a plurality of words, respectively in a
`single first memory.
`second vector register;
`7. Apparatus as defined in claim 6, wherein the bits on
`(c) a plurality of switching devices arranged in a
`said input terminals are from a second memory having
`matrix of rows and columns corresponding with
`words comprising a number of bits corresponding to the
`said matrix of second input terminals, each of said
`square of the number of bits in said first memory.
`switching devices having two inputs and one out
`8. Apparatus as defined in claim 1, wherein said out
`put, one input of each switching device being con
`puts from said OR gates are connected with a computer
`20
`nected with one of said input terminals, respec
`memory for storing words.
`tively, for controlling the state of the switching
`9. Apparatus as defined in claim 1, wherein the num
`device, the other input of each switching device in
`ber of input lines corresponds with the number of bits in
`a row of switching devices being connected with
`a word, each bit being supplied to a separate input line.
`one of said input lines for receiving the bit input
`10. Apparatus as defined in claim 9, wherein said
`25
`therefrom; and
`input lines are pipelined from individual bits of words
`(d) a plurality of OR gates each of which is connected
`from a vector register.
`with the outputs of one column of said switching
`11. Apparatus as defined in claim 9, wherein said
`devices, each OR gate having an output providing
`outputs from said OR gates are pipelined to individual
`a transformation function of the bit inputs,
`bits of words of a vector register.
`whereby efficient access of bits within the word
`
`5
`
`35
`
`45
`
`50
`
`55
`
`65
`
`Oracle-1029, p. 6
`Oracle v. Teleputers
`IPR2021-00078
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket