`
`Page 1 of 15
`
`FORD 1115
`
`
`
`matrix techniques and sparse matrix techniques. Dense matrix techniques are best suited for relatively small matrices, where the underlying physical model is very closely coupled. Sparse matrix techniques are best suited for larger matrices where the underlying physical model is very loosely coupled. Dense matrix techniques assume that most matrix elements are non-zero, and hence that most operations involving those matrix elements are non-trivial. Sparse matrix techniques try to take advantage of the fact that a large fraction of the matrix elements are zero, and hence operations involving these elements are trivial. However, sparse matrices may still have relatively complex structure, even though a large portion of their elements are zero, so flexible methods must be used to represent and manipulate these sparse matrices. Most sparse matrix techniques, therefore, make use of pointers and linked structures which are much more reminiscent of symbolic than of numeric processing. In symbolic processing, the underlying relationships are almost always sparse, hence the predominance of pointers and linked structures. Nevertheless, there are occasions where the relationships are so unstructured or so dense that pointer techniques are not optimal for symbolic processing. For example, the transitive closure of a binary relation is often dense, even ff the original relation is not. In cases such as these, bit-vectors and bit-arrays can often be useful, if their implementation is efficient enough. ([Ait-Kaci89] performs transitive closures of object-oriented hierarchies using bit-vector techniques.) Below is a straight-forward implementation in Common Lisp of the "or-and" multiplication of a bit-matrix by a bit-vector, which is useful in computing the image set resulting from the application of a binary relation to a domain set. (defun mult-mat-vec (a v) "Post-multiply bit-matrix a by bit-vector v." (declare (type (array bit 2) a) (bit-vector v)) (assert (= (array-dimension a i) (length v)) (let* ((m (array-dimension a 0)) (result (make-sequence 'bit-vector m :initial-element 0))) (dotimes (i m) (when (intersection-test (nmatrix-row a i) v) (setf (elt result i) I))) result) ) (defun nmatrix-row (a i) "Returns a vector displaced to the i'th row of a matrix a." (declare (type (array * 2) a)) (let ((n (array-dimension a I))) (make-array n :displaced-to a :displaced-index-offset (* i n)))) (defun intersection-test (vl v2) "Returns true if bit-vectors intersect one (declare (bit-vector vl v2)) (assert (= (length vl) (length v2)) (some #'logtest vl v2)) another. " © 1989,1990 by Nimble Computer Corporation. III-2.9
`
`Page 2 of 15
`
`FORD 1115
`
`
`
`Below is an implementation of Warshall's famous "in-place" algorithm for computing the transitive closure of a bit-matrix in Common Lisp. (Unfortunately, due to bugs in the implementation of displaced bit-arrays, this elegant program will not run on some Common Lisp implementations.) (defun nwarshall (a) "Implements Warshall's algorithm for the transitive closure." "Transitively closes the square bit matrix a in place." "[Baase, p.223]" (declare (type (array bit 2) a)) (assert (= (array-dimension a 0) (array-dimension a i))) (let ((n (array-dimension a 0))) (dotimes (k n) (let ((ak (nmatrix-row a k))) (dotimes (i n) (let ((ai (nmatrix-row a i))) (unless (zerop (bit a i k)) (bit-ior ai ak t))))))) a) Common Lisp provides a number of representations and operations for bit vectors and arrays. The most obvious is the "BIT-VECTOR" datatype, which is a subtype of "ARRAY" in which the elements are restricted to "BIT"'s, i.e., the integers zero and one, and in which the array rank (number of dimensions) is one. Bit-vectors of this type (which we will call "array" bit-vectors) are also Common Lisp "sequences" because they are one-dimensional arrays, so they inherit all of the operations of the Common Lisp sequence functions. However, there are a number of additional Common Lisp functions which operate on all "bit-arrays"Di.e., arrays of bits. Cun0usly, even though Common Lisp has an abbreviation for "(ARRAY BIT (*))" D "BIT-VECTOR, -- and an abbreviation for "(SIMPLE-ARRAY BIT (*))" ~ "SIMPLE-BIT-VECTOR" ~ it does not provide a standard abbreviation for "(ARRAY BIT *)", for which the obvious abbreviation would be "BIT-ARRAY". Finally, there is an extensive treatment of integers as bit-vectors (which we will call "integer" bit-vectors) through their 2's-complement notation, including "logand", "integer-length", etc. (Logical operations on Maclisp bignums were first implemented to support LINGOL [Pratt73] through the BBOOLE package [Baker75], and eventually found their way into the MIT Lisp Machine and Common Lisp.) Unfortunately for a potential user of bit-vectors in a Common Lisp program, these different representations and function suites have not been rationalized, so it may be difficult to decide a priori which representation to use in an application. To make things worse, some Common Lisp vendors provide such inefficient implementations of some of the bit-vector functions that they are useless for real programming. Before delving into the efficient implementation of the various bit-vector operations, we first give some perspective on the relationships among these different operations through the following chart, which compares the operations available for "integer" bit-vectors with those available for "array" bit-vectors. (See [Eliot89] for a discussion of various representations for sets in Common Lisp; note particularly the chart in his Figure 3.) © 1989,1990 by Nimble Computer Corporation. I I I - 2.1 0
`
`Page 3 of 15
`
`FORD 1115
`
`
`
`Ot~erafion Inteeer Version ~ Seauence Version Copy v unnecessary make-array :initial-contents v copy-seq Make-all-zeros 0 make-array :initial-element 0 make-sequence :initial-element 0 Make-all-ones -1 make-array :initial-element 1 make-sequence :initial-element 1 Test all zeros zerop, not ldb-test N/A not find 1 Test all ones eq1-1 N/A not find 0 Size integer4ength array-dimension(s) length Access i'th bit logbitp bit, sbit elt Boolean functions boole N/A N/A logand bit-and map #'logand logior bit-ior map #'logior logxor bit-xor map #'logxor logeqv bit-eqv map #'logeqv lognand bit-nand map #'lognand lognor bit-nor map #'lognor logandc 1 bit-andc 1 map #'logandc 1 logandc2 bit-andc2 map #'logandc2 logorcl bit-orcl map #'logorcl logorc2 bit-orc2 map #'logorc2 lognot bit-not map #'lognot Intersect test logtest N/A some #'logtest Subset test N/A N/A every #'<= Select portion ldb,ash make-array :displaced subseq Count bits logcount N/A count Leftmost 1 integer-length N/A position Mask a field mask-field N/A :start, :end Reverse a bit-vector N/A N/A reverse, nreverse Concatenate bit-vectors logior+ash N/A concatenate Compare bit-vectors --, equal, equalp equal, equalp mismatch Search for subsequence N/A N/A search Block initialize N/A :initial-element ? fill "BL'F' block transfer dpb :initial-contents ? replace, setf-subseq Chart of analogous operations for integers, bit-arrays and bit sequences. We can see by this chart that there is a great deal of overlap between the types of operations available for the various forms of bit-vectors. However, there are also some obvious holes, such as the lack of a quick intersection test for "array" bit-vectors, the lack of a subset test, and the lack of a reverse operation on integers. (A reverse operation on integers would come in handy for the implementation of FFT's, and would also aid in certain other recursive decomposition tasks.) The largest difference, however, between array-type bit vectors and integer-type bit vectors is in the fact that integer bit-vectors are functional, while array bit-vectors allow for side-effects. Thus, if n different pointers exist to an integer bit-vector, and one applies the "lognot" function to that bit-vector, then this operation only affects the result, and not any of the n existing pointers to the integer. On the other hand, "bit-not" with a second argument of "t" will flip all of the bits in the bit-vector given as its first argumentmnot only for the result bit-vector, but also for every other holder of a pointer to that argument bit-vector. Therefore, integer bit-vectors are better for functional programming, while array bit-vectors are better for state-oriented programming. (Those in the functional programming camp may argue that the functional integer bit-vectors are good enough for all purposes. However, I am not aware of any Lisp compilers which are smart enough to © 1989,1990 by Nimble Computer Corporation. I I I - 2.1 1
`
`Page 4 of 15
`
`FORD 1115
`
`
`
`solve the "aggregate update problem" for integer bit-vectors [Schwartz75][Hudak86], and therefore the manipulation of integer bit-vectors results in quite a bit more cons'ing than might be used when programming in a more imperative style utilizing array bit-vectors. Without a generational garbage collector [Moon84], the excellent efficiency of the integer bit-vector operations is nullified by the large amount of garbage collection that results from their use.) Given that we would like a complete complement of bit-vector operations for either style of programming, there are two ways to fill the holes in our chart--add additional operations for both types of bit-vectors and/or add a conversion function which offers the efficient conversion of one type into the other. Since we want to allow for the widest possible flexibility in the use of the Common Lisp language, we suggest doing both--filling the holes in the chart with new functions and providing for the interconvertability of integers and array bit-vectors (see also [Eliot89] for a similar proposal). (There are often other differences between integer bit-vectors and array bit-vectors due to the vagaries of an implementation. E.g., because Coral Common Lisp on the Macintosh allocates temporary space from the run-time stack for all integer operations, its operations are very fast, but its integers are also limited to 32K bytes, which is quite large enough for most integer calculations, but not large enough for many multi-dimensional bit-array applications.) Some languages other than Common Lisp offer bit-vector facilities for bit-vectors which exceed the word length of the underlying hardware. The original Pascal implementation offered "sets" up to size 60, thanks to Control Data. The Ada language [AdaLRM] offers the equivalent of Common Lisp's bit-not, bit-and, bit-ior, bit-xor, equal, subseq, copy, concatenate,fill, and replace as builtin capabilities for arbitrary-length bit-vectors. Interestingly enough, efficient Ada compilers must deal with the full complexity of packed bit-vectors which can occur with any bit-offset to a word boundary, just as we describe below. Ada also offers one builtin capability not available in Common Lispwa lexicographic comparison, although such a comparison can easily be emulated using mismatch. Of course, the APL language--as the inspiration for many Common Lisp functions offers many interesting and useful functions for manipulating bit-vectors, some of which are missing in Common Lisp--scan and bit-vector-reduce--which can often be quite useful. Efficient Implementation of Sequence and Array Functions on Bit-vectors Because "array" bit-vectors have a wider variety of operations, are more complex to implement, and are usually implemented with less efficiency than "integer" bit-vectors, we will focus on the efficient implementation of the various operations on array-type bit-vectors. The implementation of "integer" bit-vector operations can then utilize the same machinery, except that automatically extending and contracting the length of these bit-vectors requires some care (see [White86] for a discussion of the efficient implementation of the various algebraic ring operations on bignums). By efficient, we mean that we would like to take advantage of the SIMD-type "word parallelism" that exists on all modern serial computers. In other words, if we can operate on 16-bit words at a time (called "Whiz-Along-By-Words" in [White86]) instead of single bits, we would like to achieve a 16-fold speedup over the fully serial operation operating on only a single bit at a time. Clearly, the larger the word size that we can utilize, the faster these operations should go. There are some complications, however. Very few computers offer addressing capabilities down to the individual bit, and demand that words be addressed on "word boundaries", so there are the possibilities that incomplete words need to be processed, and that two bits which are to be combined may reside at different bit locations within a word. Both of these complications will require additional processing time, and add substantial complexity to the task of efficient implementation. Yet the © 1989,1990 by Nimble Computer Corporation. I I I - 2.1 2
`
`Page 5 of 15
`
`FORD 1115
`
`
`
`payoff of 16-64 X speedups is far too important to ignore. Virtually all processing of bit-vectors in Common Lisp is stream-like, involving one or more input streams of bits, and zero or one output streams of bits. (One can easily conceive of interesting operations involving multiple output streams of bits, such as a dual complement/reverse complement operation which treats the bit-vectors as sets, but none of the standard Common Lisp bit-vector operations requires more than one output bit stream.) Depending upon the operation, these bit-streams are derived from the bit-vector by processing from the start of the vector to the end, or from the end of the vector to the start. The bit-stream may also need to be inverted in the process of being read or written. The key to the efficient processing of bit-vectors is in the ability to process more than one bit at a time. Therefore, these bit streams need to be able to read or write more than one bit at a time (White calls these chunks "bigits" [White86]). Ideally, one would like to specify the number of bits to be processed---or "byte size"--perhaps 8 bits at a time for one operation, or 32 bits at a time for another operation. The complications of word boundaries occurs when processing multiple bits--sometimes one must process less than the full word width, and sometimes the word to be processed straddles a word boundary. (By word, we mean the smallest unit which can be efficiently read and written to memory. This does not necessarily mean the smallest addressable unit, as many modem "CISC" chips can address a 32-bit word occurring on any 8-bit boundary, but the maximum efficiency is reached when 32-bit words are read and written on 32-bit boundaries. By byte, we mean some number of bits which has been chosen as convenient for a particular type of processing; our bytes do not necessarily have 8 bits.) Vanilla Common Lisp already has an excellent model for bit streams which could theoretically be used for these purposes--the "binary" file. Binary files in Common Lisp are homogeneous sequences of "bytes", but since the byte size need not remain the same for all uses of the file, the only portable implementation is as a sequence of bits. One could conceive of setting up a binary file associated with each bit-vector which needed to be processed, which would in turn set up a "byte-stream" which could be read or written, queried for end-of-file, etc. Such a stream could easily be set up to read either from the beginning or the end, and to perform on-the-fly complementation. A true Common Lisp "stream" implementation of the bit-vector operations would be very easy to write code for, and the code would be quite readable. However, such an implementation would be even more inefficient than a straight-forward serial implementation of the same functions. This is because most implementations of streams in Common Lisp systems implement stream operations as function calls, and expect to perform quite a bit of processing per byte read or written, so the relative overhead of the stream operations is not normally objectionable. However, in the implementation of bit-vector operations, such overhead would be intolerable, since the cost of a bit-operation is essentially free compared with the cost of reading and writing the data. However, rather than "throw out the baby with the bath water", we can keep the "byte stream" model of processing, but implement it with some extremely efficient macros instead of true Common Lisp streams, and thereby achieve intellectual economy together with high execution speed. (A problem occurs in trying to implement these byte stream macros in portable Common Lisp---there is no efficient way to access the representation of a bit-vector where the actual bits are stored. Coral Common Lisp offers an escape--some open-coded "sub-primitives" which enable us to access 8-bit, 16-bit and 32-bit "bytes" at any offset within an object--irrespective of the type of the object. While using these operations involves delving unnecessarily deeply into the gory details of the © 1989,1990 by Nimble Computer Corporation. I I I - 2.1 3
`
`Page 6 of 15
`
`FORD 1115
`
`
`
`representation of bit-vectors, Common Lisp provides no other intermediate level access to the actual bits. The basic nature of these routines is roughly equivalent to ldb for a particular subset of sizes and positions, except that we operate on bit-vectors instead of integers. Perhaps a future Common Lisp revision can incorporate such an operation.) At the same time that we implement a new set of specialized and efficient byte-stream macros, we can also handle one of the complications of word addressing at the same time. While it is obvious that the last byte of a sequence may sometimes not be a full byte, but only a partial byte, there are times when it is more efficient to also process thefirst byte as only a partial byte. This is for "synchronization" reasons, when the processing of the middle (neither first nor last) bytes is thereby speeded up. Since we expect that bit-vectors will sometimes be quite long, it is important that the middle bytes be processed in the most efficient manner. Therefore, in the setting up of such a byte stream, we must separately specify both the offset within a byte of the beginning of the first byte as well as the length of the first byte. What different capabilities will be required of our "byte-streams"? We will obviously need to read and write them, but less obvious is the need to "update" them in place. Due to the possibility of the "t" argument in the "bit-xxx" bit-array logical operations, it will be necessary to be able to read, then write back into the same location, the result of an operation. We will also have to read backwards (although we will keep the order of the bits within the byte the same asthe forwards order), and we may also want to be able to complementa byte after reading or before writing. Due to the fact that during any builtin Common Lisp bit-vector operation, at most one bit stream will be written (or updated), we can always "synchronize" all of the bit streams being read to the one being written. Once this synchronization has been performed, we can always write words on word boundaries, and thereby avoid a large amount of complexity and inefficiency, because writing across word boundaries entails more inefficiency than reading across word boundaries. Therefore, only the first and last byte on a write (or update) stream will be less than a full byte, and even that byte will never cross a word boundary. Thus, for implementing just the standard Common Lisp bit-vector functions, we need not implement the most general possible write/update stream. (Note that even if a computer offered hardware addressing to the single bit, and allowed the hardware reading and writing of words at any bit boundary, it would still be much more efficient for the software to operate on full-word boundaries. This is because while the hardware reduces the software complexity in terms of the number of instructions executed, it is still slower, due to the larger number of cache misses and memory accesses.) Synchronization of different byte streams is the process of choosing a stream which will be processed (except for its first and last bytes) on a word boundary, and then computing the appropriate offsets for the other streams so that the corresponding bits become aligned for processing. With a single stream, we can always synchronize the stream to itself, while if there are two streams, we can synchronize one to the other. (As we pointed out above, we will always synchronize to the write stream because wnting unaligned streams is more expensive than reading unaligned streams.) With the possibility of up to two streams for reading and one stream for writing required to implement some Common Lisp bit-vector operations, we have the possibility that all three streams will have different alignments, so we force both read streams to synchronize to the write stream. We call the case of similar alignments "aligned" in our benchmarks, and the case of different alignments "unaligned". Many implementors who first approach Common Lisp bit-arrays assume that because the definition of the "bit-xxx" functions (e.g., "bit-and", "bit-ior") require the same rank and the same dimensions, that they will always be "synchronized" in the sense that corresponding bits from the different arrays © 1989,1990 by Nimble Computer Corporation. 111-2.14
`
`Page 7 of 15
`
`FORD 1115
`
`
`
`will occur at the same location within a word. However, this is not the case when one or more of the arrays is "displaced", as the "index-offset" of a displaced array can be any non-negative integer, which can therefore position the first element of the array at any bit within a word. In addition, these functions must also be careful not to disturb bits earlier in the word before the first element or later in the last word, because the underlying simple-array may also be accessible to the user, and bits which are not participating in the "bit-xxx" function should not be affected. The "nwarshalr' algorithm shown above depends cntically upon the correct implementation of "bit-xxx" functions on displaced arrays. (We note that most modern workstations already have extremely efficient implementations of the "bit-xxx-t" functions under the guise of the "bitblt" function used to manipulate the two-dimensional bitmap of the screen. In fact, Symbolics Common Lisp utilizes these functions to implement the "bit-xxx" functions, and they are very fast; curiously, Symbolics sequence functions on bit-vectors are abysmally slow (ca.Rel. 7.0). However, bitblt (at least as originally envisioned by the Xerox Alto) is a two-dimensional operation involving only two arguments, and translating its use to a single dimension and three arguments for use within Common Lisp can require some care. Coral Common Lisp, for example, does not handle the case of unafigned arguments correctly. When special purpose bitblt hardware is available for use within normal memory, its could provide the effect of an array processor for these Common Lisp functions. One must be careful, however, of potentially long startup overheads for these 2-dimensional operations; Common Lisp bit-vectors which are short also want to be efficiently manipulated.) We note that several proposals [Moon] have been made to modify the nature of the standard Common Lisp bit-vector functions. We feel that our techniques will not be affected in any major way by these proposals, so we have retained the original 1984 Common Lisp semantics [Steele]. Efficient irnplementation--general comments We implement the Common Lisp operations on a substrate made up of primitives in which only simple-bit-vectors are supported. We strip out displaced arrays and higher-order dimensions in the upper levels of the implementation. Furthermore, all of the primitives come with explicit "start" and "end" parameters so that arbitrary bit offsets can be represented. While simple-bit-vector operations (without the "start/end" parameters) could be done, and would be slightly simpler, the general case needs to be as efficient as possible, since it occurs more often than one might expect---due to programming styles exhibited by our "nmatrix-row" routine at the start of this paper. We will now go through the Common Lisp bit-vector functions in turn and indicate how they can be efficiently implemented. All times given are for Coral Common Lisp 1.2 running on a Macintosh Plus with a 16MHz Radius 68020 Accelerator card and 4Mbytes of main memory. Find. Position Find returns the item being sought if it occurs in the stream, while position returns the bit-position of the item found. Both functions return nil if the item is not found within the bit-vector. Both functions are essentially the same process, and differ only in the information returned to the caller. The implementation involves one "read" stream, synchronized to itself. The basic operation of "find1" or "positionl" is to find the first non-zero byte within the stream (after appropriate masking on first and last bytes). "Find0" and "position0" utilize a complemented byte stream. Once a non-zero byte is found, position determines the first bit within the byte by using a lookup table (on 8-bit bytes). The "with-end" versions read the stream in reverse. Our implementation of find~position in Coral Common Lisp utilizes a byte-size of 16 bits (hence © 1989,1990 by Nimble Computer Corporation. I I I - 2.15
`
`Page 8 of 15
`
`FORD 1115
`
`
`
`processes the bit-vector 16 bits at a time), and operates in about .5t.tsec/bit processed. Our implementation of "findl" in Coral Common Lisp utilizes a byte-size of 16 bits (hence processes the bit-vector 16 bits at a time), and operates in .46ktsec/bit processed, while "find0" requires .62~sec/bit processed. Eoual. mismatch Equal and mismatch both compare two bit-vectors, except that equal returns t or nil, while mismatch returns the first position of the mismatch, or nil. Since the comparison can proceed either from the start or the end of the bit-vector, we require both normal and "from-end" versions of this process. Equal~mismatch requires two read streams, with the second being synchronized to the first, so that at least one stream is processed on word boundaries. When a non-equality is found, then mismatch calls "positionl" on the logxor of the two words to find the position of mismatch. Our implementation of equal~mismatch requires .82~sec/bit for aligned streams, and 1.41~sec/bit for unaligned streams. Fill. :initial-element Fill fills a subsequence of a given bit-vector with either all O's or all l's. Presumably, make-array and make-sequence call upon fill when the keyword ":initial-element" is present. Fill utilizes a single write-stream which is synchronized to itself, and always operates from the start to the end (since every element must be processed, there is no reason for a "from-end" version). Our implementation offill requires .45ktsec/bit processed. Reolace. :initial-contents Replace replaces a subsequence of one bit-vector with a subsequence of another bit-vector. However, replace is defined such that it "still works" even if the two bit-vectors are the same, and the source and destination subsequences overlap. In this case, replace acts as though the bits were first transferred from the source to a "scratch" temporary bit-vector, and then transferred to the destination. Replace is the closest operation Common Lisp has to a "bitblt" or "bit-block-transfer" operation. Since Common Lisp requires that replace "work correctly" even when source and destination overlap, we will require both a normal and a "from-end" version of replace, since when we need to move a subsequence of a bit-vector up within the same vector, and the destination overlaps the source, we will lose information ff we transfer information from the start rather than from the end. Whether replace-ing from the start or from the end, we always synchronize the source to the destination. Both versions of replace take the same amount of time in our implementation: .781.tsec/bit when source and destination are aligned, and 1.3ktsec/bit when unaligned. Bit-not Bit-not comes in three flavors: bit-not-t inverts a bit-vector in-place; bit-not-2 inverts one bit-vector into another, and bit-not-new is bit-not-2 into a new bit-vector. Bit-not-t utilizes a single update stream synchronized to itself, which can always process upwards because all bits must be processed. © 1989,1990 by Nimble Computer Corporation. 111-2.16
`
`Page 9 of 15
`
`FORD 1115
`
`
`
`Bit-not-2, on the other hand, can have the same kinds of overlap as in replace, and therefore requires both a from-start and a from-end version, with the second being chosen if the vector is being shifted upwards in-place. Bit-not-2 utilizes both a read and a write stream, with the read stream synchronized to the write-stream. On both bit-not-t and bit-not-2, the read stream is complemented. Bit-not-t takes .89~sec/bit, while bit-not-2 takes .991.tsec/bit aligned and 1.571xsec/bit unaligned. Bit-xxx The bit-xxx functions (bit-and, bit-ior, etc.) come in several forms, due to the possible overlaps of sources and destination. The bit-xxx-t functions always put the result back into the first argument, and therefore use both an update stream (for the first argument) and a read stream (for the second argument); the read stream is always synchronized to the update stream. Due to the