`
`EXHIBIT 1016
`
`Unfied Patents Exhibit 1016
`Pg. 1
`
`
`
`Efficient Algorithms for Sorting and
`Synchronization
`
`Andrew Tridgell
`
`A thesis submitted for the degree of
`Doctor of Philosophy at
`The Australian National University
`
`February 1999
`
`Unfied Patents Exhibit 1016
`Pg. 2
`
`
`
`Except where otherwise indicated, this thesis is my own original work.
`
`Andrew Tridgell
`February 1999
`
`Unfied Patents Exhibit 1016
`Pg. 3
`
`
`
`Till min k¨ara fru, Susan
`
`Unfied Patents Exhibit 1016
`Pg. 4
`
`
`
`Acknowledgments
`
`The research that has gone into this thesis has been thoroughly enjoyable. That en-
`joyment is largely a result of the interaction that I have had with my supervisors, col-
`leagues and, in the case of rsync, the people who have tested and used the resulting
`software.
`I feel very privileged to have worked with my supervisors, Richard Brent, Paul
`Mackerras and Brendan McKay. To each of them I owe a great debt of gratitude for
`their patience, inspiration and friendship. Richard, Paul and Brendan have taught
`me a great deal about the field of Computer Science by sharing with me the joy of
`discovery and investigation that is the heart of research.
`I would also like to thank Bruce Millar and Iain Macleod, who supervised my
`initial research in automatic speech recognition. Through Bruce, Iain and the Com-
`puter Sciences Laboratory I made the transition from physicist to computer scientist,
`turning my hobby into a career.
`The Department of Computer Science and Computer Sciences Laboratory have
`provided an excellent environment for my research. I spent many enjoyable hours
`with department members and fellow students chatting about my latest crazy ideas
`over a cup of coffee. Without this rich environment I doubt that many of my ideas
`would have come to fruition.
`The Australian Telecommunications and Electronics Research Board, the Com-
`monwealth and the Australian National University were very generous in providing
`me with scholarship assistance.
`Thanks also to my family who have been extremely understanding and support-
`ive of my studies. I feel very lucky to have a family that shares my enthusiasm for
`academic pursuits.
`Finally I’d like to thank my wonderful wife, Susan, who has encouraged me so
`much over the years. Many people told us that having two PhD students under the
`one roof is a recipe for disaster. Instead it has been fantastic.
`
`iii
`
`Unfied Patents Exhibit 1016
`Pg. 5
`
`
`
`Abstract
`
`This thesis presents efficient algorithms for internal and external parallel sorting and
`remote data update. The sorting algorithms approach the problem by concentrat-
`ing first on highly efficient but incorrect algorithms followed by a cleanup phase that
`completes the sort. The remote data update algorithm, rsync, operates by exchang-
`ing block signature information followed by a simple hash search algorithm for block
`matching at arbitrary byte boundaries. The last chapter of the thesis examines a num-
`ber of related algorithms for text compression, differencing and incremental backup.
`
`iv
`
`Unfied Patents Exhibit 1016
`Pg. 6
`
`
`
`Contents
`
`Acknowledgments
`
`Abstract
`
`Introduction
`
`iii
`
`iv
`
`1
`
`1
`
`3
`Internal Parallel Sorting
`4
`1.1 How fast can it go? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4
`1.1.1 Divide and conquer
`. . . . . . . . . . . . . . . . . . . . . . . . . .
`5
`1.1.2 The parallel merging problem . . . . . . . . . . . . . . . . . . . . .
`5
`1.1.2.1
`The two processor problem . . . . . . . . . . . . . . . . .
`Extending to P processors
`6
`1.1.2.2
`. . . . . . . . . . . . . . . . .
`7
`1.1.3 Almost sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`8
`1.1.4 Putting it all together . . . . . . . . . . . . . . . . . . . . . . . . . .
`8
`1.2 Algorithm Details . . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`8
`1.2.1 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`9
`1.2.2 Aims of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
`1.2.3
`Infinity Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
`1.2.4 Balancing . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
`1.2.5 Perfect Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
`1.2.6
`Serial Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
`1.2.7 Primary Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
`1.2.8 Merge-Exchange Operation . . . . . . . . . . . . . . . . . . . . . . 15
`1.2.9
`Find-Exact Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 16
`1.2.10 Transferring Elements . . . . . . . . . . . . . . . . . . . . . . . . . 18
`1.2.11 Unbalanced Merging . . . . . . . . . . . . . . . . . . . . . . . . . . 18
`1.2.12 Block-wise Merging . . . . . . . . . . . . . . . . . . . . . . . . . . 19
`
`v
`
`Unfied Patents Exhibit 1016
`Pg. 7
`
`
`
`Contents
`
`vi
`
`. . . 21
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. .
`.
`.
`.
`.
`.
`1.2.13 Cleanup .
`.
`. . 22
`.
`. . .
`.
`. . .
`. . .
`.
`.
`. .
`.
`. . .
`. . .
`. .
`.
`. .
`.
`.
`1.3 Performance .
`.
`.
`. . . 22
`.
`. .
`.
`.
`. . .
`. . .
`.
`1.3.1 Estimating the Speedup .
`. .
`. .
`. .
`. .
`. . . 23
`. . .
`.
`.
`. .
`.
`. . .
`.
`1.3.2 Timing Results
`.
`.
`. .
`. .
`. .
`. .
`.
`. . .
`. . . 24
`. . .
`.
`. . .
`.
`. .
`.
`.
`.
`1.3.3
`Scalability .
`.
`.
`.
`. .
`. .
`.
`. . .
`.
`. . .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. 26
`1.3.4 Where Does The Time Go?
`.
`. .
`. .
`. .
`1.3.5 CM5 vs AP1000 .
`.
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. . .
`.
`. . .
`.
`. 27
`1.3.6 Primary Merge Effectiveness .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. 29
`1.3.7 Optimizations .
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. 30
`1.4 Comparison with other algorithms .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. 31
`1.4.1 Thearling and Smith .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 31
`1.4.2 Helman, Bader and JaJa .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. 32
`1.5 Conclusions .
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. . 33
`
`34
`2 External Parallel Sorting
`. 34
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. .
`. .
`.
`.
`2.1 Parallel Disk Systems .
`. 35
`.
`. . .
`.
`.
`. .
`. . .
`.
`. . .
`.
`. .
`. .
`. .
`. .
`. .
`2.2 Designing an algorithm .
`. . 36
`. . .
`.
`. .
`.
`. . .
`.
`. .
`. .
`2.2.1 Characterizing parallel sorting .
`. .
`.
`. 36
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`2.2.2 Lower Limits on I/O .
`. .
`. .
`. .
`. .
`.
`. 37
`.
`. . .
`.
`. . .
`.
`. .
`. .
`. .
`2.2.3 Overview of the algorithm .
`. .
`. .
`.
`. 37
`.
`. .
`.
`. . .
`.
`. . .
`.
`. . .
`2.2.4 Partitioning .
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. 39
`.
`. . .
`.
`. . .
`.
`. .
`. .
`. .
`2.2.5 Column and row sorting .
`. .
`. .
`. .
`.
`. 40
`.
`. .
`.
`. . .
`.
`. . .
`.
`. . .
`2.2.6 Completion .
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. 41
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`2.2.7 Processor allocation .
`. .
`. .
`. .
`. .
`2.2.8 Large k .
`.
`. . .
`.
`. . .
`.
`. . .
`. 41
`.
`.
`.
`.
`.
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`2.2.9 Other partitionings .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`. 42
`2.3 Performance .
`.
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. . .
`.
`. . .
`.
`. . .
`.
`. . . 42
`2.3.1
`I/O Efficiency .
`.
`.
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . . 45
`2.3.2 Worst case .
`.
`.
`.
`.
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. . . 45
`2.3.3
`First pass completion .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 46
`2.4 Other Algorithms .
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. 46
`2.5 Conclusions .
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. . 48
`
`Unfied Patents Exhibit 1016
`Pg. 8
`
`
`
`Contents
`
`vii
`
`49
`3 The rsync algorithm
`. . . 49
`. . .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. .
`.
`.
`.
`.
`3.1
`Inspiration .
`.
`.
`. . 50
`. . .
`.
`. . .
`.
`. .
`. .
`. .
`3.2 Designing a remote update algorithm .
`. .
`.
`. . 51
`. . .
`.
`. . .
`.
`. . .
`.
`. .
`3.2.1
`First attempt .
`.
`. . .
`. .
`. .
`.
`. . .
`.
`.
`. . 51
`. . .
`.
`. . .
`.
`. . .
`.
`. .
`3.2.2 A second try .
`.
`. . .
`. .
`. .
`.
`. . .
`.
`.
`. . 52
`. . .
`.
`.
`. .
`. . .
`.
`. . .
`3.2.3 Two signatures . . .
`. .
`. .
`. .
`. .
`.
`. .
`. 53
`. . .
`.
`3.2.4
`Selecting the strong signature .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`.
`. . 54
`.
`. . .
`3.2.5
`Selecting the fast signature .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . 55
`.
`. . .
`3.2.6 The signature search algorithm .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . 58
`3.2.7 Reconstructing the file .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . 58
`3.3 Choosing the block size
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`3.3.1 Worst case overhead .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 59
`3.4 The probability of failure .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 60
`3.4.1 The worst case .
`.
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. . .
`.
`. . .
`.
`. 60
`3.4.2 Adding a file signature .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`. 61
`3.5 Practical performance .
`. . .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`. 62
`3.5.1 Choosing the format .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 63
`3.5.2
`Speedup .
`.
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. . .
`.
`. . .
`.
`. . .
`.
`. 64
`3.5.3
`Signature performance .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`. 67
`3.6 Related work .
`.
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. . .
`.
`. . .
`.
`. . .
`.
`. . . 68
`3.7
`Summary .
`.
`.
`.
`.
`.
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. . .
`. . . . 69
`
`4
`
`70
`rsync enhancements and optimizations
`. 70
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`4.1
`Smaller signatures .
`.
`.
`.
`.
`. .
`. .
`. .
`. 71
`.
`.
`. .
`.
`. . .
`. . .
`.
`. .
`. .
`. .
`. .
`4.2 Better fast signature algorithms .
`. .
`. 73
`. .
`. .
`. .
`. .
`4.2.1 Run-length encoding of the block match tokens
`.
`.
`Stream compression .
`.
`. . .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. . .
`.
`. . .
`.
`. . 74
`4.3
`.
`4.4 Data transformations .
`.
`.
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. .
`.
`. . .
`.
`. . . 76
`.
`4.4.1 Compressed files .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. .
`.
`. . .
`.
`. . . 76
`.
`4.4.2 Compression resync .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`. . .
`.
`. .
`.
`. . . 77
`4.4.3 Text transformation .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . 78
`4.5 Multiple files and pipelining .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 80
`
`Unfied Patents Exhibit 1016
`Pg. 9
`
`
`
`Contents
`
`viii
`
`4.6 Transferring the file list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
`4.7 Summary . . .
`. . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
`
`84
`5 Further applications for rsync
`5.1 The xdelta algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
`5.2 The rzip compression algorithm . . . . . . . . . . . . . . . . . . . . . . . . 86
`5.2.1 Adding an exponent . . . . . . . . . . . . . . . . . . . . . . . . . . 87
`5.2.2
`Short range redundancies . . . . . . . . . . . . . . . . . . . . . . . 88
`5.2.3 Testing rzip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
`Incremental backup systems . . . . . . . . . . . . . . . . . . . . . . . . . . 92
`5.3
`rsync in HTTP . . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
`5.4
`rsync in a network filesystem . . . . . . . . . . . . . . . . . . . . . . . . . 94
`5.5
`5.6 Conclusion . . .
`. . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
`
`6 Conclusion
`
`96
`
`98
`A Source code and data
`A.1 Internal parallel sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
`A.2 External parallel sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
`A.3 rsync . .
`. . . . .
`. . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
`A.4 rzip . . .
`. . .
`. . . . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
`A.5 rsync data sets . . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
`A.6 rzip data sets . . .
`. . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
`
`B Hardware
`B.1 AP1000 . . .
`B.2 CM5 . . . . .
`B.3 RS6000 . . .
`
`101
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
`. .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
`
`. . .
`. . .
`. . .
`
`Bibliography
`
`103
`
`Unfied Patents Exhibit 1016
`Pg. 10
`
`
`
`Introduction
`
`While researching the materials presented in this thesis I have had the pleasure of
`covering a wider range of the discipline of Computer Science than most graduate stu-
`dents. My research began with the study of automatic speech recognition using Hid-
`den Markov Models, decision trees and large Recurrent Neural Networks[Tridgell
`et al. 1992] but I was soon drawn into studying the details of parallel sorting algo-
`rithms when, while implementing my Neural Network code on a CM5 multicom-
`puter, I discovered that suitable parallel sorting algorithms were unavailable.
`I was surprised to find that practical parallel sorting algorithms offered a lot of
`scope for active research, perhaps more than the rather densely populated field of au-
`tomatic speech recognition. As I have always enjoyed working on new algorithms,
`particularly algorithms where efficiency is of primary concern, I jumped at the chance
`to make a contribution in the field of parallel sorting. This led to the research pre-
`sented in the first chapter of this thesis and away from my original field of study.
`While completing this research I had the opportunity to participate in a coopera-
`tive research center research project called the PIOUS project[Tridgell 1996] in which I
`worked on a parallel filesystem for multicomputers. This research led to the creation
`of a parallel filesystem called HiDIOS[Tridgell and Walsh 1996] which led me to in-
`vestigate the natural extension of the problem of internal parallel sorting – external
`parallel algorithm. The result is the research presented in the second chapter of this
`thesis.
`The rest of the thesis is dedicated to the rsync algorithm which provides a novel
`method of efficiently updating files over slow network links. The rsync algorithm
`was a direct result of my work on parallel filesystems and external parallel sorting
`algorithms. The link is a simple text searching algorithm[Tridgell and Hawking 1996]
`that I developed and which is used by an application running on the HiDIOS parallel
`filesystem. It was this text searching algorithm which provided the seed from which
`the rsync algorithm was able to grow, although the algorithm presented in this thesis
`
`1
`
`Unfied Patents Exhibit 1016
`Pg. 11
`
`
`
`Contents
`
`2
`
`bears little resemblance to that seed.
`This thesis is a testament to the multi-faceted nature of computer science and the
`many and varied links within this relatively new discipline. I hope you will enjoy
`reading it as much as I enjoyed the research which it describes.
`
`Unfied Patents Exhibit 1016
`Pg. 12
`
`
`
`Internal Parallel Sorting
`
`Chapter 1
`
`My first introduction to the problem of parallel sorting came from a problem in the
`implementation of an automatic speech recognition training program. A set of speech
`data needed to be normalized in order to be used as the input to a recurrent neural
`network system and I decided that a quick-and-dirty way of doing this would be to
`sort the data, then sample it at regular intervals to generate a histogram.
`This is a terribly inefficient way of normalizing data in terms of computational
`complexity but is a quite good way in terms of software engineering because the
`availability of easy to use sorting routines in subroutine libraries makes the amount
`of coding required very small. I had already used this technique in the serial version
`of the speech recognition system so it seemed natural to do the same for the parallel
`version.
`With this in mind I looked in the subroutine library of the parallel machine I was
`using (a Fujitsu AP1000 [Ishihata et al. 1991] running CellOS) and discovered that
`there was a problem with my plan – the library totally lacked a parallel sorting routine.
`I had been expecting that there would be a routine that is the parallel equivalent of
`the ubiquitous qsort() routine found in standard C libraries. The lack of such a routine
`was quite a surprise and prompted me to look into the whole issue of parallel sorting,
`totally ignoring the fact that I only wanted a parallel sorting routine in order to solve
`a problem that didn’t really require sorting at all.
`A survey of the literature on parallel sorting routines was rather disappointing.
`The routines were not at all general purpose and tended to place unreasonable re-
`strictions on the amount of data to be sorted, the type of data and the number of
`CPUs in the parallel system. It became clear that the world (or at least my corner of
`
`3
`
`Unfied Patents Exhibit 1016
`Pg. 13
`
`
`
`§1.1 How fast can it go?
`
`4
`
`it) needed a fast, scalable, general-purpose parallel sorting routine.
`The rest of this chapter details the design, implementation and performance of just
`such a routine.
`
`1.1 How fast can it go?
`
`A question that arises when considering a new algorithm is “How fast can it go?”. It
`helps greatly when designing an algorithm to understand the limits on the algorithm’s
`efficiency. In the case of parallel sorting we can perform some simple calculations
`which are very revealing and which provide a great deal of guidance in the algorithm
`design.
`It is well known that comparison-based sorting algorithms on a single CPU require
`logN! time1, which is well approximated[Knuth 1981] by N logN. This limitation arises
`from the fact the the unsorted data can be in one of N! possible arrangements and
`that each individual comparison eliminates at most half of the arrangements from
`consideration.
`An ideal parallel sorting algorithm which uses P processors would reduce this
`time by at most a factor of P, simply because any deterministic parallel algorithm can
`be simulated by a single processor with a time cost of P. This leads us to the ob-
`servation that an ideal comparison-based parallel sorting algorithm would take time
`logN.
`Of course, this totally ignores the constant computational factors, communication
`costs and the memory requirements of the algorithm. All those factors are very impor-
`tant in the design of a parallel sorting algorithm and they will be looked at carefully
`in a later part of this chapter.
`
`NP
`
`1.1.1 Divide and conquer
`
`We now consider parallel sorting algorithms which are divided into two stages. In
`the first stage each processor sorts the data that happens to start in its local memory
`and in the second stage the processors exchange elements until the final sorted result
`
`1Throughput this thesis logx will be used to mean (cid:2)log2 x(cid:3).
`
`Unfied Patents Exhibit 1016
`Pg. 14
`
`
`
`§1.1 How fast can it go?
`
`5
`
`It seems natural to consider dividing the algorithm in this manner as
`is obtained.
`efficient algorithms for the first stage are well known.
`How long would we expect each stage to take? The first stage should take O( N
`P log NP )
`
`simply by using an optimally efficient serial sorting algorithm on each processor2.
`This clearly cannot be improved upon3.
`The more interesting problem is how long the second stage should take. We want
`the overall parallel sorting algorithm to take O( N
`P logN) time which means we would
`ideally like the second stage to take O( N
`P logP) time. If it turns out that this is not
`achievable then we might have to revisit the decision to split the algorithm into the
`two stages proposed above.
`
`1.1.2 The parallel merging problem
`
`The second stage of the parallel sorting algorithm that is now beginning to take shape
`is to merge P lists of N/P elements each stored on one of P processors. We would like
`this to be done in O( N
`P logP) time.
`
`1.1.2.1 The two processor problem
`
`Let us first consider the simplest possible parallel merging problem, where we have
`just two processors and each processor starts with N
`2 elements. The result we want is
`that the first processor ends up with all the small elements and the second processor
`ends up with all the large elements. Of course, “small” and “large” are defined by
`reference to a supplied comparison function.
`We will clearly need some communication between the two processors in order to
`transmit the elements that will end up in a different processor to the one they start
`in, and we will need some local movement of elements to ensure that each processor
`ends up with a locally sorted list of elements4.
`
`2I shall initially assume that the data is evenly distributed between the processors. The problem of
`balancing will be considered in the detailed description of the algorithm.
`3It turns out that the choice of serial sorting algorithm is in fact very important. Although there are
`numerous “optimal” serial sorting algorithms their practical performance varies greatly depending on
`the data being sorted and the architecture of the machine used.
`4As is noted in Section 1.2.12 we don’t strictly need to obtain a sorted list in each cell when this two
`processor merge is being used as part of a larger parallel merge but it does simplify the discussion.
`
`Unfied Patents Exhibit 1016
`Pg. 15
`
`
`
`§1.1 How fast can it go?
`
`6
`
`Cell 1
`
`Cell 2
`
`Cell 3
`
`Cell 4
`
`Cell 5
`
`Cell 6
`
`Cell 7
`
`Cell 8
`
`Step 1
`
`Step 2
`
`Step 3
`
`}}}
`
`Figure 1.1: The parallelisation of an eight way hypercube merge
`
`Both of these operations will have a linear time cost with respect to the number of
`elements being dealt with. Section 1.2.8 gives a detailed description of an algorithm
`that performs these operations efficiently. The basis of the algorithm is to first work
`out what elements will end up in each processor using a O(logN) bisection search and
`then to transfer those elements in a single block. A block-wise two-way merge is then
`used to obtain a sorted list within each processor.
`The trickiest part of this algorithm is minimizing the memory requirements. A
`simple local merge algorithm typically requires order N additional memory which
`would restrict the overall parallel sorting algorithm to dealing with data sets of less
`than half the total memory in the parallel machine. Section 1.2.12 shows how to
`√
`achieve the same result with O(
`N) additional memory while retaining a high degree
`of efficiency. Merging algorithms that require less memory are possible but they are
`quite computationally expensive[Ellis and Markov 1998; Huang and Langston 1988;
`Kronrod 1969].
`
`1.1.2.2 Extending to P processors
`
`Can we now produce a P processor parallel merge using a series of two proces-
`sor merges conducted in parallel? In order to achieve the aim of an overall cost of
`O( N
`P logP) we would need to use O(logP) parallel two processor merges spread across
`the P processors.
`The simplest arrangement of two processor merges that will achieve this time cost
`is a hypercube arrangement as shown for eight processors in Figure 1.1.
`This seems ideal, we have a parallel merge algorithm that completes in logP par-
`
`Unfied Patents Exhibit 1016
`Pg. 16
`
`
`
`§1.1 How fast can it go?
`
`7
`
`allel steps with each step taking O( N
`P ) time to complete.
`There is only one problem, it doesn’t work!
`
`1.1.3 Almost sorting
`
`This brings us to a central idea in the development of this algorithm. We have so
`far developed an algorithm which very naturally arises out of a simple analysis of
`the lower limit of the sort time. The algorithm is simple to implement, will clearly
`be very scalable (due to its hypercube arrangement) and is likely to involve minimal
`load balancing problems. All these features make it very appealing. The fact that the
`final result is not actually sorted is an annoyance that must be overcome.
`Once the algorithm is implemented it is immediately noticeable that although the
`final result is not sorted, it is “almost” sorted. By this I mean that nearly all elements
`are in their correct final processors and that most of the elements are in fact in their
`correct final positions within those processors. This will be looked at in Section 1.3.6
`but for now it is good enough to know that, for large N, the proportion of elements
`√
`that are in their correct final position is well approximated by 1− P/
`N.
`This is quite extraordinary. It means that we can use this very efficient algorithm
`to do nearly all the work, leaving only a very small number of elements which are
`not sorted. Then we just need to find another algorithm to complete the job. This
`“cleanup” algorithm can be designed to work for quite small data sets relative to the
`total size of the data being sorted and doesn’t need to be nearly as efficient as the
`initial hypercube based algorithm.
`The cleanup algorithm chosen for this algorithm is Batcher’s merge-exchange al-
`gorithm[Batcher 1968], applied to the processors so that comparison-exchange oper-
`ations are replaced with merge operations between processors. Batcher’s algorithm
`is a sorting network[Knuth 1981], which means that the processors do not need to
`communicate in order to determine the order in which merge operations will be per-
`formed. The sequence of merge operations is predetermined without regard to the
`data being sorted.
`
`Unfied Patents Exhibit 1016
`Pg. 17
`
`
`
`§1.2 Algorithm Details
`
`8
`
`1.1.4 Putting it all together
`
`We are now in a position to describe the algorithm as a whole. The steps in the algo-
`rithm are
`• distribute the data over the P processors
`• sort the data within each processor using the best available serial sorting algo-
`rithm for the data
`• perform logP merge steps along the edges of a hypercube
`• find which elements are unfinished (this can be done in log(N/P) time)
`• sort these unfinished elements using a convenient algorithm
`
`Note that this algorithm arose naturally out of a simple consideration of a lower
`bound on the sort time. By developing the algorithm in this fashion we have guaran-
`teed that the algorithm is optimal in the average case.
`
`1.2 Algorithm Details
`
`The remainder of this chapter covers the implementation details of the algorithm,
`showing how it can be implemented with minimal memory overhead. Each stage of
`the algorithm is analyzed more carefully resulting in a more accurate estimate of the
`expected running time of the algorithm.
`The algorithm was first presented in [Tridgell and Brent 1993]. The algorithm was
`developed by Andrew Tridgell and Richard Brent and was implemented by Andrew
`Tridgell.
`
`1.2.1 Nomenclature
`
`P is the number of nodes (also called cells or processors) available on the parallel
`machine, and N is the total number of elements to be sorted. Np is the number of
`elements in a particular node p (0 ≤ p < P). To avoid double subscripts Np j may be
`written as Nj where no confusion should arise.
`Elements within each node of the machine are referred to as Ep,i, for 0 ≤ i < Np and
`0 ≤ p < P. E j,i may be used instead of Ep j,i if no confusion will arise.
`
`Unfied Patents Exhibit 1016
`Pg. 18
`
`
`
`§1.2 Algorithm Details
`
`9
`
`When giving “big O” time bounds the reader should assume that P is fixed so that
`O(N) and O(N/P) are the same.
`The only operation assumed for elements is binary comparison, written with the
`usual comparison symbols. For example, A < B means that element A precedes ele-
`ment B. The elements are considered sorted when they are in non-decreasing order
`in each node, and non-decreasing order between nodes. More precisely, this means
`that Ep,i ≤ Ep, j for all relevant i < j and p, and that Ep,i ≤ Eq, j for 0 ≤ p < q < P and all
`relevant i, j.
`The speedup offered by a parallel algorithm for sorting N elements is defined as
`the ratio of the time to sort N elements with the fastest known serial algorithm (on
`one node of the parallel machine) to the time taken by the parallel algorithm on the
`parallel machine.
`
`1.2.2 Aims of the Algorithm
`
`The design of the algorithm had several aims:
`• Speed.
`• Good memory utilization. The number of elements that can be sorted should
`closely approach the physical limits of the machine.
`• Flexibility, so that no restrictions are placed on N and P. In particular N should
`not need to be a multiple of P or a power of two. These are common restrictions
`in parallel sorting algorithms [Ajtai et al. 1983; Akl 1985].
`
`In order for the algorithm to be truly general purpose the only operator that will
`be assumed is binary comparison. This rules out methods such as radix sort [Blelloch
`et al. 1991; Thearling and Smith 1992].
`It is also assumed that elements are of a fixed size, because of the difficulties of
`pointer representations between nodes in a MIMD machine.
`To obtain good memory utilization when sorting small elements linked lists are
`avoided. Thus, the lists of elements referred to below are implemented using arrays,
`without any storage overhead for pointers.
`
`Unfied Patents Exhibit 1016
`Pg. 19
`
`
`
`§1.2 Algorithm Details
`
`10
`
`The algorithm starts with a number of elements N assumed to be distributed over
`P processing nodes. No particular distribution of elements is assumed and the only
`restrictions on the size of N and P are the physical constraints of the machine.
`The algorithm presented here is similar in some respects to parallel shellsort [Fox
`et al. 1988], but contains a number of new features. For example, the memory over-
`head of the algorithm is considerably reduced.
`
`1.2.3 Infinity Padding
`
`In order for a parallel sorting algorithm to be useful as a general-purpose routine,
`arbitrary restrictions on the number of elements that can be sorted must be removed.
`It is unreasonable to expect that the number of elements N should be a multiple of the
`number of nodes P.
`The proof given in [Knuth 1981, solution to problem 5.3.4.38] shows that sort-
`ing networks will correctly sort lists of elements provided the number of elements in
`each list is equal, and the comparison-exchange operation is replaced with a merge-
`exchange operation. The restriction to equal-sized lists is necessary, as small examples
`show5. However, a simple extension of the algorithm, which will be referred to as in-
`finity padding, can remove this restriction6.
`First let us define M to be the maximum number of elements in any one node. It
`is clear that it would be possible to pad each node with M − Np dummy elements so
`that the total number of elements would become M × P. After sorting is complete the
`padding elements could be found and removed from the tail of the sorted list.
`Infinity padding is a variation on this theme. We notionally pad each node with
`M − Np “infinity” elements. These elements are assumed to have the property that
`they compare greater than any elements in any possible data set. If we now consider
`one particular step in the sorting algorithm, we see that these infinity elements need
`only be represented implicitly.
`Say nodes p1 and p2 have N1 and N2 elements respectively before being merged in
`
`5A small example where unequal sized lists fails with Batchers’s merge-exchange sorting network is
`a 4 way sort with elements ([1] [0] [1] [0 0]) which results in the unsorted data set ([0] [0] [1] [0 1]).
`6A method for avoiding infinity-padding using balancing is given in Section 1.2.5 so infinity-padding
`is not strictly needed but it provides some useful concepts nonetheless.
`
`Unfied Patents Exhibit 1016
`Pg. 20
`
`
`
`§1.2 Algorithm Details
`
`11
`
`procedure hypercube_balance(integer base, integer num)
`if num = 1 return
`for all i in [0..num/2)
`pair_balance (base+i, base+i+(num+1)/2)
`hypercube_balance (base+num/2, (num+1)/2)
`hypercube_balance (base, num - (num+1)/2)
`end
`
`Figure 1.2: Pseudo-code for load balancing
`
`our algorithm, with node p1 receiving the smaller elements. Then the addition of in-
`finity padding elements will result in M−N1 and M−N2 infinity elements being added
`to nodes p1 and p2 respectively. We know that, after the merge, node p2 must contain
`the largest M elements, so we can be sure that it will contain all of the infinity elements
`up to a maximum of M. From this we can calculate the number of real elements which
`each node must contain after merging. If we designate the number of real elements
`
`after merging as N(cid:6)1 and N(cid:6)2 then we find that
`
`2 = max(0,N1 + N2 − M)
`N(cid:6)
`
`and
`
`1 = N1 + N2 − N(cid:6)
`N(cid:6)
`This means that if at each merge step we give node p1 the first N(cid:6)
`1 elements and
`node p2 the remaining elements, we have implicitly performed padding of the nodes
`with infinity elements, thus guaranteeing the correct behavior of the algorithm.
`
`2
`
`1.2.4 Balancing
`
`The aim of the balancing phase of the algorithm is to produce a distribution of the
`elements on the nodes that approaches as closely as possible N/P elements per node.
`The algorithm chosen for this task is one which reduces to a hypercube for values
`of P which are a power of 2. Pseudo-code is shown in Figure 1.27.
`When the algorithm is called, the base is initially set to the index of the smallest
`node in the system and num is set to the number of nodes, P. The algorithm operates
`
`7The pseudo-code in this thesis uses the C convention of integer division.
`
`Unfied Patents Exhibit 1016
`Pg. 21
`
`
`
`§1.2 Algorithm Details
`
`12
`
`recursively and takes logP steps to complete. When the number of nodes is not a
`power of 2, the effect is to have one of the nodes idle in some phases