throbber
UNIFIED PATENTS
`
`EXHIBIT 1016
`
`Unfied Patents Exhibit 1016
`Pg. 1
`
`

`
`Efficient Algorithms for Sorting and
`Synchronization
`
`Andrew Tridgell
`
`A thesis submitted for the degree of
`Doctor of Philosophy at
`The Australian National University
`
`February 1999
`
`Unfied Patents Exhibit 1016
`Pg. 2
`
`

`
`Except where otherwise indicated, this thesis is my own original work.
`
`Andrew Tridgell
`February 1999
`
`Unfied Patents Exhibit 1016
`Pg. 3
`
`

`
`Till min k¨ara fru, Susan
`
`Unfied Patents Exhibit 1016
`Pg. 4
`
`

`
`Acknowledgments
`
`The research that has gone into this thesis has been thoroughly enjoyable. That en-
`joyment is largely a result of the interaction that I have had with my supervisors, col-
`leagues and, in the case of rsync, the people who have tested and used the resulting
`software.
`I feel very privileged to have worked with my supervisors, Richard Brent, Paul
`Mackerras and Brendan McKay. To each of them I owe a great debt of gratitude for
`their patience, inspiration and friendship. Richard, Paul and Brendan have taught
`me a great deal about the field of Computer Science by sharing with me the joy of
`discovery and investigation that is the heart of research.
`I would also like to thank Bruce Millar and Iain Macleod, who supervised my
`initial research in automatic speech recognition. Through Bruce, Iain and the Com-
`puter Sciences Laboratory I made the transition from physicist to computer scientist,
`turning my hobby into a career.
`The Department of Computer Science and Computer Sciences Laboratory have
`provided an excellent environment for my research. I spent many enjoyable hours
`with department members and fellow students chatting about my latest crazy ideas
`over a cup of coffee. Without this rich environment I doubt that many of my ideas
`would have come to fruition.
`The Australian Telecommunications and Electronics Research Board, the Com-
`monwealth and the Australian National University were very generous in providing
`me with scholarship assistance.
`Thanks also to my family who have been extremely understanding and support-
`ive of my studies. I feel very lucky to have a family that shares my enthusiasm for
`academic pursuits.
`Finally I’d like to thank my wonderful wife, Susan, who has encouraged me so
`much over the years. Many people told us that having two PhD students under the
`one roof is a recipe for disaster. Instead it has been fantastic.
`
`iii
`
`Unfied Patents Exhibit 1016
`Pg. 5
`
`

`
`Abstract
`
`This thesis presents efficient algorithms for internal and external parallel sorting and
`remote data update. The sorting algorithms approach the problem by concentrat-
`ing first on highly efficient but incorrect algorithms followed by a cleanup phase that
`completes the sort. The remote data update algorithm, rsync, operates by exchang-
`ing block signature information followed by a simple hash search algorithm for block
`matching at arbitrary byte boundaries. The last chapter of the thesis examines a num-
`ber of related algorithms for text compression, differencing and incremental backup.
`
`iv
`
`Unfied Patents Exhibit 1016
`Pg. 6
`
`

`
`Contents
`
`Acknowledgments
`
`Abstract
`
`Introduction
`
`iii
`
`iv
`
`1
`
`1
`
`3
`Internal Parallel Sorting
`4
`1.1 How fast can it go? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4
`1.1.1 Divide and conquer
`. . . . . . . . . . . . . . . . . . . . . . . . . .
`5
`1.1.2 The parallel merging problem . . . . . . . . . . . . . . . . . . . . .
`5
`1.1.2.1
`The two processor problem . . . . . . . . . . . . . . . . .
`Extending to P processors
`6
`1.1.2.2
`. . . . . . . . . . . . . . . . .
`7
`1.1.3 Almost sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`8
`1.1.4 Putting it all together . . . . . . . . . . . . . . . . . . . . . . . . . .
`8
`1.2 Algorithm Details . . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`8
`1.2.1 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`9
`1.2.2 Aims of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
`1.2.3
`Infinity Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
`1.2.4 Balancing . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
`1.2.5 Perfect Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
`1.2.6
`Serial Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
`1.2.7 Primary Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
`1.2.8 Merge-Exchange Operation . . . . . . . . . . . . . . . . . . . . . . 15
`1.2.9
`Find-Exact Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 16
`1.2.10 Transferring Elements . . . . . . . . . . . . . . . . . . . . . . . . . 18
`1.2.11 Unbalanced Merging . . . . . . . . . . . . . . . . . . . . . . . . . . 18
`1.2.12 Block-wise Merging . . . . . . . . . . . . . . . . . . . . . . . . . . 19
`
`v
`
`Unfied Patents Exhibit 1016
`Pg. 7
`
`

`
`Contents
`
`vi
`
`. . . 21
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. .
`.
`.
`.
`.
`.
`1.2.13 Cleanup .
`.
`. . 22
`.
`. . .
`.
`. . .
`. . .
`.
`.
`. .
`.
`. . .
`. . .
`. .
`.
`. .
`.
`.
`1.3 Performance .
`.
`.
`. . . 22
`.
`. .
`.
`.
`. . .
`. . .
`.
`1.3.1 Estimating the Speedup .
`. .
`. .
`. .
`. .
`. . . 23
`. . .
`.
`.
`. .
`.
`. . .
`.
`1.3.2 Timing Results
`.
`.
`. .
`. .
`. .
`. .
`.
`. . .
`. . . 24
`. . .
`.
`. . .
`.
`. .
`.
`.
`.
`1.3.3
`Scalability .
`.
`.
`.
`. .
`. .
`.
`. . .
`.
`. . .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. 26
`1.3.4 Where Does The Time Go?
`.
`. .
`. .
`. .
`1.3.5 CM5 vs AP1000 .
`.
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. . .
`.
`. . .
`.
`. 27
`1.3.6 Primary Merge Effectiveness .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. 29
`1.3.7 Optimizations .
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. 30
`1.4 Comparison with other algorithms .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. 31
`1.4.1 Thearling and Smith .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 31
`1.4.2 Helman, Bader and JaJa .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. 32
`1.5 Conclusions .
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. . 33
`
`34
`2 External Parallel Sorting
`. 34
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. .
`. .
`.
`.
`2.1 Parallel Disk Systems .
`. 35
`.
`. . .
`.
`.
`. .
`. . .
`.
`. . .
`.
`. .
`. .
`. .
`. .
`. .
`2.2 Designing an algorithm .
`. . 36
`. . .
`.
`. .
`.
`. . .
`.
`. .
`. .
`2.2.1 Characterizing parallel sorting .
`. .
`.
`. 36
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`2.2.2 Lower Limits on I/O .
`. .
`. .
`. .
`. .
`.
`. 37
`.
`. . .
`.
`. . .
`.
`. .
`. .
`. .
`2.2.3 Overview of the algorithm .
`. .
`. .
`.
`. 37
`.
`. .
`.
`. . .
`.
`. . .
`.
`. . .
`2.2.4 Partitioning .
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. 39
`.
`. . .
`.
`. . .
`.
`. .
`. .
`. .
`2.2.5 Column and row sorting .
`. .
`. .
`. .
`.
`. 40
`.
`. .
`.
`. . .
`.
`. . .
`.
`. . .
`2.2.6 Completion .
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. 41
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`2.2.7 Processor allocation .
`. .
`. .
`. .
`. .
`2.2.8 Large k .
`.
`. . .
`.
`. . .
`.
`. . .
`. 41
`.
`.
`.
`.
`.
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`2.2.9 Other partitionings .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`. 42
`2.3 Performance .
`.
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. . .
`.
`. . .
`.
`. . .
`.
`. . . 42
`2.3.1
`I/O Efficiency .
`.
`.
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . . 45
`2.3.2 Worst case .
`.
`.
`.
`.
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. . . 45
`2.3.3
`First pass completion .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 46
`2.4 Other Algorithms .
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. 46
`2.5 Conclusions .
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. . 48
`
`Unfied Patents Exhibit 1016
`Pg. 8
`
`

`
`Contents
`
`vii
`
`49
`3 The rsync algorithm
`. . . 49
`. . .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. .
`.
`.
`.
`.
`3.1
`Inspiration .
`.
`.
`. . 50
`. . .
`.
`. . .
`.
`. .
`. .
`. .
`3.2 Designing a remote update algorithm .
`. .
`.
`. . 51
`. . .
`.
`. . .
`.
`. . .
`.
`. .
`3.2.1
`First attempt .
`.
`. . .
`. .
`. .
`.
`. . .
`.
`.
`. . 51
`. . .
`.
`. . .
`.
`. . .
`.
`. .
`3.2.2 A second try .
`.
`. . .
`. .
`. .
`.
`. . .
`.
`.
`. . 52
`. . .
`.
`.
`. .
`. . .
`.
`. . .
`3.2.3 Two signatures . . .
`. .
`. .
`. .
`. .
`.
`. .
`. 53
`. . .
`.
`3.2.4
`Selecting the strong signature .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`.
`. . 54
`.
`. . .
`3.2.5
`Selecting the fast signature .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . 55
`.
`. . .
`3.2.6 The signature search algorithm .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . 58
`3.2.7 Reconstructing the file .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . 58
`3.3 Choosing the block size
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`3.3.1 Worst case overhead .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 59
`3.4 The probability of failure .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 60
`3.4.1 The worst case .
`.
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. . .
`.
`. . .
`.
`. 60
`3.4.2 Adding a file signature .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`. 61
`3.5 Practical performance .
`. . .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`. 62
`3.5.1 Choosing the format .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 63
`3.5.2
`Speedup .
`.
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. . .
`.
`. . .
`.
`. . .
`.
`. 64
`3.5.3
`Signature performance .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`. 67
`3.6 Related work .
`.
`.
`.
`.
`.
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. . .
`.
`. . .
`.
`. . .
`.
`. . . 68
`3.7
`Summary .
`.
`.
`.
`.
`.
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`. . .
`. . . . 69
`
`4
`
`70
`rsync enhancements and optimizations
`. 70
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . .
`.
`4.1
`Smaller signatures .
`.
`.
`.
`.
`. .
`. .
`. .
`. 71
`.
`.
`. .
`.
`. . .
`. . .
`.
`. .
`. .
`. .
`. .
`4.2 Better fast signature algorithms .
`. .
`. 73
`. .
`. .
`. .
`. .
`4.2.1 Run-length encoding of the block match tokens
`.
`.
`Stream compression .
`.
`. . .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`. . .
`.
`. . .
`.
`. . 74
`4.3
`.
`4.4 Data transformations .
`.
`.
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. .
`.
`. . .
`.
`. . . 76
`.
`4.4.1 Compressed files .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`. .
`.
`. . .
`.
`. . . 76
`.
`4.4.2 Compression resync .
`. .
`. .
`. .
`. .
`. .
`.
`. . .
`. . .
`.
`. .
`.
`. . . 77
`4.4.3 Text transformation .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. . 78
`4.5 Multiple files and pipelining .
`. .
`. .
`. .
`. .
`.
`. . .
`.
`. . .
`.
`. .
`.
`. . .
`.
`. 80
`
`Unfied Patents Exhibit 1016
`Pg. 9
`
`

`
`Contents
`
`viii
`
`4.6 Transferring the file list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
`4.7 Summary . . .
`. . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
`
`84
`5 Further applications for rsync
`5.1 The xdelta algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
`5.2 The rzip compression algorithm . . . . . . . . . . . . . . . . . . . . . . . . 86
`5.2.1 Adding an exponent . . . . . . . . . . . . . . . . . . . . . . . . . . 87
`5.2.2
`Short range redundancies . . . . . . . . . . . . . . . . . . . . . . . 88
`5.2.3 Testing rzip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
`Incremental backup systems . . . . . . . . . . . . . . . . . . . . . . . . . . 92
`5.3
`rsync in HTTP . . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
`5.4
`rsync in a network filesystem . . . . . . . . . . . . . . . . . . . . . . . . . 94
`5.5
`5.6 Conclusion . . .
`. . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
`
`6 Conclusion
`
`96
`
`98
`A Source code and data
`A.1 Internal parallel sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
`A.2 External parallel sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
`A.3 rsync . .
`. . . . .
`. . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
`A.4 rzip . . .
`. . .
`. . . . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
`A.5 rsync data sets . . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
`A.6 rzip data sets . . .
`. . .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
`
`B Hardware
`B.1 AP1000 . . .
`B.2 CM5 . . . . .
`B.3 RS6000 . . .
`
`101
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
`. .
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
`
`. . .
`. . .
`. . .
`
`Bibliography
`
`103
`
`Unfied Patents Exhibit 1016
`Pg. 10
`
`

`
`Introduction
`
`While researching the materials presented in this thesis I have had the pleasure of
`covering a wider range of the discipline of Computer Science than most graduate stu-
`dents. My research began with the study of automatic speech recognition using Hid-
`den Markov Models, decision trees and large Recurrent Neural Networks[Tridgell
`et al. 1992] but I was soon drawn into studying the details of parallel sorting algo-
`rithms when, while implementing my Neural Network code on a CM5 multicom-
`puter, I discovered that suitable parallel sorting algorithms were unavailable.
`I was surprised to find that practical parallel sorting algorithms offered a lot of
`scope for active research, perhaps more than the rather densely populated field of au-
`tomatic speech recognition. As I have always enjoyed working on new algorithms,
`particularly algorithms where efficiency is of primary concern, I jumped at the chance
`to make a contribution in the field of parallel sorting. This led to the research pre-
`sented in the first chapter of this thesis and away from my original field of study.
`While completing this research I had the opportunity to participate in a coopera-
`tive research center research project called the PIOUS project[Tridgell 1996] in which I
`worked on a parallel filesystem for multicomputers. This research led to the creation
`of a parallel filesystem called HiDIOS[Tridgell and Walsh 1996] which led me to in-
`vestigate the natural extension of the problem of internal parallel sorting – external
`parallel algorithm. The result is the research presented in the second chapter of this
`thesis.
`The rest of the thesis is dedicated to the rsync algorithm which provides a novel
`method of efficiently updating files over slow network links. The rsync algorithm
`was a direct result of my work on parallel filesystems and external parallel sorting
`algorithms. The link is a simple text searching algorithm[Tridgell and Hawking 1996]
`that I developed and which is used by an application running on the HiDIOS parallel
`filesystem. It was this text searching algorithm which provided the seed from which
`the rsync algorithm was able to grow, although the algorithm presented in this thesis
`
`1
`
`Unfied Patents Exhibit 1016
`Pg. 11
`
`

`
`Contents
`
`2
`
`bears little resemblance to that seed.
`This thesis is a testament to the multi-faceted nature of computer science and the
`many and varied links within this relatively new discipline. I hope you will enjoy
`reading it as much as I enjoyed the research which it describes.
`
`Unfied Patents Exhibit 1016
`Pg. 12
`
`

`
`Internal Parallel Sorting
`
`Chapter 1
`
`My first introduction to the problem of parallel sorting came from a problem in the
`implementation of an automatic speech recognition training program. A set of speech
`data needed to be normalized in order to be used as the input to a recurrent neural
`network system and I decided that a quick-and-dirty way of doing this would be to
`sort the data, then sample it at regular intervals to generate a histogram.
`This is a terribly inefficient way of normalizing data in terms of computational
`complexity but is a quite good way in terms of software engineering because the
`availability of easy to use sorting routines in subroutine libraries makes the amount
`of coding required very small. I had already used this technique in the serial version
`of the speech recognition system so it seemed natural to do the same for the parallel
`version.
`With this in mind I looked in the subroutine library of the parallel machine I was
`using (a Fujitsu AP1000 [Ishihata et al. 1991] running CellOS) and discovered that
`there was a problem with my plan – the library totally lacked a parallel sorting routine.
`I had been expecting that there would be a routine that is the parallel equivalent of
`the ubiquitous qsort() routine found in standard C libraries. The lack of such a routine
`was quite a surprise and prompted me to look into the whole issue of parallel sorting,
`totally ignoring the fact that I only wanted a parallel sorting routine in order to solve
`a problem that didn’t really require sorting at all.
`A survey of the literature on parallel sorting routines was rather disappointing.
`The routines were not at all general purpose and tended to place unreasonable re-
`strictions on the amount of data to be sorted, the type of data and the number of
`CPUs in the parallel system. It became clear that the world (or at least my corner of
`
`3
`
`Unfied Patents Exhibit 1016
`Pg. 13
`
`

`
`§1.1 How fast can it go?
`
`4
`
`it) needed a fast, scalable, general-purpose parallel sorting routine.
`The rest of this chapter details the design, implementation and performance of just
`such a routine.
`
`1.1 How fast can it go?
`
`A question that arises when considering a new algorithm is “How fast can it go?”. It
`helps greatly when designing an algorithm to understand the limits on the algorithm’s
`efficiency. In the case of parallel sorting we can perform some simple calculations
`which are very revealing and which provide a great deal of guidance in the algorithm
`design.
`It is well known that comparison-based sorting algorithms on a single CPU require
`logN! time1, which is well approximated[Knuth 1981] by N logN. This limitation arises
`from the fact the the unsorted data can be in one of N! possible arrangements and
`that each individual comparison eliminates at most half of the arrangements from
`consideration.
`An ideal parallel sorting algorithm which uses P processors would reduce this
`time by at most a factor of P, simply because any deterministic parallel algorithm can
`be simulated by a single processor with a time cost of P. This leads us to the ob-
`servation that an ideal comparison-based parallel sorting algorithm would take time
`logN.
`Of course, this totally ignores the constant computational factors, communication
`costs and the memory requirements of the algorithm. All those factors are very impor-
`tant in the design of a parallel sorting algorithm and they will be looked at carefully
`in a later part of this chapter.
`
`NP
`
`1.1.1 Divide and conquer
`
`We now consider parallel sorting algorithms which are divided into two stages. In
`the first stage each processor sorts the data that happens to start in its local memory
`and in the second stage the processors exchange elements until the final sorted result
`
`1Throughput this thesis logx will be used to mean (cid:2)log2 x(cid:3).
`
`Unfied Patents Exhibit 1016
`Pg. 14
`
`

`
`§1.1 How fast can it go?
`
`5
`
`It seems natural to consider dividing the algorithm in this manner as
`is obtained.
`efficient algorithms for the first stage are well known.
`How long would we expect each stage to take? The first stage should take O( N
`P log NP )
`
`simply by using an optimally efficient serial sorting algorithm on each processor2.
`This clearly cannot be improved upon3.
`The more interesting problem is how long the second stage should take. We want
`the overall parallel sorting algorithm to take O( N
`P logN) time which means we would
`ideally like the second stage to take O( N
`P logP) time. If it turns out that this is not
`achievable then we might have to revisit the decision to split the algorithm into the
`two stages proposed above.
`
`1.1.2 The parallel merging problem
`
`The second stage of the parallel sorting algorithm that is now beginning to take shape
`is to merge P lists of N/P elements each stored on one of P processors. We would like
`this to be done in O( N
`P logP) time.
`
`1.1.2.1 The two processor problem
`
`Let us first consider the simplest possible parallel merging problem, where we have
`just two processors and each processor starts with N
`2 elements. The result we want is
`that the first processor ends up with all the small elements and the second processor
`ends up with all the large elements. Of course, “small” and “large” are defined by
`reference to a supplied comparison function.
`We will clearly need some communication between the two processors in order to
`transmit the elements that will end up in a different processor to the one they start
`in, and we will need some local movement of elements to ensure that each processor
`ends up with a locally sorted list of elements4.
`
`2I shall initially assume that the data is evenly distributed between the processors. The problem of
`balancing will be considered in the detailed description of the algorithm.
`3It turns out that the choice of serial sorting algorithm is in fact very important. Although there are
`numerous “optimal” serial sorting algorithms their practical performance varies greatly depending on
`the data being sorted and the architecture of the machine used.
`4As is noted in Section 1.2.12 we don’t strictly need to obtain a sorted list in each cell when this two
`processor merge is being used as part of a larger parallel merge but it does simplify the discussion.
`
`Unfied Patents Exhibit 1016
`Pg. 15
`
`

`
`§1.1 How fast can it go?
`
`6
`
`Cell 1
`
`Cell 2
`
`Cell 3
`
`Cell 4
`
`Cell 5
`
`Cell 6
`
`Cell 7
`
`Cell 8
`
`Step 1
`
`Step 2
`
`Step 3
`
`}}}
`
`Figure 1.1: The parallelisation of an eight way hypercube merge
`
`Both of these operations will have a linear time cost with respect to the number of
`elements being dealt with. Section 1.2.8 gives a detailed description of an algorithm
`that performs these operations efficiently. The basis of the algorithm is to first work
`out what elements will end up in each processor using a O(logN) bisection search and
`then to transfer those elements in a single block. A block-wise two-way merge is then
`used to obtain a sorted list within each processor.
`The trickiest part of this algorithm is minimizing the memory requirements. A
`simple local merge algorithm typically requires order N additional memory which
`would restrict the overall parallel sorting algorithm to dealing with data sets of less
`than half the total memory in the parallel machine. Section 1.2.12 shows how to
`√
`achieve the same result with O(
`N) additional memory while retaining a high degree
`of efficiency. Merging algorithms that require less memory are possible but they are
`quite computationally expensive[Ellis and Markov 1998; Huang and Langston 1988;
`Kronrod 1969].
`
`1.1.2.2 Extending to P processors
`
`Can we now produce a P processor parallel merge using a series of two proces-
`sor merges conducted in parallel? In order to achieve the aim of an overall cost of
`O( N
`P logP) we would need to use O(logP) parallel two processor merges spread across
`the P processors.
`The simplest arrangement of two processor merges that will achieve this time cost
`is a hypercube arrangement as shown for eight processors in Figure 1.1.
`This seems ideal, we have a parallel merge algorithm that completes in logP par-
`
`Unfied Patents Exhibit 1016
`Pg. 16
`
`

`
`§1.1 How fast can it go?
`
`7
`
`allel steps with each step taking O( N
`P ) time to complete.
`There is only one problem, it doesn’t work!
`
`1.1.3 Almost sorting
`
`This brings us to a central idea in the development of this algorithm. We have so
`far developed an algorithm which very naturally arises out of a simple analysis of
`the lower limit of the sort time. The algorithm is simple to implement, will clearly
`be very scalable (due to its hypercube arrangement) and is likely to involve minimal
`load balancing problems. All these features make it very appealing. The fact that the
`final result is not actually sorted is an annoyance that must be overcome.
`Once the algorithm is implemented it is immediately noticeable that although the
`final result is not sorted, it is “almost” sorted. By this I mean that nearly all elements
`are in their correct final processors and that most of the elements are in fact in their
`correct final positions within those processors. This will be looked at in Section 1.3.6
`but for now it is good enough to know that, for large N, the proportion of elements
`√
`that are in their correct final position is well approximated by 1− P/
`N.
`This is quite extraordinary. It means that we can use this very efficient algorithm
`to do nearly all the work, leaving only a very small number of elements which are
`not sorted. Then we just need to find another algorithm to complete the job. This
`“cleanup” algorithm can be designed to work for quite small data sets relative to the
`total size of the data being sorted and doesn’t need to be nearly as efficient as the
`initial hypercube based algorithm.
`The cleanup algorithm chosen for this algorithm is Batcher’s merge-exchange al-
`gorithm[Batcher 1968], applied to the processors so that comparison-exchange oper-
`ations are replaced with merge operations between processors. Batcher’s algorithm
`is a sorting network[Knuth 1981], which means that the processors do not need to
`communicate in order to determine the order in which merge operations will be per-
`formed. The sequence of merge operations is predetermined without regard to the
`data being sorted.
`
`Unfied Patents Exhibit 1016
`Pg. 17
`
`

`
`§1.2 Algorithm Details
`
`8
`
`1.1.4 Putting it all together
`
`We are now in a position to describe the algorithm as a whole. The steps in the algo-
`rithm are
`• distribute the data over the P processors
`• sort the data within each processor using the best available serial sorting algo-
`rithm for the data
`• perform logP merge steps along the edges of a hypercube
`• find which elements are unfinished (this can be done in log(N/P) time)
`• sort these unfinished elements using a convenient algorithm
`
`Note that this algorithm arose naturally out of a simple consideration of a lower
`bound on the sort time. By developing the algorithm in this fashion we have guaran-
`teed that the algorithm is optimal in the average case.
`
`1.2 Algorithm Details
`
`The remainder of this chapter covers the implementation details of the algorithm,
`showing how it can be implemented with minimal memory overhead. Each stage of
`the algorithm is analyzed more carefully resulting in a more accurate estimate of the
`expected running time of the algorithm.
`The algorithm was first presented in [Tridgell and Brent 1993]. The algorithm was
`developed by Andrew Tridgell and Richard Brent and was implemented by Andrew
`Tridgell.
`
`1.2.1 Nomenclature
`
`P is the number of nodes (also called cells or processors) available on the parallel
`machine, and N is the total number of elements to be sorted. Np is the number of
`elements in a particular node p (0 ≤ p < P). To avoid double subscripts Np j may be
`written as Nj where no confusion should arise.
`Elements within each node of the machine are referred to as Ep,i, for 0 ≤ i < Np and
`0 ≤ p < P. E j,i may be used instead of Ep j,i if no confusion will arise.
`
`Unfied Patents Exhibit 1016
`Pg. 18
`
`

`
`§1.2 Algorithm Details
`
`9
`
`When giving “big O” time bounds the reader should assume that P is fixed so that
`O(N) and O(N/P) are the same.
`The only operation assumed for elements is binary comparison, written with the
`usual comparison symbols. For example, A < B means that element A precedes ele-
`ment B. The elements are considered sorted when they are in non-decreasing order
`in each node, and non-decreasing order between nodes. More precisely, this means
`that Ep,i ≤ Ep, j for all relevant i < j and p, and that Ep,i ≤ Eq, j for 0 ≤ p < q < P and all
`relevant i, j.
`The speedup offered by a parallel algorithm for sorting N elements is defined as
`the ratio of the time to sort N elements with the fastest known serial algorithm (on
`one node of the parallel machine) to the time taken by the parallel algorithm on the
`parallel machine.
`
`1.2.2 Aims of the Algorithm
`
`The design of the algorithm had several aims:
`• Speed.
`• Good memory utilization. The number of elements that can be sorted should
`closely approach the physical limits of the machine.
`• Flexibility, so that no restrictions are placed on N and P. In particular N should
`not need to be a multiple of P or a power of two. These are common restrictions
`in parallel sorting algorithms [Ajtai et al. 1983; Akl 1985].
`
`In order for the algorithm to be truly general purpose the only operator that will
`be assumed is binary comparison. This rules out methods such as radix sort [Blelloch
`et al. 1991; Thearling and Smith 1992].
`It is also assumed that elements are of a fixed size, because of the difficulties of
`pointer representations between nodes in a MIMD machine.
`To obtain good memory utilization when sorting small elements linked lists are
`avoided. Thus, the lists of elements referred to below are implemented using arrays,
`without any storage overhead for pointers.
`
`Unfied Patents Exhibit 1016
`Pg. 19
`
`

`
`§1.2 Algorithm Details
`
`10
`
`The algorithm starts with a number of elements N assumed to be distributed over
`P processing nodes. No particular distribution of elements is assumed and the only
`restrictions on the size of N and P are the physical constraints of the machine.
`The algorithm presented here is similar in some respects to parallel shellsort [Fox
`et al. 1988], but contains a number of new features. For example, the memory over-
`head of the algorithm is considerably reduced.
`
`1.2.3 Infinity Padding
`
`In order for a parallel sorting algorithm to be useful as a general-purpose routine,
`arbitrary restrictions on the number of elements that can be sorted must be removed.
`It is unreasonable to expect that the number of elements N should be a multiple of the
`number of nodes P.
`The proof given in [Knuth 1981, solution to problem 5.3.4.38] shows that sort-
`ing networks will correctly sort lists of elements provided the number of elements in
`each list is equal, and the comparison-exchange operation is replaced with a merge-
`exchange operation. The restriction to equal-sized lists is necessary, as small examples
`show5. However, a simple extension of the algorithm, which will be referred to as in-
`finity padding, can remove this restriction6.
`First let us define M to be the maximum number of elements in any one node. It
`is clear that it would be possible to pad each node with M − Np dummy elements so
`that the total number of elements would become M × P. After sorting is complete the
`padding elements could be found and removed from the tail of the sorted list.
`Infinity padding is a variation on this theme. We notionally pad each node with
`M − Np “infinity” elements. These elements are assumed to have the property that
`they compare greater than any elements in any possible data set. If we now consider
`one particular step in the sorting algorithm, we see that these infinity elements need
`only be represented implicitly.
`Say nodes p1 and p2 have N1 and N2 elements respectively before being merged in
`
`5A small example where unequal sized lists fails with Batchers’s merge-exchange sorting network is
`a 4 way sort with elements ([1] [0] [1] [0 0]) which results in the unsorted data set ([0] [0] [1] [0 1]).
`6A method for avoiding infinity-padding using balancing is given in Section 1.2.5 so infinity-padding
`is not strictly needed but it provides some useful concepts nonetheless.
`
`Unfied Patents Exhibit 1016
`Pg. 20
`
`

`
`§1.2 Algorithm Details
`
`11
`
`procedure hypercube_balance(integer base, integer num)
`if num = 1 return
`for all i in [0..num/2)
`pair_balance (base+i, base+i+(num+1)/2)
`hypercube_balance (base+num/2, (num+1)/2)
`hypercube_balance (base, num - (num+1)/2)
`end
`
`Figure 1.2: Pseudo-code for load balancing
`
`our algorithm, with node p1 receiving the smaller elements. Then the addition of in-
`finity padding elements will result in M−N1 and M−N2 infinity elements being added
`to nodes p1 and p2 respectively. We know that, after the merge, node p2 must contain
`the largest M elements, so we can be sure that it will contain all of the infinity elements
`up to a maximum of M. From this we can calculate the number of real elements which
`each node must contain after merging. If we designate the number of real elements
`
`after merging as N(cid:6)1 and N(cid:6)2 then we find that
`
`2 = max(0,N1 + N2 − M)
`N(cid:6)
`
`and
`
`1 = N1 + N2 − N(cid:6)
`N(cid:6)
`This means that if at each merge step we give node p1 the first N(cid:6)
`1 elements and
`node p2 the remaining elements, we have implicitly performed padding of the nodes
`with infinity elements, thus guaranteeing the correct behavior of the algorithm.
`
`2
`
`1.2.4 Balancing
`
`The aim of the balancing phase of the algorithm is to produce a distribution of the
`elements on the nodes that approaches as closely as possible N/P elements per node.
`The algorithm chosen for this task is one which reduces to a hypercube for values
`of P which are a power of 2. Pseudo-code is shown in Figure 1.27.
`When the algorithm is called, the base is initially set to the index of the smallest
`node in the system and num is set to the number of nodes, P. The algorithm operates
`
`7The pseudo-code in this thesis uses the C convention of integer division.
`
`Unfied Patents Exhibit 1016
`Pg. 21
`
`

`
`§1.2 Algorithm Details
`
`12
`
`recursively and takes logP steps to complete. When the number of nodes is not a
`power of 2, the effect is to have one of the nodes idle in some phases

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket