`
`Reference 17
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 1
`
`
`
`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`GPUTeraSort: High Performance Graphics Co(cid:173)processor
`Sorting for Large Database Management
`
`Naga K. Govindaraju ∗
`
`Jim Gray †
`
`Ritesh Kumar ∗
`
`Dinesh Manocha ∗
`
`{naga,ritesh,dm}@cs.unc.edu, Jim.Gray@microsoft.com
`http://gamma.cs.unc.edu/GPUTERASORT
`
`ABSTRACT
`We present a new algorithm, GPUTeraSort, to sort billion-
`record wide-key databases using a graphics processing unit
`(GPU) Our algorithm uses the data and task parallelism
`on the GPU to perform memory-intensive and compute-
`intensive tasks while the CPU is used to perform I/O and
`resource management. We therefore exploit both the high-
`bandwidth GPU memory interface and the lower-bandwidth
`CPU main memory interface and achieve higher memory
`bandwidth than purely CPU-based algorithms. GPUTera-
`Sort is a two-phase task pipeline: (1) read disk, build keys,
`sort using the GPU, generate runs, write disk, and (2) read,
`merge, write. It also pipelines disk transfers and achieves
`near-peak I/O performance. We have tested the perfor-
`mance of GPUTeraSort on billion-record files using the stan-
`dard Sort benchmark.
`In practice, a 3 GHz Pentium IV
`PC with $265 NVIDIA 7800 GT GPU is significantly faster
`than optimized CPU-based algorithms on much faster pro-
`cessors, sorting 60GB for a penny; the best reported Pen-
`nySort price-performance. These results suggest that a GPU
`co-processor can significantly improve performance on large
`data processing tasks.
`
`1.
`
`INTRODUCTION
`Huge sort tasks arise in many different applications in-
`cluding web indexing engines, geographic information sys-
`tems, data mining, and supercomputing. Sorting is also
`a proxy for any sequential I/O intensive database work-
`load. This article considers the problem of sorting very large
`datasets consisting of billions of records with wide keys.
`The problem of external memory sorting has been stud-
`ied for more than five decades, starting with Friend [16].
`The dramatic improvements in the speed of sorting algo-
`rithms are largely due to advances in computer architec-
`ture and software parallelism. Recent algorithms utilize
`simultaneous multi-threading, symmetric multi-processors,
`
`∗University of North Carolina at Chapel Hill
`†Microsoft Research
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior specific
`permission and/or a fee.
`Copyright 2006 ACM 1(cid:173)59593(cid:173)256(cid:173)9/06/0006 ...$5.00.
`
`1
`
`advanced memory units, and multi-processors to improve
`sorting performance. The current Indy PennySort record
`benchmark1, sorts a 40 GB database in 1541 seconds on a
`$614 Linux/AMD system.
`However, current external memory sort performance is
`limited by the traditional Von Neumann style architecture
`of the CPU. Computer architects use data caches to amelio-
`rate the CPU and the main memory bottleneck; but, CPU-
`based sorting algorithms incur significant cache misses on
`large datasets.
`This article shows how to use a commodity graphics pro-
`cessing unit (GPU) as a co-processor to sort large datasets.
`GPUs are programmable parallel architectures designed for
`real-time rasterization of geometric primitives - but they
`are also highly parallel vector co-processors. Current GPUs
`have 10x higher main memory bandwidth and use data par-
`allelism to achieve 10x more operations per second than
`CPUs. Furthermore, GPU performance has improved faster
`than Moore’s Law over the last decade - so the GPU-CPU
`performance gap is widening. GPUs have recently been used
`for different scientific, geometric and database applications,
`as well as in-memory sorting [20, 22, 35]. However, previ-
`ous GPU-based sorting algorithms were not able to handle
`gigabyte-sized databases with wide keys and could not keep
`up with modern disk IO systems.
`
`Main Results: We present GPUTeraSort that uses a GPU
`as a co-processor to sort databases with billions of records.
`Our algorithm is general and can handle long records with
`wide keys. This hybrid sorting architecture offloads compute-
`intensive and memory-intensive tasks to the GPU to achieve
`higher I/O performance and better main memory perfor-
`mance. We map a bitonic sorting network to GPU rasteriza-
`tion operations and use the GPU’s programmable hardware
`and high bandwidth memory interface. Our novel data rep-
`resentation improves GPU cache efficiency and minimizes
`data transfers between the CPU and the GPU. In practice,
`we achieve nearly 50 giga-byte per second memory band-
`width and 14 giga-operations per second on a current GPU.
`These numbers are 10x what we can achieve on the CPU.
`We implemented GPUTeraSort on an inexpensive 3 GHz
`Pentium IV EE CPU with a $265 NVIDIA 7800 GT GPU.
`GPUTeraSort running the SortBenchmark on this inexpen-
`sive computer has performance comparable to an “expen-
`sive” $2,200 3.6 GHz Dual Xeon server. Our experimental
`results show a 4 times performance improvement over the
`2005 Daytona PennySort benchmark record and 1.4 times
`
`1http://research.microsoft.com/barc/SortBenchmark
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 2
`
`
`
`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`improvement over the 2003 Indy PennySort benchmark record.
`Some of the novel contributions of our work include:
`
`• An external sorting architecture that distributes the
`work between the CPU and GPU.
`
`• An in-memory GPU-based sorting algorithm which is
`up to 10 times faster than prior CPU-based and GPU-
`based in-memory sorting algorithms.
`
`• Peak I/O performance on an inexpensive PC and near
`peak memory bandwidth on the GPU.
`
`• A scalable approach to sorting massive databases by
`efficiently sorting large data partitions.
`
`In combination, these features allow an inexpensive PC
`with a mid-range GPU to outperform much more expen-
`sive CPU-only PennySort systems. The rest of the paper
`is organized as follows. Section 2 reviews related work on
`sorting, hardware accelerated database queries, and GPU-
`based algorithms. Section 3 highlights some of the limita-
`tions of CPU-based external sorting algorithms and gives an
`overview of GPUTeraSort. Section 4 presents the GPUTera-
`Sort algorithm and Section 5 describes its implementation.
`Section 6 compares its performance with prior CPU-based
`algorithms.
`
`2. RELATED WORK
`This section briefly surveys related work in sorting and the
`use of GPUs to accelerate data management computations.
`
`2.1 Sorting
`Sorting is a key problem in database and scientific ap-
`plications.
`It has also been well studied in the theory of
`algorithms [23]. Many optimized sorting algorithms, such
`as quicksort, are widely available and many variants have
`been described in the database literature [2]. However, the
`CPU performance of sorting algorithms is governed by cache
`misses [17, 24, 32] and instruction dependencies [45]. To
`address these memory and CPU limits, many parallel al-
`gorithms and sorting systems have been proposed in the
`database and high performance computing literature [11,
`14, 25, 38, 44].
`The Sort Benchmark, introduced in 1985 was commonly
`used to evaluate the sorting algorithms [15]. As the original
`benchmark became trivial, it evolved to the MinuteSort [32]
`and the PennySort benchmarks [33]. Nyberg et al. [32] use a
`combination of quicksort and selection-tree mergesort in the
`AlphaSort algorithm. In practice, AlphaSort’s performance
`varied considerably based on the cache sizes. The NOW-
`SORT algorithm [8] used a cluster of workstations to sort
`large databases. Recently, Garcia and Korth [17] used fea-
`tures of SMT (simultaneous multi-threading) to accelerate
`in-memory sort performance.
`
`2.2 Optimizing Multi(cid:173)Level Memory Accesses
`Many algorithms have been proposed to improve the per-
`formance of database operations using multi-level memory
`hierarchies that include disks, main memories, and several
`levels of processor caches. Ailamaki gives a recent survey on
`these techniques [4]. Over the last few years, database archi-
`tectures have used massive main memory to reduce or elim-
`inate I/O; but the resulting applications still have very high
`
`clocks per instruction (CPI). Memory stalls due to cache
`misses can lead to increased query execution times [6, 27].
`There is considerable recent work on redesigning database
`and data mining algorithms to make full use of hardware
`resources and minimize the memory stalls and branch mis-
`predictions. These techniques can also improve the perfor-
`mance of sorting algorithms [5, 12, 26, 28, 36, 37, 39, 45].
`
`2.3 GPUs and Data Parallelism
`Many special processor architectures have been proposed
`that employ data parallelism for data intensive computa-
`tions. Graphics processing units (GPUs) are common ex-
`amples of this, but there are many others. The Clear-
`Speed CSX600 processor [1] is an embedded, low power,
`data parallel co-processor that provides up to 25 GFLOPS
`of floating point performance. The Physics Processing Unit
`(PPU) uses data parallelism and high memory bandwidth
`in order to achieve high throughput for Physical simulation.
`Many other co-processors accelerate performance through
`data parallelism.
`This paper focuses on using a GPU as a co-processor for
`sorting, because GPUs are commodity processors. A high
`performance mid-range GPU costs less than $300. Current
`GPUs have about 10× the memory bandwidth and process-
`ing power of the CPU and this gap is widening. Commodity
`GPUs are increasingly used for different applications includ-
`ing numerical linear algebra, scientific, and geometric com-
`putations [34]. GPUs have also been used as co-processors to
`speedup database queries [9, 18, 19, 40] and data streaming
`[20, 29, 41].
`Sorting on GPUs: Many researchers have proposed GPU-
`based sorting algorithms. Purcell et al.
`[35] describe a
`bitonic sort using a fragment program where each stage of
`the sorting algorithm is performed as one rendering pass.
`Kipfer et al.
`[22] improve bitonic sort by simplifying the
`fragment program; but the algorithm still requires ∼ 10
`fragment instructions. Govindaraju et al.
`[20] present a
`sorting algorithm based on a periodic balanced sorting net-
`work (PBSN) and use texture mapping and blending oper-
`ations. However, prior GPU-based algorithms have certain
`limitations for large databases. These include:
`
`• Database size: Previous algorithms were limited to
`databases that fit in GPU memory (i.e. 512MB on
`current GPUs).
`
`• Limit on key size: The sort keys were limited to 32-bit
`floating point operands.
`
`• Efficiency: Previous algorithms were not fast enough
`to match the disk array IO bandwidth.
`
`Our GPUTeraSort algorithm uses the GPU as a co-processor
`in ways that overcome these limitations.
`
`3. OVERVIEW
`This section reviews external memory sorting algorithms,
`analyzing how these algorithms use processors, caches, mem-
`ory interfaces, and input/output (I/O) devices. Then we
`present our GPUTeraSort algorithm.
`
`3.1 External Memory Sorting
`External memory sorting algorithms are used to reorga-
`nize large datasets. They typically perform two phases. The
`
`2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 3
`
`
`
`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`first phase produces a set of files; the second phase processes
`these files to produce a totally ordered permutation of the
`input data file. External memory sorting algorithms can be
`classified into two broad categories [42]:
`
`• Distribution-Based Sorting: The first phase par-
`titions the input data file using (S-1) partition keys
`and generates S disjoint buckets such that the elements
`in one bucket precede the elements in the remaining
`buckets [23]. In the second phase, each bucket is sorted
`independently. The concatenated sorted buckets are
`the output file.
`
`• Merge-Based Sorting: The first phase partitions
`the input data into data chunks of approximately equal
`size, sorts these data chunks in main memory and
`writes the “runs” to disk. The second phase merges
`the runs in main memory and writes the sorted output
`to the disk.
`
`External memory sorting performance is often limited by
`I/O performance. Disk I/O bandwidth is significantly lower
`than main memory bandwidth. Therefore, it is important to
`minimize the amount of data written to and read from disks.
`Large files will not fit in RAM so we must sort the data in at
`least two passes but two passes are enough to sort huge files.
`Each pass reads and writes to the disk. Hence, the two-pass
`sort throughput is at most 1
`4 the throughput of the disks.
`For example, a PC with 8 SATA disks each with a peak I/O
`bandwidth of 50 MBps per disk can achieve at most 400
`MBps disk bandwidth. So a p-pass algorithm will have a
`throughput of 400
`2p since each pass must read as well as write
`the data. In particular, a two-pass sort achieves at most 100
`MBps throughput on this PC. Single pass algorithms only
`work on databases that fit entirely in main memory.
`External memory sort algorithms can operate in two passes
`if the Phase 1 partitions fit in main memory. The parallel
`disk model (PDM) [43] captures disk system performance
`properties. PDM models the number of I/O operations, disk
`usage and CPU time. Vitter [42] analyzed the practical ap-
`plicability of PDM model to common I/O operations such
`as scanning the items in a file, sorting a file, etc.
`In this
`model, the average and worst case I/O performance of ex-
`ternal memory sorting algorithms is ≈ n
`D logmn where n is
`the input size, m is the internal memory size, D is the num-
`ber of disks and logmn denotes the number of passes when
`the data partition size in the Phase 1 is ≈ m [3, 30]. Based
`on the PDM model, an external memory sorting algorithm
`can achieve good I/O performance on large databases when
`the data partition sizes are comparable to the main mem-
`ory size. Salzberg et al.
`[38] present a similar analysis of
`merge based sorting memory requirements. The analysis
`is as follows. If N is the file size, M is the main memory
`size and R is the run size in phase 1 then typically: (1)
`R ≈ M
`3 because of the memory required to simultaneously
`pipeline reading the input, sorting, and writing the output.
`The number of runs generated in phase 1 is runs ≈ N
`R . If
`T is the I/O read size per run in phase 2, and then since
`at least one buffer for each run must fit in memory and a
`few more buffers are needed for prefetch and postwrite: (2)
`M ≈ T × runs ≈ T × N
`R . Combining equations (1) and (2)
`gives (3) M 2 ≈ T × N
`3 or, ignoring the constant term (4)
`√T N .
`M ≈
`Since a two-pass sort’s RAM requirements (M ) increase
`
`300
`
`250
`
`200
`
`150
`
`100
`
`50
`
`0
`
`Total Time (in Sec)
`
`Phase II
`I/O Bandwidth (MB/s)
`
`Phase II Time (in Sec)
`
`0
`
`200
`
`800
`600
`400
`Partition Size (in KB)
`
`1000
`
`1200
`
`Figure 1: Performance of an optimized merge-based
`external memory sorting algorithm on a Dual 3.6
`GHz Xeon processor system. Observe that the
`speed of Phase 2 increases nearly linearly with the
`partition size. As the data partition sizes in Phase
`I fit well in the L2 cache sizes, the Phase 1 time
`remains nearly constant.
`
`as the square root of the input file size, multi-GB RAM ma-
`chines can two-pass sort terabyte files. In particular, if T=2
`MB to reduce disk seek overhead, and N is 100 GB, then
`R ∼ 230 MB. In practice, phase 1 partitions are hundreds
`of megabytes on current PCs. However, current algorithms
`running on commodity CPUs, referred to as CPU-based al-
`gorithms, cannot achieve high sorting performance on such
`large partitions because:
`
`• Cache Misses: CPU-based sorting algorithms incur
`significant cache misses on data sets that do not fit
`in the L1, L2 or L3 data caches [32]. Therefore, it is
`not efficient to sort partitions comparable to the size
`of main memory. This results in a tradeoff between
`disk I/O performance (as described above) and CPU
`computation time spent in sorting the partitions. For
`example, in merge-based external sorting algorithms,
`the time spent in Phase 1 can be reduced by choosing
`run sizes comparable to the CPU cache sizes. However,
`this choice increases the time spent in Phase 2 to merge
`a large number of small runs. Figure 1 illustrates the
`performance of an optimized commercial CPU based
`algorithm [31] on a dual Xeon configuration for varying
`Phase 1 run sizes. Observe that the elapsed time de-
`creases as the run size increases. However, increasing
`the run size beyond the CPU data cache sizes can de-
`grade the sorting performance during Phase 1 [24]. As
`explained in Section 4, GPUs have a high bandwidth
`memory interface that can achieve higher performance
`on larger runs.
`
`I/O operations have relatively
`• I/O Performance:
`low CPU overhead. However, CPU-based sorting al-
`gorithms can be compute-intensive [24] and may not
`be able to achieve high I/O performance. Figure 13
`highlights the I/O performance of Nsort [31] on sys-
`tems with a peak I/O throughput of 200 MBps. The
`I/O throughput obtained by the CPU-based sorting al-
`gorithm is around 147 MBps for a single processor and
`
`3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 4
`
`
`
`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`relatively modest by comparison. In case of sorting, a
`high-end Pentium IV processor can execute four SSE2
`comparisons per clock cycle while a NVIDIA GeForce
`7800 GTX GPU-based sorting algorithm can perform
`96 comparisons per clock cycle.
`
`• Instruction-level Parallelism: In addition to the
`SIMD and vector processing capabilities, each frag-
`ment processor can also exploit instruction-level paral-
`lelism, evaluating multiple instructions simultaneously
`using different ALUs. As a result, GPUs can achieve
`higher performance than CPUs. For example, the peak
`computational performance of a high-end dual core
`Pentium IV processor is 25.6 GFLOPS, whereas the
`peak performance of NVIDIA GeForce 7800 GTX is
`313 GFLOPS. GPU instruction-level parallelism sig-
`nificantly improves sort performance, overlapping sort-
`key comparisons operations while fetching the pointers
`associated with the keys to achieve near-peak compu-
`tational performance.
`
`• Dedicated Memory Interface: The GPU’s mem-
`ory controller is designed for high bandwidth data stream-
`ing between main memory and the GPU’s onboard
`memory. GPUs have a wider memory interface than
`the CPU. For example, current high-end PCs have
`8-byte main memory interface with a peak memory
`bandwidth of 6.4 GB per second, whereas, a NVIDIA
`7900 GTX has a 64-byte memory interface to the GPU
`video memory and can achieve a peak memory band-
`width of 56 GB per second.
`
`• Low Memory Latency: GPUs have lower computa-
`tional clock rates (∼ 690M Hz) than memory clock
`rates (∼ 1.8 GHz) but reduce the memory latency
`by accessing the data sequentially thereby allowing
`prefetch and pipelining. In contrast, CPUs have higher
`computational clock rates (∼ 4 GHz) than main mem-
`ory speeds (∼ 533 MHz) but suffer from memory stalls
`both because the memory bandwidth is inadequate
`and because they lack a data-stream approach to data
`access.
`
`Many GPU-based sorting algorithms have been designed
`to exploit one or more of these capabilities [20, 22, 35].
`However, those algorithms do not handle large, wide-key
`databases and have other limitations, highlighted in Section
`2.
`
`In summary, GPUs offer 10× more memory bandwidth
`and processing power than CPUs; and this gap is widening.
`GPUs present an opportunity for anyone who can use them
`for tasks beyond graphics [34].
`
`3.3 Hybrid Sorting Architecture
`This section gives an overview of GPUTeraSort. The next
`section describes the use of the GPU in detail. Our goal is to
`design a sorting architecture to efficiently utilize the compu-
`tational processors, I/O and memory resources. GPUTera-
`Sort has five stages that can be executed sequentially; but,
`some stages can be executed using multi-buffered pipeline-
`parallel independent threads:
`
`• Reader: The reader asynchronously reads the input
`file into a (approximately 100 MB) main memory buffer
`(zero-copy direct IO). Read bandwidth is improved by
`
`Figure 2: This figure highlights the high data par-
`allelism and memory bandwidth inside a GPU.
`GPUTeraSort uses the vector processing function-
`alities to implement a highly parallel bitonic sort-
`ing network. It outperforms prior CPU-based and
`GPU-based algorithms by 3-10 times.
`
`around 200 MBps with a dual processor. This suggests
`that the overall I/O performance can be improved by
`offloading computation to an additional processor or
`co-processor.
`• Memory Interfaces: Some recent external sorting
`algorithms use simultaneous multi-threading (SMT)
`and chip multi-processor (CMP) architectures to im-
`prove performance. However, the interface to main
`memory on current SMT and CMP architectures sig-
`nificantly limits the memory bandwidth available to
`each thread when data does not fit in processor caches
`[17]. It is possible to achieve higher performance by
`running the sorting algorithm on co-processors with
`dedicated memory interfaces.
`
`3.2 Sorting with a Graphics Processor
`This section gives a brief overview of graphics processors
`(GPUs) highlighting features that make them useful for ex-
`ternal memory sorting. GPUs are designed to execute geo-
`metric transformations on a rectangular pixel array. Each
`transformation generates a data stream of display pixels.
`Each incoming data element has a color and a set of texture
`coordinates that reference a 2D texture array. The data
`stream is processed by a user specified program executing
`on multiple fragment processors. The output is written to
`the memory. GPUs have the following capabilities useful for
`data-intensive computations.
`
`• Data Parallelism: GPUs are highly data parallel
`- both partition parallelism and pipeline parallelism.
`They use many fragment processors for partition par-
`allelism. Each fragment processor is a pipeline-parallel
`vector processor that performs four concurrent vector
`operations such as multiply-and-add (MAD) instruc-
`tions on the texture coordinates or the color compo-
`nents of the incoming data stream. Current CPUs of-
`fer similar data parallelism using instructions such as
`SSE2 on Intel processors or AltiVec operations on Pow-
`erPC processors. However, CPU data parallelism is
`
`4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 5
`
`
`
`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`as a co-processor to perform the key-pointer sorter task. The
`new sorting architecture
`• Performs the key-pointer sorting on the GPU and frees
`CPU cycles to achieve higher I/O performance and
`throughput.
`• Reduces the memory contention by using the dedicated
`GPU memory for sorting.
`
`4. LARGE SORTS USING GPUS
`This section describes GPUTeraSort’s sorting algorithm
`to sort wide keys and pointers on GPUs using a novel data
`representation. The algorithm improves cache efficiency and
`minimizes data transfer overheads between the CPU and
`GPU. A theoretical and experimental analysis of GPUTera-
`Sort’s data transfer rate and memory bandwidth require-
`ments compares the performance with prior algorithms.
`
`Reader
`
`RAM
`
`…
`
`Disks
`
`Key-Pointer Gen.
`
`CPU
`
`RAM
`
`High Bandwidth
`(40GBPS)
`
`Sorter
`
`RAM
`
`Reorder
`
`CPU
`
`Writer
`
`…
`
`Disks
`
`Video
`RAM
`
`GPU
`
`RAM
`
`DMA
`
`Figure 3: Flow Diagram of Phase 1 of GPUTeraSort
`Architecture using GPUs and CPUs.
`
`striping the input file across all disks so the data is
`transferred from all disks in parallel. The I/O band-
`width and the CPU usage of the reader depend on
`the number of overlapping asynchronous I/O requests,
`the stripe size, and the number of disks in the stripe.
`The reader thread requires less than 10% of a CPU to
`achieve near-peak I/O performance.
`
`• Key-Generator: The Key-Generator computes the
`(key, record-pointer) pairs from the input buffer. In
`practice, this stage is not computationally intensive
`but can be memory-intensive, reading each key from
`main memory. It sequentially writes a stream of key-
`pointer pairs to main memory.
`
`• Sorter: The Sorter reads and sorts the key-pointer
`pairs. This stage is computationally intensive and
`memory-intensive on large buffers with wide keys (e.g.
`of size 10 bytes or more). For example, the through-
`put of an SSE-optimized CPU-based quicksort on a 3.4
`GHz Pentium IV sorting 1 million floating point keys
`is much less than the throughput of the other external
`memory sorting stages and is the bottleneck. This is
`shown by Figure 11 and by the quicksort performance
`in Figure 8.
`
`• Reorder: The reorder stage rearranges the input buffer
`based on the sorted key-pointer pairs to generate a
`sorted output buffer (a run). On large databases, re-
`order is expensive because it randomly reads and writes
`long records from the input buffer and so has many
`memory stalls (Figure 11).
`
`• Writer: The writer asynchronously writes the run to
`the disk. Striping a run across many disks is not effi-
`cient for Phase 2 reads[42]; therefore GPUTerasStort
`cyclically writes the Phase 1 runs to individual disks
`in very large transfers. The writer thread requires less
`than 10% of the CPU to achieve near-peak I/O per-
`formance.
`
`Figure 3 shows GPUTeraSort’s pipeline flow. In order to
`efficiently pipeline these stages, GPUTeraSort uses a GPU
`
`5
`
`4.1 Bitonic Sorting on GPUs
`GPU-based algorithms perform computations on 2D ar-
`rays of 32-bit floating point data values known as textures.
`Each array element corresponds to a pixel. Pixels are trans-
`formed by programmable fragment processors, each execut-
`ing the same fragment program on each pixel. the multiple
`GPU fragment processors perform data parallel computa-
`tions on different pixel arrays simultaneously. This sim-
`ple data-parallel architecture avoids write-after-read haz-
`ards while performing parallel computations.
`At high-level GPU-based sorting algorithms read values
`from an input array or texture, perform data-independent
`comparisons using a fragment program, and write the out-
`put to another array. The output array is then swapped
`with the input array, and the comparisons are iteratively
`performed until the whole array is sorted. These sorting
`network algorithms map well to GPUs.
`The bitonic sorting network [10] sorts bitonic sequences
`in multiple merge steps. A bitonic sequence is a monotonic
`ascending or descending sequence.
`Given an input array a = (a0, a1, . . . , an), the bitonic
`sorting algorithm proceeds bottom-up, merging bitonic se-
`quences of equal sizes at each stage.
`It first constructs
`bitonic sequences of size 2 by merging pairs of adjacent
`data elements (a2i, a2i+1) where i = 0, 1, . . . , n
`2 − 1. Then
`bitonic sequences of size 4 are formed in stage 2 by merging
`pairs of bitonic sequences (a2i, a2i+1) and (a2i+2, a2i+3), i =
`0, 1, . . . , n
`2 − 2. The output of each stage is the input to the
`next stage. The size of the bitonic sequence pairs doubles
`at every stage. The final stage forms a sorted sequence by
`+2, . . . , an)
`), (a n
`merging bitonic sequences (a0, a1, ., a n
`+1, a n
`(see Figure 4).
`Specifically, stage k is used to merge two bitonic sequences,
`each of size 2k−1 and generates a new bitonic sequence of
`length 2k. The overall algorithm requires logn stages. In
`stage k, we perform k steps in the order k to 1.
`In each
`step, the input array is conceptually divided into chunks of
`equal sizes (size d = 2j−1 for step j) and each elements in
`one chunk is compared against the corresponding element
`in its adjacent chunks i.e., an element ai in a chunk is com-
`pared with the element at distance d (ai+d or ai−d). The
`minimum is stored in one data chunk and the maximum is
`stored in the other data chunk. Figure 4 shows a bitonic
`sorting network on 8 data values. Each data chunk in a
`step is color coded and elements in adjacent data chunks
`
`2
`
`2
`
`2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 6
`
`
`
`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`1
`
`32
`
`4
`
`5
`
`76
`
`8
`
`2
`
`31
`
`4
`
`5
`
`86
`
`7
`
`3
`
`24
`
`1
`
`8
`
`57
`
`6
`
`3
`
`74
`
`8
`
`1
`
`52
`
`6
`
`3
`
`74
`
`8
`
`2
`
`61
`
`5
`
`3
`
`47
`
`8
`
`2
`
`16
`
`5
`
`3
`
`7
`
`4
`
`8
`
`6
`
`12
`
`5
`
`
`
`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`GPUTeraSort(n)
`1 b = number of bytes in key
`2 ⌋, H = height(tex) = 2⌈ logn
`2 W = width(tex) = 2⌊ logn
`2 ⌉
`3
`sorted = false
`4
`currBytes = Most significant four bytes of keys
`5 SortLists = Input Array of Key-Pointers
`6 While (!sorted)
`7
`For each array in sortedLists
`8
`Transfer the pointerTexture and keyTexture of currBytes
`to the GPU
`9
`Perform HybridBitonicSort(pointerTexture,keyTexture,n)
`10
`Readback the pointers
`11
`SortedLists = Lists of contiguous arrays with equal keys
`in currBytes
`12
`If (sizeof(SortedLists)=0 or currBytes = b-4) sorted =
`true
`13
`14
`
`else currBytes = nextfourBytes
`end for
`
`ROUTINE 4.2: GPUTeraSort: This routine is used to sort
`an input sequence of length n and b byte keys. The input se-
`quence is copied into a 2D-texture, whose width and height is set
`to a power-of-2 that is closest to √n (line 2). The sorting algo-
`rithm starts using the first four bytes (Line 4). Next, it performs
`at most b
`4 scans on a subset of the input sequence and during
`each stage, performing a fast bitonic radix sort on the GPU (line
`9). Then it reads back the pointers and computes contiguous
`portions of the array to be sorted. These contiguous unsorted
`portions consist of the same key till the value of currBytes and
`are processed further based on the remaining bytes (Line 13).
`
`4.3 Handling Wide Keys
`GPUTeraSort uses a hybrid radix-bitonic sort algorithm.
`It uses bitonic sort on the first few bytes of the keys as the
`initial radix for sorting. In order to handle string compar-
`isons using 32-bit floating point hardware, it is important
`to encode data on the GPU but avoid special IEEE float-
`ing point values such as N aN, ∞, etc. by masking the two
`most significant bits (MSB) to “10”. The key encoding to
`GPU floating point representations handles any data type,
`including ASCII strings with lexicographic comparisons. Af-
`ter the GPU orders the key-pointer array based on the first
`few bytes, GPUTeraSort scans the array to identify con-
`tiguous portions of the array with equal keys. It sorts these
`portions based on the second set of bytes on the GPU. It
`repeats the process till the entire array is sorted or all the
`key bytes have been compared. Routine 4.2 gives a pseudo-
`code for GPUTeraSort’s hybrid radix-bitonic sort using the
`GPU.
`We implemented three prior GPU-based sorting algorithms
`[20, 22, 35] and compared their performance to GPUTera-
`Sort. Figure 6 shows the comparison - GPUTeraSort bitonic
`sort has a 3x to 6x advantage over previous GPU sorting al-
`gorithms.
`
`4.4 Cache(cid:173)Efficient Sorting on GPUs
`In this section, we investigate the cache-performance of
`the fragment processors while performing each step in the
`bitonic sorting algorithm. Suppose the width of the input
`array is W and the height is H, and the block size is B ×
`B. In each step, each pixel is compared against one other
`pixel. Therefore, there are atleast ncompulsory = W ×H
`cache
`B2
`misses. Our goal is to reduce the total number of cache
`
`BitonicSort
`(Purcell et al. 2003)
`
`BitonicSort
`(Kipfer et al. 2005)
`
`PBSN
`(Govindaraju et al. 2005)
`
`0
`
`1000000
`
`2000000
`
`3000000
`
`4000000
`
`5000000
`
`Number of Records
`
`GPUTeraSort
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`0
`
`Time (in sec)
`
`Figure 6: Comparison of GPUTeraSort’s bitonic
`sort with other GPU-based algorithms indicates a 3-
`7 fold performance advantage over prior GPU-based
`bitonic sort and PBSN algorithms (Kipfer et al. [23],
`Govindaraju et al. [20], and Purcell et al. [36]).
`
`BitonicSort(tex, W,H)
`1 n = numValues to be sorted = W*H*4 /* single array repre-
`sentation*/
`2
`for i=1 to logn /* for each stage*/
`3
`for j=i to 1
`Quad size B = 2j−1
`4
`5
`Draw Textured Quads of size B
`6
`Copy from frame buffer to tex
`7
`end for
`8 end for
`
`ROUTINE 4.1: Bitonic Sorting Network Algorithm: We use
`this routine to sort a floating point input sequence of length n.
`Next, we perform log n stages on the input sequence and dur-
`ing each stage, perform i steps with quad sizes (width × height)
`varying from 2i−1 to 1 (line 4). The overall algorithm requires
`n
`O( nlg
`) comparisons and maps well to GPUs.
`
`2
`
`2
`
`representation, as opposed to the four-array representation
`[20], has the following ad-
`proposed by Govindaraju et al.
`vantages:
`• Mapping: The data transfer operation from the CPU
`to the GPU directly maps to the single-array represen-
`tation while the four-array representation does not.
`• Efficient sorting: The single-array representation has
`better performance than four-array representation as it
`reduces the memory accesses in early algorithm steps