throbber
Homayoun
`
`Reference 17
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 1
`
`

`

`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`GPUTeraSort: High Performance Graphics Co(cid:173)processor
`Sorting for Large Database Management
`
`Naga K. Govindaraju ∗
`
`Jim Gray †
`
`Ritesh Kumar ∗
`
`Dinesh Manocha ∗
`
`{naga,ritesh,dm}@cs.unc.edu, Jim.Gray@microsoft.com
`http://gamma.cs.unc.edu/GPUTERASORT
`
`ABSTRACT
`We present a new algorithm, GPUTeraSort, to sort billion-
`record wide-key databases using a graphics processing unit
`(GPU) Our algorithm uses the data and task parallelism
`on the GPU to perform memory-intensive and compute-
`intensive tasks while the CPU is used to perform I/O and
`resource management. We therefore exploit both the high-
`bandwidth GPU memory interface and the lower-bandwidth
`CPU main memory interface and achieve higher memory
`bandwidth than purely CPU-based algorithms. GPUTera-
`Sort is a two-phase task pipeline: (1) read disk, build keys,
`sort using the GPU, generate runs, write disk, and (2) read,
`merge, write. It also pipelines disk transfers and achieves
`near-peak I/O performance. We have tested the perfor-
`mance of GPUTeraSort on billion-record files using the stan-
`dard Sort benchmark.
`In practice, a 3 GHz Pentium IV
`PC with $265 NVIDIA 7800 GT GPU is significantly faster
`than optimized CPU-based algorithms on much faster pro-
`cessors, sorting 60GB for a penny; the best reported Pen-
`nySort price-performance. These results suggest that a GPU
`co-processor can significantly improve performance on large
`data processing tasks.
`
`1.
`
`INTRODUCTION
`Huge sort tasks arise in many different applications in-
`cluding web indexing engines, geographic information sys-
`tems, data mining, and supercomputing. Sorting is also
`a proxy for any sequential I/O intensive database work-
`load. This article considers the problem of sorting very large
`datasets consisting of billions of records with wide keys.
`The problem of external memory sorting has been stud-
`ied for more than five decades, starting with Friend [16].
`The dramatic improvements in the speed of sorting algo-
`rithms are largely due to advances in computer architec-
`ture and software parallelism. Recent algorithms utilize
`simultaneous multi-threading, symmetric multi-processors,
`
`∗University of North Carolina at Chapel Hill
`†Microsoft Research
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior specific
`permission and/or a fee.
`Copyright 2006 ACM 1(cid:173)59593(cid:173)256(cid:173)9/06/0006 ...$5.00.
`
`1
`
`advanced memory units, and multi-processors to improve
`sorting performance. The current Indy PennySort record
`benchmark1, sorts a 40 GB database in 1541 seconds on a
`$614 Linux/AMD system.
`However, current external memory sort performance is
`limited by the traditional Von Neumann style architecture
`of the CPU. Computer architects use data caches to amelio-
`rate the CPU and the main memory bottleneck; but, CPU-
`based sorting algorithms incur significant cache misses on
`large datasets.
`This article shows how to use a commodity graphics pro-
`cessing unit (GPU) as a co-processor to sort large datasets.
`GPUs are programmable parallel architectures designed for
`real-time rasterization of geometric primitives - but they
`are also highly parallel vector co-processors. Current GPUs
`have 10x higher main memory bandwidth and use data par-
`allelism to achieve 10x more operations per second than
`CPUs. Furthermore, GPU performance has improved faster
`than Moore’s Law over the last decade - so the GPU-CPU
`performance gap is widening. GPUs have recently been used
`for different scientific, geometric and database applications,
`as well as in-memory sorting [20, 22, 35]. However, previ-
`ous GPU-based sorting algorithms were not able to handle
`gigabyte-sized databases with wide keys and could not keep
`up with modern disk IO systems.
`
`Main Results: We present GPUTeraSort that uses a GPU
`as a co-processor to sort databases with billions of records.
`Our algorithm is general and can handle long records with
`wide keys. This hybrid sorting architecture offloads compute-
`intensive and memory-intensive tasks to the GPU to achieve
`higher I/O performance and better main memory perfor-
`mance. We map a bitonic sorting network to GPU rasteriza-
`tion operations and use the GPU’s programmable hardware
`and high bandwidth memory interface. Our novel data rep-
`resentation improves GPU cache efficiency and minimizes
`data transfers between the CPU and the GPU. In practice,
`we achieve nearly 50 giga-byte per second memory band-
`width and 14 giga-operations per second on a current GPU.
`These numbers are 10x what we can achieve on the CPU.
`We implemented GPUTeraSort on an inexpensive 3 GHz
`Pentium IV EE CPU with a $265 NVIDIA 7800 GT GPU.
`GPUTeraSort running the SortBenchmark on this inexpen-
`sive computer has performance comparable to an “expen-
`sive” $2,200 3.6 GHz Dual Xeon server. Our experimental
`results show a 4 times performance improvement over the
`2005 Daytona PennySort benchmark record and 1.4 times
`
`1http://research.microsoft.com/barc/SortBenchmark
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 2
`
`

`

`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`improvement over the 2003 Indy PennySort benchmark record.
`Some of the novel contributions of our work include:
`
`• An external sorting architecture that distributes the
`work between the CPU and GPU.
`
`• An in-memory GPU-based sorting algorithm which is
`up to 10 times faster than prior CPU-based and GPU-
`based in-memory sorting algorithms.
`
`• Peak I/O performance on an inexpensive PC and near
`peak memory bandwidth on the GPU.
`
`• A scalable approach to sorting massive databases by
`efficiently sorting large data partitions.
`
`In combination, these features allow an inexpensive PC
`with a mid-range GPU to outperform much more expen-
`sive CPU-only PennySort systems. The rest of the paper
`is organized as follows. Section 2 reviews related work on
`sorting, hardware accelerated database queries, and GPU-
`based algorithms. Section 3 highlights some of the limita-
`tions of CPU-based external sorting algorithms and gives an
`overview of GPUTeraSort. Section 4 presents the GPUTera-
`Sort algorithm and Section 5 describes its implementation.
`Section 6 compares its performance with prior CPU-based
`algorithms.
`
`2. RELATED WORK
`This section briefly surveys related work in sorting and the
`use of GPUs to accelerate data management computations.
`
`2.1 Sorting
`Sorting is a key problem in database and scientific ap-
`plications.
`It has also been well studied in the theory of
`algorithms [23]. Many optimized sorting algorithms, such
`as quicksort, are widely available and many variants have
`been described in the database literature [2]. However, the
`CPU performance of sorting algorithms is governed by cache
`misses [17, 24, 32] and instruction dependencies [45]. To
`address these memory and CPU limits, many parallel al-
`gorithms and sorting systems have been proposed in the
`database and high performance computing literature [11,
`14, 25, 38, 44].
`The Sort Benchmark, introduced in 1985 was commonly
`used to evaluate the sorting algorithms [15]. As the original
`benchmark became trivial, it evolved to the MinuteSort [32]
`and the PennySort benchmarks [33]. Nyberg et al. [32] use a
`combination of quicksort and selection-tree mergesort in the
`AlphaSort algorithm. In practice, AlphaSort’s performance
`varied considerably based on the cache sizes. The NOW-
`SORT algorithm [8] used a cluster of workstations to sort
`large databases. Recently, Garcia and Korth [17] used fea-
`tures of SMT (simultaneous multi-threading) to accelerate
`in-memory sort performance.
`
`2.2 Optimizing Multi(cid:173)Level Memory Accesses
`Many algorithms have been proposed to improve the per-
`formance of database operations using multi-level memory
`hierarchies that include disks, main memories, and several
`levels of processor caches. Ailamaki gives a recent survey on
`these techniques [4]. Over the last few years, database archi-
`tectures have used massive main memory to reduce or elim-
`inate I/O; but the resulting applications still have very high
`
`clocks per instruction (CPI). Memory stalls due to cache
`misses can lead to increased query execution times [6, 27].
`There is considerable recent work on redesigning database
`and data mining algorithms to make full use of hardware
`resources and minimize the memory stalls and branch mis-
`predictions. These techniques can also improve the perfor-
`mance of sorting algorithms [5, 12, 26, 28, 36, 37, 39, 45].
`
`2.3 GPUs and Data Parallelism
`Many special processor architectures have been proposed
`that employ data parallelism for data intensive computa-
`tions. Graphics processing units (GPUs) are common ex-
`amples of this, but there are many others. The Clear-
`Speed CSX600 processor [1] is an embedded, low power,
`data parallel co-processor that provides up to 25 GFLOPS
`of floating point performance. The Physics Processing Unit
`(PPU) uses data parallelism and high memory bandwidth
`in order to achieve high throughput for Physical simulation.
`Many other co-processors accelerate performance through
`data parallelism.
`This paper focuses on using a GPU as a co-processor for
`sorting, because GPUs are commodity processors. A high
`performance mid-range GPU costs less than $300. Current
`GPUs have about 10× the memory bandwidth and process-
`ing power of the CPU and this gap is widening. Commodity
`GPUs are increasingly used for different applications includ-
`ing numerical linear algebra, scientific, and geometric com-
`putations [34]. GPUs have also been used as co-processors to
`speedup database queries [9, 18, 19, 40] and data streaming
`[20, 29, 41].
`Sorting on GPUs: Many researchers have proposed GPU-
`based sorting algorithms. Purcell et al.
`[35] describe a
`bitonic sort using a fragment program where each stage of
`the sorting algorithm is performed as one rendering pass.
`Kipfer et al.
`[22] improve bitonic sort by simplifying the
`fragment program; but the algorithm still requires ∼ 10
`fragment instructions. Govindaraju et al.
`[20] present a
`sorting algorithm based on a periodic balanced sorting net-
`work (PBSN) and use texture mapping and blending oper-
`ations. However, prior GPU-based algorithms have certain
`limitations for large databases. These include:
`
`• Database size: Previous algorithms were limited to
`databases that fit in GPU memory (i.e. 512MB on
`current GPUs).
`
`• Limit on key size: The sort keys were limited to 32-bit
`floating point operands.
`
`• Efficiency: Previous algorithms were not fast enough
`to match the disk array IO bandwidth.
`
`Our GPUTeraSort algorithm uses the GPU as a co-processor
`in ways that overcome these limitations.
`
`3. OVERVIEW
`This section reviews external memory sorting algorithms,
`analyzing how these algorithms use processors, caches, mem-
`ory interfaces, and input/output (I/O) devices. Then we
`present our GPUTeraSort algorithm.
`
`3.1 External Memory Sorting
`External memory sorting algorithms are used to reorga-
`nize large datasets. They typically perform two phases. The
`
`2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 3
`
`

`

`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`first phase produces a set of files; the second phase processes
`these files to produce a totally ordered permutation of the
`input data file. External memory sorting algorithms can be
`classified into two broad categories [42]:
`
`• Distribution-Based Sorting: The first phase par-
`titions the input data file using (S-1) partition keys
`and generates S disjoint buckets such that the elements
`in one bucket precede the elements in the remaining
`buckets [23]. In the second phase, each bucket is sorted
`independently. The concatenated sorted buckets are
`the output file.
`
`• Merge-Based Sorting: The first phase partitions
`the input data into data chunks of approximately equal
`size, sorts these data chunks in main memory and
`writes the “runs” to disk. The second phase merges
`the runs in main memory and writes the sorted output
`to the disk.
`
`External memory sorting performance is often limited by
`I/O performance. Disk I/O bandwidth is significantly lower
`than main memory bandwidth. Therefore, it is important to
`minimize the amount of data written to and read from disks.
`Large files will not fit in RAM so we must sort the data in at
`least two passes but two passes are enough to sort huge files.
`Each pass reads and writes to the disk. Hence, the two-pass
`sort throughput is at most 1
`4 the throughput of the disks.
`For example, a PC with 8 SATA disks each with a peak I/O
`bandwidth of 50 MBps per disk can achieve at most 400
`MBps disk bandwidth. So a p-pass algorithm will have a
`throughput of 400
`2p since each pass must read as well as write
`the data. In particular, a two-pass sort achieves at most 100
`MBps throughput on this PC. Single pass algorithms only
`work on databases that fit entirely in main memory.
`External memory sort algorithms can operate in two passes
`if the Phase 1 partitions fit in main memory. The parallel
`disk model (PDM) [43] captures disk system performance
`properties. PDM models the number of I/O operations, disk
`usage and CPU time. Vitter [42] analyzed the practical ap-
`plicability of PDM model to common I/O operations such
`as scanning the items in a file, sorting a file, etc.
`In this
`model, the average and worst case I/O performance of ex-
`ternal memory sorting algorithms is ≈ n
`D logmn where n is
`the input size, m is the internal memory size, D is the num-
`ber of disks and logmn denotes the number of passes when
`the data partition size in the Phase 1 is ≈ m [3, 30]. Based
`on the PDM model, an external memory sorting algorithm
`can achieve good I/O performance on large databases when
`the data partition sizes are comparable to the main mem-
`ory size. Salzberg et al.
`[38] present a similar analysis of
`merge based sorting memory requirements. The analysis
`is as follows. If N is the file size, M is the main memory
`size and R is the run size in phase 1 then typically: (1)
`R ≈ M
`3 because of the memory required to simultaneously
`pipeline reading the input, sorting, and writing the output.
`The number of runs generated in phase 1 is runs ≈ N
`R . If
`T is the I/O read size per run in phase 2, and then since
`at least one buffer for each run must fit in memory and a
`few more buffers are needed for prefetch and postwrite: (2)
`M ≈ T × runs ≈ T × N
`R . Combining equations (1) and (2)
`gives (3) M 2 ≈ T × N
`3 or, ignoring the constant term (4)
`√T N .
`M ≈
`Since a two-pass sort’s RAM requirements (M ) increase
`
`300
`
`250
`
`200
`
`150
`
`100
`
`50
`
`0
`
`Total Time (in Sec)
`
`Phase II
`I/O Bandwidth (MB/s)
`
`Phase II Time (in Sec)
`
`0
`
`200
`
`800
`600
`400
`Partition Size (in KB)
`
`1000
`
`1200
`
`Figure 1: Performance of an optimized merge-based
`external memory sorting algorithm on a Dual 3.6
`GHz Xeon processor system. Observe that the
`speed of Phase 2 increases nearly linearly with the
`partition size. As the data partition sizes in Phase
`I fit well in the L2 cache sizes, the Phase 1 time
`remains nearly constant.
`
`as the square root of the input file size, multi-GB RAM ma-
`chines can two-pass sort terabyte files. In particular, if T=2
`MB to reduce disk seek overhead, and N is 100 GB, then
`R ∼ 230 MB. In practice, phase 1 partitions are hundreds
`of megabytes on current PCs. However, current algorithms
`running on commodity CPUs, referred to as CPU-based al-
`gorithms, cannot achieve high sorting performance on such
`large partitions because:
`
`• Cache Misses: CPU-based sorting algorithms incur
`significant cache misses on data sets that do not fit
`in the L1, L2 or L3 data caches [32]. Therefore, it is
`not efficient to sort partitions comparable to the size
`of main memory. This results in a tradeoff between
`disk I/O performance (as described above) and CPU
`computation time spent in sorting the partitions. For
`example, in merge-based external sorting algorithms,
`the time spent in Phase 1 can be reduced by choosing
`run sizes comparable to the CPU cache sizes. However,
`this choice increases the time spent in Phase 2 to merge
`a large number of small runs. Figure 1 illustrates the
`performance of an optimized commercial CPU based
`algorithm [31] on a dual Xeon configuration for varying
`Phase 1 run sizes. Observe that the elapsed time de-
`creases as the run size increases. However, increasing
`the run size beyond the CPU data cache sizes can de-
`grade the sorting performance during Phase 1 [24]. As
`explained in Section 4, GPUs have a high bandwidth
`memory interface that can achieve higher performance
`on larger runs.
`
`I/O operations have relatively
`• I/O Performance:
`low CPU overhead. However, CPU-based sorting al-
`gorithms can be compute-intensive [24] and may not
`be able to achieve high I/O performance. Figure 13
`highlights the I/O performance of Nsort [31] on sys-
`tems with a peak I/O throughput of 200 MBps. The
`I/O throughput obtained by the CPU-based sorting al-
`gorithm is around 147 MBps for a single processor and
`
`3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 4
`
`

`

`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`relatively modest by comparison. In case of sorting, a
`high-end Pentium IV processor can execute four SSE2
`comparisons per clock cycle while a NVIDIA GeForce
`7800 GTX GPU-based sorting algorithm can perform
`96 comparisons per clock cycle.
`
`• Instruction-level Parallelism: In addition to the
`SIMD and vector processing capabilities, each frag-
`ment processor can also exploit instruction-level paral-
`lelism, evaluating multiple instructions simultaneously
`using different ALUs. As a result, GPUs can achieve
`higher performance than CPUs. For example, the peak
`computational performance of a high-end dual core
`Pentium IV processor is 25.6 GFLOPS, whereas the
`peak performance of NVIDIA GeForce 7800 GTX is
`313 GFLOPS. GPU instruction-level parallelism sig-
`nificantly improves sort performance, overlapping sort-
`key comparisons operations while fetching the pointers
`associated with the keys to achieve near-peak compu-
`tational performance.
`
`• Dedicated Memory Interface: The GPU’s mem-
`ory controller is designed for high bandwidth data stream-
`ing between main memory and the GPU’s onboard
`memory. GPUs have a wider memory interface than
`the CPU. For example, current high-end PCs have
`8-byte main memory interface with a peak memory
`bandwidth of 6.4 GB per second, whereas, a NVIDIA
`7900 GTX has a 64-byte memory interface to the GPU
`video memory and can achieve a peak memory band-
`width of 56 GB per second.
`
`• Low Memory Latency: GPUs have lower computa-
`tional clock rates (∼ 690M Hz) than memory clock
`rates (∼ 1.8 GHz) but reduce the memory latency
`by accessing the data sequentially thereby allowing
`prefetch and pipelining. In contrast, CPUs have higher
`computational clock rates (∼ 4 GHz) than main mem-
`ory speeds (∼ 533 MHz) but suffer from memory stalls
`both because the memory bandwidth is inadequate
`and because they lack a data-stream approach to data
`access.
`
`Many GPU-based sorting algorithms have been designed
`to exploit one or more of these capabilities [20, 22, 35].
`However, those algorithms do not handle large, wide-key
`databases and have other limitations, highlighted in Section
`2.
`
`In summary, GPUs offer 10× more memory bandwidth
`and processing power than CPUs; and this gap is widening.
`GPUs present an opportunity for anyone who can use them
`for tasks beyond graphics [34].
`
`3.3 Hybrid Sorting Architecture
`This section gives an overview of GPUTeraSort. The next
`section describes the use of the GPU in detail. Our goal is to
`design a sorting architecture to efficiently utilize the compu-
`tational processors, I/O and memory resources. GPUTera-
`Sort has five stages that can be executed sequentially; but,
`some stages can be executed using multi-buffered pipeline-
`parallel independent threads:
`
`• Reader: The reader asynchronously reads the input
`file into a (approximately 100 MB) main memory buffer
`(zero-copy direct IO). Read bandwidth is improved by
`
`Figure 2: This figure highlights the high data par-
`allelism and memory bandwidth inside a GPU.
`GPUTeraSort uses the vector processing function-
`alities to implement a highly parallel bitonic sort-
`ing network. It outperforms prior CPU-based and
`GPU-based algorithms by 3-10 times.
`
`around 200 MBps with a dual processor. This suggests
`that the overall I/O performance can be improved by
`offloading computation to an additional processor or
`co-processor.
`• Memory Interfaces: Some recent external sorting
`algorithms use simultaneous multi-threading (SMT)
`and chip multi-processor (CMP) architectures to im-
`prove performance. However, the interface to main
`memory on current SMT and CMP architectures sig-
`nificantly limits the memory bandwidth available to
`each thread when data does not fit in processor caches
`[17]. It is possible to achieve higher performance by
`running the sorting algorithm on co-processors with
`dedicated memory interfaces.
`
`3.2 Sorting with a Graphics Processor
`This section gives a brief overview of graphics processors
`(GPUs) highlighting features that make them useful for ex-
`ternal memory sorting. GPUs are designed to execute geo-
`metric transformations on a rectangular pixel array. Each
`transformation generates a data stream of display pixels.
`Each incoming data element has a color and a set of texture
`coordinates that reference a 2D texture array. The data
`stream is processed by a user specified program executing
`on multiple fragment processors. The output is written to
`the memory. GPUs have the following capabilities useful for
`data-intensive computations.
`
`• Data Parallelism: GPUs are highly data parallel
`- both partition parallelism and pipeline parallelism.
`They use many fragment processors for partition par-
`allelism. Each fragment processor is a pipeline-parallel
`vector processor that performs four concurrent vector
`operations such as multiply-and-add (MAD) instruc-
`tions on the texture coordinates or the color compo-
`nents of the incoming data stream. Current CPUs of-
`fer similar data parallelism using instructions such as
`SSE2 on Intel processors or AltiVec operations on Pow-
`erPC processors. However, CPU data parallelism is
`
`4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 5
`
`

`

`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`as a co-processor to perform the key-pointer sorter task. The
`new sorting architecture
`• Performs the key-pointer sorting on the GPU and frees
`CPU cycles to achieve higher I/O performance and
`throughput.
`• Reduces the memory contention by using the dedicated
`GPU memory for sorting.
`
`4. LARGE SORTS USING GPUS
`This section describes GPUTeraSort’s sorting algorithm
`to sort wide keys and pointers on GPUs using a novel data
`representation. The algorithm improves cache efficiency and
`minimizes data transfer overheads between the CPU and
`GPU. A theoretical and experimental analysis of GPUTera-
`Sort’s data transfer rate and memory bandwidth require-
`ments compares the performance with prior algorithms.
`
`Reader
`
`RAM
`
`…
`
`Disks
`
`Key-Pointer Gen.
`
`CPU
`
`RAM
`
`High Bandwidth
`(40GBPS)
`
`Sorter
`
`RAM
`
`Reorder
`
`CPU
`
`Writer
`
`…
`
`Disks
`
`Video
`RAM
`
`GPU
`
`RAM
`
`DMA
`
`Figure 3: Flow Diagram of Phase 1 of GPUTeraSort
`Architecture using GPUs and CPUs.
`
`striping the input file across all disks so the data is
`transferred from all disks in parallel. The I/O band-
`width and the CPU usage of the reader depend on
`the number of overlapping asynchronous I/O requests,
`the stripe size, and the number of disks in the stripe.
`The reader thread requires less than 10% of a CPU to
`achieve near-peak I/O performance.
`
`• Key-Generator: The Key-Generator computes the
`(key, record-pointer) pairs from the input buffer. In
`practice, this stage is not computationally intensive
`but can be memory-intensive, reading each key from
`main memory. It sequentially writes a stream of key-
`pointer pairs to main memory.
`
`• Sorter: The Sorter reads and sorts the key-pointer
`pairs. This stage is computationally intensive and
`memory-intensive on large buffers with wide keys (e.g.
`of size 10 bytes or more). For example, the through-
`put of an SSE-optimized CPU-based quicksort on a 3.4
`GHz Pentium IV sorting 1 million floating point keys
`is much less than the throughput of the other external
`memory sorting stages and is the bottleneck. This is
`shown by Figure 11 and by the quicksort performance
`in Figure 8.
`
`• Reorder: The reorder stage rearranges the input buffer
`based on the sorted key-pointer pairs to generate a
`sorted output buffer (a run). On large databases, re-
`order is expensive because it randomly reads and writes
`long records from the input buffer and so has many
`memory stalls (Figure 11).
`
`• Writer: The writer asynchronously writes the run to
`the disk. Striping a run across many disks is not effi-
`cient for Phase 2 reads[42]; therefore GPUTerasStort
`cyclically writes the Phase 1 runs to individual disks
`in very large transfers. The writer thread requires less
`than 10% of the CPU to achieve near-peak I/O per-
`formance.
`
`Figure 3 shows GPUTeraSort’s pipeline flow. In order to
`efficiently pipeline these stages, GPUTeraSort uses a GPU
`
`5
`
`4.1 Bitonic Sorting on GPUs
`GPU-based algorithms perform computations on 2D ar-
`rays of 32-bit floating point data values known as textures.
`Each array element corresponds to a pixel. Pixels are trans-
`formed by programmable fragment processors, each execut-
`ing the same fragment program on each pixel. the multiple
`GPU fragment processors perform data parallel computa-
`tions on different pixel arrays simultaneously. This sim-
`ple data-parallel architecture avoids write-after-read haz-
`ards while performing parallel computations.
`At high-level GPU-based sorting algorithms read values
`from an input array or texture, perform data-independent
`comparisons using a fragment program, and write the out-
`put to another array. The output array is then swapped
`with the input array, and the comparisons are iteratively
`performed until the whole array is sorted. These sorting
`network algorithms map well to GPUs.
`The bitonic sorting network [10] sorts bitonic sequences
`in multiple merge steps. A bitonic sequence is a monotonic
`ascending or descending sequence.
`Given an input array a = (a0, a1, . . . , an), the bitonic
`sorting algorithm proceeds bottom-up, merging bitonic se-
`quences of equal sizes at each stage.
`It first constructs
`bitonic sequences of size 2 by merging pairs of adjacent
`data elements (a2i, a2i+1) where i = 0, 1, . . . , n
`2 − 1. Then
`bitonic sequences of size 4 are formed in stage 2 by merging
`pairs of bitonic sequences (a2i, a2i+1) and (a2i+2, a2i+3), i =
`0, 1, . . . , n
`2 − 2. The output of each stage is the input to the
`next stage. The size of the bitonic sequence pairs doubles
`at every stage. The final stage forms a sorted sequence by
`+2, . . . , an)
`), (a n
`merging bitonic sequences (a0, a1, ., a n
`+1, a n
`(see Figure 4).
`Specifically, stage k is used to merge two bitonic sequences,
`each of size 2k−1 and generates a new bitonic sequence of
`length 2k. The overall algorithm requires logn stages. In
`stage k, we perform k steps in the order k to 1.
`In each
`step, the input array is conceptually divided into chunks of
`equal sizes (size d = 2j−1 for step j) and each elements in
`one chunk is compared against the corresponding element
`in its adjacent chunks i.e., an element ai in a chunk is com-
`pared with the element at distance d (ai+d or ai−d). The
`minimum is stored in one data chunk and the maximum is
`stored in the other data chunk. Figure 4 shows a bitonic
`sorting network on 8 data values. Each data chunk in a
`step is color coded and elements in adjacent data chunks
`
`2
`
`2
`
`2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2129, p. 6
`
`

`

`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`1
`
`32
`
`4
`
`5
`
`76
`
`8
`
`2
`
`31
`
`4
`
`5
`
`86
`
`7
`
`3
`
`24
`
`1
`
`8
`
`57
`
`6
`
`3
`
`74
`
`8
`
`1
`
`52
`
`6
`
`3
`
`74
`
`8
`
`2
`
`61
`
`5
`
`3
`
`47
`
`8
`
`2
`
`16
`
`5
`
`3
`
`7
`
`4
`
`8
`
`6
`
`12
`
`5
`
`

`

`Microsoft Technical Report MSR TR-2005-183 (November 2005, revised March 2006)
`
`GPUTeraSort(n)
`1 b = number of bytes in key
`2 ⌋, H = height(tex) = 2⌈ logn
`2 W = width(tex) = 2⌊ logn
`2 ⌉
`3
`sorted = false
`4
`currBytes = Most significant four bytes of keys
`5 SortLists = Input Array of Key-Pointers
`6 While (!sorted)
`7
`For each array in sortedLists
`8
`Transfer the pointerTexture and keyTexture of currBytes
`to the GPU
`9
`Perform HybridBitonicSort(pointerTexture,keyTexture,n)
`10
`Readback the pointers
`11
`SortedLists = Lists of contiguous arrays with equal keys
`in currBytes
`12
`If (sizeof(SortedLists)=0 or currBytes = b-4) sorted =
`true
`13
`14
`
`else currBytes = nextfourBytes
`end for
`
`ROUTINE 4.2: GPUTeraSort: This routine is used to sort
`an input sequence of length n and b byte keys. The input se-
`quence is copied into a 2D-texture, whose width and height is set
`to a power-of-2 that is closest to √n (line 2). The sorting algo-
`rithm starts using the first four bytes (Line 4). Next, it performs
`at most b
`4 scans on a subset of the input sequence and during
`each stage, performing a fast bitonic radix sort on the GPU (line
`9). Then it reads back the pointers and computes contiguous
`portions of the array to be sorted. These contiguous unsorted
`portions consist of the same key till the value of currBytes and
`are processed further based on the remaining bytes (Line 13).
`
`4.3 Handling Wide Keys
`GPUTeraSort uses a hybrid radix-bitonic sort algorithm.
`It uses bitonic sort on the first few bytes of the keys as the
`initial radix for sorting. In order to handle string compar-
`isons using 32-bit floating point hardware, it is important
`to encode data on the GPU but avoid special IEEE float-
`ing point values such as N aN, ∞, etc. by masking the two
`most significant bits (MSB) to “10”. The key encoding to
`GPU floating point representations handles any data type,
`including ASCII strings with lexicographic comparisons. Af-
`ter the GPU orders the key-pointer array based on the first
`few bytes, GPUTeraSort scans the array to identify con-
`tiguous portions of the array with equal keys. It sorts these
`portions based on the second set of bytes on the GPU. It
`repeats the process till the entire array is sorted or all the
`key bytes have been compared. Routine 4.2 gives a pseudo-
`code for GPUTeraSort’s hybrid radix-bitonic sort using the
`GPU.
`We implemented three prior GPU-based sorting algorithms
`[20, 22, 35] and compared their performance to GPUTera-
`Sort. Figure 6 shows the comparison - GPUTeraSort bitonic
`sort has a 3x to 6x advantage over previous GPU sorting al-
`gorithms.
`
`4.4 Cache(cid:173)Efficient Sorting on GPUs
`In this section, we investigate the cache-performance of
`the fragment processors while performing each step in the
`bitonic sorting algorithm. Suppose the width of the input
`array is W and the height is H, and the block size is B ×
`B. In each step, each pixel is compared against one other
`pixel. Therefore, there are atleast ncompulsory = W ×H
`cache
`B2
`misses. Our goal is to reduce the total number of cache
`
`BitonicSort
`(Purcell et al. 2003)
`
`BitonicSort
`(Kipfer et al. 2005)
`
`PBSN
`(Govindaraju et al. 2005)
`
`0
`
`1000000
`
`2000000
`
`3000000
`
`4000000
`
`5000000
`
`Number of Records
`
`GPUTeraSort
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`0
`
`Time (in sec)
`
`Figure 6: Comparison of GPUTeraSort’s bitonic
`sort with other GPU-based algorithms indicates a 3-
`7 fold performance advantage over prior GPU-based
`bitonic sort and PBSN algorithms (Kipfer et al. [23],
`Govindaraju et al. [20], and Purcell et al. [36]).
`
`BitonicSort(tex, W,H)
`1 n = numValues to be sorted = W*H*4 /* single array repre-
`sentation*/
`2
`for i=1 to logn /* for each stage*/
`3
`for j=i to 1
`Quad size B = 2j−1
`4
`5
`Draw Textured Quads of size B
`6
`Copy from frame buffer to tex
`7
`end for
`8 end for
`
`ROUTINE 4.1: Bitonic Sorting Network Algorithm: We use
`this routine to sort a floating point input sequence of length n.
`Next, we perform log n stages on the input sequence and dur-
`ing each stage, perform i steps with quad sizes (width × height)
`varying from 2i−1 to 1 (line 4). The overall algorithm requires
`n
`O( nlg
`) comparisons and maps well to GPUs.
`
`2
`
`2
`
`representation, as opposed to the four-array representation
`[20], has the following ad-
`proposed by Govindaraju et al.
`vantages:
`• Mapping: The data transfer operation from the CPU
`to the GPU directly maps to the single-array represen-
`tation while the four-array representation does not.
`• Efficient sorting: The single-array representation has
`better performance than four-array representation as it
`reduces the memory accesses in early algorithm steps

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket