`Larson
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 6,571,244 B1
`May 27, 2003
`
`US006571244B1
`
`(54) RUN FORMATION IN LARGE SCALE
`SORTING USING BATCHED
`REPLACEMENT SELECTION
`
`(75) Inventor: Per-Ake Larson, Redmond, WA (US)
`(73) Assignee: Microsoft Corporation, Redmond, WA
`(US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(*) Notice:
`
`(21) Appl. No.: 09/437,714
`(22) Filed:
`Oct. 28, 1999
`(51) Int. Cl." .................................................. G06F 7/00
`(52) U.S. Cl. ....................... 707/7, 707/100; 707/103 R;
`707/103 Y
`(58) Field of Search ........................ 707/7, 100, 103 R,
`707/103 Y
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,287,494 A * 2/1994 Garcia et al. .................. 707/7
`5,852,826 A * 12/1998 Graunke et al. ............... 707/7
`OTHER PUBLICATIONS
`Dinsmore, “Longer Strings from Sorting,” Communications
`of the ACM, vol. 8, No. 1, Jan. 1965, p. 48.
`Frazer et al., “Sorting by Natural Selection,” Communica
`tions of the ACM, vol. 15, No. 10, Oct. 1972, pp. 910–913.
`LaMarca et al., “The Influence of Caches on the Perfor
`mance of Sorting,” Proceedings of the Eighth Annual
`ACM-SIAM Symposium on Discrete Algorithms, Jan.
`1997, pp. 370–379.
`
`Larson et al., “Memory Management during Run Generation
`in External Sorting”, ACM, 1998, pp. 472–483.
`Nyberg et al., “AlphaSort: A RISC Machine Sort”, SIGMOD
`1994, pp. 233–242.
`Ting et al., “Multiway replacement selection sort with
`dynamic reservoir,” The Computer Journal, vol. 20, No. 4,
`Aug. 1977, pp. 298–301.
`Knuth, “The Art of Computer Programming: vol. 3, Sorting
`and Searching,” Second Edition, Addison Wesley publish
`ers, 1998, pp. i–xiii and 1–780.
`* cited by examiner
`Primary Examiner—Kim Vu
`Assistant Examiner—Hung Pham
`(74) Attorney, Agent, or Firm—Lee & Hayes, PLLC
`(57)
`ABSTRACT
`A large-scale sorting process utilizes a batched replacement
`selection method to form runs of sorted data records. The
`batched replacement selection method involves reading
`multiple records from a persistent data storage into main
`memory and sorting the multiple records to form a mini-run
`of multiple sorted data records. After formation, the mini
`run is added to a selection tree by inserting a pointer to a first
`record in the mini-run into the array of pointers. The first
`record is linked to remaining records in the mini-run. As
`records are selected for output from the selection tree, the
`methodology replaces the selected record with a next record
`in the associated mini-run (if not empty) or alternatively
`deletes the node if the mini-run is empty. The selected
`records are collected into an output buffer. When the number
`of records reaches a pre-determined number, the selected
`records are written in batch back to the persistent data
`storage.
`
`44 Claims, 12 Drawing Sheets
`
`Batch Input
`
`Mini-Run Creation
`
`Tree Processing
`
`Batch Output
`
`200
`2–
`–4–
`READ RECORD
`FROM DATA
`
`206
`COPY RECORD
`INTO MEMORY AND
`ADD TO MINI-RUN
`
`
`
`SELECT RECORD
`FOR OUTPUT
`
`
`
`224
`?
`ADD SELECTED
`RECORD TO
`OUTPUT BUFFER
`
`
`
`/~ 204
`switch To
`OUTPUT MODE
`
`
`
`|
`|
`
`w
`To Step 216
`
`210
`DETERMINE RUN
`NUMBERS
`
`---
`ADD, MINI-RUN TO
`SELECTION TREE
`
`
`
`
`
`222
`
`DELETE NODE
`
`
`
`
`
`228
`
`WRITE M
`RECORDS TO DATA
`STORAGE
`
`2-232
`Yes
`SWITCH TO
`|NPUT MODE
`
`w
`To Step 200
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 1
`
`
`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 1 of 12
`
`US 6,571,244 B1
`
`Record
`
`/
`
`20
`
`
`
`22
`
`24
`
`Step 1: Read
`-
`Main M
`an Memory Subset of Records
`
`Data Storage
`
`Step 3: Write Run
`
`—
`l
`Record
`
`
`
`Step 2: Sort
`Records to
`Form Run
`
`
`
`
`
`
`
`
`
`Load-Sort-Store
`
`2ndan 2%t
`
`/T 20
`
`22
`
`Record
`
`24
`
`Main Memory
`|
`
`Record
`
`Step 2: Output
`Record
`
`Data Storage
`
`Step 1:
`Select
`Record
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Step 3: Get New
`Record
`
`
`
`
`
`Replacement
`Selection
`
`2.g. 2
`2ndan 2%t
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 2
`
`
`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 2 of 12
`
`US 6,571,244 B1
`
`
`
`8 | 2 |
`
`9
`
`1,30
`
`After Replacement
`Selection
`
`2&. sº
`2ndan 2%t
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 3
`
`
`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 3
`of 12
`
`6,571,244 B1
`US
`
`40
`
`41
`44
`48
`^ is
`-)
`?
`Run No. Home Node
`. . . .
`-
`Field
`Field
`(4 bytes)
`(4 bytes)
`
`N
`Record Key
`
`Other
`Record
`Fields
`
`Selection
`Tree Array
`
`—I-
`
`—A-
`Offset
`—r—
`
`1
`Internal s
`Nodes ºº 2
`Cache
`Li
`ine || 3
`
`
`
`
`
`
`
`—H
`
`External
`Nodes
`
`4
`
`5
`
`6
`7
`
`8
`
`9
`Array Slot
`
`Record Data Structure
`
`2ég. 4
`2ndan 2%t
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 4
`
`
`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 4 of 12
`
`US 6,571,244 B1
`
`
`
`Add Node (1,14)
`
`30 —,
`
`Step 1
`
`Step 2
`
`2.9, 5
`2ndan 2%t
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 5
`
`
`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 5 of 12
`
`US 6,571,244 B1
`
`
`
`Delete Node
`(1,15)
`
`Delete Node
`(1,15)
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 6
`
`
`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 6 of 12
`
`US 6,571,244 B1
`
`Microprocessor 66
`
`Batched RS
`
`
`
`Stable Storage
`
`Records
`
`
`
`Memory Subsystem 54
`
`*— — — — — — — — — — — — — — — — — — — — — — — — — — — —
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 7
`
`
`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 7 of 12
`
`US 6,571,244 B1
`
`1,15
`
`1,17
`
`1,39
`
`6
`
`1,22
`
`7
`
`1,29
`
`1,35
`
`1,43
`1,40
`º 1.65
`
`92(5)
`92(4)
`
`2.5
`
`2,8
`
`2,11
`
`2,13
`
`
`
`96
`
`Counter
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 8
`
`
`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 8 of 12
`
`US 6,571,244 B1
`
`Selection
`Tree Array
`
`—A-
`
`—r—
`Offset
`–7–
`1
`
`
`
`
`
`Internal
`
`-
`Same | 2
`Nodes Cache
`Line | 3
`-
`4
`
`—#—
`
`External
`Nodes
`
`5
`
`6
`7
`
`8
`
`—Y—
`
`9
`Array Slot
`
`102
`
`(151) – (147)
`(1,52)
`(2,13)
`
`(1,43)
`
`(1,65) — (1,40) :
`
`100
`
`106 – 110-
`
`Record Key Run
`Number
`
`114
`112 " \
`
`Home
`
`Next
`
`\-104
`Record Data Structure
`
`(1,41) — (1,31) —
`(1,48) — (1,36) –
`(2,11) – (2,8) —
`(1,35) — (1,29) —
`(1,39) — (1,17) —
`
`22, 10
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 9
`
`
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 10
`
`
`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 10 of 12
`
`US 6,571,244 B1
`
`
`
`3&icksort &  t £332;assic 88 .33atched 83 :
`
`22, 12
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 11
`
`
`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 11 of 12
`
`US 6,571,244 B1
`
`?? disk
`
`3:33
`
`§§§
`
`3
`
`4%
`
`§§§
`
`4
`
`23:383
`
`$33
`
`
`
`
`
`
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 12
`
`
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 13
`
`
`
`US 6,571,244 B1
`
`1
`RUN FORMATION IN LARGE SCALE
`SORTING USING BATCHED
`REPLACEMENT SELECTION
`
`TECHNICAL FIELD
`This invention relates to systems and methods for large
`scale sorting of data records. More particularly, this inven
`tion relates to run formation techniques for producing sorted
`subsets or “runs” of data records that will be subsequently
`merged together to produce the final sorted set of data
`records.
`
`BACKGROUND
`Database systems store data records on disk arrays or
`other forms of non-volatile memory. Operations performed
`on databases can be easily accommodated if all of the data
`records are first loaded into volatile main memory (i.e.,
`RAM) for processing. However, databases often contain
`large numbers of data records, which far surpass the volatile
`memory resources. As a result, subsets of data records are
`processed in the main memory and then written back to disk
`after they are processed.
`Large scale sorting is one type of operation performed by
`a database system on a large set of data records that is too
`large to fit in main memory. “Mergesort” is the standard
`technique used for sorting large sets of records. Mergesort
`has two phases: (1) a run formation phase that creates sorted
`subsets, called “runs”, and (2) a merge phase that repeatedly
`merges runs into larger and larger runs, until a single run has
`been created. This invention pertains to the run formation
`phase of a mergesort operation.
`In general, there are two types of run formation tech
`niques: load-sort-store and replacement selection. Most sort
`implementations use the load-sort-store algorithm for run
`formation. This invention is directed to an improved replace
`ment selection technique. Load-sort-store is briefly
`described below, followed by a more detailed look at
`replacement selection.
`Run Formation Type I: Load-Sort-Store
`FIG. 1 shows the load-sort-store process implemented on
`a database system 20 having main memory 22 and persistent
`data storage 24. As the name implies, the load-sort-store
`process has three main steps: loading, sorting, and storing.
`During a “load” step 1, a subset of records is read from the
`data storage 24 into the main memory 22. Next, during a
`“sort” step 2, the process extracts pointers to all records into
`an array and sorts the entries in the array according to a
`designated sort key to form the run. Any in-memory sorting
`algorithm can be used, with “Quicksort” being the most
`popular choice. Afterwards, during a “store” step 3, the run
`is written back to the data storage 24 and all records in main
`memory 22 are erased.
`These three steps are repeated until all records have been
`processed. All runs created will be of the same length, except
`possibly the last run. After multiple runs have been created
`and stored during the run formation phase, the sorting
`process transitions to the merge phase in which the runs are
`merged into one large run.
`One drawback with the load-sort-store process is that
`CPU processing and I/O (input/output) cannot be over
`lapped. As a result, the computing resources of the database
`system are not fully utilized.
`Run Formation Type II: Replacement Selection
`Replacement selection is an alternative technique that
`produces longer runs and completely overlaps CPU process
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`ing and I/O. It is based on the observation that by tracking
`the highest key output thus far, the process can easily
`determine whether an incoming record can still be made part
`of the current run or should be deferred to the next run. Any
`record tacked on to the current run in this way increases the
`run length. Adding a record to the current run is expensive
`if a (sorted) pointer array is used, so replacement selection
`often uses a (binary) selection tree instead.
`FIG. 2 shows a replacement selection process imple
`mented in the database system 20. The process described
`below is the traditional version of the replacement selection
`algorithm as described, for example, by Donald Knuth in
`“The Art of Computer Programming, Volume. 3: Sorting and
`Searching”, Second Edition 1998. This version of the algo
`rithm assumes that records are of fixed length but, in
`practice, variable-length records are more common.
`Initially, a subset of records is read into the main memory
`22. When records are of fixed length, the main loop of the
`algorithm has three steps: selection, output, and get new
`record. In the “selection” step 1, the process selects from
`among the records in memory, a record with the lowest key
`greater than or equal to the last key output. Then, at the
`“output” step 2, the selected record is written out to the data
`storage 24 and the key value is remembered. Afterwards, at
`the “get” step 3, the process retrieves a new record from the
`data storage 24 and stores it in the slot previously occupied
`by the record just output.
`While load-sort-store is more widely used, replacement
`selection has many benefits. First, replacement selection
`produces runs larger than the memory used, which reduces
`the number of runs and the subsequent merge effort. Second,
`replacement selection exhibits steady I/O behavior during
`run generation rather than I/O in great bursts. This improves
`the utilization of I/O devices. Third, it very effectively
`exploits pre-sorting to produce longer runs, i.e., input
`sequences that are not random but somewhat correlated to
`the desired sorted output sequence. In particular, if the input
`is already sorted, a single run will be generated. Fourth,
`sorting is often done as part of grouping with aggregation or
`for duplicate removal. A technique called early aggregation
`can then be applied, which reduces I/O significantly. Early
`aggregation achieves much higher I/O reduction if runs are
`created by replacement selection.
`The following sections examine the selection step of
`replacement selection using a binary selection tree. The first
`section considers a tree formed with fixed length records,
`and a subsequent section examines the tree with variable
`length records.
`Selection Tree for Fixed Length Records
`The selection step is the most expensive part of the
`replacement selection operation. The selection step can be
`performed efficiently with the help of a selection tree, using
`the run number and the record key as the (compound)
`selection key. A selection tree for N records is a left
`complete binary tree with N external nodes and N-1 internal
`nodes stored in an array without pointers. Logically, each
`external node or tree “leaf” stores a data record with a
`combined sort key consisting of a run number and the
`original sort key of the record. The combined sort key is here
`called a run-key pair and represented by a notation “run
`number, key” such as “1,12”. Each internal node stores the
`lesser of the sort keys of its two children nodes plus a
`reference to the source node of the key. Physically, the
`selection tree and nodes can be implemented in several
`different ways.
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 14
`
`
`
`3
`FIGS. 3a and 3b show a selection tree 30 before and after
`a, selection replacement operation for replacing the lowest
`key on the first run (i.e., key 12 for run 1). FIG. 3a shows
`the initial state of the selection tree 30 with five keys (i.e.,
`5, 12, 15, 22, and 32). The key of the last record output was
`8. External nodes or tree “leafs” are drawn as rectangles and
`internal tree nodes are shown as ellipses. The number to the
`left of a node indicates its position in the array.
`Initially, the record represented by external node (1,12) is
`selected and output. Its key value “12” is now the last key
`value output, and this value is recorded. Outputting the
`record frees up the external node at position 9. A new record
`is input and stored in this node. The key of the new record
`is 30, which is higher than the last key value output (12) so
`the new record “fits” in the current run. Its run number is set
`to 1, resulting in the pair (1,30) shown in position 9. Next,
`a new minimum key is found by traversing the path from the
`external node at position 9 up to the root at position 1. At
`each node, the sort keys of two sibling nodes are compared
`and the node with the lower sort key is promoted to the
`parent node. For example, for external nodes (1,32) and
`(1,30) at positions 8 and 9, the key (1,30) at position 9 is less
`than the key (1,32) and is promoted to the parent node at
`position 4. Again, comparison of the internal nodes at
`positions 4 and 5 reveals that node (1,15) is less than (1,30)
`and is therefore promoted to parent node at position 2. The
`number of comparisons is always either floor(log2(N)) or
`ceil(log2(N)).
`FIG. 4 shows one implementation of the selection tree 30
`stored in an array 40, although the selection tree can be
`implemented in several different ways. A node consists only
`of a pointer 41 to a corresponding record, in a record data
`structure 42 in order to keep the tree itself of minimal size.
`The beginning of the array is offset so that each pair of
`sibling nodes occupies a common cache line. The type of a
`node is determined by its position in the array so the type
`need not be explicitly marked. Here, the internal nodes are
`at the top of the array and the external nodes are at the
`bottom. There are approximately twice as many nodes as
`records (i.e., 2N–1 nodes to N records), so the selection tree
`array adds approximately eight bytes of overhead per record.
`A record slot 43 in the record data structure 42 contains
`a record consisting of a record key 44 and possibly some
`other fields 45. Each record slot also has two additional
`fields needed by the run formation algorithm, namely, a run
`number field 46 (e.g., 4 bytes), and a home node field 48
`recording the position in the selection tree array of the
`external node that owns the record (e.g., 4 bytes). The last
`two fields 46 and 48 add another eight bytes of overhead per
`record, bringing the total overhead to sixteen bytes per
`record. The home node field 48 tracks the external node
`occupied by a record so that, when a record has been output,
`the new record can be placed in that tree node. Knuth's
`version of the algorithm handles this problem by adding a
`level of indirection: an internal node points to the external
`node owning the record, which in turn points to the actual
`record. The indirection increases cache misses during tra
`versal because to locate a record the external node must first
`be accessed.
`When records are of fixed length, the home node field can
`actually be eliminated—a records home node can be
`deduced from which record slot it occupies. However, this is
`not possible when records are of variable length.
`Knuth's version uses two additional refinements: storing
`the loser instead of the winner in each internal node and
`packing one internal node and one external node together
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,571,244 B1
`
`4
`into a combined physical node. These changes make the
`algorithm harder to understand, but have little effect on its
`performance.
`When records are of fixed length and the amount of
`memory available is known before run formation starts, the
`size of the selection tree can be determined before run
`formation begins. The tree is pre-allocated and filled with a
`fictitious run zero that is never output. Processing of an
`incoming record then consists of the following steps:
`1. Record the key value and run number of the top record.
`2. Copy the top record to an output buffer (provided it is
`not from the fictitious run zero).
`3. If the output buffer is full, write it to the run file.
`4. Copy the incoming record into the vacated slot in
`memory.
`5. Determine its run number by comparing it with the
`recorded key value and set the run number field in the
`record.
`6. Copy a pointer to the new record into the appropriate
`node in the selection tree. Call this node T.
`7. Set the home node field in the record to point to T.
`8. Fix up the path from node T to the root in the selection
`tree.
`When there are no more input records, the memory is
`once again filled with fictitious records that are never output.
`The only operation performed is replacement, consisting
`mainly of copying two records and traversing a path from a
`leaf to the root.
`Selection Tree for Variable Length Records
`The above discussion of tree selection assumes that the
`records are of fixed length. When records are of variable
`length, two complications arise. First, managing the space
`reserved for storing records becomes more complex.
`Second, records are no longer just replaced in the selection
`tree. A record may be output and deleted from the tree
`without being replaced (if there is no free slot large enough
`for the replacement record). Similarly, records may be input
`and added to the tree without outputting any existing
`records. Consequently, the tree is no longer of constant size.
`In the case of variable length records, it is better to view
`replacement selection as consisting of two processes: (1) an
`input process that fills memory with new records and adds
`them to the selection tree and (2) an output process that
`repeatedly deletes the top record from the tree and outputs
`it. The input process drives the processing. Whenever it fails
`to find memory space for an incoming record, it resumes the
`output process, which runs until it has created a free slot at
`least as large as needed by the new record. To purge all
`records when reaching the end of input, the input process
`requests space for an infinitely large record.
`FIG. 5 illustrates the input process in which a new node
`(1,14) is added to the selection tree 30 from FIG. 3b. The
`new node (1,14) is added to the end of the tree after the last
`external node, as represented by insertion of node (1,14) into
`position 10 in the tree 30 (step 1 in FIG. 5). Because the tree
`is complete, its parent is always the first external node,
`which in this case is node (1,15) at position 5 (see FIG. 3b).
`The content of the parent node is copied into the next free
`node to become the right sibling node of the new element,
`as represented by new node (1,15) in position 11 (step 2 in
`FIG. 5).
`A different external node now owns the record, so the
`home node field in the record has to be updated from
`position 5 to position 11. In addition, the parent node
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 15
`
`
`
`5
`changes its role from an external node to an internal node.
`The sort keys of the two new external sibling nodes (1,14)
`and (1,15) are compared and the lower one promoted to the
`parent node (step 3 in FIG. 5). In this case, the new node
`(1,14) contains the lower key and it is promoted to position
`5. The walk up the tree continues until there is no change;
`that is, the record that gets promoted is the same as the one
`already stored in the parent node (step 4 in FIG. 5). In this
`case, the node (1,14) is promoted all the way to the root node
`at position 1. However, in other cases, the traversal may not
`continue all the way to the root.
`FIGS. 6a and 6b show the output process in which a node
`is deleted from the selection tree. The node with the lowest
`key is always the one deleted. Using the tree in FIG. 3b,
`suppose the output process deletes the external node (1,15)
`from the selection tree 30. First, the target node (1,15) is
`exchanged with the last node (1,30) of the tree 30 and the
`home node fields of both records are updated (step 1 in FIG.
`6a).
`The two tree paths affected by the exchange are then fixed
`up (step 2 in FIG. 6a). As a result of the exchange, the last
`node will always contain the lowest key. Accordingly, to fix
`up the path from the last node, every internal node on the
`path to the root is updated to point to this record without the
`need for key comparisons. The other path affected by the
`exchange is updated in the normal way by comparing keys.
`In this case, the internal node (1,15) at position 4 has a lower
`key value than the external node (1,30) at position 5, and
`hence the traversal stops here and does not continue all the
`way to the root. The result is a valid selection tree with the
`lowest run-key pair (1,15) as the last external node of the
`tree.
`Following the path update, the target node (1,15) in the
`last position is deleted from array 40 (step 3 in FIG. 6b). This
`deletion causes external node (1,32) to move to its parent
`node at position 4 (step 4 in FIG. 6b). The path from there
`is then fixed up all the way to the root (step 5 in FIG. 6b),
`causing node (1,22) to be promoted to the top.
`Memory Management
`When records are of variable length, managing the space
`reserved for storing records becomes an issue. In an article
`by Per-?ke Larson and Goetz Graefe, entitled “Memory
`Management during Run Generation in External Sorting”,
`SIGMOD, 1998: 472–483, the authors showed that a version
`of best-fit allocation solves this problem efficiently, resulting
`in a memory utilization of 90% is or better. The basic idea
`of best-fit allocation is to always allocate space from the
`smallest free block large enough for the record being stored.
`If the selected free block is larger than required, the unused
`space is returned to the pool as a smaller free block.
`Immediate coalescing of adjacent free blocks and the use of
`boundary tags is recommended for efficiency. Best-fit allo
`cation depends on being able to locate the smallest block
`larger than a given size efficiently. One technique is to store
`a collection of free blocks in a (balanced or unbalanced)
`binary tree with the block size as the key. If the tree is kept
`balanced (e.g., an AVL-tree), the best-fitting free block can
`always be located in logarithmic time. An alternative tech
`nique is to maintain multiple free lists segregated by block
`size. For example, we may use 256 free list and keep all free
`blocks of size 16 bytes or less on list 1, all free blocks
`between 17 and 32 bytes on list 2, and so on. In practice this
`technique is much faster and, consequently, is the preferred
`implementation.
`Drawbacks of Classical Replacement Selection
`Modern CPUs rely heavily on caches to hide memory
`latency and increase overall performance. As a result, it is
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,571,244 B1
`
`6
`increasingly important to design algorithms that generate
`few cache misses. Unfortunately, the classical replacement
`..selection algorithm has poor cache behavior when the
`number of records in memory is large. The main loop of the
`algorithm consists of traversing a path of the selection tree
`from a leaf node to the root, while comparing sort keys of
`each sibling pair. Which path is traversed is unrelated to
`previously used paths. The nodes in the top part of the tree,
`and their associated sort keys, are touched frequently and are
`likely to remain in the cache but not the ones lower down in
`the tree.
`Accordingly, there is a need for a cache-conscious version
`of replacement selection, which reduces the number of cache
`misses significantly.
`
`SUMMARY
`This invention concerns large scale sorting technology for
`forming runs of data records using an improved cache
`friendly replacement selection process that reduces cache
`misses.
`In one implementation, a database system has a memory
`subsystem with persistent data storage (e.g., disks, RAID,
`etc.) and a main memory. The system also has at least one
`processing unit with a processor and cache memory (e.g., L1
`and L2 caches). A selection tree is formed in the main
`memory as an array of pointers to records that are being
`sorted. The system also has a mini-run assembly data
`structure stored in the main memory to create mini-runs of
`sorted records prior to adding the records to the selection
`tree.
`The system runs a large-scale sorting program having a
`batched replacement selection module for forming runs of
`sorted data records. The batched replacement selection mod
`ule reads multiple records from the persistent data storage
`into main memory, adds them to the mini-run assembly data
`structure and, whenever the mini-run assembly structure
`becomes full, sorts its records to form a mini-run of multiple
`sorted data records. As an example, a mini-run might consist
`of 500 records sorted according to their key values. After
`formation, the mini-run is added to the selection tree by
`inserting a pointer to a first record in the mini-run into the
`array of pointers. The first record is linked to remaining
`records in the mini-run.
`The batched replacement selection module selects a
`record for output from a node in the selection tree. If the
`mini-run associated with the node is not empty, the module
`replaces the selected record with a next record from the
`mini-run. Once the mini-run becomes empty, the module
`deletes the node. In this manner, the selection tree grows and
`shrinks as mini-runs are added and deleted.
`When main memory becomes full, the batched replace
`ment selection module switches to output mode. It selects
`some number of records for output using the selection tree
`and copies the selected records into an output buffer in main
`memory. Whenever the output buffer becomes full, the
`module writes contents of the buffer back to the persistent
`data storage at the end of the current run. Exactly how long
`the output process runs is a matter of policy. Two possible
`policies are: stop after a predetermined number of records
`have been output or stop after a predetermined number of
`output buffers have been filled.
`The combination of using pre-sorted mini-runs as tree
`nodes and batch processing input/output records signifi
`cantly reduces the number of cache misses (particularly, L2
`cache misses).
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a diagrammatic illustration of a database system
`implementing a conventional load-sort-store process for run
`formation in large-scale mergesort operations.
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 16
`
`
`
`US 6,571,244 B1
`
`7
`FIG. 2 is a diagrammatic illustration of a database system
`implementing a conventional replacement selection process
`for run formation in large-scale mergesort operations.
`FIGS. 3a and 3b show a selection tree before and after a
`selection replacement operation is performed on a leaf node.
`FIG. 4 is a diagrammatic illustration of a data array used
`to implement the selection tree of FIG. 3a.
`FIG. 5 is a diagrammatic illustration of a selection tree
`and illustrates the addition of a node to the selection tree.
`FIGS. 6a and 6b are diagrammatic illustrations of a
`selection tree and illustrate the deletion of a node from the
`selection tree.
`FIG. 7 shows a database storage system that implements
`a batched replacement selection process.
`FIG. 8 is a diagrammatic illustration of a selection tree
`with mini-runs attached at the external leaf nodes.
`FIG. 9 is a diagrammatic illustration of a data structure
`used to assemble mini-runs.
`FIG. 10 is a diagrammatic, illustration of a data array used
`to implement the selection tree of FIG. 9.
`FIG. 11 is a flow diagram for a batched replacement
`selection process.
`FIG. 12 shows the overall performance of the four
`methods—Quicksort, Alphasort, classic replacement
`selection, and batched replacement selection—as a function
`of memory size.
`FIG. 13 shows results of experiments on four methods—
`Quicksort, Alphasort, classic replacement selection, and
`batched replacement selection—run with a single disk.
`FIG. 14 shows results of experiments on four methods—
`Quicksort, Alphasort, classic replacement selection, and
`batched replacement selection—run with two disks.
`DETAILED DESCRIPTION
`This invention concerns a cache-conscious replacement
`selection methodology that employs local batch processing
`to improve performance on modern-day CPUs. For discus
`sion purposes, the invention is described in the context of a
`uni-processor system although the invention may be
`employed in a shared-memory multiprocessor system
`(SMP).
`
`System
`FIG. 7 shows a database storage system 50 implemented
`on a computer having a processor 52 coupled to a memory
`subsystem 54. A system bus (or other interconnection
`network) 56 interconnects processor 52 and memory sub
`system 54.
`The processor 52 has a chip 60 that holds a CPU (central
`processing unit) 62 and a first level (or L1) cache 64. The
`processor 52 also has a second level (or L2) cache 68 that
`may be on the same or a different chip (as represented by
`chip 66). The L1 cache 64 is considerably smaller than the
`L2 cache 68. For example, the L1 cache 64 typically ranges
`in size from 16 Kbytes to 256 Kbytes, whereas the L2 cache
`68 typically varies from 128 Kbytes up to 4 Mbytes. Due to
`its proximity to the CPU 62, however, the L1 cache 64 has
`a significantly lower access time in comparison to the L2
`cache 68. Typically, the access time to remote L2 cache 68
`is five times longer than the access time to L1 cache 64.
`The memory subsystem 54 has a persistent storage 70 that
`provides the bulk of the data storage. The persistent storage
`