throbber
(12) United States Patent
`Larson
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 6,571,244 B1
`May 27, 2003
`
`US006571244B1
`
`(54) RUN FORMATION IN LARGE SCALE
`SORTING USING BATCHED
`REPLACEMENT SELECTION
`
`(75) Inventor: Per-Ake Larson, Redmond, WA (US)
`(73) Assignee: Microsoft Corporation, Redmond, WA
`(US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(*) Notice:
`
`(21) Appl. No.: 09/437,714
`(22) Filed:
`Oct. 28, 1999
`(51) Int. Cl." .................................................. G06F 7/00
`(52) U.S. Cl. ....................... 707/7, 707/100; 707/103 R;
`707/103 Y
`(58) Field of Search ........................ 707/7, 100, 103 R,
`707/103 Y
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,287,494 A * 2/1994 Garcia et al. .................. 707/7
`5,852,826 A * 12/1998 Graunke et al. ............... 707/7
`OTHER PUBLICATIONS
`Dinsmore, “Longer Strings from Sorting,” Communications
`of the ACM, vol. 8, No. 1, Jan. 1965, p. 48.
`Frazer et al., “Sorting by Natural Selection,” Communica
`tions of the ACM, vol. 15, No. 10, Oct. 1972, pp. 910–913.
`LaMarca et al., “The Influence of Caches on the Perfor
`mance of Sorting,” Proceedings of the Eighth Annual
`ACM-SIAM Symposium on Discrete Algorithms, Jan.
`1997, pp. 370–379.
`
`Larson et al., “Memory Management during Run Generation
`in External Sorting”, ACM, 1998, pp. 472–483.
`Nyberg et al., “AlphaSort: A RISC Machine Sort”, SIGMOD
`1994, pp. 233–242.
`Ting et al., “Multiway replacement selection sort with
`dynamic reservoir,” The Computer Journal, vol. 20, No. 4,
`Aug. 1977, pp. 298–301.
`Knuth, “The Art of Computer Programming: vol. 3, Sorting
`and Searching,” Second Edition, Addison Wesley publish
`ers, 1998, pp. i–xiii and 1–780.
`* cited by examiner
`Primary Examiner—Kim Vu
`Assistant Examiner—Hung Pham
`(74) Attorney, Agent, or Firm—Lee & Hayes, PLLC
`(57)
`ABSTRACT
`A large-scale sorting process utilizes a batched replacement
`selection method to form runs of sorted data records. The
`batched replacement selection method involves reading
`multiple records from a persistent data storage into main
`memory and sorting the multiple records to form a mini-run
`of multiple sorted data records. After formation, the mini
`run is added to a selection tree by inserting a pointer to a first
`record in the mini-run into the array of pointers. The first
`record is linked to remaining records in the mini-run. As
`records are selected for output from the selection tree, the
`methodology replaces the selected record with a next record
`in the associated mini-run (if not empty) or alternatively
`deletes the node if the mini-run is empty. The selected
`records are collected into an output buffer. When the number
`of records reaches a pre-determined number, the selected
`records are written in batch back to the persistent data
`storage.
`
`44 Claims, 12 Drawing Sheets
`
`Batch Input
`
`Mini-Run Creation
`
`Tree Processing
`
`Batch Output
`
`200
`2–
`–4–
`READ RECORD
`FROM DATA
`
`206
`COPY RECORD
`INTO MEMORY AND
`ADD TO MINI-RUN
`
`
`
`SELECT RECORD
`FOR OUTPUT
`
`
`
`224
`?
`ADD SELECTED
`RECORD TO
`OUTPUT BUFFER
`
`
`
`/~ 204
`switch To
`OUTPUT MODE
`
`
`
`|
`|
`
`w
`To Step 216
`
`210
`DETERMINE RUN
`NUMBERS
`
`---
`ADD, MINI-RUN TO
`SELECTION TREE
`
`
`
`
`
`222
`
`DELETE NODE
`
`
`
`
`
`228
`
`WRITE M
`RECORDS TO DATA
`STORAGE
`
`2-232
`Yes
`SWITCH TO
`|NPUT MODE
`
`w
`To Step 200
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 1
`
`

`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 1 of 12
`
`US 6,571,244 B1
`
`Record
`
`/
`
`20
`
`
`
`22
`
`24
`
`Step 1: Read
`-
`Main M
`an Memory Subset of Records
`
`Data Storage
`
`Step 3: Write Run
`
`—
`l
`Record
`
`
`
`Step 2: Sort
`Records to
`Form Run
`
`
`
`
`
`
`
`
`
`Load-Sort-Store
`
`2ndan 2%t
`
`/T 20
`
`22
`
`Record
`
`24
`
`Main Memory
`|
`
`Record
`
`Step 2: Output
`Record
`
`Data Storage
`
`Step 1:
`Select
`Record
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Step 3: Get New
`Record
`
`
`
`
`
`Replacement
`Selection
`
`2.g. 2
`2ndan 2%t
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 2
`
`

`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 2 of 12
`
`US 6,571,244 B1
`
`
`
`8 | 2 |
`
`9
`
`1,30
`
`After Replacement
`Selection
`
`2&. sº
`2ndan 2%t
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 3
`
`

`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 3
`of 12
`
`6,571,244 B1
`US
`
`40
`
`41
`44
`48
`^ is
`-)
`?
`Run No. Home Node
`. . . .
`-
`Field
`Field
`(4 bytes)
`(4 bytes)
`
`N
`Record Key
`
`Other
`Record
`Fields
`
`Selection
`Tree Array
`
`—I-
`
`—A-
`Offset
`—r—
`
`1
`Internal s
`Nodes ºº 2
`Cache
`Li
`ine || 3
`
`
`
`
`
`
`
`—H
`
`External
`Nodes
`
`4
`
`5
`
`6
`7
`
`8
`
`9
`Array Slot
`
`Record Data Structure
`
`2ég. 4
`2ndan 2%t
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 4
`
`

`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 4 of 12
`
`US 6,571,244 B1
`
`
`
`Add Node (1,14)
`
`30 —,
`
`Step 1
`
`Step 2
`
`2.9, 5
`2ndan 2%t
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 5
`
`

`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 5 of 12
`
`US 6,571,244 B1
`
`
`
`Delete Node
`(1,15)
`
`Delete Node
`(1,15)
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 6
`
`

`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 6 of 12
`
`US 6,571,244 B1
`
`Microprocessor 66
`
`Batched RS
`
`
`
`Stable Storage
`
`Records
`
`
`
`Memory Subsystem 54
`
`*— — — — — — — — — — — — — — — — — — — — — — — — — — — —
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 7
`
`

`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 7 of 12
`
`US 6,571,244 B1
`
`1,15
`
`1,17
`
`1,39
`
`6
`
`1,22
`
`7
`
`1,29
`
`1,35
`
`1,43
`1,40
`º 1.65
`
`92(5)
`92(4)
`
`2.5
`
`2,8
`
`2,11
`
`2,13
`
`
`
`96
`
`Counter
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 8
`
`

`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 8 of 12
`
`US 6,571,244 B1
`
`Selection
`Tree Array
`
`—A-
`
`—r—
`Offset
`–7–
`1
`
`
`
`
`
`Internal
`
`-
`Same | 2
`Nodes Cache
`Line | 3
`-
`4
`
`—#—
`
`External
`Nodes
`
`5
`
`6
`7
`
`8
`
`—Y—
`
`9
`Array Slot
`
`102
`
`(151) – (147)
`(1,52)
`(2,13)
`
`(1,43)
`
`(1,65) — (1,40) :
`
`100
`
`106 – 110-
`
`Record Key Run
`Number
`
`114
`112 " \
`
`Home
`
`Next
`
`\-104
`Record Data Structure
`
`(1,41) — (1,31) —
`(1,48) — (1,36) –
`(2,11) – (2,8) —
`(1,35) — (1,29) —
`(1,39) — (1,17) —
`
`22, 10
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 9
`
`

`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 10
`
`

`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 10 of 12
`
`US 6,571,244 B1
`
`
`
`3&icksort &  t £332;assic 88 .33atched 83 :
`
`22, 12
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 11
`
`

`
`U.S. Patent
`
`May 27, 2003
`
`Sheet 11 of 12
`
`US 6,571,244 B1
`
`?? disk
`
`3:33
`
`§§§
`
`3
`
`4%
`
`§§§
`
`4
`
`23:383
`
`$33
`
`
`
`
`
`
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 12
`
`

`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 13
`
`

`
`US 6,571,244 B1
`
`1
`RUN FORMATION IN LARGE SCALE
`SORTING USING BATCHED
`REPLACEMENT SELECTION
`
`TECHNICAL FIELD
`This invention relates to systems and methods for large
`scale sorting of data records. More particularly, this inven
`tion relates to run formation techniques for producing sorted
`subsets or “runs” of data records that will be subsequently
`merged together to produce the final sorted set of data
`records.
`
`BACKGROUND
`Database systems store data records on disk arrays or
`other forms of non-volatile memory. Operations performed
`on databases can be easily accommodated if all of the data
`records are first loaded into volatile main memory (i.e.,
`RAM) for processing. However, databases often contain
`large numbers of data records, which far surpass the volatile
`memory resources. As a result, subsets of data records are
`processed in the main memory and then written back to disk
`after they are processed.
`Large scale sorting is one type of operation performed by
`a database system on a large set of data records that is too
`large to fit in main memory. “Mergesort” is the standard
`technique used for sorting large sets of records. Mergesort
`has two phases: (1) a run formation phase that creates sorted
`subsets, called “runs”, and (2) a merge phase that repeatedly
`merges runs into larger and larger runs, until a single run has
`been created. This invention pertains to the run formation
`phase of a mergesort operation.
`In general, there are two types of run formation tech
`niques: load-sort-store and replacement selection. Most sort
`implementations use the load-sort-store algorithm for run
`formation. This invention is directed to an improved replace
`ment selection technique. Load-sort-store is briefly
`described below, followed by a more detailed look at
`replacement selection.
`Run Formation Type I: Load-Sort-Store
`FIG. 1 shows the load-sort-store process implemented on
`a database system 20 having main memory 22 and persistent
`data storage 24. As the name implies, the load-sort-store
`process has three main steps: loading, sorting, and storing.
`During a “load” step 1, a subset of records is read from the
`data storage 24 into the main memory 22. Next, during a
`“sort” step 2, the process extracts pointers to all records into
`an array and sorts the entries in the array according to a
`designated sort key to form the run. Any in-memory sorting
`algorithm can be used, with “Quicksort” being the most
`popular choice. Afterwards, during a “store” step 3, the run
`is written back to the data storage 24 and all records in main
`memory 22 are erased.
`These three steps are repeated until all records have been
`processed. All runs created will be of the same length, except
`possibly the last run. After multiple runs have been created
`and stored during the run formation phase, the sorting
`process transitions to the merge phase in which the runs are
`merged into one large run.
`One drawback with the load-sort-store process is that
`CPU processing and I/O (input/output) cannot be over
`lapped. As a result, the computing resources of the database
`system are not fully utilized.
`Run Formation Type II: Replacement Selection
`Replacement selection is an alternative technique that
`produces longer runs and completely overlaps CPU process
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`ing and I/O. It is based on the observation that by tracking
`the highest key output thus far, the process can easily
`determine whether an incoming record can still be made part
`of the current run or should be deferred to the next run. Any
`record tacked on to the current run in this way increases the
`run length. Adding a record to the current run is expensive
`if a (sorted) pointer array is used, so replacement selection
`often uses a (binary) selection tree instead.
`FIG. 2 shows a replacement selection process imple
`mented in the database system 20. The process described
`below is the traditional version of the replacement selection
`algorithm as described, for example, by Donald Knuth in
`“The Art of Computer Programming, Volume. 3: Sorting and
`Searching”, Second Edition 1998. This version of the algo
`rithm assumes that records are of fixed length but, in
`practice, variable-length records are more common.
`Initially, a subset of records is read into the main memory
`22. When records are of fixed length, the main loop of the
`algorithm has three steps: selection, output, and get new
`record. In the “selection” step 1, the process selects from
`among the records in memory, a record with the lowest key
`greater than or equal to the last key output. Then, at the
`“output” step 2, the selected record is written out to the data
`storage 24 and the key value is remembered. Afterwards, at
`the “get” step 3, the process retrieves a new record from the
`data storage 24 and stores it in the slot previously occupied
`by the record just output.
`While load-sort-store is more widely used, replacement
`selection has many benefits. First, replacement selection
`produces runs larger than the memory used, which reduces
`the number of runs and the subsequent merge effort. Second,
`replacement selection exhibits steady I/O behavior during
`run generation rather than I/O in great bursts. This improves
`the utilization of I/O devices. Third, it very effectively
`exploits pre-sorting to produce longer runs, i.e., input
`sequences that are not random but somewhat correlated to
`the desired sorted output sequence. In particular, if the input
`is already sorted, a single run will be generated. Fourth,
`sorting is often done as part of grouping with aggregation or
`for duplicate removal. A technique called early aggregation
`can then be applied, which reduces I/O significantly. Early
`aggregation achieves much higher I/O reduction if runs are
`created by replacement selection.
`The following sections examine the selection step of
`replacement selection using a binary selection tree. The first
`section considers a tree formed with fixed length records,
`and a subsequent section examines the tree with variable
`length records.
`Selection Tree for Fixed Length Records
`The selection step is the most expensive part of the
`replacement selection operation. The selection step can be
`performed efficiently with the help of a selection tree, using
`the run number and the record key as the (compound)
`selection key. A selection tree for N records is a left
`complete binary tree with N external nodes and N-1 internal
`nodes stored in an array without pointers. Logically, each
`external node or tree “leaf” stores a data record with a
`combined sort key consisting of a run number and the
`original sort key of the record. The combined sort key is here
`called a run-key pair and represented by a notation “run
`number, key” such as “1,12”. Each internal node stores the
`lesser of the sort keys of its two children nodes plus a
`reference to the source node of the key. Physically, the
`selection tree and nodes can be implemented in several
`different ways.
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 14
`
`

`
`3
`FIGS. 3a and 3b show a selection tree 30 before and after
`a, selection replacement operation for replacing the lowest
`key on the first run (i.e., key 12 for run 1). FIG. 3a shows
`the initial state of the selection tree 30 with five keys (i.e.,
`5, 12, 15, 22, and 32). The key of the last record output was
`8. External nodes or tree “leafs” are drawn as rectangles and
`internal tree nodes are shown as ellipses. The number to the
`left of a node indicates its position in the array.
`Initially, the record represented by external node (1,12) is
`selected and output. Its key value “12” is now the last key
`value output, and this value is recorded. Outputting the
`record frees up the external node at position 9. A new record
`is input and stored in this node. The key of the new record
`is 30, which is higher than the last key value output (12) so
`the new record “fits” in the current run. Its run number is set
`to 1, resulting in the pair (1,30) shown in position 9. Next,
`a new minimum key is found by traversing the path from the
`external node at position 9 up to the root at position 1. At
`each node, the sort keys of two sibling nodes are compared
`and the node with the lower sort key is promoted to the
`parent node. For example, for external nodes (1,32) and
`(1,30) at positions 8 and 9, the key (1,30) at position 9 is less
`than the key (1,32) and is promoted to the parent node at
`position 4. Again, comparison of the internal nodes at
`positions 4 and 5 reveals that node (1,15) is less than (1,30)
`and is therefore promoted to parent node at position 2. The
`number of comparisons is always either floor(log2(N)) or
`ceil(log2(N)).
`FIG. 4 shows one implementation of the selection tree 30
`stored in an array 40, although the selection tree can be
`implemented in several different ways. A node consists only
`of a pointer 41 to a corresponding record, in a record data
`structure 42 in order to keep the tree itself of minimal size.
`The beginning of the array is offset so that each pair of
`sibling nodes occupies a common cache line. The type of a
`node is determined by its position in the array so the type
`need not be explicitly marked. Here, the internal nodes are
`at the top of the array and the external nodes are at the
`bottom. There are approximately twice as many nodes as
`records (i.e., 2N–1 nodes to N records), so the selection tree
`array adds approximately eight bytes of overhead per record.
`A record slot 43 in the record data structure 42 contains
`a record consisting of a record key 44 and possibly some
`other fields 45. Each record slot also has two additional
`fields needed by the run formation algorithm, namely, a run
`number field 46 (e.g., 4 bytes), and a home node field 48
`recording the position in the selection tree array of the
`external node that owns the record (e.g., 4 bytes). The last
`two fields 46 and 48 add another eight bytes of overhead per
`record, bringing the total overhead to sixteen bytes per
`record. The home node field 48 tracks the external node
`occupied by a record so that, when a record has been output,
`the new record can be placed in that tree node. Knuth's
`version of the algorithm handles this problem by adding a
`level of indirection: an internal node points to the external
`node owning the record, which in turn points to the actual
`record. The indirection increases cache misses during tra
`versal because to locate a record the external node must first
`be accessed.
`When records are of fixed length, the home node field can
`actually be eliminated—a records home node can be
`deduced from which record slot it occupies. However, this is
`not possible when records are of variable length.
`Knuth's version uses two additional refinements: storing
`the loser instead of the winner in each internal node and
`packing one internal node and one external node together
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,571,244 B1
`
`4
`into a combined physical node. These changes make the
`algorithm harder to understand, but have little effect on its
`performance.
`When records are of fixed length and the amount of
`memory available is known before run formation starts, the
`size of the selection tree can be determined before run
`formation begins. The tree is pre-allocated and filled with a
`fictitious run zero that is never output. Processing of an
`incoming record then consists of the following steps:
`1. Record the key value and run number of the top record.
`2. Copy the top record to an output buffer (provided it is
`not from the fictitious run zero).
`3. If the output buffer is full, write it to the run file.
`4. Copy the incoming record into the vacated slot in
`memory.
`5. Determine its run number by comparing it with the
`recorded key value and set the run number field in the
`record.
`6. Copy a pointer to the new record into the appropriate
`node in the selection tree. Call this node T.
`7. Set the home node field in the record to point to T.
`8. Fix up the path from node T to the root in the selection
`tree.
`When there are no more input records, the memory is
`once again filled with fictitious records that are never output.
`The only operation performed is replacement, consisting
`mainly of copying two records and traversing a path from a
`leaf to the root.
`Selection Tree for Variable Length Records
`The above discussion of tree selection assumes that the
`records are of fixed length. When records are of variable
`length, two complications arise. First, managing the space
`reserved for storing records becomes more complex.
`Second, records are no longer just replaced in the selection
`tree. A record may be output and deleted from the tree
`without being replaced (if there is no free slot large enough
`for the replacement record). Similarly, records may be input
`and added to the tree without outputting any existing
`records. Consequently, the tree is no longer of constant size.
`In the case of variable length records, it is better to view
`replacement selection as consisting of two processes: (1) an
`input process that fills memory with new records and adds
`them to the selection tree and (2) an output process that
`repeatedly deletes the top record from the tree and outputs
`it. The input process drives the processing. Whenever it fails
`to find memory space for an incoming record, it resumes the
`output process, which runs until it has created a free slot at
`least as large as needed by the new record. To purge all
`records when reaching the end of input, the input process
`requests space for an infinitely large record.
`FIG. 5 illustrates the input process in which a new node
`(1,14) is added to the selection tree 30 from FIG. 3b. The
`new node (1,14) is added to the end of the tree after the last
`external node, as represented by insertion of node (1,14) into
`position 10 in the tree 30 (step 1 in FIG. 5). Because the tree
`is complete, its parent is always the first external node,
`which in this case is node (1,15) at position 5 (see FIG. 3b).
`The content of the parent node is copied into the next free
`node to become the right sibling node of the new element,
`as represented by new node (1,15) in position 11 (step 2 in
`FIG. 5).
`A different external node now owns the record, so the
`home node field in the record has to be updated from
`position 5 to position 11. In addition, the parent node
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 15
`
`

`
`5
`changes its role from an external node to an internal node.
`The sort keys of the two new external sibling nodes (1,14)
`and (1,15) are compared and the lower one promoted to the
`parent node (step 3 in FIG. 5). In this case, the new node
`(1,14) contains the lower key and it is promoted to position
`5. The walk up the tree continues until there is no change;
`that is, the record that gets promoted is the same as the one
`already stored in the parent node (step 4 in FIG. 5). In this
`case, the node (1,14) is promoted all the way to the root node
`at position 1. However, in other cases, the traversal may not
`continue all the way to the root.
`FIGS. 6a and 6b show the output process in which a node
`is deleted from the selection tree. The node with the lowest
`key is always the one deleted. Using the tree in FIG. 3b,
`suppose the output process deletes the external node (1,15)
`from the selection tree 30. First, the target node (1,15) is
`exchanged with the last node (1,30) of the tree 30 and the
`home node fields of both records are updated (step 1 in FIG.
`6a).
`The two tree paths affected by the exchange are then fixed
`up (step 2 in FIG. 6a). As a result of the exchange, the last
`node will always contain the lowest key. Accordingly, to fix
`up the path from the last node, every internal node on the
`path to the root is updated to point to this record without the
`need for key comparisons. The other path affected by the
`exchange is updated in the normal way by comparing keys.
`In this case, the internal node (1,15) at position 4 has a lower
`key value than the external node (1,30) at position 5, and
`hence the traversal stops here and does not continue all the
`way to the root. The result is a valid selection tree with the
`lowest run-key pair (1,15) as the last external node of the
`tree.
`Following the path update, the target node (1,15) in the
`last position is deleted from array 40 (step 3 in FIG. 6b). This
`deletion causes external node (1,32) to move to its parent
`node at position 4 (step 4 in FIG. 6b). The path from there
`is then fixed up all the way to the root (step 5 in FIG. 6b),
`causing node (1,22) to be promoted to the top.
`Memory Management
`When records are of variable length, managing the space
`reserved for storing records becomes an issue. In an article
`by Per-?ke Larson and Goetz Graefe, entitled “Memory
`Management during Run Generation in External Sorting”,
`SIGMOD, 1998: 472–483, the authors showed that a version
`of best-fit allocation solves this problem efficiently, resulting
`in a memory utilization of 90% is or better. The basic idea
`of best-fit allocation is to always allocate space from the
`smallest free block large enough for the record being stored.
`If the selected free block is larger than required, the unused
`space is returned to the pool as a smaller free block.
`Immediate coalescing of adjacent free blocks and the use of
`boundary tags is recommended for efficiency. Best-fit allo
`cation depends on being able to locate the smallest block
`larger than a given size efficiently. One technique is to store
`a collection of free blocks in a (balanced or unbalanced)
`binary tree with the block size as the key. If the tree is kept
`balanced (e.g., an AVL-tree), the best-fitting free block can
`always be located in logarithmic time. An alternative tech
`nique is to maintain multiple free lists segregated by block
`size. For example, we may use 256 free list and keep all free
`blocks of size 16 bytes or less on list 1, all free blocks
`between 17 and 32 bytes on list 2, and so on. In practice this
`technique is much faster and, consequently, is the preferred
`implementation.
`Drawbacks of Classical Replacement Selection
`Modern CPUs rely heavily on caches to hide memory
`latency and increase overall performance. As a result, it is
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,571,244 B1
`
`6
`increasingly important to design algorithms that generate
`few cache misses. Unfortunately, the classical replacement
`..selection algorithm has poor cache behavior when the
`number of records in memory is large. The main loop of the
`algorithm consists of traversing a path of the selection tree
`from a leaf node to the root, while comparing sort keys of
`each sibling pair. Which path is traversed is unrelated to
`previously used paths. The nodes in the top part of the tree,
`and their associated sort keys, are touched frequently and are
`likely to remain in the cache but not the ones lower down in
`the tree.
`Accordingly, there is a need for a cache-conscious version
`of replacement selection, which reduces the number of cache
`misses significantly.
`
`SUMMARY
`This invention concerns large scale sorting technology for
`forming runs of data records using an improved cache
`friendly replacement selection process that reduces cache
`misses.
`In one implementation, a database system has a memory
`subsystem with persistent data storage (e.g., disks, RAID,
`etc.) and a main memory. The system also has at least one
`processing unit with a processor and cache memory (e.g., L1
`and L2 caches). A selection tree is formed in the main
`memory as an array of pointers to records that are being
`sorted. The system also has a mini-run assembly data
`structure stored in the main memory to create mini-runs of
`sorted records prior to adding the records to the selection
`tree.
`The system runs a large-scale sorting program having a
`batched replacement selection module for forming runs of
`sorted data records. The batched replacement selection mod
`ule reads multiple records from the persistent data storage
`into main memory, adds them to the mini-run assembly data
`structure and, whenever the mini-run assembly structure
`becomes full, sorts its records to form a mini-run of multiple
`sorted data records. As an example, a mini-run might consist
`of 500 records sorted according to their key values. After
`formation, the mini-run is added to the selection tree by
`inserting a pointer to a first record in the mini-run into the
`array of pointers. The first record is linked to remaining
`records in the mini-run.
`The batched replacement selection module selects a
`record for output from a node in the selection tree. If the
`mini-run associated with the node is not empty, the module
`replaces the selected record with a next record from the
`mini-run. Once the mini-run becomes empty, the module
`deletes the node. In this manner, the selection tree grows and
`shrinks as mini-runs are added and deleted.
`When main memory becomes full, the batched replace
`ment selection module switches to output mode. It selects
`some number of records for output using the selection tree
`and copies the selected records into an output buffer in main
`memory. Whenever the output buffer becomes full, the
`module writes contents of the buffer back to the persistent
`data storage at the end of the current run. Exactly how long
`the output process runs is a matter of policy. Two possible
`policies are: stop after a predetermined number of records
`have been output or stop after a predetermined number of
`output buffers have been filled.
`The combination of using pre-sorted mini-runs as tree
`nodes and batch processing input/output records signifi
`cantly reduces the number of cache misses (particularly, L2
`cache misses).
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a diagrammatic illustration of a database system
`implementing a conventional load-sort-store process for run
`formation in large-scale mergesort operations.
`
`Petitioners SK hynix Inc., SK hynix America Inc. and SK hynix memory solutions Inc.
`Ex. 1019, p. 16
`
`

`
`US 6,571,244 B1
`
`7
`FIG. 2 is a diagrammatic illustration of a database system
`implementing a conventional replacement selection process
`for run formation in large-scale mergesort operations.
`FIGS. 3a and 3b show a selection tree before and after a
`selection replacement operation is performed on a leaf node.
`FIG. 4 is a diagrammatic illustration of a data array used
`to implement the selection tree of FIG. 3a.
`FIG. 5 is a diagrammatic illustration of a selection tree
`and illustrates the addition of a node to the selection tree.
`FIGS. 6a and 6b are diagrammatic illustrations of a
`selection tree and illustrate the deletion of a node from the
`selection tree.
`FIG. 7 shows a database storage system that implements
`a batched replacement selection process.
`FIG. 8 is a diagrammatic illustration of a selection tree
`with mini-runs attached at the external leaf nodes.
`FIG. 9 is a diagrammatic illustration of a data structure
`used to assemble mini-runs.
`FIG. 10 is a diagrammatic, illustration of a data array used
`to implement the selection tree of FIG. 9.
`FIG. 11 is a flow diagram for a batched replacement
`selection process.
`FIG. 12 shows the overall performance of the four
`methods—Quicksort, Alphasort, classic replacement
`selection, and batched replacement selection—as a function
`of memory size.
`FIG. 13 shows results of experiments on four methods—
`Quicksort, Alphasort, classic replacement selection, and
`batched replacement selection—run with a single disk.
`FIG. 14 shows results of experiments on four methods—
`Quicksort, Alphasort, classic replacement selection, and
`batched replacement selection—run with two disks.
`DETAILED DESCRIPTION
`This invention concerns a cache-conscious replacement
`selection methodology that employs local batch processing
`to improve performance on modern-day CPUs. For discus
`sion purposes, the invention is described in the context of a
`uni-processor system although the invention may be
`employed in a shared-memory multiprocessor system
`(SMP).
`
`System
`FIG. 7 shows a database storage system 50 implemented
`on a computer having a processor 52 coupled to a memory
`subsystem 54. A system bus (or other interconnection
`network) 56 interconnects processor 52 and memory sub
`system 54.
`The processor 52 has a chip 60 that holds a CPU (central
`processing unit) 62 and a first level (or L1) cache 64. The
`processor 52 also has a second level (or L2) cache 68 that
`may be on the same or a different chip (as represented by
`chip 66). The L1 cache 64 is considerably smaller than the
`L2 cache 68. For example, the L1 cache 64 typically ranges
`in size from 16 Kbytes to 256 Kbytes, whereas the L2 cache
`68 typically varies from 128 Kbytes up to 4 Mbytes. Due to
`its proximity to the CPU 62, however, the L1 cache 64 has
`a significantly lower access time in comparison to the L2
`cache 68. Typically, the access time to remote L2 cache 68
`is five times longer than the access time to L1 cache 64.
`The memory subsystem 54 has a persistent storage 70 that
`provides the bulk of the data storage. The persistent storage
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket