`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Exhibit 1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`USOO819061 OB2
`
`(12) United States Patent
`Dasdan et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 8,190,610 B2
`May 29, 2012
`
`(54) MAPREDUCE FOR DISTRIBUTED
`DATABASE PROCESSING
`
`(75) Inventors: Ali Dasdan, San Jose, CA (US);
`Hung-Chih Yang, Sunnyvale, CA (US);
`ity Lung Hsiao, Los Angeles, CA
`
`(73) Assignee: Yahoo! Inc., Sunnyvale, CA (US)
`(*) Notice:
`Subject to any disclaimer, the term of this
`atent is extended or adiusted under 35
`ps, 154(b) by 1105 E.
`
`(21) Appl. No.: 11/539,090
`
`(22) Filed:
`(65)
`
`Oct. 5, 2006
`Prior Publication Data
`US 2008/0086442 A1
`Apr. 10, 2008
`
`(51) Int. Cl.
`(2006.01)
`G06F 7/30
`(52) U.S. Cl. ........................................ 707/737, 707/968
`(58) Field of Classification Search .................. 707/13,
`707/737,968
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`6,158,044 A * 12/2000 Tibbetts ........................ 717/1OO
`6,341.289 B1* 1/2002 Burroughs et al. ........... 707f737
`
`. o 707/102
`.
`.
`.
`.
`.
`.
`R ck 3. Sier .
`g
`emaWat et al.
`WW -
`7.620,936 B2 * 1 1/2009 Ernst et al. .................... 717/108
`2004/0225638 A1* 11/2004 Geiselhart et al. .
`707/1
`2004/0230567 A1* 11/2004 Wookey ..........
`707 3
`2006/01 1703.6 A1* 6/2006 Cruanes et al.
`TO7/100
`2007/00386.59 A1* 2, 2007 Datar et al. ....
`TO7 101
`2007/0255685 A1* 11/2007 Boult et al. ....................... 707/2
`OTHER PUBLICATIONS
`d Saniav Gh
`it. "MapReduce: Simplified Dat
`Jeffrey D
`eIIrey LJean and Sanjay Unemawal, MapReduce: Simplilled Luala
`Processing on Large Clusters", USENIX ASSociation OSDI 04: 6th
`Symposium on Operating Systems Design and Implementation, Dec.
`6-8, 2004, pp. 137-149.
`* cited by examiner
`Primary Examiner — Khanh Pham
`(74) Attorney,
`Agent,
`or
`Firm — Weaver Austin
`Villeneuve & Sampson LLP
`
`ABSTRACT
`(57)
`An input data set is treated as a plurality of grouped sets of
`key/value pairs, which enhances the utility of the MapReduce
`programming methodology. By utilizing Such a grouping,
`map processing can be carried out independently on two or
`more related but possibly heterogeneous datasets (e.g.,
`related by being characterized by a common primary key).
`The intermediate results of the map processing (key/value
`pairs) for a particularkey can be processed togetherina single
`reduce function by applying a different iterator to intermedi
`ate values for each group. Different iterators can be arranged
`inside reduce functions in ways however desired.
`
`46 Claims, 5 Drawing Sheets
`
`402
`
`3O2 -
`y
`
`
`
`
`
`406
`E.
`
`al
`
`---
`41 Os
`
`s
`414 ^
`Y
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 2 of 14 PageID #: 1658
`
`E.
`
`
`
`304 -:
`
`404 -
`
`Q E.
`
`
`
`D'
`.......
`408 -
`
`
`
`U.S. Patent
`
`May 29, 2012
`
`Sheet 1 of 5
`
`US 8,190,610 B2
`
`
`
`
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 3 of 14 PageID #: 1659
`
`
`
`
`
`3?u ?p?tu JaquI
`
`L ‘61-I
`
`
`
`U.S. Patent
`
`May 29, 2012
`
`Sheet 2 of 5
`
`US 8,190,610 B2
`
`
`
`-
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 4 of 14 PageID #: 1660
`
`C1807
`
`| I ?ise
`
`Lºonpari | eg0Z|
`
`
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 5 of 14 PageID #: 1661
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 5of 14 PagelD #: 1661
`
`U.S. Patent
`
`May29, 2012
`
`Sheet 3 of 5
`
`US 8,190,610 B2
`
`‘qideq)}deqpuedwiv ——90€
`@weNidsq‘ewendwy
`
`
`
`Buueeuibuy‘seuor‘e¢
`
`
`Buieeurbuz‘auojs‘¢¢
`
`
`|ED3|9‘UOSUIGOYN‘pe
`
`
`JEQUa|D‘Jedser‘ve
`
`|BEDU91D“YUL“PE
`eweNideq‘qiideq)1d9q
`
`
`
`
`
`
`sales‘uasoy‘Le
`
`IIV
`
`¢big
`
`rorsales‘Le
`
`
`
`Buuseeulbug‘e¢
`
`
`
`Bunjeyen‘se
`
`|EX91D“VE
`
`cOE
`
`
`
`ewendwy‘qideq)
`
`dw|Iv
`
`UOSUIGOY“ve
`
`UWS“PE
`
`souor‘¢¢
`
`
`
`Jedser‘pe
`
`SUOISCE
`
`t
`
`
`
`uesoy‘Le
`
`t
`
`
`
`
`U.S. Patent
`
`May 29, 2012
`
`Sheet 4 of 5
`
`US 8,190,610 B2
`
`s
`
`
`
`
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 6 of 14 PageID #: 1662
`
`:
`
`3.
`
`
`
`s
`
`
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 7 of 14 PageID #: 1663
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 7 of 14 PagelD #: 1663
`
`U.S. Patent
`
`May29, 2012
`
`Sheet 5 of 5
`
`US 8,190,610 B2
`
`
`
`hdeqduyskayusariiy
`
`
`
`[Ed3ID‘UOSUIGOYN‘re
`
`
`
`[EdUAID‘UNWS“ve
`
`
`
`‘vadser‘ve
`
`ideqdwy
`
`
`
`
`
`Huuseulbuy‘seuor‘ee
`
`
`
`
`
`GHuvaeuibuz‘auois‘ee
`
`
`
`Bunswel‘TINN‘se
`
`
`
`
`
`sajes‘UsSON‘LE
`
`
`
`
`
`souor‘E¢
`
`
`
`SUDIS‘EE
`
`
`
`Jedser‘ve
`
`sheyppoiy Jeo1a19
`
` dwpeyossheyppoliy
`dwpasyossheyueariny
`
`fdaqpauossheyuanrily
` ‘e¢|ydaqpeuossheyppoliy
`
`Buusesulbuq
`lequalo‘re
`UOSUIGOY“%'
`YWWSPE
`
`Puuseui6u5‘eo
`
`ydgqsheyusag
`dwa'skeypeo
`duz'sheypeo
`ydaq'sheyPpo
`deq'sAsyusag
`UOSUIGOY‘ve
`
`UWS“vs
`
`
`
`sadser‘ye
`
`SUdIS“CE
`
`
`
`@SOy‘LE
`
`ANON
`
`sales“LE
`
`
`
`Pulexei‘SE
`
`UOSUIGOY‘ve
`
`UNWS‘pe
`
`souor‘ee
`
`
`
`Jadser‘pe
`
`BUOIS“ES
`
`
`
`ussoy‘LE
`
`dwlv
`
`
`
`Buuseulbuz‘ee
`
`sajes‘Le
`
`
`
`BuSyelAl‘Ge
`
`[BOUIN“ve
`
`1d9qIv
`
`duwra:dnois
`
`ydaq:dnoig
`
`
`
`uoneso7ulbagUOnesa}:
`
`80¢c
`
`dwsheayuaarg
`
`ZLG
`
`0L
`g
`
`90S
`
`vOS
`
`OS
`
`
`
`
`
`
`
`1.
`MAPREDUCE FOR DISTRIBUTED
`DATABASE PROCESSING
`
`BACKGROUND
`
`MapReduce is a programming methodology to perform
`parallel computations over distributed (typically, very large)
`data sets. Some theory regarding the MapReduce program
`ming methodology is described in "MapReduce: Simplified
`Data Processing on Large Clusters.” by Jeffrey Dean and
`Sanjay Ghemawat, appearing in OSDI’04: Sixth Symposium
`on Operating System Design and Implementation, San Fran
`cisco, Calif., December, 2004 (hereafter, “Dean and Ghema
`wat”). A similar, but not identical, presentation is also pro
`vided in HTML form at the following URL: http://
`labs.google.com/papers/mapreduce-osdi O4-slides/
`index.html (hereafter, “Dean and Ghemawat HTML).
`Basically, a 'map' function maps key-value pairs to new
`(intermediate) key-value pairs. A “reduce” function repre
`sents all mapped (intermediate) key-value pairs sharing the
`same key to a single key-value pair or a list of values. The
`“map' and “reduce” functions are typically user-provided.
`The map function iterates overalist of independent elements,
`performing an operation on each element as specified by the
`map function. The map function generates intermediate
`results. The reduce operation takes these intermediate results
`via a single iterator and combines elements as specified by the
`reduce function.
`
`5
`
`15
`
`25
`
`SUMMARY
`
`In accordance with an aspect, an input data set is treated as
`a plurality of grouped sets of key/value pairs, which enhances
`the utility of the MapReduce programming methodology.
`Utilizing such grouping, map processing is carried out inde
`pendently on two or more related datasets (e.g., related by
`each being characterized by a schema with a key in common).
`The intermediate results of the map processing (key/value
`pairs) for a particular key are processed together in a single
`reduce function by applying a different iterator to intermedi
`ate values for each group. Differentiterators can be composed
`inside reduce functions in ways however desired.
`Thus, for example, the enhanced MapReduce program
`ming methodology may be easily employed to carry out dis
`tributed relational database processing.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`30
`
`35
`
`40
`
`45
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 8 of 14 PageID #: 1664
`
`FIG. 1 illustrates an example of the conventional MapRe
`duce architecture.
`FIG. 2 illustrates a specific parallel implementation of the
`FIG. 1 architecture.
`FIG. 3 graphically illustrates a join performed over two
`relational tables that have different schema.
`FIG. 4 illustrates a particular implementation of an
`improved MapReduce architecture that includes consider
`ation for input groups, relative to the FIG. 3 example rela
`tional tables.
`FIG. 5 schematically illustrates, in greater detail, process
`ing by the improved MapReduce architecture to join the
`records (FIG.3) of the Employee table 302 with the records of
`the Department table 304 to generate the Employee and
`Department table 306.
`
`50
`
`55
`
`60
`
`DETAILED DESCRIPTION
`
`The inventors have realized that, by treating an input data
`set as a plurality of grouped sets of key/value pairs, the utility
`
`65
`
`US 8, 190,610 B2
`
`2
`of the MapReduce programming methodology may be
`enhanced. By utilizing Such a grouping, map processing can
`be carried out independently on two or more related datasets
`(e.g., related by being characterized by a commonkey). The
`intermediate results of the map processing (key/value pairs)
`for a particular key can be processed together in a single
`reduce function by applying a different iterator to intermedi
`ate values for each group.
`Before discussing the aspect in which the input data set is
`treated as a plurality of grouped sets of key/value pairs, it is
`useful to first discuss, as background, the conventional archi
`tecture. The conventional architecture is described, for
`example, in Dean and Ghemawat referred to in the Back
`ground. FIG. 1, illustrating an example of the conventional
`architecture, which is substantially reproduced from Dean
`and Ghemawat HTML (at slide 7) but with reference numer
`als added here to aid in the description. In the example, the
`keys of the input data key/value pairs comprise keys k1 to k5.
`As discussed by Dean and Ghemawat, and as shown in the
`FIG. 1 illustration, both the input data set 102 and the output
`data set 114 are a set of key value pairs. The programmer
`specifies a map function that processes input key/value pairs
`and produces a set of intermediate pairs 106(1) through 106
`(7). In the abstract, Such a map function is specified as fol
`lows:
`
`map (in key, in value) - list(out key, intermediate value)
`
`In the FIG. 1 illustrative example, the input data set 102 is
`partitioned into an arbitrary grouping of seven partitions 102
`(1) through 102(7). Each partition is provided to the corre
`sponding invocation of the map function, which has seven
`invocations 104(1) through 104(7).
`The group-by-key functionality 108 partitions the interme
`diate results by out key, and the intermediate partitions 110
`(k1) to 110(k5) are provided to the corresponding reduce
`functions 112(k1) to 112(k5), respectively. Each reduce func
`tion 112(k1) to 112(k5) processes intermediate data from one
`of the intermediate partitions 110(k1) to 110(k5) to generate
`the corresponding output partitions 114(k1) to 114(k5). In the
`abstract, such a reduce function is specified as follows:
`
`reduce (out key, list(intermediate value)) - list(out value)
`
`The reduce function combines all intermediate values for a
`particular key and produces a set of merged output values for
`the key, usually just one.
`FIG. 2 illustrates a parallel implementation of the FIG. 1
`architecture. For simplicity of illustration, only some of the
`elements in FIG. 2 are denoted by reference numerals, and it
`is clearly evident that others of the elements correspond to
`elements in FIG.1. In the implementation, the map functions
`104 are partitioned into a user-configurable number of map
`tasks 202a, 202b and 202c (here, three map tasks). Similarly,
`the reduce functions are partitioned into a user-configurable
`number of reduce tasks 208a and 208b (here, two reduce
`tasks). Each map task 202a, 202b, and 202c includes a parti
`tioning function 206a, 206b and 206C, respectively, that par
`titions the intermediate data across the reduce tasks 208a and
`208b. Each reduce task 206a and 206b includes a sort and
`group-by-key task 210a and 210b, respectively, which can
`make the processing of the output data more efficient and/or
`convenient.
`
`
`
`3
`Although not shown in FIGS. 1 and 2, the partitioned data
`can go through a configurable, intermediate step in which a
`“combiner function is called. A combiner function is similar
`to a reduce function with the following difference: a combiner
`runs after a mapper and performs a partial merging on each
`partition before they are transferred to a reducer. This helps
`reduce the network traffic and speed up the total execution
`time.
`Conventional MapReduce implementations enable the use
`of massive clusters of commodity computers to provide a
`simplified programming and execution model for processing
`large sets of data in parallel. In addition, the application is
`isolated from messy but important details of running a dis
`tributed program on massive clusters of potentially unreliable
`computers. However, the conventional MapReduce imple
`mentations do not have facility to efficiently process data
`from heterogeneous sources.
`For example, it is impractical to perform joins over two
`relational tables that have different schemas. FIG. 3 graphi
`cally illustrates Such a join (without regard for any particular
`method or architecture for accomplishing such the join).
`Referring to FIG.3, there are two tables, an “Employee' table
`302 and a “Department” table 304 on which it may be desir
`able to perform a join operation. Each employee record in the
`Employee table 302 has key DeptID and value LastName, and
`each department record in the Department table 304 has Dep
`tID and value DeptName. The result “Employee and Depart
`ment” table 306 is a table that shows the department name for
`each employee. More specifically, the records of the
`Employee table 302 are joined with the records of the Depart
`ment table 304 such that each record of the Employee and
`Department table 306 includes key DeptID and values Last
`Name and DeptName.
`FIG. 4 illustrates an example of an improved MapReduce
`architecture in accordance with an aspect, and relative to the
`FIG.3 example relational tables. It is noted that the schema of
`each data set, such as the FIG.3 relational tables, includes a
`set of attributes (such as DeptID, LastName, DeptName) and
`their properties (such as their data types: integer DeptD.
`string LastName, string DeptName). As discussed above with
`respect to the conventional MapReduce architecture, input,
`intermediate and output data sets may each be characterized
`by their own schema, and each schema operates according to
`“key/value” pairs. The attributes in the key/value pairs may be
`45
`distinct or overlapping; moreover, keys within a data set may
`be distinct or identical.
`In the improved MapReduce architecture such as discussed
`with reference to FIG. 4, the input, intermediate and output
`data sets are partitioned into a set of data groups. With the
`partitioning into groups, it is likely that map functions corre
`sponding to each group are different; data sets within the
`same group are characterized by the same schema; and data
`sets within different groups are characterized by different
`schemas it is also likely that map functions corresponding to
`each group are the same; data sets within all the groups are the
`SaC.
`In general, partitioning the data sets into data groups
`enables a mechanism to associate (group) identifiers with
`data sets, map functions and iterators (useable within reduce
`functions to access intermediate data) and, also, to produce
`output data sets with (group) identifiers. It is noted that the
`output group identifiers may differ from the input/intermedi
`ate group identifiers.
`Referring now to FIG.4, each of the map tasks 402 and 404
`are configured to operate on separate data groups, where each
`of the separate data groups is characterized by its own
`
`25
`
`30
`
`35
`
`40
`
`50
`
`55
`
`60
`
`65
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 9 of 14 PageID #: 1665
`
`US 8, 190,610 B2
`
`5
`
`10
`
`15
`
`4
`schema. In FIG.4, the separate tables are labeled “E” and “D’
`to represent the “Employee” table 302 and the “Department”
`table 304, respectively.
`Referring still to FIG. 4, the records of the intermediate
`data E and D' retain an identification with the groups to which
`the original input data, resulting in particular intermediate
`data, belong. Thus, the intermediate data E retains an identi
`fication with group E, and the intermediate data D' retains an
`identification with group D. That is, the intermediate data E
`of the map task 502 retains an identification with the
`employee table 302, and the intermediate data D' of the map
`task 504 retains an identification with the department table
`304.
`Even the records of the partitioned data 406 and 408, after
`partitioning the records of the intermediate data to the various
`reduce tasks 410 and 412, retain an identification with the
`groups of the original input data (the employee table 302 and
`the department table 304). The reduce tasks 410 and 412 can
`access partitioned intermediate data for both groups 302 and
`304 through one iterator per group, and the reduce tasks 410
`and 412 can be programmed by users to use the iterators
`however desired.
`In accordance with the described aspect, a programmer
`may view data as follows:
`
`input data
`intermediate data
`output data
`
`grouped sets of key/value pairs
`grouped sets of key/value pairs
`(possibly grouped) sets of
`key/value pairs or values
`
`It is noted that the output data does not have to be grouped or
`key/value pairs, but it is useful in Some instances, such as
`some instances in which MapReduce jobs are to be chained in
`a pipeline. It is also noted that there can be only one group.
`We now provide Some example syntax for map and reduce
`functions to employ the improved MapReduce architecture.
`As discussed above, a user/programmer typically provides
`map and reduce functions. In accordance with the described
`aspect, in the abstract, a map function may be specified as
`follows:
`
`Map (group id, in key, in value) -> (group id,
`list(out key, intermediate value))
`
`Thus, for a particular group identified with a group id, the
`map function processes each input key/value pair and pro
`duces one or more intermediate key/value pairs. In accor
`dance with the described aspect, in the abstract, a reduce
`function may be specified as follows:
`Reduce(list(group id.list(out key,intermediat
`e value)))->list(out value)
`Thus, the reduce function merges all intermediate data over
`all groups. Iterators specific to a particular group may be
`employed to iterate over that group's intermediate data. One
`or more output values are provided.
`In the improved MapReduce, the partitioning operation
`after the mapping phase is generalized Such that it has a
`configurable operation with two modes.
`In the partitioning mode 1, the partitioning operation par
`titions the intermediate data into R partitions, where R is the
`number of reduce tasks. Each key/value pair goes to only one
`of the partitions. This is similar to what is done by the con
`ventional MapReduce.
`
`
`
`5
`In the partitioning mode 2, it still partitions the intermedi
`ate data into R partitions but each key/value pair can go to a
`plurality of the partitions. In this mode, it is likely that the
`pairs will go to all the partitions.
`For example, consider implementing a cross product of the
`tables in FIG.3, in which every record of the Employee table
`302 will be paired with every record of the Department table
`304. The cross product can be achieved using two nested
`loops, one outer loop iterating over the records of the
`Employee table and one inner loop iterating over the records
`of the Department table. In the improved MapReduce, one
`possible implementation can map each employee to only one
`reducer but it can map each department to every reducer. In
`another possible implementation, it can map each employee
`to only one reducer and it may not map the departments at all.
`Instead, each reducer can read the departments directly from
`their local copy, if exists, or from a copy shared by the other
`reducers.
`In the improved MapReduce, the operations can collect
`information about the data that they are processing. They can
`also save the information as metadata of the corresponding
`data. For example, the partitioning, grouping, and sorting
`operations can keep the Smallest and largest keys in each
`partition processed. This information can be saved as part of
`the metadata of each partition. When an iterator is created to
`scan through the records of a partition, this metadata can be
`made available to the reduce function through the iterators.
`One example benefit is that this metadata can speed up com
`parisons. For example, in a processing of the data to find all
`records whose keys are Smaller than aparticularkey, an entire
`partition can be skipped if its Smallest key is larger than the
`particular key.
`In some examples, the Sorting operation is configurable,
`Such that it can be turned on or off for some groups. This is in
`addition to the configurable sorting function that is also avail
`able in the conventional MapReduce. As such, the overhead to
`perform some operations may be reduced where, for
`example, Sorting is not needed. One example of such a data
`base operation is determining a cross product.
`In the improved MapReduce, the reduce function has a
`configurable operation with three modes. They are named in
`the increasing order of generality as follows: the reduce mode
`1, the reduce mode 2, and the reduce mode 3. These modes
`improve its capabilities to Support the basic relational data
`base processing operations. In all modes, the data accessible
`through the iterators over all groups are made available the
`reducer functions.
`In the reduce mode 1, the records input to a reduce function
`through the group iterators carry or share the same key. If
`there is no record in a group that has the same key, the
`corresponding iterator is null. This mode helps implement an
`equijoin, a join over equal keys, efficiently. For example, it
`can produce the “Employee and Department” table 306 by
`equijoining the “Employee' table 302 and the “Department'
`table 204.
`In the reduce mode 2, the records input to a reduce function
`through the group iterators need not carry the same key. The
`input records are the ones pointed to by the group iterators
`when the reduce function is invoked. The reduce function
`controls the manner in which keys, values, and iterators are
`used. This mode helps implement a join with an arbitrary
`predicate (such as comparisons) because it can implement the
`sort-merge join. For example, this mode together with the
`partitioning mode 2 can even implement the cross product of
`the employees and departments in FIG. 3.
`In the reduce mode 3, the records input to a reduce function
`through the group iterators need not carry the same key as in
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 10 of 14 PageID #: 1666
`
`US 8, 190,610 B2
`
`10
`
`15
`
`6
`the reduce mode 2. In addition, those of some groups need not
`even come from the intermediate data when the reduce func
`tion is invoked. They can come from data local to the reduce
`function or data shared by other reduce functions. For
`example, this mode can implement the cross product of the
`employees and departments in FIG. 3. To do that, the depart
`ments data can be accessed by every reduce function.
`In the improved MapReduce, the map and reduce functions
`can be provided by the application developers or users. They
`can also be generated by a tool or program. For example,
`consider implementing the select operation of relation data
`base processing to get from a given table those employees
`whose salary is larger than a certain limit. Using the schema
`of the data, it is possible for a tool or program to automatically
`generate a map function to implement this select operation.
`The automatic generation allows building higher-level inter
`faces on top of the improved MapReduce.
`We now discuss an example, using pSuedocode, of code to
`accomplish joining the records of the Employee table 302
`with the records of the Department table 304, as discussed
`above relative to FIGS. 3 and 4, such that each record of the
`Employee and Department table 306 includes values Last
`Name and DeptName, with the records sorted by the DeptID
`key. This example will use the reduce mode 1. The map
`functions may be defined as follows:
`
`map(string group, string key, string val) {
`fi group: “emp' or “dept
`// key: DeptID
`// val: LastName for “emp', DeptName for “dept
`DeptID = key:
`emit to group (group, DeptD, val);
`
`The reduce functions may be defined as follows:
`
`reduce(hashtable iterators) {
`// iterators: one iterator for “emp', another for “dept
`DeptD = iterators.get key();
`emp iter = iterators'emp.get iter();
`dept iter = iterators"dept.get iter();
`for each LastName in emp iter do {
`for each DeptName in dept iter do {
`emit to group ("emp dept, DeptD, (LastName,
`DeptName));
`
`If null iterators are possible and the database operation such
`as outer joins needs their handling, the reduce functions may
`also be defined as follows:
`
`reduce(hashtable iterators) {
`// iterators: one iterator for “emp', another for "dept
`DeptD = iterators.get key();
`emp iter = iterators'emp.get iter();
`dept iter = iterators"dept.get iter();
`if (emp iter is not null and dept iter is not null) then {
`for each LastName in emp iter do {
`for each DeptName in dept iter do {
`emit to group ("emp dept, DeptD, (LastName,
`DeptName));
`
`
`
`7
`-continued
`
`else if (emp iter is not null and dept iter is null) then {
`// null iterator handling for left and full outer joins
`for each LastName in emp iter do {
`emit to group ("emp dept, DeptD, (LastName, null));
`
`else if (emp iter is null and dept iter is not null) then {
`i? null iterator handling for right and full outer joins
`for each DeptName in dept iter do {
`emit to group ("emp dept, DeptD, (null, DeptName));
`
`else {
`if both iterations being null can be a warning or an error
`
`As mentioned above, the output group identifiers may be
`different from the input/intermediate group identifiers. For
`example, the reduce functions above produce output data set
`with the “emp dept' group identifier from intermediate data
`sets with the “emp' and “dept' group identifiers.
`It is noted that the use of the iterators is flexible, so that the
`programmer can use the iterators as desired to accomplish a
`desired function. It is also noted that the data structure to
`implement iterators can be any data structure that can effi
`ciently return the iterator of a group given the group identifi
`cation. Hash table is only one example of Such a data struc
`ture.
`It is further noted that the iterators can carry metadata for
`the data that they are created for. The metadata may include
`various statistics. This capability is in addition to the counter
`facility that the conventional MapReduce provides.
`Having provided an example of the code for the map and
`reduce functions to accomplish joining the records of the
`Employee table 302 with the records of the Department table
`304, we now discuss an example execution of the map and
`reduce functions. For example, the map function may be
`called as follows:
`
`map(“emp', 34, “Smith')
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`which results in the data “34 Smith' being emitted into files
`with an “emp' extension. The “emp' extension identifies the
`data with the “emp' group. The map function may further be
`called as follows:
`
`50
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 11 of 14 PageID #: 1667
`
`map(“dept, 34, “Clerical)
`
`which results in the data “34 Clerical' being emitted into files
`with a “dept extension. There would further be additional
`calls to the map function that are not listed here.
`The reduce function may be called as follows:
`
`reduce((34, (“emp', (“Smith, “Robinson, “Jasper)), (“dept,
`(“Clerical))))
`
`55
`
`60
`
`65
`
`US 8, 190,610 B2
`
`8
`A partial example output of the reduce function may then be
`as follows:
`
`“34, Smith, Clerical
`34, Robinson, Clerical
`“34, Jasper, Clerical
`
`It is noted that the output will be sorted by the DeptID record
`as a side-effect of MapReduce. The sorting happens in both
`the conventional MapReduce and the improved MapReduce
`although it can also be turned off in the improved MapRe
`duce.
`FIG. 5 more fully illustrates, with the full data of the
`Employee table 302 and the Department table 304, process
`ing by an example of the improved MapReduce to equijoin
`the records of the Employee table 302 with the records of the
`Department table 304 to generate the Employee and Depart
`ment table 306, where each record shows the department
`name for one employee. Referring specifically to FIG. 5, the
`Employee table 302 and the Department table 304 are shown
`in the portion 502 of FIG. 5. The map functions are shown
`within the portion 504 of FIG. 5.
`The portion 506 of FIG.5 includes the intermediate results
`of applying the map functions followed by the partitioning
`into even-keyed records and odd-keyed records. This parti
`tioning function is just one of many possible partitioning
`functions. The intermediate results within the portion 506 are
`analogous to the boxes labeled E' and D' within map tasks 402
`and 404 in FIG. 4. The intermediate results within the portion
`508 in FIG. 5 are analogous to the partitioned results 406 and
`408 in FIG. 4. In FIG. 5, the partitioning results are sorted into
`even-keyed records 509a and odd-keyed records 509b. The
`sorted partitioned results within portion 508 are provided to
`reduce functions within the portion 510, and the portion 512
`includes the result Employee and Department table 306. It is
`noted that in the special case of department ID35, there are no
`employees in the Employee table 502 for department ID 35,
`so the record in the Employee and Department table 306 for
`department ID 35 was generated based on a null iterator for
`the employee group. (In general, a null iterator can occur for
`both groups. The null iterators are handled in reduce func
`tions in the way that the programmer desires. For example, as
`discussed before, outer join types utilize the generation of an
`output record with some null attribute values.)
`As discussed, then, the MapReduce concept may be uti
`lized to carry out map processing independently on two or
`more related datasets (e.g., related by being characterized by
`a commonkey) even when the related data sets are heteroge
`neous with respect to each other, Such as data tables organized
`according to different schema. The intermediate results of the
`map processing (key/value pairs) for a particular key can be
`processed together in a single reduce function by applying a
`different iterator to intermediate values for each group. In this
`way, operations on the two or more related datasets may be
`carried out more efficiently or in a way not even possible with
`the conventional MapReduce architecture.
`What is claimed is:
`1. A method of processing data of a data set over a distrib
`uted system, wherein the data set comprises a plurality of data
`groups, the method comprising:
`partitioning the data of each one of the data groups into a
`plurality of data partitions that each have a plurality of
`key-value pairs and providing each data partition to a
`Selected one of a plurality of mapping functions that are
`each user-configurable to independently output a plural
`
`
`
`10
`9. The method of claim 7, wherein:
`at least some of the reducers include a sort, group-by-key
`and combine task;
`the method further comprises
`generating and providing metadata for at least Some of
`the mapping, partitioning, combining, grouping and
`Sorting.
`10. The method of claim 9, wherein:
`the reducing step includes processing the metadata.
`11. The method of claim 10, wherein:
`processing the intermediate data for each data group in a
`manner that is defined to correspond to that data group
`includes, for each data group, employing an iterator that
`corresponds to that data group, wherein the iterator
`includes providing the associated metadata to the pro
`cessing of the reducing step.
`12. The method of claim 5, wherein:
`the intermediate data includes a plurality of grouped sets of
`key/value pairs;
`the reducing step is carried out by a plurality of reducers;
`for at least one of the reducers, the iterator corresponding to
`aparticular data group, for that reducer, operates accord
`ing to a different key of a different schema than the
`iterator corresponding to another particular data group,
`for that reducer.
`13. The method of claim 1, wherein:
`the intermediate data processing step of the reducing step
`further comprises processing data that is not intermedi
`ate data.
`14. The method of claim 13, wherein:
`the reducing step is carried out by a plurality of r