Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 1 of 14 PageID #: 1657
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Exhibit 1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`

`

`USOO819061 OB2
`
`(12) United States Patent
`Dasdan et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 8,190,610 B2
`May 29, 2012
`
`(54) MAPREDUCE FOR DISTRIBUTED
`DATABASE PROCESSING
`
`(75) Inventors: Ali Dasdan, San Jose, CA (US);
`Hung-Chih Yang, Sunnyvale, CA (US);
`ity Lung Hsiao, Los Angeles, CA
`
`(73) Assignee: Yahoo! Inc., Sunnyvale, CA (US)
`(*) Notice:
`Subject to any disclaimer, the term of this
`atent is extended or adiusted under 35
`ps, 154(b) by 1105 E.
`
`(21) Appl. No.: 11/539,090
`
`(22) Filed:
`(65)
`
`Oct. 5, 2006
`Prior Publication Data
`US 2008/0086442 A1
`Apr. 10, 2008
`
`(51) Int. Cl.
`(2006.01)
`G06F 7/30
`(52) U.S. Cl. ........................................ 707/737, 707/968
`(58) Field of Classification Search .................. 707/13,
`707/737,968
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`6,158,044 A * 12/2000 Tibbetts ........................ 717/1OO
`6,341.289 B1* 1/2002 Burroughs et al. ........... 707f737
`
`. o 707/102
`.
`.
`.
`.
`.
`.
`R ck 3. Sier .
`g
`emaWat et al.
`WW -
`7.620,936 B2 * 1 1/2009 Ernst et al. .................... 717/108
`2004/0225638 A1* 11/2004 Geiselhart et al. .
`707/1
`2004/0230567 A1* 11/2004 Wookey ..........
`707 3
`2006/01 1703.6 A1* 6/2006 Cruanes et al.
`TO7/100
`2007/00386.59 A1* 2, 2007 Datar et al. ....
`TO7 101
`2007/0255685 A1* 11/2007 Boult et al. ....................... 707/2
`OTHER PUBLICATIONS
`d Saniav Gh
`it. "MapReduce: Simplified Dat
`Jeffrey D
`eIIrey LJean and Sanjay Unemawal, MapReduce: Simplilled Luala
`Processing on Large Clusters", USENIX ASSociation OSDI 04: 6th
`Symposium on Operating Systems Design and Implementation, Dec.
`6-8, 2004, pp. 137-149.
`* cited by examiner
`Primary Examiner — Khanh Pham
`(74) Attorney,
`Agent,
`or
`Firm — Weaver Austin
`Villeneuve & Sampson LLP
`
`ABSTRACT
`(57)
`An input data set is treated as a plurality of grouped sets of
`key/value pairs, which enhances the utility of the MapReduce
`programming methodology. By utilizing Such a grouping,
`map processing can be carried out independently on two or
`more related but possibly heterogeneous datasets (e.g.,
`related by being characterized by a common primary key).
`The intermediate results of the map processing (key/value
`pairs) for a particularkey can be processed togetherina single
`reduce function by applying a different iterator to intermedi
`ate values for each group. Different iterators can be arranged
`inside reduce functions in ways however desired.
`
`46 Claims, 5 Drawing Sheets
`
`402
`
`3O2 -
`y
`
`
`
`
`
`406
`E.
`
`al
`
`---
`41 Os
`
`s
`414 ^
`Y
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 2 of 14 PageID #: 1658
`
`E.
`
`
`
`304 -:
`
`404 -
`
`Q E.
`
`
`
`D'
`.......
`408 -
`
`

`

`U.S. Patent
`
`May 29, 2012
`
`Sheet 1 of 5
`
`US 8,190,610 B2
`
`
`
`
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 3 of 14 PageID #: 1659
`
`
`
`
`
`3?u ?p?tu JaquI
`
`L ‘61-I
`
`

`

`U.S. Patent
`
`May 29, 2012
`
`Sheet 2 of 5
`
`US 8,190,610 B2
`
`
`
`-
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 4 of 14 PageID #: 1660
`
`C1807
`
`| I ?ise
`
`Lºonpari | eg0Z|
`
`

`

`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 5 of 14 PageID #: 1661
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 5of 14 PagelD #: 1661
`
`U.S. Patent
`
`May29, 2012
`
`Sheet 3 of 5
`
`US 8,190,610 B2
`
`‘qideq)}deqpuedwiv ——90€
`@weNidsq‘ewendwy
`
`
`
`Buueeuibuy‘seuor‘e¢
`
`
`Buieeurbuz‘auojs‘¢¢
`
`
`|ED3|9‘UOSUIGOYN‘pe
`
`
`JEQUa|D‘Jedser‘ve
`
`|BEDU91D“YUL“PE
`eweNideq‘qiideq)1d9q
`
`
`
`
`
`
`sales‘uasoy‘Le
`
`IIV
`
`¢big
`
`rorsales‘Le
`
`
`
`Buuseeulbug‘e¢
`
`
`
`Bunjeyen‘se
`
`|EX91D“VE
`
`cOE
`
`
`
`ewendwy‘qideq)
`
`dw|Iv
`
`UOSUIGOY“ve
`
`UWS“PE
`
`souor‘¢¢
`
`
`
`Jedser‘pe
`
`SUOISCE
`
`t
`
`
`
`uesoy‘Le
`
`t
`
`
`

`

`U.S. Patent
`
`May 29, 2012
`
`Sheet 4 of 5
`
`US 8,190,610 B2
`
`s
`
`
`
`
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 6 of 14 PageID #: 1662
`
`:
`
`3.
`
`
`
`s
`
`

`

`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 7 of 14 PageID #: 1663
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 7 of 14 PagelD #: 1663
`
`U.S. Patent
`
`May29, 2012
`
`Sheet 5 of 5
`
`US 8,190,610 B2
`
`
`
`hdeqduyskayusariiy
`
`
`
`[Ed3ID‘UOSUIGOYN‘re
`
`
`
`[EdUAID‘UNWS“ve
`
`
`
`‘vadser‘ve
`
`ideqdwy
`
`
`
`
`
`Huuseulbuy‘seuor‘ee
`
`
`
`
`
`GHuvaeuibuz‘auois‘ee
`
`
`
`Bunswel‘TINN‘se
`
`
`
`
`
`sajes‘UsSON‘LE
`
`
`
`
`
`souor‘E¢
`
`
`
`SUDIS‘EE
`
`
`
`Jedser‘ve
`
`sheyppoiy Jeo1a19
`
` dwpeyossheyppoliy
`dwpasyossheyueariny
`
`fdaqpauossheyuanrily
` ‘e¢|ydaqpeuossheyppoliy
`
`Buusesulbuq
`lequalo‘re
`UOSUIGOY“%'
`YWWSPE
`
`Puuseui6u5‘eo
`
`ydgqsheyusag
`dwa'skeypeo
`duz'sheypeo
`ydaq'sheyPpo
`deq'sAsyusag
`UOSUIGOY‘ve
`
`UWS“vs
`
`
`
`sadser‘ye
`
`SUdIS“CE
`
`
`
`@SOy‘LE
`
`ANON
`
`sales“LE
`
`
`
`Pulexei‘SE
`
`UOSUIGOY‘ve
`
`UNWS‘pe
`
`souor‘ee
`
`
`
`Jadser‘pe
`
`BUOIS“ES
`
`
`
`ussoy‘LE
`
`dwlv
`
`
`
`Buuseulbuz‘ee
`
`sajes‘Le
`
`
`
`BuSyelAl‘Ge
`
`[BOUIN“ve
`
`1d9qIv
`
`duwra:dnois
`
`ydaq:dnoig
`
`
`
`uoneso7ulbagUOnesa}:
`
`80¢c
`
`dwsheayuaarg
`
`ZLG
`
`0L
`g
`
`90S
`
`vOS
`
`OS
`
`
`
`
`
`

`

`1.
`MAPREDUCE FOR DISTRIBUTED
`DATABASE PROCESSING
`
`BACKGROUND
`
`MapReduce is a programming methodology to perform
`parallel computations over distributed (typically, very large)
`data sets. Some theory regarding the MapReduce program
`ming methodology is described in "MapReduce: Simplified
`Data Processing on Large Clusters.” by Jeffrey Dean and
`Sanjay Ghemawat, appearing in OSDI’04: Sixth Symposium
`on Operating System Design and Implementation, San Fran
`cisco, Calif., December, 2004 (hereafter, “Dean and Ghema
`wat”). A similar, but not identical, presentation is also pro
`vided in HTML form at the following URL: http://
`labs.google.com/papers/mapreduce-osdi O4-slides/
`index.html (hereafter, “Dean and Ghemawat HTML).
`Basically, a 'map' function maps key-value pairs to new
`(intermediate) key-value pairs. A “reduce” function repre
`sents all mapped (intermediate) key-value pairs sharing the
`same key to a single key-value pair or a list of values. The
`“map' and “reduce” functions are typically user-provided.
`The map function iterates overalist of independent elements,
`performing an operation on each element as specified by the
`map function. The map function generates intermediate
`results. The reduce operation takes these intermediate results
`via a single iterator and combines elements as specified by the
`reduce function.
`
`5
`
`15
`
`25
`
`SUMMARY
`
`In accordance with an aspect, an input data set is treated as
`a plurality of grouped sets of key/value pairs, which enhances
`the utility of the MapReduce programming methodology.
`Utilizing such grouping, map processing is carried out inde
`pendently on two or more related datasets (e.g., related by
`each being characterized by a schema with a key in common).
`The intermediate results of the map processing (key/value
`pairs) for a particular key are processed together in a single
`reduce function by applying a different iterator to intermedi
`ate values for each group. Differentiterators can be composed
`inside reduce functions in ways however desired.
`Thus, for example, the enhanced MapReduce program
`ming methodology may be easily employed to carry out dis
`tributed relational database processing.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`30
`
`35
`
`40
`
`45
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 8 of 14 PageID #: 1664
`
`FIG. 1 illustrates an example of the conventional MapRe
`duce architecture.
`FIG. 2 illustrates a specific parallel implementation of the
`FIG. 1 architecture.
`FIG. 3 graphically illustrates a join performed over two
`relational tables that have different schema.
`FIG. 4 illustrates a particular implementation of an
`improved MapReduce architecture that includes consider
`ation for input groups, relative to the FIG. 3 example rela
`tional tables.
`FIG. 5 schematically illustrates, in greater detail, process
`ing by the improved MapReduce architecture to join the
`records (FIG.3) of the Employee table 302 with the records of
`the Department table 304 to generate the Employee and
`Department table 306.
`
`50
`
`55
`
`60
`
`DETAILED DESCRIPTION
`
`The inventors have realized that, by treating an input data
`set as a plurality of grouped sets of key/value pairs, the utility
`
`65
`
`US 8, 190,610 B2
`
`2
`of the MapReduce programming methodology may be
`enhanced. By utilizing Such a grouping, map processing can
`be carried out independently on two or more related datasets
`(e.g., related by being characterized by a commonkey). The
`intermediate results of the map processing (key/value pairs)
`for a particular key can be processed together in a single
`reduce function by applying a different iterator to intermedi
`ate values for each group.
`Before discussing the aspect in which the input data set is
`treated as a plurality of grouped sets of key/value pairs, it is
`useful to first discuss, as background, the conventional archi
`tecture. The conventional architecture is described, for
`example, in Dean and Ghemawat referred to in the Back
`ground. FIG. 1, illustrating an example of the conventional
`architecture, which is substantially reproduced from Dean
`and Ghemawat HTML (at slide 7) but with reference numer
`als added here to aid in the description. In the example, the
`keys of the input data key/value pairs comprise keys k1 to k5.
`As discussed by Dean and Ghemawat, and as shown in the
`FIG. 1 illustration, both the input data set 102 and the output
`data set 114 are a set of key value pairs. The programmer
`specifies a map function that processes input key/value pairs
`and produces a set of intermediate pairs 106(1) through 106
`(7). In the abstract, Such a map function is specified as fol
`lows:
`
`map (in key, in value) - list(out key, intermediate value)
`
`In the FIG. 1 illustrative example, the input data set 102 is
`partitioned into an arbitrary grouping of seven partitions 102
`(1) through 102(7). Each partition is provided to the corre
`sponding invocation of the map function, which has seven
`invocations 104(1) through 104(7).
`The group-by-key functionality 108 partitions the interme
`diate results by out key, and the intermediate partitions 110
`(k1) to 110(k5) are provided to the corresponding reduce
`functions 112(k1) to 112(k5), respectively. Each reduce func
`tion 112(k1) to 112(k5) processes intermediate data from one
`of the intermediate partitions 110(k1) to 110(k5) to generate
`the corresponding output partitions 114(k1) to 114(k5). In the
`abstract, such a reduce function is specified as follows:
`
`reduce (out key, list(intermediate value)) - list(out value)
`
`The reduce function combines all intermediate values for a
`particular key and produces a set of merged output values for
`the key, usually just one.
`FIG. 2 illustrates a parallel implementation of the FIG. 1
`architecture. For simplicity of illustration, only some of the
`elements in FIG. 2 are denoted by reference numerals, and it
`is clearly evident that others of the elements correspond to
`elements in FIG.1. In the implementation, the map functions
`104 are partitioned into a user-configurable number of map
`tasks 202a, 202b and 202c (here, three map tasks). Similarly,
`the reduce functions are partitioned into a user-configurable
`number of reduce tasks 208a and 208b (here, two reduce
`tasks). Each map task 202a, 202b, and 202c includes a parti
`tioning function 206a, 206b and 206C, respectively, that par
`titions the intermediate data across the reduce tasks 208a and
`208b. Each reduce task 206a and 206b includes a sort and
`group-by-key task 210a and 210b, respectively, which can
`make the processing of the output data more efficient and/or
`convenient.
`
`

`

`3
`Although not shown in FIGS. 1 and 2, the partitioned data
`can go through a configurable, intermediate step in which a
`“combiner function is called. A combiner function is similar
`to a reduce function with the following difference: a combiner
`runs after a mapper and performs a partial merging on each
`partition before they are transferred to a reducer. This helps
`reduce the network traffic and speed up the total execution
`time.
`Conventional MapReduce implementations enable the use
`of massive clusters of commodity computers to provide a
`simplified programming and execution model for processing
`large sets of data in parallel. In addition, the application is
`isolated from messy but important details of running a dis
`tributed program on massive clusters of potentially unreliable
`computers. However, the conventional MapReduce imple
`mentations do not have facility to efficiently process data
`from heterogeneous sources.
`For example, it is impractical to perform joins over two
`relational tables that have different schemas. FIG. 3 graphi
`cally illustrates Such a join (without regard for any particular
`method or architecture for accomplishing such the join).
`Referring to FIG.3, there are two tables, an “Employee' table
`302 and a “Department” table 304 on which it may be desir
`able to perform a join operation. Each employee record in the
`Employee table 302 has key DeptID and value LastName, and
`each department record in the Department table 304 has Dep
`tID and value DeptName. The result “Employee and Depart
`ment” table 306 is a table that shows the department name for
`each employee. More specifically, the records of the
`Employee table 302 are joined with the records of the Depart
`ment table 304 such that each record of the Employee and
`Department table 306 includes key DeptID and values Last
`Name and DeptName.
`FIG. 4 illustrates an example of an improved MapReduce
`architecture in accordance with an aspect, and relative to the
`FIG.3 example relational tables. It is noted that the schema of
`each data set, such as the FIG.3 relational tables, includes a
`set of attributes (such as DeptID, LastName, DeptName) and
`their properties (such as their data types: integer DeptD.
`string LastName, string DeptName). As discussed above with
`respect to the conventional MapReduce architecture, input,
`intermediate and output data sets may each be characterized
`by their own schema, and each schema operates according to
`“key/value” pairs. The attributes in the key/value pairs may be
`45
`distinct or overlapping; moreover, keys within a data set may
`be distinct or identical.
`In the improved MapReduce architecture such as discussed
`with reference to FIG. 4, the input, intermediate and output
`data sets are partitioned into a set of data groups. With the
`partitioning into groups, it is likely that map functions corre
`sponding to each group are different; data sets within the
`same group are characterized by the same schema; and data
`sets within different groups are characterized by different
`schemas it is also likely that map functions corresponding to
`each group are the same; data sets within all the groups are the
`SaC.
`In general, partitioning the data sets into data groups
`enables a mechanism to associate (group) identifiers with
`data sets, map functions and iterators (useable within reduce
`functions to access intermediate data) and, also, to produce
`output data sets with (group) identifiers. It is noted that the
`output group identifiers may differ from the input/intermedi
`ate group identifiers.
`Referring now to FIG.4, each of the map tasks 402 and 404
`are configured to operate on separate data groups, where each
`of the separate data groups is characterized by its own
`
`25
`
`30
`
`35
`
`40
`
`50
`
`55
`
`60
`
`65
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 9 of 14 PageID #: 1665
`
`US 8, 190,610 B2
`
`5
`
`10
`
`15
`
`4
`schema. In FIG.4, the separate tables are labeled “E” and “D’
`to represent the “Employee” table 302 and the “Department”
`table 304, respectively.
`Referring still to FIG. 4, the records of the intermediate
`data E and D' retain an identification with the groups to which
`the original input data, resulting in particular intermediate
`data, belong. Thus, the intermediate data E retains an identi
`fication with group E, and the intermediate data D' retains an
`identification with group D. That is, the intermediate data E
`of the map task 502 retains an identification with the
`employee table 302, and the intermediate data D' of the map
`task 504 retains an identification with the department table
`304.
`Even the records of the partitioned data 406 and 408, after
`partitioning the records of the intermediate data to the various
`reduce tasks 410 and 412, retain an identification with the
`groups of the original input data (the employee table 302 and
`the department table 304). The reduce tasks 410 and 412 can
`access partitioned intermediate data for both groups 302 and
`304 through one iterator per group, and the reduce tasks 410
`and 412 can be programmed by users to use the iterators
`however desired.
`In accordance with the described aspect, a programmer
`may view data as follows:
`
`input data
`intermediate data
`output data
`
`grouped sets of key/value pairs
`grouped sets of key/value pairs
`(possibly grouped) sets of
`key/value pairs or values
`
`It is noted that the output data does not have to be grouped or
`key/value pairs, but it is useful in Some instances, such as
`some instances in which MapReduce jobs are to be chained in
`a pipeline. It is also noted that there can be only one group.
`We now provide Some example syntax for map and reduce
`functions to employ the improved MapReduce architecture.
`As discussed above, a user/programmer typically provides
`map and reduce functions. In accordance with the described
`aspect, in the abstract, a map function may be specified as
`follows:
`
`Map (group id, in key, in value) -> (group id,
`list(out key, intermediate value))
`
`Thus, for a particular group identified with a group id, the
`map function processes each input key/value pair and pro
`duces one or more intermediate key/value pairs. In accor
`dance with the described aspect, in the abstract, a reduce
`function may be specified as follows:
`Reduce(list(group id.list(out key,intermediat
`e value)))->list(out value)
`Thus, the reduce function merges all intermediate data over
`all groups. Iterators specific to a particular group may be
`employed to iterate over that group's intermediate data. One
`or more output values are provided.
`In the improved MapReduce, the partitioning operation
`after the mapping phase is generalized Such that it has a
`configurable operation with two modes.
`In the partitioning mode 1, the partitioning operation par
`titions the intermediate data into R partitions, where R is the
`number of reduce tasks. Each key/value pair goes to only one
`of the partitions. This is similar to what is done by the con
`ventional MapReduce.
`
`

`

`5
`In the partitioning mode 2, it still partitions the intermedi
`ate data into R partitions but each key/value pair can go to a
`plurality of the partitions. In this mode, it is likely that the
`pairs will go to all the partitions.
`For example, consider implementing a cross product of the
`tables in FIG.3, in which every record of the Employee table
`302 will be paired with every record of the Department table
`304. The cross product can be achieved using two nested
`loops, one outer loop iterating over the records of the
`Employee table and one inner loop iterating over the records
`of the Department table. In the improved MapReduce, one
`possible implementation can map each employee to only one
`reducer but it can map each department to every reducer. In
`another possible implementation, it can map each employee
`to only one reducer and it may not map the departments at all.
`Instead, each reducer can read the departments directly from
`their local copy, if exists, or from a copy shared by the other
`reducers.
`In the improved MapReduce, the operations can collect
`information about the data that they are processing. They can
`also save the information as metadata of the corresponding
`data. For example, the partitioning, grouping, and sorting
`operations can keep the Smallest and largest keys in each
`partition processed. This information can be saved as part of
`the metadata of each partition. When an iterator is created to
`scan through the records of a partition, this metadata can be
`made available to the reduce function through the iterators.
`One example benefit is that this metadata can speed up com
`parisons. For example, in a processing of the data to find all
`records whose keys are Smaller than aparticularkey, an entire
`partition can be skipped if its Smallest key is larger than the
`particular key.
`In some examples, the Sorting operation is configurable,
`Such that it can be turned on or off for some groups. This is in
`addition to the configurable sorting function that is also avail
`able in the conventional MapReduce. As such, the overhead to
`perform some operations may be reduced where, for
`example, Sorting is not needed. One example of such a data
`base operation is determining a cross product.
`In the improved MapReduce, the reduce function has a
`configurable operation with three modes. They are named in
`the increasing order of generality as follows: the reduce mode
`1, the reduce mode 2, and the reduce mode 3. These modes
`improve its capabilities to Support the basic relational data
`base processing operations. In all modes, the data accessible
`through the iterators over all groups are made available the
`reducer functions.
`In the reduce mode 1, the records input to a reduce function
`through the group iterators carry or share the same key. If
`there is no record in a group that has the same key, the
`corresponding iterator is null. This mode helps implement an
`equijoin, a join over equal keys, efficiently. For example, it
`can produce the “Employee and Department” table 306 by
`equijoining the “Employee' table 302 and the “Department'
`table 204.
`In the reduce mode 2, the records input to a reduce function
`through the group iterators need not carry the same key. The
`input records are the ones pointed to by the group iterators
`when the reduce function is invoked. The reduce function
`controls the manner in which keys, values, and iterators are
`used. This mode helps implement a join with an arbitrary
`predicate (such as comparisons) because it can implement the
`sort-merge join. For example, this mode together with the
`partitioning mode 2 can even implement the cross product of
`the employees and departments in FIG. 3.
`In the reduce mode 3, the records input to a reduce function
`through the group iterators need not carry the same key as in
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 10 of 14 PageID #: 1666
`
`US 8, 190,610 B2
`
`10
`
`15
`
`6
`the reduce mode 2. In addition, those of some groups need not
`even come from the intermediate data when the reduce func
`tion is invoked. They can come from data local to the reduce
`function or data shared by other reduce functions. For
`example, this mode can implement the cross product of the
`employees and departments in FIG. 3. To do that, the depart
`ments data can be accessed by every reduce function.
`In the improved MapReduce, the map and reduce functions
`can be provided by the application developers or users. They
`can also be generated by a tool or program. For example,
`consider implementing the select operation of relation data
`base processing to get from a given table those employees
`whose salary is larger than a certain limit. Using the schema
`of the data, it is possible for a tool or program to automatically
`generate a map function to implement this select operation.
`The automatic generation allows building higher-level inter
`faces on top of the improved MapReduce.
`We now discuss an example, using pSuedocode, of code to
`accomplish joining the records of the Employee table 302
`with the records of the Department table 304, as discussed
`above relative to FIGS. 3 and 4, such that each record of the
`Employee and Department table 306 includes values Last
`Name and DeptName, with the records sorted by the DeptID
`key. This example will use the reduce mode 1. The map
`functions may be defined as follows:
`
`map(string group, string key, string val) {
`fi group: “emp' or “dept
`// key: DeptID
`// val: LastName for “emp', DeptName for “dept
`DeptID = key:
`emit to group (group, DeptD, val);
`
`The reduce functions may be defined as follows:
`
`reduce(hashtable iterators) {
`// iterators: one iterator for “emp', another for “dept
`DeptD = iterators.get key();
`emp iter = iterators'emp.get iter();
`dept iter = iterators"dept.get iter();
`for each LastName in emp iter do {
`for each DeptName in dept iter do {
`emit to group ("emp dept, DeptD, (LastName,
`DeptName));
`
`If null iterators are possible and the database operation such
`as outer joins needs their handling, the reduce functions may
`also be defined as follows:
`
`reduce(hashtable iterators) {
`// iterators: one iterator for “emp', another for "dept
`DeptD = iterators.get key();
`emp iter = iterators'emp.get iter();
`dept iter = iterators"dept.get iter();
`if (emp iter is not null and dept iter is not null) then {
`for each LastName in emp iter do {
`for each DeptName in dept iter do {
`emit to group ("emp dept, DeptD, (LastName,
`DeptName));
`
`

`

`7
`-continued
`
`else if (emp iter is not null and dept iter is null) then {
`// null iterator handling for left and full outer joins
`for each LastName in emp iter do {
`emit to group ("emp dept, DeptD, (LastName, null));
`
`else if (emp iter is null and dept iter is not null) then {
`i? null iterator handling for right and full outer joins
`for each DeptName in dept iter do {
`emit to group ("emp dept, DeptD, (null, DeptName));
`
`else {
`if both iterations being null can be a warning or an error
`
`As mentioned above, the output group identifiers may be
`different from the input/intermediate group identifiers. For
`example, the reduce functions above produce output data set
`with the “emp dept' group identifier from intermediate data
`sets with the “emp' and “dept' group identifiers.
`It is noted that the use of the iterators is flexible, so that the
`programmer can use the iterators as desired to accomplish a
`desired function. It is also noted that the data structure to
`implement iterators can be any data structure that can effi
`ciently return the iterator of a group given the group identifi
`cation. Hash table is only one example of Such a data struc
`ture.
`It is further noted that the iterators can carry metadata for
`the data that they are created for. The metadata may include
`various statistics. This capability is in addition to the counter
`facility that the conventional MapReduce provides.
`Having provided an example of the code for the map and
`reduce functions to accomplish joining the records of the
`Employee table 302 with the records of the Department table
`304, we now discuss an example execution of the map and
`reduce functions. For example, the map function may be
`called as follows:
`
`map(“emp', 34, “Smith')
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`which results in the data “34 Smith' being emitted into files
`with an “emp' extension. The “emp' extension identifies the
`data with the “emp' group. The map function may further be
`called as follows:
`
`50
`
`Case 4:23-cv-01147-ALM Document 53-1 Filed 10/29/24 Page 11 of 14 PageID #: 1667
`
`map(“dept, 34, “Clerical)
`
`which results in the data “34 Clerical' being emitted into files
`with a “dept extension. There would further be additional
`calls to the map function that are not listed here.
`The reduce function may be called as follows:
`
`reduce((34, (“emp', (“Smith, “Robinson, “Jasper)), (“dept,
`(“Clerical))))
`
`55
`
`60
`
`65
`
`US 8, 190,610 B2
`
`8
`A partial example output of the reduce function may then be
`as follows:
`
`“34, Smith, Clerical
`34, Robinson, Clerical
`“34, Jasper, Clerical
`
`It is noted that the output will be sorted by the DeptID record
`as a side-effect of MapReduce. The sorting happens in both
`the conventional MapReduce and the improved MapReduce
`although it can also be turned off in the improved MapRe
`duce.
`FIG. 5 more fully illustrates, with the full data of the
`Employee table 302 and the Department table 304, process
`ing by an example of the improved MapReduce to equijoin
`the records of the Employee table 302 with the records of the
`Department table 304 to generate the Employee and Depart
`ment table 306, where each record shows the department
`name for one employee. Referring specifically to FIG. 5, the
`Employee table 302 and the Department table 304 are shown
`in the portion 502 of FIG. 5. The map functions are shown
`within the portion 504 of FIG. 5.
`The portion 506 of FIG.5 includes the intermediate results
`of applying the map functions followed by the partitioning
`into even-keyed records and odd-keyed records. This parti
`tioning function is just one of many possible partitioning
`functions. The intermediate results within the portion 506 are
`analogous to the boxes labeled E' and D' within map tasks 402
`and 404 in FIG. 4. The intermediate results within the portion
`508 in FIG. 5 are analogous to the partitioned results 406 and
`408 in FIG. 4. In FIG. 5, the partitioning results are sorted into
`even-keyed records 509a and odd-keyed records 509b. The
`sorted partitioned results within portion 508 are provided to
`reduce functions within the portion 510, and the portion 512
`includes the result Employee and Department table 306. It is
`noted that in the special case of department ID35, there are no
`employees in the Employee table 502 for department ID 35,
`so the record in the Employee and Department table 306 for
`department ID 35 was generated based on a null iterator for
`the employee group. (In general, a null iterator can occur for
`both groups. The null iterators are handled in reduce func
`tions in the way that the programmer desires. For example, as
`discussed before, outer join types utilize the generation of an
`output record with some null attribute values.)
`As discussed, then, the MapReduce concept may be uti
`lized to carry out map processing independently on two or
`more related datasets (e.g., related by being characterized by
`a commonkey) even when the related data sets are heteroge
`neous with respect to each other, Such as data tables organized
`according to different schema. The intermediate results of the
`map processing (key/value pairs) for a particular key can be
`processed together in a single reduce function by applying a
`different iterator to intermediate values for each group. In this
`way, operations on the two or more related datasets may be
`carried out more efficiently or in a way not even possible with
`the conventional MapReduce architecture.
`What is claimed is:
`1. A method of processing data of a data set over a distrib
`uted system, wherein the data set comprises a plurality of data
`groups, the method comprising:
`partitioning the data of each one of the data groups into a
`plurality of data partitions that each have a plurality of
`key-value pairs and providing each data partition to a
`Selected one of a plurality of mapping functions that are
`each user-configurable to independently output a plural
`
`

`

`10
`9. The method of claim 7, wherein:
`at least some of the reducers include a sort, group-by-key
`and combine task;
`the method further comprises
`generating and providing metadata for at least Some of
`the mapping, partitioning, combining, grouping and
`Sorting.
`10. The method of claim 9, wherein:
`the reducing step includes processing the metadata.
`11. The method of claim 10, wherein:
`processing the intermediate data for each data group in a
`manner that is defined to correspond to that data group
`includes, for each data group, employing an iterator that
`corresponds to that data group, wherein the iterator
`includes providing the associated metadata to the pro
`cessing of the reducing step.
`12. The method of claim 5, wherein:
`the intermediate data includes a plurality of grouped sets of
`key/value pairs;
`the reducing step is carried out by a plurality of reducers;
`for at least one of the reducers, the iterator corresponding to
`aparticular data group, for that reducer, operates accord
`ing to a different key of a different schema than the
`iterator corresponding to another particular data group,
`for that reducer.
`13. The method of claim 1, wherein:
`the intermediate data processing step of the reducing step
`further comprises processing data that is not intermedi
`ate data.
`14. The method of claim 13, wherein:
`the reducing step is carried out by a plurality of r

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.