throbber
Spreadsheet As a Relational Database Engine
`
`Jerzy Tyszkiewicz
`Institute Informatics, University of Warsaw
`Banacha 2, 02-097 Warsaw, Poland
`jty@mimuw.edu.pl
`
`ABSTRACT
`Spreadsheets are among the most commonly used applica-
`tions for data management and analysis. Perhaps they are
`even among the most widely used computer applications of
`all kinds. However, the spreadsheet paradigm of computa-
`tion still lacks sufficient analysis.
`In this paper we demonstrate that a spreadsheet can play
`the role of a relational database engine, without any use of
`macros or built-in programming languages, merely by utiliz-
`ing spreadsheet formulas. We achieve that by implementing
`all operators of relational algebra by means of spreadsheet
`functions.
`Given a definition of a database in SQL, it is therefore
`possible to construct a spreadsheet workbook with empty
`worksheets for data tables and worksheets filled with for-
`mulas for queries. From then on, when the user enters, al-
`ters or deletes data in the data worksheets, the formulas in
`query worksheets automatically compute the actual results
`of the queries. Thus, the spreadsheet serves as data storage
`and executes SQL queries, and therefore acts as a relational
`database engine.
`The paper is based on Microsoft Excel (TM), but our
`constructions work in other spreadsheet systems, too. We
`present a number of performance tests conducted in the beta
`version of Excel 2010. Their conclusion is that the perfor-
`mance is sufficient for a desktop database with a couple
`thousand rows.
`
`Categories and Subject Descriptors
`H.2.4 [Database Management]: Systems—relational data-
`bases; H.4.1 [Information Systems Applications]: Office
`Automation—spreadsheets
`
`General Terms
`Theory, Design, Performance
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior specific
`permission and/or a fee.
`SIGMOD’10, June 6–11, 2010, Indianapolis, Indiana, USA.
`Copyright 2010 ACM 978-1-4503-0032-2/10/06 ...$10.00.
`
`Keywords
`spreadsheets, relational databases, relational algebra, SQL,
`performance
`
`1.
`
`INTRODUCTION
`Spreadsheets are the end-user computing counterpart of
`databases and OLAP in enterprise-scale computing. They
`serve basically the same purpose — data management and
`analysis, but at the opposite extremes of the data quantity
`scale.
`At the same time spreadsheets are a great success, and
`are often described as the very first “killer app” for personal
`computers. Today they are used to manage home budgets,
`but also to create, manage and examine extremely sophisti-
`cated models and data arising in business and research. For
`example, Excel files are quite common as a form of support-
`ing online material in Science journal.
`It seems therefore surprising that relatively little research
`has been devoted to them and consequently they are still
`poorly understood. This is a source of many problems, e.g.,
`Science provides an example of a scientific controversy [5,
`3] which finally turned out to be related to the design of a
`spreadsheet used for data analysis. European Spreadsheet
`Risks Interest Group EuSpRIG http://www.eusprig.org
`with its annual conference are devoted to the problems with
`spreadsheets.
`
`2. THE CONTRIBUTION
`In this paper we aim at providing strong evidence, that
`spreadsheets are not only useful and successful, but also very
`interesting, and deserve more research. Specifically, we con-
`sider the relation of spreadsheets to database systems.
`It
`is a natural comparison, because spreadsheets indeed often
`play the role of small databases at the end-user level of com-
`puting.
`We demonstrate that virtually any spreadsheet software
`present on the market is a relational database engine. We
`do so by implementing all operators of relational algebra
`using spreadsheet functions. For each query in SQL, we
`demonstrate how to construct a spreadsheet workbook with
`empty worksheets for data tables and worksheets filled with
`formulas for queries. As the user enters, alters or deletes
`data tuples in the data worksheets, the formulas in query
`worksheets automatically compute the actual results of the
`queries. Thus, the spreadsheet serves as data storage, and
`executes SQL queries. It is therefore a relational database
`engine. Consequently, any specification of a database, writ-
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2238
`Page 1 of 12
`
`

`

`ten in SQL in the form of table and view definitions, can be
`compiled into a spreadsheet workbook which has exactly the
`same functionality as if the database was implemented in a
`classical RDBMS. Crucially, this is achieved without any use
`of macros written in an external programming language, like
`Visual Basic or the like. One might consider our construc-
`tion also as an implementation of a relational database on a
`completely new type of (virtual) hardware.
`As a model of spreadsheet syntax and semantics we take
`Microsoft Excel (TM) (the general reference is [10]), but our
`constructions work in other similar systems, like OpenOffice
`Calc, gnumeric or Google docs, too.
`We present number of performance tests, which indicate
`that our constructions are efficient enough to be practically
`used as a desktop database storing a few thousand rows of
`data.
`
`2.1 Application scenarios
`We believe that there is a market niche for a small SQL
`database implemented in a spreadsheet. It is a low-end solu-
`tion, in terms of performance. However, there is market for
`low-end mobile phones, for low-end computers, and similarly
`there is also market for low-end database solutions.
`E.g., Chrome OS does not permit installing new applica-
`tions. For a user of a netbook with this system, a method
`to have a small SQL database is to use our solution with
`the Google docs spreadsheet. This might be facilitated by
`a public ”query compiler” in WWW, where the user types
`in SQL code and gets a couple of spreadsheet formulas to
`copy and paste into his/her workbook, which implement the
`functionality he/she needs.
`Next, many people and small businesses buy MS Office
`with Access to store just a few thousand tuples. With our
`solution they could buy much cheaper version without Ac-
`cess and have principally the same abilities. Of course, they
`could also use one of the free database solutions, like Post-
`greSQL or MySQL. But these RDBMS’s do not integrate so
`easily with other elements of MS Office, and require techni-
`cal skills necessary to install, configure and maintain them.
`In [9] the authors argue that spreadsheet is actually a very
`good interface for a database, allowing unexperienced users
`to create queries stepwise.
`Moreover, the limit to a few thousands rows of data im-
`posed by the low efficiency is not very severe for small busi-
`nesses, which often process data from short periods only
`(typically 1 month), and after their tax reports are ready,
`the data can safely be archived. In the spreadsheet database
`it amounts to saving the finished one and creating a new
`empty instance for the next period. An alternative option
`is materialization of queries by cut and paste special→values
`which removes all formulas together with their need of re-
`computation, leaving the results.
`
`2.2 Related work
`To the best of our knowledge, the problem of expressing
`relational algebra and SQL in spreadsheets has not been con-
`sidered before in the setting we adopt here. The the most
`similar to our work reported here are the following. The
`paper [9] proposes an extension of the set of spreadsheet
`functions by carefully designed database function, whereby
`the user can specify (and later execute) SQL queries in
`a spreadsheet-like style, one step at a time. These addi-
`tional operators were executed by a classical database en-
`
`gine running in the background. Our contribution means
`that exactly the same functionality can be achieved by the
`spreadsheet itself. Two papers [14, 15] describe a project,
`later named Query by Excel to extend SQL by spreadsheet-
`inspired functionality, allowing the user to treat database
`tables as if they were located in a spreadsheet and define
`calculations over rows and columns by formulas resembling
`those found in spreadsheets. In the final paper [16] a spread-
`sheet interface is offered for specifying these calculations,
`which had to be specified in an SQL-like code in the earlier
`papers. The paper [8] describes a method to allow RDBMs
`to query data stored in spreadsheets.
`There is also a number of papers which discuss various
`methods to support high-level design of spreadsheets, in par-
`ticular [1, 2, 6, 7, 11, 12, 13, 17, 18]. Some of them consider
`spreadsheets from the functional programming perspective.
`
`3. TECHNICALITIES
`We assume the reader to be basically familiar with spread-
`sheets. The paper is written to make the solutions compati-
`ble with Microsoft Excel (TM) 2003 version. The newest (at
`the time of this writing) Excel 2007 and beta 2010 provide a
`number of new functions, which simplify some of the tasks,
`but are not present in other spreadsheet systems. Therefore
`we chose compatibility with the older version, but refer to
`and test the newer systems, as well.
`
`3.1 R1C1 notation
`We use the row-column R1C1-style addressing of cells and
`ranges, supported by Excel. This notation is easier to handle
`in a formal description, although in everyday practice the
`equivalent A1 notation is dominating. The key advantage
`of the R1C1 notation is that the meaning of the formula is
`independent of the cell in which it is located.
`In the R1C1 notation, both rows and columns of work-
`sheets are numbered by integers from 1 onward. For arbi-
`trary nonzero integers i and j and nonzero natural numbers
`m, n the following expressions are cell references in the R1C1
`notation: RmCn, R[i]Cm, RmC[j], R[i]C[j], RCm, RC[i], RmC,
`R[i]C.
`The number after ‘R’ refers to the row number and the
`number after ‘C’ to the column number. If that number is
`missing, it means “same row (column)” as the cell in which
`this expression is used. A number written in square brackets
`is a relative reference and the cell to which this expression
`points should be determined by adding that number to the
`row (column) number of the present cell. Number with-
`out brackets is an absolute reference to a cell whose row
`(column) number is equal to that number. For example,
`R[−1]C7 denotes a cell which is in the row directly above the
`present one in column 7, while RC[3] denotes a cell in the
`same row as the present one and 3 columns to the right. If R
`or C is itself omitted, the expression denotes the whole col-
`umn or row (respectively), e.g., C7 is column number 7. For
`referencing cells in other worksheets, RC may also be used,
`and references the cell whose row and column numbers are
`equal to the address of the cell in which this expression is
`located.
`
`3.2 IF function
`IF is a conditional function in spreadsheets. The syntax
`is IF(condition,true_branch,false_branch). Its evalua-
`tion is lazy, i.e., after the condition is evaluated and yields
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2238
`Page 2 of 12
`
`

`

`Figure 1: The idea of a database implementation in a spreadsheet. (Errors appearing in the worksheet are
`intended.)
`
`either TRUE or FALSE, only one of the branches is evaluated.
`It can be therefore used to protect functions from being ap-
`plied to arguments of wrong types, trap errors, and, last
`but not least, to speed up execution of queries by avoiding
`computing certain branches.
`
`3.3 SUMPRODUCT function
`We will often use a special function called SUMPRODUCT.
`Its uses will be generally modifications of the following two
`examples.
`
`Example 1.
`=SUMPRODUCT((R1C1:R5C1=R1C3)*(R1C2:R5C2=R1C4)) is cal-
`culated as follows:
`(i) each cell in the range R1C1:R5C1 is compared with R1C3,
`and this yields a sequence of five booleans;
`(ii) each cell in the range R1C2:R5C2 is compared with R1C4,
`and this yields another sequence of five booleans;
`(iii) the two sequences from previous items are multiplied
`coordinate-wise, which results in automatic data type con-
`version from booleans to integers, and then normal multi-
`plication;
`(iv) the five numbers are summed to give a single number
`as a result.
`Consequently, the final result is the count of rows, in which
`the columns C1 and C2 contain the same pair of numbers as
`in R1C3:R1C4.
`
`Example 2.
`=SUMPRODUCT((R1C1:R5C1=R1C3)*(R1C2:R5C2=R1C4)*
`R1C5:R5C5) is calculated as follows:
`(i-iii) the first three steps of evaluation are the same as be-
`fore;
`(iv) the sequence of 0s and 1s from the previous item and
`the range R1C5:R5C5 are multiplied again coordinate-wise,
`which results in a sequence of five numbers;
`(v) again the sum of the above five numbers is returned.
`The result is the sum of values in C5, calculated over those
`rows, in which the columns C1 and C2 contain the same pair
`of numbers as in R1C3:R1C4.
`
`These two examples generalize to sum-multiplication of
`more than two or three arrays.
`
`3.4 MATCH and INDEX functions
`We use MATCH with syntax MATCH(range,cell,0). It re-
`turns the relative position of the first value in range which
`is equal to the value in cell, and error if that value does
`not appear there.
`In one case we use MATCH(range,cell,1), in which case
`for range sorted into ascending order MATCH returns the rel-
`ative position of the largest value in range that is less than
`or equal to the value in cell. We do not apply this form to
`unsorted ranges.
`INDEX is used in the following syntax: INDEX(range,cell),
`and returns the value from range whose relative position is
`given by the value from cell.
`
`3.5 Syntax variants
`We will be using the SUMPRODUCT function with whole-
`column ranges, e.g., SUMPRODUCT((C1=R1C3)*(C2=R1C4), which
`is generally equivalent to the formula from Example 1, if no
`other values are present in columns C1 and C2. However,
`this syntax is not valid in Excel versions prior to 2007. In
`early Excel’s one would have to use ranges like R1C5:R1Cmax
`in SUMPRODUCT, where max is the maximal row number of the
`range. OpenOffice Calc generally disallows R1C1 syntax and
`whole columns in formulas. Gnumeric and Google docs in
`turn permit both.
`Our formulas are designed to work in all above systems,
`but in some of them they require syntactical modifications,
`mainly switching to A1 and eliminating whole-column ranges.
`In Excel 2007 and 2010 there are new functions COUNTIFS
`and SUMIFS, which can substitute SUMPRODUCT in our appli-
`cations, and are much more efficient. E.g., the first of our
`example formulas can be replaced by
`=COUNTIFS(R1C1:R5C1,R1C3,R1C2:R5C2,R1C4),
`and the second by
`=SUMIFS(R1C5:R5C5,R1C1:R5C1,R1C3,R1C2:R5C2,R1C4).
`
`4. ARCHITECTURE OF A DATABASE IM-
`PLEMENTED IN A SPREADSHEET
`In this paper, we disregard a number of minor issues aris-
`ing in practical implementation of database operations in
`a spreadsheet. First of all, there is the obvious limitation
`of size on then number and sizes of relations, views and
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2238
`Page 3 of 12
`
`

`

`their intermediate results, imposed by the maximal available
`number of worksheets, columns and rows in the spreadsheet
`system at hand. Next, the size of the data values (integers,
`strings, etc.) is also limited. The variety of data types in
`spreadsheets is also restricted when compared to database
`systems.
`The overall architecture of a relational database imple-
`mented in a spreadsheet is as follows.
`Given specification of the database, an implementation of
`a database is created by an external program (which plays
`the role of query compiler), in the form of an .xls, .xlsx,
`.odc, etc., file.
`The whole resulting database is a workbook, consisting of
`one worksheet per data table and one worksheet per view in
`the database.
`The data table worksheets are where the data is entered,
`updated and deleted. In the case of the (more theoretical
`in flavor) implementation of the relational algebra, the data
`table sheets do not contain any formulas and are simply
`the place to enter tuples into relations. In the case of SQL
`implementation, the cells are equipped with data validation
`formulas, which perform data type verification, enforce PRI-
`MARY KEY, FOREIGN KEY and other integrity constraints in-
`cluded in the CREATE TABLE statements.
`The query (view) worksheets are not supposed to be edited
`by the user. They contain columns filled with formulas,
`which calculate the consecutive values of the result of the
`query. Besides the result columns of the query, the view
`worksheets can also contain a number of hidden columns,
`which calculate and store intermediate results emerging dur-
`ing query evaluation. It is important that the formulas are
`completely uniform in each column of the database work-
`book, and they do not depend on the data which will be
`stored in the application. Initially all formulas compute the
`empty string "" value, representing unused space. When the
`user manually enters data into the tables, the automatic re-
`computation of the spreadsheet causes the results of queries
`to be computed and appear in the view worksheets.
`
`5. THEORETICAL LEVEL: RELATIONAL
`ALGEBRA
`We assume the semantics over a fixed domain of (the
`spreadsheet’s implementation of) integers, so that a rela-
`tion is a set or multiset of tuples over the integers that are
`implemented in the spreadsheet software.
`
`5.1 Compositionality
`We assume the unnamed syntax for the relational algebra:
`relations and queries have columns, which are numbered and
`do not have any names. Sometimes we consider the expres-
`sions C1, C2, etc., as the names of the worksheet columns,
`as well as the names of the columns in relations.
`The representation of a relation r of arity n is a group of n
`consecutive columns in a worksheet, whose rows contain the
`tuples in the relation. The rows in which there are no tuples
`of r are assumed to be filled with the empty string formula
`="", evaluating to the empty string value "", which the user
`can replace by the new tuples of the relation. The empty
`string is never a component of a tuple in a relation or query.
`Therefore either all cells in a row contain the empty string,
`or none does. The rows of tables and queries evaluating to
`empty strings are called null rows henceforth.
`
`The assumption that ="" formulas fill the empty rows of
`data tables is only for uniformity of presentation. A for-
`mula in a cell can not evaluate to ”empty cell” (because the
`formula occupies that cell anyway), only to empty string.
`Therefore, if blank cells were used in empty rows, formulas
`expressing queries would have to be adapted to accept un-
`used space in two different forms: empty cells in data tables,
`and empty strings in results of other queries.
`The representation of a relational algebra query Q of arity
`m is a group of l + m consecutive columns in a worksheet.
`Initial segments of all its rows are filled with formulas (iden-
`tical in all cells of each column), which calculate the tuples
`in Q. We assume that the formulas in the last m columns
`should return either (a component of) a tuple in the result
`of Q, or the empty string value "". The additional l columns
`are also filled with identical formulas, which calculate inter-
`mediate results. A worksheet of this kind can be created
`by entering the formulas in the first row, and then filling
`them downward to span as many rows as necessary. This
`uniformity assumption means in particular, that the formu-
`las are completely independent of the data they will work
`on. However, we would like to stress that there is no reason
`to reject nonuniform implementations, should they prove to
`be more effective or permit expressing queries inexpressible
`in a uniform way.
`In the following we will consider both set and bag (mul-
`tiset) semantics of the relational algebra. In the first case,
`duplicate rows are not permitted in the relations and queries;
`in the latter they are permitted. However, even in the set se-
`mantics a spreadsheet representation of a relation may con-
`tain many null rows.
`Furthermore, the representation may be loose if null rows
`are interspersed with the tuples, or standard if all the tuples
`come first, followed by the null rows.
`Consequently, we have loose-set, loose-bag, standard-set
`and standard-bag semantics. No matter which of the above
`semantics we have in mind, the result of the query appears
`exactly as if it were a table, and can be used as such. Now
`the only thing necessary to compose queries is to locate
`their implementations side by side in a single worksheet and
`change input column numbers in the formulas computing
`the outermost query, to agree with the column numbers of
`the outputs of the argument queries (and then the output
`columns of the argument queries become the intermediate
`results columns of the composition).
`Therefore, queries represented in this way are composi-
`tional.
`Now it suffices to demonstrate that the each of the follow-
`ing relational algebra operators from [4] can be implemented
`in a spreadsheet:
`
`• Two operations specific to spreadsheets, absent in [4]:
`error trapping and standardization (converting to stan-
`dard form).
`
`• Sorting.
`
`• Duplicate removal δr.
`
`• Selection σθr.
`
`• Projection πi,j,...r.
`
`• Union r ∪ s.
`
`• Difference r \ s.
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2238
`Page 4 of 12
`
`

`

`• Cartesian product r × s.
`
`• Grouping with aggregation γLr, where L is one of: SUM,
`COUNT, AVG, MAX and MIN.
`
`5.2 Notation
`We present here implementations based on spreadsheet
`functions which are common to most of (or even all) spread-
`sheet systems. This way we believe to consider the spread-
`sheet paradigm, even if its definition is not yet formulated in
`the literature.
`We use the following convention for presenting implemen-
`tations:
`COLUMNS < =FORMULA
`means that =FORMULA is entered into each cell of the COLUMNS,
`which may be specified either to be a single column (e.g. C5
`) or a range of a few columns (e.g. C5:C7), or a single cell
`(e.g. R1C5), and in each case belongs to the columns with
`intermediate values. The notation
`COLUMNS << =FORMULA
`indicates that formulas located in COLUMNS calculate the out-
`put of the query.
`Sometimes the output columns are not specified, and then
`it is always indicated, that the final output is computed
`by applying another, already defined operation to some of
`the columns with intermediate results. At two occasions we
`even write SQL queries as contents of spreadsheet columns,
`understanding that the spreadsheet formulas implementing
`them are written there.
`Generally, we assume the arguments of the algebra oper-
`ators to be two- or three-ary relations or queries, the gener-
`alization to higher arities is straightforward.
`Except for the standardization and sorting, in all other
`cases we assume the input to be in the standard form, i.e.,
`null rows at the bottom.
`
`5.3 Error trapping and standardization
`In this section we describe two special purpose opera-
`tors, which perform common and useful tasks, specific to
`our spreadsheet environment.
`
`5.3.1 Error trapping
`If we replace a formula =F, which may produce an error,
`by the formula =IF(ISERROR(F),"",F), any error produced
`by =F is replaced by the empty string, and otherwise the
`value is the same as the value of =F.
`
`5.3.2 Standardization
`This operation converts a relation from loose to standard
`form, moving null rows to the bottom. The relative order of
`non-null rows is preserved. We assume that columns C1 and
`C2 contain the source data.
`C3 < =SUMPRODUCT((R1C1:RC1<>"")*1)
`counts the non-null rows above the present row, including
`the present one. This number is the row number to which
`the present row should be relocated. Note that multiplica-
`tion by 1 enforces boolean to integer conversion.
`C4 < =MATCH(ROW(),C3,0)
`returns the position of the first value in C3 equal to the
`present row number. If no match is found (i.e., we are in a
`row whose number is higher than the total number of non-
`null rows), an error is returned.
`C5:C6 << =IF(ISERROR(RC4),"",INDEX(C[-4],RC4))
`
`Errors are trapped, and when there is no error, INDEX
`returns the data from the suitable row of C[-4]. Thus the
`values from C1:C2 get relocated to their positions calculated
`in C3.
`
`5.4 Sorting
`Now we describe an implementation of sorting, which is a
`generalization of standardization. We assume that columns
`C1 and C2 contain the source data and we sort in ascending
`order by the values in C1.
`C3 < =SUMPRODUCT((C1<=RC1)*1)
`
`This puts in RiC3 the number of entries in column C1
`which are smaller than or equal to RiC1. "" compared by
`<= is larger than any number, so null rows do not give any
`errors, and in the following are treated as the largest entries.
`C4 < =RC3-SUMPRODUCT((C1<=RC1)*1)+1
`
`Now in RiC4 is the number of entries in column C1 which
`are either smaller than RiC1 or equal to it and located in the
`same row or above it. This is the number of the row into
`which RiC1 should be relocated during sort.
`C5:C6 << =INDEX(C[-4],MATCH(ROW(),C4,0))
`
`This part is very similar to the standardization solution,
`except that there are no errors to be trapped and we combine
`two formulas into one.
`Sorting in descending order is done by reverting the signs
`of numbers by the formula =IF(RC[-1]="";"";-RC[-1]) and
`sorting into ascending order, and then reverting the signs
`again. This leaves the null rows at the bottom. In partic-
`ular, if sorting is necessary there is no need to standardize
`first.
`An important property of this operation is that rows with
`empty string in the column on which the sort is performed,
`are moved to the bottom. Consequently, sorting brings any
`query or relation to standard form. Moreover, this form of
`sorting is stable.
`
`5.5 Duplicate removal
`Next we describe the implementation of duplicate removal,
`which, among other things, converts its input data from
`bag to set semantics. We assume the table to contain two
`columns C1:C2.
`C3 < =SUMPRODUCT((C1=RC1)*(C2=RC2))
`
`This causes RiC3 to contain the number of tuples from
`C1:C2 which are equal to RiC1:RiC2 and are located at the
`same level or above it. This number is 1 iff the row contains
`the first occurrence of this tuple.
`C4:C5 << =IF(RC3=1,RC[-3],"")
`
`Now the first occurrences of tuples are copied into C4:C5,
`the other are replaced by null rows.
`
`5.6 Selection
`Assume that we are given a relation r located in C1:C2
`and we want to compute σθr, where θ is a boolean combi-
`nation of equalities and inequalities concerning the values
`of columns of r and constants. Then we use a spreadsheet
`formula expressing θ to substitute "" for the rows which do
`not satisfy θ. This is best explained on an example:
`if θ
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2238
`Page 5 of 12
`
`

`

`is (C1 ≤ 100 ∧ C2 > C1) ∨ C2 6= 175, then the selection is
`implemented by
`C3:C4 << =IF(OR(AND(RC1<=100,RC2>RC1),
`RC2<>175),RC[-2],"")
`It leaves the result of the selection in a loose (set or bag,
`inherited from the input) form.
`
`5.7 Projection
`The case of projection is quite easy: it amounts to omit-
`ting some columns from the input relation/query.
`
`5.8 Union
`Assume that we are given two relations located in C1:C2
`and C3:C4, respectively.
`Then use the following formulas to calculate their union
`in standard bag form.
`R1C5 < =COUNT(C1)
`
`This is the number of non-null rows in C1.
`C6:C7 << =IF(ROW()<=R1C5,RC[-5],
`INDEX(C[-3],ROW()-R1C5))
`If the present row number is less than R1C5 then we take
`the same row from C1:C2, otherwise we take rows from C3:C4
`whose numbers are suitably shifted. Note that this assumes
`the inputs to be in standard (set or bag) form.
`
`5.9 Difference
`Assume that we are given two relations located in C1:C2
`and C3:C4, respectively. Then use the following formulas to
`calculate their set difference.
`C5 < =SUMPRODUCT((C3=RC1)*(C4=RC2))
`
`This calculates in RiC5 the number of times a tuple equal
`to RiC1:RiC2 appears in C3:C4.
`C6:C7 << =IF(RC5=0,RC[-5],"")
`
`Now if RiC5 is 0, we copy the row RiC1:RiC2 to the output,
`otherwise we replace it by a null row.
`The set/bag form of the result is inherited from the in-
`puts. However, this construction does not work for the
`bag format, since in this case we should count the copies
`of identical rows in both relations and put in the output a
`suitable number of such rows.
`The more complicated construction which does work is as
`follows:
`C5 < =SUMPRODUCT((C3=RC1)*(C4=RC2))
`
`This, exactly as before, calculates in RiC5 the number of
`times a tuple equal to RiC1:RiC2 appears in C3:C4.
`C6 < =SUMPRODUCT((R1C1:RC1=RC1)*(R1C2:RC2=RC2))
`
`Now we calculate in RiC6 the number of times a tuple
`equal to RiC1:RiC2 appears in C1:C2 in row i or above it.
`C7:C8 << =IF(RC5>=RC6;"";RC[-6])
`
`Now we replace by null rows the first RiC5 occurrences of
`tuple RiC1:RiC2, and leave unaffected the remaining ones,
`which gives the desired bag difference. The resulting relation
`is loose.
`
`5.10 Cartesian product
`Assume that two relations are located in C1:C2 and C3:C4,
`respectively.
`
`Then use the following formulas to calculate their Carte-
`sian product. The construction below works only for rela-
`tions in standard form, otherwise standardization is neces-
`sary first.
`R1C5 < =COUNT(C1)
`R2C5 < =COUNT(C3)
`
`We calculate the numbers of non-null rows in C1:C2 and
`C3:C4.
`C6:C7 << =IF(ROW()<=R1C5*R2C5,INDEX(C[-5],
`INT((ROW()-1)/R2C5)+1),"")
`creates R1C5 blocks, the i-th block being R2C5 copies of
`RiC1:RiC2.
`C8:C9 << =IF(ROW()<=R1C5*R2C5,INDEX(C[-5],
`MOD(ROW()-1,R2C5)+1),"")
`repeats in circular fashion the consecutive rows of C3:C4 a
`total of R1C5 rounds.
`Note that in this case, the set or bag form of the initial
`relations is inherited by their product.
`
`5.11 Grouping with aggregation
`In the following, we assume always the relation to be lo-
`cated in C1:C3, grouping done over C1:C2 and aggregation
`over C3.
`
`5.11.1 GROUP BY with SUM,COUNT and AVG
`C4 < =SUMPRODUCT((C1=RC1)*(C2=RC2)*C3)
`
`This array formula computes in RiC4 the sum of all RjC3
`over all j such that RjC1:RjC2 is equal to RiC1:RiC2. Now
`we do duplicate elimination over C1:C2 and C4 and that is
`the desired result. Alternatively, with
`C4 < =SUMPRODUCT((C1=RC1)*(C2=RC2))
`we get the count instead of sum aggregation.
`For average one has to compute GROUP BY with SUM and
`COUNT side-by-side and return the copy of C1:C2 plus the
`sum column divided by the count column.
`
`5.11.2 GROUP BY with MAX and MIN
`Let us consider MAX, the other being handled symmetri-
`cally. First, the whole relation is sorted into descending
`order by C3. On the result, elimination of duplicates is per-
`formed, which considers two rows identical when they agree
`on C1:C2. Our implementation of this operation eliminates
`all occurrences of a tuple except the very first one. The one
`left is therefore accompanied by the maximal value of C3, as
`desired.
`
`5.12 Additional operators
`Although semijoin and join are not on the list of algebra
`operators, they are extremely inefficient if implemented as a
`selection of a Cartesian product. Therefore we present more
`effective implementations of equisemijoin and equijoin.
`
`5.12.1 Semijoin
`Assume that we are given two relations located in C1:C2
`and C3:C4, respectively, and we wish to compute the equi-
`semijoin C1 : C2 ⋉C1=C3 C2 : C3.
`This is achieved in the following way. The formulas
`C5 < =IF(ISERROR(MATCH(RC1,C3,0),"",RC1)
`C6 < IF(RC5="","",RC2)
`empty the rows from C1:C2 which do not belong to the semi-
`join. The resulting relation is loose.
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2238
`Page 6 of 12
`
`

`

`5.12.2 Join
`Let two relations be located in C1:C2 and C3:C4, respec-
`tively, and the equijoin C1 : C2 ⊲⊳C1=C3 C2 : C3 should be com-
`puted.
`We distinguish two cases here. First, that the tables C1:C2
`and C3:C4 represent a many-to-many relationship. The sec-
`ond, that C3 is a foreign key for C1:C2, so the relation is
`many-to-one. The former is much more general, the latter
`is much more frequent in normalized databases.
`
`Many-to-many.
`Our implementation utilizes a kind of index on the data
`tables. It is assumed that the tables are sorted on the C1
`and C3 columns, respectively. Then let us use the following
`formulas:
`C5:C6 < =SELECT C1,COUNT(*) FROM C1:C2 GROUP BY C1
`C7 < =IF(RC5="","",R[-1]C+RC6)
`R1C7 < =IF(RC5="","",1)
`
`The last two lines should be understood that we first fill
`the whole column C7 with formulas, and then change the
`formula in its first cell. The columns C5:C7 provide the
`index, where each key from C1 is associated with its number
`of occurrences and the location of its first occurrence in this
`column. An analogous index for C3:C4 is created in columns
`C8:C10.
`Now we create standardized semijoins
`C11:C13 < C5 : C7 ⋉C5=C8 C8
`C14:C16 < C8 : C10 ⋉C8=C10 C5
`which eliminate the elements from the index which will not
`appear in the join. The next steps are
`C17 < =IF(RC11="";"";RC12*RC15)
`C18 < IF(ISERROR(R[-1

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket