`
`Rakesh Agrawal
`
`Jerry Kiernan
`
`Ramakrishnan Srikant
`
`Yirong Xu
`
`IBM Almaden Research Center
`650 Harry Road, San Jose, CA 95120
`
`ABSTRACT
`Encryption is a well established technology for protecting sensi-
`tive data. However, once encrypted, data can no longer be easily
`queried aside from exact matches. We present an order-preserving
`encryption scheme for numeric data that allows any comparison op-
`eration to be directly applied on encrypted data. Query results pro-
`duced are sound (no false hits) and complete (no false drops). Our
`scheme handles updates gracefully and new values can be added
`without requiring changes in the encryption of other values. It al-
`lows standard database indexes to be built over encrypted tables
`and can easily be integrated with existing database systems. The
`proposed scheme has been designed to be deployed in application
`environments in which the intruder can get access to the encrypted
`database, but does not have prior domain information such as the
`distribution of values and cannot encrypt or decrypt arbitrary val-
`ues of his choice. The encryption is robust against estimation of the
`true value in such environments.
`
`1.
`
`INTRODUCTION
`Database systems typically offer access control as the means to
`restrict access to sensitive data. This mechanism protects the pri-
`vacy of sensitive information provided data is accessed using the in-
`tended database system interfaces. However, access control, while
`important and necessary, is often insufficient. Attacks upon com-
`puter systems have shown that information can be compromised
`if an unauthorized user simply gains access to the raw database
`files, bypassing the database access control mechanism altogether.
`For instance, a recent article published in the Toronto Star [14] de-
`scribes an incident where a disk containing the records of several
`hundred bank customers was being auctioned on eBay. The bank
`had inadvertently sold the disk to the eBay re-seller as used equip-
`ment without deleting its contents. Drawing upon privacy legisla-
`tions and guidelines worldwide, Hippocratic databases also identify
`the protection of personal data from unauthorized acquisition as a
`vital requirement [1].
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage, and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior specific
`permission and/or a fee.
`SIGMOD 2004 June 13-18, 2004, Paris, France.
`Copyright 2004 ACM 1-58113-859-8/04/06 : : : $5.00.
`
`Encryption is a well established technology for protecting sensi-
`tive data [7] [22] [24]. Unfortunately, the integration of existing
`encryption techniques with database systems causes undesirable
`performance degradation. For example, if a column of a table con-
`taining sensitive information is encrypted, and is used in a query
`predicate with a comparison operator, an entire table scan would
`be needed to evaluate the query. The reason is that the current en-
`cryption techniques do not preserve order and therefore database
`indices such as B-tree can no longer be used. Thus the query exe-
`cution over encrypted databases can become unacceptably slow.
`We present an encryption technique called OPES (Order Preserv-
`ing Encryption Scheme) that allows comparison operations to be
`directly applied on encrypted data, without decrypting the operands.
`Thus, equality and range queries as well as the MAX, MIN, and
`COUNT queries can be directly processed over encrypted data.
`Similarly, GROUP BY and ORDER BY operations can also be ap-
`plied. Only when applying SUM or AVG to a group do the values
`need to be decrypted. OPES is also endowed with the following
`properties:
` The results of query processing over data encrypted using
`OPES are exact. They neither contain any false positives nor
`miss any answer tuple. This feature of OPES sharply differ-
`entiates it from schemes such as [13] that produce a superset
`of answer, necessitating filtering of extraneous tuples in a
`rather expensive and complex post-processing step.
`
` OPES handles updates gracefully. A value in a column can
`be modified or a new value can be inserted in a column with-
`out requiring changes in the encryption of other values.
`
` OPES can easily be integrated with existing database sys-
`tems as it has been designed to work with the existing index-
`ing structures such as B-trees. The fact that the database is
`encrypted can be made transparent to the applications.
`
`Measurements from an implementation of OPES in DB2 show that
`the time and space overhead of OPES are reasonable for it to be
`deployed in real systems.
`1.1 Estimation Exposure
`The security of an encryption scheme is conventionally assessed
`by analyzing whether an adversary can find the key used for en-
`cryption. See [22] [24] for a categorization of different levels of
`attacks against a cryptosystem.
`When dealing with sensitive numeric data, an adversary does
`not have to determine the exact data value p corresponding to an
`
`Page 1 of 12
`
`Netskope Exhibit 1019
`
`
`
`column statistics, as well as values written to recovery logs.
`Otherwise, an adversary may be able to use this information
`to guess data distributions.
`
`1.3 Pedagogical Assumptions and Notations
`The focus of this paper is on developing order-preserving en-
`cryption techniques for numeric values and assumes conventional
`encryption [22] [24] for other data types as well as for encrypting
`information such as schema names and metadata. We will some-
`time refer to unencrypted data values as plaintext. Similarly, en-
`crypted values will also be referred to as ciphertext.
`We will assume that the database consists of a single table, which
`in turn consists of a single column. The domain of the column will
`be initially assumed to be a subset of integer values, pmin; pmax.
`The extension for real values is given later in the paper.
`Assume the database ~P consists of a total of j ~P j plaintext values.
`Out of these, jP j values are unique, which will be represented as
`P = p ; p; : : : ; pjP j , pi pi+ The corresponding encrypted
`values will be represented as C = c ; c; : : : ; cjP j, ci ci+ .
`Duplicates can sometimes be used to guess the distribution of a
`domain, particularly if the distribution is highly skewed. A closely
`related problem is that if the number of distinct values is small (e.g.,
`day of the month), it is easy to guess the domain. We will ini-
`tially assume that the domain to be encrypted either does not con-
`tain many duplicates or contains a distribution that can withstand a
`duplicate attack, and discuss the handling of duplicates later in the
`paper.
`1.4 Paper Layout
`The rest of the paper is organized as follows. We first discuss re-
`lated work in Section 2. We give an overview of OPES in Section 3.
`The next three sections give details of the three main phases of
`OPES. We describe extensions to handle real values and duplicates
`in Section 7. In Section 8, we study the quality of the encryption
`produced by OPES and present performance measurements from a
`DB2 implementation. We conclude with a summary and directions
`for future work in Section 9.
`
`2. RELATED WORK
`
`as c = Pp
`
`Summation of Random Numbers A simple scheme has been
`proposed in [3] that computes the encrypted value c of integer p
`j= Rj, where Rj is the jth value generated by a se-
`cure pseudo-random number generator R. Unfortunately, the cost
`of making p calls to R for encrypting or decrypting c can be pro-
`hibitive for large values of p.
`A more serious problem is the vulnerability to estimation ex-
`posure. Since the expected gap between two encrypted values is
`proportional to the gap between the corresponding plaintext val-
`ues, the nature of the plaintext distribution can be inferred from the
`encrypted values. Figure 2 shows the distributions of encrypted val-
`ues obtained using this scheme for data values sampled from two
`different distributions: Uniform and Gaussian. In each case, once
`both the input and encrypted distributions are scaled to be between
`0 and 1, the number of points in each bucket is almost identical for
`the plaintext and encrypted distributions. Thus the percentile of a
`point in the encrypted distribution is also identical to its percentile
`in the plaintext distribution.
`
`Figure 1: Transparent encryption in the “trusted database soft-
`ware with vulnerable storage” setting.
`
`encrypted value c; a breach may occur if the adversary succeeds
`in obtaining a tight estimate of p. For a numeric domain P , if
`an adversary can estimate with c confidence that a data value
`p lies within the interval p ; p] then the interval width p (cid:0)
`p =domain-widthP defines the amount of estimation exposure
`at c confidence level.
`Clearly, any order-preserving encryption scheme is vulnerable
`to tight estimation exposure if the adversary can choose any num-
`ber of unencrypted (encrypted) values of his liking and encrypt
`(decrypt) them into their corresponding encrypted (unencrypted)
`values. Similarly, any order-preserving encryption is not secure
`against tight estimation exposure if the adversary can guess the do-
`main and knows the distribution of values in that domain.
`We consider an application environment where the goal is safety
`from an adversary who has access to all (but only) encrypted values
`(the so called ciphertext only attack [22] [24]), and does not have
`any special information about the domain. We will particularly fo-
`cus on robustness against estimation exposure.
`1.2 Threat Model
`We assume (see Figure 1):
` The storage system used by the database software is vulnera-
`ble to compromise. While current database systems typically
`perform their own storage management, the storage system
`remains part of the operating system. Attacks against storage
`could be performed by accessing database files following a
`path other than through the database software, or in the ex-
`treme, by physical removal of the storage media.
` The database software is trusted. We trust the database
`software to transform query constants into their encrypted
`values and decrypt the query results. Similarly, we assume
`that an adversary does not have access to the values in the
`memory of the database software.
` All disk-resident data is encrypted. In addition to the data
`values, the database software also encrypts schema infor-
`mation such as table and column names, metadata such as
`
`Page 2 of 12
`
`Netskope Exhibit 1019
`
`
`
`(a) Input: Uniform
`
`2500
`
`Original
`Encrypted
`
`Scaled Domain
`
`(b) Input: Gaussian
`
`1
`
`Original
`Encrypted
`
`1
`
`2000
`
`1500
`
`1000
`
`Number of points
`
`500
`
`0
`
`0
`
`2500
`
`2000
`
`1500
`
`1000
`
`Number of points
`
`500
`
`0
`
`0
`
`Scaled Domain
`Figure 3: Polynomial functions: Encryption of different input
`distributions look different.
`
`(a) Input: Uniform
`
`Original
`Encrypted
`
`Scaled Domain
`
`(b) Input: Gaussian
`
`1
`
`Original
`Encrypted
`
`1
`
`300
`
`250
`
`200
`
`150
`
`100
`
`50
`
`Number of points
`
`0
`
`0
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`0
`
`Number of points
`
`0
`
`Scaled Domain
`Figure 2: Summation of random numbers: Distribution of en-
`crypted values tracks the input distribution.
`
`Polynomial Functions
`In [12], a sequence of strictly increasing
`polynomial functions is used for encrypting integer values while
`preserving their order. These polynomial functions can simply be
`of the first or second order, with coefficients generated from the en-
`cryption key. An integer value is encrypted by applying the func-
`tions in such a way that the output of a function becomes the input
`of the next function. Correspondingly, an encrypted value is de-
`crypted by solving these functions in reverse order. However, this
`encryption method does not take the input distribution into account.
`Therefore the shape of the distribution of encrypted values depends
`on the shape of the input distribution, as shown in Figure 3 for the
`encryption function given in Example 10 in [12]. This illustration
`suggests that this scheme may reveal information about the input
`distribution, which can be exploited.
`
`Bucketing
`In [13], tuples are encrypted using conventional en-
`cryption, but an additional bucket id is created for each attribute
`value. This bucket id, which represents the partition to which the
`unencrypted value belongs, can be indexed. The constants ap-
`pearing in a query are replaced by their corresponding bucket ids.
`Clearly, the result of a query will contain false hits that must be
`removed in a post-processing step after decrypting the tuples re-
`turned by the query. This filtering can be quite complex since the
`bucket ids may have been used in joins, subqueries, etc. The num-
`ber of false hits depends on the width of the partitions involved. It is
`shown in [13] that the post-processing overhead can become exces-
`sive if a coarse partitioning is used for bucketization. On the other
`hand, a fine partitioning makes the scheme vulnerable to estimation
`exposure, particularly if an equi-width partitioning is used.
`It has been pointed out in [6] that the indexes proposed in [13]
`can open the door to interference and linking attacks. Instead, they
`
`build a B-tree over plaintext values, but then encrypt every tuple
`and the B-tree at the node level using conventional encryption. The
`advantage of this approach is that the content of B-tree is not visible
`to an untrusted database server. The disadvantage is that the B-tree
`traversal can now be performed by the front-end only by execut-
`ing a sequence of queries that retrieve tree nodes at progressively
`deeper level.
`
`Other Relevant Work Rivest et al. [21] suggest that the limit on
`manipulating encrypted data arises from the choice of encryption
`functions used, and there exist encryption functions that permit en-
`crypted data to be operated on directly for many sets of interesting
`operations. They call these functions “privacy homomorphisms”.
`The focus of [21] and the subsequent follow-up work [2] [8] [9]
`has been on designing privacy homomorphisms to enable arith-
`metic on encrypted data, but the comparison operations were not
`investigated in this line of research.
`In [10], a simple but effective scheme has been proposed to en-
`crypt a look-up directory consisting of (key, value) pairs. The goal
`is to allow the corresponding value to be retrieved if and only if
`a valid key is provided. The essential idea is to encrypt complete
`tuples, but associate with every tuple the one-way hash value of its
`key. Thus, no tuple will be retrieved if an invalid key is presented.
`Answering range queries was not a goal of this system.
`In [23], interesting schemes are proposed to support keyword
`searches over an encrypted text repository. The driving application
`for this work is the efficient retrieval of encrypted email messages.
`Naturally, they do not discuss relational queries and it is not clear
`how their techniques can be adapted for relational databases.
`In [4], a smart card with encryption and query processing ca-
`pabilities is used to ensure the authorized and secure retrieval of
`encrypted data stored on untrusted servers. Encryption keys are
`
`Page 3 of 12
`
`Netskope Exhibit 1019
`
`
`
`running OPES with different input distributions and the same target
`distribution. Notice that the distribution of encrypted values looks
`identical in both 4(a) and 4(b), even though the input distributions
`were very different.
`3.1 Intuition
`To understand the intuition behind OPES algorithm, consider the
`following encryption scheme:
`Generate jP j unique values from a user-specified target distri-
`bution and sort them into a table T . The encrypted value ci of pi
`is then given by ci = T i. That is, the ith plaintext value in the
`sorted list of jP j plaintext values is encrypted into the ith value in
`the sorted list of jP j values obtained from the target distribution.
`The decryption of ci requires a lookup into a reverse map. Here T
`is the encryption key that must be kept secret.
`Clearly, this scheme does not reveal any information about the
`original values apart from the order, since the encrypted values
`were generated solely from the user-specified target distribution,
`without using any information from the original distribution. Even
`if an adversary has all of the encrypted values, he cannot infer T
`from those values. By appropriately choosing target distribution,
`the adversary can be forced to make large estimation errors.
`This simple scheme, while instructive, has the following short-
`comings for it to be used for encrypting large databases:
`
` The size of encryption key is twice as large as the number of
`unique values in the database.
`
` Updates are problematic. When adding a new value p, where
`pi p pi+ , we will need to re-encrypt all pj ; j i.1
`
`OPES has been designed such that the result of encryption is sta-
`tistically indistinguishable from the one obtained using the above
`scheme, thereby providing the same level of security, while remov-
`ing its shortcomings.
`3.2 Overview of OPES
`When encrypting a given database P , OPES makes use of all
`the plaintext values currently present P ,2 and also uses a database
`of sampled values from the target distribution. Only the encrypted
`database C is stored on disk. At the same time, OPES also creates
`some auxiliary information K, which the database system uses to
`decrypt encoded values or encrypt new values. Thus K serves the
`function of the encryption key. This auxiliary information is kept
`encrypted using conventional encryption techniques.
`OPES works in three stages:
`1. Model: The input and target distributions are modeled as
`piece-wise linear splines.
`2. Flatten: The plaintext database P is transformed into a “flat”
`database F such that the values in F are uniformly distributed.
`1It is possible to avoid immediate re-encryption by choosing an
`encrypted value for p in the interval (ci; ci+ ), but T would still
`need updating. Moreover, there might be cases where ci+ = ci+
`and therefore inserting a new value will require re-encryption of
`existing values.
`Note that the encryption scheme ci = T pi circumvents the up-
`date problem. But now the size of the key becomes the size of the
`domain. It is also vulnerable to percentile exposure, as discussed
`earlier in Section 2.
`2If an installation is creating a new database, the database admin-
`istrator can provide a sample of expected values.
`
`1
`
`1
`
`(a) Input: Uniform, Target: Zipf
`2500
`
`Original
`Target
`Encrypted
`
`2000
`
`1500
`
`1000
`
`Number of points
`
`500
`
`0
`
`0
`
`Scaled Domain
`
`(b) Input: Gaussian, Target: Zipf
`2500
`
`Original
`Target
`Encrypted
`
`2000
`
`1500
`
`1000
`
`Number of points
`
`500
`
`0
`
`0
`
`Scaled Domain
`
`Figure 4: Illustrating OPES.
`
`maintained on the smart card. The smart card can translate exact
`match queries into equivalent queries over encrypted data. How-
`ever, range queries require creating a disjunction for every pos-
`sible value in the range, which is infeasible for real data values.
`The smart card implementation could benefit from our encryption
`scheme in that range queries could be translated into equivalent
`queries over encrypted data.
`In [25], the security and tamper resistance of a database stored
`on a smart card is explored. They consider snooping attacks for
`secrecy, and spoofing, splicing, and replay attacks for tamper resis-
`tance. Retrieval performance is not the focus of their work and it
`is not clear how much of their techniques apply to general purpose
`databases not stored in specialized devices.
`Amongst commercial database products, Oracle 8i allows values
`in any of the columns of a table to be encrypted [18]. However,
`the encrypted column can no longer participate in indexing as the
`encryption is not order-preserving.
`Related work also includes research on order-preserving hashing
`[5] [11]. However, protecting the hash values from cryptanalysis is
`not the concern of this body of work. Similarly, the construction of
`original values from the hash values is not required.
`
`3. PROPOSED ORDER-PRESERVING EN-
`CRYPTION SCHEME
`The basic idea of OPES is to take as input a user-provided tar-
`get distribution and transform the plaintext values in such a way
`that the transformation preserves the order while the transformed
`values follow the target distribution. Figure 4 shows the result of
`
`Page 4 of 12
`
`Netskope Exhibit 1019
`
`
`
`3. Transform: The flat database F is transformed into the ci-
`pher database C such that the values in C are distributed
`according to the target distribution.
`
`Note that
`
`pi pj = fi fj = ci cj:
`
`We give details of the three stages in Sections 4, 5 and 6 respec-
`tively.
`
`4. MODELING THE DISTRIBUTIONS
`Techniques for modeling data distributions have been studied ex-
`tensively in the database literature in the context of estimating the
`costs of different query execution plans. As stated in [16], there are
`two broad categories of techniques: histogram-based that capture
`statistical information about a distribution by means of counters
`for a specified number of buckets, and parametric that approximate
`a distribution by fitting the parameters of a given type of function.
`We experimented with several histogram-based techniques [15], in-
`cluding equi-depth, equi-width, and wavelet-based methods, but
`found that the flattened values obtained were not uniformly dis-
`tributed unless the number of buckets was selected to be unreason-
`ably large. The main source of the problem was the assumption that
`the distribution is uniform within each bucket. The parametric are
`suitable for closed-form distributions but lead to poor estimations
`for irregular distributions [16], which we expect to be the norm in
`our application.
`We, therefore, resorted to a combination of histogram-based and
`parametric techniques. As in [16], we first partition the data values
`into buckets and then model the distribution within each bucket as
`a linear spline. The spline for a bucket pl; ph is simply the line
`connecting the densities at the two end-points of the bucket.3
`We also allow the width of value ranges to vary across buck-
`ets. However, unlike [16], we do not have a given fixed number
`of buckets. Rather, we use the minimum description length (MDL)
`principle [20] to determine the number of buckets.
`
`4.1 Bucket Boundaries
`The bucket boundaries are determined in two phases:4
`
`1. Growth phase. The space is recursively split into finer par-
`titions. Each partitioning of a bucket reduces the maximum
`deviation from the density function within the newly formed
`buckets when compared to their parent bucket.
`
`2. Prune phase. Some buckets are pruned (merged into big-
`ger buckets). The idea is to minimize the number of buckets
`and yet have the values within buckets after mapping be uni-
`formly distributed. We use the MDL principle to obtain this
`balance.
`
`The details of these two phases are discussed next.
`
`3In [16], the splines are not continuous across buckets; they use lin-
`ear regression over data values present in a bucket for determining
`the spline. However, such discontinuities may cause undesirable
`breaks in the uniformity when we flatten plaintext values.
`4This procedure is reminiscent of the procedure for building deci-
`sion tree classifiers, and in particular SLIQ [17], but the details are
`quite different.
`
`4.2 Growth Phase
`We are given a bucket pl; ph, with h (cid:0) l (cid:0) (sorted) points:
`fpl+ ; pl+; : : : ; ph(cid:0) g. We first find the linear spline for this
`bucket. Next, for each point ps in the bucket, we compute its ex-
`pected value if the points were distributed according to the density
`distribution modeled by the linear spline (i.e., the expected value
`of the s (cid:0) lth smallest value in a set of h (cid:0) l (cid:0) random val-
`ues drawn from the distribution). We then split the bucket at the
`point that has the largest deviation from its expected value (break-
`ing ties arbitrarily). We stop splitting when the number of points in
`a bucket is below some threshold, say, 10.
`
`4.3 Prune Phase
`The MDL principle [20] states that the best model for encoding
`data is the one that minimizes the sum of the cost of describing the
`model and the cost of describing the data in terms of that model.
`For a given bucket pl; ph, the local benefit LB of splitting this
`bucket at a point ps is given by
`
`LBpl; ph = DataCostpl; ph (cid:0) DataCostpl; ps
`(cid:0) DataCostps; ph (cid:0) IncrModelCost
`
`where DataCost(p ; p gives the cost of describing the data in the
`interval p ; p and IncrModelCost is the increase in modeling
`cost due to the partitioning of a bucket into two buckets.
`The global benefit GB of splitting this bucket at ps takes into
`account the benefit of further recursive splits:
`
`GBpl; ph = LBpl; ph + GBpl; ps + GBps; ph:
`
`If GB , the split is retained; otherwise, the split at ps and
`all recursive splits within pl; ph are pruned. Note that we do this
`computation bottom up, and therefore the cost is linear in the num-
`ber of splits.5
`We now provide the functions for the computation of DataCost
`and IncrModelCost. Assume momentarily the existence of a map-
`ping M that transforms values sampled from a linear density func-
`tion into a set of uniformly distributed values. We specify M in
`the next section. As we shall see, M will have two parameters: a
`quadratic coefficient and a scale factor.
`
`4.3.1 DataCost
`We want to flatten a given data distribution into a uniform distri-
`bution. So, given a bucket, we first flatten the values present in the
`bucket using the mapping M , and then compute the cost of encod-
`ing the deviations from uniformity for the mapped values.6
`Let the set of data values fpl; pl+ ; : : : ; ph(cid:0) g be mapped into
`ffl; fl+ ; : : : ; fh(cid:0) g using M . The encoding of a value pi
`
`5One might wonder why we did not combine pruning with the
`growth phase and stop splitting a bucket as soon as the local benefit
`became zero or negative. The reason is that the benefit of partition-
`ing may start showing only at a finer granularity, and it will often
`be the case that the local benefit is less than zero even though the
`global benefit is much greater than zero.
`6Note that our implementation encodes only the statistically sig-
`nificant deviations to avoid overfitting, i.e., rather than a single ex-
`pected value, we consider the range of values that would occur with
`a uniform distribution, and only encode values that are outside this
`range. We omit this detail for brevity.
`
`Page 5 of 12
`
`Netskope Exhibit 1019
`
`
`
`5.2 Scale Factor
`We need to find the scale factor z, one for each bucket B such
`that:
`1. Two distinct values in the plaintext will always map to two
`distinct values in the flattened space, thereby ensuring incre-
`mental updatability.
`2. Each bucket is mapped to a space proportional to the number
`of points n in that bucket, i.e., if w is the width of the bucket
`and wf = M w is the width after flattening, then wf n.
`The first constraint can be written as:
`
`p ; w : M p + (cid:0) M p :
`
`The 2 in the RHS (instead of 1) ensures two adjacent plaintext val-
`ues will be at least 2 apart in the flattened space. As we will explain
`in Section 5.5, this extra separation makes encryption tolerant to
`rounding errors in floating point calculations. Expanding M , we
`get
`
`p ; w : z =sp + + :
`
`The largest value of =sp + + will be at p = if s ,
`and at p = w (cid:0) otherwise. Therefore we get
`
`^z = ;
`
`= + sw (cid:0) ;
`
`s
`s
`
`where ^z denotes the minimum value of z that will satisfy the first
`constraint.
`To satisfy the second constraint, we want
`
`wf = Kn
`
`for all the buckets. Define
`
`^wf = ^zsw + w
`
`as the minimum width for each bucket, and define
`
`
`
`K = maxh ^wfii ; i = ; : : : ; m:
`
`Then the scale factors
`
`z =
`
`Kn
`sw + w
`will satisfy both the desired constraints, since z ^z, and wf =
`zsw + w = Kn.
`5.3 Encryption Key
`Let us briefly review what we have at this stage. The model-
`ing phase has yielded a set of buckets f B ; : : : ; Bm g. For each
`bucket, we also have a mapping function M , characterized by two
`parameters: the quadratic coefficient s and the scale factor z. We
`save the m + bucket boundaries, the m quadratic coefficients,
`and the m scale factors in a data structure Kf . The database sys-
`tem uses Kf to flatten (encrypt) new plaintext values, and also to
`unflatten (decrypt) a flattened value. Thus Kf serves the function
`of the encryption key.
`Note that Kf is computed once at the time of initial encryption
`of the database. As the database obtains new values, Kf is used to
`encrypt them, but it is not updated, which endows OPES with the
`incremental updatability property.8
`8In order
`
`to encrypt stray values outside of
`
`the current
`
`pl; ph would cost
`
`Costpi = log jfi (cid:0) Eij;
`
`where Ei, the expected value of the ith number assuming unifor-
`mity, is given by
`
`Ei =f l +
`
`i (cid:0) l
`h (cid:0) l
`The cost of encoding all the values in the interval pl; ph is then
`given by
`
`fh (cid:0) fl:
`
`DataCostpl; ph =
`
`Costpi:
`
`h(cid:0)
`
`Xi=l+
`
`IncrModelCost
`4.3.2
`If we have m buckets, we need to store m + boundaries, m
`quadratic coefficients, and m scale factors. Thus the model cost
`will be m + , assuming 32 bits for each of these values.
`More importantly, the cost of an additional bucket
`
`IncrModelCost = = :
`
`5. FLATTEN
`The overall idea of the flatten stage is to map a plaintext bucket B
`into a bucket Bf in the flattened space in such a way that the length
`of Bf is proportional to the number of values present in B. Thus,
`the dense plaintext buckets will be stretched and the sparse buckets
`will be compressed. The values within a bucket are mapped in
`such a way that the density will be uniform in the flattened bucket.
`Since the densities are uniform both inter-bucket and intra-bucket,
`the values in the flattened database will be uniformly distributed.
`We specify next a mapping function that accomplishes these goals.
`5.1 Mapping Function
`
`OBSERVATION 1. If a distribution over ; ph has the density
`function qp + r, where p ; ph, then for any constant z ,
`the mapping function
`
`
`
`p + p
`
`q
`
`M p = z
`
`r
`will yield a uniformly distributed set of values.
`
`This follows from the fact that the slope of the mapping function
`at any point p is proportional to the density at p:7
`
`qp + r
`
`z r
`
`=
`
`dM
`dp
`
` qp + r:
`
`We will refer to s := q=r as the quadratic coefficient. Thus
`
`M p = zsp + p
`
`A different scale factor z is used for different buckets, to make
`the inter-bucket density uniform as well. We describe next how the
`scale factors are computed.
`7An equivalent way to think about this is that the space around p,
`say from p (cid:0) to p + is mapped to a length of
`
`M p + (cid:0) M p (cid:0) =
`
`z
`r
`
`qp + r qp + r:
`
`Page 6 of 12
`
`Netskope Exhibit 1019
`
`
`
`5.4 Mapping a Plaintext Value into a Flat Value
`Represent the domains of the input database P and the flat database
`F as pmin; pmax and fmin; fmax respectively. Note that
`
`Compute the buckets
`1.
`used to transform the plain-
`text distribution into the flat-
`tened distribution
`
`2. From the target distri-
`bution, compute the buckets
`for the flattened distribution
`^Bf .
`
`wf
`i
`
`mXi
`
`=
`
`fmax = fmin +
`
`where wf
`i = Miwi. Recall that wi is the length of plaintext
`bucket Bi, and wf
`i the length of the corresponding flat bucket.
`To flatten a plaintext value p, we first determine the bucket Bi
`into which p falls, using the information about the bucket bound-
`aries saved in Kf . Now p is mapped into the flat value f using the
`equation:
`
`i(cid:0)
`
`i(cid:0)
`
`B
`0
`
`B
`
`m−1
`
`f
`B
`0
`
`^
`f
`B
`0
`
`^
`B
`
`f
`k−1
`
`f
`B
`m−1
`
`B
`
`t
`0
`
`B
`
`t
`k−1
`
`3. Scale the buckets of the flattened target distribution to
`equal the range of the flattened distribution computed in Step
`1, and scale the target distribution proportionately.
`
`B
`0
`
`B
`
`m−1
`
`f
`B
`0
`
`_
`f
`B
`0
`
`f
`B
`m−1
`
`_
`B
`
`f
`k−1
`
`B
`
`c
`0
`
`B
`
`c
`k−1
`
`Figure 5: Scaling the target distribution.
`
`6. TRANSFORM
`The transform stage is almost a mirror image of the flatten stage.
`Given a uniformly distributed set of flattened values, we want to
`map them into the target distribution. An equivalent way of think-
`ing about the problem is that we want to flatten the target distribu-
`tion into a uniform distribution, while ensuring that the distribution
`so obtained “lines-up” with the uniform distribution yielded by flat-
`tening the plaintext distribution.
`Figure 5 portrays this process. We already have on hand the
`buckets for the plain text distribution (Step 1). We bucketize the
`target distribution, independent of the bucketization of the plain-
`text distribution (Step 2). We then scale the target distribution (and
`the flattened target distribution) in such a way that the width of th