`
`(19) World Intellectual Property
`
`Organization
`International Bureau
`(43) International Publication Date
`15 October 2015 (15.10.2015)
`
`\7
`
`é
`&
`:9)
`WlPOl PCT
`
`(10) International Publication Number
`
`WO 2015/157020 A1
`
`(51) International Patent Classification:
`G06F 21/62 (2013.01)
`G06Q 10/04 (2012.01)
`
`(74) Agents: SHEDD, Robert, D. et at; 4 Research Way, 3rd
`Floor, Princeton, New Jersey 08540 (US).
`
`(21) International Application Number:
`
`PCT/US2015/023336
`
`(22) International Filing Date:
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`30 March 2015 (30.03.2015)
`
`English
`
`English
`
`(30) Priority Data:
`61/978,260
`
`11 April 2014 (11.04.2014)
`
`US
`
`(71) Applicant: THOMSON LICENSING [FR/FR]; 1-5 rue
`Jeanne d'Arc, F-92130 Issy-les-Moulineaux (FR).
`
`(72)
`
`Inventors: KVETON, Branislav; 663 Helweh Ct., San
`Jose, California 95126 (US). SALAMATIAN, Salman; 42
`avenue de Vessy, F-01210 l-‘erney-Volairc (FR). FAWAZ,
`Nadia; 1531 Beilomy Street, Santa Clara, California 95050
`(US). TAFT, Nina; 4265 24th Street, San Francisco, Cali-
`fornia 94114 (US).
`
`(81) Designated States (unless otherwise indicated, for every
`lrind of national protection available): AE, AG, AL, AM,
`AO, AT, AU, AZ, BA, BB, BG, BII, BN, BR, BW, BY,
`BZ, CA, CII, CL, CN, CO, CR, CU, CZ, DE, DK, DM,
`DO, DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT,
`HN, HR, HU, ID, IL, IN, IR, IS, JP, KE, KG, KN, KP, KR,
`KZ,LAJX;LK,LR,L&IILLY,MA,MD,ME,MG,
`MK, MN, MW, MX, MY, MZ, NA, NG, NI, NO, NZ, OM,
`PA,PE,PG,PH,PL,PT,QA,RO,RS,RU}RVV,SA,SC,
`SD, SE, SG, SK, SL, SM, ST, SV, SY, TH, TJ, TM, TN,
`TR, TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW'.
`
`(84) Designated States (unless otherwise indicated, for every
`kind of regional protection available): ARIPO (BW, GH,
`GM, KE, LR, LS, MW, MZ, NA, RVV, SD, SL, ST, SZ,
`TZ, UG, ZM, ZVV), Eurasian (AM, AZ, BY, KG, KZ, RU,
`TJ, TM), European (AL, AT, BE, BG, CH, CY, CZ, DE,
`DK, EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LT, LU,
`LV, MC, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK,
`
`(54) Title: METI IOD AND APPARATUS FOR SPARSE PRIVACY PRESERVING MAPPING
`
`[Continued on nextpage]
`
`(57) Abstract: A user may wish to release some public data,
`which is correlated with his private data, to an analyst in the
`hope of getting some utility. The public data can be distorted be—
`fore its release according to a probabilistic privacy preserving
`mapping mechanism, which limits information leakage under
`utility constraints. The present principles provide a solution to
`speed up the computation of privacy preserving mappings. In
`particular, we recognize that privacy preserving mappings are
`sparse, that is, public data of a user may only be mapped to a
`limited selection of data points with non-zero probabilities. Sub-
`sequently, we generate sparse privacy preserving mappings by
`recasting the optimization problem as a sequence of linear pro-
`grams and solving each of these incrementally using an adapta-
`tion of Dantzig-Wolfe decomposition.
`
`T10
`
`720
`
`740
`
`|
`1,,
`Iv- (bu
`
`
`
`700
`
`\ V705
`
`Initialization
`
`
`Collect statistical information about
`public and private data
`
`
`Determine a sparse privacy preserving
`mapping based on the statistical
`information and a utility constraint
`
`|
`
`Distort public data to be released
`to preserve privacy
`
`_ .
`.
`_
`Release data
`ides
`
`Hg. 7
`
`3/17/2018, EAST Version:
`
`4.1.1.3
`
`
`
`wo2015/157020A1|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
`
`
`
`WO 2015/157020 A1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
`
`SM, TR), OAPI (BF, BJ, CF, CG, CI, CM, GA, GN, GQ, Published:
`GW’ KM’ ML’ MR’ NE’ SN’ TD’ TG)‘
`* with international search report (Art. 21(3))
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/US2015/023336
`
`Method and Apparatus for Sparse Privacy Preserving Mapping
`
`
`CROSS-REFERENCE TO RELATED APPLICATIONS
`
`[1]
`
`This application is related to (1) US Provisional Patent Application Serial No.
`
`61/691,090 filed on August 20, 2012, and titled “A Framework for Privacy against Statistical
`
`Inference” (hereinafter “‘Fawaz”); (2) US. Provisional Patent Application Serial No.
`
`1 I047 EA
`1IOU/,J‘-t 3 4‘:
`
`6
`
`Privacy Preserving Mapping against Inference Attacks” (hereinafter “Fawaz2”); and (3) US.
`
`Provisional Patent Application Serial No. 61/867,546 filed on August 19, 2013, and titled
`
`“Method and Apparatus for Utility—Aware Privacy Preserving Mapping through Additive
`
`Noise” (hereinafter “Fawaz3”). The provisional applications are expressly incorporated by
`
`reference herein in their entirety.
`
`TECHNICAL FIELD
`
`[2]
`
`This invention relates to a method and an apparatus for preserving privacy, and more
`
`particularly, to a method and an apparatus for generating a privacy preserving mapping in a fast
`
`speed.
`
`BACKGROUND
`
`[3]
`
`Finding the right balance between privacy risks and big data rewards is a big
`
`challenge facing society today. Big data creates tremendous opportunity, especially for all
`
`offer advice on everything including movies, TV shows, restaurants, music, sleep, exercise,
`
`vacation, entertainment, shopping, and even friends. On one hand, people are willing to part
`
`with some of their personal data (e.g., movie watching history) for the sake of these services.
`
`1
`
`0
`
`1
`
`5
`
`[\3CD
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/US2015/023336
`
`The service, or other benefit that the user derives from allowing access to the user’s data may
`
`be referred to as utility. On the other hand, many users have some data about themselves
`
`they would prefer to keep private (e. g., their political affiliation, salary, pregnancy status,
`
`religion). Most individuals have both public and private data and hence they need to
`
`maintain a boundary between these different elements of their personal information. This is
`
`an enormous challenge because inference analysis on publicly released data can often
`
`uncover private data.
`
`M
`
`[4]
`
`The present principles provide a method for processing user data for a user,
`
`10
`
`15
`
`comprising: accessing the user data, which includes private data and public data; determining
`
`a set of values that the public data of the user can map to, wherein size of the set of values is
`
`small; determining a privacy preserving mapping that maps the public data to released data,
`
`wherein the public data of the user only maps to values within the determined set of values;
`
`modifying the public data of the user based on the privacy preserving mapping; and releasing
`
`the modified data as the released data to at least one of a service provider and a data
`
`collecting agency. The present principles also provide an apparatus for performing these
`
`steps.
`
`[5]
`
`The present principles also provide a method for processing user data for a first user
`
`and a second user, comprising: accessing the user data, which includes private data and
`
`public data; determining a first set of values that the public data of the first user can map to,
`
`wherein size of the first set of values is in the magnitude of order of ten: determining a
`
`second set of values that the public data of the second user can map to, wherein size of the
`
`determined second set of values is in the magnitude of order of ten, and the determined first
`
`set of values is different from the determined second set of values; determining a privacy
`
`[\0
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`preserving mapping that maps the public data to released data, wherein the public data of the
`
`first user only maps to values Within the determined first set of values and the public data of
`
`the second user only maps to values within the determined second set of values; modifying
`
`the public data of the first user and the second user based on the privacy preserving mapping;
`
`and releasing the modified data as the released data to at least one of a service provider and a
`
`data collecting agency. The present principles also provide an apparatus for performing
`
`these steps.
`
`[6]
`
`The present principles also provide a computer readable storage medium having
`
`stored thereon instructions for processing user data according to the methods described
`
`10
`
`above.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[7]
`
`FIG. 1A and FIG. 1B are pictorial examples illustrating exemplary privacy preserving
`
`mappings under small and large distortion constraints, respectively.
`
`[8]
`
`FIGs. 2A, 2B and 2C are pictorial examples illustrating effect of parameters on
`
`privacy-distortion tradeoff using synthetic data, census data and movie data, respectively.
`
`[9]
`
`FIGs. 3A, 3B and 3C are pictorial examples illustrating ROC Curves for Naive Bayes
`
`Classifier with distortion constraint A = 0.02, 0.14 and 0.44, respectively, using census data.
`
`[10]
`
`FIGs. 4A, 4B and 4C are pictorial examples illustrating ROC Curves for Logistic
`
`Regression with distortion consn‘aint A : 0.04, 0.13 and 0.22, respectively, using movie data.
`
`[11]
`
`FIG. 5 is a pictorial example illustrating behavior of Sparse Privacy Preserving
`
`Mappings (SPPM) and exponential mechanism (ExpMec) using synthetic data.
`
`['12]
`
`FIG. 6 is a pictorial examples illustrating time complexity and parameter sensitivity
`
`15
`
`20
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`using synthetic data.
`
`[13]
`
`FIG. 7 is a flow diagram depicting an exemplary method for preserving privacy, in
`
`accordance with an embodiment of the present principles.
`
`[14]
`
`FIG. 8 is a block diagram depicting an exemplary privacy agent, in accordance with
`
`an embodiment of the present principles.
`
`[15]
`
`FIG. 9 is a block diagram depicting an exemplary system that has multiple privacy
`
`agents, in accordance with an embodiment of the present principles.
`
`DETAILED DESCRIPTION
`
`[16]
`
`A number of research efforts have explored the idea of distorting the public data
`
`before they are released to preserve user’s privacy.
`
`In some of these prior efforts, distortion
`
`aims at creating some confusion around user data, by making its value hard to distinguish
`
`from other possible values, in other efforts, distortion is designed to counter a particular
`
`inference threat (i.e., a specific classifier or analysis). Recently Fawaz proposed a new
`
`framework for data distortion, based on an information theory, that captures privacy leakage
`
`in terms of mutual information, where mutual information is a measure of mutual dependence
`
`between two random variables. Minimizing the mutual information between a user’ s
`
`private data and released data is attractive because it reduces the correlation between the
`
`private data and the publicly released data, and thus any inference analysis that tries to learn
`
`the private data from the publicly released data is rendered weak, if not useless.
`
`In other
`
`words, this approach is agnostic to the type of inference analysis used in any given threat.
`
`10
`
`15
`
`20
`
`[17]
`
`In the present application, we refer to the data a user would like to remain private as
`
`“private data,” the data the user is willing to release as “public data,” and the data the user
`
`actually releases as “released data.” For example, a user may want to keep his political
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`opinion private, and is willing to release his TV ratings with modification (for example, the
`
`user’s actual rating of a program is 4, but he releases the rating as 3).
`
`In this case, the user’s
`
`political opinion is considered to be private data for this user, the TV ratings are considered
`
`to be public data, and the released modified TV ratings are considered to be the released data.
`
`Note that another user may be willing to release both political opinion and TV ratings without
`
`modifications, and thus, for this other user, there is no distinction between private data, public
`
`data and released data when only political opinion and TV ratings are considered.
`
`If many
`
`people release political opinions and TV ratings, an analyst may be able to derive the
`
`correlation between political opinions and TV ratings, and thus, may be able to infer the
`
`10
`
`political opinion of the user who wants to keep it private.
`
`15
`
`20
`
`[18]
`
`The term analyst, which for example may be a part of a service provider’s system, as
`
`used in the present application, refers to a receiver of the released data, who ostensibly uses
`
`the data in order to provide utility to the user. Often the analyst is a legitimate receiver of
`
`the released data. However, an analyst could also illegitimately exploit the released data
`
`and infer some information about private data of the user. This creates a tension between
`
`privacy and utility requirements.
`
`To reduce the inference threat while maintaining utility
`
`the user may release a “distorted version” of data, generated according to a conditional
`
`probabilistic mapping, called “privacy preserving mapping” or “privacy mapping,” designed
`
`under a utility constraint.
`
`[19]
`
`Regarding private data, this refers to data that the user not only indicates that it should
`
`not be publicly released, but also that he does not want it to be inferred from other data that
`
`he would release.
`
`Public data is data that the user would allow the privacy agent to release,
`
`possibly in a distorted way to prevent the inference of the private data.
`
`[20]
`
`In one embodiment, public data is the data that the service provider requests from the
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`user in order to provide him with the service. The user however will distort (i.e., modify) it
`
`before releasing it to the service provider.
`
`In another embodiment, public data is the data
`
`that the user indicates as being “public” in the sense that he would not mind releasing it as
`
`long as the release takes a form that protects against inference of the private data.
`
`[21]
`
`As discussed above, whether a specific category of data is considered as private data
`
`or public data is based on the point of View of a specific user. For ease of notation, we call a
`
`specific category of data as private data or public data from the perspective of the current user.
`
`For example, when trying to design privacy preserving mapping for a current user who wants
`
`to keep his political opinion private, we call the political opinion as private data for both the
`
`current user and for another user who is willing to release his political opinion.
`
`[22]
`
`In the present principles, we use the distortion between the released data and public
`
`data as a measure of utility. When the distortion is larger, the released data is more different
`
`from the public data, and more privacy is preserved, but the utility derived from the distorted
`
`data may be lower for the user. On the other hand, when the distortion is smaller, the
`
`released data is a more accurate representation of the public data and the user may receive
`
`more utility, for example, receive more accurate content recommendations.
`
`[23] Distorting data or modifying data, in the context of recommendation systems, means
`
`altering a user’s profile. The framework of Fawaz casts the privacy problem as a convex
`
`optimization problem with linear constraints, where the number of variables grows
`
`quadratically with the size of the underlying alphabet that describes user’s profiles. When
`
`the alphabet size can be huge, the enormous number of options for distorting user profiles
`
`presents a scalability challenge.
`
`[24]
`
`There have been existing works on protecting privacy against statistical inference.
`
`In a work by Salman Salamatian et. al., titled “How to hide the elephant- or the donkey- in
`
`6
`
`10
`
`15
`
`20
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`the room: Practical privacy against statistical inference for large data,” IEEE GlobalSIP 2013,
`
`a method based on quantization was proposed to reduce the number of optimization variables.
`
`It was shown that the reduction in complexity does not affect the privacy levels that can be
`
`achieved, but comes at the expense of additional distortion.
`
`In Fawaz3, privacy mappings in
`
`the class of additive noise were considered. The parametric additive noise model allows to
`
`reduce the number of optimization variables to the number of noise parameters. However,
`
`this suboptimal solution is not suitable for perfect privacy, as it requires a high distortion.
`
`[25]
`
`The use of the information theoretic framework of Fawaz relies on a local privacy
`
`setting, where users do not trust the analyst collecting data, thus each user holds his data
`
`10
`
`locally, and passes it through a privacy preserving mechanism before releasing it to the
`
`untrusted analyst. Local privacy dates back to randomized response in surveys, and has
`
`been considered in privacy for data mining and statistics.
`
`Information theoretic privacy
`
`metrics have also been considered.
`
`Finally, differential privacy is currently the prevalent
`
`notion of privacy in privacy research.
`
`In the present application, we use the exponential
`
`15
`
`mechanism (ExpMec) to compare our privacy mapping.
`
`[26]
`
`The general problem of minimizing a convex function under convex constraints has
`
`been studied extensively, and is of main importance in many machine learning tasks. The
`
`idea of a sparse approximate solutions to those problems has also been studied in the
`
`literature and has been often called Sparse Greedy Approximation. This type of algorithm
`
`20
`
`has been used with success in many applications such as Neural Network, Matrix
`
`Factorization, SVM, Boosting, and others.
`
`Perhaps the most basic form of Sparse and
`
`Greedy Approximation arises when using the Frank-Wolfe Algorithm for a convex problem
`
`over the simplex.
`
`[27]
`
`In the present application ,we consider the setting described in Fawaz, where a user
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`10
`
`15
`
`has two kinds of data: a vector of personal data A E (/1 that he would like to remain private,
`
`e.g., his income level, his political views, and a vector of data B E B that he is willing to
`
`release publicly and from which he will derive some utility, e. g., the release of his media
`
`preferences to a service provider would allow the user to receive content recommendations.
`
`o4 and B are the sets from which A and B can assume values. We assume that the
`
`user’s private data A are linked to his public data B by the joint probability distribution
`
`PA,B- Thus, an analyst who would observe B could infer some information about A. To
`
`reduce this inference threat, instead of releasing B, the user releases a distorted version of B,
`
`denoted as B E B. B is a set from which B can assume values, and is generated
`
`according to a conditional probabilistic mapping ngB, called the privacy preserving
`
`mapping. Note that the set B may differ from B. This setting is reminiscent of the local
`
`privacy setting (e. g., randomized response, input perturbation), where users do not trust the
`
`analyst collecting data, thus each user holds his data locally, and passes it through a privacy
`
`preserving mechanism before releasing it.
`
`[28]
`
`The privacy mapping {1ng is designed to render any statistical inference of A based
`
`on the observation of B harder, while preserving some utility to the released data B, by
`
`limiting the distortion caused by the mapping.
`
`Following the framework for privacy-utility
`
`against statistical inference in Fawaz, the inference threat is modeled by the mutual
`
`information I (A; B ) between the private data A and the publicly released data B, while the
`
`utility requirement is modeled by a constraint on the average distortion ]EB_3 [d (B, B)] S A,
`
`for some distortion metric d: B x B —> R+, and A 2 0. Note that this general framework
`
`does not assume a particular inference algorithm.
`
`In the case of perfect privacy (1(A; B) =
`
`0), the privacy mapping png renders the released data B statistically independent from the
`
`private data A.
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`[29]
`
`Both the mutual information I (A; f?) and the average distortion IE 3’g [d (B, 3)]
`
`depend on both the prior distribution pA_B and the privacy mapping 29%, since A —> B —> 1?
`
`form a Markov chain. To stress that I (A; B ) depends on both pA’B and png, we will
`
`write I (A; E) as ] (291413,?)3'3). Consequently, given a prior 224,3 linking the private data
`
`A and the public data B, the privacy mapping p 3| 3 minimizing the inference threat subject
`
`to a distortion constraint is obtained as the solution to the following convex optimization
`
`problem:
`
`minimize ](pAIB,p§|B),s.t.
`pEIBESimplex
`
`[EpE‘B[d(B,B\)pB’g] SA,
`
`(1)
`
`where Simplex denotes the probability simplex ( EX p (x) = 1,
`
`2906) 2 0 Vx).
`
`10
`
`15
`
`20
`
`[30]
`
`The present principles propose methods to reduce computation complexity when
`
`designing privacy preserving mappings.
`
`In particular, we propose to exploit the structure of
`
`the optimization problem to achieve computational speedup that will allow scalability. To
`
`be more precise, we solve the above problem efficiently when the sets B and g are large.
`
`In one embodiment, by studying smaller scale problems, both analytically and empirically,
`
`we identify that mappings to distort profiles are in fact naturally sparse. We leverage this
`
`observation to develop sparse privacy preserving mappings (SPPM) that handle scalability.
`
`Although the underlying optimization problem has linear constraints, its objective function is
`
`non—linear. We use the Frank—Wolfe algorithm that approximates the objective via a
`
`sequence of linear approximations, and this allows solving the problem as a sequence of
`
`linear programs, that can be solved quickly.
`
`In addition, we limit the number of alternate
`
`user profiles to a small number; this can be practical as long as the alternates are selected
`
`smartly. To do this, we adapt the Dantzig-Wolfe decomposition to the structure of our
`
`problem. Overall we reduce the number of variables from quadratic to linear in the number
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`of user profiles. To the best of our knowledge, this work is the first to apply large scale
`
`linear programming optimization techniques to privacy problems.
`
`[31] We also provide a detailed evaluation on three datasets, in which we compare our
`
`solution to an optimal one (when feasible) and to a state-of—the-art solution based on
`
`differential privacy (called the Exponential Mechanism). We find that our solutions are
`
`close to optimal, and consistently outperform the exponential mechanism (ExpMec) approach
`
`in that we achieve more privacy with less distortion. We show that our methods scale well
`
`with respect to the number of user profiles and their underlying alphabet.
`
`[32]
`
`In the following, the sparsity property of the privacy preserving mappings, the
`
`proposed sparse privacy preserving mappings and evaluation results will be discussed in great
`
`detail.
`
`[33]
`
`Sparsity of the Privacy Preserving Mappings
`
`[34] When applying the aforementioned privacy—utility framework to large data, we
`
`encounter a challenge of scalability. Designing the privacy mapping requires characterizing
`
`the value of p3|B (13 lb) for all possible pairs (b, 13) E B x g, i.e., solving the convex
`
`optimization problem over |B| |§| variables. When fi = B, and the size of the alphabet
`
`|B|
`
`is large, solving the optimization over |B|2 variables may become intractable.
`
`[35]
`
`A natural question is whether 35 = B is a necessary assumption to achieve the
`
`optimal privacy-utility tradeoff.
`
`In other words, does the optimization problem (1) need to
`
`be solved over [B I 2 variables to achieve optimality“? We use the following theoretical
`
`example to motivate a sparser approach to the design of the privacy mapping.
`
`[36]
`
`Example 1: Let A 6 {0,1}, and B 6 {1,2, ...,2m}, and define the joint distribution
`
`10
`
`15
`
`20
`
`10
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`pA‘B such that p(A = 0) = p(A = 1) = E, and for i E {1,2, ..., 27"}, let p(B = i|A = 0) =
`
`
`1
`1
`2m—1
`2m—1
`
`if i S Zm‘l, 0 otherwise; and let p(B = i|A = 1) =
`
`if i> Zm‘l, 0 otherwise.
`
`For this example, the privacy threat is the worst it could be, as observing B determines
`
`deterministically the value of the private random variable A (equivalently,
`
`I (A; B) =
`
`H (A) = 1).
`
`In FIG. 1, we consider an [2 distortion measure d(B;B) = (B — B)2 and
`
`illustrate the optimal mappings solving Problem (1) for different distortion values. Darker
`
`colors mean that the probability of mapping of the corresponding points to the other one is
`
`larger. For small distortions, the blackest diagonal in FIG. 1A shows that most points
`
`B = b are only mapped to themselves B = b (i.e., with a mapping probability of 100%),
`
`10
`
`and only a small number of points in B (around B : 65) may be mapped to different points
`
`in B. That is, for a given data point in B, the privacy preserving mapping only chooses
`
`from a small set of values from B, rather than from the entire set of B. As we increase the
`
`distortion level, more points in B get mapped to a larger number of points in B.
`
`In both
`
`FIG. 1A and FIG. 1B, we notice that the optimal mappings (i.e., these points shown on the
`
`curves) only occupy a very small portion of the 2-D space B X B. Thus, we consider the
`
`optimal privacy preserving mapping as sparse.
`
`[37]
`
`This theoretical example, as well as experiments on other datasets such as the census
`
`data have shown that the optimal privacy preserving mapping may be sparse, in the sense that
`
`the support. (i.e., the set of points B = b to which B = b can be mapped to with a non-zero
`
`20
`
`probability) of 13ng (b | b) may be of much smaller size than B, and may differ for different
`
`values of B = b. We propose to exploit sparsity properties of the privacy mapping to speed
`
`up the computation, by choosing the support of p3|,3 (b|b), in an iterative greedy way.
`
`[38] Using an example, we explain some of the notations used in the present application.
`
`11
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`We assume a group of ten people who consider age and income as private data, and consider
`
`gender and education as public data. We call age, income, gender, or education as an
`
`attribute of a user, wherein the group of people may have values for the attributes as:
`
`— age: {20-30, 30-40, 40-50};
`
`— income: {£50K, >50K };
`
`— gender: {male, female};
`
`— education: {high school, bachelor, postgraduate}
`
`10
`
`15
`
`In this example, private data A is a vector of random variables {age, income}, and public
`
`data B is a vector of random variables { gender, education}. For a particular person in this
`
`group, who is a 29—year—old woman with a bachelor’s degree and makes more than 50K, her
`
`private data a = (age: 20-30, income >50K), and public data b = (gender: female, education:
`
`bachelor).
`
`IIer user profile can be set to (gender: female, education: bachelor). The set of
`
`B may comprise {(malc, high school), (male, bachelor), (male, postgraduate), (female, high
`
`school), (female, bachelor), (female, postgraduate) }. The set of B may also be smaller, for
`
`example, the set of B may comprise {(male, high school), (male, bachelor), (male,
`
`postgraduate), (female, bachelor), (female, postgraduate)} if every woman in this group of
`
`people has a bachelor or postgraduate degree. Each element (for example, (male, bachelor))
`
`of B is also referred to as a data point or possible value of B. The set of B may be
`
`identical to B.
`
`In order to protect his or her age/income information, a person may modify
`
`the gender and/or education information, such that he or she appears as another person.
`
`[39]
`
`S arse Privac Preservin Ma
`
`in
`
`SPPM
`
`[40]
`
`Before we describe our algorithm, we rewrite the optimization problem (1). Let X
`
`be a n X n matrix of variables to be optimized, whose entries are defined as XL] =
`
`13ng (Bl-lbj), and let xj be the j-th column of X, where n is the cardinality of B (i.e.,
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`n = IBI, also referred to as alphabet size). Then the objective function ] (pA_B, p3'B) can be
`
`written as f (X), some function of X, and the distortion constraint can be written as
`
`.
`77.
`F1 dj xj S A, where each.
`
`dj =PB(bj)[(d(51.bJ-) d(52.bj)
`
`d(8n1bj))]T
`
`is a vector of length n that represents the distortion metric scaled by the probability of the
`
`corresponding symbol bj. The marginal of B is computed as p3 (bj) = 2a mm (a, bj).
`1“:
`A L.
`‘ML
`.4
`rinally, the simplex cons raini. car
`
`:z. A“
`1
`L» ”...:uz.‘ a». 1'1"“ _ 4 LL..-“ : #4.»..-
`DC WIlLlCIl 'db 1an — J.
`1 )I all j, WIlCIC 171 lb 'dH
`
`all-ones vector of length 11. Given the new notation, optimization problem (1) can be
`
`written as:
`
`77.
`
`minixmize
`f(X)
`-
`T
`subject to 2 dj X} S A
`1'21
`
`15:9 2 1 Vj = 1, ...n
`
`X20
`
`(2)
`
`where X 2 0 is an entry-wise inequality.
`
`10
`
`[41]
`
`The optimization problem (1) has linear constraints but its objective function is
`
`non-linear because of the way mutual information is computed.
`
`In one embodiment, we
`
`solve the problem as a sequence of linear programs, also known as the Frank-Wolfe method.
`
`Each iteration 1,” of the method consists of three major steps.
`
`First, we compute the
`
`gradient fo(Xg,1) at the solution from the previous step Xg,1. The gradient is a n X n
`
`15
`
`.
`.
`.
`.
`.
`.
`.
`.
`a
`.
`matrlx C, where ci’1' = fl f (X3—1) 1s a part1al derivative of the objectrve funct1on With
`
`respect to the variable 9513]:
`
`Second, we find a feasible solution X’
`
`in the direction of the
`
`gradient. This problem is solved as a linear program with the same constraint as the
`
`optimization problem (1):
`
`13
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`TI.
`
`.
`.
`.
`T
`minlxmizeZ cj xj
`i=177.
`
`subject to Z d; Xj S A
`'=11
`
`1ng = 1 w: 1, n
`
`X 2 0
`
`,
`(3)
`
`where cj
`
`is the j—th column of C.
`
`Finally, we find the minimum of f between Xg_1 and
`
`X’, X3, and make it the current solution.
`
`Since f is convex, this minimum can be found
`
`efficiently by ternary search. The minimum is also feasible because the feasible region is
`
`convex, and both X' and Xg_1 are feasible.
`
`
`
`Algorithm 1 SPPM: Sparse privacy preserving mappings
`
`Input:
`
`Initial feasible point X0
`
`Number of linearization steps L
`for all? = 1,2, ...,L d0
`
`C ‘— VXf(le—1)
`12 <— DWD
`
`Find a feasible solution X'
`
`in the direction of the gradient C:
`Tl
`
`(4)
`
`.
`.
`.
`T
`mlnixmizeZ cj xj
`i=1n
`
`subject to Z dJ-T Xj S A
`j:1
`
`lflxj = 1 V]
`X 2 0
`
`acid 2 0 V(i,j) 65 17
`
`Find the minimum of f between Xg_1 and X':
`
`V ‘— arg minf((1 — WX/aa + WW
`VEIOJI
`
`(5)
`
`Xe ‘— (1 — V*)Xe—1 + VX'
`end for
`
`Output: Feasible solution XL
`
`14
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`
`
`
`Algorithm 2 DWD: Dantzig-Wolfe decomposition
`
`Initialize the set of active variables: V <— { (1,1), (2,2),
`
`, (71,71) }
`
`while the set V grows do
`
`Solve the master problem for 2* and if:
`
`Tl
`
`maximize
`A,“
`
`AA + Z flj
`.J21
`subject to xi S 0
`
`Adi'j +HISC1'] V(l,j)E V
`
`(6)
`
`forall j = 1,2,...,n do
`
`Find the most violated constraint in the master problem [or fixed j:
`
`i.* = arg minfc; ,- —Ad.— ,- — ad
`i
`L
`.,,
`w
`.,J
`
`(7)
`x
`,
`
`if (CPU — Adi-:3]- — ,uj < 0)
`end for
`end while
`
`then 1? <— 1} U (i*,j) end if
`
`Output: Active variables
`
`[42]
`
`The linear program (3) has n
`
`2 variables and therefore is hard to solve when n is
`
`large.
`
`In one embodiment, we propose an incremental solution to this problem, which is
`
`defined only on a subset of active variables 1? E {1, 2,
`
`, n} X {1, 2,
`
`, n}. This is why we
`
`refer to our approach as sparse privacy mappings. The active variables are indices of the
`
`non-zero variables in the solution to the problem (3). Each active variable include a pair of
`
`indices (i, j), wherein the j-th data point in B is mapped to the i-th data point in E? with a
`
`non-zero probability. Therefore, solving (3) on active variables 1/
`
`is equivalent to
`
`restricting all inactive variables to zero:
`
`(8)
`
`_
`_
`_
`minimize
`x
`
`77.
`
`i=177.
`
`T
`c- x-
`J
`J
`
`subject to VdTX;SA
`Li
`1
`’
`j=1
`
`1£Xj=1 Vj=1,...,n
`
`X20
`
`15
`
`3/17/2018, EAST Version: 4.1.1.3
`
`
`
`WO 2015/157020
`
`PCT/U82015/023336
`
`The above program has only IVI variables. Now the challenge is in finding a good set of
`
`active variables V.
`
`'lhis set should be small, and the solutions to of (3) and (8) should be
`
`close.
`
`[43] We grow the set V greedily using the dual linear program of (8).
`
`In particular, we
`
`incrementally solve the dual by adding most violated constraints, which corresponds to
`
`adding most beneficial variables in the primal. The dual of (8) is:
`n
`
`I
`1 A
`'
`'
`maxrmlze Au -1-
`Ml
`
`I
`
`[=1
`
`la}-
`
`subject to
`
`/1 S 0
`
`[Idij +.u]' S CL] V(l,j) E V
`
`.
`
`(9)
`
`10
`
`15
`
`where A E IR is a variable associ