`
`(19) World Intellectual Property
`Organization
`International Bureau
`
`Q
`
`(43) International Publication Date
`02 November 2017 (02.11.2017)
`
`%
`WIPOI PCT
`
`(10) International Publication Number
`
`WO 2017/187207 A1
`
`(51)
`
`(21)
`
`International Patent Classification:
`
`G06F 21/62 (2013.01)
`
`G06F 1 7/30 (2006.01)
`
`International Application Number:
`PCT/GB2017/051227
`
`(22)
`
`International Filing Date:
`
`(25)
`
`(26)
`
`(30)
`
`Filing Language:
`
`Publication Language:
`
`02 May 2017 (02.05.2017)
`
`English
`
`English
`
`Priority Data:
`1607591.3
`16129918
`16197337
`17023573
`
`GB
`29 April 2016 (29.04.2016)
`GB
`27 July 2016 (27.07.2016)
`22 November 2016 (22.11.2016) GB
`14 February 2017 (14.02.2017) GB
`
`(71)
`
`Applicant: PRIVITAR LIMITED [GB/GB]; Salisbury
`House, Station Road, Cambridge CB1 2LA (GB).
`
`(72) Inventors: MCFALL, Jason Derek; c/o Privitar Limit—
`ed, Salisbury House, Station Road, Cambridge CB1 2LA
`(GB). CABOT, Charles Codman; e/o Privitar Limited,
`Salisbury House, Station Road, Cambridge CB1 2LA (GB).
`MORAN, Timothy James; c/o Privitar Limited, Salis-
`bury House, Station Road, Cambridge CB1 2LA (GB).
`GUINAMARD, Kieron Francois Pascal; c/o Privitar
`Limited, Salisbury House, Station Road, Cambridge CB1
`2LA (GB). EATWELL, Vladimir Michael; e/o Privitar
`Limited, Salisbury House, Station Road, Cambridge CB1
`2LA (GB). PICKERING, Benjamin Thomas; c/o Priv-
`itar Limited, Salisbury House, Station Road, Cambridge
`CB1 2LA (GB). MELLOR, Paul David; c/o Privitar Lim—
`ited, Salisbury House, Station Road, Cambridge CB1 2LA
`(GB). STADLER, Theresa; e/o Privitar Limited, Salis-
`bury House, Station Road, Cambridge CB1 2LA (GB). PE—
`TRE, Andrei; c/o Privitar Limited, Salisbury House, Sta-
`tion Road, Cambridge CB1 2LA (GB). SMITH, Christo-
`pher Andrew; c/o Privitar Limited, Salisbury House, Sta—
`tion Road, Cambridge CB1 2LA (GB). DU PREEZ, An-
`
`(54) Title: COMPUTER-IMPLEMENTED PRIVACY ENGINEERING SYSTEM AND METIIOD
`
`Consuming
`
`Application
`Differential/y private aggregate queries
`
`Lens Query
`Interface
`
`
`Lens Management
`
`
`Application
`
`External
`
`Distributed Compute Cluster
`
`
`Sensitive
`
`Anonymised
`Publisher
`
`Data
`Databases ,
`
` Anonymised
`Analysis Jobs
`Sensn‘ive
`Batch files
`Sensitive
`
`Publisher
`
`Data
`Input Data Minibatches
`Data
`Anonymisation
`Record streams
`Anonymised
`Jobs
`Data
`
`Matched data
`Metadata
`
`t
`r
`
`;
`Internal or
`
`Publisher
`Metadata
`external to E
`Stores
`cluster
`I
`Management
`0
`Application
`
`.
`User [m
`Management of privacy
`controls
`Account management
`Access monitoring
`
`SecureLink
`oblivious Matching
`
`Data With
`encrypted
`identifiers
`
`
`
`wo2017/187207A1|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
`
`
` HU|
`
`Anonymised
`Output Data
`
`
`
`
`User 1‘ 4—,
`
`Manage privacy policies
`Action data Anonymisation
`View risk/unity analysis
`
`Privacy
`Policies
`
`Reports
`and
`anal tics
`
`
`1 ”Configuration
`L Database ”W
`«
`
`(57) Abstract: A system allows the identification and protection of sensitive data in a multiple ways, which can be combined for different
`workflows, data situations or use cases. The system scans datasets to identify sensitive data or identifying datasets, and to enable the
`anonymisation of sensitive or identifying datasets by processing that data to produce a safe copy. Furthermore, the system prevents
`access to a raw dataset. The system enables privacy preserving aggregate queries and computations. The system uses differentially
`private algorithms to reduce or prevent the risk of identification or disclosure of sensitive information. The system scales to big data
`and is implemented in a way that supports parallel execution on a distributed compute cluster.
`
`[Continued on nextpage]
`
`
`
`WO 2017/187207 A1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
`
`thony Jason; c/o PriVitar Limited, Salisbury House, Station
`Road, Cambridge CB1 2LA (GB). VUJOSEVIC, Igor; c/o
`Privitar Limited, Salisbury House, Station Road, Cambridge
`CB1 2LA (GB). DANEZIS, George; c/o PriVitar Limited,
`Salisbury House, Station Road, Cambridge CB1 2LA (GB).
`
`(74) Agent: ORIGIN LIMITED; Twisden Works, Twisden
`Road, London N W5 lDN (GB).
`
`(81) Designated States (unless otherwise indicated, for every
`kind ofnational protection available): AE, AG, AL, AM,
`Ao, AT, AU, AZ, BA, BB, BG, BH, BN, BR, BW, BY, BZ,
`CA, CH, CL, CN, CO, CR, CU, CZ, DE, DJ, DK, DM, DO,
`DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT, HN,
`HR, HU, ID, IL, IN, IR, Is, JP, KE, KG, KH, KN, KP, KR,
`KW, KZ, LA, LC, LK, LR, LS, LU, LY, MA, MD, ME, MG,
`MK, MN, MW, MX, MY, MZ, NA, NG, NI, NO, NZ, OM,
`PA, PE, PG, PH, PL, PT, QA, R0, RS, RU, RW, SA, SC,
`SD, SE, SG, SK, SL, SM, ST, SV, SY, TH, TJ, TM, TN, TR,
`TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW.
`
`(84) Designated States (unless otherwise indicated, for every
`kind of regional protection available): ARIPO (BW, GH,
`GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, ST, SZ, TZ,
`UG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, RU, TJ,
`TM), European (AL, AT, BE, BG, CH, CY, CZ, DE, DK,
`EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LT, LU, LV,
`MC, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM,
`TR), OAPI (BF, BJ, CF, CG, CI, CM, GA, GN, GQ, GW,
`KM, ML, MR, NE, SN, TD, TG).
`
`Published:
`
`— with international search report (Art. 21(3))
`
`
`
`WO 2017/187207
`
`1
`
`PCT/GB2017/051227
`
`COMPUTER-IMPLEMENTED PRIVACY ENGINEERING SYSTEM AND
`
`METHOD
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`10
`
`15
`
`20
`
`25
`
`30
`
`The field of the invention relates to a computer implemented process of managing and
`
`controlling the privacy and utility of dataset(s) that contain information of a sensitive or
`
`identifying nature. More particularly, but not exclusively,
`
`it relates to a computer—
`
`implemented process for anonymising sensitive or identifying datasets, a differential
`
`privacy system, and a computer—implemented method for securely sharing sensitive
`
`datas ets.
`
`A portion of the disclosure of this patent document contains material which is subject to
`
`copyright protection. The copyright owner has no objection to the
`
`facsimile
`
`reproduction by anyone of the patent document or the patent disclosure, as it appears in
`
`the Patent and Trademark Office patent file or records, but otherwise reserves all
`
`copyright rights whatsoever.
`
`2. Description of the Prior Art
`
`The specification of the present disclosure is broad and deep. We will now describe in
`
`turns the prior art in relation to key aspects of the present disclosure.
`
`Differential privacy
`
`Data analysts commonly want to use sensitive or confidential data sources. Existing
`
`practices in industry for sensitive data analytics are insufficient because they do not
`
`provide adequate privacy while still being useful. For instance, one common solution is
`
`to rely on access control and secure enclaves for providing access to sensitive data. This
`
`approach does not protect privacy because the people conducting the analytics can still
`
`learn sensitive information about the individuals in the dataset. In general, all approaches
`
`that rely on security technologies will suffer from this problem: those that need to use
`
`the data will be able to breach individuals’ privacy.
`
`The family of approaches called privacy—enhancing technologies offers fundamentally
`
`better privacy protection than the security approaches discussed above. Data privacy
`
`
`
`WO 2017/187207
`
`2
`
`PCT/GB2017/051227
`
`methods use statistical and cryptographic techniques to enable analysts to extract
`
`information about groups without being able to learn significant amounts about
`
`individuals. For cases where group analysis is the desired goal—a wide class of cases that
`
`includes business intelligence, reporting,
`
`independence testing, cohort analyses, and
`
`randomized
`
`controlled trials—privacy—enhancing
`
`technologies
`
`allow the
`
`person
`
`performing the analysis to achieve their goal without being able to learn sensitive
`
`information about an individual.
`
`10
`
`15
`
`20
`
`One subset of privacy—enhancing technologies, privacy—preserving query interfaces,
`
`relates to systems that respond to aggregate queries and release the requested statistics in
`
`a way that preserves individual privacy. These systems are of academic interest due to
`
`their potential
`
`for strong guarantees of privacy:
`
`for instance,
`
`they can guarantee
`
`differential privacy, a strong privacy guarantee for individuals that has recently been
`
`adopted in Google Chrome and Apple’s iOS.
`
`One industry attempt worth noting is query interfaces that use a technique called query
`
`set size restriction. Query set size restriction is the practice of suppressing an aggregate
`
`result if it is over a population of less than 1‘ people for some threshold 1‘ (commonly 5,
`
`10, or 100). Many industry query interfaces advertise that they employ this approach and
`
`allege that it preserves privacy. However, this approach is not enough to make a query
`
`interface preserve privacy. Query set size restriction is vulnerable to a family of attacks
`
`called tracker attacks, which can be used to successfully circumvent nearly any query set
`
`size restriction and recover information about individual data subjects. Tracker attacks
`
`are combinations ofaggregate queries that can be used to determine the information of a
`
`single record. An example ofa tracker attack can be found in Appendix 1.
`
`Despite the academic popularity of privacy—preserving query interfaces, there is no widely
`
`available software system that offers a privacy—preserving query interface with sufficient
`
`flexibility and usability for industry usage. There are several challenges in bringing theory
`
`into practice, such as but not limited to: including a way to orient data analysts without
`
`their eyeballing the data, a system for controlling and reporting on the accuracy of query
`
`results, a system for detecting when attack attempts occur and interfacing with the data
`
`holder, and a method for extending the core techniques of academia to a wider set of
`
`realistic data types — that are addressed by the invention.
`
`
`
`WO 2017/187207
`
`3
`
`PCT/GB2017/051227
`
`Managing and sharing sensitive datasets that include an original unique ID
`
`People often use multiple banks for their finances, multiple hospitals and doctors for
`
`their medical treatments, multiple phones for their calls, and so on. It can be a challenge
`
`to assemble a complete set of information about an individual because it is segmented
`
`across these different organisations.
`
`If an organisation desires to piece together this segmented information, they will need to
`
`gather the data from all parties and join it on a common unique identifier, such as a
`
`10
`
`national identification number.
`
`However, such unique identifiers make privacy breaches trivial
`
`—— any target whose
`
`national identification number is known can simply be looked up in the data.
`
`15
`
`20
`
`One trivial way share datasets from contributing organisations would be for the
`
`contributing organisations to agree on a shared secret key without the knowledge of the
`
`central party, and then encrypt all of the ID numbers with this key. However, this is
`
`practically difficult because they often do not have the ability to independently organise
`
`and safeguard a secret key.
`
`Processing sensitive datasets and publishing a derivative safe copy of the datasets
`
`Data analysts and software testers commonly want to use sensitive or confidential data
`
`sources. Existing practices in industry for sensitive data use are insufficient because they
`
`do not provide adequate privacy while still being useful.
`
`One of the most common workflows for using sensitive data is to create a desensitized
`
`or deidentifted copy that can be used in place of the original. This workflow involves
`
`producing a set of tables that resemble the original tables, but certain fields have been
`
`altered or suppressed. The alteration or suppression of fields is intended to prevent
`
`people from learning sensitive attributes about individuals by looking at the dataset.
`
`A number of techniques have been used to create a deidentified copy for an original
`
`dataset, such as for example: tokenisation and k—anonymisation.
`
`
`
`WO 2017/187207
`
`4
`
`PCT/GB2017/051227
`
`Tokenisation relates to the practice of replacing identifiers (such as ID numbers or full
`
`names) with randomly generated values. Tokenisation technologies eXist for a variety of
`
`applications, however, processing large, distributed datasets, such as those stored in
`
`IIDFS (I Iadoop Distributed File System), with these technologies is difficult.
`
`K—anonymisation is the process of accounting for available background information and
`
`ensuring that that background informaiion cannot be used to re—identify masked data. In
`
`the k—anonymity model, attributes that can be learned via background information——such
`
`as gender, age, or place of residence——are called quasi—identifiers. A dataset
`
`is
`
`k—
`
`anonymous if every record in the dataset shares their combination of quasi—identifier
`
`values with k—l other records. This poses a significant obstacle to an attacker who tries to
`
`re—identify the data, because they cannot use the background information to tell which
`
`out of k records corresponds to any target individual.
`
`K—anonymisation is an established technique, but has some aspects with significant room
`
`for innovation, such as but not limited to: guiding a non—expert user toward proper
`
`configuration of k—anonymisation, and measuring and minimising k—anonymisation’s
`
`impact on data utility.
`
`The present invention addresses the above vulnerabilities and also other problems not
`
`described above.
`
`10
`
`15
`
`20
`
`Reference may also be made to PCT/GB2016/053776, the contents of which are hereby
`
`incorporated by reference.
`
`
`
`WO 2017/187207
`
`5
`
`PCT/GB2017/051227
`
`SUMMARY OF THE INVENTION
`
`One aspect of the invention is a system allowing the identification and protection of
`
`sensitive data in a multiple ways, which can be combined for different workflows, data
`
`situations or use cases.
`
`10
`
`15
`
`20
`
`Another aspect is a method in which a computer—based system scans datasets to identify
`
`sensitive data or identifying datasets, and enables the anonymisation of sensitive or
`
`identifying datasets by processing that data to produce a safe copy. This discovery and
`
`anonymisation may scale to big data and may be implemented in a way that supports
`
`parallel execution on a distributed compute cluster. A user may configure and control
`
`how data is anonymised, may view what privacy risks exist and how to mitigate them
`
`and/ or may record and inspect an audit trail of all classification and anonymisation
`
`activity. Anonymisation may consist of tokenisation, masking, and/ or k—anonymisation
`
`to protect against the risks of reidentification through background information or linkage
`
`to external datasets. Tokenisation processes may use a token vault, which can reside on
`
`the distributed compute cluster or on an external database.
`
`Another aspect is a system for preventing access to a raw dataset. The system may enable
`
`privacy preserving aggregate queries and computations. The system may use differentially
`
`private algorithms to reduce or prevent the risk of identification or disclosure of sensitive
`
`information. Data access may be controlled and all usage may be logged, and analysed
`
`for malicious behaviour. The system may be used to query data in a relational database,
`
`in flat files, or in a non—relational distributed system such as Hadoop. The system may be
`
`used to manage and audit all data access, and to classify what data is sensitive and should
`
`be processed via differentially private algorithms.
`
`Another aspect is a computer—imp]emented method for managing and sharing sensitive
`
`data consisting of a combination of sensitive datasets, joined together. Data may be
`
`joined against a common identifier (such as a social security number), while protecting
`
`that common identifier and ensuring that it is not revealed in the matching process or the
`
`resulting joined dataset. The joined dataset may then be anonymised using one or more
`
`of the methods, systems defined above.
`
`
`
`WO 2017/187207
`
`6
`
`PCT/GB2017/051227
`
`Other key aspects include any one or more of the features defined above.
`
`
`
`WO 2017/187207
`
`7
`
`PCT/GB2017/051227
`
`BRIEF DESCRIPTION OF THE FIGURES
`
`Aspects of the invention will now be described, by way of example(s), with reference to
`
`the followings, in which:
`
`Figure 1
`
`shows a diagram illustrating the key aspects of the system.
`
`Figure 2
`
`shows a diagram illustrating the key components of Lens.
`
`Figure 3
`
`shows a screenshot with a query interface.
`
`Figure 4
`
`shows a contingency table.
`
`Figure 5
`
`shows a diagram of the query workflow.
`
`Figure 6
`
`shows a diagram of the sample—aggregate mechanism.
`
`Figure 7
`
`shows a screenshot of a user interface displaying to an end—user an amount
`
`of budget spent.
`
`Figure 8
`
`shows a line chart representing an indiyidual’s querying of a dataset as a
`
`function of time.
`
`Figure 9
`
`shows a screenshot of the information displayed to a data holder.
`
`Figure 10
`
`shows a simple diagram where contributors share data to a recipient.
`
`Figure 11
`
`shows a diagram illustrating the key components of SecureLink.
`
`Figure 12
`
`shows a diagram illustrating the key components of Publisher.
`
`Figure 13
`
`shows an example of the modelling ofa ‘Policy’ Schema in Publisher.
`
`Figure 14
`
`shows a diagram illustrating the sharing of rules within Publisher.
`
`Figure 15
`
`shows a diagram illustrating the configuration of a Rule Library within
`
`Publisher.
`
`Figure 16
`
`shows the process of integrating with a metadata store.
`
`Figure 17
`
`shows a screenshot of a user interface allowing a user to verify, choose
`
`from a set of altematiyes, and define new Rules per column.
`
`Figure 18
`
`shows a diagram illustrating the audit of data workflow.
`
`Figure 19A
`
`shows a diagram illustrating the tokenisation flow.
`
`Figure 19B
`
`shows a diagram illustrating the tokenisation flow.
`
`Figure 20A
`
`shows a diagram illustrating the obfuscation flow.
`
`Figure 20B
`
`shows a diagram illustrating the obfuscation flow.
`
`Figure 21
`
`shows a diagram illustrating the derived tokenisation flow.
`
`Figure 22
`
`shows a diagram illustrating the process of using the collisions map within
`
`the obfuscation phase.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`
`
`WO 2017/187207
`
`8
`
`PCT/GB2017/051227
`
`Figure 23A
`
`shows a diagram illustration the token Generation phase adapted to use
`
`derived tokenisation.
`
`Figure 23B
`
`shows a diagram illustration the token Generation phase adapted to use
`
`derived tokenisation.
`
`Figure 24
`
`shows a diagram with the collisions map workflow of the obfuscation
`
`phase.
`
`Figure 25
`
`shows a diagram with an example of food hierarchy.
`
`Figure 26
`
`shows a diagram illustrating the top down generalisation approach.
`
`Figure 27
`
`shows a diagram with an example of ‘animals’ hierarchy.
`
`Figure 28
`
`shows a diagram with another example of ‘animals’ hierarchy.
`
`Figure 29
`
`shows a planar graph representation and a generalised territories map.
`
`Figure 30
`
`shows a table displayed by Publisher, which contains the rule and distortion
`
`corresponding to a specific data column.
`
`Figure 31A
`
`shows a screenshot of Publisher
`
`in which distortion histograms are
`
`displayed.
`
`Figure 31B
`
`shows a screenshot of Publisher
`
`in which distortion histograms are
`
`displayed.
`
`Figure 32
`
`shows a screenshot of Publisher in which cluster size distribution is
`
`displayed.
`
`Figure 33
`
`shows an example ofa cluster size bubble chart displayed to an end—user.
`
`Figure 34
`
`shows an example ofa cluster size bubble chart displayed to an end—user.
`
`Figure 35A
`
`shows a visualisation depicting the Sensitive data discovery.
`
`Figure 35B
`
`shows a visualisation depicting the Sensitive data discovery.
`
`Figure 35C
`
`shows a visualisation depicting the Sensitive data discovery.
`
`10
`
`15
`
`20
`
`
`
`WO 2017/187207
`
`9
`
`PCT/GB2017/051227
`
`DETAILED DESCRIPTION
`
`We will now describe an implementation of the invention in the following sections:
`
`Section A: Overview of some key components in the system
`
`Section B: Lens
`
`Section C: SecureLink
`
`Section D: Publisher
`
`Note that each innovation listed above, and the related, optional implementation features
`
`for each innovation, can be combined with any other innovation and related optional
`
`implementation.
`
`In this document, we shall use the term ‘node’ in the following different contexts:
`
`(l) A node in a computing cluster. In this instance a node means a single computer
`
`that is a member of a computing cluster.
`
`(2) A node in a graph structure, which may have edges connecting it to other nodes.
`
`We use the term node in this sense when discussing tree structures. The terms
`
`root node, leaf node, child node, and parent node relate to this context.
`
`We also shall use the term ‘cluster’ in the following different contexts:
`
`(1) A computing cluster. A computing cluster is a set of computers that work
`
`together to store large files and do distributed computing.
`
`(2) A set of rows in a table that have the same quasi—identifying values, also known
`
`as an anonymity set. For instance, if there are four and only four rows that have
`
`the quasi—identifying values “haierrown”, “age=92”, “nationalityZCanadian”,
`
`then these four records are a cluster.
`
`10
`
`20
`
`25
`
`30
`
`
`
`WO 2017/187207
`
`10
`
`PCT/GB2017/051227
`
`Section A: Overview of some key components in the system
`
`Privitar aims to provide a platform solution to enable organisations to use, share and
`
`trade data containing personal or private information.
`
`Figure 1 shows an example of the overall system architecture. The system allows the
`
`identification and protection of sensitive data in a multiple ways, which can be combined
`
`for different workflows, data situations or use cases.
`
`10
`
`15
`
`20
`
`Privitar Publisher scans datasets to identify sensitive data or identifying datasets, and
`
`enables the anonymisation of sensitive or identifying datasets by processing that data to
`
`produce a safe copy. This discovery and anonymisation scales to big data and is
`
`implemented in a way that supports parallel execution on a distributed compute cluster.
`
`Tokenisation processes may use a token vault, which can reside on the distributed
`
`compute cluster or on an external database.
`
`The Publisher Management Application allows the user to configure and control how
`
`data is anonymised, to view what privacy risks exist and how to mitigate them, to record
`
`and inspect an audit trail of all classification and anonymisation activity. Anonymisation
`
`can consist of tokenisation and masking, and also of k—anonymisation to protect against
`
`the risks of reidentification through background information or linkage to external
`
`datas ets.
`
`Privitar Lens takes a complementary and alternative approach to privacy protection.
`
`Lens prevents access to the raw dataset, but enables privacy preserving aggregate queries
`
`and computations, and uses differentially private algorithms to reduce or prevent the risk
`
`ofidentiftcation or disclosure of sensitive information. Data access is controlled, all usage
`
`is logged, and analysed for malicious behaviour. Lens may be used to query data in a
`
`relational database, in flat files, or in a non—relational distributed system such as Hadoop.
`
`The Lens Management Application is used to manage and audit all data access, and to
`
`classify what data is
`
`sensitive and should be processed via differentially private
`
`algorithms.
`
`
`
`WO 2017/187207
`
`1 1
`
`PCT/GB2017/051227
`
`Sometimes sensitive data consists ofa combination of sensitive datasets, joined together.
`
`SecureLink Oblivious Matching offers a way for data to be joined against a common
`
`identifier (such as a social security number), While protecting that common identifier and
`
`ensuring that it is not revealed in the matching process or the resulting joined dataset.
`
`The joined dataset may then be anonymised using Privitar Publisher, or made available
`
`for privacy preserving analysis using Privitar Lens.
`
`10
`
`
`
`WO 2017/187207
`
`12
`
`PCT/GB2017/051227
`
`Section B: Lens
`
`10
`
`15
`
`20
`
`Lens relates to a computer—implemented process for running computations and queries
`
`over datasets such that privacy is preserved; access control methods, noise addition,
`
`generalisation, rate limiting (i.e. throttling), visualization, and monitoring techniques are
`
`applied.
`
`Lens is a system for answering queries on datasets while preserving privacy. It
`
`is
`
`applicable for conducting analytics on any datasets that contain sensitive information
`
`about a person, company, or other entity whose privacy must be preserved. For instance,
`
`it could be used to conduct analytics on hospital visit data, credit card transaction data,
`
`mobile phone location data, or smart meter data. As shown in Figure 2 Lens (11)
`
`is
`
`typically the only gateway through which a data analyst (14) can retrieve infornlation
`
`about a dataset (12). The dataset itself is protected in a secure location (13). The data
`
`owner or holder (15)
`
`(6g. the bank or health company) can configure Lens and audit
`
`analysts’ activity through Lens. Lens restricts access for configuration of the query
`
`system to a single channel, with a restricted set of ways to retrieve information and types
`
`of information that may be retrieved.
`
`Lens differs from previous efforts to implement privacy—preserving query interfaces. The
`
`two most notable previous attempts are PINQ (McSherry, Frank D. ”Privacy integrated
`
`queries: an extensible platform for privacy—preserving data analysis.H Proceeding; offlie 2009
`
`ACM .S'IGMOD Iiztemaiz'wm/ (.biflrmm 072 Management of dam. ACM, 2009) and GUPT
`
`(Mohan, Prashanth, et al. ”GUPT: privacy preserving data analysis made easy.” P7“0666d[71g5
`
`(yr #76 2072 ACM STGMOD ”flaw/17207111! Caifereme 072 Management (f Dam. ACM, 2012).,
`
`both ofwhich were academic projects that have been published. One broad difference is
`
`that
`
`the previous attempts were narrowly scoped software libraries, while Lens is a
`
`comprehensive application with a number of novel usability features and privacy
`
`optimizations. Some differences are described in more detail here. The first difference is
`
`that Lens is a web service, while the previous attempts were software libraries (PTNQ in
`
`C#, GUPT in Python). The second difference is that Lens——because it is a live service
`
`with logins and user authentication——has separate interfaces for the data analyst and the
`
`data holder, while Lens does not have such a separation of interfaces. The third
`
`difference is that the functionality provided to both data analysts and data holders far
`
`
`
`WO 2017/187207
`
`13
`
`PCT/GB2017/051227
`
`outstrips the functionality provided by the previous attempts——particularly in the ability
`
`for the data holder to control the service as it is running, and to audit all activity in the
`
`system, and additionally in the ability of the data analyst to browse datasets and get a
`
`sense of their look and feel. The fourth difference is that Lens has several usability
`
`features, such as reporting noise on results and allowing the user to specify privacy
`
`parameters in novel, intuitive ways. The fifth difference is that Lens has several privacy
`
`optimizations, such as
`
`the ability to designate columns as public or private and
`
`automatically decide whether to add noise to queries based on whether they concern
`
`private columns.
`
`The remainder of this section is structured as follows. Lens is a query interface service,
`
`and section 1 defines what we mean by query and describes the scope of queries that are
`
`handled. Section 2 defines what types of datasets Lens handles. Section 3 defines the
`
`architectural setup of Lens. Section 4 defines the general steps that Lens follows for
`
`handling a query. Section 5 describes the features of Lens that preserve the privacy of
`
`individuals by making sure that outputs of Lens do not
`
`leak information about
`
`individuals. Section 6 describes how to configure Lens. Section 7 outlines some examples
`
`10
`
`15
`
`ofuse cases.
`
`20
`
`1.
`
`Scope of queries
`
`Lens may answer any query that is aggregate in nature. Aggregate queries are queries that
`
`give statistical information about a group of people rather than an individual. Figure 3
`
`shows a screenshot with an example of a query interface. Examples of aggregate queries
`
`range from sums, counts, and means to clusterings and linear regressions. The supported
`
`types of queries may include, but are not limited to:
`
`1.1
`
`SQL—like aggregate queries
`
`These are queries that are equivalent to SELECT COLlNT(*), SELECT SUM(vatiable),
`
`and SELECT AVG<variable) in the SQL language. In these queries, a number of filters
`
`are applied to the dataset to get a subset of records, and then either the records are
`
`counted or the sum or average is found of a certain column within the subset. Lens
`
`expresses these queries as an abstract syntax tree in which there are two parts: an
`
`aggregate and a list of filters. The aggregate has two parts: a function (e.g. SUM, MEAN,
`
`or COUNT) and a column name (which can be missing if the function does not need a
`
`
`
`10
`
`15
`
`20
`
`WO 2017/187207
`
`14
`
`PCT/GB2017/051227
`
`column name, for instance if it is COUNT). The filters each have three parts: a column
`
`name, a comparison operator, and a value. The comparison operator may be less than,
`
`greater than, less than or equal to, greater than or equal to, equal to, or not equal to.
`
`However, if the column designated is a categorical column, the comparison operator is
`
`restricted to the smaller list of equal to or not equal to. These queries are passed into
`
`Lens through the user interface, which may be a REST APT or a web page (see
`
`screenshot below). The REST API accepts a JSON object which specifies each of the
`
`data fields listed above. Lens has a number of connectors which use the abstract syntax
`
`tree to construct a query for a certain underlying database. For instance,
`
`for a
`
`PostgreSQL query,
`
`the
`
`aggregate
`
`function
`
`is
`
`turned
`
`into
`
`a
`
`“SELECT
`
`<function>(<column name>)”,
`
`the filters are constructed as “<column name>
`
`<comparison operator> <value>”, and the full query is assembled as “SELECT
`
`<function>(<column name>) FROM <table name> WHERE <filter1> AND
`
`<filter2> AND
`
`AND <filterN>”.
`
`This family of queries can be extended to contingency tables (which is one aggregate
`
`query for each cell in the contingency table). An image of a contingency table is shown in
`
`Figure 4——the example is salary broken down by location and employee grade. For
`
`contingency tables, the same inputs as a normal aggregate query are passed in in addition
`
`to a list of categorical columns to “group by”. Lens first queries the underlying database
`
`to determine each value present in each categorical column. It expresses these in lists,
`
`henceforth referred to as COLlVALS, COLZVALS,
`
`COLNVALS. Lens
`
`then
`
`iteratively selects each unique combination of (collval, colZval,
`
`..., colnval) where
`
`collval is chosen from COLIVALS, colZval is chosen from COLZVALS,
`
`colnval is
`
`chosen from COLNVALS. Lens constructs a query for each resulting tuple (collval,
`
`col2val,
`
`..
`
`colnval) that is the base query with N additional filters where for filter 1' in
`
`{1..N}, the column name is the column name of the 2th group by column name, the
`
`comparison operator is equals to, and the value is colz'val. Lens then constructs queries
`
`for the underlying database for each of these queries, and then returns the results for
`
`each query. Each result is a cell in a logical contingency table. The GUT can represent a
`
`contingency with up to 2 group by variables as a straightforward two—dimensional table,
`
`where the values of coll are the column headers, the values of col2 are the row headers,
`
`and each entry at column i and row j
`
`is the query result for the tuple (colheaderi,
`
`rowh ead erj) .
`
`
`
`10
`
`15
`
`20
`
`WO 2017/187207
`
`1 5
`
`PCT/GB2017/051227
`
`1.2
`
`Parametrized models
`
`Certain supervised and unsupervised learning models, such as linear regressions or k—
`
`means clustering, have well known training algorithms. This query type takes as input the
`
`parameters for the training algorithm, trains the model on the dataset, and returns as
`
`output the parameters of the trained model. Lens may use a language such as Predictive
`
`Model Markup Language
`
`(PMML)
`
`to
`
`specify model
`
`type
`
`and
`
`parameters
`
`(https:
`
`en.wikipedia.org wiki Predictive Model Markup Language). PMML is a well—
`
`defined way to describe a predictive model in XML. Parameters will vary based on the
`
`algorithm. For instance, for decision trees, the number levels of the tree is required, as
`
`well as the columns to use, and the column to be predicted. Outputs also vary based on
`
`model type. For instance, the decision tree algorithm outputs a decision tree——a tree of
`
`nodes where each node has a variable name and a threshol

Accessing this document will incur an additional charge of $.
After purchase, you can access this document again without charge.
Accept $ ChargeStill Working On It
This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.
Give it another minute or two to complete, and then try the refresh button.
A few More Minutes ... Still Working
It can take up to 5 minutes for us to download a document if the court servers are running slowly.
Thank you for your continued patience.

This document could not be displayed.
We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.
You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.
Set your membership
status to view this document.
With a Docket Alarm membership, you'll
get a whole lot more, including:
- Up-to-date information for this case.
- Email alerts whenever there is an update.
- Full text search for other cases.
- Get email alerts whenever a new case matches your search.

One Moment Please
The filing “” is large (MB) and is being downloaded.
Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!
If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document
We are unable to display this document, it may be under a court ordered seal.
If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.
Access Government Site