`WO 2017/187207
`1. Field of the Invention
`The field of the invention relates to a computer implemented process of managing and
`controlling the privacy and utility of dataset(s) that contain information of a sensitive or
`identifying nature. More particularly, but not exclusively,
`it relates to a computer—
`implemented process for anonymising sensitive or identifying datasets, a differential
`privacy system, and a computer—implemented method for securely sharing sensitive
`datas ets.
`A portion of the disclosure of this patent document contains material which is subject to
`copyright protection. The copyright owner has no objection to the
`reproduction by anyone of the patent document or the patent disclosure, as it appears in
`the Patent and Trademark Office patent file or records, but otherwise reserves all
`copyright rights whatsoever.
`2. Description of the Prior Art
`The specification of the present disclosure is broad and deep. We will now describe in
`turns the prior art in relation to key aspects of the present disclosure.
`Differential privacy
`Data analysts commonly want to use sensitive or confidential data sources. Existing
`practices in industry for sensitive data analytics are insufficient because they do not
`provide adequate privacy while still being useful. For instance, one common solution is
`to rely on access control and secure enclaves for providing access to sensitive data. This
`approach does not protect privacy because the people conducting the analytics can still
`learn sensitive information about the individuals in the dataset. In general, all approaches
`that rely on security technologies will suffer from this problem: those that need to use
`the data will be able to breach individuals’ privacy.
`The family of approaches called privacy—enhancing technologies offers fundamentally
`better privacy protection than the security approaches discussed above. Data privacy
`WO 2017/187207
`methods use statistical and cryptographic techniques to enable analysts to extract
`information about groups without being able to learn significant amounts about
`individuals. For cases where group analysis is the desired goal—a wide class of cases that
`includes business intelligence, reporting,
`independence testing, cohort analyses, and
`controlled trials—privacy—enhancing
`allow the
`performing the analysis to achieve their goal without being able to learn sensitive
`information about an individual.
`One subset of privacy—enhancing technologies, privacy—preserving query interfaces,
`relates to systems that respond to aggregate queries and release the requested statistics in
`a way that preserves individual privacy. These systems are of academic interest due to
`their potential
`for strong guarantees of privacy:
`for instance,
`they can guarantee
`differential privacy, a strong privacy guarantee for individuals that has recently been
`adopted in Google Chrome and Apple’s iOS.
`One industry attempt worth noting is query interfaces that use a technique called query
`set size restriction. Query set size restriction is the practice of suppressing an aggregate
`result if it is over a population of less than 1‘ people for some threshold 1‘ (commonly 5,
`10, or 100). Many industry query interfaces advertise that they employ this approach and
`allege that it preserves privacy. However, this approach is not enough to make a query
`interface preserve privacy. Query set size restriction is vulnerable to a family of attacks
`called tracker attacks, which can be used to successfully circumvent nearly any query set
`size restriction and recover information about individual data subjects. Tracker attacks
`are combinations ofaggregate queries that can be used to determine the information of a
`single record. An example ofa tracker attack can be found in Appendix 1.
`Despite the academic popularity of privacy—preserving query interfaces, there is no widely
`available software system that offers a privacy—preserving query interface with sufficient
`flexibility and usability for industry usage. There are several challenges in bringing theory
`into practice, such as but not limited to: including a way to orient data analysts without
`their eyeballing the data, a system for controlling and reporting on the accuracy of query
`results, a system for detecting when attack attempts occur and interfacing with the data
`holder, and a method for extending the core techniques of academia to a wider set of
`realistic data types — that are addressed by the invention.
`WO 2017/187207
`Managing and sharing sensitive datasets that include an original unique ID
`People often use multiple banks for their finances, multiple hospitals and doctors for
`their medical treatments, multiple phones for their calls, and so on. It can be a challenge
`to assemble a complete set of information about an individual because it is segmented
`across these different organisations.
`If an organisation desires to piece together this segmented information, they will need to
`gather the data from all parties and join it on a common unique identifier, such as a
`national identification number.
`However, such unique identifiers make privacy breaches trivial
`—— any target whose
`national identification number is known can simply be looked up in the data.
`One trivial way share datasets from contributing organisations would be for the
`contributing organisations to agree on a shared secret key without the knowledge of the
`central party, and then encrypt all of the ID numbers with this key. However, this is
`practically difficult because they often do not have the ability to independently organise
`and safeguard a secret key.
`Processing sensitive datasets and publishing a derivative safe copy of the datasets
`Data analysts and software testers commonly want to use sensitive or confidential data
`sources. Existing practices in industry for sensitive data use are insufficient because they
`do not provide adequate privacy while still being useful.
`One of the most common workflows for using sensitive data is to create a desensitized
`or deidentifted copy that can be used in place of the original. This workflow involves
`producing a set of tables that resemble the original tables, but certain fields have been
`altered or suppressed. The alteration or suppression of fields is intended to prevent
`people from learning sensitive attributes about individuals by looking at the dataset.
`A number of techniques have been used to create a deidentified copy for an original
`dataset, such as for example: tokenisation and k—anonymisation.
`WO 2017/187207
`Tokenisation relates to the practice of replacing identifiers (such as ID numbers or full
`names) with randomly generated values. Tokenisation technologies eXist for a variety of
`applications, however, processing large, distributed datasets, such as those stored in
`IIDFS (I Iadoop Distributed File System), with these technologies is difficult.
`K—anonymisation is the process of accounting for available background information and
`ensuring that that background informaiion cannot be used to re—identify masked data. In
`the k—anonymity model, attributes that can be learned via background information——such
`as gender, age, or place of residence——are called quasi—identifiers. A dataset
`anonymous if every record in the dataset shares their combination of quasi—identifier
`values with k—l other records. This poses a significant obstacle to an attacker who tries to
`re—identify the data, because they cannot use the background information to tell which
`out of k records corresponds to any target individual.
`K—anonymisation is an established technique, but has some aspects with significant room
`for innovation, such as but not limited to: guiding a non—expert user toward proper
`configuration of k—anonymisation, and measuring and minimising k—anonymisation’s
`impact on data utility.
`The present invention addresses the above vulnerabilities and also other problems not
`described above.
`Reference may also be made to PCT/GB2016/053776, the contents of which are hereby
`incorporated by reference.
`WO 2017/187207
`One aspect of the invention is a system allowing the identification and protection of
`sensitive data in a multiple ways, which can be combined for different workflows, data
`situations or use cases.
`Another aspect is a method in which a computer—based system scans datasets to identify
`sensitive data or identifying datasets, and enables the anonymisation of sensitive or
`identifying datasets by processing that data to produce a safe copy. This discovery and
`anonymisation may scale to big data and may be implemented in a way that supports
`parallel execution on a distributed compute cluster. A user may configure and control
`how data is anonymised, may view what privacy risks exist and how to mitigate them
`and/ or may record and inspect an audit trail of all classification and anonymisation
`activity. Anonymisation may consist of tokenisation, masking, and/ or k—anonymisation
`to protect against the risks of reidentification through background information or linkage
`to external datasets. Tokenisation processes may use a token vault, which can reside on
`the distributed compute cluster or on an external database.
`Another aspect is a system for preventing access to a raw dataset. The system may enable
`privacy preserving aggregate queries and computations. The system may use differentially
`private algorithms to reduce or prevent the risk of identification or disclosure of sensitive
`information. Data access may be controlled and all usage may be logged, and analysed
`for malicious behaviour. The system may be used to query data in a relational database,
`in flat files, or in a non—relational distributed system such as Hadoop. The system may be
`used to manage and audit all data access, and to classify what data is sensitive and should
`be processed via differentially private algorithms.
`Another aspect is a computer—imp]emented method for managing and sharing sensitive
`data consisting of a combination of sensitive datasets, joined together. Data may be
`joined against a common identifier (such as a social security number), while protecting
`that common identifier and ensuring that it is not revealed in the matching process or the
`resulting joined dataset. The joined dataset may then be anonymised using one or more
`of the methods, systems defined above.
`WO 2017/187207
`Other key aspects include any one or more of the features defined above.
`WO 2017/187207
`Aspects of the invention will now be described, by way of example(s), with reference to
`the followings, in which:
`Figure 1
`shows a diagram illustrating the key aspects of the system.
`Figure 2
`shows a diagram illustrating the key components of Lens.
`Figure 3
`shows a screenshot with a query interface.
`Figure 4
`shows a contingency table.
`Figure 5
`shows a diagram of the query workflow.
`Figure 6
`shows a diagram of the sample—aggregate mechanism.
`Figure 7
`shows a screenshot of a user interface displaying to an end—user an amount
`of budget spent.
`Figure 8
`shows a line chart representing an indiyidual’s querying of a dataset as a
`function of time.
`Figure 9
`shows a screenshot of the information displayed to a data holder.
`Figure 10
`shows a simple diagram where contributors share data to a recipient.
`Figure 11
`shows a diagram illustrating the key components of SecureLink.
`Figure 12
`shows a diagram illustrating the key components of Publisher.
`Figure 13
`shows an example of the modelling ofa ‘Policy’ Schema in Publisher.
`Figure 14
`shows a diagram illustrating the sharing of rules within Publisher.
`Figure 15
`shows a diagram illustrating the configuration of a Rule Library within
`Figure 16
`shows the process of integrating with a metadata store.
`Figure 17
`shows a screenshot of a user interface allowing a user to verify, choose
`from a set of altematiyes, and define new Rules per column.
`Figure 18
`shows a diagram illustrating the audit of data workflow.
`Figure 19A
`shows a diagram illustrating the tokenisation flow.
`Figure 19B
`shows a diagram illustrating the tokenisation flow.
`Figure 20A
`shows a diagram illustrating the obfuscation flow.
`Figure 20B
`shows a diagram illustrating the obfuscation flow.
`Figure 21
`shows a diagram illustrating the derived tokenisation flow.
`Figure 22
`shows a diagram illustrating the process of using the collisions map within
`the obfuscation phase.
`WO 2017/187207
`Figure 23A
`shows a diagram illustration the token Generation phase adapted to use
`derived tokenisation.
`Figure 23B
`shows a diagram illustration the token Generation phase adapted to use
`derived tokenisation.
`Figure 24
`shows a diagram with the collisions map workflow of the obfuscation
`Figure 25
`shows a diagram with an example of food hierarchy.
`Figure 26
`shows a diagram illustrating the top down generalisation approach.
`Figure 27
`shows a diagram with an example of ‘animals’ hierarchy.
`Figure 28
`shows a diagram with another example of ‘animals’ hierarchy.
`Figure 29
`shows a planar graph representation and a generalised territories map.
`Figure 30
`shows a table displayed by Publisher, which contains the rule and distortion
`corresponding to a specific data column.
`Figure 31A
`shows a screenshot of Publisher
`in which distortion histograms are
`Figure 31B
`shows a screenshot of Publisher
`in which distortion histograms are
`Figure 32
`shows a screenshot of Publisher in which cluster size distribution is
`Figure 33
`shows an example ofa cluster size bubble chart displayed to an end—user.
`Figure 34
`shows an example ofa cluster size bubble chart displayed to an end—user.
`Figure 35A
`shows a visualisation depicting the Sensitive data discovery.
`Figure 35B
`shows a visualisation depicting the Sensitive data discovery.
`Figure 35C
`shows a visualisation depicting the Sensitive data discovery.
`WO 2017/187207
`We will now describe an implementation of the invention in the following sections:
`Section A: Overview of some key components in the system
`Section B: Lens
`Section C: SecureLink
`Section D: Publisher
`Note that each innovation listed above, and the related, optional implementation features
`for each innovation, can be combined with any other innovation and related optional
`In this document, we shall use the term ‘node’ in the following different contexts:
`(l) A node in a computing cluster. In this instance a node means a single computer
`that is a member of a computing cluster.
`(2) A node in a graph structure, which may have edges connecting it to other nodes.
`We use the term node in this sense when discussing tree structures. The terms
`root node, leaf node, child node, and parent node relate to this context.
`We also shall use the term ‘cluster’ in the following different contexts:
`(1) A computing cluster. A computing cluster is a set of computers that work
`together to store large files and do distributed computing.
`(2) A set of rows in a table that have the same quasi—identifying values, also known
`as an anonymity set. For instance, if there are four and only four rows that have
`the quasi—identifying values “haierrown”, “age=92”, “nationalityZCanadian”,
`then these four records are a cluster.
`WO 2017/187207
`Section A: Overview of some key components in the system
`Privitar aims to provide a platform solution to enable organisations to use, share and
`trade data containing personal or private information.
`Figure 1 shows an example of the overall system architecture. The system allows the
`identification and protection of sensitive data in a multiple ways, which can be combined
`for different workflows, data situations or use cases.
`Privitar Publisher scans datasets to identify sensitive data or identifying datasets, and
`enables the anonymisation of sensitive or identifying datasets by processing that data to
`produce a safe copy. This discovery and anonymisation scales to big data and is
`implemented in a way that supports parallel execution on a distributed compute cluster.
`Tokenisation processes may use a token vault, which can reside on the distributed
`compute cluster or on an external database.
`The Publisher Management Application allows the user to configure and control how
`data is anonymised, to view what privacy risks exist and how to mitigate them, to record
`and inspect an audit trail of all classification and anonymisation activity. Anonymisation
`can consist of tokenisation and masking, and also of k—anonymisation to protect against
`the risks of reidentification through background information or linkage to external
`datas ets.
`Privitar Lens takes a complementary and alternative approach to privacy protection.
`Lens prevents access to the raw dataset, but enables privacy preserving aggregate queries
`and computations, and uses differentially private algorithms to reduce or prevent the risk
`ofidentiftcation or disclosure of sensitive information. Data access is controlled, all usage
`is logged, and analysed for malicious behaviour. Lens may be used to query data in a
`relational database, in flat files, or in a non—relational distributed system such as Hadoop.
`The Lens Management Application is used to manage and audit all data access, and to
`classify what data is
`sensitive and should be processed via differentially private
`WO 2017/187207
`1 1
`Sometimes sensitive data consists ofa combination of sensitive datasets, joined together.
`SecureLink Oblivious Matching offers a way for data to be joined against a common
`identifier (such as a social security number), While protecting that common identifier and
`ensuring that it is not revealed in the matching process or the resulting joined dataset.
`The joined dataset may then be anonymised using Privitar Publisher, or made available
`for privacy preserving analysis using Privitar Lens.
`WO 2017/187207
`Section B: Lens
`Lens relates to a computer—implemented process for running computations and queries
`over datasets such that privacy is preserved; access control methods, noise addition,
`generalisation, rate limiting (i.e. throttling), visualization, and monitoring techniques are
`Lens is a system for answering queries on datasets while preserving privacy. It
`applicable for conducting analytics on any datasets that contain sensitive information
`about a person, company, or other entity whose privacy must be preserved. For instance,
`it could be used to conduct analytics on hospital visit data, credit card transaction data,
`mobile phone location data, or smart meter data. As shown in Figure 2 Lens (11)
`typically the only gateway through which a data analyst (14) can retrieve infornlation
`about a dataset (12). The dataset itself is protected in a secure location (13). The data
`owner or holder (15)
`(6g. the bank or health company) can configure Lens and audit
`analysts’ activity through Lens. Lens restricts access for configuration of the query
`system to a single channel, with a restricted set of ways to retrieve information and types
`of information that may be retrieved.
`Lens differs from previous efforts to implement privacy—preserving query interfaces. The
`two most notable previous attempts are PINQ (McSherry, Frank D. ”Privacy integrated
`queries: an extensible platform for privacy—preserving data analysis.H Proceeding; offlie 2009
`ACM .S'IGMOD Iiztemaiz'wm/ (.biflrmm 072 Management of dam. ACM, 2009) and GUPT
`(Mohan, Prashanth, et al. ”GUPT: privacy preserving data analysis made easy.” P7“0666d[71g5
`(yr #76 2072 ACM STGMOD ”flaw/17207111! Caifereme 072 Management (f Dam. ACM, 2012).,
`both ofwhich were academic projects that have been published. One broad difference is
`the previous attempts were narrowly scoped software libraries, while Lens is a
`comprehensive application with a number of novel usability features and privacy
`optimizations. Some differences are described in more detail here. The first difference is
`that Lens is a web service, while the previous attempts were software libraries (PTNQ in
`C#, GUPT in Python). The second difference is that Lens——because it is a live service
`with logins and user authentication——has separate interfaces for the data analyst and the
`data holder, while Lens does not have such a separation of interfaces. The third
`difference is that the functionality provided to both data analysts and data holders far
`WO 2017/187207
`outstrips the functionality provided by the previous attempts——particularly in the ability
`for the data holder to control the service as it is running, and to audit all activity in the
`system, and additionally in the ability of the data analyst to browse datasets and get a
`sense of their look and feel. The fourth difference is that Lens has several usability
`features, such as reporting noise on results and allowing the user to specify privacy
`parameters in novel, intuitive ways. The fifth difference is that Lens has several privacy
`optimizations, such as
`the ability to designate columns as public or private and
`automatically decide whether to add noise to queries based on whether they concern
`private columns.
`The remainder of this section is structured as follows. Lens is a query interface service,
`and section 1 defines what we mean by query and describes the scope of queries that are
`handled. Section 2 defines what types of datasets Lens handles. Section 3 defines the
`architectural setup of Lens. Section 4 defines the general steps that Lens follows for
`handling a query. Section 5 describes the features of Lens that preserve the privacy of
`individuals by making sure that outputs of Lens do not
`leak information about
`individuals. Section 6 describes how to configure Lens. Section 7 outlines some examples
`ofuse cases.
`Scope of queries
`Lens may answer any query that is aggregate in nature. Aggregate queries are queries that
`give statistical information about a group of people rather than an individual. Figure 3
`shows a screenshot with an example of a query interface. Examples of aggregate queries
`range from sums, counts, and means to clusterings and linear regressions. The supported
`types of queries may include, but are not limited to:
`SQL—like aggregate queries
`These are queries that are equivalent to SELECT COLlNT(*), SELECT SUM(vatiable),
`and SELECT AVG<variable) in the SQL language. In these queries, a number of filters
`are applied to the dataset to get a subset of records, and then either the records are
`counted or the sum or average is found of a certain column within the subset. Lens
`expresses these queries as an abstract syntax tree in which there are two parts: an
`aggregate and a list of filters. The aggregate has two parts: a function (e.g. SUM, MEAN,
`or COUNT) and a column name (which can be missing if the function does not need a
`WO 2017/187207
`column name, for instance if it is COUNT). The filters each have three parts: a column
`name, a comparison operator, and a value. The comparison operator may be less than,
`greater than, less than or equal to, greater than or equal to, equal to, or not equal to.
`However, if the column designated is a categorical column, the comparison operator is
`restricted to the smaller list of equal to or not equal to. These queries are passed into
`Lens through the user interface, which may be a REST APT or a web page (see
`screenshot below). The REST API accepts a JSON object which specifies each of the
`data fields listed above. Lens has a number of connectors which use the abstract syntax
`tree to construct a query for a certain underlying database. For instance,
`for a
`PostgreSQL query,
`<function>(<column name>)”,
`the filters are constructed as “<column name>
`<comparison operator> <value>”, and the full query is assembled as “SELECT
`<function>(<column name>) FROM <table name> WHERE <filter1> AND
`<filter2> AND
`AND <filterN>”.
`This family of queries can be extended to contingency tables (which is one aggregate
`query for each cell in the contingency table). An image of a contingency table is shown in
`Figure 4——the example is salary broken down by location and employee grade. For
`contingency tables, the same inputs as a normal aggregate query are passed in in addition
`to a list of categorical columns to “group by”. Lens first queries the underlying database
`to determine each value present in each categorical column. It expresses these in lists,
`henceforth referred to as COLlVALS, COLZVALS,
`iteratively selects each unique combination of (collval, colZval,
`..., colnval) where
`collval is chosen from COLIVALS, colZval is chosen from COLZVALS,
`colnval is
`chosen from COLNVALS. Lens constructs a query for each resulting tuple (collval,
`colnval) that is the base query with N additional filters where for filter 1' in
`{1..N}, the column name is the column name of the 2th group by column name, the
`comparison operator is equals to, and the value is colz'val. Lens then constructs queries
`for the underlying database for each of these queries, and then returns the results for
`each query. Each result is a cell in a logical contingency table. The GUT can represent a
`contingency with up to 2 group by variables as a straightforward two—dimensional table,
`where the values of coll are the column headers, the values of col2 are the row headers,
`and each entry at column i and row j
`is the query result for the tuple (colheaderi,
`rowh ead erj) .
`WO 2017/187207
`1 5
`Parametrized models
`Certain supervised and unsupervised learning models, such as linear regressions or k—
`means clustering, have well known training algorithms. This query type takes as input the
`parameters for the training algorithm, trains the model on the dataset, and returns as
`output the parameters of the trained model. Lens may use a language such as Predictive
`Model Markup Language
`specify model
`en.wikipedia.org wiki Predictive Model Markup Language). PMML is a well—
`defined way to describe a predictive model in XML. Parameters will vary based on the
`algorithm. For instance, for decision trees, the number levels of the tree is required, as
`well as the columns to use, and the column to be predicted. Outputs also vary based on
`model type. For instance, the decision tree algorithm outputs a decision tree——a tree of
`nodes where each node has a variable name and a threshol

