Proof of Concept Data Security

 
 

data security

The aim of this Proof of Concept (PoC) was to take pensions data on customers of various pension providers from a raw state on-premises (of the provider) to an appropriate solution within the Google Cloud Platform (GCP) in control of the Pensions Policy Institute (PPI). The main issues regarding this was ensuring customer data had any personally identifiable information (PII) stripped from them - as this is eventually publicly releasable information no PII data can be held. Due to analytics being performed on the data there needed to be a way to ensure that any “encrypted/hashed” data could be matched against each other still, even if from different pension providers. E.g. Joe Bloggs with two pension pots from two different providers would still be aggregatable and show up on the visualisations presented, but not personally identifiable, (for example, as a random value that is the same if you provide it the same data).

Take the example below, with some sample data. The data from Joe Bloggs must have the ability to be aggregated, even when coming from different providers yet removing any PII data.

PoC Table 1.JPG

As seen below, the data removes any PII data yet is still matchable. There is a requirement to do this on-premises of the providers so GCP tools such as the data loss prevention (DLP) Application Programming Interface (API) cannot be used.

PoC Table 2.JPG

what was done

An application was created for running on-premise with the providers with the purpose of hashing (one-way encryption) any PII data of a provided .csv file and some other preparation for the data formats of BigQuery (such as setting the DoB in YYYY-MM-DD, etc).

This is then uploaded to a storage bucket in GCP which triggers a background function (all automatic) to transition the data into BigQuery (the database). It performs some error checking at this stage and if there is errorenous data (eg incorrect NiNo, etc) then it places it into a separate “error” area that isnt used for the visualisations etc so it can be reviewed where required.

From being stored in BigQuery, Looker is used as a visualisation tool that connects to BigQuery and allows aggregation of the data in meaningful charts/diagrams/tables etc as required.

PoC.JPG

what tech was used?

Golang has been used for the on-premise “csv hasher” program. It is a modern language created by Googlers with many useful functions that older languages lack in case of future development on the application. It results in a singular program so cross-compatibility of systems is assured. The hashing is used via the SHA-512 Algorithm, which is effectively future proofed for some time. Industry standards are primarily using SHA-256 for hashing, of which 512 is double the standard and currently there are no known “hacks” at this time.

Cloud Functions, a GCP managed service is used and configured which performs tasks automatically. Google Cloud Storage is used and uploaded files are encrypted in transit and at rest. BigQuery is used as the database and is again secure, with only those given access by PPI able to view or do any operations there. These have been used as using a GCP managed service is much easier where possible as opposed to the requirement of managing a Virtual Machine and normally comes with reduced costs, depending on usage of the product. They are heavily supported by Google and have many Google created programming libraries with which to work with, making ease of use and our technical expertise.

Looker is used as the visualisation tool which is used by CTS if Google Data Studio is unable to perform some of the tasks required. This is a paid tool which connects to BigQuery and performs data aggregation. This was chosen due to our own technical expertise with the product, however many tools are available.