Enabling Data Access through Privacy Preserving Synthetic Data
In this blog, and the accompanying technical report, we describe how the Data Science Campus are developing privacy preserving synthetic data to enable faster access to data, and to support the ongoing delivery of the Integrated Data Service (IDS). This is demonstrated by synthesising the linked 2011 Census and deaths dataset while preserving its confidentiality.
Data drives the ONS’ work to produce statistics that inform the public, and support decision-making that improves lives. Key to delivering this is data access: having the right data available, for the right people, with the increasingly diverse tools they need to process the data. As part of meeting this need, the Integrated Data Service (IDS) is a cross-government initiative providing government analysts, devolved administrations, and external accredited researchers access to linked data.
Owen Daniel, Lead Data Scientist in the Data Science Campus, says: “Synthetic data offers a lot for the public good. Not only can it provide safer, faster access to researchers, but it can enable the high-quality tooling required for effective analysis across government.”
Talking about its application, he adds: “Our hope is that wider access to synthetic data in the IDS will mean that data users can make informed decisions about whether data will meet their purposes, before investing time in applying for full project accreditation”.
What is Privacy Preserving Synthetic Data?
Synthetic data are artificially generated data that are made to resemble real-world, often sensitive, data. We previously published our early approaches to data synthesis using generative adversarial networks (GANs), autoencoders and synthetic minority oversampling. Previous work has also included supporting Census 2021 preparations by creating a synthetic census for testing the data processing pipelines ahead of real data being available.
A substantial challenge in creating high quality synthetic data are that in the quest to retain statistical properties of the original, we risk also including private or confidential information contained in the original data. In previous work, we explored privacy preserving synthetic data generation with differentially private GANs. Like many deep learning models, GANs can be very difficult to explain to non-technical audiences, which limits its impact.
Today we are publishing a technical report setting out the approach we have taken to overcome both challenges, applied to synthesising the linked 2011 Census and deaths dataset while preserving its confidentiality, alongside the open-source code base so that others can reproduce our approach.
Our contribution has been to adapt a state-of-the-art synthesis method that won the NIST 2018 Differential Privacy Synthetic Data Challenge, to work in a practical setting where decision making is shared between data scientists and non-technical stakeholders, including data owners and legal experts. In short:
- Our approach allows us to control exactly which statistics contribute to generating the synthetic data – enabling data owners to decide what information they consider safe to disclose for the given purpose.
- In addition, we’ve followed the recent guidance from the UK Information Commissioner’s Office on the adoption of privacy-enhancing technologies (PETs) and added noise to the data using differential privacy: a formal mathematical approach to disclosure control that allows us to quantify the privacy risk.
Through our conversations with ONS data protection officers and disclosure control experts we are confident that our proposed uses of synthetic data meet the stringent requirements set out in Statistics and Registration Service Act 2007 and in data privacy legislation such as the UK GDPR and Data Protection Act 2018.
How will our synthetic data be used?
The IDS is currently operating in a beta phase. As the service continues to mature, our synthetic data will play a key role in supporting the onboarding of users and tooling.
The primary use will be to enable accredited researchers to become familiar with the confidential data without full access by provisioning access to synthetic data in a demonstration area. Minimising unnecessary data access is a cornerstone of safely managing confidential data. This synthetic data demonstration area will allow approved users to build experience in using the platform.
This will be a critical feedback step for the IDS, which would be hard to implement on an individual basis on sensitive data as each researcher’s project needs individual accreditation. In this way synthetic data will enable efficient onboarding of users and improving the platform.
The Data Science Campus are also supporting the IDS with the delivery of future training sessions. By moving these to using our synthetic data we open the opportunity for these sessions to be recorded and reused, ensuring the training reaches a broad audience with proportionate resource.
Finally, the data are currently being used by colleagues in ONS Methodology to test that methods onboarded to the Statistical Methods Library (SML) are accessible to IDS users and function as expected in the IDS environment.
These use cases speak to the immediate need for synthetic data, but we are also working with the IDS to identify the longer-term strategic role that it can play in enabling timely access to data. Our hope is that whilst accreditation is being sought, initial analytical pipelines can be drafted on synthetic data.
In an increasingly data driven world, the efficiencies enabled by synthetic data not only can save time and money but also have the potential to save lives and improve decision-making for the public good.