Evaluating synthetic data using SynthGauge
Updated 9 June 2022
A version of this blog post was first published on 10 May 2022. It has been updated to include more detail on the background to this work, which is part of our collaboration with researchers from the Alan Turing Institute.
SynthGauge is a Python library that provides a framework for evaluating the utility and privacy of synthetic datasets, using a range of metrics and visualisations. It’s the first output of our collaboration with researchers from the Alan Turing Institute. You can view the SynthGauge Python library to learn more.
Data synthesis is the process of replacing a private dataset with one that looks and behaves the same but does not reveal personal information about real individuals.
Synthetic data is increasingly being adopted as a method to improve data access and security. One example is making available synthetic versions of data that cannot be shared; for instance, the Clinical Practice Research Datalink (CPRD), who release synthetic healthcare records for training and code testing purposes. Another example is enhancing the privacy guarantees of published statistics, such as the US Census, where differential privacy has been adopted
We are currently focusing on applying synthetic data to enable data access. This includes providing synthetic datasets to enable researchers to understand our data while awaiting accreditation to access the true data. It may include testing data pipelines on synthetic data while waiting for real data to become available; this expands on our previous work where we built a synthetic Census dataset as a part of the 2019 Census rehearsal.
Reasons we created SynthGauge
Up to now, much of our research has focused on the methods for generating synthetic data. This includes our research into the use of Generative Adversarial Networks, and the Office for National Statistics (ONS) Methodology team’s pilot analysis testing Differential Privacy.
Before releasing synthetic data, we need to understand its limitations in terms of statistical accuracy and be confident about the privacy guarantees afforded to limit the risk of statistical disclosure.
There is no one-size-fits-all approach to measuring how useful a synthetic dataset is. In some settings, privacy may be valued over statistical accuracy. In other applications, the opposite may be true. In general, there is a trade-off; synthetic data with a greater degree of privacy generally means it becomes less useful, and vice versa.
We have created SynthGauage to provide a cohesive framework for evaluating synthetic data for utility and privacy. It will also enable users to understand the strengths and limitations of their methods, and make informed decisions before putting synthetic datasets to use.
What SynthGauge does
Through its Evaluator, SynthGauge provides an intuitive and consistent interface to evaluate synthetic datasets, implementing a range of recognised metrics as well as providing the functionality for user defined custom metrics. SynthGauge enables users to compare many datasets consistently and quickly to provide vital insight.
SynthGauge will not make any decisions on behalf of the user or specify if one synthetic dataset is better than another. This decision is dataset- and purpose-dependent so can vary widely from user to user. Instead, SynthGauge is intended to support decision makers.
With engagement from the open-source community, we hope the suite of metrics can be expanded and refined, contributing to the evolution of the package.
To help you get started, the SynthGauge repository and Application Programming Interfaces (API) reference documentation are available on GitHub.
If you have any questions about SynthGauge or can help us to improve it, please contact us. You can contact us by email.