By Nick Radcliffe

WE are used to synthetic alternatives in many areas of life: synthetic leather, synthetic flight (with flight simulators), synthetic medical implants, and even the promise of synthetic meat. While many synthetic alternatives are widely perceived as inferior to the real thing (synthetic leather), others, at least in certain respects, are plainly superior – perhaps titanium knees or perfect artificial diamonds. The concept of synthetic data may be less familiar.

Where real data describes real events, people, entities and places, synthetic data describes artificial people and events – the places and other entities may be real – while retaining the same shape and structure as corresponding real data. The big idea is that if synthetic data matches real data in key respects, we can use it instead of real data, with little or no loss of fidelity, but vastly increased privacy – for analysis, machine learning, reporting, or education. With rampant identity theft, the surveillance economy and the advent of GDPR, and its possible fines of two per cent of global turnover, there are both ethical and business imperatives for adopting better privacy practices.

Synthetic data is not new: people have always used artificial data for testing and simulations for scenario modelling (“what if?” analyses). What is new is the possibility of training machines to learn the patterns in real data, and then inverting the resulting AI model (in some sense, turning it inside out) to generate synthetic data.

If all goes well, the result will be data that replicates the relevant general patterns in real data, without reproducing the specific features of any real individuals or events. Naturally, verifying that a synthetic dataset accurately captures the key patterns in a real dataset is complex, and it can be even harder to prove that it properly protects privacy, by showing that no specific information from individuals has leaked through to the synthetic data. But progress is rapid, and there are ever-improving procedures that can assure these with high confidence when applied diligently.

As with other synthetic counterparts, there are even cases where synthetic data can be better than the real thing, for example by compensating for biases in real data or encapsulating scenarios that have not been seen or recorded. One obvious example is climate forecasting: we don’t have real data about a (modern) world with 3ºC heating relative to the 18th century, but it’s useful to model what it would be like. Similarly, we don’t have data about how an economy that has undergone decarbonisation functions, or from a credit system free from historical biases and exclusions, but both would be useful to simulate.

Synthetic data is not a silver bullet but provides a vital tool in the arsenal for responsible organisations looking to use and share information from sensitive data responsibly and safely. Britain is well-placed to be in vanguard, with leading academic research, such as Synthpop from Edinburgh, and innovative start-ups such as Hazy: this doesn’t need to be yet another technology in which the US dominates.

Nick Radcliffe is the Chief Data Scientist at Smart Data Foundry, a non-profit organisation at the University of Edinburgh that is using data to improve lives.