Synthetic data: a dangerous sense of certainty

Synthetic data & AI

Dani Shanley and Joshi Hogenboom on synthetic data, the pains and gains of interdisciplinarity, and why AI likely won’t release us from having to study the world we live in. 

Synthetic data is information generated by algorithms trained on existing data sets obtained through real world data gathering. The generated data has similar statistical properties and can augment or diversify the original data sets. This is useful for validating mathematical models, technical prototyping, or training machine-learning models. While very promising, synthetic data comes with serious ethical and practical concerns. 

“There is a lot of hype around the potential – and with it, many inflated promises,” explains Dani Shanley. “There were isolated critical voices too, but we’ve felt that academics should engage in a serious dialogue around the risks and how we can mitigate them.” Just before her maternity leave, Shanley’s colleague Flora Lysen set up an interdisciplinary collaboration  between FASoS and UM-researchers at the Clinical Data Science Department at Maastricht UMC+ and Maastro Clinic. Together, they produced a commentary, which was recently published in the prestigious EMBO Reports.

Free, private and diverse as a rainbow

Joshi Hogenboom is an epidemiologist and biomedical researcher who specialises in ascertaining knowledge from geographically dispersed data while maximising individual privacy. He experimented with sophisticated modelling and deep learning to synthesise data that mimics real healthcare data, in that is has the same characteristics. Having studied the benefits and vulnerabilities, he explains that technological progress has meant increased availability at scale and at a very low price. “Clinical trials can cost millions; by comparison synthetic data costs next to nothing since it is generated from existing data.” 

That makes synthetic data a viable solution for a host of problems. It supposedly offers privacy: the new data set can be used to test processes without exposing actual patient data. It makes it possible to deal with a lack of data too. “If you only have data for a hundred patients but you need 10,000 for your statistical test, you can generate synthetic data with similar properties. If all the 100 patients are from the hospital here, you can use techniques that are even more advanced to artificially increase the diversity of the data set.”

Filling in the gaps – quick and dirty

However, filling in the gaps to make data more representative, like many of the futuristic promises of AI, tends to neglect the problems we have in the present. “Rather than solving a problem, we’re masking it,” Shanley warns. “If we don’t have data from underrepresented groups for sociomaterial and historic reasons, the synthetic data will represent a world that doesn’t exist – but with the implicit promise of representation attached.” She likens it to speaking up for others, which, even when done with the best of intentions, denies them an actual voice.

“We have already seen a case of companies over-relying on synthetic data and developing e.g. patient care eco systems that are not tailored to the real world,” says Hogenboom. Regarding the promise of privacy, he contextualises: “We have seen models generate actual patient data if they are not implemented with utmost caution. People would just assume privacy as a given because the data is synthetic.

There is a chance that it will contaminate real-world data.

Joshi Hogenboom

Regress and reality

Another problem is that the more complex deep learning models are, the less explainable how they arrive at their results. Mind you, many of the synthetically generated data sets are used to train other deep learning models, which might in turn, generate data sets used to train other algorithms until there is Russian doll of black boxes. “Depending on the level of sophistication,” says Hogenboom, “some AIs can recognise synthetic data as such, e.g. in the case of image generation, but we’re already at the point where that’s no longer a given. There is a chance that it will contaminate real-world data.”

This brings us back to the problem of the data no longer teaching us anything new – or worse. According to Shanley, the inherent biases and flaws in the data only get inflated. “Overreliance on the promises of AI is a broader peril; it’s like a magical bullet that absolves us from the need for detailed qualitative research, which is expensive and time-consuming. But we need that to actually understand phenomena.”

 

 

Joshi Hogenboom is a PhD candidate at Clinical Data Science, a joint academic department of Maastricht University, Maastricht UMC+ and Maastro Clinic. He holds Master of Science degrees in Biomedical Sciences and Epidemiology from UM.

Hogenboom MUMC

Cross-river collaboration

In their commentary, Shanley et al look at synthetic data under the lens of the AI ethics core concepts responsibility, non-maleficence, privacy, transparency, as well as justice, fairness and equity. The intention was certainly not to express hostility towards technological progress or lobby for a moratorium; the potential advantages are obvious. “This almost binary approach of either unfettered enthusiasm or dystopian fear isn’t helpful,” explains Shanley. “We wanted to alert the community that this might grow legs quite quickly and run away with us, so we have to think about mechanisms to ensure responsible use as early as we can. Beyond the technical aspects, we should be clear on what we want to use this for and how.”

Somewhat disappointingly, there’s no toggle one can click to make an algorithm ethical in the design phase. That is why Shanley saw collaboration with technical experts as crucial. “It’s all well and good telling developers they have to make transparent algorithms but how and to what extent can you actually operationalise these concepts? In the social sciences, we have abstract conversations around these concepts, so it was great for us to get a clearer idea of what it actually looks like trying to implement e.g. transparency.”

 

Developers should understand that they are making ethical decisions every day – whether they do so knowingly or not.

Dani Shanley

Embedded ethics

Hogenboom admits with a laugh that his technical tangents might have been hard to follow at times, but Shanley insists that, “whenever you work across disciplines, you have to learn to speak enough of each other’s language to follow arguments and engage in dialogue – which really is a lot of work.” She warns against corporate window dressing and, in the same way as she welcomes more technical training for FASoS students, thinks ethics should be an integral part of engineering curricula, rather than the odd compulsory course. “Developers should understand that they are making ethical decisions every day – whether they do so knowingly or not.”

Hogenboom agrees: “The collaboration really helped me appreciate the value of considering ethics, not just as an afterthought but as an integral part at every stage of research and development.” Quoting the Dutch childcare benefits scandal (kinderopvangtoeslagaffaire) and the British Post Office scandal, he warns technological possibilities shouldn’t blind us to the possible consequences not only of the tools but also of swooning at the promises around them. “Technology generates a sense of certainty that’s quite dangerous.”

Text: Florian Raith

Also read