Skip to main content Skip to footer


March 16, 2023

Synthetic data: handle with care

When using synthetic data to train AI models, first understand the risks amid the many benefits.


In the news

Many business leaders would agree that data is now the world’s most valuable commodity. What if, as this piece posits, it were possible to create infinite amounts of it, “cheaply and quickly”?

Welcome to the world of synthetic data. As the term implies, this data is created digitally rather than gathered from real-world events. Research shows that synthetic data may be superior to actual data as a training tool for artificial intelligence (AI) systems.

While synthetic data contains none of the original data from which it was derived, it retains the qualities of the original data; therefore, from a statistical standpoint, anything you do with it— such as building a predictive model—will produce the same results as if the original data was used.

In healthcare, for example, synthetic data wouldn’t contain any actual patient data, so privacy regulations like HIPAA or GDPR would not apply to it.

Synthetic data is already being used heavily in the autonomous vehicle sector, where it’s essentially impossible to gather enough data to simulate all potential driving situations. Two additional use cases are eliminating bias in biometric algorithm training by increasing the demographic diversity of the data set, and boosting clinical researchers’ efficiency by using non-patient—and thus more shareable—data.

The Cognizant take

Overall, synthetic data “offers a flexible, cost-effective way to generate high-quality training data for machine learning models,” says Aakash Shirodkar, a Senior Director in Cognizant’s AI & Analytics Practice. “By using synthetic data, companies can address privacy concerns, overcome data scarcity and accelerate the development of AI applications across various industries.”

When advising clients on synthetic data, Aakash urges them to consider the risks and constraints of doing so, including:

  • The more closely the synthetic data resembles the actual underlying data, the more likely it will be reverse-engineered to uncover actual sensitive data. 

  • Outliers can pose a problem to the final model output when synthetic data is scaled.

  • The biases contained in real-world data could also potentially cause issues. After all, Aakash notes, “Most real-world data is biased in some way. You run the danger of replicating and magnifying that, skewing your synthetic data accordingly.” 

  • When a field is new, real data sets may be too small to effectively synthesize. As an example, Aakash points to new and emerging payments methods.

“We have to be very responsible, keep an eye on where we are going, and have checks and balances everywhere,” Aakash says. “Of course, this is true of any new technology—but the stakes are very high here.” He recommends starting with one use case, examining the numerous techniques to synthesize data and selecting the one that best matches the business’s needs.



Tech to Watch Blog
Cognizant’s weekly blog
Headshot of Digitally Cognizant author Tech to Watch

Understand the transformative impact of emerging technologies on the world around us as they address our most significant global challenges.

editorialboard@cognizant.com



Latest posts

Related posts

Subscribe for more and stay relevant

The Modern Business newsletter delivers monthly insights to help your business adapt, evolve, and respond—as if on intuition