March 16, 2023 - 250 views|
When using synthetic data to train AI models, first understand the risks amid the many benefits.
Many business leaders would agree that data is now the world’s most valuable commodity. What if, as this piece posits, it were possible to create infinite amounts of it, “cheaply and quickly”?
Welcome to the world of synthetic data. As the term implies, this data is created digitally rather than gathered from real-world events. Research shows that synthetic data may be superior to actual data as a training tool for artificial intelligence (AI) systems.
While synthetic data contains none of the original data from which it was derived, it retains the qualities of the original data; therefore, from a statistical standpoint, anything you do with it— such as building a predictive model—will produce the same results as if the original data was used.
In healthcare, for example, synthetic data wouldn’t contain any actual patient data, so privacy regulations like HIPAA or GDPR would not apply to it.
Synthetic data is already being used heavily in the autonomous vehicle sector, where it’s essentially impossible to gather enough data to simulate all potential driving situations. Two additional use cases are eliminating bias in biometric algorithm training by increasing the demographic diversity of the data set, and boosting clinical researchers’ efficiency by using non-patient—and thus more shareable—data.
Overall, synthetic data “offers a flexible, cost-effective way to generate high-quality training data for machine learning models,” says Aakash Shirodkar, a Senior Director in Cognizant’s AI & Analytics Practice. “By using synthetic data, companies can address privacy concerns, overcome data scarcity and accelerate the development of AI applications across various industries.”
When advising clients on synthetic data, Aakash urges them to consider the risks and constraints of doing so, including:
“We have to be very responsible, keep an eye on where we are going, and have checks and balances everywhere,” Aakash says. “Of course, this is true of any new technology—but the stakes are very high here.” He recommends starting with one use case, examining the numerous techniques to synthesize data and selecting the one that best matches the business’s needs.