Skip to content

Synthetic data can be the answer to training expensive AI & LLM models and address privacy concerns

inner-banner-image

Synthetic data can be the answer to training expensive AI & LLM models and address privacy concerns

Written by Elena Georgieva, Senior Analyst Relations Consultant

The world has gone crazy about AI!

Since its launch at the end of November 2022, ChatGPT now has 200 million monthly active users globally, 3.9 million paying subscribers in the US, annual recurring revenue of $2 billion and OpenAI is not valued at $86 billion.

Pretty impressive numbers for a fairly small company – OpenAI has around 1,200 employees worldwide. It also became the first back-to-back company to land the number one spot on the CNBC Disruptor 50 list after topping the chart in 2024 and the previous year. Only SpaceX has managed to top the list twice.

However, AI and LLM models are becoming increasingly expensive to train. To give you an idea of the scale – in 2017 Google’s model Transformer cost $930 to train, fast forward to 2020 GPT-3 175B model by OpenAI cost $4.325 million. In 2023, Google’s Gemini Ultra model cost $191.400 million to train.

It isn’t only about the cost though. Many have concerns about privacy, compliance, and the challenges around anonymising the data.

So is there a more cost-effective and safer way to train AI/LLMs?

Enter synthetic data! This type of data is emerging as a powerful tool not only to train AI models but also to address privacy concerns in AI/ML and computer vision model training.

“Synthetic data can bridge information silos by acting as a substitute for real data and not revealing sensitive information, such as personal details and intellectual property.

“Since synthetic datasets maintain statistical properties that closely resemble the original data, they can produce precise training and testing data that is crucial for model development.

“Training computer vision (CV) models often require a large and diverse set of labelled data to build highly accurate models. Obtaining and using real data for this purpose can be challenging, especially when it involves personally identifiable information (PII),” says Alys Woodward, Sr Director Analyst at Gartner.

By generating artificial data that mimics real-world information, organisations can develop robust models without compromising individual privacy.

What is synthetic data?

Synthetic data is generated by applying a sampling technique to real-world data or by creating simulation scenarios where models and processes interact to create completely new data not directly taken from the real world. (Description by Gartner Glossary)

The Key benefits

Privacy protection: Synthetic data eliminates the need to use sensitive personal information, reducing the risk of data breaches and compliance issues.

Scalability: Organisations can generate large volumes of diverse data, overcoming limitations of scarce or imbalanced real-world datasets.

Edge case coverage: Synthetic data can be designed to include rare scenarios, improving model performance in critical situations.

Cost-effectiveness: Reduces expenses associated with data collection, cleaning, and anonymisation of real-world data.

The advantages for AI/ML and CV training

  • Scalability: Generate vast amounts of diverse training data on demand.
  • Control over data distribution: Create balanced datasets to reduce bias.
  • Rare scenario simulation: Produce data for edge cases difficult to capture in real life.
  • Cost-effective: Potentially cheaper than collecting and anonymising real data.

Challenges

Using synthetic data isn’t without its challenges though. When companies commit to using this type of data, they need to ensure the data accurately represents real-world patterns and verify that the models trained on synthetic data perform well on real data.

According to Alys Woodward, Sr Director Analyst at Gartner, “creating a synthetic tabular dataset involves striking a balance between privacy and utility, ensuring the data remains useful and accurately represents the original dataset.

“If the utility is too high, privacy may be compromised, especially for unique or distinctive records, as the synthetic dataset could be matched with other data sources.

“Conversely, methods to enhance privacy, such as disconnecting certain attributes or introducing ‘noise’ via differential privacy, can inherently diminish the dataset’s utility.”

There is no question that synthetic data is a powerful tool that helps enterprises balance the need for robust AI/ML and CV model training with stringent privacy requirements. As the technology matures, it’s likely to play an increasingly important role in responsible AI development.