Can synthetic data help scale AI’s data wall?

Technovera answer that.

Implementing generative AI large language models (LLMs) is no straightforward feat—it's entirely reasonable to be skeptical of anyone who claims otherwise. Like any groundbreaking technology, it brings its own set of intricate challenges.

Organizations often struggle to identify use cases that deliver the greatest business impact, while also trying to strike a balance between relentless innovation and robust governance frameworks. These hurdles are just a fraction of the GenAI landscape.

Interestingly, even the LLMs themselves seem to be facing a stumbling block. Experts suggest that these models might soon run out of fresh training data, a plight that’s driving the AI sector toward a potential interim solution: synthetic data.

Synthetic data, in this context, is artificially generated based on the statistical characteristics of real-world datasets. Crucially, it avoids using actual details tied to individuals, companies, or other entities, thereby mitigating risks related to privacy or security. This approach allows organizations to simulate outcomes for various scenarios without jeopardizing sensitive information.

Some analysts propose that synthetic data could help LLMs overcome what’s been termed the "data wall." By generating entirely new datasets for their outputs, synthetic data holds the potential to keep these models progressing. To appreciate its true value, it’s essential to first understand the growing limitations of relying solely on real-world data in the pursuit of AI advancement.

The data wall

Academics and AI luminaries alike have noted the probability for LLMs to hit a limit to the amount of human-generated text with which they’re trained – possibly as soon as 2026.

The data shortfall presents a problem because as the volume of training data declines, models can struggle to generalize. This can lead to overfitting, a phenomenon in which a model masters its training data so much that it performs poorly on new data, resulting in less coherent outputs.

And while experts began publicizing the problem shortly after OpenAI’s kicked off the GenAI race by launching ChatGPT two years ago, VCs powerful enough to pull the financial levers of this market have lent their voices to the issue.

“The big models are trained by scraping the internet and pulling in all human-generated training data, all-human generated text and increasingly video and audio and everything else, and there’s just literally only so much of that,” said Marc Andreessen, co-founder of Andreessen Horowitz.

The problem is serious enough that AI companies have gone analog, hiring human domain experts such as doctors and lawyers to handwrite prompts for LLMs.

Barring any breakthroughs in model techniques or other innovations that help GenAI hurdle the coming data wall, synthetic data may be the best available option.

Big brands swear by synthetic data

Synthetic data is particularly useful for helping organizations simulate real-world scenarios, including everything from what merchandise customers may purchase next to modeling financial services scenarios without the risk of exposing protected data.

Walmart, for one, synthesizes user behavior sequences for its sports and electronics categories to predict next purchases. Walmart employees vet the data throughout the process to ensure integrity between the user behavior sequence and the prediction.

The human-in-the-loop factor may be key to harnessing synthetic data to improve outcomes. For example, combining proprietary data owned by enterprises with reasoning from human employees can create a new class of data that corporations can use to create value.

This “hybrid human AI data approach” to creating synthetic data is something that organizations such as JPMorgan are exploring, according to Alex Wang, a senior research associate with the financial services company, who noted that JPMorgan has 150 petabytes of data at its disposal compared to 1 petabyte OpenAI has indexed for GPT 4.

In fact, OpenAI itself has used its Strawberry reasoning model to create data for its Orion LLM. You read that right – OpenAI is using its AI models to train its AI models.

The bottom line

Synthethic data has its limitations. For example, it often fails to capture the complexity and nuances – think sarcasm or turns of phrase – which makes real-world data so rich. This can reduce the relevancy of results, thus limiting the value of scenarios synthetic data is meant to model.

As with real-world data, algorithms used to generate synthetic data can include or amplify existing biases, which can lead to biased outputs. Moreover, ensuring the model trained on synthetic data performs well may require using supplementary real-world data, which can make fine-tuning challenging. Similarly, inaccuracies hallucinations remain an issue in synthetic data.

The challenges that come with using synthetic data require the same sound data governance practices organizations are leveraging with LLMs that train on real-world data. As such, many data engineers view the use of synthetic data to populate models as complementary.

Even so, an existential data crisis isn’t required to capitalize on the benefits of using synthetic data. And your organization needn’t be Walmart of JPMorgan’s scale to take advantage of the opportunities synthetic data has to offer.

Knowing how to effectively leverage synthetic data may be challenging for organizations who haven’t leveraged such techniques to manage and manipulate their data.

Dell Technologies offers access to professional services, as well as a broad open ecosystem of vendors, that can help you embark on your synthetic data creation journey.

As Technovera Co., we officially partner with well-known vendors in the IT industry to provide solutions tailored to our customers’ needs. Technovera makes the purchase and guarantee of all these vendors, as well as the installation and configuration of the specified hardware and software.

We believe in providing technical IT solutions based on experience.

Can synthetic data help scale AI’s data wall?

Source