AI Jun 9, 2026 1 min read

Generating Synthetic Data: A Solution for Training Data Scarcity

This article analyzes two methods of generating synthetic data to optimize machine learning model training when real-world data sources are limited.

Tier 1 · sources 99% confidence Reviewed

Sources lilianweng.github.io

In a landscape where collecting real-world data faces increasing legal barriers and high costs, optimizing training efficiency using synthetic data is becoming a crucial direction for engineers. According to an analysis by expert Lilian Weng, there are two main approaches to thoroughly addressing the current training data shortage: data augmentation and generating entirely new synthetic data using large language models.

Context

The first method is data augmentation. This approach focuses on transforming, distorting, or modifying the format of existing data samples (such as changing words in a text or altering images) while preserving their core underlying meaning. This helps the model better recognize different real-world variations without requiring additional newly labeled data.

The second method is generating entirely new data using powerful pretrained models. Thanks to the rapid advancement of large language models (LLMs) in recent years, few-shot prompting techniques have proven highly effective, helping to rapidly generate new training data without requiring significant additional training resources.

Why It Matters

For the AI development community in Vietnam, accessing large standardized datasets in the native language remains a significant challenge. Fully leveraging synthetic data not only helps businesses minimize manual labeling costs but also significantly shortens testing time and time-to-market. However, engineers must carefully evaluate the quality of synthetic data to avoid systematic bias in real-world operations.