AI May 26, 2026 1 min read

Cognitive Lab launches NayanaOCR open document dataset with 1 million images 📄

The NayanaOCR dataset contains over 1 million document images across 22 languages, helping to effectively optimize multilingual and multitasking OCR models.

Tier 1 · sources 85% confidence Reviewed

Cognitive LAB OCR Dataset Open Source

Sources x.com

Cognitive Lab has officially launched the NayanaOCR Corpus, an open-source document image dataset containing over 1 million images across 22 different languages. This is currently the largest multilingual synthetic document dataset supporting the development of optical character recognition (OCR) models.

Background

In the field of AI, training effective OCR models requires a large volume of high-quality image data. However, manual collection often faces barriers related to costs and privacy. Consequently, the trend of using synthetic data is becoming an important alternative.

Key Developments

According to Cognitive Lab, the NayanaOCR Corpus provides a massive repository of over 1 million automatically generated but highly accurate document images. The dataset is designed to serve multitasking and multimodal operations. The free release of this dataset will help developers optimize text extraction technologies more quickly.

Why It Matters

For the Vietnamese AI community, large-scale open datasets like NayanaOCR open up opportunities to access high-quality resources without incurring building costs. Support for 22 languages allows engineers to easily test the multilingual capabilities of their models. However, the actual effectiveness of synthetic data for the Vietnamese language still needs to be verified through real-world applications.