Cognitive Lab has officially launched the NayanaOCR Corpus, an open-source document image dataset containing over 1 million images across 22 different languages. This is currently the largest multilingual synthetic document dataset supporting the development of optical character recognition (OCR) models.
Background
In the field of AI, training effective OCR models requires a large volume of high-quality image data. However, manual collection often faces barriers related to costs and privacy. Consequently, the trend of using synthetic data is becoming an important alternative.
Key Developments
According to Cognitive Lab, the NayanaOCR Corpus provides a massive repository of over 1 million automatically generated but highly accurate document images. The dataset is designed to serve multitasking and multimodal operations. The free release of this dataset will help developers optimize text extraction technologies more quickly.
Why It Matters
For the Vietnamese AI community, large-scale open datasets like NayanaOCR open up opportunities to access high-quality resources without incurring building costs. Support for 22 languages allows engineers to easily test the multilingual capabilities of their models. However, the actual effectiveness of synthetic data for the Vietnamese language still needs to be verified through real-world applications.