Bỏ qua đến nội dung chính
Back to home
AI 1 min read

Soro: A Lightweight Tajik Large Language Model Based on Gemma 3

Researchers introduce Soro, a specialized Tajik large language model family built on Gemma 3 and optimized for edge deployment.

Tier 2 · sources 90% confidence Reviewed
Sources arxiv.org

The Soro model family is designed for practical deployment under limited computational and connectivity conditions in Tajikistan, unlocking AI accessibility for low-resource languages.

Development

Starting with Gemma 3 checkpoints, the development team conducted continual pre-training on a refined Tajik dataset containing 1.9 billion tokens. The model was then instruction-tuned using 40,000 Tajik teacher-style examples. For evaluation purposes, the research team also released a Tajik benchmark covering general knowledge and linguistic capability, which has been open-sourced on Hugging Face.

Why It Matters

Soro significantly outperforms Gemma 3 models of comparable size on Tajik benchmarks while retaining its English-language performance. Support for FP8 and INT4 quantization drastically reduces memory requirements, making it ideal for edge devices. The project is currently being piloted in education, with plans to scale it across schools in Tajikistan, demonstrating a practical path for minority language communities leveraging open-source foundation models.