The Soro model family is designed for practical deployment under limited computational and connectivity conditions in Tajikistan, unlocking AI accessibility for low-resource languages.
Development
Starting with Gemma 3 checkpoints, the development team conducted continual pre-training on a refined Tajik dataset containing 1.9 billion tokens. The model was then instruction-tuned using 40,000 Tajik teacher-style examples. For evaluation purposes, the research team also released a Tajik benchmark covering general knowledge and linguistic capability, which has been open-sourced on Hugging Face.
Why It Matters
Soro significantly outperforms Gemma 3 models of comparable size on Tajik benchmarks while retaining its English-language performance. Support for FP8 and INT4 quantization drastically reduces memory requirements, making it ideal for edge devices. The project is currently being piloted in education, with plans to scale it across schools in Tajikistan, demonstrating a practical path for minority language communities leveraging open-source foundation models.