We present SozKZ, a family of small language models (50M and 150M parameters) trained from scratch for Kazakh — a low-resource Turkic language with approximately 13 million speakers. Unlike existing approaches that rely on multilingual models or domain adaptation of English-centric models, SozKZ builds dedicated infrastructure from the ground up: custom ByteLevel BPE tokenizers optimized for Kazakh Cyrillic text, multi-stage data cleaning pipelines, and Chinchilla-optimal training schedules. We show that standard multilingual tokenizers exhibit fertility rates of 4.9–6.6 tokens per word on Kazakh text, while our dedicated tokenizers achieve 1.3–1.4, a 3.6–4.9× improvement in encoding efficiency. Our 150M model achieves a perplexity of ~20 on a held-out Kazakh evaluation set and generates coherent, topically relevant Kazakh text across multiple domains. We release all models, tokenizers, datasets, and training code as open-source artifacts on Hugging Face Hub.
Kazakh is the official language of Kazakhstan and is spoken by approximately 13 million people worldwide. Written in Cyrillic script (with an ongoing transition to Latin), Kazakh is an agglutinative Turkic language with rich morphology, vowel harmony, and flexible word order. Despite growing digital presence, Kazakh remains underrepresented in NLP resources.
Existing multilingual models such as BLOOM [1], mGPT [2], and XLM-RoBERTa [3] include Kazakh data but allocate minimal vocabulary entries to Kazakh tokens. This leads to poor tokenization efficiency — common Kazakh words are split into many subword pieces or even individual bytes. We quantify this problem: the GPT-NeoX tokenizer [4] produces 6.58 tokens per Kazakh word on average, compared to 1.34 for our dedicated Kazakh tokenizer — a 4.9× difference.
We hypothesize that for a low-resource language like Kazakh, dedicated small models with language-specific tokenizers can be more practical and efficient than relying on large multilingual models. Our contributions are:
Multilingual Models for Kazakh. BLOOM [1] and mGPT [2] are large multilingual models that include Kazakh in their training data but with limited vocabulary coverage. XLM-RoBERTa [3] provides multilingual representations but is not designed for text generation.
Tokenization for Low-Resource Languages. Sennrich et al. [7] introduced BPE for neural machine translation. Rust et al. [8] demonstrated that language-specific tokenizers are crucial for downstream task performance in low-resource settings. Ahia et al. [9] showed that multilingual tokenizers produce highly fragmented representations for underrepresented languages, directly impacting training efficiency and model quality.
Small Language Models. Pythia [4] provides a suite of models from 14M to 12B parameters. TinyLlama [10] demonstrated that 1.1B-parameter models can be competitive when trained on large token counts. Eldan & Li [11] showed that very small models can exhibit emergent abilities with high-quality data.
Chinchilla Scaling. Hoffmann et al. [6] established that compute-optimal training requires ~20 tokens per parameter. We follow this guideline.
Data Cleaning. Penedo et al. [12] introduced comprehensive pipelines for web data cleaning. Lee et al. [13] showed that deduplication significantly improves language model quality. We adapt these approaches with Kazakh-specific filters including fastText [14] language identification.
We aggregate Kazakh text from six major public datasets on Hugging Face:
| Source | Raw Samples | Unique (after dedup) |
|---|---|---|
| CulturaX (kk) | 2,731,934 | 2,705,991 |
| HPLT 2.0 (kaz_Cyrl) | 2,637,330 | 2,246,264 |
| C4 (kk) | 2,371,528 | 2,230,795 |
| MADLAD-400 (kk) | 1,807,996 | 1,807,827 |
| mOSCAR (kaz_Cyrl) | 245,869 | 245,869 |
| Wikipedia (kk) | 238,356 | 238,343 |
| Total new | 10,033,013 | 9,475,089 |
Combined with the kz-transformers multidomain dataset (12.4M unique texts), this yields approximately 21.9M unique Kazakh text documents.
Our 9-stage pipeline processes raw text into training-ready data:
The pipeline reduces 28.4M raw texts to 13.7M clean texts (48.2% pass rate).
| Tokenizer | Vocab Size | Training Data |
|---|---|---|
| BPE-32K | 32,000 | 23.6M Kazakh texts |
| GPT2-50K | 50,257 | 78K clean v2 texts |
Both tokenizers use ByteLevel pre-tokenization with special tokens <|endoftext|> (BOS/EOS) and <|padding|> (PAD).
We measure fertility (tokens per word) across four tokenizers on 10 diverse Kazakh sentences covering news, science, literature, and conversational domains:
| Tokenizer | Vocab Size | Avg Tokens/Sent | Chars/Token | Fertility |
|---|---|---|---|---|
| GPT-NeoX (Pythia) [4] | 50,304 | 52.6 | 1.22 | 6.58 |
| LLaMA-2 [19] | 32,000 | 39.2 | 1.64 | 4.90 |
| SozKZ BPE-32K (ours) | 32,000 | 11.5 | 5.59 | 1.44 |
| SozKZ GPT2-50K (ours) | 50,257 | 10.7 | 6.01 | 1.34 |
The GPT-NeoX tokenizer effectively decomposes Kazakh text into individual bytes (1.22 chars/token), while our tokenizers capture whole words and common morphemes. This 3.6–4.9× reduction in sequence length directly translates to faster training, lower memory usage, and richer contextual representation within the same context window. A 1,024-token context covers ~760 Kazakh words with our tokenizer versus only ~156 with GPT-NeoX.
| Parameter | 50M Model | 150M Model |
|---|---|---|
| Parameters | 50.29M | 149.8M |
| Hidden size | 512 | 896 |
| Layers | 8 | 12 |
| Attention heads | 8 | 16 |
| Intermediate (SwiGLU) | 1,344 | 2,560 |
| Context length | 1,024 | 1,024 |
| Vocab size | 50,257 | 32,000 |
| Positional encoding | RoPE | RoPE |
| Parameter | 50M Model | 150M Model |
|---|---|---|
| Training tokens | ~1.04B | ~2.88B |
| Learning rate | 6e-4 | 3e-4 |
| LR schedule | Cosine | Cosine |
| Precision | bfloat16 | bfloat16 |
| Hardware | 1× A100 | 8× H200 |
| Training time | ~6-8 hours | ~36 minutes |
| Step | Train Loss | Epoch |
|---|---|---|
| 500 | ~5.5 | 0.046 |
| 2,500 | ~3.5 | 0.228 |
| 5,000 | ~3.2 | 0.456 |
| 7,500 | ~3.0 | 0.683 |
| 10,000 | ~2.9 | 0.911 |
| 10,972 (final) | ~2.9 | 1.000 |
Final eval loss: ~3.0 | Perplexity: ~20
| Model | Eval Loss | Perplexity |
|---|---|---|
| 50M (exp004, step 24K) | 3.39 | ~30 |
| 150M (final) | ~3.0 | ~20 |
Through incremental evaluation during training, we observe clear quality progression:
Our initial experiments (exp001–003) attempted domain-adaptive pre-training on English-centric Pythia [4] models. The GPT-NeoX tokenizer fragments Kazakh text into near-byte-level representations (fertility 6.58), meaning the model must learn Kazakh character composition in addition to language modeling. Training from scratch with a dedicated tokenizer proved far more effective.
Our fertility analysis demonstrates that tokenizer choice has dramatic impact for agglutinative languages. A 1,024-token context window effectively covers ~760 Kazakh words with our tokenizer versus only ~156 words with GPT-NeoX — a 4.9× difference in effective context length, with direct implications for generation coherence and long-range dependencies.
The 48.2% pass rate of our cleaning pipeline reveals that web-crawled “Kazakh” data contains substantial noise: Russian-Kazakh mixed content, boilerplate, and near-duplicate content across sources. The cross-source deduplication step alone removed 524K documents. We recommend aggressive quality filtering for any low-resource language data pipeline.
Following Chinchilla-optimal ratios [6], the 150M model (perplexity ~20) substantially outperforms the 50M model (perplexity ~30), confirming that scaling laws hold for low-resource languages when sufficient data is available. Our 22M unique documents provide enough tokens for models up to ~500M parameters at Chinchilla-optimal ratios.
We demonstrate that small, dedicated language models trained from scratch can be a practical and effective approach for low-resource languages. The core insight is that tokenizer quality is the single most impactful factor: a 4.9× improvement in encoding efficiency translates directly to better training efficiency and generation quality.
The SozKZ project provides the Kazakh NLP community with two pre-trained language models (50M, 150M), two custom Kazakh tokenizers (32K, 50K), large-scale cleaned Kazakh corpora (~22M texts), and complete, reproducible training pipelines. All artifacts are released under open licenses on Hugging Face Hub.
Future work includes: (1) instruction tuning for dialogue capabilities, (2) extending to the new Latin-script Kazakh orthography, (3) scaling to larger model sizes, and (4) evaluation on emerging Kazakh NLP benchmarks.
| Name | Parameters |
|---|---|
| sozkz-core-llama-50m-kk-base-v2 | 50.29M |
| sozkz-core-llama-150m-kk-base-v1 | 149.8M |
| Name | Vocab |
|---|---|
| sozkz-vocab-bpe-32k-kk-base-v1 | 32K |
| sozkz-core-gpt2-50k-kk-base-v1 | 50K |
| Name | Size |
|---|---|
| sozkz-corpus-dedup-kk-web-v1 | 9.5M texts |
| sozkz-corpus-clean-kk-text-v2 | 78K docs |
| sozkz-corpus-clean-kk-pretrain-v2 | 1.04B tokens |
github.com/sakentukenov/slm — Training pipeline and standalone recipe