SozKZ: Small Language Models for Kazakh Trained from Scratch

Saken Tukenov
Independent Researcher — saken@tukenov.kz
February 2026
kazakh small-language-model llama low-resource-nlp from-scratch chinchilla-optimal

Abstract

We present SozKZ, a family of small language models (50M and 150M parameters) trained from scratch for Kazakh — a low-resource Turkic language with approximately 13 million speakers. Unlike existing approaches that rely on multilingual models or domain adaptation of English-centric models, SozKZ builds dedicated infrastructure from the ground up: custom ByteLevel BPE tokenizers optimized for Kazakh Cyrillic text, multi-stage data cleaning pipelines, and Chinchilla-optimal training schedules. We show that standard multilingual tokenizers exhibit fertility rates of 4.9–6.6 tokens per word on Kazakh text, while our dedicated tokenizers achieve 1.3–1.4, a 3.6–4.9× improvement in encoding efficiency. Our 150M model achieves a perplexity of ~20 on a held-out Kazakh evaluation set and generates coherent, topically relevant Kazakh text across multiple domains. We release all models, tokenizers, datasets, and training code as open-source artifacts on Hugging Face Hub.

1. Introduction

Kazakh is the official language of Kazakhstan and is spoken by approximately 13 million people worldwide. Written in Cyrillic script (with an ongoing transition to Latin), Kazakh is an agglutinative Turkic language with rich morphology, vowel harmony, and flexible word order. Despite growing digital presence, Kazakh remains underrepresented in NLP resources.

Existing multilingual models such as BLOOM [1], mGPT [2], and XLM-RoBERTa [3] include Kazakh data but allocate minimal vocabulary entries to Kazakh tokens. This leads to poor tokenization efficiency — common Kazakh words are split into many subword pieces or even individual bytes. We quantify this problem: the GPT-NeoX tokenizer [4] produces 6.58 tokens per Kazakh word on average, compared to 1.34 for our dedicated Kazakh tokenizer — a 4.9× difference.

We hypothesize that for a low-resource language like Kazakh, dedicated small models with language-specific tokenizers can be more practical and efficient than relying on large multilingual models. Our contributions are:

  1. Custom Kazakh tokenizers (32K and 50K vocabulary) with 3.6–4.9× better encoding efficiency than multilingual alternatives
  2. Curated training data: a 9-stage cleaning pipeline applied to ~28M raw texts, yielding high-quality Kazakh corpora totaling ~22M unique documents
  3. Two language models trained from scratch: 50M and 150M parameter LlamaForCausalLM [5] architectures with Chinchilla-optimal [6] training
  4. Quantitative tokenizer analysis: fertility measurements comparing dedicated vs. multilingual tokenizers on Kazakh text
  5. Complete reproducibility: all code, configs, data, tokenizers, and models are open-source

2. Related Work

Multilingual Models for Kazakh. BLOOM [1] and mGPT [2] are large multilingual models that include Kazakh in their training data but with limited vocabulary coverage. XLM-RoBERTa [3] provides multilingual representations but is not designed for text generation.

Tokenization for Low-Resource Languages. Sennrich et al. [7] introduced BPE for neural machine translation. Rust et al. [8] demonstrated that language-specific tokenizers are crucial for downstream task performance in low-resource settings. Ahia et al. [9] showed that multilingual tokenizers produce highly fragmented representations for underrepresented languages, directly impacting training efficiency and model quality.

Small Language Models. Pythia [4] provides a suite of models from 14M to 12B parameters. TinyLlama [10] demonstrated that 1.1B-parameter models can be competitive when trained on large token counts. Eldan & Li [11] showed that very small models can exhibit emergent abilities with high-quality data.

Chinchilla Scaling. Hoffmann et al. [6] established that compute-optimal training requires ~20 tokens per parameter. We follow this guideline.

Data Cleaning. Penedo et al. [12] introduced comprehensive pipelines for web data cleaning. Lee et al. [13] showed that deduplication significantly improves language model quality. We adapt these approaches with Kazakh-specific filters including fastText [14] language identification.

3. Data

3.1 Source Corpora

We aggregate Kazakh text from six major public datasets on Hugging Face:

SourceRaw SamplesUnique (after dedup)
CulturaX (kk)2,731,9342,705,991
HPLT 2.0 (kaz_Cyrl)2,637,3302,246,264
C4 (kk)2,371,5282,230,795
MADLAD-400 (kk)1,807,9961,807,827
mOSCAR (kaz_Cyrl)245,869245,869
Wikipedia (kk)238,356238,343
Total new10,033,0139,475,089

Combined with the kz-transformers multidomain dataset (12.4M unique texts), this yields approximately 21.9M unique Kazakh text documents.

3.2 Cleaning Pipeline

Our 9-stage pipeline processes raw text into training-ready data:

  1. Unicode NFC normalization
  2. Kazakh character filtering
  3. Script profiling — Cyrillic character ratio ≥ 60%
  4. FastText language identification
  5. Junk/boilerplate removal (URL density, HTML tags, control characters)
  6. Repetition filtering
  7. Length filtering — minimum 50 characters
  8. Exact deduplication (MD5 hash-based, cross-source)
  9. Near deduplication (MinHash LSH)

The pipeline reduces 28.4M raw texts to 13.7M clean texts (48.2% pass rate).

3.3 Tokenization

TokenizerVocab SizeTraining Data
BPE-32K32,00023.6M Kazakh texts
GPT2-50K50,25778K clean v2 texts

Both tokenizers use ByteLevel pre-tokenization with special tokens <|endoftext|> (BOS/EOS) and <|padding|> (PAD).

Tokenizer Fertility Analysis

We measure fertility (tokens per word) across four tokenizers on 10 diverse Kazakh sentences covering news, science, literature, and conversational domains:

TokenizerVocab SizeAvg Tokens/SentChars/TokenFertility
GPT-NeoX (Pythia) [4]50,30452.61.226.58
LLaMA-2 [19]32,00039.21.644.90
SozKZ BPE-32K (ours)32,00011.55.591.44
SozKZ GPT2-50K (ours)50,25710.76.011.34

The GPT-NeoX tokenizer effectively decomposes Kazakh text into individual bytes (1.22 chars/token), while our tokenizers capture whole words and common morphemes. This 3.6–4.9× reduction in sequence length directly translates to faster training, lower memory usage, and richer contextual representation within the same context window. A 1,024-token context covers ~760 Kazakh words with our tokenizer versus only ~156 with GPT-NeoX.

4. Models

4.1 Architecture

Parameter50M Model150M Model
Parameters50.29M149.8M
Hidden size512896
Layers812
Attention heads816
Intermediate (SwiGLU)1,3442,560
Context length1,0241,024
Vocab size50,25732,000
Positional encodingRoPERoPE

4.2 Training

Parameter50M Model150M Model
Training tokens~1.04B~2.88B
Learning rate6e-43e-4
LR scheduleCosineCosine
Precisionbfloat16bfloat16
Hardware1× A1008× H200
Training time~6-8 hours~36 minutes

4.3 Training Curve (150M)

StepTrain LossEpoch
500~5.50.046
2,500~3.50.228
5,000~3.20.456
7,500~3.00.683
10,000~2.90.911
10,972 (final)~2.91.000

Final eval loss: ~3.0 | Perplexity: ~20

5. Evaluation

5.1 Perplexity

ModelEval LossPerplexity
50M (exp004, step 24K)3.39~30
150M (final)~3.0~20

5.2 Generation Examples (150M Model)

Prompt: «Қазақстан — бұл»
Қазақстан — бұл адамның өмір сүруіне қажетті жағдай жасайтын құрал. Біздің қоғамымызда осы мәселені шешуде мемлекеттің рөлі жоғары...
Prompt: «Бүгін ауа райы»
Бүгін ауа райы болжамы «Қазгидромет» РМК 25 қыркүйекке арналған ауа райы болжамын жариялады...
Prompt: «Алыстағы ауылда бір кәрі шал тұратын,»
Алыстағы ауылда бір кәрі шал тұратын, көпқабатты үйдің баласындай өсіп-өнген...

5.3 Qualitative Analysis

Through incremental evaluation during training, we observe clear quality progression:

6. Discussion

6.1 Domain-Adaptive Pre-Training vs. From-Scratch

Our initial experiments (exp001–003) attempted domain-adaptive pre-training on English-centric Pythia [4] models. The GPT-NeoX tokenizer fragments Kazakh text into near-byte-level representations (fertility 6.58), meaning the model must learn Kazakh character composition in addition to language modeling. Training from scratch with a dedicated tokenizer proved far more effective.

6.2 Tokenizer Design for Agglutinative Languages

Our fertility analysis demonstrates that tokenizer choice has dramatic impact for agglutinative languages. A 1,024-token context window effectively covers ~760 Kazakh words with our tokenizer versus only ~156 words with GPT-NeoX — a 4.9× difference in effective context length, with direct implications for generation coherence and long-range dependencies.

6.3 Data Quality and Scale

The 48.2% pass rate of our cleaning pipeline reveals that web-crawled “Kazakh” data contains substantial noise: Russian-Kazakh mixed content, boilerplate, and near-duplicate content across sources. The cross-source deduplication step alone removed 524K documents. We recommend aggressive quality filtering for any low-resource language data pipeline.

6.4 Scaling Laws for Low-Resource Languages

Following Chinchilla-optimal ratios [6], the 150M model (perplexity ~20) substantially outperforms the 50M model (perplexity ~30), confirming that scaling laws hold for low-resource languages when sufficient data is available. Our 22M unique documents provide enough tokens for models up to ~500M parameters at Chinchilla-optimal ratios.

7. Limitations

8. Conclusion

We demonstrate that small, dedicated language models trained from scratch can be a practical and effective approach for low-resource languages. The core insight is that tokenizer quality is the single most impactful factor: a 4.9× improvement in encoding efficiency translates directly to better training efficiency and generation quality.

The SozKZ project provides the Kazakh NLP community with two pre-trained language models (50M, 150M), two custom Kazakh tokenizers (32K, 50K), large-scale cleaned Kazakh corpora (~22M texts), and complete, reproducible training pipelines. All artifacts are released under open licenses on Hugging Face Hub.

Future work includes: (1) instruction tuning for dialogue capabilities, (2) extending to the new Latin-script Kazakh orthography, (3) scaling to larger model sizes, and (4) evaluation on emerging Kazakh NLP benchmarks.

Released Artifacts

Models

NameParameters
sozkz-core-llama-50m-kk-base-v250.29M
sozkz-core-llama-150m-kk-base-v1149.8M

Tokenizers

NameVocab
sozkz-vocab-bpe-32k-kk-base-v132K
sozkz-core-gpt2-50k-kk-base-v150K

Datasets

NameSize
sozkz-corpus-dedup-kk-web-v19.5M texts
sozkz-corpus-clean-kk-text-v278K docs
sozkz-corpus-clean-kk-pretrain-v21.04B tokens

Code

github.com/sakentukenov/slm — Training pipeline and standalone recipe

Citation

@misc{tukenov2026sozkz, title={SozKZ: Small Language Models for Kazakh Trained from Scratch}, author={Saken Tukenov}, year={2026}, url={https://huggingface.co/collections/saken-tukenov/sozkz} }

References

  1. Workshop, B. et al. (2023). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv:2211.05100
  2. Shliazhko, O. et al. (2022). mGPT: Few-Shot Learners Go Multilingual. arXiv:2204.07580
  3. Conneau, A. et al. (2020). Unsupervised Cross-lingual Representation Learning at Scale. ACL 2020. arXiv:1911.02116
  4. Biderman, S. et al. (2023). Pythia: A Suite for Analyzing Large Language Models. ICML 2023. arXiv:2304.01373
  5. Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971
  6. Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022. arXiv:2203.15556
  7. Sennrich, R. et al. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016. arXiv:1508.07909
  8. Rust, P. et al. (2021). How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. ACL 2021. arXiv:2012.15613
  9. Ahia, O. et al. (2023). Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. EMNLP 2023. arXiv:2305.13707
  10. Zhang, P. et al. (2024). TinyLlama: An Open-Source Small Language Model. arXiv:2401.02385
  11. Eldan, R. & Li, Y. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv:2305.07759
  12. Penedo, G. et al. (2023). The RefinedWeb Dataset for Falcon LLM. NeurIPS 2023. arXiv:2306.01116
  13. Lee, K. et al. (2022). Deduplicating Training Data Makes Language Models Better. ACL 2022. arXiv:2107.06499
  14. Joulin, A. et al. (2017). Bag of Tricks for Efficient Text Classification. EACL 2017. arXiv:1607.01759
  15. Nguyen, T. et al. (2024). CulturaX: A Cleaned, Enormous, and Multilingual Dataset. ACL 2024. arXiv:2309.09400
  16. de Gibert, O. et al. (2024). A New Massive Multilingual Dataset for High-Performance Language Technologies. LREC-COLING 2024. arXiv:2403.14009
  17. Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 2020. arXiv:1910.10683
  18. Kudugunta, S. et al. (2024). MADLAD-400: A Multilingual and Document-Level Large Audited Dataset. AAAI 2024. arXiv:2309.04662
  19. Touvron, H. et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288
  20. Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv:2002.05202
  21. Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864
  22. Zhang, B. & Sennrich, R. (2019). Root Mean Square Layer Normalization. NeurIPS 2019. arXiv:1910.07467
  23. Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019. arXiv:1711.05101