Designing A Persian Synthetic Data Pipeline

Training LLMs is no longer about scaling models. It is about scaling data quality.

Most Persian datasets lack structure. This causes models to fail at following instructions. The problem is data scarcity, not model size.

I built a pipeline to solve this. It moves from topic graphs to QLoRA fine tuning.

The Pipeline Process:

  • Topic Tree creation
  • LLM Generation
  • Deduplication
  • Quality Scoring
  • Dataset Export
  • QLoRA Fine Tuning
  • Evaluation

Core Design Rules:

  • 51 domains to ensure balanced coverage.
  • Semantic deduplication to remove repetitive ideas.
  • Multi-model generation using GPT models to reduce bias.
  • Qwen2.5 3B Instruct for the final fine tuning.

How the Data Engine Works: I use multiple models to create variety. GPT models provide reasoning and variation. This keeps costs low and diversity high.

I use semantic filtering to clean the data. If two instructions have a similarity score above 0.75, I remove one. This prevents the model from overfitting on the same patterns.

I use an LLM as a judge to score quality. It checks for:

  • Fluency
  • Relevance
  • Completeness

Only data with a score of 3.5 or higher stays in the set.

Fine Tuning Results: I used QLoRA on a Qwen2.5 3B Instruct model via Google Colab. QLoRA trains small adapters instead of full weights. This saves memory while keeping performance high.

The results show a massive difference:

  • The base model often switches to Arabic.
  • The fine tuned model speaks fluent, consistent Persian.

The main lesson is clear: Data engineering matters more than model scaling. Data quality is the primary bottleneck.

Key Insights:

  • Dual filtering is necessary for clean data.
  • Structured topic graphs work better than free prompts.
  • An LLM judge is a vital part of the system.

This system is a complete engine for low resource LLM alignment.

Source: https://dev.to/mohammadheydari/designing-a-synthetic-data-pipeline-for-persian-llm-fine-tuning-from-topic-graphs-to-qlora-5cg5

Optional learning community: https://t.me/GyaanSetuAi