𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗔 𝗥𝗔𝗚 𝗙𝗿𝗼𝗺 𝗦𝗰𝗿𝗮𝘁𝗰𝗵

My first AI version told me I sold a hydraulic excavator. I do not sell excavators. It gave me a fake price and a fake description with total confidence.

That was the moment I stopped trusting prompts alone. I rebuilt the system with one rule: it answers from the catalog, or it does not answer at all.

Here is how I built a reliable RAG (Retrieval-Augmented Generation) system using Postgres and Python.

𝗧𝗵𝗲 𝗗𝗮𝘁𝗮 𝗣𝗿𝗲𝗽 Most tutorials skip the hard part: cleaning data. I split my process into two stages:

  • Stage 1: Download HTML files to disk. I save metadata as a comment at the top of each file. This makes the process idempotent. If a file exists, I skip it.
  • Stage 2: Parse those files offline. This turns HTML into a clean JSON catalog.

I check field coverage after parsing. If a field like weight or price is empty, I find out immediately. Clean data is where the real work happens.

𝗧𝗵𝗲 𝗔𝗜 𝗣𝗮𝗿𝘁 I turn each product into a block of text and convert it into a vector using the bge-m3 model. I store these vectors in Postgres using the pgvector extension.

I use a hybrid search approach to find products:

  • Semantic Search: Uses vectors to find products that match the meaning of your question.
  • Structured Filters: I use an LLM to turn a query like "Siemens motors under €2000" into JSON. This allows me to run a SQL query with exact filters for brand and price.

One SQL statement handles both the fuzzy search and the hard filters. This keeps everything in sync.

𝗧𝗵𝗲 𝗚𝘂𝗮𝗿𝗱𝗿𝗮𝗶𝗹𝘀 A good RAG must know when to shut up. I use two layers to prevent hallucinations:

  • Similarity Threshold: Every match gets a score. If the score is below a set limit, I drop the results. If no results pass, the system says "not found" without even calling the LLM. You cannot hallucinate if the model never sees the data.
  • Strict System Prompt: I tell the model to answer only from the provided products. If the products are irrelevant, it must refuse.

The threshold makes bad behavior impossible. The prompt just asks for good behavior. Use both.

𝗧𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁 𝗦𝘂𝗺𝗺𝗮𝗿𝘆

  • Collect carefully.
  • Clean honestly.
  • Embed simply.
  • Refuse by design.

The refusal is what makes the system trustworthy. Trust comes from architecture, not from asking a model to be nice.

Kujenga RAG kuanzia mwanzo: Kusanya, Safisha, Embed, na Retrieve

RAG (Retrieval-Augmented Generation) ni mbinu inayolenga kuboresha majibu ya LLM (Large Language Models) kwa kuipa uwezo wa kutafuta habari kutoka kwenye vyanzo vya nje kabla ya kutoa jibu. Badala ya kutegemea tu maarifa ya ndani ya modeli, RAG inaruhusu modeli "kusoma" nyaraka zako maalum ili kutoa majibu sahihi zaidi na yenye ushahidi.

Katika mwongozo huu, tutachambua hatua nne kuu za kujenga mfumo wa RAG:

  1. Kusanya (Collect)
  2. Safisha (Clean)
  3. Embed
  4. Retrieve

1. Kusanya (Collect)

Hatua ya kwanza ni kupata data unayotaka modeli yako iijue. Data hii inaweza kuwa katika aina mbalimbali:

  • Faili za PDF: Ripoti, vitabu, au nyaraka za kiufundi.
  • Maandishi (Text files): Maelezo ya kawaida.
  • Tovuti (Web scraping): Maandishi kutoka kwenye kurasa za mtandao.
  • Database: Data kutoka kwenye SQL au NoSQL databases.

Muhimu ni kuhakikisha kuwa unachukua data ambayo ni ya kweli na yenye uhusiano na mada unayolenga.

2. Safisha (Clean)

Data ghafi (raw data) mara nyingi huwa na "kelele" (noise) ambayo inaweza kuvuruga uwezo wa modeli kuelewa. Usafishaji unahusisha:

  • Kuondoa HTML tags: Ikiwa unatumia data kutoka kwenye tovuti.
  • Kuondoa alama zisizo za lazima: Kama vile herufi zisizohitajika, alama za ajabu, au nafasi nyingi (extra whitespaces).
  • Kurekebisha muundo: Kuhakikisha maandishi yanafuata mtiririko unaoeleweka.

Mfano wa nambari (Python) ya kusafisha maandishi rahisi:

import re

def clean_text(text):
    # Ondoa alama za HTML
    text = re.sub(r'<.*?>', '', text)
    # Ondoa alama zisizo za lazima
    text = re.sub(r'[^a-zA-Z0-9\s.,!?]', '', text)
    # Ondoa nafasi nyingi
    text = re.sub(r'\s+', ' ', text).strip()
    return text

3. Embed

Hapa ndipo uchawi wa kisayansi unapotokea. Kompyuta haziwezi kuelewa maneno kama binadamu; zinaelewa namba pekee. Embedding ni mchakato wa kubadilisha maandishi kuwa "vectors" (orodha za namba) ambazo zinawakilisha maana ya maneno hayo.

Kabla ya ku-embed, ni lazima ufanye Chunking. Kwa sababu modeli zina ukomo wa maneno (context window), huwezi kutuma kitabu kizima kwa mara moja. Unapaswa kukigawanya katika vipande vidogo (chunks).

Hatua za Embedding:

  1. Chunking: Gawanya maandishi katika vipande (mfano: maneno 500 kwa kila kipande).
  2. Embedding Model: Tumia modeli kama za OpenAI (text-embedding-3-small) au modeli za open-source kutoka HuggingFace ili kubadilisha kila kipande kuwa vector.

4. Retrieve

Baada ya kuwa na embeddings zako, unahitaji mahali pa kuzihifadhi ili uweze kuzitafuta kwa haraka. Hapa ndipo Vector Database inapoingia mchezo. Mifano ya database hizi ni pamoja na ChromaDB, FAISS, au Pinecone.

Mchakato wa Retrieval unavyofanya kazi:

  1. Swali la Mtumiaji: Mtumiaji anauliza swali (mfano: "Je, kampuni inatoa huduma ya bima?").
  2. Query Embedding: Swali hilo linabadilishwa kuwa vector kwa kutumia modeli ile ile uliyotumia kwenye hatua ya embedding.
  3. Similarity Search: Database inatafuta vipande vya data ambavyo vina "vector" zinazokaribiana zaidi na vector ya swali (mara nyingi kwa kutumia Cosine Similarity).
  4. Context Injection: Vipande hivyo vilivyopatikana vinatumiwa kama "context" pamoja na swali la mtumiaji na kutumwa kwa LLM ili kutoa jibu la mwisho.

Hitimisho

Kujenga RAG ni safari ya kuanzia kwenye data ghafi hadi kwenye mfumo unaoweza "kufikiri" kulingana na data hiyo. Kwa kufuata hatua hizi za Collect, Clean, Embed, na Retrieve, unaweza kutengeneza mifumo yenye uwezo mkubwa wa kusaidia katika uchambuzi wa data na huduma za mteja.

Optional learning community: https://t.me/GyaanSetuAi