Building A RAG From Scratch

My first AI version told me I sold a hydraulic excavator. I do not sell excavators. It gave me a fake price and a fake description with total confidence.

That was the moment I stopped trusting prompts alone. I rebuilt the system with one rule: it answers from the catalog, or it does not answer at all.

Here is how I built a reliable RAG (Retrieval-Augmented Generation) system using Postgres and Python.

The Data Prep Most tutorials skip the hard part: cleaning data. I split my process into two stages:

  • Stage 1: Download HTML files to disk. I save metadata as a comment at the top of each file. This makes the process idempotent. If a file exists, I skip it.
  • Stage 2: Parse those files offline. This turns HTML into a clean JSON catalog.

I check field coverage after parsing. If a field like weight or price is empty, I find out immediately. Clean data is where the real work happens.

The AI Part I turn each product into a block of text and convert it into a vector using the bge-m3 model. I store these vectors in Postgres using the pgvector extension.

I use a hybrid search approach to find products:

  • Semantic Search: Uses vectors to find products that match the meaning of your question.
  • Structured Filters: I use an LLM to turn a query like "Siemens motors under €2000" into JSON. This allows me to run a SQL query with exact filters for brand and price.

One SQL statement handles both the fuzzy search and the hard filters. This keeps everything in sync.

The Guardrails A good RAG must know when to shut up. I use two layers to prevent hallucinations:

  • Similarity Threshold: Every match gets a score. If the score is below a set limit, I drop the results. If no results pass, the system says "not found" without even calling the LLM. You cannot hallucinate if the model never sees the data.
  • Strict System Prompt: I tell the model to answer only from the provided products. If the products are irrelevant, it must refuse.

The threshold makes bad behavior impossible. The prompt just asks for good behavior. Use both.

Throughput Summary

  • Collect carefully.
  • Clean honestly.
  • Embed simply.
  • Refuse by design.

The refusal is what makes the system trustworthy. Trust comes from architecture, not from asking a model to be nice.

Source: https://dev.to/utku_catal/building-a-rag-from-scratch-collect-clean-embed-refuse-20ob

Optional learning community: https://t.me/GyaanSetuAi