𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗔 𝗥𝗔𝗚 𝗙𝗿𝗼𝗺 𝗦𝗰𝗿𝗮𝘁𝗰𝗵
My first AI version told me I sold a hydraulic excavator. I do not sell excavators. It gave me a fake price and a fake description with total confidence.
That was the moment I stopped trusting prompts alone. I rebuilt the system with one rule: it answers from the catalog, or it does not answer at all.
Here is how I built a reliable RAG (Retrieval-Augmented Generation) system using Postgres and Python.
𝗧𝗵𝗲 𝗗𝗮𝘁𝗮 𝗣𝗿𝗲𝗽 Most tutorials skip the hard part: cleaning data. I split my process into two stages:
- Stage 1: Download HTML files to disk. I save metadata as a comment at the top of each file. This makes the process idempotent. If a file exists, I skip it.
- Stage 2: Parse those files offline. This turns HTML into a clean JSON catalog.
I check field coverage after parsing. If a field like weight or price is empty, I find out immediately. Clean data is where the real work happens.
𝗧𝗵𝗲 𝗔𝗜 𝗣𝗮𝗿𝘁 I turn each product into a block of text and convert it into a vector using the bge-m3 model. I store these vectors in Postgres using the pgvector extension.
I use a hybrid search approach to find products:
- Semantic Search: Uses vectors to find products that match the meaning of your question.
- Structured Filters: I use an LLM to turn a query like "Siemens motors under €2000" into JSON. This allows me to run a SQL query with exact filters for brand and price.
One SQL statement handles both the fuzzy search and the hard filters. This keeps everything in sync.
𝗧𝗵𝗲 𝗚𝘂𝗮𝗿𝗱𝗿𝗮𝗶𝗹𝘀 A good RAG must know when to shut up. I use two layers to prevent hallucinations:
- Similarity Threshold: Every match gets a score. If the score is below a set limit, I drop the results. If no results pass, the system says "not found" without even calling the LLM. You cannot hallucinate if the model never sees the data.
- Strict System Prompt: I tell the model to answer only from the provided products. If the products are irrelevant, it must refuse.
The threshold makes bad behavior impossible. The prompt just asks for good behavior. Use both.
𝗧𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁 𝗦𝘂𝗺𝗺𝗮𝗿𝘆
- Collect carefully.
- Clean honestly.
- Embed simply.
- Refuse by design.
The refusal is what makes the system trustworthy. Trust comes from architecture, not from asking a model to be nice.
Source: https://dev.to/utku_catal/building-a-rag-from-scratch-collect-clean-embed-refuse-20ob
Optional learning community: https://t.me/GyaanSetuAi