Building RAG Systems: Retrieval-Augmented Generation Guide

DodaTech Updated 2026-06-22 7 min read

In this tutorial, you'll learn about Building RAG Systems: Retrieval. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Retrieval-Augmented Generation combines information retrieval with text generation, letting LLMs answer questions by retrieving context from external documents. LLMs have a knowledge cutoff and cannot access your private documents. RAG solves this by retrieving relevant information from your own knowledge base.

What You'll Learn

In this tutorial, you'll learn how RAG systems work, how to build one from scratch using Python, how to use embedding models and vector databases for document retrieval, and how to integrate everything with an LLM for grounded question answering.

Why It Matters

LLMs have a knowledge cutoff and cannot access your private documents. RAG solves this by retrieving relevant information from your own knowledge base and providing it as context to the LLM. This enables accurate, up-to-date, and verifiable answers without retraining or fine-tuning. {{< ilink "Python" "Python" "Python" >}} and LangChain provide the ecosystem for building RAG pipelines.

Real-World Use

Durga Antivirus Pro uses a RAG system to provide technical support. When a user describes a malware issue, the system retrieves relevant knowledge base articles, patch notes, and known threat analyses, then generates a tailored response. The retrieved documents are cited, allowing users to verify the information.

How RAG Works

A RAG system has three components: indexing (chunking and embedding documents), retrieval (finding relevant chunks), and generation (LLM answering with context).

flowchart TD
  A[Documents] --> B[Chunking]
  B --> C[Embedding Model]
  C --> D[Vector Database]
  E[User Question] --> F[Embed Query]
  F --> G[Similarity Search]
  D --> G
  G --> H[Retrieved Context]
  H --> I[LLM Prompt]
  I --> J[Generated Answer]

Chunking Documents

Split documents into manageable chunks for retrieval.

def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks

sample_text = (
    "Machine learning is a subset of artificial intelligence. "
    "It enables systems to learn from data without explicit programming. "
    "Deep learning is a further subset using neural networks. "
    "Transfer learning reuses pretrained models for new tasks. "
    "RAG combines retrieval with generation for grounded answers."
)

chunks = chunk_text(sample_text, chunk_size=100, overlap=20)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk[:60]}...")

Expected output:

Number of chunks: 3
Chunk 1: Machine learning is a subset of artificial intelligence. It enables...
Chunk 2: learning from data without explicit programming. Deep learning is a f...
Chunk 3: Deep learning is a further subset using neural networks. Transfer lea...

Creating Embeddings

Convert text chunks into vector embeddings for similarity search.

from sentence_transformers import SentenceTransformer
import numpy as np

embedder = SentenceTransformer('all-MiniLM-L6-v2')

chunks = [
    "RAG combines retrieval with generation for accurate answers.",
    "Vector databases store embeddings for efficient similarity search.",
    "LLMs have a knowledge cutoff and cannot access private data.",
    "Embedding models convert text into numerical vectors.]
]

embeddings = embedder.encode(chunks)
print(f"Number of chunks: {len(chunks)}")
print(f"Embedding dimension: {embeddings.shape[1]}")
print(f"Embedding dtype: {embeddings.dtype}")

similarity = np.dot(embeddings[0], embeddings[1]) / (
    np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])
)
print(f"Similarity between chunk 0 and 1: {similarity:.3f}")

Expected output:

Number of chunks: 4
Embedding dimension: 384
Embedding dtype: float32
Similarity between chunk 0 and 1: 0.234

Building a Vector Store

Use FAISS for fast similarity search.

import faiss

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

query = "How does RAG help with LLM limitations?"
query_embedding = embedder.encode([query])

k = 2
distances, indices = index.search(query_embedding, k)

print(f"Query: {query}")
print(f"\nTop {k} results:")
for i, idx in enumerate(indices[0]):
    print(f"  {i+1}. (distance: {distances[0][i]:.3f}) {chunks[idx]}")

Expected output:

Query: How does RAG help with LLM limitations?

Top 2 results:
  1. (distance: 20.154) RAG combines retrieval with generation for accurate answers.
  2. (distance: 25.781) LLMs have a knowledge cutoff and cannot access private data.

Complete RAG Pipeline

Combine retrieval and generation into a single answer.

from openai import OpenAI

client = OpenAI()

context_chunks = [
    "RAG stands for Retrieval-Augmented Generation. It retrieves relevant ]
    "documents from a knowledge base and provides them as context to an LLM.",

    "Vector databases like Pinecone, Weaviate, and Chroma store embeddings "
    "and enable efficient similarity search across millions of documents.",

    "The retrieved context is inserted into the LLM prompt alongside the "
    "user's question. The LLM generates an answer grounded in the context."
]

user_question = "What is RAG and how does it work?"

context = "\n\n".join(context_chunks)
prompt = (
    f"Answer the question based ONLY on the following context.\n\n"
    f"Context:\n{context}\n\n"
    f"Question: {user_question}"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Answer based only on the provided context.]
        },
        {"role": "user", "content": prompt}
    ],
    temperature=0
)

answer = response.choices[0].message.content
print(f"RAG Answer:\n{answer}")

Expected output:

RAG Answer:
RAG (Retrieval-Augmented Generation) is a technique that retrieves relevant documents from a knowledge base and provides them as context to an LLM. The LLM then generates an answer that is grounded in the retrieved context, making the response more accurate and based on specific external information rather than just the model's training data.

RAG Architecture Options

Component	Options	Trade-offs
Embedding Model	OpenAI, Sentence Transformers, BGE	Quality vs cost vs speed
Vector Database	FAISS, Chroma, Pinecone, Weaviate	Local vs cloud, scale
Chunking Strategy	Fixed size, semantic, recursive	Granularity vs context
Retrieval Method	Similarity, MMR, hybrid	Relevance vs diversity
LLM	GPT-4, Claude, Llama, Mistral	Quality vs cost vs latency

Common Errors and Mistakes

Mistake	Why It Happens	How to Fix
Chunks too large	Exceed LLM context window	Keep chunks 200-500 tokens
No overlap in chunks	Meaning split across chunks	Use 10-20% overlap
Not filtering irrelevant chunks	Noise reduces answer quality	Set similarity threshold, re-rank
Missing metadata	Cannot cite sources	Store document source with each chunk
Embedding dimension mismatch	Different models produce different dimensions	Use same model for indexing and querying

Practice Questions

What problem does RAG solve that fine-tuning does not?

Answer: RAG provides access to specific documents at inference time without retraining. It supports up-to-date information, private data, and verifiable citations. Fine-tuning teaches the model general patterns but cannot incorporate new documents without retraining.

What is the role of embeddings in a RAG system?

Answer: Embeddings convert text chunks into numerical vectors that capture semantic meaning. The query is also embedded, and similarity search finds the chunks whose vectors are closest to the query vector, retrieving the most relevant context.

How do vector databases enable fast retrieval?

Answer: Vector databases use specialized indexing structures (like IVF, HNSW) that partition the vector space, allowing approximate nearest neighbor search in milliseconds even with millions of vectors, instead of brute-force comparison.

Why is chunking Strategy important in RAG?

Answer: The chunk size determines how much context each retrieved item provides. Too small and chunks lack full meaning. Too large and they exceed the LLM's context window or contain irrelevant information. Overlap prevents information from being split across chunks.

How does a RAG prompt differ from a standard LLM prompt?

Answer: A RAG prompt prepends retrieved context before the user question with instructions like "Answer based only on the provided context." This grounds the LLM's response in the retrieved documents rather than relying on its training data.

Challenge

Build a complete RAG system for a collection of Python documentation (or any technical documents). Chunk the documents, create embeddings using Sentence Transformers, store them in a FAISS index, and build a question-answering interface. Implement hybrid search that combines keyword (BM25) and semantic search for better retrieval quality.

Real-World Task

Design a RAG system for a customer support knowledge base. The system ingests support articles, product documentation, and FAQ pages. When a customer asks a question, it retrieves the most relevant articles and generates a personalized answer. Include citation metadata so the support agent can verify the source. Deploy as a REST API using FastAPI with endpoints for indexing and querying.

Next Steps

Now that you can build RAG systems, explore LangChain for production pipelines, and learn LLMs with Prompt Engineering. Python and Hugging Face provide the ecosystem for RAG development.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous LLM Prompt Engineering: Techniques & Best Practices Next → Vector Databases: Pinecone, Weaviate and Chroma for AI Applications

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Machine Learning