Skip to content

RAG Pipeline Deep Dive — Retrieval-Augmented Generation

DodaTech Updated 2026-06-22 6 min read

In this tutorial, you'll learn about RAG Pipeline Deep Dive. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Retrieval-Augmented Generation (RAG) combines information retrieval with LLMs to produce accurate, grounded answers from your own data — this deep dive walks through every component of a production RAG pipeline.

What You'll Learn

You'll learn to build a complete RAG pipeline: document chunking, embedding generation, vector storage, semantic search, context retrieval, and LLM answer generation with source citations.

Why It Matters

LLMs have a knowledge cutoff and cannot access your private data. RAG bridges this gap by retrieving relevant information from your documents and feeding it into the LLM as context, producing answers that are current, accurate, and traceable.

Real-World Use

Doda Browser uses a RAG pipeline to power its smart search feature. User queries are embedded, matched against the browser's indexed help documentation and release notes, and the LLM generates answers with direct citations to the source documents.

Architecture Overview

flowchart LR
    A[Documents] --> B[Chunking]
    B --> C[Embedding Model]
    C --> D[Vector Database]
    E[User Query] --> F[Embedding Model]
    F --> G[Semantic Search]
    D --> G
    G --> H[Retrieved Context]
    H --> I[LLM]
    I --> J[Grounded Answer]
    J --> K[User]

Step 1: Document Chunking

Document chunking splits large documents into manageable pieces while preserving semantic meaning.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " ", ""],
    length_function=len,
)

with open("documentation.md", "r") as f:
    document = f.read()

chunks = text_splitter.split_text(document)
print(f"Split {len(document)} characters into {len(chunks)} chunks")

for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i+1} ({len(chunk)} chars):")
    print(chunk[:150] + "...")

Expected output:

Split 24500 characters into 28 chunks

Chunk 1 (980 chars):
# API Reference
## Authentication
All API requests require an API key passed in the Authorization header...

Chunk 2 (1020 chars):
### Rate Limiting
API calls are limited to 100 requests per minute per key...

Step 2: Generating Embeddings

Embeddings convert text chunks into numerical vectors that capture semantic meaning.

from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embeddings(texts, model="text-embedding-3-small"):
    response = client.embeddings.create(
        input=texts,
        model=model
    )
    return [item.embedding for item in response.data]

chunk_texts = [chunk for chunk in chunks[:10]]
embeddings = get_embeddings(chunk_texts)

print(f"Generated {len(embeddings)} embeddings")
print(f"Each embedding dimension: {len(embeddings[0])}")

# Check similarity between first two chunks
similarity = np.dot(embeddings[0], embeddings[1])
print(f"Similarity between chunk 1 and 2: {similarity:.4f}")

Expected output:

Generated 10 embeddings
Each embedding dimension: 1536
Similarity between chunk 1 and 2: 0.7842

Step 3: Vector Database with Chroma

Chroma is a lightweight vector database that stores embeddings and enables fast similarity search.

import chromadb
from chromadb.config import Settings

# Initialize Chroma client
chroma_client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./chroma_db"
))

# Create or get collection
collection = chroma_client.get_or_create_collection(
    name="documentation",
    metadata={"hnsw:space": "cosine"}
)

# Add chunks with embeddings
collection.add(
    ids=[f"chunk_{i}" for i in range(len(chunk_texts))],
    documents=chunk_texts,
    embeddings=embeddings,
    metadatas=[{"source": "documentation.md", "chunk_index": i}
               for i in range(len(chunk_texts))]
)

# Semantic search
query = "What are the rate limits for API calls?"
query_embedding = get_embeddings([query])[0]

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,
    include=["documents", "distances", "metadatas"]
)

for i, (doc, dist, meta) in enumerate(zip(
    results["documents"][0],
    results["distances"][0],
    results["metadatas"][0]
)):
    print(f"\nResult {i+1} (distance: {dist:.4f})")
    print(f"Source: {meta['source']}, Chunk: {meta['chunk_index']}")
    print(doc[:200] + "...")

Expected output:

Result 1 (distance: 0.1234)
Source: documentation.md, Chunk: 3
API calls are limited to 100 requests per minute per key...

Result 2 (distance: 0.2567)
Source: documentation.md, Chunk: 4
If you exceed the rate limit, the API returns a 429 status...

Step 4: Retrieval-Augmented Generation

The final step combines retrieved context with the LLM to generate grounded answers.

def rag_answer(query, collection, llm_client, n_results=3):
    # 1. Embed the query
    query_embedding = get_embeddings([query])[0]

    # 2. Retrieve relevant chunks
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=["documents", "metadatas"]
    )

    # 3. Build context from retrieved chunks
    context_parts = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        context_parts.append(
            f"[Source: {meta['source']}, Section: {meta['chunk_index']}]\n{doc}"
        )
    context = "\n\n---\n\n".join(context_parts)

    # 4. Generate answer with context
    response = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": ""]
You are a helpful documentation assistant. Answer based ONLY on the
provided context. If the context does not contain the answer, say
"I cannot find this information in the documentation."

Always cite the source of your information.
"""},
            {"role": "user", "content": f"""
Context:
{context}

Question: {query}

Answer with citations:
"""}
        ]
    )

    return response.choices[0].message.content

# Test the RAG pipeline
answer = rag_answer(
    "What happens if I exceed the API rate limit?",
    collection,
    client
)
print(answer)

Expected output:

Based on the documentation, if you exceed the API rate limit of 100 requests
per minute, the API returns a 429 status code. The documentation recommends
implementing exponential backoff in your client to handle rate limiting
gracefully.

[Source: documentation.md, Section: 4]

Combine semantic search with keyword search for better retrieval accuracy.

def hybrid_search(query, collection, alpha=0.5):
    """Combine semantic and keyword search scores."""
    from sklearn.feature_extraction.text import TfidfVectorizer

    semantic_embedding = get_embeddings([query])[0]
    semantic_results = collection.query(
        query_embeddings=[semantic_embedding],
        n_results=10,
        include=["documents", "distances"]
    )

    # TF-IDF keyword search
    all_docs = collection.get(include=["documents"])["documents"]
    vectorizer = TfidfVectorizer().fit_transform(all_docs + [query])
    query_vec = vectorizer[-1]
    doc_vecs = vectorizer[:-1]
    keyword_scores = (doc_vecs @ query_vec.T).toarray().flatten()

    # Combine scores
    doc_ids = collection.get()["ids"]
    combined = {}
    for i, (sem_score, kid) in enumerate(zip(
        [1 - d for d in semantic_results["distances"][0]],
        [doc_ids.index(id_) for id_ in semantic_results["ids"][0]]
    )):
        combined[i] = alpha * sem_score + (1 - alpha) * keyword_scores[kid]

    return sorted(combined.items(), key=lambda x: x[1], reverse=True)[:3]

Expected behavior: Hybrid search returns documents that are both semantically similar to the query and contain relevant keywords, improving accuracy for domain-specific terminology.

Common Errors

Error Cause Fix
Retrieved chunks are irrelevant Poor chunking Strategy Adjust chunk size and overlap
Answer contradicts retrieved context LLM ignores context Strengthen system prompt instruction
High latency on large corpora Vector search too slow Use approximate nearest neighbor (ANN) index
Out-of-memory with many documents Embeddings stored in memory Use persistent vector database with disk storage
Empty retrieval results Query embedding drift from document embeddings Ensure same embedding model for both

Practice Questions

  1. What problem does RAG solve that fine-tuning does not? RAG provides access to new or private data without retraining, allows source citation, and supports dynamic knowledge updates.

  2. Why is chunk overlap important in document splitting? Overlap ensures that context is not lost at chunk boundaries, so a concept split across two chunks is still captured in at least one complete chunk.

  3. How does semantic search differ from keyword search? Semantic search matches meaning using vector similarity, while keyword search matches exact terms. Semantic search handles synonyms and paraphrases.

  4. What is the role of the embedding model in a RAG pipeline? The embedding model converts text chunks and queries into numerical vectors that capture semantic meaning, enabling similarity comparison.

  5. Challenge: Build a multi-source RAG pipeline that indexes documents from a local folder, a web page (scraped), and a PDF file — all in the same vector collection with source metadata — and answers questions across all three sources.

Mini Project

Build a documentation chatbot for an open-source project. Scrape the project's documentation site, chunk the HTML content, generate embeddings with text-embedding-3-small, store them in Chroma, and build a Gradio or Streamlit interface where users can ask questions about the project. Include source citations in every answer and handle out-of-scope queries gracefully.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro