Build an AI Chatbot with RAG (Step-by-Step Guide)
In this tutorial, you'll learn about Build an AI Chatbot with RAG (Step. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Build an AI chatbot powered by Retrieval-Augmented Generation (RAG) using LangChain, OpenAI embeddings, and a Chroma vector database to answer questions from your own documents with source citations.
What You'll Build
You'll build a Python chatbot that reads PDF or text documents, chunks them into searchable pieces, stores them as vector embeddings in ChromaDB, and answers questions using OpenAI's GPT model. When a user asks a question, the system retrieves the most relevant document chunks and feeds them as context to the LLM. The result: answers grounded in your data, not generic training knowledge.
Why RAG Matters
A standard LLM knows only what it learned during training. If you ask about a private document, a recent product manual, or internal company policy, it guesses or says it doesn't know. RAG solves this by giving the model relevant context at query time. It's like giving a student open-book exam access instead of forcing them to memorize everything. At DodaTech, similar retrieval patterns power smart search across compressed archives in DodaZIP, letting users find files by content rather than filename alone.
Prerequisites
- Python 3.9+ installed
- An OpenAI API key (set as
OPENAI_API_KEYenvironment variable) - Basic familiarity with large language models
Step 1: Project Setup
mkdir rag-chatbot
cd rag-chatbot
python -m venv venv
source venv/bin/activate
pip install langchain langchain-community langchain-openai chromadb pypdf
Project structure:
rag-chatbot/
├── ingest.py # Load and chunk documents, store embeddings
├── query.py # Ask questions with RAG retrieval
└── documents/ # Place your PDFs or text files here
Step 2: Ingest Documents
The ingestion pipeline loads files, splits them into overlapping chunks, generates embedding vectors, and stores them in ChromaDB.
# ingest.py
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os
loader = DirectoryLoader("documents", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ".", " "]
)
chunks = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
vectorstore.persist()
print(f"Ingested {len(chunks)} chunks from {len(documents)} documents")
Expected output:
Ingested 87 chunks from 3 documents
Why these settings? chunk_size=500 means each piece is roughly half a page, giving the LLM enough context without exceeding token limits. The chunk_overlap=50 ensures no information is lost at the boundaries between chunks.
Step 3: Query the RAG System
Now build the retrieval-and-generation pipeline that answers questions using your ingested documents.
# query.py
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
while True:
question = input("\nAsk a question (or 'quit'): ")
if question.lower() == "quit":
break
result = qa_chain.invoke({"query": question})
print(f"\nAnswer: {result['result']}")
print(f"\nSources: {result['source_documents'][0].metadata['source']}")
Expected output:
Ask a question (or 'quit'): What is the refund policy?
Answer: According to section 3.1 of the policy document, refunds are processed within 5-7 business days for all products purchased directly through the company website.
Sources: documents/company-policy.pdf
Architecture
flowchart LR
A[PDF Documents] --> B[Document Loader]
B --> C[Text Splitter]
C --> D[Embedding Model]
D --> E[Chroma Vector Store]
F[User Question] --> G[Embedding]
G --> H[Similarity Search]
H --> I[Relevant Chunks]
I --> J[LLM + Context]
J --> K[Grounded Answer]
Common Errors
1. No module named 'LangChain_community'
You installed only <a href="/ai-frameworks-apis/langchain/">LangChain</a> but not the community extras. Run pip install <a href="/ai-frameworks-apis/langchain/">LangChain</a>-community to get document loaders and vector store integrations.
2. OpenAI rate limit errors
The free tier has low requests-per-minute. Add a delay between queries or upgrade your API tier. For testing, use gpt-4o-mini instead of gpt-4o — it's faster and cheaper.
3. Empty results from retrievals
Your documents may not contain text relevant to the question. Check that the PDFs have selectable text (not scanned images). For scanned PDFs, add OCR with pytesseract before ingestion.
4. ChromaDB persistence warning
If you see "No such collection" error, the persist_directory path changed between ingest and query. Always use the same directory. Alternatively, use the same Chroma object without closing the Process.
5. Token limit exceeded with large chunks
If chunks are too large, the "stuff" chain type may exceed the context window. Reduce chunk_size to 300 or switch to map_reduce chain type which processes chunks individually.
Practice Questions
1. What does RAG stand for and why is it useful? Retrieval-Augmented Generation. It combines document retrieval with text generation so the LLM answers based on your data instead of its training data alone.
2. Why do we split documents into chunks instead of sending the whole document? Documents can exceed the LLM's context window. Chunking also lets the retriever find the most relevant section instead of burying it in a massive document.
3. What is the role of embeddings in this system? Embeddings convert text into numeric vectors. Documents with similar meaning produce similar vectors, so the retriever can find the most relevant chunks by measuring vector distance.
4. Challenge: Add a relevance score
Modify the query script to print the similarity score for each retrieved chunk. Chroma returns a distance attribute — lower values mean higher relevance. Display the score next to each source.
5. Challenge: Multi-document summarization Write a script that retrieves all chunks across all documents on a broad topic (like "pricing") and asks the LLM to generate a summary table comparing pricing across products.
FAQ
Next Steps
- Add web UI with FastAPI and a chat interface
- Explore vector databases like Pinecone or Qdrant for production scale
- Build the PDF to Audio Converter project for another document processing use case
- Learn about evaluation metrics for RAG systems in the LLM evaluation guide
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro