Skip to content

Build a Mini Search Engine with Python (Step by Step)

DodaTech Updated 2026-06-21 8 min read

In this tutorial, you'll learn about Build a Mini Search Engine with Python (Step by Step). We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Build a full-text search engine in Python from scratch using an inverted index with TF-IDF ranking that returns documents ranked by relevance to a natural-language query.

What You'll Build

You'll implement a search engine that indexes a collection of text documents, builds an inverted index mapping every word to the documents containing it, ranks results using TF-IDF (Term Frequency-Inverse Document Frequency), and accepts plain-text queries. By the end, you'll have a command-line search tool you can point at any folder of text files.

Why Build a Search Engine?

Search is everywhere — web search, code search, documentation search, log analysis. Understanding how inverted indexes and TF-IDF work gives you the foundation to build autocomplete systems, spam filters, plagiarism detectors, and recommendation engines. At DodaTech, similar relevance-ranking techniques help Durga Antivirus Pro prioritize suspicious file signatures by how closely they match known malware patterns.

Prerequisites

  • Python 3.10+ installed
  • Basic data structures knowledge (dictionaries, sets)
  • Familiarity with command-line tools

Step 1: Project Setup

mkdir search-engine && cd search-engine
python -m venv venv
source venv/bin/activate

Create this structure:

search-engine/
├── indexer.py     # Inverted index builder
├── ranker.py      # TF-IDF scoring
├── query.py       # Query parser and search
└── documents/     # Sample text files to index

Step 2: The Tokenizer and Text Processor

Before we can index documents, we need to split text into tokens — individual words cleaned of punctuation and converted to lowercase:

# indexer.py
import os
import json
import math
from collections import defaultdict
import re

STOP_WORDS = {"the", "a", "an", "is", "are", "was", "were",
              "in", "on", "at", "to", "for", "of", "and", "or"}

def tokenize(text):
    text = text.lower()
    tokens = re.findall(r"\b[a-z0-9]+\b", text)
    return [t for t in tokens if t not in STOP_WORDS]

The re.findall(r"\b[a-z0-9]+\b", text) pattern matches whole words — sequences of letters and digits surrounded by word boundaries. We filter out common stop words because they carry little meaning but appear in almost every document.

Step 3: The Inverted Index

The inverted index is the heart of any search engine. Instead of storing "document -> words," we store "word -> list of documents containing it." This makes lookup instant:

class InvertedIndex:
    def __init__(self):
        self.index = defaultdict(list)
        self.doc_count = 0
        self.doc_names = {}

    def add_document(self, doc_id, text, name=""):
        self.doc_count += 1
        self.doc_names[doc_id] = name
        tokens = tokenize(text)
        unique_terms = set(tokens)
        for term in unique_terms:
            self.index[term].append(doc_id)

    def search(self, term):
        return self.index.get(term, [])

    def save(self, path):
        with open(path, "w") as f:
            json.dump({
                "index": {k: v for k, v in self.index.items()},
                "doc_count": self.doc_count,
                "doc_names": self.doc_names,
            }, f)

    def load(self, path):
        with open(path) as f:
            data = json.load(f)
            self.index = defaultdict(list, data["index"])
            self.doc_count = data["doc_count"]
            self.doc_names = data["doc_names"]

Each document gets a unique integer ID. We store only unique terms per document in the index to avoid counting the same word multiple times during retrieval.

Step 4: TF-IDF Ranking

TF-IDF gives higher scores to words that appear frequently in a document (TF) but rarely across the whole collection (IDF). This prevents common words like "python" from overpowering rare but meaningful terms:

# ranker.py
import math

class TfIdfRanker:
    def __init__(self, index):
        self.index = index
        self.N = index.doc_count

    def tf(self, term, doc_text):
        tokens = tokenize(doc_text)
        count = tokens.count(term)
        return count / len(tokens) if tokens else 0

    def idf(self, term):
        docs_with_term = len(self.index.search(term))
        if docs_with_term == 0:
            return 0
        return math.log(self.N / docs_with_term)

    def score(self, term, doc_text):
        return self.tf(term, doc_text) * self.idf(term)

The TF component normalizes by document length so a 1000-word document doesn't automatically beat a 100-word one. The IDF component uses a logarithmic scale — a term appearing in 10 out of 100 documents gets an IDF of about 2.3, while one appearing in 90 documents gets only about 0.1.

Step 5: The Query Processor

A query like "python search engine" should match documents containing any of those terms, ranked by their combined TF-IDF score:

# query.py
from indexer import InvertedIndex, tokenize
from ranker import TfIdfRanker
import os

def search(query_text, index, ranker, doc_store):
    tokens = tokenize(query_text)
    scores = defaultdict(float)

    for token in tokens:
        matching_docs = index.search(token)
        for doc_id in matching_docs:
            doc_text = doc_store[doc_id]
            scores[doc_id] += ranker.score(token, doc_text)

    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return ranked

We accumulate scores across all query terms. A document mentioning both "python" and "search" ranks higher than one mentioning only "python."

Step 6: Building Index from a Directory

Now we need a function to scan a folder, read each file, and add it to the index:

def build_index_from_directory(directory):
    idx = InvertedIndex()
    doc_store = {}
    doc_id = 0

    for fname in sorted(os.listdir(directory)):
        fpath = os.path.join(directory, fname)
        if not os.path.isfile(fpath):
            continue
        with open(fpath, encoding="utf-8", errors="ignore") as f:
            text = f.read()
        idx.add_document(doc_id, text, name=fname)
        doc_store[doc_id] = text
        doc_id += 1

    return idx, doc_store

if __name__ == "__main__":
    idx, store = build_index_from_directory("documents")
    ranker = TfIdfRanker(idx)
    idx.save("index.json")

    while True:
        q = input("Search: ").strip()
        if not q:
            break
        results = search(q, idx, ranker, store)
        for doc_id, score in results[:5]:
            print(f"  [{score:.3f}] {idx.doc_names[doc_id]}")

Step 7: Creating Sample Documents and Testing

Create documents/doc1.txt:

Python is a versatile programming language used for web development and data science.

Create documents/doc2.txt:

Search engines use inverted indexes to quickly find documents matching a user query.

Create documents/doc3.txt:

Python provides excellent libraries for building search engines and data analysis tools.

Run the engine:

python query.py

Expected output:

Search: python search engine
  [0.183] doc3.txt
  [0.124] doc1.txt
  [0.092] doc2.txt

Architecture

flowchart LR
    A[Text Documents] --> B[Tokenizer]
    B --> C[Inverted Index]
    B --> D[Document Store]
    C --> E[TF-IDF Ranker]
    E --> F[Scored Results]
    G[User Query] --> H[Query Parser]
    H --> E
    H --> C
    F --> I[Ranked Results]

Common Errors

1. Case mismatch between indexed terms and query terms If you lowercase during indexing but forget to lowercase the query, "Python" won't match "python." Always apply the same tokenize() function to both documents and queries — consistency is everything.

2. Zero IDF for common terms A term appearing in every document has an IDF of log(N/N) = 0, which zeroes out its TF-IDF score entirely. This is correct behavior — if a word is in every document, it is useless for distinguishing relevance.

3. Division by zero in TF calculation If a document has zero tokens after stop-word removal, len(tokens) is 0. Check for this before dividing. Our code handles it with if tokens else 0.

4. JSON Serialization of defaultdict The defaultdict type is not JSON-serializable by default. Convert it to a plain dict before saving with json.dump(). The save() method in our index handles this with {k: v for k, v in self.index.items()}.

5. Memory usage grows with corpus size The inverted index and document store both live in RAM. For large collections, store the index in a database like SQLite or use disk-backed structures. Our approach works well for up to tens of thousands of documents.

Practice Questions

1. What is an inverted index and why is it faster than scanning every document? An inverted index maps each unique term to a list of document IDs containing it. Instead of scanning N documents for each query term, you perform O(1) dictionary lookups and intersect short postings lists.

2. How does IDF prevent common words from dominating results? IDF applies a logarithmic penalty to terms that appear in many documents. A word like "the" might appear in 99% of documents, giving it near-zero IDF, which zeroes its contribution to the final score.

3. Why do we remove stop words before indexing? Stop words are high-frequency, low-meaning terms that appear in almost every document. Keeping them bloats the index size and adds noise to relevance scoring without improving result quality.

4. Challenge: Implement a boolean query parser Add support for AND and OR operators. A query like "python AND search" should return only documents containing both terms. Use set intersections for AND and set unions for OR.

5. Challenge: Add positional indexing for phrase queries Instead of storing just document IDs, store (doc_id, position) pairs for each term occurrence. This allows phrase searches like "hello world" where terms must appear adjacent and in order.

FAQ

What is TF-IDF in simple terms?

TF-IDF stands for Term Frequency-Inverse Document Frequency. It measures how important a word is to a document within a larger collection. A word gets a high TF-IDF score if it appears many times in a specific document but rarely in other documents.

How does Google improve on basic TF-IDF?

Google uses PageRank (link analysis), click-through data, personalization, and Machine Learning models like BERT for semantic understanding. TF-IDF is the foundation, but modern search engines layer hundreds of signals on top.

Can I use this engine for searching code files?

Yes. Add a code-aware tokenizer that preserves underscores, dots, and capitalization for identifiers. You might also index on symbol names versus content separately, as IDEs do.

Next Steps

  • Add stemming with NLTK so "running" and "run" match the same index entry
  • Store the index in SQLite for persistence across restarts without full JSON reloads
  • Combine this search engine with the Static Site Generator to add site search
  • Explore Elasticsearch to see how production search engines scale to billions of documents

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro