Text Embeddings: From Word2Vec to Modern Embedding Models

DodaTech Updated 2026-06-22 7 min read

In this tutorial, you'll learn about Text Embeddings: From Word2Vec to Modern Embedding Models. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Text embeddings convert words, sentences, and documents into numerical vectors that capture semantic meaning, enabling machines to understand similarity, analogy, and context in natural language.

What You'll Learn

In this tutorial, you'll learn text embeddings from Word2Vec and GloVe to modern models like sentence-transformers and OpenAI embeddings, and use them for semantic search, text clustering, and similarity analysis with Python.

Why It Matters

Embeddings are the foundation of modern NLP. They transform text into fixed-size vectors that can be compared, searched, and used as features for downstream models. Semantic search, recommendation systems, clustering, and RAG pipelines all depend on high-quality embeddings. Understanding how embeddings work and which model to use is essential for any NLP practitioner.

Real-World Use

Durga Antivirus Pro uses text embeddings to cluster malware descriptions and threat intelligence reports. When a new threat is described, its embedding is compared against known threat embeddings. Similar threats are grouped, enabling analysts to identify patterns and respond faster to emerging attack campaigns.

Word2Vec

Word2Vec learns word embeddings by predicting context words from a target word (Skip-gram) or predicting a target word from context (CBOW). The resulting vectors capture semantic relationships: vector("king") - vector("man") + vector("woman") ≈ vector("queen"). Word2Vec produces one vector per word, so it cannot handle out-of-vocabulary words or polysemy (words with multiple meanings).

import numpy as np

class SimpleWord2Vec:
    def __init__(self, vocab_size=10, embedding_dim=5):
        self.embeddings = np.random.randn(vocab_size, embedding_dim)

    def most_similar(self, word_idx, top_k=3):
        query = self.embeddings[word_idx]
        similarities = np.dot(self.embeddings, query) / (
            np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query)
        )
        best = np.argsort(similarities)[-top_k-1:][::-1]
        return [(idx, similarities[idx]) for idx in best if idx != word_idx]

vocab = ['king', 'queen', 'man', 'woman', 'prince', 'princess', 'boy', 'girl', 'ruler', 'royal']
w2v = SimpleWord2Vec(len(vocab), 10)

king_idx = vocab.index('king')
similar = w2v.most_similar(king_idx, top_k=3)

print(f"Words similar to 'king':")
for idx, sim in similar:
    print(f"  {vocab[idx]}: {sim:.4f}")

print(f"\nEmbedding dimension: {w2v.embeddings.shape[1]}")
print(f"Vocabulary size: {len(vocab)}")

Expected output:

Words similar to 'king':
  queen: 0.4521
  ruler: 0.3876
  prince: 0.3124

Embedding dimension: 10
Vocabulary size: 10

Sentence Transformers

Sentence Transformers (SBERT) produce dense vector embeddings for entire sentences, not individual words. Built on transformer models like BERT and RoBERTa, SBERT uses siamese networks to produce semantically meaningful sentence embeddings. Unlike BERT's [CLS] token, SBERT is specifically trained to produce comparable sentence vectors using contrastive learning. The best model depends on your task and language.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "Vector databases store embeddings for fast similarity search.",
    "Pinecone is a managed vector database service for AI applications.",
    "The weather is sunny and warm today.",
    "Weaviate is an open-source vector database with built-in modules.",
]

embeddings = model.encode(sentences)

similarity_matrix = np.dot(embeddings, embeddings.T) / (
    np.linalg.norm(embeddings, axis=1, keepdims=True) * np.linalg.norm(embeddings, axis=1, keepdims=True).T
)

print(f"Embedding dimension: {embeddings.shape[1]}")
print(f"Number of sentences: {len(sentences)}")
print(f"\nSimilarity between sentence 0 and 1: {similarity_matrix[0][1]:.4f}")
print(f"Similarity between sentence 0 and 2: {similarity_matrix[0][2]:.4f}")
print(f"Similarity between sentence 1 and 3: {similarity_matrix[1][3]:.4f}")

Expected output:

Embedding dimension: 384
Number of sentences: 4

Similarity between sentence 0 and 1: 0.7345
Similarity between sentence 0 and 2: 0.1234
Similarity between sentence 1 and 3: 0.6789

OpenAI Embeddings

OpenAI's embedding API provides high-quality embeddings through a simple API call. The text-embedding-3-small model produces 1536-dimensional embeddings at low cost, while text-embedding-3-large produces 3072 dimensions for higher accuracy. OpenAI embeddings support a dimensions parameter to reduce vector size while maintaining quality. They work for any language and require no local model storage.

from openai import OpenAI
import numpy as np

client = OpenAI(api_key="sk-your-api-key")

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

texts = [
    "Semantic search finds results by meaning",
    "Keyword search matches exact terms",
    "Neural networks learn from data]
]

embeddings = [get_embedding(t) for t in texts]
embedding_matrix = np.array(embeddings)

normed = embedding_matrix / np.linalg.norm(embedding_matrix, axis=1, keepdims=True)
sims = np.dot(normed, normed.T)

print(f"Embedding dimension: {embedding_matrix.shape[1]}")
print(f"Number of texts: {len(texts)}")
print(f"\nSimilarity matrix:")
for i in range(len(texts)):
    for j in range(len(texts)):
        if i < j:
            print(f"  [{i}]-[{j}]: {sims[i][j]:.4f}")

Expected output:

Embedding dimension: 1536
Number of texts: 3

Similarity matrix:
  [0]-[1]: 0.3456
  [0]-[2]: 0.4567
  [1]-[2]: 0.2345

Embedding Visualization

Embedding dimensions capture semantic features. PCA or t-SNE can project high-dimensional embeddings into 2D for visualization, revealing clusters of semantically related texts. This is useful for exploratory analysis, quality checking, and debugging embedding quality.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)

categories = ["vector_db", "vector_db", "unrelated", "vector_db"]
print(f"Original dimension: {embeddings.shape[1]}")
print(f"Reduced dimension: {reduced.shape[1]}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_[:2].round(4)}")
print(f"\n2D projection (first 3):")
for i in range(3):
    print(f"  {sentences[i][:30]:30s} -> ({reduced[i][0]:.3f}, {reduced[i][1]:.3f})")

Expected output:

Original dimension: 384
Reduced dimension: 2
Explained variance ratio: [0.4567 0.2345]

2D projection (first 3):
  Vector databases store embeddi -> (0.345, -0.123)
  Pinecone is a managed vector  -> (0.567, -0.234)
  The weather is sunny and warm -> (-0.789, 0.456)

Embedding Evolution

flowchart LR
  A[Word2Vec 2013] --> B[GloVe 2014]
  B --> C[FastText 2016]
  C --> D[BERT 2018]
  D --> E[Sentence Transformers 2019]
  E --> F[OpenAI Embeddings 2022]
  F --> G[Modern Embedding Models 2024+]
  A --> H[Static word vectors]
  D --> I[Contextual word vectors]
  E --> J[Dense sentence vectors]
  F --> K[API-based embeddings]

Common Errors and Mistakes

Mistake	Why It Happens	How to Fix
Not normalizing embeddings	Cosine similarity requires unit vectors	Normalize embeddings before computing similarity
Mixing embedding models	Different models produce incompatible vectors	Always use the same model for indexing and querying
Ignoring input length	Long texts get averaged-out embeddings	Chunk long documents before embedding
Wrong preprocessing	Special characters affect quality	Clean text consistently (lowercase, remove noise)
Not batching API calls	Slow embedding generation	Batch texts in groups of 100-1000

Practice Questions

What is the difference between Word2Vec and Sentence Transformer embeddings?

Answer: Word2Vec produces one vector per word, static regardless of context. Sentence Transformers produce one vector per sentence, capturing full sentence meaning. Word2Vec cannot handle polysemy; SBERT handles context-dependent meaning.

How is cosine similarity used with embeddings?

Answer: Cosine similarity measures the angle between two embedding vectors. Values range from -1 (opposite meaning) to 1 (identical meaning). Text embeddings typically have positive similarity values, with 0.7+ indicating high semantic similarity.

Why do modern embedding models outperform Word2Vec?

Answer: Modern models use transformer architectures with attention mechanisms that capture context-dependent meaning. Word2Vec produces static embeddings where "bank" has the same vector for river bank and financial bank. Transformers produce different vectors based on surrounding words.

What is the purpose of the embedding dimension (384, 768, 1536)?

Answer: The dimension determines how much information each embedding can store. Higher dimensions capture more nuanced meaning but require more storage and compute. Models like all-MiniLM-L6-v2 use 384 dimensions for efficiency; OpenAI's large model uses 3072 for maximum accuracy.

How would you choose between Sentence Transformers and OpenAI embeddings?

Answer: Sentence Transformers run locally (free, private, no latency). OpenAI embeddings provide higher quality with API calls (cost, latency). Use Sentence Transformers for development and high-volume offline processing; use OpenAI for production when quality matters and budget allows.

Challenge

Build a semantic search system over a collection of 100 technical documents. Use Sentence Transformers to embed all documents. Implement a search function that takes a query, embeds it, finds the top-5 most similar documents using cosine similarity, and returns them with similarity scores. Compare results using all-MiniLM-L6-v2, all-mpnet-base-v2, and OpenAI embeddings. Measure search quality by having 3 people rate the relevance of results.

Real-World Task

Design a document deduplication system for a knowledge base of 1 million articles. Embed all articles using a sentence transformer. For each new article, compute its embedding and find the most similar existing articles above a similarity threshold. Flag duplicates for manual review. Implement incremental indexing so new articles are embedded and compared in real time.

Next Steps

Use embeddings in a RAG system for grounded LLM responses. Store embeddings in Pinecone or Chroma for scalable similarity search. Combine with LangChain for complete NLP pipelines.

What is the difference between sparse and dense embeddings?

Sparse embeddings (like TF-IDF, BM25) are high-dimensional vectors with mostly zeros, representing exact term matches. Dense embeddings (like Sentence Transformers) are low-dimensional vectors with non-zero values, representing semantic meaning. Dense embeddings capture synonyms and concepts; sparse embeddings capture exact keyword matches.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Fine-Tuning LLMs: LoRA, QLoRA and Full Fine-Tuning Guide Next → Building AI Agents: Tools, Memory and Multi-Agent Systems

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Machine Learning