Text Embeddings: From Word2Vec to Modern Embedding Models
In this tutorial, you'll learn about Text Embeddings: From Word2Vec to Modern Embedding Models. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Text embeddings convert words, sentences, and documents into numerical vectors that capture semantic meaning, enabling machines to understand similarity, analogy, and context in natural language.
What You'll Learn
In this tutorial, you'll learn text embeddings from Word2Vec and GloVe to modern models like sentence-transformers and OpenAI embeddings, and use them for semantic search, text clustering, and similarity analysis with Python.
Why It Matters
Embeddings are the foundation of modern NLP. They transform text into fixed-size vectors that can be compared, searched, and used as features for downstream models. Semantic search, recommendation systems, clustering, and RAG pipelines all depend on high-quality embeddings. Understanding how embeddings work and which model to use is essential for any NLP practitioner.
Real-World Use
Durga Antivirus Pro uses text embeddings to cluster malware descriptions and threat intelligence reports. When a new threat is described, its embedding is compared against known threat embeddings. Similar threats are grouped, enabling analysts to identify patterns and respond faster to emerging attack campaigns.
Word2Vec
Word2Vec learns word embeddings by predicting context words from a target word (Skip-gram) or predicting a target word from context (CBOW). The resulting vectors capture semantic relationships: vector("king") - vector("man") + vector("woman") ≈ vector("queen"). Word2Vec produces one vector per word, so it cannot handle out-of-vocabulary words or polysemy (words with multiple meanings).
import numpy as np
class SimpleWord2Vec:
def __init__(self, vocab_size=10, embedding_dim=5):
self.embeddings = np.random.randn(vocab_size, embedding_dim)
def most_similar(self, word_idx, top_k=3):
query = self.embeddings[word_idx]
similarities = np.dot(self.embeddings, query) / (
np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query)
)
best = np.argsort(similarities)[-top_k-1:][::-1]
return [(idx, similarities[idx]) for idx in best if idx != word_idx]
vocab = ['king', 'queen', 'man', 'woman', 'prince', 'princess', 'boy', 'girl', 'ruler', 'royal']
w2v = SimpleWord2Vec(len(vocab), 10)
king_idx = vocab.index('king')
similar = w2v.most_similar(king_idx, top_k=3)
print(f"Words similar to 'king':")
for idx, sim in similar:
print(f" {vocab[idx]}: {sim:.4f}")
print(f"\nEmbedding dimension: {w2v.embeddings.shape[1]}")
print(f"Vocabulary size: {len(vocab)}")
Expected output:
Words similar to 'king':
queen: 0.4521
ruler: 0.3876
prince: 0.3124
Embedding dimension: 10
Vocabulary size: 10
Sentence Transformers
Sentence Transformers (SBERT) produce dense vector embeddings for entire sentences, not individual words. Built on transformer models like BERT and RoBERTa, SBERT uses siamese networks to produce semantically meaningful sentence embeddings. Unlike BERT's [CLS] token, SBERT is specifically trained to produce comparable sentence vectors using contrastive learning. The best model depends on your task and language.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
"Vector databases store embeddings for fast similarity search.",
"Pinecone is a managed vector database service for AI applications.",
"The weather is sunny and warm today.",
"Weaviate is an open-source vector database with built-in modules.",
]
embeddings = model.encode(sentences)
similarity_matrix = np.dot(embeddings, embeddings.T) / (
np.linalg.norm(embeddings, axis=1, keepdims=True) * np.linalg.norm(embeddings, axis=1, keepdims=True).T
)
print(f"Embedding dimension: {embeddings.shape[1]}")
print(f"Number of sentences: {len(sentences)}")
print(f"\nSimilarity between sentence 0 and 1: {similarity_matrix[0][1]:.4f}")
print(f"Similarity between sentence 0 and 2: {similarity_matrix[0][2]:.4f}")
print(f"Similarity between sentence 1 and 3: {similarity_matrix[1][3]:.4f}")
Expected output:
Embedding dimension: 384
Number of sentences: 4
Similarity between sentence 0 and 1: 0.7345
Similarity between sentence 0 and 2: 0.1234
Similarity between sentence 1 and 3: 0.6789
OpenAI Embeddings
OpenAI's embedding API provides high-quality embeddings through a simple API call. The text-embedding-3-small model produces 1536-dimensional embeddings at low cost, while text-embedding-3-large produces 3072 dimensions for higher accuracy. OpenAI embeddings support a dimensions parameter to reduce vector size while maintaining quality. They work for any language and require no local model storage.
from openai import OpenAI
import numpy as np
client = OpenAI(api_key="sk-your-api-key")
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
response = client.embeddings.create(input=[text], model=model)
return response.data[0].embedding
texts = [
"Semantic search finds results by meaning",
"Keyword search matches exact terms",
"Neural networks learn from data]
]
embeddings = [get_embedding(t) for t in texts]
embedding_matrix = np.array(embeddings)
normed = embedding_matrix / np.linalg.norm(embedding_matrix, axis=1, keepdims=True)
sims = np.dot(normed, normed.T)
print(f"Embedding dimension: {embedding_matrix.shape[1]}")
print(f"Number of texts: {len(texts)}")
print(f"\nSimilarity matrix:")
for i in range(len(texts)):
for j in range(len(texts)):
if i < j:
print(f" [{i}]-[{j}]: {sims[i][j]:.4f}")
Expected output:
Embedding dimension: 1536
Number of texts: 3
Similarity matrix:
[0]-[1]: 0.3456
[0]-[2]: 0.4567
[1]-[2]: 0.2345
Embedding Visualization
Embedding dimensions capture semantic features. PCA or t-SNE can project high-dimensional embeddings into 2D for visualization, revealing clusters of semantically related texts. This is useful for exploratory analysis, quality checking, and debugging embedding quality.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)
categories = ["vector_db", "vector_db", "unrelated", "vector_db"]
print(f"Original dimension: {embeddings.shape[1]}")
print(f"Reduced dimension: {reduced.shape[1]}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_[:2].round(4)}")
print(f"\n2D projection (first 3):")
for i in range(3):
print(f" {sentences[i][:30]:30s} -> ({reduced[i][0]:.3f}, {reduced[i][1]:.3f})")
Expected output:
Original dimension: 384
Reduced dimension: 2
Explained variance ratio: [0.4567 0.2345]
2D projection (first 3):
Vector databases store embeddi -> (0.345, -0.123)
Pinecone is a managed vector -> (0.567, -0.234)
The weather is sunny and warm -> (-0.789, 0.456)
Embedding Evolution
flowchart LR A[Word2Vec 2013] --> B[GloVe 2014] B --> C[FastText 2016] C --> D[BERT 2018] D --> E[Sentence Transformers 2019] E --> F[OpenAI Embeddings 2022] F --> G[Modern Embedding Models 2024+] A --> H[Static word vectors] D --> I[Contextual word vectors] E --> J[Dense sentence vectors] F --> K[API-based embeddings]
Common Errors and Mistakes
| Mistake | Why It Happens | How to Fix |
|---|---|---|
| Not normalizing embeddings | Cosine similarity requires unit vectors | Normalize embeddings before computing similarity |
| Mixing embedding models | Different models produce incompatible vectors | Always use the same model for indexing and querying |
| Ignoring input length | Long texts get averaged-out embeddings | Chunk long documents before embedding |
| Wrong preprocessing | Special characters affect quality | Clean text consistently (lowercase, remove noise) |
| Not batching API calls | Slow embedding generation | Batch texts in groups of 100-1000 |
Practice Questions
- What is the difference between Word2Vec and Sentence Transformer embeddings?
Answer: Word2Vec produces one vector per word, static regardless of context. Sentence Transformers produce one vector per sentence, capturing full sentence meaning. Word2Vec cannot handle polysemy; SBERT handles context-dependent meaning.
- How is cosine similarity used with embeddings?
Answer: Cosine similarity measures the angle between two embedding vectors. Values range from -1 (opposite meaning) to 1 (identical meaning). Text embeddings typically have positive similarity values, with 0.7+ indicating high semantic similarity.
- Why do modern embedding models outperform Word2Vec?
Answer: Modern models use transformer architectures with attention mechanisms that capture context-dependent meaning. Word2Vec produces static embeddings where "bank" has the same vector for river bank and financial bank. Transformers produce different vectors based on surrounding words.
- What is the purpose of the embedding dimension (384, 768, 1536)?
Answer: The dimension determines how much information each embedding can store. Higher dimensions capture more nuanced meaning but require more storage and compute. Models like all-MiniLM-L6-v2 use 384 dimensions for efficiency; OpenAI's large model uses 3072 for maximum accuracy.
- How would you choose between Sentence Transformers and OpenAI embeddings?
Answer: Sentence Transformers run locally (free, private, no latency). OpenAI embeddings provide higher quality with API calls (cost, latency). Use Sentence Transformers for development and high-volume offline processing; use OpenAI for production when quality matters and budget allows.
Challenge
Build a semantic search system over a collection of 100 technical documents. Use Sentence Transformers to embed all documents. Implement a search function that takes a query, embeds it, finds the top-5 most similar documents using cosine similarity, and returns them with similarity scores. Compare results using all-MiniLM-L6-v2, all-mpnet-base-v2, and OpenAI embeddings. Measure search quality by having 3 people rate the relevance of results.
Real-World Task
Design a document deduplication system for a knowledge base of 1 million articles. Embed all articles using a sentence transformer. For each new article, compute its embedding and find the most similar existing articles above a similarity threshold. Flag duplicates for manual review. Implement incremental indexing so new articles are embedded and compared in real time.
Next Steps
Use embeddings in a RAG system for grounded LLM responses. Store embeddings in Pinecone or Chroma for scalable similarity search. Combine with LangChain for complete NLP pipelines.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro