NLP Basics: Tokenization, Embeddings & Transformer Architecture
In this tutorial, you'll learn about NLP Basics: Tokenization, Embeddings & Transformer Architecture. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Natural Language Processing is a branch of AI that enables computers to understand and generate human language by converting text into numerical representations. Language is inherently discrete, ambiguous, and context-dependent — making it one of the hardest problems in AI. NLP bridges the gap between human communication and machine computation by transforming words, sentences, and documents into structured numerical forms that algorithms can analyze mathematically.
What You'll Learn
In this tutorial, you'll learn the fundamentals of NLP including tokenization strategies, word embeddings (Word2Vec, GloVe), TF-IDF for information retrieval, and the transformer architecture — the foundation of modern NLP models built with {{< ilink "Python" "Python" "Python" >}}.
Why It Matters
NLP powers search engines, machine translation, chatbots, sentiment analysis, and large language models. Every time you use Google Search, ChatGPT, or a voice assistant, NLP is working behind the scenes. Understanding NLP fundamentals is essential for working with the most impactful AI technology today. Python and Hugging Face are the primary tools for modern NLP development.
Real-World Use
Durga Antivirus Pro uses NLP to analyze threat intelligence reports. The system tokenizes thousands of security advisories, converts them into embeddings, and matches them against known attack patterns to identify emerging threats before they become widespread.
Tokenization
Tokenization splits text into smaller units (tokens) that models can process. Word tokenization splits on whitespace and punctuation. Subword tokenization (used by BERT and GPT) breaks rare words into common subword units, handling unknown words gracefully. Character tokenization treats each character as a token, capturing every possible string but losing word-level meaning. The choice of tokenization Strategy affects vocabulary size, out-of-vocabulary handling, and model capacity. Most modern NLP models use subword tokenization because it balances vocabulary size with coverage.
from tensorflow.keras.preprocessing.text import Tokenizer
texts = [
"The cat sat on the mat",
"The dog sat on the log",
"The cat and dog are friends]
]
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
print(f"Word index: {dict(list(tokenizer.word_index.items())[:6])}")
print(f"Sequence 1: {sequences[0]}")
print(f"Sequence 2: {sequences[1]}")
print(f"Sequence 3: {sequences[2]}")
Expected output:
Word index: {'<OOV>': 1, 'the': 2, 'cat': 3, 'sat': 4, 'on': 5, 'mat': 6}
Sequence 1: [2, 3, 4, 5, 2, 6]
Sequence 2: [2, 7, 4, 5, 2, 8]
Sequence 3: [2, 3, 7, 9, 10]
Bag of Words and TF-IDF
Bag of Words counts word occurrences in each document, creating a sparse matrix where rows are documents and columns are vocabulary terms. The main limitation is that common words like "the" and "is" dominate the counts despite carrying little meaning. TF-IDF addresses this by downweighting words that appear in many documents. The TF-IDF score is the product of term frequency (how often a word appears in the document) and inverse document frequency (log of total documents divided by number of documents containing the word). This highlights distinctive words while suppressing common ones. Bag of Words counts word occurrences. TF-IDF weights words by their importance across documents.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
docs = [
"machine learning is fascinating",
"deep learning is a subset of machine learning",
"natural language processing is a branch of AI]
]
count_vectorizer = CountVectorizer()
bow = count_vectorizer.fit_transform(docs)
print("Bag of Words (dense):")
print(pd.DataFrame(
bow.toarray(),
columns=count_vectorizer.get_feature_names_out()
).to_string(index=False))
Expected output:
Bag of Words (dense):
ai branch deep fascinating is language learning machine natural of processing subset
0 0 0 1 1 0 1 1 0 0 0 0
0 0 1 0 1 0 1 1 0 0 0 1
1 1 0 0 1 1 0 0 1 1 1 0
Word Embeddings
Embeddings are dense vector representations where similar words have similar vectors. Unlike one-hot encoding where each word is a sparse vector of vocabulary size, embeddings are dense (typically 50-300 dimensions) and learned from data. The key property is that vector arithmetic captures semantic relationships: vec("king") - vec("man") + vec("woman") approximates vec("queen"). This geometric structure is why embeddings are so powerful — they encode meaning as spatial relationships. Modern NLP models learn embeddings as part of the training process, adapting them to the specific task.
import numpy as np
from tensorflow.keras.layers import Embedding
embedding_layer = Embedding(input_dim=100, output_dim=8, input_length=5)
sample_sequence = np.array([[2, 3, 4, 5, 6]])
embedded = embedding_layer(sample_sequence)
print(f"Input shape: {sample_sequence.shape}")
print(f"Output shape: {embedded.shape}")
print(f"Embedding for word 2:\n{embedded[0, 0].numpy().round(3)}")
Expected output:
Input shape: (1, 5)
Output shape: (1, 5, 8)
Embedding for word 2:
[ 0.042 -0.031 0.015 -0.022 0.038 -0.045 0.012 0.028]
Training a Word2Vec-Style Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
sentences = [
"the quick brown fox jumps over the lazy dog",
"the fox is quick and clever",
"the dog is lazy but friendly]
]
tokenizer = Tokenizer(num_words=50, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 16
model = Sequential([
Embedding(vocab_size, embedding_dim, input_length=5),
Flatten(),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy')
print(f"Vocabulary size: {vocab_size}")
print(f"Embedding dimension: {embedding_dim}")
Expected output:
Vocabulary size: 17
Embedding dimension: 16
The Transformer Architecture
Transformers replaced RNNs by processing all tokens in parallel using self-attention.
flowchart TD A[Input Text] --> B[Tokenization] B --> C[Token Embeddings] C --> D[Positional Encoding] D --> E[Multi-Head Self-Attention] E --> F[Feed-Forward Network] F --> G[Layer Norm & Residual] G --> H[Stack N Layers] H --> I[Output] E -.-> J[Attention Weights]
Traditional vs Modern NLP
| Aspect | Traditional NLP | Transformer NLP |
|---|---|---|
| Representation | Sparse (BoW, TF-IDF) | Dense (embeddings) |
| Context | Fixed window | Full sequence (self-attention) |
| Parallelization | Sequential (RNNs) | Fully parallel |
| Out-of-vocabulary | Unknown token | Subword tokenization (BPE) |
| Scale | Small models | Billions of parameters |
Common Errors and Mistakes
| Mistake | Why It Happens | How to Fix |
|---|---|---|
| Not lowercasing text | "The" and "the" become different tokens | Use tokenizer with lower=True |
| Removing stopwords too aggressively | Lost meaning in phrases like "not good" | Keep stopwords for sentiment analysis |
| Fixed vocabulary too small | Many OOV tokens | Use subword tokenization or larger vocab |
| Wrong padding direction | Post-padding wastes memory on RNNs | Use pre-padding for RNNs, post-padding for transformers |
| Not handling punctuation | "hello!" and "hello" become different | Use regex to normalize punctuation |
Practice Questions
- What is tokenization in NLP?
Answer: Tokenization splits text into smaller units called tokens — words, subwords, or characters. These tokens are the basic input units that NLP models process.
- How does TF-IDF differ from Bag of Words?
Answer: BoW counts word frequencies in each document. TF-IDF multiplies term frequency by inverse document frequency, downweighting common words that appear across many documents and highlighting words unique to specific documents.
- What is the purpose of word embeddings?
Answer: Word embeddings map words to dense vectors where semantically similar words have similar vector representations. They capture meaning, analogy relationships (king - man + woman = queen), and reduce dimensionality compared to one-hot encoding.
- Why is self-attention the key innovation in transformers?
Answer: Self-attention allows each token to attend to every other token in the sequence, capturing long-range dependencies efficiently. Unlike RNNs, this is fully parallelizable, enabling training on massive datasets.
- What is the difference between subword and word tokenization?
Answer: Word tokenization splits on whitespace/punctuation and requires a fixed vocabulary. Subword tokenization (BPE, WordPiece) breaks rare words into smaller meaningful units, handling unknown words and morphologically rich languages better.
Challenge
Build a text classification pipeline for sentiment analysis on the IMDB movie review dataset. Implement three approaches: Bag of Words + logistic regression, Word2Vec embeddings + LSTM, and a small transformer model using TensorFlow. Compare accuracy, training time, and inference speed for each approach.
Real-World Task
Design an NLP system that monitors customer support tickets for urgent security issues. Use TF-IDF to classify tickets into categories (security, billing, technical), and train a word embedding model to find semantically similar historical tickets. The system should flag security-related tickets for immediate human review based on keyword patterns and embedding similarity to known security incidents.
Next Steps
Now that you understand NLP fundamentals, use Hugging Face Transformers for state-of-the-art models, and learn LLM Prompt Engineering. Python is the primary language for NLP with TensorFlow and PyTorch.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro