Skip to content

NLP Basics: Tokenization, Embeddings & Transformer Architecture

DodaTech Updated 2026-06-22 7 min read

In this tutorial, you'll learn about NLP Basics: Tokenization, Embeddings & Transformer Architecture. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Natural Language Processing is a branch of AI that enables computers to understand and generate human language by converting text into numerical representations. Language is inherently discrete, ambiguous, and context-dependent — making it one of the hardest problems in AI. NLP bridges the gap between human communication and machine computation by transforming words, sentences, and documents into structured numerical forms that algorithms can analyze mathematically.

What You'll Learn

In this tutorial, you'll learn the fundamentals of NLP including tokenization strategies, word embeddings (Word2Vec, GloVe), TF-IDF for information retrieval, and the transformer architecture — the foundation of modern NLP models built with {{< ilink "Python" "Python" "Python" >}}.

Why It Matters

NLP powers search engines, machine translation, chatbots, sentiment analysis, and large language models. Every time you use Google Search, ChatGPT, or a voice assistant, NLP is working behind the scenes. Understanding NLP fundamentals is essential for working with the most impactful AI technology today. Python and Hugging Face are the primary tools for modern NLP development.

Real-World Use

Durga Antivirus Pro uses NLP to analyze threat intelligence reports. The system tokenizes thousands of security advisories, converts them into embeddings, and matches them against known attack patterns to identify emerging threats before they become widespread.

Tokenization

Tokenization splits text into smaller units (tokens) that models can process. Word tokenization splits on whitespace and punctuation. Subword tokenization (used by BERT and GPT) breaks rare words into common subword units, handling unknown words gracefully. Character tokenization treats each character as a token, capturing every possible string but losing word-level meaning. The choice of tokenization Strategy affects vocabulary size, out-of-vocabulary handling, and model capacity. Most modern NLP models use subword tokenization because it balances vocabulary size with coverage.

from tensorflow.keras.preprocessing.text import Tokenizer

texts = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "The cat and dog are friends]
]

tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(texts)

sequences = tokenizer.texts_to_sequences(texts)

print(f"Word index: {dict(list(tokenizer.word_index.items())[:6])}")
print(f"Sequence 1: {sequences[0]}")
print(f"Sequence 2: {sequences[1]}")
print(f"Sequence 3: {sequences[2]}")

Expected output:

Word index: {'<OOV>': 1, 'the': 2, 'cat': 3, 'sat': 4, 'on': 5, 'mat': 6}
Sequence 1: [2, 3, 4, 5, 2, 6]
Sequence 2: [2, 7, 4, 5, 2, 8]
Sequence 3: [2, 3, 7, 9, 10]

Bag of Words and TF-IDF

Bag of Words counts word occurrences in each document, creating a sparse matrix where rows are documents and columns are vocabulary terms. The main limitation is that common words like "the" and "is" dominate the counts despite carrying little meaning. TF-IDF addresses this by downweighting words that appear in many documents. The TF-IDF score is the product of term frequency (how often a word appears in the document) and inverse document frequency (log of total documents divided by number of documents containing the word). This highlights distinctive words while suppressing common ones. Bag of Words counts word occurrences. TF-IDF weights words by their importance across documents.

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

docs = [
    "machine learning is fascinating",
    "deep learning is a subset of machine learning",
    "natural language processing is a branch of AI]
]

count_vectorizer = CountVectorizer()
bow = count_vectorizer.fit_transform(docs)

print("Bag of Words (dense):")
print(pd.DataFrame(
    bow.toarray(),
    columns=count_vectorizer.get_feature_names_out()
).to_string(index=False))

Expected output:

Bag of Words (dense):
 ai  branch  deep  fascinating  is  language  learning  machine  natural  of  processing  subset
  0       0     0            1   1         0         1        1        0    0           0       0
  0       0     1            0   1         0         1        1        0    0           0       1
  1       1     0            0   1         1         0        0        1    1           1       0

Word Embeddings

Embeddings are dense vector representations where similar words have similar vectors. Unlike one-hot encoding where each word is a sparse vector of vocabulary size, embeddings are dense (typically 50-300 dimensions) and learned from data. The key property is that vector arithmetic captures semantic relationships: vec("king") - vec("man") + vec("woman") approximates vec("queen"). This geometric structure is why embeddings are so powerful — they encode meaning as spatial relationships. Modern NLP models learn embeddings as part of the training process, adapting them to the specific task.

import numpy as np
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(input_dim=100, output_dim=8, input_length=5)
sample_sequence = np.array([[2, 3, 4, 5, 6]])
embedded = embedding_layer(sample_sequence)

print(f"Input shape: {sample_sequence.shape}")
print(f"Output shape: {embedded.shape}")
print(f"Embedding for word 2:\n{embedded[0, 0].numpy().round(3)}")

Expected output:

Input shape: (1, 5)
Output shape: (1, 5, 8)
Embedding for word 2:
[ 0.042 -0.031  0.015 -0.022  0.038 -0.045  0.012  0.028]

Training a Word2Vec-Style Embedding

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

sentences = [
    "the quick brown fox jumps over the lazy dog",
    "the fox is quick and clever",
    "the dog is lazy but friendly]
]

tokenizer = Tokenizer(num_words=50, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)

vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 16

model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=5),
    Flatten(),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy')
print(f"Vocabulary size: {vocab_size}")
print(f"Embedding dimension: {embedding_dim}")

Expected output:

Vocabulary size: 17
Embedding dimension: 16

The Transformer Architecture

Transformers replaced RNNs by processing all tokens in parallel using self-attention.

flowchart TD
  A[Input Text] --> B[Tokenization]
  B --> C[Token Embeddings]
  C --> D[Positional Encoding]
  D --> E[Multi-Head Self-Attention]
  E --> F[Feed-Forward Network]
  F --> G[Layer Norm & Residual]
  G --> H[Stack N Layers]
  H --> I[Output]
  E -.-> J[Attention Weights]

Traditional vs Modern NLP

Aspect Traditional NLP Transformer NLP
Representation Sparse (BoW, TF-IDF) Dense (embeddings)
Context Fixed window Full sequence (self-attention)
Parallelization Sequential (RNNs) Fully parallel
Out-of-vocabulary Unknown token Subword tokenization (BPE)
Scale Small models Billions of parameters

Common Errors and Mistakes

Mistake Why It Happens How to Fix
Not lowercasing text "The" and "the" become different tokens Use tokenizer with lower=True
Removing stopwords too aggressively Lost meaning in phrases like "not good" Keep stopwords for sentiment analysis
Fixed vocabulary too small Many OOV tokens Use subword tokenization or larger vocab
Wrong padding direction Post-padding wastes memory on RNNs Use pre-padding for RNNs, post-padding for transformers
Not handling punctuation "hello!" and "hello" become different Use regex to normalize punctuation

Practice Questions

  1. What is tokenization in NLP?

Answer: Tokenization splits text into smaller units called tokens — words, subwords, or characters. These tokens are the basic input units that NLP models process.

  1. How does TF-IDF differ from Bag of Words?

Answer: BoW counts word frequencies in each document. TF-IDF multiplies term frequency by inverse document frequency, downweighting common words that appear across many documents and highlighting words unique to specific documents.

  1. What is the purpose of word embeddings?

Answer: Word embeddings map words to dense vectors where semantically similar words have similar vector representations. They capture meaning, analogy relationships (king - man + woman = queen), and reduce dimensionality compared to one-hot encoding.

  1. Why is self-attention the key innovation in transformers?

Answer: Self-attention allows each token to attend to every other token in the sequence, capturing long-range dependencies efficiently. Unlike RNNs, this is fully parallelizable, enabling training on massive datasets.

  1. What is the difference between subword and word tokenization?

Answer: Word tokenization splits on whitespace/punctuation and requires a fixed vocabulary. Subword tokenization (BPE, WordPiece) breaks rare words into smaller meaningful units, handling unknown words and morphologically rich languages better.

Challenge

Build a text classification pipeline for sentiment analysis on the IMDB movie review dataset. Implement three approaches: Bag of Words + logistic regression, Word2Vec embeddings + LSTM, and a small transformer model using TensorFlow. Compare accuracy, training time, and inference speed for each approach.

Real-World Task

Design an NLP system that monitors customer support tickets for urgent security issues. Use TF-IDF to classify tickets into categories (security, billing, technical), and train a word embedding model to find semantically similar historical tickets. The system should flag security-related tickets for immediate human review based on keyword patterns and embedding similarity to known security incidents.

Next Steps

Now that you understand NLP fundamentals, use Hugging Face Transformers for state-of-the-art models, and learn LLM Prompt Engineering. Python is the primary language for NLP with TensorFlow and PyTorch.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro