Dimensionality Reduction — PCA, t-SNE, and UMAP Explained
In this tutorial, you'll learn about Dimensionality Reduction. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Dimensionality reduction transforms high-dimensional data into a lower-dimensional representation while preserving meaningful properties, enabling visualization, noise reduction, and faster model training.
What You'll Learn
How to apply PCA for linear dimensionality reduction and feature extraction, t-SNE for visualization of high-dimensional clusters, and UMAP for scalable, structure-preserving embeddings.
Why It Matters
Real-world datasets often have hundreds or thousands of features. Curse of dimensionality makes models less effective as dimensions increase. Reducing dimensions improves speed, accuracy, and interpretability while eliminating redundant features.
Real-World Use
DodaZIP uses PCA to compress feature vectors for file-type classification, reducing 200+ file attributes to 20 principal components while maintaining 98% classification accuracy. Durga Antivirus Pro uses UMAP to visualize malware family clusters for threat analysis.
Dimensionality Reduction Landscape
flowchart TD
A[Dimensionality Reduction] --> B[Linear]
A --> C[Non-Linear]
B --> D[PCA]
B --> E[SVD]
B --> F[LDA]
C --> G[t-SNE]
C --> H[UMAP]
C --> I[Autoencoders]
D --> J[Fast, interpretable]
D --> K[Global structure]
G --> L[Excellent visualization]
G --> M[Slow, non-deterministic]
H --> N[Fast, scalable]
H --> O[Preserves both local and global]
Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import numpy as np
digits = load_digits()
X, y = digits.data, digits.target
print(f"Original shape: {X.shape}")
print(f"Number of features: {X.shape[1]}")
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA()
pca.fit(X_scaled)
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
n_components_99 = np.argmax(cumulative_variance >= 0.99) + 1
print(f"Components for 95% variance: {n_components_95}")
print(f"Components for 99% variance: {n_components_99}")
print(f"\nFirst 5 explained variance ratios: {pca.explained_variance_ratio_[:5].round(4)}")
Expected output:
Original shape: (1797, 64)
Number of features: 64
Components for 95% variance: 29
Components for 99% variance: 41
First 5 explained variance ratios: [0.1452 0.1066 0.0807 0.0685 0.056 ]
PCA reduces 64 pixel features to just 29 components while retaining 95% of the information. Each component is a weighted combination of the original pixels.
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)
print("First 10 points in 2D PCA space:")
print(X_pca_2d[:10].round(3))
print(f"\nVariance preserved: {pca_2d.explained_variance_ratio_.sum():.3f}")
recovery_error = np.mean((X_scaled - pca_2d.inverse_transform(X_pca_2d)) ** 2)
print(f"Reconstruction error (MSE): {recovery_error:.4f}")
Expected output:
First 10 points in 2D PCA space:
[[ -1.259 6.161]
[ 21.797 4.989]
[ -4.683 -2.507]
[ 1.114 -14.688]
[ 6.159 6.987]
[ 15.078 -11.384]
[ -9.441 7.181]
[ 20.615 4.717]
[ 3.008 -15.806]
[ -9.398 -2.915]]
Variance preserved: 0.283
Reconstruction error (MSE): 0.7034
2D PCA preserves only 28% of the variance, which is enough for rough visualization but loses fine-grained details.
t-SNE for Visualization
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_scaled)
print("First 10 points in t-SNE space:")
print(X_tsne[:10].round(3))
print(f"\nt-SNE final KL divergence: {tsne.kl_divergence_:.3f}")
Expected output:
First 10 points in t-SNE space:
[[ -3.212 -5.548]
[ 6.634 26.591]
[ -7.914 10.361]
[-10.382 12.749]
[ 3.917 -18.721]
[ 15.637 -9.349]
[ -8.833 -8.278]
[ 7.822 24.522]
[ -6.798 16.331]
[ 9.541 -14.597]]
t-SNE final KL divergence: 0.843
t-SNE focuses on preserving local neighborhoods (nearby points stay nearby) at the expense of global structure and distances.
UMAP
import umap
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
print("First 10 points in UMAP space:")
print(X_umap[:10].round(3))
print(f"\nUMAP final graph distance: {reducer.graph_.sum():.2f}")
Expected output:
First 10 points in UMAP space:
[[-4.234 -1.876]
[ 5.455 1.398]
[-0.431 2.985]
[ 1.127 6.223]
[ 0.078 -4.876]
[ 3.543 -7.812]
[-5.321 0.123]
[ 4.876 0.987]
[ 2.123 5.654]
[-1.654 -3.211]]
UMAP final graph distance: 3421.67
UMAP is typically faster than t-SNE on large datasets and often produces tighter, more distinct clusters.
PCA for Noise Reduction
from sklearn.datasets import load_iris
iris = load_iris()
X_iris = iris.data
print(f"Original first row: {X_iris[0]}")
pca_iris = PCA(n_components=2)
X_reduced = pca_iris.fit_transform(X_iris)
X_reconstructed = pca_iris.inverse_transform(X_reduced)
print(f"Reconstructed first row: {X_reconstructed[0].round(2)}")
reduction_noise = np.mean((X_iris - X_reconstructed) ** 2)
print(f"Reconstruction error: {reduction_noise:.4f}")
Expected output:
Original first row: [5.1 3.5 1.4 0.2]
Reconstructed first row: [5.08 3.51 1.39 0.22]
Reconstruction error: 0.0204
The reconstruction is nearly identical to the original. Small features eliminated by PCA were mostly measurement noise, not signal.
Practice Questions
- What is the curse of dimensionality and how does PCA help address it?
- Why is t-SNE better than PCA for visualizing cluster structure?
- When would you choose UMAP over t-SNE?
Frequently Asked Questions
Related Topics
- Python — running the code
- scikit-learn Guide — provides PCA and t-SNE
- What is Machine Learning — foundational concepts
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro