Transfer Learning with Pretrained Models: Practical Guide
In this tutorial, you'll learn about Transfer Learning with Pretrained Models: Practical Guide. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Transfer learning is a Machine Learning technique where a model developed for one task is reused as the starting point for a model on a second, related task. Instead of starting from random weights, you begin with weights that have already learned useful features from a large dataset like ImageNet or Wikipedia. This prior knowledge gives you a massive head start — the model already knows how to detect edges, textures, and shapes (for vision) or grammar, syntax, and semantics (for NLP). You only need to adapt these general features to your specific task.
What You'll Learn
In this tutorial, you'll learn two transfer learning strategies — feature extraction and fine-tuning — how to apply them with vision models (ResNet, VGG) and language models (BERT, GPT) using TensorFlow and Hugging Face Transformers.
Why It Matters
Training deep neural networks from scratch requires massive datasets and days of GPU compute. Transfer learning lets you leverage models trained on millions of examples (like ImageNet or Wikipedia), adapting them to your task with as few as 100 labeled examples. It is the standard approach for most real-world Deep Learning applications.
Real-World Use
A medical imaging startup building a skin cancer classifier uses transfer learning from ResNet50 pretrained on ImageNet. Instead of collecting a million dermatology images (practically impossible), they fine-tune the pretrained model on 5,000 labeled skin lesion images and achieve 95% accuracy in a fraction of the training time. Python and TensorFlow provide the tools for implementing transfer learning at scale.
Feature Extraction
Feature extraction uses the pretrained model as a fixed feature extractor. You remove the original classification head and add a new one for your task. All pretrained layers are frozen — their weights do not update during training. Only the new head learns. This is the fastest transfer learning approach, requiring only a few minutes of training on modest hardware. It works best when your dataset is small and similar to the pretraining data. For example, using ImageNet-pretrained weights to classify medical images works well because low-level features (edges, textures) transfer across domains. Freeze the pretrained base and train only the new classifier head.
flowchart TD A[Pretrained Model] --> B[Freeze All Layers] B --> C[Remove Original Head] C --> D[Add New Classifier] D --> E[Train Only New Head] E --> F[Ready for Deployment]
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.applications import VGG16
base_model = VGG16(
weights='imagenet',
include_top=False,
input_shape=(224, 224, 3)
)
base_model.trainable = False
model = keras.Sequential([
base_model,
keras.layers.GlobalAveragePooling2D(),
keras.layers.Dense(256, activation='relu'),
keras.layers.Dropout(0.3),
keras.layers.Dense(5, activation='softmax')
])
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
trainable = sum(p.numel() for p in model.trainable_variables)
nontrainable = sum(p.numel() for p in model.non_trainable_variables)
print(f"Trainable params: {trainable:,}")
print(f"Frozen (pretrained) params: {nontrainable:,}")
print(f"Total params: {trainable + nontrainable:,}")
Expected output:
Trainable params: 647,941
Frozen (pretrained) params: 14,714,688
Total params: 15,362,629
Fine-Tuning
Fine-tuning unfreezes some or all of the pretrained layers and trains them alongside the new head. This allows the pretrained features to adapt to the target domain. The key is to use a very low learning rate (1e-5 or lower) so the pretrained weights are adjusted gradually rather than destroyed. A common Strategy is to train the new head first with the base frozen, then unfreeze the top layers and continue training with a lower learning rate. This two-stage approach prevents the randomly initialized head from sending large error signals back through the pretrained base. Unfreeze part of the pretrained base and train everything together with a low learning rate.
base_model.trainable = True
for layer in base_model.layers[:15]:
layer.trainable = False
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=1e-5),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
trainable_fine = sum(p.numel() for p in model.trainable_variables)
nontrainable_fine = sum(p.numel() for p in model.non_trainable_variables)
print(f"After unfreezing:")
print(f" Trainable params: {trainable_fine:,}")
print(f" Frozen params: {nontrainable_fine:,}")
Expected output:
After unfreezing:
Trainable params: 8,312,005
Frozen params: 7,050,624
Transfer Learning with BERT
For NLP tasks, BERT provides state-of-the-art transfer learning. BERT was pretrained on the entire English Wikipedia and BookCorpus using masked language modeling — predicting randomly masked words from context. This bidirectional training gives BERT a deep understanding of language context. When you fine-tune BERT on your task, it adapts this general language understanding to your specific domain. The tokenizer converts text into subword tokens that BERT understands, handling out-of-vocabulary words gracefully by breaking them into known subword units.
from transformers import TFBertModel, BertTokenizer
bert_model = TFBertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
texts = [
"This movie was amazing and fantastic",
"The product was terrible and disappointing]
]
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors='tf'
)
outputs = bert_model(inputs)
print(f"Input IDs shape: {inputs['input_ids'].shape}")
print(f"Attention mask shape: {inputs['attention_mask'].shape}")
print(f"Last hidden state shape: {outputs.last_hidden_state.shape}")
print(f"Pooler output shape: {outputs.pooler_output.shape}")
Expected output:
Input IDs shape: (2, 128)
Attention mask shape: (2, 128)
Last hidden state shape: (2, 128, 768)
Pooler output shape: (2, 768)
Fine-Tuning BERT for Classification
from transformers import TFBertForSequenceClassification
classifier = TFBertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2
)
labels = tf.constant([1, 0])
with tf.GradientTape() as tape:
outputs = classifier(inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits
predictions = tf.nn.softmax(logits, axis=-1)
print(f"Loss: {loss.numpy():.4f}")
print(f"Logits: {logits.numpy()}")
print(f"Predictions: {predictions.numpy()}")
Expected output:
Loss: 0.6931
Logits: [[-0.021 0.015]
[ 0.032 -0.045]]
Predictions: [[0.491 0.509]
[0.519 0.481]]
Transfer Learning Strategies
| Strategy | When to Use | Steps |
|---|---|---|
| Feature Extraction | Small dataset, similar domain | Freeze base, train new head |
| Fine-Tuning | Medium dataset, different domain | Train head first, then unfreeze and fine-tune |
| Full Training | Large dataset, unique domain | Use pretrained weights as initialization |
| Adapter Tuning | Limited compute, multiple tasks | Insert small adapter modules, keep base frozen |
Common Errors and Mistakes
| Mistake | Why It Happens | How to Fix |
|---|---|---|
| Learning rate too high | Destroys pretrained features during fine-tuning | Use 1e-5 or lower for fine-tuning |
| Unfreezing too much too fast | Model forgets pretrained knowledge | Unfreeze gradually, starting from the top |
| Wrong input preprocessing | Pretrained models expect specific normalization | Use the model's preprocess_input function |
| Not freezing BatchNorm | BatchNorm statistics get corrupted during fine-tuning | Set BatchNorm layers to trainable=False |
| Training head and fine-tuning together | Random head destroys pretrained features | Train head to convergence first, then fine-tune |
Practice Questions
- What is the key difference between feature extraction and fine-tuning?
Answer: Feature extraction freezes the pretrained base and only trains the new classifier head. Fine-tuning unfreezes some or all of the base and trains everything together with a low learning rate.
- Why use a low learning rate when fine-tuning?
Answer: A low learning rate prevents large weight updates that would destroy the useful features learned during pretraining. The pretrained features need small adjustments, not complete reorganization.
- When should you use feature extraction instead of fine-tuning?
Answer: When you have a small dataset or the target domain is very similar to the pretraining domain. Feature extraction is faster and less prone to overfitting than fine-tuning.
- How does BERT encode input text for transfer learning?
Answer: BERT uses a tokenizer to convert text into input IDs, attention masks, and token type IDs. The input IDs are tokenized with WordPiece, and special tokens [CLS] and [SEP] mark sequence boundaries.
- What is the role of the pooling layer when using a CNN for feature extraction?
Answer: The pooling layer (GlobalAveragePooling2D) reduces the spatial dimensions of the feature maps to a fixed-size vector per image, which can then be fed into a dense classifier. It removes spatial location information while preserving feature presence.
Challenge
Implement transfer learning on the Oxford Flowers 102 dataset. Compare three approaches: feature extraction with ResNet50, fine-tuning ResNet50 (unfreeze top 20 layers), and training from scratch. For each approach, report accuracy, training time, and the number of trainable parameters. Use data augmentation for all three.
Real-World Task
Design a transfer learning system for detecting phishing emails. For the text component, fine-tune BERT to classify email bodies as "phishing" or "legitimate." For the visual component, use a pretrained ResNet to classify email screenshots. Combine both models in an ensemble and deploy as a browser extension that warns users about suspicious emails.
Next Steps
Now master Hugging Face Transformers for advanced NLP, and learn NLP Basics. Python and TensorFlow are essential for implementing transfer learning in production.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro