Transfer Learning with Pretrained Models: Practical Guide

DodaTech Updated 2026-06-22 7 min read

In this tutorial, you'll learn about Transfer Learning with Pretrained Models: Practical Guide. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Transfer learning is a Machine Learning technique where a model developed for one task is reused as the starting point for a model on a second, related task. Instead of starting from random weights, you begin with weights that have already learned useful features from a large dataset like ImageNet or Wikipedia. This prior knowledge gives you a massive head start — the model already knows how to detect edges, textures, and shapes (for vision) or grammar, syntax, and semantics (for NLP). You only need to adapt these general features to your specific task.

What You'll Learn

In this tutorial, you'll learn two transfer learning strategies — feature extraction and fine-tuning — how to apply them with vision models (ResNet, VGG) and language models (BERT, GPT) using TensorFlow and Hugging Face Transformers.

Why It Matters

Training deep neural networks from scratch requires massive datasets and days of GPU compute. Transfer learning lets you leverage models trained on millions of examples (like ImageNet or Wikipedia), adapting them to your task with as few as 100 labeled examples. It is the standard approach for most real-world Deep Learning applications.

Real-World Use

A medical imaging startup building a skin cancer classifier uses transfer learning from ResNet50 pretrained on ImageNet. Instead of collecting a million dermatology images (practically impossible), they fine-tune the pretrained model on 5,000 labeled skin lesion images and achieve 95% accuracy in a fraction of the training time. Python and TensorFlow provide the tools for implementing transfer learning at scale.

Feature Extraction

Feature extraction uses the pretrained model as a fixed feature extractor. You remove the original classification head and add a new one for your task. All pretrained layers are frozen — their weights do not update during training. Only the new head learns. This is the fastest transfer learning approach, requiring only a few minutes of training on modest hardware. It works best when your dataset is small and similar to the pretraining data. For example, using ImageNet-pretrained weights to classify medical images works well because low-level features (edges, textures) transfer across domains. Freeze the pretrained base and train only the new classifier head.

flowchart TD
  A[Pretrained Model] --> B[Freeze All Layers]
  B --> C[Remove Original Head]
  C --> D[Add New Classifier]
  D --> E[Train Only New Head]
  E --> F[Ready for Deployment]

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.applications import VGG16

base_model = VGG16(
    weights='imagenet',
    include_top=False,
    input_shape=(224, 224, 3)
)
base_model.trainable = False

model = keras.Sequential([
    base_model,
    keras.layers.GlobalAveragePooling2D(),
    keras.layers.Dense(256, activation='relu'),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(5, activation='softmax')
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

trainable = sum(p.numel() for p in model.trainable_variables)
nontrainable = sum(p.numel() for p in model.non_trainable_variables)
print(f"Trainable params: {trainable:,}")
print(f"Frozen (pretrained) params: {nontrainable:,}")
print(f"Total params: {trainable + nontrainable:,}")

Expected output:

Trainable params: 647,941
Frozen (pretrained) params: 14,714,688
Total params: 15,362,629

Fine-Tuning

Fine-tuning unfreezes some or all of the pretrained layers and trains them alongside the new head. This allows the pretrained features to adapt to the target domain. The key is to use a very low learning rate (1e-5 or lower) so the pretrained weights are adjusted gradually rather than destroyed. A common Strategy is to train the new head first with the base frozen, then unfreeze the top layers and continue training with a lower learning rate. This two-stage approach prevents the randomly initialized head from sending large error signals back through the pretrained base. Unfreeze part of the pretrained base and train everything together with a low learning rate.

base_model.trainable = True

for layer in base_model.layers[:15]:
    layer.trainable = False

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-5),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

trainable_fine = sum(p.numel() for p in model.trainable_variables)
nontrainable_fine = sum(p.numel() for p in model.non_trainable_variables)
print(f"After unfreezing:")
print(f"  Trainable params: {trainable_fine:,}")
print(f"  Frozen params: {nontrainable_fine:,}")

Expected output:

After unfreezing:
  Trainable params: 8,312,005
  Frozen params: 7,050,624

Transfer Learning with BERT

For NLP tasks, BERT provides state-of-the-art transfer learning. BERT was pretrained on the entire English Wikipedia and BookCorpus using masked language modeling — predicting randomly masked words from context. This bidirectional training gives BERT a deep understanding of language context. When you fine-tune BERT on your task, it adapts this general language understanding to your specific domain. The tokenizer converts text into subword tokens that BERT understands, handling out-of-vocabulary words gracefully by breaking them into known subword units.

from transformers import TFBertModel, BertTokenizer

bert_model = TFBertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

texts = [
    "This movie was amazing and fantastic",
    "The product was terrible and disappointing]
]

inputs = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors='tf'
)

outputs = bert_model(inputs)
print(f"Input IDs shape: {inputs['input_ids'].shape}")
print(f"Attention mask shape: {inputs['attention_mask'].shape}")
print(f"Last hidden state shape: {outputs.last_hidden_state.shape}")
print(f"Pooler output shape: {outputs.pooler_output.shape}")

Expected output:

Input IDs shape: (2, 128)
Attention mask shape: (2, 128)
Last hidden state shape: (2, 128, 768)
Pooler output shape: (2, 768)

Fine-Tuning BERT for Classification

from transformers import TFBertForSequenceClassification

classifier = TFBertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

labels = tf.constant([1, 0])

with tf.GradientTape() as tape:
    outputs = classifier(inputs, labels=labels)
    loss = outputs.loss
    logits = outputs.logits

predictions = tf.nn.softmax(logits, axis=-1)
print(f"Loss: {loss.numpy():.4f}")
print(f"Logits: {logits.numpy()}")
print(f"Predictions: {predictions.numpy()}")

Expected output:

Loss: 0.6931
Logits: [[-0.021  0.015]
         [ 0.032 -0.045]]
Predictions: [[0.491 0.509]
              [0.519 0.481]]

Transfer Learning Strategies

Strategy	When to Use	Steps
Feature Extraction	Small dataset, similar domain	Freeze base, train new head
Fine-Tuning	Medium dataset, different domain	Train head first, then unfreeze and fine-tune
Full Training	Large dataset, unique domain	Use pretrained weights as initialization
Adapter Tuning	Limited compute, multiple tasks	Insert small adapter modules, keep base frozen

Common Errors and Mistakes

Mistake	Why It Happens	How to Fix
Learning rate too high	Destroys pretrained features during fine-tuning	Use 1e-5 or lower for fine-tuning
Unfreezing too much too fast	Model forgets pretrained knowledge	Unfreeze gradually, starting from the top
Wrong input preprocessing	Pretrained models expect specific normalization	Use the model's preprocess_input function
Not freezing BatchNorm	BatchNorm statistics get corrupted during fine-tuning	Set BatchNorm layers to trainable=False
Training head and fine-tuning together	Random head destroys pretrained features	Train head to convergence first, then fine-tune

Practice Questions

What is the key difference between feature extraction and fine-tuning?

Answer: Feature extraction freezes the pretrained base and only trains the new classifier head. Fine-tuning unfreezes some or all of the base and trains everything together with a low learning rate.

Why use a low learning rate when fine-tuning?

Answer: A low learning rate prevents large weight updates that would destroy the useful features learned during pretraining. The pretrained features need small adjustments, not complete reorganization.

When should you use feature extraction instead of fine-tuning?

Answer: When you have a small dataset or the target domain is very similar to the pretraining domain. Feature extraction is faster and less prone to overfitting than fine-tuning.

How does BERT encode input text for transfer learning?

Answer: BERT uses a tokenizer to convert text into input IDs, attention masks, and token type IDs. The input IDs are tokenized with WordPiece, and special tokens [CLS] and [SEP] mark sequence boundaries.

What is the role of the pooling layer when using a CNN for feature extraction?

Answer: The pooling layer (GlobalAveragePooling2D) reduces the spatial dimensions of the feature maps to a fixed-size vector per image, which can then be fed into a dense classifier. It removes spatial location information while preserving feature presence.

Challenge

Implement transfer learning on the Oxford Flowers 102 dataset. Compare three approaches: feature extraction with ResNet50, fine-tuning ResNet50 (unfreeze top 20 layers), and training from scratch. For each approach, report accuracy, training time, and the number of trainable parameters. Use data augmentation for all three.

Real-World Task

Design a transfer learning system for detecting phishing emails. For the text component, fine-tune BERT to classify email bodies as "phishing" or "legitimate." For the visual component, use a pretrained ResNet to classify email screenshots. Combine both models in an ensemble and deploy as a browser extension that warns users about suspicious emails.

Next Steps

Now master Hugging Face Transformers for advanced NLP, and learn NLP Basics. Python and TensorFlow are essential for implementing transfer learning in production.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous RNNs & LSTMs for Sequential Data: Time Series and Text Next → NLP Basics: Tokenization, Embeddings & Transformer Architecture

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Machine Learning