Skip to content

CNNs for Image Classification: Convolutional Neural Networks Guide

DodaTech Updated 2026-06-22 7 min read

In this tutorial, you'll learn about CNNs for Image Classification: Convolutional Neural Networks Guide. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Convolutional Neural Networks are Deep Learning architectures that process images using convolutional filters to learn hierarchical spatial features automatically. They are the foundation of modern Computer Vision systems. Unlike fully connected networks that treat each pixel independently, CNNs exploit the spatial structure of images โ€” nearby pixels are more related than distant ones, and patterns can appear anywhere in the image. This translation invariance is what makes CNNs so powerful for visual tasks.

What You'll Learn

In this tutorial, you'll learn how CNNs work, the role of convolution and pooling layers, how to build CNN architectures in TensorFlow/Keras using TensorFlow, how data augmentation improves generalization, and how transfer learning with pretrained models accelerates training.

Why It Matters

CNNs revolutionized Computer Vision, achieving human-level performance on image classification, object detection, and segmentation. They are used in medical imaging for disease diagnosis, autonomous vehicles for object detection, and security systems for face recognition. Python and TensorFlow provide the tools for building and deploying CNNs efficiently.

Real-World Use

Durga Antivirus Pro uses a CNN-based vision system to analyze screenshots of suspicious applications. The CNN classifies UI layouts as "malicious" or "benign" based on visual patterns โ€” detecting fake login screens and phishing pages that mimic legitimate apps.

How Convolution Works

A convolution layer learns multiple filters, each detecting a specific pattern. Early layers detect simple patterns like edges and corners. Deeper layers combine these into complex patterns like eyes, wheels, or text. Each filter produces a feature map indicating where that pattern appears in the input. The network learns which filters are useful through backpropagation, just like any other neural network layer. A convolution slides a small filter (kernel) across the image, computing dot products at each position to produce a feature map showing the filter's response at every location.

flowchart LR
  A[Input Image] --> B[Convolution Layer]
  B --> C[Activation ReLU]
  C --> D[Pooling Layer]
  D --> E[More Conv+Pool]
  E --> F[Flatten]
  F --> G[Dense Layers]
  G --> H[Class Probabilities]

Building a Simple CNN

import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Conv2D(64, (3, 3), activation='relu'),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Conv2D(64, (3, 3), activation='relu'),
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

print(model.summary())

Expected output:

Model: "sequential"
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Layer (type)                 โ”ƒ Output Shape โ”ƒ    Param # โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ conv2d (Conv2D)              โ”‚ (None, 30,30)โ”‚        896  โ”‚
โ”‚ max_pooling2d (MaxPooling2D) โ”‚ (None, 15,15)โ”‚          0  โ”‚
โ”‚ conv2d_1 (Conv2D)            โ”‚ (None, 13,13)โ”‚     18,496  โ”‚
โ”‚ max_pooling2d_1 (MaxPooling2Dโ”‚ (None, 6,6,64โ”‚          0  โ”‚
โ”‚ conv2d_2 (Conv2D)            โ”‚ (None, 4,4,64โ”‚     36,928  โ”‚
โ”‚ flatten (Flatten)            โ”‚ (None, 1024) โ”‚          0  โ”‚
โ”‚ dense (Dense)                โ”‚ (None, 64)   โ”‚     65,600  โ”‚
โ”‚ dense_1 (Dense)              โ”‚ (None, 10)   โ”‚        650  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
 Total params: 122,570 (478.79 KB)

Understanding CNN Architecture Design

When designing a CNN, consider the input size, the number of filters per layer, and the receptive field. A common pattern is to double the number of filters after each pooling layer while halving spatial dimensions. This maintains a roughly constant computational cost per layer. Start with 32 filters, then 64, then 128. The final feature maps are typically 4x4 or 2x2 in spatial size before flattening. Dropout between dense layers prevents co-adaptation of features, and batch normalization stabilizes training by normalizing layer inputs.

Training a CNN on CIFAR-10

from tensorflow.keras.datasets import cifar10

(X_train, y_train), (X_test, y_test) = cifar10.load_data()
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

history = model.fit(
    X_train, y_train,
    validation_split=0.1,
    epochs=10,
    batch_size=64,
    verbose=1
)

test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test accuracy: {test_acc:.3f}")
print(f"Training accuracy: {history.history['accuracy'][-1]:.3f}")

Expected output:

Test accuracy: 0.712
Training accuracy: 0.894

Data Augmentation

Data augmentation creates varied training examples by randomly transforming images.

data_augmentation = keras.Sequential([
    keras.layers.RandomFlip('horizontal'),
    keras.layers.RandomRotation(0.1),
    keras.layers.RandomZoom(0.1),
    keras.layers.RandomContrast(0.1),
])

augmented_model = keras.Sequential([
    data_augmentation,
    keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Conv2D(64, (3, 3), activation='relu'),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Flatten(),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

augmented_model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

sample = X_train[:1]
augmented = data_augmentation(sample, training=True)
print(f"Original shape: {sample.shape}")
print(f"Augmented shape: {augmented.shape}")
print(f"Original pixel range: [{sample.min():.3f}, {sample.max():.3f}]")

Expected output:

Original shape: (1, 32, 32, 3)
Augmented shape: (1, 32, 32, 3)
Original pixel range: [0.094, 0.859]

Transfer Learning with Pretrained CNN

Using a pretrained model like ResNet50 dramatically improves accuracy with less data.

from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input

base_model = ResNet50(
    weights='imagenet',
    include_top=False,
    input_shape=(32, 32, 3)
)
base_model.trainable = False

transfer_model = keras.Sequential([
    base_model,
    keras.layers.GlobalAveragePooling2D(),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(10, activation='softmax')
])

transfer_model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

print(f"Trainable params: {sum(p.numel() for p in transfer_model.trainable_variables):,}")
print(f"Non-trainable params: {sum(p.numel() for p in transfer_model.non_trainable_variables):,}")

Expected output:

Trainable params: 133,898
Non-trainable params: 23,530,984

Pooling Operations Explained

Pooling reduces spatial dimensions while preserving the most important information. Max pooling takes the maximum value in each window, preserving the strongest activation. Average pooling takes the mean, preserving overall signal strength. Global average pooling computes the average of each entire feature map, producing one value per filter. This is often used before the final classification layer because it drastically reduces parameters while preserving spatial information.

CNN Architecture Comparison

Architecture Parameters Top-5 Accuracy Year
LeNet-5 60K ~85% 1998
AlexNet 60M ~80% 2012
VGG16 138M ~90% 2014
ResNet50 25M ~92% 2015
EfficientNet 5.3M ~94% 2019

Common Errors and Mistakes

Mistake Why It Happens How to Fix
Wrong input shape Images not resized to expected dimensions Always check model's expected input_shape
Not normalizing pixels Large pixel values (0-255) slow convergence Divide by 255.0 or use preprocess_input
Too many parameters Model overfits on small datasets Add dropout, reduce layers, use transfer learning
No data augmentation Model memorizes training images Add random flips, rotations, and zooms
Unfreezing too early Destroying pretrained features Train head first, then fine-tune with low LR

Practice Questions

  1. What does a convolutional layer do?

Answer: It slides learnable filters across the input image, computing dot products at each position to create feature maps that detect patterns like edges, textures, and shapes.

  1. Why use pooling layers in a CNN?

Answer: Pooling reduces spatial dimensions, decreasing the number of parameters and computation. It also provides translation invariance โ€” small shifts in the input produce the same pooled output.

  1. What is data augmentation and why is it important?

Answer: Data augmentation applies random transformations (flips, rotations, zooms) to training images. It increases effective dataset size, reduces overfitting, and improves generalization to real-world variations.

  1. How does transfer learning work with CNNs?

Answer: A pretrained model (trained on ImageNet) is used as a feature extractor. Its convolutional base is frozen, and a new classifier head is trained on the target dataset. Later, parts of the base can be fine-tuned with a low learning rate.

  1. What is the purpose of the Flatten layer between convolutional and dense layers?

Answer: Convolutional layers output 3D feature maps (height, width, channels). Dense layers expect 1D vectors. Flatten reshapes the 3D output into a 1D vector without changing the data.

Challenge

Build a CNN to classify the 102 Flower Categories dataset. Use transfer learning with EfficientNetB0, add a custom classification head with dropout, use data augmentation, and use learning rate scheduling with ReduceLROnPlateau. Aim for 90%+ top-3 accuracy. Compare training from scratch vs transfer learning.

Real-World Task

Design a CNN-based system for detecting phishing websites by analyzing screenshots. The system should classify a screenshot as "legitimate" or "phishing." Use transfer learning with a pretrained ResNet50, implement data augmentation tailored to screen captures (crops, brightness variations), and deploy the model using TensorFlow Serving with a REST API endpoint.

Next Steps

Master TensorFlow for Transfer Learning with pretrained models, and explore RNNs & LSTMs for sequential data. These Python libraries work together in the PyTorch ecosystem for production Computer Vision pipelines.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro