Skip to content

Build an Image Classifier with TensorFlow & Python

DodaTech 11 min read

Build an image classifier with TensorFlow and Python using a convolutional neural network trained on the CIFAR-10 dataset. You will learn how to load and preprocess image data, build a CNN with Conv2D and MaxPooling layers, train the model, evaluate performance with accuracy and loss plots, make predictions on new images, and deploy the classifier with Gradio.

What You'll Build

A complete image classification pipeline that takes raw pixel values, normalizes and augments them, feeds them through a convolutional neural network, and outputs class probabilities. The final model is saved to disk and wrapped in a Gradio web interface anyone can use from a browser.

Why Build an Image Classifier?

Image classification is a core Machine Learning task and the foundation of Computer Vision. The same techniques you will learn here -- convolutional filters, pooling, data augmentation, and softmax classification -- power medical diagnosis systems, autonomous vehicle perception, facial recognition, and content moderation tools. Durga Antivirus Pro uses similar CNN architectures for visual malware analysis and phishing site detection.

Real-World Use

E-commerce platforms classify product photos automatically. Social media networks flag inappropriate content with image classifiers. Security cameras distinguish people, vehicles, and animals in real time. After this tutorial, you will be able to build any of these systems.

Pipeline Overview

graph LR
    A["Dataset (CIFAR-10)"] --> B["Preprocessing (Normalization + Augmentation)"]
    B --> C["CNN Model (Conv2D + MaxPooling + Dense)"]
    C --> D["Training (Adam + Cross-Entropy)"]
    D --> E["Evaluation (Accuracy + Loss + Confusion Matrix)"]
    E --> F["Deployment (Gradio Web Interface)"]

Step 1: Project Setup

Create a new directory and install the required libraries. TensorFlow provides the Deep Learning framework, Keras is its high-level API for building models, Matplotlib handles visualizations, and NumPy manages array operations.

mkdir image-classifier
cd image-classifier
python -m venv venv
source venv/bin/activate
pip install tensorflow matplotlib numpy Scikit-Learn gradio

Verify the installation by importing TensorFlow and checking the version.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt
import gradio as gr

print("TensorFlow version:", tf.__version__)

Expected output:

TensorFlow version: 2.18.0

Step 2: Load the CIFAR-10 Dataset

CIFAR-10 contains 60,000 color images (32x32 pixels) across 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The dataset is split into 50,000 training samples and 10,000 test samples. Keras includes it as a built-in dataset, so no external downloads are needed.

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

class_names = ["airplane", "automobile", "bird", "cat", "deer",
               "dog", "frog", "horse", "ship", "truck"]

print(f"Training samples: {x_train.shape[0]}")
print(f"Test samples: {x_test.shape[0]}")
print(f"Image shape: {x_train.shape[1:]}")
print(f"Number of classes: {len(class_names)}")

Expected output:

Training samples: 50000
Test samples: 10000
Image shape: (32, 32, 3)
Number of classes: 10

Step 3: Data Preprocessing

Raw pixel values range from 0 to 255. Neural networks converge faster when inputs are scaled to a smaller range. Normalization divides every pixel by 255.0 to shift values into the [0, 1] interval.

Data augmentation creates modified versions of training images -- flipped, rotated, zoomed -- which reduces overfitting and improves generalization. Keras provides built-in augmentation layers that run during training.

x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

data_augmentation = keras.Sequential([
    layers.RandomFlip("horizontal"),
    layers.RandomRotation(0.1),
    layers.RandomZoom(0.1),
])

Only the training images are augmented. The test set stays untouched to measure real-world performance.

Step 4: Build the CNN Model

A convolutional neural network uses three key layer types. Conv2D layers learn spatial features (edges, textures, shapes) by sliding filters across the image. MaxPooling2D layers reduce the spatial size, keeping only the most important features and making the model computationally efficient. Dense layers at the end perform the final classification.

The model takes a 32x32x3 input, applies three convolutional blocks with increasing filter depth (32, 64, 128), flattens the feature maps, and passes them through a fully connected layer with dropout for regularization. The final layer uses softmax to output a probability distribution over the 10 classes.

model = keras.Sequential([
    layers.Input(shape=(32, 32, 3)),
    layers.Conv2D(32, (3, 3), activation="relu", padding="same"),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation="relu", padding="same"),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(256, activation="relu"),
    layers.Dropout(0.5),
    layers.Dense(10, activation="softmax"),
])

model.summary()

Expected output (truncated):

Model: "sequential_1"
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Layer (type)                    โ”ƒ Output Shape   โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ conv2d (Conv2D)                 โ”‚ (None, 32, ... โ”‚
โ”‚ max_pooling2d (MaxPooling2D)    โ”‚ (None, 16, ... โ”‚
โ”‚ conv2d_1 (Conv2D)               โ”‚ (None, 16, ... โ”‚
โ”‚ max_pooling2d_1 (MaxPooling2D)  โ”‚ (None, 8, 8... โ”‚
โ”‚ conv2d_2 (Conv2D)               โ”‚ (None, 8, 8... โ”‚
โ”‚ max_pooling2d_2 (MaxPooling2D)  โ”‚ (None, 4, 4... โ”‚
โ”‚ flatten (Flatten)               โ”‚ (None, 2048)   โ”‚
โ”‚ dense (Dense)                   โ”‚ (None, 256)    โ”‚
โ”‚ dropout (Dropout)               โ”‚ (None, 256)    โ”‚
โ”‚ dense_1 (Dense)                 โ”‚ (None, 10)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
 Total params: 1,153,930 (4.40 MB)
 Trainable params: 1,153,930 (4.40 MB)
 Non-trainable params: 0 (0.00 B)

Step 5: Train the Model

The Adam optimizer adapts the learning rate during training. Sparse categorical crossentropy measures the difference between predicted probabilities and true class labels. Accuracy tracks what fraction of predictions match the ground truth.

The validation_split parameter reserves 20% of training data for validation after each epoch. This lets you monitor whether the model is learning patterns (validation accuracy improves) or just memorizing the training set (validation accuracy plateaus or drops).

model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

history = model.fit(
    x_train, y_train,
    batch_size=64,
    epochs=20,
    validation_split=0.2,
)

Expected output (last epoch):

Epoch 20/20
625/625 โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 12s 19ms/step - accuracy: 0.8795 - loss: 0.3410 - val_accuracy: 0.7218 - val_loss: 0.8618

Training accuracy reaches approximately 88% while validation accuracy settles around 72%. The gap indicates some overfitting, which is normal for a model of this size trained from scratch.

Step 6: Evaluation

Plot the training and validation metrics to understand the model's behavior. Then evaluate on the held-out test set.

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history["accuracy"], label="Training Accuracy")
plt.plot(history.history["val_accuracy"], label="Validation Accuracy")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.title("Accuracy over Epochs")

plt.subplot(1, 2, 2)
plt.plot(history.history["loss"], label="Training Loss")
plt.plot(history.history["val_loss"], label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.title("Loss over Epochs")

plt.tight_layout()
plt.show()

test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
print(f"Test accuracy: {test_acc:.4f}")
print(f"Test loss: {test_loss:.4f}")

Expected output:

Test accuracy: 0.7243
Test loss: 0.8157

Generate a confusion matrix to see which classes the model confuses most often.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

y_pred = model.predict(x_test, verbose=0)
y_pred_classes = np.argmax(y_pred, axis=1)

cm = confusion_matrix(y_test, y_pred_classes)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
disp.plot(xticks_rotation=45)
plt.tight_layout()
plt.show()

The confusion matrix reveals that cats and dogs are frequently mistaken for each other, while ships and trucks are classified more reliably. This makes sense -- cats and dogs share similar shapes and fur textures, while man-made vehicles have more distinctive geometric features.

Step 7: Make Predictions on New Images

Use the trained model to classify individual images. The function loads an image, resizes it to the required 32x32 input shape, normalizes pixel values, and runs inference.

def predict_image(image_path):
    img = tf.keras.utils.load_img(image_path, target_size=(32, 32))
    img_array = tf.keras.utils.img_to_array(img) / 255.0
    img_array = tf.expand_dims(img_array, 0)

    predictions = model.predict(img_array, verbose=0)[0]
    predicted_class = np.argmax(predictions)
    confidence = predictions[predicted_class]

    print(f"Predicted: {class_names[predicted_class]} ({confidence:.2%})")
    for i, (name, prob) in enumerate(zip(class_names, predictions)):
        print(f"  {name}: {prob:.2%}")

predict_image("test_car.jpg")

Expected output:

Predicted: automobile (98.34%)
  airplane: 0.02%
  automobile: 98.34%
  bird: 0.01%
  cat: 0.13%
  deer: 0.01%
  dog: 0.04%
  frog: 0.00%
  horse: 0.00%
  ship: 0.14%
  truck: 1.31%

Step 8: Save and Load the Model

Saving the model preserves the architecture, weights, and training configuration in a single file. Loading restores it for inference without retraining.

model.save("cifar10_classifier.keras")
print("Model saved to cifar10_classifier.keras")

loaded_model = keras.models.load_model("cifar10_classifier.keras")
loaded_model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

loss, acc = loaded_model.evaluate(x_test, y_test, verbose=0)
print(f"Loaded model accuracy: {acc:.4f}")

Expected output:

Model saved to cifar10_classifier.keras
Loaded model accuracy: 0.7243

The accuracy matches the original model exactly, confirming the save-load round trip is lossless.

Step 9: Deploy with Gradio

Gradio creates a web interface for the classifier in a few lines. Users can upload any image, and the app resizes, normalizes, and classifies it in real time.

def classify_image(img):
    img_resized = tf.image.resize(img, (32, 32))
    img_array = tf.keras.utils.img_to_array(img_resized) / 255.0
    img_array = tf.expand_dims(img_array, 0)

    pred = loaded_model.predict(img_array, verbose=0)[0]
    return {class_names[i]: float(pred[i]) for i in range(10)}

gr.Interface(
    fn=classify_image,
    inputs=gr.Image(type="pil"),
    outputs=gr.Label(num_top_classes=3),
    title="CIFAR-10 Image Classifier",
    description="Upload any image and see what class it belongs to.",
).launch()

Expected output:

Running on local URL:  http://127.0.0.1:7860

Open the local URL in a browser. Upload any photo -- the model resizes it to 32x32 internally and returns the top three predicted classes with confidence scores.

Common Errors

1. Model achieves high training accuracy but low test accuracy This is overfitting. The model memorizes training patterns instead of learning general features. Reduce overfitting with more aggressive dropout, more data augmentation, or fewer epochs. Adding a pretrained base model (transfer learning) also helps.

2. ValueError: Input 0 of layer "sequential" is incompatible The input shape in the model definition must match the actual image dimensions. CIFAR-10 images are 32x32x3. If you use a custom dataset with different dimensions, update the Input(shape=...) parameter.

3. Predictions are all the same class regardless of input The learning rate may be too high, causing the optimizer to overshoot the minimum. Lower the learning rate with optimizer=keras.optimizers.Adam(learning_rate=0.0001). Alternatively, the labels might be misaligned with the images.

4. Gradio shows "File not found" error The uploaded file path is temporary and Gradio handles deletion automatically. If you modify the image inside the function before classification, make sure you work with a copy. Use img.copy() before any in-place operations.

5. Confusion matrix shows zero for some classes This happens when the model never predicts certain classes, usually because of class imbalance or a collapsed model. Check the class distribution in your dataset. Ensure the final layer has the correct number of neurons matching the class count.

Practice Questions

1. Why do we normalize pixel values to [0, 1] instead of keeping 0-255? Neural networks use activation functions like ReLU and sigmoid that work best with small input values. Large inputs cause gradients to explode or vanish, making training unstable and slow. Normalization also ensures all features contribute equally regardless of their original scale.

2. What does each Conv2D layer learn as depth increases? Early layers (32 filters) detect low-level features like edges, corners, and colors. Middle layers (64 filters) combine edges into textures and patterns. Later layers (128 filters) assemble patterns into object parts like wheels, eyes, or wings.

3. Why use MaxPooling2D instead of just more Conv2D layers? MaxPooling reduces the spatial dimensions, which decreases the number of parameters and computation. It also provides translation invariance -- small shifts in the input produce the same pooled output, making the model more robust.

4. What does the Dropout layer do and why set it to 0.5? Dropout randomly turns off 50% of neurons during training, forcing the network to learn redundant representations. This prevents any single neuron from becoming a bottleneck and reduces overfitting. At test time all neurons are active.

5. Why is softmax used in the final layer? Softmax converts raw scores (logits) into a probability distribution that sums to 1.0. Each output value represents the confidence that the input belongs to that class, making results interpretable as percentages.

Challenge: Add Early Stopping and Model Checkpointing

Modify the training loop to use Keras callbacks. EarlyStopping halts training when validation accuracy stops improving for 5 epochs, saving time. ModelCheckpoint saves the best model weights to disk automatically.

callbacks = [
    keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
    keras.callbacks.ModelCheckpoint("best_model.keras", save_best_only=True),
]

history = model.fit(
    x_train, y_train,
    batch_size=64,
    epochs=50,
    validation_split=0.2,
    callbacks=callbacks,
)

Training stops automatically when the model stops improving, preventing wasted compute.

Real-World Task: Classify Cats vs Dogs

Download the Kaggle Dogs vs Cats dataset (25,000 labeled images). Adapt this tutorial to binary classification (2 classes instead of 10). The same CNN architecture works -- just change the final Dense layer to 1 neuron with sigmoid activation and use binary_crossentropy loss. Report your test accuracy and discuss how the larger dataset size affects overfitting.

FAQ

What is CIFAR-10?

CIFAR-10 is a benchmark dataset of 60,000 labeled color images across 10 classes. Each image is 32x32 pixels. It is widely used for teaching Computer Vision because it is small enough to train on a laptop but complex enough to demonstrate real classification challenges.

Why does data augmentation improve accuracy?

Augmentation creates new training samples by applying realistic transformations -- flips, rotations, shifts, zooms, and brightness changes. This exposes the model to more variations of each class, teaching it to focus on the object itself rather than its orientation or position. Models trained with augmentation generalize better to unseen images.

How can I improve accuracy beyond 72%?

Use transfer learning with a pretrained model like ResNet50 or MobileNetV2. These models are already trained on millions of images and can be fine-tuned on CIFAR-10 in a few epochs. Alternatively, train longer with learning rate scheduling, add batch normalization layers, or use more aggressive data augmentation.

Can I run this on a CPU or do I need a GPU?

CIFAR-10 training with this architecture takes about 15-20 minutes on a modern CPU and 2-3 minutes on a GPU. For learning purposes a CPU is fine. If you lack a local GPU, run this notebook on Google Colab (free GPU) or Kaggle (free GPU).

Next Steps


Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro