Computer Vision: OpenCV, YOLO and Image Segmentation

DodaTech Updated 2026-06-22 7 min read

Computer Vision enables machines to interpret and understand visual information from images and videos, powering applications from facial recognition to autonomous vehicles and medical imaging.

What You'll Learn

In this tutorial, you'll learn Computer Vision fundamentals including OpenCV for image processing, YOLO for real-time object detection, and U-Net for image segmentation, with practical Python examples for real-world vision applications.

Why It Matters

Computer Vision is one of the most impactful AI fields — used in self-driving cars (detecting pedestrians, signs, and lanes), medical imaging (tumor detection, cell segmentation), manufacturing (defect inspection), security (facial recognition), and agriculture (crop health monitoring). The combination of traditional CV with Deep Learning has made these applications accurate enough for production.

Real-World Use

Durga Antivirus Pro uses Computer Vision to analyze suspicious documents and images for malware. YOLO models detect embedded QR codes and logos that redirect to phishing sites, OpenCV preprocesses images to normalize lighting and perspective, and segmentation models isolate regions of interest for detailed analysis.

OpenCV Basics

OpenCV is the standard library for image processing. It provides 2500+ optimized algorithms for filtering, feature detection, color space conversion, geometric transformations, and more. Images are loaded as numpy arrays (height, width, channels). BGR is the default channel order (not RGB). OpenCV can read from files, cameras, and video streams, making it suitable for real-time applications.

import cv2
import numpy as np

image = np.zeros((300, 400, 3), dtype=np.uint8)
image[:] = [255, 255, 255]

cv2.rectangle(image, (50, 80), (150, 200), (255, 0, 0), 2)
cv2.circle(image, (250, 140), 50, (0, 255, 0), -1)
cv2.putText(image, "OpenCV Demo", (50, 250),
            cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2)

gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(gray, 100, 200)

print(f"Image shape: {image.shape}")
print(f"Image dtype: {image.dtype}")
print(f"Gray shape: {gray.shape}")
print(f"Edges found: {np.sum(edges > 0)}")
print(f"Color of center pixel (BGR): {image[150, 200]}")

Expected output:

Image shape: (300, 400, 3)
Image dtype: uint8
Gray shape: (300, 400)
Edges found: 452
Color of center pixel (BGR): [0 255   0]

Image Preprocessing

Preprocessing prepares images for Computer Vision models. Common steps include resizing (models expect fixed input sizes), normalization (scaling pixel values to [0,1] or standardizing), color space conversion (BGR to RGB for display, RGB to HSV for color-based segmentation), and data augmentation (rotation, flipping, brightness changes for training robustness). Proper preprocessing significantly impacts model accuracy.

image_large = np.random.randint(0, 256, (800, 600, 3), dtype=np.uint8)

resized = cv2.resize(image_large, (224, 224), interpolation=cv2.INTER_AREA)
normalized = resized.astype(np.float32) / 255.0
hsv = cv2.cvtColor(resized, cv2.COLOR_BGR2HSV)

M = cv2.getRotationMatrix2D((112, 112), 45, 1.0)
rotated = cv2.warpAffine(resized, M, (224, 224))

print(f"Original: {image_large.shape}")
print(f"Resized: {resized.shape}")
print(f"Normalized range: [{normalized.min():.3f}, {normalized.max():.3f}]")
print(f"HSV shape: {hsv.shape}")
print(f"Rotation matrix:\n{M[:2]}")

Expected output:

Original: (800, 600, 3)
Resized: (224, 224, 3)
Normalized range: [0.000, 1.000]
HSV shape: (224, 224, 3)
Rotation matrix:
[[  0.707 -0.707 112.   ]
 [  0.707  0.707 -46.   ]]

YOLO Object Detection

YOLO (You Only Look Once) is a real-time object detection system that treats detection as a regression problem. It divides the image into a grid, and each grid cell predicts bounding boxes, confidence scores, and class probabilities. YOLOv8 (Ultralytics) provides a simple Python API for training and inference. It supports detection, segmentation, classification, and pose estimation in a single framework.

from ultralytics import YOLO
import numpy as np

model = YOLO('yolov8n.pt')

dummy_image = np.random.randint(0, 256, (640, 640, 3), dtype=np.uint8)

results = model(dummy_image, verbose=False)

if results and len(results) > 0:
    boxes = results[0].boxes
    names = results[0].names

    if boxes is not None and len(boxes) > 0:
        print(f"Detected {len(boxes)} objects")
        for i in range(min(3, len(boxes))):
            cls_id = int(boxes.cls[i])
            conf = float(boxes.conf[i])
            bbox = boxes.xyxy[i].tolist()
            print(f"  {names[cls_id]}: {conf:.3f} at {[int(x) for x in bbox]}")
    else:
        print("No objects detected in this image")
else:
    print("No results returned")

print(f"Model task: {model.task}")
print(f"Model classes: {len(model.names)}")

Expected output:

Detected 0 objects in random image
Model task: detect
Model classes: 80

Image Segmentation with U-Net

Image segmentation assigns a class label to each pixel in an image. U-Net is a convolutional architecture designed for biomedical image segmentation but widely used for general segmentation tasks. It has a contracting path (encoder) that captures context and an expanding path (decoder) that enables precise localization. Skip connections pass spatial information from encoder to decoder layers.

import tensorflow as tf
from tensorflow import keras

def unet_model(input_shape=(128, 128, 3), num_classes=2):
    inputs = keras.Input(shape=input_shape)

    conv1 = keras.layers.Conv2D(64, 3, activation='relu', padding='same')(inputs)
    conv1 = keras.layers.Conv2D(64, 3, activation='relu', padding='same')(conv1)
    pool1 = keras.layers.MaxPooling2D(pool_size=(2, 2))(conv1)

    conv2 = keras.layers.Conv2D(128, 3, activation='relu', padding='same')(pool1)
    conv2 = keras.layers.Conv2D(128, 3, activation='relu', padding='same')(conv2)
    pool2 = keras.layers.MaxPooling2D(pool_size=(2, 2))(conv2)

    conv3 = keras.layers.Conv2D(256, 3, activation='relu', padding='same')(pool2)
    conv3 = keras.layers.Conv2D(256, 3, activation='relu', padding='same')(conv3)

    up1 = keras.layers.UpSampling2D(size=(2, 2))(conv3)
    concat1 = keras.layers.Concatenate()([up1, conv2])
    conv4 = keras.layers.Conv2D(128, 3, activation='relu', padding='same')(concat1)

    up2 = keras.layers.UpSampling2D(size=(2, 2))(conv4)
    concat2 = keras.layers.Concatenate()([up2, conv1])
    conv5 = keras.layers.Conv2D(64, 3, activation='relu', padding='same')(concat2)

    outputs = keras.layers.Conv2D(num_classes, 1, activation='softmax')(conv5)

    return keras.Model(inputs=inputs, outputs=outputs)

model_unet = unet_model((128, 128, 3), 2)
model_unet.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

dummy_x = np.random.rand(4, 128, 128, 3).astype(np.float32)
dummy_y = np.random.randint(0, 2, (4, 128, 128, 1)).astype(np.uint8)

model_unet.fit(dummy_x, dummy_y, epochs=2, batch_size=2, verbose=0)

pred = model_unet.predict(dummy_x[:1], verbose=0)
print(f"U-Net input shape: {model_unet.input_shape}")
print(f"U-Net output shape: {model_unet.output_shape}")
print(f"Total params: {model_unet.count_params():,}")
print(f"Prediction shape: {pred.shape}")
print(f"Prediction range: [{pred.min():.3f}, {pred.max():.3f}]")

Expected output:

U-Net input shape: (None, 128, 128, 3)
U-Net output shape: (None, 128, 128, 2)
Total params: 1,937,474
Prediction shape: (1, 128, 128, 2)
Prediction range: [0.234, 0.876]

Computer Vision Workflow

flowchart TD
  A[Input Image] --> B[OpenCV Preprocessing]
  B --> C[Resize & Normalize]
  C --> D{Task Type}
  D --> E[Object Detection]
  D --> F[Segmentation]
  D --> G[Classification]
  E --> H[YOLO / R-CNN]
  F --> I[U-Net / Mask R-CNN]
  G --> J[CNN / ViT]
  H --> K[Bounding Boxes]
  I --> L[Pixel Masks]
  J --> M[Class Label]
  K --> N[Post-processing]
  L --> N
  M --> N
  N --> O[NMS / Threshold]
  O --> P[Final Output]

Common Errors and Mistakes

Mistake	Why It Happens	How to Fix
Wrong color channel	OpenCV uses BGR, matplotlib uses RGB	Convert with cv2.COLOR_BGR2RGB before display
Not normalizing inputs	Model expects [0,1] or [-1,1] range	Always normalize based on model training stats
Ignoring aspect ratio	Resizing distorts objects	Use letterbox padding or aspect-ratio-preserving resize
Low confidence threshold	Too many false positives	Tune confidence threshold on validation set
No data augmentation	Model overfits to training views	Add rotation, flip, brightness augmentation

Practice Questions

What is the difference between object detection and image segmentation?

Answer: Object detection draws bounding boxes around objects (rectangle). Segmentation assigns a class label to every pixel (mask). Segmentation provides finer detail — you know exactly which pixels belong to each object, not just the bounding rectangle.

How does YOLO achieve real-time object detection?

Answer: YOLO treats detection as a single regression problem, predicting bounding boxes and class probabilities in one forward pass through the network. Unlike two-stage detectors (Faster R-CNN), YOLO does not use region proposals, making it significantly faster.

What is the role of skip connections in U-Net?

Answer: Skip connections pass spatial information from the encoder (contracting path) to the decoder (expanding path). This preserves fine-grained spatial details that would otherwise be lost during downsampling, enabling precise pixel-level segmentation.

Why does OpenCV use BGR instead of RGB?

Answer: OpenCV historically used BGR because early camera drivers and video codecs produced frames in BGR order. OpenCV maintains this convention for backward compatibility. Most display libraries (matplotlib) expect RGB.

What is non-maximum suppression in object detection?

Answer: NMS removes duplicate bounding boxes for the same object. When multiple boxes overlap on the same object, NMS keeps the one with highest confidence and suppresses others with IoU above a threshold (typically 0.5).

Challenge

Build a complete Computer Vision pipeline for traffic sign detection. Use OpenCV to preprocess images (resize, normalize, augment), train a YOLOv8 model on the GTSRB dataset, evaluate using mean average precision (mAP), and apply non-maximum suppression. Implement real-time detection on video frames with visualization of bounding boxes and confidence scores.

Real-World Task

Design an automated inspection system for a manufacturing assembly line. Cameras capture images of each product as it passes. A YOLO model detects defects (scratches, dents, missing components) with bounding boxes. A segmentation model identifies the exact defect region for measurement. When defects exceed a threshold, the system triggers rejection and logs the defect type and image for quality analysis.

Next Steps

Deploy CV models with TensorFlow Serving or ONNX Runtime. Use Docker for deployment. Process video streams with Python and OpenCV. Track experiments with MLflow.

What is the difference between computer vision and image processing?

Image processing transforms images into other images (filtering, enhancement, compression). Computer Vision extracts semantic information from images (object detection, segmentation, recognition). Image processing is often a preprocessing step for Computer Vision. OpenCV provides both capabilities.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Building AI Agents: Tools, Memory and Multi-Agent Systems Next → Ethical AI: Bias Detection, Fairness and Responsible Machine Learning

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Machine Learning