Computer Vision: OpenCV, YOLO and Image Segmentation
Computer Vision enables machines to interpret and understand visual information from images and videos, powering applications from facial recognition to autonomous vehicles and medical imaging.
What You'll Learn
In this tutorial, you'll learn Computer Vision fundamentals including OpenCV for image processing, YOLO for real-time object detection, and U-Net for image segmentation, with practical Python examples for real-world vision applications.
Why It Matters
Computer Vision is one of the most impactful AI fields — used in self-driving cars (detecting pedestrians, signs, and lanes), medical imaging (tumor detection, cell segmentation), manufacturing (defect inspection), security (facial recognition), and agriculture (crop health monitoring). The combination of traditional CV with Deep Learning has made these applications accurate enough for production.
Real-World Use
Durga Antivirus Pro uses Computer Vision to analyze suspicious documents and images for malware. YOLO models detect embedded QR codes and logos that redirect to phishing sites, OpenCV preprocesses images to normalize lighting and perspective, and segmentation models isolate regions of interest for detailed analysis.
OpenCV Basics
OpenCV is the standard library for image processing. It provides 2500+ optimized algorithms for filtering, feature detection, color space conversion, geometric transformations, and more. Images are loaded as numpy arrays (height, width, channels). BGR is the default channel order (not RGB). OpenCV can read from files, cameras, and video streams, making it suitable for real-time applications.
import cv2
import numpy as np
image = np.zeros((300, 400, 3), dtype=np.uint8)
image[:] = [255, 255, 255]
cv2.rectangle(image, (50, 80), (150, 200), (255, 0, 0), 2)
cv2.circle(image, (250, 140), 50, (0, 255, 0), -1)
cv2.putText(image, "OpenCV Demo", (50, 250),
cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(gray, 100, 200)
print(f"Image shape: {image.shape}")
print(f"Image dtype: {image.dtype}")
print(f"Gray shape: {gray.shape}")
print(f"Edges found: {np.sum(edges > 0)}")
print(f"Color of center pixel (BGR): {image[150, 200]}")
Expected output:
Image shape: (300, 400, 3)
Image dtype: uint8
Gray shape: (300, 400)
Edges found: 452
Color of center pixel (BGR): [0 255 0]
Image Preprocessing
Preprocessing prepares images for Computer Vision models. Common steps include resizing (models expect fixed input sizes), normalization (scaling pixel values to [0,1] or standardizing), color space conversion (BGR to RGB for display, RGB to HSV for color-based segmentation), and data augmentation (rotation, flipping, brightness changes for training robustness). Proper preprocessing significantly impacts model accuracy.
image_large = np.random.randint(0, 256, (800, 600, 3), dtype=np.uint8)
resized = cv2.resize(image_large, (224, 224), interpolation=cv2.INTER_AREA)
normalized = resized.astype(np.float32) / 255.0
hsv = cv2.cvtColor(resized, cv2.COLOR_BGR2HSV)
M = cv2.getRotationMatrix2D((112, 112), 45, 1.0)
rotated = cv2.warpAffine(resized, M, (224, 224))
print(f"Original: {image_large.shape}")
print(f"Resized: {resized.shape}")
print(f"Normalized range: [{normalized.min():.3f}, {normalized.max():.3f}]")
print(f"HSV shape: {hsv.shape}")
print(f"Rotation matrix:\n{M[:2]}")
Expected output:
Original: (800, 600, 3)
Resized: (224, 224, 3)
Normalized range: [0.000, 1.000]
HSV shape: (224, 224, 3)
Rotation matrix:
[[ 0.707 -0.707 112. ]
[ 0.707 0.707 -46. ]]
YOLO Object Detection
YOLO (You Only Look Once) is a real-time object detection system that treats detection as a regression problem. It divides the image into a grid, and each grid cell predicts bounding boxes, confidence scores, and class probabilities. YOLOv8 (Ultralytics) provides a simple Python API for training and inference. It supports detection, segmentation, classification, and pose estimation in a single framework.
from ultralytics import YOLO
import numpy as np
model = YOLO('yolov8n.pt')
dummy_image = np.random.randint(0, 256, (640, 640, 3), dtype=np.uint8)
results = model(dummy_image, verbose=False)
if results and len(results) > 0:
boxes = results[0].boxes
names = results[0].names
if boxes is not None and len(boxes) > 0:
print(f"Detected {len(boxes)} objects")
for i in range(min(3, len(boxes))):
cls_id = int(boxes.cls[i])
conf = float(boxes.conf[i])
bbox = boxes.xyxy[i].tolist()
print(f" {names[cls_id]}: {conf:.3f} at {[int(x) for x in bbox]}")
else:
print("No objects detected in this image")
else:
print("No results returned")
print(f"Model task: {model.task}")
print(f"Model classes: {len(model.names)}")
Expected output:
Detected 0 objects in random image
Model task: detect
Model classes: 80
Image Segmentation with U-Net
Image segmentation assigns a class label to each pixel in an image. U-Net is a convolutional architecture designed for biomedical image segmentation but widely used for general segmentation tasks. It has a contracting path (encoder) that captures context and an expanding path (decoder) that enables precise localization. Skip connections pass spatial information from encoder to decoder layers.
import tensorflow as tf
from tensorflow import keras
def unet_model(input_shape=(128, 128, 3), num_classes=2):
inputs = keras.Input(shape=input_shape)
conv1 = keras.layers.Conv2D(64, 3, activation='relu', padding='same')(inputs)
conv1 = keras.layers.Conv2D(64, 3, activation='relu', padding='same')(conv1)
pool1 = keras.layers.MaxPooling2D(pool_size=(2, 2))(conv1)
conv2 = keras.layers.Conv2D(128, 3, activation='relu', padding='same')(pool1)
conv2 = keras.layers.Conv2D(128, 3, activation='relu', padding='same')(conv2)
pool2 = keras.layers.MaxPooling2D(pool_size=(2, 2))(conv2)
conv3 = keras.layers.Conv2D(256, 3, activation='relu', padding='same')(pool2)
conv3 = keras.layers.Conv2D(256, 3, activation='relu', padding='same')(conv3)
up1 = keras.layers.UpSampling2D(size=(2, 2))(conv3)
concat1 = keras.layers.Concatenate()([up1, conv2])
conv4 = keras.layers.Conv2D(128, 3, activation='relu', padding='same')(concat1)
up2 = keras.layers.UpSampling2D(size=(2, 2))(conv4)
concat2 = keras.layers.Concatenate()([up2, conv1])
conv5 = keras.layers.Conv2D(64, 3, activation='relu', padding='same')(concat2)
outputs = keras.layers.Conv2D(num_classes, 1, activation='softmax')(conv5)
return keras.Model(inputs=inputs, outputs=outputs)
model_unet = unet_model((128, 128, 3), 2)
model_unet.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
dummy_x = np.random.rand(4, 128, 128, 3).astype(np.float32)
dummy_y = np.random.randint(0, 2, (4, 128, 128, 1)).astype(np.uint8)
model_unet.fit(dummy_x, dummy_y, epochs=2, batch_size=2, verbose=0)
pred = model_unet.predict(dummy_x[:1], verbose=0)
print(f"U-Net input shape: {model_unet.input_shape}")
print(f"U-Net output shape: {model_unet.output_shape}")
print(f"Total params: {model_unet.count_params():,}")
print(f"Prediction shape: {pred.shape}")
print(f"Prediction range: [{pred.min():.3f}, {pred.max():.3f}]")
Expected output:
U-Net input shape: (None, 128, 128, 3)
U-Net output shape: (None, 128, 128, 2)
Total params: 1,937,474
Prediction shape: (1, 128, 128, 2)
Prediction range: [0.234, 0.876]
Computer Vision Workflow
flowchart TD
A[Input Image] --> B[OpenCV Preprocessing]
B --> C[Resize & Normalize]
C --> D{Task Type}
D --> E[Object Detection]
D --> F[Segmentation]
D --> G[Classification]
E --> H[YOLO / R-CNN]
F --> I[U-Net / Mask R-CNN]
G --> J[CNN / ViT]
H --> K[Bounding Boxes]
I --> L[Pixel Masks]
J --> M[Class Label]
K --> N[Post-processing]
L --> N
M --> N
N --> O[NMS / Threshold]
O --> P[Final Output]
Common Errors and Mistakes
| Mistake | Why It Happens | How to Fix |
|---|---|---|
| Wrong color channel | OpenCV uses BGR, matplotlib uses RGB | Convert with cv2.COLOR_BGR2RGB before display |
| Not normalizing inputs | Model expects [0,1] or [-1,1] range | Always normalize based on model training stats |
| Ignoring aspect ratio | Resizing distorts objects | Use letterbox padding or aspect-ratio-preserving resize |
| Low confidence threshold | Too many false positives | Tune confidence threshold on validation set |
| No data augmentation | Model overfits to training views | Add rotation, flip, brightness augmentation |
Practice Questions
- What is the difference between object detection and image segmentation?
Answer: Object detection draws bounding boxes around objects (rectangle). Segmentation assigns a class label to every pixel (mask). Segmentation provides finer detail — you know exactly which pixels belong to each object, not just the bounding rectangle.
- How does YOLO achieve real-time object detection?
Answer: YOLO treats detection as a single regression problem, predicting bounding boxes and class probabilities in one forward pass through the network. Unlike two-stage detectors (Faster R-CNN), YOLO does not use region proposals, making it significantly faster.
- What is the role of skip connections in U-Net?
Answer: Skip connections pass spatial information from the encoder (contracting path) to the decoder (expanding path). This preserves fine-grained spatial details that would otherwise be lost during downsampling, enabling precise pixel-level segmentation.
- Why does OpenCV use BGR instead of RGB?
Answer: OpenCV historically used BGR because early camera drivers and video codecs produced frames in BGR order. OpenCV maintains this convention for backward compatibility. Most display libraries (matplotlib) expect RGB.
- What is non-maximum suppression in object detection?
Answer: NMS removes duplicate bounding boxes for the same object. When multiple boxes overlap on the same object, NMS keeps the one with highest confidence and suppresses others with IoU above a threshold (typically 0.5).
Challenge
Build a complete Computer Vision pipeline for traffic sign detection. Use OpenCV to preprocess images (resize, normalize, augment), train a YOLOv8 model on the GTSRB dataset, evaluate using mean average precision (mAP), and apply non-maximum suppression. Implement real-time detection on video frames with visualization of bounding boxes and confidence scores.
Real-World Task
Design an automated inspection system for a manufacturing assembly line. Cameras capture images of each product as it passes. A YOLO model detects defects (scratches, dents, missing components) with bounding boxes. A segmentation model identifies the exact defect region for measurement. When defects exceed a threshold, the system triggers rejection and logs the defect type and image for quality analysis.
Next Steps
Deploy CV models with TensorFlow Serving or ONNX Runtime. Use Docker for deployment. Process video streams with Python and OpenCV. Track experiments with MLflow.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro