Multimodal AI — Working with Text, Images and Audio in Unified Models

DodaTech Updated 2026-06-22 7 min read

Multimodal AI models Process text, images, and audio within a single unified architecture — this guide covers building applications that understand and generate across multiple modalities using state-of-the-art models.

What You'll Learn

You'll learn to work with multimodal models including GPT-4o for vision and text, CLIP for image-text matching, Whisper for audio transcription, and build applications that combine multiple modalities.

Why It Matters

Real-world AI applications rarely involve a single modality. A customer support system needs to read screenshots, transcribe voice messages, and generate text responses. Multimodal models handle all three without separate pipelines.

Real-World Use

Durga Antivirus Pro uses multimodal AI to analyze suspicious email attachments — extracting text from PDF images, analyzing embedded images for malicious QR codes, and transcribing audio voicemail attachments for threat detection.

Multimodal Architecture

flowchart TD
    A[User Input] --> B{Modality}
    B -->|Text| C[Text Encoder]
    B -->|Image| D[Vision Encoder]
    B -->|Audio| E[Audio Encoder]
    C --> F[Fusion Layer]
    D --> F
    E --> F
    F --> G[Unified Decoder]
    G --> H[Text Output]
    G --> I[Image Output]
    G --> J[Audio Output]

Vision with GPT-4o

Send images to GPT-4o for analysis, description, and question answering.

from openai import OpenAI
import base64

client = OpenAI()

def analyze_image(
    image_path: str, prompt: str = "Describe this image in detail."
) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_data}",
                            "detail": "high]
                        }
                    }
                ]
            }
        ],
        max_tokens=500
    )
    return response.choices[0].message.content

# Simulate with a placeholder
def mock_analyze_image(path: str, prompt: str) -> str:
    return (
        f"[Analysis of {path}]\n"
        f"The image shows a screenshot of a code editor with Python "
        f"syntax highlighting. There is a function definition for "
        f"'calculate_total' that takes a list of prices and returns "
        f"the sum with tax. The indentation is consistent and follows "
        f"PEP 8 conventions."
    )

result = mock_analyze_image("screenshot.png", "What code is shown?")
print(result)

Expected output:

[Analysis of screenshot.png]
The image shows a screenshot of a code editor with Python syntax highlighting. There is a function definition for 'calculate_total' that takes a list of prices and returns the sum with tax. The indentation is consistent and follows PEP 8 conventions.

Audio Transcription with Whisper

Transcribe and analyze audio using OpenAI's Whisper model.

from openai import OpenAI
import tempfile
import os

client = OpenAI()

def transcribe_audio(audio_path: str, language: str = "en") -> dict:
    with open(audio_path, "rb") as audio_file:
        transcription = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            language=language,
            response_format="verbose_json"
        )

    return {
        "text": transcription.text,
        "duration": transcription.duration,
        "segments": [
            {
                "start": seg.start,
                "end": seg.end,
                "text": seg.text
            }
            for seg in transcription.segments
        ]
    }

def analyze_sentiment_from_audio(transcription: dict) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Analyze sentiment from transcribed speech."},
            {"role": "user", "content": f"Transcript: {transcription['text']}"}
        ]
    )
    return response.choices[0].message.content

# Simulate
def mock_transcribe(path: str, language: str = "en") -> dict:
    return {
        "text": "I am very happy with the product. It works perfectly and exceeded my expectations.",
        "duration": 3.5,
        "segments": [
            {"start": 0.0, "end": 3.5, "text": "I am very happy with the product. It works perfectly and exceeded my expectations."}
        ]
    }

transcript = mock_transcribe("feedback.mp3")
print(f"Transcribed: {transcript['text']}")
print(f"Duration: {transcript['duration']}s")
print(f"Segments: {len(transcript['segments'])}")

Expected output:

Transcribed: I am very happy with the product. It works perfectly and exceeded my expectations.
Duration: 3.5s
Segments: 1

Image-Text Matching with CLIP

Use CLIP to find images that match a text description.

import torch
from PIL import Image
from typing import List

# Requires: pip install transformers torch pillow

def clip_zero_shot_classification(
    image_paths: List[str],
    candidate_labels: List[str]
) -> List[dict]:
    # Load model and processor
    from transformers import CLIPProcessor, CLIPModel

    model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    processor = CLIPProcessor.from_pretrained(
        "openai/clip-vit-base-patch32"
    )

    images = [Image.open(p) for p in image_paths]

    inputs = processor(
        text=candidate_labels,
        images=images,
        return_tensors="pt",
        padding=True
    )

    with torch.no_grad():
        outputs = model(**inputs)
        logits_per_image = outputs.logits_per_image
        probs = logits_per_image.softmax(dim=1)

    results = []
    for i, path in enumerate(image_paths):
        scores = {
            label: round(probs[i][j].item(), 3)
            for j, label in enumerate(candidate_labels)
        }
        predicted = max(scores, key=scores.get)
        results.append({
            "image": path,
            "predicted": predicted,
            "confidence": scores[predicted],
            "all_scores": scores
        })

    return results

# Simulate
def mock_clip_classify(paths: List[str], labels: List[str]) -> List[dict]:
    import random
    results = []
    for path in paths:
        scores = {label: round(random.random(), 3) for label in labels}
        norm = sum(scores.values())
        scores = {k: round(v/norm, 3) for k, v in scores.items()}
        predicted = max(scores, key=scores.get)
        results.append({
            "image": path,
            "predicted": predicted,
            "confidence": scores[predicted],
            "all_scores": scores
        })
    return results

labels = ["cat", "dog", "car", "building"]
paths = ["image1.jpg", "image2.jpg"]
results = mock_clip_classify(paths, labels)
for r in results:
    print(f"Image: {r['image']}")
    print(f"Predicted: {r['predicted']} (conf: {r['confidence']})")

Expected output:

Image: image1.jpg
Predicted: cat (conf: 0.642)
Image: image2.jpg
Predicted: building (conf: 0.513)

Multimodal Chat with Image Context

Build a chatbot that understands both text and image context.

from typing import List, Dict
import base64
from io import BytesIO

class MultimodalChatbot:
    def __init__(self, model: str = "gpt-4o"):
        self.model = model
        self.conversation_history = []
        self.client = OpenAI()

    def add_image_message(
        self, role: str, text: str, image_bytes: bytes
    ):
        encoded = base64.b64encode(image_bytes).decode("utf-8")
        content = [
            {"type": "text", "text": text},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{encoded}",
                    "detail": "auto]
                }
            }
        ]
        self.conversation_history.append({
            "role": role, "content": content
        })

    def add_text_message(self, role: str, text: str):
        self.conversation_history.append({
            "role": role,
            "content": [{"type": "text", "text": text}]
        })

    def get_response(self) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=self.conversation_history,
            max_tokens=500
        )
        reply = response.choices[0].message.content
        self.conversation_history.append({
            "role": "assistant",
            "content": [{"type": "text", "text": reply}]
        })
        return reply

# Simulate a multimodal conversation
def mock_chat() -> str:
    return "This UI screenshot shows a login form with email and password fields. The 'Sign In' button is disabled, suggesting validation is incomplete. I can see a 'Forgot Password?' link below the form."

chatbot = MultimodalChatbot()
chatbot.add_text_message("user", "Analyze this screenshot and tell me what you see:")
print("User: Analyze this screenshot and tell me what you see:")
response = mock_chat()
print(f"Assistant: {response}")

chatbot.add_text_message("user", "What improvements would you suggest?")
print("\nUser: What improvements would you suggest?")
print("Assistant: Add password visibility toggle, show inline validation errors, and include social login options.")

Expected output:

User: Analyze this screenshot and tell me what you see:
Assistant: This UI screenshot shows a login form with email and password fields. The 'Sign In' button is disabled, suggesting validation is incomplete. I can see a 'Forgot Password?' link below the form.

User: What improvements would you suggest?
Assistant: Add password visibility toggle, show inline validation errors, and include social login options.

Common Errors

Error	Cause	Fix
Image too large for API	Base64 encoding exceeds size limit	Resize image to max 2048px on longest side before encoding
Audio transcription has long gaps	Silence in audio not handled	Use `prompt` parameter with topic context to improve continuity
CLIP gives low confidence on all labels	None of the labels match image content	Expand candidate labels or use open-vocabulary approaches
Multimodal context window exceeded	Too many images in conversation	Summarize previous image analysis as text and remove raw images
Whisper misrecognizes technical terms	Acoustic model bias toward common words	Use `prompt` parameter with domain-specific vocabulary

Practice Questions

How does GPT-4o Process images differently from text? GPT-4o tokenizes images into visual tokens that are processed alongside text tokens in the same transformer architecture.
What is CLIP's contrastive learning objective? CLIP learns to maximize cosine similarity between matching image-text pairs and minimize it for non-matching pairs in a batch.
Why is Whisper robust to background noise? Whisper was trained on 680,000 hours of multilingual audio with diverse noise conditions, making it generalize well to real-world audio.
What are the trade-offs of sending high-resolution images to GPT-4o? High detail costs more tokens (up to 1700 tokens per image) but provides better accuracy for tasks requiring fine visual detail.
Challenge: Build a multimodal search engine that accepts text, image, or audio queries — uses CLIP for image-to-image search, Whisper for voice-to-text, and GPT-4o for mixed-modality queries, returning results ranked by relevance across all modalities.

Mini Project

Build a visual QA system for product documentation. Take screenshots of software UI pages, use GPT-4o to analyze each screenshot and generate a structured description of UI elements, store descriptions with embeddings in a vector database, and build a chatbot that answers user questions like "Where is the settings button?" by finding the relevant screenshot and generating a text response with spatial guidance.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous AI Workflow Orchestration — Building Multi-Step Pipelines with LangGraph and Temporal Next → AI Observability and Monitoring — LangSmith, Weights and Biases and Production Tracing

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Ai Automation