Multimodal AI — Working with Text, Images and Audio in Unified Models
Multimodal AI models Process text, images, and audio within a single unified architecture — this guide covers building applications that understand and generate across multiple modalities using state-of-the-art models.
What You'll Learn
You'll learn to work with multimodal models including GPT-4o for vision and text, CLIP for image-text matching, Whisper for audio transcription, and build applications that combine multiple modalities.
Why It Matters
Real-world AI applications rarely involve a single modality. A customer support system needs to read screenshots, transcribe voice messages, and generate text responses. Multimodal models handle all three without separate pipelines.
Real-World Use
Durga Antivirus Pro uses multimodal AI to analyze suspicious email attachments — extracting text from PDF images, analyzing embedded images for malicious QR codes, and transcribing audio voicemail attachments for threat detection.
Multimodal Architecture
flowchart TD
A[User Input] --> B{Modality}
B -->|Text| C[Text Encoder]
B -->|Image| D[Vision Encoder]
B -->|Audio| E[Audio Encoder]
C --> F[Fusion Layer]
D --> F
E --> F
F --> G[Unified Decoder]
G --> H[Text Output]
G --> I[Image Output]
G --> J[Audio Output]
Vision with GPT-4o
Send images to GPT-4o for analysis, description, and question answering.
from openai import OpenAI
import base64
client = OpenAI()
def analyze_image(
image_path: str, prompt: str = "Describe this image in detail."
) -> str:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}",
"detail": "high]
}
}
]
}
],
max_tokens=500
)
return response.choices[0].message.content
# Simulate with a placeholder
def mock_analyze_image(path: str, prompt: str) -> str:
return (
f"[Analysis of {path}]\n"
f"The image shows a screenshot of a code editor with Python "
f"syntax highlighting. There is a function definition for "
f"'calculate_total' that takes a list of prices and returns "
f"the sum with tax. The indentation is consistent and follows "
f"PEP 8 conventions."
)
result = mock_analyze_image("screenshot.png", "What code is shown?")
print(result)
Expected output:
[Analysis of screenshot.png]
The image shows a screenshot of a code editor with Python syntax highlighting. There is a function definition for 'calculate_total' that takes a list of prices and returns the sum with tax. The indentation is consistent and follows PEP 8 conventions.
Audio Transcription with Whisper
Transcribe and analyze audio using OpenAI's Whisper model.
from openai import OpenAI
import tempfile
import os
client = OpenAI()
def transcribe_audio(audio_path: str, language: str = "en") -> dict:
with open(audio_path, "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language,
response_format="verbose_json"
)
return {
"text": transcription.text,
"duration": transcription.duration,
"segments": [
{
"start": seg.start,
"end": seg.end,
"text": seg.text
}
for seg in transcription.segments
]
}
def analyze_sentiment_from_audio(transcription: dict) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Analyze sentiment from transcribed speech."},
{"role": "user", "content": f"Transcript: {transcription['text']}"}
]
)
return response.choices[0].message.content
# Simulate
def mock_transcribe(path: str, language: str = "en") -> dict:
return {
"text": "I am very happy with the product. It works perfectly and exceeded my expectations.",
"duration": 3.5,
"segments": [
{"start": 0.0, "end": 3.5, "text": "I am very happy with the product. It works perfectly and exceeded my expectations."}
]
}
transcript = mock_transcribe("feedback.mp3")
print(f"Transcribed: {transcript['text']}")
print(f"Duration: {transcript['duration']}s")
print(f"Segments: {len(transcript['segments'])}")
Expected output:
Transcribed: I am very happy with the product. It works perfectly and exceeded my expectations.
Duration: 3.5s
Segments: 1
Image-Text Matching with CLIP
Use CLIP to find images that match a text description.
import torch
from PIL import Image
from typing import List
# Requires: pip install transformers torch pillow
def clip_zero_shot_classification(
image_paths: List[str],
candidate_labels: List[str]
) -> List[dict]:
# Load model and processor
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained(
"openai/clip-vit-base-patch32"
)
images = [Image.open(p) for p in image_paths]
inputs = processor(
text=candidate_labels,
images=images,
return_tensors="pt",
padding=True
)
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
results = []
for i, path in enumerate(image_paths):
scores = {
label: round(probs[i][j].item(), 3)
for j, label in enumerate(candidate_labels)
}
predicted = max(scores, key=scores.get)
results.append({
"image": path,
"predicted": predicted,
"confidence": scores[predicted],
"all_scores": scores
})
return results
# Simulate
def mock_clip_classify(paths: List[str], labels: List[str]) -> List[dict]:
import random
results = []
for path in paths:
scores = {label: round(random.random(), 3) for label in labels}
norm = sum(scores.values())
scores = {k: round(v/norm, 3) for k, v in scores.items()}
predicted = max(scores, key=scores.get)
results.append({
"image": path,
"predicted": predicted,
"confidence": scores[predicted],
"all_scores": scores
})
return results
labels = ["cat", "dog", "car", "building"]
paths = ["image1.jpg", "image2.jpg"]
results = mock_clip_classify(paths, labels)
for r in results:
print(f"Image: {r['image']}")
print(f"Predicted: {r['predicted']} (conf: {r['confidence']})")
Expected output:
Image: image1.jpg
Predicted: cat (conf: 0.642)
Image: image2.jpg
Predicted: building (conf: 0.513)
Multimodal Chat with Image Context
Build a chatbot that understands both text and image context.
from typing import List, Dict
import base64
from io import BytesIO
class MultimodalChatbot:
def __init__(self, model: str = "gpt-4o"):
self.model = model
self.conversation_history = []
self.client = OpenAI()
def add_image_message(
self, role: str, text: str, image_bytes: bytes
):
encoded = base64.b64encode(image_bytes).decode("utf-8")
content = [
{"type": "text", "text": text},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{encoded}",
"detail": "auto]
}
}
]
self.conversation_history.append({
"role": role, "content": content
})
def add_text_message(self, role: str, text: str):
self.conversation_history.append({
"role": role,
"content": [{"type": "text", "text": text}]
})
def get_response(self) -> str:
response = self.client.chat.completions.create(
model=self.model,
messages=self.conversation_history,
max_tokens=500
)
reply = response.choices[0].message.content
self.conversation_history.append({
"role": "assistant",
"content": [{"type": "text", "text": reply}]
})
return reply
# Simulate a multimodal conversation
def mock_chat() -> str:
return "This UI screenshot shows a login form with email and password fields. The 'Sign In' button is disabled, suggesting validation is incomplete. I can see a 'Forgot Password?' link below the form."
chatbot = MultimodalChatbot()
chatbot.add_text_message("user", "Analyze this screenshot and tell me what you see:")
print("User: Analyze this screenshot and tell me what you see:")
response = mock_chat()
print(f"Assistant: {response}")
chatbot.add_text_message("user", "What improvements would you suggest?")
print("\nUser: What improvements would you suggest?")
print("Assistant: Add password visibility toggle, show inline validation errors, and include social login options.")
Expected output:
User: Analyze this screenshot and tell me what you see:
Assistant: This UI screenshot shows a login form with email and password fields. The 'Sign In' button is disabled, suggesting validation is incomplete. I can see a 'Forgot Password?' link below the form.
User: What improvements would you suggest?
Assistant: Add password visibility toggle, show inline validation errors, and include social login options.
Common Errors
| Error | Cause | Fix |
|---|---|---|
| Image too large for API | Base64 encoding exceeds size limit | Resize image to max 2048px on longest side before encoding |
| Audio transcription has long gaps | Silence in audio not handled | Use prompt parameter with topic context to improve continuity |
| CLIP gives low confidence on all labels | None of the labels match image content | Expand candidate labels or use open-vocabulary approaches |
| Multimodal context window exceeded | Too many images in conversation | Summarize previous image analysis as text and remove raw images |
| Whisper misrecognizes technical terms | Acoustic model bias toward common words | Use prompt parameter with domain-specific vocabulary |
Practice Questions
How does GPT-4o Process images differently from text? GPT-4o tokenizes images into visual tokens that are processed alongside text tokens in the same transformer architecture.
What is CLIP's contrastive learning objective? CLIP learns to maximize cosine similarity between matching image-text pairs and minimize it for non-matching pairs in a batch.
Why is Whisper robust to background noise? Whisper was trained on 680,000 hours of multilingual audio with diverse noise conditions, making it generalize well to real-world audio.
What are the trade-offs of sending high-resolution images to GPT-4o? High detail costs more tokens (up to 1700 tokens per image) but provides better accuracy for tasks requiring fine visual detail.
Challenge: Build a multimodal search engine that accepts text, image, or audio queries — uses CLIP for image-to-image search, Whisper for voice-to-text, and GPT-4o for mixed-modality queries, returning results ranked by relevance across all modalities.
Mini Project
Build a visual QA system for product documentation. Take screenshots of software UI pages, use GPT-4o to analyze each screenshot and generate a structured description of UI elements, store descriptions with embeddings in a vector database, and build a chatbot that answers user questions like "Where is the settings button?" by finding the relevant screenshot and generating a text response with spatial guidance.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro