Workers AI -- Run AI Models at the Edge
In this tutorial, you'll learn about Workers AI. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Cloudflare Workers AI is a Serverless inference platform that runs pretrained Machine Learning models on Cloudflare's global edge network, enabling low-latency AI capabilities like text generation, image classification, and speech recognition without managing GPUs or servers.
Why Workers AI Matters
Running AI inference typically requires expensive GPU infrastructure, complex model deployment pipelines, and careful scaling to handle traffic spikes. Workers AI eliminates these concerns by providing a simple API over a curated catalog of open-source models deployed across Cloudflare's edge. Each request is routed to a GPU-equipped edge node near the user, delivering inference latencies that traditional cloud AI services cannot match. Unlike Cloudflare's compute offerings that execute general code, Workers AI is specialized for neural network inference -- it handles batching, hardware scheduling, and model loading behind the scenes. This approach is ideal for applications like content moderation, real-time translation, and intelligent assistants built with JavaScript.
Real-world use: A social media platform scans every uploaded image for inappropriate content using Workers AI's image classification model. The request never leaves the edge, and results return in under 200 milliseconds -- fast enough to block uploads before users see them.
Workers AI Architecture
flowchart LR
W[Worker Code] --> AI[Workers AI API]
AI --> R[Router]
R --> M1[Model: Llama 3.2]
R --> M2[Model: Whisper]
R --> M3[Model: ResNet]
M1 --> G1[GPU Edge US]
M2 --> G2[GPU Edge EU]
M3 --> G3[GPU Edge Asia]
G1 --> Res[Inference Result]
G2 --> Res
G3 --> Res
Res --> W
style AI fill:#f90,color:#fff
style R fill:#3498db,color:#fff
style Res fill:#2ecc71,color:#fff
The Worker calls env.AI.run(modelId, inputs) and Workers AI routes the request to the nearest GPU-equipped edge node that has the requested model loaded. Results are streamed back as JSON or text depending on the model type.
Text Generation with Llama
export default {
async fetch(request, env) {
const response = await env.AI.run(
'@cf/meta/llama-3.2-3b-instruct',
{
prompt: 'Explain serverless computing in one sentence.',
stream: false
}
);
return new Response(response.response, {
headers: { 'Content-Type': 'text/plain' }
});
}
};
// Expected output (approximate):
// "Serverless computing is a cloud execution model where the cloud provider
// dynamically manages the allocation of machine resources, charging only
// for the actual compute time consumed."
The @cf/meta/llama-3.2-3b-instruct model is a small but capable instruction-tuned language model. Setting stream: false returns the complete response at once. For longer outputs, set stream: true to receive chunks via a ReadableStream.
Image Classification
export default {
async fetch(request, env) {
const formData = await request.formData();
const imageFile = formData.get('image');
const arrayBuffer = await imageFile.arrayBuffer();
const result = await env.AI.run(
'@cf/microsoft/resnet-50',
{
image: [...new Uint8Array(arrayBuffer)]
}
);
const topPrediction = result.results
.sort((a, b) => b.score - a.score)
.slice(0, 3)
.map(r => `${r.label}: ${(r.score * 100).toFixed(1)}%`);
return new Response(JSON.stringify({ predictions: topPrediction }), {
headers: { 'Content-Type': 'application/json' }
});
}
};
// Expected output (for a photo of a dog):
// {"predictions":["golden retriever: 92.3%","Labrador retriever: 4.1%","curly-coated retriever: 1.2%"]}
The ResNet-50 model classifies images into 1000 categories. The image must be sent as a byte array. Workers AI handles decoding and preprocessing internally. The response contains class labels with confidence scores.
Speech-to-Text with Whisper
export default {
async fetch(request, env) {
const audio = await request.arrayBuffer();
const audioBytes = [...new Uint8Array(audio)];
const result = await env.AI.run(
'@cf/openai/whisper-tiny-en',
{
audio: audioBytes,
response_format: 'json'
}
);
return new Response(JSON.stringify({
transcription: result.text,
duration_seconds: result.duration,
segments: result.segments.length
}), {
headers: { 'Content-Type': 'application/json' }
});
}
};
// Expected output (for a 5-second audio clip saying "Hello world"):
// {"transcription": "Hello world","duration_seconds": 5.12,"segments": 1}
Whisper processes raw audio data and returns a text transcription along with timing metadata. The tiny-en variant is optimized for English and runs efficiently on edge GPUs. Larger Whisper variants are available for multilingual transcription.
Common Errors
| Error | Cause | Fix |
|---|---|---|
Model not found |
Model ID does not exist in Workers AI catalog | Verify the model ID against the official Workers AI catalog documentation |
Input exceeds maximum length |
Prompt or input too large for model context window | Truncate input to the model's maximum token limit (typically 2048-8192 tokens) |
Insufficient quota |
Free plan inference limit reached | Upgrade to a paid plan or wait for the monthly quota reset |
Unsupported content type |
Input format not recognized by the model | Check the model's expected input format (text, image bytes, audio bytes) |
GPU temporarily unavailable |
All GPUs at the nearest edge node are busy | Requests are automatically retried at other edge nodes; increase your timeout settings |
Practice Questions
- What two parameters are required to run a model with Workers AI?
- How does Workers AI route inference requests to the appropriate hardware?
- What is the difference between streaming and non-streaming inference responses?
FAQ
Summary
Workers AI brings Machine Learning inference to Cloudflare's edge network, offering text generation, image classification, speech recognition, and embedding models through a simple API. Each request is processed on GPU hardware at the nearest edge location, eliminating the need to manage servers or GPUs. Use Workers AI for content moderation, real-time translation, intelligent chatbots, and image analysis. The platform integrates seamlessly with Workers, KV, D1, and R2, enabling full-stack AI applications at the edge. DodaTech uses Workers AI for automated content filtering in its publishing platform.
This guide is brought to you by the developers of Cloudflare, REST APIs, and Durga Antivirus Pro at DodaTech.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro