Workers AI -- Run AI Models at the Edge

DodaTech 5 min read

In this tutorial, you'll learn about Workers AI. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Cloudflare Workers AI is a Serverless inference platform that runs pretrained Machine Learning models on Cloudflare's global edge network, enabling low-latency AI capabilities like text generation, image classification, and speech recognition without managing GPUs or servers.

Why Workers AI Matters

Running AI inference typically requires expensive GPU infrastructure, complex model deployment pipelines, and careful scaling to handle traffic spikes. Workers AI eliminates these concerns by providing a simple API over a curated catalog of open-source models deployed across Cloudflare's edge. Each request is routed to a GPU-equipped edge node near the user, delivering inference latencies that traditional cloud AI services cannot match. Unlike Cloudflare's compute offerings that execute general code, Workers AI is specialized for neural network inference -- it handles batching, hardware scheduling, and model loading behind the scenes. This approach is ideal for applications like content moderation, real-time translation, and intelligent assistants built with JavaScript.

Real-world use: A social media platform scans every uploaded image for inappropriate content using Workers AI's image classification model. The request never leaves the edge, and results return in under 200 milliseconds -- fast enough to block uploads before users see them.

Workers AI Architecture

flowchart LR
    W[Worker Code] --> AI[Workers AI API]
    AI --> R[Router]
    R --> M1[Model: Llama 3.2]
    R --> M2[Model: Whisper]
    R --> M3[Model: ResNet]
    M1 --> G1[GPU Edge US]
    M2 --> G2[GPU Edge EU]
    M3 --> G3[GPU Edge Asia]
    G1 --> Res[Inference Result]
    G2 --> Res
    G3 --> Res
    Res --> W

    style AI fill:#f90,color:#fff
    style R fill:#3498db,color:#fff
    style Res fill:#2ecc71,color:#fff

The Worker calls env.AI.run(modelId, inputs) and Workers AI routes the request to the nearest GPU-equipped edge node that has the requested model loaded. Results are streamed back as JSON or text depending on the model type.

Text Generation with Llama

export default {
  async fetch(request, env) {
    const response = await env.AI.run(
      '@cf/meta/llama-3.2-3b-instruct',
      {
        prompt: 'Explain serverless computing in one sentence.',
        stream: false
      }
    );

    return new Response(response.response, {
      headers: { 'Content-Type': 'text/plain' }
    });
  }
};

// Expected output (approximate):
// "Serverless computing is a cloud execution model where the cloud provider
// dynamically manages the allocation of machine resources, charging only
// for the actual compute time consumed."

The @cf/meta/llama-3.2-3b-instruct model is a small but capable instruction-tuned language model. Setting stream: false returns the complete response at once. For longer outputs, set stream: true to receive chunks via a ReadableStream.

Image Classification

export default {
  async fetch(request, env) {
    const formData = await request.formData();
    const imageFile = formData.get('image');
    const arrayBuffer = await imageFile.arrayBuffer();

    const result = await env.AI.run(
      '@cf/microsoft/resnet-50',
      {
        image: [...new Uint8Array(arrayBuffer)]
      }
    );

    const topPrediction = result.results
      .sort((a, b) => b.score - a.score)
      .slice(0, 3)
      .map(r => `${r.label}: ${(r.score * 100).toFixed(1)}%`);

    return new Response(JSON.stringify({ predictions: topPrediction }), {
      headers: { 'Content-Type': 'application/json' }
    });
  }
};

// Expected output (for a photo of a dog):
// {"predictions":["golden retriever: 92.3%","Labrador retriever: 4.1%","curly-coated retriever: 1.2%"]}

The ResNet-50 model classifies images into 1000 categories. The image must be sent as a byte array. Workers AI handles decoding and preprocessing internally. The response contains class labels with confidence scores.

Speech-to-Text with Whisper

export default {
  async fetch(request, env) {
    const audio = await request.arrayBuffer();
    const audioBytes = [...new Uint8Array(audio)];

    const result = await env.AI.run(
      '@cf/openai/whisper-tiny-en',
      {
        audio: audioBytes,
        response_format: 'json'
      }
    );

    return new Response(JSON.stringify({
      transcription: result.text,
      duration_seconds: result.duration,
      segments: result.segments.length
    }), {
      headers: { 'Content-Type': 'application/json' }
    });
  }
};

// Expected output (for a 5-second audio clip saying "Hello world"):
// {"transcription": "Hello world","duration_seconds": 5.12,"segments": 1}

Whisper processes raw audio data and returns a text transcription along with timing metadata. The tiny-en variant is optimized for English and runs efficiently on edge GPUs. Larger Whisper variants are available for multilingual transcription.

Common Errors

Error	Cause	Fix
`Model not found`	Model ID does not exist in Workers AI catalog	Verify the model ID against the official Workers AI catalog documentation
`Input exceeds maximum length`	Prompt or input too large for model context window	Truncate input to the model's maximum token limit (typically 2048-8192 tokens)
`Insufficient quota`	Free plan inference limit reached	Upgrade to a paid plan or wait for the monthly quota reset
`Unsupported content type`	Input format not recognized by the model	Check the model's expected input format (text, image bytes, audio bytes)
`GPU temporarily unavailable`	All GPUs at the nearest edge node are busy	Requests are automatically retried at other edge nodes; increase your timeout settings

Practice Questions

What two parameters are required to run a model with Workers AI?
How does Workers AI route inference requests to the appropriate hardware?
What is the difference between streaming and non-streaming inference responses?

FAQ

What models are available on Workers AI?

Workers AI offers a curated catalog of open-source models including Llama 3.2, Mistral, Whisper, ResNet, Stable Diffusion, and BGE embeddings. Models are chosen for their performance on edge GPUs. Cloudflare regularly adds new models based on community demand and partnerships.

How is Workers AI priced?

Pricing is based on inference time measured in GPU-seconds. Each model family has a different rate reflecting the computational cost. The free tier includes a monthly allocation of GPU-seconds. There are no charges for idle time or model storage -- you only pay for actual inference.

Can I bring my own fine-tuned model?

Support for custom models is limited. You can use LoRA adapters with compatible base models, but full custom model deployment is not yet available. Cloudflare publishes a roadmap of supported customization features in the Workers AI documentation.

Summary

Workers AI brings Machine Learning inference to Cloudflare's edge network, offering text generation, image classification, speech recognition, and embedding models through a simple API. Each request is processed on GPU hardware at the nearest edge location, eliminating the need to manage servers or GPUs. Use Workers AI for content moderation, real-time translation, intelligent chatbots, and image analysis. The platform integrates seamlessly with Workers, KV, D1, and R2, enabling full-stack AI applications at the edge. DodaTech uses Workers AI for automated content filtering in its publishing platform.

This guide is brought to you by the developers of Cloudflare, REST APIs, and Durga Antivirus Pro at DodaTech.

← Previous Workers Durable Objects -- Stateful Serverless Next → Workers WebSockets -- Real-Time Connections

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Cloudflare