Cloudflare Workers AI: Run Edge Inference Without a GPU Server

Running AI inference used to mean provisioning a GPU server, managing a model registry, and paying for idle compute. Workers AI collapses this to a single API call from the edge.

What Workers AI Actually Is

Cloudflare Workers AI gives you:

50+ models (LLMs, embeddings, vision, speech-to-text) available as bindings
Inference runs on Cloudflare's GPU network — no cold starts, no server management
Billed per unit of inference (neurons), not per hour of GPU time
Response times from the PoP closest to your user

This is the missing piece for AI-native apps that need global latency without global infrastructure cost.

Setup

# wrangler.toml
name = "my-ai-worker"
main = "src/index.ts"
compatibility_date = "2024-09-23"

[ai]
binding = "AI"

Basic LLM Inference

interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { prompt } = await request.json<{ prompt: string }>();

    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [
        { role: 'system', content: 'You are a helpful assistant.' },
        { role: 'user', content: prompt },
      ],
      max_tokens: 512,
    });

    return Response.json(response);
  },
};

This Worker runs at the edge, closest to your user. No API key rotation. No rate limit management. No GPU provisioning.

Streaming Responses

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { prompt } = await request.json<{ prompt: string }>();

    const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [{ role: 'user', content: prompt }],
      stream: true,
    });

    // Returns a ReadableStream — pipe directly to client
    return new Response(stream, {
      headers: {
        'Content-Type': 'text/event-stream',
        'Cache-Control': 'no-cache',
      },
    });
  },
};

The stream is SSE-compatible — your frontend can consume it with EventSource or the Vercel AI SDK's useChat.

Embeddings for Semantic Search

// Generate embeddings at the edge — no external embedding API needed
async function embedText(text: string, env: Env): Promise<number[]> {
  const result = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
    text: [text],
  });
  return result.data[0]; // 768-dimensional vector
}

// Combine with Vectorize (Cloudflare's vector DB) for full semantic search:
async function semanticSearch(query: string, env: Env & { VECTORIZE: VectorizeIndex }) {
  const queryVector = await embedText(query, env);

  const results = await env.VECTORIZE.query(queryVector, {
    topK: 5,
    returnMetadata: true,
  });

  return results.matches;
}

Image Classification

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // Accept image upload
    const formData = await request.formData();
    const image = formData.get('image') as File;
    const imageBytes = await image.arrayBuffer();

    const result = await env.AI.run('@cf/microsoft/resnet-50', {
      image: [...new Uint8Array(imageBytes)],
    });

    // Returns top classifications with confidence scores
    return Response.json(result);
  },
};

Speech-to-Text

const transcription = await env.AI.run('@cf/openai/whisper', {
  audio: [...new Uint8Array(audioBuffer)],
});

console.log(transcription.text); // full transcript
console.log(transcription.vtt);  // WebVTT subtitles with timestamps

Whisper at the edge, billed per audio second. No Replicate, no API keys to manage.

Cost Model

Workers AI uses "neurons" — Cloudflare's unit of GPU compute:

Llama 3.1 8B: ~300 neurons per 1M tokens input
BGE embeddings: ~0.01 neurons per document
Whisper: ~10 neurons per audio minute

Free tier: 10,000 neurons/day (Workers Free plan)

Paid: $0.011 per 1,000 neurons (Workers Paid, $5/month)

For a typical RAG application (1000 queries/day with embeddings + LLM), you're looking at ~$3-8/month vs $50-200/month for a dedicated GPU instance.

When to Use Workers AI vs External APIs

Use Workers AI when:

You need globally low latency (edge PoPs = 30-50ms vs 100-300ms for centralized inference)
Your use case fits the available models (Llama, Mistral, Whisper, BGE)
You want simplified billing (one Cloudflare invoice)
You're already on Workers infrastructure

Use Claude/GPT-4 directly when:

You need the best-in-class reasoning (Llama 8B ≠ Claude Opus)
Your task requires tool use, complex multi-step reasoning
Response quality is more important than latency

The Architecture I Use

User request
    ↓
Workers AI (edge) → embeddings + lightweight classification
    ↓
If complex reasoning needed:
Forward to Claude API (centralized) with pre-processed context
    ↓
Return to user

Workers AI handles 80% of inference at the edge (fast + cheap). Claude handles the 20% that needs real reasoning.

Build MCP Integrations for Your Workers AI Apps

The MCP Security Scanner and Workflow Automator MCP are built on this exact Workers AI + Claude hybrid architecture. See how production MCP tools are structured.

whoffagents.com →

What are you building with Workers AI? I'm curious about production use cases — share in the comments.