Cloudflare Workers AI: Run Edge Inference Without a GPU Server
Running AI inference used to mean provisioning a GPU server, managing a model registry, and paying for idle compute. Workers AI collapses this to a single API call from the edge.
What Workers AI Actually Is
Cloudflare Workers AI gives you:
- 50+ models (LLMs, embeddings, vision, speech-to-text) available as bindings
- Inference runs on Cloudflare's GPU network — no cold starts, no server management
- Billed per unit of inference (neurons), not per hour of GPU time
- Response times from the PoP closest to your user
This is the missing piece for AI-native apps that need global latency without global infrastructure cost.
Setup
# wrangler.toml
name = "my-ai-worker"
main = "src/index.ts"
compatibility_date = "2024-09-23"
[ai]
binding = "AI"
Basic LLM Inference
interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { prompt } = await request.json<{ prompt: string }>();
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: prompt },
],
max_tokens: 512,
});
return Response.json(response);
},
};
This Worker runs at the edge, closest to your user. No API key rotation. No rate limit management. No GPU provisioning.
Streaming Responses
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { prompt } = await request.json<{ prompt: string }>();
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: prompt }],
stream: true,
});
// Returns a ReadableStream — pipe directly to client
return new Response(stream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
},
});
},
};
The stream is SSE-compatible — your frontend can consume it with EventSource or the Vercel AI SDK's useChat.
Embeddings for Semantic Search
// Generate embeddings at the edge — no external embedding API needed
async function embedText(text: string, env: Env): Promise<number[]> {
const result = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: [text],
});
return result.data[0]; // 768-dimensional vector
}
// Combine with Vectorize (Cloudflare's vector DB) for full semantic search:
async function semanticSearch(query: string, env: Env & { VECTORIZE: VectorizeIndex }) {
const queryVector = await embedText(query, env);
const results = await env.VECTORIZE.query(queryVector, {
topK: 5,
returnMetadata: true,
});
return results.matches;
}
Image Classification
export default {
async fetch(request: Request, env: Env): Promise<Response> {
// Accept image upload
const formData = await request.formData();
const image = formData.get('image') as File;
const imageBytes = await image.arrayBuffer();
const result = await env.AI.run('@cf/microsoft/resnet-50', {
image: [...new Uint8Array(imageBytes)],
});
// Returns top classifications with confidence scores
return Response.json(result);
},
};
Speech-to-Text
const transcription = await env.AI.run('@cf/openai/whisper', {
audio: [...new Uint8Array(audioBuffer)],
});
console.log(transcription.text); // full transcript
console.log(transcription.vtt); // WebVTT subtitles with timestamps
Whisper at the edge, billed per audio second. No Replicate, no API keys to manage.
Cost Model
Workers AI uses "neurons" — Cloudflare's unit of GPU compute:
- Llama 3.1 8B: ~300 neurons per 1M tokens input
- BGE embeddings: ~0.01 neurons per document
- Whisper: ~10 neurons per audio minute
Free tier: 10,000 neurons/day (Workers Free plan)
Paid: $0.011 per 1,000 neurons (Workers Paid, $5/month)
For a typical RAG application (1000 queries/day with embeddings + LLM), you're looking at ~$3-8/month vs $50-200/month for a dedicated GPU instance.
When to Use Workers AI vs External APIs
Use Workers AI when:
- You need globally low latency (edge PoPs = 30-50ms vs 100-300ms for centralized inference)
- Your use case fits the available models (Llama, Mistral, Whisper, BGE)
- You want simplified billing (one Cloudflare invoice)
- You're already on Workers infrastructure
Use Claude/GPT-4 directly when:
- You need the best-in-class reasoning (Llama 8B ≠ Claude Opus)
- Your task requires tool use, complex multi-step reasoning
- Response quality is more important than latency
The Architecture I Use
User request
↓
Workers AI (edge) → embeddings + lightweight classification
↓
If complex reasoning needed:
Forward to Claude API (centralized) with pre-processed context
↓
Return to user
Workers AI handles 80% of inference at the edge (fast + cheap). Claude handles the 20% that needs real reasoning.
Build MCP Integrations for Your Workers AI Apps
The MCP Security Scanner and Workflow Automator MCP are built on this exact Workers AI + Claude hybrid architecture. See how production MCP tools are structured.
What are you building with Workers AI? I'm curious about production use cases — share in the comments.