Master Cloudflare Workers AI with our step-by-step tutorial, detailed feature walkthrough, and expert tips.
Sign up for Cloudflare account and navigate to Workers AI dashboard Browse the model catalog to identify models suitable for your use case Test inference using the REST API or Workers playground with sample prompts Integrate using Workers bindings for server
side applications or REST API for external access Monitor usage and costs through the Cloudflare dashboard analytics Scale to production with AI Gateway integration for observability and control
💡 Quick Start: Follow these 2 steps in order to get up and running with Cloudflare Workers AI quickly.
Explore the key features that make Cloudflare Workers AI powerful for ai model apis workflows.
Deploy AI models across 300+ edge locations worldwide, leveraging Cloudflare's anycast network to route requests to the nearest available GPU for optimized performance. Latency varies by model size and GPU availability at each location — smaller models like Mistral 7B and Gemma 2B typically achieve median latencies well under 100ms from nearby locations, while larger models may route to a more limited set of GPU-equipped data centers. The system automatically balances proximity, capacity, and load to deliver the best available response time for each request.
Building AI-powered applications serving global audiences where response time directly impacts user experience, such as real-time chat assistants or interactive content generation.
Access 50+ curated open-source models including Meta's Llama 3.3 and Llama 4 Scout, Mistral 7B for efficient text generation, Google's Gemma for lightweight inference, and Stable Diffusion XL for image generation. The catalog spans text generation, embeddings (BGE), speech-to-text (Whisper), translation, and image models — all optimized for edge deployment and accessible through a unified API.
Multi-modal AI applications requiring text generation, image creation, speech processing, and embedding generation without managing multiple AI service providers.
Pay-per-use pricing at $0.011 per 1,000 neurons with 10,000 neurons free daily. Neurons represent normalized compute units across different model types, providing predictable billing without idle costs or minimum commitments. Each model in the catalog publishes its neuron cost per request, enabling developers to estimate expenses before deploying. For example, a typical Llama 3.1 8B text generation request costs approximately 50 neurons (~$0.00055), while image generation models consume more neurons per request due to higher compute requirements.
Startups and variable-workload applications where traditional GPU instance pricing creates financial uncertainty or forces over-provisioning for peak capacity.
Zero infrastructure management with automatic scaling, batching optimization, and resource allocation. Models warm automatically based on usage patterns to minimize cold start latency. The platform handles GPU provisioning, model loading, request queuing, and scaling entirely behind the scenes, allowing developers to treat AI inference as a simple API call without any operational burden.
Production applications requiring elastic scaling during traffic spikes without pre-provisioning capacity or managing GPU clusters and model deployment pipelines.
Native integration with AI Gateway for observability and control, Vectorize for vector storage, Workers for edge computing, and R2/D1 for data storage, creating complete AI application stacks. AI Gateway provides unified caching, rate limiting, retry logic, fallback routing, and real-time analytics across both Workers AI and external providers. Vectorize enables semantic search and RAG pipelines, while D1 and R2 handle structured and object storage respectively.
Building end-to-end AI applications with semantic search, RAG capabilities, and edge processing without assembling multiple disparate cloud services.
Support for function calling, structured JSON outputs, reasoning tasks, vision processing, and multi-turn conversations with extended context windows for document processing applications. LoRA adapter loading enables fine-tuned model variants without redeploying base models, and batch processing handles high-volume offline workloads efficiently. The Agents SDK and Workflows product enable stateful, multi-step agent pipelines combining inference with durable execution.
Agentic AI workflows requiring tool usage, complex reasoning, document analysis, and multi-modal understanding for sophisticated automation and decision-making systems.
The catalog includes 50+ open-source models, including Meta Llama 3.1/3.2/3.3 and Llama 4 Scout, Mistral 7B, Google Gemma, Qwen, DeepSeek, BGE embeddings for semantic search, OpenAI Whisper for speech-to-text, Stable Diffusion XL and Flux for image generation, plus models for translation, classification, summarization, and sentiment analysis. The catalog is curated and optimized by Cloudflare for edge deployment, and new models are added regularly as they become available and pass Cloudflare's optimization pipeline. Each model in the catalog includes published neuron costs, supported features (streaming, function calling, etc.), and maximum context window specifications.
Pricing is based on neurons, Cloudflare's normalized unit of AI compute. The free tier includes 10,000 neurons per day at no cost, and the Workers Paid plan ($5/month) includes 10,000 neurons/day plus pay-as-you-go pricing at $0.011 per 1,000 neurons beyond the free allotment. Each model has a published neuron cost per request in the model catalog, so developers can estimate expenses before deploying. For example, a typical Llama 3.1 8B inference request costs approximately 50 neurons (~$0.00055). Enterprise customers can negotiate volume discounts and committed-use contracts. Neuron costs vary by model size and modality — text generation models consume fewer neurons per request than image generation models.
Yes. Workers AI supports LoRA adapters on selected base models, allowing you to load fine-tuned weights at inference time without redeploying the base model. You can also bring your own fine-tuned weights for supported architectures through the BYOM program, and Cloudflare integrates with Hugging Face for some model import workflows. Fully custom architectures that fall outside the supported model formats (such as novel attention mechanisms or proprietary model structures) still require dedicated infrastructure and cannot be deployed to Workers AI. Cloudflare continues to expand the range of supported base models and adapter formats, so checking the current documentation for the latest compatibility list is recommended.
OpenAI offers higher-quality proprietary models like GPT-4o and o-series reasoners, the most mature developer ecosystem, and broader feature coverage (advanced function calling, Assistants API, fine-tuning). Workers AI offers global edge inference with lower latency for geographically distributed users, open-weight models that provide transparency and no vendor lock-in, lower price points for many workloads (especially at scale with smaller models), and tight integration with Cloudflare's storage, networking, and security stack. The choice depends on whether you prioritize frontier model quality (OpenAI) or edge distribution, cost efficiency, and platform integration (Workers AI). Many teams use both — Workers AI for latency-sensitive open-model tasks and OpenAI via AI Gateway for frontier-quality reasoning.
Requests are routed to the nearest Cloudflare data center equipped with GPUs capable of serving the requested model. GPU capacity is deployed across over 300 cities globally through Cloudflare's anycast network, so latency from end-user to inference is typically low for popular models that are widely distributed. However, not every model is available at every location — larger models may only be served from a subset of GPU-equipped data centers, which can increase latency for those specific models. Cloudflare's routing layer automatically selects the optimal location balancing proximity, GPU availability, and current load. The network continues to expand GPU coverage, with the goal of making all catalog models available at every major point of presence.
Now that you know how to use Cloudflare Workers AI, it's time to put this knowledge into practice.
Sign up and follow the tutorial steps
Check pros, cons, and user feedback
See how it stacks against alternatives
Follow our tutorial and master this powerful ai model apis tool in minutes.
Tutorial updated March 2026