High-throughput, memory-efficient open-source inference and serving engine for LLMs, used as the default backend at many AI companies.
High-throughput, memory-efficient open-source inference and serving engine for LLMs, used as the default backend at many AI companies.
vLLM is the de facto open-source serving engine for large language models, originally born out of UC Berkeley's Sky Computing Lab and now governed by an open community of contributors across Anyscale, Meta, NVIDIA, Databricks, AMD, and many others. Its core innovation is PagedAttention, a virtual-memory-style allocator for KV cache that dramatically reduces fragmentation and lets a single GPU host serve far more concurrent requests than a naive transformer stack. On top of PagedAttention the project layers continuous batching, speculative decoding, prefix caching, tensor and pipeline parallelism, quantization (AWQ, GPTQ, FP8, INT4), and an OpenAI-compatible HTTP server. vLLM supports nearly every popular architecture — Llama, Qwen, DeepSeek, Mistral, Phi, Gemma, multimodal models like Llava and Qwen-VL, and embedding/reranker models — across NVIDIA, AMD, Intel, AWS Inferentia, and Apple Silicon hardware. Because it is open source under Apache 2.0 there is no subscription cost; teams pay for the GPUs they run it on. vLLM ships as a Python package, a Docker image, a Kubernetes operator, and is the default backend behind many managed inference clouds (Together, Fireworks, Lepton, RunPod, parts of AWS Bedrock). Production engineering teams use vLLM when they need self-hosted control of latency, cost, privacy, and routing for their LLM workloads.
Was this helpful?
Feature information is available on the official website.
View Features →$0
Ready to get started with vLLM?
View Pricing Options →Weekly insights on the latest AI tools, features, and trends delivered to your inbox.
No reviews yet. Be the first to share your experience!
Get started with vLLM and see if it's the right fit for your needs.
Get Started →Take our 60-second quiz to get personalized tool recommendations
Find Your Perfect AI Stack →Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.
Browse Agent Templates →