The Serving Engine Landscape (2025)
Three engines dominate production LLM serving: vLLM (ease of use + high concurrency), TensorRT-LLM (maximum hardware efficiency), and SGLang (structured outputs + caching). Your choice determines throughput, latency, and operational overhead for the next 3 years.
vLLM V1: High-Concurrency Champion
Release: January 2025 (V1 alpha), default engine since v0.8.0
Key Features (2025):
- 1.7× throughput gain over V0 without multi-step scheduling
- FlashAttention 3 integration for state-of-the-art attention performance
- Zero-overhead prefix caching: RadixAttention reuses shared prompt prefixes
- Chunked prefill: breaks long prefills into chunks to reduce TTFT spikes
- 24% throughput improvement for generation-heavy workloads (v0.8.1 vs v0.7.3)
- Multi-modal support: VLMs like LLaVA, Qwen2-VL with improved latency
Best For:
- Interactive applications requiring fast TTFT (time to first token)
- High-concurrency deployments (50-100+ concurrent requests)
- Rapid prototyping and experimentation (Python-native, easy setup)
- Multi-modal workloads (vision-language models)
vLLM V1 Architecture Upgrade
V1 represents a complete rewrite of the core scheduler with CPU overhead reductions across the stack. The integration of FlashAttention 3 addresses feature parity while maintaining excellent performance—critical for production deployments requiring both speed and flexibility.
TensorRT-LLM: Hardware Efficiency Leader
Vendor: NVIDIA (optimized for H100, H200, B200 Blackwell)
Key Features (2025):
- 2.72× gain in TPOT (time per output token) for long-context workloads vs vLLM
- Highest single-request throughput on H100 for Llama 3 8B
- FP8/FP4 quantization with minimal accuracy loss (Blackwell-optimized)
- In-flight batching: continuous batching without padding overhead
- KV cache optimization: paged attention with aggressive memory reuse
- Pipeline parallelism: efficient multi-GPU execution for large models
Best For:
- Low-concurrency deployments (1-10 concurrent requests)
- Maximum hardware utilization (squeezing every TFLOP from H100/B200)
- Long-context workloads (32K+ tokens input, 2K+ output)
- Batch processing pipelines (offline inference, evaluation)
TensorRT-LLM Trade-Off
TensorRT-LLM requires model compilation (engine build step) which can take 15-45 minutes depending on model size. This makes iteration slower than vLLM's dynamic approach. However, the runtime performance gains justify the upfront cost for stable production deployments.
SGLang: Structured Output Specialist
Release: Open-source (LMSYS Org), joined PyTorch Ecosystem March 2025
Key Features (2025):
- RadixAttention: prefix caching with shared prompt reuse across requests
- Zero-overhead batch scheduler: overlaps CPU scheduling with GPU compute
- Cache-aware load balancing: routes requests to workers with highest cache hit probability
- Structured outputs: JSON schema enforcement, regex constraints
- Multi-LoRA batching: serve multiple fine-tuned adapters simultaneously
- DeepSeek V3/R1 day-one support with model-specific optimizations
Best For:
- Applications requiring structured JSON outputs (agents, APIs)
- Multi-tenant deployments serving multiple fine-tuned variants
- Workloads with high prompt prefix overlap (chatbots, RAG)
- DeepSeek models (V3, R1) with native optimizations
Production Scale: Deployed at large scale generating trillions of tokens daily (as of 2025).
Performance Benchmarks: H100 GPUs (2025)
All benchmarks conducted on 2× NVIDIA H100 80GB GPUs using production-realistic workloads. Models tested: Llama 3.1 8B, Llama 3.3 70B, and GPT-OSS-120B.
Throughput Comparison
Dataset: ShareGPT (mixed short/long prompts, realistic distribution)
| Engine | Llama 3 8B (tokens/sec) | Llama 3 70B (tokens/sec) | GPT-OSS-120B (tokens/sec) |
|---|---|---|---|
| TensorRT-LLM | Highest | Competitive | Lower at 100 req |
| vLLM V1 | Second highest | Highest | 4,741 @ 100 req |
| SGLang | Competitive | Moderate | Moderate |
Key Insights:
- vLLM dominates high-concurrency: At 100 concurrent requests, vLLM achieves 4,741 tokens/sec on GPT-OSS-120B (highest)
- TensorRT-LLM wins small models: Llama 3 8B sees highest throughput with TensorRT-LLM on short sequences
- Model size matters: vLLM's scheduler scales better for 70B+ models under load
Latency Analysis
Metrics: TTFT (Time to First Token) and TPOT (Time Per Output Token)
Real-World Latency Impact
Interactive chatbot (300-500 token outputs):
- vLLM V1: ~400ms TTFT, ~2.5s total → feels instant
- TensorRT-LLM: ~800ms TTFT, ~2s total → noticeable delay
Document summarization (2K+ token outputs):
- TensorRT-LLM: 2.72× faster generation → completes in 8s vs 22s
- User doesn't notice TTFT when generation takes 8+ seconds anyway
Concurrency Scaling
Test: GPT-OSS-120B on 2× H100, varying concurrent requests from 1 to 100
Low Concurrency (1-10 requests)
Winner: TensorRT-LLM
- Highest per-request throughput (minimal batching overhead)
- Best for single-user applications, evaluation pipelines
- TPOT advantage shines on long-context tasks
Medium Concurrency (10-50 requests)
Winner: SGLang
- Moderate throughput with consistent performance
- Cache-aware load balancing reduces latency variance
- Structured output features add value without overhead
High Concurrency (50-100+ requests)
Winner: vLLM V1
- 4,741 tokens/sec at 100 concurrent requests (highest measured)
- Fast TTFT maintains responsiveness under load
- Excellent scaling characteristics for production APIs
Decision Matrix: Choosing Your Engine
Choose vLLM V1 if:
Interactive user-facing applications (chatbots, assistants, APIs) Expected concurrency >20 simultaneous requests Rapid iteration and experimentation (no compilation step) Multi-modal models (vision-language, audio) Team lacks deep CUDA/TensorRT expertise Need fastest time-to-first-token (TTFT) for responsiveness
Choose TensorRT-LLM if:
Low concurrency (1-10 requests) with maximum hardware efficiency required Long-context workloads (32K+ input, 2K+ output tokens) Batch processing pipelines (offline inference, evaluation) Stable model + deployment (compilation time acceptable) NVIDIA H100/H200/B200 hardware (vendor-optimized) Need FP8/FP4 quantization with minimal accuracy loss
Choose SGLang if:
Applications require structured JSON outputs (agents, tool calling) High prompt prefix overlap (chatbots with system prompts, RAG) Multi-tenant deployment serving multiple LoRA adapters Using DeepSeek V3 or DeepSeek R1 models Need consistent mid-tier performance without tuning
Deployment Quick Start
vLLM V1 Installation
# Install vLLM with V1 engine (default since 0.8.0)
pip install vllm>=0.8.1
# Serve Llama 3.1 70B on 2× H100
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--enable-prefix-caching
# Test endpoint
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"prompt": "Explain quantum computing in 3 sentences.",
"max_tokens": 100
}'
TensorRT-LLM Build & Serve
# Clone TensorRT-LLM
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
# Build engine (one-time, ~30 min for 70B)
python examples/llama/convert_checkpoint.py \
--model_dir ./llama-3.1-70b-instruct \
--output_dir ./llama-3.1-70b-trt-ckpt \
--dtype float16 \
--tp_size 2
trtllm-build \
--checkpoint_dir ./llama-3.1-70b-trt-ckpt \
--output_dir ./llama-3.1-70b-trt-engine \
--gemm_plugin float16 \
--max_batch_size 256
# Serve with Triton
python3 scripts/launch_triton_server.py \
--model_repo=./llama-3.1-70b-trt-engine \
--tensorrt_llm_model_name=llama-3.1-70b
SGLang Deployment
# Install SGLang
pip install "sglang[all]"
# Serve with RadixAttention and structured outputs
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--tp 2 \
--mem-fraction-static 0.85 \
--enable-torch-compile
# Structured output example (JSON schema enforcement)
from sglang import function, gen, set_default_backend, RuntimeEndpoint
@function
def generate_user(s):
s += "Generate a user profile:\\n"
s += gen("profile", max_tokens=200,
regex=r'\{"name": ".+", "age": \d+, "email": ".+@.+"\}')
set_default_backend(RuntimeEndpoint("http://localhost:30000"))
state = generate_user.run()
print(state["profile"])