vLLM vs TensorRT-LLM: Production Serving Guide

The Serving Engine Landscape (2025)

Three engines dominate production LLM serving: vLLM (ease of use + high concurrency), TensorRT-LLM (maximum hardware efficiency), and SGLang (structured outputs + caching). Your choice determines throughput, latency, and operational overhead for the next 3 years.

vLLM V1: High-Concurrency Champion

Release: January 2025 (V1 alpha), default engine since v0.8.0

Key Features (2025):

1.7× throughput gain over V0 without multi-step scheduling
FlashAttention 3 integration for state-of-the-art attention performance
Zero-overhead prefix caching: RadixAttention reuses shared prompt prefixes
Chunked prefill: breaks long prefills into chunks to reduce TTFT spikes
24% throughput improvement for generation-heavy workloads (v0.8.1 vs v0.7.3)
Multi-modal support: VLMs like LLaVA, Qwen2-VL with improved latency

Best For:

Interactive applications requiring fast TTFT (time to first token)
High-concurrency deployments (50-100+ concurrent requests)
Rapid prototyping and experimentation (Python-native, easy setup)
Multi-modal workloads (vision-language models)

vLLM V1 Architecture Upgrade

V1 represents a complete rewrite of the core scheduler with CPU overhead reductions across the stack. The integration of FlashAttention 3 addresses feature parity while maintaining excellent performance—critical for production deployments requiring both speed and flexibility.

TensorRT-LLM: Hardware Efficiency Leader

Vendor: NVIDIA (optimized for H100, H200, B200 Blackwell)

Key Features (2025):

2.72× gain in TPOT (time per output token) for long-context workloads vs vLLM
Highest single-request throughput on H100 for Llama 3 8B
FP8/FP4 quantization with minimal accuracy loss (Blackwell-optimized)
In-flight batching: continuous batching without padding overhead
KV cache optimization: paged attention with aggressive memory reuse
Pipeline parallelism: efficient multi-GPU execution for large models

Best For:

Low-concurrency deployments (1-10 concurrent requests)
Maximum hardware utilization (squeezing every TFLOP from H100/B200)
Long-context workloads (32K+ tokens input, 2K+ output)
Batch processing pipelines (offline inference, evaluation)

TensorRT-LLM Trade-Off

TensorRT-LLM requires model compilation (engine build step) which can take 15-45 minutes depending on model size. This makes iteration slower than vLLM's dynamic approach. However, the runtime performance gains justify the upfront cost for stable production deployments.

SGLang: Structured Output Specialist

Release: Open-source (LMSYS Org), joined PyTorch Ecosystem March 2025

Key Features (2025):

RadixAttention: prefix caching with shared prompt reuse across requests
Zero-overhead batch scheduler: overlaps CPU scheduling with GPU compute
Cache-aware load balancing: routes requests to workers with highest cache hit probability
Structured outputs: JSON schema enforcement, regex constraints
Multi-LoRA batching: serve multiple fine-tuned adapters simultaneously
DeepSeek V3/R1 day-one support with model-specific optimizations

Best For:

Applications requiring structured JSON outputs (agents, APIs)
Multi-tenant deployments serving multiple fine-tuned variants
Workloads with high prompt prefix overlap (chatbots, RAG)
DeepSeek models (V3, R1) with native optimizations

Production Scale: Deployed at large scale generating trillions of tokens daily (as of 2025).

Performance Benchmarks: H100 GPUs (2025)

All benchmarks conducted on 2× NVIDIA H100 80GB GPUs using production-realistic workloads. Models tested: Llama 3.1 8B, Llama 3.3 70B, and GPT-OSS-120B.

Throughput Comparison

Dataset: ShareGPT (mixed short/long prompts, realistic distribution)

Engine	Llama 3 8B (tokens/sec)	Llama 3 70B (tokens/sec)	GPT-OSS-120B (tokens/sec)
TensorRT-LLM	Highest	Competitive	Lower at 100 req
vLLM V1	Second highest	Highest	4,741 @ 100 req
SGLang	Competitive	Moderate	Moderate

Key Insights:

vLLM dominates high-concurrency: At 100 concurrent requests, vLLM achieves 4,741 tokens/sec on GPT-OSS-120B (highest)
TensorRT-LLM wins small models: Llama 3 8B sees highest throughput with TensorRT-LLM on short sequences
Model size matters: vLLM's scheduler scales better for 70B+ models under load

Latency Analysis

Metrics: TTFT (Time to First Token) and TPOT (Time Per Output Token)

vLLM V1 - TTFT: Fastest across all concurrency levels

vLLM V1 - TPOT: Competitive for short-medium outputs

TensorRT-LLM - TTFT: Slowest (compilation overhead)

TensorRT-LLM - TPOT: 2.72× faster than vLLM on long outputs (>1K tokens)

SGLang - TTFT: Moderate (benefits from prefix caching on cache hits)

SGLang - TPOT: Consistent mid-tier performance

Real-World Latency Impact

Interactive chatbot (300-500 token outputs):

vLLM V1: ~400ms TTFT, ~2.5s total → feels instant
TensorRT-LLM: ~800ms TTFT, ~2s total → noticeable delay

Document summarization (2K+ token outputs):

TensorRT-LLM: 2.72× faster generation → completes in 8s vs 22s
User doesn't notice TTFT when generation takes 8+ seconds anyway

Concurrency Scaling

Test: GPT-OSS-120B on 2× H100, varying concurrent requests from 1 to 100

Low Concurrency (1-10 requests)

Winner: TensorRT-LLM

Highest per-request throughput (minimal batching overhead)
Best for single-user applications, evaluation pipelines
TPOT advantage shines on long-context tasks

Medium Concurrency (10-50 requests)

Winner: SGLang

Moderate throughput with consistent performance
Cache-aware load balancing reduces latency variance
Structured output features add value without overhead

High Concurrency (50-100+ requests)

Winner: vLLM V1

4,741 tokens/sec at 100 concurrent requests (highest measured)
Fast TTFT maintains responsiveness under load
Excellent scaling characteristics for production APIs

Decision Matrix: Choosing Your Engine

Choose vLLM V1 if:

Interactive user-facing applications (chatbots, assistants, APIs)
Expected concurrency >20 simultaneous requests
Rapid iteration and experimentation (no compilation step)
Multi-modal models (vision-language, audio)
Team lacks deep CUDA/TensorRT expertise
Need fastest time-to-first-token (TTFT) for responsiveness

Choose TensorRT-LLM if:

Low concurrency (1-10 requests) with maximum hardware efficiency required
Long-context workloads (32K+ input, 2K+ output tokens)
Batch processing pipelines (offline inference, evaluation)
Stable model + deployment (compilation time acceptable)
NVIDIA H100/H200/B200 hardware (vendor-optimized)
Need FP8/FP4 quantization with minimal accuracy loss

Choose SGLang if:

Applications require structured JSON outputs (agents, tool calling)
High prompt prefix overlap (chatbots with system prompts, RAG)
Multi-tenant deployment serving multiple LoRA adapters
Using DeepSeek V3 or DeepSeek R1 models
Need consistent mid-tier performance without tuning

Deployment Quick Start

vLLM V1 Installation

# Install vLLM with V1 engine (default since 0.8.0)
pip install vllm>=0.8.1

# Serve Llama 3.1 70B on 2× H100
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --enable-prefix-caching

# Test endpoint
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "prompt": "Explain quantum computing in 3 sentences.",
    "max_tokens": 100
  }'

TensorRT-LLM Build & Serve

# Clone TensorRT-LLM
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM

# Build engine (one-time, ~30 min for 70B)
python examples/llama/convert_checkpoint.py \
  --model_dir ./llama-3.1-70b-instruct \
  --output_dir ./llama-3.1-70b-trt-ckpt \
  --dtype float16 \
  --tp_size 2

trtllm-build \
  --checkpoint_dir ./llama-3.1-70b-trt-ckpt \
  --output_dir ./llama-3.1-70b-trt-engine \
  --gemm_plugin float16 \
  --max_batch_size 256

# Serve with Triton
python3 scripts/launch_triton_server.py \
  --model_repo=./llama-3.1-70b-trt-engine \
  --tensorrt_llm_model_name=llama-3.1-70b

SGLang Deployment

# Install SGLang
pip install "sglang[all]"

# Serve with RadixAttention and structured outputs
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp 2 \
  --mem-fraction-static 0.85 \
  --enable-torch-compile

# Structured output example (JSON schema enforcement)
from sglang import function, gen, set_default_backend, RuntimeEndpoint

@function
def generate_user(s):
    s += "Generate a user profile:\\n"
    s += gen("profile", max_tokens=200,
             regex=r'\{"name": ".+", "age": \d+, "email": ".+@.+"\}')

set_default_backend(RuntimeEndpoint("http://localhost:30000"))
state = generate_user.run()
print(state["profile"])

Next Steps

Benchmark your workload: Use your actual prompts/outputs, not synthetic tests
Measure at target concurrency: 10 concurrent users ≠ 100 concurrent users
Monitor GPU utilization: Aim for 70-90% (TensorRT-LLM typically higher)
Test failover: What happens when a GPU fails? Load balancing strategy?

Book Infrastructure Assessment →

The Serving Engine Landscape (2025)

vLLM V1: High-Concurrency Champion

vLLM V1 Architecture Upgrade

TensorRT-LLM: Hardware Efficiency Leader

TensorRT-LLM Trade-Off

SGLang: Structured Output Specialist

Performance Benchmarks: H100 GPUs (2025)

Throughput Comparison

Latency Analysis

Real-World Latency Impact

Concurrency Scaling

Low Concurrency (1-10 requests)

Medium Concurrency (10-50 requests)

High Concurrency (50-100+ requests)

Decision Matrix: Choosing Your Engine

Choose vLLM V1 if:

Choose TensorRT-LLM if:

Choose SGLang if:

Deployment Quick Start

vLLM V1 Installation

TensorRT-LLM Build & Serve

SGLang Deployment

Next Steps

Related Articles

Sovereign AI 101

Vector Database Comparison