vLLM vs TensorRT-LLM: Production Serving Guide

You've chosen on-premise LLM deployment. Now comes the critical decision: which serving engine? vLLM V1, TensorRT-LLM, or SGLang? This guide compares throughput, latency, and operational complexity across real-world workloads on H100 GPUs—with benchmarks you can trust.

The Serving Engine Landscape (2025)

Three engines dominate production LLM serving: vLLM (ease of use + high concurrency), TensorRT-LLM (maximum hardware efficiency), and SGLang (structured outputs + caching). Your choice determines throughput, latency, and operational overhead for the next 3 years.

vLLM V1: High-Concurrency Champion

Release: January 2025 (V1 alpha), default engine since v0.8.0

Key Features (2025):

  • 1.7× throughput gain over V0 without multi-step scheduling
  • FlashAttention 3 integration for state-of-the-art attention performance
  • Zero-overhead prefix caching: RadixAttention reuses shared prompt prefixes
  • Chunked prefill: breaks long prefills into chunks to reduce TTFT spikes
  • 24% throughput improvement for generation-heavy workloads (v0.8.1 vs v0.7.3)
  • Multi-modal support: VLMs like LLaVA, Qwen2-VL with improved latency

Best For:

  • Interactive applications requiring fast TTFT (time to first token)
  • High-concurrency deployments (50-100+ concurrent requests)
  • Rapid prototyping and experimentation (Python-native, easy setup)
  • Multi-modal workloads (vision-language models)

vLLM V1 Architecture Upgrade

V1 represents a complete rewrite of the core scheduler with CPU overhead reductions across the stack. The integration of FlashAttention 3 addresses feature parity while maintaining excellent performance—critical for production deployments requiring both speed and flexibility.

TensorRT-LLM: Hardware Efficiency Leader

Vendor: NVIDIA (optimized for H100, H200, B200 Blackwell)

Key Features (2025):

  • 2.72× gain in TPOT (time per output token) for long-context workloads vs vLLM
  • Highest single-request throughput on H100 for Llama 3 8B
  • FP8/FP4 quantization with minimal accuracy loss (Blackwell-optimized)
  • In-flight batching: continuous batching without padding overhead
  • KV cache optimization: paged attention with aggressive memory reuse
  • Pipeline parallelism: efficient multi-GPU execution for large models

Best For:

  • Low-concurrency deployments (1-10 concurrent requests)
  • Maximum hardware utilization (squeezing every TFLOP from H100/B200)
  • Long-context workloads (32K+ tokens input, 2K+ output)
  • Batch processing pipelines (offline inference, evaluation)

TensorRT-LLM Trade-Off

TensorRT-LLM requires model compilation (engine build step) which can take 15-45 minutes depending on model size. This makes iteration slower than vLLM's dynamic approach. However, the runtime performance gains justify the upfront cost for stable production deployments.

SGLang: Structured Output Specialist

Release: Open-source (LMSYS Org), joined PyTorch Ecosystem March 2025

Key Features (2025):

  • RadixAttention: prefix caching with shared prompt reuse across requests
  • Zero-overhead batch scheduler: overlaps CPU scheduling with GPU compute
  • Cache-aware load balancing: routes requests to workers with highest cache hit probability
  • Structured outputs: JSON schema enforcement, regex constraints
  • Multi-LoRA batching: serve multiple fine-tuned adapters simultaneously
  • DeepSeek V3/R1 day-one support with model-specific optimizations

Best For:

  • Applications requiring structured JSON outputs (agents, APIs)
  • Multi-tenant deployments serving multiple fine-tuned variants
  • Workloads with high prompt prefix overlap (chatbots, RAG)
  • DeepSeek models (V3, R1) with native optimizations

Production Scale: Deployed at large scale generating trillions of tokens daily (as of 2025).

Performance Benchmarks: H100 GPUs (2025)

All benchmarks conducted on 2× NVIDIA H100 80GB GPUs using production-realistic workloads. Models tested: Llama 3.1 8B, Llama 3.3 70B, and GPT-OSS-120B.

Throughput Comparison

Dataset: ShareGPT (mixed short/long prompts, realistic distribution)

Engine Llama 3 8B (tokens/sec) Llama 3 70B (tokens/sec) GPT-OSS-120B (tokens/sec)
TensorRT-LLM Highest Competitive Lower at 100 req
vLLM V1 Second highest Highest 4,741 @ 100 req
SGLang Competitive Moderate Moderate

Key Insights:

  • vLLM dominates high-concurrency: At 100 concurrent requests, vLLM achieves 4,741 tokens/sec on GPT-OSS-120B (highest)
  • TensorRT-LLM wins small models: Llama 3 8B sees highest throughput with TensorRT-LLM on short sequences
  • Model size matters: vLLM's scheduler scales better for 70B+ models under load

Latency Analysis

Metrics: TTFT (Time to First Token) and TPOT (Time Per Output Token)

vLLM V1 - TTFT: Fastest across all concurrency levels
vLLM V1 - TPOT: Competitive for short-medium outputs
TensorRT-LLM - TTFT: Slowest (compilation overhead)
TensorRT-LLM - TPOT: 2.72× faster than vLLM on long outputs (>1K tokens)
SGLang - TTFT: Moderate (benefits from prefix caching on cache hits)
SGLang - TPOT: Consistent mid-tier performance

Real-World Latency Impact

Interactive chatbot (300-500 token outputs):

  • vLLM V1: ~400ms TTFT, ~2.5s total → feels instant
  • TensorRT-LLM: ~800ms TTFT, ~2s total → noticeable delay

Document summarization (2K+ token outputs):

  • TensorRT-LLM: 2.72× faster generation → completes in 8s vs 22s
  • User doesn't notice TTFT when generation takes 8+ seconds anyway

Concurrency Scaling

Test: GPT-OSS-120B on 2× H100, varying concurrent requests from 1 to 100

Low Concurrency (1-10 requests)

Winner: TensorRT-LLM

  • Highest per-request throughput (minimal batching overhead)
  • Best for single-user applications, evaluation pipelines
  • TPOT advantage shines on long-context tasks

Medium Concurrency (10-50 requests)

Winner: SGLang

  • Moderate throughput with consistent performance
  • Cache-aware load balancing reduces latency variance
  • Structured output features add value without overhead

High Concurrency (50-100+ requests)

Winner: vLLM V1

  • 4,741 tokens/sec at 100 concurrent requests (highest measured)
  • Fast TTFT maintains responsiveness under load
  • Excellent scaling characteristics for production APIs

Decision Matrix: Choosing Your Engine

Choose vLLM V1 if:

  • Interactive user-facing applications (chatbots, assistants, APIs)
  • Expected concurrency >20 simultaneous requests
  • Rapid iteration and experimentation (no compilation step)
  • Multi-modal models (vision-language, audio)
  • Team lacks deep CUDA/TensorRT expertise
  • Need fastest time-to-first-token (TTFT) for responsiveness

Choose TensorRT-LLM if:

  • Low concurrency (1-10 requests) with maximum hardware efficiency required
  • Long-context workloads (32K+ input, 2K+ output tokens)
  • Batch processing pipelines (offline inference, evaluation)
  • Stable model + deployment (compilation time acceptable)
  • NVIDIA H100/H200/B200 hardware (vendor-optimized)
  • Need FP8/FP4 quantization with minimal accuracy loss

Choose SGLang if:

  • Applications require structured JSON outputs (agents, tool calling)
  • High prompt prefix overlap (chatbots with system prompts, RAG)
  • Multi-tenant deployment serving multiple LoRA adapters
  • Using DeepSeek V3 or DeepSeek R1 models
  • Need consistent mid-tier performance without tuning

Deployment Quick Start

vLLM V1 Installation

# Install vLLM with V1 engine (default since 0.8.0)
pip install vllm>=0.8.1

# Serve Llama 3.1 70B on 2× H100
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --enable-prefix-caching

# Test endpoint
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "prompt": "Explain quantum computing in 3 sentences.",
    "max_tokens": 100
  }'

TensorRT-LLM Build & Serve

# Clone TensorRT-LLM
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM

# Build engine (one-time, ~30 min for 70B)
python examples/llama/convert_checkpoint.py \
  --model_dir ./llama-3.1-70b-instruct \
  --output_dir ./llama-3.1-70b-trt-ckpt \
  --dtype float16 \
  --tp_size 2

trtllm-build \
  --checkpoint_dir ./llama-3.1-70b-trt-ckpt \
  --output_dir ./llama-3.1-70b-trt-engine \
  --gemm_plugin float16 \
  --max_batch_size 256

# Serve with Triton
python3 scripts/launch_triton_server.py \
  --model_repo=./llama-3.1-70b-trt-engine \
  --tensorrt_llm_model_name=llama-3.1-70b

SGLang Deployment

# Install SGLang
pip install "sglang[all]"

# Serve with RadixAttention and structured outputs
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp 2 \
  --mem-fraction-static 0.85 \
  --enable-torch-compile

# Structured output example (JSON schema enforcement)
from sglang import function, gen, set_default_backend, RuntimeEndpoint

@function
def generate_user(s):
    s += "Generate a user profile:\\n"
    s += gen("profile", max_tokens=200,
             regex=r'\{"name": ".+", "age": \d+, "email": ".+@.+"\}')

set_default_backend(RuntimeEndpoint("http://localhost:30000"))
state = generate_user.run()
print(state["profile"])

Next Steps

  1. Benchmark your workload: Use your actual prompts/outputs, not synthetic tests
  2. Measure at target concurrency: 10 concurrent users ≠ 100 concurrent users
  3. Monitor GPU utilization: Aim for 70-90% (TensorRT-LLM typically higher)
  4. Test failover: What happens when a GPU fails? Load balancing strategy?

Book Infrastructure Assessment →

Related Articles

vLLM vs TensorRT-LLM: Production Serving Guide

Compare vLLM and TensorRT-LLM across throughput, latency, GPU efficiency, observability, and deployment complexity for production LLM serving.

Want the full technical deep dive?

This page includes an executive brief in your language. Switch to English to read the full technical version with implementation details.

Key takeaways

  • Serving decisions are about throughput per GPU and tail latency (p95/p99), not averages.
  • vLLM optimizes iteration speed and broad model support; TensorRT-LLM maximizes performance with more engineering cost.
  • Measure tokens/sec, TTFT, p95 latency, GPU utilization, and stability under concurrency.
  • Production readiness requires observability: tracing, cost per request, and failure-mode runbooks.

30-day plan

  • Define SLOs (TTFT, p95) and target throughput for your workloads.
  • Deploy a baseline vLLM stack and instrument end-to-end.
  • Prototype TensorRT-LLM on the same model and compare apples-to-apples prompts.
  • Pick caching/batching/routing strategy, load test concurrency, and document runbooks.