vLLM vs TensorRT-LLM: Production Serving Guide

You've chosen on-premise LLM deployment. Now comes the critical decision: which serving engine? vLLM V1, TensorRT-LLM, or SGLang? This guide compares throughput, latency, and operational complexity across real-world workloads on H100 GPUs—with benchmarks you can trust.

The Serving Engine Landscape (2025)

Three engines dominate production LLM serving: vLLM (ease of use + high concurrency), TensorRT-LLM (maximum hardware efficiency), and SGLang (structured outputs + caching). Your choice determines throughput, latency, and operational overhead for the next 3 years.

vLLM V1: High-Concurrency Champion

Release: January 2025 (V1 alpha), default engine since v0.8.0

Key Features (2025):

  • 1.7× throughput gain over V0 without multi-step scheduling
  • FlashAttention 3 integration for state-of-the-art attention performance
  • Zero-overhead prefix caching: RadixAttention reuses shared prompt prefixes
  • Chunked prefill: breaks long prefills into chunks to reduce TTFT spikes
  • 24% throughput improvement for generation-heavy workloads (v0.8.1 vs v0.7.3)
  • Multi-modal support: VLMs like LLaVA, Qwen2-VL with improved latency

Best For:

  • Interactive applications requiring fast TTFT (time to first token)
  • High-concurrency deployments (50-100+ concurrent requests)
  • Rapid prototyping and experimentation (Python-native, easy setup)
  • Multi-modal workloads (vision-language models)

vLLM V1 Architecture Upgrade

V1 represents a complete rewrite of the core scheduler with CPU overhead reductions across the stack. The integration of FlashAttention 3 addresses feature parity while maintaining excellent performance—critical for production deployments requiring both speed and flexibility.

TensorRT-LLM: Hardware Efficiency Leader

Vendor: NVIDIA (optimized for H100, H200, B200 Blackwell)

Key Features (2025):

  • 2.72× gain in TPOT (time per output token) for long-context workloads vs vLLM
  • Highest single-request throughput on H100 for Llama 3 8B
  • FP8/FP4 quantization with minimal accuracy loss (Blackwell-optimized)
  • In-flight batching: continuous batching without padding overhead
  • KV cache optimization: paged attention with aggressive memory reuse
  • Pipeline parallelism: efficient multi-GPU execution for large models

Best For:

  • Low-concurrency deployments (1-10 concurrent requests)
  • Maximum hardware utilization (squeezing every TFLOP from H100/B200)
  • Long-context workloads (32K+ tokens input, 2K+ output)
  • Batch processing pipelines (offline inference, evaluation)

TensorRT-LLM Trade-Off

TensorRT-LLM requires model compilation (engine build step) which can take 15-45 minutes depending on model size. This makes iteration slower than vLLM's dynamic approach. However, the runtime performance gains justify the upfront cost for stable production deployments.

SGLang: Structured Output Specialist

Release: Open-source (LMSYS Org), joined PyTorch Ecosystem March 2025

Key Features (2025):

  • RadixAttention: prefix caching with shared prompt reuse across requests
  • Zero-overhead batch scheduler: overlaps CPU scheduling with GPU compute
  • Cache-aware load balancing: routes requests to workers with highest cache hit probability
  • Structured outputs: JSON schema enforcement, regex constraints
  • Multi-LoRA batching: serve multiple fine-tuned adapters simultaneously
  • DeepSeek V3/R1 day-one support with model-specific optimizations

Best For:

  • Applications requiring structured JSON outputs (agents, APIs)
  • Multi-tenant deployments serving multiple fine-tuned variants
  • Workloads with high prompt prefix overlap (chatbots, RAG)
  • DeepSeek models (V3, R1) with native optimizations

Production Scale: Deployed at large scale generating trillions of tokens daily (as of 2025).

Performance Benchmarks: H100 GPUs (2025)

All benchmarks conducted on 2× NVIDIA H100 80GB GPUs using production-realistic workloads. Models tested: Llama 3.1 8B, Llama 3.3 70B, and GPT-OSS-120B.

Throughput Comparison

Dataset: ShareGPT (mixed short/long prompts, realistic distribution)

Engine Llama 3 8B (tokens/sec) Llama 3 70B (tokens/sec) GPT-OSS-120B (tokens/sec)
TensorRT-LLM Highest Competitive Lower at 100 req
vLLM V1 Second highest Highest 4,741 @ 100 req
SGLang Competitive Moderate Moderate

Key Insights:

  • vLLM dominates high-concurrency: At 100 concurrent requests, vLLM achieves 4,741 tokens/sec on GPT-OSS-120B (highest)
  • TensorRT-LLM wins small models: Llama 3 8B sees highest throughput with TensorRT-LLM on short sequences
  • Model size matters: vLLM's scheduler scales better for 70B+ models under load

Latency Analysis

Metrics: TTFT (Time to First Token) and TPOT (Time Per Output Token)

vLLM V1 - TTFT: Fastest across all concurrency levels
vLLM V1 - TPOT: Competitive for short-medium outputs
TensorRT-LLM - TTFT: Slowest (compilation overhead)
TensorRT-LLM - TPOT: 2.72× faster than vLLM on long outputs (>1K tokens)
SGLang - TTFT: Moderate (benefits from prefix caching on cache hits)
SGLang - TPOT: Consistent mid-tier performance

Real-World Latency Impact

Interactive chatbot (300-500 token outputs):

  • vLLM V1: ~400ms TTFT, ~2.5s total → feels instant
  • TensorRT-LLM: ~800ms TTFT, ~2s total → noticeable delay

Document summarization (2K+ token outputs):

  • TensorRT-LLM: 2.72× faster generation → completes in 8s vs 22s
  • User doesn't notice TTFT when generation takes 8+ seconds anyway

Concurrency Scaling

Test: GPT-OSS-120B on 2× H100, varying concurrent requests from 1 to 100

Low Concurrency (1-10 requests)

Winner: TensorRT-LLM

  • Highest per-request throughput (minimal batching overhead)
  • Best for single-user applications, evaluation pipelines
  • TPOT advantage shines on long-context tasks

Medium Concurrency (10-50 requests)

Winner: SGLang

  • Moderate throughput with consistent performance
  • Cache-aware load balancing reduces latency variance
  • Structured output features add value without overhead

High Concurrency (50-100+ requests)

Winner: vLLM V1

  • 4,741 tokens/sec at 100 concurrent requests (highest measured)
  • Fast TTFT maintains responsiveness under load
  • Excellent scaling characteristics for production APIs

Decision Matrix: Choosing Your Engine

Choose vLLM V1 if:

  • Interactive user-facing applications (chatbots, assistants, APIs)
  • Expected concurrency >20 simultaneous requests
  • Rapid iteration and experimentation (no compilation step)
  • Multi-modal models (vision-language, audio)
  • Team lacks deep CUDA/TensorRT expertise
  • Need fastest time-to-first-token (TTFT) for responsiveness

Choose TensorRT-LLM if:

  • Low concurrency (1-10 requests) with maximum hardware efficiency required
  • Long-context workloads (32K+ input, 2K+ output tokens)
  • Batch processing pipelines (offline inference, evaluation)
  • Stable model + deployment (compilation time acceptable)
  • NVIDIA H100/H200/B200 hardware (vendor-optimized)
  • Need FP8/FP4 quantization with minimal accuracy loss

Choose SGLang if:

  • Applications require structured JSON outputs (agents, tool calling)
  • High prompt prefix overlap (chatbots with system prompts, RAG)
  • Multi-tenant deployment serving multiple LoRA adapters
  • Using DeepSeek V3 or DeepSeek R1 models
  • Need consistent mid-tier performance without tuning

Deployment Quick Start

vLLM V1 Installation

# Install vLLM with V1 engine (default since 0.8.0)
pip install vllm>=0.8.1

# Serve Llama 3.1 70B on 2× H100
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --enable-prefix-caching

# Test endpoint
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "prompt": "Explain quantum computing in 3 sentences.",
    "max_tokens": 100
  }'

TensorRT-LLM Build & Serve

# Clone TensorRT-LLM
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM

# Build engine (one-time, ~30 min for 70B)
python examples/llama/convert_checkpoint.py \
  --model_dir ./llama-3.1-70b-instruct \
  --output_dir ./llama-3.1-70b-trt-ckpt \
  --dtype float16 \
  --tp_size 2

trtllm-build \
  --checkpoint_dir ./llama-3.1-70b-trt-ckpt \
  --output_dir ./llama-3.1-70b-trt-engine \
  --gemm_plugin float16 \
  --max_batch_size 256

# Serve with Triton
python3 scripts/launch_triton_server.py \
  --model_repo=./llama-3.1-70b-trt-engine \
  --tensorrt_llm_model_name=llama-3.1-70b

SGLang Deployment

# Install SGLang
pip install "sglang[all]"

# Serve with RadixAttention and structured outputs
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp 2 \
  --mem-fraction-static 0.85 \
  --enable-torch-compile

# Structured output example (JSON schema enforcement)
from sglang import function, gen, set_default_backend, RuntimeEndpoint

@function
def generate_user(s):
    s += "Generate a user profile:\\n"
    s += gen("profile", max_tokens=200,
             regex=r'\{"name": ".+", "age": \d+, "email": ".+@.+"\}')

set_default_backend(RuntimeEndpoint("http://localhost:30000"))
state = generate_user.run()
print(state["profile"])

Next Steps

  1. Benchmark your workload: Use your actual prompts/outputs, not synthetic tests
  2. Measure at target concurrency: 10 concurrent users ≠ 100 concurrent users
  3. Monitor GPU utilization: Aim for 70-90% (TensorRT-LLM typically higher)
  4. Test failover: What happens when a GPU fails? Load balancing strategy?

Book Infrastructure Assessment →

Related Articles