Executive Summary

  • Migrated from OpenAI Embeddings API to self-hosted Llama 3.3 70B for 100M product catalog
  • CTR improved from 12% (rule-based) to 35% with semantic embeddings + hybrid search
  • p95 latency: 88ms (embeddings retrieval 15ms + ranking 50ms + LLM rerank 23ms)
  • Cost per 1K recommendations: from €0.15 (OpenAI) to €0.012 (self-hosted) = 92% reduction
  • Production deployment on Kubernetes with vLLM + Milvus + Redis cache

Before / After

Metric
Before
After
Improvement
CTR (Click-Through Rate)
12%
35%
+192%
p95 latency
850ms
88ms
-90%
Cost per 1K recommendations
€0.15
€0.012
-92%
Recommendation quality (NDCG@10)
0.62
0.84
+35%
Average order value (AOV)
€45
€61
+36%

Measured over 3 months post-deployment with A/B testing

Timeline

W1-2

Assessment & Benchmarking

Baseline evaluation, embedding model selection, infrastructure sizing

Deliverable: Architecture plan, GPU requirements (2× H100), baseline NDCG 0.62
W3-5

Pilot Deployment

Milvus vector DB setup, vLLM deployment, embedding 100M products, Redis cache layer

Deliverable: Staging environment, 5% traffic shadow test, latency <100ms verified
W6-8

Production Rollout

A/B testing, gradual traffic ramp (10% → 50% → 100%), monitoring & alerting

Deliverable: Full production deployment, CTR +35% confirmed, runbooks

Decisions & Trade-offs

Embedding Model

Choice: Self-hosted Llama 3.3 70B fine-tuned on product catalog
Alternatives: OpenAI text-embedding-3-large, SentenceTransformers
Why: Domain-specific fine-tuning improved NDCG by 18%, zero API costs
Risks: Model drift, retraining overhead
KPI Impact: +18% NDCG, -92% cost

Vector Database

Choice: Milvus with GPU indexing (IVF_FLAT)
Alternatives: Pinecone (managed), Qdrant, Weaviate
Why: 100M vectors, GPU acceleration, <20ms p95 retrieval
Risks: Index rebuild time (2hrs), memory pressure

Serving Infrastructure

Choice: vLLM with continuous batching on 2× H100 80GB
Why: 480 tokens/sec throughput, PagedAttention for 5M requests/day
Risks: OOM at peak traffic (mitigated with autoscaling + Redis cache)

Stack & Architecture

Models

  • Llama 3.3 70B FP8 (fine-tuned embeddings)
  • SentenceTransformers/all-MiniLM-L6-v2 (fallback)

Serving

  • vLLM 0.6.3 with continuous batching
  • 2× NVIDIA H100 80GB (tensor parallelism)
  • Kubernetes with HPA (2-8 replicas)

Vector Database

  • Milvus 2.4 with GPU indexing
  • 100M vectors, 768D embeddings
  • IVF_FLAT index, nprobe=128

Caching & Storage

  • Redis 7.x (60% cache hit rate)
  • PostgreSQL (product metadata)
  • S3 (model checkpoints, backups)

Monitoring

  • Prometheus + Grafana
  • OpenTelemetry distributed tracing
  • Custom CTR/NDCG dashboards

SLO & KPI

p95 Latency < 100ms

✓ Achieved: 88ms

Recommendation Quality (NDCG@10) >= 0.80

✓ Achieved: 0.84

CTR Improvement >= 25%

✓ Achieved: +192% (35% vs 12% baseline)

System Availability >= 99.9%

✓ Achieved: 99.95% (3 months)

ROI & Unit Economics

TCO Breakdown (3 years):
  • Capex: 2× H100 80GB @ €64K = €128K (amortized over 36 months = €3.5K/month)
  • Opex: Power (€450/month) + Staff (0.1 FTE = €800/month) + Infrastructure (€300/month) = €1.5K/month
  • Total Monthly Cost: €5K/month = €180K/3yr
  • Cloud API Alternative: 5M requests/day × 30 × €0.15/1K = €22.5K/month = €810K/3yr
  • Savings: €810K - €180K = €630K over 3 years
  • Breakeven: 8 months
Business Impact:
  • CTR Increase: 12% → 35% = +23% more clicks
  • Revenue Impact: +36% AOV (€45 → €61) × 5M clicks/day × 0.35 CTR × 3% conversion = +€3.2M annual revenue
  • ROI: (€3.2M revenue + €630K savings) / €180K investment = 21× return

Risks & Mitigations

Risk: Model drift (product catalog changes) → Mitigation: Weekly incremental retraining, monitoring NDCG degradation
Risk: Peak traffic OOM (Black Friday) → Mitigation: Kubernetes HPA (2-8 replicas), Redis cache (60% hit rate)
Risk: Cold start latency (model loading 45s) → Mitigation: Always-on warm pool, graceful shutdown
Risk: Index rebuild downtime → Mitigation: Blue-green deployment, shadow traffic validation

Lessons learned

  • Redis cache is critical: 60% hit rate reduced vLLM load by 2.5×, preventing OOM at peak hours
  • Fine-tuning pays off: Domain-specific embeddings improved NDCG by 18% vs generic SentenceTransformers
  • A/B testing revealed surprises: Hybrid search (keyword + vector) outperformed pure semantic search by 12% for branded queries
  • Monitor business metrics: NDCG is academic—CTR and AOV are what matter. Track both technical and business KPIs
  • Gradual rollout avoided disaster: 10% traffic shadow test revealed memory leak that would have crashed production

Testimonials

"We went from spending €270K/year on OpenAI APIs to €60K for self-hosted infrastructure. The recommendations are better, faster, and we own the entire stack."

— VP of Engineering, E-Commerce Platform

"The 35% CTR increase translated to €3.2M in additional annual revenue. This project paid for itself in 2 months."

— Head of Product

Transform Your Recommendation Engine

Self-hosted LLMs can reduce costs by 90% while improving quality. Let's build your solution.