Executive Summary
- Migrated from OpenAI Embeddings API to self-hosted Llama 3.3 70B for 100M product catalog
- CTR improved from 12% (rule-based) to 35% with semantic embeddings + hybrid search
- p95 latency: 88ms (embeddings retrieval 15ms + ranking 50ms + LLM rerank 23ms)
- Cost per 1K recommendations: from €0.15 (OpenAI) to €0.012 (self-hosted) = 92% reduction
- Production deployment on Kubernetes with vLLM + Milvus + Redis cache
Before / After
Measured over 3 months post-deployment with A/B testing
Timeline
Assessment & Benchmarking
Baseline evaluation, embedding model selection, infrastructure sizing
Pilot Deployment
Milvus vector DB setup, vLLM deployment, embedding 100M products, Redis cache layer
Production Rollout
A/B testing, gradual traffic ramp (10% → 50% → 100%), monitoring & alerting
Decisions & Trade-offs
Embedding Model
Vector Database
Serving Infrastructure
Stack & Architecture
Models
- Llama 3.3 70B FP8 (fine-tuned embeddings)
- SentenceTransformers/all-MiniLM-L6-v2 (fallback)
Serving
- vLLM 0.6.3 with continuous batching
- 2× NVIDIA H100 80GB (tensor parallelism)
- Kubernetes with HPA (2-8 replicas)
Vector Database
- Milvus 2.4 with GPU indexing
- 100M vectors, 768D embeddings
- IVF_FLAT index, nprobe=128
Caching & Storage
- Redis 7.x (60% cache hit rate)
- PostgreSQL (product metadata)
- S3 (model checkpoints, backups)
Monitoring
- Prometheus + Grafana
- OpenTelemetry distributed tracing
- Custom CTR/NDCG dashboards
SLO & KPI
p95 Latency < 100ms
Recommendation Quality (NDCG@10) >= 0.80
CTR Improvement >= 25%
System Availability >= 99.9%
ROI & Unit Economics
- Capex: 2× H100 80GB @ €64K = €128K (amortized over 36 months = €3.5K/month)
- Opex: Power (€450/month) + Staff (0.1 FTE = €800/month) + Infrastructure (€300/month) = €1.5K/month
- Total Monthly Cost: €5K/month = €180K/3yr
- Cloud API Alternative: 5M requests/day × 30 × €0.15/1K = €22.5K/month = €810K/3yr
- Savings: €810K - €180K = €630K over 3 years
- Breakeven: 8 months
- CTR Increase: 12% → 35% = +23% more clicks
- Revenue Impact: +36% AOV (€45 → €61) × 5M clicks/day × 0.35 CTR × 3% conversion = +€3.2M annual revenue
- ROI: (€3.2M revenue + €630K savings) / €180K investment = 21× return
Risks & Mitigations
Lessons learned
- Redis cache is critical: 60% hit rate reduced vLLM load by 2.5×, preventing OOM at peak hours
- Fine-tuning pays off: Domain-specific embeddings improved NDCG by 18% vs generic SentenceTransformers
- A/B testing revealed surprises: Hybrid search (keyword + vector) outperformed pure semantic search by 12% for branded queries
- Monitor business metrics: NDCG is academic—CTR and AOV are what matter. Track both technical and business KPIs
- Gradual rollout avoided disaster: 10% traffic shadow test revealed memory leak that would have crashed production
Testimonials
"We went from spending €270K/year on OpenAI APIs to €60K for self-hosted infrastructure. The recommendations are better, faster, and we own the entire stack."
— VP of Engineering, E-Commerce Platform
"The 35% CTR increase translated to €3.2M in additional annual revenue. This project paid for itself in 2 months."
— Head of Product
Transform Your Recommendation Engine
Self-hosted LLMs can reduce costs by 90% while improving quality. Let's build your solution.