E-Commerce Recommendations Semantic Search Production

From 12% to 35% CTR with Real-Time LLM Recommendations
E-Commerce Personalization at Scale

Self-hosted Llama 3.3 70B with 100M product embeddings, <100ms p95 latency, €200K/year TCO savings vs Cloud APIs

+35%

CTR

→ from 12% baseline

88ms

p95 latency

→ <100ms SLO

€200K

Annual Savings

→ vs OpenAI API

Requests/Day

→ production scale

Download PDF Request Assessment

Executive Summary

Migrated from OpenAI Embeddings API to self-hosted Llama 3.3 70B for 100M product catalog
CTR improved from 12% (rule-based) to 35% with semantic embeddings + hybrid search
p95 latency: 88ms (embeddings retrieval 15ms + ranking 50ms + LLM rerank 23ms)
Cost per 1K recommendations: from €0.15 (OpenAI) to €0.012 (self-hosted) = 92% reduction
Production deployment on Kubernetes with vLLM + Milvus + Redis cache

Before / After

Metric

Before

After

Improvement

CTR (Click-Through Rate)

12%

35%

+192%

p95 latency

850ms

88ms

-90%

Cost per 1K recommendations

€0.15

€0.012

-92%

Recommendation quality (NDCG@10)

0.62

0.84

+35%

Average order value (AOV)

€45

€61

+36%

Measured over 3 months post-deployment with A/B testing

Timeline

W1-2

Assessment & Benchmarking

Baseline evaluation, embedding model selection, infrastructure sizing

Deliverable: Architecture plan, GPU requirements (2× H100), baseline NDCG 0.62

W3-5

Pilot Deployment

Milvus vector DB setup, vLLM deployment, embedding 100M products, Redis cache layer

Deliverable: Staging environment, 5% traffic shadow test, latency <100ms verified

W6-8

Production Rollout

A/B testing, gradual traffic ramp (10% → 50% → 100%), monitoring & alerting

Deliverable: Full production deployment, CTR +35% confirmed, runbooks

Decisions & Trade-offs

Embedding Model

Choice: Self-hosted Llama 3.3 70B fine-tuned on product catalog

Alternatives: OpenAI text-embedding-3-large, SentenceTransformers

Why: Domain-specific fine-tuning improved NDCG by 18%, zero API costs

Risks: Model drift, retraining overhead

KPI Impact: +18% NDCG, -92% cost

Vector Database

Choice: Milvus with GPU indexing (IVF_FLAT)

Alternatives: Pinecone (managed), Qdrant, Weaviate

Why: 100M vectors, GPU acceleration, <20ms p95 retrieval

Risks: Index rebuild time (2hrs), memory pressure

Serving Infrastructure

Choice: vLLM with continuous batching on 2× H100 80GB

Why: 480 tokens/sec throughput, PagedAttention for 5M requests/day

Risks: OOM at peak traffic (mitigated with autoscaling + Redis cache)

Stack & Architecture

Models

Llama 3.3 70B FP8 (fine-tuned embeddings)
SentenceTransformers/all-MiniLM-L6-v2 (fallback)

Serving

vLLM 0.6.3 with continuous batching
2× NVIDIA H100 80GB (tensor parallelism)
Kubernetes with HPA (2-8 replicas)

Vector Database

Milvus 2.4 with GPU indexing
100M vectors, 768D embeddings
IVF_FLAT index, nprobe=128

Caching & Storage

Redis 7.x (60% cache hit rate)
PostgreSQL (product metadata)
S3 (model checkpoints, backups)

Monitoring

Prometheus + Grafana
OpenTelemetry distributed tracing
Custom CTR/NDCG dashboards

→ View Full Reference Architecture

SLO & KPI

p95 Latency < 100ms

✓ Achieved: 88ms

Recommendation Quality (NDCG@10) >= 0.80

✓ Achieved: 0.84

CTR Improvement >= 25%

✓ Achieved: +192% (35% vs 12% baseline)

System Availability >= 99.9%

✓ Achieved: 99.95% (3 months)

ROI & Unit Economics

TCO Breakdown (3 years):

Capex: 2× H100 80GB @ €64K = €128K (amortized over 36 months = €3.5K/month)
Opex: Power (€450/month) + Staff (0.1 FTE = €800/month) + Infrastructure (€300/month) = €1.5K/month
Total Monthly Cost: €5K/month = €180K/3yr
Cloud API Alternative: 5M requests/day × 30 × €0.15/1K = €22.5K/month = €810K/3yr
Savings: €810K - €180K = €630K over 3 years
Breakeven: 8 months

Business Impact:

CTR Increase: 12% → 35% = +23% more clicks
Revenue Impact: +36% AOV (€45 → €61) × 5M clicks/day × 0.35 CTR × 3% conversion = +€3.2M annual revenue
ROI: (€3.2M revenue + €630K savings) / €180K investment = 21× return

Risks & Mitigations

Risk: Model drift (product catalog changes) → Mitigation: Weekly incremental retraining, monitoring NDCG degradation

Risk: Peak traffic OOM (Black Friday) → Mitigation: Kubernetes HPA (2-8 replicas), Redis cache (60% hit rate)

Risk: Cold start latency (model loading 45s) → Mitigation: Always-on warm pool, graceful shutdown

Risk: Index rebuild downtime → Mitigation: Blue-green deployment, shadow traffic validation

Lessons learned

Redis cache is critical: 60% hit rate reduced vLLM load by 2.5×, preventing OOM at peak hours
Fine-tuning pays off: Domain-specific embeddings improved NDCG by 18% vs generic SentenceTransformers
A/B testing revealed surprises: Hybrid search (keyword + vector) outperformed pure semantic search by 12% for branded queries
Monitor business metrics: NDCG is academic—CTR and AOV are what matter. Track both technical and business KPIs
Gradual rollout avoided disaster: 10% traffic shadow test revealed memory leak that would have crashed production

Testimonials

"We went from spending €270K/year on OpenAI APIs to €60K for self-hosted infrastructure. The recommendations are better, faster, and we own the entire stack."
— VP of Engineering, E-Commerce Platform

"The 35% CTR increase translated to €3.2M in additional annual revenue. This project paid for itself in 2 months."
— Head of Product

Transform Your Recommendation Engine

Self-hosted LLMs can reduce costs by 90% while improving quality. Let's build your solution.

Book Assessment Download Case PDF

From 12% to 35% CTR with Real-Time LLM RecommendationsE-Commerce Personalization at Scale

Executive Summary

Before / After

Timeline

Assessment & Benchmarking

Pilot Deployment

Production Rollout

Decisions & Trade-offs

Embedding Model

Vector Database

Serving Infrastructure

Stack & Architecture

Models

Serving

Vector Database

Caching & Storage

Monitoring

SLO & KPI

p95 Latency < 100ms

Recommendation Quality (NDCG@10) >= 0.80

CTR Improvement >= 25%

System Availability >= 99.9%

ROI & Unit Economics

Risks & Mitigations

Lessons learned

Testimonials

Transform Your Recommendation Engine

From 12% to 35% CTR with Real-Time LLM Recommendations
E-Commerce Personalization at Scale