Observability Stack for LLM: What to Track and Why

LLM systems fail in new ways: retrieval drift, prompt injection, silent quality regressions, and runaway cost. Observability is how you turn “AI magic” into an engineered system with SLOs and auditability.

The four signals: metrics, logs, traces, eval telemetry

  • Metrics: fast aggregates for latency, throughput, errors, cost.
  • Logs: structured events for audit trails and incident response.
  • Traces: distributed view across tools, retrieval, model, and external calls.
  • Eval telemetry: quality signals (groundedness proxies, user feedback, regressions).

What to measure (minimum viable set)

Domain Metrics Why it matters
Latency TTFT, TBT, p95 end-to-end UX and SLO compliance.
Reliability error rate, timeouts, retries Silent failures look like “bad answers”.
Retrieval recall@k, empty hits, filter rejects RAG quality is retrieval quality.
Quality hallucination rate (sampled), user rating Detect regressions before stakeholders do.
Cost €/call, tokens/call, cache hit rate Unit economics and predictability.
Safety policy blocks, PII detects, injection flags Risk management and compliance evidence.

Tracing with OpenTelemetry (LLM-specific)

In enterprise workflows, a single “chat request” triggers multiple spans: retrieval, reranking, tool calls, model inference, and post-processing.

  • Propagate trace IDs across all internal services and tool calls.
  • Add LLM attributes: model version, prompt template ID, retrieval IDs, and safety policy results.
  • Sample smartly: keep full traces for errors and a small rate for successes.

Audit logging (privacy-safe)

Audit trails should connect inputs, retrieved sources, and outputs without leaking PII or secrets.

  • Log identifiers: doc IDs, chunk IDs, policy versions, model versions.
  • Redact payloads: store hashes or sampled, masked text.
  • Retention: align to GDPR data minimization and incident response needs.

Alert rules that don’t spam

  • Latency: p95 breach for N minutes (and separate TTFT alerts).
  • Quality regression: golden set score drops after deploy.
  • Cost anomaly: tokens/call spikes or cache hit rate drops.
  • Safety: policy blocks spike (possible injection campaign).

Related Articles

Observability Stack for LLM: What to Track and Why

Tracing, metrics, and audit logging for LLM systems: what to track (quality, latency, safety, cost) and how to tie it to decisions and SLOs.

Want the full technical deep dive?

This page includes an executive brief in your language. Switch to English to read the full technical version with implementation details.

Key takeaways

  • Without observability, you can’t manage quality, latency, safety, or cost in production.
  • Track TTFT, p95 latency, tokens/request, cache hit rate, retrieval metrics, and policy blocks.
  • Logs and traces must be audit-friendly and privacy-aware (redaction, retention, access).
  • Tie metrics to SLOs and runbooks; do weekly governance reviews on real data.

30-day plan

  • Define KPIs + SLOs and instrument the full request lifecycle end-to-end.
  • Add RAG metrics (recall proxies, freshness) and caching metrics.
  • Set alert thresholds and write runbooks for latency/quality/cost regressions.
  • Use dashboards in a weekly executive cadence and iterate on blind spots.