RAG Architecture: 7 Patterns for Quality Retrieval

RAG failures are rarely “model problems”. They are retrieval problems: wrong chunks, wrong filters, or missing evaluation. This guide captures seven production patterns that consistently raise groundedness and reduce hallucinations—without turning your stack into a science project.

Executive summary

  • Default architecture: hybrid retrieval (keyword + vector) → reranker → context budget + citations → answer + self-check.
  • Most common failure: no eval harness, so “quality” is subjective and regressions slip into production.
  • Most common scaling issue: metadata is missing (tenant, doc type, ACL), so filtering is inaccurate and retrieval becomes unsafe.

Pattern 1 — Hybrid retrieval (BM25 + vectors)

Vectors are great for semantic similarity, but they can miss exact strings, codes, version numbers, and product identifiers. Hybrid retrieval combines strengths:

  • Keyword / BM25: exact matches (SKUs, error codes, product names, policy IDs).
  • Vector search: semantic similarity for “how do I…?” questions and paraphrases.
  • Filters first: tenant, ACL, doc type, region, and “effective date” should narrow the search space before scoring.

Pattern 2 — Chunking that matches how humans read

Chunking is not a technical detail: it is your retrieval granularity. The default that works across enterprise docs:

  • Semantic chunks: split by headings/sections and keep paragraphs together.
  • Small overlap: avoid losing definitions spanning two paragraphs.
  • Structure-aware: tables, SOPs, and runbooks benefit from specialized chunking.
Doc type Chunk strategy Notes
Policies / Legal Section-based Preserve clause boundaries; citations matter.
Runbooks / Ops Step-based Prefer “procedure blocks” over paragraphs.
Tickets / KB Thread-based Keep resolution + context together.

Pattern 3 — Query rewriting and expansion

Users rarely write the best retrieval query. A lightweight “query rewrite” step improves recall without changing the UI:

  • Normalize: expand abbreviations, map synonyms, keep the original.
  • Extract entities: product names, regions, dates, ticket IDs.
  • Generate 2–3 variants: one keyword-heavy, one semantic, one “problem→solution”.

Pattern 4 — Reranking (cheap, high impact)

Retrievers optimize speed; rerankers optimize relevance. In most stacks, reranking is the single best quality lever after filters.

  • Use reranking when: you have long documents, similar sections, or high “near-duplicate” content.
  • Keep it bounded: rerank top 20–50 results, not thousands.
  • Measure: run eval before/after; keep a rollback path.

Pattern 5 — Context budgeting + citations

Long context windows don’t solve retrieval. They hide errors. Budget context explicitly:

  • Top-k with diversity: avoid 5 chunks from the same section when 3 topics are needed.
  • Citations: tie claims to sources; in regulated environments, this is non-negotiable.
  • Refuse gracefully: if evidence is missing, respond with “I can’t find it” and ask clarifying questions.

Pattern 6 — Eval harness (offline + online)

Without evaluation, “quality” is a feeling. With evaluation, it becomes an SLO.

  • Offline: golden set questions + expected citations; regression tests on every change.
  • Online: sample production traffic; track groundedness proxies and user feedback.
  • Failure taxonomy: retrieval miss, stale doc, wrong filter, hallucination, tool failure.

Pattern 7 — Guardrails for RAG (prompt injection & data safety)

RAG increases attack surface: documents can contain malicious instructions. Treat retrieval as untrusted input.

  • Instruction hierarchy: system > developer > user; retrieved text is evidence, not instructions.
  • Policy filters: block unsafe tools/actions and sensitive data exfiltration.
  • Audit trail: log query, retrieved doc IDs, and citations (privacy-safe).

Decision matrix

Problem Most likely fix What to measure
Wrong answers with “confident” tone Reranking + citations + refusal policy Hallucination rate, groundedness
Answers ignore the newest policy Metadata (effective date) + filtering Staleness, doc coverage
Misses exact identifiers / codes Hybrid retrieval Recall on code-heavy queries
Too slow at scale Index sizing, batching, caching p95 latency, throughput

Related Articles

RAG Architecture: 7 Patterns for Quality Retrieval

7 practical RAG patterns to improve retrieval quality: chunking, metadata filtering, hybrid search, reranking, caching, and evaluation loops.

Want the full technical deep dive?

This page includes an executive brief in your language. Switch to English to read the full technical version with implementation details.

Key takeaways

  • Retrieval quality is a pipeline design problem: chunking, metadata, filters, reranking.
  • Measure with a benchmark set; optimize p95 latency and answer quality, not averages.
  • Add guardrails: citations, confidence thresholds, and escalation paths.
  • Keep KB fresh with versioned ingestion and dedup to avoid “drift by backlog”.

30-day plan

  • Build a query benchmark and define expected sources + quality KPIs.
  • Fix chunking and metadata schema; add multi-tenant access control.
  • Add hybrid search + reranker and compare quality/latency.
  • Instrument drift and latency; set alerts and run red team injection tests.