A product manager’s case study in evaluation-driven RAG tuning · Brad Hinkel · April 2026 · regs.bradhinkel.com

Summary
This project built and deployed a production retrieval-augmented generation (RAG) system over the U.S. Code of Federal Regulations — specifically Titles 7 (Agriculture), 21 (Food and Drugs), and 42 (Public Health), totaling 85,351 regulatory sections. Each query returns three outputs: a plain-English explanation, a formal legal-language synthesis with verbatim quotes from the source text, and a structured list of CFR citations (Title / Part / Section). The live system runs at regs.bradhinkel.com.
Unlike the fantasy-lore domain of the D&D project in this portfolio, regulations are a domain where hallucination is not merely aesthetic — it’s a liability issue. The entire product thesis rests on whether the system can reliably ground its claims in the retrieved text. That reframed the work: the question wasn’t “does this feel like a good answer,” it was “can we prove, at inference time, that this answer is grounded in what the retriever surfaced?”
Eleven evaluation experiments later, three findings carried most of the weight:
1. Retrieval quality was already strong before the harness showed it. A metric audit revealed that MRR had been under-reported by roughly 2.5× due to three subtle bugs — a plot twist that reframed every subsequent tuning decision.
2. Section-level chunking, guided by the CFR’s own structural hierarchy (DIV8 sections), beat every embedding and retrieval sophistication stacked against it — including a law-domain-tuned embedding model, hybrid BM25 retrieval, hierarchical reconstruction, and HyDE query expansion.
3. An inference-time confidence signal — a linear combination of retrieval similarity and citation-coverage — provides a reliable “I don’t know” flag without requiring a judge-LLM call. That’s the PM-distinctive contribution of this project. It’s also the feature most worth extending.
The final production configuration reaches MRR 0.858, NDCG 0.880, faithfulness 0.729, legal accuracy 0.734, and citation accuracy 0.602 at a median end-to-end latency around 8 seconds per query on Claude Haiku 4.5. The confidence signal correctly flags 100% of the “not found” cases in the eval set with zero false positives.
The Problem
Federal regulations are the archetype of a domain where an LLM needs help. The CFR runs about 200,000 pages. Natural-language queries like “what are the labeling requirements for organic produce” don’t map cleanly to a regulatory text that was written by lawyers for other lawyers, in a controlled vocabulary that rarely aligns with how a non-lawyer would phrase a question.
A user — in practice, a small business owner, a compliance analyst, a journalist, or a curious citizen — has two choices today: wade through multi-level hierarchies of Title → Part → Subpart → Section on ecfr.gov, or ask a general-purpose chatbot and hope it isn’t confabulating. Both are bad options. The first is a time sink. The second is a liability: regulatory text is authoritative, and an LLM that paraphrases “may” as “must” has materially changed the meaning of the law.
What the user actually needs is the regulatory text itself, accurately retrieved and verbatim-quoted, with citations precise enough to verify. The plain-English summary is a courtesy. The legal-language output and the structured citations are the product.

The System in Brief
The architecture is, deliberately, the same small-to-mid-scale RAG pattern I used for the D&D project. The point of reusing the shape is to let the tuning be the interesting part.
Corpus ingest — The eCFR XML API is fetched per title, parsed into a hierarchy (DIV1 through DIV8), and normalized. Each DIV8 (SECTION) becomes one chunk, with metadata carrying title, part, subpart, section number, section heading, agency, and CFR reference. 85,351 sections total.
Embedding — OpenAI text-embedding-3-small (1536 dimensions). Section-level chunks.
Vector store — PostgreSQL 16 + pgvector on a 4GB DigitalOcean droplet.
Retrieval — Pure vector cosine similarity, top-k = 10. All queries filter status = ‘active’, enforced in the query layer not per-call-site.
Generation — Two-call sequential strategy on Claude Haiku 4.5. Call 1 writes the plain-English answer. Call 2 writes the legal-language answer with verbatim quotes, conditioned on both the retrieved context and the Call-1 summary.
API — FastAPI backend streams status events over Server-Sent Events: retrieving → generating → result. Confidence data is computed inline and returned in the result event.
Frontend — Next.js 14 / Tailwind, a single query form with tabbed outputs for Plain English, Legal Language, and CFR Citations, plus a confidence tier badge.
Production — nginx reverse proxy, Let’s Encrypt TLS, HSTS header, systemd units on the same DigitalOcean droplet that hosts the D&D project.
The interesting decisions were not “which components” but “which configuration of those components actually works for regulations.” That question needed a harness.
The Evaluation Framework
I built a Python evaluation harness modeled on the one I used for the D&D project, with significant revisions specific to this domain. It runs a dataset of 60 regulatory questions — each tagged with a ground-truth CFR section reference and a human-written answer — against any given configuration and reports retrieval metrics (Precision@k, Recall@k, MRR, NDCG@k), generation metrics (faithfulness, answer relevancy, legal accuracy, citation accuracy, answer completeness — scored by a Haiku-based LLM judge with a fixed rubric), and operational metrics (embed/retrieve/generation/end-to-end latency, input/output tokens).
Each experiment isolates a single variable so the impact can be attributed cleanly. Each configuration is described in a YAML file; results persist to JSON so phase-to-phase comparison is a trivial diff. The discipline of changing one thing at a time is more important than any single result. It’s what lets a PM — or a future self, or a reviewer — answer “why did you choose top_k=10?” with a number instead of a shrug.
What Moved the Needle — and What Didn’t
Wins
Sequential generation (Experiment 1). The first decision was the generation strategy: one LLM call that produces all three outputs as structured JSON, or two calls — plain-English first, then legal-language conditioned on the plain-English summary? Sequential won on every quality metric: faithfulness +4.5 points, legal accuracy +6.7, citation accuracy +12.9. It cost roughly 2× latency (2.5 s → 4.9 s), which for a product that runs at conversational pace was acceptable.
Top-k = 10 (Experiment 2). The single highest-impact change in the evaluation. Moving from top_k=6 to top_k=10 improved recall by 22 percentage points (0.683 → 0.833), NDCG by 10%, and every generation metric, at roughly 13% more tokens per query and no MRR cost. Regulatory questions frequently span multiple sections (a “what are the labeling requirements for organic products” question touches § 205.300, § 205.301, § 205.303 simultaneously), and under-sampling the retrieval set starves the generator. It’s the kind of change a “vibe check” loop would never catch — the top-6 answers looked fine, but were missing meaningful grounding.
Neutral — Not Worth the Cost
Sonnet vs. Haiku (Experiment 3). Sonnet produced +0.9 faithfulness at 3.6× latency (5.6 s → 20.1 s) and ~15% more tokens. Within measurement noise. Haiku stayed.
Hybrid retrieval (Experiment 4). BM25 + vector via reciprocal rank fusion gave +6% MRR with identical recall and precision. Neither helped nor hurt meaningfully. Vector retained for simplicity.
Negative Results — Directly Harmful
Query rewriting (Experiment 5). I expected this to help. The intuition is that a plain-English question like “can I sell raw milk” probably doesn’t hit the vocabulary the regulatory text uses (“unpasteurized fluid milk products”), and an LLM rewrite would bridge that gap. It didn’t. MRR dropped 25% on Haiku, 21% on Sonnet. The honest lesson: regulatory text uses a precise, controlled vocabulary. An expanded query drifts toward approximate synonyms that produce false-positive matches in the vector space. In a domain where the source vocabulary is already carefully chosen, query rewriting is not expansion — it’s drift. Disabled.
HyDE query expansion. A variant of the above — generating a hypothetical regulatory passage before embedding. With section-level chunks already bridging the vocabulary gap, HyDE degraded MRR marginally (0.858 → 0.842). Interesting in other domains; unnecessary here.
The Big Bet That Didn’t Pay Off — voyage-law-2 + Paragraph Chunks
This was the experiment I most expected to move the final needle, and it didn’t.
voyage-law-2 is a 1024-dimensional embedding model from Voyage AI, fine-tuned on legal text. On paper, this is exactly the domain alignment a federal-regulation RAG should want. Combined with paragraph-level chunks — finer-grained retrieval targets that could surface the specific subparagraph (a)(1)(i) that answers a question — the hypothesis was that domain-specialized embeddings would outperform a general-purpose embedding model, and finer chunks would give the retriever more specific hit targets.
The corpus exploded from 83,448 section-level chunks to 223,918 paragraph chunks. Re-embedding that at Voyage’s pricing cost a few dollars in API spend.
The result was genuinely instructive. Recall went up (0.850). MRR and NDCG went down. At top_k=10, the retriever was finding the right sections — but surfacing three or four paragraph chunks from the same section ahead of the chunks that answered the specific question. This is exactly the failure mode paragraph chunking was supposed to cure, and instead paragraph chunking created it.
Raising top_k to 25 pushed retrieval metrics to their best numbers of the whole study (NDCG 0.653, +25.6% over baseline) — and generation metrics fell. The retriever was excellent. The generator couldn’t use what it was given. This is the “lost in the middle” problem from the long-context literature, amplified here by a specific regulatory pattern: a paragraph (a) defining a rule, (b) listing exceptions, and (c) defining terms are semantically interdependent — separating them into independent retrieval targets fragments the meaning.
The lesson: a domain-specialized embedding model didn’t compensate for a chunking strategy that fought the document’s natural structure. Voyage-law-2 is a fine model; this wasn’t a fair test of it. The failure was the paragraph chunking, which the voyage choice had implicitly locked in.
The Plot Twist: Auditing the Metrics
Before spending more API budget on experiments, I stopped and audited the metric code itself. Three bugs surfaced:
Sub-paragraph notation false negatives. Ground-truth references were written as § 205.301(a)(1); chunk metadata stored only the base section § 205.301. String matching flagged 23 out of 43 apparent retrieval “misses” as perfect hits in disguise.
Chunk-text false positives. The relevance check was also matching against the full chunk text. Regulatory text is dense with cross-references like “see § 135.110 for definitions,” so matching against chunk text credited the retriever for every chunk that mentioned the target section. Corrected to match only against citation metadata.
Duplicate-section inflation. In the paragraph corpus, multiple paragraphs from the same ground-truth section all counted as separate hits, pushing NDCG above 1.0 — physically impossible. Corrected by deduplicating.
After the audit, MRR on the baseline configuration jumped from 0.275 to 0.710. NDCG from 0.520 to 0.720. Retrieval quality was already roughly 2.5× better than the pre-fix numbers had suggested.
This is an uncomfortable, honest result, and it’s also the most PM-useful single finding in the project. For several weeks I’d been tuning for what I thought was a retrieval problem. It had actually been a measurement problem. The fancy embeddings, the hybrid retrieval, the hierarchical rescue attempts — these were all solutions to a retrieval ceiling that the buggy metrics had invented. Every evaluation harness needs an adversarial sanity pass before tuning decisions get stacked on top of it. I didn’t do one early enough; the project cost that mistake several experiment budgets.
The Re-Ingestion That Actually Fixed Things
With metrics trustworthy, the pattern clarified: the generator wasn’t starved for retrieval, it was starved for coherent context. Paragraph chunks had been fragmenting the regulatory meaning that lives at the section level. The natural unit of regulatory meaning is the DIV8 SECTION — the CFR’s own drafters had already decided where the meaning boundaries are.
I re-ingested Titles 7, 21, and 42 with text-embedding-3-small and section-level chunks. 85,351 sections. Top_k stayed at 10. Sequential generation stayed. Everything else stayed.
| Metric | Paragraph + voyage-law-2 | Section + text-3-small |
|---|---|---|
| MRR | 0.710 | 0.858 |
| NDCG@k | 0.720 | 0.880 |
| Faithfulness | 0.533 | 0.729 |
| Legal accuracy | 0.569 | 0.734 |
| Citation accuracy | 0.354 | 0.602 |
Citation accuracy — the metric that most directly measures “is the answer grounded in real regulations” — moved from 0.354 to 0.602. That is the needle that mattered.
The architectural lesson: when the source corpus has been carefully structured by its authors, the chunking strategy should honor that structure. A general-purpose embedding model aligned to the document’s natural boundaries outperformed a legal-domain-specialized embedding model operating on contrived paragraph boundaries — by 20 points of faithfulness and 25 points of citation accuracy.
Adding an Inference-Time Confidence Signal
With the retrieval and generation stacks tuned, one problem remained: the system occasionally produced plausible answers that weren’t actually grounded in the retrieved context. This is the hallucination-risk tail — low-frequency, high-consequence. The LLM judge in the evaluation framework catches it at test time, but at inference time, the user sees no signal distinguishing a reliable answer from a shaky one.
LLM-as-judge at inference time is expensive (another LLM call per user query) and slow. The product needed something cheaper.
What I Built
A two-component confidence score, computed without any additional LLM call:
Retrieval score = average cosine similarity of the top-3 retrieved chunks. Proxy for “did the retriever find semantically relevant content.”
Citation coverage = fraction of CFR § references mentioned in the generated text that actually appear in the retrieved chunk set. Proxy for “did the generator ground its claims in what the retriever gave it.”
Composite: 0.35 × retrieval_score + 0.65 × citation_coverage. Four tiers: high (≥0.75), medium (0.50–0.74), low (<0.50), and not_found (no relevant content retrieved).
What the Signal Reliably Does Today
not_found detection. 100% of not_found flags in the eval set corresponded to answers with zero faithfulness. Users see a clean “no relevant regulations were found for this query” message instead of a confabulated answer.
Aggregate monitoring. Average confidence across rolling windows of production traffic is a reasonable proxy for system health. A drop in average confidence is a signal worth paging on.
What it doesn’t reliably do yet: per-answer user-facing probability labels. The tier ordering between high and medium doesn’t hold up in the current 60-question sample. Showing a user “this answer scored 72%” would be a false guarantee until the test set is expanded to 200+ questions and the weights are learned rather than guessed.
The broader product point: inference-time quality signals are a real and underused product feature in LLM applications. Users care about “is this answer good,” not “is the average answer good.” A cheap, honest per-query quality signal — even one that only reliably surfaces the not_found case — is more valuable than a better aggregate eval score.
Results Snapshot
Production configuration: CFR Titles 7, 21, 42 — 85,351 sections · OpenAI text-embedding-3-small · PostgreSQL + pgvector · vector retrieval top_k=10 · Claude Haiku 4.5 sequential two-call generation · retrieval + citation-coverage confidence scoring · Next.js + FastAPI + nginx on a 2 vCPU / 4GB DigitalOcean droplet.
Final metrics: MRR 0.858 · NDCG@10 0.880 · Faithfulness 0.729 · Legal accuracy 0.734 · Citation accuracy 0.602 · Average confidence 0.763 (71.7% high, 16.7% medium, 5% low, 6.7% not_found) · Median end-to-end latency ~8 s · Cost fractions of a cent per query.
Lessons Learned
Audit the harness before tuning on it.
The metric audit was the most consequential finding in the project, and the one I should have done earliest. It retroactively invalidated several tuning decisions and a non-trivial amount of experiment budget. Every evaluation framework needs an adversarial sanity pass — construct cases where the expected answer is obvious and confirm the metrics reward them — before real tuning starts.
Honor the document’s natural structure.
Before reaching for a domain-specialized embedding model, check whether the document’s authors have already solved the chunking problem. A general-purpose embedding model operating on the CFR’s natural SECTION boundaries beat a legal-domain-specialized model operating on engineered paragraph chunks — by every metric that mattered. A domain-specialized component can’t compensate for a structural mismatch upstream of it.
Inference-time quality signals are a PM product feature, not just an eval concern.
The conventional LLM-PM instinct is to push quality into test-time: bigger eval sets, better judges, tighter prompts. Those matter. But a cheap, per-query, honest quality signal — one the user can see, even if it’s just a not_found flag — is differentiated product value. It turns a question of system reliability into a transparent product affordance.
Token budget and precision are the same variable, not opposing ones.
The top_k=10 experiment improved every retrieval and generation metric for a 13% token cost, and the re-ingestion to section-level chunks did the same at higher magnitude. The question is always “is this context actually load-bearing for the answer,” not “are we using too many tokens.”
What’s Next
200+ question evaluation set — the single highest-leverage next step. The 60-question set produces confidence intervals too wide to resolve sub-10-point differences between configurations, and doesn’t give the confidence signal the calibration data it needs.
Semantic citation coverage. Replace the regex-based citation_coverage check with a lightweight semantic-grounding prompt. Each factual claim gets evaluated against the retrieved chunks for semantic support. More expensive than the regex, far cheaper than a full LLM-judge call.
Differential corpus refresh. The CFR updates daily via the eCFR API. The schema was designed for atomic version-swap from the start — every chunk has a status ENUM and a version_id, and retrieval filters status=’active’ at the query layer. A weekly differential refresh that ingests changed sections, runs sanity checks, then atomically swaps staged → active is a natural follow-up.
Corpus expansion. Currently Titles 7, 21, 42. The CFR has 50 titles. Expanding to full coverage is a data-engineering effort that will cross into territory where the existing 4GB droplet starts to feel tight.
Monitoring and CI/CD. Sentry, a latency dashboard, and GitHub Actions push-to-deploy. Plumbing, but the difference between “deployed once” and “operated continuously.”
Closing Thought
The most useful thing this project taught me wasn’t about RAG at all — it was about the gap between “we built an evaluation harness” and “we trusted our evaluation harness.” I had the former from day one. The latter took a painful audit that invalidated weeks of tuning work and reframed the entire project. That’s a PM story worth telling: the discipline of treating your own tools as suspect until proven otherwise.
The other durable lesson is about inference-time signals. Every RAG system I’ve seen ships with an eval harness (good), and most ship with zero per-query quality indicators (less good). The confidence signal in this project is imperfect and the calibration story is in progress — but it exists, it’s computed for free on top of the existing retrieval and generation pipeline, and in its current form it reliably catches the case where the system doesn’t know. That’s a product primitive, not an engineering concern.
The live system runs at regs.bradhinkel.com. It is, at the time of writing, the most accurate and most honest version of this product I know how to ship with today’s tools. That’s the standard the next version will have to beat.
Legal disclaimer. This tool provides information about federal regulations for educational purposes. It is not legal advice. Consult a qualified attorney for legal guidance.