D&D RAG Content Generator — Case Study

A product manager’s case study in evaluation-driven LLM development · Brad Hinkel · April 2026 · dnd.bradhinkel.com

Summary

This project built and deployed a production retrieval-augmented generation (RAG) system that creates original Dungeons & Dragons content — weapons, NPCs, artifacts, locations, and monsters — grounded in a curated 1,006-document corpus of Forgotten Realms lore. The live system runs at dnd.bradhinkel.com.

The product goal was straightforward, but the more interesting story is how the system got tuned. Rather than relying on the “vibe check” iteration loop that dominates a lot of LLM development, I built a six-phase evaluation framework that tested one variable at a time across 87 ground-truth questions. The result was a configuration that improved retrieval precision by 85% over the baseline while cutting per-query token cost by 65% — improvements that would not have been visible by eyeballing outputs.

For PMs working on LLM-powered products, the takeaway is simple: rigorous offline evaluation is one of the highest-leverage things a product team can invest in, and most of the work is unglamorous bookkeeping that pays for itself many times over.

The Problem

Dungeon masters constantly need fresh content that feels canonical — a longsword that fits the politics of Waterdeep, an NPC who could plausibly have crossed paths with Drizzt. Existing AI tools either hallucinate non-canon material or push the DM into extensive manual research to ground the output. RAG is a natural fit: retrieve relevant lore from a trusted corpus, then condition the LLM on it. But “natural fit” doesn’t mean “easy to do well.” The quality of a RAG system is determined by dozens of small decisions — chunk size, embedding model, retrieval depth, filtering, reranking — and each one interacts with the others.

The System in Brief

The application takes a category and a few parameters from the user (rarity, theme, location, etc.), retrieves the top relevant chunks from a ChromaDB vector store, and passes them as grounding context to Claude Haiku, which generates a structured JSON object validated against a category-specific Pydantic schema. DALL-E 3 then generates a matching image. Items are persisted to PostgreSQL and shown in a gallery. A FastAPI backend streams generation progress to a Next.js frontend over Server-Sent Events. The whole thing runs on a single DigitalOcean droplet behind Nginx.

Example generated artifact: Sunburst of the Flaming Fist — Example generated artifact — the system produces structured item data and a matching DALL-E 3 illustration, grounded in retrieved Forgotten Realms lore.

The architecture itself is unremarkable — it’s the standard small-to-mid-scale RAG pattern. The interesting decisions weren’t which components to use but how to configure them. That’s where the evaluation framework earned its keep.

The Evaluation Framework

I built a custom Python evaluation harness that runs 87 ground-truth questions (across five categories and six question types) against a given configuration and reports nine metrics: retrieval precision, recall, MRR, and NDCG; faithfulness, answer relevancy, and context relevancy from an LLM judge; and end-to-end latency and total token consumption from the operational side. Each question is tied to a specific source document, so retrieval correctness can be scored objectively rather than guessed at.

I ran six phases, each isolating a single design variable so the impact could be attributed cleanly:

Phase	Variable Tested	Options	Winner
0	Baseline (Flowise prototype)	—	—
1	Metadata filtering	ON vs. OFF	Filter ON
2	Chunk size	256 / 512 / 1024 tokens	256 tokens
3	Embedding model	ada-002 / 3-small / BGE	3-small
4	Search method	vector / BM25 / hybrid	Vector only
5	Top-k	3 / 5 / 10	Top-5
6	Reranking	none / cross-encoder	No reranking

The discipline of changing one thing at a time turned out to be more important than any single result. It made the system legible. When stakeholders — or hiring managers, or my future self — ask “why is the chunk size 256?” there’s an answer with numbers attached, not “it felt better.”

Results

The optimized configuration produced large wins on the metrics that mattered and roughly neutral results on the metrics where the baseline was already strong:

Metric	Baseline	Optimized	Change
Precision@k	0.339	0.627	+85%
Recall@k	0.852	0.840	−1.5%
NDCG@k	1.281	1.784	+39%
Faithfulness (LLM judge)	0.940	0.935	−0.5%
Answer Relevancy (LLM judge)	0.887	0.877	−1.1%
End-to-end latency	2,678 ms	2,474 ms	−7%
Tokens per query	~3,833	~1,350	−65%

Three things stand out. First, precision nearly doubled while latency improved — unusual, since quality and speed normally trade off. Second, the cost per query fell by 65%, almost entirely from sending less context to the LLM rather than from a cheaper model. Third, the LLM-judge scores barely moved. The retrieval metrics were far more discriminative for tuning decisions, which is itself a useful finding: judge scores are better suited to regression detection than to optimization.

Lessons Learned

Rigorous evaluation beats vibe checks — by a lot.

The headline finding of the entire project is that systematic, ground-truth-based evaluation surfaced an 85% precision improvement that no amount of eyeballing would have caught. Chunk size — the variable that produced the biggest gain — is invisible to a human reviewer scrolling through generated outputs. Without a numerical framework, that improvement would have stayed on the table. For PMs working with LLMs, this is the lesson worth internalizing: build the eval harness first, even when it feels like overhead, because the tuning decisions you make without it are mostly guesses.

Token count is a precision proxy, not just a budget line.

The 256-token chunks weren’t just cheaper; they were also more precise. Smaller chunks force the retriever to return more focused content, which means less noise in the context window. Cost and quality moved in the same direction, not opposite directions. PMs who frame token usage purely as a budget concern are missing half the story.

Domain mismatch breaks pre-trained components.

I tested a cross-encoder reranker (ms-marco-MiniLM-L-6-v2) in Phase 6 expecting a precision boost. Instead it added 79% latency overhead with no measurable quality gain. The reranker was trained on web-search queries and didn’t generalize to fantasy-lore retrieval. Pre-trained components carry assumptions about their training distribution, and those assumptions don’t always hold.

The hybrid search result is narrower than it sounds.

Phase 4 found that BM25 hybrid search did not outperform pure vector search. The honest interpretation isn’t “hybrid search doesn’t help” — it’s “BM25-style hybrid didn’t help here, given that queries are conceptual, the corpus is small and homogeneous, and metadata filtering already narrows the search space.” A more interesting hybrid I didn’t test would use the strongly-typed attributes of D&D items (rarity, alignment, weapon type, challenge rating) as structured pre-filters before the vector search runs. That would likely improve precision further, reduce latency, and let the prompt drop instructions like “must be legendary rarity” because retrieval would guarantee it. This kind of structured-plus-semantic hybrid is, I suspect, the right next step — and it’s a good reminder that an evaluation result tells you exactly what you tested, not the more general thing it sounds like.

Lock the configuration grid before sweeping.

Phases 4–6 were initialized with placeholder configs before the Phase 2 winner was confirmed, which left a small dependency-chain caveat in the final results. Trivial to avoid in retrospect; worth a callout for any PM planning a similar sweep.

What’s Next

The most interesting follow-up is the structured-field hybrid retrieval described above. Beyond that: a dedicated end-to-end “full optimal” evaluation run with all winning configs locked in, a domain-adapted reranker trained on D&D corpora, a thumbs up/down feedback loop in the production UI to drive future retraining, and open-sourcing the evaluation framework as a reusable RAG eval toolkit. The framework is the artifact with the most general value — most teams haven’t built one, and most could use one.

Closing Thought

A lot of LLM product development today happens by feel. Outputs look good, ship it; outputs look bad, change something and try again. That works at small scale and breaks down quickly past it. The discipline this project taught me — formalize ground truth, isolate one variable at a time, instrument the operational metrics alongside the quality metrics — is the difference between a demo and a product. It’s not glamorous work, and it’s not where most of the LinkedIn excitement lives, but it’s where the actual leverage is.

← Back to Projects