LoopRails
LoopRails · Articles · Advanced and Agentic RAG
View article-advanced-agentic-rag.md on GitHub ↗

Advanced and Agentic RAG: A Recipe Book

Basic RAG has a simple shape. A question comes in, you search a vector store for the chunks that look most similar, you stuff those chunks into the prompt, and the model answers. That pipeline is covered in the companion chapter on RAG retrieval patterns, and for a lot of questions it is all you need. It works right up until the question is vague, or the first search grabs the wrong passages, or the real answer is spread across ten documents instead of sitting in one, or you need a way to know when the system is confidently wrong.

The recipes in this chapter are the upgrades that handle those cases. Some of them add a feedback step so the system can notice a bad retrieval and try again. Some of them hand the model the steering wheel, letting it decide what to search for and whether one search was enough. Some change what gets retrieved in the first place, so the chunks carry more meaning or connect across the whole corpus. Each one buys you accuracy or autonomy, and each one adds moving parts.

Here is the thread that runs through all of them, and it is the LoopRails point. Every upgrade in this chapter turns a single retrieve-then-answer step into a loop. The agent searches, reads, decides, and maybe searches again. A loop that can run more than once needs a rule for when to stop and a check on whether it did the right thing. That is why the chapter ends on evaluation. Evaluation is the verifier for RAG, the thing that tells you whether any of these clever additions actually helped or just made the system slower and harder to debug.

The recipes

Contextual retrieval

What it is: Before you embed a chunk, you prepend a short note describing where the chunk came from and what surrounds it, so the chunk still makes sense when it gets pulled out of its document and dropped into a prompt on its own. A chunk that reads "the rate is 4 percent" becomes something like "From the 2024 refund policy, section on late returns: the rate is 4 percent."

When to reach for it: Reach for this when your chunks lose their meaning once they are separated from the document around them. This happens constantly with policy docs, contracts, technical manuals, and anything where a sentence relies on a heading three paragraphs up. If you have ever retrieved a chunk that was technically relevant but said "this applies in all cases above" with no "above" attached, contextual retrieval is the fix.

How it fails: The plain failure is that a raw chunk is meaningless alone. "The rate is 4 percent" matches a query about rates, gets retrieved, and the model has no idea which rate, from which policy, in which year. It answers confidently with the wrong number. The other cost is the preprocessing itself. Generating a context blurb for every chunk usually means an extra model call per chunk at indexing time, which adds money and time to your ingestion pipeline, and you pay it again every time you re-index.

How to fix it: Generate the context once at ingestion time and store it with the chunk, so you are not doing it on every query. Keep the prepended note short and factual, just enough to anchor the chunk: the document title, the section, the date. You can build the blurb cheaply by pulling structural metadata you already have (filename, headers, page) instead of asking a model to summarize from scratch. When the extra preprocessing cost worries you, run it only on the corpora where chunks actually lose meaning, not on everything by reflex.

Agentic RAG

What it is: Instead of always retrieving exactly once, you let an agent decide whether to retrieve at all, what query to send, and whether to go back and retrieve again after reading the first results. The agent runs a small loop: search, read what came back, decide if that was enough, then either answer or search again with a better query.

When to reach for it: Reach for this when your questions vary a lot in difficulty. Some users ask "what is the refund window," which one search answers. Others ask "compare the refund windows across all three of our regional policies and tell me which is strictest," which needs several searches and some reasoning in between. A fixed single retrieval is wrong for both ends of that range. It wastes a search on questions that did not need one and falls short on questions that needed five.

How it fails: The classic failure is a loop with no brakes. The agent decides the results are never quite good enough and keeps retrieving, burning tokens and wall-clock time on every turn, sometimes until it hits a hard limit and dies mid-task. The mirror image is just as bad: the agent decides it does not need to retrieve, answers from memory, and hallucinates. In between you get agents that retrieve a reasonable number of times but have no real stopping rule, so the behavior is unpredictable and the cost per question swings wildly.

How to fix it: This is a loop, so treat it like one. Put a hard cap on retrieval rounds, three or four for most systems, so a confused agent cannot spin forever. Give it an explicit done-condition instead of leaving "enough" to vibes: a check that asks whether the retrieved context actually covers the question before the agent is allowed to answer, and forces an answer once the cap is hit. The loop engineering chapter is the long version of this idea, and the multi-agent loops chapter covers what happens when you have several of these agents retrieving at once. The short version: a retrieval loop needs a done-condition and a cap, exactly like every other loop you build.

Corrective RAG (CRAG)

What it is: After you retrieve, you grade the documents you got back before you trust them. If they look irrelevant or thin, you fall back to something else, like a broader query or a live web search, instead of answering on weak evidence. Good documents go straight to the answer step; weak ones trigger the fallback.

When to reach for it: Reach for this when retrieval quality is uneven and you cannot count on the first search being good. Maybe your corpus has gaps, maybe some questions land outside what you have indexed, maybe the embeddings are noisy for certain topics. CRAG gives the system a way to notice "these results are bad" and do something about it rather than dressing up whatever it found as a confident answer.

How it fails: The grader is the whole game, and a bad grader sinks the pattern. If it waves through irrelevant documents, you are back to plain RAG with extra steps. If it rejects good documents, you trigger expensive fallbacks you did not need. The second failure is having no fallback worth falling back to: the grader correctly flags the results as weak, and the system shrugs and answers anyway because there is nowhere else to go. The third is answering on weak evidence in the first place, which is the failure CRAG exists to prevent, so if your grader is not actually wired to block the answer step, you have built the scaffolding and skipped the point.

How to fix it: Make the grader cheap and judge it against real examples, not your hopes for it. Build a small set of query-document pairs you have labeled relevant or not, and check that the grader agrees with your labels before you ship it. Wire up a fallback that is genuinely different from the first attempt, so retrying actually has a chance of producing better evidence: a web search, a keyword search to complement vector search, a broader query. And make the grade gate the answer for real. If the documents fail the grade and the fallback also comes back empty, the honest move is to say the system does not have a good answer, not to generate one anyway.

Self-RAG

What it is: The model decides for itself when to retrieve, and after it drafts an answer it critiques whether that answer is actually supported by the retrieved text, retrieving more if the support is missing. It is a maker-checker setup where the same model plays both roles, generating the answer and then checking its own grounding.

When to reach for it: Reach for this when you want the system to police its own grounding rather than blindly trusting that retrieved text equals a correct answer. It pairs well with tasks where the cost of an ungrounded claim is high and you want a built-in step that asks "did I actually get this from the documents, or am I filling in a gap?"

How it fails: A model grading its own answer has the same blind spots that produced the answer. If it was confident enough to write an unsupported sentence, it is often confident enough to approve that same sentence on review. Self-critique that runs on the model's general sense of "this seems right" tends to rubber-stamp the model's own work, because confidence and correctness are not the same thing and the model cannot tell them apart from the inside.

How to fix it: Anchor the self-check to the retrieved text, not to the model's confidence. The useful version of the critique asks a concrete, checkable question for each claim: which retrieved passage supports this sentence, and quote it. A claim that cannot be tied to a specific span in the retrieved context fails the check and either gets cut or triggers another retrieval. This is the same maker-checker rule that shows up everywhere in this material: a checker is only worth having if it is grounded in something outside the maker's own judgment. Evaluation-driven development makes the case at length. Grounding the check against the retrieved spans is what turns Self-RAG from a confidence echo into a real check.

GraphRAG

What it is: You build a knowledge graph out of your documents first, pulling out the entities (people, products, policies, dates) and the relationships between them, and then you retrieve over that graph instead of, or alongside, the raw chunks. Because the graph captures connections, you can answer questions that need facts stitched together from many sources, like "how are these two things related" or "what connects all the customers who churned in March."

When to reach for it: Reach for this when the answer is not sitting in any single passage and instead has to be synthesized across the corpus. Plain vector search retrieves the chunks most similar to the question, which is great for "what does the policy say about X" and useless for "trace the chain of approvals that led to this decision." If your hard questions are about relationships and connections spanning many documents, the graph is what makes them answerable.

How it fails: Building and maintaining the graph is expensive. Extracting entities and relationships from a large corpus usually means a lot of model calls up front, and the graph goes stale the moment your documents change, so you are signing up for ongoing extraction and updates, not a one-time build. The other failure is reaching for it when you did not need it. For straightforward lookups, a knowledge graph is heavy machinery that adds cost, latency, and a whole new thing that can drift out of date, and it buys you nothing a plain similarity search would not have delivered faster.

How to fix it: Match the tool to the question shape. Use GraphRAG for the connection-and-synthesis questions and keep plain retrieval for the lookups, often in the same system, routing to the graph only when the question needs it. Budget for keeping the graph fresh from the start, with an update path that runs when documents change rather than a rebuild you keep putting off, because a stale graph quietly returns relationships that no longer hold. If your questions are mostly simple lookups, skip the graph and save yourself the maintenance.

Evaluating RAG

What it is: You measure whether the RAG system is working instead of reading a few answers and deciding it looks fine. Eyeballing a handful of outputs feels like checking, but it cannot tell you whether a change helped, hurt, or did nothing, because you only saw the cases you happened to look at. Evaluation replaces that with a score you can track.

When to reach for it: Always, and especially the moment you start changing anything. Every recipe above is a modification: a new grader, a retrieval loop, a graph, a context-prepending step. You cannot know whether any of them improved the system without a measurement to compare before and after. The first time you tune a chunk size or swap an embedding model and have no way to tell if the answers got better, you have found the reason this section exists.

How it fails: The failure here is not having any measurement, which leaves you guessing. You make a change, the system feels a bit better, you ship it, and you have no idea whether you fixed the thing or broke something else you did not look at. The related failure is measuring the wrong thing: scoring how fluent the answers read while ignoring whether they are grounded in the retrieved text, so a smooth-sounding hallucination passes and a clunky correct answer fails.

How to fix it: Build a small test set of real questions where you already know the good sources and the right answer, then score the system against it on three plain questions. First, was the right context retrieved (retrieval quality): did the chunks the system pulled actually contain the answer. Second, does the answer stick to the retrieved context (faithfulness, or grounding): is every claim traceable to something that was retrieved, or did the model invent the rest. Third, does the answer actually address the question (answer relevance): a grounded answer to a different question is still a miss. Score every change against that test set so you can see whether it moved the numbers. This is the verifier for RAG. Without it, every recipe in this chapter is a guess about whether you improved anything, and evaluation-driven development is the discipline of letting the verifier, not your gut, decide.

Pulling it together

The pattern across all six recipes is the same. Each one takes the flat retrieve-then-answer step and bends it into a loop, and a loop is only as good as its check and its stopping rule. Agentic RAG needs a cap on retrieval rounds and a done-condition. CRAG and Self-RAG are maker-checker for retrieval, and the checker earns its keep only when it is grounded in the retrieved text rather than the model's confidence. Contextual retrieval and GraphRAG change what gets retrieved, which raises the ceiling on what the system can answer but does nothing to tell you whether you hit it.

That last part is what evaluation handles. It is the verifier that turns all of this from clever ideas into a system you can trust and improve on purpose. The LoopRails framework makes the general version of the argument: any loop you let run needs a way to grade itself, guard against the bad cases, show you what it did, and prove it works. RAG is just one more loop, and these recipes are how you keep it on the rails.

Get new LoopRails essays by email

Loop engineering, verifiers, and human oversight. No spam, unsubscribe anytime.