Agent Workflow Patterns: A Recipe Book

By Brenn Hill · June 23, 2026

There is one fork that decides almost everything about how an LLM system behaves, and it is worth being clear about before any code gets written. A workflow follows a fixed path that you wrote. The steps are in a known order, the model fills in the parts you asked it to fill in, and the control flow belongs to you. An agent is the other thing: the model decides what to do next, picks its own tools, and chooses when it is done. Both can be useful. They are not the same animal, and confusing them is how you end up with a system you cannot debug.

This chapter is about the workflow side. These are the patterns where you hold the steering wheel. You decide the path, you decide the checks, and the model does the work inside the lines you drew. The autonomous patterns, the ones where the model drives, live in a separate chapter (autonomous agent patterns), because they come with a different set of failure modes and a different set of guardrails.

Why start here? Because workflows are predictable in a way agents are not. You can look at a workflow and trace exactly what will happen. You can put a check between any two steps. You can swap one model for a cheaper one and measure the difference. The cost is bounded, the latency is roughly knowable, and when something breaks you can point at the step that broke it. Agents trade all of that away for flexibility. Sometimes the trade is worth it. Often it is not, and a fixed path would have done the job for a tenth of the trouble.

The LoopRails stance is simple: start with the simplest workflow that does the job, and add model-driven autonomy only when the task genuinely needs it. Most things people reach for an agent to do are actually a short chain of steps in a known order. If you can write the steps down, write them down. Reach for an agent when you cannot, and not before. Each recipe below follows the same shape: what it is, when to reach for it, how it fails, and how to fix the failures.

The recipes

Prompt chaining

What it is: You break a task into an ordered set of steps, and each step's output becomes the next step's input. Draft, then critique, then rewrite. Or extract, then transform, then format. The path is fixed and you wrote it.

When to reach for it: Use a chain when the task splits cleanly into stages that have to happen in order, and where doing it in one giant prompt would ask the model to juggle too much at once. Chaining trades a little latency for a lot of accuracy on each step, because each prompt is smaller and more focused. It also gives you a natural place to drop a check between steps, which is the real reason to bother.

How it fails: Errors compound. A small mistake in step one becomes the trusted input to step two, and step two builds on it as if it were correct, and by step four you have a confident answer assembled on a bad foundation. The model does not know that the thing it received was wrong, so it does not flag it. The classic version of this is a missing or weak check between steps: you wired the output of one prompt straight into the next with nothing in between to ask "is this even sane?" The chain runs to completion and produces garbage that looks finished.

How to fix it: Put a check between steps, and make it a real one. After the step that extracts data, validate that the data has the fields you expected and the values are in range. After the step that drafts, run a gate that asks whether the draft actually addresses the request before you spend a model call rewriting it. The check does not have to be another LLM call; often a plain assertion or a schema validation is stronger and cheaper. When a check fails, stop the chain rather than feeding bad input forward. A loud failure at step two is much cheaper than a confident wrong answer at step five. For the larger discipline of putting the check first, see evaluation-driven development.

Routing

What it is: You classify the input first, then send it down the path that fits. Easy questions go to a cheap, fast model; hard ones go to the strong model. Refund requests go to the refund flow; billing questions go to the billing flow. The classifier decides the route, and each route is tuned for its kind of work.

When to reach for it: Routing pays off when your inputs fall into clearly different buckets that want different handling, and lumping them together hurts every bucket. A single prompt that tries to be good at both trivial and hard cases tends to be mediocre at both, and expensive on the trivial ones. Splitting the traffic lets you spend big where it matters and stay cheap where it does not.

How it fails: Misrouting. The classifier sends a hard question to the cheap model, which answers confidently and wrong, and nobody downstream knows the route was bad. The cost of a wrong route is asymmetric: routing a hard case to the cheap path produces a bad answer, while routing an easy case to the expensive path just wastes money. The first one hurts more and is harder to notice. Routers also rot. The categories you defined six months ago stop matching the traffic you actually get, and a growing slice of inputs falls into a bucket that was never meant to hold them.

How to fix it: Pick a default route that is safe when the classifier is unsure, and bias toward it. If confidence is low, send the input to the stronger path rather than guessing cheap. Log the route alongside the outcome so you can measure how often each route produced a good result, and watch for the bucket that is quietly filling up with mismatched inputs. Add an explicit "none of these" route instead of forcing every input into an existing category, because a forced fit is a silent misroute. When a route leads to a consequential action, grade that action and keep a human at the ones that matter, rather than trusting the classifier to have gotten it right.

Parallelization

What it is: You run several model calls at the same time and combine the results. It comes in two flavors. Sectioning splits a task into independent parts that run in parallel, like checking a document for tone, for factual claims, and for policy violations all at once. Voting runs the same task several times and aggregates the answers, like asking three times whether a piece of content is safe and taking the majority.

When to reach for it: Use sectioning when the task has genuinely independent parts that do not need each other's output. Running them together in one prompt makes the model split its attention and do each part worse, while running them separately lets each call focus. Use voting when a single run is unreliable but the task is one where more samples raise your confidence, like a judgment call where you would rather have three opinions than one.

How it fails: Fan-out costs you latency and money. Voting five times is five times the spend, and the slowest of your parallel calls sets the wall-clock time, so one straggler holds up the whole batch. Sectioning fails when the parts were not actually independent: you split a task into pieces that needed to share context, and now each piece is missing something the others had, so the combined result has gaps or contradictions. Voting fails when the runs are correlated. If all of them share the same blind spot, they agree confidently and wrongly, and the majority vote launders a single error into an apparent consensus.

How to fix it: Only section along real seams. If two parts keep needing each other's output, they were one part, so keep them together. For voting, make the aggregation step deliberate rather than a blind majority: when the votes disagree, that disagreement is signal, and a split decision is a good trigger to escalate to a human or to the stronger path. Cap the fan-out at the smallest number that gives you the confidence you need, since the gains taper off fast. Set a timeout so one slow call does not stall the batch, and decide in advance what to do when a call times out: drop it, retry it, or fail the whole thing.

Orchestrator-workers

What it is: A lead model looks at the task, breaks it into subtasks at runtime, hands each subtask to a worker model, and then synthesizes the workers' results into a final answer. The key difference from the patterns above is that the subtasks are not fixed in advance. The orchestrator decides what they are based on the input it sees.

When to reach for it: Reach for this when you cannot write the subtasks down ahead of time because they depend on the input. A research task over a question you have not seen, or a code change that touches an unknown set of files, is a poor fit for a fixed chain, because you do not know the steps until you look at the specifics. The orchestrator's job is to figure out the breakdown that a static workflow could not encode.

How it fails: The orchestrator fragments the task badly. It splits the work along the wrong lines, hands workers subtasks that overlap or leave gaps, and then has to stitch together pieces that do not fit. Workers given a slice without enough context produce something locally reasonable and globally wrong, and the synthesis step inherits all of it. Because the orchestrator is itself a model making decisions, this is the recipe where you have quietly crossed from workflow into agent territory, and the costs climb with it: more calls, more latency, and a control flow that is harder to predict because the model invented part of it.

How to fix it: Give the orchestrator a clear contract for what a good subtask looks like, and constrain the breakdown rather than leaving it wide open. Have workers return enough about their work that the synthesis step can spot a gap or an overlap, instead of blindly concatenating. Put a check on the synthesized result against the original request, because a clean-looking assembly of flawed parts is the failure to watch for. And ask the honest question first: do you actually need a model to decide the breakdown, or do your inputs fall into a few shapes you could route and chain instead? If a fixed path covers most of your traffic, use it for that traffic and save the orchestrator for the genuinely open-ended cases.

Evaluator-optimizer

What it is: One model generates a candidate, a second model evaluates it against a set of criteria and sends back specific feedback, and the first model revises. You loop until the candidate passes. This is a verifier loop: the generator proposes, the evaluator decides whether it is good enough, and the cycle repeats until the standard is met or you hit a cap.

When to reach for it: Use it when you have clear criteria for a good answer, the first try usually misses some of them, and feedback genuinely helps the next try. Polishing a translation against a style guide, tightening a piece of writing against a rubric, or refining code against a set of requirements all fit, because in each case there is a checkable standard and an iteration loop that closes the gap. If you cannot say what "good" means in a way the evaluator can check, this pattern has nothing to grade against and will not help.

How it fails: The evaluator rubber-stamps. A weak evaluator, or one that shares the generator's blind spots, approves work that should have failed, and the loop terminates pleased with itself on a bad result. This gets worse when the evaluator is not independent: if the same model both writes and grades, it tends to like its own output, and you have built a loop that congratulates itself. The other failure is a gameable standard. If the criteria are vague or the generator can see exactly how the evaluator scores, the optimizer learns to satisfy the letter of the check while missing the point, the same way a student writes to the rubric instead of to the question.

How to fix it: Make the evaluator independent of the generator. Different model, different prompt, and a standard the generator was not handed verbatim, so it has to actually be good rather than just match a string. Give the evaluator a clear, checkable standard rather than a vibe: concrete criteria it can score, ideally backed by checks that are not themselves an LLM, like tests or validators that cannot be sweet-talked. Cap the iterations, because a loop chasing an evaluator it can never fully satisfy will burn budget forever, and a stalled loop should hand off to a human rather than declaring victory. The strongest version of this is the LoopRails idea of an independent verifier: the thing that grades the work sits outside the thing that does it, and the loop cannot reach in and edit its own grader. When the evaluator's verdict gates a consequential action, grade that action and keep a human at the high grades; a passing score from an automated evaluator is evidence, not a release form. The loop patterns cookbook has worked examples of verifier loops you can copy.

Choosing among them

These five are not a ladder you climb. They are tools, and the right one is the simplest that covers your case. Prompt chaining and routing are the workhorses, and most real systems are some combination of the two: route the input, then run it down a chain tuned for that route. Parallelization is an optimization you add when you have independent work or need more samples, not a starting point. Orchestrator-workers and evaluator-optimizer are where the model starts making decisions, which is where you cross from a workflow you fully control into something closer to an agent, with the extra cost and unpredictability that comes with it.

A few failure modes show up across all of them, so they are worth holding in your head as you design. Errors compound whenever output feeds forward without a check, so put checks between steps and at the seams. A wrong route or a bad task breakdown is expensive precisely because nothing downstream knows it happened, so log decisions and watch the outcomes. Fan-out buys you accuracy or speed at the price of cost and latency, so cap it at what you actually need. And the evaluator that is too soft, or that the generator can game, is worse than no evaluator at all, because it tells you the work passed when it did not.

The one mistake that costs the most is over-engineering. Reaching for an orchestrator or a multi-model voting scheme when a single well-written prompt would have done the job adds latency, spend, and surface area for bugs, and buys you nothing. Before you build any of these, try the boring version: one prompt, one model, one check on the output. If that works, ship it. Add a chain when one prompt is doing too much, add a router when your inputs split, and add the model-driven patterns only when you genuinely cannot write the path down yourself. Start simple, add autonomy when the task earns it, and grade the actions any of these patterns can take (G0-G3) so a human stays on the consequential ones. For the full method behind the grading and the rails, see the framework.