LoopRails — Calibrated Oversight for Agentic AI
Keep your agents on the rails · the framework (reference edition) · method: Grade · Guard · Show · Prove
The full method and reasoning. For the quick do-this version see the
playbook.md; for the evidence behind every claim seecodex.md. Bracketed tags like[A-17],[F-7],[E-13]point to the annotated source in the codex. Where the codex asks what is known, this document asks what to do about it.The four moves map onto the sections below — Grade the action (§2) · Guard it (§3 mode + §4 control surfaces) · Show the human (§5 loop episode) · Prove it works (§6 defend + §7 validate). The reframe (§0) and core model (§1) come first.
Audience: anyone designing, building, or reviewing an agentic LLM system (coding agents, copilots, computer-use agents, autonomous assistants) who has to decide where, how, and how much a human should be kept in control.
0. The reframe
Most HITL design starts from the wrong question: "Should there be a human in the loop?" The codex shows that question is nearly meaningless, because two facts make a naive "yes" worthless:
- Humans are unreliable detectors of automation error. Automation bias and complacency are structural, attentional, and training-resistant; they afflict experts; and they get worse as the agent gets more reliable [F-7, F-4, E-7, D-18]. Empirically, adding approval gates reduces bad actions but barely improves the human's ability to catch one — the exposure-vs-correction gap [A-17]. A human told to "review the output" will rubber-stamp it [A-18, A-21, D-4].
- Nominal oversight is not meaningful control. A human placed in the loop without authority, awareness, ability, and time becomes a moral crumple zone [F-19, D-16] — a liability sponge that legitimizes a flawed system without improving it [D-14, D-15].
So the framework replaces the binary question with four better ones, asked per action, not per system:
- For which actions does a human need to be involved? (§2 — grade by consequence)
- In what role and at what level of control? (§3 — match an oversight mode)
- With what machinery to make that control real and to blunt the predictable failure modes? (§4 control surfaces, §6 defenses)
- When we bring the human in, what must that oversight moment contain to actually work? (§5 — design the loop episode)
- How do we know the oversight actually works rather than merely existing? (§7 — validate)
The goal, stated precisely
Meaningful human control over agent behavior, with reliance calibrated to the agent's true reliability, at a cost (friction, attention, latency) proportionate to the consequences.
"Calibrated" [F-1] and "proportionate" are the load-bearing words. Maximal oversight is as wrong as none: it manufactures alert fatigue [H-10], deskills the overseer [E-22], and trains rubber-stamping [A-17]. The art is allocation of a scarce, leaky resource.
The four A's of meaningful control
Every oversight point should be tested against four conditions (synthesized from tracking+tracing [C-20, D-13], Billings' "informed, involved, in command" [E-21], and WP29's "authority and competence to change the decision" [D-4]). If any is absent, the oversight is theater:
| The four A's | Failure if absent | |
|---|---|---|
| Authority | The human can actually stop, change, or reverse the action. | Crumple zone [D-16] |
| Awareness | The human comprehends what the agent is doing and why — situation awareness, not a log dump [E-12, G-17]. | Automation surprise [H-5] |
| Ability | The human has the competence and the time to evaluate before it's too late [H-19, E-22]. | Out-of-the-loop failure [E-13] |
| Accountability | Responsibility is traceable to an informed human, not diffused [C-20]. Felt accountability also reduces bias [F-11]. | Responsibility gap [D-15] |
1. The core model: Consequence × Controllability
Two independent axes decide the oversight strategy for any agent action.
- Consequence — how much oversight is warranted. A function of reversibility × blast radius × stakes (§2).
- Controllability — how much oversight can actually be effective. Can a human detect the problem from what's surfaced, and correct it in time? Driven by detectability, the time budget, human competence (deskilling [E-22]), model steerability [A-19], and reviewer span [H-22].
The crucial insight — and the thing most real systems get wrong — lives in the interaction of the two axes:
high │ REVIEW IS A TRAP GENUINE OVERSIGHT
│ (high stakes, but humans (high stakes, and humans
│ can't catch errors here) can catch/correct in time)
C │ → PREVENT, don't review: → invest in real review:
o │ constrain scope, sandbox, plan-approve, co-execute,
n │ make reversible, pre-commit, forcing functions, strong
s │ or forbid. Reviewing is awareness displays.
e │ rubber-stamping. [A-17,F-7] [F-21, A-20, E-13]
q ───────────┼─────────────────────────────────────────────────────────
u │ LET IT RUN LIGHT TOUCH
e │ (low stakes, low control) (low stakes, easy to catch)
n low │ → autonomous + logged; → autonomous + notify/undo;
c │ don't spend attention here. cheap reversibility is enough.
e │ [budget interruptions, H-21] [G-14]
└─────────────────────────────────────────────────────────
low Controllability high
The top-left quadrant is the one that kills you. When consequences are high but the human cannot reliably detect or correct the error in time, adding a confirmation gate does not produce safety — it produces a rubber stamp and a scapegoat. The recognition bottleneck [A-17] and automation bias [F-9, D-18] guarantee it. The only valid moves there are prevention strategies that don't depend on the human being a good detector:
- shrink consequence → make the action reversible/undoable [G-14, A-4], or narrow its blast radius;
- raise controllability → improve detectability (verification-oriented evidence, not persuasion [F-20]), buy time (stage the action), or force engagement (decide-before-seeing [F-21]);
- remove the action → constrain scope, sandbox so the worst case is contained [A-23], or forbid it and escalate.
Everything in §2–§7 is machinery for moving actions out of the top-left quadrant.
2. Step one — GRADE the action by consequence
You cannot allocate oversight until you know what each agent action is worth. Grade every capability the agent has (every tool, every action class) on three dimensions:
| Dimension | Low (0) | Medium (1) | High (2) |
|---|---|---|---|
| Reversibility | One-click undo / checkpointed [A-4] | Recoverable with effort | Irreversible (sent, paid, deleted, published, executed) |
| Blast radius | Self / local workspace | Shared / team state | External / third parties / public |
| Stakes | Trivial | Meaningful (money, time, minor harm) | Severe (safety, legal, security, finance, reputation) |
Consequence Grade = driven by the highest dimension (a single irreversible-external action is critical no matter how the others score):
| Grade | Definition | Examples |
|---|---|---|
| G0 — Trivial | all dimensions low | read a file, run a read-only query, draft text locally |
| G1 — Low | at most one medium, none high | edit a local file, install a dev dependency, run tests |
| G2 — High | any one dimension high, or ≥two medium | git push, modify shared infra, send an internal message, spend within a budget |
| G3 — Critical | irreversible and (external or severe stakes) | delete prod data, deploy to prod, send external email, execute a payment, run rm -rf, post publicly, take a legally significant decision [D-2] |
The shell boundary trap [A-4]. Checkpoint/undo typically covers an agent's structured edits but not shell side effects (
rm,mv,cp, network calls). Grade by actual reversibility, not by whether your framework has a "rewind" button — a Bash command that deletes files is G2–G3 even if your editor has undo.
3. Step two — MATCH an oversight mode
Oversight has two dials, not one (from Sheridan's levels [E-4] and Parasuraman's stages [E-6], made concrete with modern permission practice [A-2, A-5]):
The stage dial — which part of the agent's processing is overseen. Automate the early stages freely; gate the late ones, where over-trust and lost authority bite hardest [E-6]:
acquire information → analyze → recommend/decide → act
The level dial — how much the human is in control at the gated stage:
| Level | Mode | What happens | Human role |
|---|---|---|---|
| L0 | Autonomous, silent | Agent acts; nothing surfaced | none |
| L1 | Autonomous, logged | Agent acts; recorded for audit | post-hoc audit |
| L2 | Notify-after | Agent acts; actively surfaces it; cheap undo | react / undo [G-14] |
| L3 | Confirm-before | Agent proposes one action; blocks for approve / edit / reject / respond [A-8] | gatekeeper |
| L4 | Plan-approve | Agent proposes a multi-step plan; human approves before execution; checkpoints between steps [A-20] | planner / approver |
| L5 | Co-execute (forcing) | Human pre-commits or decides key steps before seeing the agent's answer [F-21] | co-driver |
| L6 | Escalate / forbid | Agent must hand to a human or must not act [A-15, D-12] | decision owner |
Default mapping (grade → mode)
| Grade | Default level for the act stage | Rationale |
|---|---|---|
| G0 | L0–L1 | Spending attention here is waste; it manufactures alert fatigue [H-21, F-17] |
| G1 | L1–L2 | Reversibility + notification is sufficient; undo beats confirmation [G-14] |
| G2 | L3–L4 | Block for approval; show a preview/diff; plan-approve multi-step work |
| G3 | L4–L5 plus prevention — and L6 if controllability is low | Review alone is a trap at this grade [A-17]; see §1 top-left quadrant |
This is the default. The controllability overlay then adjusts it:
- Low detectability (the human can't tell right from wrong from what's shown) → don't rely on L3/L4 review; move to prevention (§1) or invest in verification-oriented evidence [F-20].
- Low time budget (no time to intervene before harm) → review modes are useless [H-19]; make the action reversible or stage it so there is time.
- Low model steerability (the agent can't actually be interrupted/corrected mid-task [A-19]) → prefer plan-approve before execution (L4) over mid-flight steering you can't trust.
- High reviewer load / fan-out (one human, many agents [H-22]) → you cannot put every action at L3; reserve human attention for G2–G3 and pool oversight.
Make it dynamic, not static
A fixed autonomy setting is wrong; the right level changes with context [E-14, A-2]. Escalate the level automatically on:
- uncertainty — the agent's own confidence is low, or it's outside its tested domain [A-6, F-1];
- novelty / drift — an action unlike what was approved before, or scope creep [A-18];
- trifecta exposure — the action newly combines private data + untrusted input + external comms [A-23] (see §6);
- accumulated blast radius — many small G1 actions composing into a G2 effect.
And always make stepping down to a lower level easy and visible — the "children of the magenta line" failure is riding maximum autonomy into a corner with no graceful way back to manual [H-6].
Brief by intent, not by steps
The military/management answer to delegation transfers directly. Use commander's intent + mission orders — state the purpose, desired end-state, and hard constraints; specify results to attain, not how [K-7] — and run management by exception: handle the routine autonomously, escalate only deviations [K-1], bounded by a small set of pre-authorized "rules of engagement" [K-12]. This is the principal–agent bargain made concrete [J-2]: oversight is costly monitoring, so monitor the exceptions, not everything — and remember that some duties are simply non-delegable and must keep a human decision-maker as a matter of law [J-15].
4. Step three — INSTRUMENT the control surfaces
Seven concrete mechanisms implement the modes above. Each has a right time and a documented pitfall.
Permissioning / gating — allow/deny/ask rules, risk-tiered tools [A-2, A-6]. Use for: enforcing the grade→mode mapping. Pitfall: precedence and inheritance are load-bearing and subtle (allow-lists don't constrain bypass mode; subagents inherit autonomy) [A-2]; string-matching denylists and client-side approvals are not security boundaries — they're bypassable/forgeable [A-14, A-12]. Gate for UX; sandbox for security.
Preview / diff / dry-run — show the concrete effect before committing [A-20, A-22]. Use for: G2 actions where the human can read the diff. Pitfall: a preview the human can't actually evaluate is decorative; pair with verification evidence, not just a rationale [F-20].
Interrupt / steer / cancel — pause, redirect, retract mid-run [A-8, A-19]. Use for: long-horizon tasks. Pitfall: models are genuinely weak at honoring mid-task changes [A-19] — treat steerability as a capability you must test, not assume; prefer plan-approval when it's poor.
Checkpoint / undo / rollback — snapshot state; one-click revert [A-4]. Use for: making actions reversible so they drop a grade and you can prefer undo over confirmation [G-14]. Pitfall: the shell boundary [A-4] — know exactly what is not covered.
Sandbox / environment control — no-network containers, scoped/ephemeral credentials, budget caps, disposable machines [A-23, A-1]. Use for: moving the safety boundary off the per-action prompt so high-autonomy work is contained rather than reviewed. Pitfall: none major — this is the highest-leverage control; underused.
Escalation / handoff — transfer control to a human with a generated context summary [A-15, A-6, A-10]. Use for: G3 actions, low-confidence cases, and async approval that may take hours [A-10]. Pitfall: the handoff itself is a failure point — make it anticipatory and context-rich, not a panic dump [H-7]; budget for re-engagement latency [H-19].
Monitoring (incl. AI-supervising-AI) — parallel supervisors and always-on monitors that can observe→intercept→escalate [A-15], plus AI-control protocols that route scarce human review to flagged actions [C-22]. Use for: scaling oversight past human span [H-22]. Pitfall: it relocates trust into the supervising AI's blind spots [B-20, C-8]; classifiers have real false-negative rates [A-3]. Defense in depth, not a silver bullet.
Layer them. No surface is sufficient alone. The mature stack is: sandbox the environment (5) → gate by grade (1) → preview/plan high grades (2,4) → keep interrupt + undo available (3,4) → monitor for drift (7) → escalate the critical (6).
These surfaces are not new — other industries hardened them first (Part IV of the codex). Jidoka's stop-on-anomaly and the andon cord [N-1, N-4] are escalation + a universal, blameless interrupt; finance's circuit breakers and kill switches [O-6, O-7] are automatic-halt + emergency-stop; four-eyes / maker-checker and the two-person rule [O-1, N-15] are two-party approval for high grades (the proposer must not be the approver); poka-yoke, lockout-tagout, and least-privilege / capabilities [N-6, N-17, O-10, O-11] are capability-removal — make bad actions impossible, not merely discouraged (a denylist the agent can evade is policy, not a boundary [A-14]); permit-to-work [N-13] is scoped, time-boxed, named privilege elevation; and runtime shields / the Simplex architecture [O-17, O-18] are the verified monitor that vetoes the agent even under prompt injection (the confused-deputy problem [O-11, O-12]). Borrow their hard-won detail; don't reinvent it.
5. Step four — DESIGN THE LOOP EPISODE (the anatomy of a single oversight moment)
Steps §2–§4 decide which actions get a human and at what mode. This step designs the moment itself — and it is where most real oversight quietly fails. The recognition bottleneck [A-17] means putting a human in the loop does not make them a good detector; the episode has to be engineered to give them clarity, context, detection affordances, and a respected attention budget, or it collapses into a rubber stamp.
This is exactly the move in the one prior HITL framework, Cranor (2008) [O-9]: keep humans out of the loop wherever you can safely automate; where you must keep them, engineer the interaction so they can actually succeed. Her six failure points for a human in a security loop — delivery, attention, comprehension, application, motivation, capability — are the same ones that sink an agent approval prompt: the prompt is buried, attention has glazed over, the human can't tell what the agent will actually do, they're asked to vet something they can't verify, or approval fatigue has set in. A modern loop episode is Cranor's framework extended from a one-shot warning to an ongoing supervisory relationship.
The anatomy of a well-designed loop episode
Ordered. Each element supplies something that is usually missing from an oversight moment — the four most-missed are flagged.
Decide whether to interrupt at all (attention budget). ⚠️anti-fatigue — An interruption is never free: it raises stress and error even when output survives [Q-23, Q-24], and habituation sets in by the second identical prompt [O-16]. Defer non-urgent asks to a task breakpoint [Q-28], batch them, and cap the interruptive rate; every blocking prompt must demand a specific response or it's noise [H-21]. Over-prompting is the dominant real-world way oversight dies — clinicians override 49–96% of alerts [H-10].
State the request clearly. ⚠️clarity-of-request — Exactly what is being asked: scoped, unambiguous, in the user's terms, with what the agent can and can't do made explicit [Q-1, G-7]. Narrow the Gulf of Execution [G-11].
Show the consequences before the human acts (feedforward). ⚠️clarity-of-consequence (your top missing piece) — Preview the concrete effect of each option: what approving does, downstream side effects, and crucially whether it is reversible [Q-3, Q-4]. Agents demonstrably propose irreversible, harmful actions ([Q-5]: even the safest agent fails ~24% on a high-stakes tool benchmark), so consequence + reversibility preview is necessary, not decorative.
Surface calibrated uncertainty. Show the agent's confidence — calibrated and carefully framed, since raw numbers are misread [Q-20, Q-21]; low confidence is itself an escalation trigger.
Give the provenance — "how did this even get to me?" ⚠️context (your second missing piece) — The upstream trail that restores the overseer's situation awareness: what the agent perceived, what it considered and rejected, why it escalated, and the steps/tool-calls that led here [Q-7, Q-9, Q-12]. Structure it (data lineage vs. interaction history vs. rationale), ideally as a queryable provenance record [Q-11, Q-13]. Without it the decision-point human is out-of-the-loop and cannot competently intervene [Q-10, E-13].
Build in error-detection affordances — support checking, not trusting. ⚠️detection — Contrastive "why / why-not" explanations [Q-16] and diffs framed for verification; surface disagreement, counter-evidence, and the agent's own doubts. Beware the trap: explanations framed to persuade increase acceptance regardless of correctness [F-20, Q-20] — design them to help the human find the error, not to sell the answer. Use progressive disclosure (summary first, full plan one click away) [Q-22].
Add proportionate friction — a microboundary, not sludge. A beat of forced reflection (decide-before-seeing / cognitive forcing) only before consequential, irreversible actions [Q-19, Q-35]; audit that the friction is warranted by the stakes [Q-34]. Indiscriminate friction is sludge and breeds approval fatigue [Q-33].
Use bias-safe choice architecture — and no dark patterns. An approval prompt is choice architecture; there is no neutral UI [Q-31]. Make the safe/reversible option the default, and never auto-approve on timeout [Q-32]. Explicitly exclude the dark-pattern repertoire: no confirmshaming the "reject," no false urgency, no obstructing the path to deny/undo [Q-36, Q-39].
Preserve accountability after the act. Keep the human the accountable initiator (internal locus of control [Q-4]) and log the decision into the provenance record for audit [Q-11, C-20].
The compression. If you do only four things, do the four usually missing: (1) show consequences + reversibility before acting; (2) show the provenance that answers "how did this get to me"; (3) give detection affordances (why/why-not + uncertainty, framed to check not to sell); (4) respect the attention budget — interrupt rarely, at breakpoints, with high precision. The first two are clarity, the third is detection, the fourth is anti-fatigue.
6. Step five — DEFEND against the predictable failure modes
These failures are predictable, so design against each one explicitly rather than hoping.
| Failure mode | Codex source | Concrete defenses |
|---|---|---|
| Automation bias / rubber-stamping — humans accept agent output uncritically | [F-9, A-17, A-21, D-18] | Don't rely on passive review for high grades; use cognitive forcing functions (decide before seeing the answer) [F-21]; show verifiable evidence, not persuasive rationale [F-20]; assign felt accountability [F-11]; let users experience the agent being wrong during onboarding [F-12] |
| Alert fatigue — too many low-value prompts → users dismiss all, including valid ones | [H-10, F-17, H-21] | Budget interruptions: every blocking prompt must demand a specific response [H-21]; cap the interruptive rate; most signals passive/logged; minimize false alarms (they corrode trust globally [F-17]); never fix fatigue by muting the noisiest category [H-11] |
| Complacency from reliability — better agent → less monitoring | [F-4, E-18, E-7] | Treat capability and complacency as rising together; keep humans actively engaged at decision points, not passive monitors [E-13]; vary/expose the agent's failure cases |
| Deskilling — overseer loses the expertise to oversee | [E-22, H-9] | Build in deliberate human-only practice; rotate humans through manual work; recognize a reviewer who never does the task can't review the agent |
| Out-of-the-loop / bad handoff — disengaged human can't take over in time | [E-13, H-7, H-19] | Anticipatory, gradual, context-rich handoffs; keep the human "warm" where takeover must be fast; don't architect safety around a fallback that can't re-engage in time |
| Moral crumple zone — human positioned to absorb blame without control | [F-19, D-16, D-15] | Apply the four A's (§0); if you can't give real authority+awareness+ability, don't claim oversight — change the design |
| Lethal trifecta — private data + untrusted content + external comms = exfiltration | [A-23] | Detect the combination and escalate/forbid; break a leg (read-only, no network, no untrusted input) before running unsupervised |
| Mode confusion — human's model of agent state diverges from reality | [H-5, E-10, G-17] | Make current mode/authority and intent continuously visible; no silent mode transitions (e.g., suggest→act); intelligibility as a safety property |
| False-security gates — approval/denylist mistaken for a security control | [A-14, A-12] | Bind approvals server-side; enforce real boundaries in the sandbox, not in pattern matching |
7. Step six — VALIDATE that oversight is effective
The codex's sharpest governance finding: oversight is routinely required and routinely ineffective, and almost no one measures the difference [D-14, D-17]. Green's reform is the right default posture: treat "there is a human in the loop" as a claim to be demonstrated with evidence, not a checkbox [D-14].
Measure oversight as a system you can be wrong about:
- Intervention success rate — when the agent is wrong, how often does the human actually catch and correct it? (Not approval rate — correctness of approvals.) This is the metric [A-17] shows is usually terrible; if you measure nothing else, measure this.
- Override rate + override correctness — are humans overriding, and are their overrides right? Uniform high approval = rubber-stamping; biased overrides = a new bias source [D-24].
- Time-to-detect vs. time-to-harm — is there actually time to intervene before consequence lands [H-19]?
- False-alarm / interruption rate — is the signal economy healthy, or are you breeding fatigue [F-17, H-10]?
- Calibration — does human reliance track the agent's actual reliability across contexts [F-1]?
Red-team the oversight, not just the agent. Plant errors and adversarial actions and see if the human (or the monitoring AI) catches them [A-17, C-22]. An oversight design that has never been tested against a wrong agent is unvalidated.
8. The design procedure (putting it together)
For each agent, run the loop. It is iterative — new capabilities re-enter at step 1.
1. MAP List every action the agent can take (every tool, side effect, shell command).
2. GRADE Score each by reversibility × blast radius × stakes → G0–G3. (§2)
3. MATCH Assign a default oversight mode per grade; apply the controllability (§3)
overlay; define dynamic escalation triggers.
4. INSTRUMENT Choose control surfaces to implement the modes; layer them. (§4)
5. EPISODE For each human touchpoint, design the oversight moment: clarity, (§5)
consequences+reversibility, provenance, detection affordances, attention budget.
6. DEFEND For each relevant failure mode, add its specific defense. (§6)
7. VALIDATE Instrument the effectiveness metrics; red-team the oversight; iterate. (§7)
Designer checklist
- Every action is graded; nothing defaults to L0 by accident.
- No G3 action depends on passive human review as its only safeguard. (top-left quadrant, §1)
- Reversibility is the default; blocking confirmation is reserved for the genuinely irreversible. [G-14]
- You know precisely what your undo does not cover. [A-4]
- High-autonomy work runs in a sandbox with scoped credentials and budget caps. [A-23]
- The lethal trifecta is detected and gated. [A-23]
- Interruptions are budgeted; every blocking prompt demands a specific action. [H-21]
- Each oversight point passes the four A's (Authority, Awareness, Ability, Accountability). [§0]
- Awareness is comprehension (L2/L3 SA), not a log dump. [E-12]
- Handoffs are anticipatory and context-rich; re-engagement latency is budgeted. [H-19]
- Autonomy is dynamic (escalates on uncertainty/novelty/drift) and stepping down is easy. [E-14, H-6]
- Intervention-success rate is measured and red-teamed — not just "a human approves." [A-17, D-14]
- Every oversight moment shows the request and its consequences + reversibility before the human acts. [Q-3, Q-4]
- Every escalation answers "how did this get to me?" with provenance, not just the final ask. [Q-7, Q-11, E-12]
- Detection affordances are verification-oriented (why/why-not, diffs, surfaced uncertainty), not persuasive rationale. [Q-16, Q-20, F-20]
- The approval UI is bias-safe: safe default, no auto-approve-on-timeout, no dark patterns. [Q-31, Q-32, Q-36]
- Interrupts are deferred to breakpoints and batched; high precision so they aren't tuned out. [Q-28, O-16]
Maturity model
| Level | State |
|---|---|
| 0 — Binary | One global autonomy switch (ask-everything or YOLO). Alert fatigue or unmonitored risk. |
| 1 — Graded | Actions risk-tiered; gates match consequence. [A-6] |
| 2 — Reversible & sandboxed | Undo-by-default + environment containment; gates reserved for the irreversible. [A-4, A-23] |
| 3 — Calibrated & dynamic | Oversight level adapts to uncertainty/novelty; loop episodes are designed (clarity, consequences, provenance, detection); verification-oriented awareness; failure-mode defenses in place. |
| 4 — Validated | Oversight effectiveness is measured, red-teamed, and demonstrated — meaningful control proven, not assumed. [D-14] |
9. Worked examples
9a. Coding agent (Claude Code–style)
- Map/Grade: read file (G0) · edit local file (G1, reversible via checkpoint) · run tests (G1)
·
git push(G2 — shared, hard to reverse) ·rm -rf/ DB migration on prod (G3 — irreversible, severe). - Match: reads/edits autonomous+logged with undo (L1–L2);
git pushconfirm-before with diff (L3); prod-touching commands plan-approve + escalate (L4/L6). Auto-escalate when an edit touches files outside the approved scope [A-18]. - Instrument: run in a sandbox/branch with no prod credentials [A-23]; checkpoint every edit but flag that Bash deletions aren't covered [A-4]; preview diffs for G2.
- Episode: the
git push/ prod prompt shows the diff and what it affects, whether it's reversible, why the agent wants to push now (provenance), surfaces test results as detection evidence, defaults to the safe option, and is batched so the dev isn't prompted per-file [Q-2, Q-3, Q-4, Q-7, H-21]. - Defend: combat rubber-stamping of large diffs with focused, reviewable chunks + tests as verification evidence [F-20]; budget confirmations so the human doesn't reflexively approve [H-21].
- Validate: track how often human review catches a real bug vs. merge rate [A-16]; red-team with planted regressions.
9b. Computer-use / web agent
- Map/Grade: read page (G0) · fill form (G1) · purchase / send / delete (G3) · enter credentials (G3 + privacy).
- Match: browse autonomously (L1); confirm-before on every side-effecting action (Purchase / Send / Delete) [A-5]; credentials via human takeover with the model's vision blanked [A-5].
- Instrument: the lethal trifecta is the live risk (private session + untrusted web content + ability to transact) [A-23] — detect and gate. Escalate, don't review, when controllability is low (the agent can rationalize a wrong action and the user won't catch it [A-17]).
- Validate: measure intervention success against embedded dark-pattern / injection attacks [A-17].
9c. Customer-support agent (high fan-out)
- Map/Grade: answer question (G1) · issue refund within policy (G2) · refund outside policy / account closure (G3).
- Match: answer autonomously with deterministic guardrails it "cannot cross" (e.g., a 30-day window) [A-15]; in-policy refunds L1–L2; out-of-policy escalate to a human with an AI-generated summary [A-15, A-6].
- Instrument: human is the escalation tier, not the per-turn approver (span limits [H-22]); AI supervisors observe→intercept in parallel; monitors score grounding/sentiment at scale [A-15].
- Defend: the supervising AI has blind spots [B-20] — keep human review on the escalated tail and audit a sample of the autonomous body.
10. Limits, tensions, and honest caveats
This framework manages the hard tensions named in the codex synthesis; it does not dissolve them.
- Autonomy vs. oversight capability — every increment of useful autonomy degrades the human's takeover ability [E-13]. The framework's answer is per-action allocation and dynamic escalation, not a global setting — but the trade-off is real and permanent [E-15].
- Friction vs. usability — the interventions that most reduce over-reliance are the ones users most dislike [F-21, A-17]. Spend friction where stakes justify it; expect resistance.
- Explanation vs. over-trust — intelligibility is necessary for oversight yet explanations increase uncritical acceptance [F-20, G-17]. Prefer verification-oriented evidence over persuasion-oriented rationale; this is easier to state than to design.
- Scaling oversight vs. preserving the human — AI-assisted oversight is the only way to scale, but each step removes a little of the human it's meant to empower [B-20, C-19]. Track where genuine human judgment still lives.
- Compliance vs. safety — designing for the regulatory checkbox can actively produce a crumple zone [D-14, D-16]. Validate effectiveness (§7); don't confuse "compliant" with "controlled."
- The model is part of the design — interruptibility and mid-task correction are unsolved model capabilities [A-19]. Some patterns in this framework degrade when steerability is poor; design within the model's actual limits, and re-grade as models improve.
One-line summary. Don't ask whether a human is in the loop. Grade each action by consequence, match a proportionate oversight mode, prevent (don't merely review) where humans can't catch errors, design the oversight moment so the human has clarity, context, and detection affordances, defend against the predictable failure modes, and prove the oversight works.
Foundations
This framework rests on the 17 research clusters in codex.md. Beyond the original
human-factors / HCI / AI-safety base (Parts I–II), it now draws on:
- The economics & law of delegation (Part III.J/K) — principal–agent theory and agency costs reframe oversight as costly monitoring [J-2]; residual control rights [J-8] ground "the human keeps the override"; non-delegable duties [J-15] set hard limits on automation; commander's intent and management by exception [K-7, K-1] supply the brief-by-intent posture in §3.
- Resilience engineering & HRO (III.L) — tight coupling / latent failure / drift / normalization of deviance [L-10, L-12, L-15, L-16] and deference to expertise [L-7] underwrite the decouple-and-monitor stance and the warning that oversight standards erode silently.
- Joint cognitive systems / human-autonomy teaming (III.M) — Observability, Predictability, Directability [M-1, M-5] and common ground [M-14] are the teaming requirements behind §4/§5.
- Cross-industry mechanisms (Part IV.N/O) — Jidoka/andon, poka-yoke, checklists, permit-to-work, the two-person rule, four-eyes, circuit breakers, kill switches, least-privilege, runtime shields — and Cranor's prior HITL framework [O-9], the direct precedent generalized by §5.
- The cognition of advice & the loop episode (Part V.P/Q/R) — advice-taking and anchoring [P-2, P-9], decision fatigue with its replication caveats [P-13, P-16], interruption science, provenance, and choice architecture — the evidence base for §5.
Related: the BRACE Framework
LoopRails governs human oversight — which actions to gate, which to prevent, and how to design the review moment. It pairs with the BRACE Framework, a security framework for autonomous AI agents that hardens the agent's configuration and infrastructure (environment/network isolation, capability-scoped tokens, signed/minimal containers, and observability — nine controls plus three observability requirements). LoopRails sits on top (when and how a human is in the loop); BRACE is the security baseline underneath. Use them together.
Companion to codex.md (the research foundation). Assembled 2026-06-22; expanded with
Parts III–V and the loop-episode section after the initial draft.