LoopRails

GRADE · GUARD · SHOW · PROVE

1 · Grade each action by its worst case

Grade	When	What you do
G0 trivial	reversible, local, low stakes	Act on your own. Log it. Don't ask.
G1 low	one axis is moderate	Act, then report. Keep an undo.
G2 high	any axis severe, or two moderate	Show a real diff/preview; approve before acting.
G3 critical	irreversible AND (wide blast or severe)	Don't proceed alone. Stop & ask; worst case, refuse + escalate.

Three axes: reversibility (undo easily / with effort / never) × blast radius (me / team / public) × stakes (trivial / meaningful / severe). Unsure? Round up.

2 · Guard — match controls to grade

G0 run automatically + log
G1 act, notify, one-step undo
G2 preview + approve before acting
G3 prevent by design; stop & ask; refuse + escalate

The trap: high stakes + a human can't catch it in time → prevent the outcome; don't add a review. A confirmation prompt there is just a rubber stamp.

3 · Show — design the review moment

Show the real action + consequences (a diff), not a tidy summary.
Show where the request came from (provenance).
Spend attention sparingly — few, precise interrupts.
Don't ask for a decision a person can't actually make.

4 · Prove — validate the oversight

Test that humans catch seeded bad actions — not just that a review step exists.
Attack your own oversight the way you'd attack the agent.

Keep every governed action on the RAIL

R Reversible
undoable, or blast contained

A Authorized
least privilege; high-stakes = two-party

I Interruptible
one blame-free kill switch

L Logged
reconstruct what happened & who