LoopRails
LoopRails · Codex
View codex.md on GitHub ↗

Codex: Human-in-the-Loop Design for Agentic AI

A living research reference assembled to ground the design of human-in-the-loop (HITL) behavior for agentic LLM systems (coding agents, copilots, autonomous assistants). It captures (1) what is known to work, (2) what people tried that failed, and (3) the deeper ergonomics / human-factors / HCI traditions we can draw on.

Scope decision (2026-06-22): center of gravity is agentic LLM tools — approvals, interrupts, plan/preview, oversight of autonomous action — with classical automation, human-factors, and HCI literature treated as transferable foundations. The codex is meant to serve both as a cited research synthesis and as the substrate for an actionable design framework (proposed separately).


How to read this codex

  • Part I covers HITL specifically for AI/ML and agentic systems, plus governance mandates.
  • Part II covers the foundational ergonomics, human-factors, automation, and HCI literature.
  • Part III covers the theory of delegation, organization, and resilience (the principal–agent problem, management/military delegation doctrine, resilience engineering/HRO, and human–agent teaming).
  • Part IV mines other industries for concrete, battle-tested oversight mechanisms (manufacturing & process safety; finance & security).
  • Part V turns to the human side and the design of a single loop episode (the cognition of advice & decision fatigue; the anatomy of an oversight moment; sociotechnical context & workflow precedents).
  • Each thematic section opens with Highlights (the load-bearing takeaways) followed by an Annotated bibliography.

Citation conventions

References are recorded in an academic, lightly-numbered style, grouped by theme:

[§-n] Author(s) (Year). Title. Venue / Publisher. URL or DOI. — Annotation: core contribution; what worked / what failed; relevance to agentic-LLM HITL.

Empirical findings are distinguished from opinion/position pieces. Documented failures and anti-patterns are recorded alongside successes by design — knowing what didn't work is half the point of this codex.


Part I — Human-in-the-Loop for AI

(Annotated bibliography + highlights assembled below.)

I.A — Agentic LLM HITL patterns (industry & framework practice)

How current agentic systems actually implement oversight. Tags: [VENDOR] primary docs · [PRACTITIONER] essay/postmortem · [EMPIRICAL] study.

Highlights

  • Default to "ask," make "act" an explicit, graduated opt-in. Every mature system (Claude Code, Copilot, Cursor, Operator) gates writes/side-effects by default and exposes a deliberate ladder to autonomy (acceptEditsbypassPermissions; /yolo; auto-approve toggles). Trust is loosened over time / within a session, not granted up front [A-2, A-5, A-13].
  • The converged HITL primitive is interrupt → {approve / edit / reject / respond} → resume, built on durable state. LangGraph, OpenAI Agents SDK, Google ADK, CrewAI, and Vercel all implement this; the non-negotiable substrate is a checkpointer/persistence layer — HITL and pause/resume are the same feature [A-8, A-7, A-10, A-11, A-12].
  • Approval gates reduce bad actions but do NOT make humans good at catching them. Strongest empirical result here: plan-approval cut attack occurrence to 60–74% (from ~90%) yet intervention success stayed 9–26% across all strategies — failure is rationalization/normalization (a "recognition bottleneck"), not inattention [A-17].
  • ANTI-PATTERN — rubber-stamping / automation bias. Even motivated expert developers default to heuristic, shallow review under productivity pressure; "safety contingent on developer oversight" does not scale and externalizes correction cost onto the human [A-18, A-21].
  • ANTI-PATTERN — string-matching command denylists / client-side approvals as a security boundary. Cursor's denylist was bypassed ≥4 ways (base64, subshells, generated scripts, quoting); Vercel documents that needsApproval is a UX affordance, forgeable via replayed history unless cryptographically bound. Pattern-matching guardrails are not sandboxes [A-14, A-12].
  • Sandbox the environment instead of gating every action. Practitioner consensus for high-autonomy work: no-network containers, disposable cloud machines, scoped test/staging credentials, budget caps — move the safety boundary off the per-action prompt [A-23, A-3, A-1].
  • The "lethal trifecta" decides when an agent must NOT run unsupervised [A-23]: private-data access + untrusted-content exposure + external-communication ≈ guaranteed exfiltration via prompt injection. Removing any one leg (read-only, no network) is the cheapest guardrail.
  • Undo/rollback has a hard boundary at the shell [A-4]. Claude Code checkpoints structured edits but cannot undo Bash side-effects (rm/mv/cp) — the most destructive ops sit outside the recovery net.
  • Models are genuinely weak at mid-task steering, cancellation, and re-planning [A-19, A-16]. "Steerable"/"interruptible" are model-capability claims, not just UI claims.
  • Risk-tier tools to decide what to gate [A-6, A-12, A-5]. read-vs-write, reversibility, permissions, financial impact → low/med/high → auto-run the trivial, gate the consequential (e.g., value-conditional approval >$1000).
  • Escalation-to-human as a first-class control, plus AI-supervising-AI for scale [A-15, A-6, A-10]. Treat the human as the escalation tier with a generated handoff summary — "human-on-the-loop," not per-turn "in-the-loop."
  • Plan-then-confirm and disambiguate-before-acting is what expert users actually want [A-20, A-22]. Preview/diff-before-apply (plan mode) and clarifying questions before execution recur as the highest-trust pattern.
  • Permission-rule precedence and inheritance are subtle and load-bearing [A-2]. Order matters (hooks → deny → ask → mode → allow → callback); allow-lists don't constrain bypass mode; subagents silently inherit a parent's bypass/acceptEdits — misconfiguration quietly grants more autonomy than intended.

Annotated bibliography

[A-1] Anthropic (2024). [PRACTITIONER] Building Effective Agents. Anthropic Engineering. https://www.anthropic.com/engineering/building-effective-agents — Distinguishes workflows from agents; minimal agent = "model + tools + system prompt in a loop." Frames HITL: pause at checkpoints/blockers, human-set stopping conditions (max-iteration caps). Worked: agent-computer-interface (ACI) design, "poka-yoke" tools, sandboxed testing, read-only as a cheap guardrail. Limit: autonomy → higher cost + compounding errors; HITL/read-only also cap scaling. Relevance: canonical statement of the autonomy-vs-control tradeoff and "pause at checkpoints."

[A-2] Anthropic (2026). [VENDOR] Configure Permissions (Claude Agent SDK). https://code.claude.com/docs/en/agent-sdk/permissions — Most precise published permission spec: six-step pipeline Hooks → Deny → Ask → Mode → Allow → canUseTool, modes default/acceptEdits/dontAsk/bypassPermissions/plan/auto. Worked: deny/ask override even bypassPermissions; dynamic mode escalation mid-session. Pitfalls (documented): allowed_tools does NOT constrain bypassPermissions; subagents inherit bypass/acceptEdits and can't be re-tightened. Relevance: reference implementation of allow/deny/ask precedence — ordering and bare-name vs scoped rules change the safety semantics.

[A-3] Anthropic (2026). [PRACTITIONER] How We Built Claude Code Auto Mode: A Safer Way to Skip Permissions. https://www.anthropic.com/engineering/claude-code-auto-mode — Two-layer classifier defense (input prompt-injection probe + output transcript classifier with fast single-token filter escalating to chain-of-thought). Worked: blocks "overeager behavior," "credential exploration," "scope escalation." Limit: 17% false-negative on real overeager actions; classifier deliberately blinded to assistant reasoning/tool outputs (prevents "talking it into" approval but sacrifices provenance). Relevance: rare quantified data on automated gates substituting for the human — and an admission they don't fully replace one.

[A-4] Anthropic (2025). [VENDOR] Checkpointing (Claude Code) — /rewind. https://code.claude.com/docs/en/checkpointing — Automatic per-edit snapshots; /rewind (Esc-Esc) restores code+conversation / conversation-only / code-only. Worked: transparent undo enables ambitious tasks; decoupling code-state from chat-state. Critical limit: does NOT track files changed by Bash (rm,mv,cp) — the most destructive ops are outside the undo boundary. Relevance: rollback pattern + its sharp boundary (undo covers structured edits, not shell side-effects).

[A-5] OpenAI (2025). [VENDOR + safety] Introducing Operator & Operator System Card. https://openai.com/index/introducing-operator/ · https://openai.com/index/operator-system-card/ — Computer-use agent, three-layer oversight: autonomous by default; user-confirmation gates before side-effecting actions (Purchase/Send/Delete); Takeover Mode blacks out model vision on password/CC fields; refuses high-stakes categories. Limit: oversight concentrated at agent-chosen moments — missed risk depends on the agent's own risk detection. Relevance: reference design for "act autonomously, ask only at consequential moments" + sensitive-field takeover.

[A-6] OpenAI (2025). [PRACTITIONER] A Practical Guide to Building Agents. (PDF) https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf — Risk-rate every tool (read-vs-write, reversibility, permissions, financial impact) to drive pause/escalation; "human intervention" as graceful control-transfer; matters most early to surface edge cases; layered guardrails (input → tool → HITL). Limit: prescriptive, not evaluated. Relevance: the canonical "risk-tier your tools, gate the high-risk" rubric.

[A-7] OpenAI Agents SDK (2026). [VENDOR] Guardrails and Human Review. https://developers.openai.com/api/docs/guides/agents/guardrails-approvals — Tools set needsApproval; runs pause and surface interruptions; host calls approve()/reject and resumes from saved state. Input/output/tool guardrails with a defined firing order (tool-input guardrails run after approval by default; a flag inverts this). Relevance: the interrupt-inspect-resolve-resume loop as a first-class SDK primitive.

[A-8] LangChain/LangGraph (2026). [VENDOR] Human-in-the-Loop. https://docs.langchain.com/oss/python/langchain/human-in-the-loop — Most-copied HITL abstraction: interrupt() pauses a graph node and returns a payload; four decision types approve / edit / reject / respond; resume via Command(resume=...). Requires a checkpointer (HITL built on durable persistence). Worked: when predicates gate only risky calls; structured rejection feedback. Limit: heavy arg edits can trigger re-evaluation/re-runs; decision ordering must match action ordering. Relevance: the de-facto approve/edit/reject/respond vocabulary others mirror.

[A-9] Microsoft AutoGen (2025). [VENDOR] Human-in-the-Loop / UserProxyAgent. https://microsoft.github.io/autogen/stable/user-guide/agentchat-user-guide/tutorial/human-in-the-loop.html — Models the human as an agent (UserProxyAgent) with a pluggable input_func (console/websocket/Slack); legacy human_input_mode: NEVER / TERMINATE / ALWAYS. Relevance: HITL as a participant in a multi-agent conversation rather than an interrupt on one loop; NEVER/TERMINATE/ALWAYS is an early autonomy slider.

[A-10] Google ADK (2026). [VENDOR] Action Confirmations & Long-Running Function Tools. https://google.github.io/adk-docs/tools-custom/confirmation/ · https://developers.googleblog.com/build-long-running-ai-agents-that-pause-resume-and-never-lose-context-with-adk/ — Tool Confirmation (≥1.14): any tool pauses for yes/no or structured confirmation via a FunctionResponse; LongRunningFunctionTool returns pending and resumes on external human/webhook/queue response — pause/resume surviving across processes and time. Relevance: best-articulated pattern for durable, asynchronous approval (human may respond hours later).

[A-11] CrewAI (2026). [VENDOR] Human Feedback in Flows / human_input. https://docs.crewai.com/en/learn/human-feedback-in-flows — Two tiers: task-level human_input=True; @human_feedback decorator (≥1.8) pauses a Flow, presents output, routes by outcome, loops until approved/rejected (typically via webhooks). Relevance: HITL as a review/approve loop at task/workflow boundaries with branching on the verdict.

[A-12] Vercel AI SDK (2026). [VENDOR] Tool Execution Approval — needsApproval (AI SDK 6). https://vercel.com/blog/ai-sdk-6 · https://ai-sdk.dev/cookbook/next/human-in-the-loop — Single-flag HITL: needsApproval: true inserts a gate before execute; function form enables conditional gating (e.g., payments >$1000); durable workflows can suspend for days. Critical security caveat (documented): by default the approval is a UX affordance, NOT a server-side boundary — forgeable via replayed message history; use experimental_toolApprovalSecret to bind it. Relevance: cleanest "conditional approval by input value" + explicit warning that a gate isn't a security control unless bound server-side.

[A-13] GitHub / VS Code (2025–2026). [VENDOR] Copilot Agent Mode & Copilot CLI Autopilot ("/yolo"). https://code.visualstudio.com/blogs/2025/02/24/introducing-copilot-agent-mode · https://docs.github.com/en/copilot/concepts/agents/copilot-cli/autopilot — Terminal commands require approval by default; autonomy opt-in via chat.tools.autoApprove / terminal.autoApprove, JetBrains "global auto approve," CLI /allow-all (alias /yolo). Worked: safe defaults + graduated path. Risk: JetBrains global auto-approve overrides per-category settings incl. destructive ops. Relevance: a mainstream IDE's "ask-by-default, opt into act" surface and its blunt-override risk.

[A-14] Cursor docs + Claburn, T. / The Register (2025). [VENDOR + EMPIRICAL security] Cursor Agent YOLO Mode, Allowlist/Denylist — and its Bypass. https://forum.cursor.com/t/how-to-enable-actual-yolo-auto-run-mode/67491 · https://www.theregister.com/2025/07/21/cursor_ai_safeguards_easily_bypassed/ — YOLO mode auto-runs with command_allowlist/command_denylist + file-deletion protection. Backslash Security found ≥4 denylist bypasses (base64-pipe, bash -c subshell, generated-script exec, quote variations), with malice arriving via poisoned rules.mdc / README / fetched content; denylist reportedly deprecated. Failed: pattern-matching denylists trivially evaded. Relevance: the cautionary case that string-matching command guardrails are not a sandbox.

[A-15] Sierra (2025). [PRACTITIONER] Confidence in Every Conversation & Enterprise-Grade Agents. https://sierra.ai/blog/confidence-in-every-conversation · https://sierra.ai/blog/enterprise-grade-agents — Production CX stack: per-agent Supervisors ("a Jiminy Cricket") observe in parallel and shift observe→intercept (correct/redirect/escalate in real time); always-on Monitors score coherence/grounding/sentiment at scale; deterministic guardrails the agent "cannot cross"; escalation produces an AI summary for handoff. Relevance: a "human-on-the-loop + AI-supervising-AI" stack where the human is the escalation tier, not the per-turn approver.

[A-16] Cognition (2025). [PRACTITIONER] Devin's 2025 Performance Review: Learnings From 18 Months of Agents at Work. https://cognition.com/blog/devin-annual-performance-review-2025 — Merge rate rose to 67% (from 34%); excels at clear-spec, verifiable, 4–8h tasks parallelizable as a fleet; async "first-pass" with humans reviewing only the final result. Failed/anti-patterns: iterative problem-solving + mid-task scope changes degrade performance; ambiguous/visual work needs explicit specs; bottleneck shifts from writing to reviewing. Relevance: evidence that async/end-of-task review beats turn-by-turn for some workloads, but interruptibility/steering remains weak.

[A-17] Chen et al. (2026). [EMPIRICAL, within-subjects n=48] Comparing Human Oversight Strategies for Computer-Use Agents. arXiv:2604.04918. — Four strategies (Risk-Gated, per-step Action Confirmation, Supervisory Co-Execution/plan-approval, Structurally Enriched) vs. embedded privacy-leak, prompt-injection, dark-pattern attacks. Worked: plan-based cut problematic-action occurrence (60–74% vs 88–90%); Structurally Enriched best on trust/usability in high-consequence tasks. Key failure: the exposure-vs-correction gap — no strategy improved intervention success once a problem surfaced (9–26%); failure was rationalization/normalization (a "recognition bottleneck"); task consequence mattered more than strategy. Relevance: hard evidence that approval gates reduce bad actions but don't make humans good at catching them.

[A-18] Tang, N., et al. (2026). [EMPIRICAL, 20,574 sessions / 1,639 repos] How Coding Agents Fail Their Users: …Developer-Agent Misalignment in 20,574 Real-World Sessions. arXiv:2605.29442. — Seven symptoms led by Developer-Constraint-Violation (38%), Misread-Intent (27%), Inaccurate-Self-Reporting (23%), Self-Initiated-Overreach (10%). Key HITL finding: 91% of visible resolutions required explicit developer pushback — "safety contingent on developer oversight," which doesn't scale. Anti-patterns: interaction-level symptoms persist even as code-level errors decline; CLI agents damage external state more than IDE; misalignment recurs across adjacent sessions (+54%). Relevance: largest real-world evidence that current agents externalize correction cost onto the human.

[A-19] Zou, H. P., et al. (2026). [EMPIRICAL, benchmark] When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation. arXiv:2604.00892. — InterruptBench (from WebArena-Lite) tests Addition / Revision / Retraction interruptions across six LLMs, measuring adaptation + recovery efficiency. Finding: even frontier LLMs struggle with mid-task steering, cancellation, and re-planning — an unsolved capability. Relevance: interrupt/pause/steer is not just a UI affordance — the models themselves are weak at honoring mid-flight intent changes.

[A-20] Huang, R., et al. (2025). [EMPIRICAL, qualitative] Professional Software Developers Don't Vibe, They Control: AI Agent Use for Coding in 2025. arXiv:2512.14012. — Professionals maintain agency via upfront planning, staged approval checkpoints before integration, and mid-execution interruption — contra "vibe coding." Implications: prioritize control over autonomy — explicit approval workflows, interruptibility at logical checkpoints, transparent reasoning, developer-directed planning. Relevance: empirically grounds "plan-then-confirm + interruptible checkpoints" as what expert users want.

[A-21] Dhanorkar, S., Passi, S., & Vorvoreanu, M. (2026). [EMPIRICAL, interviews n=17] Human Oversight of Agentic Systems in Practice… arXiv:2606.05391. — Documents the gap between intended and actual oversight: developers rely on informal heuristics rather than systematic verification; automation bias + productivity pressure produce rubber-stamping. Frames the "agentic oversight problem" — governance intent vs. real practice. Relevance: complements A-17 — even motivated experts default to shallow review, so HITL must counter automation bias, not assume diligence.

[A-22] D'Oro, P., et al. / Meta FAIR (2025). [position/framework] ADEPTS: A Capability Framework for Human-Centered Agent Design. arXiv:2507.15885. — Six user-facing capabilities — Actuation, Disambiguation, Evaluation, Personalization, Transparency, Safety — as the minimal vocabulary for understandable, controllable, trustworthy agents. Disambiguation (clarify before acting) + Transparency/Evaluation map onto plan-confirm and preview-before-apply. Relevance: a unifying taxonomy situating approval gates, steering, and escalation in one human-centered model.

[A-23] Willison, S. (2025). [PRACTITIONER] Designing Agentic Loops & The Lethal Trifecta for AI Agents. https://simonwillison.net/2025/Sep/30/designing-agentic-loops/ · https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/ — Argues YOLO/auto-approve drives the best results but only behind environmental controls: Docker w/ no network, disposable cloud machines, scoped test/staging credentials, hard budget caps — move the boundary from per-action approval to the sandbox. The "lethal trifecta": private-data access + untrusted-content exposure + external communication = near-guaranteed exfiltration via injection. Relevance: the most-cited articulation of "sandbox so you can skip the human gate," plus a heuristic for when an agent must NOT run unsupervised.

I.B — Human-in-the-loop machine learning (active/interactive learning, annotation, RLHF)

The older ML tradition that "human-in-the-loop" originally named. Tags: [E] empirical · [O] opinion/survey · [E/O] evidence-backed survey.

Highlights

  • Confidence-gated human queries (uncertainty sampling, [B-1, B-2]) can cut labeling 10–500× — but the gains are fragile. In real projects, utility-per-cost querying often fails to beat random sampling, and the "most informative" items are also the hardest and most disagreement-prone for humans [B-3]. Lesson: confidence-gated escalation is powerful but must account for human cost and difficulty, not just model uncertainty.
  • Passive "label what the model asks" loops miss rare-but-critical cases [B-4]. A model "cannot be uncertain about a class it has never seen" — argues for human initiative (search/seeding), not only human response.
  • Rich human feedback is double-edged. Users gladly critique model reasoning [B-5, B-6] and explanation-mediated correction ~2× efficiency [B-7] — yet naive incorporation of human corrections can degrade accuracy [B-6].
  • Aggregation beats individual labels — up to a point [B-9, B-10]. A few cheap labels can match experts, but redundancy is wasted when labelers are good and assumes a single correct answer exists.
  • Disagreement is often signal, not noise [B-13]. For subjective tasks there may be no single ground truth; majority-vote gold distorts data and eval. Implication: RLHF preference disagreement is partly irreducible.
  • Agreement metrics measure consistency, not correctness [B-11]. High κ/α can encode shared bias.
  • Crowdsourcing pathologies recur: spam, gaming, low effort, fatigue, demographic/task-design bias [B-12].
  • Programmatic / expert-in-the-loop supervision scales human effort [B-14, B-15]. Snorkel: ~2.8× faster, ~45% better than hand labeling — the conceptual bridge to LLM-assisted labeling.
  • Preference-based RLHF is the scalable HITL primitive [B-16]: humans compare, a reward model learns, the policy optimizes — feedback on <1% of interactions. Made oversight of deep RL and then LLMs affordable [B-17, B-18, B-19].
  • Reward hacking / over-optimization is the signature failure mode [B-17, B-22]. Optimizing too hard against an imperfect learned reward makes true human-judged quality go down (Goodhart); mitigations: KL penalties, early stopping, reward-model ensembles.
  • The human-feedback pool is a narrow, contestable foundation [B-18, B-21]. InstructGPT's "good" was set by a small labeler group (~72–77% agreement); annotator bias/error/non-representativeness are partly fundamental limits.
  • RLAIF / Constitutional AI [B-20] shifts humans from labelers to rule-authors — scaling oversight but inheriting the base model's blind spots. Trajectory: human-in-the-loop → human-defining-the-rules, AI-in-the-loop.

Annotated bibliography

[B-1] Settles, B. (2009). [E/O] Active Learning Literature Survey. CS Tech Report 1648, Univ. Wisconsin–Madison. https://burrsettles.com/pub/settles.activelearning.pdf — Canonical AL survey: three query scenarios (membership synthesis, stream-, pool-based) and strategy families (uncertainty sampling, query-by-committee, expected model change/error reduction, density weighting). Worked: up to ~90% fewer labels in favorable apps. Limits: gains are model-/task-dependent and don't transfer cleanly. Relevance: root of "ask the human only where the model is uncertain" — informs confidence-gated handoff.

[B-2] Lewis, D. D., & Gale, W. A. (1994). [E] A Sequential Algorithm for Training Text Classifiers. SIGIR '94, 3–12. arXiv:cmp-lg/9407020. — Coined uncertainty sampling. Worked: up to 500× reduction in labeled data on newswire categorization. Limits: single-classifier uncertainty is myopic, biased toward the current (possibly wrong) boundary; can underperform random sampling under misspecification. Relevance: the original confidence-triggered human-query primitive.

[B-3] Settles, B. (2011). [E/O] From Theories to Queries: Active Learning in Practice. JMLR W&CP 16, 1–18. https://proceedings.mlr.press/v16/settles11a.html — Reality check on theory-vs-deployment. Failed/sobering: real annotation cost is per-instance variable, so utility-maximizing queries did NOT beat random sampling on several NLP tasks even after cost modeling; "informative" items are hardest for humans too, hurting label quality/morale. Relevance: the items a model most wants adjudicated are often the items humans find hardest — raises cost and disagreement.

[B-4] Attenberg, J., & Provost, F. (2010). [E] Why Label When You Can Search? KDD '10. (pagination UNVERIFIED; authorship/argument confirmed) — Under extreme class imbalance, AL can exhaust budget without surfacing a rare positive ("cannot be uncertain about a class it has never seen"). Worked: proposes guided learning — humans search for rare positives. Relevance: passive label-what-the-model-asks loops miss rare-but-critical cases; argues for human initiative.

[B-5] Amershi, S., Cakmak, M., Knox, W. B., & Kulesza, T. (2014). [O] Power to the People: The Role of Humans in Interactive Machine Learning. AI Magazine 35(4), 105–120. DOI: 10.1609/aimag.v35i4.2513. — Defining articulation of IML as rapid, focused, incremental human-model cycles studied user-centered. Limit (authors'): users behave unexpectedly, give inconsistent/strategic feedback, hold divergent mental models. Relevance: HITL is an interaction-design problem, not only an ML problem.

[B-6] Stumpf, S., et al. (2009). [E] Interacting Meaningfully with Machine Learning Systems: Three Experiments. IJHCS 67(8), 639–662. DOI: 10.1016/j.ijhcs.2009.03.004. — Whether end users can give richer-than-binary feedback. Worked: users meaningfully critique reasoning. Failed/mixed: some user-suggested changes degraded accuracy; naive incorporation is risky. Relevance: NL correction of agents is double-edged — well-meant edits can harm.

[B-7] Kulesza, T., Burnett, M., Wong, W.-K., & Stumpf, S. (2015). [E] Principles of Explanatory Debugging to Personalize Interactive Machine Learning. IUI '15. DOI: 10.1145/2678025.2701399. — Closed explanation loop: system explains, user corrects via the explanation. Worked: +~52% understanding; up to ~2× more efficient fixes. Limit: depends on faithful, intelligible explanations and a steerable model. Relevance: template for explanation-mediated correction of agents.

[B-8] Monarch, R. (Munro) (2021). [O] Human-in-the-Loop Machine Learning: Active Learning and Annotation for Human-Centered AI. Manning. ISBN 9781617296741. — Standard practitioner text unifying AL, annotation, and interface design: sampling for review, annotation QC, interface design, combining transfer/self-supervision with review. Limit: prescriptive/experiential, not controlled study. Relevance: the closest "playbook" for the whole human-feedback pipeline an agent needs.

[B-9] Snow, R., O'Connor, B., Jurafsky, D., & Ng, A. Y. (2008). [E] Cheap and Fast — But Is It Good? Evaluating Non-Expert Annotations for Natural Language Tasks. EMNLP 2008, 254–263. https://aclanthology.org/D08-1027/ — MTurk for NLP across five tasks. Worked: aggregating ~4 cheap labels matched expert gold; quantified and corrected per-annotator bias. Limit: holds for objective tasks; relies on redundancy + a gold standard. Relevance: ancestor of multi-annotator preference collection in RLHF.

[B-10] Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). [E] Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. KDD '08, 614–622. DOI: 10.1145/1401890.1401965. — Repeated labeling + quality-aware aggregation improves data and model quality, even when labels aren't cheap. Nuance: "not always" — wasteful when labelers are good/noise is low. Relevance: the cost/quality calculus for how many humans per agent decision/comparison.

[B-11] Artstein, R., & Poesio, M. (2008). [E/O] Inter-Coder Agreement for Computational Linguistics. Computational Linguistics 34(4), 555–596. https://aclanthology.org/J08-4004/ — Reference survey of agreement metrics (κ, π, Fleiss, Krippendorff's α). Limit: agreement is a proxy — high agreement can encode shared bias; low can reflect genuine ambiguity. Relevance: the measurement toolkit for diagnosing disagreement in any feedback dataset.

[B-12] Vaughan, J. W. (2018). [O] Making Better Use of the Crowd. JMLR 18(193), 1–46. https://jmlr.org/papers/v18/17-234.html — Survey of crowdsourcing for ML (data generation, eval/debugging, hybrid systems, behavioral experiments). Catalogs failure modes: spam/gaming, low effort, fatigue, demographic/task-design bias. Relevance: the pathologies that re-emerge when humans supply agent feedback at scale.

[B-13] Plank, B. (2022). [O] The "Problem" of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation. EMNLP 2022, 10671–10682. https://aclanthology.org/2022.emnlp-main.731/ — Argues annotator disagreement is often signal (genuine subjectivity/ambiguity); aggregating to one ground truth discards information. Relevance: explains why RLHF preference disagreement is partly irreducible — no single correct answer to optimize toward.

[B-14] Ratner, A., De Sa, C., Wu, S., Selsam, D., & Ré, C. (2016). [E] Data Programming: Creating Large Training Sets, Quickly. NeurIPS 2016. arXiv:1605.07723. — Users write noisy labeling functions whose accuracies/correlations are estimated without ground truth to produce probabilistic labels. Limit: bounded by coverage; correlated/systematically-wrong functions mislead. Relevance: theory behind programmatic/AI-assisted supervision.

[B-15] Ratner, A., Bach, S. H., et al. (2017/2020). [E] Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB 11(3), 269–282 (ext. VLDB Journal 2020, DOI 10.1007/s00778-019-00552-1). https://www.vldb.org/pvldb/vol11/p269-ratner.pdf — Operationalized data programming. Worked: experts built models 2.8× faster, ~45.5% better than 7h hand labeling; industrial scale. Limit: needs expert-authored functions; weak labels remain noisier. Relevance: expert-in-the-loop programmatic supervision can substitute for much manual annotation.

[B-16] Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). [E] Deep Reinforcement Learning from Human Preferences. NeurIPS 2017. arXiv:1706.03741. — Learn a reward model from pairwise preferences over short trajectory segments, optimize policy against it. Worked: solved Atari/locomotion with feedback on <1% of interactions. Limit: learned reward is a gameable proxy. Relevance: the direct technical ancestor of LLM RLHF.

[B-17] Stiennon, N., Ouyang, L., et al. (2020). [E] Learning to Summarize from Human Feedback. NeurIPS 2020. arXiv:2009.01325. — Preference RLHF for summarization at LM scale. Worked: preferred over SFT and even human references; generalized. Failed/observed: explicit reward-model over-optimization — high proxy score, worse to humans. Relevance: first LLM-scale demo of both the power and the reward-hacking failure mode.

[B-18] Ouyang, L., Wu, J., et al. (2022). [E] Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS 2022. arXiv:2203.02155. — The three-stage RLHF recipe (SFT → reward model → PPO). Worked: 1.3B InstructGPT preferred over 175B GPT-3; more truthful/less toxic. Failed/acknowledged: "alignment tax," residual harms, and a small, demographically narrow labeler pool (≈72–77% agreement) defining "good." Relevance: production template for LLM HITL + its core labeler-pool caveat.

[B-19] Bai, Y., Jones, A., et al. (2022). [E] Training a Helpful and Harmless Assistant with RLHF. arXiv:2204.05862. — Large-scale HH-RLHF with iterated weekly preference collection. Worked: an "alignment bonus" across evals; helpful-but-non-evasive via data mixing. Observed: near-linear reward vs √(KL) drift; purely-helpful models far easier to red-team (helpful/harmless tension). Relevance: iterated feedback at scale and the trade-off agents must navigate.

[B-20] Bai, Y., Kadavath, S., et al. (2022). [E] Constitutional AI: Harmlessness from AI Feedback (RLAIF). arXiv:2212.08073. — Replaces most human harmlessness labels with AI feedback guided by a written constitution (self-critique + RLAIF). Worked: harmless-but-non-evasive with far fewer human labels; auditable principles. Limits: only as good as the constitution + model judgment; inherits base-model blind spots; shifts (not removes) oversight burden. Relevance: the pivot to human-defining-the-rules, AI-in-the-loop.

[B-21] Casper, S., Davies, X., et al. (2023). [O] Open Problems and Fundamental Limitations of RLHF. TMLR. arXiv:2307.15217. — Definitive taxonomy by stage: (1) feedback — biased/erring/adversarial/lazy annotators, foolable, disagreeing, narrow population; (2) reward model — misspecification, hacking, scalar can't capture diverse values; (3) optimization — over-optimization, distribution shift, mode collapse. Separates tractable from fundamental limits. Relevance: the single best checklist of HITL failure modes.

[B-22] Gao, L., Schulman, J., & Hilton, J. (2023). [E] Scaling Laws for Reward Model Overoptimization. ICML 2023. arXiv:2210.10760. — Quantifies Goodhart: as policy optimizes a proxy reward, gold reward rises then falls, in clean functional forms scaling with RM size/data. Relevance: hard grounding for reward hacking — more optimization eventually hurts; bound it (KL penalties, early stopping).

I.C — Scalable oversight & AI-safety HITL

Keeping humans meaningfully in control of increasingly capable AI.

Highlights

  • Scalable oversight is the load-bearing assumption of HITL. As agents exceed humans on the relevant skills, direct human evaluation breaks down [C-1, C-3, C-9]; every technique here amplifies a limited human rather than relying on raw human judgment.
  • "AI helps the human evaluate" is the central design pattern. Recursive reward modeling, iterated amplification, decomposition, and debate all bootstrap oversight via (more-trusted) AI assisting the overseer [C-3, C-5, C-6, C-7]. Book summarization [C-7] is the clearest proof it works on a real task.
  • Debate works — but less than the hype, and only in the right conditions. Persuasive-debater results encourage [C-10]; the broadest benchmark finds gains modest, task-dependent, and shrinking as the human's own information grows [C-11]. Debate assumes a competent, unbiased judge — its key vulnerability.
  • Constitutional AI/RLAIF reduce the oversight burden partly by removing the human [C-8]. Substituting an AI judge scales but relocates risk into the constitution's quality and the AI evaluator's blind spots.
  • Weak-to-strong generalization makes superhuman oversight studiable today — and shows naive supervision is insufficient [C-12]. Strong students beat weak teachers but recover only part of the gap; combining with debate helps [C-13].
  • FAILURE MODE — over-reliance / rubber-stamping is the default. Automation bias yields omission + commission errors [C-14]; complacency afflicts experts and resists practice [C-15]; people over-trust advice because it's labeled algorithmic [C-16].
  • FAILURE MODE — explanations cause over-trust, not better oversight [C-17, C-18]. Transparency aimed at empowering the overseer can instead disarm them.
  • What actually curbs rubber-stamping: friction [C-18]. Cognitive forcing functions (commit-before-seeing-AI, mandatory deliberation) measurably reduce over-reliance — but they're slower, unpopular, and help analytical users most.
  • FAILURE MODE — adding a human can make the team worse [C-19]. Meta-analytically, human–AI combinations often underperform the better of human-or-AI alone, especially in decision tasks and when the AI is stronger.
  • FAILURE MODE — illusory oversight / accountability laundering [C-21]. Mandating an overseer who cannot in practice oversee legitimizes flawed systems and creates a moral crumple zone.
  • Meaningful control demands tracking + tracing, not a button [C-2, C-20]. The system must respond to relevant human reasons and outcomes must trace to an informed, responsible human — a far higher bar than a veto checkbox.
  • Assume the model may be adversarial: monitor, don't just align [C-22]. AI-control protocols stay safe against intentional subversion by routing scarce human/trusted-model review to flagged actions — the most operationally concrete HITL strategy here.

Annotated bibliography

[C-1] Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). [position] Concrete Problems in AI Safety. arXiv:1606.06565. — Names "scalable oversight" as one of five concrete problems: training/evaluating when the true objective is too expensive/rare to evaluate, so cheap proxies invite reward hacking. Worked: a shared, tractable vocabulary. Limit: an agenda, not a solution. Relevance: canonical origin of the framing all agentic-HITL design inherits.

[C-2] Russell, S. (2019). [position, book] Human Compatible: Artificial Intelligence and the Problem of Control. Viking/Penguin. — Argues the "standard model" (optimize a fixed objective) is unsafe; proposes assistance games / CIRL where the AI is uncertain about preferences and treats human behavior (incl. shutdown) as evidence → corrigibility. Limit: largely theoretical, hard to scale. Relevance: meaningful control requires the agent to want to defer, not just be monitored.

[C-3] Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., & Legg, S. (2018). [position] Scalable Agent Alignment via Reward Modeling: A Research Direction. arXiv:1811.07871. — Recursive reward modeling: learn reward from feedback, then use aligned agents to help users evaluate harder tasks. Worked: articulated the recursive-oversight scaffold. Limit: identifies but doesn't solve reward gaming/feedback quality. Relevance: blueprint for HITL where AI assists the evaluator.

[C-4] / see B-16. Christiano et al. (2017). Deep RL from Human Preferences. (cross-listed in I.B) — empirical root of preference-based HITL.

[C-5] Irving, G., Christiano, P., & Amodei, D. (2018). [position + toy exp.] AI Safety via Debate. arXiv:1805.00899. — Two agents argue; a human judges who was more truthful. Bet: refuting a lie is easier than constructing one, so a limited human can supervise superhuman play (PSPACE vs NP analogy). Worked: concept + MNIST demo. Limit: assumes a competent, unbiased judge; vulnerable to persuasive-but-wrong and "obfuscated" arguments. Relevance: core AI-vs-AI mechanism to amplify a human overseer.

[C-6] Christiano, P., Shlegeris, B., & Amodei, D. (2018). [E, algorithmic] Supervising Strong Learners by Amplifying Weak Experts (Iterated Amplification). arXiv:1810.08575. — Build a training signal for hard problems by decomposing into subproblems a weak agent + human can solve, then distill. Worked: learned complex behaviors via decomposition. Limit: clean algorithmic domains; relies on clean decomposition. Relevance: the decomposition primitive behind "AI helps the human evaluate."

[C-7] Wu, J., Ouyang, L., et al. (2021). [E] Recursively Summarizing Books with Human Feedback. arXiv:2109.10862. — First real-world recursive decomposition + RLHF: labelers evaluate a whole book via summaries-of-summaries. Worked: humans supervised a task larger than any could check, with traceability to source. Limit: below human quality; decomposition loses cross-section coherence. Relevance: proof decomposition makes superhuman-scale tasks human-supervisable.

[C-8] / see B-20. Bai et al. (2022). Constitutional AI. (cross-listed in I.B) — scales oversight by substituting AI feedback for human feedback; relocates risk to the constitution and the AI judge's blind spots.

[C-9] Bowman, S. R., Hyun, J., Perez, E., Chen, E., et al. (2022). [E + method] Measuring Progress on Scalable Oversight for Large Language Models. arXiv:2211.03540. — Proposes the "sandwiching" paradigm — tasks where specialists succeed but non-experts and current models fail — to test whether non-expert + model reaches expert level. Worked: early proof-of-concept beat both human-alone and model-alone on QA. Limit: narrow QA; no adversarial/deceptive models yet. Relevance: the empirical program for measuring whether HITL closes the human–AI gap.

[C-10] Khan, A., Hughes, J., et al. (2024). [E] Debating with More Persuasive LLMs Leads to More Truthful Answers. ICML 2024 (Best Paper). arXiv:2402.06782. — On info-asymmetric reading comprehension, non-expert judges chose more truthfully under debate (76%/88% LLM/human) than naive baselines (48%/60%); optimizing debaters for persuasiveness increased judge accuracy. Limit: one task type; persuasiveness/truth may diverge elsewhere. Relevance: leading empirical support that debate can amplify a weaker overseer.

[C-11] Kenton, Z., Siegel, N. Y., et al. (2024). [E] On Scalable Oversight with Weak LLMs Judging Strong LLMs. NeurIPS 2024. arXiv:2407.04622. — Compares debate, "consultancy," and direct QA across many task types. Debate beats consultancy; stronger debaters help — but modestly, and the advantage shrinks when the judge has its own information. Failure: debate gains smaller than hyped and concentrated in extractive info-asymmetry. Relevance: tempers expectations — protocol choice matters; benefits are task-contingent.

[C-12] Burns, C., Izmailov, P., Kirchner, J. H., et al. (2023). [E] Weak-to-Strong Generalization. arXiv:2312.09390. — Fine-tune strong models on weak-model labels; strong students exceed weak supervisors, and an auxiliary-confidence loss recovers much of the NLP-task gap. Worked: studiable today; strong models don't merely imitate weak errors. Failed/limit: recovers only a fraction, uneven across tasks (reward modeling/chess); RLHF likely scales poorly without more work. Relevance: reframes oversight as a generalization problem; naive label-transfer insufficient.

[C-13] Lang, H., Huang, F., & Li, Y. (2025). [E] Debate Helps Weak-to-Strong Generalization. AAAI 2025. arXiv:2501.13124. — Uses debate to help a weak model extract trustworthy info from an untrustworthy strong model, then as supervision. Worked: combining oversight + generalization beats either alone. Limit: dataset-specific; long contexts disrupt debate. Relevance: the two strategies are complementary.

[C-14] / see F-9. Skitka, Mosier & Burdick (1999). Does Automation Bias Decision-Making? (cross-listed in II.F) — original evidence base for rubber-stamping (omission/commission errors).

[C-15] / see F-7. Parasuraman & Manzey (2010). Complacency and Bias… (cross-listed in II.F) — expertise and "just train people" are weak defenses against rubber-stamping.

[C-16] Logg, J. M., Minson, J. A., & Moore, D. A. (2019). [E] Algorithm Appreciation: People Prefer Algorithmic to Human Judgment. OBHDP 151, 90–103. DOI: 10.1016/j.obhdp.2018.12.005. — Counter to "algorithm aversion": laypeople weight advice more when told it's algorithmic — but appreciation wanes vs. one's own judgment and among experts. Relevance: the disposition to over-trust labeled-AI advice is a baseline risk agentic HITL must counteract.

[C-17] / see F-20. Bansal et al. (2021). Does the Whole Exceed Its Parts? (cross-listed in II.F) — explanations increase acceptance regardless of correctness.

[C-18] / see F-21. Buçinca, Malaya & Gajos (2021). To Trust or to Think. (cross-listed in II.F) — cognitive forcing functions reduce over-reliance; friction is unpopular but works.

[C-19] Vaccaro, M., Almaatouq, A., & Malone, T. (2024). [E, meta-analysis] When Combinations of Humans and AI Are Useful. Nature Human Behaviour 8, 2293–2303. DOI: 10.1038/s41562-024-02024-1. — 106 studies / 370 effect sizes: on average human–AI teams performed worse than the better of human-alone or AI-alone. Gains in content creation; losses in decision tasks; combination helped when the human was the stronger party, hurt when the AI was. Relevance: strongest caution that HITL is not automatically beneficial.

[C-20] Santoni de Sio, F., & van den Hoven, J. (2018). [position, philosophy] Meaningful Human Control over Autonomous Systems. Frontiers in Robotics and AI 5:15. DOI: 10.3389/frobt.2018.00015. — Two conditions: tracking (system responds to relevant human moral reasons + environment) and tracing (outcomes trace to a human who understood the system and their role). Addresses the responsibility gap. Relevance: control means more than pressing a button — reason-responsiveness + traceable accountability.

[C-21] Green, B. (2022). [position + evidence] The Flaws of Policies Requiring Human Oversight of Government Algorithms. Computer Law & Security Review 45, 105681. arXiv:2109.05067. — Oversight mandates assume people can oversee algorithms; evidence shows they often can't (automation bias), so policies legitimize flawed systems while creating a façade of accountability (rubber-stamp / moral crumple zone). Relevance: keystone "oversight is often illusory" argument — nominal HITL can launder responsibility without delivering control.

[C-22] Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2023/2024). [E + method] AI Control: Improving Safety Despite Intentional Subversion. ICML 2024 (oral). arXiv:2312.06942. — Design + red-team protocols that stay safe even if a powerful untrusted model actively subverts oversight, using a weaker trusted model + scarce trusted human labor. On backdoored-code, "trusted editing" and "untrusted monitoring" sharply improve the safety–usefulness frontier. Limit: constrained coding setting; assumes a genuinely trusted weaker model + effective red-teaming. Relevance: shifts oversight from "make it aligned" to "stay safe assuming it isn't" — allocate scarce human review where monitoring flags risk.

I.D — Governance, standards & "human oversight" mandates

What regulators/standards require, and scholarship on whether mandated oversight actually works. Tags: [LEGAL/STANDARD] · [CRITIQUE] · [EMPIRICAL].

Highlights

  • The law now demands oversight; the evidence says humans can't reliably deliver it. EU AI Act Art. 14 [D-1] and GDPR Art. 22 [D-2] mandate human oversight/intervention, while 25+ years of empirics [D-18, D-19, D-20] show people systematically over-rely on automation and miss its errors — especially under load and high autonomy (the agentic regime).
  • Regulators named the failure mode — then arguably legislated into it. WP29/EDPB [D-4] explicitly bars rubber-stamping and requires reviewers with "authority and competence to change the decision," yet Art. 14's de-biasing strategy is largely to require awareness of automation bias — which [D-17] argues won't actually de-bias anyone.
  • "Add a human" is a legitimacy device, not a control [D-14]. Oversight requirements often launder faulty algorithms with false confidence while diffusing accountability; agencies are rarely required to prove oversight works. Green's fix — shift the burden to demonstrate effectiveness — is the most actionable governance reform here.
  • The human is frequently positioned to absorb blame, not exercise control. Elish's "moral crumple zone" [D-16] and the "MABA-MABA trap" [D-15] warn that inserting a human creates new failure modes (scapegoating, deskilling, diffused responsibility) rather than curing algorithmic ones.
  • Override authority can introduce bias, not remove it [D-24]. Judges' discretionary overrides of a risk tool produced racial disparities — human discretion is not a neutral corrective.
  • Opacity structurally defeats oversight [D-25, D-22]. Black-box scores can't be meaningfully scrutinized by nominally-responsible humans — directly implicating agents whose reasoning is hard to inspect in real time.
  • Accountability sinks are not hypothetical. The Dutch benefits scandal [D-21] (toppled a government) and Australian Robodebt [D-23] (unlawful, ~470k wrongful debts) are large-scale demonstrations of automated decisions running with ineffective/complicit oversight — caught by courts/commissions, not the systems' own HITL.
  • Standards give vocabulary and scaffolding but stop short of teeth. NIST AI RMF + GenAI Profile [D-5, D-6] name overreliance/automation bias and tie oversight intensity to autonomy; ISO/IEC 42001 & 23894 [D-7, D-8] give certifiable governance — but all are voluntary, outcome-based, and leave "does this human actually control anything?" to the deployer.
  • Oversight language is engineered to be slippery. DoD 3000.09 [D-11] chose "appropriate levels of human judgment" over "meaningful human control"; OECD/UNESCO/ICRC [D-9, D-10, D-12] endorse retained control but are non-binding (ICRC's predictability + timely intervention/deactivation is the most operational).
  • There is a usable design target [D-13]: tracking + tracing reframes control as reason-responsiveness plus traceable human accountability — a better spec than "a person can press stop."
  • Net takeaway: treat human oversight as a claim to be empirically validated, not a checkbox. Effective oversight needs (a) a genuinely empowered reviewer with authority and competence to override, (b) inspectable agent reasoning, (c) intensity scaled to autonomy/stakes, (d) defenses against automation bias/deskilling, and (e) accountability that doesn't collapse onto a lone crumple-zone human.

Annotated bibliography — Part A: what regulators & standards require

[D-1] European Parliament & Council (2024). [LEGAL/STANDARD] Regulation (EU) 2024/1689 (AI Act), Article 14 "Human oversight." OJ L, 12 Jul 2024. https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng — High-risk AI must be "effectively overseen by natural persons." Art. 14(4): overseers must understand capacities/limits, "remain aware of … over-relying on the output (automation bias)," correctly interpret, "disregard, override or reverse," and interrupt via a "stop button"; biometric ID needs two-person verification. Worked: most concrete oversight mandate in law; first to name automation bias. Gap: high-risk only; relies on deployer-assigned competence; untested (most provisions apply from 2 Aug 2026). Relevance: the closest statutory spec for agentic HITL — intervention authority, override, kill-switch as design requirements.

[D-2] European Parliament & Council (2016). [LEGAL/STANDARD] Regulation (EU) 2016/679 (GDPR), Article 22. OJ L 119. https://gdpr-info.eu/art-22-gdpr/ — Right not to be subject to decisions "based solely on automated processing" with "legal …or similarly significant" effects; where permitted, "at least the right to obtain human intervention," to express a view, and to contest. Gap: triggers only on "solely" automated + "significant"; "human intervention" undefined — the loophole nominal review exploits. Relevance: the original legal demand for a human checkpoint over automated decisions.

[D-3] CJEU (2023). [LEGAL/STANDARD] OQ v Land Hessen ("SCHUFA Scoring"), Case C-634/21. CELEX 62021CJ0634. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A62021CJ0634 (verbatim text UNVERIFIED by direct fetch; holding corroborated) — Held an automated probability score is itself an "automated individual decision" when a third party "draws strongly" on it. Worked: closes the loophole where upstream profiling escaped oversight as "preparatory." Relevance: in agent pipelines an upstream model's score/recommendation can be the legally decisive act — oversight must attach there.

[D-4] Article 29 WP / EDPB (2017, rev. 2018). [LEGAL/STANDARD] Guidelines on Automated individual decision-making and Profiling (WP251rev.01). https://ec.europa.eu/newsroom/article29/items/612053 — Human involvement must be meaningful, by "someone who has the authority and competence to change the decision"; expressly warns rubber-stamping does not exempt the controller. Worked: source of the "meaningful human review" standard. Gap: soft law. Relevance: the canonical statement that token HITL is non-compliant.

[D-5] NIST (2023). [LEGAL/STANDARD] AI Risk Management Framework (AI RMF 1.0), NIST AI 100-1. DOI: 10.6028/NIST.AI.100-1. — GOVERN/MAP/MEASURE/MANAGE; oversight lives in the "Human-AI Configuration" category naming overreliance, automation bias, algorithmic aversion, anthropomorphizing. Worked: granular, widely adopted; Playbook scales oversight to risk. Gap: voluntary, outcome-based. Relevance: design vocabulary (human-AI teaming, overreliance) + a lifecycle hook for oversight controls.

[D-6] NIST (2024). [LEGAL/STANDARD] AI RMF: Generative AI Profile, NIST AI 600-1. DOI: 10.6028/NIST.AI.600-1. — Extends the RMF to GenAI across twelve risk categories incl. Human-AI Configuration; oversight calibrated to risk, and notes autonomous real-world action needs mechanisms beyond static inference. Relevance: the most agent-relevant U.S. guidance — explicitly ties oversight intensity to autonomy.

[D-7] ISO/IEC (2023). [LEGAL/STANDARD] ISO/IEC 42001:2023 — AI Management System. https://www.iso.org/standard/42001 (ISO 403 to fetcher; content cross-confirmed) — First certifiable AI management-system standard (PDCA): policies, accountability, roles, risk assessment, controls. Gap: governs process, not technical oversight controls. Relevance: the org-level accountability layer for deploying agents responsibly.

[D-8] ISO/IEC (2023). [LEGAL/STANDARD] ISO/IEC 23894:2023 — AI Risk Management Guidance. https://www.iso.org/standard/77304.html (ISO 403; cross-confirmed) — AI-specific risk-management guidance aligned to ISO 31000, complementary to 42001. Gap: guidance only; oversight is one risk-treatment option. Relevance: the risk-treatment basis under which oversight controls get identified. (Adjacent: ISO/IEC TR 24028:2020 trustworthiness/controllability.)

[D-9] OECD (2019, rev. 2024). [LEGAL/STANDARD] Recommendation of the Council on AI (OECD/LEGAL/0449). https://oecd.ai/en/ai-principles — First intergovernmental AI standard; values incl. human rights/autonomy/agency and (strengthened 2024) capacity for human determination/oversight appropriate to context. Gap: soft law. Relevance: the principle-level anchor for "human agency" over autonomous systems.

[D-10] UNESCO (2021). [LEGAL/STANDARD] Recommendation on the Ethics of AI. https://www.unesco.org/en/artificial-intelligence/recommendation-ethics — First global AI-ethics instrument (all 193 members); dedicated "Human oversight and determination" principle — AI must not displace ultimate human responsibility; life-and-death decisions not ceded to AI. Gap: non-binding. Relevance: the global norm that ultimate accountability stays human.

[D-11] U.S. DoD (2023). [LEGAL/STANDARD] DoD Directive 3000.09, "Autonomy in Weapon Systems." https://www.esd.whs.mil/portals/54/documents/dd/issuances/dodd/300009p.pdf (403; cross-verified) — Requires "appropriate levels of human judgment over the use of force" — deliberately not "meaningful human control," i.e., context-dependent. Critique: flexibility leaves the required level ambiguous (HRW). Relevance: the policy slipperiness of oversight language — "appropriate" can mean anything.

[D-12] ICRC (2021). [LEGAL/STANDARD] ICRC Position on Autonomous Weapon Systems. https://www.icrc.org/en/document/icrc-position-autonomous-weapon-systems — Calls for binding rules: prohibit unpredictable systems + those targeting humans; regulate the rest via limits on target type, duration, geography, and human-machine interaction ensuring effective supervision and timely intervention/deactivation. Relevance: maps to agent guardrails — bound the action space, ensure timely interruption.

[D-13] Santoni de Sio, F. & van den Hoven, J. (2018). [SCHOLARLY] Meaningful Human Control over Autonomous Systems. Frontiers in Robotics and AI 5:15. DOI: 10.3389/frobt.2018.00015. (also at [C-20]) — Tracking (responds to relevant human moral reasons/circumstances) + tracing (behavior traceable to a human who understands its capabilities/stakes); closes the responsibility gap. Relevance: a rigorous design target — control is reason-responsiveness + traceable accountability, not a button.

Annotated bibliography — Part B: critiques (oversight as rubber stamp / accountability sink)

[D-14] Green, B. (2022). [CRITIQUE] The Flaws of Policies Requiring Human Oversight of Government Algorithms. Computer Law & Security Review 45:105681. DOI: 10.1016/j.clsr.2022.105681 (OA arXiv:2109.05067). (also at [C-21]) — Surveys 41 policies; two flawed assumptions (people can oversee; requirements don't themselves harm). Failed: oversight legitimizes faulty algorithms, gives false security, diffuses accountability. Fix: flip the burden — agencies must demonstrate with evidence that oversight works before adoption. Relevance: "add a human" is not a control unless proven effective.

[D-15] Crootof, R., Kaminski, M. E. & Price, W. N. II (2023). [CRITIQUE] Humans in the Loop. Vanderbilt Law Review 76(2):429. https://scholarship.law.vanderbilt.edu/vlr/vol76/iss2/2/ — HITL is invoked as a regulatory reflex. Names the "MABA-MABA trap": inserting a human to patch algorithmic flaws ignores that hybrid systems generate new problems (automation bias, deskilling, diffused responsibility). Fix: a taxonomy of distinct human roles requiring tailored interventions. Relevance: a framework for what kind of human role an agent actually needs vs. a decorative one.

[D-16] Elish, M. C. (2019). [CRITIQUE] Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction. ESTS 5:40–60. DOI: 10.17351/ests2019.260. (also at [F-19]) — The "moral crumple zone": a human in a highly automated system becomes the liability sponge, absorbing blame for failures they couldn't prevent — protecting the system and its makers. Relevance: an agent "approver" may be set up to take the fall, not to control outcomes.

[D-17] Laux, J. & Ruschemeier, H. (2025). [CRITIQUE] Automation Bias in the AI Act: On the Legal Implications of Attempting to De-Bias Human Oversight of AI. European Journal of Risk Regulation 16:1519–1534. arXiv:2502.10036. — Ties the automation-bias literature to Art. 14, which obliges enabling overseers' awareness of bias; argues awareness alone is unlikely to de-bias, given how robust the bias is. Relevance: the bridge between the law [D-1] and the empirics [D-18/19] — the AI Act may have encoded an unproven assumption.

Annotated bibliography — Part C: empirical evidence that oversight fails in practice

[D-18] Skitka, L. J., Mosier, K. L. & Burdick, M. (1999). [EMPIRICAL] Does automation bias decision-making? IJHCS 51(5):991–1006. DOI: 10.1006/ijhc.1999.0252. (also at [F-9]) — Participants with a reliable-but-imperfect aid performed worse than those without — omission + commission errors. Relevance: the empirical floor under every "a human will catch the model's mistakes" claim.

[D-19] Skitka, L. J., Mosier, K. L. & Burdick, M. (2000). [EMPIRICAL] Accountability and automation bias. IJHCS 52(4):701–717. DOI: 10.1006/ijhc.1999.0349. (also at [F-11]) — Making people accountable for accuracy reduced but did not eliminate automation bias. Relevance: even accountability pressure (the AI Act/GDPR strategy) leaves a substantial residue.

[D-20] Parasuraman, R. & Manzey, D. H. (2010). [review of EMPIRICAL] Complacency and Bias in Human Use of Automation. Human Factors 52(3):381–410. DOI: 10.1177/0018720810376055. (also at [F-7]) — Complacency/bias both stem from attentional allocation; under load, operators under-monitor and miss failures. Relevance: explains why oversight degrades precisely when most needed — high workload, high autonomy.

[D-21] Amnesty International (2021). [EMPIRICAL] Xenophobic Machines: …the Dutch Childcare Benefits Scandal (EUR 35/4686/2021). https://www.amnesty.org/en/documents/eur35/4686/2021/en/ — A self-learning risk model flagged tens of thousands of mostly low-income/immigrant families as fraudsters (nationality as a risk factor); §7.5 documents failed human/institutional oversight; scandal toppled the cabinet (Jan 2021). Relevance: canonical case of automated decisions with nominal-but-ineffective human review.

[D-22] District Court of The Hague (2020). [EMPIRICAL/legal] NJCM c.s. v. The State of the Netherlands ("SyRI"), C-09-550982, 5 Feb 2020. (UN: https://www.ohchr.org/en/press-releases/2020/02/landmark-ruling-dutch-court-stops-government-attempts-spy-poor-un-expert) — Struck down SyRI welfare-fraud risk-profiling as violating ECHR Art. 8 — no transparency, verifiability, or safeguards. Relevance: oversight gaps in automated scoring caught by courts, not the system's own HITL.

[D-23] Royal Commission into the Robodebt Scheme (2023). [EMPIRICAL] Report of the Royal Commission into the Robodebt Scheme. Commonwealth of Australia. https://robodebt.royalcommission.gov.au/publications/report (gov.au timed out; findings corroborated) — Automated income-averaging unlawfully generated ~470,000 welfare debts, reversing the burden of proof; "a crude and cruel mechanism, neither fair nor legal"; ~A$1.2bn settlement; 57 recommendations. Relevance: what happens when an automated pipeline runs with oversight that is captured/complicit rather than corrective.

[D-24] Stevenson, M. T. (2018). [EMPIRICAL] Assessing Risk Assessment in Action. Minnesota Law Review 103:303–384. https://scholarship.law.umn.edu/mlr/58/ — Kentucky pretrial risk assessment (>1M cases): judges overrode recommendations unevenly (adherence higher in predominantly white rural counties), producing racial disparities. Relevance: human override is not automatically a safeguard — it can introduce its own bias.

[D-25] Angwin, J., Larson, J., Mattu, S. & Kirchner, L. (2016). [EMPIRICAL] Machine Bias. ProPublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing — ~7,000 COMPAS scores: Black defendants ~2× as likely to be falsely flagged high-risk; ~61% accurate; proprietary black-box scores not effectively scrutinized by relying judges. Relevance: opacity defeats oversight — agents whose reasoning humans can't inspect can't be meaningfully overseen. (Northpointe disputed the methodology — the calibration-vs-error-rate fairness debate.)


Part II — Ergonomics, Usability & Human-Factors Foundations

II.E — Automation theory & human-factors foundations

The classical foundations agentic AI should inherit.

Highlights

  • The fundamental irony (Bainbridge, [E-7]) applies verbatim. Automating the routine work an agent does well increases the difficulty and stakes of the human's leftover job — monitoring for rare failures and recovering from them. Designing for the happy path while ignoring the human's degraded oversight role is the original sin.
  • Reliability is double-edged [E-17, E-18]. The more dependable an agent, the less the human monitors it, so the rare confident error slips through. Better models make complacency easier; pair reliability with deliberate engagement.
  • The out-of-the-loop problem is the central HITL hazard [E-13]. Full autonomy maximizes the takeover penalty; intermediate levels that keep the human in the active decision loop preserve takeover ability. Prefer "human in the loop" to "human on the loop" for high-consequence actions.
  • Autonomy is a dial, not a switch — and there are several dials [E-4, E-6]. Decide which stage (acquire → analyze → decide → act) and what level (suggest → act-with-approval → act-then-notify → act-silently) independently. Automating the decision/action stage is where over-trust and lost authority bite hardest; automate perception/analysis more freely than action.
  • Measure the failure case, not the happy path — the "lumberjack effect" [E-15]. Higher autonomy buys small routine gains but, past a threshold, catastrophic recovery failures. Evaluate agents on what happens when they're wrong.
  • Calibrated trust is the goal — both over- and under-trust fail [E-16, E-19]. Misuse (blind acceptance), disuse (alarm fatigue), abuse (automating because you can). Communicate capability, process, purpose, and uncertainty so reliance tracks competence.
  • Automation bias replaces judgment, not just effort [E-20]. Confident recommendations induce omission + commission errors. Preserve independent verification, especially for irreversible/commission-type actions.
  • Opacity breeds dangerous mismatches [E-10]. Mode confusion / automation surprise come from divergence between the human's model and reality. Make the agent predictable, its state visible, its actions explainable.
  • Situation awareness, not activity logs, is what overseers need [E-12]. Effective intervention requires comprehension (L2) and projection (L3), not just perception (L1). Feed understanding and what-happens-next, not a stream of tool calls.
  • Skill atrophy is selective and quiet [E-22]. Routine skills survive disuse; the higher-order judgment skills needed for non-routine takeover decay unless the human stays engaged. Keep humans reasoning, not rubber-stamping.
  • Beware "clumsy" agents [E-11]. Agents that demand fiddly setup or grab attention at the busiest moments redistribute workload badly. Design the interaction and its timing, not just the capability.
  • Aim for complementarity; keep the human informed and in command [E-3, E-21]. Reject leftover-allocation thinking for human-centered, congruent design. After ~45 years [E-7→E-9] the ironies remain unsolved — engineer around them; don't expect capability gains to dissolve them.

Annotated bibliography

[E-1] Fitts, P. M. (Ed.) (1951). Human Engineering for an Effective Air-Navigation and Traffic-Control System. National Research Council. (Origin of the "Fitts list" / HABA-MABA.) https://apps.dtic.mil/sti/citations/AD0759066 — Foundational function-allocation document: 11 statements of where "men are better at" (judgment, improvisation, pattern-in-noise) vs. "machines are better at" (speed, power, routine repetition, computation). Trap: framed allocation as static, comparative, leftover-to-the-human. Relevance: the original "agent does X, human does Y" template — and a caution that naive HABA-MABA produces the dangerous supervisory residue.

[E-2] de Winter, J. C. F., & Dodou, D. (2014). Why the Fitts list has persisted throughout the history of function allocation. Cognition, Technology & Work 16(1), 1–11. DOI: 10.1007/s10111-011-0188-1. — Explains the list's 60-year survival (plausibility, simplicity, generalizability) and notes Fitts already foreshadowed the ironies of automation. Relevance: crude "let the LLM handle X" heuristics will persist; use them as a first cut, not a finished allocation.

[E-3] Hollnagel, E., & Bye, A. (2000). Principles for modelling function allocation. IJHCS 52(2), 253–265. DOI: 10.1006/ijhc.1999.0288. — Critiques the Fitts paradigm as "elementaristic"; argues the goal is function congruence/complementarity over time, not parceling tasks to whoever is "better." Relevance: design human + agent as a coupled, complementary system, not a set of handed-off tasks.

[E-4] Sheridan, T. B., & Verplank, W. L. (1978). Human and Computer Control of Undersea Teleoperators. MIT Man-Machine Systems Lab. https://apps.dtic.mil/sti/citations/ADA057655 — Introduced the original 10-point Levels of Automation scale (from "no assistance" → "executes a suggestion if the human approves" → "acts autonomously and ignores the human"). Relevance: the direct ancestor of "autonomy sliders" / permission levels.

[E-5] Sheridan, T. B. (1992). Telerobotics, Automation, and Human Supervisory Control. MIT Press. ISBN 0-262-19316-7. — Book-length statement of supervisory control: the human as higher-level planner/monitor/intervener, not moment-to-moment controller. Lesson: automating the inner loop promotes the human to a different and harder job (the seed of the out-of-the-loop problem). Relevance: agentic use is supervisory control by definition.

[E-6] Parasuraman, R., Sheridan, T. B., & Wickens, C. D. (2000). A Model for Types and Levels of Human Interaction with Automation. IEEE SMC-A 30(3), 286–297. DOI: 10.1109/3468.844354. — Most-cited modern framework: four stages (information acquisition → analysis → decision/action selection → action implementation), each automatable to a different level, with evaluative criteria. Relevance: ready-made taxonomy for agent pipelines (retrieve → analyze → propose → execute); automating the decision/action stage is riskiest.

[E-7] Bainbridge, L. (1983). Ironies of Automation. Automatica 19(6), 775–779. DOI: 10.1016/0005-1098(83)90046-8. — THE canonical paper. Automating "because the human is unreliable" leaves the human the un-automatable tasks plus monitoring + emergency takeover — for which disuse has destroyed skill and SA, and emergencies are the hardest cases. Relevance: the single most important paper for agentic HITL — routine delegation silently degrades the human's ability to catch the agent's rare, high-consequence failures.

[E-8] Baxter, G., Rooksby, J., Wang, Y., & Khajeh-Hosseini, A. (2012). The ironies of automation … still going strong at 30? ECCE '12, 65–71. DOI: 10.1145/2448136.2448149. — 30-year retrospective across aviation, trading, cloud: original ironies intact, plus new ironies (cheap compute encouraging less-dependable, procedure-bypassing systems). Relevance: making agents more capable/reliable doesn't dissolve the oversight problem — it sharpens it.

[E-9] Strauch, B. (2018). Ironies of Automation: Still Unresolved After All These Years. IEEE THMS 48(5), 419–433. DOI: 10.1109/THMS.2017.2732506. — Concludes the ironies remain valid and unresolved (mode confusion, skill atrophy, accidents) because designs keep automating without addressing the human's changed role. Relevance: agentic AI inherits an unsolved problem — engineer around the ironies; don't expect better models to outgrow them.

[E-10] Baxter, G., Besnard, D., & Riley, D. (2007). Cognitive mismatches in the cockpit: Will they ever be a thing of the past? Applied Ergonomics 38(4), 417–423. DOI: 10.1016/j.apergo.2007.01.005. — "Cognitive mismatch": operator's model of the automation diverges from reality (mode confusion, automation surprise) — recurring, structural error in opaque systems. Relevance: if the human's model of the agent's goals/state diverges, oversight fails; argues for transparency, mode visibility, intent communication.

[E-11] Wiener, E. L., & Curry, R. E. (1980). Flight-deck automation: promises and problems. Ergonomics 23(10), 995–1011. DOI: 10.1080/00140138008924809. — Founding aviation-automation paper: automation improves performance yet adds new failure modes (setup errors, ignored alarms, proficiency loss). Source of "clumsy automation" — helps in low-workload phases, adds workload when already busy. Relevance: foreshadows clumsy agent UX; design the interaction and timing.

[E-12] Endsley, M. R. (1995). Toward a Theory of Situation Awareness in Dynamic Systems. Human Factors 37(1), 32–64. DOI: 10.1518/001872095779049543. — Canonical 3-level SA model: L1 perception, L2 comprehension, L3 projection; attention/working memory are limiting. Relevance: an overseer needs comprehension + projection of the agent's situation, not just a log of actions (which gives L1 only).

[E-13] Endsley, M. R., & Kiris, E. O. (1995). The Out-of-the-Loop Performance Problem and Level of Control in Automation. Human Factors 37(2), 381–394. DOI: 10.1518/001872095779064555. — Empirically establishes OOTL: operators resume manual control slower/worse after failure; the decrement was significantly worse under full than intermediate automation. Relevance: the empirical case for "human in the loop" over "on the loop" for high-stakes actions.

[E-14] Kaber, D. B., & Endsley, M. R. (2004). The effects of level of automation and adaptive automation on human performance, SA and workload… Theoretical Issues in Ergonomics Science 5(2), 113–153. DOI: 10.1080/1463922021000054335. — Intermediate LOAs aid SA; adaptive automation aids workload; effects differ and aren't additive. Relevance: supports adjustable/adaptive autonomy — let agent authority rise/fall with context, criticality, and human workload.

[E-15] Onnasch, L., Wickens, C. D., Li, H., & Manzey, D. (2014). Human Performance Consequences of Stages and Levels of Automation: An Integrated Meta-Analysis. Human Factors 56(3), 476–488. DOI: 10.1177/0018720813501549. — Meta-analysis: higher degree of automation improves routine performance/lowers workload while it works but sharply worsens performance + SA when it fails — the "lumberjack effect" (beyond a critical DOA, failure-mode performance collapses). Relevance: quantitative argument against maximal autonomy in high-consequence domains.

[E-16] / see F-2. Parasuraman & Riley (1997). Use, Misuse, Disuse, Abuse. (cross-listed in II.F) — the core framework for calibrated reliance.

[E-17] / see F-7. Parasuraman & Manzey (2010). Complacency and Bias… (cross-listed in II.F) — over-reliance is attentional, not a discipline problem.

[E-18] / see F-4. Parasuraman, Molloy & Singh (1993). Performance Consequences of Automation-Induced "Complacency." (cross-listed in II.F) — the more reliable the automation, the worse humans catch its lapses.

[E-19] / see F-1. Lee & See (2004). Trust in Automation. (cross-listed in II.F) — blueprint for agent transparency and confidence communication.

[E-20] / see F-10. Mosier, Skitka, Heers & Burdick (1998). Automation Bias in High-Tech Cockpits. (cross-listed in II.F) — automated advice can replace independent judgment, including when wrong.

[E-21] Billings, C. E. (1997). Aviation Automation: The Search for a Human-Centered Approach. Lawrence Erlbaum. ISBN 0-8058-2127-9. — Defining articulation of human-centered automation: keep the human informed, involved, and in command; automation must be predictable, comprehensible, accountable. Relevance: the philosophical north star for agentic HITL — agents keep the human in command and never quietly assume authority.

[E-22] Casner, S. M., Geven, R. W., Recker, M. P., & Schooler, J. W. (2014). The Retention of Manual Flying Skills in the Automated Cockpit. Human Factors 56(8), 1506–1516. DOI: 10.1177/0018720814535628. — Manual control/scanning skills were retained; the cognitive skills of manual flying (planning, judgment) degraded unless pilots stayed actively engaged. Relevance: humans who let agents do the cognitive work lose the expertise to evaluate them — keep humans reasoning, not rubber-stamping.

II.F — Trust, complacency, automation bias & failure modes

Markings: [E] empirical · [T] theoretical/review · [E/T] mixed.

Highlights

  • Appropriate reliance, not maximum trust, is the goal. [F-1] Target calibrated trust matching real capability; both over-trust → misuse and under-trust → disuse are failures. Make the agent's competence, process, and purpose legible so reliance tracks reliability.
  • Four pathologies as a diagnostic checklist. [F-2] Misuse (over-reliance), disuse (neglect, usually from false-alarm fatigue), and abuse (deploying autonomy without regard for human consequences) map cleanly onto agentic-LLM risks — and abuse locates much fault in design/deployment, not the operator.
  • Automation bias = heuristic deference. [F-9, F-10] Users substitute "the system says so" for vigilant processing, producing omission errors (missing what the agent didn't flag) and commission errors (following the agent against contradictory evidence). Afflicts experts and crews alike, incl. false-memory ("phantom memory") that the agent was correct.
  • Complacency is structural, attention-driven, and training-resistant. [F-4, F-5, F-7] Reliable automation under multi-task load erodes monitoring (decrement in ~20 min); appears in experts; and instruction/training does not reliably fix it. "Tell users to stay vigilant" is not a mitigation.
  • The automation conundrum / out-of-the-loop problem. [F-8] The more autonomous and reliable the agent, the lower the human's situation awareness and the weaker their ability to take over when it finally fails — exactly when takeover matters most.
  • Mode confusion & automation surprise. [F-14, F-15] "What is it doing? Why? What next?" arises from poor observability + mental-model gaps; "strong, silent, clumsy" automation yields slow "going sour" accidents. Agents that silently switch tools/strategies/modes are surprise generators.
  • There is a reliability floor (~70%). [F-16] Below ~70% diagnostic reliability an advisory agent can be net-negative vs. no automation. Know the task's floor before deploying autonomy.
  • False alarms are uniquely corrosive. [F-17, F-18] False-alarm-prone systems damage both compliance and reliance and drive alert fatigue ("cry wolf") — desensitization that leads users to override even critical alerts. Alert precision and volume are safety parameters.
  • Trust has three layers — design for each. [F-3] Dispositional (who the user is), situational (context/workload/task), learned (onboarding + dynamic in-session updating).
  • Explanations can backfire. [F-20] Rationales (and by extension chain-of-thought) tend to increase acceptance regardless of correctness, deepening over-reliance. Explanations must support verification, not just persuade.
  • Structural friction beats exhortation. [F-12, F-21] Cognitive forcing functions (decide before seeing the AI's answer; deliberate friction) and experiencing the agent fail during onboarding measurably reduce over-reliance — but users dislike the friction that helps most.
  • Accountability is an evidence-based debiasing lever — but token HITL is a trap. [F-11] Felt accountability for decision accuracy reduces automation-bias errors; yet [F-19] "moral crumple zones" warn that a human nominally in the loop without real authority/awareness is just a scapegoat.

Annotated bibliography

[F-1] Lee, J. D., & See, K. A. (2004). [T/E] Trust in Automation: Designing for Appropriate Reliance. Human Factors, 46(1), 50–80. DOI: 10.1518/hfes.46.1.50_30392. — Field-defining synthesis. Because people respond to technology socially, trust mediates reliance; the goal is appropriate trust calibrated to actual capability, not maximum trust. Names over-trust → misuse and distrust → disuse, and the trust dimensions performance / process / purpose (does it work / how / why built). Relevance: master template for agentic HITL — make competence, reasoning, and intent legible so reliance tracks reliability.

[F-2] Parasuraman, R., & Riley, V. (1997). [T] Humans and Automation: Use, Misuse, Disuse, Abuse. Human Factors, 39(2), 230–253. DOI: 10.1518/001872097778543886. — Taxonomy paper. Use / misuse (over-reliance → monitoring failures) / disuse (neglect, often from false-alarm-prone alerts) / abuse (deploying automation without regard for human consequences). Shifts blame upstream: many "operator errors" are seeded by abuse at design/management level. Relevance: diagnostic vocabulary — over-acceptance (misuse), alarm-fatigue dismissal (disuse), reckless autonomy (abuse) are all live in LLM agents.

[F-3] Hoff, K. A., & Bashir, M. (2015). [E/T] Trust in Automation: Integrating Empirical Evidence on Factors That Influence Trust. Human Factors, 57(3), 407–434. DOI: 10.1177/0018720814547570. — Systematic review consolidating trust antecedents into a three-layer model: dispositional (traits, culture, age), situational (context, workload, task, system), learned (initial + dynamic, updating with experience). Relevance: trust is not one knob — onboarding shapes initial learned trust; in-session behavior drives dynamic recalibration.

[F-4] Parasuraman, R., Molloy, R., & Singh, I. L. (1993). [E] Performance Consequences of Automation-Induced "Complacency". Int. J. Aviation Psychology, 3(1), 1–23. DOI: 10.1207/s15327108ijap0301_1. — First empirical demonstration of automation complacency. Failure detection was far worse under constant-reliability than variable-reliability automation, emerging after ~20 min — steady, reliable automation breeds the worst monitoring. Relevance: the more dependable the agent seems, the more the human stops checking.

[F-5] Molloy, R., & Parasuraman, R. (1996). [E] Monitoring an Automated System for a Single Failure: Vigilance and Task Complexity Effects. Human Factors, 38(2), 311–322. DOI: 10.1177/001872089606380211. — Vigilance decrement in monitoring highly-but-imperfectly-reliable automation: detection of a single critical failure degrades over time and under load. Relevance: agents right 99% of the time make the human worst-positioned to catch the 1% — argues for forcing functions, not "keep an eye on it."

[F-6] Singh, I. L., Molloy, R., & Parasuraman, R. (1993). [E] Automation-Induced "Complacency": Development of the Complacency-Potential Rating Scale. Int. J. Aviation Psychology, 3(2), 111–122. DOI: 10.1207/s15327108ijap0302_2. — A 20-item instrument; establishes complacency potential as a measurable individual difference, not just a situational state. Relevance: some users are far more disposed to over-rely — suggests per-user calibration / adaptive friction.

[F-7] Parasuraman, R., & Manzey, D. H. (2010). [E/T] Complacency and Bias in Human Use of Automation: An Attentional Integration. Human Factors, 52(3), 381–410. DOI: 10.1177/0018720810376055. — Landmark review: complacency and automation bias overlap, rooted in attention allocation. Complacency emerges chiefly under multiple-task load; both appear in experts; neither is reliably eliminated by training. Relevance: strongest warning that "we'll train users to stay alert" is insufficient — structural design changes required.

[F-8] Endsley, M. R. (2017). [T] From Here to Autonomy: Lessons Learned From Human–Automation Research. Human Factors, 59(1), 5–27. DOI: 10.1177/0018720816681350. — Articulates the "automation conundrum": more autonomy/reliability → lower operator SA → less able to intervene at failure (the out-of-the-loop problem). Proposes a Human-Autonomy System Oversight model; argues for SA-supportive design and granularity of control. Relevance: pushing LLM autonomy up degrades the takeover capability you depend on for safety.

[F-9] Skitka, L. J., Mosier, K. L., & Burdick, M. (1999). [E] Does Automation Bias Decision-Making? Int. J. Human-Computer Studies, 51(5), 991–1006. DOI: 10.1006/ijhc.1999.0252. — Defining automation-bias experiment. Distinguishes omission (missing an event the aid didn't flag) from commission (following an automated directive despite contradictory reliable info). Bias = using the aid as a heuristic shortcut for vigilant processing. Relevance: canonical frame for LLM over-acceptance.

[F-10] Mosier, K. L., Skitka, L. J., Heers, S., & Burdick, M. (1998). [E] Automation Bias: Decision Making and Performance in High-Tech Cockpits. Int. J. Aviation Psychology, 8(1), 47–63. DOI: 10.1207/s15327108ijap0801_3. — Automation bias in professional glass-cockpit pilots, incl. the "phantom memory" effect (recalling automation behaved correctly when it failed). Display enhancements/verification training reduced but didn't eliminate it. Relevance: expertise and team review do not inoculate against deferring to a confident machine.

[F-11] Skitka, L. J., Mosier, K., & Burdick, M. D. (2000). [E] Accountability and Automation Bias. Int. J. Human-Computer Studies, 52(4), 701–717. DOI: 10.1006/ijhc.1999.0349. — Making operators accountable for decision accuracy (vs. mere performance) significantly reduced both omission and commission errors. Relevance: assigning the human felt responsibility for an agent's outputs — not nominal "in-the-loop" status — is an evidence-based debiasing lever.

[F-12] Bahner, J. E., Hüper, A.-D., & Manzey, D. (2008). [E] Misuse of Automated Decision Aids: Complacency, Automation Bias and the Impact of Training Experience. Int. J. Human-Computer Studies, 66(9), 688–699. DOI: 10.1016/j.ijhcs.2008.06.001. — Operationalizes complacency as insufficient verification behavior. Found under-verification; commission errors linked to high complacency; and exposing operators to rare automation failures during training reduced complacency. Relevance: letting users experience the agent being wrong recalibrates verification — an onboarding pattern.

[F-13] Goddard, K., Roudsari, A., & Wyatt, J. C. (2012). [E/T] Automation Bias: A Systematic Review of Frequency, Effect Mediators, and Mitigators. JAMIA, 19(1), 121–127. DOI: 10.1136/amiajnl-2011-000089. — Healthcare review confirming automation bias is common and consequential in clinical decision support; identifies mediators (task difficulty, trust, confidence, workload) and candidate mitigators. Relevance: closest analog to LLM decision copilots — good aids still induce harmful deference; mitigation must be engineered.

[F-14] Sarter, N. B., Woods, D. D., & Billings, C. E. (1997). [T] Automation Surprises. In Salvendy (Ed.), Handbook of Human Factors and Ergonomics (2nd ed., pp. 1926–1943). Wiley. (book chapter, no DOI) — Coins "automation surprise" ("What is it doing? Why? What next?"), rooted in mode confusion / loss of mode awareness + low system observability, esp. in non-normal, time-critical situations. Relevance: an agent silently switching strategies/tools/modes is a textbook surprise generator — expose state and intent.

[F-15] Sarter, N. B., & Woods, D. D. (1995). [E/T] How in the World Did We Ever Get into That Mode? Mode Error and Awareness in Supervisory Control. Human Factors, 37(1), 5–19. DOI: 10.1518/001872095779049516. (related: "Learning from Automation Surprises and 'Going Sour' Accidents," 1997) — Characterizes "strong, silent, clumsy" automation — highly capable but poor at communicating — producing "going sour" accidents where a small undetected mismatch compounds. Frames observability and directability as the missing properties. Relevance: warns against powerful-but-opaque executor agents on long-horizon tasks.

[F-16] Wickens, C. D., & Dixon, S. R. (2007). [E/T] The Benefits of Imperfect Diagnostic Automation: A Synthesis of the Literature. Theoretical Issues in Ergonomics Science, 8(3), 201–212. DOI: 10.1080/14639220500370105. — Meta-synthesis establishing the ~0.70 reliability crossover: below ~70% reliability, diagnostic automation tends to be worse than none. Relevance: a quantitative gut-check — a moderately-accurate autonomous advisor can be net-negative.

[F-17] Dixon, S. R., Wickens, C. D., & McCarley, J. S. (2007). [E] On the Independence of Compliance and Reliance: Are Automation False Alarms Worse Than Misses? Human Factors, 49(4), 564–572. DOI: 10.1518/001872007X215656. — Dissociates compliance (acting on alarms) from reliance (trusting silence). False-alarm-prone automation harms both; miss-prone harms only reliance — false alarms are especially corrosive. Relevance: tune agentic alert/intervention thresholds toward avoiding false alarms.

[F-18] Cvach, M. (2012). [E/T] Monitor Alarm Fatigue: An Integrative Review. Biomedical Instrumentation & Technology, 46(4), 268–277. DOI: 10.2345/0899-8205-46.4.268. (see also AHRQ PSNet Alert Fatigue primer; Ancker et al. 2017, BMC Med Inform Decis Mak 17:36, DOI 10.1186/s12911-017-0430-8) — Defines alert fatigue: under a barrage of mostly-false alerts, operators desensitize and override/ignore them, incl. critical ones — the "cry wolf" effect (the clinical manifestation of disuse). Relevance: over-alerting trains users to dismiss the agent's warnings wholesale; alert volume/precision are safety parameters.

[F-19] Elish, M. C. (2019). [T] Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction. Engaging Science, Technology, and Society (ESTS), 5, 40–60. DOI: 10.17351/ests2019.260. — Introduces the "moral crumple zone": responsibility for failures collapses onto the nearest human operator, who has limited actual control — protecting the system at the human's expense. Relevance: structural critique of token HITL — a human nominally in the loop without real authority/awareness is a scapegoat, not a safeguard.

[F-20] Bansal, G., Wu, T., Zhou, J., Fok, R., Nushi, B., Kamar, E., Ribeiro, M. T., & Weld, D. S. (2021). [E] Does the Whole Exceed Its Parts? The Effect of AI Explanations on Complementary Team Performance. CHI 2021. DOI: 10.1145/3411764.3445717 (arXiv:2006.14779). — Explanations increase acceptance of the AI's recommendation regardless of correctness — boosting agreement but not complementary accuracy. Relevance: adding rationales/chain-of-thought may increase automation bias; explanations must enable verification, not persuasion.

[F-21] Buçinca, Z., Malaya, M. B., & Gajos, K. Z. (2021). [E] To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-Assisted Decision-Making. Proc. ACM HCI, 5(CSCW1), Art. 188. DOI: 10.1145/3449287 (arXiv:2102.09692). — People rarely engage System-2 per recommendation, so explanations alone don't fix over-reliance. Cognitive forcing functions (decide before seeing the AI's answer; add friction) measurably reduced over-reliance — though users disliked the interventions that helped most. Relevance: the most actionable modern HITL pattern — structural friction beats exhortation, with a usability trade-off to manage.

II.G — HCI foundations: mixed-initiative, human-centered AI, interaction design

Highlights

  • Initiative is negotiated, not fixed [G-1, G-2]. Decide per action whether the agent proposes or acts, driven by uncertainty about the user's goal × expected cost/benefit of acting. Act autonomously only when expected utility is clearly positive; otherwise ask, suggest, or defer.
  • Pursue high automation AND high human control simultaneously [G-4]. Reject the false tradeoff (Shneiderman's 2-D HCAI). Target the RST quadrant: an agent that does a lot while the human stays in command. Watch both failure modes — over-automation (lost oversight) and over-control (no leverage).
  • Continuous intelligibility prevents handoff disasters [G-12, G-17]. The danger isn't "too much automation," it's silent automation with poor feedback. Keep the human informed throughout — what the agent knows, how it knows it, what it's doing — so takeover at the competence boundary is safe.
  • Close both gulfs [G-11]. Make available actions discoverable (Gulf of Execution) and state/reasoning/effects legible (Gulf of Evaluation). Supervision is impossible without both.
  • Prefer reversibility over confirmation [G-13, G-14, G-15]. Undo beats warnings — confirmations habituate users into clicking through. Make agent actions reversible by default; reserve confirmation gates for genuinely irreversible, high-stakes operations. Always provide a cheap "emergency exit."
  • Design the four phases explicitly [G-7, G-8]. Initially: disclose what the agent can do and how well. During: time interventions to context, show relevant info, respect norms. When wrong: efficient invocation, dismissal, correction; scope/hedge when in doubt; explain why it did that. Over time: learn cautiously, provide global controls + change notifications.
  • Calibrate trust to true reliability [G-9, G-10]. Set accurate mental models; never overpromise. Label when AI is acting; make suggestions trivially easy to review, override, or modify.
  • Design for graceful failure [G-9, G-12, G-13]. Assume the agent will be wrong. Build plain-language recovery, constructive remedies, safe fallbacks — not dead ends or silent failure.
  • Augment, don't replace [G-3, G-5, G-10]. Frame the agent as a supertool / tele-operated device / active appliance that amplifies a human — not a human-substitute. Keeps responsibility and mastery with the person.
  • Progressive disclosure reconciles simplicity with power [G-16]. Default to a simple, trustworthy interaction; tuck autonomy levels, tool permissions, and reasoning traces one layer down for power users.
  • Protect the human's work and acknowledge latency [G-15]. For long-running tasks: immediate acknowledgment, autosave/preserve in-progress state, keep operations explorable/reversible.
  • Enforce accountability for consequential acts [G-6, G-17]. Surface inferences and keep an identifiable human answerable — backed by layered guardrails (reliable engineering practice → org safety culture → external oversight).

Annotated bibliography

[G-1] Horvitz, E. (1999). Principles of Mixed-Initiative User Interfaces. CHI '99, 159–166. DOI: 10.1145/302979.303030. — Canonical: couple automation and direct manipulation so human and machine negotiate who acts when. ~12 factors incl. uncertainty about the user's goal, expected cost/benefit of autonomous action (act only when expected utility is positive), dialog to resolve uncertainty, efficient invocation/termination, graceful degradation/direct invocation, scoping precision to uncertainty, and memory of recent interactions. Demonstrated in LookOut. Counters: the all-or-nothing automation trap. Relevance: the rulebook for initiative handoff — when an agent should propose vs. act, computed from confidence × stakes.

[G-2] Allen, J., Guinn, C. I., & Horvitz, E. (1999). Mixed-Initiative Interaction. IEEE Intelligent Systems 14(5), 14–23 (Trends & Controversies). https://erichorvitz.com/mixedinit.htm (exact pagination UNVERIFIED) — Frames mixed-initiative as flexible, shifting negotiation of control. Counters: rigid task allocation breaks when competence varies by situation. Relevance: motivates dynamic, per-turn control allocation rather than a static autonomy setting.

[G-3] Shneiderman, B., & Maes, P. (1997). Direct Manipulation vs. Interface Agents. Interactions 4(6), 42–61. https://www.cs.umd.edu/users/ben/papers/Shn-Maes-v4n6-1997.pdf — The landmark debate. Shneiderman: direct manipulation — visible objects, reversible actions, predictability, user control; warns agents erode mastery/responsibility. Maes: complex task spaces need agents to delegate to, which earn trust gradually. Both surface: over-reaching agents destroy predictability/accountability; pure manual control doesn't scale. Relevance: the original framing of the delegation-vs-control tension agentic HITL must resolve.

[G-4] Shneiderman, B. (2020). Human-Centered Artificial Intelligence: Reliable, Safe & Trustworthy. IJHCI 36(6), 495–504. DOI: 10.1080/10447318.2020.1741118 (arXiv:2002.04087). — The 2-D HCAI framework: independent axes of human control and computer automation; target the upper-right (high automation AND high control) → Reliable, Safe & Trustworthy. Names failure quadrants (excessive automation → lost oversight; excessive control → tedium). Relevance: the master mental model — keep the human "in command, not in the loop of everything."

[G-5] Shneiderman, B. (2020). Human-Centered Artificial Intelligence: Three Fresh Ideas. AIS THCI 12(3), 109–124. https://aisel.aisnet.org/thci/vol12/iss3/1/ — Adds design metaphors — supertools, tele-operated devices, active appliances — and a governance roadmap. Flags: treating autonomy as the goal rather than amplified, controllable human performance. Relevance: concrete UX archetypes for how much an agent acts vs. amplifies.

[G-6] Shneiderman, B. (2022). Human-Centered AI. Oxford University Press. ISBN 9780192845290. — Book-length synthesis; adds a multi-layer governance model: reliable (team/SE practice), safe (org safety culture), trustworthy (independent oversight + regulation). Relevance: maps the layered guardrails production agentic deployments need around the HITL UX.

[G-7] Amershi, S., Weld, D., Vorvoreanu, M., et al. (2019). Guidelines for Human-AI Interaction. CHI '19, Paper 3. DOI: 10.1145/3290605.3300233. — 18 validated guidelines by phase. Initially: (1) make clear what the system can do; (2) how well. During: (3) time services to context; (4) show contextual info; (5) match social norms; (6) mitigate bias. When wrong: (7) efficient invocation; (8) efficient dismissal; (9) efficient correction; (10) scope when in doubt; (11) make clear why. Over time: (12) remember recent interactions; (13) learn from behavior; (14) update cautiously; (15) encourage granular feedback; (16) convey consequences; (17) global controls; (18) notify about changes. Relevance: the single most directly applicable guideline set for agentic HITL.

[G-8] Microsoft Research (2021–2023). HAX Toolkit (Guidelines, Design Library, Patterns, Playbook, Workbook). https://www.microsoft.com/en-us/haxtoolkit/ — Operationalizes the 18 guidelines into a searchable pattern library, failure-scenario playbook, and prioritization workbook. Relevance: a ready-made pattern catalog for building HITL affordances (surface confidence, support correction).

[G-9] Google PAIR (2019, updated for generative AI). People + AI Guidebook. https://pair.withgoogle.com/guidebook-v2/ — Six chapters / patterns: User Needs + Success, Data + Evaluation, Mental Models, Explainability + Trust, Feedback + Control, Errors + Graceful Failure. Core: calibrate trust to reliability; set accurate mental models; design for graceful failure. Relevance: supplies HITL design moves — feedback+control loops, explainability, graceful failure for generative agents.

[G-10] Apple (2024–2025). Human Interface Guidelines: Machine Learning & Generative AI. https://developer.apple.com/design/human-interface-guidelines/machine-learning · /generative-ai — AI should augment, not replace; indicate when AI is involved; make suggestions easy to review, override, modify; preserve agency; responsible-AI guidance (bias, safety, provenance). Relevance: consumer-grade norms for labeling AI actions and keeping overrides cheap — table stakes for agentic UX.

[G-11] Norman, D. A. (2013, rev.; orig. 1988). The Design of Everyday Things. Basic Books. — Establishes affordances/signifiers, mapping, constraints, conceptual models, feedback; and the Gulf of Execution (intention → available action) and Gulf of Evaluation (system state → user understanding). Relevance: an agent must close both gulfs — discoverable actions + legible state/reasoning/effects.

[G-12] Norman, D. A. (1990). The 'Problem' with Automation: Inappropriate Feedback and Interaction, Not 'Over-Automation'. Phil. Trans. R. Soc. Lond. B 327(1241), 585–593. DOI: 10.1098/rstb.1990.0101. — Automation failures stem from inadequate feedback/interaction, not too much automation; "intermediate intelligence" automation drops the human into a crisis with no SA (out-of-the-loop). Relevance: the foundational argument for continuous intelligibility — keep the human informed so boundary handoff is safe.

[G-13] Nielsen, J. (1994/2020). 10 Usability Heuristics for User Interface Design. NN/g. https://www.nngroup.com/articles/ten-usability-heuristics/ — Four are critical to HITL: #1 Visibility of system status; #3 User control and freedom (emergency exits, undo/redo); #5 Error prevention (confirm before commitment); #9 Recognize, diagnose, recover from errors. Relevance: a compact checklist mapping one-to-one onto supervising an agent.

[G-14] Raskin, A. (2007). Never Use a Warning When You Mean Undo. A List Apart. https://alistapart.com/article/neveruseawarning/ — Confirmation warnings train reflexive click-through and don't prevent errors; reversibility (undo) is the superior pattern (e.g., Gmail "Undo Send"). Relevance: prefer reversible, undoable agent operations over confirmation gates; reserve confirmation for genuinely irreversible/high-stakes steps.

[G-15] Tognazzini, B. (2014, rev.). First Principles of Interaction Design. askTog. https://asktog.com/atc/principles-of-interaction-design/Protect Users' Work (never lose work; autosave, easy undo) and Latency Reduction (acknowledge within ~50 ms; hide latency); explorable interfaces (reversible, no dead ends). Relevance: for long-running agent tasks — immediate acknowledgment, protect in-progress work, keep operations explorable.

[G-16] Nielsen, J. (2006). Progressive Disclosure. NN/g. https://www.nngroup.com/articles/progressive-disclosure/ — Defer advanced/rare features to secondary views; show the most important first → better learnability, efficiency, error rate. Relevance: surface a simple default interaction while keeping deep controls (autonomy levels, permissions, reasoning traces) one layer down — "high control AND high automation" without clutter.

[G-17] Bellotti, V., & Edwards, K. (2001). Intelligibility and Accountability: Human Considerations in Context-Aware Systems. Human–Computer Interaction 16(2–4), 193–212. DOI: 10.1207/S15327051HCI16234_05. — Systems can't simply act on users' behalf because social context is unsensable. Two imperatives: Intelligibility (represent what it knows, how, and what it's doing) and Accountability (enforce user accountability rather than acting opaquely). Relevance: the conceptual root of "explainable, controllable agents" — expose inferences; keep a human accountable for consequential acts.

II.H — Adjacent safety-critical domains (aviation, medical, AVs, control rooms)

Tags: [INCIDENT] accident investigation · [STUDY] empirical/review · [STANDARD] standard/guidance.

Highlights

  • The handoff problem is the central HITL failure mode. AF447, Asiana 214, Uber Tempe, the Tesla crashes, and the L3 takeover meta-analysis converge: abruptly handing control to a disengaged human at the worst moment fails. Re-engagement takes 1.5–3.5+ s (competence longer), grows with disengagement, and often arrives too late. Design handoffs anticipatory, gradual, context-rich — and accept that for time-critical failures the human fallback may not be viable.
  • The "last line of defense" / human-safety-driver fallacy. A human kept nominally responsible but operationally idle is the least reliable safeguard because reliable automation induces disengagement [H-16, H-17, H-18]. Oversight requires active, verified engagement; gameable engagement checks provide the illusion, not the substance.
  • Alert fatigue is quantified and severe. Clinicians override 49–96% of drug-safety alerts [H-10]; TMI dumped 100+ unprioritized alarms in minutes [H-20]. Indiscriminate flagging trains humans to dismiss everything, including the rare valid signal. Reserve interruptive/blocking interventions for high-severity cases; make the rest passive. EEMUA: every alarm must demand a defined response; cap the interruptive rate [H-21].
  • You can't safely fix alert fatigue by muting the noisiest signals [H-11]. Rare-but-critical cases hide in high-override categories. Reduce volume by per-case risk stratification/tiering, not blanket suppression.
  • Skill decay is real and measured [H-9, H-1]. Heavy automation erodes the underlying skill (errors in ~60% of studied cases). A human who never performs the task can't meaningfully review an agent that does — build in deliberate manual practice.
  • Out-of-the-loop = lost comprehension, not just lost data [H-3]. Showing logs/raw output isn't enough; the human needs to retain comprehension of what the agent is doing and why. Intermediate automation preserves takeover ability better than full autonomy with a passive monitor.
  • Over-trust and under-trust are one calibration problem [H-2]. Even experts are biased by confident-but-wrong AI recommendations [H-13]. Calibrate trust to actual reliability; surface confidence/uncertainty honestly.
  • Mode confusion is first-class for agents with multiple states/permissions [H-5]. Users must always know what the agent is doing and under what authority right now. Silent transitions (suggest→act; a guardrail silently disabled) are the cockpit mode errors behind Asiana 214 [H-8].
  • Select the appropriate level of automation — and make stepping down easy [H-6]. "Children of the magenta line": the failure is uncritically riding maximum automation into task saturation.
  • There is a hard ceiling on supervisory span [H-22]. One operator caps at a handful of actively-supervised agents; performance collapses past ~70% utilization. Scaling agent count without scaling/pooling oversight guarantees unmonitored failures.
  • Crisis is exactly when the firehose hurts most [H-20, H-7]. At overload, prioritize, aggregate, and interpret for the human — don't dump everything.
  • Validate the human–system combination, with realistic misuse [H-14, H-16]. Safety is a property of the whole socio-technical system; test against real, over-trusting users — and don't let an org offload safety responsibility onto a single monitor.

Annotated bibliography

[H-1] Bainbridge, L. (1983). [STUDY] Ironies of Automation. Automatica 19(6), 775–779. (full entry at [E-7]) — The automation paradox: the more you automate, the more critical/demanding the residual human role. Transferable lesson: leaving the human "only the exceptions" is self-defeating — the exceptions are the hard cases; keep the human meaningfully in the loop during normal operation.

[H-2] Parasuraman, R., & Riley, V. (1997). [STUDY] Humans and Automation: Use, Misuse, Disuse, Abuse. Human Factors 39(2), 230–253. (full entry at [F-2]) — Transferable lesson: over- and under-trust are two failure modes of the same calibration problem; too many low-value flags train the human to ignore valid ones. Calibration is a design responsibility, not the user's fault.

[H-3] Endsley, M. R., & Kiris, E. O. (1995). [STUDY] The Out-of-the-Loop Performance Problem and Level of Control in Automation. Human Factors 37(2), 381–394. DOI: 10.1518/001872095779064555. — Higher automation degrades comprehension (L2 SA) even when data monitoring (L1) is intact, slowing recovery; OOTL decrement worse under full than intermediate automation. Transferable lesson: keep the human in the decision loop; display raw data is not enough — humans need comprehension, not just awareness.

[H-4] Endsley, M. R. (2017). [STUDY] From Here to Autonomy: Lessons Learned From Human–Automation Research. Human Factors 59(1), 5–27. DOI: 10.1177/0018720816681350. — The "automation conundrum": rising autonomy/reliability → falling SA → less able to take over at the rare failure. Transferable lesson: reliability is not a substitute for transparency; an opaque-but-reliable agent makes rare failures more likely to slip past the human.

[H-5] Sarter, N. B., & Woods, D. D. (1995). [STUDY] "How in the World Did We Ever Get into That Mode?" Mode Error and Awareness in Supervisory Control. Human Factors 37(1), 5–19. DOI: 10.1518/001872095779049516. — Glass-cockpit mode confusion → "automation surprises." Transferable lesson: the canonical risk for any agent with multiple states/permissions/tool-policies — users must always be able to answer "what is the agent doing and under what authority?" Invisible mode changes (suggest→act) are direct analogues.

[H-6] Vanderburgh, W. — "Children of the Magenta Line," American Airlines (1997). [STUDY / training] AA Flight Academy lecture. Analysis: https://airfactsjournal.com/2020/09/stepping-down-in-automation-the-real-lesson-for-children-of-the-magenta-line/ · 99% Invisible: https://99percentinvisible.org/episode/children-of-the-magenta-automation-paradox-pt-1/ — ~68% of AA incidents involved automation mismanagement; pilots failed to step down to lower automation when needed, losing SA and saturating. Transferable lesson: make it easy and natural to drop to lower-autonomy/manual mode under uncertainty, rather than defaulting users into maximum automation.

[H-7] BEA (2012). [INCIDENT] Final Report — Air France Flight 447 (A330-203, F-GZCP), 1 June 2009. Bureau d'Enquêtes et d'Analyses. https://bea.aero/en/investigation-reports/notified-events/detail/ (AF447) · HF analysis: https://humanfactors101.com/incidents/air-france-flight-447/ — Iced pitot tubes → airspeed loss → autopilot disconnect → startled, out-of-the-loop crew with atrophied high-altitude handling held nose-up and stalled into the ocean; 228 died. Transferable lesson: the textbook startle + sudden handoff to an unprepared human. Handoffs must be anticipatory, gradual, and context-rich — not a panic dump.

[H-8] NTSB (2014). [INCIDENT] Asiana Airlines Flight 214, B777-200ER, San Francisco, 6 July 2013. NTSB/AAR-14/01. https://www.ntsb.gov/investigations/AccidentReports/Reports/AAR1401.pdf — Over-reliance on automation not fully understood; an inappropriate autopilot mode (FLCH) left the autothrottle no longer protecting airspeed; faulty mental model → landed short into the seawall. Transferable lesson: never let the human assume a guardrail is on when a mode choice silently disabled it; make active safety protections and their gaps explicit.

[H-9] PARC/CAST Flight Deck Automation WG (2013). [STANDARD] Operational Use of Flight Path Management Systems. FAA. https://www.faa.gov/sites/faa.gov/files/aircraft/air_cert/design_approvals/human_factors/OUFPMS_Report.pdf (host 403 to fetcher; corroborated via SKYbrary) — 29 findings/18 recommendations: documented manual-flying skill decay, automation dependency, knowledge gaps, complacency; manual handling errors in ~60% of cases → FAA SAFO 13002/17007 urging hand-flying. Transferable lesson: heavy automation erodes skill; oversight needs deliberate "manual practice" — a human who never does the task can't review the agent.

[H-10] van der Sijs, H., Aarts, J., Vulto, A., & Berg, M. (2006). [STUDY] Overriding of Drug Safety Alerts in Computerized Physician Order Entry. JAMIA 13(2), 138–147. https://pmc.ncbi.nlm.nih.gov/articles/PMC1447540/ — Clinicians override drug-safety alerts in 49–96% of cases, driven by poor signal-to-noise. Transferable lesson: the empirical heart of alert fatigue — indiscriminate flagging makes humans override ~everything, including the rare valid alert. Specificity/prioritization are safety-critical.

[H-11] van der Sijs, H., et al. [STUDY] Turning off frequently overridden drug alerts: limited opportunities for doing it safely. JAMIA. https://pmc.ncbi.nlm.nih.gov/articles/PMC2585537/ (exact PMC ID UNVERIFIED; widely cited) — Simply suppressing the most-overridden alerts is unsafe — a minority remain clinically critical. Transferable lesson: don't reduce agent alert volume by blanket-muting noisy categories; rare-but-fatal cases hide in the noise — use per-case risk stratification.

[H-12] Page, N., Baysari, M. T., & Westbrook, J. I. (2019). [STUDY] Medication safety alert fatigue may be reduced via interaction design and clinical role tailoring: a systematic review. JAMIA 26(10), 1141–1149. https://academic.oup.com/jamia/article/26/10/1141/5519579 — Alert fatigue reduced by tiering interruptive vs. passive alerts and tailoring to role/context. Transferable lesson: the fix is design, not more alerts — reserve interruptive/blocking for high-severity; make most signals passive; route the right alert to the right human (maps to notify vs. require-approval vs. block).

[H-13] Dratsch, T., et al. (2023). [STUDY] Automation Bias in Mammography: The Impact of AI BI-RADS Suggestions on Reader Performance. Radiology 307(4). https://pubs.rsna.org/doi/full/10.1148/radiol.222176 — Radiologists at all experience levels were swayed by AI suggestions; incorrect AI advice degraded accuracy even among experts. Transferable lesson: expertise doesn't immunize against automation bias — "a human reviews it" is not a reliable safeguard when the human anchors on the agent's output.

[H-14] U.S. FDA (2016). [STANDARD] Applying Human Factors and Usability Engineering to Medical Devices. FDA. https://www.fda.gov/media/80481/download — Manufacturers must validate safety in the hands of real, fallible users under realistic conditions, incl. use errors. Transferable lesson: demonstrate safety for the human–system combination incl. misuse and over-trust — not just nominal performance.

[H-15] SAE International (2021). [STANDARD] J3016: Taxonomy and Definitions for Driving Automation Systems (rev. 202104). https://www.sae.org/standards/content/j3016_202104/ — Levels 0–5; the critical L2→L3 boundary: at L3 the system does the entire dynamic driving task but the human must be a fallback when prompted. Transferable lesson: the most dangerous region is the ambiguous middle — "the agent mostly does it, but you're responsible if it fails." Clear, communicated authority (who's responsible right now) matters more than raw capability.

[H-16] NTSB (2019). [INCIDENT] Collision Between Developmental ADS Vehicle and Pedestrian, Tempe, AZ, 18 Mar 2018. NTSB/HAR-19/03. https://www.ntsb.gov/investigations/AccidentReports/Reports/HAR1903.pdf — First pedestrian fatality by a self-driving car: misclassification + "action suppression" delayed braking; the safety driver was on her phone; NTSB cited deficient safety culture and unmanaged automation complacency. Transferable lesson: the "human safety driver as last line of defense" is a fallacy when the system is reliable enough to induce disengagement but not enough to trust.

[H-17] NTSB (2017). [INCIDENT] Tesla Autopilot, Williston, FL, 7 May 2016. NTSB/HAR-17/02. https://www.ntsb.gov/investigations/AccidentReports/Reports/HAR1702.pdf — Probable cause included driver inattention from over-reliance, enabled by a design that permitted prolonged disengagement and use outside intended conditions. Transferable lesson: if an agent allows the human to check out, they will — over-reliance is a predictable product of permissive design; constrain use to the operational design domain.

[H-18] NTSB (2020). [INCIDENT] Tesla Autopilot, Mountain View, CA, 23 Mar 2018. NTSB/HAR-20/01. https://www.ntsb.gov/investigations/AccidentReports/Reports/HAR2001.pdf — Autopilot steered into a gore-point barrier; driver playing a game; NTSB found no effective driver-engagement monitoring and inadequate alert timing. Transferable lesson: token engagement checks (trivially gamed hands-on-wheel nags) aren't oversight — verify meaningful engagement.

[H-19] Zhang, B., de Winter, J., Varotto, S., Happee, R., & Martens, M. (2019). [STUDY, meta-analysis] Determinants of take-over time from automated driving: A meta-analysis of 129 studies. Transportation Research Part F 64, 285–307. https://www.sciencedirect.com/science/article/pii/S1369847818307249 — Takeover time typically ~1.5–3.5 s, worsened by non-driving tasks, shortened by urgent/multimodal warnings; quality degrades with secondary-task load. Transferable lesson: handoff isn't instantaneous — budget for re-engagement latency; for time-critical failures the human-fallback model may simply not work.

[H-20] President's Commission (Kemeny) (1979). [INCIDENT] Report on the Accident at Three Mile Island. U.S. GPO. https://www.threemileisland.org/downloads/188.pdf100+ alarms with no prioritization/suppression, poorly arranged indicators, contradictory feedback → wrong operator actions; absent the HF failures it would have been minor. Transferable lesson: the original alarm-flood disaster — dumping everything at once disables oversight. Prioritize, aggregate, and interpret for the human at the moment of overload.

[H-21] EEMUA (2013). [STANDARD] Publication 191: Alarm Systems — A Guide to Design, Management and Procurement (3rd ed.). https://www.eemua.org/Products/Publications/Print/EEMUA-Publication-191.aspx (paywalled; existence verified) — Defines an alarm as something requiring a specific operator response; target ≤~10 alarms / 10 min / operator; alarm rationalization to delete/downgrade nuisance alarms. Transferable lesson: a quantified discipline for keeping oversight signals within human capacity — every agent "flag" should demand a defined action; cap the interruptive rate; everything else is a passive log.

[H-22] Cummings, M. L., Bruni, S., Mercier, S., & Mitchell, P. J. (2007). [STUDY] Automation Architecture for Single Operator, Multiple UAV Command and Control. The International C2 Journal 1(2), 1–24. http://www.dodccrp.org/files/IC2J_v1n2_01_Cummings.pdf — Span of control rises with autonomy (~4–5 active, up to ~12 supervisory), but operator utilization above ~70% triggers sharp performance decay; high autonomy/rapid re-planning erodes SA and breeds complacency. Transferable lesson: a hard ceiling on how many agents one person can meaningfully supervise — scaling agent count without scaling/pooling oversight is a predictable path to unmonitored failures.


Part III — Delegation, Organization & Resilience

The theory of delegating to, organizing around, and staying safe with an autonomous agent — the formal backbone the first two parts lacked.

III.J — Delegation: economics & law

A human delegating to an LLM agent is a principal–agent relationship; a century of economics and agency law already maps its failure modes and the price of oversight. Tags: theoretical / legal-doctrine / AI-application.

Highlights

  • A human delegating to an LLM agent is, formally, a principal–agent relationship [J-1, J-2]. The failure taxonomy — divergent objectives, hidden action, hidden type, costly observation — is already mapped; HITL is the monitoring term in a century-old equation.
  • Oversight is costly monitoring, and total agency cost can never hit zero [J-2]. Jensen–Meckling's monitoring + bonding + residual loss reframes HITL as minimizing total cost, not maximizing control: human review = monitoring; guardrails/evals = bonding; tolerated errors = residual loss.
  • "Management by exception" is the optimal response to monitoring cost [J-7]. Because informative signals have decreasing marginal value, review only the diagnostic exceptions — the economic justification for selective, risk-triggered HITL over uniform review.
  • Observability of the agent's process is provably valuable — log the trace [J-7]. Holmström's informativeness principle: any signal informative about the agent's action strictly lowers agency cost. Traces/tool-call logs/chain-of-thought are exactly those signals.
  • Measuring one thing distorts the rest — Goodhart has an economic theorem [J-10]. The multitask result: strong incentives on a measurable sub-goal degrade unmeasured ones → low-powered incentives + task/job-design limits. Over-tuning to one eval is structurally wrong; reward-hacking is predicted.
  • You can't specify everything, so the real choice is who holds residual control [J-8, J-9]. Incomplete-contracts theory: power lives in the override / final-decision right for unforeseen states. HITL is the assignment of residual control rights; selective intervention (halt, revoke access, substitute) is the operative form of control.
  • Behavior-based vs. outcome-based oversight is a contractible choice [J-5]. When behavior is observable/cheap to monitor, supervise the process; else contract on outcomes and bear the risk. Tells you whether HITL inspects steps or only final outputs.
  • Trust starts before runtime: agent/model selection is an adverse-selection problem [J-6]. Evals/reputation/screening are the "lemons" remedy; monitoring after delegation does not substitute for screening what you delegate to.
  • Authority must be scoped — and an agent can bind you beyond its scope [J-13]. Actual vs. apparent authority; ratification. Permission scoping is the technical twin of scope-of-authority; accepting an out-of-scope action ratifies it.
  • The principal is generally liable for the agent's in-scope acts [J-14, J-19]. Respondeat superior routes responsibility to the operator; narrowing scope is both a safety control and the primary liability lever. Since the agent has no intent, the law puts the human principal on the hook.
  • Some duties are non-delegable — the law itself can mandate a human in the loop [J-15, J-16]. Outsourcing a task doesn't outsource accountability; fiduciary loyalty + care supply the normative target for "alignment."
  • The AI-specific literature already translates all of this [J-17, J-18, J-19].

Annotated bibliography

[J-1] Ross, S. A. (1973). [theoretical] The Economic Theory of Agency: The Principal's Problem. American Economic Review 63(2), 134–139. JSTOR 1817064. — Founding statement of the economic agency problem: the principal chooses an incentive schedule when the agent's interests/information/risk-preferences diverge. Even a perfect contract requires paying to realign the agent. Relevance: the LLM operator is Ross's principal; "prompt + reward" is the fee schedule; misalignment is structural.

[J-2] Jensen, M. C., & Meckling, W. H. (1976). [theoretical] Theory of the Firm: Managerial Behavior, Agency Costs and Ownership Structure. J. Financial Economics 3(4), 305–360. DOI: 10.1016/0304-405X(76)90026-X. — Introduces agency cost = monitoring + bonding + residual loss; total cost is irreducible to zero. Relevance: the single most transferable frame — human review = monitoring, guardrails/evals = bonding, accepted errors = residual loss.

[J-3] Stiglitz, J. E. (1974). [theoretical] Incentives and Risk Sharing in Sharecropping. Review of Economic Studies 41(2), 219–255. DOI: 10.2307/2296714. — Models the incentive-vs-risk-sharing trade-off; the optimum depends on output-signal noise. Relevance: high-stakes/noisy agent outputs warrant more direct control; low-stakes/clean ones warrant autonomy.

[J-4] Grossman, S. J., & Hart, O. D. (1983). [theoretical] An Analysis of the Principal-Agent Problem. Econometrica 51(1), 7–45. DOI: 10.2307/1912246. — Rigorous formalization of hidden action; the cost of inducing a desired action rises as the effort→output link weakens. Relevance: hard-to-verify agent tasks (research, open-ended reasoning) are the expensive ones to oversee.

[J-5] Eisenhardt, K. M. (1989). [review] Agency Theory: An Assessment and Review. Academy of Management Review 14(1), 57–74. DOI: 10.5465/amr.1989.4279003. — Distills the design choice between behavior-based (monitor the process) and outcome-based (pay for results) contracts, governed by uncertainty, goal conflict, programmability, measurability. Relevance: a ready-made rule for whether HITL inspects the trace or only the output.

[J-6] Akerlof, G. A. (1970). [theoretical] The Market for "Lemons": Quality Uncertainty and the Market Mechanism. QJE 84(3), 488–500. DOI: 10.2307/1879431. — Seminal adverse selection (hidden type): unobservable quality drives good types out. Relevance: model/agent selection is a pre-delegation problem distinct from runtime monitoring; screening/evals/reputation are the remedy.

[J-7] Holmström, B. (1979). [theoretical] Moral Hazard and Observability. Bell J. Economics 10(1), 74–91. DOI: 10.2307/3003320. — The informativeness principle: any signal informative about the agent's action should enter the contract and strictly reduces agency cost. Relevance: the theoretical license for trace/tool-call/CoT logging — process observability is the cheapest lever for trust.

[J-8] Grossman, S. J., & Hart, O. D. (1986). [theoretical] The Costs and Benefits of Ownership: A Theory of Vertical and Lateral Integration. JPE 94(4), 691–719. DOI: 10.1086/261404. — Founds incomplete-contracts / residual control rights: what matters is who decides in unforeseen contingencies. Relevance: the formal basis for "human retains the override" — HITL assigns residual control rights.

[J-9] Hart, O., & Moore, J. (1990). [theoretical] Property Rights and the Nature of the Firm. JPE 98(6), 1119–1158. DOI: 10.1086/261729. — Control is exercised through selective intervention (the ability to halt/withdraw access), not an exhaustive rulebook. Relevance: grounds kill-switches, scoped credentials, and selective task-revocation as the real instruments of control.

[J-10] Holmström, B., & Milgrom, P. (1991). [theoretical] Multitask Principal–Agent Analyses. J. Law, Economics & Organization 7 (special issue), 24–52. DOI: 10.1093/jleo/7.special_issue.24. — The multitask result: strong incentives on the measurable task distort effort away from unmeasured ones → low-powered incentives + job-design limits. A formal cousin of Goodhart. Relevance: explains reward-hacking/spec-gaming as structural; bound the task, don't just sharpen the metric.

[J-11] Royal Swedish Academy of Sciences (2016). [review] Oliver Hart and Bengt Holmström: Contract Theory (2016 Prize background). NobelPrize.org. — Authoritative synthesis tying moral hazard, informativeness, performance-measure distortion, and residual control into one toolkit for "designing policies and institutions." Relevance: a vetted one-stop overview of the delegation-contracting toolkit for agent governance.

[J-12] Williamson, O. E. (1985). [theoretical] The Economic Institutions of Capitalism. Free Press. ISBN 9780029348215. — Transaction-cost economics: under bounded rationality + opportunism + asset specificity, craft governance structures matched to the hazard profile. Relevance: vocabulary for matching oversight intensity to risk.

[J-13] American Law Institute (2006). [legal-doctrine] Restatement (Third) of Agency. ALI. — Distinguishes actual vs. apparent authority and ratification: a principal can be bound by acts it never authorized (if it created the appearance) and can ratify after the fact. Relevance: an agent acting beyond scope may still bind its operator; accepting an out-of-scope action ratifies it. Scope limits = permission scoping.

[J-14] Cornell LII, Wex. [legal-doctrine] Respondeat Superior / Vicarious Liability. law.cornell.edu/wex/respondeat_superior. — Principal is liable for an agent's wrongful acts within the scope of agency. Relevance: liability flows to the operator for in-scope acts; narrowing scope is the principal's main liability lever.

[J-15] Cornell LII, Wex (+ Restatement (Third) §7.06). [legal-doctrine] Non-Delegable Duty. law.cornell.edu/wex/vicarious_liability. (treated within vicarious-liability material) — Some duties cannot be offloaded; the principal stays liable regardless of who performs the act. Relevance: a hard legal limit on automation — certain decisions are non-delegable to an AI, mandating a human in the loop.

[J-16] Cornell LII, Wex (+ Restatement (Third) §§8.01–8.08). [legal-doctrine] Fiduciary Relationship. law.cornell.edu/wex/fiduciary_relationship. — Agency is inherently fiduciary: duties of loyalty (act solely in the principal's interest) and care (competence/diligence). Relevance: the normative target for "alignment" in law — an aligned agent discharges loyalty + care.

[J-17] Kolt, N. (2025). [legal/AI-application] Governing AI Agents. Notre Dame Law Review (forthcoming). arXiv:2501.07913. — Imports both economic principal-agent theory and common-law agency to AI; names information asymmetry, discretionary authority, and loyalty as the recurring agency problems, proposing inclusivity/visibility/liability infrastructure. Relevance: a roadmap translating [J-1]–[J-16] into agent governance (visibility = monitoring; liability = who answers).

[J-18] Gabison, G. A., & Xian, R. P. (2025). [theoretical/AI-application] Inherent and Emergent Liability Issues in LLM-Based Agentic Systems: A Principal-Agent Perspective. arXiv:2504.03255. — Applies principal-agent theory to liability; interpretability/eval/detection are the technical substitutes for the monitoring the theory demands. Relevance: operationalizes the informativeness principle as concrete agent-observability mechanisms.

[J-19] Ayres, I., & Balkin, J. M. (2024). [legal/AI-application] The Law of AI is the Law of Risky Agents Without Intentions. University of Chicago Law Review Online. — Since AI agents lack intent, ascribe intent (foreseeable consequences) and impose objective reasonableness on the humans/orgs who deploy them. Relevance: resolves the "agent can't form intent" objection by routing accountability to the human principal — a legal warrant for oversight obligations.

III.K — Delegation: doctrine & management

How organizations and militaries delegate by intent and boundaries rather than step-by-step control.

Highlights

  • Specify intent + boundaries, not steps — the most cross-validated lesson [K-7, K-8, K-6, K-19, K-20]: brief with purpose + desired end-state + constraints, then delegate method. Mission orders state "results to attain, not how."
  • Management by exception is the default oversight posture [K-1]: handle the routine autonomously, escalate only deviations. Tune active (tight tripwires) vs passive (intervene on failure) by risk tier [K-2], with measured cost trade-offs [K-3]; avoid drifting into laissez-faire.
  • Pre-authorize action limits in advance [K-12, K-13]. Rules of engagement are decisions made before the moment; encode an agent's always-allowed set and prohibitions as a structured, condition-keyed, machine-checkable policy.
  • Bound to enable, not just to restrict [K-14]. Over-tight/ambiguous boundaries degrade initiative and tempo; clear boundaries are a force enabler.
  • Span-of-control ceilings cap multi-agent oversight [K-4, K-5]: coordination load grows geometrically (~5–6 interdependent subordinates). Cap concurrent agents per supervisor or decouple them (loose coupling + homogeneity) to widen the safe span.
  • Autonomy is a dial, not a switch [K-15] — assign each action class a Level of Automation; adjustable autonomy [K-17, K-18] makes the dial runtime-tunable via policy.
  • Make agents directable, observable, predictable (OPD) [K-16]: team effectiveness comes from supporting interdependence, not maximizing independence.
  • Escalation should be a cost-aware policy [K-18]: "act vs. hand back to the human" weighs expected cost of a wrong autonomous act against the cost/delay of interrupting — the formal trigger logic behind management-by-exception.
  • Pair autonomy with shared context [K-10]: "empowered execution without shared consciousness is dangerous." Grant autonomy only alongside rich briefing.
  • Orientation beats raw speed [K-11]: Boyd's OODA — invest in the agent's context/tools/memory (Orient) before letting it run faster loops; humans add most value in framing.
  • Govern with policy, not micromanagement [K-17]: runtime-adjustable authorizations + obligations as the control surface — "highest useful autonomy at acceptable trust."
  • Honor intent over literal instruction [K-9]: the "duty to deviate" — an agent should optimize for stated intent when literal instructions conflict — but only when observability/directability let the human catch misjudgments.

Annotated bibliography

[K-1] Taylor, F. W. (1903/1911). [conceptual] Shop Management; The Principles of Scientific Management. Harper. — Credited with the first articulation of management by exception: delegate the routine, reserve attention for deviations. Relevance: the foundational oversight posture — "handle the routine; surface only the exceptions"; define the expected envelope so the agent can self-identify exceptions.

[K-2] (Management-literature synthesis). [conceptual] Management by Exception. en.wikipedia.org/wiki/Management_by_exception. — Active MBE (monitor + intervene pre-crisis) vs passive MBE (intervene after breach). Relevance: choose the MBE mode by risk tier — active for high-stakes/irreversible agent actions, passive for low-stakes/reversible.

[K-3] Bass, B. M. & Avolio, B. J. (1990s–2000s); Judge & Piccolo (2004). [empirical] Full Range Leadership Model / MLQ. (meta-analysis J. Applied Psychology). — Formalizes MBE-active, MBE-passive, and laissez-faire (over-delegation) as measurable, outcome-linked dimensions. Relevance: "how actively do we watch the agent" is a tunable parameter on a real cost gradient; don't drift into laissez-faire.

[K-4] Graicunas, V. A. (1933). [conceptual] Relationship in Organization (in Gulick & Urwick, eds.). — Subordinate relationships grow geometrically as subordinates grow arithmetically → a reasonable span of ~5–6. Relevance: interdependence (not headcount) caps how many agents one supervisor can oversee.

[K-5] Urwick, L. F. (1956); Davis, R. C. (1951). [conceptual] The Manager's Span of Control (HBR); span ranges in classical management. — Span for interlocking work ~5–6, but widens greatly for routine/non-interdependent work. Relevance: widen agent-oversight spans by reducing interdependence + judgment load; reserve narrow spans for novel, interdependent work.

[K-6] Drucker, P. F. (1954/1973). [conceptual] The Practice of Management; Management: Tasks, Responsibilities, Practices. Harper & Row. — Originated Management by Objectives (joint goal-setting, then self-direction) + decentralization. Relevance: brief an agent on measurable end-states, then let it choose method — "specify what, not how."

[K-7] U.S. Army (2019). [doctrine] ADP 6-0: Mission Command. HQ Dept. of the Army. — Mission command = empower subordinate decision-making + decentralized execution; principles incl. commander's intent (purpose + desired end-state, act "without further orders") and mission orders (results, not how). Relevance: the gold-standard template for briefing an agent — purpose, end-state, constraints, then delegate; expects the plan to break and the agent to improvise toward intent.

[K-8] Widder, W. (2002). [doctrine/conceptual] Auftragstaktik and Innere Führung: Trademarks of German Leadership. Military Review. (PDF 403; lineage corroborated; exact page quotes UNVERIFIED) — Traces Auftragstaktik to Prussian reforms after 1806 (Scharnhorst, Moltke; formalized in the 1888 manual). Relevance: historical evidence that intent-driven decentralized execution beats tight central control under uncertainty and fast tempo — given trained, trusted subordinates + shared doctrine.

[K-9] (Corroborating sources). [conceptual/doctrine] The Origins of Auftragstaktik. Australian Army Research Centre; USNI Proceedings (2025). — Commanders state what and why; subordinates decide how, with a duty to deviate from orders when the situation demands it to achieve intent. Relevance: an agent should optimize for stated intent over literal instruction when they conflict.

[K-10] McChrystal, S. et al. (2015). [practitioner] Team of Teams. Portfolio/Penguin. — Shared consciousness (radical transparency) + empowered execution (authority to the edge); "empowered execution without shared consciousness is dangerous." Relevance: safe agent autonomy requires rich context, not just permissions; informs briefing/context design.

[K-11] Boyd, J. R. (1976–1996). [conceptual] A Discourse on Winning and Losing (OODA). — Observe–Orient–Decide–Act; Orientation is pivotal (a well-oriented slow loop beats a fast poorly-oriented one); enables late commitment via feedback. Relevance: agent value = loop speed × orientation quality; fund orientation (context/tools/memory) before speed; humans add most value in Orient.

[K-12] CJCS (2005). [doctrine] CJCSI 3121.01B: Standing Rules of Engagement / Standing Rules for the Use of Force. — Standing (pre-issued) authorization + inherent right of self-defense — certain actions pre-authorized so the unit acts immediately without asking up the chain. Relevance: the canonical pre-authorization model — define in advance the agent's always-allowed set and prohibitions.

[K-13] NATO. [doctrine] NATO Rules of Engagement (MC 362-1); AJP-3. — ROE define "the circumstances, conditions, degree, and manner" force may be used — structured authorizations + prohibitions tied to conditions. Relevance: express agent boundaries as a machine-checkable "authorize/prohibit by condition" policy.

[K-14] Lieber Institute, West Point (2021). [conceptual/legal] Rules of Engagement in Large-Scale Combat Operations: Force Enabler or Much Ado About Nothing? — Well-crafted ROE are a force enabler; overly restrictive/ambiguous ROE degrade tempo and initiative. Relevance: over-constraining agents hurts as much as under-constraining; design boundaries that enable confident action.

[K-15] Sheridan, T. B. & Verplank, W. (1978); Sheridan (2000s). [conceptual/empirical] Levels of Automation; Human Supervisory Control. (cross-ref [E-4]) — The 10-level LOA scale; supervisory control = set goals/constraints, intervene by exception. Relevance: the canonical taxonomy for where on the autonomy spectrum each agent action should sit.

[K-16] Johnson, M., Bradshaw, J. M., Feltovich, P. J., et al. (2014). [conceptual/empirical] Coactive Design: Designing Support for Interdependence in Joint Activity. J. Human-Robot Interaction 3(1). (cross-ref [M] cluster) — Effectiveness comes from supporting interdependence via Observability, Predictability, Directability (OPD). Relevance: a supervisable agent exposes state, is predictable, and can be redirected mid-task — "directable autonomy" as the goal.

[K-17] Bradshaw, J. M. et al. [conceptual] Toward Trustworthy Adjustable Autonomy in KAoS. IHMC. — Policy-governed autonomy: machine-readable authorizations + obligations, imposed/modified at runtime — "highest useful autonomy at acceptable trust." Relevance: the AI analogue of ROE — runtime-adjustable policy as the control surface.

[K-18] Scerri, P., Pynadath, D. V. & Tambe, M. (2002). [empirical/conceptual] Towards Adjustable Autonomy for the Real World. JAIR 17. — Formalizes transfer-of-control strategies to minimize miscoordination cost; "when to hand control back to a human" is the central problem. Relevance: a principled, cost-aware basis for the escalation trigger behind management-by-exception.

[K-19] Sull, D. & Eisenhardt, K. M. (2015/2001). [conceptual/empirical] Simple Rules; Strategy as Simple Rules (HBR). — Firms in fast/complex markets govern with a handful of simple rules (boundary/how-to/priority) rather than detailed plans. Relevance: bound an agent with a small set of clear rules, not an exhaustive procedure — scales to novel situations, stays fast and aligned.

[K-20] (Practitioner adaptations). [practitioner] Commander's Intent in management practice. Thinkers50; DBM. — Documents the civilian crossover: a concise purpose + end-state lets decentralized teams improvise while aligned. Relevance: intent-based bounding is a general, business-tested technique — lowers the bar to adopt it for agent briefing.

III.L — Resilience engineering, HRO & accident theory

Agentic systems are tightly-coupled, complex socio-technical systems; this is the discipline that studies how such systems fail and stay safe.

Highlights

  • Tight coupling + interactive complexity = the danger zone (Perrow, [L-10]). Multi-agent pipelines with chained tool calls and little slack sit in the "normal accident" regime. The structural fix is to decouple: human checkpoints, reversibility, rate limits, circuit breakers, so failures can't propagate faster than humans can intervene.
  • Latent failures live at the blunt end (Reason, [L-12, L-13]; Leveson, [L-14]). An agent's bad output is the active failure; the leverage is in latent conditions — prompt/tool design, eval coverage, permission scopes, gate placement. HITL is defense-in-depth, not catching every error.
  • Drift and migration are silent and continuous (Rasmussen, [L-15]; Dekker, [L-17]). Efficiency/effort pressure pushes systems toward the safety boundary via locally reasonable decisions. "Reduce human friction / the agent is good enough now" is textbook migration — counter with explicit, monitored counter-gradients.
  • Normalization of deviance erodes HITL standards one success at a time (Vaughan, [L-16]). Each un-reviewed action that "gets away with it" raises tolerance for un-reviewed actions. Track the trend in oversight bypasses, not just incidents.
  • Deference to expertise is the keystone HITL principle (Weick & Sutcliffe, [L-7]; Rochlin et al., [L-9]). Authority should migrate in real time to whoever has the relevant situational knowledge — human or specialized checker — independent of the default automation hierarchy.
  • Design for things going right — Safety-II (Hollnagel, [L-1, L-2]). Instrument the system's successful and recovered trajectories and near-misses, not only post-incident forensics. The human is a source of resilience, not an error term to design out.
  • The four cornerstones map onto an oversight architecture (Hollnagel et al., [L-5]): respond + monitor = runtime guardrails + anomaly detection; anticipate = red-teaming/evals; learn = incident review feeding updates. All four are necessary.
  • Robustness ≠ graceful extensibility (Woods, [L-6]). Handling anticipated faults ≠ handling surprise; LLM agents fail suddenly past their competence envelope. Humans supply graceful extensibility for novel situations.
  • Redundancy can backfire (Sagan, [L-11]). Stacking automated checkers adds interactions, diffuses responsibility, and manufactures false confidence; an incident-free record may be luck.
  • Optimizations are the seed of failure — ETTO (Hollnagel, [L-3]). Every speed/cost optimization (skipped checks, cached judgments, fewer gates) is an efficiency-for-thoroughness trade — make it explicit and monitored.
  • Treat safety as control; keep the controller's model accurate (Leveson STAMP, [L-14]). The human is a controller whose authority and situational awareness must stay adequate; STPA can derive where a human gate is needed.
  • "Human error" is a symptom, not a verdict (Dekker, [L-18]); beware the ironies of automation (Hollnagel & Woods, [L-19]). Don't design HITL as a blame-sink; if the agent escalates only the hardest cases, the human gets the toughest decisions with the least context.

Annotated bibliography

[L-1] Hollnagel, E., Woods, D. D., & Leveson, N. (eds.) (2006). [book/theory] Resilience Engineering: Concepts and Precepts. Ashgate. ISBN 9780754646419. — Founding text: safety is the presence of adaptive capacity, and failure is the flip side of the adaptations that normally produce success. Relevance: design agents for adaptive capacity; oversight should preserve and exercise human adaptive capacity, not assume a static failure list.

[L-2] Hollnagel, E. (2013). [white paper/theory] From Safety-I to Safety-II: A White Paper. EUROCONTROL. (PDF wouldn't render; authorship/publication corroborated) — Safety-I (prevent things going wrong) vs Safety-II (study why things usually go right; amplify the variability that absorbs disturbances). Relevance: instrument routine/near-miss/recovered runs; the human is a source of resilience.

[L-3] Hollnagel, E. (2009). [book/theory] The ETTO Principle: Efficiency–Thoroughness Trade-Off. Ashgate. ISBN 9780754676782. — People trade thoroughness for efficiency to fit time/resources; the same shortcuts occasionally produce failure. Relevance: every "speed up the loop / reduce friction" decision is an ETTO move — make it explicit and monitored.

[L-4] Hollnagel, E. (2012). [book/method] FRAM: The Functional Resonance Analysis Method. Ashgate. ISBN 9781409445517. — Models how normal variability in coupled functions can "resonate" into disproportionate outcomes. Relevance: failures in coupled agent steps are emergent; map where one agent's "good enough" output becomes another's unreliable input and place humans at high-resonance couplings.

[L-5] Hollnagel, E., Pariès, J., Woods, D. D., & Wreathall, J. (eds.) (2011). [book/theory+method] Resilience Engineering in Practice. Ashgate. — The four cornerstones: respond, monitor, anticipate, learn — all necessary. Relevance: maps directly to oversight architecture (guardrails = respond/monitor; red-teaming = anticipate; incident review = learn).

[L-6] Woods, D. D. (2015). [peer-reviewed/theory] Four Concepts for Resilience… Reliability Engineering & System Safety 141, 5–9. DOI: 10.1016/j.ress.2015.03.018. — Distinguishes rebound, robustness, graceful extensibility (stretching beyond design boundaries before becoming brittle), sustained adaptability. Relevance: LLM agents are brittle at distribution edges; humans supply graceful extensibility for surprise.

[L-7] Weick, K. E., & Sutcliffe, K. M. (2015; orig. 2001). [book/theory+empirical] Managing the Unexpected. Jossey-Bass. ISBN 9781118862414. — Five principles of mindful organizing: preoccupation with failure, reluctance to simplify, sensitivity to operations, commitment to resilience, deference to expertise. Relevance: the cluster's keystone — control should pass to whoever has the local knowledge; don't trust a single aggregate confidence score.

[L-8] LaPorte, T. R., & Consolini, P. M. (1991). [peer-reviewed/theory+empirical] Working in Practice But Not in Theory: …High-Reliability Organizations. J. Public Admin. Research & Theory 1(1), 19–48. JSTOR 1181764. — HROs achieve near-failure-free operation via redundancy, decentralized real-time authority, safety culture, continuous learning. Relevance: reliable high-stakes agent deployment needs deliberate organizational scaffolding, not just a better model.

[L-9] Rochlin, G. I., La Porte, T. R., & Roberts, K. H. (1987). [peer-reviewed/empirical] The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea. Naval War College Review 40(4), 76–92. — Reliability under tight coupling comes from dense communication + authority that flexes to whoever sees the hazard first. Relevance: a model for human–agent teaming where intervention authority moves fluidly to the best-informed actor.

[L-10] Perrow, C. (1984/1999). [book/theory] Normal Accidents: Living with High-Risk Technologies. Basic Books / Princeton UP. — Interactive complexity + tight coupling make serious accidents inevitable ("normal"); adding safety features can add interactions and worsen things. Relevance: multi-agent pipelines are exactly this regime — deliberately insert slack (checkpoints, rate limits, reversibility) to decouple.

[L-11] Sagan, S. D. (1993). [book/empirical+theory] The Limits of Safety. Princeton UP. — The definitive NAT-vs-HRO test (nuclear C2 near-misses): redundancy added complexity and new failure modes; "good luck" mattered more than "good design." Relevance: more automated checkers can create new interaction failures and false confidence; an incident-free record ≠ a safe system.

[L-12] Reason, J. (1990). [book/theory] Human Error. Cambridge UP. ISBN 9780521314190. — Active vs latent failures; the Swiss-cheese model (defenses with holes that occasionally align). Relevance: the leverage is in latent conditions (prompt design, eval gaps, permissions, gate placement), not the proximate error.

[L-13] Reason, J. (2000). [peer-reviewed/theory] Human Error: Models and Management. BMJ 320(7237), 768–770. DOI: 10.1136/bmj.320.7237.768. — The person vs system approach: build defenses, accept fallibility, target conditions. Relevance: the accessible single-paper cite for "don't just blame the human reviewer or the model — fix the system of defenses."

[L-14] Leveson, N. G. (2011). [book/theory+method] Engineering a Safer World: Systems Thinking Applied to Safety. MIT Press (open access). — STAMP: accidents = inadequate enforcement of safety constraints by the control structure; controllers need accurate process models. Includes STPA/CAST. Relevance: arguably the best-fit model for HITL — the human is a controller whose authority + SA must stay adequate; STPA derives where/why a gate is needed.

[L-15] Rasmussen, J. (1997). [peer-reviewed/theory] Risk Management in a Dynamic Society: A Modelling Problem. Safety Science 27(2–3), 183–213. DOI: 10.1016/S0925-7535(97)00052-0. — Work practices migrate toward the boundary of acceptable performance under efficiency/effort gradients until they cross it. Relevance: pressure to reduce oversight is textbook migration; build explicit counter-gradients (mandatory gates, drift monitors).

[L-16] Vaughan, D. (1996). [book/empirical] The Challenger Launch Decision. University of Chicago Press. — Coins normalization of deviance: locally reasonable decisions incrementally redefine unsafe conditions as acceptable. Relevance: each successful un-reviewed action ratchets up tolerance for un-reviewed actions — guard against slow erosion of HITL standards.

[L-17] Dekker, S. (2011). [book/theory] Drift into Failure. Ashgate/Routledge. ISBN 9781409422211. — Failure is emergent — accumulated locally rational adaptations ("drift"), not a broken part. Relevance: agent-system failures are often emergent drift across prompts/tools/policies — monitor the slope of the whole system, not just discrete defects.

[L-18] Dekker, S. (2014; orig. 2002). [book/practitioner theory] The Field Guide to Understanding 'Human Error'. Ashgate/Routledge. ISBN 9781472439055. — Old View (bad apples) vs New View (error is a symptom of deeper systemic trouble; people are a resource). Relevance: don't design HITL as a blame-sink; understand and support the human's local rationality — same New View for agent failures.

[L-19] Hollnagel, E., & Woods, D. D. (2005). [book/theory] Joint Cognitive Systems: Foundations of Cognitive Systems Engineering. CRC/Taylor & Francis. (cross-ref [M] cluster) — Analyze human + automation as a single joint cognitive system; warns of the ironies of automation (automation handles easy cases, dumps degraded hard cases on the de-skilled human). Relevance: if the agent escalates only hard/ambiguous cases, the human faces the hardest with the least context — design the joint system so the human stays competent.

III.M — Joint cognitive systems & human-autonomy teaming

The unit of design is the joint human–machine system, not the agent in isolation. An effective agent is a team player: observable, predictable, directable, sharing common ground.

Highlights

  • Observability, Predictability, Directability (OPD) are the three core design requirements [M-1, M-5] — and they are bidirectional (the human must also be observable/predictable/directable to the agent).
  • Coordination is never free and never goes to zero [M-7, M-15]. The "fully autonomous" agent is a myth; automating a task relocates coordination work (often dumping it on the human at the worst moment). Design the coordination; don't pretend it away.
  • "Teammate" beats both "tool" and "fully autonomous" [M-1, M-7, M-9]. A tool loads all coordination on the human; "full autonomy" fails at the boundaries. The teammate stance is the sweet spot.
  • Interdependence, not autonomy, is the right organizing principle [M-5, M-6, M-8]. Map where human and agent depend on each other (Interdependence Analysis), then design to support it. "More autonomy" is the wrong goal.
  • Plans are resources for action, not controllers of it (Suchman, [M-4]). A rigid plan-executor breaks in open environments; design for situated re-grounding and graceful breakdown-and-repair.
  • Common ground is a maintained state, not a one-time briefing [M-2, M-14]. It decays; breakdowns are normal; the design question is whether repair is cheap (least collaborative effort). Explains why LLM context/memory drift causes coordination failures.
  • Team cognition lives in the interaction [M-11, M-12]: shared understanding and team SA are built by anticipatory information-pushing and closed-loop comms. Agents that only answer when queried produce weak team SA.
  • Individual task competence ≠ teaming competence [M-11]. Synthetic teammates can do the job yet fail as team members for lack of proactive, bidirectional coordination.
  • Design for calibrated trust, not maximal trust [M-16].
  • Continual feedback preserves the human's situation awareness [M-15, M-17]. Silent automation pulls the human out of the loop, destroying oversight.
  • The authority gradient must run both ways (CRM, [M-19]). Humans need a low-friction path to challenge/override; a good agent should flag when it thinks the human is wrong. Closed-loop confirmation is a portable pattern.
  • Evaluate the joint system, not the agent in isolation [M-3]. The metric is combined human+agent performance and the quality of the coupling — not the agent's solo benchmark.

Annotated bibliography

[M-1] Klein, G., Woods, D. D., Bradshaw, J. M., Hoffman, R. R., & Feltovich, P. J. (2004). [seminal] Ten Challenges for Making Automation a "Team Player" in Joint Human-Agent Activity. IEEE Intelligent Systems 19(6), 91–95. DOI: 10.1109/MIS.2004.74. — The canonical paper: a team-player agent must enter a Basic Compact (commit to coordinate and repair), maintain common ground, be interpredictable, directable, and make status/intentions observable. Relevance: the most directly portable design checklist for an LLM agent inside a human's workflow.

[M-2] Klein, G., Feltovich, P. J., Bradshaw, J. M., & Woods, D. D. (2005). [seminal] Common Ground and Coordination in Joint Activity. In Rouse & Boff (eds.), Organizational Simulation, 139–184. Wiley. — Book-length parent of [M-1]: joint activity rests on interpredictability → common ground that must be continually established, monitored, and repaired. Relevance: frames LLM context/memory drift as a grounding problem; the remedy is continuous grounding, not one-shot prompting.

[M-3] Hollnagel, E., & Woods, D. D. (2005). [seminal] Joint Cognitive Systems: Foundations of Cognitive Systems Engineering. CRC/Taylor & Francis. ISBN 0-8493-2821-7. (cross-ref [L-19]) — Reframes the object of study from "human + machine as two boxes" to a single joint cognitive system whose competence emerges from the coupling. Relevance: evaluate combined human+agent performance, not the agent's solo benchmark.

[M-4] Suchman, L. A. (1987). [seminal] Plans and Situated Actions: The Problem of Human-Machine Communication. Cambridge UP. — The foundational critique of plan-based AI: plans are resources for action, not controllers of it; action is situated and improvised. Relevance: treat generated plans as revisable resources; keep a human for situated judgment; expect and handle breakdowns.

[M-5] Johnson, M., Bradshaw, J. M., Feltovich, P. J., et al. (2014). [seminal+empirical] Coactive Design: Designing Support for Interdependence in Joint Activity. J. Human-Robot Interaction 3(1), 43–69. DOI: 10.5898/JHRI.3.1.Johnson. (cross-ref [K-16]) — Interdependence must shape autonomy: identify task interdependencies (Interdependence Analysis), then support them via OPD for both parties; critiques "levels of autonomy" as ignoring the contingent nature of activity. Relevance: the most actionable design method here — map dependencies, then build the observability/redirection affordances they require.

[M-6] Johnson, M., et al. (2011). [seminal] The Fundamental Principle of Coactive Design: Interdependence Must Shape Autonomy. COIN VI (LNCS 6541), 172–191. DOI: 10.1007/978-3-642-21268-0_10. — Earlier statement of the coactive principle: autonomy should serve interdependence, not be an end in itself. Relevance: reframes the race toward more capable joint systems, not more independent agents.

[M-7] Bradshaw, J. M., Hoffman, R. R., Johnson, M., & Woods, D. D. (2013). [argumentative] The Seven Deadly Myths of "Autonomous Systems." IEEE Intelligent Systems 28(3), 54–61. — Demolishes the myths that autonomy is one-dimensional, an independent property of the machine, or ever "full." Core: autonomy is relational; coordination work never disappears, it relocates. Relevance: a direct rebuttal to "fully autonomous agent" hype — HITL is structural, not a temporary crutch.

[M-8] Johnson, M., Bradshaw, J. M., & Feltovich, P. J. (2018). [methodological] Tomorrow's Human–Machine Design Tools: From Levels of Automation to Interdependencies. J. Cognitive Engineering and Decision Making 12(1), 77–82. — Argues for retiring "levels of automation" in favor of analyzing/designing for interdependencies. Relevance: points to the tooling agent teams actually need (interdependence maps), beyond an autonomy dial.

[M-9] Lyons, J. B., Sycara, K., Lewis, M., & Capiola, A. (2021). [review] Human–Autonomy Teaming: Definitions, Debates, and Directions. Frontiers in Psychology 12:589585. DOI: 10.3389/fpsyg.2021.589585. — What makes a machine a teammate: agency, communication, intent sharing, shared mental models, task interdependence; teaming is partly perceptual/social. Relevance: clarifies which design moves (intent signaling, shared models) drive perceived teammate-ness.

[M-10] O'Neill, T., McNeese, N., Barron, A., & Schelble, B. (2022). [empirical review] Human–Autonomy Teaming: A Review and Analysis of the Empirical Literature. Human Factors 64(5), 904–938. DOI: 10.1177/0018720820960865. — Aggregates HAT evidence on composition, training, trust, communication, SA, performance. Relevance: outcomes hinge on designable/measurable factors (comms patterns, calibrated trust, shared SA), not raw capability.

[M-11] McNeese, N. J., Demir, M., Cooke, N. J., & Myers, C. (2018). [empirical] Teaming With a Synthetic Teammate. Human Factors 60(2), 262–273. DOI: 10.1177/0018720817743223. — Teams with a synthetic teammate struggled most with team-level coordination/communication even when individually competent. Relevance: concrete warning — LLM agents may execute well yet fail at the proactive, bidirectional comms teaming requires.

[M-12] Demir, M., McNeese, N. J., & Cooke, N. J. (2017). [empirical] Team Situation Awareness Within the Context of Human-Autonomy Teaming. Cognitive Systems Research 46, 3–12. DOI: 10.1016/j.cogsys.2016.11.003. — Team SA is built by anticipatory information-pushing; HAT teams push/pull less than all-human teams. Relevance: agents should anticipate and volunteer relevant context, not merely answer when queried.

[M-13] Cooke, N. J., et al. (likely 2013). [theory — UNVERIFIED citation] Interactive Team Cognition. (likely Cognitive Science; exact volume/pages not directly fetched.) — Team cognition is an activity observable in interaction, not a static shared property in members' heads. Relevance: instrument the human-agent interaction stream as the locus of "shared understanding." (Concept attested via [M-11], [M-12].)

[M-14] Clark, H. H., & Brennan, S. E. (1991). [seminal] Grounding in Communication. In Perspectives on Socially Shared Cognition, 127–149. APA. — Defines common ground and grounding; participants minimize combined effort (least collaborative effort), and the medium's grounding cost shapes how they do it. Relevance: informs how agents should confirm understanding, ask clarifying questions, and repair efficiently.

[M-15] Norman, D. A. (1990). [seminal] The 'Problem' with Automation… Phil. Trans. R. Soc. B 327(1241), 585–593. (cross-ref [G-12]) — Automation's failure mode is inadequate feedback/interaction, not over-automation. Relevance: continual feedback (observability) is the cure for opaque agents that act without surfacing what they do.

[M-16] Lee, J. D., & See, K. A. (2004). [seminal review] Trust in Automation: Designing for Appropriate Reliance. Human Factors 46(1), 50–80. (cross-ref [F-1]) — Calibrated trust = appropriate reliance, avoiding disuse and misuse. Relevance: design for trust calibration, not maximization.

[M-17] Endsley, M. R. (1995). [seminal] Toward a Theory of Situation Awareness in Dynamic Systems. Human Factors 37(1), 32–64. (cross-ref [E-12]) — The 3-level SA model; automation can pull humans out of the loop. Relevance: observability must preserve the human's ability to project what the agent will do next.

[M-18] Parasuraman, R., Sheridan, T. B., & Wickens, C. D. (2000). [seminal] A Model for Types and Levels of Human Interaction with Automation. IEEE SMC-A 30(3), 286–297. (cross-ref [E-6]) — The four-stage function-allocation view — useful but, per [M-7]/[M-8], insufficient alone for genuine joint activity. Relevance: a still-useful map for partitioning an agent pipeline (retrieve/analyze/decide/act) and placing HITL checkpoints — provided you also design the coordination.

[M-19] Helmreich, R. L., Merritt, A. C., & Wilhelm, J. A. (1999). [seminal/historical] The Evolution of Crew Resource Management Training in Commercial Aviation. Int. J. Aviation Psychology 9(1), 19–32. — CRM (post-Tenerife) institutionalizes closed-loop communication, assertiveness/"speaking up," cross-checking, and flattened authority gradients. Relevance: the human-agent authority gradient runs both ways — humans must be able to challenge/override; agents should flag suspected human error; closed-loop confirmation maps to action acknowledgement.

[M-20] (2025). [bridging — verify before formal use] Joint Activity Design Heuristics for Enhancing Human-AI Teaming. arXiv:2512.08036. — Translates Klein/Woods joint-activity theory + coactive design into concrete heuristics for AI (joint sensemaking, predictable behavior, clear mode/transition signaling, local+global explanations). Relevance: evidence the tradition is being actively ported to modern AI.


Part IV — Cross-Industry Oversight Mechanisms

Concrete, battle-tested mechanisms for overseeing autonomous action, mined from industries that have done it for decades.

IV.N — Manufacturing & process-safety mechanisms

Concrete, deployed, regulated controls with decades of operational evidence. Tags: [STANDARD] / [STUDY] / [PRACTITIONER].

Highlights

  • Stop-and-escalate on anomaly is the master primitive (Jidoka) [N-1, N-2, N-3]. Run autonomously on the happy path; the instant the agent detects it's outside its competence/confidence envelope, halt itself and surface to a human rather than ship a defect downstream. Detect-and-stop is enough; don't trust the agent to autonomously self-recover.
  • Build quality in; don't inspect it in (Jidoka + Deming) [N-2, N-9]. You cannot review-your-way to a safe agent; oversight works best when execution trips the alarm at the moment of anomaly, not via end-of-line audit.
  • Make bad actions impossible, not merely discouraged (Poka-yoke) [N-6, N-7]. Prefer capability removal (least-privilege, sandbox, read-only, schema/type constraints) over denylists. A guardrail the agent can circumvent is policy; a poka-yoke removes the capability.
  • Use the mistake-proofing hierarchy to rank guardrails [N-7]: prevention (impossible) > detection/warning (flagged instantly) > human vigilance (weakest). Never rely on "the human will watch carefully" as the primary defense.
  • Universal, cheap, blameless interrupt authority (Andon) [N-4, N-5]. Every party — supervising human, monitor agent, guardrail, end user — must have a low-friction, always-available halt. If interrupting is costly or punished, it won't be used.
  • Separate signal-and-summon from hard-stop (Andon) [N-5]: an ambient "andon board" of agent status for awareness, plus a true blocking interrupt for hazards.
  • Intervene on calibrated signal, not every wobble (SPC) [N-8]. Define human entry via metrics against calibrated control limits balancing over-reaction to noise vs. missing a real fault.
  • Over-intervention is its own failure mode — tampering (Deming) [N-9]. Correcting on every fluctuation degrades the system and burns the human's attention.
  • Gate the killer items at phase boundaries; keep gates short (Checklists) [N-10, N-11]. Structured pre-action verification at high-consequence transitions reduces harm — but only if brief (≈5–9 items). Bloated gates induce rubber-stamping.
  • A gate is only as good as the attention it commands [N-12]. The Ontario study: a mandated checklist with weak engagement delivered zero benefit — the manufacturing/clinical proof of the codex's central finding.
  • Scoped, expiring, named authorization for privileged actions (Permit-to-work) [N-13, N-14]. Replace standing "the agent can do dangerous thing X" with narrow, time-boxed grants tied to a named approver, plus an explicit "hand-back" to a safe state.
  • Two truly-independent authorizers for irreversible actions [N-15, N-16]. Separation of duties; the proposer must not be the approver. Independence is everything — an agent re-reading its own work just confirms its bias; a double check is a backstop, never the primary control.
  • Positively disable, don't politely pause (Lockout-tagout) [N-17]. Revoke tokens / kill the session / drop to read-only as a structural lock held by the person doing the work — capability revocation (lockout) beats a warning (tagout), codified into U.S. law.

Annotated bibliography

[N-1] Ohno, T. (1988). [practitioner] Toyota Production System: Beyond Large-Scale Production. Productivity Press (orig. Japanese 1978). — Defines jidoka (autonomation): a machine detects an abnormality, stops itself, and signals a human, so a defect is never passed downstream and one operator can supervise many machines. Traces to Sakichi Toyoda's 1924 auto-stopping loom. Relevance: the stop-and-escalate-on-anomaly primitive — full speed on the happy path, hard stop on anomaly.

[N-2] Toyota Motor Corporation. [practitioner/primary] Toyota Production System — Jidoka. global.toyota. (403 to fetcher; corroborated via [N-1], [N-3] — treated VERIFIED) — "Automation with a human touch": machines stop on a problem, making it immediately visible and building quality in rather than inspecting it in. Relevance: oversight is most effective when the agent's own execution trips the alarm, not when a human audits a finished batch.

[N-3] Autonomation. (encyclopedic synthesis of TPS literature). [reference] en.wikipedia.org/wiki/Autonomation. — The four-step jidoka cycle: detect → stop → fix the immediate condition → investigate root cause + install a countermeasure; Shingo's "detect-and-stop is ~90% of the benefit; self-repair is over-engineering." Relevance: a ready-made incident protocol for agents; argues against trusting an agent to autonomously recover from its own anomalies.

[N-4] Liker, J. K. (2004). [practitioner] The Toyota Way. McGraw-Hill (Principle 5: stop to fix problems). — The andon system: any worker has the authority and obligation to pull a cord that signals a problem and, if unresolved, stops the line — and is rewarded, not punished, for it. Relevance: a universal interrupt — every stakeholder needs a low-friction, blameless authority to halt the agent mid-run.

[N-5] Andon (manufacturing). (TPS/lean synthesis). [reference] en.wikipedia.org/wiki/Andon_(manufacturing). — Andon is the signaling element of jidoka (visual/audible alert, manual OR automatic), distinguishing fixed-position stop (window to fix) from immediate stop. Relevance: provide both signal-and-summon (ambient status board) and hard interrupt; humans AND automated monitors can raise the alarm.

[N-6] Shingo, S. (1986). [practitioner] Zero Quality Control: Source Inspection and the Poka-Yoke System. Productivity Press (orig. 1985). — Poka-yoke (mistake-proofing): source inspection + devices that physically prevent a mistake or make it instantly obvious; 100%-prevention-at-source supplants statistical sampling. Relevance: make bad agent actions structurally impossible (scoped credentials, schema/constrained decoding, dry-run preview) — poka-yoke > policy.

[N-7] Project Production Systems Lab (P2SL), UC Berkeley. [reference] Mistakeproofing. p2sl.berkeley.edu/mistakeproofing/. — The hierarchy: elimination/prevention > detection/warning > reliance on human vigilance (weakest). Relevance: a direct ranking for agent guardrails — prefer capability removal over runtime detection over "ask the human to watch carefully."

[N-8] Shewhart, W. A. (1931). [practitioner/foundational] Economic Control of Quality of Manufactured Product. Van Nostrand. — Invents the control chart + chance (common) vs assignable (special) cause; ±3σ limits balance over-reacting to noise vs. missing a real signal. Relevance: the foundational answer to when should a human intervene? — on a calibrated control-limit breach, not on every output and not never.

[N-9] Deming, W. E. (1986). [practitioner/foundational] Out of the Crisis. MIT Press. — Warns of tampering (over-reacting to common-cause noise makes a stable system worse) and "cease dependence on inspection." Relevance: over-intervention is a failure mode; you cannot inspect quality into an agent — engineer it in. Support for calibrated, not maximal, oversight.

[N-10] Haynes, A. B., et al. (2009). [study] A Surgical Safety Checklist to Reduce Morbidity and Mortality in a Global Population. NEJM 360(5), 491–499. PMID 19144931. — WHO 19-item checklist at three junctures (Sign In / Time Out / Sign Out) across 8 hospitals: death 1.5%→0.8%, complications 11.0%→7.0%. Relevance: the empirical case for structured pre-action verification at high-consequence transitions (a "time out" before deploy/delete/send/pay) — placed at phase boundaries, not everywhere.

[N-11] Gawande, A. (2009). [practitioner] The Checklist Manifesto. Metropolitan Books. — READ-DO vs DO-CONFIRM checklists; states the limits — short (≈5–9 items), one page, killer items only; bloated checklists fail. Relevance: concise pre-flight gates for agents, with the discipline of brevity (a gate crammed with low-value confirmations induces rubber-stamping).

[N-12] Urbach, D. R., et al. (2014). [study — counter-evidence] Introduction of Surgical Safety Checklists in Ontario, Canada. NEJM 370, 1029–1038. — Across 101 hospitals, a mandated rollout found no significant reduction — executed as box-ticking with thin engagement. Relevance: the most important caution here — a gate is only as good as the attention it commands; mandated, ritualized confirmation = theater. Manufacturing/clinical proof of the codex's central HITL finding.

[N-13] HSE, UK (2005). [standard] Guidance on Permit-to-Work Systems (HSG250). HSE Books. — A permit-to-work is a formal, documented, time-boxed authorization before high-hazard non-routine work, tied to a named authorizer with a formal hand-back. Relevance: the template for scoped, expiring, explicitly-authorized privilege elevation — safer than blanket bypassPermissions; hand-back = confirm the agent returned to a safe state.

[N-14] HSE. [standard] Permit to Work Systems (COMAH technical measures). hse.gov.uk/comah/sragtech/techmeaspermit.htm. — Reinforces PTW essentials and notes maintenance-related work (non-routine) is a leading source of major accidents. Relevance: an agent's off-script, "maintenance-mode" excursions deserve the most stringent authorization, not the least.

[N-15] U.S. Air Force (2020). [standard] AFI 91-104, Nuclear Surety — the Two-Person Concept. (AFI is primary; edition lightly UNVERIFIED) — Two-person rule / TPI: critical actions require two independently authorized people; no-lone zones, split keys. Relevance: for the highest-consequence irreversible agent actions, require two independent authorizers (human + independent checker); the proposer should not be the approver (separation of duties).

[N-16] ISMP. [standard/practitioner] Independent Double Checks: Worth the Effort if Used Judiciously and Properly. ISMP/ECRI. — An independent double check catches ~33% more errors if genuinely independent — but ISMP warns against overuse and against it as a sole control (false assurance). Relevance: double-checking agent output works only when reserved for high-consequence actions and the second check is truly independent (different model/prompt/context, not the agent re-reading itself); it's a backstop, not the primary control.

[N-17] OSHA (1989). [standard] The Control of Hazardous Energy (Lockout/Tagout), 29 CFR 1910.147. — Before servicing, every hazardous energy source must be positively isolated and locked de-energized, keyed to the worker, so it cannot be re-energized while they're exposed; lockout (physical) preferred over tagout (a warning tag). ~120 fatalities / 50,000 injuries prevented yearly. Relevance: before humans/processes touch a system an agent controls (or before maintaining the agent), positively disable its ability to act (kill session, revoke tokens, read-only) as a structural lock — capability revocation (lockout) > warning (tagout), codified in law.

IV.O — Finance & security mechanisms (incl. runtime assurance)

Two domains with mature controls over fast autonomous action: markets (where algorithms already outpace humans) and security (which formalized scoped authority and the human's role). Includes Cranor (2008), a direct prior HITL framework.

Highlights

  • Pre-action admission control beats post-hoc review [O-2, O-3]. Gate every action against scoped limits before execution at machine speed — humans and after-the-fact logs can't keep up with an autonomous actor.
  • Two-party approval for high-grade actions (maker-checker) [O-1, O-10]. The proposer must not be the approver; split planning/approval/execution across principals.
  • Circuit breakers (auto-halt on threshold) + kill switches (decisive manual stop) [O-6, O-7]. Auto-pause and require re-authorization when guard bands trip; provide a one-command panic stop usable without diagnosing root cause.
  • Bound blast radius ("fat-finger" limits) and action rate (throttles) [O-3]. Cap per-action magnitude (max spend/deletions/recipients) and per-window action rate to break runaway loops.
  • Knight Capital is the canonical agentic cautionary tale [O-4]: a tiny config/dead-code error in a fast autonomous actor cost ~$440M in 45 minutes with no kill switch. Verify feature flags/dead code before deploy; stage rollouts; never deploy without an emergency stop.
  • Least privilege + capabilities + zero-trust = scoped, per-request agent permissions [O-10, O-11, O-13]. Grant only what the task needs, deny by default, mediate every action, re-authorize each step in context.
  • The confused deputy IS prompt injection [O-11, O-12]. An agent with ambient authority is weaponized by untrusted content flowing through the same channel as legitimate instructions. Defense = scoped request-bound capabilities + runtime mediation for irreversible actions regardless of instruction source + sub-agents not inheriting orchestrator authority.
  • No standing privilege; elevate just-in-time, break glass loudly [O-14]. Sensitive scopes granted time-boxed behind approval and revoked; emergency elevation is auditable, alarmed, post-hoc reviewed.
  • Runtime shields/Simplex = a verified monitor that vetoes the agent [O-17, O-18, O-19]. You can't formally verify an LLM, but you can verify a small guard; let the agent act inside a verified envelope and override on breach — a guarantee that holds even under injection, complementing the soft human gate.
  • Human gates are a design variable, and they fatigue [O-15, O-16]. Many users click through badly-designed warnings; habituation sets in neurologically by the second repetition. Minimize interrupts (auto-handle low-risk), and make the rest salient/distinct (polymorphic) to fight rubber-stamping.
  • Beware emergent multi-agent failure and mis-specified objectives [O-5]. The Flash Crash: individually "correct" fast algorithms (one optimizing volume only) produced systemic collapse. Constrain objectives explicitly; add system-level circuit breakers, not just per-agent ones.
  • Cranor (2008) is the direct precedent for a modern agentic HITL framework [O-9]: keep the human out of the loop where you can safely automate; where a human must stay, engineer the interaction (delivery, comprehension, capability, motivation) so they can actually succeed.

Annotated bibliography

[O-1] COSO / internal-control literature. [standard] Segregation of Duties & the Four-Eyes (Maker-Checker) Principle. — No single party initiates, approves, records, and verifies the same activity; a second authorized party approves high-risk actions. Weakness: rubber-stamping and collusion. Relevance: two-party approval gating for high-grade/irreversible agent actions; an agent that drafts an action shouldn't also approve/execute it.

[O-2] U.S. SEC (2010). [standard/rule] Rule 15c3-5 — Market Access Rule. 17 CFR 240.15c3-5. — Requires automated pre-trade controls that prevent entry of orders exceeding thresholds or appearing erroneous, order-by-order, at machine speed. Relevance: pre-action admission control — screen every proposed agent action against scoped limits before execution; whoever grants tool access owns the guardrails.

[O-3] Futures Industry Association (2024). [standard] Best Practices for Automated Trading Risk Controls and System Safeguards. — Catalogs max order-size ("fat-finger") limits, price collars, message/execution throttles, cancel-on-disconnect — defense in depth at participant/broker/exchange levels. Relevance: template for bounding per-action magnitude, throttling action rate to stop runaway loops, sanity-checking outputs, and a one-command cancel-all.

[O-4] U.S. SEC (2013). [incident+enforcement] In re Knight Capital Americas LLC. (Aug 1, 2012 incident; $12M settlement.) — Dormant legacy code left active on one server sent ~4M unintended orders in ~45 min, ~$440M loss, no kill switch. Relevance: the textbook case for kill switches, blast-radius limits, staged rollout, and pre-deploy verification of dead code/feature flags on fast autonomous actors.

[O-5] U.S. SEC/CFTC (2010). [incident/investigation] Findings Regarding the Market Events of May 6, 2010 (Flash Crash). — A large automated sell program targeting volume only — not price/time triggered a ~$1T transient collapse amplified by interacting HFTs. Relevance: warning about multi-agent emergent failure and mis-specified objectives; constrain objective functions and add system-level circuit breakers.

[O-6] U.S. SEC. [standard/rule] Rule 80B (market-wide circuit breakers) & Limit Up–Limit Down. — Threshold-triggered automatic trading halts (7%/13%/20%) injecting a cool-down so humans reassess. Relevance: auto-halt on anomaly/threshold (error rate, cost burn, output volume) + require human re-authorization to resume — a deliberate HITL re-entry point.

[O-7] CME Group / SEC concept release. [standard/tool] Kill Switches for Algorithmic Trading. — A risk admin can immediately block all new orders and cancel working orders (CME blocks in <1s); regulators push for automatic limit-breach shut-offs. Relevance: the emergency stop/panic button for an agent fleet — both manual and automatic (limit-breach) variants needed; stop first, investigate later.

[O-8] U.S. SEC (2014). [standard/rule] Regulation Systems Compliance and Integrity (Reg SCI). 17 CFR Part 242. — Treats the automated system itself as a regulated risk surface: capacity/stress testing, dev-and-test methodology, resilience, and mandatory detection/corrective-action/notification for "SCI events." Relevance: governance template for agent reliability + incident response — HITL is institutional (process + reporting), not only the in-the-moment click.

[O-9] Cranor, L. F. (2008). [study/framework — KEY PRECEDENT] A Framework for Reasoning about the Human in the Loop. USENIX UPSEC '08. — A structured HITL framework: try to keep humans out of the loop where feasible; where necessary, engineer the system to support them. Adapts the C-HIP model into communication / impediments / human receiver / behavior, with six receiver attributes (delivery, processing/attention, comprehension, application, intentions/motivation, capabilities). Relevance: the closest pre-LLM analog — a modern agentic HITL framework is Cranor's framework generalized from a one-shot warning to continuous oversight of a fast, capable, potentially-hijackable autonomous actor; its six attributes map onto why agent approval gates fail.

[O-10] Saltzer, J. H. & Schroeder, M. D. (1975). [study/foundational] The Protection of Information in Computer Systems. Proc. IEEE 63(9), 1278–1308. — Least privilege, fail-safe defaults (default deny), complete mediation, economy of mechanism, separation of privilege. Relevance: the charter for scoped agent permissions — minimal scopes, deny-by-default, mediate every action, keep the guard simple, require two conditions for high-risk actions.

[O-11] Hardy, N. (1988). [study/foundational] The Confused Deputy. ACM SIGOPS OSR 22(4), 36–38. — A privileged program is tricked by a less-privileged caller into misusing its ambient authority; the fix is capability-based security (authority travels bound to the request). Relevance: the structural template for the prompt-injection problem — pass scoped, request-bound capabilities so injected instructions can't reach beyond the legitimate task.

[O-12] Cloud Security Alliance (2024/2025). [study/analysis] Confused Deputy Attacks on Autonomous AI Agents (Prompt Injection). — Makes the Hardy→LLM mapping explicit: because operator instructions and untrusted retrieved content share one inference pathway, an attacker who can write to an inbox/issue executes with the operator's authority. Mitigations: least-privilege scoped credentials, runtime admission control for irreversible actions regardless of source, sub-agents not auto-inheriting authority. Relevance: the direct modern statement of this cluster's thesis.

[O-13] NIST (2020). [standard] Zero Trust Architecture. SP 800-207. — No implicit trust by location/prior auth; per-request, per-session decisions on identity/posture/context, least privilege granted dynamically, continuously re-verified. Relevance: treat each agent action as an untrusted request authorized in context — a policy engine evaluates every tool call; a hijacked/drifting agent is re-checked at every step.

[O-14] CyberArk/BeyondTrust et al. [standard/practice] Break-Glass, Just-in-Time Access, Zero Standing Privilege. — JIT grants elevated privilege only at need, time-boxed, behind approval (zero standing privilege); break-glass is auditable, alarmed, post-hoc reviewed emergency elevation. Relevance: agents run low-privilege by default; sensitive scopes granted JIT and revoked; break-glass for genuine emergencies, loudly logged.

[O-15] Akhawe, D. & Felt, A. P. (2013). [study] Alice in Warningland: A Large-Scale Field Study of Browser Security Warning Effectiveness. USENIX Security '13. — Across >25M impressions, clickthrough (ignore) rates varied widely (up to ~70% for Chrome SSL) — warnings aren't uniformly futile; design materially changes compliance. Relevance: the human gate's effectiveness is a design variable — gating everything trains bypass; reserve and design interrupts for what matters.

[O-16] Anderson, B. B., Vance, A., Kirwan, C. B., et al. (2015/2018). [study] How Polymorphic Warnings Reduce Habituation in the Brain (CHI '15); Tuning Out Security Warnings (MIS Quarterly 2018). — fMRI/eye-tracking/field evidence of a sharp drop in visual processing by the second exposure to a repeated warning; polymorphic (varying) warnings resist habituation. Relevance: over-prompting the supervisor is neurologically tuned out — minimize gates, and make high-stakes prompts salient/distinct.

[O-17] Sha, L. (2001). [study/architecture] Using Simplicity to Control Complexity (the Simplex Architecture). IEEE Software 18(4), 20–28. — Pairs a complex unverifiable controller with a simple, formally verified safe controller + monitor that switches control when the state nears the safety boundary. Relevance: the reference pattern for runtime assurance — let the LLM act freely inside a verified envelope; a simple trusted monitor takes over on breach.

[O-18] Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., & Topcu, U. (2018). [study] Safe Reinforcement Learning via Shielding. AAAI-18. — A correct-by-construction shield from a formal safety spec intercepts each action and overrides only those that would violate safety (preemptive or post-posed). Relevance: a verified shield that vetoes the agent — encode hard constraints (never delete prod, never send funds >X) enforced regardless of what the LLM "decides," including under injection.

[O-19] Mehmood, U., et al. (2021). [study — extends O-17; author list lightly UNVERIFIED] The Black-Box Simplex Architecture for Runtime Assurance of Autonomous CPS. NASA Formal Methods / arXiv:2102.12981. — Extends Simplex to opaque/neural controllers and even black-box safe controllers via a switching monitor. Relevance: confirms the runtime-assurance pattern survives when both actor and parts of the guard are opaque — wrap an LLM agent in a switching monitor with a "safe mode" (halt + escalate) on predicted breach.


Part V — The Human and the Loop Episode

The human side, and the design of a single oversight moment.

V.P — Cognition of advice, judgment & decision fatigue

The human-cognition side of HITL: how people receive advice, decide under pressure, and whether decisions degrade with fatigue. Contested results flagged explicitly.

Highlights

  • Two opposite failure modes coexist. Classic advice research shows egocentric discounting — humans under-weight good advice [P-2, P-3] — while the AI-advice literature shows both algorithm appreciation (over-crediting an "AI" label, [P-20]) and algorithm aversion (over-abandoning after one visible error, [P-21]). HITL must calibrate against both under- and over-reliance.
  • Anchoring on the agent's suggestion is the central, most robust risk [P-9]. A suggested answer pulls the human's judgment toward it with insufficient adjustment, even when wrong; confirmation bias [P-11] then steers the evidence search toward confirming it.
  • "Decide first, then reveal" is evidence-backed [P-4]. Independent pre-advice judgment yields better, less-deferential decisions than seeing the suggestion first.
  • The most corrective advice is the most likely to be rejected (distance effect) [P-3]. Humans discount advice that diverges sharply from their prior — exactly the contrarian agent suggestions that may be most valuable; experts discount most.
  • Trust is fragile and asymmetric [P-2, P-21]. Reputation forms fast and is lost more easily than gained; people drop algorithms faster than humans after identical errors. Error transparency, recovery, and letting users adjust outputs keep reliance from collapsing.
  • Reliance is contaminated by irrelevant factors — incidental mood [P-5] and the cost/effort of obtaining advice [P-6] move advice-weighting independently of quality.
  • Expertise changes everything; both human and AI intuition have boundary conditions [P-7, P-8]. Experts pattern-match (RPD); intuition is trustworthy only in regular environments with good feedback — a two-condition test for whom to weight.
  • Review quality is bounded by working memory [P-17]. Dense, poorly-presented agent output produces poor oversight regardless of intent — favor concise rationales, diffs, confidence cues.
  • Decision fatigue should constrain HITL only weakly and provisionally. The mechanism (ego depletion) failed a 23-lab preregistered replication (d≈0.04, [P-13], undercutting [P-12]); the iconic parole "hungry judge" study [P-14] is contested by an ordering confound [P-15] and a time-allocation artifact [P-16]. Treat "limit decisions per human" as a testable heuristic, not a law.
  • Choice overload is similarly over-claimed — the jam study [P-18] is famous but a 50-study meta-analysis found a mean effect near zero [P-19]. Don't assume "fewer AI options = better."
  • The dual-process frame is sound but cite per-study carefully [P-9, P-10] — System 1/2 and core biases hold, but some studies popularized in Thinking, Fast and Slow (priming) didn't replicate.
  • Net stance: build for calibrated reliance — force a provisional human judgment before revealing AI output (anti-anchoring), surface reasoning (counter discounting + confirmation bias), make errors/confidence transparent (manage fragile trust), minimize reviewer load, and treat fatigue/overload limits as hypotheses, not settled science.

Annotated bibliography

[P-1] Bonaccio, S., & Dalal, R. S. (2006). [review] Advice taking and decision-making: An integrative literature review… OBHDP 101(2), 127–151. DOI: 10.1016/j.obhdp.2006.07.001. — The canonical synthesis around the Judge–Advisor System (JAS): advice generally improves accuracy but is systematically under-weighted; design factors (pre-advice decision, #advisors, cost, solicited?) change utilization. Relevance: an agent is structurally an "advisor" in a JAS — the master frame for predicting (under-)weighting of its output.

[P-2] Yaniv, I., & Kleinberger, E. (2000). [empirical] Advice taking in decision making: Egocentric discounting and reputation formation. OBHDP 83(2), 260–281. DOI: 10.1006/obhd.2000.2909. — Egocentric discounting (~70/30 toward self) because judges have privileged access to their own reasons; reputation forms fast and is lost more easily than gained. Relevance: default human behavior under-uses good agent advice; a single visible error collapses trust disproportionately.

[P-3] Yaniv, I. (2004). [empirical] Receiving other people's advice: Influence and benefit. OBHDP 93(1), 1–13. DOI: 10.1016/j.obhdp.2003.08.002. — The distance effect: advice is discounted more the further from the judge's prior; knowledgeable judges discount more. Relevance: contrarian agent suggestions are ignored precisely when most corrective; experts discount agent advice most.

[P-4] Sniezek, J. A., & Buckley, T. (1995). [empirical] Cueing and cognitive conflict in Judge-Advisor decision making. OBHDP 62(2), 159–174. DOI: 10.1006/obhd.1995.1040. — Judges who formed an independent tentative choice before advice performed best; "dependent" judges worst. Relevance: "decide first, then reveal the AI suggestion" is an evidence-backed anti-anchoring workflow.

[P-5] Gino, F., & Schweitzer, M. E. (2008). [empirical] Blinded by anger or feeling the love: How emotions influence advice taking. J. Applied Psychology 93(5), 1165–1173. DOI: 10.1037/0021-9010.93.5.1165. — Incidental gratitude raises advice uptake; anger lowers it. Relevance: reliance on an agent is partly a function of mood at the moment, not just agent quality.

[P-6] Gino, F. (2008). [empirical] Do we listen to advice just because we paid for it? OBHDP 107(2), 234–245. DOI: 10.1016/j.obhdp.2008.03.001. — People weight advice more when they paid for it, quality held constant (sunk-cost flavored). Relevance: the cost/effort of obtaining an agent's output can inflate reliance independent of accuracy.

[P-7] Klein, G. (1998). [theory+field] Sources of Power: How People Make Decisions. MIT Press. ISBN 9780262611466. — The Recognition-Primed Decision model: experts under time pressure recognize a situation and retrieve a workable action, mentally simulating it, rather than comparing options. Relevance: an AI suggestion that interrupts a fast expert recognition may be resisted; the same suggestion may anchor a pattern-less novice. Design for who's in the loop.

[P-8] Kahneman, D., & Klein, G. (2009). [theory; adversarial collaboration] Conditions for intuitive expertise: A failure to disagree. American Psychologist 64(6), 515–526. DOI: 10.1037/a0016755. — Intuitive expertise is trustworthy only when (a) the environment has valid cues and (b) there was practice with rapid, unambiguous feedback. Relevance: a calibration test for both human and model — neither should be trusted where cue validity is low or feedback poor.

[P-9] Tversky, A., & Kahneman, D. (1974). [empirical/theory] Judgment under uncertainty: Heuristics and biases. Science 185(4157), 1124–1131. DOI: 10.1126/science.185.4157.1124. — Representativeness, availability, and anchoring-and-adjustment (estimates pulled toward an initial anchor, even arbitrary). Relevance: an agent's suggested answer is an anchor; humans under-adjust away from it even when wrong — the mechanistic core of "anchoring on an AI suggestion."

[P-10] Kahneman, D. (2011). [theory/synthesis] Thinking, Fast and Slow. FSG. ISBN 9780374275631. — System 1 (fast, intuitive, bias-prone) vs System 2 (slow, effortful). Reviewing an agent's output is a System-2 task humans skip, defaulting to System-1 acceptance. Caveat: some studies in the book (priming) did not replicate — use as framework, verify per claim. Relevance: without a forcing function, humans rubber-stamp plausible-looking output.

[P-11] Nickerson, R. S. (1998). [review] Confirmation bias: A ubiquitous phenomenon in many guises. Review of General Psychology 2(2), 175–220. DOI: 10.1037/1089-2680.2.2.175. — People seek/interpret/remember evidence favoring a belief in hand. Relevance: once an agent proposes an answer, the human looks for confirming rather than disconfirming evidence — degrading the very "review" HITL relies on.

[P-12] Baumeister, R. F., Bratslavsky, E., Muraven, M., & Tice, D. M. (1998). [empirical — CONTESTED] Ego depletion: Is the active self a limited resource? JPSP 74(5), 1252–1265. DOI: 10.1037/0022-3514.74.5.1252. — Proposed self-control draws on a single depletable resource (radish-vs-chocolate). Robustness: seriously contested — see [P-13]. Relevance: do not justify "limit decisions per human" purely on ego depletion; the foundation is weak.

[P-13] Hagger, M. S., Chatzisarantis, N. L. D., et al. (2016). [registered replication — NULL] A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science 11(4), 546–573. DOI: 10.1177/1745691616652873. — 23 labs, N=2,141; pooled effect essentially null (d≈0.04). Relevance: the high-quality robustness evidence, and it's negative — be skeptical of designs premised on a simple depletion mechanism.

[P-14] Danziger, S., Levav, J., & Avnaim-Pesso, L. (2011). [empirical — CONTESTED] Extraneous factors in judicial decisions. PNAS 108(17), 6889–6892. DOI: 10.1073/pnas.1018033108. — ~1,112 parole rulings: favorable rulings fell from ~65% to near 0% before breaks, resetting after — the "hungry judge" result. Robustness: contested ([P-15], [P-16]). Relevance: the strongest narrative for limiting decision load — cite with its critiques.

[P-15] Weinshall-Margel, K., & Shapard, J. (2011). [critique] Overlooked factors in the analysis of parole decisions. PNAS 108(42), E833. DOI: 10.1073/pnas.1110910108. — Case ordering is not random (unrepresented prisoners, who win less, tend to be last) — composition, not fatigue, could drive the decline. Relevance: rule out ordering/selection confounds before attributing degraded review to fatigue.

[P-16] Glöckner, A. (2016). [critique/simulation] The irrational hungry judge effect revisited… Judgment and Decision Making 11(6), 601–610. DOI: 10.1017/S1930297500004873. — Simulations show the decline can be largely a statistical artifact of favorable rulings taking longer + avoiding long cases before breaks. Relevance: "decision fatigue" effects in field data may be inflated by data-censoring artifacts — a plausible-but-unproven design heuristic.

[P-17] Sweller, J. (1988). [theory+empirical] Cognitive load during problem solving: Effects on learning. Cognitive Science 12(2), 257–285. DOI: 10.1207/s15516709cog1202_4. — Working memory is capacity-limited; high extraneous load degrades performance. Relevance: a human reviewing dense, poorly-presented agent output under load reviews it badly — surface concise rationales, diffs, confidence cues.

[P-18] Iyengar, S. S., & Lepper, M. R. (2000). [empirical — CONTESTED] When choice is demotivating… JPSP 79(6), 995–1006. DOI: 10.1037/0022-3514.79.6.995. — The "jam study": 24 options → far less purchase than 6. Robustness: contested ([P-19]). Relevance: offering too many agent-generated options might reduce decision quality — but not reliably established.

[P-19] Scheibehenne, B., Greifeneder, R., & Todd, P. M. (2010). [meta-analysis] Can there ever be too many options? J. Consumer Research 37(3), 409–425. DOI: 10.1086/651235. — 50 experiments, N≈5,036: mean choice-overload effect ≈ zero, no reliable moderators. Relevance: "fewer options is better" is not a dependable law — tune empirically per task.

[P-20] Logg, J. M., Minson, J. A., & Moore, D. A. (2019). [empirical] Algorithm appreciation: People prefer algorithmic to human judgment. OBHDP 151, 90–103. DOI: 10.1016/j.obhdp.2018.12.005. (cross-ref [C-16]) — People weight identical advice more when told it's algorithmic — but appreciation shrinks vs. their own judgment and among experts. Relevance: the "AI" label itself shifts reliance upward (partly flipping egocentric discounting); pairs with automation-bias concerns.

[P-21] Dietvorst, B. J., Simmons, J. P., & Massey, C. (2015). [empirical] Algorithm aversion: People erroneously avoid algorithms after seeing them err. JEP: General 144(1), 114–126. DOI: 10.1037/xge0000033. — After seeing an algorithm err, people abandon it faster than a human making the same error, even when it's superior; letting people slightly adjust it restores use (2018 follow-up). Relevance: trust in an agent is fragile/asymmetric — error transparency and user adjustment control keep reliance calibrated rather than collapsing.

V.Q — The anatomy of the loop episode + choice architecture

The design of a single oversight moment — what it must contain to be effective. Organized around five necessary aspects + the choice-architecture layer that frames them. This cluster directly answers "what are the necessary aspects of the loop itself?"

Highlights

  • Interrupting the human is never free [Q-23, Q-24]. Even when output quality holds, interruptions raise stress, error, and anxiety; timing (defer to a task breakpoint) matters as much as content [Q-28]. Treat each oversight prompt as a priced action against an attention budget [Q-25, Q-26, Q-29].
  • Low-precision alerting destroys oversight [Q-30]. Humans reflexively dismiss all alerts — including critical ones — when precision is low. HITL prompts must be high-precision and role-targeted. The dominant real-world failure mode of "human in the loop."
  • Humans systematically under-scrutinize confident automation [Q-17, Q-18]. A reviewer in the loop does not reliably catch errors by default — detection must be engineered.
  • Detection-useful explanations are contrastive and demand-driven [Q-15, Q-16]. "Why / why-not" help most, and demand spikes when the system looks wrong; generic "show your work" framed to persuade can increase over-reliance.
  • Uncertainty is a stronger detection signal than explanation [Q-20, Q-21] — but only if calibrated and carefully framed; raw confidence is easily misread. Low confidence is a natural escalation trigger.
  • Friction can backfire (sludge) or help (microboundaries / cognitive forcing) [Q-19, Q-33, Q-35]. Indiscriminate friction causes approval fatigue; targeted friction before irreversible actions reduces rubber-stamping. Audit friction against the stakes it gates [Q-34].
  • The default dominates the decision [Q-31, Q-32]. The safe/reversible option — never auto-approve-on-timeout — must be the default in any approval gate.
  • Consequence preview is feedforward [Q-2, Q-3]. "What happens if I approve" is a named HCI principle (feedforward across the Gulf of Execution); reversibility/undo is a first-class safety affordance [Q-4]. Agents demonstrably propose irreversible harmful actions [Q-5], so pre-execution preview is necessary.
  • The decision-point human is "out of the loop" [Q-10]. Being pulled in only at the moment of decision degrades situation awareness; the provenance/context trail restores SA [Q-9] and lets them reconstruct the situation (sensemaking) [Q-7, Q-8].
  • Provenance should be structured, not narrative [Q-12]. Separate data lineage from interaction history and decision rationale; PROV gives a ready schema (entities/activities/agents) for an auditable trail [Q-11, Q-13].
  • An approval prompt IS choice architecture — there is no neutral UI [Q-31]. The same levers (defaults, friction, framing) are weaponized as dark patterns; the dark-patterns corpus is a precise anti-pattern checklist (no confirmshaming "reject," false urgency, obstruction of deny/undo) [Q-36, Q-37, Q-38, Q-39].
  • Communicate stakes trustworthily, not persuasively [Q-6] — the oversight moment's goal is genuine understanding, the inverse of manipulation.

THE ANATOMY OF A WELL-DESIGNED LOOP EPISODE

An ordered checklist of the elements a single oversight moment should contain (each tagged with supporting sources). This is the actionable core of the cluster.

  1. Decide whether to interrupt at all (attention gate). Estimate the expected cost of interrupting this human now; defer non-urgent asks to a task breakpoint; batch; budget alert volume; keep precision high. → [Q-23, Q-24, Q-25, Q-26, Q-27, Q-28, Q-29, Q-30]
  2. State the request clearly (decision legibility). What exactly is being asked — scoped, unambiguous, in the human's frame. → [Q-1, Q-2, Q-14]
  3. Show the consequences before acting (feedforward). Preview the concrete effects of each option, including downstream side effects. → [Q-2, Q-3, Q-6]
  4. Make reversibility explicit. State whether it's undoable; provide the undo; flag clearly when it's not (irreversible → stronger gate). → [Q-4, Q-5]
  5. Surface uncertainty / confidence — calibrated and well-framed. Low confidence as an escalation/scrutiny trigger; frame numbers so they aren't misread. → [Q-6, Q-20, Q-21]
  6. Provide the provenance trail — "how did this get to me?" The upstream context that restores SA: what the agent perceived, inferred, considered, why it escalated, the steps that led here — structured (lineage vs. interaction history vs. rationale), ideally a queryable provenance graph. Counters the out-of-the-loop problem. → [Q-7, Q-8, Q-9, Q-10, Q-11, Q-12, Q-13]
  7. Build in error-detection affordances (support checking, not trusting). Contrastive "why/why-not" + diffs framed for verification; surface disagreement, counter-evidence, the agent's own doubts; progressive disclosure (summary first, full plan one click away). Counteract automation bias deliberately. → [Q-14, Q-15, Q-16, Q-17, Q-18, Q-20, Q-22]
  8. Add proportionate friction — a microboundary, not sludge. A beat of reflection (decide-first / cognitive forcing) only before consequential, irreversible actions; audit that the friction is warranted by the stakes. → [Q-19, Q-33, Q-34, Q-35]
  9. Use bias-safe choice architecture — no dark patterns. Safe/reversible option as default (never auto-approve-on-timeout); no manipulative framing — no confirmshaming the reject, false urgency/scarcity, obstruction of deny/undo, or preselected high-consequence approvals. → [Q-31, Q-32, Q-36, Q-37, Q-38, Q-39]
  10. Preserve accountability after the act. Keep the human the accountable initiator (internal locus of control); log the decision into the provenance record. → [Q-4, Q-11, Q-13, Q-14]

Annotated bibliography

Tags: [empirical] / [design-guidance] / [theory]. Aspect tags: request / consequence / provenance / detection / attention / choice-arch.

[Q-1] Amershi, S., Weld, D., Vorvoreanu, M., et al. (2019). [design-guidance, validated] Guidelines for Human-AI Interaction. CHI '19. DOI: 10.1145/3290605.3300233. (cross-ref [G-7]) — 18 validated guidelines incl. the request-clarity primitives G1 (what the system can do), G11 (why it did what it did), G6 (scope when in doubt). Aspect: request, detection.

[Q-2] Norman, D. A. (2013). [theory] The Design of Everyday Things (rev. ed.). Basic Books. (cross-ref [G-11]) — Gulf of Execution (intention → action) and Gulf of Evaluation (state → understanding). Aspect: request, consequence — legibility + visible effects narrow both gulfs around the approve/reject moment.

[Q-3] Vermeulen, J., Luyten, K., van den Hoven, E., & Coninx, K. (2013). [theory+design-guidance] Crossing the Bridge over Norman's Gulf of Execution: Revealing Feedforward's True Identity. CHI '13, 1931–1940. DOI: 10.1145/2470654.2466255. — Establishes feedforward: information shown before an action telling the user its result. Aspect: consequence — a consequence preview in an approval gate is feedforward.

[Q-4] Shneiderman, B., Plaisant, C., et al. (2016). [design-guidance] Designing the User Interface (6th ed.) — Eight Golden Rules. Pearson. — "Permit easy reversal of actions" (undo as core safety affordance) and "support internal locus of control." Aspect: consequence, choice-arch — reversibility + keeping the human the initiator.

[Q-5] Ruan, Y., Dong, H., Wang, A., et al. (2024). [empirical] Identifying the Risks of LM Agents with an LM-Emulated Sandbox (ToolEmu). ICLR 2024; arXiv:2309.15817. — An emulated sandbox stress-tests agents across 36 high-stakes toolkits; even the safest agent fails ~23.9% of the time, often with irreversible actions. Aspect: consequence, detection — empirical proof that pre-execution preview and gates are necessary.

[Q-6] Spiegelhalter, D. (2017). [design-guidance] Risk and Uncertainty Communication. Annual Review of Statistics and Its Application 4, 31–60. DOI: 10.1146/annurev-statistics-010814-020148. — Evidence-based principles for communicating risk/uncertainty honestly (frequencies, graphics, second-order uncertainty) — trustworthy over persuasive. Aspect: consequence, choice-arch.

[Q-7] Pirolli, P., & Card, S. (2005). [theory] The Sensemaking Process and Leverage Points for Analyst Technology… Proc. Int'l Conf. on Intelligence Analysis. (no DOI; pagination approx.) — The canonical sensemaking loop (foraging + sensemaking; shoebox → evidence → schema → hypothesis). Aspect: provenance — a blueprint for the "how did this get to me" trail.

[Q-8] Klein, G., Moon, B., & Hoffman, R. R. (2006). [theory] Making Sense of Sensemaking 2: A Macrocognitive Model. IEEE Intelligent Systems 21(5), 88–92. DOI: 10.1109/MIS.2006.100. — The data-frame theory: data fit a frame while the frame determines relevant data, in a reciprocal cycle. Aspect: provenance, detection — an escalation is a "broken frame" moment; surface data and frame so the human can re-anchor.

[Q-9] Endsley, M. R. (1995). [theory] Toward a Theory of Situation Awareness in Dynamic Systems. Human Factors 37(1), 32–64. (cross-ref [E-12]) — SA: perception/comprehension/projection. Aspect: provenance — the handoff display must restore the overseer to L1–L3 SA before they decide.

[Q-10] Endsley, M. R., & Kiris, E. O. (1995). [empirical] The Out-of-the-Loop Performance Problem and Level of Control in Automation. Human Factors 37(2), 381–394. (cross-ref [E-13], [H-3]) — Full automation leaves operators out-of-the-loop with degraded SA/slower takeover. Aspect: provenance, attention — the escalation trail is the mechanism to re-establish SA enough to intervene.

[Q-11] Moreau, L., Groth, P., Miles, S., et al. (2008). [theory+design-guidance] The Provenance of Electronic Data. CACM 51(4), 52–58. DOI: 10.1145/1330311.1330323. — Documented data history confers authority/interpretability; agenda for provenance-aware applications. Aspect: provenance — the conceptual foundation for an interrogable audit trail.

[Q-12] Ragan, E. D., Endert, A., Sanyal, J., & Chen, J. (2016). [design-guidance] Characterizing Provenance in Visualization and Data Analysis… IEEE TVCG 22(1), 31–40. DOI: 10.1109/TVCG.2015.2467551. — Organizes provenance by type (data, visualization, interaction, insight, rationale) and purpose. Aspect: provenance, detection — a design checklist distinguishing data lineage from the agent's interaction history and rationale.

[Q-13] Moreau, L., & Groth, P. (2013). [design-guidance+theory] Provenance: An Introduction to PROV. Morgan & Claypool. (Anchor: W3C PROV-DM Recommendation.) — The W3C PROV standard: Entities/Activities/Agents + relations forming queryable provenance graphs. Aspect: provenance — a ready schema to model an agent's decision trail.

[Q-14] Bellotti, V., & Edwards, W. K. (2001). [theory] Intelligibility and Accountability: Human Considerations in Context-Aware Systems. HCI 16(2–4), 193–212. (cross-ref [G-17]) — Systems acting on inferred context must be intelligible + keep users accountable. Aspect: detection, request — expose reasoning + proposed action so a human can detect and contest before commit.

[Q-15] Lim, B. Y., & Dey, A. K. (2009). [empirical] Assessing Demand for Intelligibility in Context-Aware Applications. UbiComp '09, 195–204. DOI: 10.1145/1620545.1620576. — Demand for explanations (esp. why / why-not) spikes when the system behaves unexpectedly or fails. Aspect: detection — which explanation types matter at error moments.

[Q-16] Lim, B. Y., Dey, A. K., & Avrahami, D. (2009). [empirical] Why and Why Not Explanations Improve the Intelligibility of Context-Aware Intelligent Systems. CHI '09, 2119–2128. DOI: 10.1145/1518701.1519023. — Why and Why-Not (contrastive) explanations significantly improved understanding and appropriate trust; "what-if/how-to" less. Aspect: detection — build "why did the agent do X / why not Y" affordances for verification.

[Q-17] Parasuraman, R., & Riley, V. (1997). [theory] Humans and Automation: Use, Misuse, Disuse, Abuse. Human Factors 39(2), 230–253. (cross-ref [F-2]) — Misuse/disuse/abuse + complacency with imperfectly-reliable automation. Aspect: detection, attention — why humans miss agent errors; design for calibrated reliance.

[Q-18] Skitka, L. J., Mosier, K. L., & Burdick, M. (1999). [empirical] Does Automation Bias Decision-Making? IJHCS 51(5), 991–1006. (cross-ref [F-9]) — Omission + commission automation-bias errors with a reliable-but-imperfect aid. Aspect: detection — the central risk detection affordances must counteract.

[Q-19] Buçinca, Z., Malaya, M. B., & Gajos, K. Z. (2021). [empirical+design-guidance] To Trust or to Think: Cognitive Forcing Functions… PACMHCI 5(CSCW1), Art. 188. (cross-ref [F-21], [C-18]) — Cognitive forcing (delay the answer; decide first) reduced over-reliance vs. plain XAI, at some satisfaction cost. Aspect: detection, choice-arch — concrete friction patterns that force verification.

[Q-20] Zhang, Y., Liao, Q. V., & Bellamy, R. K. E. (2020). [empirical] Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-Assisted Decision Making. FAT* '20, 295–305. DOI: 10.1145/3351095.3372852. — Confidence scores helped calibrate trust; local explanations did not reliably help and sometimes increased over-reliance. Aspect: detection, consequence — uncertainty is a stronger detection signal than explanation.

[Q-21] Prabhudesai, S., Yang, L., Asthana, S., et al. (2023). [empirical] Understanding Uncertainty: How Lay Decision-makers Perceive and Interpret Uncertainty in Human-AI Decision Making. IUI '23, 379–396. DOI: 10.1145/3581641.3584033. — Communicated uncertainty slows users to reason analytically, but is easily misperceived without careful framing. Aspect: detection, consequence — confidence must be legible and calibrated; supports low-confidence escalation.

[Q-22] Nielsen, J. (2006). [design-guidance — practitioner, not peer-reviewed] Progressive Disclosure. NN/g. (cross-ref [G-16]) — Essential info first, advanced detail deferred but recoverable. Aspect: detection, attention — layered preview/diff "show details" affordances.

[Q-23] Mark, G., Gudith, D., & Klocke, U. (2008). [empirical] The Cost of Interrupted Work: More Speed and Stress. CHI '08, 107–110. DOI: 10.1145/1357054.1357072. — Interrupted tasks finish faster but at higher stress/frustration/effort; relatedness didn't reduce disruption. Aspect: attention — interrupting the overseer is costly by default.

[Q-24] Bailey, B. P., & Konstan, J. A. (2006). [empirical] On the Need for Attention-Aware Systems… Computers in Human Behavior 22(4), 685–708. DOI: 10.1016/j.chb.2005.12.009. — Interrupting during a task (vs. at a boundary) cost 3–27% more time, ~2× errors, more annoyance; deferring a few seconds to a natural pause mitigated it. Aspect: attention — timing matters as much as the request.

[Q-25] Horvitz, E., & Apacible, J. (2003). [theory+empirical] Learning and Reasoning about Interruption. ICMI '03, 20–27. DOI: 10.1145/958432.958440. — Models to infer the expected cost of interrupting from sensed streams. Aspect: attention — interruption as an expected-cost quantity an agent can reason over.

[Q-26] Horvitz, E., Koch, P., & Apacible, J. (2004). [empirical+design-guidance] BusyBody: Creating and Fielding Personalized Models of the Cost of Interruption. CSCW '04, 507–510. DOI: 10.1145/1031607.1031690. — Learns personalized decision-theoretic interruptibility models and mediates notifications at run-time. Aspect: attention — an agent can learn each overseer's interruption tolerance.

[Q-27] Fogarty, J., Hudson, S. E., Atkeson, C. G., et al. (2005). [empirical] Predicting Human Interruptibility with Sensors. ACM TOCHI 12(1), 119–146. DOI: 10.1145/1057237.1057243. — Cheap sensors estimate interruptibility as accurately as human observers. Aspect: attention — lightweight signals suffice for good timing.

[Q-28] Iqbal, S. T., & Bailey, B. P. (2008). [design-guidance+empirical] Effects of Intelligent Notification Management on Users and Their Tasks. CHI '08, 93–102. DOI: 10.1145/1357054.1357070. — Deferring notifications to task breakpoints reduced frustration and reaction time vs. immediate delivery. Aspect: attention — bundle/defer non-urgent agent notifications to natural boundaries.

[Q-29] Pielot, M., Church, K., & de Oliveira, R. (2014). [empirical] An In-Situ Study of Mobile Phone Notifications. MobileHCI '14, 233–242. DOI: 10.1145/2628363.2628364. — ~63.5 notifications/day; higher volume associated with more negative emotion. Aspect: attention — over-notifying degrades well-being; budget and aggregate prompts.

[Q-30] Hussain, M. I., Reynolds, T. L., & Zheng, K. (2019). [empirical+design-guidance] Medication Safety Alert Fatigue May Be Reduced via Interaction Design and Clinical Role Tailoring: A Systematic Review. JAMIA 26(10), 1141–1149. DOI: 10.1093/jamia/ocz095. (cross-ref [H-12]) — Clinicians override most alerts (incl. critical) due to low specificity; design + role-tailoring mitigate. Aspect: attention, detection — the canonical over-burdening failure; HITL alerts must be high-precision and role-targeted.

[Q-31] Thaler, R. H., & Sunstein, C. R. (2008). [theory] Nudge. Yale University Press. ISBN 978-0300122237. — Choice architecture: no choice is presented neutrally. Aspect: choice-arch — an approval prompt is choice architecture; framing/ordering/default must be designed intentionally.

[Q-32] Johnson, E. J., & Goldstein, D. G. (2003). [empirical] Do Defaults Save Lives? Science 302(5649), 1338–1339. DOI: 10.1126/science.1091721. — The default effect: opt-out organ donation ~90%+ vs ~15% opt-in. Aspect: choice-arch — the default dominates; the safe/reversible action must be default, never auto-approve-on-timeout.

[Q-33] Sunstein, C. R. (2019). [theory] Sludge and Ordeals. Duke Law Journal 68(8), 1843–1883. — Sludge (harmful friction) vs justified "ordeals." Aspect: choice-arch, attention — oversight friction can force beneficial reflection or become harmful approval-fatigue sludge.

[Q-34] Sunstein, C. R. (2020). [design-guidance] Sludge Audits. Behavioural Public Policy 6(4), 654–673. DOI: 10.1017/bpp.2019.32. — A methodology to measure/justify frictional burdens. Aspect: choice-arch, attention — audit how much friction each approval step adds vs. the stakes it gates.

[Q-35] Cox, A. L., Gould, S. J. J., Cecchinato, M. E., et al. (2016). [design-guidance] Design Frictions for Mindful Interactions: The Case for Microboundaries. CHI EA '16, 1389–1397. DOI: 10.1145/2851581.2892410. — Deliberate microboundaries prompt reflection before consequential actions without removing autonomy. Aspect: choice-arch, detection — friction done right: a beat of reflection before irreversible agent actions.

[Q-36] Mathur, A., Acar, G., Friedman, M. J., et al. (2019). [empirical] Dark Patterns at Scale: Findings from a Crawl of 11K Shopping Websites. PACMHCI 3(CSCW), Art. 81. DOI: 10.1145/3359183. — 1,818 dark-pattern instances across 15 types/7 categories (sneaking, urgency, misdirection, scarcity, obstruction, forced action). Aspect: choice-arch — a catalogue of tactics an approval prompt must NOT use.

[Q-37] Gray, C. M., Kou, Y., Battles, B., et al. (2018). [design-guidance] The Dark (Patterns) Side of UX Design. CHI '18, Paper 534. DOI: 10.1145/3173574.3174108. — Five strategies: nagging, obstruction, sneaking, interface interference, forced action; designers drift into manipulation serving the deploying party. Aspect: choice-arch — an anti-pattern checklist for oversight UX.

[Q-38] Gray, C. M., Santos, C., & Bielova, N. (2023). [theory] Towards a Preliminary Ontology of Dark Patterns Knowledge. CHI EA '23, Art. 286. DOI: 10.1145/3544549.3585676. — Harmonizes a three-level ontology of dark patterns for enforcement (EU DSA, FTC). Aspect: choice-arch — a vocabulary for naming and prohibiting specific manipulative moves in approval UIs.

[Q-39] Brignull, H. (2023). [design-guidance] Deceptive Patterns: Exposing the Tricks Tech Companies Use to Control You. Testimonium Ltd. ISBN 978-1739454401. — The canonical practitioner book by the coiner of "dark patterns." Aspect: choice-arch — the authoritative reference for ensuring agent consent/approval prompts inform rather than steer.

V.R — Sociotechnical context & HITL workflow precedents

How automation reshapes skill, work, and agency — and mature human+AI workflow precedents from other fields. Tags: [BOOK]/[EMPIRICAL]/[STANDARD]/[THEORY]/[REVIEW].

Highlights

  • Informate, don't just automate (Zuboff, [R-1]). Agents should surface reasoning, evidence, and intermediate state to grow human understanding — not silently execute and hide the process. The root choice that determines whether HITL builds or erodes competence.
  • Deskilling is the central failure mode HITL must prevent (Braverman [R-3], Bainbridge [R-6]). Separating "conception" (judgment) from "execution" turns humans into rubber-stamps; "the human will catch it" fails once the human is deskilled into a passive monitor.
  • Scaffolding + fading is the deskilling antidote and the template for progressive autonomy [R-9, R-10, R-8]. Calibrate support to the learner's zone, then fade it as competence grows — "without creating dependence." Cognitive apprenticeship's modeling→coaching→scaffolding→fading→exploration is a staircase for graduating autonomy and tapering oversight.
  • Augment, don't replace (Autor [R-4], Brynjolfsson [R-5]). Allocate by comparative advantage at the task level; the "Turing Trap" warns mimic-and-replace AI is the costly default to resist.
  • Quality-estimation gating (MT QE [R-13], ISO 18587 [R-11]): the agent emits a calibrated confidence/quality score; auto-proceed on high confidence, escalate low. A mature, operational confidence-routing mechanism.
  • Risk-tiered review intensity (PE levels [R-12]): match human effort to content risk/value — full review for high-stakes, light/sampled for low-stakes.
  • Escalation tiers + queue routing (Gillespie [R-15]): automated first-pass → front-line human → escalate hard/ambiguous cases up. Pair with transparency.
  • Audit sampling of autonomous output (Grossman & Cormack [R-16]): statistically sample and validate agent decisions against an expert/control set to certify quality; combine with active-learning loops.
  • Second-reader / redundancy pattern (radiology CAD [R-17, R-18]): run the agent as an independent parallel checker flagging candidates for adjudication — but tune flag volume, or it adds false positives and burden.
  • Reviewer wellbeing is a first-class constraint (Roberts [R-14]). High-volume HITL review harms reviewers; pacing/rotation/support must be designed in. The "human trains and backstops AI" loop is real and persistent.
  • Joint optimization, not technical optimization (Trist & Bamforth [R-7]). HITL is a sociotechnical system design problem; a technically superior agent that wrecks human roles is a net failure.
  • Plans are situated — build for in-flight repair (Suchman [R-2]). Favor continuous, cheap human repair over one-shot up-front approval; agency is reconfigured, not cleanly handed off.

Annotated bibliography

[R-1] Zuboff, S. (1988). [book] In the Age of the Smart Machine: The Future of Work and Power. Basic Books. — Coins informate vs. automate: IT both substitutes for labor and renders work processes visible/knowable; firms reflexively chose control over learning. Relevance: the foundational design principle — agents should informate (surface reasoning/evidence/state) so human competence grows rather than atrophies.

[R-2] Suchman, L. A. (1987/2007). [book] Plans and Situated Actions / Human-Machine Reconfigurations. Cambridge UP. (cross-ref [M-4]) — Human action is situated (improvised against contingencies); plans are resources, not determinants; agency is reconfigured, not transferred. Relevance: favor cheap, continuous in-flight repair over one-shot approval; the labor/agency angle on automation.

[R-3] Braverman, H. (1974). [book] Labor and Monopoly Capital: The Degradation of Work in the Twentieth Century. Monthly Review Press. — Origin of the deskilling thesis: Taylorism separates conception from execution, concentrating knowledge in management. Relevance: names the risk that "human-in-the-loop" becomes "human-as-rubber-stamp" deskilling; keep conception with the human or jointly held.

[R-4] Autor, D. H. (2015). [empirical/review] Why Are There Still So Many Jobs? The History and Future of Workplace Automation. J. Economic Perspectives 29(3), 3–30. DOI: 10.1257/jep.29.3.3. — Automation substitutes for routine/codifiable tasks but complements human judgment (the Polanyi paradox bounds what's automatable). Relevance: decompose the workflow into tasks and route by comparative advantage; the loop is a task-allocation boundary.

[R-5] Brynjolfsson, E. (2022). [theory/essay] The Turing Trap: The Promise & Peril of Human-Like Artificial Intelligence. Daedalus 151(2), 272–287. arXiv:2201.04200. — Chasing human-like (substitutive) AI over-incentivizes replacement; augmentation generates more value and keeps humans leveraged. Relevance: build to augment; HITL is the mechanism by which augmentation (and accountability) is operationalized.

[R-6] Bainbridge, L. (1983). [theory/essay] Ironies of Automation. Automatica 19(6), 775–779. (cross-ref [E-7], [H-1]) — Automating the routine leaves humans the residual hardest tasks while their skills decay from disuse; monitoring is a task humans do poorly. Relevance: the deskilling-by-automation trap; HITL must keep humans actively practicing, not passively monitoring.

[R-7] Trist, E. L., & Bamforth, K. W. (1951). [empirical] Some Social and Psychological Consequences of the Longwall Method of Coal-Getting. Human Relations 4(1), 3–38. DOI: 10.1177/001872675100400101. — Founding sociotechnical systems study: a technically "efficient" mechanization wrecked the social work system; performance requires joint optimization of social + technical. Relevance: HITL is a system design problem; don't optimize the agent in isolation.

[R-8] Vygotsky, L. S. (1978). [book] Mind in Society. Harvard UP. — The Zone of Proximal Development: the gap between unaided and guided performance, where learning happens via assisted performance. Relevance: a model for calibrating how much an agent should help — target the human's zone; as competence grows, assistance recedes.

[R-9] Wood, D., Bruner, J. S., & Ross, G. (1976). [empirical] The Role of Tutoring in Problem Solving. J. Child Psychology and Psychiatry 17(2), 89–100. DOI: 10.1111/j.1469-7610.1976.tb00381.x. — Coins scaffolding (six tutoring functions); effective scaffolding is contingent and meant to be faded, not to create dependence. Relevance: a blueprint for progressive autonomy — the agent scaffolds the human (and vice versa), and the scaffolding fades; fading is the explicit deskilling antidote.

[R-10] Collins, A., Brown, J. S., & Newman, S. E. (1989). [book chapter/theory] Cognitive Apprenticeship: Teaching the Crafts of Reading, Writing, and Mathematics. In Resnick (ed.), Knowing, Learning, and Instruction, 453–494. Erlbaum. — The staircase: modeling → coaching → scaffolding → articulation → reflection → fading → exploration; experts make tacit reasoning visible, then transfer responsibility. Relevance: the canonical model for graduating an agent's autonomy and tapering oversight while preserving competence.

[R-11] ISO 18587:2017. [standard] Translation services — Post-editing of machine translation output — Requirements. ISO. — Defines the post-editing process and full vs. light PE (tiered quality targets). Relevance: a standardized template for the human-corrects-machine-output loop with quality targets tied to use case — review need not be uniform.

[R-12] Nunziatini, M., & Marg, L. (2020). [empirical/industry] Machine Translation Post-Editing Levels: Breaking Away from the Tradition… EAMT 2020, 309–318. — Proposes graduated PE levels tailored by content type and end use (effort matched to value/risk). Relevance: tier human review intensity by content risk — direct precedent for risk-tiered HITL.

[R-13] Specia, L., Scarton, C., & Paetzold, G. H. (2018). [book/survey] Quality Estimation for Machine Translation. Morgan & Claypool. DOI: 10.2200/S00854ED1V01Y201805HLT039 (verify at press). — MT Quality Estimation predicts quality without a reference to gate/route segments (auto-publish high-confidence; send low-confidence to humans). Relevance: the cleanest precedent for confidence-based gating — agents should produce calibrated confidence and route by it.

[R-14] Roberts, S. T. (2019). [book/ethnography] Behind the Screen: Content Moderation in the Shadows of Social Media. Yale UP. — Ethnography of commercial content moderation: a hidden workforce that enforces policy, trains AI, and screens what machines can't — at severe psychological cost. Relevance: reviewer wellbeing is a first-class design constraint; the human-trains-and-backstops-AI loop is real and persistent.

[R-15] Gillespie, T. (2018). [book] Custodians of the Internet… Yale UP. — Moderation is constitutional to platforms: rule-making → automated detection → human review queues, escalation tiers → deliberate opacity. Relevance: the canonical scaled architecture — tier the loop by difficulty/risk; make the process transparent and accountable, not masked.

[R-16] Grossman, M. R., & Cormack, G. V. (2011). [empirical] Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review. Richmond J. Law & Technology 17(3), Art. 5. — TAR/predictive coding (classifier trained on expert coding) beats exhaustive manual review; CAL keeps a human in a tight train-review-retrain loop with statistical sampling/validation. Relevance: the best precedent for audit sampling of autonomous output + active-learning loops.

[R-17] Gilbert, F. J., et al. (2008). [empirical] Single Reading with Computer-Aided Detection for Screening Mammography. NEJM 359(16), 1675–1684. DOI: 10.1056/NEJMoa0803545. — Single radiologist + CAD ("second reader") approximated double reading. Relevance: the second-reader / redundancy pattern — the agent runs as an independent parallel checker flagging candidates for human adjudication.

[R-18] Azavedo, E., et al. (2012). [review] Is single reading with CAD as good as double reading in mammography screening? A systematic review. BMC Medical Imaging 12:22. DOI: 10.1186/1471-2342-12-22. (lead-author attribution to confirm) — Single+CAD broadly comparable on detection, but CAD increases reading time and recall (false positives); benefit depends on prompt integration. Relevance: the cost side of the second-reader pattern — tune the agent's flag threshold to the human's attention budget, or it degrades the workflow.


Cross-cutting synthesis

The first eight clusters (Parts I–II) were assembled independently, yet they converge on a small set of findings that recur in 1970s aviation, 2000s clinical informatics, 2020s welfare-algorithm scandals, and 2026 coding-agent studies alike. (Parts III–V, added in a second round, overwhelmingly reinforce these from economics, law, safety science, manufacturing, finance, and education — see "What Parts III–V add" below.) That convergence is the strongest signal in the codex: these are not domain quirks but structural properties of putting a human in charge of an autonomous system. The recurring themes below are the load-bearing inputs to the framework.

The ten recurring findings (each triangulated across ≥3 clusters)

  1. The reliability paradox — competence breeds complacency. The more reliable the automation, the less the human monitors it, so the rare error is the one most likely to slip through. Stated by Bainbridge [E-7], measured by Parasuraman/Molloy/Singh [F-4, E-18], reaffirmed across autonomy levels [E-15, H-4], visible in AV crashes [H-16, H-17, H-18], and reproduced in coding agents where "safety contingent on developer oversight" decays under productivity pressure [A-18, A-21]. Design consequence: never treat a more capable agent as a reason for less oversight machinery — capability and complacency rise together.

  2. The exposure-vs-correction gap (the recognition bottleneck). Oversight mechanisms reliably reduce the occurrence of bad actions but barely improve the human's ability to catch one once it surfaces. The cleanest evidence is [A-17] (plan-approval cut bad-action occurrence to 60–74% but intervention success stayed 9–26%); the mechanism is automation bias / commission errors [F-9, F-10, D-18] and the failure is rationalization, not inattention. Design consequence: gating what the agent does is easier and more effective than improving what the human notices — invest accordingly, and don't assume a gate makes the human a good detector.

  3. Nominal oversight ≠ meaningful control. A human placed in the loop without authority, competence, awareness, or time becomes a moral crumple zone [F-19, D-16] / rubber stamp [D-4, D-14] / MABA-MABA trap [D-15] — absorbing blame while protecting the system. The constructive target is meaningful human control = tracking + tracing [C-20, D-13] and Billings' "informed, involved, in command" [E-21]. Design consequence: every oversight point must give the human real authority and the context to use it, or it is theater.

  4. Reversibility beats confirmation. Confirmation dialogs habituate users into reflexive click-through and don't prevent errors; undo is the superior safety net [G-14, G-13, G-15]. Modern agents implement this as checkpointing/rewind [A-4] — but it has a hard boundary at the shell (Bash side-effects aren't tracked [A-4]) and false-alarm-prone confirmations actively corrode trust [F-17]. Design consequence: make agent actions reversible by default; reserve blocking confirmation for the genuinely irreversible — and know precisely what you cannot undo.

  5. The out-of-the-loop / handoff problem. Pulling the human out of the loop degrades the comprehension (not just the data) needed to take over [E-12, E-13, H-3], and abrupt handoff to a disengaged human at the worst moment fails — re-engagement takes 1.5–3.5+ s and longer for competence [H-19], as AF447 [H-7] and AV takeover studies show. Today's models are themselves weak at mid-task steering/cancellation [A-19, A-16]. Design consequence: handoffs must be anticipatory, gradual, and context-rich; for time-critical failures the human fallback may not be viable, so don't architect safety around it.

  6. Autonomy is a multi-dimensional dial, not a switch. Decide which stage (acquire → analyze → decide → act) and what level (suggest → act-with-approval → act-then-notify → act-silently) independently [E-4, E-6]; automate perception/analysis more freely than action, where over-trust and lost authority bite hardest. Mature agents implement this as permission modes + risk-tiering [A-2, A-5, A-6, A-13], and governance says oversight intensity should scale to autonomy/stakes [D-6]. Design consequence: gate by reversibility × blast radius × stakes, not uniformly; make stepping down to lower autonomy easy [H-6].

  7. Trust must be calibrated — and explanations can decalibrate it. Both over-trust (misuse) and under-trust (disuse) are failures [F-1, F-2, E-16]; the disposition to over-trust labeled-AI advice is a baseline risk [C-16]. Critically, explanations/rationales tend to increase acceptance regardless of correctness [F-20, C-17] — transparency aimed at empowering the overseer can instead disarm them. Design consequence: surface capability, uncertainty, and verifiable evidence (not just persuasive rationale); design explanations to support checking, not agreeing.

  8. The signal economy: every interruption is a tax, and false alarms are the worst tax. False-alarm-prone systems corrode both compliance and reliance [F-17] and drive alert fatigue — clinicians override 49–96% of alerts [H-10], TMI flooded operators with 100+ alarms [H-20]. The portable discipline: an alarm must demand a specific response, and the interruptive rate must be capped [H-21]; you cannot fix fatigue by muting the noisiest category [H-11]. Design consequence: treat interruptions as a scarce budget; most signals should be passive/logged, reserving blocking interventions for high-severity, high-precision cases.

  9. Deskilling erodes the very expertise oversight depends on. Heavy automation atrophies skill — especially higher-order judgment [E-22, H-9, E-7] — and the human-feedback literature shows you need competent humans to evaluate at all [B-1, B-8]. A reviewer who never does the task cannot meaningfully review the agent. Design consequence: build in deliberate human-only practice / engagement, or accept that the oversight claim hollows out over time.

  10. Structural friction beats exhortation; sandboxes beat per-action gates. Telling people to "stay vigilant" does not work [F-7, E-17]; cognitive forcing functions (decide before seeing the AI's answer; mandatory deliberation) measurably reduce over-reliance [F-21, C-18] though users dislike them. At the system level, the practitioner consensus is to move the safety boundary into the environment — sandboxes, no-network containers, scoped credentials, budget caps [A-23, A-1] — and to recognize the lethal trifecta (private data + untrusted content + external comms) as the condition under which an agent must not run unsupervised [A-23]. Design consequence: prefer architectural and forcing-function controls over asking the human to be more careful.

The hard tensions the framework must resolve (not eliminate)

  • Autonomy vs. oversight capability. Every increment of autonomy that makes the agent more useful also degrades the human's situation awareness and takeover ability [E-13, H-4]. There is no setting that maximizes both; the framework must choose per action class, not globally — Shneiderman's 2-D HCAI insists the two are independent axes and the goal is high-high [G-4], but the empirics [E-15] show that is hard-won, not free.
  • Friction vs. usability. The interventions that most reduce over-reliance are the ones users most dislike [F-21, C-18, A-17]; the interventions users love (one-click approve, YOLO mode) are the ones that produce rubber-stamping. The framework must spend friction deliberately where stakes justify it, not uniformly.
  • Explanation vs. over-trust. Users need intelligibility to oversee [G-17, G-11], yet explanations increase uncritical acceptance [F-20]. The resolution is verification-oriented transparency (evidence the human can independently check) over persuasion-oriented rationale.
  • Scaling oversight vs. preserving the human. AI-assisted oversight (debate, amplification, RLAIF, AI supervisors) is the only way to oversee superhuman-scale work [C-3→C-13, A-15, B-20], but each step that lets AI help the human evaluate also removes the human a little, relocating risk into the assisting AI's blind spots [B-20, C-8]. Net team performance can even go down [C-19]. The framework must track where genuine human judgment still lives.
  • Mandated vs. effective oversight. Law and standards increasingly require oversight [D-1, D-2, D-5] while the evidence says humans routinely can't deliver it [D-14, D-17→D-25]. Compliance and safety are not the same goal; designing for the checkbox can actively produce a crumple zone. The framework should treat oversight as a claim to be validated, per Green [D-14].

Seed design principles for the framework (provisional)

These fall out of the findings above and will be developed into the actionable framework:

  1. Gate by consequence, not uniformly — reversibility × blast-radius × stakes determines whether an action is auto-run, notify-after, approve-before, or forbidden [A-6, E-6, D-6].
  2. Default to reversible; confirm only the irreversible — checkpoint/undo as the primary net; blocking confirmation as the rare exception [G-14, A-4].
  3. Make state, intent, and evidence legible — for verification, not persuasion [G-17, F-20].
  4. Budget interruptions; make every alert demand a defined action [H-21, F-17].
  5. Engineer the environment, not just the dialog — sandbox, scope credentials, break the lethal trifecta; forcing functions over exhortation [A-23, F-21].
  6. Design the handoff, not just the autonomy — anticipatory, gradual, context-rich; assume re-engagement latency; keep the human warm where takeover must be fast [H-19, E-13].
  7. Keep the human competent — deliberate engagement/practice to resist deskilling and complacency [E-22, F-7].
  8. Give oversight real teeth — authority + competence + time + traceability, or don't claim it [D-4, C-20]; and validate that it actually works [D-14, A-17].
  9. Scale span deliberately — there is a hard ceiling on how many agents one human can oversee [H-22]; plan oversight capacity as a finite, budgeted resource.
  10. Treat the model's steerability as a constraint, not a given — interruptibility and mid-task correction are unsolved model capabilities [A-19], so design within their limits.

What Parts III–V add (delegation, mechanisms, the episode)

The nine second-round clusters largely reinforce the ten findings above — itself strong corroboration, since they come from economics, law, safety science, manufacturing, finance, and education rather than AI. The genuinely new contributions:

  1. Oversight is costly monitoring — so monitor the exceptions (principal–agent theory). A human delegating to an agent is a textbook principal–agent relationship [J-1, J-2]; agency cost (monitoring + bonding + residual loss) never reaches zero, so the goal is minimizing total cost, not maximizing control. "Management by exception" [K-1] and the informativeness principle [J-7] agree: log the trace, review the deviations, accept a residual loss.

  2. The real lever is residual control rights, not an exhaustive rulebook (incomplete contracts). You can't specify correct behavior for every state [J-8, J-9]; control is holding the override / selective-intervention right (halt, revoke, substitute) for the unforeseen — and some duties are non-delegable by law [J-15], a hard floor under "a human must decide."

  3. Brief by intent, not by steps (mission command). Commander's intent + mission orders ("results, not how") [K-7] and a small set of pre-authorized rules of engagement [K-12] out-perform micromanagement under uncertainty — and autonomy must be paired with shared context [K-10].

  4. Decouple tightly-coupled agent pipelines (normal accident theory). Interactive complexity + tight coupling make accidents "normal" [L-10]; the structural fix is to insert slack — human checkpoints, reversibility, rate limits, circuit breakers [O-6, O-7]. Redundancy can backfire [L-11]; more checkers ≠ more safety.

  5. Watch for drift and normalization of deviance. Oversight standards erode silently as "we got away with it" repeats [L-15, L-16]; track the trend in oversight bypasses, not just incidents. Deference to expertise [L-7] says authority should migrate to whoever has the local knowledge.

  6. Capability-removal beats policy — cross-domain consensus. Poka-yoke's hierarchy (prevention > detection > vigilance) [N-7], lockout > tagout [N-17], and least-privilege / capabilities [O-10, O-11] all say what the agentic-security literature does: make bad actions impossible, don't merely discourage them; a bypassable denylist is not a boundary [A-14]. The confused-deputy problem is prompt injection [O-11, O-12].

  7. Two-party approval and verified shields are mature, portable gates. Four-eyes / maker-checker and the two-person rule [O-1, N-15] (proposer ≠ approver), and runtime shields / Simplex [O-17, O-18] (a verified monitor that vetoes the agent, even under injection), are battle-tested machinery — and Cranor (2008) [O-9] is a direct prior HITL framework this codex generalizes.

  8. The loop episode is a designable unit (and usually under-designed). A single oversight moment must supply request clarity, consequence + reversibility preview (feedforward [Q-3]), provenance ("how did this get to me" [Q-7, Q-11]), detection affordances (why/why-not + calibrated uncertainty, framed to check not sell [Q-16, Q-20, F-20]), and a respected attention budget — with bias-safe choice architecture and no dark patterns [Q-31, Q-36]. Cross-domain evidence converges that gates only work when meaningfully attended (Ontario checklist [N-12], ISMP double-checks [N-16], alert fatigue [H-10, Q-30]). See the full anatomy in §V.Q.

  9. Under-reliance is a failure mode too — and decision-fatigue evidence is shaky. Classic advice research shows humans under-weight good advice (egocentric discounting [P-2]) even as automation-bias research shows over-reliance, so calibration must guard both directions [P-20, P-21]. And "limit decisions per human" rests on contested ground: ego depletion failed a 23-lab replication [P-13] and the "hungry judge" effect is disputed [P-15, P-16]. Treat it as a testable heuristic, not a law.

  10. Informate, don't just automate; fade scaffolding to fight deskilling. Agents should surface reasoning/evidence to grow human understanding [R-1], and scaffolding-with-fading [R-9, R-10] is a concrete model for progressive autonomy that preserves the competence oversight depends on. Mature workflow precedents — quality-estimation gating [R-13], escalation tiers [R-15], audit sampling [R-16] — are ready patterns to borrow.

Open questions & gaps

  • Closing the recognition bottleneck. [A-17] shows gates don't help humans catch errors. What actually improves detection (vs. prevention)? Forcing functions [F-21] help reliance but detection-specific interventions for agentic settings are largely unstudied.
  • Calibrated transparency. We know explanations can increase over-trust [F-20]; we lack tested patterns for verification-oriented disclosure that raises detection without raising blind acceptance — especially for long chain-of-thought / multi-step agent traces.
  • Oversight of agent fleets / multi-agent systems. Supervisory-span limits are documented for UAVs [H-22]; the analog for one human overseeing many concurrent LLM agents (and agents overseeing agents [A-15, C-22]) is barely measured.
  • Validated oversight metrics. Green [D-14] demands evidence that oversight works, but there is no standard instrument for measuring effective (vs. nominal) human control of an agent. What would a "meaningful control" conformance test look like?
  • Reversibility at the shell / real world. Checkpointing covers structured edits but not Bash side-effects [A-4], and many agent actions (sends, payments, deletes) are irreversible by nature. How much of an agent's action space can be made genuinely undoable, and how should the irreducibly-irreversible remainder be governed?
  • Deskilling under pervasive agent use. The aviation evidence [E-22, H-9] predicts judgment decay; the longitudinal effect on developers/knowledge workers using agents daily is unmeasured.
  • Model steerability as the bottleneck. [A-19] shows interruption/cancellation/re-planning are weak model capabilities. How much HITL design is currently limited by the model rather than the interface, and which patterns degrade gracefully when steerability is poor?
  • Trust dynamics over long horizons. Most trust studies are short-session [F-3]; how reliance calibrates (or miscalibrates) over months of living with an agent — incl. recovery after a visible failure — is open.
  • Cross-cultural / individual variation. Complacency potential is an individual difference [F-6] and trust has dispositional/cultural layers [F-3]; per-user adaptive friction is proposed but untested in agentic tools.
  • Measuring meaningful control. [C-20, D-13] define control as tracking + tracing, but no operational test exists for whether an agent's behavior actually tracks the human's reasons and traces to an informed human — the conformance test governance [D-14] implicitly demands.
  • The loop-episode evidence gap. The anatomy in §V.Q is assembled from adjacent fields (security warnings, clinical alerts, sensemaking, provenance, choice architecture); few components are tested in agentic-LLM settings — especially provenance displays and consequence-feedforward for multi-step agent plans, and which detection affordances actually raise catch-rate [A-17].
  • Capability-removal vs. usefulness. Poka-yoke / least-privilege say "make it impossible" [N-7, O-10], but over-constraining throttles useful autonomy [K-14]; the trade-off curve for agents is uncharted.
  • Does AI-assisted oversight net help? Human–AI teams often underperform the better of either alone [C-19]; whether AI supervisors / debate / runtime shields actually raise effective human oversight of agents (vs. relocating risk into the assisting system [B-20, C-8]) is largely untested in real agentic settings.

Status & provenance

  • Assembled: 2026-06-22, in two rounds of parallel research agents — 8 clusters for Parts I–II, then 9 for Parts III–V (17 clusters, ~366 reference entries). Each agent was instructed to retrieve real sources, distinguish empirical from opinion, record successes and documented failures, and flag any unverified citation.
  • Verification posture: the large majority of entries were confirmed against publisher pages, DOIs, arXiv IDs, or authoritative indexes. Items the agents could not machine-verify (paywalls/ 403s, JS-rendered legal pages) or flagged for a pinpoint check are marked UNVERIFIED or "cross-confirmed via secondary sources" inline — incl. [A-2]/[C-2] notes, [B-4], [D-3], [D-7], [D-8], [D-11], [D-23], [G-2], [H-9], [H-11], [H-15], [H-21], [K-8], [M-13], [M-20], [O-19], [Q-7], [Q-13], [R-13], [R-18]. Spot-check before reusing any citation in a published artifact.
  • Known cross-listings (same source, multiple sections): Lee & See [F-1≈E-19]; Parasuraman & Riley [F-2≈E-16≈H-2]; Parasuraman & Manzey [F-7≈E-17≈C-15≈D-20]; Parasuraman/Molloy/Singh [F-4≈E-18]; Mosier et al. [F-10≈E-20]; Skitka et al. 1999 [F-9≈C-14≈D-18] & 2000 [F-11≈D-19]; Bansal et al. [F-20≈C-17]; Buçinca et al. [F-21≈C-18]; Endsley & Kiris [E-13≈H-3]; Endsley 2017 [E-?≈H-4]; Sarter & Woods [E-?≈H-5]; Bainbridge [E-7≈H-1]; Christiano et al. 2017 [B-16≈C-4]; Constitutional AI [B-20≈C-8]; Santoni de Sio & van den Hoven [C-20≈D-13]; Green [C-21≈D-14]; Elish [F-19≈D-16]. Second-round additions: Coactive Design [K-16≈M-5]; Joint Cognitive Systems [L-19≈M-3]; Norman 1990 [G-12≈M-15]; Endsley SA [E-12≈M-17≈Q-9]; Endsley & Kiris [E-13≈H-3≈Q-10]; Parasuraman/Sheridan/Wickens [E-6≈M-18]; Logg et al. [C-16≈P-20]; Bellotti & Edwards [G-17≈Q-14]; Nielsen progressive disclosure [G-16≈Q-22]; alert fatigue [H-12≈Q-30]; and several trust/bias results re-cited in V.Q (Parasuraman & Riley [F-2≈Q-17], Skitka 1999 [F-9≈Q-18], Bansal [F-20≈Q-20 context], Buçinca [F-21≈Q-19]). These overlaps are themselves a finding: the same handful of results anchor trust, automation theory, scalable oversight, governance, teaming, and the loop episode — and the new delegation/mechanism clusters (J/K/N/O) corroborate them from outside AI.