Human-in-the-loop is not a setting

The default behavior of most agentic frameworks goes like this: the agent plans, the agent acts, the agent reports back. If you want a human to approve actions before they happen, you wire in a callback. The callback fires for every tool call, the user clicks yes or no, the agent continues.

That works for demos. It collapses the moment the agent is allowed to do more than three things in sequence, because the user is now adjudicating context-free questions like “approve PUT to /api/now/table/sys_dictionary?” and the answer is “I have no idea, what’s the change going to do?”

Serac is built on the opposite premise. The reviewer step is not a callback you opt into. It’s a node in the orchestration graph, and every write tool flows through it by construction. The agent can’t bypass it because there’s no path around it in the graph. Autonomy is the configurable layer on top, not the floor underneath.

The graph, in shape

The Serac orchestrator is a small set of node types arranged into a directed graph for each agent run:

plan — the LLM produces a plan as structured intent (tools to call, in what order, with what arguments)
decompose — the planner output is unpacked into individual tool calls
classify — each call is labeled read/write, scope, side-effect class, and which platform domain it touches
review — every call labeled “write” pauses here; the user sees the intent, the arguments, the predicted scope, and approves or rejects
execute — only reachable from review.approved or from classify if the call is read-only
summarize — the response is summarized back into the planner state for the next iteration

The review node is not optional. It’s not a hook you register. The classifier cannot label a write call as read-only — the classifier has a static list of writeable surfaces (table.create, table.update, table.delete, update_set.apply, script.run, anything that hits a PUT/POST/PATCH/DELETE endpoint, anything that fires a background script). If the LLM tries to route a write through a tool labeled read, that gets rejected at validation.

This is the part where “autonomous mode” becomes a configuration question rather than an architectural one. Autonomous mode in Serac means: the user pre-approves a batch of writes that share an intent. The reviewer step still runs. The user just clicked “approve all five of these update-set adds” once, instead of five times. The graph didn’t change.

Why this matters in practice

Consider a generic agent framework where approvals are a middleware hook. The agent has a tool called update_record. The hook fires before update_record is called. The user gets a prompt: “approve update_record?”

Now the agent decides — because the LLM is unconstrained — to instead call execute_script with a body that does the same update. The hook may or may not fire, depending on whether the framework’s authors thought to label execute_script as a write surface. Often they didn’t. The script runs.

In Serac, execute_script is labeled write by classifier rules, the arguments are inspected (the script body is one of them), and the reviewer step sees the literal script the agent intends to run. Not a description. The body. If the script body contains a current.update() call, the reviewer knows it’s a write. If the agent tries to obfuscate by base64-encoding the script body, the classifier rejects the call at validation because base64 input is a red flag for a script tool. None of this is the LLM’s choice.

The pattern: the classifier is platform-aware, and the reviewer step shows the user real intent, not a generic “approve action?” prompt.

The approval contract

Every reviewer prompt in Serac has the same four-field shape:

INTENT      What the agent says it's trying to do
SURFACE     What it will touch (table, scope, record count)
PREDICTED   What the agent expects to happen after
DETAIL      The literal tool call arguments

So a real approval looks like:

INTENT      Move ticket INC0012345 to "In Progress" and assign to me
SURFACE     incident table · global scope · 1 record · u_assigned_to + state
PREDICTED   Notification fires to assignee · SLA timer restart on state change
DETAIL      table=incident · sys_id=a8c2... · payload={state: 2, assigned_to: ...}

The user reads INTENT to decide if it’s the right thing to do. They read SURFACE to decide if they have authority to do it. They read PREDICTED to anticipate side effects. They read DETAIL only when one of the first three feels off.

The reviewer step is in the graph, but the contract is the four fields. If a tool author skips PREDICTED, the tool fails validation. We don’t ship tools with empty fields.

What this costs

The reviewer step adds latency. We measured it on the most common 20 ServiceNow workflows: median 6 seconds of human deliberation per approval, p95 around 35 seconds for anything touching multiple records.

That’s real. If you’re the kind of buyer who measures agent throughput in calls-per-minute, Serac will lose that benchmark to a system that just executes whatever the LLM emits.

We made it acceptable by surfacing intent inline. The reviewer prompt is not a modal that pulls the user into a separate UI. It’s a card in the same conversation stream where the agent is reasoning. The user is already reading the agent’s plan; the approval prompt extends the same thread. There’s no context switch, no “open the dashboard,” no review queue.

We also made the first approval in a session richer than the rest. The first time the agent says “I want to update incident records,” the reviewer shows a 30-second explainer: what fields are touched, what BRs will fire, what the rollback path looks like. The next nine approvals of the same shape are one-line confirmations. The cost-per-decision drops as the user builds context.

The thing we won’t do: collapse approvals into a single “approve the whole run” prompt. That puts us back where the generic frameworks live. The whole point is that each write is a discrete decision, even if the decision is made fast.

Contrast: what frameworks-with-callbacks miss

A few patterns we kept seeing in 2024-2025 agent frameworks:

Callbacks fire after argument resolution. The user approves “call update_incident” but the LLM resolved an argument from assigned_to: "the user" to assigned_to: "sys_id of the admin" via a different lookup. The user approved a name. The system executed a sys_id. The reviewer in Serac sees the resolved arguments.
Approvals are async-only. The agent emits a “pending approval” event, the UI shows a notification, the user clicks back to the agent at some point. In the meantime, the agent waits, the LLM context cools, and reasoning quality drops because the planner state has gone stale. Serac’s reviewer step is synchronous to the run — if the user is AFK, the run pauses, but the planner state is preserved verbatim.
No classification step at all. Tools are tools; the framework doesn’t know which ones write. The hook author has to remember to wrap every new tool. The first one missed is the production incident.
Approvals on the wrong abstraction. Some frameworks ask the user to approve at the prompt level (“approve agent run”) or at the token level (“review every LLM completion”). The first is too coarse, the second is unusable. The right level is the tool call, with the resolved arguments — and that’s the only level Serac asks the user to look at.

What we still get wrong

Reviewer fatigue is real. If a run produces 50 approvals, the user clicks through the last 30 without reading. We’ve seen it in our own usage logs. We’re not going to pretend we’ve solved it.

The current mitigation is batch approval for runs that the user has explicitly marked as repetitive (“nightly cleanup of expired sessions” gets approved once per scheduled run, not per cleanup row). The longer-term mitigation is to let the user define approval policies declaratively — “auto-approve any read on cmdb_ci_*, hold for review on any write to sys_user, escalate to a second reviewer on any change to sys_dictionary.” The policy engine is in our v1.2 milestone. It’s not in v1.0.

There’s also a class of approvals the reviewer step doesn’t handle well: changes that look small in isolation but have cascading effects. Disabling a single business rule looks like a one-field update. The effect is half the platform behaves differently. The PREDICTED field tries to surface this, but it’s only as good as the platform map we ship — and the platform map doesn’t know about your custom rules.

We’re working on the cascading-effect prediction. It’s a hard problem and we won’t claim to have it solved until it actually works.

The pattern

Approvals work when they’re part of the graph, not part of the configuration. When they survive every code path the agent can take, including the ones the agent invents. When the contract surfaced to the human is rich enough to make the approval meaningful, and consistent enough that the human builds a fast mental model.

You can’t bolt this on. Or you can, but the result is “approval theater” — the prompts fire, the clicks happen, the agent does something else underneath. We picked the harder version because the easier one doesn’t actually protect anything.