Skip to main content

Runtime Reaper and Doctor

NoETL uses two related pieces for self-healing stuck distributed executions:

  • the in-process command reaper in the NoETL server, which performs automatic recovery;
  • doctor (noetl-doctor), the out-of-process runtime reaper surface that monitoring systems and MCP clients can call.

In this documentation, doctor means the runtime reaper / self-healing surface. It is not a generic diagnostic toolkit.

Responsibilities

The command reaper is the durable correctness mechanism. It runs inside deploy/noetl-server, holds a RuntimeLease, scans noetl.command for recoverable non-terminal commands, and republishes command notifications through NATS. Workers still claim commands through the normal NoETL command claim API, so claim_policy remains the authority for whether a republished notification becomes executable work.

Doctor is the monitoring-callable wrapper around this system. It is a small Rust binary that shells out to the noetl Rust CLI and runs bundled NoETL playbooks in local-runtime mode. Doctor can detect stale rows, inspect one execution, run reachability probes, expose an MCP HTTP surface, and ask the server to run a command-reaper sweep when that admin endpoint exists.

Doctor must not write to noetl.command, fabricate completion events, force loop.done, or duplicate claim-policy logic. New correctness rules belong in repos/noetl; new monitoring diagnostics belong in repos/doctor.

Architecture

flowchart LR
Monitor["Monitoring system or MCP client"] --> Doctor["noetl-doctor"]
Doctor --> CLI["noetl run --runtime local"]
CLI --> Playbooks["Bundled doctor playbooks"]
Playbooks --> API["NoETL HTTP API"]
Playbooks --> PG["Postgres noetl database"]
API --> Server["NoETL server"]
Server --> Reaper["in-process command reaper"]
Reaper --> NATS["NATS command notifications"]
NATS --> Workers["NoETL workers"]
Workers --> Claim["/api/commands/{id}/claim"]
Claim --> Policy["claim_policy"]
Policy --> PG

Command Reaper Behavior

The server-side command reaper scans noetl.command, not the event log, for commands attached to executions that have not reached a terminal event.

It republishes two classes of rows:

  • CLAIMED or RUNNING commands whose worker is missing, non-ready, or stale, or whose healthy-worker claim has exceeded the configured hard timeout;
  • old PENDING commands that may have missed their NATS notification.

The reaper does not mark a command complete and does not skip worker claim arbitration. It only republishes the original command notification; the regular claim endpoint and claim policy decide the next state.

Useful environment knobs:

VariableDefaultPurpose
NOETL_COMMAND_REAPER_ENABLEDtrueEnable the server reaper.
NOETL_COMMAND_REAPER_INTERVAL_SECONDS60Sweep cadence.
NOETL_COMMAND_REAPER_WORKER_STALE_SECONDS60Worker heartbeat stale threshold.
NOETL_COMMAND_REAPER_HEALTHY_HARD_TIMEOUT_SECONDS1800Hard timeout even when a worker still appears healthy.
NOETL_COMMAND_REAPER_PENDING_RETRY_SECONDS60Age before republishing stranded PENDING commands.
NOETL_COMMAND_REAPER_MAX_PER_RUN100Maximum rows recovered in one sweep.

Some deployments also set NOETL_COMMAND_CLAIM_HEALTHY_WORKER_HARD_TIMEOUT_SECONDS so the claim endpoint and the reaper use the same hard-timeout posture for apparently healthy workers.

Doctor CLI

Doctor is implemented in noetl/doctor as a Rust crate named noetl-doctor. It bundles YAML playbooks under playbooks/ and runs them through:

noetl run <playbook> --runtime local --set key=value

Common commands:

noetl-doctor detect \
--noetl-url http://localhost:8082 \
--pg-dsn postgresql://noetl@localhost:54321/noetl \
--stale-seconds 300

noetl-doctor reachability \
--noetl-url http://localhost:8082 \
--pg-dsn postgresql://noetl@localhost:54321/noetl

noetl-doctor repair trigger-reaper \
--noetl-url http://localhost:8082

noetl-doctor repair run-playbook ./playbooks/inspect_stale_commands.yaml \
--set execution_id=627209422065893596 \
--set pg_dsn=postgresql://noetl@localhost:54321/noetl

Exit codes are stable for monitoring:

Exit codeMeaning
0OK or repaired/no-op repair.
2Anomaly detected.
3Doctor failed to run.

The CLI always emits JSON shaped like:

{
"action": "detect",
"severity": "ok",
"generated_at": "2026-05-15T13:00:00Z",
"data": {}
}

The exact data payload is playbook-specific.

Bundled Playbooks

Doctor playbooks intentionally use the Rust CLI local-runtime subset. They are authored as kind: shell steps that call curl, psql, and jq. This keeps doctor independent of server-only tool kinds such as postgres, python, or noop.

PlaybookActionPurpose
detect_stuck_executions.yamldetectRead-only scan for stuck executions and stale command rows.
inspect_stale_commands.yamlinspectRead-only inspection for one execution.
reachability_smoke.yamlprobeHTTP and Postgres reachability plus worker-pool snapshot.
trigger_command_reaper.yamltriggerBest-effort server reaper sweep nudge; 404 is treated as no-op on older servers.
provision_doctor_mcp.yamldeploy, redeploy, status, destroy, logsProvision the doctor MCP server into Kubernetes.

Every playbook has a workload.action dispatch field and a safe help path for unknown lifecycle actions.

MCP Surface

When run as an HTTP service:

noetl-doctor mcp serve --host 0.0.0.0 --port 8765

Doctor exposes:

GET /healthz
GET /tools
POST /tools/detect/invoke
POST /tools/reachability/invoke
POST /tools/repair_trigger_reaper/invoke

Each tool returns the same Outcome JSON contract as the CLI.

Verification

Check that the server-side reaper is alive:

kubectl -n noetl logs deploy/noetl-server --tail=200 \
| grep -E "COMMAND-REAPER|RuntimeLease"

Expected startup evidence:

[COMMAND-REAPER] Started (interval=..., worker_stale=..., hard_timeout=...)

Check the command table after a recovery-sensitive run:

SELECT status, count(*)
FROM noetl.command
WHERE execution_id = '<execution_id>'
GROUP BY status
ORDER BY status;

For a successful run, the final state should not leave PENDING, CLAIMED, or RUNNING rows.

The PFT v2 validation run on 2026-05-15 proved the recovery path: the server command reaper twice found 20 orphaned fetch_mds_details:task_sequence commands, republished all 20 each time, and the execution completed with all 10 facilities validated.