Build a self-troubleshooting playbook
This tutorial walks you through composing the full self-troubleshooting pattern from scratch: a parent playbook that invokes a failing sub-playbook, receives a structured diagnosis attached to the error envelope, and routes its downstream behavior on the diagnosis category.
By the end you will have built a working version of the pattern the spike e2e demonstrates, understand which architectural contract each step exercises, and know how to switch the triage backend per deployment mode.
Estimated time: 30 minutes. Prereqs: completed
Quickstart with the local cluster up and gemma3:4b
loaded.
What you'll build
Two playbooks:
failing_canary— a deliberately-failing sub-playbook. Always raises a clear error so the diagnose agent has something to work with.canary_with_diagnosis— the parent playbook. Invokesfailing_canaryviatool: agent framework=noetlwithon_failure.troubleshoot: true, then in its next step extracts a useful summary from the attached diagnosis.
When you're done, invoking the parent will:
- Trigger the auto-troubleshoot hook on the failing sub-playbook's error envelope.
- Dispatch the
diagnose_executionagent automatically. - Wait for the diagnosis sub-execution to reach terminal status.
- Fetch the persisted diagnosis from events.
- Attach the
{category, confidence, root_cause, suggested_action, source}dict toerror.diagnosis. - Surface the diagnosis through your
extract_envelopestep into the parent's result.
You'll see the same architectural surface the spike e2e exercises, but with playbooks you wrote yourself.
Step 1 — The failing sub-playbook
Create tutorials/playbooks/failing_canary.yaml:
apiVersion: noetl.io/v2
kind: Playbook
metadata:
name: failing_canary
path: tutorials/failing_canary
description: |
A deliberately-failing canary used by the self-troubleshooting tutorial.
Always raises a clear, structured error.
executor:
profile: distributed
version: noetl-runtime/1
workload: {}
workflow:
- step: simulate_upstream_5xx
tool: http
method: GET
url: "https://this-host-does-not-exist.example.invalid"
timeout_seconds: 5
The tool: http call to a non-resolvable host produces a clean
DNS-failure error envelope without depending on any flaky external
service. If you want a different failure mode, swap to a
tool: python step that raises a typed exception (raise RuntimeError("upstream returned 502")) — the diagnose agent handles
both shapes.
The shape mirrors the test fixture at
repos/e2e/fixtures/playbooks/spike/spike_failing_subflow.yaml
— see that file if you need a richer reference example.
Step 2 — The parent playbook
Create tutorials/playbooks/canary_with_diagnosis.yaml:
apiVersion: noetl.io/v2
kind: Playbook
metadata:
name: canary_with_diagnosis
path: tutorials/canary_with_diagnosis
description: |
Parent playbook for the self-troubleshooting tutorial. Invokes
failing_canary via the agent runtime with auto-troubleshoot
enabled, then surfaces the diagnosis through extract_envelope.
executor:
profile: distributed
version: noetl-runtime/1
workload:
triage_mcp_server: "mcp/ollama"
triage_model: "gemma3:4b"
confidence_threshold: 0.0
escalate_to: "none"
workflow:
- step: trigger_failure
tool: agent
framework: noetl
entrypoint: tutorials/failing_canary
on_failure:
troubleshoot: true
triage_mcp_server: "{{ workload.triage_mcp_server }}"
triage_model: "{{ workload.triage_model }}"
confidence_threshold: "{{ workload.confidence_threshold }}"
escalate_to: "{{ workload.escalate_to }}"
next:
arcs:
- step: extract_envelope
- step: extract_envelope
tool: python
code: |
envelope = trigger_failure if isinstance(trigger_failure, dict) else {}
error_block = envelope.get("error") if isinstance(envelope.get("error"), dict) else {}
diagnosis = error_block.get("diagnosis") if isinstance(error_block.get("diagnosis"), dict) else {}
result = {
"status": "ok",
"smoke_status": "ok",
"agent_envelope": envelope,
"diagnosis": diagnosis,
"summary": (
"Canary failed; diagnosed as "
f"{diagnosis.get('category', '<no category>')} "
f"(confidence {diagnosis.get('confidence', 0.0):.2f}); "
f"suggested: {diagnosis.get('suggested_action', '<no suggestion>')}"
),
}
args:
trigger_failure: "{{ trigger_failure }}"
next:
arcs:
- step: end
- step: end
type: end
A few points worth noting:
triage_*keys are the canonical names since v2.36.0 (the worker forwards them generically). The deprecatedollama_mcp_server/ollama_modelaliases were removed inops#42. See Triage Model Selection for the full naming history.confidence_threshold: 0.0+escalate_to: "none"means the Ollama-only triage owns the diagnosis. Setescalate_to: "openai"or"claude"and raise the threshold if you want escalation when the cheap-first model isn't sure.extract_envelopereadserror.diagnosisdirectly. The auto-troubleshoot hook runs inside the worker — by the timeextract_envelopeexecutes, the diagnosis is already attached to the envelope yourtrigger_failurestep received.
Step 3 — Register both playbooks
noetl catalog register tutorials/playbooks/failing_canary.yaml
noetl catalog register tutorials/playbooks/canary_with_diagnosis.yaml
Catalog registration is idempotent — re-running bumps the version without breaking anything. Confirm both are visible:
noetl catalog list | grep tutorials
Step 4 — Invoke the parent
EXEC_ID=$(noetl exec tutorials/canary_with_diagnosis \
--runtime distributed \
--payload '{}' \
--json | jq -r '.execution_id')
echo "execution: $EXEC_ID"
# Wait for terminal — typically 5–10 seconds with warm gemma3:4b.
sleep 15
noetl status "$EXEC_ID" --json > /tmp/canary.json
Always go through noetl status rather than raw curl against
/api/executions/<id>. The CLI handles host/port resolution, gateway
auth, and JSON shape conventions consistently across local kind, GKE, and
any future deployment topology — your tutorials and runbooks stay portable
without hardcoded URLs.
Pull the parent's terminal result:
python3 -c "
import json
with open('/tmp/canary.json') as f:
doc = json.load(f)
print(doc.get('result', {}).get('summary'))
"
You should see something like:
Canary failed; diagnosed as transient_5xx (confidence 0.71); suggested: Retry with exponential backoff; check upstream availability if persistent
The exact category and wording will vary — gemma3:4b is doing real
inference on the failure context. Common categories for the
non-resolvable-host failure: transient_5xx, infra, or unknown.
Step 5 — Walk the events
The chronological event flow shows each architectural contract activating in order:
python3 - <<'PY'
import json
with open('/tmp/canary.json') as f:
doc = json.load(f)
for evt in doc.get('events', []):
print(f"{evt.get('event_type','?'):25} {evt.get('node_name','?'):25} {evt.get('status','?')}")
PY
The flow you should see:
command.issued trigger_failure PENDING
command.completed trigger_failure COMPLETED <-- agent envelope ready
command.issued extract_envelope PENDING
call.done extract_envelope COMPLETED
step.exit extract_envelope COMPLETED
command.completed extract_envelope COMPLETED
... end COMPLETED
The interesting event is trigger_failure → command.completed. Inspect
its result.context.error:
python3 - <<'PY'
import json
with open('/tmp/canary.json') as f:
doc = json.load(f)
for evt in doc.get('events', []):
if evt.get('node_name') == 'trigger_failure' and evt.get('event_type') == 'command.completed':
error = evt.get('result', {}).get('context', {}).get('error', {})
diag = error.get('diagnosis', {})
print(f"error.kind = {error.get('kind')}")
print(f"error.code = {error.get('code')}")
print(f"diagnosis.category = {diag.get('category')}")
print(f"diagnosis.source = {diag.get('source')}")
print(f"_meta.diagnosis_fetch = {diag.get('_meta', {}).get('diagnosis_fetch')}")
break
PY
You'll see all five required diagnosis keys plus the _meta.diagnosis_fetch
telemetry block — proof that the projection chokepoint preserved nested
content end-to-end.
What just happened, architecturally
Each contract you exercised:
Agent envelope contract. tool: agent framework=noetl waited for
failing_canary to reach terminal status, built a full envelope with
{status: "error", framework: "noetl", entrypoint, error, execution_id},
and surfaced that envelope as trigger_failure's result. See
Agent Failure Diagnostics → Gap 1.
kind: agent tool_error carve-out. Even though
envelope.status == "error", the worker did NOT translate that into a
step-level failure for trigger_failure. The envelope IS the contract;
downstream steps inspect it. Without this carve-out, trigger_failure
would have failed and extract_envelope would never have run. See
Agent Orchestration.
Auto-troubleshoot hook (Gap 4.1). Because on_failure.troubleshoot
was true and the envelope's status was error, the dispatcher invoked
automation/agents/troubleshoot/diagnose_execution, waited for that
sub-execution to reach terminal status, fetched the persisted diagnosis
from the persist_diagnosis step's events, and attached the result to
error.diagnosis before returning the envelope to trigger_failure.
Event projection preservation. The worker's _extract_control_context
helper preserved the entire nested error.diagnosis dict, including
the _meta.diagnosis_fetch telemetry, through the strict event-projection
layer. See
Vertex AI Triage Backend → Cloud latency
for the telemetry schema.
Step 6 — Variation: route on diagnosis category
Diagnoses have a category field that maps to a small fixed set:
transient_5xx, auth, rate_limit, bad_request, tool_error,
infra, unknown. Route on it to make the parent playbook
self-correcting:
- step: trigger_failure
tool: agent
framework: noetl
entrypoint: tutorials/failing_canary
on_failure:
troubleshoot: true
triage_mcp_server: "{{ workload.triage_mcp_server }}"
triage_model: "{{ workload.triage_model }}"
case:
- when: "{{ trigger_failure.error.diagnosis.category == 'transient_5xx' }}"
next: retry_after_backoff
- when: "{{ trigger_failure.error.diagnosis.category == 'auth' }}"
next: fail_loud
- when: "{{ trigger_failure.error.diagnosis.category == 'rate_limit' }}"
next: schedule_retry
- default:
next: extract_envelope
- step: retry_after_backoff
tool: python
code: |
import time
time.sleep(5)
result = {"action": "retried after 5s backoff"}
next: extract_envelope
- step: fail_loud
tool: python
code: |
raise RuntimeError(
f"Auth failure diagnosed: {diagnosis.get('root_cause')}"
)
args:
diagnosis: "{{ trigger_failure.error.diagnosis }}"
- step: schedule_retry
tool: python
code: |
result = {"action": "deferred to retry queue"}
next: extract_envelope
The case block evaluates against call.done (since the agent envelope
is the contract; the step itself doesn't fail even though the
sub-execution did). Each branch handles a different diagnosis category
appropriately. This is the building block for self-correcting workflows
— the diagnosis is structured enough that branching logic can be
written against it.
Step 7 — Switch backends
The same playbook works on a GKE cluster with Vertex AI as the triage
backend. The catalog defaults are deployment-mode-aware: local kind
keeps mcp/ollama, GKE catalog operators pass overrides per payload.
See Vertex AI Triage Backend
for the full architecture.
To run your tutorial against a Vertex backend:
noetl --server https://gateway.your-gke.example/api/noetl \
exec tutorials/canary_with_diagnosis \
--runtime distributed \
--payload '{
"triage_mcp_server": "mcp/vertex-ai",
"triage_model": "gemini-2.5-flash",
"escalate_to": "none"
}'
The diagnosis source field will read vertex-ai instead of ollama,
and _meta.diagnosis_fetch.elapsed_seconds will land in the 1–3 second
range typical for cloud inference (versus the sub-100ms range for warm
local Ollama). Both are within the adaptive retry budget; both produce
a usable diagnosis dict.
What you just learned
You built a complete self-troubleshooting playbook by composing four
contracts: the agent envelope, the kind: agent carve-out, the
auto-troubleshoot hook, and event projection preservation. The diagnosis
dict is structured well enough to drive case branching, which means
your playbooks can become self-correcting without you writing any
backend-specific failure-handling code.
Next steps
- Add a new MCP backend — the advanced version, where you build a new triage backend behind the same JSON-RPC contract (e.g. AWS Bedrock).
- Triage Model Selection — full reference for model tier choices and the deployment-mode-aware backend pattern.
- Self-Troubleshoot Agent — the canonical reference for how
diagnose_executionworks internally.
Troubleshooting
Diagnosis arrives empty (category, confidence etc. all missing).
Most often escalate_to is set to a backend that isn't configured
(e.g. "openai" without an API key). The Ollama path silently
short-circuits and returns no diagnosis. Check the diagnose
sub-execution's events for the actual failure.
attempts > 5 on the spike fixture poll counter. Cold-start —
Ollama hadn't loaded gemma3:4b into resident memory before the
diagnose call ran. Subsequent runs reuse warm state. The
cloud-latency profile
documents the expected ranges per backend.
category consistently comes back unknown. The triage model
isn't getting enough signal from the failure context. Two options: raise
confidence_threshold to force escalation, or add more structured
context to your sub-playbook's error path so the model has more to work
with (specifically: a typed exception with a useful message beats a
generic RuntimeError).
extract_envelope sees trigger_failure.error as None. The
agent envelope wasn't attached. Most likely cause: the executor's
_dispatch_troubleshoot_diagnosis didn't run because troubleshoot
wasn't truthy in on_failure. Confirm the on_failure.troubleshoot: true line is present and the kind:agent step's failure is reaching
the auto-troubleshoot path (you'll see a diagnose sub-execution in
/api/executions if it did).