Build a self-troubleshooting playbook

This tutorial walks you through composing the full self-troubleshooting pattern from scratch: a parent playbook that invokes a failing sub-playbook, receives a structured diagnosis attached to the error envelope, and routes its downstream behavior on the diagnosis category.

By the end you will have built a working version of the pattern the spike e2e demonstrates, understand which architectural contract each step exercises, and know how to switch the triage backend per deployment mode.

Estimated time: 30 minutes. Prereqs: completed Quickstart with the local cluster up and gemma3:4b loaded.

What you'll build

Two playbooks:

failing_canary — a deliberately-failing sub-playbook. Always raises a clear error so the diagnose agent has something to work with.
canary_with_diagnosis — the parent playbook. Invokes failing_canary via tool: agent framework=noetl with on_failure.troubleshoot: true, then in its next step extracts a useful summary from the attached diagnosis.

When you're done, invoking the parent will:

Trigger the auto-troubleshoot hook on the failing sub-playbook's error envelope.
Dispatch the diagnose_execution agent automatically.
Wait for the diagnosis sub-execution to reach terminal status.
Fetch the persisted diagnosis from events.
Attach the {category, confidence, root_cause, suggested_action, source} dict to error.diagnosis.
Surface the diagnosis through your extract_envelope step into the parent's result.

You'll see the same architectural surface the spike e2e exercises, but with playbooks you wrote yourself.

Step 1 — The failing sub-playbook

Create tutorials/playbooks/failing_canary.yaml:

apiVersion: noetl.io/v2
kind: Playbook
metadata:
  name: failing_canary
  path: tutorials/failing_canary
  description: |
    A deliberately-failing canary used by the self-troubleshooting tutorial.
    Always raises a clear, structured error.

executor:
  profile: distributed
  version: noetl-runtime/1

workload: {}

workflow:
  - step: simulate_upstream_5xx
    tool: http
    method: GET
    url: "https://this-host-does-not-exist.example.invalid"
    timeout_seconds: 5

The tool: http call to a non-resolvable host produces a clean DNS-failure error envelope without depending on any flaky external service. If you want a different failure mode, swap to a tool: python step that raises a typed exception (raise RuntimeError("upstream returned 502")) — the diagnose agent handles both shapes.

The shape mirrors the test fixture at repos/e2e/fixtures/playbooks/spike/spike_failing_subflow.yaml — see that file if you need a richer reference example.

Step 2 — The parent playbook

Create tutorials/playbooks/canary_with_diagnosis.yaml:

apiVersion: noetl.io/v2
kind: Playbook
metadata:
  name: canary_with_diagnosis
  path: tutorials/canary_with_diagnosis
  description: |
    Parent playbook for the self-troubleshooting tutorial. Invokes
    failing_canary via the agent runtime with auto-troubleshoot
    enabled, then surfaces the diagnosis through extract_envelope.

executor:
  profile: distributed
  version: noetl-runtime/1

workload:
  triage_mcp_server: "mcp/ollama"
  triage_model: "gemma3:4b"
  confidence_threshold: 0.0
  escalate_to: "none"

workflow:
  - step: trigger_failure
    tool: agent
    framework: noetl
    entrypoint: tutorials/failing_canary
    on_failure:
      troubleshoot: true
      triage_mcp_server: "{{ workload.triage_mcp_server }}"
      triage_model: "{{ workload.triage_model }}"
      confidence_threshold: "{{ workload.confidence_threshold }}"
      escalate_to: "{{ workload.escalate_to }}"
    next:
      arcs:
        - step: extract_envelope

  - step: extract_envelope
    tool: python
    code: |
      envelope = trigger_failure if isinstance(trigger_failure, dict) else {}
      error_block = envelope.get("error") if isinstance(envelope.get("error"), dict) else {}
      diagnosis = error_block.get("diagnosis") if isinstance(error_block.get("diagnosis"), dict) else {}

      result = {
        "status": "ok",
        "smoke_status": "ok",
        "agent_envelope": envelope,
        "diagnosis": diagnosis,
        "summary": (
          "Canary failed; diagnosed as "
          f"{diagnosis.get('category', '<no category>')} "
          f"(confidence {diagnosis.get('confidence', 0.0):.2f}); "
          f"suggested: {diagnosis.get('suggested_action', '<no suggestion>')}"
        ),
      }
    args:
      trigger_failure: "{{ trigger_failure }}"
    next:
      arcs:
        - step: end

  - step: end
    type: end

A few points worth noting:

triage_* keys are the canonical names since v2.36.0 (the worker forwards them generically). The deprecated ollama_mcp_server / ollama_model aliases were removed in ops#42. See Triage Model Selection for the full naming history.
confidence_threshold: 0.0 + escalate_to: "none" means the Ollama-only triage owns the diagnosis. Set escalate_to: "openai" or "claude" and raise the threshold if you want escalation when the cheap-first model isn't sure.
extract_envelope reads error.diagnosis directly. The auto-troubleshoot hook runs inside the worker — by the time extract_envelope executes, the diagnosis is already attached to the envelope your trigger_failure step received.

Step 3 — Register both playbooks

noetl catalog register tutorials/playbooks/failing_canary.yaml
noetl catalog register tutorials/playbooks/canary_with_diagnosis.yaml

Catalog registration is idempotent — re-running bumps the version without breaking anything. Confirm both are visible:

noetl catalog list | grep tutorials

Step 4 — Invoke the parent

EXEC_ID=$(noetl exec tutorials/canary_with_diagnosis \
  --runtime distributed \
  --payload '{}' \
  --json | jq -r '.execution_id')
echo "execution: $EXEC_ID"

# Wait for terminal — typically 5–10 seconds with warm gemma3:4b.
sleep 15
noetl status "$EXEC_ID" --json > /tmp/canary.json

Always go through noetl status rather than raw curl against /api/executions/<id>. The CLI handles host/port resolution, gateway auth, and JSON shape conventions consistently across local kind, GKE, and any future deployment topology — your tutorials and runbooks stay portable without hardcoded URLs.

Pull the parent's terminal result:

python3 -c "
import json
with open('/tmp/canary.json') as f:
    doc = json.load(f)
print(doc.get('result', {}).get('summary'))
"

You should see something like:

Canary failed; diagnosed as transient_5xx (confidence 0.71); suggested: Retry with exponential backoff; check upstream availability if persistent

The exact category and wording will vary — gemma3:4b is doing real inference on the failure context. Common categories for the non-resolvable-host failure: transient_5xx, infra, or unknown.

Step 5 — Walk the events

The chronological event flow shows each architectural contract activating in order:

python3 - <<'PY'
import json
with open('/tmp/canary.json') as f:
    doc = json.load(f)

for evt in doc.get('events', []):
    print(f"{evt.get('event_type','?'):25} {evt.get('node_name','?'):25} {evt.get('status','?')}")
PY

The flow you should see:

command.issued            trigger_failure           PENDING
command.completed         trigger_failure           COMPLETED   <-- agent envelope ready
command.issued            extract_envelope          PENDING
call.done                 extract_envelope          COMPLETED
step.exit                 extract_envelope          COMPLETED
command.completed         extract_envelope          COMPLETED
...                       end                       COMPLETED

The interesting event is trigger_failure → command.completed. Inspect its result.context.error:

python3 - <<'PY'
import json
with open('/tmp/canary.json') as f:
    doc = json.load(f)

for evt in doc.get('events', []):
    if evt.get('node_name') == 'trigger_failure' and evt.get('event_type') == 'command.completed':
        error = evt.get('result', {}).get('context', {}).get('error', {})
        diag = error.get('diagnosis', {})
        print(f"error.kind        = {error.get('kind')}")
        print(f"error.code        = {error.get('code')}")
        print(f"diagnosis.category = {diag.get('category')}")
        print(f"diagnosis.source   = {diag.get('source')}")
        print(f"_meta.diagnosis_fetch = {diag.get('_meta', {}).get('diagnosis_fetch')}")
        break
PY

You'll see all five required diagnosis keys plus the _meta.diagnosis_fetch telemetry block — proof that the projection chokepoint preserved nested content end-to-end.

What just happened, architecturally

Each contract you exercised:

Agent envelope contract. tool: agent framework=noetl waited for failing_canary to reach terminal status, built a full envelope with {status: "error", framework: "noetl", entrypoint, error, execution_id}, and surfaced that envelope as trigger_failure's result. See Agent Failure Diagnostics → Gap 1.

kind: agent tool_error carve-out. Even though envelope.status == "error", the worker did NOT translate that into a step-level failure for trigger_failure. The envelope IS the contract; downstream steps inspect it. Without this carve-out, trigger_failure would have failed and extract_envelope would never have run. See Agent Orchestration.

Auto-troubleshoot hook (Gap 4.1). Because on_failure.troubleshoot was true and the envelope's status was error, the dispatcher invoked automation/agents/troubleshoot/diagnose_execution, waited for that sub-execution to reach terminal status, fetched the persisted diagnosis from the persist_diagnosis step's events, and attached the result to error.diagnosis before returning the envelope to trigger_failure.

Event projection preservation. The worker's _extract_control_context helper preserved the entire nested error.diagnosis dict, including the _meta.diagnosis_fetch telemetry, through the strict event-projection layer. See Vertex AI Triage Backend → Cloud latency for the telemetry schema.

Step 6 — Variation: route on diagnosis category

Diagnoses have a category field that maps to a small fixed set: transient_5xx, auth, rate_limit, bad_request, tool_error, infra, unknown. Route on it to make the parent playbook self-correcting:

  - step: trigger_failure
    tool: agent
    framework: noetl
    entrypoint: tutorials/failing_canary
    on_failure:
      troubleshoot: true
      triage_mcp_server: "{{ workload.triage_mcp_server }}"
      triage_model: "{{ workload.triage_model }}"
    case:
      - when: "{{ trigger_failure.error.diagnosis.category == 'transient_5xx' }}"
        next: retry_after_backoff
      - when: "{{ trigger_failure.error.diagnosis.category == 'auth' }}"
        next: fail_loud
      - when: "{{ trigger_failure.error.diagnosis.category == 'rate_limit' }}"
        next: schedule_retry
      - default:
          next: extract_envelope

  - step: retry_after_backoff
    tool: python
    code: |
      import time
      time.sleep(5)
      result = {"action": "retried after 5s backoff"}
    next: extract_envelope

  - step: fail_loud
    tool: python
    code: |
      raise RuntimeError(
          f"Auth failure diagnosed: {diagnosis.get('root_cause')}"
      )
    args:
      diagnosis: "{{ trigger_failure.error.diagnosis }}"

  - step: schedule_retry
    tool: python
    code: |
      result = {"action": "deferred to retry queue"}
    next: extract_envelope

The case block evaluates against call.done (since the agent envelope is the contract; the step itself doesn't fail even though the sub-execution did). Each branch handles a different diagnosis category appropriately. This is the building block for self-correcting workflows — the diagnosis is structured enough that branching logic can be written against it.

Step 7 — Switch backends

The same playbook works on a GKE cluster with Vertex AI as the triage backend. The catalog defaults are deployment-mode-aware: local kind keeps mcp/ollama, GKE catalog operators pass overrides per payload. See Vertex AI Triage Backend for the full architecture.

To run your tutorial against a Vertex backend:

noetl --server https://gateway.your-gke.example/api/noetl \
  exec tutorials/canary_with_diagnosis \
  --runtime distributed \
  --payload '{
    "triage_mcp_server": "mcp/vertex-ai",
    "triage_model": "gemini-2.5-flash",
    "escalate_to": "none"
  }'

The diagnosis source field will read vertex-ai instead of ollama, and _meta.diagnosis_fetch.elapsed_seconds will land in the 1–3 second range typical for cloud inference (versus the sub-100ms range for warm local Ollama). Both are within the adaptive retry budget; both produce a usable diagnosis dict.

What you just learned

You built a complete self-troubleshooting playbook by composing four contracts: the agent envelope, the kind: agent carve-out, the auto-troubleshoot hook, and event projection preservation. The diagnosis dict is structured well enough to drive case branching, which means your playbooks can become self-correcting without you writing any backend-specific failure-handling code.

Next steps

Add a new MCP backend — the advanced version, where you build a new triage backend behind the same JSON-RPC contract (e.g. AWS Bedrock).
Triage Model Selection — full reference for model tier choices and the deployment-mode-aware backend pattern.
Self-Troubleshoot Agent — the canonical reference for how diagnose_execution works internally.

Troubleshooting

Diagnosis arrives empty (category, confidence etc. all missing). Most often escalate_to is set to a backend that isn't configured (e.g. "openai" without an API key). The Ollama path silently short-circuits and returns no diagnosis. Check the diagnose sub-execution's events for the actual failure.

attempts > 5 on the spike fixture poll counter. Cold-start — Ollama hadn't loaded gemma3:4b into resident memory before the diagnose call ran. Subsequent runs reuse warm state. The cloud-latency profile documents the expected ranges per backend.

category consistently comes back unknown. The triage model isn't getting enough signal from the failure context. Two options: raise confidence_threshold to force escalation, or add more structured context to your sub-playbook's error path so the model has more to work with (specifically: a typed exception with a useful message beats a generic RuntimeError).

extract_envelope sees trigger_failure.error as None. The agent envelope wasn't attached. Most likely cause: the executor's _dispatch_troubleshoot_diagnosis didn't run because troubleshoot wasn't truthy in on_failure. Confirm the on_failure.troubleshoot: true line is present and the kind:agent step's failure is reaching the auto-troubleshoot path (you'll see a diagnose sub-execution in /api/executions if it did).

What you'll build​

Step 1 — The failing sub-playbook​

Step 2 — The parent playbook​

Step 3 — Register both playbooks​

Step 4 — Invoke the parent​

Step 5 — Walk the events​

What just happened, architecturally​

Step 6 — Variation: route on diagnosis category​

Step 7 — Switch backends​

What you just learned​

Next steps​

Troubleshooting​