Self-Troubleshoot Agent
A NoETL playbook that diagnoses other failed NoETL executions: fetches the event log, classifies the failure with a local Ollama model (cheap-first), escalates to OpenAI only when the local model's confidence is low, returns a structured diagnosis the GUI and programmatic callers can both consume.
The agent is the worked example for the NoETL-as-AI-OS architecture:
it composes tool: agent framework=noetl,
playbook-as-MCP-server, the
typed metadata, and
the Ollama bridge into a single
self-contained agent.
Why cheap-first
Calling OpenAI / Claude for every failure in a fleet is slow and wasteful when most failures look identical to ones a 2B-parameter local model can classify in 200ms.
| Approach | Latency | Cost per failure |
|---|---|---|
| OpenAI / Claude for every failure | ~3s | ~$0.01 |
| Ollama-only | ~200ms | $0.00 |
| Cheap-first with escalation | ~200ms* | ~$0.001* |
* Most failures (HTTP 5xx, transient timeouts, known patterns) stay
on the Ollama path. Escalation runs only on novel / interesting
failures — typically <10% in steady state.
At fleet error volumes (O(thousands) per day) this is the difference between $0/day and ~$30/day in inference spend.
Flow
flowchart LR
fetch["fetch_events<br/>(GET /api/executions/{id})"]
extract["extract_failure_signal<br/>(Python: walk events,<br/>identify proximate cause)"]
triage["ollama_triage<br/>(MCP → mcp/ollama)"]
parse["parse_ollama_response<br/>(strip fences, parse JSON,<br/>clamp confidence)"]
hi["confidence ≥ threshold"]
lo["confidence < threshold<br/>or escalate_to=openai"]
esc["escalate_openai<br/>(HTTP POST → OpenAI)"]
parse2["parse_openai_response"]
persist["persist_diagnosis<br/>(pick higher-confidence,<br/>emit envelope)"]
fetch --> extract --> triage --> parse
parse --> hi --> persist
parse --> lo --> esc --> parse2 --> persist
Agent envelope
{
"status": "ok",
"diagnosis": {
"execution_id": "619156384600293663",
"category": "transient_5xx",
"confidence": 0.82,
"root_cause": "Amadeus sandbox returned HTTP 500 on a well-formed query",
"suggested_action": "Retry; if persistent, check api.amadeus.com status page",
"source": "ollama",
"escalated": false
},
"summary": "Amadeus sandbox is temporarily unavailable; retry shortly.",
"text": "Amadeus sandbox is temporarily unavailable; retry shortly.",
"user_message": "Amadeus sandbox is temporarily unavailable; retry shortly."
}
The summary / text / user_message fields all carry the same
friendly one-liner, mirroring the GUI's extractAgentText heuristic
so the run dialog renders it inline. data.diagnosis carries the
structured form for programmatic callers (other playbooks,
auto-dispatch hooks, CI integrations).
Workload knobs
| Key | Default | What it controls |
|---|---|---|
execution_id | "" (required) | which failed execution to diagnose |
noetl_url | http://noetl-server.noetl.svc.cluster.local:8080 | NoETL API base |
ollama_model | gemma2:2b | local model for first-pass |
ollama_mcp_server | mcp/ollama | catalog path of the Ollama bridge |
confidence_threshold | 0.7 | escalate when local confidence < this |
escalate_to | openai | openai / claude / none |
openai_credential | openai_token | keychain entry for OpenAI API key |
openai_model | gpt-4o-mini | OpenAI model for escalation |
anthropic_credential | anthropic_token | keychain entry for Anthropic API key |
anthropic_model | claude-haiku-4-5 | Anthropic model for escalation |
How callers reach it
Three surfaces:
-
From a peer playbook, via
tool: agent framework=noetl:- step: diagnose_recent_failure
tool:
kind: agent
framework: noetl
entrypoint: automation/agents/troubleshoot/diagnose_execution
payload:
execution_id: "{{ failed_id }}"
confidence_threshold: 0.85 -
Automatically on failure, via the agent executor's auto-dispatch hook (see Agent Orchestration → Auto-troubleshoot on failure):
- step: ask_amadeus
tool:
kind: agent
framework: noetl
entrypoint: api_integration/amadeus_ai_api
on_failure:
troubleshoot: true # diagnosis attaches to error.diagnosis on failure -
From an external MCP client (Cursor, Claude Desktop), via the playbook-as-MCP-server endpoint:
POST /api/mcp/playbook/automation/agents/troubleshoot/diagnose_execution/jsonrpc
Failure modes the agent classifies
The system prompt asks the model for one of these categories:
| Category | What it means |
|---|---|
transient_5xx | Upstream returned 5xx; usually retryable |
auth | 401/403; refresh credentials |
rate_limit | 429; back off + retry |
bad_request | 4xx that's not auth/rate-limit; query needs to change |
tool_error | NoETL tool itself failed (DSL error, missing credential, etc.) |
infra | Worker / database / NATS issue |
unknown | Model couldn't classify; signal to escalate |
The model is explicitly instructed to be conservative — set
confidence below 0.5 and category=unknown rather than guessing.
On parse failure (small models occasionally return malformed JSON)
confidence is forced to 0, which guarantees escalation when the
operator opted into it.
Escalation targets
Two upstream models are wired:
escalate_to value | Endpoint | Default model | Auth header |
|---|---|---|---|
openai | https://api.openai.com/v1/chat/completions | gpt-4o-mini | Authorization: Bearer <token> |
claude | https://api.anthropic.com/v1/messages | claude-haiku-4-5 | x-api-key: <token> + anthropic-version: 2023-06-01 |
none | (no escalation; local-only mode) |
The two paths are mutually exclusive — at most one runs per
diagnosis. Pick claude when you'd rather spend the marginal cost
on a larger context window or stronger multi-step reasoning; pick
openai for the cheaper / faster path. Either way the parsed
output shape is identical so downstream consumers don't need to
branch.
When neither escalation provider is reachable, the diagnostic
still completes — it just falls back to the local Ollama result
and sets diagnosis.escalated = false. The agent never blocks on
upstream availability.
What's not in scope
- Multi-execution batch diagnosis ("what broke yesterday?") — separate workflow shape (loop over executions, cluster by category). Out of scope.
- The Ollama deployment itself. This page covers the agent. See Ollama Bridge for how to stand up the cheap-first inference layer the agent depends on.
See also
- Agent Orchestration — the
tool: agent framework=noetlcontract this agent participates in, and the auto-dispatch hook that calls it on failures. - Ollama Bridge — deployment guide for the cheap-first inference layer.
- Playbook-as-MCP-Server — how external MCP clients reach this agent over the wire.