Skip to main content

Self-Troubleshoot Agent

A NoETL playbook that diagnoses other failed NoETL executions: fetches the event log, classifies the failure with a local Ollama model (cheap-first), escalates to OpenAI only when the local model's confidence is low, returns a structured diagnosis the GUI and programmatic callers can both consume.

The agent is the worked example for the NoETL-as-AI-OS architecture: it composes tool: agent framework=noetl, playbook-as-MCP-server, the typed metadata, and the Ollama bridge into a single self-contained agent.

Why cheap-first

Calling OpenAI / Claude for every failure in a fleet is slow and wasteful when most failures look identical to ones a 2B-parameter local model can classify in 200ms.

ApproachLatencyCost per failure
OpenAI / Claude for every failure~3s~$0.01
Ollama-only~200ms$0.00
Cheap-first with escalation~200ms*~$0.001*

* Most failures (HTTP 5xx, transient timeouts, known patterns) stay on the Ollama path. Escalation runs only on novel / interesting failures — typically <10% in steady state.

At fleet error volumes (O(thousands) per day) this is the difference between $0/day and ~$30/day in inference spend.

Flow

flowchart LR
fetch["fetch_events<br/>(GET /api/executions/{id})"]
extract["extract_failure_signal<br/>(Python: walk events,<br/>identify proximate cause)"]
triage["ollama_triage<br/>(MCP → mcp/ollama)"]
parse["parse_ollama_response<br/>(strip fences, parse JSON,<br/>clamp confidence)"]
hi["confidence ≥ threshold"]
lo["confidence &lt; threshold<br/>or escalate_to=openai"]
esc["escalate_openai<br/>(HTTP POST → OpenAI)"]
parse2["parse_openai_response"]
persist["persist_diagnosis<br/>(pick higher-confidence,<br/>emit envelope)"]

fetch --> extract --> triage --> parse
parse --> hi --> persist
parse --> lo --> esc --> parse2 --> persist

Agent envelope

{
"status": "ok",
"diagnosis": {
"execution_id": "619156384600293663",
"category": "transient_5xx",
"confidence": 0.82,
"root_cause": "Amadeus sandbox returned HTTP 500 on a well-formed query",
"suggested_action": "Retry; if persistent, check api.amadeus.com status page",
"source": "ollama",
"escalated": false
},
"summary": "Amadeus sandbox is temporarily unavailable; retry shortly.",
"text": "Amadeus sandbox is temporarily unavailable; retry shortly.",
"user_message": "Amadeus sandbox is temporarily unavailable; retry shortly."
}

The summary / text / user_message fields all carry the same friendly one-liner, mirroring the GUI's extractAgentText heuristic so the run dialog renders it inline. data.diagnosis carries the structured form for programmatic callers (other playbooks, auto-dispatch hooks, CI integrations).

Workload knobs

KeyDefaultWhat it controls
execution_id"" (required)which failed execution to diagnose
noetl_urlhttp://noetl-server.noetl.svc.cluster.local:8080NoETL API base
ollama_modelgemma2:2blocal model for first-pass
ollama_mcp_servermcp/ollamacatalog path of the Ollama bridge
confidence_threshold0.7escalate when local confidence < this
escalate_toopenaiopenai / claude / none
openai_credentialopenai_tokenkeychain entry for OpenAI API key
openai_modelgpt-4o-miniOpenAI model for escalation
anthropic_credentialanthropic_tokenkeychain entry for Anthropic API key
anthropic_modelclaude-haiku-4-5Anthropic model for escalation

How callers reach it

Three surfaces:

  1. From a peer playbook, via tool: agent framework=noetl:

    - step: diagnose_recent_failure
    tool:
    kind: agent
    framework: noetl
    entrypoint: automation/agents/troubleshoot/diagnose_execution
    payload:
    execution_id: "{{ failed_id }}"
    confidence_threshold: 0.85
  2. Automatically on failure, via the agent executor's auto-dispatch hook (see Agent Orchestration → Auto-troubleshoot on failure):

    - step: ask_amadeus
    tool:
    kind: agent
    framework: noetl
    entrypoint: api_integration/amadeus_ai_api
    on_failure:
    troubleshoot: true # diagnosis attaches to error.diagnosis on failure
  3. From an external MCP client (Cursor, Claude Desktop), via the playbook-as-MCP-server endpoint:

    POST /api/mcp/playbook/automation/agents/troubleshoot/diagnose_execution/jsonrpc

Failure modes the agent classifies

The system prompt asks the model for one of these categories:

CategoryWhat it means
transient_5xxUpstream returned 5xx; usually retryable
auth401/403; refresh credentials
rate_limit429; back off + retry
bad_request4xx that's not auth/rate-limit; query needs to change
tool_errorNoETL tool itself failed (DSL error, missing credential, etc.)
infraWorker / database / NATS issue
unknownModel couldn't classify; signal to escalate

The model is explicitly instructed to be conservative — set confidence below 0.5 and category=unknown rather than guessing. On parse failure (small models occasionally return malformed JSON) confidence is forced to 0, which guarantees escalation when the operator opted into it.

Escalation targets

Two upstream models are wired:

escalate_to valueEndpointDefault modelAuth header
openaihttps://api.openai.com/v1/chat/completionsgpt-4o-miniAuthorization: Bearer <token>
claudehttps://api.anthropic.com/v1/messagesclaude-haiku-4-5x-api-key: <token> + anthropic-version: 2023-06-01
none(no escalation; local-only mode)

The two paths are mutually exclusive — at most one runs per diagnosis. Pick claude when you'd rather spend the marginal cost on a larger context window or stronger multi-step reasoning; pick openai for the cheaper / faster path. Either way the parsed output shape is identical so downstream consumers don't need to branch.

When neither escalation provider is reachable, the diagnostic still completes — it just falls back to the local Ollama result and sets diagnosis.escalated = false. The agent never blocks on upstream availability.

What's not in scope

  • Multi-execution batch diagnosis ("what broke yesterday?") — separate workflow shape (loop over executions, cluster by category). Out of scope.
  • The Ollama deployment itself. This page covers the agent. See Ollama Bridge for how to stand up the cheap-first inference layer the agent depends on.

See also

  • Agent Orchestration — the tool: agent framework=noetl contract this agent participates in, and the auto-dispatch hook that calls it on failures.
  • Ollama Bridge — deployment guide for the cheap-first inference layer.
  • Playbook-as-MCP-Server — how external MCP clients reach this agent over the wire.