Triage Model Selection
NoETL's self-troubleshoot path uses a local Ollama model first and escalates only when confidence is too low. Keep those two choices separate:
- the default triage model is the model used for the first local diagnosis attempt;
- the escalation tier is invoked only when the first-pass diagnosis is not confident enough and the workload allows escalation.
Default: gemma3:4b
gemma3:4b is the default first-pass model for the troubleshoot agent.
It is the right default for local kind clusters, developer laptops, and
memory-constrained worker nodes because it fits in roughly 4 GiB total
memory while still producing reliable structured JSON for known failure
patterns.
ollama pull gemma3:4b
Use the Gemma 3 Ollama library page
and select the 4b tag. This is the model validated across the
NoETL-as-AI-OS evidence trail from v2.35.3 through v2.35.9. The spike
e2e workflow, auto-troubleshoot smoke, optional-AI smoke, and
live-vs-persisted parity smoke all assume this remains safe for
commodity local clusters.
Do not replace the catalog default just because a larger model is available. The default needs to run where NoETL itself is being developed and debugged.
Higher-quality opt-in: gemma4:e4b
gemma4:e4b is a higher-quality local model for production
deployments with dedicated memory for Ollama. It is an opt-in workload
override, not the catalog default.
ollama pull gemma4:e4b
See the Gemma 4 e4b Ollama library page.
Memory profile measured on 2026-05-06:
| Measurement | Value |
|---|---|
| Resident model after pull/load | ~9.4 GiB |
| Additional inference working set | ~9.8 GiB |
| Practical cgroup requirement | ~20 GiB |
| Local kind-on-Podman VM floor | ~24 GiB |
The validation attempt that produced this measurement ran on a 16 GiB Podman VM with the Ollama pod temporarily raised to a 12 GiB cgroup limit. The model loaded, but inference failed:
model requires more system memory (9.8 GiB) than is available (3.3 GiB)
That message is easy to misread. Ollama is reporting the additional memory the inference pass needs on top of the already-loaded model, not the total memory needed for the model. If you see "needs 9.8 GiB available" while the model is already resident in a 12 GiB cgroup, you do not need 12 GiB total; you need roughly 20 GiB of room for the resident model plus the inference working set.
Use gemma4:e4b for production clusters with a dedicated Ollama node
or pod budget. Good fits include GKE node pools with at least 24 GiB
allocated to Ollama, EKS m6i.2xlarge-class nodes or larger, or
self-managed clusters with dedicated AI nodes.
Workloads opt in by passing ollama_model: "gemma4:e4b":
noetl exec tests/spike/spike_e2e_test \
--runtime distributed \
--payload '{"escalate_to":"none","ollama_model":"gemma4:e4b"}'
Leave the catalog default as gemma3:4b; larger-model use should be
intentional per workload or per environment.
Escalation tier: qwen3:32b
qwen3:32b is the local heavyweight tier for cases where the first
model returns low confidence and the workload allows escalation. It is
roughly a 19 GiB model, so it belongs on production-grade Ollama nodes,
not small local clusters.
This is a different axis from choosing the default model:
- default-model choice decides which model tries first;
- escalation decides what happens when that first answer is not confident enough.
The troubleshoot agent escalates when:
- the parsed local diagnosis has
confidence < confidence_threshold; escalate_toallows a second pass, such asollama,openai, orclaude;- the required model or credential is available.
If escalation is disabled with escalate_to: "none", the agent returns
the local diagnosis even when confidence is low.
How the choice flows
For GKE deployments that use a cloud-managed backend instead of
in-cluster Ollama, see
Vertex AI Triage Backend. That page
documents the MCP pointer-swap contract and the triage_mcp_server /
triage_model workload names.
flowchart LR
catalog["Catalog default<br/>diagnose_execution.yaml<br/>ollama_model: gemma3:4b"]
workload["Workload override<br/>payload.ollama_model"]
local["Local Ollama triage"]
confidence{"confidence<br/>< threshold?"}
disabled["Return local diagnosis"]
escalation["Escalation tier<br/>qwen3:32b / OpenAI / Claude"]
result["Structured diagnosis"]
catalog --> workload --> local --> confidence
confidence -- "no" --> result
confidence -- "yes, escalation disabled" --> disabled --> result
confidence -- "yes, escalation allowed" --> escalation --> result
Worked example:
{
"escalate_to": "none",
"ollama_model": "gemma4:e4b"
}
This payload asks the spike e2e workflow to use gemma4:e4b for the
local diagnosis only. Because escalate_to is none, a low-confidence
answer still returns from the local model and does not call the
escalation tier.
Operational notes
- The default Ollama pod budget was raised after ops#38 because 4 GiB
was too small for reliable
gemma3:4binference. The local default is now a 3 GiB request and 5 GiB limit. - Local kind development uses Podman only. The Podman machine must mount
the shared checkout path, including
/Volumes:/Volumeson macOS, so kind can read manifests and local automation paths. - Run Podman and kind commands with
XDG_DATA_HOMEunset when the machine was created that way; otherwise the CLI can look at the wrong machine metadata. - Do not use Colima for the NoETL local kind workflow. Keep the
Podman-backed
kind-noetlpath deterministic. - The 2026-05-06
gemma4:e4bvalidation result is recorded inbridge/outbox/20260506-025752-option-b-gemma4-evidence-doc-coverage.result.jsonin theai-metarepository. Treat its cgroup math as the current sizing floor until a larger local VM or production node proves a new lower bound.