Skip to main content

Quickstart: from clone to a green spike e2e

This tutorial gets a complete NoETL stack running on your laptop and walks you through one full self-troubleshooting playbook execution. By the end you will have:

  • A Podman-backed kind cluster with NoETL server, worker, gateway, GUI, Postgres, NATS, and Ollama all running.
  • The gemma3:4b model loaded for the auto-troubleshoot triage path.
  • One spike e2e execution that reaches GREEN with structured diagnostic telemetry visible in persisted events.
  • All six prepared regression smokes passing.

Estimated time: 15–20 minutes if dependencies are already installed.

Prerequisites

  • macOS or Linux. Windows works through WSL2 but is not the primary path.
  • Podman 5.x. Do not use Docker Desktop or Colima for the canonical local path — see Local Podman Kind Cluster for the rationale.
  • kind v0.20+.
  • kubectl 1.28+.
  • Python 3.10+ for the smoke scripts.
  • The NoETL CLI — install via brew tap noetl/tap && brew install noetl or follow Quick Start.
  • About 16 GiB of memory for the Podman VM.

Step 1 — Bootstrap the local cluster

The local automation assumes a Podman VM with /Volumes mounted and a clean XDG_DATA_HOME setting. The full rationale is in Local Podman Kind Cluster; the short version:

# Initialize a Podman machine sized for the full NoETL stack.
podman machine init \
--memory=16384 \
--cpus=4 \
--disk-size=200 \
--volume=/Volumes:/Volumes \
noetl-dev

podman machine start noetl-dev

# Confirm the machine is healthy and the volume mount took.
podman machine list
podman machine ssh noetl-dev -- ls /Volumes

If your shell exports XDG_DATA_HOME to a custom location, unset it before running Podman/kind commands. Mixed locations split machine state and break podman machine list.

unset XDG_DATA_HOME

Now create the kind cluster:

kind create cluster --name=noetl-cluster
kubectl cluster-info --context=kind-noetl-cluster

kubectl get nodes should show one Ready node within ~30 seconds.

Step 2 — Deploy the NoETL stack

NoETL deploys go through the bump_image lifecycle agent rather than raw kubectl apply. This gives you the GHCR availability probe (so a release race fails fast instead of timing out a kubectl rollout) and the idempotent unchanged path for free. See Bump Image Lifecycle for the full operational contract.

For a fresh local cluster, the simplest path is the development bootstrap playbook from repos/ops:

# From the ai-meta workspace root.
noetl exec automation/development/noetl --runtime local --payload '{
"namespace": "noetl",
"noetl_image": "ghcr.io/noetl/noetl:v2.37.1",
"gateway_image": "ghcr.io/noetl/gateway:v2.10.0",
"gui_image": "ghcr.io/noetl/gui:v1.7.0"
}'

Wait until rollouts converge — 1–3 minutes on a warm host:

kubectl -n noetl rollout status deploy/noetl-server --timeout=180s
kubectl -n noetl rollout status deploy/noetl-worker --timeout=180s
kubectl -n noetl rollout status deploy/ollama-bridge --timeout=180s
kubectl -n noetl rollout status deploy/gateway --timeout=180s
kubectl -n noetl rollout status deploy/gui --timeout=180s

Quick health check:

kubectl -n noetl get pods
curl -sf http://localhost:8082/api/health

/api/health should return {"status":"ok"} once the noetl-server pod is Ready.

Step 3 — Pull the triage model

The auto-troubleshoot path uses Ollama-hosted gemma3:4b as the default triage model. See Triage Model Selection for the rationale and tier comparison.

kubectl -n noetl exec deploy/ollama -- ollama pull gemma3:4b
kubectl -n noetl exec deploy/ollama -- ollama list

ollama list should show gemma3:4b at roughly 3.0 GB.

Step 4 — Run the spike

The spike e2e (tests/spike/spike_e2e_test, lives in repos/e2e/fixtures/playbooks/spike/spike_e2e_test.yaml) deliberately invokes a failing sub-playbook so the auto-troubleshoot path exercises end-to-end:

EXEC_ID=$(noetl exec tests/spike/spike_e2e_test \
--runtime distributed \
--payload '{"escalate_to":"none"}' \
--json | jq -r '.execution_id')
echo "execution: $EXEC_ID"

# Wait for terminal status — typically 8–15 seconds on local kind.
sleep 30
noetl status "$EXEC_ID" --json > /tmp/spike.json

noetl status is the canonical way to fetch execution state. It hits the same /api/executions/<id> endpoint the gateway exposes but goes through the CLI's host/port resolution and credential plumbing — keeps your scripts portable across local kind, GKE, and any future deployment topology without hardcoded URLs.

Run the assertion script to confirm GREEN:

python3 scripts/spike_e2e_assert.py /tmp/spike.json

You should see All checks passed. NoETL-as-AI-OS spike e2e smoke is GREEN. followed by diagnosis source: ollama and diagnosis category: ....

Step 5 — Read the telemetry

The spike's extract_envelope step pulls a useful summary into the parent result, but the canonical diagnostic data lives deeper in the persisted event stream at events[N].result.context.error.diagnosis._meta.diagnosis_fetch:

python3 - <<'PY'
import json
with open('/tmp/spike.json') as f:
doc = json.load(f)

for evt in doc.get('events', []):
diag_fetch = (
evt.get('result', {})
.get('context', {})
.get('error', {})
.get('diagnosis', {})
.get('_meta', {})
.get('diagnosis_fetch')
)
if diag_fetch:
print(f"event {evt.get('event_id')} ({evt.get('node_name')}):")
for k, v in diag_fetch.items():
print(f" {k} = {v}")
break
PY

Expected output:

event 62123... (trigger_failure):
poll_count = 1
elapsed_seconds = 0.064
deadline_seconds = 60.0
hit_deadline = False

That poll_count = 1, elapsed_seconds ≈ 0.06 is the warm-path signature. The diagnose sub-execution finished and its result was persisted before the noetl-side fetch loop reached its first sleep. See Vertex AI Triage Backend for the full cloud-vs-local profile and what cold-start numbers look like.

Step 6 — Run the regression smokes

NoETL ships six prepared smoke regression detectors. Run them all to confirm the architectural surface is clean:

cd /Volumes/X10/projects/noetl/ai-meta

python3 scripts/agent_envelope_carveout_smoke.py # 8/8
python3 scripts/gap41_diagnosis_wait_smoke.py # 7/7
python3 scripts/auto_troubleshoot_smoke.py # 9/9
python3 scripts/optional_ai_smoke.py # 6/6
python3 scripts/live_vs_persisted_parity_smoke.py # 3/3 static
python3 scripts/worker_workload_forwarding_smoke.py # passing

Each smoke validates a specific architectural contract — see Agent Failure Diagnostics for what each one protects.

What you just exercised

The 12 events your spike produced demonstrate every architectural contract NoETL relies on for self-troubleshooting:

  • Agent envelope contracttool: agent framework=noetl returned a full envelope with status, framework, entrypoint, error, and execution_id fields. Documented in Agent Failure Diagnostics.
  • Sub-execution wait-for-terminal — the dispatcher waited for the failing sub-playbook to reach terminal state before reading its result.
  • kind: agent tool_error carve-outenvelope.status: "error" is the contract describing the sub-execution's outcome, not a step-level failure of the dispatcher. See Agent Orchestration.
  • Auto-troubleshoot hook — on failure the executor dispatched the diagnose agent, waited for its terminal status, and fetched the persisted diagnosis from the persist_diagnosis step's events.
  • Event projection preservation — the worker preserved the full nested error.diagnosis._meta.diagnosis_fetch dict end-to-end through _extract_control_context.
  • Live-vs-persisted parity — the assertion script and the parity smoke both confirm the persisted shape matches the live response.

Next steps

Troubleshooting

podman machine start hangs or fails. Re-init at a smaller memory (8 GiB) to confirm the host has the resources, then bring it back to 16. If the machine state is split across ~/.local/share and ~/Library, unset XDG_DATA_HOME and run podman machine list to see which is authoritative.

kind create cluster succeeds but kubectl connects to nothing. kind sometimes leaves stale kubeconfig entries. Run kind delete cluster --name=noetl-cluster and recreate. Confirm the kind-noetl-cluster context is current with kubectl config current-context.

Spike returns attempts > 5 on the first run. This is the cold-start signature — Ollama hadn't loaded gemma3:4b into resident memory before the diagnose call ran. Subsequent runs reuse warm state and drop to attempts ≤ 1. The cloud-latency profile is documented in Vertex AI Triage Backend.

/api/health returns 502 or never responds. The noetl-server pod crashed or never became Ready. kubectl -n noetl describe pod -l app=noetl-server will show the pull or startup failure. If GHCR is rate-limiting, wait a minute and retry — the bump_image GHCR probe typically catches this before the rollout times out.

ollama list is empty after pull completes. The Ollama pod restarted mid-pull and the partial download was discarded. kubectl -n noetl rollout restart deploy/ollama and re-pull. Models are stored on the pod's emptyDir; persisting them across pod restarts is part of Local Podman Kind Cluster.