Quickstart: from clone to a green spike e2e

This tutorial gets a complete NoETL stack running on your laptop and walks you through one full self-troubleshooting playbook execution. By the end you will have:

A Podman-backed kind cluster with NoETL server, worker, gateway, GUI, Postgres, NATS, and Ollama all running.
The gemma3:4b model loaded for the auto-troubleshoot triage path.
One spike e2e execution that reaches GREEN with structured diagnostic telemetry visible in persisted events.
All six prepared regression smokes passing.

Estimated time: 15–20 minutes if dependencies are already installed.

Prerequisites

macOS or Linux. Windows works through WSL2 but is not the primary path.
Podman 5.x. Do not use Docker Desktop or Colima for the canonical local path — see Local Podman Kind Cluster for the rationale.
kind v0.20+.
kubectl 1.28+.
Python 3.10+ for the smoke scripts.
The NoETL CLI — install via brew tap noetl/tap && brew install noetl or follow Quick Start.
About 16 GiB of memory for the Podman VM.

Step 1 — Bootstrap the local cluster

The local automation assumes a Podman VM with /Volumes mounted and a clean XDG_DATA_HOME setting. The full rationale is in Local Podman Kind Cluster; the short version:

# Initialize a Podman machine sized for the full NoETL stack.
podman machine init \
  --memory=16384 \
  --cpus=4 \
  --disk-size=200 \
  --volume=/Volumes:/Volumes \
  noetl-dev

podman machine start noetl-dev

# Confirm the machine is healthy and the volume mount took.
podman machine list
podman machine ssh noetl-dev -- ls /Volumes

If your shell exports XDG_DATA_HOME to a custom location, unset it before running Podman/kind commands. Mixed locations split machine state and break podman machine list.

unset XDG_DATA_HOME

Now create the kind cluster:

kind create cluster --name=noetl-cluster
kubectl cluster-info --context=kind-noetl-cluster

kubectl get nodes should show one Ready node within ~30 seconds.

Step 2 — Deploy the NoETL stack

NoETL deploys go through the bump_image lifecycle agent rather than raw kubectl apply. This gives you the GHCR availability probe (so a release race fails fast instead of timing out a kubectl rollout) and the idempotent unchanged path for free. See Bump Image Lifecycle for the full operational contract.

For a fresh local cluster, the simplest path is the development bootstrap playbook from repos/ops:

# From the ai-meta workspace root.
noetl exec automation/development/noetl --runtime local --payload '{
  "namespace": "noetl",
  "noetl_image": "ghcr.io/noetl/noetl:v2.37.1",
  "gateway_image": "ghcr.io/noetl/gateway:v2.10.0",
  "gui_image": "ghcr.io/noetl/gui:v1.7.0"
}'

Wait until rollouts converge — 1–3 minutes on a warm host:

kubectl -n noetl rollout status deploy/noetl-server --timeout=180s
kubectl -n noetl rollout status deploy/noetl-worker --timeout=180s
kubectl -n noetl rollout status deploy/ollama-bridge --timeout=180s
kubectl -n noetl rollout status deploy/gateway --timeout=180s
kubectl -n noetl rollout status deploy/gui --timeout=180s

Quick health check:

kubectl -n noetl get pods
curl -sf http://localhost:8082/api/health

/api/health should return {"status":"ok"} once the noetl-server pod is Ready.

Step 3 — Pull the triage model

The auto-troubleshoot path uses Ollama-hosted gemma3:4b as the default triage model. See Triage Model Selection for the rationale and tier comparison.

kubectl -n noetl exec deploy/ollama -- ollama pull gemma3:4b
kubectl -n noetl exec deploy/ollama -- ollama list

ollama list should show gemma3:4b at roughly 3.0 GB.

Step 4 — Run the spike

The spike e2e (tests/spike/spike_e2e_test, lives in repos/e2e/fixtures/playbooks/spike/spike_e2e_test.yaml) deliberately invokes a failing sub-playbook so the auto-troubleshoot path exercises end-to-end:

EXEC_ID=$(noetl exec tests/spike/spike_e2e_test \
  --runtime distributed \
  --payload '{"escalate_to":"none"}' \
  --json | jq -r '.execution_id')
echo "execution: $EXEC_ID"

# Wait for terminal status — typically 8–15 seconds on local kind.
sleep 30
noetl status "$EXEC_ID" --json > /tmp/spike.json

noetl status is the canonical way to fetch execution state. It hits the same /api/executions/<id> endpoint the gateway exposes but goes through the CLI's host/port resolution and credential plumbing — keeps your scripts portable across local kind, GKE, and any future deployment topology without hardcoded URLs.

Run the assertion script to confirm GREEN:

python3 scripts/spike_e2e_assert.py /tmp/spike.json

You should see All checks passed. NoETL-as-AI-OS spike e2e smoke is GREEN. followed by diagnosis source: ollama and diagnosis category: ....

Step 5 — Read the telemetry

The spike's extract_envelope step pulls a useful summary into the parent result, but the canonical diagnostic data lives deeper in the persisted event stream at events[N].result.context.error.diagnosis._meta.diagnosis_fetch:

python3 - <<'PY'
import json
with open('/tmp/spike.json') as f:
    doc = json.load(f)

for evt in doc.get('events', []):
    diag_fetch = (
        evt.get('result', {})
           .get('context', {})
           .get('error', {})
           .get('diagnosis', {})
           .get('_meta', {})
           .get('diagnosis_fetch')
    )
    if diag_fetch:
        print(f"event {evt.get('event_id')} ({evt.get('node_name')}):")
        for k, v in diag_fetch.items():
            print(f"  {k} = {v}")
        break
PY

Expected output:

event 62123... (trigger_failure):
  poll_count = 1
  elapsed_seconds = 0.064
  deadline_seconds = 60.0
  hit_deadline = False

That poll_count = 1, elapsed_seconds ≈ 0.06 is the warm-path signature. The diagnose sub-execution finished and its result was persisted before the noetl-side fetch loop reached its first sleep. See Vertex AI Triage Backend for the full cloud-vs-local profile and what cold-start numbers look like.

Step 6 — Run the regression smokes

NoETL ships six prepared smoke regression detectors. Run them all to confirm the architectural surface is clean:

cd /Volumes/X10/projects/noetl/ai-meta

python3 scripts/agent_envelope_carveout_smoke.py     # 8/8
python3 scripts/gap41_diagnosis_wait_smoke.py        # 7/7
python3 scripts/auto_troubleshoot_smoke.py           # 9/9
python3 scripts/optional_ai_smoke.py                 # 6/6
python3 scripts/live_vs_persisted_parity_smoke.py    # 3/3 static
python3 scripts/worker_workload_forwarding_smoke.py  # passing

Each smoke validates a specific architectural contract — see Agent Failure Diagnostics for what each one protects.

What you just exercised

The 12 events your spike produced demonstrate every architectural contract NoETL relies on for self-troubleshooting:

Agent envelope contract — tool: agent framework=noetl returned a full envelope with status, framework, entrypoint, error, and execution_id fields. Documented in Agent Failure Diagnostics.
Sub-execution wait-for-terminal — the dispatcher waited for the failing sub-playbook to reach terminal state before reading its result.
kind: agent tool_error carve-out — envelope.status: "error" is the contract describing the sub-execution's outcome, not a step-level failure of the dispatcher. See Agent Orchestration.
Auto-troubleshoot hook — on failure the executor dispatched the diagnose agent, waited for its terminal status, and fetched the persisted diagnosis from the persist_diagnosis step's events.
Event projection preservation — the worker preserved the full nested error.diagnosis._meta.diagnosis_fetch dict end-to-end through _extract_control_context.
Live-vs-persisted parity — the assertion script and the parity smoke both confirm the persisted shape matches the live response.

Next steps

Build a self-troubleshooting playbook — compose your own version of this pattern with a worked example.
GKE production deploy — same stack, deployed to Google Kubernetes Engine with Vertex AI as the triage backend.
Frontend developer onboarding — long-form Auth0 + GraphQL walkthrough for building a UI on top of the gateway.

Troubleshooting

podman machine start hangs or fails. Re-init at a smaller memory (8 GiB) to confirm the host has the resources, then bring it back to 16. If the machine state is split across ~/.local/share and ~/Library, unset XDG_DATA_HOME and run podman machine list to see which is authoritative.

kind create cluster succeeds but kubectl connects to nothing. kind sometimes leaves stale kubeconfig entries. Run kind delete cluster --name=noetl-cluster and recreate. Confirm the kind-noetl-cluster context is current with kubectl config current-context.

Spike returns attempts > 5 on the first run. This is the cold-start signature — Ollama hadn't loaded gemma3:4b into resident memory before the diagnose call ran. Subsequent runs reuse warm state and drop to attempts ≤ 1. The cloud-latency profile is documented in Vertex AI Triage Backend.

/api/health returns 502 or never responds. The noetl-server pod crashed or never became Ready. kubectl -n noetl describe pod -l app=noetl-server will show the pull or startup failure. If GHCR is rate-limiting, wait a minute and retry — the bump_image GHCR probe typically catches this before the rollout times out.

ollama list is empty after pull completes. The Ollama pod restarted mid-pull and the partial download was discarded. kubectl -n noetl rollout restart deploy/ollama and re-pull. Models are stored on the pod's emptyDir; persisting them across pod restarts is part of Local Podman Kind Cluster.

Prerequisites​

Step 1 — Bootstrap the local cluster​

Step 2 — Deploy the NoETL stack​

Step 3 — Pull the triage model​

Step 4 — Run the spike​

Step 5 — Read the telemetry​

Step 6 — Run the regression smokes​

What you just exercised​

Next steps​

Troubleshooting​