Skip to main content

Status Transition Matrix

This document defines the intended state ownership and transition model for NoETL status handling.

Use it for:

  • runtime lifecycle development
  • execution status API design
  • regression testing around premature completion
  • code review when status-related changes touch engine, worker, or API layers

Core Rule

Overall execution status must come only from the playbook execution lifecycle layer.

Lower-level events such as command.completed, call.done, step.exit, and batch.completed are observability or transport signals. They are not valid evidence that a playbook execution is terminal.


State Layers

LayerScopeOwnerAuthoritative storageTerminal statesMust not decide
Tool outcomeSingle tool invocationWorker or tool executorcall.done or call.error payloadOK, ERROR, BREAK, NOOPPlaybook execution status
Command stateSingle distributed commandServer and workercommand.* eventsCOMPLETED, FAILED, CANCELLEDPlaybook execution status
Step stateSingle playbook stepEngineExecutionState.completed_steps, step_results, current_step, step.exitCOMPLETED, FAILED, CASE_HANDLEDPlaybook execution status
Workflow stateOne workflowEngineExecutionState.completed, ExecutionState.failed, workflow.* lifecycle eventsCOMPLETED, FAILEDCommand completion semantics
Playbook execution stateOverall executionEngineExecutionState.completed, ExecutionState.failed, playbook.* lifecycle events, execution.cancelledCOMPLETED, FAILED, CANCELLEDLower-layer event heuristics

Authoritative Execution Status Model

The execution status API should expose a single normalized lifecycle state:

  • PENDING
  • RUNNING
  • COMPLETED
  • FAILED
  • CANCELLED

Ownership

  • If live engine state exists, execution status comes from execution-level state.
  • If live engine state does not exist, execution status is reconstructed only from persisted execution lifecycle events:
    • playbook.completed
    • playbook.failed
    • execution.cancelled
  • No fallback may infer terminal execution status from:
    • batch.completed
    • command.completed
    • call.done
    • step.exit

Transition Matrix

Playbook Execution

FromTriggerToOwnerPersisted signal
PENDINGPlaybook initializedRUNNINGEngineplaybook.initialized
RUNNINGEngine emits terminal successCOMPLETEDEngineplaybook.completed
RUNNINGEngine emits terminal failureFAILEDEngineplaybook.failed
RUNNINGCancel API or runtime cancel pathCANCELLEDServer or engineexecution.cancelled

Rules:

  • Terminal playbook states are exactly COMPLETED, FAILED, and CANCELLED.
  • Only the execution lifecycle layer may close an execution.

Workflow

FromTriggerToOwnerPersisted signal
PENDINGWorkflow startsRUNNINGEngineworkflow.initialized
RUNNINGEngine determines all required work is completeCOMPLETEDEngineworkflow.completed
RUNNINGEngine determines unrecoverable failureFAILEDEngineworkflow.failed

Rules:

  • Workflow lifecycle may inform playbook lifecycle.
  • Workflow lifecycle must not be reconstructed from command flushes or worker ack patterns.

Step

FromTriggerToOwnerPersisted signal
PENDINGStep scheduled or enteredRUNNINGEngine or workerstep.enter or command issuance
RUNNINGTool succeedsCOMPLETEDEnginecall.done then step.exit
RUNNINGTool fails without handled recoveryFAILEDEnginecall.error then step.exit
RUNNINGCase policy consumes resultCASE_HANDLEDEngine or workerstep.exit with CASE_HANDLED

Rules:

  • Step completion is structural and local.
  • Loop and task-sequence steps require special handling so per-iteration events do not close the full execution.

Command

FromTriggerToOwnerPersisted signal
PENDINGCommand createdISSUEDServercommand.issued
ISSUEDWorker claimsCLAIMEDWorker and servercommand.claimed
CLAIMEDWorker begins workRUNNINGWorkercommand.started
RUNNINGWorker completesCOMPLETEDWorkercommand.completed
RUNNINGWorker hard failsFAILEDWorkercommand.failed
ISSUED or CLAIMEDExecution cancelledCANCELLEDServer or reapercommand.cancelled

Rules:

  • Command state answers only whether the command finished.
  • A fully completed command set does not imply playbook completion unless the engine emits execution lifecycle completion.

Tool or Task Outcome

FromTriggerToOwnerSignal
PENDINGTool invokedRUNNINGWorkerIn-flight only
RUNNINGTool returns ok resultOKWorkercall.done
RUNNINGTool returns errorERRORWorkercall.error
RUNNINGTask-sequence breaks or paginatesBREAK or NOOPWorkerPayload only

Rules:

  • Tool outcome feeds routing and step materialization.
  • Tool outcome must never be exposed as overall execution status.

Validation Against Current Code

Canonical execution vocabulary is split

Current code in noetl/core/status.py defines canonical statuses including STARTED, RUNNING, PAUSED, PENDING, FAILED, and COMPLETED, but execution APIs also use CANCELLED.

Implication:

  • CANCELLED is already a real terminal execution state.
  • It should also be a first-class member of the canonical status utility.

Step status strings are not normalized

Current code in noetl/worker/v2_worker_nats.py emits mixed step.exit.status values including completed, COMPLETED, failed, FAILED, and CASE_HANDLED.

Implication:

  • Step observability status is inconsistent.
  • That inconsistency is survivable only if step state remains non-authoritative for execution status.

Engine owns execution completion, but APIs still infer it from lower layers

Current code:

  • noetl/core/dsl/v2/engine.py
    • sets state.completed = True when structural completion is reached
    • emits workflow.completed and playbook.completed
    • sets state.failed = True and emits terminal failure lifecycle events
  • noetl/server/api/execution/endpoint.py
    • _infer_execution_completion_from_events() marks execution COMPLETED from lower-layer events
  • noetl/server/api/v2.py
    • get_execution_status() repeats similar inference in both fallback and live-state mode

Implication:

  • Execution status is currently being reconstructed from command and step signals.
  • That violates the intended ownership model.

batch.completed is being treated as execution-terminal evidence

Current code in noetl/server/api/execution/endpoint.py and noetl/server/api/v2.py uses batch.completed as part of completion inference.

Implication:

  • batch.completed is a transport or control-plane signal.
  • It does not prove that the playbook lifecycle is terminal.

command.completed, call.done, and step.exit are elevated via end heuristics

Current code in noetl/server/api/execution/endpoint.py and noetl/server/api/v2.py treats several lower-layer events as terminal evidence when attached to node_name == "end".

Implication:

  • These events can occur before more work is issued.
  • They must not close the execution.

Status endpoints return execution variables

Current code in noetl/server/api/v2.py returns execution variables in both full and compact status responses.

Implication:

  • Execution status should report lifecycle state, not execution payload or context.
  • Any variable or debug view should be separate from the status API.

GET /executions/{id}/status should report only execution lifecycle state and a minimal operational view.

Minimal Response Model

FieldMeaningSource
execution_idExecution identifierExecution record
statePENDING, RUNNING, COMPLETED, FAILED, or CANCELLEDExecution owner
current_stepCurrent structural stepExecutionState.current_step
started_atExecution start timePersisted lifecycle
ended_atTerminal time if terminalPersisted lifecycle
terminal_eventExact lifecycle event if terminalPersisted lifecycle
completion_inferredFallback-only signalAPI fallback path

Strict Rules

  • Do not inspect state.variables to answer status.
  • Do not derive execution state from command.completed.
  • Do not derive execution state from step.exit.
  • Do not derive execution state from call.done.
  • Do not derive execution state from batch.completed.
  • Only use lifecycle events as persisted truth when reconstructing status without live state.

Development Checklist

When changing status logic:

  • confirm the change preserves layer ownership
  • confirm no lower-layer event can close an execution
  • confirm CANCELLED is handled as a first-class execution terminal state
  • confirm status endpoints do not expose execution variables
  • confirm engine completion is emitted exactly once through lifecycle events

Testing Checklist

Positive Cases

  • execution remains RUNNING while commands continue to be issued
  • execution becomes COMPLETED only after playbook.completed
  • execution becomes FAILED only after playbook.failed
  • execution becomes CANCELLED only after execution.cancelled

Negative Cases

  • command.completed on the final visible step does not force execution COMPLETED
  • step.exit on a loop or task-sequence iteration does not force execution COMPLETED
  • batch.completed with pending_count == 0 does not force execution COMPLETED
  • mixed-case step status values do not change execution status semantics

Maintenance Rule

When status semantics change in noetl, update this document in the docs repo and add or refresh an ai-meta memory entry that points future development and test work back here.