PRD: Reference-Only Event Results and Worker-Owned Data Plane
1. Context
Tracking issue: AHM-4124
Trigger: Final Jira comment (focusedCommentId=48137) and production runtime pressure caused by oversized event payloads.
Current pain:
- Large tool outputs are still flowing through worker -> server API ->
event.result. - This bloats the event table, increases serialization/deserialization overhead, and drives memory pressure in server/worker paths.
- Status queries should evaluate execution state, not transport full payload bodies.
2. Problem Statement
NoETL control-plane events are mixing two concerns:
- Control-plane state (success/failure/routing metadata).
- Data-plane payloads (large result bodies).
This coupling causes:
- DB bloat in
event.result. - Slow execution/status reads.
- Increased OOM risk under heavy playbooks.
- Tight coupling between worker progress and server payload handling.
3. Product Goal
Make NoETL runtime strictly reference-only for the worker/server contract:
event.resultstores control-plane information only:- status / error envelope
- context fields required for conditional logic
- reference pointers to full payload
- Full payload is persisted directly by worker to result storage.
- Server API handles state transitions and references, not bulk data transport.
4. Non-Goals
- No redesign of playbook DSL semantics.
- No change to credential/keychain storage model.
- No change to user-facing success/failure semantics.
5. Success Criteria
Functional:
- No new large payload blobs are written into
event.result. - Conditional expressions continue working using context fields and references.
- Worker can retrieve prior step outputs by resolving references (not via embedded payload).
Performance/SRE:
- Reduce average
event.resultbyte size by >= 90% for heavy executions. - Reduce status API payload and query cost for high-volume executions.
- Eliminate payload-induced server OOM scenarios attributable to event-result body inflation.
6. User Stories
- As an operator, I can query execution status quickly without loading large tool bodies.
- As a playbook author, I can use
whenexpressions with context fields exactly as today. - As a runtime worker, I can persist/reload full result data via references across retries, loops, and pagination.
- As a security reviewer, I can confirm references contain no raw credentials or secrets.
7. Requirements
7.1 Control-plane event contract
event.result is an object that carries reference and context attributes for runtime control-plane behavior.
event.result must contain:
status:ok|error|skipped|break|retry|...error: normalized error envelope (if present)reference(optional): reference object or array of references (manifest), present only when data is persisted by a storage toolcontext(optional): size-limited scalar/object fields used by routing and templatesmeta: bytes/hash/content-type/store/scope
event.result must not contain output data at all, regardless of size. It is reference-only.
7.2 Worker-owned persistence
Worker must:
- Pass tool output between tool items inside the same step pipeline.
- If a downstream storage tool is defined (for example Postgres), persist output there and build
referencefrom that storage location. - If no storage tool is defined, omit output data entirely and emit execution status only.
- Build/update
contextfrom actual result data on worker side before event emission. - Emit only control-plane metadata in completion events: status + optional
reference+ optionalcontext. - Resolve references explicitly when downstream task needs full body.
7.3 Server responsibilities
Server must:
- Accept and persist only reference-only events; reject event payloads that include output data.
- Use context fields for state transitions and
whenevaluation context. - Return only state + optional reference/context in execution/status APIs; never hydrate output payload bodies on status paths.
7.4 Reference guarantees
Every stored reference must include:
type(required):relational|nats|object_storeauth_reference(optional): reference to auth/keychain record ID used to access the data location- relational reference fields when
type=relational:db_url,schema,table,record_id(or equivalent key fields) - NATS reference fields when
type=nats: subject/bucket + key/stream locator, and payload size must be< 1MB - object store reference fields when
type=object_store: direct object URL - integrity metadata (
bytes,sha256,content_type,compression) - ttl/scope policy: for
natsandobject_store, TTL should be explicitly configurable; if omitted, use storage default/unlimited behavior. No TTL policy is required for relational references.
7.5 Security and compliance
- No credentials/tokens in context fields.
auth_referencemay point to credential/keychain records by ID only (never inline secrets).- Preserve existing auth controls for resolved data access.
8. Target Architecture
8.1 Data-plane vs control-plane split
- Control-plane: server, execution state machine, event ingestion, routing metadata.
- Data-plane: worker result storage and retrieval via reference resolver.
8.2 Write path (target)
- Task runs in worker.
- If the step pipeline includes a storage tool (for example Postgres), worker stores output there and constructs
reference(db/schema/table/record details). - If no storage tool exists in the pipeline, worker omits output payload and keeps only execution status (and optional context).
- Worker emits event with status + optional
reference+ optionalcontext. - Server persists compact event and updates projections.
8.3 Read path (target)
- Status APIs return compact state + reference/context values.
- Full payload retrieval requires explicit resolve call/tool using
reference.
9. Schema and API Changes
9.1 Event payload schema (versioned)
Introduce result_schema_version: 2 with strict shape validation.
Cutover policy:
- Worker/server communication is v2-only after rollout gate.
- No dual-read compatibility mode for legacy large-payload event contracts.
- Existing legacy heavy rows are removed from the event table during migration.
9.2 Reference indexing policy
- No dedicated reference projection table is required.
- Event table is the source for reference lookup and resolver/index operations.
- Add/confirm event-table indexes on execution id + step/task keys + event timestamp/type as needed.
9.3 API behavior
status/execution endpoints must not hydrate full data bodies.- Add/standardize resolver endpoint/tool contract for
referenceretrieval.
10. Migration Strategy
Phase 0: Prep
- Add telemetry for
event.resultbyte size distributions and payload-in-event violations.
Phase 1: Hard cutover to reference-only v2
- Deploy worker and server together with v2-only contract.
- Server status endpoints ignore/avoid full payload fields.
- Expressions consume context/reference-aware state only.
Phase 2: Legacy data purge
- Delete legacy heavy rows from
eventtable that contain inline payload bodies not needed for control-plane state. - Keep only compact event rows required for current execution/state tracking and audit policy.
Phase 3: Enforcement
- Hard-reject oversized/noncompliant inline payload events at ingest.
- Enable alerting on any contract violations.
11. Acceptance Criteria
- Heavy playbook run completes with no large payloads in
event.result. - Loop/pagination/retry flows remain correct when only references are propagated.
- Execution status latency and memory profile improve versus baseline.
- No credential leaks detected in event rows or references.
12. Test Plan
Unit:
- Reference envelope schema validation.
- Context-field construction and size limits.
Integration:
- End-to-end heavy playbook with large pages.
- Retry + pagination + loop with reference-only propagation.
- Resolver correctness and auth checks.
Load/soak:
- Concurrent executions with large payload generation.
- Measure DB growth, API latency, memory usage, and OOM behavior.
13. Risks and Mitigations
- Risk: Broken expressions if context fields are incomplete.
- Mitigation: context contract tests + migration guardrails.
- Risk: Reference resolution latency.
- Mitigation: store selection tuning, caching, manifest strategy.
- Risk: Rollout mismatch between worker and server versions.
- Mitigation: coordinated cutover gate and version precheck before enabling traffic.
14. Rollout and Observability
Metrics:
event_result_bytes(p50/p95/p99)- count of inline payload events over threshold
- resolver latency/error rate
- status API latency + memory
- worker/server OOM and restart rates
Dashboards/alerts:
- SLO alert on payload-in-event violations.
- SLO alert on status latency regression.
15. Dependencies
- Runtime event model and result storage standards:
docs/reference/dsl/runtime_results.mddocs/reference/result_storage.mddocs/reference/tempref_storage.md
16. Decisions (Resolved)
- Step-pipeline storage behavior:
- In a
toollist, output from earlier tool items is passed to later tool items. - If a storage tool exists (for example Postgres as second item after HTTP), output is stored there and storage location details are sent as
referenceinevent.result. - If no storage tool exists, output data is omitted and only execution status is written to
event.result.
- Reference lookup model:
- No dedicated reference projection table; event table is sufficient.
- TTL policy:
- No TTL policy required for relational references.
- For NATS/object-store references, TTL should be parameterized; if not set, use storage-system default/unlimited behavior.
- Ingest enforcement:
- Server hard-rejects noncompliant inline payload events (no auto-externalize fallback in server ingest path).
17. Implementation Sequencing (Post-PRD)
- Define and freeze v2 result envelope contract.
- Implement worker data-plane writer + reference emitter.
- Implement server compact ingest + status path hardening.
- Implement resolver contract.
- Execute legacy event-row purge plan and retention guardrails.
- Run heavy-playbook validation and ship progressively.