Performance Analysis

Guide for diagnosing and resolving playbook execution performance issues.

Quick Start

Immediate Diagnosis (PostgreSQL Only)

No ClickHouse required - analyze directly against PostgreSQL:

# Find bottlenecks (time gaps between events)
noetl run automation/observability/event-sync.yaml --set action=analyze-gaps --set since_hours=1

# Find slowest executions
noetl run automation/observability/event-sync.yaml --set action=analyze-slow --set since_hours=1

Full Analysis (With ClickHouse)

# 1. Deploy ClickHouse (if not running)
noetl run automation/infrastructure/clickhouse.yaml --set action=deploy

# 2. Sync events to ClickHouse
noetl run automation/observability/event-sync.yaml --set action=sync --set since_hours=24

# 3. Run analysis queries
kubectl exec -n clickhouse <pod> -- clickhouse-client < ci/manifests/clickhouse/performance-queries.sql

Event Sync Playbook

The event sync playbook (automation/observability/event-sync.yaml) provides tools for performance analysis.

Available Actions

Action	Description	Requires ClickHouse
`help`	Show usage information	No
`sync`	Sync events to ClickHouse	Yes
`sync-recent`	Alias for sync	Yes
`count`	Count events in both databases	Yes
`verify`	Verify sync status	Yes
`analyze-gaps`	Find time gaps (bottlenecks)	No
`analyze-slow`	Find slowest executions	No

Options

Option	Default	Description
`since_hours`	24	Time range to analyze
`batch_size`	1000	Max events per sync batch
`pg_auth`	pg_local	PostgreSQL auth profile

Examples

# Analyze last hour
noetl run automation/observability/event-sync.yaml \
  --set action=analyze-gaps \
  --set since_hours=1

# Sync with larger batch
noetl run automation/observability/event-sync.yaml \
  --set action=sync \
  --set since_hours=48 \
  --set batch_size=5000

# Use k8s postgres credentials
noetl run automation/observability/event-sync.yaml \
  --set action=analyze-slow \
  --set pg_auth=pg_k8s

Performance Logs

The orchestrator emits structured [PERF] logs for real-time monitoring.

Viewing Logs

# View all performance logs
kubectl logs -n noetl deployment/noetl-server | grep '\[PERF\]'

# Watch live
kubectl logs -n noetl deployment/noetl-server -f | grep '\[PERF\]'

# Filter by phase
kubectl logs -n noetl deployment/noetl-server | grep 'evaluate_execution'

Metrics Logged

Metric	Description	Target
`get_execution_state_batch`	State query time	<5ms
`get_transition_context_batch`	Transition query time	<10ms
`catalog_fetch`	Playbook fetch (cache miss)	<100ms
`evaluate_execution`	Total orchestration time	<50ms

Example Output

[PERF] get_execution_state_batch: 2.3ms
[PERF] get_transition_context_batch: 3.1ms
[PERF] catalog_fetch: 1.2ms (catalog_id=12345)
[PERF] evaluate_execution total: 8.5ms

Warning Thresholds

Logs emit warnings when thresholds are exceeded:

evaluate_execution > 500ms
catalog_fetch > 100ms
State reconstruction with >50 events

Direct SQL Analysis

PostgreSQL Queries

Connect to PostgreSQL:

kubectl exec -n postgres <pod> -- psql -U noetl -d noetl

Slowest Executions

SELECT
    e.execution_id,
    c.path as playbook_path,
    COUNT(*) as event_count,
    MIN(e.created_at) as started_at,
    ROUND(EXTRACT(EPOCH FROM (MAX(e.created_at) - MIN(e.created_at)))::numeric, 2) as duration_seconds,
    COUNT(DISTINCT e.node_name) as unique_steps
FROM noetl.event e
LEFT JOIN noetl.catalog c ON e.catalog_id = c.catalog_id
WHERE e.created_at > NOW() - INTERVAL '1 hour'
GROUP BY e.execution_id, c.path
ORDER BY duration_seconds DESC
LIMIT 20;

Event Transition Delays

WITH transitions AS (
    SELECT
        execution_id,
        event_type,
        created_at,
        LAG(event_type) OVER (PARTITION BY execution_id ORDER BY created_at) as prev_event,
        LAG(created_at) OVER (PARTITION BY execution_id ORDER BY created_at) as prev_time
    FROM noetl.event
    WHERE created_at > NOW() - INTERVAL '1 hour'
)
SELECT
    prev_event || ' -> ' || event_type as transition,
    COUNT(*) as occurrences,
    ROUND(AVG(EXTRACT(EPOCH FROM (created_at - prev_time)))::numeric, 3) as avg_sec,
    ROUND(percentile_cont(0.50) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (created_at - prev_time)))::numeric, 3) as p50_sec,
    ROUND(percentile_cont(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (created_at - prev_time)))::numeric, 3) as p95_sec
FROM transitions
WHERE prev_event IS NOT NULL
GROUP BY prev_event, event_type
HAVING COUNT(*) > 5
ORDER BY avg_sec DESC
LIMIT 15;

Bottleneck Detection (>2 second gaps)

WITH event_gaps AS (
    SELECT
        execution_id,
        event_type,
        node_name,
        created_at,
        LAG(created_at) OVER (PARTITION BY execution_id ORDER BY created_at) as prev_time,
        LAG(event_type) OVER (PARTITION BY execution_id ORDER BY created_at) as prev_event
    FROM noetl.event
    WHERE created_at > NOW() - INTERVAL '1 hour'
)
SELECT
    execution_id,
    prev_event as from_event,
    event_type as to_event,
    node_name,
    ROUND(EXTRACT(EPOCH FROM (created_at - prev_time))::numeric, 2) as gap_seconds
FROM event_gaps
WHERE created_at - prev_time > INTERVAL '2 seconds'
ORDER BY gap_seconds DESC
LIMIT 30;

Event Count per Execution

SELECT
    execution_id,
    COUNT(*) as total_events,
    MAX(created_at) - MIN(created_at) as duration
FROM noetl.event
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY execution_id
ORDER BY total_events DESC
LIMIT 20;

ClickHouse Queries

Connect to ClickHouse:

kubectl exec -n clickhouse <pod> -- clickhouse-client

Full query library: ci/manifests/clickhouse/performance-queries.sql

Key Queries

Execution Duration Distribution:

SELECT
    EventType,
    count() AS event_count,
    round(avg(Duration), 2) AS avg_ms,
    round(quantile(0.50)(Duration), 2) AS p50_ms,
    round(quantile(0.95)(Duration), 2) AS p95_ms,
    max(Duration) AS max_ms
FROM observability.noetl_events
WHERE Timestamp >= now() - INTERVAL 1 HOUR
GROUP BY EventType
ORDER BY avg_ms DESC;

Step Performance:

SELECT
    StepName,
    count() AS occurrences,
    round(avg(Duration), 2) AS avg_ms,
    round(quantile(0.95)(Duration), 2) AS p95_ms
FROM observability.noetl_events
WHERE Timestamp >= now() - INTERVAL 24 HOUR
  AND StepName != ''
  AND Duration > 0
GROUP BY StepName
ORDER BY avg_ms DESC
LIMIT 20;

Key Performance Indicators

Healthy System

Metric	Value
`command.completed → command.claimed`	<100ms
`evaluate_execution` total	<50ms
Playbook execution time	~2-5 seconds for simple playbooks
Event count per execution	<100 for typical workflows

Warning Signs

Symptom	Possible Cause	Solution
`command.completed → command.claimed` > 1s	Orchestrator overload	Check PERF logs, scale workers
High event count per execution	Complex workflows	Consider splitting playbooks
`catalog_fetch` > 100ms	Cache miss	Normal on first access
`evaluate_execution` > 500ms	State reconstruction	Archive old executions

Troubleshooting

Slow Playbook Execution

Check orchestrator performance:

kubectl logs -n noetl deployment/noetl-server | grep '\[PERF\]' | tail -50

Identify slow transitions:

noetl run automation/observability/event-sync.yaml --set action=analyze-gaps

Check for high event counts:

SELECT execution_id, COUNT(*) as events
FROM noetl.event
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY execution_id
HAVING COUNT(*) > 100;

Worker Not Claiming Commands

Check worker logs:

kubectl logs -n noetl deployment/noetl-worker | tail -100

Verify NATS connection:

kubectl logs -n noetl deployment/noetl-worker | grep -i nats

Check worker registration:

SELECT * FROM noetl.runtime WHERE component_type = 'worker_pool';

High Memory Usage

Check catalog cache stats (in server logs)

Monitor event table size:

SELECT pg_size_pretty(pg_total_relation_size('noetl.event'));

Consider archiving old events:

DELETE FROM noetl.event WHERE created_at < NOW() - INTERVAL '7 days';

Architecture Notes

Performance Optimizations

Batch Queries: The orchestrator uses get_transition_context_batch() to fetch all transition data in a single query instead of 6+ sequential queries.
Catalog Caching: LRU cache (100 entries, 5-minute TTL) for playbook content avoids repeated database lookups.
State Reconstruction: Events are processed in order to reconstruct workflow state. High event counts impact performance.

Critical Path

Worker completes command
    ↓
Event emitted to server (HTTP)
    ↓
Event persisted to PostgreSQL
    ↓
Orchestrator triggered
    ├── get_execution_state_batch() [~2ms]
    ├── get_transition_context_batch() [~3ms]
    ├── catalog_fetch() [~1ms cached, ~50ms uncached]
    └── Publish next command to NATS
    ↓
Worker claims command via NATS
    ↓
Worker executes...

Total orchestration overhead: ~10-50ms (optimized)

References

Observability Services
ClickHouse queries: ci/manifests/clickhouse/performance-queries.sql
Event sync playbook: automation/observability/event-sync.yaml

Quick Start​

Immediate Diagnosis (PostgreSQL Only)​

Full Analysis (With ClickHouse)​

Event Sync Playbook​

Available Actions​

Options​

Examples​

Performance Logs​

Viewing Logs​

Metrics Logged​

Example Output​

Warning Thresholds​

Direct SQL Analysis​

PostgreSQL Queries​

Slowest Executions​

Event Transition Delays​

Bottleneck Detection (>2 second gaps)​

Event Count per Execution​

ClickHouse Queries​

Key Queries​

Key Performance Indicators​

Healthy System​

Warning Signs​

Troubleshooting​

Slow Playbook Execution​

Worker Not Claiming Commands​

High Memory Usage​

Architecture Notes​

Performance Optimizations​

Critical Path​

References​