Retry Mechanism

NoETL provides a retry mechanism for all task types, allowing tasks to be automatically retried based on configurable conditions.

Overview

The retry logic is implemented at the execution orchestration level (noetl/tools/tool/execution.py), making it available to all action types without requiring individual plugin implementations. This design keeps retry logic abstract and reusable across all task types.

Architecture

Implementation Location

Retry Module: /noetl/tools/tool/retry.py
- Contains RetryPolicy class for configuration and evaluation
- Contains execute_with_retry() wrapper function
Execution Module: /noetl/tools/tool/execution.py
- Integrates retry wrapper around all plugin executors
- No changes needed in individual action type plugins

Design Principles

Task-Type Agnostic: Retry logic works for all action types (http, python, postgres, duckdb, etc.)
Expression-Based: Retry conditions use Jinja2 expressions for flexibility
Separation of Concerns: Individual plugins remain unaware of retry logic
Configurable Backoff: Supports exponential backoff with jitter to prevent thundering herd

Configuration

Simple Boolean

retry: true  # Use default retry policy (3 attempts, 1s initial delay)

Integer (Max Attempts Only)

retry: 5  # Retry up to 5 times with default settings

Full Configuration

retry:
  max_attempts: 3           # Maximum number of execution attempts (default: 3)
  initial_delay: 1.0        # Initial delay in seconds (default: 1.0)
  max_delay: 60.0           # Maximum delay between retries (default: 60.0)
  backoff_multiplier: 2.0   # Exponential backoff multiplier (default: 2.0)
  jitter: true              # Add random jitter to delays (default: true)
  retry_when: "{{ expr }}"  # Jinja2 expression to determine if retry needed
  stop_when: "{{ expr }}"   # Jinja2 expression to stop retrying (overrides retry_when)

Retry Conditions

Available Variables

Retry condition expressions have access to:

result: Complete task execution result dictionary
status_code: HTTP status code (for HTTP tasks)
error: Error message string if task failed
success: Boolean indicating task success
data: Task result data
attempt: Current attempt number (1-indexed)

Expression Evaluation

Conditions are Jinja2 templates that must evaluate to a boolean-like value:

"true", "1", "yes" → retry
Any other value → don't retry

Example Conditions

HTTP Status Codes

# Retry on server errors (5xx)
retry_when: "{{ status_code >= 500 and status_code < 600 }}"

# Retry on specific status codes
retry_when: "{{ status_code in [500, 502, 503, 504] }}"

# Retry on any non-success status
retry_when: "{{ status_code != 200 }}"

Error Messages

# Retry on any error
retry_when: "{{ error != None }}"

# Retry on specific error types
retry_when: "{{ 'timeout' in (error|lower) }}"

# Retry on database deadlock
retry_when: "{{ 'deadlock' in (error|lower) }}"

Success Flag

# Retry on failure
retry_when: "{{ success == False }}"

# Stop on success
stop_when: "{{ success == True }}"

Attempt-Based Logic

# Limit retries for specific conditions
retry_when: "{{ attempt <= 2 and 'rate limit' in (error|lower) }}"

# Different conditions based on attempt
retry_when: "{{ (attempt == 1 and status_code >= 500) or (attempt > 1 and status_code == 503) }}"

Backoff Strategy

Exponential Backoff

Delay calculation: delay = initial_delay * (backoff_multiplier ^ (attempt - 1))

Capped at: min(calculated_delay, max_delay)

Example with initial_delay=1.0, backoff_multiplier=2.0:

Attempt 1: 0s (no delay before first attempt)
Attempt 2: 1s
Attempt 3: 2s
Attempt 4: 4s
Attempt 5: 8s

Jitter

When jitter: true, adds randomization to prevent synchronized retries:

actual_delay = calculated_delay * (0.5 + random())

This creates a delay between 50% and 150% of the calculated delay.

Usage Examples

HTTP Retry on Server Errors

- step: fetch_data
  tool: http
  method: GET
  url: "{{ api_url }}"
  retry:
    max_attempts: 5
    initial_delay: 2.0
    backoff_multiplier: 2.0
    retry_when: "{{ status_code >= 500 }}"
    stop_when: "{{ status_code == 200 }}"

Python Task with Exception Handling

- step: process_data
  tool: python
  code: |
    def main(input_data):
        # Processing logic that might fail
        return result
  retry:
    max_attempts: 3
    initial_delay: 0.5
    retry_when: "{{ error != None }}"

Database Query with Connection Retry

- step: query_database
  tool: postgres
  auth:
    type: postgres
    credential: prod_db
  query: "{{ sql_query }}"
  retry:
    max_attempts: 5
    initial_delay: 1.0
    backoff_multiplier: 1.5
    retry_when: "{{ error != None or success == False }}"

DuckDB with Conditional Retry

- step: analyze_data
  tool: duckdb
  query: "{{ analysis_query }}"
  retry:
    max_attempts: 3
    initial_delay: 0.5
    retry_when: "{{ 'out of memory' in (error|lower) }}"

Testing

Test Playbooks

Test playbooks are available in tests/fixtures/playbooks/retry_test/:

http_retry_status_code.yaml - HTTP status code retry
http_retry_with_stop.yaml - HTTP with stop condition
python_retry_exception.yaml - Python exception handling
postgres_retry_connection.yaml - Database connection retry
duckdb_retry_query.yaml - DuckDB query retry
retry_simple_config.yaml - All configuration formats

Running Tests

# Register all retry test playbooks
task playbook:local:register-retry-tests

# Run all retry tests
task test:local:retry-all

# Run specific retry test
task playbook:local:execute:retry-http-status
task playbook:local:execute:retry-python-exception

Best Practices

1. Set Reasonable Max Attempts

Start with 3-5 attempts. Too many attempts can cause long delays.

retry:
  max_attempts: 3  # Good starting point

2. Use Specific Retry Conditions

Target specific errors instead of retrying everything:

# Good: Specific condition
retry_when: "{{ status_code in [429, 500, 502, 503] }}"

# Avoid: Too broad
retry_when: "{{ True }}"

3. Configure Appropriate Delays

Balance between quick retry and avoiding server overload:

retry:
  initial_delay: 1.0        # Start with 1 second
  max_delay: 30.0           # Cap at 30 seconds
  backoff_multiplier: 2.0   # Double each time

4. Enable Jitter for Distributed Systems

Prevent synchronized retries across multiple workers:

retry:
  jitter: true  # Add randomization

5. Use Stop Conditions for Early Exit

Define success criteria to avoid unnecessary retries:

retry:
  retry_when: "{{ status_code != 200 }}"
  stop_when: "{{ status_code == 200 }}"  # Exit early on success

6. Monitor and Tune

Check logs to understand retry patterns and adjust configuration:

Task 'fetch_data' will retry after 2.34s (attempt 2/5)
Task 'fetch_data' succeeded on attempt 3

Implementation Notes

Plugin Integration

No changes required in individual action plugins. The retry wrapper in execution.py handles all retry logic:

def execute_task(...):
    return execute_with_retry(
        _execute_task_without_retry,
        task_config,
        task_name,
        context,
        jinja_env,
        task_with,
        log_event_callback
    )

Error Handling

Exceptions are caught and evaluated against retry conditions
If retry condition not met, exception is re-raised
After all attempts exhausted, last exception is raised

Logging

Retry attempts are logged with details:

Attempt number and total attempts
Delay before next retry
Success/failure status
Retry condition evaluation results

Limitations

Async Tasks: Workbook tasks use async execution which may need special handling
State Management: Retry logic doesn't persist state across worker restarts
Resource Cleanup: Tasks should handle their own resource cleanup on failure
Idempotency: Tasks should be idempotent if retried (especially for data modification)

Future Enhancements

Potential improvements:

Circuit Breaker: Stop retrying after consecutive failures
Retry Metrics: Track retry rates and success patterns
Persistent Retry State: Store retry attempts in database
Retry Budget: Limit total retry time across all attempts
Conditional Backoff: Different backoff strategies based on error type

Overview​

Architecture​

Implementation Location​

Design Principles​

Configuration​

Simple Boolean​

Integer (Max Attempts Only)​

Full Configuration​

Retry Conditions​

Available Variables​

Expression Evaluation​

Example Conditions​

HTTP Status Codes​

Error Messages​

Success Flag​

Attempt-Based Logic​

Backoff Strategy​

Exponential Backoff​

Jitter​

Usage Examples​

HTTP Retry on Server Errors​

Python Task with Exception Handling​

Database Query with Connection Retry​

DuckDB with Conditional Retry​

Testing​

Test Playbooks​

Running Tests​

Best Practices​

1. Set Reasonable Max Attempts​

2. Use Specific Retry Conditions​

3. Configure Appropriate Delays​

4. Enable Jitter for Distributed Systems​

5. Use Stop Conditions for Early Exit​

6. Monitor and Tune​

Implementation Notes​

Plugin Integration​

Error Handling​

Logging​

Limitations​

Future Enhancements​