Retry Mechanism
NoETL provides a retry mechanism for all task types, allowing tasks to be automatically retried based on configurable conditions.
Overview
The retry logic is implemented at the execution orchestration level (noetl/tools/tool/execution.py), making it available to all action types without requiring individual plugin implementations. This design keeps retry logic abstract and reusable across all task types.
Architecture
Implementation Location
-
Retry Module:
/noetl/tools/tool/retry.py- Contains
RetryPolicyclass for configuration and evaluation - Contains
execute_with_retry()wrapper function
- Contains
-
Execution Module:
/noetl/tools/tool/execution.py- Integrates retry wrapper around all plugin executors
- No changes needed in individual action type plugins
Design Principles
- Task-Type Agnostic: Retry logic works for all action types (http, python, postgres, duckdb, etc.)
- Expression-Based: Retry conditions use Jinja2 expressions for flexibility
- Separation of Concerns: Individual plugins remain unaware of retry logic
- Configurable Backoff: Supports exponential backoff with jitter to prevent thundering herd
Configuration
Simple Boolean
retry: true # Use default retry policy (3 attempts, 1s initial delay)
Integer (Max Attempts Only)
retry: 5 # Retry up to 5 times with default settings
Full Configuration
retry:
max_attempts: 3 # Maximum number of execution attempts (default: 3)
initial_delay: 1.0 # Initial delay in seconds (default: 1.0)
max_delay: 60.0 # Maximum delay between retries (default: 60.0)
backoff_multiplier: 2.0 # Exponential backoff multiplier (default: 2.0)
jitter: true # Add random jitter to delays (default: true)
retry_when: "{{ expr }}" # Jinja2 expression to determine if retry needed
stop_when: "{{ expr }}" # Jinja2 expression to stop retrying (overrides retry_when)
Retry Conditions
Available Variables
Retry condition expressions have access to:
result: Complete task execution result dictionarystatus_code: HTTP status code (for HTTP tasks)error: Error message string if task failedsuccess: Boolean indicating task successdata: Task result dataattempt: Current attempt number (1-indexed)
Expression Evaluation
Conditions are Jinja2 templates that must evaluate to a boolean-like value:
"true","1","yes"→ retry- Any other value → don't retry
Example Conditions
HTTP Status Codes
# Retry on server errors (5xx)
retry_when: "{{ status_code >= 500 and status_code < 600 }}"
# Retry on specific status codes
retry_when: "{{ status_code in [500, 502, 503, 504] }}"
# Retry on any non-success status
retry_when: "{{ status_code != 200 }}"
Error Messages
# Retry on any error
retry_when: "{{ error != None }}"
# Retry on specific error types
retry_when: "{{ 'timeout' in (error|lower) }}"
# Retry on database deadlock
retry_when: "{{ 'deadlock' in (error|lower) }}"
Success Flag
# Retry on failure
retry_when: "{{ success == False }}"
# Stop on success
stop_when: "{{ success == True }}"
Attempt-Based Logic
# Limit retries for specific conditions
retry_when: "{{ attempt <= 2 and 'rate limit' in (error|lower) }}"
# Different conditions based on attempt
retry_when: "{{ (attempt == 1 and status_code >= 500) or (attempt > 1 and status_code == 503) }}"
Backoff Strategy
Exponential Backoff
Delay calculation: delay = initial_delay * (backoff_multiplier ^ (attempt - 1))
Capped at: min(calculated_delay, max_delay)
Example with initial_delay=1.0, backoff_multiplier=2.0:
- Attempt 1: 0s (no delay before first attempt)
- Attempt 2: 1s
- Attempt 3: 2s
- Attempt 4: 4s
- Attempt 5: 8s
Jitter
When jitter: true, adds randomization to prevent synchronized retries:
actual_delay = calculated_delay * (0.5 + random())
This creates a delay between 50% and 150% of the calculated delay.
Usage Examples
HTTP Retry on Server Errors
- step: fetch_data
tool: http
method: GET
url: "{{ api_url }}"
retry:
max_attempts: 5
initial_delay: 2.0
backoff_multiplier: 2.0
retry_when: "{{ status_code >= 500 }}"
stop_when: "{{ status_code == 200 }}"
Python Task with Exception Handling
- step: process_data
tool: python
code: |
def main(input_data):
# Processing logic that might fail
return result
retry:
max_attempts: 3
initial_delay: 0.5
retry_when: "{{ error != None }}"
Database Query with Connection Retry
- step: query_database
tool: postgres
auth:
type: postgres
credential: prod_db
query: "{{ sql_query }}"
retry:
max_attempts: 5
initial_delay: 1.0
backoff_multiplier: 1.5
retry_when: "{{ error != None or success == False }}"
DuckDB with Conditional Retry
- step: analyze_data
tool: duckdb
query: "{{ analysis_query }}"
retry:
max_attempts: 3
initial_delay: 0.5
retry_when: "{{ 'out of memory' in (error|lower) }}"
Testing
Test Playbooks
Test playbooks are available in tests/fixtures/playbooks/retry_test/:
http_retry_status_code.yaml- HTTP status code retryhttp_retry_with_stop.yaml- HTTP with stop conditionpython_retry_exception.yaml- Python exception handlingpostgres_retry_connection.yaml- Database connection retryduckdb_retry_query.yaml- DuckDB query retryretry_simple_config.yaml- All configuration formats
Running Tests
# Register all retry test playbooks
task playbook:local:register-retry-tests
# Run all retry tests
task test:local:retry-all
# Run specific retry test
task playbook:local:execute:retry-http-status
task playbook:local:execute:retry-python-exception
Best Practices
1. Set Reasonable Max Attempts
Start with 3-5 attempts. Too many attempts can cause long delays.
retry:
max_attempts: 3 # Good starting point
2. Use Specific Retry Conditions
Target specific errors instead of retrying everything:
# Good: Specific condition
retry_when: "{{ status_code in [429, 500, 502, 503] }}"
# Avoid: Too broad
retry_when: "{{ True }}"
3. Configure Appropriate Delays
Balance between quick retry and avoiding server overload:
retry:
initial_delay: 1.0 # Start with 1 second
max_delay: 30.0 # Cap at 30 seconds
backoff_multiplier: 2.0 # Double each time
4. Enable Jitter for Distributed Systems
Prevent synchronized retries across multiple workers:
retry:
jitter: true # Add randomization
5. Use Stop Conditions for Early Exit
Define success criteria to avoid unnecessary retries:
retry:
retry_when: "{{ status_code != 200 }}"
stop_when: "{{ status_code == 200 }}" # Exit early on success
6. Monitor and Tune
Check logs to understand retry patterns and adjust configuration:
Task 'fetch_data' will retry after 2.34s (attempt 2/5)
Task 'fetch_data' succeeded on attempt 3
Implementation Notes
Plugin Integration
No changes required in individual action plugins. The retry wrapper in execution.py handles all retry logic:
def execute_task(...):
return execute_with_retry(
_execute_task_without_retry,
task_config,
task_name,
context,
jinja_env,
task_with,
log_event_callback
)
Error Handling
- Exceptions are caught and evaluated against retry conditions
- If retry condition not met, exception is re-raised
- After all attempts exhausted, last exception is raised
Logging
Retry attempts are logged with details:
- Attempt number and total attempts
- Delay before next retry
- Success/failure status
- Retry condition evaluation results
Limitations
- Async Tasks: Workbook tasks use async execution which may need special handling
- State Management: Retry logic doesn't persist state across worker restarts
- Resource Cleanup: Tasks should handle their own resource cleanup on failure
- Idempotency: Tasks should be idempotent if retried (especially for data modification)
Future Enhancements
Potential improvements:
- Circuit Breaker: Stop retrying after consecutive failures
- Retry Metrics: Track retry rates and success patterns
- Persistent Retry State: Store retry attempts in database
- Retry Budget: Limit total retry time across all attempts
- Conditional Backoff: Different backoff strategies based on error type