2026-01-14

Experiment 14: Handling Step Retry Exhaustion with DBOSMaxStepRetriesExceeded

Purpose

This experiment demonstrates how to gracefully handle step failures when all retry attempts have been exhausted. Instead of letting the workflow fail, it shows how to catch the DBOSMaxStepRetriesExceeded exception and implement custom error handling or fallback logic.

Problem Being Solved

In production systems, external dependencies (APIs, databases, services) can fail persistently. When a DBOS step exhausts all its retry attempts:

Default behavior: The exception propagates up and fails the workflow
Desired behavior: Catch the exception and handle it gracefully (log, alert, use fallback, etc.)

This pattern is essential for building resilient applications that can degrade gracefully rather than fail completely.

Key Features

Exception handling: Demonstrates catching DBOSMaxStepRetriesExceeded
Retry exhaustion: Step fails intentionally to exhaust retries
Graceful degradation: Workflow continues despite step failure
Logging: Tracks retry attempts and failure handling
Production pattern: Shows real-world error handling approach

Code Structure

`provision_step()`

A step that always fails to simulate a persistent external service failure:

@DBOS.step(retries_allowed=True)
def provision_step():
    DBOS.logger.info(
        f"Step: Starting provision_step {DBOS.step_status.current_attempt + 1} of {DBOS.step_status.max_attempts}"
    )
    sleep(1)  # Simulate work
    raise ValueError("Simulated failure in provision_step")

Behavior:

Logs each retry attempt (1 of 3, 2 of 3, 3 of 3)
Sleeps 1 second to simulate external API call
Always raises ValueError to trigger retry
After 3 attempts (default), DBOS raises DBOSMaxStepRetriesExceeded

`provision_workflow()`

A workflow that catches the exception and continues execution:

@DBOS.workflow()
def provision_workflow() -> bool:
    DBOS.logger.info("Workflow: Starting")
    
    try:
        provision_step()
    except DBOSMaxStepRetriesExceeded:
        DBOS.logger.error("Workflow: Caught max retries exceeded exception")
        # Could implement fallback logic here:
        # - Use cached data
        # - Return partial results
        # - Send alert to monitoring system
        # - Set workflow status flag
    
    DBOS.logger.info("Workflow: Finishing")
    return True

Behavior:

Attempts to call the failing step
Catches DBOSMaxStepRetriesExceeded when retries are exhausted
Logs the error and continues execution
Returns True indicating the workflow completed (despite the step failure)

Expected Output

DBOS system database URL: postgresql://trustle:***@localhost:5432/test_dbos_sys?sslmode=disable
DBOS application database URL: postgresql://trustle:***@localhost:5432/test?sslmode=disable
Database engine parameters: {'pool_timeout': 30, 'max_overflow': 0, 'pool_size': 20, 'pool_pre_ping': True, 'connect_args': {'connect_timeout': 10}}
[    INFO] (dbos:_dbos.py:370) Initializing DBOS (v2.1.0)
[    INFO] (dbos:_dbos.py:445) Executor ID: local_executer
[    INFO] (dbos:_dbos.py:446) Application version: local_v0
[ WARNING] (dbos:_dbos.py:486) Failed to start admin server: [Errno 98] Address already in use
[    INFO] (dbos:_dbos.py:496) No workflows to recover from application version local_v0
[    INFO] (dbos:_dbos.py:548) DBOS launched!
To view and manage workflows, connect to DBOS Conductor at:https://console.dbos.dev/self-host?appname=dbos-starter
[    INFO] (dbos:ex1.py:23) Workflow: Starting
[   DEBUG] (dbos:_core.py:1161) Running step, id: 1, name: provision_step
[    INFO] (dbos:ex1.py:13) Step: Starting provision_step 1 of 3
[ WARNING] (dbos:_core.py:1099) Step being automatically retried (attempt 1 of 3)
Traceback (most recent call last):
  File ".../dbos/_outcome.py", line 137, in _retry
    return func()
           ^^^^^^
  File ".../exp14/ex1.py", line 18, in provision_step
    raise ValueError("Simulated failure in provision_step")
ValueError: Simulated failure in provision_step
[    INFO] (dbos:ex1.py:13) Step: Starting provision_step 2 of 3
[ WARNING] (dbos:_core.py:1099) Step being automatically retried (attempt 2 of 3)
Traceback (most recent call last):
  File ".../dbos/_outcome.py", line 137, in _retry
    return func()
           ^^^^^^
  File ".../exp14/ex1.py", line 18, in provision_step
    raise ValueError("Simulated failure in provision_step")
ValueError: Simulated failure in provision_step
[    INFO] (dbos:ex1.py:13) Step: Starting provision_step 3 of 3
[ WARNING] (dbos:_core.py:1099) Step being automatically retried (attempt 3 of 3)
Traceback (most recent call last):
  File ".../dbos/_outcome.py", line 137, in _retry
    return func()
           ^^^^^^
  File ".../exp14/ex1.py", line 18, in provision_step
    raise ValueError("Simulated failure in provision_step")
ValueError: Simulated failure in provision_step
[   ERROR] (dbos:ex1.py:29) Workflow: Caught max retries exceeded exception
[    INFO] (dbos:ex1.py:31) Workflow: Finishing
[    INFO] (dbos:ex1.py:51) Main: Workflow output: True

Key observations from the output:

Three retry attempts: The step executes exactly 3 times (default max_attempts)
Retry warnings: DBOS logs “Step being automatically retried” for attempts 1 and 2
Full tracebacks: Each failure shows the complete stack trace with the ValueError
Exception caught: After attempt 3 fails, the workflow catches DBOSMaxStepRetriesExceeded
Workflow continues: Despite the step failure, the workflow logs “Finishing” and returns True
Timing: Each retry waits ~1 second (the sleep(1) in the step) plus exponential backoff
Success indicator: Main: Workflow output: True shows the workflow completed successfully

Flow Diagram

provision_workflow()
    |
    ├─> provision_step()
    |       ├─> Attempt 1: FAIL (sleep 1s)
    |       ├─> Attempt 2: FAIL (sleep 1s)
    |       └─> Attempt 3: FAIL (sleep 1s)
    |       └─> Raise DBOSMaxStepRetriesExceeded
    |
    ├─> Catch DBOSMaxStepRetriesExceeded
    |       └─> Log error message
    |
    └─> Continue workflow execution
            └─> Return True

Use Cases

1. Provisioning Resources

try:
    provision_cloud_resource()
except DBOSMaxStepRetriesExceeded:
    DBOS.logger.error("Cloud provisioning failed after retries")
    send_alert_to_ops_team()
    use_fallback_resource()

2. External API Calls

try:
    fetch_user_data_from_api()
except DBOSMaxStepRetriesExceeded:
    DBOS.logger.warning("API unavailable, using cached data")
    return get_cached_user_data()

3. Payment Processing

try:
    process_payment()
except DBOSMaxStepRetriesExceeded:
    DBOS.logger.error("Payment processing failed")
    mark_order_as_pending()
    notify_customer_service()

4. Data Validation

try:
    validate_external_data_source()
except DBOSMaxStepRetriesExceeded:
    DBOS.logger.warning("Validation service down, skipping checks")
    proceed_with_unvalidated_data()

Configuration Options

You can customize retry behavior in the step decorator:

@DBOS.step(
    retries_allowed=True,
    max_attempts=5,              # Number of retry attempts (default: 3)
    interval_seconds=2.0,        # Initial delay between retries (default: 1.0)
    backoff_rate=2.0             # Exponential backoff multiplier (default: 2.0)
)
def provision_step():
    # Your step logic
    pass

Retry Schedule Example:

Attempt 1: immediate
Attempt 2: after 2 seconds
Attempt 3: after 4 seconds (2 × 2.0)
Attempt 4: after 8 seconds (4 × 2.0)
Attempt 5: after 16 seconds (8 × 2.0)

Usage

# Run the experiment
python exp14/ex1.py

Prerequisites

PostgreSQL database running on localhost:5432
Database: test with user trustle:trustle
Python dependencies: pip install dbos

Environment Variables

export DBOS_DATABASE_URL="postgresql://trustle:trustle@localhost:5432/test?sslmode=disable"

Learning Points

Exception handling: How to catch DBOSMaxStepRetriesExceeded in workflows
Graceful degradation: Workflows can continue despite step failures
Retry awareness: Use DBOS.step_status to track retry attempts
Production patterns: Real-world error handling strategies
Resilience design: Building fault-tolerant applications
Monitoring integration: Where to add alerting and observability
Fallback strategies: Implementing alternative paths when services fail

Best Practices

✅ DO:

Catch DBOSMaxStepRetriesExceeded for expected failures
Log detailed error information for debugging
Implement meaningful fallback logic
Set up alerts for retry exhaustion
Track metrics on retry failures
Use appropriate retry configuration for your use case

❌ DON’T:

Silently swallow exceptions without logging
Use empty except blocks
Retry indefinitely without limits
Ignore persistent failures
Skip monitoring/alerting setup
Use the same retry config for all operations

Comparison with Other Approaches

Approach	Pros	Cons	Use Case
Catch exception (this experiment)	Graceful degradation, custom logic	Requires explicit handling	Expected failures
Let workflow fail	Simple, clear failure signal	No graceful degradation	Critical operations
Increase retries	More chances to succeed	Longer wait times	Transient failures
Circuit breaker	Prevents cascade failures	More complex implementation	Distributed systems

Circuit Breaker: Temporarily stop calling failing services
Fallback: Use alternative data sources or services
Timeout: Limit how long to wait for responses
Bulkhead: Isolate failures to prevent system-wide impact
Retry with Jitter: Add randomness to retry delays

Future Enhancements

Potential improvements to explore:

Implement circuit breaker pattern
Add metrics collection for retry patterns
Create reusable error handling utilities
Test with real external services
Implement progressive backoff strategies
Add dead letter queue for failed operations

exp13: Workflow recovery and step retries
exp15: Performance analysis of step operations

Recent changes

2025-10-18 66833e7 added readmes
2025-10-08 bb13b72 fixed dbos experiments for dbos 2.1
2025-09-19 2b6dd00 tested retries exceeded error

Categories: experiments, Python

Tags: dbos-experiments

← Previous · Next →

DBOS Experiments: Experiment 14: Handling Step Retry Exhaustion with DBOSMaxStepRetriesExceeded