GitHub: dbos_experiments/exp10
Experiment 10
Summary
This experiment demonstrates advanced DBOS error handling, workflow recovery mechanisms, and process isolation patterns. It explores how DBOS handles different types of failures (exceptions vs OOM errors) and showcases multiprocessing integration within DBOS workflows.
Files Description
Core Application Files
exp1.py- Comprehensive error handling and recovery demonstration:- Error Type Analysis: Documents how DBOS handles different error types:
- Step Exceptions: Retried without replaying
- Step OOM: Retried with replaying
- Workflow Exceptions: Workflow finishes early (like success)
- Workflow OOM: Retried without replaying completed steps
- Multiprocessing Integration: Uses Python’s
multiprocessing.Processwithin DBOS steps - Process Isolation: Fibonacci calculation executed in separate process for memory safety
- Step Retry Configuration: Both steps configured with
retries_allowed=True - Workflow Recovery: Configured with
max_recovery_attempts=3 - Queue Management: Handles pending workflows and workflow continuation
- Fixed Executor ID: Uses constant executor and app version for consistent recovery
- Error Type Analysis: Documents how DBOS handles different error types:
Key Features Demonstrated
Error Handling Patterns
- Exception vs OOM Handling: Different recovery behaviors for different error types
- Step-level Retries: Individual step retry configuration and status tracking
- Workflow-level Recovery: Workflow restart capabilities with attempt tracking
- Error Simulation: Commented code for simulating OOM and exception errors
Process Isolation
- Multiprocessing Integration: Using
multiprocessing.Processwithin DBOS steps - Memory Safety: Isolating memory-intensive operations in separate processes
- Inter-process Communication: Using
multiprocessing.Queuefor result passing - Process Lifecycle: Proper process creation, execution, and cleanup
Workflow Continuity
- Pending Workflow Detection: Checking for existing workflows in queue
- Workflow Retrieval: Recovering and waiting for existing workflows
- Queue State Management: Handling workflow state across application restarts
- Graceful Continuation: Seamless continuation of interrupted workflows
Advanced Configuration
- Fixed Executor ID: Ensuring consistent workflow recovery across restarts
- App Version Control: Explicit version management for workflow compatibility
- Retry Configuration: Fine-tuned retry settings for different failure scenarios
- Recovery Attempts: Configurable maximum recovery attempts
Error Handling Documentation
The experiment includes detailed comments explaining DBOS error handling behavior:
Step-level Errors
- Exception: Step retried without replaying previous operations
- OOM (Out of Memory): Step retried with full replay of operations
Workflow-level Errors
- Exception: Workflow terminates early (treated as successful completion)
- OOM: Workflow retried without replaying successfully completed steps
Multi-step Recovery
- Sequential Execution: If step 2 fails with OOM, step 1 is assumed successful
- State Preservation: Completed steps don’t need re-execution during recovery
Process Architecture
Multiprocessing Pattern
- Parent Process: DBOS workflow execution
- Child Process: Isolated Fibonacci calculation
- Communication: Result passing through multiprocessing queues
- Resource Management: Automatic process cleanup and resource release
Memory Management
- Isolation Benefits: Memory leaks in child processes don’t affect parent
- OOM Protection: Child process OOM doesn’t crash main application
- Resource Limits: Child processes can have independent resource limits
Production Considerations
This experiment demonstrates patterns useful for:
- Memory-intensive Operations: Isolating memory-heavy computations
- Fault Tolerance: Robust error handling and recovery
- Long-running Workflows: Workflows that need to survive application restarts
- Resource Management: Preventing memory leaks and resource exhaustion
The combination of DBOS workflow management with multiprocessing provides a robust foundation for production systems requiring both reliability and performance.
Recent changes
-
2025-08-10 bdd71b7 added AI READMEs -
2025-07-24 25f7b07 backup
Categories: experiments, Python
Tags: dbos-experiments