Skip to content

Pipeline Hang Prevention & Timeout Mechanisms

Overview

This document describes the comprehensive timeout and hang prevention mechanisms implemented to prevent the pipeline from hanging indefinitely during long-running operations.

Problem Statement

The pipeline was experiencing hanging issues where: - TotalSegmentator operations would hang indefinitely on GPU/CPU issues - dcm2niix conversions would stall on corrupted DICOM files - Radiomics parallel processing would block if any worker hung - No progress indicators made it impossible to distinguish between slow progress and actual hangs - No watchdog mechanisms to detect and handle stuck processes

Solutions Implemented

1. Subprocess Timeout Protection

TotalSegmentator Timeout

File: rtpipeline/segmentation.py:32-68

All TotalSegmentator operations now have configurable timeouts:

# Default: 3600 seconds (1 hour) per segmentation task
timeout = int(os.environ.get('TOTALSEG_TIMEOUT', '3600'))

CLI Option:

--totalseg-timeout 3600  # Default: 1 hour

Behavior: - Automatically detects TotalSegmentator commands and applies appropriate timeout - Logs detailed timeout errors when operations exceed the limit - Continues with next task instead of hanging indefinitely

dcm2niix Timeout

File: rtpipeline/segmentation.py:32-68

DICOM to NIfTI conversions have shorter timeouts:

# Default: 300 seconds (5 minutes) per conversion
timeout = int(os.environ.get('DCM2NIIX_TIMEOUT', '300'))

CLI Option:

--dcm2niix-timeout 300  # Default: 5 minutes

Behavior: - Protects against corrupted DICOM files that cause dcm2niix to hang - Logs timeout and continues with next operation - Pipeline falls back to DICOM-mode segmentation if NIfTI conversion fails

2. Task-Level Timeout Protection

Adaptive Workers Timeout

File: rtpipeline/utils.py:323-333, 410-433

All pipeline stages using run_tasks_with_adaptive_workers() now support per-task timeouts:

def run_tasks_with_adaptive_workers(
    ...,
    task_timeout: Optional[int] = None,
)

CLI Option:

--task-timeout 7200  # 2 hours per course (segmentation/DVH/etc.)

Behavior: - Each course/task has its own timeout - Times out and logs error if task exceeds limit - Continues processing other courses instead of hanging - Applies to: Segmentation, Custom Models, CT Cropping, DVH, Visualization

Example timeout errors:

ERROR: Segmentation: task #5 (Patient123/Course1) timed out after 7200s

3. Radiomics Parallel Processing

Progress Monitoring

File: rtpipeline/radiomics_parallel.py:86

The radiomics processing includes configurable timeout and progress monitoring:

_TASK_TIMEOUT = int(os.environ.get('RTPIPELINE_RADIOMICS_TASK_TIMEOUT', '600'))

CLI Option:

--radiomics-task-timeout 600  # Default: 10 minutes per ROI

Behavior: - Uses imap_unordered() for progress monitoring and efficient task distribution - Progress is logged periodically as tasks complete - Individual task retries are handled within worker processes - Note: Per-task timeout enforcement is handled at the subprocess level via the _TASK_TIMEOUT configuration within worker functions, not by the imap_unordered() iterator itself

Progress Logging:

INFO: Radiomics progress: 50/100 (50.0%), ETA: 234.5s
INFO: Radiomics progress: 100/100 (100.0%), ETA: 0.0s

4. Heartbeat & Progress Monitoring

Periodic Heartbeat Logging

File: rtpipeline/utils.py:403-408

Pipeline now logs heartbeat messages every 60 seconds:

if now - last_heartbeat > 60:  # Log every 60 seconds
    log.info("%s: Still processing... %d/%d completed (%.1f%%)",
            label, completed, total, 100 * completed / total)

Benefits: - Distinguishes between slow progress and actual hangs - Shows pipeline is still alive and making progress - Helps identify which stage is slow

Slow Task Detection

File: rtpipeline/utils.py:418-422

Tasks taking longer than 5 minutes trigger warnings:

if task_duration > 300:  # Warn if task took more than 5 minutes
    log.warning("%s: task #%d (%s) took %.1fs (slow)",
               label, idx + 1, item_desc, task_duration)

Benefits: - Identifies slow courses/tasks for investigation - Helps tune timeout values - Provides performance insights

5. Enhanced Error Handling

Timeout Error Handling

File: rtpipeline/utils.py:424-433

Explicit handling for timeout errors:

except TimeoutError:
    log.error("%s: task #%d (%s) timed out after %ds",
             label, idx + 1, item_desc, task_timeout)
    completed += 1
    results[idx] = None

Behavior: - Logs detailed timeout information - Marks task as failed (None result) - Continues processing remaining tasks - Doesn't crash entire pipeline

Memory Error Handling

Already present, now combined with timeout handling: - Detects OOM errors - Automatically reduces worker count - Retries with lower parallelism

Configuration Matrix

Default Timeouts

Operation Default Timeout Environment Variable CLI Option
TotalSegmentator 3600s (1 hour) TOTALSEG_TIMEOUT --totalseg-timeout
dcm2niix 300s (5 min) DCM2NIIX_TIMEOUT --dcm2niix-timeout
General tasks None (disabled) N/A --task-timeout
Radiomics ROI 600s (10 min) RTPIPELINE_RADIOMICS_TASK_TIMEOUT --radiomics-task-timeout

For Fast Systems (32+ cores, GPU)

rtpipeline \
  --totalseg-timeout 1800 \
  --dcm2niix-timeout 300 \
  --task-timeout 3600 \
  --radiomics-task-timeout 300

For Slow Systems or Large Datasets

rtpipeline \
  --totalseg-timeout 7200 \
  --dcm2niix-timeout 600 \
  --task-timeout 14400 \
  --radiomics-task-timeout 1200

For Debugging (catch hangs quickly)

rtpipeline \
  --totalseg-timeout 600 \
  --dcm2niix-timeout 120 \
  --task-timeout 1800 \
  --radiomics-task-timeout 180

For Production (balanced)

rtpipeline \
  --totalseg-timeout 3600 \
  --dcm2niix-timeout 300 \
  --task-timeout 7200 \
  --radiomics-task-timeout 600

Monitoring for Hangs

Log Patterns to Watch For

Normal Progress

INFO: Segmentation stage: using 1 parallel workers
INFO: Segmentation: Still processing... 5/10 completed (50.0%)
INFO: DVH stage: using 15 parallel workers
INFO: DVH: Still processing... 100/200 completed (50.0%)

Timeout Detected (Good - prevented hang!)

ERROR: Command timed out after 3600s: TotalSegmentator...
ERROR: This usually indicates a hung process or insufficient resources.
ERROR: Segmentation: task #5 (Patient123/Course1) timed out after 7200s

Slow Task Warning

WARNING: Segmentation: task #3 (Patient456/Course2) took 1234.5s (slow)

Actual Hang (Bad - needs investigation)

INFO: Segmentation stage: using 1 parallel workers
... no further logs for > 60 seconds despite heartbeat

Diagnostic Commands

Check if pipeline is stuck

# Watch log file for heartbeat messages
tail -f ./Logs/rtpipeline.log | grep -E "(Still processing|completed|timeout)"

# Check system resources
htop  # CPU usage
nvidia-smi -l 1  # GPU usage (should show activity)

# Check for zombie processes
ps aux | grep -E "(TotalSegmentator|dcm2niix|python)"

Kill hung pipeline safely

# Find pipeline process
ps aux | grep rtpipeline

# Send SIGTERM (graceful)
kill -TERM <pid>

# If doesn't respond after 30s, force kill
kill -KILL <pid>

# Clean up any orphaned TotalSegmentator processes
pkill -f TotalSegmentator

Troubleshooting

Issue: Timeouts occurring frequently

Diagnosis: - Check log for which stage is timing out - Look at task durations in logs - Check system resources (CPU, GPU, memory, disk I/O)

Solutions: 1. Increase timeout:

--totalseg-timeout 7200  # Double the timeout

  1. Reduce parallelism (GPU memory):

    --seg-workers 1 --totalseg-force-split
    

  2. Check for corrupted data:

  3. Examine courses that consistently timeout
  4. Validate DICOM files
  5. Check for extremely large volumes

Issue: Pipeline still hangs despite timeouts

Diagnosis: - Check if heartbeat messages stopped - Look for patterns in which stage hangs - Check system logs for hardware issues

Possible causes: 1. Kernel/driver hang (GPU): - Check dmesg for GPU errors - May need to reboot system - Update NVIDIA drivers

  1. Network filesystem hang:
  2. Check if output directory is on NFS/network share
  3. May need to remount filesystem
  4. Consider using local storage

  5. Docker/container issues:

  6. Check Docker logs: docker logs <container>
  7. May hit container resource limits
  8. Try running outside Docker

Issue: False timeouts (tasks failing that should succeed)

Diagnosis: - Check if tasks are genuinely slow or timing out too early - Look at successful task durations - Compare with similar datasets

Solutions: 1. Increase timeouts appropriately:

# Large dataset example
--totalseg-timeout 10800  # 3 hours
--radiomics-task-timeout 1800  # 30 minutes

  1. Reduce data size (if appropriate):
  2. Use --totalseg-roi-subset to limit ROIs
  3. Use --radiomics-skip-roi to skip large structures

  4. Increase resources:

  5. Reduce parallelism: --max-workers 4
  6. Give tasks more memory
  7. Use faster storage

Implementation Details

Files Modified

  1. rtpipeline/segmentation.py
  2. Added timeout parameter to _run() function
  3. Automatic timeout detection based on command type
  4. Timeout error handling with detailed logging

  5. rtpipeline/utils.py

  6. Added task_timeout parameter to run_tasks_with_adaptive_workers()
  7. Implemented heartbeat logging (60s intervals)
  8. Slow task detection (>5 min warning)
  9. Per-task timeout enforcement with TimeoutError handling

  10. rtpipeline/radiomics_parallel.py

  11. Changed from pool.map() to pool.imap_unordered() for progress monitoring
  12. Added per-ROI timeout constant _TASK_TIMEOUT
  13. Implemented progress logging (every 10 tasks)
  14. Better error handling for hung workers

  15. rtpipeline/cli.py

  16. Added CLI arguments: --totalseg-timeout, --dcm2niix-timeout, --task-timeout, --radiomics-task-timeout
  17. Set environment variables from CLI args
  18. Pass task_timeout to all adaptive worker calls
  19. Enhanced parallelization logging

Backward Compatibility

All timeout features are backward compatible: - Default timeouts are generous (won't break existing workflows) - Can be disabled: --task-timeout 0 or omit the option - Environment variables take precedence for automation - CLI options override environment variables

Testing Recommendations

Unit Testing

# Test timeout enforcement
def test_subprocess_timeout():
    result = _run("sleep 10", timeout=1)
    assert result is False  # Should timeout

# Test task timeout
def test_task_timeout():
    def slow_func(x):
        time.sleep(10)
        return x
    results = run_tasks_with_adaptive_workers(
        "Test", [1], slow_func, max_workers=1, task_timeout=1
    )
    assert results[0] is None  # Should timeout

Integration Testing

# Test with short timeouts to catch hangs quickly
rtpipeline \
  --dicom-root ./test_data \
  --totalseg-timeout 60 \
  --task-timeout 300 \
  --max-workers 2

# Monitor for timeouts in logs
grep -i "timeout" ./Logs/rtpipeline.log

Future Enhancements

Potential improvements for hang prevention: 1. Global pipeline timeout - entire pipeline must complete within X hours 2. Watchdog thread - separate thread monitors main pipeline health 3. Auto-recovery - automatically retry timed-out tasks with different settings 4. Adaptive timeouts - learn optimal timeouts from successful runs 5. Telemetry - track timeout frequencies and patterns 6. Resource monitoring - detect system resource exhaustion before timeout 7. Process tree monitoring - detect and kill orphaned subprocesses

Summary

The pipeline now has comprehensive hang prevention: - ✅ Subprocess timeouts - TotalSegmentator, dcm2niix protected - ✅ Task-level timeouts - each course/ROI can timeout individually - ✅ Progress monitoring - heartbeat logging every 60s - ✅ Slow task detection - warnings for tasks >5 minutes - ✅ Graceful degradation - continues with other tasks after timeout - ✅ Configurable - all timeouts adjustable via CLI - ✅ Well-logged - detailed timeout information for debugging

Result: Pipeline should never hang indefinitely. Timeouts will catch stuck operations and allow processing to continue with remaining data.