Radiomics Robustness Module¶

A built-in feature stability assessment module for rtpipeline that evaluates radiomics features under systematic perturbations following the NTCV methodology of Zwanenburg et al. (2019).¹

Overview¶

The radiomics robustness module helps you identify stable, reproducible radiomics features for modeling by:

Collecting segmentations from multiple sources:
TotalSegmentator (RS_auto.dcm)
Custom structures (RS_custom.dcm)
Custom models (Segmentation_{model_name}/rtstruct.dcm)
Generating systematic perturbations via the validated NTCV chain (Noise + Translation + Contour + Volume):¹
N: Image noise injection (Gaussian noise in HU)
T: Rigid translations (±3-5 mm geometric shifts)
C: Contour randomization (morphological boundary randomization via erosion/dilation smoothing)
V: Volume adaptation (erosion/dilation ±15-30% volume change)
Re-extracting radiomics features for each perturbation using PyRadiomics
Computing robustness metrics: ICC(3,1) with analytical 95% CIs (via Pingouin⁴), CoV, QCD, and cohort-wide pass fractions.³
Classifying features as "robust", "acceptable", or "poor" based on configurable thresholds informed by radiomics reproducibility literature.³

Scientific Background¶

This implementation follows the NTCV perturbation methodology from Zwanenburg et al. (2019)¹ and related radiomics reproducibility research:

Literature-validated perturbation chains: In the original Zwanenburg et al. (2019) test–retest validation, NTCV and RCV combinations detected ~98–99% of unstable features with <2% false positives in their specific datasets.¹ Important: These operating characteristics were established on specific test–retest datasets with particular parameter configurations. rtpipeline currently implements an NTCV-like perturbation chain only (RCV is not implemented) and has not independently validated these exact figures—users should treat them as literature benchmarks, not performance guarantees.
Volume adaptation: Iterative erosion/dilation (±15% volume change) provides a clinically plausible approximation of contour perturbations, as commonly used in CT radiomics reproducibility studies.
Reliability statistics: ICC(3,1) with analytical confidence intervals and complementary CoV thresholds support conservative clinical adoption.³
Standardization: The IBSI initiative² provides standardized definitions for radiomics features and recommends reporting perturbation details and preprocessing provenance.

Robustness Thresholds (Configurable Defaults)¶

The default thresholds in rtpipeline are informed by published recommendations, particularly Koo & Li (2016)³ for ICC interpretation and common practices in CT radiomics reproducibility studies:

ICC (Intraclass Correlation Coefficient): Following Koo & Li (2016) qualitative descriptors: - ICC ≥ 0.90: Excellent — used as "robust" threshold in rtpipeline (conservative choice) - 0.75 ≤ ICC < 0.90: Good — used as "acceptable" threshold (standard) - 0.50 ≤ ICC < 0.75: Moderate — not recommended for clinical modeling - ICC < 0.50: Poor

CoV (Coefficient of Variation): CoV is computed as 100 × (standard deviation / mean) of perturbation-level feature values. rtpipeline defaults to CoV ≤10% as "robust" based on thresholds frequently reported in CT radiomics reproducibility studies: - CoV ≤ 10%: Robust (conservative default threshold) - 10% < CoV ≤ 20%: Acceptable - CoV > 20%: Poor

Note: For features with means close to zero, CoV can become numerically unstable; users should manually review features with extreme CoV values.

Important: These are configurable defaults inspired by published conventions, not consensus standards. Users should adapt thresholds to their specific clinical and statistical context.

For highly conservative applications (delta-radiomics, adaptive RT), users may configure a stricter CoV ≤5% threshold via config.yaml:

radiomics_robustness:
  thresholds:
    cov:
      robust_pct: 5.0  # Override default 10%

A feature is classified as "robust" if it meets both ICC and CoV thresholds. These are conventions commonly used in the radiomics literature, not universal standards; users should adapt thresholds to their specific clinical context.

Perturbation Intensity Levels¶

The module supports three intensity levels to balance computational cost and thoroughness. Typical use cases are annotated with recommended metrics:

mild: ~10-15 perturbations (QA spot checks, contouring pilot studies)
standard: 15-30 perturbations (recommended for most pelvic CT applications)
aggressive: 30-60 perturbations (research-grade, adaptive RT / multi-centre validation)

Perturbation Parameter Defaults¶

The following table summarizes the default perturbation parameters used by rtpipeline:

Parameter	Default Value	Description
N (Noise)	`[0.0]` HU	Gaussian noise σ; set `noise_levels: [0.0, 10.0, 20.0]` to enable
T (Translation)	`0.0` mm	Max shift; set `max_translation_mm: 3.0` for ±3mm
C (Contour)	`0` realizations	Boundary randomizations; set `n_random_contour_realizations: 3` to enable
V (Volume)	`[-0.15, 0.0, 0.15]`	Volume change ratios (±15% erosion/dilation)
Intensity	`"standard"`	Controls perturbation count: mild/standard/aggressive

Algorithm details:

Volume adaptation: Iterative morphological erosion (τ < 0) or dilation (τ > 0) using ball structuring element until target volume change is achieved (max 20 iterations)
Contour randomization: Random selection of erosion→dilation or dilation→erosion smoothing sequence with radius proportional to max_translation_mm / 2
Translation: Rigid shifts applied via SimpleITK ResampleImageFilter with nearest-neighbor interpolation

Reproducibility: Random Seed¶

For deterministic, reproducible perturbations, rtpipeline uses a fixed random seed:

np.random.seed(42 + perturbation_count)

This ensures that the same configuration produces identical perturbation sequences across runs. The seed is incremented per perturbation to ensure variation while maintaining reproducibility. Currently, this seed is not user-configurable; for different seed values, modify the source code directly.

Quick Start¶

1. Enable in config.yaml¶

Basic configuration (volume-only perturbations):

radiomics_robustness:
  enabled: true
  modes:
    - segmentation_perturbation

  segmentation_perturbation:
    apply_to_structures:
      - "GTV*"
      - "CTV*"
      - "BLADDER"
      - "RECTUM"
    small_volume_changes: [-0.15, 0.0, 0.15]  # ±15% volume change
    intensity: "standard"  # Options: mild, standard, aggressive

Advanced NTCV configuration (comprehensive testing):

radiomics_robustness:
  enabled: true
  modes:
    - segmentation_perturbation

  segmentation_perturbation:
    apply_to_structures:
      - "GTV*"
      - "CTV*"
      - "BLADDER"
      - "RECTUM"

    # Volume perturbations (V in NTCV)
    small_volume_changes: [-0.15, -0.10, -0.05, 0.0, 0.05, 0.10, 0.15]
    large_volume_changes: [-0.30, -0.20, -0.10, 0.0, 0.10, 0.20, 0.30]

    # Translation perturbations (T in NTCV)
    max_translation_mm: 3.0  # ±3mm rigid shifts

    # Contour randomization (C in NTCV)
    n_random_contour_realizations: 3  # 3 boundary randomization variants

    # Image noise injection (N in NTCV)
    noise_levels: [0.0, 10.0, 20.0]  # Gaussian noise std dev in HU

    # Perturbation intensity
    intensity: "aggressive"  # Options: mild (10-15), standard (15-30), aggressive (30-60)

2. Install dependencies¶

pip install pingouin  # Required for ICC computation
# or install with radiomics extras:
pip install -e ".[radiomics]"

3. Run with Snakemake¶

snakemake -c 8

The robustness analysis runs after standard radiomics extraction and produces: - Per-course results: Data_Snakemake/{patient}/{course}/radiomics_robustness_ct.parquet containing all perturbation-level feature values - Aggregated summary: Data_Snakemake/_RESULTS/radiomics_robustness_summary.xlsx

4. Use robust features for modeling¶

The output Excel file contains multiple sheets:

global_summary: Features averaged across all structures/courses
robust_features: Only features classified as "robust" (ICC ≥ 0.90, CoV ≤ 10%)
acceptable_features: Features meeting "acceptable" thresholds (ICC ≥ 0.75, CoV ≤ 20%)
per_source_summary (if available): Cohort metrics grouped by segmentation source
per_structure_source: Detailed per-structure breakdown (preserving segmentation source)
raw_values: All perturbation-level feature values used for the cohort statistics
robust_features_per_source (if available): Robust features for each segmentation source

Example workflow:

import pandas as pd

# Load results
summary = pd.read_excel("Data_Snakemake/_RESULTS/radiomics_robustness_summary.xlsx",
                        sheet_name="robust_features")

# Filter radiomics data to robust features only
robust_feature_names = summary["feature_name"].tolist()
radiomics = pd.read_excel("Data_Snakemake/_RESULTS/radiomics_ct.xlsx")
radiomics_robust = radiomics[radiomics.columns.intersection(robust_feature_names)]

# Use radiomics_robust for modeling

CLI Commands¶

Per-course analysis¶

rtpipeline radiomics-robustness \
  --course-dir Data_Snakemake/Patient001/Course001 \
  --config config.yaml \
  --output radiomics_robustness_ct.parquet

Aggregate results¶

rtpipeline radiomics-robustness-aggregate \
  --inputs Data_Snakemake/*/*/radiomics_robustness_ct.parquet \
  --output radiomics_robustness_summary.xlsx \
  --config config.yaml

Configuration Reference¶

Full config.yaml example¶

radiomics_robustness:
  enabled: true

  modes:
    - segmentation_perturbation  # Currently supported
    # - segmentation_method      # Future: compare manual vs auto segmentation
    # - scan_rescan              # Future: test-retest reliability

  segmentation_perturbation:
    # Structures to analyze (wildcards supported)
    apply_to_structures:
      - "GTV*"      # Matches GTV, GTV_primary, GTV_node, etc.
      - "CTV*"
      - "PTV*"
      - "BLADDER"
      - "RECTUM"
      - "PROSTATE"

    # V: Volume changes (τ parameter from Lo Iacono 2024)
    small_volume_changes: [-0.15, 0.0, 0.15]   # ±15% volume change (standard)
    large_volume_changes: [-0.30, 0.0, 0.30]   # ±30% for stress testing

    # T: Translation perturbations (mm)
    # Rigid shifts in x, y, z directions to simulate positioning uncertainty
    max_translation_mm: 3.0  # Set to 0.0 to disable

    # C: Contour randomization
    # Simulates inter-observer variability in delineation
    n_random_contour_realizations: 3  # Set to 0 to disable

    # N: Noise injection (HU)
    # Gaussian noise to simulate scanner variability
    noise_levels: [0.0, 10.0, 20.0]  # Standard deviations in HU

    # Perturbation intensity (controls total perturbation count)
    # - "mild": ~10-15 perturbations per ROI (quick testing)
    # - "standard": 15-30 perturbations per ROI (recommended)
    # - "aggressive": 30-60 perturbations per ROI (research-grade)
    intensity: "standard"

  metrics:
    icc:
      implementation: "pingouin"  # ICC computation library
      icc_type: "ICC3"            # ICC(3,1): two-way mixed, consistency for fixed perturbations
      ci: true                    # Compute 95% confidence intervals
    cov:
      enabled: true
    qcd:
      enabled: true    # QCD = (Q3 - Q1) / (Q3 + Q1), a robust dispersion measure

  thresholds:
    icc:
      robust: 0.90       # Conservative clinical threshold (2023-2025 research)
      acceptable: 0.75   # Standard threshold (Zwanenburg 2019)
    cov:
      robust_pct: 10.0      # CoV ≤ 10%: "robust" (standard)
      acceptable_pct: 20.0  # CoV ≤ 20%: "acceptable"

ICC Implementation Details¶

rtpipeline computes ICC using Pingouin's intraclass_corr function⁴ with the following configuration:

ICC Type: ICC(3,1) — two-way mixed effects model, single measurement, consistency agreement
Subject encoding: Each unique combination of patient_id + course_id + structure + segmentation_source is treated as one "subject" (so repeated courses for the same patient are distinct subjects)
Rater encoding: Each perturbation_id (e.g., ntcv_n10_t1_0_0_c0_v-0.15) is treated as a "rater"
Confidence intervals: Pingouin's analytical 95% CIs (not bootstrap-based)

This design treats perturbations as fixed "raters" of the same underlying subject and uses ICC(3,1) (two-way mixed, single, consistency), following the fixed-rater rationale described by Koo & Li (2016).³ This is an adaptation of ICC to perturbation-based robustness analysis rather than a standard inter-rater scenario. The ICC(3,1) model was chosen because:

Fixed perturbations: The perturbation set is chosen by the researcher, not randomly sampled from a population of possible raters
Single measurements: Each perturbation produces one feature value per ROI
Consistency: We measure relative agreement, not absolute agreement (small systematic offsets are acceptable)

Methodological caveat: This interpretation is one reasonable choice for perturbation-based robustness analysis, but it is not universally standardized. Other ICC formulations (e.g., absolute-agreement models or ICC(2,1)) could also be justified depending on study design. Users should verify that ICC(3,1) aligns with their specific reliability framework and may adjust the icc_type configuration if needed.

Conservative thresholding: When 95% confidence intervals are available, rtpipeline uses the lower CI bound for robustness classification rather than the point estimate. This means a feature is classified as "robust" only if ICC_CI95_lower ≥ 0.90. This conservative approach reduces false positives in robustness labeling but may exclude borderline features. If the CI is unavailable, the point estimate is used directly.

Sample size note: ICC estimates derived from small numbers of perturbations (e.g., <10) or small cohorts (<20 subjects) may be unstable with wide confidence intervals. Users should ensure sufficient perturbations (typically ≥10–15 per ROI) for reliable ICC estimation.

Example data structure for ICC computation:

| subject                          | rater                    | feature_value |
|----------------------------------|--------------------------|---------------|
| patient001_course1_bladder_auto  | ntcv_n0_t0_0_0_c0_v0.0   | 0.456         |
| patient001_course1_bladder_auto  | ntcv_n10_t0_0_0_c0_v0.0  | 0.461         |
| patient001_course1_bladder_auto  | ntcv_n0_t1_0_0_c0_v-0.15 | 0.448         |
| ...                              | ...                      | ...           |

NTCV Perturbation Chain Explanation¶

The NTCV chain follows the systematic perturbation methodology from Zwanenburg et al. (2019).¹ The order (Noise → Translation → Contour → Volume) ensures proper propagation of uncertainty sources:

N (Noise): Adds Gaussian noise to the CT image
Simulates scanner variability and acquisition differences
Applied to the image, not the mask
Typical values: 0, 10, 20 HU std dev
T (Translation): Applies rigid geometric shifts
Simulates positioning uncertainty and registration errors
Applied before contour randomization
Typical values: ±3-5 mm in x, y, z directions
C (Contour): Randomizes ROI boundaries
Simulates inter-observer delineation variability
Applied to the already-shifted ROI
Uses boundary noise simulation
V (Volume): Systematic erosion/dilation
Final morphological adjustment
Tests feature stability across ROI size variations
Typical values: ±15% (standard) or ±30% (stress testing)

Total perturbations = N_noise × N_translation × N_contour × N_volume

Example: - 3 noise levels × 3 translations × 2 contours × 5 volumes = 90 perturbations (too many) - With intensity: "aggressive", the module intelligently selects subsets to reach 30-60 perturbations

Output Format¶

Per-course parquet file¶

Columns: - structure: ROI name (e.g., "BLADDER", "GTV_primary") - segmentation_source: Source of segmentation (e.g., "AutoRTS_total", "Custom", "CustomModel:cardiac_STOPSTORM") - feature_name: PyRadiomics feature (e.g., "original_glcm_Correlation") - n_perturbations: Number of perturbations tested - icc: ICC point estimate - icc_ci95_low, icc_ci95_high: 95% confidence interval - cov_pct: Coefficient of Variation (%) - qcd: Quartile Coefficient of Dispersion, defined as (Q3 − Q1) / (Q3 + Q1) from perturbation-level values; a robust alternative to CoV for skewed distributions - robustness_label: "robust", "acceptable", or "poor" - pass_seg_perturb: Boolean (True if robust or acceptable)

Aggregated Excel file¶

Multiple sheets for easy filtering: 1. global_summary: Features averaged across all structures/courses/sources 2. per_source_summary: Features averaged by segmentation source (TotalSegmentator vs Custom vs Custom Models) 3. per_structure_source: Detailed per-structure-source breakdown 4. robust_features: Only ICC ≥ 0.90 and CoV ≤ 10% (global) 5. acceptable_features: ICC ≥ 0.75 and CoV ≤ 20% (global) 6. robust_features_per_source: Robust features broken down by segmentation source

Best Practices¶

Feature Selection Strategy (Based on 2023-2025 Research)¶

Primary recommendation: Use only "robust" features (ICC ≥ 0.90, CoV ≤ 10%) for predictive models
Modern clinical applications increasingly demand ICC ≥0.90 (conservative threshold)
A substantial proportion of features can meet this threshold, though the exact fraction is highly dataset- and protocol-dependent
Multi-center studies: Consider "acceptable" features (ICC ≥ 0.75, CoV ≤ 20%) if data scarcity requires it
Standard threshold from Zwanenburg 2019
Suitable for exploratory analysis or hypothesis generation
Cost-effective alternative: Perturbation-based stability analysis provides a practical alternative to expensive test-retest imaging
In the original Zwanenburg et al. (2019) study, NTCV-like chains detected ~98–99% of unstable features with <2% false positives on their specific test-retest datasets
These are literature benchmarks; users should not assume identical performance on their data
Perturbation intensity selection:
mild: Quick screening, pilot studies
standard: Production use, clinical applications (recommended)
aggressive: Research-grade, publication-quality analysis, multi-center studies

Recommended NTCV configuration for clinical RT:

small_volume_changes: [-0.15, -0.10, -0.05, 0.0, 0.05, 0.10, 0.15]
max_translation_mm: 3.0
n_random_contour_realizations: 2
noise_levels: [0.0, 10.0]
intensity: "standard"

This generates ~25-30 perturbations per ROI - sufficient for robust stability assessment.

Structure and Modality Considerations¶

Structure selection: Focus on clinically relevant structures (GTV, CTV, organs at risk)
Volume thresholds: Features from very small ROIs are often unstable regardless of type
Multi-source analysis: Compare robustness across segmentation sources (TotalSegmentator vs Custom structures)
CT-specific: Current implementation optimized for CT-based radiomics in bladder, prostate, and rectal cancer
Validation: If possible, validate feature stability on independent test-retest data

Advanced Considerations¶

Harmonization: For multi-scanner studies, consider CovBat harmonization (outperforms traditional ComBat)
Note: CovBat implementation not included in this module - apply as preprocessing
Feature selection methods: Consider stability-aware approaches like Graph-FS that maintain performance across institutions
Combines stability metrics with predictive performance
Discretization: Ensure consistent bin width (HU) or bin count across all perturbations
PyRadiomics settings controlled via radiomics_params.yaml
Preprocessing chains: Robustness to resampling and discretization variations important for multi-center studies
Test with different voxel sizes if data will come from multiple scanners

Example Methods Paragraph for Publications¶

When describing rtpipeline's robustness analysis in a manuscript, consider using language similar to:

Radiomics Feature Stability Assessment

Radiomics feature stability was assessed using rtpipeline's perturbation-based robustness module, which implements a perturbation chain methodology inspired by Zwanenburg et al.[1] For each ROI, [N] systematic perturbations were generated combining Gaussian noise injection (σ = 0, 10, 20 HU), rigid translations (±3 mm), contour randomization, and volume adaptation (±15% erosion/dilation). Features were re-extracted for each perturbation using PyRadiomics [version].

Feature stability was quantified using the intraclass correlation coefficient ICC(3,1) computed via Pingouin[2], with each perturbation treated as a fixed rater measuring the same underlying subject (patient-course-structure combination). This interpretation follows the fixed-rater rationale described by Koo & Li[3] but represents an adaptation of ICC to perturbation-based analysis rather than a standard inter-rater scenario. Features with ICC ≥ 0.90 and coefficient of variation (CoV) ≤ 10% were classified as "robust" following commonly used thresholds in the radiomics literature.[3] Only robust features were retained for subsequent modeling.

[1] Zwanenburg A, et al. Assessing robustness of radiomic features by image perturbation. Sci Rep. 2019;9:614. DOI: 10.1038/s41598-018-36938-4 [2] Vallat R. Pingouin: statistics in Python. JOSS. 2018;3(31):1026. DOI: 10.21105/joss.01026 [3] Koo TK, Li MY. A guideline of selecting and reporting ICC for reliability research. J Chiropr Med. 2016;15(2):155-163. DOI: 10.1016/j.jcm.2016.02.012

Important: Adjust the specific parameter values (noise levels, translation distances, volume changes, thresholds) to match your actual configuration. If using the CTV1 D95 heuristic for Rx estimation, explicitly note this limitation. If using Fast Mode for segmentation, note that lower-resolution segmentations were used.

Typical Feature Families by Robustness¶

Based on radiomics reproducibility literature (feature-type patterns are generally consistent across studies, though specific results vary):

Generally Robust: - Shape features (Volume, SurfaceArea, Sphericity) - First-order statistics (Mean, Median, Energy) - Some GLDM features

Moderately Robust: - GLCM features (with appropriate normalization) - GLRLM features

Often Fragile: - High-order texture features without careful preprocessing - Features from very small ROIs - Features sensitive to discretization

Troubleshooting¶

"Radiomics robustness is disabled"¶

Enable it in config.yaml: radiomics_robustness.enabled: true

"Pingouin not available"¶

Install: pip install pingouin>=0.5.3

"No structures matched robustness patterns"¶

Check apply_to_structures patterns in config. Verify structure names match your patterns in: - RS_auto.dcm (TotalSegmentator) - RS_custom.dcm (Custom structures) - Segmentation_{model_name}/rtstruct.dcm (Custom models)

"Insufficient perturbations for X; skipping"¶

Some small structures may fail erosion/dilation. This is expected; they'll be skipped automatically.

Key References¶

Zwanenburg A, et al. (2019). "Assessing robustness of radiomic features by image perturbation." Scientific Reports 9, 614. DOI: 10.1038/s41598-018-36938-4
Zwanenburg A, et al. (2020). "The Image Biomarker Standardization Initiative (IBSI)." Radiology 295(2):328-338. DOI: 10.1148/radiol.2020191145
Koo TK, Li MY. (2016). "A guideline of selecting and reporting intraclass correlation coefficients for reliability research." Journal of Chiropractic Medicine 15(2):155-163. DOI: 10.1016/j.jcm.2016.02.012
Vallat R. (2018). "Pingouin: statistics in Python." Journal of Open Source Software 3(31):1026. DOI: 10.21105/joss.01026

Note: Additional references cited in the literature review sections represent reported findings from the radiomics reproducibility literature. Users should verify specific citations for their own publications.

What's New (2025)¶

NTCV Perturbation Chain Implementation¶

Based on 2023-2025 radiomics stability research:

✅ Implemented: - NTCV (Noise + Translation + Contour + Volume) perturbation chains - Image noise injection (Gaussian noise in HU) - Rigid translation perturbations (±3-5 mm shifts) - Contour randomization (boundary noise simulation) - Configurable perturbation intensity (mild/standard/aggressive) - Research-grade testing: 30-60 perturbations per ROI - Conservative clinical thresholds: ICC >0.90 and CoV <10%

Key improvements over basic volume-only perturbations: - Comprehensive stability testing following Zwanenburg 2019 NTCV methodology - Literature-reported performance: Zwanenburg et al. achieved ~98–99% sensitivity with <2% false positives on their specific test-retest datasets—these serve as benchmarks, not guarantees - Cost-effective alternative to expensive test-retest imaging - Multi-axis perturbations capture different sources of variability

Research Basis¶

The implementation is informed by radiomics reproducibility literature findings: 1. ICC ≥0.75 and CoV ≤10% are commonly used thresholds (ICC ≥0.90 for conservative clinical use)³ 2. Systematic perturbation chains combining geometric and image-based variations improve robustness assessment 3. 30-60 perturbed versions per ROI are often recommended for comprehensive assessment 4. A substantial proportion of features (varies by study and structure) may meet stability thresholds 5. Perturbation-based methods provide a practical alternative when test-retest data is unavailable

Future Enhancements¶

The following methods are referenced in the literature sections above but are not yet implemented in rtpipeline. They are planned for future releases:

CovBat harmonization: Advanced harmonization method (outperforms traditional ComBat) — must be applied as external preprocessing
Graph-FS feature selection: Stability-aware feature selection maintaining multi-institution performance
Segmentation-method robustness: Compare Manual vs TotalSegmentator vs custom models
Scan-rescan ICC: Test-retest reliability for longitudinal studies
Panel-averaged features: Zwanenburg-style feature averaging across perturbations
Preprocessing variations: Resampling and discretization robustness testing

Footnote References¶

Note: Some citations in earlier sections reference findings from the broader radiomics reproducibility literature. Users should independently verify specific references for publication purposes.

A. Zwanenburg et al., "Assessing robustness of radiomic features by image perturbation," Scientific Reports 9, 614 (2019). DOI: 10.1038/s41598-018-36938-4 ↩↩↩↩↩
A. Zwanenburg et al., "The Image Biomarker Standardization Initiative," Radiology 295(2):328–338 (2020). DOI: 10.1148/radiol.2020191145 ↩
T. K. Koo and M. Y. Li, "A guideline of selecting and reporting intraclass correlation coefficients for reliability research," Journal of Chiropractic Medicine 15(2):155–163 (2016). DOI: 10.1016/j.jcm.2016.02.012 ↩↩↩↩↩↩
R. Vallat, "Pingouin: statistics in Python," Journal of Open Source Software 3(31):1026 (2018). DOI: 10.21105/joss.01026 ↩↩