Stratification Pipeline
The Stratification pipeline extends hierarchical clustering with response-aware feature selection and a LOOCV SVM classifier. It takes a cohort with clinical response labels and produces a trained classification model, ROC/PR curves, and feature importance outputs.
Web app slug: stratification-pipeline
When to use
Section titled “When to use”Use this pipeline when your cohort TSV includes a response column (clinical outcome label, e.g., responder/non-responder) and you want to:
- Select immune repertoire features that distinguish response groups
- Build and evaluate a Leave-One-Out Cross-Validation (LOOCV) SVM classifier
- Generate interpretable heatmaps and volcano plots
Submitting a job
Section titled “Submitting a job”The stratification pipeline uses a 6-step stepper form in the web app.
Navigate to Dashboard → Stratification Pipeline → New Job.
Step 1: Cohort sheet
Section titled “Step 1: Cohort sheet”Upload a cohort TSV. Required columns:
| Column | Description |
|---|---|
subject_id | Unique subject identifier |
airr_file_path | Path to per-subject AIRR files |
response | Clinical response label (e.g., responder, non-responder, or 0/1) |
Step 2: Clustering parameters
Section titled “Step 2: Clustering parameters”Same as the Clustering pipeline — sid, cov, auto-tune, metric.
Step 3: Feature selection
Section titled “Step 3: Feature selection”| Field | Default | Description |
|---|---|---|
| FDR threshold | 0.05 | Benjamini-Hochberg FDR cutoff for feature selection |
| Min cluster size | 2 | Minimum number of subjects a cluster must appear in to be considered |
| Log transform | true | Apply log1p transformation to frequency features |
Step 4: Classification
Section titled “Step 4: Classification”| Field | Default | Description |
|---|---|---|
| SVM kernel | rbf | SVM kernel (rbf, linear, poly) |
| Class weight | balanced | Handle class imbalance |
| CV folds | LOOCV | Leave-one-out cross-validation (fixed) |
Step 5: Visualization
Section titled “Step 5: Visualization”| Field | Default | Description |
|---|---|---|
| Generate heatmap | true | CLR-normalized heatmap of selected features |
| Generate volcano plot | true | Volcano plot of feature significance vs. effect size |
Step 6: Review & submit
Section titled “Step 6: Review & submit”Review all parameters and submit.
Output files
Section titled “Output files”| File | Description |
|---|---|
feature_matrix.tsv | Subject × cluster frequency matrix |
selected_features.tsv | Features passing FDR threshold with test statistics |
clr_heatmap.html | Interactive CLR heatmap |
clr_heatmap.png | Static heatmap image |
volcano_plot.html | Interactive volcano plot |
volcano_plot.png | Static volcano plot |
loocv_predictions.tsv | Per-subject predicted labels and probabilities |
classifier_metrics.json | AUC, accuracy, sensitivity, specificity |
roc_curve.png | ROC curve plot |
pr_curve.png | Precision-Recall curve plot |
stratification_results.zip | Archive of all outputs |
Interpreting results
Section titled “Interpreting results”classifier_metrics.json
{ "auc_roc": 0.84, "auc_pr": 0.79, "accuracy": 0.76, "sensitivity": 0.80, "specificity": 0.73, "n_features_selected": 12, "n_subjects": 45}loocv_predictions.tsv — one row per subject:
| subject_id | true_label | predicted_label | probability |
|---|---|---|---|
| S001 | responder | responder | 0.82 |
| S002 | non-responder | non-responder | 0.11 |
Troubleshooting
Section titled “Troubleshooting”T0 fails: missing response column
Add the response column to your cohort TSV. Values must be consistent (e.g., only 0/1, or only responder/non-responder).
T5 fails: R packages not found
The conda worker must have vegan, ggplot2, and limma installed. Check the Dockerfile.conda.prd build or run conda install -c bioconda r-vegan r-ggplot2 bioconductor-limma on the worker.
T6 gives AUC close to 0.5
Low classifier performance often means the selected features don’t separate the response groups, or the cohort is too small. Try lowering the FDR threshold or increasing min_cluster_size to reduce noise features.