Stratification Pipeline

The Stratification pipeline extends hierarchical clustering with response-aware feature selection and a LOOCV SVM classifier. It takes a cohort with clinical response labels and produces a trained classification model, ROC/PR curves, and feature importance outputs.

Web app slug: stratification-pipeline

When to use

Use this pipeline when your cohort TSV includes a response column (clinical outcome label, e.g., responder/non-responder) and you want to:

Select immune repertoire features that distinguish response groups
Build and evaluate a Leave-One-Out Cross-Validation (LOOCV) SVM classifier
Generate interpretable heatmaps and volcano plots

Submitting a job

The stratification pipeline uses a 6-step stepper form in the web app.

Navigate to Dashboard → Stratification Pipeline → New Job.

Step 1: Cohort sheet

Upload a cohort TSV. Required columns:

Column	Description
`subject_id`	Unique subject identifier
`airr_file_path`	Path to per-subject AIRR files
`response`	Clinical response label (e.g., `responder`, `non-responder`, or `0`/`1`)

Step 2: Clustering parameters

Same as the Clustering pipeline — sid, cov, auto-tune, metric.

Step 3: Feature selection

Field	Default	Description
FDR threshold	0.05	Benjamini-Hochberg FDR cutoff for feature selection
Min cluster size	2	Minimum number of subjects a cluster must appear in to be considered
Log transform	true	Apply log1p transformation to frequency features

Step 4: Classification

Field	Default	Description
SVM kernel	`rbf`	SVM kernel (`rbf`, `linear`, `poly`)
Class weight	`balanced`	Handle class imbalance
CV folds	LOOCV	Leave-one-out cross-validation (fixed)

Step 5: Visualization

Field	Default	Description
Generate heatmap	true	CLR-normalized heatmap of selected features
Generate volcano plot	true	Volcano plot of feature significance vs. effect size

Step 6: Review & submit

Review all parameters and submit.

Output files

File	Description
`feature_matrix.tsv`	Subject × cluster frequency matrix
`selected_features.tsv`	Features passing FDR threshold with test statistics
`clr_heatmap.html`	Interactive CLR heatmap
`clr_heatmap.png`	Static heatmap image
`volcano_plot.html`	Interactive volcano plot
`volcano_plot.png`	Static volcano plot
`loocv_predictions.tsv`	Per-subject predicted labels and probabilities
`classifier_metrics.json`	AUC, accuracy, sensitivity, specificity
`roc_curve.png`	ROC curve plot
`pr_curve.png`	Precision-Recall curve plot
`stratification_results.zip`	Archive of all outputs

Interpreting results

classifier_metrics.json

{
  "auc_roc": 0.84,
  "auc_pr": 0.79,
  "accuracy": 0.76,
  "sensitivity": 0.80,
  "specificity": 0.73,
  "n_features_selected": 12,
  "n_subjects": 45
}

loocv_predictions.tsv — one row per subject:

subject_id	true_label	predicted_label	probability
S001	responder	responder	0.82
S002	non-responder	non-responder	0.11

Troubleshooting

T0 fails: missing response column
Add the response column to your cohort TSV. Values must be consistent (e.g., only 0/1, or only responder/non-responder).

T5 fails: R packages not found
The conda worker must have vegan, ggplot2, and limma installed. Check the Dockerfile.conda.prd build or run conda install -c bioconda r-vegan r-ggplot2 bioconductor-limma on the worker.

T6 gives AUC close to 0.5
Low classifier performance often means the selected features don’t separate the response groups, or the cohort is too small. Try lowering the FDR threshold or increasing min_cluster_size to reduce noise features.