Skip to content

Stratification Pipeline

The Stratification pipeline extends hierarchical clustering with response-aware feature selection and a LOOCV SVM classifier. It takes a cohort with clinical response labels and produces a trained classification model, ROC/PR curves, and feature importance outputs.

Web app slug: stratification-pipeline


Use this pipeline when your cohort TSV includes a response column (clinical outcome label, e.g., responder/non-responder) and you want to:

  • Select immune repertoire features that distinguish response groups
  • Build and evaluate a Leave-One-Out Cross-Validation (LOOCV) SVM classifier
  • Generate interpretable heatmaps and volcano plots

The stratification pipeline uses a 6-step stepper form in the web app.

Navigate to Dashboard → Stratification Pipeline → New Job.

Upload a cohort TSV. Required columns:

ColumnDescription
subject_idUnique subject identifier
airr_file_pathPath to per-subject AIRR files
responseClinical response label (e.g., responder, non-responder, or 0/1)

Same as the Clustering pipelinesid, cov, auto-tune, metric.

FieldDefaultDescription
FDR threshold0.05Benjamini-Hochberg FDR cutoff for feature selection
Min cluster size2Minimum number of subjects a cluster must appear in to be considered
Log transformtrueApply log1p transformation to frequency features
FieldDefaultDescription
SVM kernelrbfSVM kernel (rbf, linear, poly)
Class weightbalancedHandle class imbalance
CV foldsLOOCVLeave-one-out cross-validation (fixed)
FieldDefaultDescription
Generate heatmaptrueCLR-normalized heatmap of selected features
Generate volcano plottrueVolcano plot of feature significance vs. effect size

Review all parameters and submit.


FileDescription
feature_matrix.tsvSubject × cluster frequency matrix
selected_features.tsvFeatures passing FDR threshold with test statistics
clr_heatmap.htmlInteractive CLR heatmap
clr_heatmap.pngStatic heatmap image
volcano_plot.htmlInteractive volcano plot
volcano_plot.pngStatic volcano plot
loocv_predictions.tsvPer-subject predicted labels and probabilities
classifier_metrics.jsonAUC, accuracy, sensitivity, specificity
roc_curve.pngROC curve plot
pr_curve.pngPrecision-Recall curve plot
stratification_results.zipArchive of all outputs

classifier_metrics.json

{
"auc_roc": 0.84,
"auc_pr": 0.79,
"accuracy": 0.76,
"sensitivity": 0.80,
"specificity": 0.73,
"n_features_selected": 12,
"n_subjects": 45
}

loocv_predictions.tsv — one row per subject:

subject_idtrue_labelpredicted_labelprobability
S001responderresponder0.82
S002non-respondernon-responder0.11

T0 fails: missing response column
Add the response column to your cohort TSV. Values must be consistent (e.g., only 0/1, or only responder/non-responder).

T5 fails: R packages not found
The conda worker must have vegan, ggplot2, and limma installed. Check the Dockerfile.conda.prd build or run conda install -c bioconda r-vegan r-ggplot2 bioconductor-limma on the worker.

T6 gives AUC close to 0.5
Low classifier performance often means the selected features don’t separate the response groups, or the cohort is too small. Try lowering the FDR threshold or increasing min_cluster_size to reduce noise features.