Skip to content

Clustering Pipeline

The Clustering pipeline builds per-subject subgraph representations from AIRR data, then runs hierarchical clustering across all subjects. It produces cluster lists, hierarchical trees, and an optional HTML visualization report.

Web app slug: clustering-pipeline


Use this pipeline when you have a multi-subject cohort in AIRR format and want to:

  • Identify shared sequence clusters across subjects
  • Visualize repertoire overlap as a hierarchical tree
  • Prepare cluster features for downstream stratification

Navigate to Dashboard → Clustering Pipeline → New Job.

FieldDescription
Job nameA label for this run
Cohort TSVA cohort sheet TSV with one row per subject. Required columns: subject_id, airr_file_path
AIRR files zipZip archive containing per-subject AIRR TSV files
FieldDefaultDescription
Sequence identity (sid)0.85Minimum sequence identity threshold for subgraph construction
Coverage (cov)0.85Minimum coverage threshold
Auto-tune thresholdsfalseAutomatically optimize sid and cov based on the dataset
Clustering metricblosumMetric for hierarchical clustering: blosum or blosum_RPD
RPD methodOnly applicable when metric is blosum_RPD
Generate HTML reporttrueProduce an interactive HTML cluster visualization

FileDescription
cluster_list.tsvAll clusters with member sequences and subject IDs
hierarchical_tree.nwkNewick-format hierarchical tree
subject_cluster_map.tsvSubject-to-cluster membership table
cluster_report.htmlInteractive visualization (if enabled)
clustering_results.zipArchive of all outputs

Stage 1 slow for large cohorts
Subgraph construction is O(n²) per subject. For subjects with > 100 k sequences, consider pre-filtering to CDR3-only sequences to reduce computational load.

Empty cluster list
sid and cov thresholds may be too strict. Lower them (e.g., to 0.75) or enable auto-tuning.