Clustering Pipeline
The Clustering pipeline builds per-subject subgraph representations from AIRR data, then runs hierarchical clustering across all subjects. It produces cluster lists, hierarchical trees, and an optional HTML visualization report.
Web app slug: clustering-pipeline
When to use
Section titled “When to use”Use this pipeline when you have a multi-subject cohort in AIRR format and want to:
- Identify shared sequence clusters across subjects
- Visualize repertoire overlap as a hierarchical tree
- Prepare cluster features for downstream stratification
Submitting a job
Section titled “Submitting a job”Navigate to Dashboard → Clustering Pipeline → New Job.
Required inputs
Section titled “Required inputs”| Field | Description |
|---|---|
| Job name | A label for this run |
| Cohort TSV | A cohort sheet TSV with one row per subject. Required columns: subject_id, airr_file_path |
| AIRR files zip | Zip archive containing per-subject AIRR TSV files |
Optional inputs
Section titled “Optional inputs”| Field | Default | Description |
|---|---|---|
| Sequence identity (sid) | 0.85 | Minimum sequence identity threshold for subgraph construction |
| Coverage (cov) | 0.85 | Minimum coverage threshold |
| Auto-tune thresholds | false | Automatically optimize sid and cov based on the dataset |
| Clustering metric | blosum | Metric for hierarchical clustering: blosum or blosum_RPD |
| RPD method | — | Only applicable when metric is blosum_RPD |
| Generate HTML report | true | Produce an interactive HTML cluster visualization |
Output files
Section titled “Output files”| File | Description |
|---|---|
cluster_list.tsv | All clusters with member sequences and subject IDs |
hierarchical_tree.nwk | Newick-format hierarchical tree |
subject_cluster_map.tsv | Subject-to-cluster membership table |
cluster_report.html | Interactive visualization (if enabled) |
clustering_results.zip | Archive of all outputs |
Troubleshooting
Section titled “Troubleshooting”Stage 1 slow for large cohorts
Subgraph construction is O(n²) per subject. For subjects with > 100 k sequences, consider pre-filtering to CDR3-only sequences to reduce computational load.
Empty cluster list
sid and cov thresholds may be too strict. Lower them (e.g., to 0.75) or enable auto-tuning.