Clustering Pipeline

Groups immune receptor sequences into clusters across a whole cohort. It runs in two stages: first it pools every subject’s sequences and collapses near-identical ones into sub-graphs, then it builds a hierarchical tree over those sub-graphs to merge related ones into final clusters — so you can see which sequence groups are shared across people.

Web app slug: clustering-pipeline

When to use

Use this pipeline when you have a multi-subject cohort in AIRR format and want to find sequence clusters that recur across subjects

How it works

Clustering happens at two levels of granularity, both over the pooled sequences of the entire cohort:

Stage 1  — tight grouping              Stage 2  — loose grouping
──────────────────────────────        ──────────────────────────────
ALL subjects' sequences pooled    →    distances between sub-graphs
into one similarity graph; near-       computed with the chosen metric,
identical sequences (sid + cov)        agglomerated into a tree, then
become connected "sub-graphs"          cut at height_cutoff → clusters
(≈ clonal families)                    (≈ shared specificity groups)

Validate & ingest — the cohort sheet and every referenced AIRR file are validated up front (before any heavy compute).
Stage 1 — Sub-graph (subG) clustering: all subjects’ paratope sequences are pooled into a single similarity graph. Two sequences are linked when they clear both the sequence-identity (sid) and coverage (cov) thresholds, computed position-by-position over the alignment. The connected components of that graph are the sub-graphs. Because the default thresholds are high (sid 0.9 / cov 0.95), a sub-graph is essentially a clonal family. The Sub-graph Method selects how sub-graphs are formed: gsc (connected components of the similarity graph, the default), clonotype (exact-clonotype grouping, no similarity graph), or louvain (community detection).
Stage 2 — Hierarchical clustering of sub-graphs A distance is computed between sub-graphs using the chosen metrics, agglomerated with the chosen linkage method , and the resulting dendrogram is cut at height_cutoff to produce the final flat clusters. Because everything was pooled in Stage 1, a single cluster can contain sequences contributed by many subjects — that is what makes a cluster “public.”
Stage 2b — Map clusters back Cluster membership (stored as global sequence indices) is joined back to subjects and sequences, yielding the key cohort-level table: which subjects populate which cluster.

Choosing a Stage 2 metric

The distance metric controls what “similar” means when building the cohort tree:

Identity-based — sid, sid_cov, hamming, hamming_norm: fast, count identical/mismatched aligned positions (the _norm variants divide by alignment length).
Substitution-aware — blosum_ave, blosum_norm, blosum_ave_norm, blosum_dist, blosum_dist_centrality, blosum_RPD: score amino-acid similarity with BLOSUM62 so conservative substitutions cost less than disruptive ones — better for grouping functionally related sequences rather than just literal-identical ones.
TCR-specialized — tcrdist: a position-weighted, capped BLOSUM62-based distance (the TCRdist metric).

blosum_RPD is a reciprocal BLOSUM62 distance and additionally requires the RPD_method parameter (an integer 1–4, selecting among alternative distance-decay formulas); setting RPD_method with any other metric is rejected at validation time.

Optional side branches

These run alongside the main path and upload their own outputs — if one fails it surfaces as a DAG failure (so you’re alerted) but it does not block the main results.zip:

Threshold diagnostics (run_threshold_diagnostics) — plots the pairwise sid/cov distributions from Stage 1 so you can pick sensible thresholds.
HTML report (generate_html_report) — a self-contained interactive report (cluster network, CLR/norm-transformed abundance charts, optional Wilcoxon filtering between two arms). Uploaded separately as report.html, not included in results.zip.

Submitting a job

Navigate to Dashboard → Clustering Pipeline → New Job.

Required inputs

Field	Description
Job name	A label for this run
Input bundle (zip)	A zip containing the cohort sheet plus one AIRR TSV per subject

The cohort sheet (default filename cohort.tsv, set via Cohort Sheet Filename) needs at minimum a subject column and an airr_file column; any extra metadata columns pass through. If no cohort sheet is found in the bundle, a minimal one is auto-generated from the discovered *.tsv files (one subject per file). Each referenced AIRR TSV must contain sequence_id, v_call, j_call, and sequence_aa (or the alternative set cell_id, v_gene, j_gene, sequence_aa).

Optional inputs

Stage 1 — sub-graph clustering

Field	Default	Description
Receptor chain	`TRB`	One of `IGH`, `IGL`, `IGK`, `TRB`, `TRA`
Species	`human`	`human`, `mouse`, or `none` (affects germline V/J calls)
Sequence identity (`sid`)	`0.9`	Identity cutoff for pairwise similarity, `0 < sid ≤ 1`. Ignored when auto-tune is on
Coverage (`cov`)	`0.95`	Alignment coverage cutoff, `0 < cov ≤ 1`. Ignored when auto-tune is on
Clustering method	`gsc`	Sub-graph algorithm: `gsc`, `louvain`, or `clonotype`
Minimum clones	`1`	Skip AIRR files with fewer clones than this
Keep singletons	`false`	Retain sequences that fall in no sub-graph
Auto-tune `sid`/`cov`	`false`	Detect thresholds from the data (valley search over a random pair sample) instead of using the fixed values

Stage 2 — hierarchical clustering

Field	Default	Description
Linkage method	`average`	`single`, `complete`, `average`, `weighted`, `centroid`, `median`, or `ward`
Distance metric	`sid_cov`	See Choosing a Stage 2 metric for the full list
Height cutoff	`0.1`	Tree-cut height (`> 0`); units depend on the chosen metric
RPD method	—	Required only when metric is `blosum_RPD` (integer 1–4); must be unset otherwise
`fastcluster` backend	`true`	Faster linkage computation

Reports & lifecycle

Field	Default	Description
Run threshold diagnostics	`false`	Plot `sid`/`cov` distributions (side branch)
Generate HTML report	`false`	Emit the interactive `report.html` (side branch, supports two-arm contrasts)
Cleanup temporary files	`true`	Remove the run’s temp directory after upload

Output files

Packaged into results.zip under the run’s output prefix:

File	Description
`cluster_list.tsv`	All clusters with their member sequences
`lngs_trees.tsv`	The hierarchical tree(s) produced by Stage 2
`subject_cluster_map.tsv`	Subject → cluster membership table
`config.json`	Stage-1 sentinel with the resolved run config (including the real `sid`/`cov` when auto-tuned)
`info_hcluster.txt`	Run metadata (chain, species, resolved parameters, subject counts)

Uploaded separately to the same run prefix (not inside results.zip):

File	Description
`report.html`	Interactive cluster report — only when Generate HTML report is enabled
`threshold_plots/`	`sid`/`cov` diagnostic plots — only when Run threshold diagnostics is enabled

Troubleshooting

Empty or tiny cluster list Stage 1 sid/cov may be too strict, so few sequences connect into sub-graphs. Lower them (e.g. sid toward 0.8) or enable Auto-tune. If Stage 2 produces a tree but no flat clusters, your height_cutoff may be cutting above every merge — lower it.

metrics='blosum_RPD' but the run fails validation blosum_RPD requires RPD_method set to one of 1–4. Conversely, leaving RPD_method set while choosing any other metric is also rejected — clear it.

Cohort sheet rejected The sheet must have subject and airr_file columns, unique non-empty subjects, and every airr_file must resolve on disk with the required AIRR columns present. The validation error names the exact missing column or unreadable file.

Stage 1 slow for large cohorts Pairwise similarity scales steeply with sequences per subject. Use Minimum clones to drop tiny repertoires, set CPU cores to 0 (all cores), or pre-filter your AIRR files before upload.

The HTML report / diagnostics failed but I still got results.zip That’s by design — those side branches upload independently and only fail the DAG to alert you; they never block the main clustering output.