Data Formats

AIRR format

The AIRR Community TSV format is the primary interchange format across MyImmune pipelines.

Required columns

Column	Type	Description
`sequence_id`	string	Unique identifier for this sequence
`sequence`	string	Full nucleotide sequence
`productive`	bool	Whether the sequence is productive (in-frame, no stop codon)
`v_call`	string	V gene assignment (e.g., `TRBV7-2`)
`d_call`	string	D gene assignment
`j_call`	string	J gene assignment
`junction`	string	Junction nucleotide sequence
`junction_aa`	string	Junction amino acid sequence (CDR3)
`duplicate_count`	int	Clone count / UMI count

Common optional columns

Column	Description
`c_call`	C gene assignment
`cdr3`	CDR3 nucleotide sequence
`pseudo_sequence`	Pseudo-sequence encoding (added by alignment pipeline)
`subject_id`	Subject identifier (added during cohort processing)

Adaptive Biotech Pipeline

Raw export from Adaptive Biotech immunoSEQ Analyzer. Supports v2 format. The Adaptive pipeline converts this to AIRR format.

Required columns

These columns must be present; the pipeline will reject the file if any are missing.

Adaptive column	Maps to AIRR	Notes
`nucleotide`	`sequence`	Full nucleotide sequence; `rev_comp` is always set to `F`
`aminoAcid`	`junction_aa`	CDR3 amino acid sequence; must start with `C` and end with `F`/`W` (chain-dependent)
`count (templates/reads)`	`duplicate_count`	Clone abundance
`vMaxResolved`	`v_call`	Gene names are converted from Adaptive to IMGT nomenclature
`dMaxResolved`	`d_call`	Empty for TRA chains
`jMaxResolved`	`j_call`	Gene names are converted from Adaptive to IMGT nomenclature

Additional columns read by the pipeline

These columns are consumed during processing but are not validated at file-ingestion time.

Adaptive column	Purpose
`sequenceStatus`	Mapped to `productive`; value `In` → `T`, anything else → `F`
`vGeneNameTies`	Resolves ambiguous `v_call` assignments
`dGeneNameTies`	Resolves ambiguous `d_call` assignments
`jGeneNameTies`	Resolves ambiguous `j_call` assignments
`vGeneAlleleTies`	Resolves ambiguous V allele calls
`dGeneAlleleTies`	Resolves ambiguous D allele calls
`jGeneAlleleTies`	Resolves ambiguous J allele calls

Derived output columns

AIRR column	How it is produced
`sequence_id`	`{sample_name}\|{zero-padded line number}`
`junction`	Nucleotide subsequence matching `junction_aa`, filled after conversion
`sequence_aa`	Assembled from V segment (IMGT positions 1–104) + `junction_aa` + J segment from conserved motif onward; rows where assembly fails are dropped

Cohort TSV for Clustering and Stratification

Used by the Clustering and Stratification pipelines to describe a multi-subject cohort.

Required columns

Column	Description
`subject_id`	Unique subject identifier
`airr_file_path`	path to per-subject AIRR files

Additional column for Stratification

Column	Description
`response`	Clinical response label. Values must be of binary nature (e.g., all `0`/`1` or all `responder`/`non-responder`)

Example

subject_id  airr_file_path  response
S001  s3://myimmune/S001/output/airr.zip  responder
S002  s3://myimmune/S002/output/airr.zip  non-responder
S003  s3://myimmune/S003/output/airr.zip  responder

FASTQ Pipeline

Standard FASTQ format. Accepted as .fastq or .fastq.gz.

Paired-end reads: upload both R1 and R2 files
Single-end reads: upload one file

The output file follows the AIRR Community TSV standard with MyImmune-specific additions.

Core

Column	Description
`sequence_id`	Unique identifier for this sequence
`locus`	Immune receptor locus (`TRB`, `IGH`, `IGL`, etc.)
`sequence`	Full nucleotide input sequence
`sequence_aa`	Full amino acid translation
`rev_comp`	Whether the sequence was reverse complemented
`productive`	Whether the rearrangement is in-frame with no stop codons
`complete_vdj`	Whether V, D, and J genes are all assigned
`duplicate_count`	Clone / UMI count
`consensus_count`	Number of reads contributing to the consensus

V(D)J gene assignments

Column	Description
`v_call`	V gene assignment (e.g., `TRBV7-2*01`)
`d_call`	D gene assignment
`j_call`	J gene assignment
`c_call`	C gene assignment
`v_score` / `v_cigar`	V alignment score and CIGAR string
`d_score` / `d_cigar`	D alignment score and CIGAR string
`j_score` / `j_cigar`	J alignment score and CIGAR string
`c_score` / `c_cigar`	C alignment score and CIGAR string

Alignments

Column	Description
`sequence_alignment`	V(D)J nucleotide sequence aligned to germline
`germline_alignment`	Germline reference aligned to the query

Junction

Column	Description
`junction`	Junction nucleotide sequence (CDR3 + conserved anchors)
`junction_aa`	Junction amino acid sequence
`junction_length`	Junction length in nucleotides
`np1` / `np1_length`	N/P nucleotides between V and D, and their length
`np2` / `np2_length`	N/P nucleotides between D and J, and their length

CDR and framework regions

Column	Description
`cdr1` / `cdr1_aa` / `cdr1_len`	CDR1 nucleotide, amino acid, and length
`cdr2` / `cdr2_aa` / `cdr2_len`	CDR2 nucleotide, amino acid, and length
`cdr3` / `cdr3_aa` / `cdr3_len`	CDR3 nucleotide, amino acid, and length
`fwr1` / `fwr1_aa`	FR1 nucleotide and amino acid
`fwr2` / `fwr2_aa`	FR2 nucleotide and amino acid
`fwr3` / `fwr3_aa`	FR3 nucleotide and amino acid
`fwr4` / `fwr4_aa`	FR4 nucleotide and amino acid

Position coordinates (one set per gene segment: v_, d_, j_, c_)

Column pattern	Description
`_germline_start` / `_germline_end`	Start/end position in the germline reference
`_sequence_start` / `_sequence_end`	Start/end position in the input sequence
`_alignment_start` / `_alignment_end`	Start/end position in the alignment

Pseudo-sequence

Column	Description
`pseudo_seq`	Paratope sequence consisting of CDR1, CDR2, and CDR3
`aligned_pseudo_seq`	Paratope sequence with gaps and IGMT aligned

Sample metadata

Column	Description
`srr_id`	SRA run identifier (populated by SRA Data pipeline; empty for direct FASTQ uploads)
`patient_id`	Patient identifier
`patient_label`	Patient group / response label

Clustering Output

Output of the Clustering pipeline. Two files are produced.

`cluster_list.tsv` — one row per cluster

Column	Description
`cluster_id`	Cluster identifier
`subG_id`	Sub-group ID within the hierarchical clustering
`size`	Total number of sequences in this cluster
`subjects`	Number of distinct subjects contributing sequences
`diversity`	Within-cluster diversity score
`ave_blosum`	Mean pairwise BLOSUM62 score across cluster members
`ave_hamming`	Mean pairwise Hamming distance across cluster members
`condition:<label>.counts`	Number of sequences with this condition label (one column per condition value)
`condition:<label>.subjects`	Number of subjects with this condition label
`condition:<label>.subjects_ratio`	Fraction of subjects with this condition label
`Rep`	`global_idx` of the representative sequence for this cluster
`Mems`	Comma-separated `global_idx` of all member sequences

`subject_cluster_map.tsv` — one row per sequence

Column	Description
`cluster_id`	Cluster this sequence belongs to
`subG_id`	Sub-group ID within the hierarchical clustering
`global_idx`	Global sequence index across the full cohort
`sequence_id`	Per-subject sequence identifier
`v_call`	V gene assignment
`j_call`	J gene assignment
`sequence_aa`	Full amino acid sequence
`db_id`	Database-internal sequence ID
`subject`	Subject identifier
`subject_idx`	Numeric subject index

Bulk Epitope Prediction

Input CSV for Bulk Epitope predition pipeline

Must contain a column with full amino acid sequence of Heavy chain (eg: sequence_aa_heavy)
Can optionally have a column for full amino acid sequence of Light chain (eg: sequence_aa_light)

Bulk Sequence Search

Input CSV for Bulk Sequence Search

Must contain a column with full amino acid sequence (eg:sequence_aa)

TCR to Amino Acid Sequence

Assembles full germline amino acid sequences from V/J gene calls and CDR3 sequences. Input is a ZIP of TSV files; output is the same TSV with assembled sequence columns appended.

Input columns

Column names are configurable via DAG parameters (cdr3_column, v_column, j_column). The defaults are shown below.

Column	Default name	Required	Description
CDR3 sequence	`cdr3`	Yes	CDR3 amino acid sequence
V gene	`v_call`	Yes	V gene name (IMGT format; `TCR` prefix is normalised to `TR` automatically)
J gene	`j_call`	No	J gene name; pipeline proceeds with a warning if absent

Output columns

All original columns are retained (append mode). The following columns are added:

Column	Description
`vname`	Normalised V gene name after dictionary lookup
`jname`	Normalised J gene name after dictionary lookup
`cdr3`	CDR3 sequence (trimmed of conserved anchors if `trim_cdr3` is enabled)
`v_fwr1_aa`	FWR1 amino acids (from V germline)
`v_cdr1_aa`	CDR1 amino acids (from V germline)
`v_fwr2_aa`	FWR2 amino acids (from V germline)
`v_cdr2_aa`	CDR2 amino acids (from V germline)
`v_fwr3_aa`	FWR3 amino acids (from V germline)
`j_fwr4_aa`	FWR4 amino acids (from J germline)
`sequence_aa`	Full assembled amino acid sequence: FWR1 + CDR1 + FWR2 + CDR2 + FWR3 + CDR3 + FWR4

Row filtering

Rows are dropped if any of cdr3, v_cdr1_aa, or v_cdr2_aa are empty after lookup. The output file must contain at least one row with a non-empty sequence_aa.

Pseudo-Sequence Alignment

Aligns full amino acid sequences using ANARCI under the IMGT numbering scheme and extracts CDR1, CDR2, and CDR3 positions into a fixed-length pseudo-sequence string. Input is a ZIP of TSV files; the output column is appended to each row.

Input columns

Column	Required	Description
`sequence_aa`	Yes (configurable)	Full amino acid sequence to align; column name set via `input_column_name` parameter
`sequence_id`	No	Unique row identifier; auto-assigned as string row index if absent

Output column

Column	Description
`aligned_pseudo_seq`	Fixed-length IMGT-aligned pseudo-sequence concatenated from CDR1, CDR2, and CDR3 positions; column name set via `output_column_name` parameter

The pseudo-sequence is built from 71 IMGT positions across three CDR regions:

Region	Positions
CDR1	27–32, insertions 32A–32J, insertions 33J–33A, 33–38
CDR2	56–60, insertions 60A–60J, insertions 61J–61A, 61–65
CDR3	105–111, insertions 111A–111J, insertions 112J–112A, 112–117

Positions with no alignment hit are filled with -. Sequences that ANARCI cannot align at all produce an all-dash string and are logged as failures but are not dropped.

Rows with an empty sequence_aa are skipped during ANARCI alignment and merged back into the output with a null value in the output column.

If the output column already exists in the input file it is removed before re-alignment runs.

Data Formats

AIRR format

Required columns

Common optional columns

Adaptive Biotech Pipeline

Required columns

Additional columns read by the pipeline

Derived output columns

Cohort TSV for Clustering and Stratification

Required columns

Additional column for Stratification

Example

FASTQ Pipeline

Clustering Output

cluster_list.tsv — one row per cluster

subject_cluster_map.tsv — one row per sequence

Bulk Epitope Prediction

Bulk Sequence Search

TCR to Amino Acid Sequence

Input columns

Output columns

Row filtering

Pseudo-Sequence Alignment

Input columns

Output column

`cluster_list.tsv` — one row per cluster

`subject_cluster_map.tsv` — one row per sequence