Data Formats
AIRR format
Section titled “AIRR format”The AIRR Community TSV format is the primary interchange format across MyImmune pipelines.
Required columns
Section titled “Required columns”| Column | Type | Description |
|---|---|---|
sequence_id | string | Unique identifier for this sequence |
sequence | string | Full nucleotide sequence |
productive | bool | Whether the sequence is productive (in-frame, no stop codon) |
v_call | string | V gene assignment (e.g., TRBV7-2) |
d_call | string | D gene assignment |
j_call | string | J gene assignment |
junction | string | Junction nucleotide sequence |
junction_aa | string | Junction amino acid sequence (CDR3) |
duplicate_count | int | Clone count / UMI count |
Common optional columns
Section titled “Common optional columns”| Column | Description |
|---|---|
c_call | C gene assignment |
cdr3 | CDR3 nucleotide sequence |
pseudo_sequence | Pseudo-sequence encoding (added by alignment pipeline) |
subject_id | Subject identifier (added during cohort processing) |
Adaptive Biotech Pipeline
Section titled “Adaptive Biotech Pipeline”Raw export from Adaptive Biotech immunoSEQ Analyzer. Supports v2 format. The Adaptive pipeline converts this to AIRR format.
Required columns
Section titled “Required columns”These columns must be present; the pipeline will reject the file if any are missing.
| Adaptive column | Maps to AIRR | Notes |
|---|---|---|
nucleotide | sequence | Full nucleotide sequence; rev_comp is always set to F |
aminoAcid | junction_aa | CDR3 amino acid sequence; must start with C and end with F/W (chain-dependent) |
count (templates/reads) | duplicate_count | Clone abundance |
vMaxResolved | v_call | Gene names are converted from Adaptive to IMGT nomenclature |
dMaxResolved | d_call | Empty for TRA chains |
jMaxResolved | j_call | Gene names are converted from Adaptive to IMGT nomenclature |
Additional columns read by the pipeline
Section titled “Additional columns read by the pipeline”These columns are consumed during processing but are not validated at file-ingestion time.
| Adaptive column | Purpose |
|---|---|
sequenceStatus | Mapped to productive; value In → T, anything else → F |
vGeneNameTies | Resolves ambiguous v_call assignments |
dGeneNameTies | Resolves ambiguous d_call assignments |
jGeneNameTies | Resolves ambiguous j_call assignments |
vGeneAlleleTies | Resolves ambiguous V allele calls |
dGeneAlleleTies | Resolves ambiguous D allele calls |
jGeneAlleleTies | Resolves ambiguous J allele calls |
Derived output columns
Section titled “Derived output columns”| AIRR column | How it is produced |
|---|---|
sequence_id | {sample_name}|{zero-padded line number} |
junction | Nucleotide subsequence matching junction_aa, filled after conversion |
sequence_aa | Assembled from V segment (IMGT positions 1–104) + junction_aa + J segment from conserved motif onward; rows where assembly fails are dropped |
Cohort TSV for Clustering and Stratification
Section titled “Cohort TSV for Clustering and Stratification”Used by the Clustering and Stratification pipelines to describe a multi-subject cohort.
Required columns
Section titled “Required columns”| Column | Description |
|---|---|
subject_id | Unique subject identifier |
airr_file_path | path to per-subject AIRR files |
Additional column for Stratification
Section titled “Additional column for Stratification”| Column | Description |
|---|---|
response | Clinical response label. Values must be of binary nature (e.g., all 0/1 or all responder/non-responder) |
Example
Section titled “Example”subject_id airr_file_path responseS001 s3://myimmune/S001/output/airr.zip responderS002 s3://myimmune/S002/output/airr.zip non-responderS003 s3://myimmune/S003/output/airr.zip responderFASTQ Pipeline
Section titled “FASTQ Pipeline”Standard FASTQ format. Accepted as .fastq or .fastq.gz.
- Paired-end reads: upload both R1 and R2 files
- Single-end reads: upload one file
The output file follows the AIRR Community TSV standard with MyImmune-specific additions.
Core
| Column | Description |
|---|---|
sequence_id | Unique identifier for this sequence |
locus | Immune receptor locus (TRB, IGH, IGL, etc.) |
sequence | Full nucleotide input sequence |
sequence_aa | Full amino acid translation |
rev_comp | Whether the sequence was reverse complemented |
productive | Whether the rearrangement is in-frame with no stop codons |
complete_vdj | Whether V, D, and J genes are all assigned |
duplicate_count | Clone / UMI count |
consensus_count | Number of reads contributing to the consensus |
V(D)J gene assignments
| Column | Description |
|---|---|
v_call | V gene assignment (e.g., TRBV7-2*01) |
d_call | D gene assignment |
j_call | J gene assignment |
c_call | C gene assignment |
v_score / v_cigar | V alignment score and CIGAR string |
d_score / d_cigar | D alignment score and CIGAR string |
j_score / j_cigar | J alignment score and CIGAR string |
c_score / c_cigar | C alignment score and CIGAR string |
Alignments
| Column | Description |
|---|---|
sequence_alignment | V(D)J nucleotide sequence aligned to germline |
germline_alignment | Germline reference aligned to the query |
Junction
| Column | Description |
|---|---|
junction | Junction nucleotide sequence (CDR3 + conserved anchors) |
junction_aa | Junction amino acid sequence |
junction_length | Junction length in nucleotides |
np1 / np1_length | N/P nucleotides between V and D, and their length |
np2 / np2_length | N/P nucleotides between D and J, and their length |
CDR and framework regions
| Column | Description |
|---|---|
cdr1 / cdr1_aa / cdr1_len | CDR1 nucleotide, amino acid, and length |
cdr2 / cdr2_aa / cdr2_len | CDR2 nucleotide, amino acid, and length |
cdr3 / cdr3_aa / cdr3_len | CDR3 nucleotide, amino acid, and length |
fwr1 / fwr1_aa | FR1 nucleotide and amino acid |
fwr2 / fwr2_aa | FR2 nucleotide and amino acid |
fwr3 / fwr3_aa | FR3 nucleotide and amino acid |
fwr4 / fwr4_aa | FR4 nucleotide and amino acid |
Position coordinates (one set per gene segment: v_, d_, j_, c_)
| Column pattern | Description |
|---|---|
*_germline_start / *_germline_end | Start/end position in the germline reference |
*_sequence_start / *_sequence_end | Start/end position in the input sequence |
*_alignment_start / *_alignment_end | Start/end position in the alignment |
Pseudo-sequence
| Column | Description |
|---|---|
pseudo_seq | Paratope sequence consisting of CDR1, CDR2, and CDR3 |
aligned_pseudo_seq | Paratope sequence with gaps and IGMT aligned |
Sample metadata
| Column | Description |
|---|---|
srr_id | SRA run identifier (populated by SRA Data pipeline; empty for direct FASTQ uploads) |
patient_id | Patient identifier |
patient_label | Patient group / response label |
Clustering Output
Section titled “Clustering Output”Output of the Clustering pipeline. Two files are produced.
cluster_list.tsv — one row per cluster
Section titled “cluster_list.tsv — one row per cluster”| Column | Description |
|---|---|
cluster_id | Cluster identifier |
subG_id | Sub-group ID within the hierarchical clustering |
size | Total number of sequences in this cluster |
subjects | Number of distinct subjects contributing sequences |
diversity | Within-cluster diversity score |
ave_blosum | Mean pairwise BLOSUM62 score across cluster members |
ave_hamming | Mean pairwise Hamming distance across cluster members |
condition:<label>.counts | Number of sequences with this condition label (one column per condition value) |
condition:<label>.subjects | Number of subjects with this condition label |
condition:<label>.subjects_ratio | Fraction of subjects with this condition label |
Rep | global_idx of the representative sequence for this cluster |
Mems | Comma-separated global_idx of all member sequences |
subject_cluster_map.tsv — one row per sequence
Section titled “subject_cluster_map.tsv — one row per sequence”| Column | Description |
|---|---|
cluster_id | Cluster this sequence belongs to |
subG_id | Sub-group ID within the hierarchical clustering |
global_idx | Global sequence index across the full cohort |
sequence_id | Per-subject sequence identifier |
v_call | V gene assignment |
j_call | J gene assignment |
sequence_aa | Full amino acid sequence |
db_id | Database-internal sequence ID |
subject | Subject identifier |
subject_idx | Numeric subject index |
Bulk Epitope Prediction
Section titled “Bulk Epitope Prediction”Input CSV for Bulk Epitope predition pipeline
- Must contain a column with full amino acid sequence of Heavy chain (eg:
sequence_aa_heavy) - Can optionally have a column for full amino acid sequence of Light chain (eg:
sequence_aa_light)
Bulk Sequence Search
Section titled “Bulk Sequence Search”Input CSV for Bulk Sequence Search
- Must contain a column with full amino acid sequence (eg:
sequence_aa)
TCR to Amino Acid Sequence
Section titled “TCR to Amino Acid Sequence”Assembles full germline amino acid sequences from V/J gene calls and CDR3 sequences. Input is a ZIP of TSV files; output is the same TSV with assembled sequence columns appended.
Input columns
Section titled “Input columns”Column names are configurable via DAG parameters (cdr3_column, v_column, j_column). The defaults are shown below.
| Column | Default name | Required | Description |
|---|---|---|---|
| CDR3 sequence | cdr3 | Yes | CDR3 amino acid sequence |
| V gene | v_call | Yes | V gene name (IMGT format; TCR prefix is normalised to TR automatically) |
| J gene | j_call | No | J gene name; pipeline proceeds with a warning if absent |
Output columns
Section titled “Output columns”All original columns are retained (append mode). The following columns are added:
| Column | Description |
|---|---|
vname | Normalised V gene name after dictionary lookup |
jname | Normalised J gene name after dictionary lookup |
cdr3 | CDR3 sequence (trimmed of conserved anchors if trim_cdr3 is enabled) |
v_fwr1_aa | FWR1 amino acids (from V germline) |
v_cdr1_aa | CDR1 amino acids (from V germline) |
v_fwr2_aa | FWR2 amino acids (from V germline) |
v_cdr2_aa | CDR2 amino acids (from V germline) |
v_fwr3_aa | FWR3 amino acids (from V germline) |
j_fwr4_aa | FWR4 amino acids (from J germline) |
sequence_aa | Full assembled amino acid sequence: FWR1 + CDR1 + FWR2 + CDR2 + FWR3 + CDR3 + FWR4 |
Row filtering
Section titled “Row filtering”Rows are dropped if any of cdr3, v_cdr1_aa, or v_cdr2_aa are empty after lookup. The output file must contain at least one row with a non-empty sequence_aa.
Pseudo-Sequence Alignment
Section titled “Pseudo-Sequence Alignment”Aligns full amino acid sequences using ANARCI under the IMGT numbering scheme and extracts CDR1, CDR2, and CDR3 positions into a fixed-length pseudo-sequence string. Input is a ZIP of TSV files; the output column is appended to each row.
Input columns
Section titled “Input columns”| Column | Required | Description |
|---|---|---|
sequence_aa | Yes (configurable) | Full amino acid sequence to align; column name set via input_column_name parameter |
sequence_id | No | Unique row identifier; auto-assigned as string row index if absent |
Output column
Section titled “Output column”| Column | Description |
|---|---|
aligned_pseudo_seq | Fixed-length IMGT-aligned pseudo-sequence concatenated from CDR1, CDR2, and CDR3 positions; column name set via output_column_name parameter |
The pseudo-sequence is built from 71 IMGT positions across three CDR regions:
| Region | Positions |
|---|---|
| CDR1 | 27–32, insertions 32A–32J, insertions 33J–33A, 33–38 |
| CDR2 | 56–60, insertions 60A–60J, insertions 61J–61A, 61–65 |
| CDR3 | 105–111, insertions 111A–111J, insertions 112J–112A, 112–117 |
Positions with no alignment hit are filled with -. Sequences that ANARCI cannot align at all produce an all-dash string and are logged as failures but are not dropped.
Rows with an empty sequence_aa are skipped during ANARCI alignment and merged back into the output with a null value in the output column.
If the output column already exists in the input file it is removed before re-alignment runs.