Skip to content

Data Formats

The AIRR Community TSV format is the primary interchange format across MyImmune pipelines.

ColumnTypeDescription
sequence_idstringUnique identifier for this sequence
sequencestringFull nucleotide sequence
productiveboolWhether the sequence is productive (in-frame, no stop codon)
v_callstringV gene assignment (e.g., TRBV7-2)
d_callstringD gene assignment
j_callstringJ gene assignment
junctionstringJunction nucleotide sequence
junction_aastringJunction amino acid sequence (CDR3)
duplicate_countintClone count / UMI count
ColumnDescription
c_callC gene assignment
cdr3CDR3 nucleotide sequence
pseudo_sequencePseudo-sequence encoding (added by alignment pipeline)
subject_idSubject identifier (added during cohort processing)

Raw export from Adaptive Biotech immunoSEQ Analyzer. Supports v2 format. The Adaptive pipeline converts this to AIRR format.

These columns must be present; the pipeline will reject the file if any are missing.

Adaptive columnMaps to AIRRNotes
nucleotidesequenceFull nucleotide sequence; rev_comp is always set to F
aminoAcidjunction_aaCDR3 amino acid sequence; must start with C and end with F/W (chain-dependent)
count (templates/reads)duplicate_countClone abundance
vMaxResolvedv_callGene names are converted from Adaptive to IMGT nomenclature
dMaxResolvedd_callEmpty for TRA chains
jMaxResolvedj_callGene names are converted from Adaptive to IMGT nomenclature

These columns are consumed during processing but are not validated at file-ingestion time.

Adaptive columnPurpose
sequenceStatusMapped to productive; value InT, anything else → F
vGeneNameTiesResolves ambiguous v_call assignments
dGeneNameTiesResolves ambiguous d_call assignments
jGeneNameTiesResolves ambiguous j_call assignments
vGeneAlleleTiesResolves ambiguous V allele calls
dGeneAlleleTiesResolves ambiguous D allele calls
jGeneAlleleTiesResolves ambiguous J allele calls
AIRR columnHow it is produced
sequence_id{sample_name}|{zero-padded line number}
junctionNucleotide subsequence matching junction_aa, filled after conversion
sequence_aaAssembled from V segment (IMGT positions 1–104) + junction_aa + J segment from conserved motif onward; rows where assembly fails are dropped

Cohort TSV for Clustering and Stratification

Section titled “Cohort TSV for Clustering and Stratification”

Used by the Clustering and Stratification pipelines to describe a multi-subject cohort.

ColumnDescription
subject_idUnique subject identifier
airr_file_pathpath to per-subject AIRR files
ColumnDescription
responseClinical response label. Values must be of binary nature (e.g., all 0/1 or all responder/non-responder)
subject_id airr_file_path response
S001 s3://myimmune/S001/output/airr.zip responder
S002 s3://myimmune/S002/output/airr.zip non-responder
S003 s3://myimmune/S003/output/airr.zip responder

Standard FASTQ format. Accepted as .fastq or .fastq.gz.

  • Paired-end reads: upload both R1 and R2 files
  • Single-end reads: upload one file

The output file follows the AIRR Community TSV standard with MyImmune-specific additions.

Core

ColumnDescription
sequence_idUnique identifier for this sequence
locusImmune receptor locus (TRB, IGH, IGL, etc.)
sequenceFull nucleotide input sequence
sequence_aaFull amino acid translation
rev_compWhether the sequence was reverse complemented
productiveWhether the rearrangement is in-frame with no stop codons
complete_vdjWhether V, D, and J genes are all assigned
duplicate_countClone / UMI count
consensus_countNumber of reads contributing to the consensus

V(D)J gene assignments

ColumnDescription
v_callV gene assignment (e.g., TRBV7-2*01)
d_callD gene assignment
j_callJ gene assignment
c_callC gene assignment
v_score / v_cigarV alignment score and CIGAR string
d_score / d_cigarD alignment score and CIGAR string
j_score / j_cigarJ alignment score and CIGAR string
c_score / c_cigarC alignment score and CIGAR string

Alignments

ColumnDescription
sequence_alignmentV(D)J nucleotide sequence aligned to germline
germline_alignmentGermline reference aligned to the query

Junction

ColumnDescription
junctionJunction nucleotide sequence (CDR3 + conserved anchors)
junction_aaJunction amino acid sequence
junction_lengthJunction length in nucleotides
np1 / np1_lengthN/P nucleotides between V and D, and their length
np2 / np2_lengthN/P nucleotides between D and J, and their length

CDR and framework regions

ColumnDescription
cdr1 / cdr1_aa / cdr1_lenCDR1 nucleotide, amino acid, and length
cdr2 / cdr2_aa / cdr2_lenCDR2 nucleotide, amino acid, and length
cdr3 / cdr3_aa / cdr3_lenCDR3 nucleotide, amino acid, and length
fwr1 / fwr1_aaFR1 nucleotide and amino acid
fwr2 / fwr2_aaFR2 nucleotide and amino acid
fwr3 / fwr3_aaFR3 nucleotide and amino acid
fwr4 / fwr4_aaFR4 nucleotide and amino acid

Position coordinates (one set per gene segment: v_, d_, j_, c_)

Column patternDescription
*_germline_start / *_germline_endStart/end position in the germline reference
*_sequence_start / *_sequence_endStart/end position in the input sequence
*_alignment_start / *_alignment_endStart/end position in the alignment

Pseudo-sequence

ColumnDescription
pseudo_seqParatope sequence consisting of CDR1, CDR2, and CDR3
aligned_pseudo_seqParatope sequence with gaps and IGMT aligned

Sample metadata

ColumnDescription
srr_idSRA run identifier (populated by SRA Data pipeline; empty for direct FASTQ uploads)
patient_idPatient identifier
patient_labelPatient group / response label

Output of the Clustering pipeline. Two files are produced.

ColumnDescription
cluster_idCluster identifier
subG_idSub-group ID within the hierarchical clustering
sizeTotal number of sequences in this cluster
subjectsNumber of distinct subjects contributing sequences
diversityWithin-cluster diversity score
ave_blosumMean pairwise BLOSUM62 score across cluster members
ave_hammingMean pairwise Hamming distance across cluster members
condition:<label>.countsNumber of sequences with this condition label (one column per condition value)
condition:<label>.subjectsNumber of subjects with this condition label
condition:<label>.subjects_ratioFraction of subjects with this condition label
Repglobal_idx of the representative sequence for this cluster
MemsComma-separated global_idx of all member sequences

subject_cluster_map.tsv — one row per sequence

Section titled “subject_cluster_map.tsv — one row per sequence”
ColumnDescription
cluster_idCluster this sequence belongs to
subG_idSub-group ID within the hierarchical clustering
global_idxGlobal sequence index across the full cohort
sequence_idPer-subject sequence identifier
v_callV gene assignment
j_callJ gene assignment
sequence_aaFull amino acid sequence
db_idDatabase-internal sequence ID
subjectSubject identifier
subject_idxNumeric subject index

Input CSV for Bulk Epitope predition pipeline

  • Must contain a column with full amino acid sequence of Heavy chain (eg: sequence_aa_heavy)
  • Can optionally have a column for full amino acid sequence of Light chain (eg: sequence_aa_light)

Input CSV for Bulk Sequence Search

  • Must contain a column with full amino acid sequence (eg:sequence_aa)

Assembles full germline amino acid sequences from V/J gene calls and CDR3 sequences. Input is a ZIP of TSV files; output is the same TSV with assembled sequence columns appended.

Column names are configurable via DAG parameters (cdr3_column, v_column, j_column). The defaults are shown below.

ColumnDefault nameRequiredDescription
CDR3 sequencecdr3YesCDR3 amino acid sequence
V genev_callYesV gene name (IMGT format; TCR prefix is normalised to TR automatically)
J genej_callNoJ gene name; pipeline proceeds with a warning if absent

All original columns are retained (append mode). The following columns are added:

ColumnDescription
vnameNormalised V gene name after dictionary lookup
jnameNormalised J gene name after dictionary lookup
cdr3CDR3 sequence (trimmed of conserved anchors if trim_cdr3 is enabled)
v_fwr1_aaFWR1 amino acids (from V germline)
v_cdr1_aaCDR1 amino acids (from V germline)
v_fwr2_aaFWR2 amino acids (from V germline)
v_cdr2_aaCDR2 amino acids (from V germline)
v_fwr3_aaFWR3 amino acids (from V germline)
j_fwr4_aaFWR4 amino acids (from J germline)
sequence_aaFull assembled amino acid sequence: FWR1 + CDR1 + FWR2 + CDR2 + FWR3 + CDR3 + FWR4

Rows are dropped if any of cdr3, v_cdr1_aa, or v_cdr2_aa are empty after lookup. The output file must contain at least one row with a non-empty sequence_aa.

Aligns full amino acid sequences using ANARCI under the IMGT numbering scheme and extracts CDR1, CDR2, and CDR3 positions into a fixed-length pseudo-sequence string. Input is a ZIP of TSV files; the output column is appended to each row.

ColumnRequiredDescription
sequence_aaYes (configurable)Full amino acid sequence to align; column name set via input_column_name parameter
sequence_idNoUnique row identifier; auto-assigned as string row index if absent
ColumnDescription
aligned_pseudo_seqFixed-length IMGT-aligned pseudo-sequence concatenated from CDR1, CDR2, and CDR3 positions; column name set via output_column_name parameter

The pseudo-sequence is built from 71 IMGT positions across three CDR regions:

RegionPositions
CDR127–32, insertions 32A–32J, insertions 33J–33A, 33–38
CDR256–60, insertions 60A–60J, insertions 61J–61A, 61–65
CDR3105–111, insertions 111A–111J, insertions 112J–112A, 112–117

Positions with no alignment hit are filled with -. Sequences that ANARCI cannot align at all produce an all-dash string and are logged as failures but are not dropped.

Rows with an empty sequence_aa are skipped during ANARCI alignment and merged back into the output with a null value in the output column.

If the output column already exists in the input file it is removed before re-alignment runs.