Sequence Search

The Bulk Search pipeline lets users submit multiple sequences for searching against one or more repertoire databases.

When to use

Use this pipeline when you have a set of query sequences (full amino acid sequences) and want to find the most similar sequences in a database. Typical use cases:

Finding public database matches for novel sequences
Comparing patient sequences to a reference cohort index
Cross-cohort sequence similarity analysis

Submitting a job

Navigate to Dashboard → New Job (or via the Experimental → Single Sequence Search for one-off queries).

Required inputs

Field	Description
Job name	A label for this run
Input CSV files	One or more CSV/TSV files with a sequence column, uploaded as a zip archive
Sequence Column Name	Column name containing the query sequences
Database	database names to search against
Top-K	Number of nearest neighbors to return per query

Optional inputs

Field	Default	Description
Score threshold	none	Minimum similarity score to include in results
Search type	`full_sequence`	`full_sequence` or `cdr3_only`

Output files

`results.csv`

The input CSV with additional columns per neighbor:

Column	Description
`match_1_sequence`	Top match sequence
`match_1_score`	Similarity score (0–1)
`match_1_subject`	Subject ID from the database
`match_2_sequence`	Second match sequence
…	… up to top-K

`failed.csv`

Sequences that had no results or encountered search errors:

Column	Description
`sequence`	Original query sequence
`error`	Error description

Troubleshooting

Job stuck in polling state
FIRE API may be overloaded. The DAG retries indefinitely with backoff. Check FIRE API health at GET /health. If FIRE is down, the job will eventually time out.

High rate of sequences in failed.csv
Sequences that are too short (< 5 aa) or contain non-amino-acid characters are rejected by FIRE. Pre-filter your input sequences.

Score threshold too strict
If results.csv is mostly empty, lower the score_threshold or remove it entirely.