Output files

When used at the command line, AIRRSHIP produces four output files per run:

1. Sequence FASTA

outname.fasta

A FASTA file containing the final simulated sequences. In the case of SHM, these will be the mutated sequences. The FASTA headers correspond to the sequence_id column in outname.tsv.

2. Sequence information TSV

outname.tsv

A tab-delimited file containing information about each sequence and its formation. The file format follows the AIRR-C Rearrangement Schema where possible.

The below columns are present regardless of simulation criteria:

Name	Description
sequence_id	Unique sequence identifier.
sequence	Final simulated nucleotide sequence.
productive	True if sequence is predicted to be productive.
stop_codon	True if the sequence contains a stop codon.
vj_in_frame	True if the V and J segments are in frame.
v_call	V gene with allele.
d_call	D gene with allele.
j_call	J gene with allele.
junction	Junction region nucleotide sequence. CDR3 plus two conserved codons.
junction_aa	Junction region amino acid translation.
junction_length	Length of the junction region.
np1_length	Length of the combined N/P region between the V and D gene.
np1	Nucleotide sequence of the combined N/P region between the V and D gene.
np2_length	Length of the combined N/P region between the D and J gene.
np2	Length of the combined N/P region between the D and J gene.
v_3_trim	Number of nucleotides trimmed from the 3' end of the V gene.
d_5_trim	Number of nucleotides trimmed from the 5' end of the D gene.
d_3_trim	Number of nucleotides trimmed from the 3' end of the D gene.
j_5_trim	Number of nucleotides trimmed from the 5' end of the J gene.
v_sequence	Part of the sequence originating from the V gene.
d_sequence	Part of the sequence originating from the D gene.
j_sequence	Part of the sequence originating from the J gene.
v_sequence_start	Start position of the V gene in the sequence (1-based closed interval).
v_sequence_end	End position of the V gene in the sequence (1-based closed interval).
d_sequence_start	Start position of the D gene in the sequence (1-based closed interval).
d_sequence_end	End position of the D gene in the sequence (1-based closed interval).
j_sequence_start	Start position of the J gene in the sequence (1-based closed interval).
j_sequence_end	End position of the J gene in the sequence (1-based closed interval).

Some columns are present only when SHM is not simulated:

Name	Description
gapped_sequence	Simulated nucleotide sequence with gaps inserted according to IMGT schema.

Some columns are present only when SHM is simulated:

Name	Description
shm_events	Comma-delimited list of mutation events. In the format position:base>mutated_base
shm_count	Number of mutations in the sequence.
shm_freq	Mutation frequency (number of mutations divided by length of sequence)
unmutated_sequence	Unmutated simulated nucleotide sequence.
gapped_unmutated_sequence	Unmutated simulated nucleotide sequence with gaps inserted according to IMGT schema.
gapped_mutated_sequence	Mutated simulated nucleotide sequence with gaps inserted according to IMGT schema.

3. Locus file

outname_locus.csv

A two column CSV file containing the alleles chosen for each simulated "chromosome". Can be used in subsequent runs to simulate sequences from the same genetic background.

4. Summary file

outname_summary.txt

A text file listing the arguments provided to the AIRRSHIP call.