Output files
When used at the command line, AIRRSHIP produces four output files per run:
1. Sequence FASTA
outname.fasta
A FASTA file containing the final simulated sequences. In the case of SHM, these will be the mutated sequences. The FASTA headers correspond to the sequence_id column in outname.tsv.
2. Sequence information TSV
outname.tsv
A tab-delimited file containing information about each sequence and its formation. The file format follows the AIRR-C Rearrangement Schema where possible.
The below columns are present regardless of simulation criteria:
| Name | Description |
|---|---|
| sequence_id | Unique sequence identifier. |
| sequence | Final simulated nucleotide sequence. |
| productive | True if sequence is predicted to be productive. |
| stop_codon | True if the sequence contains a stop codon. |
| vj_in_frame | True if the V and J segments are in frame. |
| v_call | V gene with allele. |
| d_call | D gene with allele. |
| j_call | J gene with allele. |
| junction | Junction region nucleotide sequence. CDR3 plus two conserved codons. |
| junction_aa | Junction region amino acid translation. |
| junction_length | Length of the junction region. |
| np1_length | Length of the combined N/P region between the V and D gene. |
| np1 | Nucleotide sequence of the combined N/P region between the V and D gene. |
| np2_length | Length of the combined N/P region between the D and J gene. |
| np2 | Length of the combined N/P region between the D and J gene. |
| v_3_trim | Number of nucleotides trimmed from the 3' end of the V gene. |
| d_5_trim | Number of nucleotides trimmed from the 5' end of the D gene. |
| d_3_trim | Number of nucleotides trimmed from the 3' end of the D gene. |
| j_5_trim | Number of nucleotides trimmed from the 5' end of the J gene. |
| v_sequence | Part of the sequence originating from the V gene. |
| d_sequence | Part of the sequence originating from the D gene. |
| j_sequence | Part of the sequence originating from the J gene. |
| v_sequence_start | Start position of the V gene in the sequence (1-based closed interval). |
| v_sequence_end | End position of the V gene in the sequence (1-based closed interval). |
| d_sequence_start | Start position of the D gene in the sequence (1-based closed interval). |
| d_sequence_end | End position of the D gene in the sequence (1-based closed interval). |
| j_sequence_start | Start position of the J gene in the sequence (1-based closed interval). |
| j_sequence_end | End position of the J gene in the sequence (1-based closed interval). |
Some columns are present only when SHM is not simulated:
| Name | Description |
|---|---|
| gapped_sequence | Simulated nucleotide sequence with gaps inserted according to IMGT schema. |
Some columns are present only when SHM is simulated:
| Name | Description |
|---|---|
| shm_events | Comma-delimited list of mutation events. In the format position:base>mutated_base |
| shm_count | Number of mutations in the sequence. |
| shm_freq | Mutation frequency (number of mutations divided by length of sequence) |
| unmutated_sequence | Unmutated simulated nucleotide sequence. |
| gapped_unmutated_sequence | Unmutated simulated nucleotide sequence with gaps inserted according to IMGT schema. |
| gapped_mutated_sequence | Mutated simulated nucleotide sequence with gaps inserted according to IMGT schema. |
3. Locus file
outname_locus.csv
A two column CSV file containing the alleles chosen for each simulated "chromosome". Can be used in subsequent runs to simulate sequences from the same genetic background.
4. Summary file
outname_summary.txt
A text file listing the arguments provided to the AIRRSHIP call.