Quickstart
Download
The latest version of AIRRSHIP can be downloaded from PyPi or GitHub.
Installation
The easiest way to install is using pip, either directly:
pip install airrship
Or after downloading the latest release:
pip install airrship-x.y.z.tar.gz
Requirements
AIRRSHIP intentionally uses only Python standard libraries and requires only the installation of base Python (version 3.7 or above).
Examples
A very small example repertoire is held at the AIRRSHIP GitHub repository to provide an example of the expected output. Larger example repertoire files are available at Zenodo.
Running from the command line
The most basic call to AIRRSHIP requires only an output name.
airrship -o my_repertoire
This will create a repertoire of 1000 unmutated human, heavy chain BCR sequences with metrics derived from experimental distributions.
Four output files will be generated:
- my_repertoire.fasta - final sequences in FASTA format
- my_repertoire.tsv - information regarding sequence generation
- my_repertoire_locus.csv - the simulated locus
- my_repertoire_summary.txt - summary of input commands
Please see Output Files for further details on output file format.
Customising repertoire generation
By default, AIRRSHIP attempts to replicate real experimental repertoires as closely as possible. However, there a large number of command line options that can be used to produce repertoires with specific desired features.
For example, we could create a repertoire with:
- 16,000 sequences
- a locus where as many genes as possible are heterozygous (can't achieve it for every position as not all genes have two alleles)
- balanced usage of the gene families
- no trimming of the 5' end of the D gene
- no insertion of nucleotides between the D and J gene (no NP2 regions)
- include non-productive sequences in output and limit them to 10% of the repertoire
airrship -o complex_repertoire \
-n 16000 \
--het 1 1 1 \
--flat_vdj family \
--no_trim_d5 \
--no_np2 \
--non_productive \
--prop_non_productive 0.1
Full details can be found in Command line Usage.
Note
Occasionally AIRRSHIP may fail to generate a productive sequence from a specific combination of alleles and will print a warning. This should not be of concern unless it happens with high frequency. In this case you may need to check your chosen parameters or input data.
Adding somatic hypermutation
The --shm flag will generate SHM according to observed distributions (see Simulation Model for more information).
airrship -o shm_repertoire --shm
Mutation rates can be controlled by passing a multiplication factor with --shm_multiplier. For example, the below command will create a repertoire with sequences mutated to rates half that as specified in the reference files.
airrship -o shm_repertoire --shm --shm_multiplier 0.5
To request a constant mutation frequency across all sequences, the --shm_flat option can be used. The desired mutation rate or number can be specified with either --mut_rate or mut_num.
The below command will create 1000 sequences, each of which with a mutation rate of 0.08 (i.e. number of mutations in sequence / length of sequence = 0.08). The distribution will be as close to flat as is possible but may fluctuate slightly.
airrship -o shm_flat_repertoire --shm_flat --mutation_rate 0.08
The default SHM algorithm treats each base in the sequence differently, depending on the 5mer context of the base and the region of the sequence it is found in. To make per base mutation independent of sequence context, --shm_random can be used.
airrship -o shm_random_repertoire --shm_random
The per sequence mutation rate will still follow the observed experimental distribution unless --shm_flat is also specified.
Note
Setting mutation rates higher than 0.2 will result in a warning and, depending on the other options specified, may result in very slow performance or a failure to generate sufficient sequences. Other distributions may also be skewed. Mutation rates above 0.5 are not supported.
Using the package in Python
If desired, instead of running from the command line, the package can be imported and a call to main() made within Python, specifying the same parameters as discussed above.
from airrship import create_repertoire
create_repertoire.main(['-o', 'my_repertoire', '--outdir', 'output'])
create_repertoire.main(('-o my_repertoire --outdir output').split())
It is also possible to use individual functions from the package. A simple three step workflow to generate sequences is described below.
Read in data to be used
from airrship import create_repertoire
data_dict = create_repertoire.load_data()
Create a locus from which to generate sequences
locus = create_repertoire.get_genotype()
Generate sequences
sequence = create_repertoire.generate_sequence(locus, data_dict, mutate = True)
generate_sequence returns an individual Sequence class object with all information about its generation stored as attributes. For example, the final mutated sequence can be accessed using sequence.mutated_seq. For full details, see Python.