Input files
AIRRSHIP uses various input data in order to accurately replicate real repertoires. If desired, an alternative data directory can be specified. Files in this directory must follow the format of the included data, including exactly matching headers and column order, and be named in the same manner.
Scripts for producing input files
A selection of scripts are in included in the scripts folder on Github and can be used to process AIRR TSV files into the required formats for AIRRSHIP input.
Each reference file requires that certain AIRR-C format columns are present - to produce all reference files, the following columns must be included in your input: sequence,sequence_alignment, germline_alignment, fwr1, fwr2, fwr3, cdr1, cdr2, cdr3, v_call, d_call, j_call, np1_length, np2_length, v_sequence_end, d_sequence_start, j_sequence_start. You must also provide paths to IMGT V and D allele FASTA files when calculating trimming metrics. Help for each script can be accessed using script_name.py -h.
These scripts apply very limited pre-processing to the provided sequences to allow for flexibility and will only drop sequences that contain missing values for required columns. The user may wish to filter to only productive sequences in advance. Some scripts allow the specification of a group column. In these cases this column will be used to group sequences by each unique level of this column and the resulting metrics will be averaged. Otherwise metrics are calculated from the provided file as a whole, treating every sequence equally.
Details of the input files and the scripts that can be used to process them are included below. These scripts do require the installation of pandas (tested with v1.3.5)
IMGT alleles
Germline VDJ alleles in FASTA format with IMGT headers. Only alleles that are marked as F (functional) or ORF (open reading frame) will be used for sequence generation:
- IMGT_human_IGHV.fasta
- IMGT_human_IGHD.fasta
- IMGT_human_IGHJ.fasta
AIRRSHIP includes the IMGT dataset as of 18.02.2022. If an updated dataset is used, or the user wishes to use an alternate database, all other files must contain information to support all the genes present in the new dataset (i.e. if IMGT adds new genes or gene families, and you try to substitute in these files, AIRRSHIP may fail as the currently included distributions do not account for them). AIRRSHIP currently only supports alleles with names in the IMGT format,
See Non-human Species for allele files for information on simulating sequences from species other than humans.
Note
To avoid confusion when benchmarking alignments, AIRRSHIP ignores the IMGT allele IGHV3-30-3*03 as it has the same sequence as allele IGHV3-30*04. AIRRSHIP will also not utilise V or J alleles that do not have the expected junctional anchors.
VDJ usage
VDJ gene and family usage, giving the proportion of sequences to include each gene or family:
- IGHV_usage.csv
- IGHV_usage_gene.csv
- IGHD_usage.csv
- IGHD_usage_gene.csv
- IGHJ_usage.csv
- IGHJ_usage_gene.csv
Example:
| v_family | prop |
|---|---|
| IGHV1 | 0.210282513 |
| IGHV1/OR15 | 0.000783292 |
| ... | ... |
Scripts:
create_vdj_files.py [-h] -i INPUT_FILE [--group GROUP]
-i INPUT_FILE, --input_file INPUT_FILE
--group GROUP Name of column containing dataset metadata to group by.
Required columns: v_call, d_call, j_call
Trimming
Distributions of number of nucleotides to be removed from gene ends at the junctions, per gene family:
- V_family_trimming_proportions.csv
- D_5_family_trimming_proportions.csv
- D_3_family_trimming_proportions.csv
- J_family_trimming_proportions.csv
Example:
| v_family | v_3p_del | proportions |
|---|---|---|
| IGHV1 | 0 | 0.406392257 |
| IGHV1 | 1 | 0.290851348 |
| ... | ... | ... |
Scripts:
create_trimming_files.py [-h] -i INPUT_FILE [--imgt_v IMGT_V] [--imgt_d IMGT_D]
-i INPUT_FILE, --input_file INPUT_FILE
--imgt_v IMGT_V Path to IMGT V allele file.
--imgt_d IMGT_D Path to IMGT D allele file.
Required columns: v_call, d_call, v_germline_end, j_germline_start, d_germline_start, d_sequence_end and d_sequence_start.
NP additions
Proportion of NP regions starting with each nucleotide base:
- np1_first_base_probs.csv
- np2_first_base_probs.csv
Example:
| proportion | |
|---|---|
| A | 0.111702543 |
| C | 0.246122379 |
| G | 0.284580985 |
| T | 0.357594093 |
Proportion of NP regions of each length:
- np1_lengths_proportions.csv
- np2_lengths_proportions.csv
Example:
| np1_length | prop |
|---|---|
| 0 | 0.052651438 |
| 1 | 0.040751106 |
| ... | ... |
Transition probabilities for each position in NP region (i.e. the likelihood of the next base being A,C,G or T when the current base is A,C,G or T). For both mutated and unmutated sequences:
- np1_transition_probs_per_position_igdm.csv
- np1_transition_probs_per_position_igag.csv
- np2_transition_probs_per_position_igdm.csv
- np2_transition_probs_per_position_igag.csv
Example:
| Length | Base | A | C | G | T |
|---|---|---|---|---|---|
| 0 | T | 0.169400291 | 0.390251379 | 0.232798504 | 0.207549826 |
| 0 | A | 0.24858057 | 0.238914924 | 0.34526599 | 0.167238516 |
| ... | ... | ... | ... | ... | ... |
Scripts:
create_np_lengths_files.py [-h] -i INPUT_FILE [--group GROUP]
-i INPUT_FILE, --input_file INPUT_FILE
--group GROUP Name of column containing dataset metadata to group by.
Required columns: np1_length, np2_length
create_np_first_bases.py [-h] -i INPUT_FILE
-i INPUT_FILE, --input_file INPUT_FILE
Required columns: np1_length,np2_length,sequence,v_sequence_end,d_sequence_start and j_sequence_start
create_np_transitions_files.py [-h] -i INPUT_FILE
-i INPUT_FILE, --input_file INPUT_FILE
Required columns: np1_length,np2_length,sequence,v_sequence_end,d_sequence_start and j_sequence_start
When producing NP transition matrices, only a single file for each NP region is produced. AIRRSHIP requires two, np[1|2]_transition_probs_per_position_igdm.csv and np[1|2]_transition_probs_per_position_igag.csv, to account for differences in inserted nucleotides following SHM. Therefore this script must be run once on unmutated sequences and once on mutated sequences and the files manually renamed. If required, the same file with different names can be used although this may affect insertion accuracy.
Somatic Hypermutation
Rates of mutation frequency per sequence, per V family:
- mut_freq_per_seq_per_family.csv
Example:
| v_family | mut_freq | proportion |
|---|---|---|
| IGHV1 | 0 | 0.005959742 |
| IGHV1 | 0.1 | 0.003065621 |
| IGHV1 | 0.111111111 | 0.002811285 |
| ... | ... | ... |
Per base mutation rates for the centre base of each unique 5mer (the likelihood of the centre base being one of A,C,G,T for that kmer when pulled from mutated sequences). For each IMGT region and for the CDR and FWR regions in aggregate:
- cdr_kmer_base_usage.csv
- cdr1_kmer_base_usage.csv
- cdr2_kmer_base_usage.csv
- cdr3_kmer_base_usage.csv
- fwr_kmer_base_usage.csv
- fwr1_kmer_base_usage.csv
- fwr2_kmer_base_usage.csv
- fwr3_kmer_base_usage.csv
- fwr4_kmer_base_usage.csv
Example:
| kmer | A | C | G | T |
|---|---|---|---|---|
| GTGCA | 0.054353444 | 0.001915083 | 0.927897253 | 0.015834221 |
| TGCAC | 0.001906821 | 0.901529425 | 0.017524232 | 0.079039522 |
| GCACA | 0.847932735 | 0.043780581 | 0.074638153 | 0.033648531 |
| ... | ... | ... | ... | ... |
Scripts:
create_shm_files.py [-h] -i INPUT_FILE
-i INPUT_FILE, --input_file INPUT_FILE
Required columns: v_call, germline_alignment, sequence_alignment, fwr1, fwr2, fwr3, cdr1, cdr2, cdr3
Experimental data used
The input files included with AIRRSHIP were generated using repertoire data from the following public datasets:
- Cowan, G et al., unpublished data.
- Ghraichy,M. et al. (2020) Maturation of the Human Immunoglobulin Heavy Chain Repertoire With Age. Front. Immunol., 11, 1734.
- Preprocessed data downloaded from https://zenodo.org/record/3585046#.Y2upy4LP2JF.
- Only individuals aged > 9 years included.
- Gidoni,M. et al. (2019) Mosaic deletion patterns of the human antibody heavy chain gene locus shown by Bayesian haplotyping. Nat. Commun., 10, 628.
- Raw sequence data downloaded from the ENA, accession PRJEB26509.
- Healthy controls only included.
- Waltari,E. et al. (2018) 5′ Rapid Amplification of cDNA Ends and Illumina MiSeq Reveals B Cell Receptor Features in Healthy Adults, Adults With Chronic HIV-1 Infection, Cord Blood, and Humanized Mice. Front. Immunol., 9, 628.
- Raw sequence data downloaded from the SRA, accession PRJNA393446.
- Healthy adults only included.
- Yang,X. et al. (2021) Large-scale analysis of 2,152 Ig-seq datasets reveals key features of B cell biology and the antibody repertoire. Cell Rep., 35, 109110.
- Raw sequence data downloaded from the SRA, accession PRJNA564936.
Sequences were processed using pRESTO [1] and Change-O [2] from the Immcantation pipeline, with VDJ assignment carried out using IgBLAST [3] v1.18.0.
[1] Vander Heiden, J. A. et al. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics 30, 1930–1932 (2014).
[2] Gupta, N. T. et al. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 31, 3356–3358 (2015).
[3] Ye, J., Ma, N., Madden, T. L. & Ostell, J. M. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 41, W34–W40 (2013).