Python Usage

Overview

For a basic example of importing AIRRSHIP as a package, see here.

Detailed Function and Class Documentation

create_repertoire.load_data

Loads and processes required data files from data folder.

Parameters:

Name Type Description Default
data_folder path

Path to data folder with required data. If not specified then uses inbuilt package data. Defaults to None.

None
mutate bool

Whether to read in data for mutated sequences or not. Defaults to False.

False

Returns:

Name Type Description
data_dict dict

Dictionary containing all required data for generating sequences. Includes family_use_dict, gene_use_dict, trim_dicts, NP_transitions, NP_first_bases, NP_lengths, mut_rate_per_seq and kmer_dicts.

create_repertoire.get_genotype

Wrapper that generates a locus for use in sequence generations

Parameters:

Name Type Description Default
data_folder path

Path to data folder with required data. When not specified will use package data. Defaults to None.

None
het_list list

Proportion of genes [V, D, J] to be heterozygous. Defaults to [1, 1, 1].

[1, 1, 1]
haplotype bool

True when only two alleles per gene are to be used. Defaults to True.

True
locus path

Path to file with predefined locus. Defaults to None.

None

Returns:

Name Type Description
locus list

List of two dictionaries. Each is a dictionary containing the gene segment as keys and the chosen alleles as values. Format is {Segment : [Allele, Allele ...], ...}

create_repertoire.generate_sequence

Wrapper to bring together entire sequence generation process.

Recombines, trims and mutates. Optional produces functional sequences (sequences with an in-frame V and J gene, no stop codons and the expected junction anchor residues) or non-functional sequences.

Parameters:

Name Type Description Default
locus list

List of two dictionaries. Each is a dictionary containing the gene segment as keys and the chosen alleles as values. Format is {Segment : [Allele, Allele ...], ...}

required
data_dict dict

Output of load_data(). Includes family_use_dict, gene_use_dict, trim_dicts, NP_transitions, NP_first_bases, NP_lengths, mut_rate_per_seq and kmer_dicts.

required
mutate bool

True if SHM to be introduced. Defaults to False.

False
flat_usage optional

gene, family or False. Gene or family specify that sequences should use all genes or gene families evenly. If false, usage follows experimental distributions. Defaults to False.

False
no_trim_list tuple

List of 5 Booleans, specifying whether to not trim [all_ends, v_3_end, d_5_end, d_3_end, j_5_end]. Defaults to (False, False, False, False, False).

(False, False, False, False, False)
no_np_list tuple

List of 3 Booleans, specifying whether to not add [both_np, np1, np2]. Defaults to (False, False, False).

(False, False, False)
shm_flat bool

True if SHM is to be even across all sequences. Defaults to False.

False
shm_random bool

True if per base mutation is to be random. Defaults to False.

False
mutation_rate float

Mutation rate to be used rather than choosing from distribution. Defaults to None.

None
mutation_number int

Number of mutations to be added rather than choosing from distribution. Defaults to None.

None
mut_multiplier float

Multiplier to be used on mutation rates pulled from distribution.

1
non_functional bool

Return non-functional sequences. Defaults to False.

False

Returns:

Name Type Description
sequence Sequence

Final recombined sequence, with trimming, NP region addition and SHM if requested.

create_repertoire.Sequence

Represents a recombined Ig sequence consisting of V, D and J segments.

Attributes:

Name Type Description
v_allele Allele

IMGT V gene allele.

d_allele Allele

IMGT D gene allele.

j_allele Allele

IMGT J gene allele.

alleles list

List of IMGT alleles.

NP1_region str

NP1 region - between V and D gene.

NP1_length int

Length of NP1 region.

NP2_region str

NP2 region - between V and D gene.

NP2_length int

Length of NP2 region.

ungapped_seq str

Ungapped nucleotide sequence.

gapped_seq str

Gapped nucleotide sequence.

mutated_seq str

Ungapped mutated nucleotide sequence.

gapped_mutated_seq str

Ungapped mutated nucleotide sequence.

mutated_seq str

Ungapped mutated nucleotide sequence.

junction str

Nucleotide sequence of junction region.

v_seq str

Nucleotide sequence of V region.

d_seq str

Nucleotide sequence of D region.

j_seq str

Nucleotide sequence of J region.

v_seq_start int

Start position of V region.

d_seq_start int

Start position of D region.

j_seq_start int

Start position of J region.

v_seq_end int

End position of V region.

d_seq_end int

End position of D region.

j_seq_end int

End position of J region.

mutations str

Mutation events.

mut_count int

Mutation count.

mut_freq int

Mutation frequency.

functional bool

Sequence is functional.

stop bool

Presence/absence of stop codon.

anchors bool

Presence/absence correct junction anchors.

inframe bool

VJ is in-frame.

__init__(v_allele, d_allele, j_allele)

Initialises a Sequence class instance.

Parameters:

Name Type Description Default
v_allele Allele

IMGT V gene allele, required.

required
d_allele Allele

IMGT D gene allele, required.

required
j_allele Allele

IMGT J gene allele, required.

required

get_junction_length()

Calculates the junction length of the sequence (CDR3 region plus both anchor residues).

Returns:

Name Type Description
junction_length int

Number of nucleotides in junction (CDR3 + anchors)

get_nuc_seq(no_trim_list, trim_dicts, no_np_list, NP_lengths, NP_transitions, NP_first_bases, gapped=False)

Creates the recombined nucleotide sequence with trimming and np addition.

Parameters:

Name Type Description Default
no_trim_list list

List of 5 Booleans, specifying whether to not trim [all_ends, v_3_end, d_5_end, d_3_end, j_5_end].

required
trim_dicts dict

A dictionary of dictionaries of trimming length proportions by gene family for each segment (V, D or J).

required
no_np_list list

List of 3 Booleans, specifying whether to not add [both_np, np1, np2].

required
NP_lengths dict

Dictionary of possible NP region lengths and the proportion of sequences to use them. In the format {NP region length: proportion}.

required
NP_transitions dict

Nested dictionary containing transition matrix of probabilities of moving from one nucleotide (A, C, G, T) to any other for each position in the NP region.

required
NP_first_bases dict

Nested dictionary of the proportion of NP sequences starting with each base for NP1 and NP2. gapped (bool): Specify whether to return sequence with IMGT gaps or not.

required

Returns:

Name Type Description
nuc_seq str

The recombined nucleotide sequence.

create_repertoire.Allele

Class that represents a V, D or J allele.

Attributes:

Name Type Description
name str

The IMGT name of the allele

gapped_seq str

The IMGT gapped germline nucleotide sequence

length str

IMGT defined length of the allele

ungapped_sq str

Ungapped germline nucleotide sequence

trim_5 int

Number of nucleotides to be trimmed from 5' end

trim_3 int

Number of nucleotides to be trimmed from 3' end

__init__(name, gapped_seq, length)

Initialises an Allele class instance.

Parameters:

Name Type Description Default
name str

The IMGT name of the allele

required
gapped_seq str

The IMGT gapped nucleotide sequence

required
length str

IMGT defined length of the allele

required

get_trim_length(no_trim_list, trim_dicts)

Chooses trimming lengths for allele.

Adds two class attributes - trim_3, 3' prime trimming value and trim_5, 5' prime trimming value.

Parameters:

Name Type Description Default
no_trim_list list

List of 5 Booleans, specifying whether to not trim [all_ends, v_3_end, d_5_end, d_3_end, j_5_end].

required
trim_dicts dict

A dictionary of dictionaries of trimming length proportions by gene family for each segment (V, D or J).

required