As someone who works with T cell antigen receptor (TCR) and peptide-major histocompatibility complex (pMHC) data, I have found several Python packages to be very useful for eliminating tedious steps in data cleaning and feature engineering stages.
tidytcells
The first package I wanted to highlight is tidytcells (https://github.com/yutanagano/tidytcells). This package helps to clean TCR and MHC gene information and makes comparing and collating different datasets much easier. It helps clean gene labels such as “A02:01” to the IMGT standard “HLA-A*02:01”. I run it on all TCR and MHC genes as well as CDR3s to create a standardised dataset.
Stitchr
The next package is Stitchr (https://github.com/JamieHeather/stitchr). This package helps take the gene codes and CDR3 regions to make full amino acid sequences of the TCR alpha and beta chains. I use it in the following way:
from Stitchr import stitchr as st
from Stitchr import stitchrfunctions as fxn
def stitch_sequence(v_gene: str, j_gene: str, cdr3: str, species: str) -> str | None:
"""Create full length TCR sequence from V gene, J, gene, CDR3 and species information."""
def create_input_args(args: dict, gene_types: list) -> tuple:
input_args, chain = fxn.sort_input(args)
codons = fxn.get_optimal_codons(input_args['codon_usage_path'], input_args['species'])
j_res, low_conf_js = fxn.get_j_motifs(input_args['species'])
c_res = fxn.get_c_motifs(input_args['species'])
tcr_dat, tcr_functionality, partial = fxn.get_ref_data(chain, gene_types, input_args['species'])
if input_args['extra_genes']:
tcr_dat, tcr_functionality = fxn.get_additional_genes(tcr_dat, tcr_functionality)
input_args['skip_c_checks'] = True
if input_args['preferred_alleles_path']:
preferred_alleles = fxn.get_preferred_alleles(
input_args['preferred_alleles_path'],
gene_types,
tcr_dat,
partial,
chain,
)
else:
preferred_alleles = {}
return input_args, tcr_dat, tcr_functionality, partial, codons, preferred_alleles, c_res, j_res, low_conf_js
gene_types = list(fxn.regions.values())
start = 'C' if not cdr3.startswith('C') else ''
end = 'F' if not cdr3.startswith('C') else ''
cdr3 = start + cdr3 + end
args = {
'v': v_gene,
'j': j_gene,
'cdr3': cdr3,
'species': species.upper(),
'c': '',
'l': '',
'aa': '',
'name': '',
'seamless': False,
'5_prime_seq': '',
'3_prime_seq': '',
'extra_genes': False,
'mode': 'AA',
'preferred_alleles_path': '',
'codon_usage_path': '',
'j_warning_threshold': 3,
'skip_c_checks': False,
'skip_n_checks': False,
'suppress_warnings': False,
'no_leader': True,
}
(
input_args,
tcr_dat,
tcr_functionality,
partial,
codons,
preferred_alleles,
c_res,
j_res,
low_conf_js,
) = create_input_args(args, gene_types)
try:
with warnings.catch_warnings():
warnings.simplefilter('ignore')
stitched = st.stitch(
input_args,
tcr_dat,
tcr_functionality,
partial,
codons,
input_args['j_warning_threshold'],
preferred_alleles,
c_res,
j_res,
low_conf_js,
)
except ValueError:
# If it didn't work with a phenylalanine, try with a tryptophan
if cdr3[-1] == 'F':
cdr3 = cdr3[:-1] + 'W'
elif cdr3[-1] == 'W':
cdr3 = cdr3[:-1] + 'F'
args['cdr3'] = cdr3
(
input_args,
tcr_dat,
tcr_functionality,
partial,
codons,
preferred_alleles,
c_res,
j_res,
low_conf_js,
) = create_input_args(args, gene_types)
with warnings.catch_warnings():
warnings.simplefilter('ignore')
stitched = st.stitch(
input_args,
tcr_dat,
tcr_functionality,
partial,
codons,
input_args['j_warning_threshold'],
preferred_alleles,
c_res,
j_res,
low_conf_js,
)
return fxn.translate_nt('N' * stitched['translation_offset'] + stitched['stitched_nt'])
STCRpy
Next, I want to highlight an OPIG tool, STCRpy (https://github.com/oxpig/STCRpy). This tool helps a lot with working with messy structure data by automatically pairing TCR chains, peptides, and MHC molecules, as well as providing functions to compute different geometry properties of the interaction, which can be used in turn to engineer features for machine learning applications.
Honourable mention: ANARCI/II
Finally, I wanted to give an honourable mention to another set of OPIG tools, ANARCI. These tools provide numbering to TCR, MHC, and antibody sequences. The original ANARCI (https://github.com/oxpig/ANARCI) uses Hidden Markov Models to provide the numbering, and the newer ANARCII ( https://github.com/oxpig/ANARCII) uses protein language models. ANARCI is built into STCRpy to renumber structures automatically, but I also use it to extract the CDR regions of TCR sequences by numbering the sequences following IMGT standards.
