As someone who works with T cell antigen receptor (TCR) and peptide-major histocompatibility complex (pMHC) data, I have found several Python packages to be very useful for eliminating tedious steps in data cleaning and feature engineering stages.

tidytcells

The first package I wanted to highlight is tidytcells (https://github.com/yutanagano/tidytcells). This package helps to clean TCR and MHC gene information and makes comparing and collating different datasets much easier. It helps clean gene labels such as “A02:01” to the IMGT standard “HLA-A*02:01”. I run it on all TCR and MHC genes as well as CDR3s to create a standardised dataset.

Stitchr

The next package is Stitchr (https://github.com/JamieHeather/stitchr). This package helps take the gene codes and CDR3 regions to make full amino acid sequences of the TCR alpha and beta chains. I use it in the following way:

from Stitchr import stitchr as st
from Stitchr import stitchrfunctions as fxn

def stitch_sequence(v_gene: str, j_gene: str, cdr3: str, species: str) -> str | None:
    """Create full length TCR sequence from V gene, J, gene, CDR3 and species information."""

    def create_input_args(args: dict, gene_types: list) -> tuple:
        input_args, chain = fxn.sort_input(args)
        codons = fxn.get_optimal_codons(input_args['codon_usage_path'], input_args['species'])
        j_res, low_conf_js = fxn.get_j_motifs(input_args['species'])
        c_res = fxn.get_c_motifs(input_args['species'])
        tcr_dat, tcr_functionality, partial = fxn.get_ref_data(chain, gene_types, input_args['species'])

        if input_args['extra_genes']:
            tcr_dat, tcr_functionality = fxn.get_additional_genes(tcr_dat, tcr_functionality)
            input_args['skip_c_checks'] = True

        if input_args['preferred_alleles_path']:
            preferred_alleles = fxn.get_preferred_alleles(
                input_args['preferred_alleles_path'],
                gene_types,
                tcr_dat,
                partial,
                chain,
            )

        else:
            preferred_alleles = {}

        return input_args, tcr_dat, tcr_functionality, partial, codons, preferred_alleles, c_res, j_res, low_conf_js

    gene_types = list(fxn.regions.values())

    start = 'C' if not cdr3.startswith('C') else ''
    end = 'F' if not cdr3.startswith('C') else ''

    cdr3 = start + cdr3 + end

    args = {
        'v': v_gene,
        'j': j_gene,
        'cdr3': cdr3,
        'species': species.upper(),
        'c': '',
        'l': '',
        'aa': '',
        'name': '',
        'seamless': False,
        '5_prime_seq': '',
        '3_prime_seq': '',
        'extra_genes': False,
        'mode': 'AA',
        'preferred_alleles_path': '',
        'codon_usage_path': '',
        'j_warning_threshold': 3,
        'skip_c_checks': False,
        'skip_n_checks': False,
        'suppress_warnings': False,
        'no_leader': True,
    }

    (
        input_args,
        tcr_dat,
        tcr_functionality,
        partial,
        codons,
        preferred_alleles,
        c_res,
        j_res,
        low_conf_js,
    ) = create_input_args(args, gene_types)

    try:
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            stitched = st.stitch(
                input_args,
                tcr_dat,
                tcr_functionality,
                partial,
                codons,
                input_args['j_warning_threshold'],
                preferred_alleles,
                c_res,
                j_res,
                low_conf_js,
            )

    except ValueError:
        # If it didn't work with a phenylalanine, try with a tryptophan
        if cdr3[-1] == 'F':
            cdr3 = cdr3[:-1] + 'W'

        elif cdr3[-1] == 'W':
            cdr3 = cdr3[:-1] + 'F'

        args['cdr3'] = cdr3

        (
            input_args,
            tcr_dat,
            tcr_functionality,
            partial,
            codons,
            preferred_alleles,
            c_res,
            j_res,
            low_conf_js,
        ) = create_input_args(args, gene_types)

        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            stitched = st.stitch(
                input_args,
                tcr_dat,
                tcr_functionality,
                partial,
                codons,
                input_args['j_warning_threshold'],
                preferred_alleles,
                c_res,
                j_res,
                low_conf_js,
            )

    return fxn.translate_nt('N' * stitched['translation_offset'] + stitched['stitched_nt'])

STCRpy

Next, I want to highlight an OPIG tool, STCRpy (https://github.com/oxpig/STCRpy). This tool helps a lot with working with messy structure data by automatically pairing TCR chains, peptides, and MHC molecules, as well as providing functions to compute different geometry properties of the interaction, which can be used in turn to engineer features for machine learning applications.

Honourable mention: ANARCI/II

Finally, I wanted to give an honourable mention to another set of OPIG tools, ANARCI. These tools provide numbering to TCR, MHC, and antibody sequences. The original ANARCI (https://github.com/oxpig/ANARCI) uses Hidden Markov Models to provide the numbering, and the newer ANARCII ( https://github.com/oxpig/ANARCII) uses protein language models. ANARCI is built into STCRpy to renumber structures automatically, but I also use it to extract the CDR regions of TCR sequences by numbering the sequences following IMGT standards.

Author

Benjamin McMaster

View all posts

Oxford Protein Informatics Group

or "OPIG" to friends

Nice TCR processing libraries

tidytcells

Stitchr

STCRpy

Honourable mention: ANARCI/II

Author