{"id":13984,"date":"2026-02-11T17:07:42","date_gmt":"2026-02-11T17:07:42","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=13984"},"modified":"2026-02-25T10:18:17","modified_gmt":"2026-02-25T10:18:17","slug":"nice-tcr-processing-libraries","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2026\/02\/nice-tcr-processing-libraries\/","title":{"rendered":"Nice TCR processing libraries"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">As someone who works with T cell antigen receptor (TCR) and peptide-major histocompatibility complex (pMHC) data, I have found several Python packages to be very useful for eliminating tedious steps in data cleaning and feature engineering stages.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h1 class=\"wp-block-heading\">tidytcells<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">The first package I wanted to highlight is tidytcells (<a href=\"https:\/\/github.com\/yutanagano\/tidytcells\">https:\/\/github.com\/yutanagano\/tidytcells<\/a>). This package helps to clean TCR and MHC gene information and makes comparing and collating different datasets much easier. It helps clean gene labels such as \u201cA02:01\u201d to the IMGT standard \u201cHLA-A*02:01\u201d. I run it on all TCR and MHC genes as well as CDR3s to create a standardised dataset.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Stitchr<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">The next package is Stitchr (<a href=\"https:\/\/github.com\/JamieHeather\/stitchr\">https:\/\/github.com\/JamieHeather\/stitchr<\/a>). This package helps take the gene codes and CDR3 regions to make full amino acid sequences of the TCR alpha and beta chains. I use it in the following way:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from Stitchr import stitchr as st\nfrom Stitchr import stitchrfunctions as fxn\n\ndef stitch_sequence(v_gene: str, j_gene: str, cdr3: str, species: str) -&gt; str | None:\n    \"\"\"Create full length TCR sequence from V gene, J, gene, CDR3 and species information.\"\"\"\n\n    def create_input_args(args: dict, gene_types: list) -&gt; tuple:\n        input_args, chain = fxn.sort_input(args)\n        codons = fxn.get_optimal_codons(input_args&#091;'codon_usage_path'], input_args&#091;'species'])\n        j_res, low_conf_js = fxn.get_j_motifs(input_args&#091;'species'])\n        c_res = fxn.get_c_motifs(input_args&#091;'species'])\n        tcr_dat, tcr_functionality, partial = fxn.get_ref_data(chain, gene_types, input_args&#091;'species'])\n\n        if input_args&#091;'extra_genes']:\n            tcr_dat, tcr_functionality = fxn.get_additional_genes(tcr_dat, tcr_functionality)\n            input_args&#091;'skip_c_checks'] = True\n\n        if input_args&#091;'preferred_alleles_path']:\n            preferred_alleles = fxn.get_preferred_alleles(\n                input_args&#091;'preferred_alleles_path'],\n                gene_types,\n                tcr_dat,\n                partial,\n                chain,\n            )\n\n        else:\n            preferred_alleles = {}\n\n        return input_args, tcr_dat, tcr_functionality, partial, codons, preferred_alleles, c_res, j_res, low_conf_js\n\n    gene_types = list(fxn.regions.values())\n\n    start = 'C' if not cdr3.startswith('C') else ''\n    end = 'F' if not cdr3.startswith('C') else ''\n\n    cdr3 = start + cdr3 + end\n\n    args = {\n        'v': v_gene,\n        'j': j_gene,\n        'cdr3': cdr3,\n        'species': species.upper(),\n        'c': '',\n        'l': '',\n        'aa': '',\n        'name': '',\n        'seamless': False,\n        '5_prime_seq': '',\n        '3_prime_seq': '',\n        'extra_genes': False,\n        'mode': 'AA',\n        'preferred_alleles_path': '',\n        'codon_usage_path': '',\n        'j_warning_threshold': 3,\n        'skip_c_checks': False,\n        'skip_n_checks': False,\n        'suppress_warnings': False,\n        'no_leader': True,\n    }\n\n    (\n        input_args,\n        tcr_dat,\n        tcr_functionality,\n        partial,\n        codons,\n        preferred_alleles,\n        c_res,\n        j_res,\n        low_conf_js,\n    ) = create_input_args(args, gene_types)\n\n    try:\n        with warnings.catch_warnings():\n            warnings.simplefilter('ignore')\n            stitched = st.stitch(\n                input_args,\n                tcr_dat,\n                tcr_functionality,\n                partial,\n                codons,\n                input_args&#091;'j_warning_threshold'],\n                preferred_alleles,\n                c_res,\n                j_res,\n                low_conf_js,\n            )\n\n    except ValueError:\n        # If it didn't work with a phenylalanine, try with a tryptophan\n        if cdr3&#091;-1] == 'F':\n            cdr3 = cdr3&#091;:-1] + 'W'\n\n        elif cdr3&#091;-1] == 'W':\n            cdr3 = cdr3&#091;:-1] + 'F'\n\n        args&#091;'cdr3'] = cdr3\n\n        (\n            input_args,\n            tcr_dat,\n            tcr_functionality,\n            partial,\n            codons,\n            preferred_alleles,\n            c_res,\n            j_res,\n            low_conf_js,\n        ) = create_input_args(args, gene_types)\n\n        with warnings.catch_warnings():\n            warnings.simplefilter('ignore')\n            stitched = st.stitch(\n                input_args,\n                tcr_dat,\n                tcr_functionality,\n                partial,\n                codons,\n                input_args&#091;'j_warning_threshold'],\n                preferred_alleles,\n                c_res,\n                j_res,\n                low_conf_js,\n            )\n\n    return fxn.translate_nt('N' * stitched&#091;'translation_offset'] + stitched&#091;'stitched_nt'])<\/code><\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">STCRpy<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Next, I want to highlight an OPIG tool, STCRpy (<a href=\"https:\/\/github.com\/oxpig\/STCRpy\">https:\/\/github.com\/oxpig\/STCRpy<\/a>). This tool helps a lot with working with messy structure data by automatically pairing TCR chains, peptides, and MHC molecules, as well as providing functions to compute different geometry properties of the interaction, which can be used in turn to engineer features for machine learning applications.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Honourable mention: ANARCI\/II<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, I wanted to give an honourable mention to another set of OPIG tools, ANARCI. These tools provide numbering to TCR, MHC, and antibody sequences. The original ANARCI (https:\/\/github.com\/oxpig\/ANARCI) uses Hidden Markov Models to provide the numbering, and the newer ANARCII ( https:\/\/github.com\/oxpig\/ANARCII) uses protein language models. ANARCI is built into STCRpy to renumber structures automatically, but I also use it to extract the CDR regions of TCR sequences by numbering the sequences following IMGT standards.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As someone who works with T cell antigen receptor (TCR) and peptide-major histocompatibility complex (pMHC) data, I have found several Python packages to be very useful for eliminating tedious steps in data cleaning and feature engineering stages.<\/p>\n","protected":false},"author":109,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[361,186,221,227,906],"tags":[24,152,140],"ppma_author":[714],"class_list":["post-13984","post","type-post","status-publish","format-standard","hentry","category-data-science","category-immunoinformatics","category-python","category-python-code","category-tcrs","tag-bioinformatics","tag-python","tag-tcrs"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":714,"user_id":109,"is_guest":0,"slug":"benjamin","display_name":"Benjamin McMaster","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/d08cff80235bc80063c59072381da602325bce09b5774c2bc0db69545176f359?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"McMaster","first_name":"Benjamin","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13984","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/109"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=13984"}],"version-history":[{"count":1,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13984\/revisions"}],"predecessor-version":[{"id":13985,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13984\/revisions\/13985"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=13984"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=13984"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=13984"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=13984"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}