{"id":9680,"date":"2023-04-25T15:46:41","date_gmt":"2023-04-25T14:46:41","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=9680"},"modified":"2023-04-25T16:21:14","modified_gmt":"2023-04-25T15:21:14","slug":"train-your-own-protein-language-model-in-just-a-few-lines-of-code","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2023\/04\/train-your-own-protein-language-model-in-just-a-few-lines-of-code\/","title":{"rendered":"Train Your Own Protein Language Model In Just a Few Lines of Code"},"content":{"rendered":"\n<p>Language models have token the world by storm recently and, given the already explored analogies between protein primary sequence and text, there&#8217;s been a lot of interest in applying these models to protein sequences. Interest is not only coming from  academia and the pharmaceutical industry, but also some very unlikely suspects such as ByteDance &#8211; yes the same ByteDance of TikTok fame. So if you also fancy trying your hand at building a protein language model then read on, it&#8217;s surprisingly easy. <\/p>\n\n\n\n<p>Training your own protein language model from scratch is made remarkably easy by the HuggingFace Transformers library, which allows you to specify a model architecture, tokenise your training data, and train a model in only a few lines of code. Under the hood, the Transformers library uses PyTorch (or optionally Tensorflow) models, allowing you to dig deeper into customising training or model architecture, or simply leave it to the highly abstracted Transformers library to handle it all for you.<\/p>\n\n\n\n<p>For this article, I&#8217;ll assume you already understand how language models work, and are now looking to implement one yourself, trained from scratch. <\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">1. Install Required Libraries<\/h2>\n\n\n\n<p>We&#8217;ll be using both the Transformers and Datasets library from HuggingFace, as well as PyTorch. Installing transformers from conda can be problematic sometimes, so I&#8217;d recommend using pip, which can still be done under a conda virtual environment. <\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install transformers\npip install datasets<\/pre>\n\n\n\n<p>You&#8217;ll also need to install PyTorch, you can check system-dependent guides <a href=\"https:\/\/pytorch.org\/get-started\/locally\/\">here<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Prepare Your Dataset<\/h2>\n\n\n\n<p>With your dataset of protein sequences separated into train, validation, and test sets, we&#8217;ll use the Datasets library from HuggingFace for easy and efficient tokenisation.<\/p>\n\n\n\n<p>Protein language models typically tokenize on a character level (i.e., residue), so you won&#8217;t need to train a custom tokenizer like in most text-based language models. Download this example tokenizer, which includes a token for each of the 20 amino acids, as well as padding, and N and C-term tokens (represented by &#8216;1&#8217; and &#8216;2&#8217; respectively). <\/p>\n\n\n\n<div class=\"wp-block-file\"><a id=\"wp-block-file--media-ed618516-ff31-44fa-87d8-8c04d7388c2b\" href=\"https:\/\/www.blopig.com\/blog\/wp-content\/uploads\/2023\/04\/protein_tokeniser.tar\">protein_tokeniser<\/a><a href=\"https:\/\/www.blopig.com\/blog\/wp-content\/uploads\/2023\/04\/protein_tokeniser.tar\" class=\"wp-block-file__button wp-element-button\" download aria-describedby=\"wp-block-file--media-ed618516-ff31-44fa-87d8-8c04d7388c2b\">Download<\/a><\/div>\n\n\n\n<p><\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tokeniser = AutoTokenizer.from_pretrained('path\/to\/tokeniser', use_fast=True)<\/pre>\n\n\n\n<p>Next, we&#8217;ll load the train, val, and test sets. The Datasets library has some handy features such as caching any data preparation steps (e.g. tokenisation), meaning this step won&#8217;t be repeated between training runs. This saves a lot of compute, especially with the large datasets typically used in language models.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">data_set_paths = {\"train\": \"\/path\/to\/train.csv\", \"test\": \"\/path\/to\/test.csv\", \"val\": \"\/path\/to\/val.csv\"}\ndatasets = load_dataset('csv', data_files=data_sets, cache_dir='\/where\/to\/store\/the\/cache')<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>3. Define the model architecture<\/strong><\/h2>\n\n\n\n<p>Choose a model architecture from the Transformers library (check their documentation for a full list). Here, we&#8217;ll use GPT2. Customize any model architecture hyperparameters as needed, since we won&#8217;t keep any pretrained model weights:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">transformer_config = GPT2Config(vocab_size=tokenizer.vocab_size, n_layer=12, n_embd=512, n_head=12, n_inner=2048)\n\nmodel = GPT2LMHeadModel(config=transformer_config)<\/pre>\n\n\n\n<p>Remember to move the model to the GPU using <code>model.to()<\/code> if using CUDA.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>4. Train your language model<\/strong><\/h2>\n\n\n\n<p>Now, it&#8217;s time to train our model! The Trainer class in HuggingFace makes this easy with training parameters from a TrainingArguments instance. Check the TrainingArguments documentation <a href=\"https:\/\/huggingface.co\/docs\/transformers\/v4.28.1\/en\/main_classes\/trainer#transformers.TrainingArguments\">here<\/a> for customization options, or <a href=\"https:\/\/huggingface.co\/course\/chapter3\/4#the-training-loop)\" data-type=\"URL\" data-id=\"https:\/\/huggingface.co\/course\/chapter3\/4#the-training-loop)\">write your own training loop using PyTorch<\/a>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">training_args = TrainingArguments(\n\t\t\t\t\t\t\t\toutput_dir=\".\/output\",\n\t\t\t\t\t\t\t\toverwrite_output_dir=True,\n\t\t\t\t\t\t\t\tnum_train_epochs=2,\n\t\t\t\t\t\t\t\tper_device_train_batch_size=16, \n\t\t\t\t\t\t\t\tper_device_eval_batch_size=16, \n\t\t\t\t\t\t\t\tsave_steps=10_000, \n\t\t\t\t\t\t\t\tsave_total_limit=2) \ntrainer = Trainer(\n\t\t\t\t  model=model,\n\t\t\t\t  args=training_args,\n\t\t\t\t  train_dataset=tokenized_datasets[\"train\"],\n\t\t\t\t  eval_dataset=tokenized_datasets[\"validation\"]\n\t\t\t\t) \ntrainer.train()\n<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>5. Generate samples<\/strong><\/h2>\n\n\n\n<p>After training your language model, you&#8217;ll want to generate new samples. There are several ways to generate samples from a language model, with further details on methods and implementation in HuggingFace available <a href=\"https:\/\/huggingface.co\/blog\/how-to-generate\">here<\/a>. Below is an example of generating sequences using top-p sampling.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pad_token_id = tokenizer.pad_token_id\n\n# get the token id for the N-term token ('1')\nbos_token_id = tokenizer.encode('1')[0]\n\nn_samples = math.ceil(num_return_sequences \/ batch_size)\n\n# make sure we're using GPU\nmodel.to('cuda')\n\n# top-p sampling parameters. top-p uses sampling from the token probability distribution, so 'do_sample' is true. \ngeneration_config = GenerationConfig(\n\nmax_new_tokens=max_new_tokens, do_sample=True, top_p=top_p, pad_token_id=pad_token_id, temperature=temperature, num_return_sequences=batch_size, bos_token_id=bos_token_id\n\n)\n\ngenerated_token_ids = []\n\nfor x in range(n_samples):\n\nbatch_generated_token_ids = self.model.generate(generation_config = generation_config, ) # type: ignore\n\ngenerated_token_ids.append(batch_generated_token_ids)\n\n# flatten the generated token ids\n\ngenerated_token_ids = [item for sublist in generated_token_ids for item in sublist]\n\n#Decode from token IDs to residue letters\n\ndecoded_sequences = self.tokenizer.batch_decode(generated_token_ids, skip_special_tokens=True)<\/pre>\n\n\n\n<p>And that&#8217;s it! You&#8217;ve now successfully trained a protein language model from scratch and generated new samples from it. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Language models have token the world by storm recently and, given the already explored analogies between protein primary sequence and text, there&#8217;s been a lot of interest in applying these models to protein sequences. Interest is not only coming from academia and the pharmaceutical industry, but also some very unlikely suspects such as ByteDance &#8211; [&hellip;]<\/p>\n","protected":false},"author":82,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[633,296,14,189,228],"tags":[],"ppma_author":[667],"class_list":["post-9680","post","type-post","status-publish","format-standard","hentry","category-ai","category-hints-and-tips","category-howto","category-machine-learning","category-protein-structure"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":667,"user_id":82,"is_guest":0,"slug":"olivert","display_name":"Oliver Turnbull","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/0b20bd3396ebbc89e8634a762722722d9168249244342ce2a0f4f6d05ed796eb?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/9680","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/82"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=9680"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/9680\/revisions"}],"predecessor-version":[{"id":9686,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/9680\/revisions\/9686"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=9680"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=9680"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=9680"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=9680"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}