OPIG Retreat, 2019

For the third year running, the Oxford Protein Informatics Group of Professors Deane and Morris traveled to a bucolic, remote location for a series of talks (long and lightning), journal clubs, and hands-on practicals—not to mention evenings of quizzes, board games, and an afternoon of exploration of local attractions.

Kington, Herefordshire

Thanks to the organization of OPIG Members Mark Chonofsky and Javier Prado Diaz, five hire cars and one motorbike, some two dozen of us traveled from Oxford to the rolling hills and orchard country of Herefordshire, and Kington, near the border with Wales. We had the whole YHA Kington to ourselves from Wednesday until Friday, September 18-20, 2019. Our schedule was packed with great talks, and a few opportunities to press, watch people press, or tell people to press, <shift><enter>.


OPIG Retreat 2019: Programme

Wednesday

Session 1

4 pm: Chair: Dan

Eve Richardson: Antibody Journal Club: “Learning Context-aware Structural Representations to Predict Antigen and Antibody Binding Interfaces”

Eve presented a journal club on a new method for antibody epitope and paratope prediction by Srivamshi Pittala and Chris Bailey-Kellogg (available on Biorxiv at https://www.biorxiv.org/content/10.1101/658054v1), which reported best-in-class performance in both tasks. This is the second application of deep learning to paratope prediction and the first for epitope prediction.

The method used a “unified framework” that captured features in the antibody and antigen, with each node corresponding to an amino acid in each sequence. Each amino acid was encoded by one-hot encoded amino acids for the BLAST query, a one-hot encoded amino acid conservation profile, and properties like solvent accessibility.

Constantin Schneider: “Antibody Journal Club”

Constantin reviewed a preprint from the Greiff lab on structural vocabulary to describe antigen-antibody binding. He summarized their main claims and discussed their validity, as well as the implications for the antibody field in general and particularly for the machine learning based predictability of the antibody-antigen interaction. See also: https://www.biorxiv.org/content/biorxiv/early/2019/06/03/658054.full.pdf#cite.Daberdaku18

Short talks: Lucian (A), Mark (B), Lyuba (C), Conor (D)


Session 2

5:15 pm: Chair: An

Jack Scantlebury: Deep Learning Journal Club: “In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening”

Jack presented two papers on machine learning for small molecule-protein interaction prediction, one on the limitations of using the DUD-E dataset for training GNINA CNNs [1]; an ad second on a comparison of the ease with which actives can be distinguised from decoys using biases present in the DUD, DUD-E and MUV datasets [2]. Jack also gave a three minute flash talk on software he wrote to aid in the creation and submission and book-keeping of GNINA experiments to SLURM queues.

  1. https://chemrxiv.org/articles/Hidden_Bias_in_the_DUD-E_Dataset_Leads_to_Misleading_Performance_of_Deep_Learning_in_Structure-Based_Virtual_Screening/7886165
  2. https://pubs.acs.org/doi/full/10.1021/acs.jcim.8b00712?src=recsys

James Wilsenach: “3D Image Processing: Unpacking a Convoluted Set of Options”

For those interested in 3D image processing, whether for de-noising or post-processing of light or electron microscopy microscopy images it is often difficult to know what methods and packages are available. James talked about some of the packages available for large 3D image processing, including ImageJ, Vaa3D, Ilastik etc., their limitations, and the key methods available in each. 


Short talks: Flo (A), Dominik (B), Elliot (C), Garrett (D)

Session 3

8 pm: Chair: Flo

Aleksandr Kovaltsuk: “Digitalization of adaptive immune repertoires”

Ten years since its inception, next-generation sequencing is now routinely applied to study dynamics of the human immune system. Aleks went through wet lab procedures that must be followed to turn lymphocyte collections into digital information.

Prof. Charlotte M. Deane: “Getting jobs”

Charlotte presented career advice for doctoral students looking for postdoctoral jobs.


The evening of Thursday saw OPIG have a general knowledge quiz. Figuring out which team each person was on, by lining up based on the color of their clothes, proved to be challenging for some…


Thursday

Session 1

9 am: Chair: Carlos

Short talks: An (A), Jack (B), Constantin (C), Clare (D)

Fergus Boyles: “Principles of Software Engineering for Bioinformatics”

We all write code, but how well engineered is it? Fergus discussed some important principles and best practices for research software engineering, with an emphasis on developing robust, sustainable Python projects, as well as topics such as standards, licensing, versioning, and distribution.

Claire Marks: “Making Websites: Things You May Find Useful”

Claire gave an overview of languages, tools, and other things that may be useful when developing websites.


Session 2

10:15 am: Chair: Jack

Prof. Garrett M. Morris: “The Next Ten Years”

Garrett talked about how the field of molecular modelling has evolved from its origins, some recent developments, and covered some of the potential developments in the next ten years. After pointing out the changes in science and technology in previous decades, and early applications of machine learning, he talked about the disruptive technologies and expected improvements in quantity, quality, and metadata about data that we can expect in the next decade. For instance, if current trends continue, there could be nearly half a million structures in the PDB by 2029. With projected improvements in supercomputing, we could move from “computational molecular biology if you will to computational cellular biology“, according to Andy McCammon. Our understanding of the mesoscale should improve, through better confocal microscopy and cryoEM detectors. Improvements in applied machine learning could dramatically affect the way we carry out calculations, including materials design, biomolecular simulations, conformational sampling, retrosynthetic planning, docking, and drug design. The increased availability and ease-of-use of many of these techniques will expand its user base, and help create new opportunities for start-ups, thanks to better interfaces, workflow platforms, cloud computing, and new devices.

But as Abraham Lincoln once said, “The best way to predict your future is to create it.”

Javier Pardo: “Networks journal club”

Javi discussed a Nature Communications paper titled “Network-based prediction of protein interactions“. Have talked about the idea of considering as interacting proteins those which are separated from each other by three edges, which performed better than methods that connect protein pairs by two edges.

Session 3

11:25 am: Chair: Garrett

Elliot Nelson: “Plumbing Pipelines with Luigi”

Luigi is a python pipe-lining tool that allows one to run batches of dependent atomic tasks. When does it make sense to use Luigi? Elliot presented some tricks for use alongside cluster scheduling, in particular how to use luigi as a task scheduler. Elliot’s tutorial is available at https://github.com/nelse003/luigi_tutorial.

Luigi is easy to install: just run pip install luigi to get the latest stable version from PyPIDocumentation for the latest release is hosted on readthedocs.

Short talks: Eve (A), Fergus Imrie (B), Dan (C), James (D)


Session 4

12:15 pm: Chair: Eve

Marc Mößer: “Machine Learning for Molecule Generation”

Unfortunately, An Goto was ill and unable to join Marc, for their presentation on de novo molecule generation using machine learning.

Marc mentioned some of the research An did using generative neural networks based on MolDQN, with a modified reaction space that utilized available reagents and desired reactions and called RxnDQN, that was able to propose a molecule that was very similar to the known drug, prilocaine.

Marc gave a short introduction to the field of de novo compound generation, and went on to describe the two most popular generation methods: Recurrent Neural Networks and Variational Autoencoders in more detail and gave everyone with a short Jupyter notebook the chance to generate their own molecules using a RNN. The practical required the installation of RDKit and PyTorch.


Free Afternoon

Some of us walked to the Small Breeds Farm Park & Owl Centre; others went walking on Offa’s Dyke Path; some of us drove to forage through the plentiful bookstores in Hay-on-Wye; and some visited the Penrhos Distillery.


Session 5

8 pm: Chair: Claire

Dr Florian Klimm & Dominik Schwarz: “Shiny for dummies”

Shiny is an R library that allows the construction of interactive online applications. Flo and Dominik ran an interactive tutorial that taught how to construct a simple but pretty application that illustrates a co-authorship network and allows the user to investigate its degree distribution (see https://floklimm.shinyapps.io/shinyOPIG/).

The tutorial is now available on GitHub (https://github.com/floklimm/shinyTutorial) and should take between 30 min and an hour, depending on your proficiency with R.


Board games


Friday

Session 1

9 am: Chair: Marc

Short talks: Fergus Boyles (A), Alex (B), Claire (C), Carlos (D)

Mark Chonofsky: Protein Folding Journal Club: “Protein interaction networks revealed by proteome coevolution”

Marc talked about the recent paper that sought to identify all interactions in the E. coli proteome from coevolution analyses. Marc said it is “a really nice paper from the Baker group (https://science.sciencemag.org/content/365/6449/185)”, and that it “dovetails nicely with our work about the fact that amino acid contacts are not the only thing that protein coevolution reveals.”

Dr Dan Nissley: “What MD can do for you?”

Molecular dynamics (MD) is a highly flexible way to learn about molecular systems, but different methods are suited for different problems. I will be providing an overview of simulation methods including all-atom MD, coarse-grain MD, steered MD, and dual-resolution techniques to provide an introduction to the types of problems each can address, the time and length regimes they can access, common practical problems, and how you can select the method suited to your research goals. 

Dan surveyed molecular dynamics methods to update the group on current techniques and their possible applications to the group’s various research goals. The importance of selecting a model based on the length and timescales of the process under study was emphasized.


Session 2

10:15 am: Chair: Dominik

Catherine Wong: “Sending millions of queries and breaking someone else’s server”

Catherine (Wing Ki) Wong talked about the basics of programmatically accessing the web. She introduced two Python packages for interacting with RESTful APIs and client-side javascript, and ways to extract data from the HTML source code of the results page. She gave a few bonus tips on how to prevent your own website from being scraped and how to ethically access others’ websites. This tutorial introduced the basic pipeline of querying (scraping) a web server and parsing the results in Python. Modules: requests, Selenium and BeautifulSoup.

Conor Wild: “Handling crystallographic data with Conor”

Crystallographic data can be notoriously hard to process. Conor explored some of the ways such data can be loaded and manipulated in Python and C++, including using GridDataFormats, Clipper and CCTBX.


Session 3

11:25 am: Chair: Constantin

Carlos Outeiral: “Let quantum computing solve your real-life problems”

Carlos gave a tutorial on adiabatic quantum optimization, outlining how to use it to solve a simple combinatorial optimization problem, and how to minimize the number of qubits required by exploiting symmetry and other heuristics. Unfortunately, there was not enough time to submit a calculation to a real D-Wave quantum computer.

Lucian Chan: “Recent Advances In Bayesian Optimization and Application in Chemistry”

Lucian discussed recent advances in Bayesian optimization and potential applications in chemistry, including conformed generation. Lucian also presented ways to incorporate prior knowledge into the Bayesian optimization framework, in order to speed up the search.


12:15 pm: Session 4

Chair: Clare

Lyuba Bozhilova: “Adventures in data wrangling”

Lyuba talked about cross-referencing a couple of databases, and combining the information with an ontology or two. Lyuba talked about the pitfalls of data parsing and data cleaning. She also gave an overview of what protein interaction data is like, and the different ways of representing it. She also briefly outlined what ontology data is like.  Overall, Lyuba emphasized the importance of good coding practice when it comes to data parsing and data cleaning.

Lyuba compared the pros and cons of data frames, adjacency matrices, and graph objects, when representing protein-protein interaction networks, and summarized the sanity checks and tests to run when cleaning data from protein-protein interaction networks. One sage piece of advice from Lyuba: “talk to someone who knows more about the data than you do”.

Fergus Imrie: “Small molecule journal club”: “Deep learning enables rapid identification of potent DDR1 kinase inhibitors”

Insilico Medicine recently reported in Nature Biotechnology the identification of nanomolar inhibitors of DDR1 kinase 46 days after target identification with only one design-make-test cycle using a deep learning-based de novo generative model.  

Fergus’s analysis reflected the criticism the paper received from many quarters, which many perceived as a nice virtual screening paper on a well-characterized target family, but noted the relatively extensive range of experimentation validation conducted. Perhaps their most interesting contribution was the use of self-organizing maps (SOMs) as reward functions and tensor-train decomposition to parameterise the structure of the learned manifold.


Session 5

2 pm: Chair: Conor

Short talks: Javi (A), Catherine (B), Marc (C), Charlotte (D)

Clare West: “Research and Policy”

Clare gave a fascinating overview of her Fellowship at the Houses of Parliament and the Parliamentary Office of Science and Technology (POST), and gave an overview of how research is used in Parliament, as well as advice on how researchers can contribute and have policy impact. (See also Clare’s great Blopig post on her experiences: https://www.blopig.com/blog/2019/03/not-proteins-in-parliament/).


Most of us left around 3 pm, although a few stayed overnight and returned on Saturday.


Recipe for Käsespätzler

Dominik and Flo prepared a delicious German delicacy called “Käsespätzler”, and some have since asked for the recipe—so here it is:

Ingredients

  • 2 eggs per person
  • 100 g flour per person
  • 100-200 g cheese per person (mature cheddar works best, mozzarella, gouda as well, and blue cheese in smaller amounts if you’d like)
  • ~1/2 onion per person
  • garlic
  • salt
  • pepper
  • nutmeg
  • oil

Cut onions into half rings, add a little bit of garlic if you wish, and fry in pan with a bit of oil—the longer, the better. Start to boil a large pot of water filled half to two thirds, and salt generously. Meanwhile, mix eggs, flour and salt (not too little but still reasonable) into a dough of dripping consistency. See video at 1:45 https://www.youtube.com/watch?v=Lwn97mpBCu8 where it is nicely visible, but please ignore the rest of the video. You can adjust the consistence with milk, water, or flour. Pre-heat the oven to 180-200°C.

When dough is done and salted water is boiling, start making the Spätzle with your favourite type of equipment; there are many options but definitely the easiest and most reliable would be: https://www.amazon.co.uk/Fackelmann-45443-Spaetzle-Grater-Scraper/dp/B003OBY45E/ref=sr_1_5?keywords=sp%C3%A4tzle&qid=1569336920&sr=8-5

Usually, the water stops boiling when the Spätzle drop into the water, so wait until it boils again (~30-45 s) then take them out into a different pot or sieve. When all the dough is used, layer Spätzle, fried onions and cheese (and a little bit of salt, pepper, and nutmeg) in a deep baking tray or something similar. Finish with a huge layer of cheese. Bake it to your liking (15-30 minutes); cheese crust and appetite are usually the competing interests.

—Dominik Schwartz

Author