OPIG Retreat 2022 | Oxford Protein Informatics Group

Finally, after two years of social distancing, we were able to continue the tradition of OPIGtreat – a 2-3 day escape to the countryside for a packed schedule of talks and fun.

This year, the lovely YHA Wilderhope Manor in Shropshire was chosen by Lewis, our trip organizer. With a hostel in the middle of nowhere, with no phone signal, this trip promised to be an exciting get-away from our plugged-in lives at the university.

Tobias, Carlos, Guy, (Leo, and Martin not in the picture) tossing a frisbee around in front of the manor house/hostel

To balance the fun of exploring the surrounding fields, board games and quizzes, we had everyone give a 20-minute presentation on a research-related topic of choice.

Highlights

Excellent scientific talks on a wide range of topics (read more below)
Beautiful walks in the countryside, made even better by the sheep+lambs and cows+calves (cue Olly trying, unsuccessfully, to manifest all the lambs coming up to him for cuddles)
Phenomenal food organised by Fergus – a multi-dish, Michelin-star menu cooked for 24 people each night! The whole group chipped in cooking & cleaning up

TUESDAY (Day 1)

Session 1

Matt: Pan SARS-like coronavirus vaccines: impossible or just a matter of time?

There is great interest in the development of a pan-neutralising vaccine, which induces the production of antibodies that can neutralise many coronaviruses.
Pan-neutralisers should avoid immunodominant epitopes in favour of subdominant (‘cryptic’) epitopes to decrease the likelihood of escape mutations.
However, there are potential limitations to targeting cryptic epitopes, including limited accessibility (and therefore lack of avidity) and lower evolutionary conservation.
Future directions include increased consideration of T cell responses (as viruses seem less likely to mutate to evade this) and affinity maturation (which has been shown to increase breadth of neutralising activity).

Carlos: How to write fast code for fun and profit

The golden rule is that ‘premature optimization is the root of all evil’ (Donald E Knuth).
Key takeaways from this talk, abiding by this rule, are: 1) measure before you optimise, 2) someone has already thought of a better solution, 3) plan “how do I make it easy to optimise” rather than “how do I optimise it”.
Carlos shared lots of helpful tips to make code faster (once you’ve reached a point where your code needs to be optimised and figured out which parts of your code are the slowest) including parallelising, vectorising or compiling your code.

Guy: Sonification in Bioinformatics and Cheminfortmatics

Sonification is the representation of inaudible data using non-speech audio (ie music).
Recently, sonfiication has been applied to represent chemical molecules by a program, SAMPLES: the key is derived from a sum of the molecule’s molecular properties and notes are derived from tokens in the SELFIE representation of the molecule.
Separately, sonification is being extended for proteins (encoding primary, secondary and tertiary structure).
Overall, this can be a fun and exciting way to represent molecules but its importance or added benefit over other representations remains somewhat unclear.

Olly: Statistical Paradoxes and Gotchas

Our resident statistician – one of two in the Oxford Protein Informatics Group, despite the group residing in the Department of Statistics – fulfilled his role by sharing some of the paradoxes, gotchas and pitfalls in statistics.
These included, but were by no means limited to, multiple testing (you have to correct for this!), problematic permutation tests, and Simpson’s paradox (splitting data can give you higher success rates in both subpopulations than in the combined population, so be wary when data is presented split when it doesn’t need to be).

Werewolf

On our first night together, the 24 of us played two rounds of Werewolf. During the first round of playing, the werewolves triumphantly took over the village – only a single werewolf was identified. In the second round, the werewolves successfully sowed dissent but ultimately they were found out and the villagers saved themselves.

WEDNESDAY (Day 2)

Session 2

Tom [& Lucy in spirit]: RDKit fun – practical tips to work with RDKit

Much great advice was shared including the following nuggets of wisdom:
Drawing the chemical structure is easy with RDKit.Chem.Draw(). If things look odd, use the flatten function because the molecule probably includes 3D coordinates.
Molplotly can create scatter plots where hovering over a data point shows the molecular structure – fancy!
When working with a jupyter notebook, use nglview to visualize protein-ligand interactions in style.

Jesse: Paper presentation “BACPI: a bi-directional attention neural network for compound-protein interaction and binding affinity prediction”

Structure-free approaches for protein-compound binding do not use any 3D coordinates.
BACPI represents proteins as three letter long words (ngrams) and compounds using graph neural networks
A bi-directional attention mechanism lets the network to update compound and protein features simultaneously
Link to paper: https://pubmed.ncbi.nlm.nih.gov/35043942/

Fabian: Paper presentation: “Iterative Refinement Graph Neural Network For Antibody Sequence-Structure Co-Design“

This paper proposes an architecture that co-designs sequence and structure to generate complementary-determining regions (CDR) sequences with desired properties
Further research to focus on the relation between sequence and structure, the CDR interactions with framework, and on optimizing for antibody properties
Link to paper: https://arxiv.org/pdf/2110.04624.pdf

Fergus: How to share data

Better than sharing data on our websites is using a data archive like the university-endorsed figshare.com. This will allow others to easily find and cite your data.
Licenses let others know how you allow them to use your data. There are very easy to use template licenses (https://creativecommons.org/about/cclicenses/). Even if you do not care how others use your data, just add a permissive license to obviate questions and save yourself.
The FAIR data principle offers guidance for managing scientific data. The acronym says that scientific data should be findable, accessible, interoperable, and reusable.

Session 3

Steph & Gemma: A guide to fragment-based drug discovery

In comparison to high-throughput screening, fragment-based drug discovery uses a smaller number (less than 500) of lighter compounds (less than 300 daltons).
These smaller fragments need to be elaborated into full-sized ligands which can be done via growing, linking, or merging.
The conformation of a molecule is the placement of atoms of a molecular structure in 3D space. ETKDG is a commonly used conformation finding algorithm.

Patrick: CRISPR Degron Tags for Target Validation Screening

To understand drugs, we need to understand what a drug’s target protein does in a cell and what the phenotype is with or without this protein.
To test this, many different gene knockout strategies have been devised including RnAi, CRISPR Knockout, CAS13 RNA Cleavage, PROTACS or degron tags.
Degron tags basically add an “off switch” to a protein. They are great because they are small molecules with drug-like properties and they are unlikely to interfere with the endogenous protein activity. However, off-target activity and the inability to use fluorescent halotag ligands might be disadvantages.

Afternoon
Part of the group ventured out to enjoy the scenery and, more importantly, find the nearest pub. After 30 minutes of walking the reward was slim and we learned the pub stays closed on Tuesdays! However the walk was not a complete disappointment, as there were lots of adorable newborn lambs to be seen.

(Tom disagrees, saying “this was the worst part of the whole trip” for him).

Dinner
On our second night, we had a series of home-cooked delicious curries: saag tofu, lentil daal and a chickpea curry. Who knew Fergus was such a fine chef!

However, we did have somewhat of a mishap! We found out – in this time of great need, with >20 cans needing opening – that the hostel’s can opener was dulled and completely unusable. Olly performed some magic with a knife and is fortunate to have escaped with all 10 fingers.

Session 4

Broncio: 5 Things That You Can Do Right Now to Enhance the Reproducibility of Your Computational Research

Use version control now: think “git”
Manage your environment: “conda” and “mamba” being two popular options
Test on another machine: try a container if you do not have a second computer at hand
Document, log, and monitor: you are doing it for yourself too!
Test your code: test code in smaller chunks or “units”

Anna & Eve: Women in STEM & the Authority Gap

There is a gender imbalance in computer science – too few women enter the field and stay in it
Gender-science stereotypes place men in natural sciences and women in liberal arts – try the Harvard Implicit Association Test if you think you are resistant to such biases
What can we all do? Actively listen to women – show that you are listening and let her finish. Apply this to all channels of communication – in person, reading research publications, and on social media.
The book “The Authority Gap” by Mary Ann Sieghart is a great read

We spent the first two days viewing presentations projected onto a wall with poor image contrast & a small area using a projector we had brought with us – only to find the hidden basement with a fancy projector! This was the first session that benefited from the new setup.

The first presentation setup with a projector balancing on a hassock on a chair

We later moved to the downstairs movie theater – better projector but now the laptops had to be balanced on two stacked highchairs (and sometimes a few books as well)

We are such dedicated scientists, with talks running until 10pm, so the quiz was rescheduled to the next night!

THURSDAY (Day 3)

Session 5

Leo & Martin: Bayesian Neural Networks

To get Bayesian Neural Networks from Neural Networks, one first assumes a distribution over the model’s parameter space. To make predictions one uses the posterior to weight all these possible models’ predictions.
The math of calculating the posterior is hard for complicated models – more often than not, there is no closed form solution.
Deterministic and sampling based approximations are commonly used – deep ensembles are seen by some as an alternative as well

Ruben: An Introduction to Cryo-EM

Ruben gave us insight into how data is analysed and resolution determined for single-particle cryo-EM.
Data analysis involves the central slice theorem and Fourier Transform.
Limitations include protein size (needs to be large), low SNR, orientation coverage is not uniform, heterogeneity (eg different conformations and flexibility).
Resolution is often calculated as spectral self-consistency between two half-maps (spectral cross-correlation) (but note that these maps are not entirely independent!)
Resolution is not the same across the whole map (dependent on flexibility, heterogeneity, radiation damage and angular assignment errors). In particular, it is always better at the core because angular errors propagate with distance from the core – therefore resolution should be considered locally!

Conor: Making readable console output in python with rich

Conor discussed rich, a package that can help make any console output from a python script readable and visually appealing.
Using rich, you can ensure visual hierarchy, consistency and whitespace, as well as include a progress bar.
Additionally, rich formats tracebacks nicely, which is useful for developers.

Session 6

Bora: Unpacking Packing: Side Chain Modelling for Downstream Design Work

Side chain packing is an important first step in computational protein design (as the saying goes: Garbage In, Garbage Out).
Traditional approaches utilised a rotamer library and hand-designed scoring function.
Recent developments in the field are making use of deep learning, including image transformation using a convolutional neural network (DLPacker); an end-to-end, SE(3)-equivariant deep graph transformer (AttnPacker); and self-supervised learning of side chain orientations by a graph neural network (GeoPPI).
Open questions remain, including about the conformations of surface residues (do they exist in multiple conformations? do we have enough information to predict conformations?) and what this means for designing protein binders.

Alissa: Designing a Protein Binder In silico: Target-Agnostic vs. Target-Aware?

Methods for designing a protein binder can be broadly classified into two groups: target-agnostic (which do not explicitly take the structure of the target into account in the design process) and target-aware (which do).
A target-agnostic design could include generating a large number of candidates (eg from a database of natural sequences, such as OAS) and subsequently evaluating them based on binding and other properties.
For target-aware design, a number of methods have been developed with many relying on incorporating a known interaction motif (eg hotspot grafting, Rosetta design or network hallucination around a motif), although there have been recent advances in design based only on the target structure.
These approaches involve a tradeoff between specificity for binding to a particular site of biological or therapeutic interest (target-aware) and increased design space – and the potential to access a higher-affinity space (target-agnostic).

Lewis: Making and hosting a personal website using GitHub Pages

In a time where having a professional online presence is more important than ever, Lewis gave the helpful talk demonstrating how to make and host a personal website using GitHub Pages.
Helpfully, there are templates (eg HTML5 UP) available, which make creating a website easy and fast – just download, add to your GitHub repo and personalise!
Creating a website is only step 1 though and more work is required to make it visible.
Make sure to request that your site is indexed in the Google Search console (so that it shows up in Google searches), add a sitemap file, add “meta tags” and link your website on other platforms (eg Linkedin, GitHub and Twitter – and vice versa).

Maranga: Developments in the use of Quantum Computing (magic) in drug discovery

Quantum computing promises an enormous speed-up for certain calculations (including molecular simulations), but is still limited by challenges in decoherence and noise.
While a full-scale quantum computer is not yet feasible, Noisy Intermediate-Scale Quantum Computers (NISQs) are being developed.
These combined with algorithm advances have enabled minimum energy conformations to be calculated for larger (albeit still small) molecules with reduced computation cost and qubits.
There are still many limitations and unaddressed questions, but this is certainly an exciting space to watch!
Additionally, Maranga masterfully peppered the entire talk with out-of-context but hilariously relevant Teams messages posted by Carlos (our other quantum computing expert).

Afternoon
Those of us who weren’t napping either explored the countryside (although these received a warning of an impending cow stampede and quickly moved on!) or played board games (7 Wonders and Trivial Pursuit). Unfortunately the Trivial Pursuit was from 1991 – from a time before Pluto was even a planet in the first place (before being subsequently demoted) – making for some very tricky questions.

Peaceful looking cows ready to stampede any moment

Dinner
Another great dinner was cooked under the direction of head chef Fergus. The night’s menu was Italian Pasta, featuring spaghetti with either a basil avocado pesto or a tomato and vegetable sauce!

After dinner, we celebrated Lewis’ birthday with cake. It turns out Bora can create a masterpiece with strawberries (inspired by Lewis’ request to have his likeness depicted on the cake)!

Session 7

Brennan & Tobias: Getting comfortable with PyTorch

Brennan and Tobias gave a great overview of ways to use PyTorch to build your own deep learning model implementations, as well as use existing tools.
They demonstrated building an example, simple E(n)-equivariant GNN for antibody CDR-H3 loop structure prediction (nearly) from scratch with PyTorch.
Additionally, they covered useful tools for preparing your data – including sklearn train_test_split, PyTorch DataLoader and PyTorch Data – and removing the need to write boilerplate code (and therefore bugs) by using PyTorch Lightning.
A final, valuable tip is to use a logger (eg Neptune.ai) for keeping track of your model training!
You can view the Google Colab notebook here.

Charlotte: Is BLAST dead?

In her talk, Charlotte explored some key topics in the future of bioinformatics
1) Over the past decades (and significantly, in the past years), bioinformatics has been moving from (less conserved and information poor) DNA sequences → protein sequences → protein structures → function (more conserved and information rich). In particular, the last step towards function represents the next big hurdle.
2) Machine learning (ML) has been applied extensively, and in many cases to great success, to a wide range of biological questions. We need to move beyond simply applying ML though to a point where we know when the model is wrong, understand the limitations of the model and can understand how it will fail.
3) A priority will become generating experimental data specifically for computational methods. This will likely be aided by robotics.
4) We need to continue moving beyond static structures to protein motion & flexibility. This may be accelerated by AlphaFold2 as researchers refocus their efforts to areas less well addressed by structure prediction.

Quiz
On our last night together, we tested ourselves on the questions that quiz master Bora had selected. Would you have known the answers to the following questions: Where does the word cappuccino come from? What are the 7 world wonders of the ancient world and where were they? What are the first names of the band members of the Jackson Five? We all failed miserably on the last question, but the groups whose members had played the 7 Wonders board game earlier in the day certainly had an advantage for that question!

The group is puzzling over Bora’s quiz questions on our last night together

Departure

After three fun days in Shropshire, we returned to the university city. The retreat provided a great mix of excellent science, delicious food and fantastic company – a wonderful trip! In addition to establishing stronger bonds between members of the group, everyone expanded their horizons and learned about a new area of research.

Additionally – unsurprising for those of you familiar with OPIG – many jokes were made and laughs had! We are also very pleased to say that no one caught COVID on the trip.

Thanks to Broncio (and a few other group members) who captured all these memories in such lovely photos!

Authors