CAML: Courses in Applied Machine Learning | Oxford Protein Informatics Group

*Shameless self-promotion klaxon!! Have a look at my new website!*

I’m excited to share a project I’ve been working on for the past few months! One of the biggest challenges of working on an interdisciplinary research project is getting to grips with the core principles of the disciplines which you don’t have much formal training in. For me, that means learning the basics of Medicinal Chemistry and Structural Biology so that when someone mentions pi-stacking I don’t think they’re talking about the logistics of managing a bakery; for people coming from Bio/Chem backgrounds it can mean understanding the Maths and Statistics necessary to make sense of the different algorithms which are central to their work.

In my experience, beyond wading through pages and pages of algebra, the best way of understanding a particular algorithm is to implement it yourself programmatically; if you can implement it, then you must understand how it works on some level.

I’ve created a website, catchily named opencaml.github.io, which helps people implement popular machine learning algorithms themselves and develop their understanding of how they work. The ‘CAML’ stands for ‘Courses in Applied Machine Learning’, whilst the ‘open’ denotes that all of the resources on the website are freely available for anyone to use (and also that caml.github.io wasn’t available).

Which Algorithms Are Supported?

Currently, there are 12 modules, which cover the implementation of:

Linear Regression
Logistic Regression
K-Means
Principal Components Analysis
Linear Discriminant Analysis
K-Nearest Neighbours
Decision Trees
Bagging
Random Forests
Neural Networks
Kernel Regression
Bayesian Linear Regression

How Do The Modules Work?

The core of each module is a Jupyter Notebook. In the notebook, we generate and visualise a dataset, then implement the algorithm to run on that dataset before checking that the algorithm is working as we’d expect it to. Your task is to fill in the pieces of code that we’ve redacted so that the algorithm can be used on the dataset. There are two levels of difficulty: one where we’ve taken out almost all of the code, and one where we’ve left some of the trickier parts in and provide hints and tips to help you. A completed notebook is provided in each module to be used as a solution.

Part of the ‘Empty’ notebook provided for the Principal Components Analysis module.

Each notebook can either be downloaded as an .ipynb file to be run on your computer’s python, or you can complete each notebook in your browser without needing to download anything. Click here to access the online notebook for K-Means (it might take a minute to load).

Building Your Intuition

As well as getting you to write some code, we provide some resources to help develop your understanding of what exactly each algorithm is doing; each page has a set of freely available online links to helpful resources and in the notebooks we explore some of the algorithm’s key concepts in more detail.

Examining the effect of changing the model bandwidth on predictive performance in Kernel Regression

Let Me Know What You Think!

If you’re looking to develop your understanding of what goes on underneath the hood of machine learning and want to try out some of the modules; I’d love to get any feedback on what you thought. I’m currently working on adding modules on Gaussian Mixture Models and Gaussian Processes, but if there are any additional modules you’d like to see included, let me know and I’ll see what can be done!

Author

Thomas Hadfield

View all posts