Useful metrics and their meanings

Short and selfish blog here. Probably been done before, but I shall carry on regardless. I am going to review some metrics relevant to our area of Immunoinformatics. In other words, I will try dissect things such as perplexity, logits, pTM, pLDDT and the ABodyBuilder2 confidence score. These numbers can help inform us on the likelihood of predictions, and whether we should have confidence in them.

Logits: A raw logit is just a number associated with an event happening. So, if we are predicting the next amino acid in a sequence, we will have a value associated with each of the 20 potential amino acids. We can either SoftMax this distribution of logits to ensure all numbers add up to 1, and that gives us something we can compare to other distributions of logits for other positions in a sequence. Or we can take the argmax (highest value of the 20 logits) to give the index of the amino acid the model thinks is most likely at the position of interest. The key point is that logits must be transformed to get something useful.

Perplexity: If we have a sequence generated by a language model, it is useful to know if distribution of probabilities of the predicted tokens are indicative of something the model has seen before (i.e. the probabilities of the tokens are high). Or if the distribution of probabilities contains that suggest the model had difficulty deciding which token belonged at each position, i.e. it was ‘surprised’ or ‘perplexed’ by the task. For a given sequence we can generate a value that describes this. Higher values suggest that the probabilities were lower and the model was more perplexed by the sequence. Lower values suggest that the model was not perplexed, and the patterns of the sequence were likely present in the training data.

It is calculated using the exponentiated average negative log-likelihood of all tokens in a sequence. This is best explained with simple examples. Below we have the probabilities of the predicted tokens in two sequences 4 amino acid in length. Each value is the probability of the top predicted token at that position (it comes from the SoftMax distribution of logits assigned to each of the 20 amino acids for that position in the sequence, described above).

Moderate perplexity sequence token probabilities: 0.3, 0.1, 0.2, 0.2

Low perplexity sequence token probabilities: 0.8, 0.9, 0.5, 0.7

Example1:

Negative Log-Likelihoods:

-log(0.3) ≈ 1.20

-log(0.1) ≈ 2.30

-log(0.2) ≈ 1.61

-log(0.2) ≈ 1.61

Average: (1.20 + 2.30 + 1.61 + 1.61) / 4 ≈ 1.68

Perplexity: e^1.68 ≈ 5.37

Example 2:

Negative Log-Likelihoods:

-log(0.8) ≈ 0.22

-log(0.9) ≈ 0.11

-log(0.5) ≈ 0.69

-log(0.7) ≈ 0.36

Average: (0.22 + 0.11 + 0.69 + 0.36) / 4 ≈ 0.35

Perplexity: e^0.35 ≈ 1.42

This shows how the perplexity value for the sequence with higher probabilities results in a lower number i.e. the model was less surprised by what it was predicting.

Hopefully that has related logits to perplexity and should make sense for protein language models. Onto predictions of 3D coordinates next…

Relevant sequence to structure metrics

pLDDT: This stands for ‘predicted Local Distance Difference Test and was developed for use with AlphaFold2. The score indicates how reliable the backbone coordinates are, however, it is not an explicit measure of side chain accuracy, they are just assumed to correlate. The score ranges from 0-100, with anything above 70 being quite good, less than 50 indicates disorder.

The key word here is ‘predicted’ as the score is generated by a model which has been trained on an existing metric called lDDT-Cα (Local Distance Difference Test, based on C-alpha atoms). During AlphaFold2 training the modelled structure versus ground truth lDDT-Cα values are calculated and a separate model is trained to predict pLDDT. This pLDDT predictor has essentially learned the relationship between patterns in the AF2 predictions and the likely lDDT-Cα confidence scores.

pTM: Predicted Template Modeling Score (pTM-score) is often mentioned in papers on protein structure predictors. Like pLDDT the key word is ‘predicted’ and this is built on the older TM score developed for things like template search or homology modelling. The score is not given for each amino acid but is a global metric indicating the distance between the prediction and ground truth, or the template used. It counts all residue pairs using the Levitt–Gerstein weight and the final value is independent of protein size.

ABodyBuilder2 (ABB2) confidence score: If you have used ABB2 you will have noticed it also outputs a confidence score in the PDB B factor column for each residue. Unlike pLDDT this is not derived from a trained model, instead it is a measure of agreement (deviation among model predictions) between the outputs of the four models in the ensemble. If the coordinates output from all for models are all close together then the RMSD between the aligned structures will be low. However, if each of the models predicts highly divergent structures then the RMSDs between them will be higher. Although the calculation used in ABB2 confidence score is not RMSD, but instead deviation from the mean of the 4 predictions, the principle is similar to that behind RMSD.

Hope this was useful!

Author