Optimising for PR AUC vs ROC AUC – an intuitive understanding

When training a machine learning (ML) model, our main aim is usually to get the ‘best’ model out the other end in an unbiased manner. Of course, there are other considerations such as quick training and inference, but mostly we want to be good at predicting the right answer.

A number of factors will affect the quality of our final model, including the chosen architecture, optimiser, and – importantly – the metric we are optimising for. So, how should we pick this metric?

A recent preprint from McDermott et al.[1] highlights that optimising an ML model for the areas under the precision-recall curve (PR AUC) or the receiver operating characteristic curve (ROC AUC) can have quite different effects. Using maths, the authors illustrate that while optimising for ROC AUC is unbiased, optimising for PR AUC is biased towards improving certain predictions before others.

I encourage you to check out their preprint, but I also wanted to present an intuitive (no maths) explanation of why optimising for PR AUC and ROC AUC differ.

  • Consider we have 100 data points – 90 negative (label ‘0’), and 10 positive (label ‘1’)
  • We train a simple ML classifier on this data and obtain some decent predictions (better than random)
  • Then, we organise our data labels, ‘l’, according to their predictions, ‘p’ (low to high)

p: [0.01, 0.03, 0.04, ..., 0.90, 0.93, 0.98]
l: [0, 0, 0, ..., 1, 0, 0, 0, 1, 0, 0, 1, 0]

We can now correct each ‘atomic mistake’, as McDermott et al. describes. An atomic mistake in our example is where we have a label of ‘1’ followed by a label of ‘0’ i.e. a negative data point with a higher prediction than a positive one. Three of these atomic mistakes occur in the subset of our example above (highlighted).

Now suppose we corrected just one of these atomic mistakes. First, we might try correcting the atomic mistake with the highest predictions (furthest right). Simply, this would involve swapping the ‘…, 1, 0]‘ to ‘…, 0, 1]‘ and leaving everything else unchanged.

However, instead of correcting the first atomic mistake, we could choose to correct the third one (furthest left). This would be done similarly, again swapping the ‘.., 1, 0, …‘ to ‘…, 0, 1, …‘ and leaving everything else unchanged.

The immediate impact of this choice – swapping the first or third atomic mistake – may not be immediately clear, but the below figure shows the differences clearly. While the ROC AUC improves by the same amount through fixing either atomic mistake, the PR AUC improves far more when correcting the first mistake (the difference in area between the green and black curves vs the red and black curves; in both these plots the dashed grey lines indicate the performance of random guessing).

The bias of the PR AUC metric to optimise ‘high prediction’ atomic mistakes first means that it could favour fixing higher prevalence, more easily identifiable subpopulations of your data (that you may or may not know to exist). In some instances, this might not be of concern but in others, as McDermott et al. state, it very much should be. (Note that ROC AUC or Accuracy, which behaves similarly to ROC AUC, are often the default optimisation metrics for many ML frameworks anyway.)

Despite PR AUC’s bias as an optimisation metric, I do still believe PR AUC can be a good evaluation metric and that it should certainly be reported alongside ROC AUC, F1-score, MCC, Balanced Accuracy, and any of your other favourite metrics, at least in your SIs! I am particularly keen for additional metrics to be reported alongside ROC AUC as previously I have found ROC AUC to be more easily ‘hacked’ by baseline studies[2]. Additionally, high ROC AUCs can mislead those unfamiliar with statistics or a given field of research to believe a problem is solved when in fact it is far from it.

Finally, I would also encourage everyone to examine your PR and ROC curve shapes, not just their AUCs. I encourage this as, depending on the application, you may only require a high precision up to moderate recall and not care if the precision drops sharply after a certain cutoff.

[1] https://arxiv.org/abs/2401.06091
[2] https://academic.oup.com/bioinformatics/article/39/1/btac732/6825310

Author