{"id":8734,"date":"2022-11-01T14:38:22","date_gmt":"2022-11-01T14:38:22","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=8734"},"modified":"2023-03-03T14:40:53","modified_gmt":"2023-03-03T14:40:53","slug":"am-i-better-performance-metrics-unravelled","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2022\/11\/am-i-better-performance-metrics-unravelled\/","title":{"rendered":"Am I better? Performance metrics unravelled"},"content":{"rendered":"\n<p>What&#8217;s the deal with all these numbers? Accuracy, Precision, Recall, Sensitivity, AUC and ROCs. <\/p>\n\n\n\n<p><strong>The basic stuff:<\/strong><\/p>\n\n\n\n<p>Given a method that produces a numerical outcome either catagorical (classification) or continuous (regression), we want to know how well our method did. Let&#8217;s start simple:<\/p>\n\n\n\n<p><strong>True positives<\/strong> <strong>(TP)<\/strong>: You said something was a cow and it was in fact a cow &#8211; duh.<\/p>\n\n\n\n<p><strong>False positives (FP)<\/strong>: You said it was a cow and it wasn&#8217;t &#8211; sad.<\/p>\n\n\n\n<p><strong>True negative (TN):<\/strong> You said it was not a cow and it was not &#8211; good job.<\/p>\n\n\n\n<p><strong>False negative (FN)<\/strong>: You said it was not a cow but it was a cow &#8211; do better.<\/p>\n\n\n\n<p>I can optimise these metrics artificially. Just call everything a cow and I have a 100% true positive rate. We are usually interested in a trade-off, something like the relative value of metrics. This gives us:<\/p>\n\n\n\n<!--more-->\n\n\n\n<p><strong>Accuracy<\/strong>: Accuracy is about the truth:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7BTP+%2B+TN%7D%7BTP+%2B+TN+%2B+FN+%2B+FP%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{TP + TN}{TP + TN + FN + FP}\" class=\"latex\" \/><\/code><\/pre>\n\n\n\n<p>However, we should be careful I can get a really high accuracy if my data contains  an imbalance of positives or negatives. In fact, the baseline accuracy is equal the proportion of positives in your dataset!<\/p>\n\n\n\n<p><strong>Precision<\/strong>: How many thing I say are cows are actually cows? Or more formally<\/p>\n\n\n\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7BTP%7D%7BTP+%2B+FP%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{TP}{TP + FP}\" class=\"latex\" \/>\n\n\n\n<p><strong>Recall<\/strong>: How many cows are called cows? <\/p>\n\n\n\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7BTP%7D%7BTP+%2B+FN%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{TP}{TP + FN}\" class=\"latex\" \/>\n\n\n\n<p>You can combine precision and recall into a score called the F1-Score using their harmonic mean (remember harmonic means are appropriate for rates).<\/p>\n\n\n\n<p>You may have heard of sensitivity, which is another word for recall. Specificity is rate of true negatives. The big thesaurus is here: <a href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall\">https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall<\/a><\/p>\n\n\n\n<p><strong>The less basic stuff<\/strong><\/p>\n\n\n\n<p>We can report ROC (receiver-operator characteristic) curves. This plots the FPR against the TPR &#8211; the extra R is for <em>rate<\/em>. A curve above the diagonal indicates better than random. The higher the curve the better the method. Often it a waste of ink though, with all the key information bunched up and lots of unused white space. Instead we could report the area under this curve (AUC). This has a nice interpretation:<em> the AUC  is the probability that the classifier will rank  a randomly selected true positive higher than a randomly selected true negeative.<\/em> It maybe clear from this interpretation that if there are lots of true positives or negatives the AUC can be inflated.<\/p>\n\n\n\n<p>The precision-recall curves plots the precision against recall (obviously). The higher the curves against the negative diagonal the better. Again they suffer the same visual issues as a ROC curves. Again the AUPRC can be interprete as the average precision (over all values of recall). However, it has similar issues to ROC curve because it ignores true negatives. <\/p>\n\n\n\n<p>The problem with these approaches is they only care about <strong>average<\/strong> performance and do not take into account the practical value of incorrect\/correct answers. The way around this is report these metric for a range of practical values of concern. For example, what is the AUC only upto the 1% false positive rate. What is the precision for a recall equal to 0.1. A practical value is application specific (sorry you have to think!).<\/p>\n\n\n\n<p><strong>The hard stuff (technical)<\/strong><\/p>\n\n\n\n<p>When we are considering making prediction, we should think of assigning a score, S, to those predictions.  A score tell us how good those predictions are, example of scores are the log-score (related to entropy\/likelihood), Brier score (related to quadratic loss and mean squared error), and the spherical score etc. The score depends on what you are interested in: err on the side of caution you go for the log-score; can be a bit liberal you choose the spherical score. There&#8217;s no perfect metric &#8211; you have to think about what&#8217;s valuable to you (or your field) to pick the correct score. I&#8217;ll leave the technicalities to wikipedia (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Scoring_rule\">https:\/\/en.wikipedia.org\/wiki\/Scoring_rule<\/a>).<\/p>\n\n\n\n<p><strong>A metric check list<\/strong><\/p>\n\n\n\n<p>Here&#8217;s a checklist for reporting metrics to determine whether your method is better: <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Report multiple performance metrics and be clear about imbalance in your data.<\/li>\n\n\n\n<li>Report metrics that are appropriate for your problem.<\/li>\n\n\n\n<li>Report ranges for performance metrics.\n<ol class=\"wp-block-list\">\n<li>You can subsample\/perturb your test data or training data to get an idea of uncertainty.<\/li>\n\n\n\n<li>Explain what your range means &#8211; how did you make those errors bars?<\/li>\n<\/ol>\n<\/li>\n\n\n\n<li>Avoid diochotimsing your predictions &#8211; evaluate on the real scale.<\/li>\n\n\n\n<li>Compare to a baseline.\n<ol class=\"wp-block-list\">\n<li>Good baselines are hard but random predictions, weighted averages are the place to start.<\/li>\n<\/ol>\n<\/li>\n\n\n\n<li>Avoid metrics that capture average performance &#8211; quote values of practical interest<\/li>\n\n\n\n<li>Intepret your performance gains in words.<\/li>\n\n\n\n<li>If generalisation is important &#8211; evaluate on carefully contrusted test sets.<\/li>\n\n\n\n<li>Check the calibration of your forcasts if they come with scores.\n<ol class=\"wp-block-list\">\n<li>Be careful with approaches such as expected calibration error that only reports an average result.<\/li>\n\n\n\n<li>Report calibration at practical levels of interest.<\/li>\n<\/ol>\n<\/li>\n\n\n\n<li>Be careful with metrics that are difficult to interpret and may not evaluate what you want\n<ol class=\"wp-block-list\">\n<li>Looking at you AIC, BIC, DIC, MDL, learning curves, elbow method, silhoutte, Dunn Index, R<sup>2<\/sup><\/li>\n<\/ol>\n<\/li>\n\n\n\n<li>There is no harm in evaluating on simulated data:\n<ol class=\"wp-block-list\">\n<li>Pathologies are clearest when you know everything.<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What&#8217;s the deal with all these numbers? Accuracy, Precision, Recall, Sensitivity, AUC and ROCs. The basic stuff: Given a method that produces a numerical outcome either catagorical (classification) or continuous (regression), we want to know how well our method did. Let&#8217;s start simple: True positives (TP): You said something was a cow and it was [&hellip;]<\/p>\n","protected":false},"author":77,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[278],"tags":[],"ppma_author":[499],"class_list":["post-8734","post","type-post","status-publish","format-standard","hentry","category-statistics"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":499,"user_id":77,"is_guest":0,"slug":"oliver","display_name":"Oliver Crook","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/6a2e38da4fe0d5fdced1939b5da8306ab0749dc303f68da68002e54f31f42a68?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/8734","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/77"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=8734"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/8734\/revisions"}],"predecessor-version":[{"id":8963,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/8734\/revisions\/8963"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=8734"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=8734"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=8734"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=8734"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}