{"id":5984,"date":"2020-08-04T16:17:35","date_gmt":"2020-08-04T15:17:35","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=5984"},"modified":"2020-08-04T16:33:58","modified_gmt":"2020-08-04T15:33:58","slug":"learning-from-biased-datasets","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2020\/08\/learning-from-biased-datasets\/","title":{"rendered":"Learning from Biased Datasets"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Both the beauty and the downfall of learning-based methods is that the data used for training will largely determine the quality of any model or system. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While there have been numerous algorithmic advances in recent years, the most successful applications of machine learning have been in areas where either (i) you can generate your own data in a fully understood environment (e.g. <a href=\"https:\/\/www.nature.com\/articles\/nature24270\">AlphaGo<\/a>\/<a href=\"https:\/\/science.sciencemag.org\/content\/362\/6419\/1140\">AlphaZero<\/a>), or (ii) data is so abundant that you&#8217;re essentially training on &#8220;everything&#8221; (e.g. <a href=\"https:\/\/openai.com\/blog\/better-language-models\/\">GPT2\/3<\/a>, CNNs trained on ImageNet).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This covers only a narrow range of applications, with most data not falling into one of these two categories. Unfortunately, when this is true (and even sometimes when you <strong><span style=\"text-decoration: underline\">are<\/span><\/strong> in one of those rare cases) your data is almost certainly biased &#8211; you just may or may not know it.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p class=\"wp-block-paragraph\">This can have drastic consequences for any model you train using such data. In the world of structure-based scoring functions, this has recently been reported in three separate publications (<a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/acs.jcim.8b00712\">here<\/a>, <a href=\"https:\/\/journals.plos.org\/plosone\/article?id=10.1371\/journal.pone.0220113\">here<\/a>, and <a href=\"https:\/\/www.frontiersin.org\/articles\/10.3389\/fphar.2020.00069\/full\">here<\/a>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are two clear strategies to overcoming such issues: (i) fix\/remove such biases from the data, or (ii) develop algorithms that can learn despite the presence of such biases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">My interest in this topic continues to grow, and in OPIG we are actively working on both approaches. We are currently preparing a manuscript on work presented at ISMB 2020 that adopts strategy (i), while a <a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/acs.jcim.0c00263\">recent publication from the group (link)<\/a> is an example of strategy (ii) that employs data augmentation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One of my favourite talks from the recent <a href=\"https:\/\/www.iscb.org\/ismb2020\">ISMB 2020 (virtual) conference<\/a> was a presentation from Ayse Dincer of the University of Washington (<a href=\"https:\/\/www.biorxiv.org\/content\/10.1101\/2020.04.28.065052v1.full\">bioRxiv link<\/a>):<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Adversarial Deconfounding Autoencoder for Learning Robust Gene Expression Embeddings<br>Ayse B. Dincer, Joseph D. Janizek, Su-In Lee<br>bioRxiv 2020.04.28.065052; doi: https:\/\/doi.org\/10.1101\/2020.04.28.065052<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In their work, they presented an autoencoder that learnt representations from gene expression data that was designed not to capture &#8220;confounders&#8221; or biases in such representations (Fig. 1). These can range from technical artifacts (e.g. batch effects) to uninteresting biological variables (e.g. age) or just random noise. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"416\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2020\/08\/image.png?resize=625%2C416&#038;ssl=1\" alt=\"\" class=\"wp-image-5985\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2020\/08\/image.png?w=747&amp;ssl=1 747w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2020\/08\/image.png?resize=300%2C200&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2020\/08\/image.png?resize=624%2C415&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><figcaption>Fig 1. From Dincer et al. Adversarial Deconfounding Autoencoder for Learning Robust Gene Expression Embeddings<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This is achieved through the use of an auxiliary neural network that is trained to predict the value of the &#8220;confounding&#8221; variable from the latent representation of the network (Fig. 2). The autoencoder is trained to produce a latent representation that can be used to reconstruct the input expression data, but not allow the auxiliary network to predict the confounding variable. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"431\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2020\/08\/image-1.png?resize=625%2C431&#038;ssl=1\" alt=\"\" class=\"wp-image-5986\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2020\/08\/image-1.png?w=765&amp;ssl=1 765w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2020\/08\/image-1.png?resize=300%2C207&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2020\/08\/image-1.png?resize=624%2C430&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><figcaption><em>Fig 2. From Dincer et al. Adversarial Deconfounding Autoencoder for Learning Robust Gene Expression Embeddings<\/em><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This is an interesting approach with seemingly broad applicability, as long as the confounder or bias is known and quantifiable (either with a class label or specific value). <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A similar approach is explored by Kim and colleagues in the realm of computer vision (<a href=\"https:\/\/openaccess.thecvf.com\/content_CVPR_2019\/html\/Kim_Learning_Not_to_Learn_Training_Deep_Neural_Networks_With_Biased_CVPR_2019_paper.html\">link<\/a>):<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Learning Not to Learn: Training Deep Neural Networks with Biased Data<br>Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, Junmo Kim<br>Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9012-9020<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"587\" height=\"378\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2020\/08\/image-2.png?resize=587%2C378&#038;ssl=1\" alt=\"\" class=\"wp-image-5987\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2020\/08\/image-2.png?w=587&amp;ssl=1 587w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2020\/08\/image-2.png?resize=300%2C193&amp;ssl=1 300w\" sizes=\"auto, (max-width: 587px) 100vw, 587px\" \/><figcaption>Fig. 3. From Kim et al. Learning not to Learn<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Their model is trained such that the features produced by their convolutional neural network (labelled <em>f<\/em> in Fig. 3) cannot be used to predict the known bias (network <em>h<\/em> in Fig 3.), but can be used by to label the image (network <em>g<\/em>). <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Both approaches are promising and much needed advances. While there is clearly much more work to be done (for example, these methods require the bias to be known <em>a priori<\/em>, which often isn&#8217;t the case), we will be much better off by both acknowledging our data is biased and trying to do something about it! <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Both the beauty and the downfall of learning-based methods is that the data used for training will largely determine the quality of any model or system. While there have been numerous algorithmic advances in recent years, the most successful applications of machine learning have been in areas where either (i) you can generate your own [&hellip;]<\/p>\n","protected":false},"author":50,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[52,138,189,9,15],"tags":[248,13],"ppma_author":[535],"class_list":["post-5984","post","type-post","status-publish","format-standard","hentry","category-conferences","category-journal-club","category-machine-learning","category-talks","category-technical","tag-conference","tag-journal-club"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":535,"user_id":50,"is_guest":0,"slug":"fergus2","display_name":"Fergus Imrie","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/19c18fa7f4d0a2aecc5f69760c6a9f2fc9b493dfe45b1fd333ccb447db9d6a90?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Imrie","first_name":"Fergus","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/5984","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/50"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=5984"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/5984\/revisions"}],"predecessor-version":[{"id":5993,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/5984\/revisions\/5993"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=5984"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=5984"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=5984"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=5984"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}