{"id":4251,"date":"2018-07-23T22:24:32","date_gmt":"2018-07-23T21:24:32","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=4251"},"modified":"2018-07-23T22:28:58","modified_gmt":"2018-07-23T21:28:58","slug":"covariate-shift","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2018\/07\/covariate-shift\/","title":{"rendered":"Covariate Shift in Virtual Screening"},"content":{"rendered":"<p>In supervised learning, we assume that the training data and testing are drawn from the same distribution, i.e <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P_%7Btrain%7D%28x%2Cy%29+%3D+P_%7Btest%7D%28x%2Cy%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P_{train}(x,y) = P_{test}(x,y)\" class=\"latex\" \/>. However this assumption is often violated in virtual screening. For example, a chemist initially focuses on a series of compounds and the information from this series is used to train a model. For some reasons,\u00a0 the chemist changes their focus on a new, structurally distinct series later on and we would not expect the model to accurately predict the labels in the testing sets.\u00a0 Here, we introduce some methods to address this problem.<\/p>\n<p>Methods such as Kernel Mean Matching (KMM) or Kullback-Leibler Importance Estimation Procedure (KLIEP) have been proposed.\u00a0 These methods typically assume the concept remain unchanged and only the distribution changes, i.e.\u00a0<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P_%7Btrain%7D%28y%7Cx%29+%3DP_%7Btest%7D%28y%7Cx%29+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P_{train}(y|x) =P_{test}(y|x) \" class=\"latex\" \/> and <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=P_%7Btrain%7D%28x%29+%5Cneq+P_%7Btest%7D%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"P_{train}(x) &#92;neq P_{test}(x)\" class=\"latex\" \/>.\u00a0 In general, these methods\u00a0 reweight instances in the training data so that the distribution of training instances is more closely aligned with the distribution of instances in the testing set. The appropriate importance weighting factor <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w(x)\" class=\"latex\" \/> for each instance x in the training set is:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w%28x%29+%3D+%5Cfrac%7Bp_%7Btest%7D%28x%29%7D%7Bp_%7Btrain%7D%28x%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w(x) = &#92;frac{p_{test}(x)}{p_{train}(x)}\" class=\"latex\" \/><\/p>\n<p id=\"d2538e259\" class=\"\">where <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=p_%7Btrain%7D%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"p_{train}(x)\" class=\"latex\" \/> is the training set density and <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=p_%7Btest%7D%C2%A0%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"p_{test}\u00a0(x)\" class=\"latex\" \/> is the testing set density. Note that only the feature vector values (not their labels) are used in reweighting. The major difference between KMM and KLIEP is the objective function: KLIEP is based on the minimisation of the Kullback-Leibler divergence while KMM is based on the minimisation of Maximum Mean Discrepancy (MMD).\u00a0 For more detail, please see reference.<\/p>\n<p>Reference:<\/p>\n<ol>\n<li><i>Masashi Sugiyama ,Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B\u00fcnau, Motoaki Kawanabe.<\/i>: Direct importance estimation for Covariate Shift Adaptation.\u00a0Ann Inst Stat Math<i>.<\/i>\u00a02008<\/li>\n<li><em>Jiayuan Huang,\u00a0<em>\u00a0<\/em><em>Alex Smola,\u00a0<\/em><em><em>Arthur Gretton,\u00a0<\/em><\/em>Karsten Borgwardt, Bernhard Scholkopf<\/em>.:Correcting Sample Selection Bias by Unlabeled Data.\u00a0NIPS 06.<\/li>\n<li>\n<p class=\"EXLResultAuthor\"><em>Mcgaughey, Georgia ; Walters, W Patrick ; Goldman, Brian.:\u00a0<\/em>Understanding\u00a0<span class=\"searchword\">covariate<\/span>\u00a0<span class=\"searchword\">shift<\/span>\u00a0in model performance.\u00a0F1000Research, 2016,<\/p>\n<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In supervised learning, we assume that the training data and testing are drawn from the same distribution, i.e . However this assumption is often violated in virtual screening. For example, a chemist initially focuses on a series of compounds and the information from this series is used to train a model. For some reasons,\u00a0 the [&hellip;]<\/p>\n","protected":false},"author":53,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[187,189],"tags":[],"ppma_author":[537],"class_list":["post-4251","post","type-post","status-publish","format-standard","hentry","category-cheminformatics","category-machine-learning"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":537,"user_id":53,"is_guest":0,"slug":"lucian","display_name":"Lucian Chan","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/18a18dc1ef93082baba140e09637bc5a3ca6146479993acf9ea2f6053225f91f?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4251","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/53"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=4251"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4251\/revisions"}],"predecessor-version":[{"id":4258,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4251\/revisions\/4258"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=4251"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=4251"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=4251"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=4251"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}