Covariate Shift in Virtual Screening | Oxford Protein Informatics Group

In supervised learning, we assume that the training data and testing are drawn from the same distribution, i.e $P_{train}(x,y) = P_{test}(x,y)$ . However this assumption is often violated in virtual screening. For example, a chemist initially focuses on a series of compounds and the information from this series is used to train a model. For some reasons, the chemist changes their focus on a new, structurally distinct series later on and we would not expect the model to accurately predict the labels in the testing sets. Here, we introduce some methods to address this problem.

Methods such as Kernel Mean Matching (KMM) or Kullback-Leibler Importance Estimation Procedure (KLIEP) have been proposed. These methods typically assume the concept remain unchanged and only the distribution changes, i.e. $P_{train}(y|x) =P_{test}(y|x)$ and $P_{train}(x) \neq P_{test}(x)$ . In general, these methods reweight instances in the training data so that the distribution of training instances is more closely aligned with the distribution of instances in the testing set. The appropriate importance weighting factor $w(x)$ for each instance x in the training set is:

$w(x) = \frac{p_{test}(x)}{p_{train}(x)}$

where $p_{train}(x)$ is the training set density and $p_{test} (x)$ is the testing set density. Note that only the feature vector values (not their labels) are used in reweighting. The major difference between KMM and KLIEP is the objective function: KLIEP is based on the minimisation of the Kullback-Leibler divergence while KMM is based on the minimisation of Maximum Mean Discrepancy (MMD). For more detail, please see reference.

Reference:

Masashi Sugiyama ,Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, Motoaki Kawanabe.: Direct importance estimation for Covariate Shift Adaptation. Ann Inst Stat Math. 2008
Jiayuan Huang, Alex Smola, Arthur Gretton, Karsten Borgwardt, Bernhard Scholkopf.:Correcting Sample Selection Bias by Unlabeled Data. NIPS 06.
Mcgaughey, Georgia ; Walters, W Patrick ; Goldman, Brian.: Understanding covariate shift in model performance. F1000Research, 2016,

Author

Lucian Chan

View all posts