BAYESIAN OPTIMISATION WITH GPyOPT

Here we demonstrate a couple of examples of how we can use Bayesian Optimization to quickly find the global minimum of a multi-dimensional function. Whilst methods such as gradient descent, grid search and random search can all be used to find extrema, gradient descent is susceptible to finding local extrema, whilst search methods typically require substantially more function evaluations to effectively cover the search space than Bayesian Optimisation.

Bayesian Optimisation works by incorporating information learned in previous function evaluations to choose an optimal set of coordinates for the next evaluation. It does this by calculating the posterior predictive distribution for the function's value at each point. One particularly popular application of Bayesian Optimisation is in hyper-parameter tuning, as deep learning models can often contain a large number of hyperparameters for models which subsequently take a substantial amount of time to train.

We use the GPyOpt module, developed by the Machine Learning group at the University of Sheffield.

In [1]:
#Import Modules

#GPyOpt - Cases are important, for some reason
import GPyOpt
from GPyOpt.methods import BayesianOptimization

#numpy
import numpy as np
from numpy.random import multivariate_normal #For later example

import pandas as pd

#Plotting tools
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import numpy as np
from numpy.random import multivariate_normal

We'll start by finding the minimum of the polynomial $f(x) = x^4 + 2x^3 - 12x^2 - 2x + 6$ in the range $[-5,4]$

In [2]:
#Define the objective function
def obj_func(x):
    out = x**4 + 2*x**3 -12*x**2 - 2*x + 6
    return(out)

First, let's visualise $f(x)$ and note that it contains three stationary points, two of which are minima. It's very important that we obtain the global minimum and don't get stuck at the local minimum around $x = 2$.

In [3]:
#Plot the function 
x = pd.Series(np.linspace(-5,4,1000))
f_x = pd.Series.apply(x, obj_func)

plt.plot(x, f_x, 'b-')
plt.show()

Now we'll use Bayesian Optimisation to obtain the minimum. We input the objective function and the domain which we search over (which can be continuous or discrete) to create the Bayesian Optimisation object. We then specify the number of function evaluations we want (the model actually does five more than max_iter by default).

For 1- or 2d problems, the plot_acquisition() method shows where the function has been evaluated as well as the posterior mean and variance at each point. As one might expect there is greater uncertainty in the areas whihc have not been thoroughly explored by the model.

The red curve at the bottom of the plot indicates the Expected Improvement at each point - the point with the largest EI will be the next to be evaluated and we can already see that after only 10 evaluations we are getting close to the true minimum.

In [4]:
domain = [{'name': 'var_1', 'type': 'continuous', 'domain': (-5,4)}]


myBopt_1d = BayesianOptimization(f=obj_func, domain=domain)
myBopt_1d.run_optimization(max_iter=5)
myBopt_1d.plot_acquisition()

We can print a list of the places where the function was evaluated and the corresponding values, as well as the optimal value.

In [5]:
ins = myBopt_1d.get_evaluations()[1].flatten()
outs = myBopt_1d.get_evaluations()[0].flatten()
evals = pd.DataFrame(ins, outs)
print(evals.sort_index())
                   0
-5.000000  91.000000
-3.814909 -60.248104
-3.453162 -70.349283
-3.078797 -70.106810
-2.589734 -59.058342
-1.462760 -18.431958
 0.775723  -1.476724
 0.966524  -4.464611
 2.634348   2.178148
 2.814442  12.648554
In [6]:
print("The minumum value obtained by the function was %.4f (x = %.4f)" % (myBopt_1d.fx_opt, myBopt_1d.x_opt))
The minumum value obtained by the function was -70.3493 (x = -3.4532)

Now let's consider a 2 dimensional example, where it would be even more time consuming to use grid- or random search. We define our objective function as $f(x,y) = (x^2 + y^2) \times (sin^2(x) - cos(y))$ in the range $[0, 10] \times [0, 10]$. Here's what our function looks like:

In [7]:
#2-Dimensional Example

def obj_func_2d(x,y):
    return((x**2 + y**2)*(np.sin(x)**2 - np.cos(y)))


fig = plt.figure(figsize=plt.figaspect(0.3))
fig.suptitle('Plots of our objective function')

# Make data.
X = np.arange(0, 10, 0.25)
Y = np.arange(0, 10, 0.25)
X, Y = np.meshgrid(X, Y)
Z = obj_func_2d(X,Y)



# First subplot
ax = fig.add_subplot(1, 2, 1)
ax.contour(X,Y,Z)

# Second subplot
ax = fig.add_subplot(1, 2, 2, projection='3d')

# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,
                       linewidth=0, antialiased=False)

# Customize the z axis.
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)

plt.show()

Clearly there are a large number of local extrema.

When setting up the Bayesian Optimisation, we reparameterise the function so that it takes a single vector argument. The only major difference to the set up is that we had to specify two variables in bounds2d.

In [8]:
def objfunc2d(x):
    """
    x is a 2 dimensional vector.
    """ 
    x1 = x[:, 0]
    x2 = x[:, 1]
    return((x1**2 + x2**2)*(np.sin(x1)**2 - np.cos(x2)))

bounds2d = [{'name': 'var_1', 'type': 'continuous', 'domain': (0,10)},
            {'name': 'var_2', 'type': 'continuous', 'domain': (0,10)}]
maxiter = 50

myBopt_2d = GPyOpt.methods.BayesianOptimization(objfunc2d, domain=bounds2d)
myBopt_2d.run_optimization(max_iter = maxiter)
print("="*20)
print("Value of (x,y) that minimises the objective:"+str(myBopt_2d.x_opt))    
print("Minimum value of the objective: "+str(myBopt_2d.fx_opt))     
print("="*20)
myBopt_2d.plot_acquisition()
====================
Value of (x,y) that minimises the objective:[9.53004346 6.39171534]
Minimum value of the objective: -129.44733039516598
====================

We can see that the posterior mean corresponds roughly to the contour map of the true 2d function, whereas the greatest posterior variance is in the (relatively unexplored) area centered around $[8, 3]$. The very small expected improvement in almost all areas demonstrates that it is likely that we have found the true minimum in this region.

We can visualise how the algorithm explored the space by looking at the distance between consecutive evaluations. Most of the time there is a sizeable distance between evaluations but on occasion we see consecutive evaluations that are very close to each other - theses evaluations typically correspond to a reduction in the value of the best selected sample.

In [9]:
#Plot some more characteristics:

myBopt_2d.plot_convergence() #Can clearly see it spends quite some time exploring the best small section 
#which it thinks is the best space

In the 2d example we just saw, we were free to take any combination of $x$ and $y$ values, with no constraints other than the rectangular domain that we specified. Below we show an example of constrained optimisation.

We consider the same 2d objectice function as before, with an extra step. Rather than specifying an $(x, y)$ pair and receiving a real valued output, we must parameterise a 2d Gaussian distribution, which we then sample from and input the sample to the objective function. In this setting we are constrained by the values which make up the covariance matrix, which must be symmetric positive definite

In [10]:
def stochastic_obj_func_2d_const(x):
    #Construct our normal parameters:
    
    mu1,mu2,Sig_diag1,Sig_diag2,Sig_Cross = x[:,0],x[:,1],x[:,2],x[:,3],x[:,4]
    
    mu = np.array([mu1[0], mu2[0]]).flatten()
    Sigma = np.array([[Sig_diag1[0], Sig_Cross[0]], [Sig_Cross[0], Sig_diag2[0]]]) #Cov Matrix must be symmetric so 
    
    sample = multivariate_normal(mu, Sigma, check_valid = 'raise')
    
    #Now run the deterministic objective function
    return(obj_func_2d(sample[0], sample[1]))

We set up the Bayesian Optimisation in a similar way as before - but this time we have an extra list which details our constraints

In [11]:
#Now lets set up the Bayesian Optimisation:
#Guide on how to do it with constraints: https://nbviewer.jupyter.org/github/SheffieldML/GPyOpt/blob/devel/manual/GPyOpt_constrained_optimization.ipynb

bounds = [{'name': 'mu_1', 'type': 'continuous', 'domain': (0,10)},
        {'name': 'mu_2', 'type': 'continuous', 'domain': (0,10)},
         {'name': 'Sig_diag_1', 'type': 'continuous', 'domain': (0,5)},
         {'name': 'Sig_diag_2', 'type': 'continuous', 'domain': (0,5)},
         {'name': 'Sig_Cross', 'type': 'continuous', 'domain': (-5,5)}]

constraints = [{'name': 'constr_1', 'constraint': '-x[:,2]'},
              {'name': 'constr_2', 'constraint': '-x[:,3]'},
              {'name': 'constr_3', 'constraint': '-x[:,2]*x[:,3] + x[:,4]**2'}]

#Above constraints ensure our 2x2 covariance matrix will be symmetric positive definite 
#The expression corresponding to 'constraint' is an inequality that is less than zero - i.e. x[:,2] > 0 etc.
In [12]:
#Determine the subset where we are allowed to sample
feasible_region = GPyOpt.Design_space(space = bounds, constraints = constraints) 
initial_design = GPyOpt.experiment_design.initial_design('random', feasible_region, 10)

#CHOOSE the objective
objective = GPyOpt.core.task.SingleObjective(stochastic_obj_func_2d_const)

# CHOOSE the model type
model = GPyOpt.models.GPModel(exact_feval=True,optimize_restarts=10,verbose=False)

#CHOOSE the acquisition optimizer
aquisition_optimizer = GPyOpt.optimization.AcquisitionOptimizer(feasible_region)

#CHOOSE the type of acquisition
acquisition = GPyOpt.acquisitions.AcquisitionEI(model, feasible_region, optimizer=aquisition_optimizer)

#CHOOSE a collection method
evaluator = GPyOpt.core.evaluators.Sequential(acquisition)
In [13]:
# Now create BO object
bo = GPyOpt.methods.ModularBayesianOptimization(model, feasible_region, objective, acquisition, evaluator, initial_design)
In [14]:
# --- Stop conditions
max_time  = None 
max_iter  = 100
tolerance = 1e-8     # distance between two consecutive observations 
                     # if we're sampling a region in such fine detail then it is likely that we've found the true min.

# Run the optimization                                                  
bo.run_optimization(max_iter = max_iter, max_time = max_time, eps = tolerance, verbosity=False) 
bo.plot_convergence()
In [15]:
print(bo.x_opt)
print(bo.fx_opt)
[8.26911308 6.15984192 2.33539153 3.34381918 0.53045561]
-181.81886066767237

The mean of the two dimensional normal which gave the lowest return was (8.27, 6.16) which was relatively close to the minimiser we got when we used a standard 2-d objective function.

We can access all of the function evaluations the Bayesian Optimisation has performed, which might be useful when the coordinates we are optimising over have some interpretable meaning and we want to chart their progress.

In [18]:
print(bo.X[0:5]) #The first five evaluations
[[ 7.79048061  1.86358427  1.79678336  1.48669037  0.39341564]
 [ 0.87614258  5.31407509  3.62447788  3.37028292  1.58800987]
 [ 4.75244145  2.39498522  0.79238314  0.52835265  0.27772583]
 [ 8.01733392  2.17956434  3.60892325  1.08167549 -0.52500222]
 [ 2.79333858  7.91035496  2.07018607  1.41000484  1.64668716]]

And that's it! The above cases demonstrate how to implement Bayesian Optimisation in a number of simple cases. The exact same principles apply in more complex cases with higher dimensionality, although standard BO is reported to perform worse in cases where the dimension is more than 20.

Happy optimising!