Gaussian Process Regression¶

Zhenwen Dai (2018-11-2)

Introduction¶

Gaussian process (GP) is a Bayesian non-parametric model used for various machine learning problems such as regression, classification. This notebook shows about how to use a Gaussian process regression model in MXFusion.

In [1]:

import warnings
warnings.filterwarnings('ignore')
import os
os.environ['MXNET_ENGINE_TYPE'] = 'NaiveEngine'

Toy data¶

We generate some synthetic data for our regression example. The data set is generate from a sine function with some additive Gaussian noise.

In [2]:

import numpy as np
%matplotlib inline
from pylab import *

np.random.seed(0)
X = np.random.uniform(-3.,3.,(20,1))
Y = np.sin(X) + np.random.randn(20,1)*0.05

The generated data are visualized as follows:

In [3]:

plot(X, Y, 'rx', label='data points')
_=legend()

../../_images/examples_notebooks_gp_regression_6_0.png

Gaussian process regression with Gaussian likelihood¶

Denote a set of input points \(X \in \mathbb{R}^{N \times Q}\). A Gaussian process is often formulated as a multi-variate normal distribution conditioned on the inputs:

\[ \begin{align}\begin{aligned} p(F|X) = \mathcal{N}(F; 0, K),\\where :math:`F \in \mathbb{R}^{N \times 1}` is the corresponding output\end{aligned}\end{align} \]

points of the Gaussian process and \(K\) is the covariance matrix computed on the set of inputs according to a chosen kernel function \(k(\cdot, \cdot)\).

For a regression problem, \(F\) is often referred to as the noise-free output and we usually assume an additional probability distribution as the observation noise. In this case, we assume the noise distribution to be Gaussian:

\[ \begin{align}\begin{aligned} p(Y|F) = \mathcal{N}(Y; F, \sigma^2 \mathcal{I}),\\where :math:`Y \in \mathbb{R}^{N \times 1}` is the observed output and\end{aligned}\end{align} \]

\(\sigma^2\) is the variance of the Gaussian distribution.

The following code defines the above GP regression in MXFusion. First, we change the default data dtype to double precision to avoid any potential numerical issues.

In [4]:

from mxfusion.common import config
config.DEFAULT_DTYPE = 'float64'

In the code below, the variable Y is defined following the probabilistic module GPRegression. A probabilistic module in MXFusion is a pre-built probabilistic model with dedicated inference algorithms for computing log-pdf and drawing samples. In this case, GPRegression defines the above GP regression model with a Gaussian likelihood. It understands that the log-likelihood after marginalizing \(F\) is closed-form and exploits this property when computing log-pdf.

The model is defined by the input variable X with the shape (m.N, 1), where the value of m.N is discovered when data is given during inference. A positive noise variance variable m.noise_var is defined with the initial value to be 0.01. For GP, we define a RBF kernel with input dimensionality being one and initial value of variance and lengthscale to be one. We define the variable m.Y following the GP regression distribution with the above specified kernel, input variable and noise_variance.

In [5]:

from mxfusion import Model, Variable
from mxfusion.components.variables import PositiveTransformation
from mxfusion.components.distributions.gp.kernels import RBF
from mxfusion.modules.gp_modules import GPRegression

m = Model()
m.N = Variable()
m.X = Variable(shape=(m.N, 1))
m.noise_var = Variable(shape=(1,), transformation=PositiveTransformation(), initial_value=0.01)
m.kernel = RBF(input_dim=1, variance=1, lengthscale=1)
m.Y = GPRegression.define_variable(X=m.X, kernel=m.kernel, noise_var=m.noise_var, shape=(m.N, 1))

In the above model, we have not defined any prior distributions for any hyper-parameters. To use the model for regrssion, we typically do a maximum likelihood estimate for all the hyper-parameters conditioned on the input and output variable. In MXFusion, this is done by first creating an inference algorithm, which is MAP in this case, by specifying the observed variables. Then, we create an inference body for gradient optimization inference methods, which is called GradBasedInference. The inference method is triggered by calling the run method, in which all the observed data are given as keyword arguments and any necessary configruation parameters are specified.

In [6]:

import mxnet as mx
from mxfusion.inference import GradBasedInference, MAP

infr = GradBasedInference(inference_algorithm=MAP(model=m, observed=[m.X, m.Y]))
infr.run(X=mx.nd.array(X, dtype='float64'), Y=mx.nd.array(Y, dtype='float64'),
         max_iter=100, learning_rate=0.05, verbose=True)

Iteration 11 loss: -13.523289192527265
Iteration 21 loss: -16.077990179961076
Iteration 31 loss: -16.784414553096843
Iteration 41 loss: -16.820970924702017
Iteration 51 loss: -16.859865329532193
Iteration 61 loss: -16.895666914166453
Iteration 71 loss: -16.899409131167452
Iteration 81 loss: -16.901728290347176
Iteration 91 loss: -16.903122097339737
Iteration 100 loss: -16.903135093930537

All the inference outcomes are in the attribute params of the inference body. The inferred value of a parameter can be access by passing the reference of the queried parameter to the params attribute. For example, to get the value m.noise_var, we can call inference.params[m.noise_var]. The estimated parameters from the above experiment are as follows:

In [7]:

print('The estimated variance of the RBF kernel is %f.' % infr.params[m.kernel.variance].asscalar())
print('The estimated length scale of the RBF kernel is %f.' % infr.params[m.kernel.lengthscale].asscalar())
print('The estimated variance of the Gaussian likelihood is %f.' % infr.params[m.noise_var].asscalar())

The estimated variance of the RBF kernel is 0.616992.
The estimated length scale of the RBF kernel is 1.649073.
The estimated variance of the Gaussian likelihood is 0.002251.

We can compare the estimated values with the same model implemented in GPy. The estimated values from GPy are very close to the ones from MXFusion.

In [8]:

import GPy

m_gpy = GPy.models.GPRegression(X, Y, kernel=GPy.kern.RBF(1))
m_gpy.optimize()
print(m_gpy)


Name : GP regression
Objective : -16.903456670910902
Number of Parameters : 3
Number of Optimization Parameters : 3
Updates : True
Parameters:
  GP_regression.           |                 value  |  constraints  |  priors
  rbf.variance             |    0.6148038604494702  |      +ve      |
  rbf.lengthscale          |    1.6500299722611123  |      +ve      |
  Gaussian_noise.variance  |  0.002270049772204339  |      +ve      |

Prediction¶

The above section shows how to estimate the model hyper-parameters of a GP regression model. This is often referred to as training. After training, we are often interested in using the inferred model to predict on unseen inputs. The GP modules offers two types of predictions: predicting the mean and variance of the output variable or drawing samples from the predictive posterior distributions.

Mean and variance of the posterior distribution¶

To estimate the mean and variance of the predictive posterior distribution, we use the inference algorithm ModulePredictionAlgorithm, which takes the model, the observed variables and the target variables of prediction as input arguments. We use TransferInference as the inference body, which allows us to take the inference outcome from the previous inference. This is done by passing the inference parameters infr.params into the infr_params argument.

In [9]:

from mxfusion.inference import TransferInference, ModulePredictionAlgorithm
infr_pred = TransferInference(ModulePredictionAlgorithm(model=m, observed=[m.X], target_variables=[m.Y]),
                              infr_params=infr.params)

To visualize the fitted model, we make predictions on 100 points evenly spanned from -5 to 5. We estimate the mean and variance of the noise-free output \(F\).

In [10]:

xt = np.linspace(-5,5,100)[:, None]
res = infr_pred.run(X=mx.nd.array(xt, dtype='float64'))[0]
f_mean, f_var = res[0].asnumpy()[0], res[1].asnumpy()[0]

The resulting figure is shown as follows:

In [11]:

plot(xt, f_mean[:,0], 'b-', label='mean')
plot(xt, f_mean[:,0]-2*np.sqrt(f_var), 'b--', label='2 x std')
plot(xt, f_mean[:,0]+2*np.sqrt(f_var), 'b--')
plot(X, Y, 'rx', label='data points')
ylabel('F')
xlabel('X')
_=legend()

../../_images/examples_notebooks_gp_regression_22_0.png

Posterior samples of Gaussian process¶

Apart from getting the mean and variance at every location, we may need to draw samples from the posterior GP. As the output variables at different locations are correlated with each other, each sample gives us some idea of a potential function from the posterior GP distribution.

To draw samples from the posterior distribution, we need to change the prediction inference algorithm attached to the GP module. The default prediction function estimate the mean and variance of the output variable as shown above. We can attach another inference algorithm as the prediction algorithm. In the following code, we attach the GPRegressionSamplingPrediction algorithm as the prediction algorithm. The targets and conditionals arguments specify the target variables of the algorithm and the conditional variables of the algorithm. After spcifying a name in the alg_name argument such as gp_predict, we can access this inference algorithm with the specified name like gp.gp_predict. In following code, we set the diagonal_variance attribute to be False in order to draw samples from a full covariace matrix. To avoid numerical issue, we set a small jitter to help matrix inversion. Then, we create the inference body in the same way as the above example.

In [12]:

from mxfusion.inference import TransferInference, ModulePredictionAlgorithm
from mxfusion.modules.gp_modules.gp_regression import GPRegressionSamplingPrediction

gp = m.Y.factor
gp.attach_prediction_algorithms(targets=gp.output_names, conditionals=gp.input_names,
            algorithm=GPRegressionSamplingPrediction(
                gp._module_graph, gp._extra_graphs[0], [gp._module_graph.X]),
            alg_name='gp_predict')
gp.gp_predict.diagonal_variance = False
gp.gp_predict.jitter = 1e-8
infr_pred = TransferInference(ModulePredictionAlgorithm(model=m, observed=[m.X], target_variables=[m.Y], num_samples=5),
                              infr_params=infr.params)

We draw five samples on the 100 evenly spanned input locations.

In [13]:

xt = np.linspace(-5,5,100)[:, None]
y_samples = infr_pred.run(X=mx.nd.array(xt, dtype='float64'))[0].asnumpy()

We visualize the individual samples each with a different color.

In [14]:

for i in range(y_samples.shape[0]):
    plot(xt, y_samples[i,:,0])

../../_images/examples_notebooks_gp_regression_28_0.png

Gaussian process with a mean function¶

TBA

Variational sparse Gaussian process regression¶

TBA