Bayesian-Optimization / doxygen / introduction.dox

/*! \page bopttheory Bayesian optimization

\section introbopt Introduction to Bayesian Optimization

Many problems in engineering, computer science, economics, etc.,
require to find the extremum of a real valued function. These
functions are typically continuous and sometimes smooth (e.g.:
Lipschitz continuous). However, those functions do not have a
closed-form expression or might be multimodal, where some of the local
extrema might have a bad outcome compared to the global extremum. The
evaluation of those functions might be costly.

Global optimization is a special case of non-convex optimization where
we want to find the global extremum of a real valued function, that
is, the target function. The search is done by some pointwise
evaluation of the target function.

The objective of a global optimization algorithm is to find the
sequence of points
x_n \in \mathcal{A} \subset \mathbb{R}^m , \;\;\; n = 1,2,\ldots
which converges to the point \f$x^*\f$, that is, the extremum of the
target function, when \f$n\f$ is large. The algorithm should be able to
find that sequence at least for all functions from a given family.

As explained in \cite Mockus94, this search procedure is a sequential
decision making problem where point at step \f$n+1\f$ is based on decision
\f$d_n\f$ which considers all previous data:
x_{n+1} = d_n(x_{1:n},y_{1:n})
where \f$y_i = f(x_i) + \epsilon_i\f$. For simplicity, many works assume
\f$\epsilon_i = 0\f$, that is, function evaluations are
deterministic. However, we can easily extend the description to
include stochastic functions (e.g.: homoscedastic noise \f$\epsilon_i
\sim \mathcal{N}(0,\sigma)\f$).

The search method is the sequence of decisions \f$d = {d_0,\ldots,
d_{n-1}}\f$, which leads to the final decision \f$x_{n} = x_{n}(d)\f$. In
most applications, the objective is to optimize the response of the
final decisions. Then, the criteria relies on the \a optimality
\a error or \a optimality \a gap, which can be expressed as:
\delta_n(f,d) = f\left(x_n\right) - f(x^*)
In other applications, the objective may require to converge to \f$x^*\f$
in the input space. Then, we can use for example the <em>Euclidean
distance error</em>:
\delta_n(f,d) = \|x_n - x^*\|_2 \label{eq:dist-error}
The previous equations can also be interpreted as variants of the
\a loss function for the decision at each step. Thus, the optimal
decision is defined as the function that minimizes the loss function:
d_n = \arg \min_d \delta_n(f,d) 
This requires full knowledge of function \f$f\f$, which is
unavailable. Instead, let assume that the target function \f$f = f(x)\f$
belongs to a family of functions \f$f \in F\f$, e.g.: continuous functions
in \f$\mathbb{R}^m\f$. Let also assume that the function can be
represented as sample from a probability distribution over functions
\f$f \sim P(f)\f$. Then, the best response case analysis for the search
process is defined as the decision that optimizes the expectation of
the loss function:
d^{BR}_n = \arg \min_d \mathbb{E}_{P(f)} \left[
\delta_n(f,d)\right]= \arg \min_d \int_F \delta_n(f,d) \; dP(f)
where \f$P\f$ is a prior distribution over functions.

However, we can improve the equation considering that, at decision
\f$d_n\f$ we have already \a observed the actual response of the
function at \f$n-1\f$ points, \f$\{x_{1:n-1},y_{1:n-1}\}\f$. Thus, the prior
information of the function can be updated with the observations and
the Bayes rule:
  P(f|x_{1:n-1},y_{1:n-1}) = \frac{P(x_{1:n-1},y_{1:n-1}|f) P(f)}{P(x_{1:n-1},y_{1:n-1})}
In fact, we can actually rewrite the equation to represent the updates
  P(f|x_{1:i},y_{1:i}) = \frac{P(x_{i},y_{i}|f) P(f|x_{1:i-1},y_{1:i-1})}{P(x_{i},y_{i})}, \qquad \forall \; i=1 \ldots n-1
Thus, the previous equation can be rewritten as:
d^{BO}_n = \arg \min_d \mathbb{E}_{P(f|x_{1:n-1},y_{1:n-1})} \left[ \delta_n(f,d)\right] = \arg \min_d \int_F \delta_n(f,d) \; dP(f|x_{1:n-1},y_{1:n-1})     
This equation is the root of <em>Bayesian optimization</em>, where the
Bayesian part comes from the fact that we are computing the
expectation with respect to the posterior distribution, also called
\a belief, over functions. Therefore, Bayesian optimization is a
memory-based optimization algorithm.

As commented before, most of the theory of Bayesian optimization is
related to deterministic functions, we consider also stochastic
functions, that is, we assume there might be a random error in the
function output. In fact, evaluations can produce different outputs if
repeated. In that case, the target function is the expected
output. Furthermore, in a recent paper by \cite Gramacy2012 it has
been shown that, even for deterministic functions, it is better to
assume certain error in the observation. The main reason being that,
in practice, there might be some mismodelling errors which can lead to
instability of the recursion if neglected.

\section modbopt Bayesian optimization general model

In order to simplify the description, we are going to use a special
case of Bayesian optimization model defined previously which
corresponds to the most common application. In subsequent Sections we
will introduce some generalizations for different applications.

Without loss of generality, consider the problem of finding the
minimum of an unknown real valued function \f$f:\mathbb{X} \rightarrow
\mathbb{R}\f$, where \f$\mathbb{X}\f$ is a compact space, \f$\mathbb{X}
\subset \mathbb{R}^d, d \geq 1\f$. Let \f$P(f)\f$ be a prior distribution
over functions represented as a stochastic process, for example, a
Gaussian process \f$\mathbf{x}i(\cdot)\f$, with inputs \f$x \in \mathbb{X}\f$ and an
associate kernel or covariance function \f$k(\cdot,\cdot)\f$. Let also
assume that the target function is a sample of the stochastic process
\f$f \sim \mathbf{x}i(\cdot)\f$.

In order to find the minimum, the algorithm has a maximum budget of
\f$N\f$ evaluations of the target function \f$f\f$. The purpose of the
algorithm is to find optimal decisions that provide a better
performance at the end.

One advantage of using Gaussian processes as a prior distributions
over functions is that new observations of the target function
\f$(x_i,y_i)\f$ can be easily used to update the distribution over
functions. Furthermore, the posterior distribution is also a Gaussian
process \f$\mathbf{x}i_i = \left[ \mathbf{x}i(\cdot) | x_{1:i},y_{1:i}
\right]\f$. Therefore, the posterior can be used as an informative prior
for the next iteration in a recursive algorithm.

In a more general setting, many authors have suggested to modify the
standard zero-mean Gaussian process for different variations that
include semi-parametric models \cite Huang06 \cite Handcock1993 \cite Jones:1998 \cite OHagan1992, use of hyperpriors on the parameters
\cite MartinezCantin09AR \cite Brochu:2010c \cite Hoffman2011, Student
t processes \cite Gramacy_Polson_2009 \cite Sacks89SS \cite Williams_Santner_Notz_2000, etc.

We use a generalized linear model of the form:
  f(x) = \phi(\mathbf{x})^T \mathbf{w} + \epsilon(\mathbf{x})
  \epsilon(\mathbf{x}) \sim \mathcal{NP} \left( 0, \sigma^2_s (\mathbf{K}(\theta) + \sigma^2_n \mathbf{I}) \right)
The term \f$\mathcal{NP}\f$ means a non-parametric process, which can
make reference to a Gaussian process \f$\mathcal{GP}\f$ or a Student's
t process \f$\mathcal{TP}\f$. In both cases, \f$\sigma^2_n\f$ is the
observation noise variance, sometimes called nugget, and it is problem
specific. Many authors decide to fix this value \f$\sigma^2_n = 0\f$
when the function \f$f(x)\f$ is deterministic, for example, a computer
simulation. However, as cleverly pointed out in \cite Gramacy2012,
there might be more reasons to include this term appart from being the
observation noise, for example, to consider model inaccuracies.

This model has been presented in different ways depending on the field
where it was used:
\li As a generalized linear model \f$\phi(\mathbf{x})^T\mathbf{w}\f$ with heteroscedastic
perturbation \f$\epsilon(\mathbf{x})\f$.
\li As a nonparametric process of the form \f$\mathcal{NP} \left(\phi(\mathbf{x})^T\mathbf{w},
\sigma^2_s (\mathbf{K}(\theta) + \sigma^2_n \mathbf{I}) \right)\f$.
\li As a semiparametric model \f$f(\mathbf{x}) = f_{par}(\mathbf{x}) + f_{nonpar}(\mathbf{x}) =
\phi(\mathbf{x})^T\mathbf{w} + \mathcal{NP}(\cdot)\f$