# Bayesian-Optimization / doxygen / introduction.dox

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 /*! \page bopttheory Bayesian optimization \tableofcontents \section introbopt Introduction to Bayesian Optimization Many problems in engineering, computer science, economics, etc., require to find the extremum of a real valued function. These functions are typically continuous and sometimes smooth (e.g.: Lipschitz continuous). However, those functions do not have a closed-form expression or might be multimodal, where some of the local extrema might have a bad outcome compared to the global extremum. The evaluation of those functions might be costly. Global optimization is a special case of non-convex optimization where we want to find the global extremum of a real valued function, that is, the target function. The search is done by some pointwise evaluation of the target function. The objective of a global optimization algorithm is to find the sequence of points \f[ x_n \in \mathcal{A} \subset \mathbb{R}^m , \;\;\; n = 1,2,\ldots \f] which converges to the point \f$x^*\f$, that is, the extremum of the target function, when \f$n\f$ is large. The algorithm should be able to find that sequence at least for all functions from a given family. As explained in \cite Mockus94, this search procedure is a sequential decision making problem where point at step \f$n+1\f$ is based on decision \f$d_n\f$ which considers all previous data: \f[ x_{n+1} = d_n(x_{1:n},y_{1:n}) \f] where \f$y_i = f(x_i) + \epsilon_i\f$. For simplicity, many works assume \f$\epsilon_i = 0\f$, that is, function evaluations are deterministic. However, we can easily extend the description to include stochastic functions (e.g.: homoscedastic noise \f$\epsilon_i \sim \mathcal{N}(0,\sigma)\f$). The search method is the sequence of decisions \f$d = {d_0,\ldots, d_{n-1}}\f$, which leads to the final decision \f$x_{n} = x_{n}(d)\f$. In most applications, the objective is to optimize the response of the final decisions. Then, the criteria relies on the \a optimality \a error or \a optimality \a gap, which can be expressed as: \f[ \delta_n(f,d) = f\left(x_n\right) - f(x^*) \f] In other applications, the objective may require to converge to \f$x^*\f$ in the input space. Then, we can use for example the Euclidean distance error: \f[ \delta_n(f,d) = \|x_n - x^*\|_2 \label{eq:dist-error} \f] The previous equations can also be interpreted as variants of the \a loss function for the decision at each step. Thus, the optimal decision is defined as the function that minimizes the loss function: \f[ d_n = \arg \min_d \delta_n(f,d) \f] This requires full knowledge of function \f$f\f$, which is unavailable. Instead, let assume that the target function \f$f = f(x)\f$ belongs to a family of functions \f$f \in F\f$, e.g.: continuous functions in \f$\mathbb{R}^m\f$. Let also assume that the function can be represented as sample from a probability distribution over functions \f$f \sim P(f)\f$. Then, the best response case analysis for the search process is defined as the decision that optimizes the expectation of the loss function: \f[ d^{BR}_n = \arg \min_d \mathbb{E}_{P(f)} \left[ \delta_n(f,d)\right]= \arg \min_d \int_F \delta_n(f,d) \; dP(f) \f] where \f$P\f$ is a prior distribution over functions. However, we can improve the equation considering that, at decision \f$d_n\f$ we have already \a observed the actual response of the function at \f$n-1\f$ points, \f$\{x_{1:n-1},y_{1:n-1}\}\f$. Thus, the prior information of the function can be updated with the observations and the Bayes rule: \f[ P(f|x_{1:n-1},y_{1:n-1}) = \frac{P(x_{1:n-1},y_{1:n-1}|f) P(f)}{P(x_{1:n-1},y_{1:n-1})} \f] In fact, we can actually rewrite the equation to represent the updates sequentially: \f[ P(f|x_{1:i},y_{1:i}) = \frac{P(x_{i},y_{i}|f) P(f|x_{1:i-1},y_{1:i-1})}{P(x_{i},y_{i})}, \qquad \forall \; i=1 \ldots n-1 \f] Thus, the previous equation can be rewritten as: \f[ d^{BO}_n = \arg \min_d \mathbb{E}_{P(f|x_{1:n-1},y_{1:n-1})} \left[ \delta_n(f,d)\right] = \arg \min_d \int_F \delta_n(f,d) \; dP(f|x_{1:n-1},y_{1:n-1}) \f] This equation is the root of Bayesian optimization, where the Bayesian part comes from the fact that we are computing the expectation with respect to the posterior distribution, also called \a belief, over functions. Therefore, Bayesian optimization is a memory-based optimization algorithm. As commented before, most of the theory of Bayesian optimization is related to deterministic functions, we consider also stochastic functions, that is, we assume there might be a random error in the function output. In fact, evaluations can produce different outputs if repeated. In that case, the target function is the expected output. Furthermore, in a recent paper by \cite Gramacy2012 it has been shown that, even for deterministic functions, it is better to assume certain error in the observation. The main reason being that, in practice, there might be some mismodelling errors which can lead to instability of the recursion if neglected. \section modbopt Bayesian optimization general model In order to simplify the description, we are going to use a special case of Bayesian optimization model defined previously which corresponds to the most common application. In subsequent Sections we will introduce some generalizations for different applications. Without loss of generality, consider the problem of finding the minimum of an unknown real valued function \f$f:\mathbb{X} \rightarrow \mathbb{R}\f$, where \f$\mathbb{X}\f$ is a compact space, \f$\mathbb{X} \subset \mathbb{R}^d, d \geq 1\f$. Let \f$P(f)\f$ be a prior distribution over functions represented as a stochastic process, for example, a Gaussian process \f$\mathbf{x}i(\cdot)\f$, with inputs \f$x \in \mathbb{X}\f$ and an associate kernel or covariance function \f$k(\cdot,\cdot)\f$. Let also assume that the target function is a sample of the stochastic process \f$f \sim \mathbf{x}i(\cdot)\f$. In order to find the minimum, the algorithm has a maximum budget of \f$N\f$ evaluations of the target function \f$f\f$. The purpose of the algorithm is to find optimal decisions that provide a better performance at the end. One advantage of using Gaussian processes as a prior distributions over functions is that new observations of the target function \f$(x_i,y_i)\f$ can be easily used to update the distribution over functions. Furthermore, the posterior distribution is also a Gaussian process \f$\mathbf{x}i_i = \left[ \mathbf{x}i(\cdot) | x_{1:i},y_{1:i} \right]\f$. Therefore, the posterior can be used as an informative prior for the next iteration in a recursive algorithm. In a more general setting, many authors have suggested to modify the standard zero-mean Gaussian process for different variations that include semi-parametric models \cite Huang06 \cite Handcock1993 \cite Jones:1998 \cite OHagan1992, use of hyperpriors on the parameters \cite MartinezCantin09AR \cite Brochu:2010c \cite Hoffman2011, Student t processes \cite Gramacy_Polson_2009 \cite Sacks89SS \cite Williams_Santner_Notz_2000, etc. We use a generalized linear model of the form: \f[ f(x) = \phi(\mathbf{x})^T \mathbf{w} + \epsilon(\mathbf{x}) \f] where \f[ \epsilon(\mathbf{x}) \sim \mathcal{NP} \left( 0, \sigma^2_s (\mathbf{K}(\theta) + \sigma^2_n \mathbf{I}) \right) \f] The term \f$\mathcal{NP}\f$ means a non-parametric process, which can make reference to a Gaussian process \f$\mathcal{GP}\f$ or a Student's t process \f$\mathcal{TP}\f$. In both cases, \f$\sigma^2_n\f$ is the observation noise variance, sometimes called nugget, and it is problem specific. Many authors decide to fix this value \f$\sigma^2_n = 0\f$ when the function \f$f(x)\f$ is deterministic, for example, a computer simulation. However, as cleverly pointed out in \cite Gramacy2012, there might be more reasons to include this term appart from being the observation noise, for example, to consider model inaccuracies. This model has been presented in different ways depending on the field where it was used: \li As a generalized linear model \f$\phi(\mathbf{x})^T\mathbf{w}\f$ with heteroscedastic perturbation \f$\epsilon(\mathbf{x})\f$. \li As a nonparametric process of the form \f$\mathcal{NP} \left(\phi(\mathbf{x})^T\mathbf{w}, \sigma^2_s (\mathbf{K}(\theta) + \sigma^2_n \mathbf{I}) \right)\f$. \li As a semiparametric model \f$f(\mathbf{x}) = f_{par}(\mathbf{x}) + f_{nonpar}(\mathbf{x}) = \phi(\mathbf{x})^T\mathbf{w} + \mathcal{NP}(\cdot)\f$ */