Commits

Ruben Martinez-Cantin committed b65843d

Improving docs

Comments (0)

Files changed (3)

 
 \section learnmod Methods for learning the kernel parameters  
 
-As commented before, we consider that the prior of the kernel
-hyperparameters \f$\theta\f$ --if available-- is independent of other
-variables. Thus, either if we are going to use maximum likelihood,
-maximum a posteriori or a fully Bayesian approach, we need to find the
-likelihood function of the observed data for the parameters. Depending
-on the model, this function will be a multivariate Gaussian
-distribution or multivariate t distribution. In general, we present
-the likelihood function up to a constant factor, that is, we remove
+The posterior distribution of the model, which is necessary to compute
+the criterion function, cannot be computed in closed form if the
+kernel hyperparameters are unknown. Thus, we need a find to
+approximate this posterior distribution conditional on the kernel
+hyperparameters.
+
+First, we need to consider if we are going to use a full Bayesian
+approach or an empirical Bayesian approach. The first one, computes
+the full posterior distribution by propagation of the uncertainty of
+each element and hyperparameter to the posterior. In this case, it can
+be done by discretization of the hyperparameter space or by using MCMC
+(not yet implemented). In theory, it is more precise but the
+computation burden can be orders of magnitude higher. The empirical
+approach on the other hand computes a point estimate of the
+hyperparameters based on some score function and use it as a "true"
+value. Although the uncertainty estimation in this case might not be
+as accurate as the full Bayesian, the computation overhead is minimal.
+
+For the score function, we need to find the likelihood function of the
+observed data for the parameters. Depending on the model, this
+function will be a multivariate Gaussian distribution or multivariate
+t distribution. In general, we present the likelihood as a
+log-likelihood function up to a constant factor, that is, we remove
 the terms independent of \f$\theta\f$ from the log likelihood. In
-practice, wether we use ML or MAP point estimates or full Bayes MCMC
-posterior, the constant factor is not needed.
+practice, whether we use a point estimate (maximum score) or full
+Bayes MCMC/discrete posterior, the constant factor is not needed.
 
-We are going to consider the following algorithms to learn the kernel
+We are going to consider the following score functions to learn the kernel
 hyperparameters:
 
-\li Cross-validation (L_LOO): In this case, we try to maximize the
+\li Leave one out cross-validation (SC_LOOCV): In this case, we try to maximize the
 average predicted log probability by the <em>leave one out</em> (LOO)
-strategy. This is sometimes called a pseudo-likelihood.
+cross-validation strategy. This is sometimes called a pseudo-likelihood.
 
-\li Maximum Likelihood (L_ML) For any of the models presented, one
+\li Maximum Total Likelihood (SC_MTL) For any of the models presented, one
 approach to learn the hyperparameters is to maximize the likelihood of
 all the parameters \f$\mathbf{w}\f$, \f$\sigma_s^2\f$ and
 \f$\theta\f$. Then, the likelihood function is a multivariate Gaussian
 of degrees of freedom, this is called <em>restricted maximum
 likelihood</em>. The library automatically selects the restricted
 version, if it is suitable.
-\li Posterior maximum likelihood (L_MAP): In this case, the likelihood
+
+\li Posterior maximum likelihood (SC_ML): In this case, the likelihood
 function is modified to consider the posterior estimate of
 \f$(\mathbf{w},\sigma_s^2)\f$ based on the different cases defined in
 Section \ref surrmods. In this case, the function will be a
 multivariate Gaussian or t distribution, depending on the kind of
 prior used for \f$\sigma_s^2\f$.
 
-\li Maximum a posteriori (L_ML or L_MAP): We can modify any of the
-previous algorithms by adding a prior distribution \f$p(\theta)\f$. By
-default, we add a normal prior on the kernel hyperparameters. However,
-if the variance of the prior \a hp_std is invalid (<=0), then we
-assume no prior. Since we assume that the hyperparameters are independent,
-we can apply priors selectively only to a small set.
+\li Maximum a posteriori (SC_MAP): We can modify the previous
+algorithms by adding a prior distribution \f$p(\theta)\f$. By default,
+we add a joint normal prior on all the kernel
+hyperparameters. However, if the variance of the prior \a hp_std is
+invalid (<=0), then we assume a flat prior on that
+hyperparameter. Since we assume that the hyperparameters are
+independent, we can apply priors selectively only to a small set.
 
 \section initdes Initial design methods
 

doxygen/reference.dox

 \subsection initpar Initialization parameters
 
 \li \b init_method: (unsigned integer value) For continuous
-optimization, we can choose among diferent strategies for the initial
+optimization, we can choose among different strategies for the initial
 design (1-Latin Hypercube Sampling (LHS), 2-Sobol sequences (if available,
 see \ref mininst), Other-Uniform Sampling) [Default 1, LHS].
 
 
 \subsection hyperlearn Hyperparameter learning
 
-Although BayesOpt tries to build a full analytical Bayesian model for
+Although BayesOpt tries to build a full analytic Bayesian model for
 the surrogate function, some hyperparameters cannot be estimated in
 closed form. Currently, the only parameters of BayesOpt models that
 require special treatment are the kernel hyperparameters. See Section
 functions like "cHedge(cEI,cLCB,cPOI,cThompsonSampling)". See section
 critmod for the different possibilities. [Default: "cEI"]
 \li \b crit_params, \b n_crit_params: Array with the set of parameters
-for the selected criteria. If there are more than one criterium, the
+for the selected criteria. If there are more than one criterion, the
 parameters are split among them according to the number of parameters
 required for each criterion. If n_crit_params is 0, then the default
 parameter is selected for each criteria. [Default: n_crit_params = 0]
 combination of functions like "kSum(kSEARD,kMaternARD3)". See section
 kermod for the different posibilities. [Default: "kMaternISO3"]
 \li \b kernel.hp_mean, \b kernel.hp_std, \b kernel.n_hp: Kernel
-hyperparameters prior. Any "ilegal" standard deviation (std<=0)
-results in a maximum likelihood estimate. Depends on the kernel
-selected. If there are more than one, the parameters are split among
-them according to the number of parameters required for each
-criterion. [Default: "1.0, 10.0, 1" ]
+hyperparameters prior in the log space. That is, if the
+hyperparameters are \f$\theta\f$, this prior is \f$p(\log(\theta))\f$. Any
+"ilegal" standard deviation (std<=0) results in a maximum likelihood
+estimate. Depends on the kernel selected. If there are more than one,
+the parameters are split among them according to the number of
+parameters required for each criterion. [Default: "1.0, 10.0, 1" ]
 
 \subsection meanpar Mean function parameters
 
 \li \b mean.name: Name of the mean function. Could be a combination of
-functions like "mSum(mOne, mLinear)". See section parmod for the different
+functions like "mSum(mOne, mLinear)". See Section parmod for the different
 posibilities. [Default: "mOne"]
 \li \b mean.coef_mean, \b kernel.coef_std, \b kernel.n_coef: Mean
 function coefficients. The standard deviation is only used if the
   /** Kernel configuration parameters */
   typedef struct {
     char*  name;                 /**< Name of the kernel function */
-    double hp_mean[128];         /**< Kernel hyperparameters prior (mean) */
-    double hp_std[128];          /**< Kernel hyperparameters prior (st dev) */
+    double hp_mean[128];         /**< Kernel hyperparameters prior (mean, log space) */
+    double hp_std[128];          /**< Kernel hyperparameters prior (st dev, log space) */
     size_t n_hp;                 /**< Number of kernel hyperparameters */
   } kernel_parameters;