--- title: Which optimization algorithm to choose? author: Marie Laure Delignette Muller, Christophe Dutang date: '`r Sys.Date()`' output: bookdown::html_document2: base_format: rmarkdown::html_vignette fig_caption: yes toc: true number_sections: yes link-citations: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Which optimization algorithm to choose?} %!\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} pkgdown: as_is: true --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} require(fitdistrplus) require(knitr) #for kable() function set.seed(12345) options(digits = 3) ``` # Quick overview of main optimization methods We present very quickly the main optimization methods. Please refer to **Numerical Optimization (Nocedal \& Wright, 2006)** or **Numerical Optimization: theoretical and practical aspects (Bonnans, Gilbert, Lemarechal \& Sagastizabal, 2006)** for a good introduction. We consider the following problem $\min_x f(x)$ for $x\in\mathbb{R}^n$. ## Derivative-free optimization methods The Nelder-Mead method is one of the most well known derivative-free methods that use only values of $f$ to search for the minimum. It consists in building a simplex of $n+1$ points and moving/shrinking this simplex into the good direction. 1. set initial points $x_1, \dots, x_{n+1}$. 2. order points such that $f(x_1)\leq f(x_2)\leq\dots\leq f(x_{n+1})$. 3. compute $x_o$ as the centroid of $x_1, \dots, x_{n}$. 4. Reflection: + compute the reflected point $x_r = x_o + \alpha(x_o-x_{n+1})$. + **if** $f(x_1)\leq f(x_r)1$, once initiated by $d_1 = -g(x_1)$. $\beta_k$ are updated according a scheme: * $\beta_k = \frac{ g_k^T g_k}{g_{k-1}^T g_{k-1} }$: Fletcher-Reeves update, * $\beta_k = \frac{ g_k^T (g_k-g_{k-1} )}{g_{k-1}^T g_{k-1}}$: Polak-Ribiere update. There exists also three-term formula for computing direction $d_k = -g(x_k) + \beta_k d_{k-1}+\gamma_{k} d_t$ for $tt+1$ otherwise $\gamma_k=0$ if $k=t$. See Yuan (2006) for other well-known schemes such as Hestenses-Stiefel, Dixon or Conjugate-Descent. The three updates (Fletcher-Reeves, Polak-Ribiere, Beale-Sorenson) of the (non-linear) conjugate gradient are available in `optim`. ### Computing the stepsize $t_k$ Let $\phi_k(t) = f(x_k + t d_k)$ for a given direction/iterate $(d_k, x_k)$. We need to find conditions to find a satisfactory stepsize $t_k$. In literature, we consider the descent condition: $\phi_k'(0) < 0$ and the Armijo condition: $\phi_k(t) \leq \phi_k(0) + t c_1 \phi_k'(0)$ ensures a decrease of $f$. Nocedal \& Wright (2006) presents a backtracking (or geometric) approach satisfying the Armijo condition and minimal condition, i.e. Goldstein and Price condition. * set $t_{k,0}$ e.g. 1, $0 < \alpha < 1$, * **Repeat** until Armijo satisfied, + $t_{k,i+1} = \alpha \times t_{k,i}$. * **end Repeat** This backtracking linesearch is available in `optim`. ## Benchmark To simplify the benchmark of optimization methods, we create a `fitbench` function that computes the desired estimation method for all optimization methods. This function is currently not exported in the package. ```{r, echo=TRUE, eval=FALSE} fitbench <- function(data, distr, method, grad = NULL, control = list(trace = 0, REPORT = 1, maxit = 1000), lower = -Inf, upper = +Inf, ...) ``` ```{r, echo=FALSE} fitbench <- fitdistrplus:::fitbench ``` # Numerical illustration with the beta distribution ## Log-likelihood function and its gradient for beta distribution ### Theoretical value The density of the beta distribution is given by $$ f(x; \delta_1,\delta_2) = \frac{x^{\delta_1-1}(1-x)^{\delta_2-1}}{\beta(\delta_1,\delta_2)}, $$ where $\beta$ denotes the beta function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. We recall that $\beta(a,b)=\Gamma(a)\Gamma(b)/\Gamma(a+b)$. There the log-likelihood for a set of observations $(x_1,\dots,x_n)$ is $$ \log L(\delta_1,\delta_2) = (\delta_1-1)\sum_{i=1}^n\log(x_i)+ (\delta_2-1)\sum_{i=1}^n\log(1-x_i)+ n \log(\beta(\delta_1,\delta_2)) $$ The gradient with respect to $a$ and $b$ is $$ \nabla \log L(\delta_1,\delta_2) = \left(\begin{matrix} \sum\limits_{i=1}^n\ln(x_i) - n\psi(\delta_1)+n\psi( \delta_1+\delta_2) \\ \sum\limits_{i=1}^n\ln(1-x_i)- n\psi(\delta_2)+n\psi( \delta_1+\delta_2) \end{matrix}\right), $$ where $\psi(x)=\Gamma'(x)/\Gamma(x)$ is the digamma function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. ### `R` implementation As in the `fitdistrplus` package, we minimize the opposite of the log-likelihood: we implement the opposite of the gradient in `grlnL`. Both the log-likelihood and its gradient are not exported. ```{r} lnL <- function(par, fix.arg, obs, ddistnam) fitdistrplus:::loglikelihood(par, fix.arg, obs, ddistnam) grlnlbeta <- fitdistrplus:::grlnlbeta ``` ## Random generation of a sample ```{r, fig.height=4, fig.width=4} #(1) beta distribution n <- 200 x <- rbeta(n, 3, 3/4) grlnlbeta(c(3, 4), x) #test hist(x, prob=TRUE, xlim=0:1) lines(density(x), col="red") curve(dbeta(x, 3, 3/4), col="green", add=TRUE) legend("topleft", lty=1, col=c("red","green"), legend=c("empirical", "theoretical"), bty="n") ``` ## Fit Beta distribution Define control parameters. ```{r} ctr <- list(trace=0, REPORT=1, maxit=1000) ``` Call `mledist` with the default optimization function (`optim` implemented in `stats` package) with and without the gradient for the different optimization methods. ```{r} unconstropt <- fitbench(x, "beta", "mle", grad=grlnlbeta, lower=0) ``` In the case of constrained optimization, `mledist` permits the direct use of `constrOptim` function (still implemented in `stats` package) that allow linear inequality constraints by using a logarithmic barrier. Use a exp/log transformation of the shape parameters $\delta_1$ and $\delta_2$ to ensure that the shape parameters are strictly positive. ```{r} dbeta2 <- function(x, shape1, shape2, log) dbeta(x, exp(shape1), exp(shape2), log=log) #take the log of the starting values startarg <- lapply(fitdistrplus:::startargdefault(x, "beta"), log) #redefine the gradient for the new parametrization grbetaexp <- function(par, obs, ...) grlnlbeta(exp(par), obs) * exp(par) expopt <- fitbench(x, distr="beta2", method="mle", grad=grbetaexp, start=startarg) #get back to original parametrization expopt[c("fitted shape1", "fitted shape2"), ] <- exp(expopt[c("fitted shape1", "fitted shape2"), ]) ``` Then we extract the values of the fitted parameters, the value of the corresponding log-likelihood and the number of counts to the function to minimize and its gradient (whether it is the theoretical gradient or the numerically approximated one). ## Results of the numerical investigation Results are displayed in the following tables: (1) the original parametrization without specifying the gradient (`-B` stands for bounded version), (2) the original parametrization with the (true) gradient (`-B` stands for bounded version and `-G` for gradient), (3) the log-transformed parametrization without specifying the gradient, (4) the log-transformed parametrization with the (true) gradient (`-G` stands for gradient). ```{r, results='asis', echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt), invert=TRUE)], digits=3, caption="Unconstrained optimization with approximated gradient") ``` ```{r, results='asis', echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt))], digits=3, caption="Unconstrained optimization with true gradient") ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt), invert=TRUE)], digits=3, caption="Exponential trick optimization with approximated gradient") ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt))], digits=3, caption="Exponential trick optimization with true gradient") ``` Using `llsurface`, we plot the log-likehood surface around the true value (green) and the fitted parameters (red). ```{r, fig.width=4, fig.height=4} llsurface(min.arg=c(0.1, 0.1), max.arg=c(7, 3), xlim=c(.1,7), plot.arg=c("shape1", "shape2"), nlev=25, lseq=50, data=x, distr="beta", back.col = FALSE) points(unconstropt[1,"BFGS"], unconstropt[2,"BFGS"], pch="+", col="red") points(3, 3/4, pch="x", col="green") ``` We can simulate bootstrap replicates using the `bootdist` function. ```{r, fig.width=4, fig.height=4} b1 <- bootdist(fitdist(x, "beta", method = "mle", optim.method = "BFGS"), niter = 100, parallel = "snow", ncpus = 2) summary(b1) plot(b1, trueval = c(3, 3/4)) ``` # Numerical illustration with the negative binomial distribution ## Log-likelihood function and its gradient for negative binomial distribution ### Theoretical value The p.m.f. of the Negative binomial distribution is given by $$ f(x; m,p) = \frac{\Gamma(x+m)}{\Gamma(m)x!} p^m (1-p)^x, $$ where $\Gamma$ denotes the beta function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. There exists an alternative representation where $\mu=m (1-p)/p$ or equivalently $p=m/(m+\mu)$. Thus, the log-likelihood for a set of observations $(x_1,\dots,x_n)$ is $$ \log L(m,p) = \sum_{i=1}^{n} \log\Gamma(x_i+m) -n\log\Gamma(m) -\sum_{i=1}^{n} \log(x_i!) + mn\log(p) +\sum_{i=1}^{n} {x_i}\log(1-p) $$ The gradient with respect to $m$ and $p$ is $$ \nabla \log L(m,p) = \left(\begin{matrix} \sum_{i=1}^{n} \psi(x_i+m) -n \psi(m) + n\log(p) \\ mn/p -\sum_{i=1}^{n} {x_i}/(1-p) \end{matrix}\right), $$ where $\psi(x)=\Gamma'(x)/\Gamma(x)$ is the digamma function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. ### `R` implementation As in the `fitdistrplus` package, we minimize the opposite of the log-likelihood: we implement the opposite of the gradient in `grlnL`. ```{r} grlnlNB <- function(x, obs, ...) { m <- x[1] p <- x[2] n <- length(obs) c(sum(psigamma(obs+m)) - n*psigamma(m) + n*log(p), m*n/p - sum(obs)/(1-p)) } ``` ## Random generation of a sample ```{r, fig.height=4, fig.width=4} #(2) negative binomial distribution n <- 200 trueval <- c("size"=10, "prob"=3/4, "mu"=10/3) x <- rnbinom(n, trueval["size"], trueval["prob"]) hist(x, prob=TRUE, ylim=c(0, .3), xlim=c(0, 10)) lines(density(x), col="red") points(min(x):max(x), dnbinom(min(x):max(x), trueval["size"], trueval["prob"]), col = "green") legend("topright", lty = 1, col = c("red", "green"), legend = c("empirical", "theoretical"), bty="n") ``` ## Fit a negative binomial distribution Define control parameters and make the benchmark. ```{r} ctr <- list(trace = 0, REPORT = 1, maxit = 1000) unconstropt <- fitbench(x, "nbinom", "mle", grad = grlnlNB, lower = 0) unconstropt <- rbind(unconstropt, "fitted prob" = unconstropt["fitted mu", ] / (1 + unconstropt["fitted mu", ])) ``` In the case of constrained optimization, `mledist` permits the direct use of `constrOptim` function (still implemented in `stats` package) that allow linear inequality constraints by using a logarithmic barrier. Use a exp/log transformation of the shape parameters $\delta_1$ and $\delta_2$ to ensure that the shape parameters are strictly positive. ```{r} dnbinom2 <- function(x, size, prob, log) dnbinom(x, exp(size), 1 / (1 + exp(-prob)), log = log) # transform starting values startarg <- fitdistrplus:::startargdefault(x, "nbinom") startarg$mu <- startarg$size / (startarg$size + startarg$mu) startarg <- list(size = log(startarg[[1]]), prob = log(startarg[[2]] / (1 - startarg[[2]]))) # redefine the gradient for the new parametrization Trans <- function(x) c(exp(x[1]), plogis(x[2])) grNBexp <- function(par, obs, ...) grlnlNB(Trans(par), obs) * c(exp(par[1]), plogis(x[2])*(1-plogis(x[2]))) expopt <- fitbench(x, distr="nbinom2", method="mle", grad=grNBexp, start=startarg) # get back to original parametrization expopt[c("fitted size", "fitted prob"), ] <- apply(expopt[c("fitted size", "fitted prob"), ], 2, Trans) ``` Then we extract the values of the fitted parameters, the value of the corresponding log-likelihood and the number of counts to the function to minimize and its gradient (whether it is the theoretical gradient or the numerically approximated one). ## Results of the numerical investigation Results are displayed in the following tables: (1) the original parametrization without specifying the gradient (`-B` stands for bounded version), (2) the original parametrization with the (true) gradient (`-B` stands for bounded version and `-G` for gradient), (3) the log-transformed parametrization without specifying the gradient, (4) the log-transformed parametrization with the (true) gradient (`-G` stands for gradient). ```{r, echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt), invert=TRUE)], digits=3, caption="Unconstrained optimization with approximated gradient") ``` ```{r, results='asis', echo=FALSE} kable(unconstropt[, grep("G-", colnames(unconstropt))], digits=3, caption="Unconstrained optimization with true gradient") ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt), invert=TRUE)], digits=3, caption="Exponential trick optimization with approximated gradient") ``` ```{r, results='asis', echo=FALSE} kable(expopt[, grep("G-", colnames(expopt))], digits=3, caption="Exponential trick optimization with true gradient") ``` Using `llsurface`, we plot the log-likehood surface around the true value (green) and the fitted parameters (red). ```{r, fig.width=4, fig.height=4} llsurface(min.arg = c(5, 0.3), max.arg = c(15, 1), xlim=c(5, 15), plot.arg = c("size", "prob"), nlev = 25, lseq = 50, data = x, distr = "nbinom", back.col = FALSE) points(unconstropt["fitted size", "BFGS"], unconstropt["fitted prob", "BFGS"], pch = "+", col = "red") points(trueval["size"], trueval["prob"], pch = "x", col = "green") ``` We can simulate bootstrap replicates using the `bootdist` function. ```{r, fig.width=4, fig.height=4} b1 <- bootdist(fitdist(x, "nbinom", method = "mle", optim.method = "BFGS"), niter = 100, parallel = "snow", ncpus = 2) summary(b1) plot(b1, trueval=trueval[c("size", "mu")]) ``` # Conclusion Based on the two previous examples, we observe that all methods converge to the same point. This is reassuring. However, the number of function evaluations (and the gradient evaluations) is very different from a method to another. Furthermore, specifying the true gradient of the log-likelihood does not help at all the fitting procedure and generally slows down the convergence. Generally, the best method is the standard BFGS method or the BFGS method with the exponential transformation of the parameters. Since the exponential function is differentiable, the asymptotic properties are still preserved (by the Delta method) but for finite-sample this may produce a small bias.