Keywords: confidence regions, residuals, diagnostic tools, resampling techniques, starting values, 6-minute walk test, nonlinear regression, diagnostic tools,
Nonlinear regression is used routinely in a wide range of biological disciplines including pharmacology, toxicology, biochemistry, ecology, microbiology and medicine . However, statistical software programs do not always include user-friendly routines or modules for fitting nonlinear regression models; this means that researchers often choose to use inappropriate approaches such as polynomial regression or data segmentation (with arbitrary trimming of data), i.e., approaches easily carried out in any statistical software by means of linear regression. On the other hand, specialized commercial programs are available, but they are not always sufficiently flexible or intuitive .
In addition to limitations in software availability, several other difficulties arise when using nonlinear regression. Like in linear regression, nonlinear regression provides parameter estimates based on the least-squares criterion. However, unlike linear regression, no explicit mathematical solution is available and specific algorithms are needed to solve the minimization problem, involving iterative numerical approximations. Unfortunately, minimization, or optimization in general, is not always a straightforward exercise for such models due to the nonlinear relationships between parameters . Consequently, obtaining useful nonlinear regression model fits may often require careful specification of model details, critical appraisal of the resulting output, and perhaps also use of summary statistics that do not rely too heavily on the model assumptions.
Therefore nonlinear regression may appear to be more daunting than linear regression. It requires a higher degree of user interaction, both for initializing the estimation procedure and for interpreting the results. The package offers tools for addressing these steps when fitting nonlinear regression models using (function implemented in the package ). is available on the Comprehensive Archive Network at .
Specifically, there are three key issues that are often causing problems when using nonlinear regression in practice:
The iterative estimation procedure requires initial values of the model parameters. These so-called starting values need to be relatively close to the unknown parameter estimates in order to avoid convergence problems where the iterative procedure fails to approach the optimal parameter values. Consequently, a clear understanding of the model features and, in particular, the meaning or interpretation of its parameters would be desirable for ensuring uncomplicated model fitting. In practice researchers may find it difficult to convert such knowledge into an operational format suitable for statistical software programs. Within the statistical environment , a number of extension packages provide ways to get around having to come up with starting values. The package provides a number of ways to do grid search among candidate starting values and the resulting object may be fed directly into . Although the use of a grid search gives the user some flexibility in the definition of the starting values, a range for each model parameter still has to be provided and this may still be a challenge when balancing against the computational burden of an exhaustive search. Specifically for dose-response and growth curve modeling, the packages , , and among others offer infrastructure for automatically providing data-driven, informed starting values so that the user need not think about providing suitable starting values. The idea behind such self starter routines is explained by . To our knowledge this idea has not been implemented in any other statistical software. However, in many cases no self starter routines are available. This means users may often need to adopt a manual trial-and-error approach in order to ensure an optimal model fit. provides functionality to assist in fitting models.
The validity of nonlinear regression model fits must be carefully evaluated by means of appropriate diagnostic and graphical tools. One of the reasons is that sometimes the algorithms used for parameter estimation return sub-optimal estimates, simply because the iterative procedure was not successful in converging to the optimal estimates (often caused by poor starting values or too complex model equations for the data at hand). However, these after-fitting validation steps, which cannot be easily automated, are often neglected because of the lack of dedicated functionality. The package (the project: ) provides some model checking functionality for specific nonlinear regression models (e.g., function ). provides a range of model diagnostics that will work with any model fit.
Moreover, the standard confidence intervals for model parameters in nonlinear regression models are derived assuming local linearity and normally distributed parameter estimates, e.g., . In practice, these assumptions may not always be satisfied; this may in particular be the case for small data sets. For deriving confidence intervals, the method in the package provides likelihood profiling that does not rely on the linearization step . The use of non-parametric resampling techniques for assessing the uncertainty of parameter estimates will even rely less on asymptotic distributions . provides such a non-parametric alternative.
In a nonlinear regression context there are several other ways to put less emphasis on the distributional assumptions. One is the use of sandwich estimators for the standard errors (in case of suspicion of misspecification of the distribution in general) . Another is to use a robust estimation procedure to avoid that singleton data points get to much influence on the model fit, e.g., using the function in the package . There are also several ways to accommodate non-standard distributional assumptions. The packages and (the model fitting functions have the same names) allow flexible fitting of various extensions of the nonlinear regression model in terms of the distributions considered for the response as well as the correlation structures needed to describe dependencies between response values, respectively . However, the challenges in particular related to choosing starting values but also partly concerning model checking (points 2) and 3) above) remain. So offers supplementary functionality that is generally applicable for nonlinear regression analysis.
Section briefly outlines the background for nonlinear regression. In Section we give a detailed introduction to the salient features of using an example from pulmonary medicine. In Section we provide some concluding remarks.
We consider standard nonlinear regression models of the following form:
with y being the response (the dependent variable), x the (possibly multivariate) independent variable, which is often controlled by the experimenter, θ the vector of model parameters characterizing the relationship between x and y through the function f, and ϵ the residual error term that is assumed to be normally distributed, centered around 0 and with unknown variance (σ2). Furthermore, we assume that the residual error terms are mutually independent as is usually assumed for standard nonlinear regression analysis . In , this nonlinear regression model may be fitted using in the standard installation (the package ). Parameter estimation is based on an iterative procedure that involves a linearization approximation leading to a least-squares problem at each step.
Note that functions and in allow fitting of nonlinear regression models for several curves corresponding to different covariate configurations (such as different treatments) and thus necessitating the use of correlation structures (e.g., random effects) . However, for building these more complex models (i.e., obtaining model fits that converge), is often used initially to produce fits of individual curves, which may then subsequently be combined and supplied to enable fitting more complex nonlinear regression models (e.g., through the use of the wrapper ).
The package provides a number of tools to facilitate fitting standard nonlinear regression models (Equation ()) and is specifically designed to work directly with . The package contains functions and graphical tools that will help users to create objects and carry out various diagnostic tests. More specifically, the toolbox will assist users in:We will elaborate on these features in the next section, using a concrete data example from pulmonary medicine.
In order to illustrate the features of the package , a worked example is taken from pulmonary medicine . Exercise testing is carried out on patients with a broad range of pulmonary diseases . The clinical relevance of exercise testing is well established. For instance, the 6-minute walk test (6MWT) is performed, allowing to monitor the change in oxygen uptake over time. It is well known that exercise capacity as assessed by 6MWT correlates with impairment of daily life activities and patients prognosis in several pulmonary conditions . Peak oxygen uptake has predictive value in patients with pulmonary hypertension and is an indicator of operability in patients with pulmonary diseases.
Data from a typical oxygen uptake kinetics profile are shown in Figure . The change of oxygen uptake (VO2) during exercise testing is classically monitored in 3 distinct phases including a resting phase, the 6-minute exercise testing period, and a recovery period. VO2 kinetics are classically characterized by a series of parameters including the oxygen uptake in the resting phase (VO2rest), the maximum oxygen uptake during exercise (VO2peak), the rate of oxygen increase between VO2rest and VO2peak. Subsequent parameters of clinical importance are derived from these initial parameters. Oxygen deficit (O2def) is defined as the area between an instantaneous increase of oxygen to the maximum upper limit and the observed asymptotic rise of oxygen (Figure ). Mean response time (MRT) is the time constant of an exponential function describing the rate of oxygen increase. It corresponds to the time needed to attain approximately 63% of the asymptotic value VO2peak , and is defined as follows: MRT = O2def/ΔVO2, with ΔVO2 = VO2peak − VO2rest.
The length of the resting phase (λ) is controlled by the experimenter and does not need to be estimated. Considering λ constant, the following 3-parameter asymptotic regression model with a lag period (Equation ) is suitable for describing the first 2 phases of the VO2 kinetics:
with VO2rest the oxygen level during the resting phase, VO2peak, the maximum oxygen uptake during exercise testing, μ > 0 the rate of change characterizing the steepness of the increase as time (t) elapses (the larger the more steep is the curve), and λ the duration of the resting time controlled by the experimenter. Other examples of segmented regression models are the hockey stick model and the no-effect-concentration model occasionally used in ecotoxicology . For the latter, a self-starter routine is available in package but for the former the user will need to provide starting values by himself as explained by . For that specific purpose, the functionality provided by may prove particularly useful.
As mentioned in the introduction, fitting nonlinear regression models requires the provision of starting values for model parameters. A poor choice of starting values may cause non-convergence or convergence to an unwanted local (rather than global) minimum when trying to minimize the least-squares criterion. Biologically interpretable parameter often allows the user to guess adequate starting values by assessing (often graphically) a set of plausible candidate model parameter values. For this purpose, provides the graphical function , which can be used to assess the suitability of the chosen starting values, prior to fitting the model. This graphical approach for assessing candidate starting values is also used by , but it was not wrapped up in a single function. Below is an example of usage. First, you should specify the model equation to be used in the nonlinear regression as a formula in . Use this formula as first argument of the function , then supply the name of your dataset as second argument, and finally provide a list of values for the model parameters as third argument. An additional argument can be used to specify which independent variable is plotted against the dependent variable (column index of the original dataset; default is 1) when more than one independent variable is modeled.
##
## 'nlstools' has been loaded.
## IMPORTANT NOTICE: Most nonlinear regression models and data set examples
## related to predictive microbiolgy have been moved to the package 'nlsMicrobio'
formulaExp <- as.formula(VO2 ~ (t <= 5.883) * VO2rest + (t > 5.883) *
(VO2rest + (VO2peak - VO2rest) *
(1 - exp(-(t - 5.883) / mu))))
preview(formulaExp, data = O2K,
start = list(VO2rest = 400, VO2peak = 1600, mu = 1))
##
## RSS: 149000
Both the oxygen levels during the resting phase (the parameter VO2rest) and the maximum oxygen level reached during the 6MWT (the parameter VO2peak) can be retrieved directly in an approximate fashion from Figure , whereas μ is simply initially set to 1. Note that the length of the resting phase (λ = 5.883 min) was hardcoded into the above formula. Figure shows good agreement between the data and the theoretical model based on the provided set of starting values. Judged by the figure, the chosen starting values seem to be suitable for initializing . Note that next to the plot, the residual sum of squares measuring the discrepancy between the model (based on the chosen starting values) and the observed data is provided. This value gives an idea of the magnitude of the residual sum of squares to expect from the model fit based on .
Once suitable starting values are obtained, the model may be fitted using and then the function in may be used for providing a single display with all relevant pieces of information about the model fit.
Specifically, returns output containing:##
## ------
## Formula: VO2 ~ (t <= 5.883) * VO2rest + (t > 5.883) * (VO2rest + (VO2peak -
## VO2rest) * (1 - exp(-(t - 5.883)/mu)))
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## VO2rest 3.568e+02 1.141e+01 31.26 <2e-16 ***
## VO2peak 1.631e+03 2.149e+01 75.88 <2e-16 ***
## mu 1.186e+00 7.661e-02 15.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.59 on 33 degrees of freedom
##
## Number of iterations to convergence: 5
## Achieved convergence tolerance: 7.6e-06
##
## ------
## Residual sum of squares: 81200
##
## ------
## t-based confidence interval:
## 2.5% 97.5%
## VO2rest 333.537401 379.980302
## VO2peak 1587.155299 1674.611700
## mu 1.030255 1.342002
##
## ------
## Correlation matrix:
## VO2rest VO2peak mu
## VO2rest 1.00000000 0.07907045 0.1995377
## VO2peak 0.07907045 1.00000000 0.7554924
## mu 0.19953772 0.75549241 1.0000000
In order to facilitate the visualization of the model fit together with the data, provides the function , which offers functionality similar to with a simple linear regression model fit as argument. Thus avoids manual definition of a grid of values for the independent variable, subsequent prediction, and use of .
The function superimposes the fitted curve on top of the plot of the data (Figure ).
Notice that the argument provides a smoothed representation of the fitted regression curve. This option is only available when one single (one-dimensional) independent variable is involved. For plots of a model fit involving more than one independent variable (e.g., see worked example in ), it is necessary to specify the argument in the function to indicate which variable is used for the x axis. In such a case, no smoothing is possible as it would also depend on the other independent variables, i.e., .
An examination of the quality of the obtained nonlinear regression model fit may be based on the residuals calculated from the fit as follows:
ϵ̂ = y − f(θ̂, x)
Standardized residuals are obtained by dividing the centered residuals by the residual standard error. provides the function , which extracts the residuals from an object.
The corresponding method allows a convenient display of the diagnostic plots outlined by . Specifically, produces by default a four-panel display:
The argument may be used to choose which diagnostic plots should be shown (there are 6 options in total as explained in the help page). For the model fit the resulting plots are shown in Figure .
Figure shows no indication of problems with the model assumptions as the residuals seem to be approximately normally distributed (a clear alignment along the diagonal in the QQ plot) and without evidence of autocorrelation or heteroscedastic variance.
In addition to the visual assessment of the model assumptions, the normality of residuals may be evaluated using the Shapiro-Wilk test . This test is one of the most powerful tests of normality (the function is part of the package in the standard installation). Similarly, the lack of autocorrelation in residuals may be assessed by means of the runs test , using the function in the package . However, note that this is not a very strong test as it essentially only utilizes the signs of the residuals but not their actual values . We consider these tests as supplements that are occasionally useful next to the routine visual assessment of the model assumptions. Both tests are available through the function .
##
## ------
## Shapiro-Wilk normality test
##
## data: stdres
## W = 0.95205, p-value = 0.1214
##
##
## ------
##
## Runs Test
##
## data: as.factor(run)
## Standard Normal = 0.76123, p-value = 0.4465
## alternative hypothesis: two.sided
In our example, the null hypothesis of normal distribution could not be rejected (Shapiro-Wilk test: p = 0.12) and there was also no indication of autocorrelation (runs test: p = 0.45).
We define the 1 − α joint confidence region for the model parameters by means of the following inequality. A given set of parameters θ is included in the confidence region if the corresponding residual sum of squares (RSS) is lying within the margin defined in the following Equation (often referred to as Beale’s criterion).
with RSSmin the minimum residual sum of squares obtained from the least-squares estimation (previously defined for the function ), F1 − α the appropriate quantile of the F-distribution with (p, n − p) degrees of freedom, where n is the number of observations and p the number of model parameters in f . Two functions are implemented in for visualizing the joint confidence region defined in Equation (), one for showing contours and another one for showing projections.
For each pair of parameters the function provides two-dimensional contours of the p-dimensional joint confidence region using a grid while keeping the remaining p − 2 parameters fixed at their least-squares estimates. The number of contour levels is defined by the user using the argument . The RSS contours can be used both to assess the structural correlation among model parameters (a graphical analog to the correlation matrix) and to detect the presence of unexpected multiple minima, which could indicate that sub-optimal parameter estimates were obtained and perhaps the model should be fitted again with different starting values.
For the model fit , Figure (left panel) shows the RSS contours for the three pairs of two model parameters. The resulting two-dimensional 95% confidence regions, which correspond to specific contours are also shown (in dotted red lines). These contours, which are expected to be close to elliptical curves as long as the error model is valid, allow an evaluation of the two-dimensional confidence regions as compared to the one-dimensional confidence intervals that are routinely used. In particular, we get an insight on the extent of overlap between one-dimensional intervals and two-dimensional regions. For instance, for the pair μ and VO2peak, the projections of the elliptical confidence region onto the axes result in marginal confidence intervals that are wider as compared to the standard one-dimensional confidence intervals shown in the output from on page . This means that standard one-dimensional confidence intervals seem to be too narrow (insufficient coverage).
O2K.cont1 <- nlsContourRSS(O2K.nls1)
plot(O2K.cont1, col = FALSE, nlev = 5)
O2K.conf1 <- nlsConfRegions(O2K.nls1, exp = 2, length = 2000)
plot(O2K.conf1, bounds = TRUE)
The function allows users to plot another representation of the Beale’s confidence region, also known as joint parameter likelihood region . The method consists in randomly sampling parameter values in a hypercube centered around the least-squares estimates. A set of parameters is acceptable if the resulting residual sum of squares satisfies Beale’s criterion (Equation ). As soon as the specified number of points to be in the confidence region is reached (argument in ), the iterative sampling is stopped. The confidence region is then plotted by projection of the sampled points in planes defined by each pair of model parameters (Figure , right panel). It is often necessary to zoom in or out the sampling region in order to get a better view of the overall projected region. This is done by changing argument of function . The sampling region can be visualized by setting argument in the generic plotting function .
It is worth noticing that the representation of the confidence region by contours does not provide exactly the same information as the representation by projections when the number of parameters is greater than two. Representations of confidence regions by contours provide smaller confidence regions than confidence regions based on projections, because the former does not incorporate the uncertainty in the p − 2 parameter estimates left out. Therefore, representations by contours tend to slightly underestimate the size of the confidence region.
In our example, contours are perfectly elliptical with a global minimum at the center, which is an indication of a good nonlinear regression model fit. The narrower elliptic shape of Beale’s confidence region between VO2peak and μ reflects the relatively high correlation between these two parameters (correlation: 0.76).
This section was revised by Florent Baty and Marie Laure Delignette-Muller since the publication of this vignette in the Journal of Statistical Software in order to include the new function nlsBootPredict().
Both jackknife and bootstrap procedures applied to nonlinear regression are implemented in . Jackknife is implemented via the function . By using a leave-one-out procedure, it produces jackknife parameter estimates together with confidence intervals . It can also be used to assess the influence of each observation on each parameter estimates.
##
## ------
## Jackknife statistics
## Estimates Bias
## VO2rest 356.531664 0.227186743
## VO2peak 1628.737024 2.146475697
## mu 1.184519 0.001609593
##
## ------
## Jackknife confidence intervals
## Low Up
## VO2rest 342.2882858 370.775043
## VO2peak 1560.1030746 1697.370973
## mu 0.9923545 1.376683
##
## ------
## Influential values
## * Observation 21 is influential on VO2peak
## * Observation 34 is influential on VO2peak
## * Observation 35 is influential on VO2peak
## * Observation 20 is influential on mu
## * Observation 21 is influential on mu
## * Observation 35 is influential on mu
The generic function , applied to an object produced by the function , returns the jackknife parameter estimates together with the associated bias and confidence intervals and, if applicable, a list of influential observations that are identified based on DFBETAs defined as follows:
with θ̂j the estimate of the j parameter based on the original dataset, θ̂j−i the estimate of the j parameter based on the dataset without the i observation, and se(θ̂j) the standard error of the j parameter estimate based on the original dataset.
An observation is empirically denoted as influential for one parameter if the absolute difference between parameter estimates with and without this observation exceeds twice the standard error of the estimate divided by the square root of the number of observations . Applied to our dataset, three observations appear to have a substantial influence on VO2peak estimate and, similarly, three other observations are influencing μ estimate (Figure , left panel).
The function uses non-parametric bootstrap of mean centered residuals to obtain a given number (argument ) of bootstrap estimates . Bootstrap estimates, standard errors together with median and percentile confidence intervals (2.5% and 97.5% percentiles of bootstrapped estimates) are displayed by the generic function . The associated plotting function can be used both for a pairwise representation of the bootstrap estimates or, as shown in Figure (right panel), for a boxplot representation of the distribution of each bootstrapped parameter.
##
## ------
## Bootstrap statistics
## Estimate Std. error
## VO2rest 356.628263 10.95685166
## VO2peak 1631.613794 21.14983126
## mu 1.187981 0.07457396
##
## ------
## Median of bootstrap estimates and percentile confidence intervals
## Median 2.5% 97.5%
## VO2rest 357.022091 334.237834 376.808684
## VO2peak 1632.973391 1589.279130 1671.353468
## mu 1.184218 1.055769 1.346726
In some cases, problems of convergence may arise during the bootstrap resampling procedure, when the model cannot be fitted to the resampled data. If the fitting procedure fails less than 50% of cases, the bootstrap statistic is provided with a warning indicating the percentage of times the procedure successfully converged; otherwise the procedure is interrupted with an error message and no result is given.
The comparison of confidence intervals based on the t-based approximation, previously obtained using , and based on resampling procedures (jackknife or bootstrap) is illustrated in Figure . In that case, it shows that bootstrap confidence intervals are comparable to the t-based ones, providing slightly narrower confidence intervals for all three parameters. On the other hand, jackknife confidence intervals noticeably differ from the other two intervals. This was to be expected considering that jackknife is using only limited information about the statistic and is therefore less efficient than bootstrap . In addition, jackknife is resampling sequentially individuals whereas bootstrap resamples residuals with replacement. Therefore, we tentatively recommend to use the bootstrap procedure for the assessment of confidence intervals of parameter estimates, whereas the jackknife procedure should be specifically used for the detection of influential observations. It is worth noting that function from the default package can also be used to build confidence intervals by profiling the residual sum of squares. However, this method often fails to give any result due the lack of convergence of the optimization algorithm for at least one explored value of one of the parameters. Unlike the function , provides confidence intervals even if the optimization algorithm fails to converge for some of the bootstrapped samples.
From the results of it is now possible to compute confidence or prediction bootstrap intervals as in Figure representing confidence intervals on the fitted curve (see for details}). A great number of bootstrap iterations (greater than the default value) should generally be used for the computation of prediction intervals.
newdata <- data.frame(t = seq(0, 12, length.out = 50))
pred.clim <- nlsBootPredict(O2K.boot1, newdata = newdata, interval = "confidence")
plotfit(O2K.nls1, smooth = TRUE)
lines(newdata$t, pred.clim[, 2], col = "red", lty = 2)
lines(newdata$t, pred.clim[, 3], col = "red", lty = 2)
We have shown the usefulness of for the important steps in a nonlinear regression analysis. The proposed functionality is generally applicable as seen from the variety of different areas where it has already been used successfully: biological chemistry , environmental toxicology , forest ecology , marine science , microbiology , limnology , and plant biology .
Further developments of are envisaged, including support for self-starting nonlinear regression models. It would also be useful to provide additional sensitivity procedures in order to allow users to further minimize the risk of undesirable convergence results towards local minima during the optimization procedure (e.g., an annotated and detailed report of the entire convergence process). We would also like to extend the diagnostic tools to situations where high-throughput nonlinear regression analyses are carried out, , through the implementation of some automatic diagnostic checks.