Mixed models with R
Marc Pépino
December 2022
- Introduction
- Packages
- Simple linear regression
- Mixed-models: varying-intercepts
- Mixed-models: varying-slopes
- LMM: brook charr allometry
- GLMM: brook charr abondance
- NLME: walleye growth curve
- References
Introduction
Linear mixed models (LMM) are natural extension of the simple linear regression and are now among the most standard and powerful techniques use to model ecological data. We first show this extension using simulations as I’m convinced that we well understand a model if we are able to simulate it. Simulations are also a precious help for teaching statistics. Using simulations, we wish to understand the output of mixed models that gives more information than the simple p-values, especially the variance structure of the data. We finally show how to apply mixed models to three cases studies, exploring linear mixed models and their extensions to generalized linear mixed models (GLMM) and non linear mixed models (NLME). At each step (i.e., data download, model fitting, model validation), we will use graphics to visualize what we have in hand. Without graphics, we walk blind and risk straying far from what we think we are modeling. We will explore LMM and their extensions using Pinheiro and Bates (2000) as the main reference book. Other complementary books include, but not limited to, Faraway (2006), Gelman and Hill (2006) for a soft transition to Bayesian modelling, and the Zuur et al. (2009) for practical applications of mixed models to ecological data and, probably, the most popular book consulted by students. An introduction to mixed models could also be found in articles like Wagner et al. (2006), Bolker (2009) or Harrison et al. (2018).
Packages
Wee need the following packages for this workshop. I’m usually as minimalist as possible and upload only the packages we really need. Be particularly careful to the order you upload the packages since some functions could be masked from package to package.
#Packages ####
library(ggplot2) #For graphic visualization
library(readxl) #For downloading Excel data
library(lme4) #For mixed models (most popular)
library(glmmTMB) #For speed up mixed models
library(nlme) #For mixed models (my favorite)
Simple linear regression
The simple linear regression is the starting point to understand mixed models. The regression line is defined by two parameters: the intercept (i.e., the y-value when x is equal to zero) and the slope (i.e., the increase of y when x increases from one unit). To simulate a simple linear regression, we also need to add residuals to y-values. This last step is the most important because it comes from the main assumption: residuals are independent and normally distributed with mean zero and variance ($\sigma^2$) to be estimated from the data. The equation of the simple linear regression could be written as follow:
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \epsilon_i \sim N(0,\sigma^2)\]Where $y_i$ and $x_i$ are variables, $\beta_0$ and $\beta_1$ are the parameters (i.e., the intercept and slope, respectively), and $\epsilon_i$ are the residuals.The index $i$ refers to observations.
We first define the parameters of the equation, then the predictor (i.e., x-values) and finally the response variable (i.e., y-values), adding the residuals with an additional parameter: the standard deviation (i.e., $\sigma$). In experimental studies, we can choose to have x-values coming from the uniform distribution. In observational studies, however, x-values generally come from the normal distribution. In this example, we will choose to have x-values coming from the normal distribution. Note that we also need to define how many observations we have in hand (i.e., the sample size: n).
Simulation
Note that in this simulation, we say that $y$ comes from the normal distribution with mean given by the linear equation,$\beta_0 + \beta_1 x$, and standard deviation coming from the residuals, which is equivalent to first add the linear equation and then the residuals.
# Define the parameters of the equation
b0 = 5 # Intercept
b1 = 3 # Slope
sigma = 30 #standard deviation of the residuals
# Define the data
n = 50 #Sample size
x = rnorm(n,mean=125,sd=10)
y = rnorm(n,mean=b0+b1*x,sd=sigma)
dat = data.frame(x,y)
Graphic
Let’s take a look at our data
# Visualization
ggplot(data=dat,aes(x=x,y=y))+
geom_point()+
theme_bw()
Analyses
The code for model fitting is generally quite easy and the shortest part of the exercise.
mod = lm(y~x,data=dat)
Results
The summary function is usually used to explore the output of the model.
summary(mod)
Call:
lm(formula = y ~ x, data = dat)
Residuals:
Min 1Q Median 3Q Max
-51.45 -23.78 -3.68 19.78 65.68
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -68.6240 47.3891 -1.448 0.154
x 3.5884 0.3705 9.685 7.15e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 28.3 on 48 degrees of freedom
Multiple R-squared: 0.6615, Adjusted R-squared: 0.6544
F-statistic: 93.8 on 1 and 48 DF, p-value: 7.147e-13
In this example, we see the intercept, the slope (x), but also the estimated standard deviation of the residuals (Residual standard error) and how the variance is explained by the predictor (Multiple R-squared, i.e., the coefficient of determination, $R^2$). We can see that the parameter values are closed of what we have simulated but not exactly the same, especially for the intercept. This is what we expect since we add residuals to y-values. Increasing residuals (i.e., increasing sigma in simulating data) leads to lower fit.
Let’s take a look at our data, adding simulated (solid) and estimated (broken) lines.
# Visualization
ggplot(data=dat,aes(x=x,y=y))+
geom_abline(intercept=b0,slope=b1)+
geom_abline(intercept=coef(mod)[1],slope=coef(mod)[2],lty=2)+
geom_point()+
theme_bw()
Assumptions
The first plot is to verify the independence of the data and the homogeneity of the residuals. The second plot is to verify the normality of the residuals.
# Model checking
par(mfrow=c(1,2))
plot(mod,which=c(1,2))
Mixed-models: varying-intercepts
The simplest LMM is the varying-intercept model. Because of the presence of grouped data, the main assumption of the independence of the data is violated. Then, LMM comes from a natural extension of the linear regression, assuming than the intercept could vary among groups. This variation is assumed to come from a normal distribution, with mean zero and variance (i.e., $\sigma_{\beta0}^2$) to be estimated from the data. In this way, we obtain a global relationship at the population level (i.e., fixed effects) and deviation at the group level (i.e. random effects).
The equation of LMM with varying-intercept could be written in two steps:
\[y_{ij} = \beta_{0j} + \beta_1 x_{ij} + \epsilon_{ij}, \epsilon_{ij} \sim N(0,\sigma^2)\] \[\beta_{0j} = \beta_0 + b_{0j}, b_{0j} \sim N(0,\sigma_{\beta0}^2)\]Where $y_{ij}$ and $x_{ij}$ are variables, $\beta_0$ and $\beta_1$ are the parameters (the intercept and slope, respectively) at the population level, $\beta_{0j}$ are the intercepts at the group level, $b_{0j}$ are the group level residuals of the intercept, and $\epsilon_{ij}$ are the level-one residuals.The index $j$ refers to group level. The index $i$ refers to observations.
Simulation
To simulate LMM, we need an additional parameter, the standard deviation of the intercept at the group level (i.e., $\sigma_{\beta0}$). Since we need to simulate the data at each group level,we will define the x-values for each group and then loop the simulation for all groups.
# Define the parameters of the equation
b1 = 3 #slope
b0 = 5 #intercept
sigma = 10 #sd for residuals
sigmab0 = 20 #sd for intercepts
b0j = rnorm(n=1000,mean=0,sd=sigmab0)
# Define the data
ng = 10 #number of group
nj = sample(20:40,ng,replace=TRUE) #sample size in each group
xmean = runif(ng,-25,25) #x mean in each group
xsd = runif(ng,5,10) #x sd in each group
# Loop for all groups
dat = data.frame()
for(j in 1:ng){
x = rnorm(nj[j],mean=xmean[j],sd=xsd[j])
y = rnorm(nj[j],mean=b0+b0j[j]+b1*x,sd=sigma)
g = rep(j,nj[j])
dat = rbind(dat,data.frame(x,y,g))
}
dat$g = as.factor(dat$g)
Graphic
Let’s take a look at our data
#Visualization
ggplot(data=dat,aes(x=x,y=y,col=g))+
geom_point()+
theme_bw()
Analyses
Traditionally we could analyse this type of data using ANCOVA, using the grouping factor as an additional predictor. However, a more powerful technique is to use LMM with varying-intercept. Using the nlme package, the random argument specifies the intercept (i.e., 1) and the grouping factor (i.e., g in this example) as follow:
#Analyses: ANCOVA
mod = lm(y~x+g,data=dat)
summary(mod)
Call:
lm(formula = y ~ x + g, data = dat)
Residuals:
Min 1Q Median 3Q Max
-23.7049 -6.1446 0.3523 6.7870 26.3402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.58808 2.32945 -1.970 0.049769 *
x 3.04646 0.07239 42.084 < 2e-16 ***
g2 -20.59491 3.71571 -5.543 6.35e-08 ***
g3 15.36780 2.61833 5.869 1.12e-08 ***
g4 -9.37524 2.80239 -3.345 0.000922 ***
g5 15.12837 2.84610 5.315 2.03e-07 ***
g6 29.96195 3.27074 9.161 < 2e-16 ***
g7 2.48753 2.54925 0.976 0.329926
g8 28.64816 2.59083 11.058 < 2e-16 ***
g9 2.40887 2.76159 0.872 0.383729
g10 -1.18766 2.62787 -0.452 0.651621
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.788 on 312 degrees of freedom
Multiple R-squared: 0.96, Adjusted R-squared: 0.9587
F-statistic: 749.1 on 10 and 312 DF, p-value: < 2.2e-16
#Analyses: Mixed models with nlme package
mod = lme(y~x,random=~1|g,data=dat)
summary(mod)
Linear mixed-effects model fit by REML
Data: dat
AIC BIC logLik
2439.092 2454.178 -1215.546
Random effects:
Formula: ~1 | g
(Intercept) Residual
StdDev: 15.93838 9.787311
Fixed effects: y ~ x
Value Std.Error DF t-value p-value
(Intercept) 1.601332 5.110372 312 0.31335 0.7542
x 3.035387 0.071049 312 42.72226 0.0000
Correlation:
(Intr)
x 0.125
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.41041756 -0.62668163 0.03215744 0.70114206 2.67355343
Number of Observations: 323
Number of Groups: 10
The interpretation of the ANCOVA could be tedious, especially as the number of groups increases. The interpretation of the mixed model is more straightforward, at least for the fixed effects…
Results
We can easily retrieve the summary output using different functions and compare the results to simulated parameters. This exercise is particularly helpful to see how well the model fits to the data. This could be helpful, for example, to found the best sampling design before collecting ecological data (e.g., Pépino et al. 2016). Especially, the intervals functions gives the confidence intervals of parameter estimates (the default is 95%), which can be used to report the final result of the model.
fixef(mod) #The coefficient of the fixed effects (estimates)
(Intercept) x
1.601332 3.035387
ranef(mod) #b0j estimates
(Intercept)
1 -6.274000
2 -26.251615
3 8.845424
4 -15.420527
5 8.704950
6 23.639338
7 -3.881986
8 22.007652
9 -3.825922
10 -7.543314
VarCorr(mod) #the variance covariance structure (sigmab0 and sigma estimates)
g = pdLogChol(1)
Variance StdDev
(Intercept) 254.03190 15.938378
Residual 95.79145 9.787311
intervals(mod) #95 confidence intervals of parameter estimates
Approximate 95% confidence intervals
Fixed effects:
lower est. upper
(Intercept) -8.453819 1.601332 11.656483
x 2.895591 3.035387 3.175184
Random Effects:
Level: g
lower est. upper
sd((Intercept)) 9.976875 15.93838 25.46207
Within-group standard error:
lower est. upper
9.048780 9.787311 10.586118
We can also add the fitted and residuals values to the original data frame using the fitted and residuals function, respectively. This is particularly helpful for checking model assumptions or illustrating model fitting. Here, black and color lines refer to regression lines at the population and group levels, respectively.
dat$fit = fitted(mod) #fitted values
dat$res = residuals(mod) #residuals values
# Visualization
ggplot(data=dat,aes(x=x,y=y,col=g))+
geom_abline(intercept=b0,slope=b1)+
geom_abline(intercept=fixef(mod)[1],slope=fixef(mod)[2],lty=2)+
geom_point(alpha=0.5)+
geom_line(aes(x=x,y=fit,col=g),linewidth=1)+
#facet_wrap(~g)+ #Optional
theme_bw()+
theme(legend.position="none")
Assumptions
There are tow main assumptions.
Assumption 1: within-group errors
The within-group errors are independent and identically normally distributed, with mean zero and variance to be estimated, and they are independent of the random effects.
You can used default functions to verify this first assumption. Note that this assumption should be verify at the group level. You can also have an rough idea of the goodness-of-fit by plotting fitted versus observed values.
#Homogeneity of residuals
plot(mod)
plot(mod,g~resid(.,type="p"),abline=0)
plot(mod,resid(.,type="p")~fitted(.)|g,abline=0,lty=2)
#Normality of residuals
qqnorm(mod,~resid(.))
#Some idea of goodness-of-fit
plot(mod,y~fitted(.),id=0.05,adj=-0.3)
After adding residuals and fitted values to the data frame, you can also reproduce similar graphs using ggplot.
#Homogeneity of residuals
ggplot(data=dat,aes(x=fit,y=res,col=g))+
geom_hline(yintercept=0,lty=2)+
geom_point(alpha=0.5)+
facet_wrap(~g)+
theme_bw()+
theme(legend.position="none")
#Boxplot could also be used to see variation of residuals among groups
#You could also explore relationship between residuals and predictors
ggplot(data=dat,aes(x=g,y=res))+
geom_boxplot()+
geom_hline(yintercept=0,lty=2)+
coord_flip()+
theme_bw()
#Normality of residuals
ggplot(dat,aes(sample = res))+
stat_qq()+
stat_qq_line()+
facet_wrap(~g)+#optional
theme_bw()
#Prediction: a rough idea of goodness-of-fit
ggplot(data=dat,aes(x=fit,y=y,col=g))+
geom_abline(intercept=0,slope=1,lty=2)+
geom_point(alpha=0.5)+
facet_wrap(~g,scale="free")+
theme_bw()+
theme(legend.position="none")
Assumption 2: random effects
The random effects are normally distributed, with mean zero and covariance matrix (not depending on the group) and are independent for different groups.
As before, you can use default functions or customize graphic output using ggplot. Note that with 10 groups, it is more difficult to evaluate carefully this second assumption.
#First option: default functions
#Normality of random effects
qqnorm(mod,~ranef(.))
#Second option: ggplot
ran = ranef(mod)
ggplot(ran,aes(sample = ran[,1]))+
stat_qq()+
stat_qq_line()+
theme_bw()
Violation of assumptions could include dependencies or heteroscedasticity among the within-group errors that can be modeled with correlation structure or the specification of the Variance-Covariance matrices for the random effects, respectively. Even if I encourage to adequately specify how to model the fixed and random effects, mixed models are generally robust to violations of these assumptions (Schielzeth et al. 2020).
Intraclass correlation
The intraclass correlation (ICC) is the proportion of the total variation that is among groups (Faraway 2006, Gelman and Hill 2007). The ICC could be calculated in varying-intercept models as follow:
\[\frac{\sigma_{\beta0}^2} {\sigma^2 + \sigma_{\beta0}^2}\]Where $\sigma^2 + \sigma_{\beta0}^2$ is the total variation and $\sigma_{\beta0}^2$, the variation among groups.
The ICC is thus the variation among groups divided by the total variation. The ICC ranges from 0 (all variation withing groups) to 1 (all variation among groups). You can read Reyjol et al. (2008) for an ecological application. Extension of ICC could also be found in animal behavior studies for the estimation of repeatability (e.g., Dingemanse and Dotcherman 2013, Allegue et al. 2017).
variation = as.numeric(VarCorr(mod)[,1])
ICC = variation[1]/sum(variation)
ICC
[1] 0.7261719
sigmab0^2/(sigmab0^2+sigma^2) #Predicted value of ICC based on simulated values
[1] 0.8
Mixed-models: varying-slopes
Now we can assume that not only the intercept could vary among groups, but also the slope. We could also assume that only the slope varies among groups and the intercept is fixed but, as many ecological data could vary both in intercept and slope, we will not explore this option here. This variation in slope is assumed to come from a normal distribution, with mean zero and variance ($\sigma_{\beta1}^2$) to be estimated from the data.
The equation of LMM with varying intercept and slope could be written as follow:
\[y_{ij} = \beta_{0j} + \beta_{1j} x_{ij} + \epsilon_{ij}, \epsilon_{ij} \sim N(0,\sigma^2)\] \[\beta_{0j} = \beta_0 + b_{0j}, b_{0j} \sim N(0,\sigma_{\beta0}^2)\] \[\beta_{1j} = \beta_1 + b_{1j}, b_{1j} \sim N(0,\sigma_{\beta1}^2)\]Where $y_{ij}$ and $x_{ij}$ are variables, $\beta_0$ and $\beta_1$ are the parameters (the intercept and slope, respectively) at the population level, $\beta_{0j}$ are the intercepts at the group level, $\beta_{1j}$ are the slopes at the group level, $b_{0j}$ are the group level residuals of the intercept, $b_{1j}$ are the group level residuals of the slope, and $\epsilon_{ij}$ are the level-one residuals.The index $j$ refers to group level. The index $i$ refers to observations. Important: random intercept and random slope are not assumed to be independent from each other, which is taking into account by their covariance matrix. For simplicity, we will assume that they are independent in the following simulations.
Simulations
The simulation look like the varying-intercept mixed model. The only difference is that we add an additional parameter, the standard deviation ($\sigma_{\beta1}$) to take into account that the slope could vary among groups.
dat = data.frame()
# Define the parameters of the equation
b1 = 3 #slope
b0 = 5 #intercept
sigma = 10 #sd for residuals
sigmab0 = 20 #sd for intercepts (try 20)
sigmab1 = 1 #sd for slopes (try 1 or 0.2 or 0)
b0j = rnorm(n=1000,mean=0,sd=sigmab0)
b1j = rnorm(n=1000,mean=0,sd=sigmab1)
# Define the data
ng = 10 #number of group
nj = sample(20:40,ng,replace=TRUE) #sample size in each group
xmean = runif(ng,-25,25) #x mean in each group
xsd = runif(ng,5,10) #x sd in each group try: (5,10) or (1) or 30
# Loop for all groups
dat = data.frame()
for(j in 1:ng){
x = rnorm(nj[j],mean=xmean[j],sd=xsd[j])
y = rnorm(nj[j],mean=b0+b0j[j]+(b1+b1j[j])*x,sd=sigma)
g = rep(j,nj[j])
dat = rbind(dat,data.frame(x,y,g))
}
dat$g = as.factor(dat$g)
Graphic
Let’s take a look at our data
#Visualization
ggplot(data=dat,aes(x=x,y=y,col=g))+
geom_point()+
theme_bw()
Analyses
As for LMM with varying-intercept, we start with a model fitting using ANCOVA, but with an interaction term. The varying-slope model is specified by adding the predictor x in the random argument. Note that the ANCOVA output could be particularly difficult to interpret, especially as the number of groups increases. LMM output is more straightforward, with the same number of parameters whatever the number of groups.
# Analyses: ANCOVA
mod = lm(y~x*g,data=dat)
summary(mod)
Call:
lm(formula = y ~ x * g, data = dat)
Residuals:
Min 1Q Median 3Q Max
-27.410 -6.494 -0.313 6.256 28.988
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6850 2.4732 1.490 0.13742
x 0.4027 0.3120 1.291 0.19793
g2 14.7585 4.9330 2.992 0.00304 **
g3 -11.0255 6.7928 -1.623 0.10576
g4 -2.3594 4.5790 -0.515 0.60679
g5 -1.6848 6.5532 -0.257 0.79731
g6 -13.1434 6.5245 -2.014 0.04498 *
g7 -5.7461 4.4803 -1.283 0.20079
g8 39.8874 5.0984 7.824 1.25e-13 ***
g9 -11.9828 7.2724 -1.648 0.10061
g10 -10.5213 6.5113 -1.616 0.10732
x:g2 0.1196 0.5088 0.235 0.81440
x:g3 3.1410 0.4117 7.629 4.34e-13 ***
x:g4 0.1057 0.4168 0.253 0.80010
x:g5 4.9061 0.4472 10.971 < 2e-16 ***
x:g6 2.0745 0.3937 5.269 2.86e-07 ***
x:g7 1.7577 0.4126 4.260 2.85e-05 ***
x:g8 3.0249 0.4116 7.350 2.51e-12 ***
x:g9 3.1552 0.4059 7.773 1.73e-13 ***
x:g10 2.4086 0.4009 6.009 6.21e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.17 on 263 degrees of freedom
Multiple R-squared: 0.9803, Adjusted R-squared: 0.9789
F-statistic: 688 on 19 and 263 DF, p-value: < 2.2e-16
# Analyses: Mixed models with nlme package - random intercept
mod0 = lme(y~x,random=~1|g,data=dat)
summary(mod0)
Linear mixed-effects model fit by REML
Data: dat
AIC BIC logLik
2343.084 2357.637 -1167.542
Random effects:
Formula: ~1 | g
(Intercept) Residual
StdDev: 29.89671 13.88851
Fixed effects: y ~ x
Value Std.Error DF t-value p-value
(Intercept) -4.228197 9.504821 272 -0.444848 0.6568
x 2.619327 0.117580 272 22.276956 0.0000
Correlation:
(Intr)
x 0.053
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.791434759 -0.590994740 -0.006061655 0.640791495 2.420196655
Number of Observations: 283
Number of Groups: 10
# Analyses: Mixed models with nlme package - random intercept and slope
mod1 = lme(y~x,random=~x|g,data=dat)
summary(mod1)
Linear mixed-effects model fit by REML
Data: dat
AIC BIC logLik
2197.994 2219.824 -1092.997
Random effects:
Formula: ~x | g
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 15.847055 (Intr)
x 1.590983 -0.05
Residual 10.166360
Fixed effects: y ~ x
Value Std.Error DF t-value p-value
(Intercept) 3.512384 5.260836 272 0.667647 0.5049
x 2.513745 0.510795 272 4.921242 0.0000
Correlation:
(Intr)
x -0.036
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.76861646 -0.63737211 -0.02064573 0.62375233 2.86482177
Number of Observations: 283
Number of Groups: 10
Results
As for the varying-intercept mixed model, we can compare model estimates with simulated values using the same functions!
fixef(mod1) #The coefficient of the fixed effects (estimates)
(Intercept) x
3.512384 2.513745
ranef(mod1) #b0j estimates
(Intercept) x
1 0.5730601 -2.03206100
2 13.0055977 -1.80691304
3 -9.9867624 1.06178060
4 -1.3780864 -1.93847817
5 -3.1706913 2.69934276
6 -11.3913918 -0.09622784
7 -5.4006808 -0.36205703
8 37.5826831 1.04759948
9 -10.6252539 1.08523496
10 -9.2084743 0.34177928
VarCorr(mod1) #the variance covariance structure
g = pdLogChol(x)
Variance StdDev Corr
(Intercept) 251.129159 15.847055 (Intr)
x 2.531227 1.590983 -0.05
Residual 103.354876 10.166360
intervals(mod1) #95 confidence intervals of parameter estimates
Approximate 95% confidence intervals
Fixed effects:
lower est. upper
(Intercept) -6.844750 3.512384 13.86952
x 1.508131 2.513745 3.51936
Random Effects:
Level: g
lower est. upper
sd((Intercept)) 9.5512907 15.84705522 26.2926935
sd(x) 0.9850315 1.59098298 2.5696913
cor((Intercept),x) -0.6125874 -0.04963095 0.5467327
Within-group standard error:
lower est. upper
9.334299 10.166360 11.072591
And the same code for plotting!
dat$fit = fitted(mod1) #fitted values
dat$res = residuals(mod1) #residuals values
# Visualization
ggplot(data=dat,aes(x=x,y=y,col=g))+
geom_abline(intercept=b0,slope=b1)+
geom_abline(intercept=fixef(mod1)[1],slope=fixef(mod1)[2],lty=2)+
geom_point(alpha=0.5)+
geom_line(aes(x=x,y=fit,col=g),linewidth=1)+
#facet_wrap(~g)+ #Optional
theme_bw()+
theme(legend.position="none")
Testing random component
How to choose between random intercept or random slope models ? An easy way is to compare the two models using log-likelihood ratio test or comparing information criteria like Aikaike Information Criteria (AIC). Note that you have to use REML method to compare random effect and ML method to compare fixed effects (Zuur et al. 2009). You can also repeat the simulation using $\sigma_{\beta1} = 0.1$ and see if you obtain the same conclusion!
mod0 = lme(y~x,random=~1|g,data=dat,method="REML")
mod1 = lme(y~x,random=~x|g,data=dat,method="REML")
anova(mod0,mod1)
Model df AIC BIC logLik Test L.Ratio p-value
mod0 1 4 2343.084 2357.637 -1167.542
mod1 2 6 2197.994 2219.824 -1092.997 1 vs 2 149.0896 <.0001
A short note on grouped data. The groupedData function in the nlme package can be used to defined how your data are grouped and speed up model fitting and visualization. Grouped data are also useful to explore different structure of the variance-covariance matrix. In this simulation example, we did not assume any correlation between the slope and intercept of the random component, with can be specify using the pdDiag function. We repeat the preceding models, starting with the random intercept model and then updating this model using the update function. We then compare the models as before using the anova function.
#Grouped data and update function of the nlme package
datG = groupedData(y~x|g,data=dat)
mod0 = lme(y~x,random=~1|g,data=datG,method="REML")
mod1 = update(mod0,random=~x)
mod1diag = update(mod0,random=pdDiag(~x))
anova(mod0,mod1diag,mod1)
Model df AIC BIC logLik Test L.Ratio p-value
mod0 1 4 2343.084 2357.637 -1167.542
mod1diag 2 5 2196.014 2214.206 -1093.007 1 vs 2 149.06951 <.0001
mod1 3 6 2197.994 2219.824 -1092.997 2 vs 3 0.02011 0.8872
To be consistent with the simulation, you should select the model with the random slope and a diagonal variance-covariance matrix (i.e., mod1diag), but conclusions could differ depending of your simulated data…
Assumptions
As for the varying-intercept model, the same assumptions applied and could be verify in the same way. You could thus verify these assumptions using the same code. I just show here the scatter plot of the estimated random effects. This plot is particularly useful to visualize the correlation between the intercept and slope of the random component, suggesting which variance-covariance matrix should be used. I also reproduce the normal Q-Q plots of the random effects to emphasize that the normality should be checked for both intercept and slope, but again, with only 10 groups, this assumption is difficult to verify.
#Correlation between random effects
pairs(mod1diag,~ranef(.))
#Normality of random effects
qqnorm(mod1diag,~ranef(.))
Further readings
How total variation in LMM is explained is less straightforward than in linear regressions. For an overview of the indices and available tools, you can consult Nakagawa and Schielzeth (2013), Johnson (2014),or Nakagawa et al. (2017).
How to well include the hierarchical structure of your data (e.g. nested or crossed design), you can consult Schielzeth and Nakagawa (2013) or Harrison et al. (2018).
Finally, before to explore how to fit mixed models to case studies, it is important to read Zuur et al. (2016), a practical guide for conducting and presenting results of regression-types analyses (e.g., see Fig. 1 in the original article). We outline here that model fitting arrives only at the step 6 (among the 10 steps recommended). Thinking (and stating) a relevant ecological question (step 1) and ensuring we have the right data in hand to answer it (step 2) are the primary initial steps that any statistical models can compensate…
LMM: brook charr allometry
The database could be found here: https://datadryad.org/stash/dataset/doi:10.5061/dryad.p2k02
In this example, we will explore the relationship between total length and mass in brook charr. Details of the sampling design could be found in the original article (Pépino et al. 2018). In brief, height families were raised in laboratory and were placed in lake enclosures for growth in summer in two distinct habitats (i.e., littoral or pelagic habitats).
Download and explore data
dat = read_excel("Data/Pepino_et_al_Original_Data.xlsx")
#Transform the total length and mass variables
dat$log10TL = log10(dat$Length_Final)
dat$log10M = log10(dat$Mass_Final)
datG = groupedData(log10M~log10TL|Family,data=dat)
#Visualization
ggplot(data=datG,aes(x=log10TL,y=log10M,col=Family))+
geom_point()+
facet_wrap(~Family)+
theme_bw()
Blind linear mixed model
Many students try the simplest linear mixed model, taking into account the non-independence of the data. We will see how to go further in modelling the variability of ecological data.
#Analyses
mod0 = lme(log10M~log10TL,random=~1|Family,data=datG)
#Results
summary(mod0)
Linear mixed-effects model fit by REML
Data: datG
AIC BIC logLik
-1778.77 -1762.487 893.385
Random effects:
Formula: ~1 | Family
(Intercept) Residual
StdDev: 0.01990289 0.0297343
Fixed effects: log10M ~ log10TL
Value Std.Error DF t-value p-value
(Intercept) -4.862396 0.05393961 426 -90.14519 0
log10TL 2.881259 0.02784003 426 103.49337 0
Correlation:
(Intr)
log10TL -0.991
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-5.66116458 -0.54323068 -0.03043674 0.53004392 5.58488220
Number of Observations: 435
Number of Groups: 8
Let’s take a look at our data, adding the population (broken) and family-specific (solid) regression lines.
#Visualization
datG$fit = fitted(mod0)
datG$res = residuals(mod0)
ggplot(data=datG,aes(x=log10TL,y=log10M,col=Family))+
geom_abline(intercept=fixef(mod0)[1],slope=fixef(mod0)[2],lty=2)+
geom_point(alpha=0.5)+
geom_line(aes(y=fit),linewidth=1)+
facet_wrap(~Family)+ #Optional
theme_bw()+
theme(legend.position="none")
Do we stop the modelling here ?
Assumptions
We can’t stop without checking at the assumptions of the model. We first check for the homogeneity of residuals.
#Homogeneity of residuals
plot(mod0,resid(.)~fitted(.)|Family,abline=0,lty=2)
Do we stop the modelling here ?
Random effects
The last plots show patterns in residuals. For example, residuals seem to increase with fitted values for families 14P and 5L, and decrease for family 4L, suggesting that including random slope in model could be useful to improve model fit. This graphical approach could be confirmed by testing the random component according to the approach suggested by Zuur et al. (2009). Following the Zuur’s approach, we start with the full model (i.e., including all variables with their interaction) and test to the random component (i.e., intercept and slope) with the REML method. Technically, model comparison or graphical inspection of the residuals should converge to the same specification of the final model.
modfullI = lme(log10M~log10TL*Zone,random=~1|Family,data=datG)
modfullS = lme(log10M~log10TL*Zone,random=~log10TL|Family,data=datG)
anova(modfullI,modfullS)
Model df AIC BIC logLik Test L.Ratio p-value
modfullI 1 6 -1842.539 -1818.142 927.2695
modfullS 2 8 -1869.018 -1836.490 942.5092 1 vs 2 30.47943 <.0001
Both AIC and log-likelihood ratio test show that we should include random slope.
Do we stop the modelling here ?
Fixed effects
Following the Zuur’s approach, we continue with comparing competing models differing in their fixed component using the ML method.
mod0 = lme(log10M~log10TL,random=~log10TL|Family,data=datG,method="ML")
mod1 = lme(log10M~log10TL+Zone,random=~log10TL|Family,data=datG,method="ML")
mod2 = lme(log10M~log10TL*Zone,random=~log10TL|Family,data=datG,method="ML")
anova(mod0,mod1,mod2)
Model df AIC BIC logLik Test L.Ratio p-value
mod0 1 6 -1825.685 -1801.233 918.8425
mod1 2 7 -1887.127 -1858.600 950.5635 1 vs 2 63.44192 <.0001
mod2 3 8 -1893.485 -1860.883 954.7427 2 vs 3 8.35833 0.0038
Both AIC and log-likelihood ratio test show that Model 2, including the two variables and their interaction, is the best model.
Do we stop the modelling here ?
Modelling heteroscedasticity
Let’s take a look at the homogeneity of the residuals.
mod2 = lme(log10M~log10TL*Zone,random=~log10TL|Family,data=datG,method="REML")
plot(mod2,resid(.,type="p")~fitted(.),abline=0,lty=2)
This plot shows that the variability of the residuals is higher for low fitted values, meaning heteroscedasticity of the residuals. This heteroscedasticity seems to come from different variation of the residuals according to the Zone variable, as shown in the plot below:
plot(mod2,resid(.,type="p")~fitted(.)|Zone,abline=0,lty=2)
We can model this heteroscedasticity using the weights argument and the varIdent function as follow:
mod2H = lme(log10M~log10TL*Zone,random=~log10TL|Family,data=datG,method="REML",
weights = varIdent(form = ~ 1|Zone))
anova(mod2,mod2H)
Model df AIC BIC logLik Test L.Ratio p-value
mod2 1 8 -1869.018 -1836.490 942.5092
mod2H 2 9 -1915.376 -1878.781 966.6881 1 vs 2 48.3578 <.0001
plot(mod2H,resid(.,type="p")~fitted(.)|Zone,abline=0,lty=2)
Model comparison based on AIC and log-likelihood ratio test show that modelling heteroscedasticity improves model fit. Residual variability is also more homogeneous in the two habitats. After choosing the better random and fixed components and including the Zone variable to model heteroscedasticity, we now have a model that better captures the data variability. We can stop the modelling here.
Results
We can then report the results of the best model, first by reporting the parameter estimates and their confidence intervals and then illustrating this result on a plot.
intervals(mod2H)
Approximate 95% confidence intervals
Fixed effects:
lower est. upper
(Intercept) -5.6004388 -5.3142985 -5.0281581
log10TL 2.9775185 3.1252230 3.2729275
ZoneP 0.1992625 0.4801723 0.7610822
log10TL:ZoneP -0.4086951 -0.2628585 -0.1170220
Random Effects:
Level: Family
lower est. upper
sd((Intercept)) 0.16999148 0.3152096 0.5844827
sd(log10TL) 0.08573837 0.1596196 0.2971647
cor((Intercept),log10TL) -0.99974099 -0.9986156 -0.9926182
Variance function:
lower est. upper
P 0.5294444 0.607688 0.6974946
Within-group standard error:
lower est. upper
0.02826388 0.03110539 0.03423257
This case study shows us how to model and interpret more than the fixed effects, especially how the variation could be explained. For example, the variance function tells us that the variation in the pelagic habitat is 0.6 times the variation in the littoral habitat. Modeling the variation of ecological data gives us a more complete understanding of the ecological processes at work.
# Visualization
datG$fit = predict(mod2H)
datG$res = residuals(mod2H)
ggplot(data=datG,aes(x=log10TL,y=log10M,col=Zone))+
geom_point(alpha=0.5)+
geom_line(aes(y=fit),linewidth=1)+
facet_wrap(~Family)+ #Optional
theme_bw()
This best model shows that we have a random variation of the length-mass relationship according to the family and that this relationship is different in the two types of habitat, the slope being slower in the pelagic habitat. We finish, however, with a general though: is the length-mass relationship is really different according to the habitat type or, alternatively, the relationship shift according to the size (total length) of the individual? Since the overlap of total length according to the habitat is low, we can completely distinguish these two possible explanations. This outlines the importance to have the right data in hand to answer ecological question.
GLMM: brook charr abondance
The database could be found here: https://datadryad.org/stash/landing/show?id=doi%3A10.5061%2Fdryad.34tmpg4p8
In this case study, we will explore the relative abundance of brook charr in littoral habitat of Canadian Shield lakes. Details of the sampling design could be found in the original article (Rainville et al. 2022). In brief, brook charr were captured in 24 lakes and three consecutive years. This sampling design is thus typical of crossed design, year and lake including as random effects. We will focus here on the distribution of the response variable, extending LMM to GLMM.
Download and explore data
dat = read_excel("Data/Data_Rainville_et_al_2022_Evolutionary_Ecology.xlsx",
sheet = "BDlittoral")
dat$Temp = dat$T #Temperature variable
#Visualization: distribution of captures
ggplot(data=dat,aes(x=BC,fill=as.factor(Y)))+
geom_histogram(binwidth=10)+
facet_wrap(~LC)+
theme_bw()
Negative binomial distribution
We will use the lme4 package. We don’t explore the influence of predictors in this example, but we will show how to deal with crossed design that is quite straightforward using lme4 package. Extending LMM to GLMM is also straightforward using the family argument to specify the distribution of the response variable. Since the response variable are the number of captures, appropriate distributions could be the Poisson or the negative binomial distributions. The negative binomial distribution can generally handle the overdispersion of ecological data and is often more appropriate than the Poisson distribution. As before, the two models can be compare using the anova function. For simplicity, we will not explore the potential relationship with the predictors (e.g., the temperature). Note that simulated values from LMM (i.e., normal distribution) are obtained with the simulate function and could be negative, which is not possible (number of captures cannot be negative). The negative distribution simulates more high abundances than the Poisson distribution.
#Normal distribution
mod.N = lmer(BC~1+(1|LC)+(1|Y),data=dat)
simulate(mod.N,nsim=5)[1:10,1:5]
sim_1 sim_2 sim_3 sim_4 sim_5
1 22.177275 -49.4177181 56.271410 34.687981 72.073170
2 17.024076 -0.1053373 5.234417 31.821681 33.430622
3 51.075015 -24.4760264 -5.222976 -10.643936 29.809958
4 -1.090146 15.3458591 16.145508 -10.633513 39.518018
5 38.434107 35.7595436 32.733063 8.619748 26.227723
6 -2.585815 -42.0272885 32.300223 -31.156558 43.610788
7 53.544889 11.1338845 13.311571 -4.676094 2.509264
8 29.041403 -5.6599222 -10.201517 -2.191722 78.587746
9 8.483316 -2.4866802 48.440083 50.946557 46.433534
10 -7.191438 25.7922704 3.240183 -13.671578 38.992284
#Poisson distribution
mod.P = glmer(BC~1+(1|LC)+(1|Y),data=dat,family="poisson")
simulate(mod.P,nsim=5)[1:10,1:5]
sim_1 sim_2 sim_3 sim_4 sim_5
1 18 112 4 14 10
2 18 97 3 15 8
3 14 90 3 18 6
4 22 93 2 16 12
5 16 82 2 15 8
6 24 80 2 20 6
7 21 105 5 12 10
8 29 87 2 17 4
9 13 88 5 14 8
10 14 82 1 18 5
#Negative binomial distribution
mod.NB = glmer.nb(BC~1+(1|LC)+(1|Y),data=dat)
simulate(mod.NB,nsim=5)[1:10,1:5]
sim_1 sim_2 sim_3 sim_4 sim_5
1 3 3 2 12 113
2 3 38 40 8 577
3 1 6 5 1 166
4 1 3 1 2 202
5 4 7 3 16 43
6 0 8 3 0 228
7 2 19 34 10 125
8 0 3 17 5 124
9 0 2 41 4 83
10 4 8 49 15 117
anova(mod.P,mod.NB)
Data: dat
Models:
mod.P: BC ~ 1 + (1 | LC) + (1 | Y)
mod.NB: BC ~ 1 + (1 | LC) + (1 | Y)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
mod.P 3 4335.2 4345.8 -2164.62 4329.2
mod.NB 4 1789.7 1803.8 -890.86 1781.7 2547.5 1 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The negative binomial model outperforms the Poisson model. Plots of fitted values versus Pearson residuals show how residuals are reduced and better distributed for the negative binomial distribution.
You can use the summary function to look at the results of the model.
summary(mod.NB)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: Negative Binomial(1.0623) ( log )
Formula: BC ~ 1 + (1 | LC) + (1 | Y)
Data: dat
AIC BIC logLik deviance df.resid
1789.7 1803.8 -890.9 1781.7 247
Scaled residuals:
Min 1Q Median 3Q Max
-1.0084 -0.6210 -0.2681 0.3030 5.1776
Random effects:
Groups Name Variance Std.Dev.
LC (Intercept) 1.1406 1.0680
Y (Intercept) 0.1758 0.4193
Number of obs: 251, groups: LC, 24; Y, 3
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.3887 0.3419 6.987 2.82e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at p-vales of the fixed effect is not very meaningful (i.e., abundances is different from 0), however random effect show that variability is higher among lakes than among years.
The predict function is useful to obtain predicted values from the fitting model. The type argument is used to specify that we want predicted values on the original scale (i.e., the number of capture) whereas the re.form argument specifies how to deal with the random component: NA if we want to omit random effect and predict on fixed effects only, NULL if we want to predict considering all random effects, or a specific formula to indicate which random effect to consider in prediction. Note how the expand.grid function can be used to create a new data frame by considering the combination of variables included in the model.
newDat <- with(dat, expand.grid(LC=unique(LC), Y=unique(Y)))
newDat$predFix = predict(mod.NB,type="response",newdata=newDat,re.form=NA) #No random component
newDat$predRan = predict(mod.NB,type="response",newdata=newDat,re.form=NULL) #Full random components
newDat$predLC = predict(mod.NB,type="response",newdata=newDat,re.form=~(1|LC)) #Random component: lakes
newDat$predY = predict(mod.NB,type="response",newdata=newDat,re.form=~(1|Y)) #Random component: years
newDat[1:48,]
LC Y predFix predRan predLC predY
1 A 2012 10.89913 19.6258884 21.9914088 9.726754
2 AA 2012 10.89913 3.6094518 4.0445013 9.726754
3 B 2012 10.89913 14.0246398 15.7150383 9.726754
4 BB 2012 10.89913 14.0658020 15.7611617 9.726754
5 C 2012 10.89913 2.7984814 3.1357841 9.726754
6 D 2012 10.89913 5.5530209 6.2223299 9.726754
7 G 2012 10.89913 45.0371770 50.4655356 9.726754
8 H 2012 10.89913 14.9966975 16.8042587 9.726754
9 I 2012 10.89913 11.6287999 13.0304263 9.726754
10 J 2012 10.89913 22.6479304 25.3776994 9.726754
11 L 2012 10.89913 3.1772991 3.5602609 9.726754
12 M 2012 10.89913 0.4347612 0.4871632 9.726754
13 N 2012 10.89913 14.6509346 16.4168208 9.726754
14 O 2012 10.89913 4.3643540 4.8903923 9.726754
15 P 2012 10.89913 7.8086182 8.7497958 9.726754
16 Q 2012 10.89913 26.4542222 29.6427659 9.726754
17 R 2012 10.89913 11.9237583 13.3609362 9.726754
18 S 2012 10.89913 11.3996388 12.7736442 9.726754
19 T 2012 10.89913 17.8290762 19.9780257 9.726754
20 U 2012 10.89913 14.9658550 16.7696988 9.726754
21 W 2012 10.89913 14.0658020 15.7611617 9.726754
22 X 2012 10.89913 10.7415264 12.0362091 9.726754
23 Y 2012 10.89913 33.0505303 37.0341310 9.726754
24 Z 2012 10.89913 4.4000789 4.9304231 9.726754
25 A 2014 10.89913 33.2131390 21.9914088 16.460708
26 AA 2014 10.89913 6.1083209 4.0445013 16.460708
27 B 2014 10.89913 23.7340752 15.7150383 16.460708
28 BB 2014 10.89913 23.8037344 15.7611617 16.460708
29 C 2014 10.89913 4.7359054 3.1357841 16.460708
30 D 2014 10.89913 9.3974474 6.2223299 16.460708
31 G 2014 10.89913 76.2169837 50.4655356 16.460708
32 H 2014 10.89913 25.3791007 16.8042587 16.460708
33 I 2014 10.89913 19.6795650 13.0304263 16.460708
34 J 2014 10.89913 38.3273788 25.3776994 16.460708
35 L 2014 10.89913 5.3769835 3.5602609 16.460708
36 M 2014 10.89913 0.7357518 0.4871632 16.460708
37 N 2014 10.89913 24.7939618 16.4168208 16.460708
38 O 2014 10.89913 7.3858515 4.8903923 16.460708
39 P 2014 10.89913 13.2146233 8.7497958 16.460708
40 Q 2014 10.89913 44.7688146 29.6427659 16.460708
41 R 2014 10.89913 20.1787268 13.3609362 16.460708
42 S 2014 10.89913 19.2917527 12.7736442 16.460708
43 T 2014 10.89913 30.1723709 19.9780257 16.460708
44 U 2014 10.89913 25.3269056 16.7696988 16.460708
45 W 2014 10.89913 23.8037344 15.7611617 16.460708
46 X 2014 10.89913 18.1780208 12.0362091 16.460708
47 Y 2014 10.89913 55.9318301 37.0341310 16.460708
48 Z 2014 10.89913 7.4463091 4.9304231 16.460708
glmmTMB is another powerful package for GLMMs. The formula is quite similar to the lme4 package. You can consult the vignettes associated to this package for more details.
mod.P = glmmTMB(BC~1+(1|LC)+(1|Y),data=dat,family=poisson)
Binomial distribution
When the response variable is binary (e.g., presence/absence), the binomial distribution should be used. Here, we first transform the number of captures to a dummy variable (0 = no capture; 1: at least one capture). We then try to predict the probability to capture a brook charr in littoral habitat according to the temperature of the epilimnion (Temp variable).
dat$pres = ifelse(dat$BC==0,0,1)
#Analyses
mod.B = glmer(pres~Temp+(1|LC)+(1|Y),data=dat,family="binomial")
#Results
summary(mod.B)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: pres ~ Temp + (1 | LC) + (1 | Y)
Data: dat
AIC BIC logLik deviance df.resid
149.7 163.8 -70.9 141.7 247
Scaled residuals:
Min 1Q Median 3Q Max
-4.3185 0.0232 0.1560 0.3582 1.2745
Random effects:
Groups Name Variance Std.Dev.
LC (Intercept) 1.5492 1.2447
Y (Intercept) 0.5855 0.7652
Number of obs: 251, groups: LC, 24; Y, 3
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 25.1542 6.2234 4.042 5.3e-05 ***
Temp -1.0531 0.2784 -3.782 0.000155 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
Temp -0.995
Temperature is highly significant, the coefficient being negative means that the probability of capturing a brook charr decreases as temperature increases. We can see this relationship on a plot.
#Visualization
newDat <- data.frame(Temp=seq(10,27,0.1))
newDat$predFix = predict(mod.B,type="response",newdata=newDat,re.form=NA)
newDat$x = 1 + newDat$predFix #just to add line on ggplot
ggplot(dat,aes(x=as.factor(pres),y=Temp))+
geom_jitter(col="blue",position=position_jitter(0.05),alpha=0.5)+
geom_boxplot(width=0.1,alpha=0.5)+
geom_line(data=newDat,aes(x=x,y=Temp),col="red")+
coord_flip()+
theme_bw()
NLME: walleye growth curve
The database could be found here: https://datadryad.org/stash/dataset/doi:10.5061/dryad.vb957
In this third case study, we will extend the conceptual framework of fixed and random effects in the context of non linear mixed models using the nlme package. This is particularly helpful to fit theoretical models to ecological data. We illustrate the use of NLME by fitting von Bertalanffy growth curve (VBGF) on walleye data. Extensive analyses of the walleye growth curve could be found in the original article (Honsey et al. 2017).
Download and explore data
dat = read_excel("Data/MN_Walleye_data_Honsey_et_al_Ecol_Apps.xlsx",sheet="Data")
dat$TL = dat$TotalLength
dat = dat[,-c(6,7)] #Remove Sex and Maturity that contain missing values
datG = groupedData(TL~Age|LakeName,data=dat)
#Visualization
ggplot(data=dat,aes(x=Age,y=TL,col=LakeName))+
geom_point(alpha=0.2)+
facet_wrap(~LakeName)+
theme_bw()
VBGF
The VBGF is defined by three parameters: Linf, the asymptotic length, k, the growth coefficient, and t0, the theoretical age when size (total length) equals zero. The best way to understand how these parameters influence the VBGF is to first define the VBGF function and then visualize the growth curve for different values of the parameters. On the plot below, Linf is fixed to 800, k, ranges from 0.1 to 1 and t0, from -2 to 0.
#Define the von Bertalanffy function
VBGF = function(x,Linf,k,t0) Linf*(1-exp(-k*(x-t0)))
#Visualization
age = seq(0,20,0.1)
k = c(0.1,0.25,0.5,1)
t0 = c(-2,-1,0)
newDat = expand.grid(age=age,k=k,t0=t0)
newDat$TL = VBGF(x=newDat$age,Linf=800,k=newDat$k,t0=newDat$t0)
ggplot(data=newDat,aes(x=age,y=TL))+
geom_line(linewidth=1)+
facet_grid(t0~k)+
theme_bw()
## Warning: Ignoring unknown parameters: linewidth
nls and nlsList
The nls function can fit non linear model. Plotting the residuals according to the grouping variable informs us how it is important to incorporate the random effects to the model. Here, we see that residuals are not centered to zero for many lakes, indicating biases in model fit. Random effects have to be included in the model to improve model fit.
#nls
mod.nls = nls(TL~VBGF(Age,Linf,k,t0),data=dat,
start=c(Linf=800,k=.3,t0=-2))
summary(mod.nls)
Formula: TL ~ VBGF(Age, Linf, k, t0)
Parameters:
Estimate Std. Error t value Pr(>|t|)
Linf 686.20487 3.84292 178.56 <2e-16 ***
k 0.18346 0.00264 69.50 <2e-16 ***
t0 -1.31013 0.02532 -51.75 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 42.74 on 6232 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 8.025e-06
plot(mod.nls,LakeName~resid(.),abline=0)
The nlsList function from the nlme package fits the non linear model by group and can be used to feed the NLME model. Plotting the estimates and confidence intervals of the parameters informs us which of the parameters should be incorporated to random effect into the model.
#nlsList
mod.lis = nlsList(TL~VBGF(Age,Linf,k,t0)|LakeName,data=dat,
start=c(Linf=800,k=.3,t0=-2))
mod.lis
Call:
Model: TL ~ VBGF(Age, Linf, k, t0) | LakeName
Data: dat
Coefficients:
Linf k t0
Big Stone 615.2847 0.47640877 -0.4496596
Birch 599.2751 0.40639674 -0.4326197
Cass 730.2746 0.16582262 -1.6728568
Cut Foot Sioux 721.9514 0.16013094 -1.6658202
Kabetogama 619.5719 0.19482829 -1.6663531
Lake_of_the_Woods 763.7969 0.13920308 -1.4058217
Leech 574.3497 0.34585097 -0.7668472
Mille Lacs 685.0770 0.20748989 -1.0721370
Otter Tail 768.9445 0.12378024 -1.9103712
Rainy 784.8787 0.10368433 -1.7118710
Red (Upper Red) 1246.5037 0.06944248 -1.9706741
Sand Point 628.2803 0.17967672 -1.1223282
Vermilion 752.8391 0.13023880 -1.7898292
Winnibigoshish 733.2848 0.15629007 -1.5480317
Degrees of freedom: 6235 total; 6193 residual
Residual standard error: 33.61449
plot(intervals(mod.lis),layout=c(3,1))
nlme
Since all the three parameters seem to vary by lake, we can fit the NLME with random effects for the three parameters. We can directly use the nlme function with the model coming from the nlsList object or writing the equation, specifying how to fit fixed and random components and initiate the starting parameters from the nlsList estimates.
mod.nlme = nlme(mod.lis)
mod.nlme = nlme(TL~VBGF(Age,Linf,k,t0),data=dat,
fixed = Linf + k + t0 ~1,
random = Linf + k + t0 ~1|LakeName,
start=fixef(mod.lis))
summary(mod.nlme)
Nonlinear mixed-effects model fit by maximum likelihood
Model: TL ~ VBGF(Age, Linf, k, t0)
Data: dat
AIC BIC logLik
61718.82 61786.2 -30849.41
Random effects:
Formula: list(Linf ~ 1, k ~ 1, t0 ~ 1)
Level: LakeName
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
Linf 73.50797185 Linf k
k 0.09675492 -0.750
t0 0.41455909 -0.628 0.921
Residual 33.63418459
Fixed effects: Linf + k + t0 ~ 1
Value Std.Error DF t-value p-value
Linf 701.3126 20.897026 6219 33.56040 0
k 0.2019 0.026457 6219 7.63013 0
t0 -1.3274 0.116774 6219 -11.36769 0
Correlation:
Linf k
k -0.744
t0 -0.634 0.900
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-7.29496083 -0.58028176 0.01433836 0.60245254 5.87134241
Number of Observations: 6235
Number of Groups: 14
intervals(mod.nlme)
Approximate 95% confidence intervals
Fixed effects:
lower est. upper
Linf 660.3570955 701.3126299 742.2681643
k 0.1500165 0.2018684 0.2537203
t0 -1.5563106 -1.3274487 -1.0985867
Random Effects:
Level: LakeName
lower est. upper
sd(Linf) 45.70317142 73.50797185 118.2285990
sd(k) 0.06348487 0.09675492 0.1474606
sd(t0) 0.27409574 0.41455909 0.6270044
cor(Linf,k) -0.91421134 -0.75048764 -0.3761385
cor(Linf,t0) -0.87001051 -0.62825819 -0.1429577
cor(k,t0) 0.75534473 0.92084581 0.9759264
Within-group standard error:
lower est. upper
33.04661 33.63418 34.23221
The summary function is used to show the model output and the intervals function, to extract the confidence intervals of the model parameters. We can visualize the results by calculating the predicted values at the population (black lines) or lake (color lines) levels as follow:
#Extract residuals and predicted values
dat$res = residuals(mod.nlme)
dat$predFull = predict(mod.nlme)
dat$predFixe = VBGF(dat$Age,fixef(mod.nlme)[1],fixef(mod.nlme)[2],fixef(mod.nlme)[3])
#Visualization
ggplot(data=dat,aes(x=Age,y=TL,col=LakeName))+
geom_point(alpha=0.2)+
geom_line(aes(y=predFixe),linewidth=1,col='black')+
geom_line(aes(y=predFull),linewidth=1,alpha=0.5)+
facet_wrap(~LakeName)+
theme_bw()
## Warning: Ignoring unknown parameters: linewidth
## Ignoring unknown parameters: linewidth
Note that we could smooth the lines by calculating predicted values on a new data frame by specifying regular intervals in age using the expand.grid function as illustrated in the GLMM case study. The parameters estimates at the lake level could be obtained with the coef function:
coef(mod.nlme)
Linf k t0
Big Stone 636.0669 0.4288756 -0.4916049
Birch 625.1030 0.3453633 -0.6424293
Cass 725.1959 0.1693830 -1.6279210
Cut Foot Sioux 704.4610 0.1723542 -1.4999589
Kabetogama 619.7784 0.1979797 -1.5987604
Lake_of_the_Woods 764.5359 0.1386047 -1.4208832
Leech 575.1076 0.3445203 -0.7696058
Mille Lacs 685.2610 0.2071741 -1.0767931
Otter Tail 738.3058 0.1363032 -1.7370665
Rainy 776.8366 0.1059113 -1.6781612
Red (Upper Red) 840.2491 0.1294834 -1.4276757
Sand Point 660.5877 0.1545496 -1.4147556
Vermilion 735.4408 0.1382678 -1.6648340
Winnibigoshish 731.4472 0.1573876 -1.5338316
Finally, different structures of the model could be compared by updating an existing model. Here, an example if we try to simplify the random component of the model:
#comparing model for the random component
mod.LKT = mod.nlme
mod.LK = update(mod.LKT,random = Linf + k ~1|LakeName)
anova(mod.LK,mod.LKT)
Model df AIC BIC logLik Test L.Ratio p-value
mod.LK 1 7 62056.94 62104.11 -31021.47
mod.LKT 2 10 61718.82 61786.20 -30849.41 1 vs 2 344.1205 <.0001
We confirm that we need a random effect for all parameters. We could also try to include predictors in the fixed effect. Be careful, however, to include the right starting parameter values. See the book of Pinheiro and Bates (2000) for detailed examples.
References
Allegue, H., Y. G. Araya-Ajoy, N. J. Dingemanse, N. A. Dochtermann, L. Z. Garamszegi, S. Nakagawa, D. Réale, H. Schielzeth, and D. F. Westneat. 2017. Statistical Quantification of Individual Differences (SQuID): an educational and statistical tool for understanding multilevel phenotypic data in linear mixed models. Methods in Ecology and Evolution 8:257-267.
Bolker, B. 2009. Learning hierarchical models: advice for the rest of us. Ecological Applications 19:588-592.
Dingemanse, N. J., and N. A. Dochtermann. 2013. Quantifying individual variation in behaviour: mixed-effect modelling approaches. Journal of Animal Ecology 82:39-54.
Faraway, J. 2006. Extending the linear models with R. Chapman and Hall, Boca Raton, Floride.
Gelman, A., and J. Hill. 2006. Data analysis using regression and multilevel / hierarchical models. Cambridge University Press, Cambridge, New York.
Harrison, X. A., L. Donaldson, M. E. Correa-Cano, J. Evans, D. N. Fisher, C. E. D. Goodwin, B. S. Robinson, D. J. Hodgson, and R. Inger. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. Peerj 6:e4794.
Honsey, A. E., D. F. Staples, and P. A. Venturelli. 2017. Accurate estimates of age at maturity from the growth trajectories of fishes and other ectotherms. Ecological Applications 27:182-192.
Johnson, P. C. D. 2014. Extension of Nakagawa & Schielzeth’s R2GLMM to random slopes models. Methods in Ecology and Evolution 5:944-946.
Nakagawa, S., P. C. D. Johnson, and H. Schielzeth. 2017. The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded. Journal of the Royal Society Interface 14:20170213.
Nakagawa, S., and H. Schielzeth. 2013. A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods in Ecology and Evolution 4:133-142.
Pépino, M., P. Magnan, and R. Proulx. 2018. Field evidence for a rapid adaptive plastic response in morphology and growth of littoral and pelagic brook charr: A reciprocal transplant experiment. Functional Ecology 32:161-170.
Pépino, M., M. A. Rodríguez, and P. Magnan. 2016. Assessing the detectability of road crossing effects in streams: mark–recapture sampling designs under complex fish movement behaviours. Journal of Applied Ecology 53:1831-1841.
Pinheiro, J. C., and D. M. Bates. 2000. Mixed-effects models in S and S-plus. Springer, New York.
Rainville, V., A. Dupuch, M. Pépino, and P. Magnan. 2022. Intraspecific competition and temperature drive habitat-based resource polymorphism in brook charr, Salvelinus fontinalis. Evolutionary Ecology 36:967-986.
Reyjol, Y., M. A. Rodriguez, N. Dubuc, P. Magnan, and R. Fortin. 2008. Among- and within-tributary responses of riverine fish assemblages to habitat features. Canadian Journal of Fisheries and Aquatic Sciences 65:1379-1392.
Schielzeth, H., N. J. Dingemanse, S. Nakagawa, D. F. Westneat, H. Allegue, C. Teplitsky, D. Réale, N. A. Dochtermann, L. Z. Garamszegi, and Y. G. Araya-Ajoy. 2020. Robustness of linear mixed-effects models to violations of distributional assumptions. Methods in Ecology and Evolution 11: 1141– 1152.
Schielzeth, H., and S. Nakagawa. 2013. Nested by design: model fitting and interpretation in a mixed model era. Methods in Ecology and Evolution 4:14-24.
Wagner, T., D. B. Hayes, and M. T. Bremigan. 2006. Accounting for multilevel data structures in fisheries data using mixed models. Fisheries 31:180-187.
Zuur, A. F., and E. N. Ieno. 2016. A protocol for conducting and presenting results of regression-type analyses. Methods in Ecology and Evolution 7:636-645.
Zuur, A. F., E. N. Ieno, N. J. Walker, A. A. Saveliev, and G. M. Smith. 2009. Mixed effects models and extensions in ecology with R. Springer, New York, New York, USA.