First steps

Nature is a mess, but we search out for order

Most of the time, as scientists, we try to find the laws able to explain patterns we can observe in the real world (the one outside an ecologist's desk)

Most of the time, we have hard time finding deterministic processes : Nature is full of stochasticity (genetic and environmental variations, displacements…)

How could we possibly find laws if what we are looking for is full of noise?

Help me, Obi-Wan Statistic. You're my only hope

Data generating process

Because stochasticity is an inherent part of our world, we have to model it! Thus…

A good model should be able to reproduce the noise…

  • It might be described by an equation
curve(dnorm(x, 50, 4), from = 30, to = 70)
curve(dlnorm(x, log(50), log(4)), from = 0, to = 120)

Data generating process

Because stochasticity is an inherent part of our world, we have to model it! Thus…

A good model should be able to reproduce the noise…

  • It might be described by an equation

  • Or it might emerge from stochastic simulations

N = 10000
pos <- numeric(N)
hist(pos)
for(i in 1:30) pos <- pos + sample(c(-1,1), N, replace = T)
hist(pos, freq = F); curve(dnorm(x, 0, sd(pos)),
                           from = -30, to = 30, add = T)

Data generating process

Because stochasticity is an inherent part of our world, we have to model it! Thus…

A good model should be able to reproduce the noise…

  • It might be described by an equation

  • Or it might emerge from stochastic simulations

But should be informative enough to allow predicting essential features of data, such as summary statistics

Nemobius populations

Our inspiration of the day

Helen Sifera, an amateur entomologist, passionated by Orthopterans

And statistics!

Has accumulated a wonderful collection of records

Picasso, 1903, Celestina

Picasso, 1903, Celestina

Population estimate

Aim : estimate Nemobius sylvestris density in the familial forest

50 random quadrats and systematic count

Population estimate

Aim : estimate Nemobius sylvestris density in the familial forest

50 random quadrats and systematic count

## Import data
Nemobius <- gen.nemobius.pop()
## Plot the density of data
Nemobius %>% ggplot(aes(x = abundance)) +
  stat_density(geom = "line", position = "identity") +
  theme_minimal()
N = nrow(Nemobius)

Population estimate

Which distribution best describes the data?

  • Low number of counts with small variation: let's try with a poisson distribution

  • Poisson distribution is parametrized by a single parameter, \(\lambda > 0\).

  • This allows the formulation of the likelihood:

\[ y_i \sim Poisson(\lambda)\\ P(y_i | \lambda) = \frac{\lambda^{y_i} e^{-\lambda}}{y_i!}\\ \]

Population estimate

What are the parameter values which are the more probable given my data?

  • Posterior distribution!

\[ P(\lambda | \boldsymbol{y}) \] - With a mode and uncertainty…

How to estimate this distribution?

Population estimate

Brut force approach : let's define a vector of potential parameters

## Define an equally spaced vector
lambda_prior <- seq(from = 0.1, to= 12, by = 0.2)

Population estimate

Brut force approach : let's generate data for every parameter value

## Create a data.frame to store results
Simu <- data.frame(run = rep(1:length(lambda_prior), each = N),
                          lambda_prior = rep(lambda_prior, each = N),
                          abund_sim = NA)
## Simulate data for each value of mu_prior
for(r in unique(Simu$run)){
  lambda_r <- unique(Simu$lambda_prior[Simu$run == r])
  Simu$abund_sim[Simu$run == r] <- rpois(N, lambda_r)
}

Population estimate

Comparing observed and predicted : visual inspection

  • Probability that our simulation produces exactly the observed data is extremely low!
Nemobius %>% ggplot(aes(x = abundance)) +
  stat_density(data = filter(Simu, run %in% c(3:75)),
               aes(x = abund_sim, group = run), geom = "line",
               position = "identity",col = "blue", alpha = 0.1) +
  stat_density(geom = "line", lwd = 1, position = "stack") +
  theme_minimal()

Population estimate

Comparing observed and predicted : summary statistics

  • Our objective is to summarize the observed values by a sufficient set of statistics

Poisson distribution: \[ Mean = \lambda\\ Median \approx\lfloor\lambda+1/3-0.02/\lambda\rfloor\\ Variance = \lambda\\ Skewness = \lambda^{-1/2} \]

Population estimate

Comparing observed and predicted : summary statistics

## Compute observed mean
mean_obs <- mean(Nemobius$abundance)
## Compute simulated means
Mean_sim <- Simu %>% group_by(run, lambda_prior) %>%
  summarize(mean_sim = mean(abund_sim)) %>%
  mutate(mean_dist = mean_obs - mean_sim)
## Plot
Mean_sim %>% ggplot(aes(x= mean_sim, y = lambda_prior)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  geom_vline(aes(xintercept = mean_obs)) +
  xlab("Simulated mean") +
  ylab(expression(paste("Prior ", lambda))) +
  theme_minimal()

Population estimate

Computing the posterior distribution

  • Weighting by the distance to the result

\[ \lambda_r^* = \lambda_r - b(S(y_{r}) - S(y_{obs})) \]

Population estimate

Computing the posterior distribution

  • Weighting by the distance to the result!
## Extract regression parameters
b <- Mean_sim %>% lm(data = ., mean_sim ~ lambda_prior) %>% coef()
## Weight each simulated parameters
Mean_sim <- Mean_sim %>%
  mutate(lambda_post = lambda_prior - b[2] * (mean_sim - mean_obs))
## Plot the posterior distribution
Mean_sim %>% ggplot(aes(lambda_post)) +
  stat_density(geom = "line", position = "identity") +
  geom_vline(aes(xintercept = mean_obs)) +
  xlab("Value of lambda") +
  ylab(expression(paste("P(",lambda,"| Abund)"))) +
  theme_minimal()

Questions and exercise

Did we input prior information in the model?

What happens if we try a sequence of lambda between -10 and 10? Between 0.01 and 1000?

What happens if we use a normal distribution instead of a poisson? Set \(\sigma = \sqrt(\mu)\)

Environmental constraints

Environmental constraints

During the exploration of the familial forest, Helen noticed strong variations in abundances of wood crickets.

She associated these variations with the feeling under her feet…

And decided to elucidate that!

Environmental constraints

How could we relate observed cricket abundances to litter thickness?

## Import data and paramater values
Nemolist <- gen.nemobius.reg()
Nemobius <- Nemolist$Nemobius
Truea <- Nemolist$Truea
Trueb <- Nemolist$Trueb
## Plot observed pattern
Nemobius %>% ggplot(aes(litter_thickness, abundance)) +
  geom_point() +
  theme_minimal() +
  xlab("Litter thickness (cm)") +
  ylab("Crickets abundance")

Environmental constraints

How could we relate observed cricket abundances to litter thickness? We have:

  • a likelihood function

  • an equation decribing the phenomenom of interest

  • and a trick to ensure we will predict positive counts!

Environmental constraints

How could we relate observed cricket abundances to litter thickness? We have:

  • a likelihood function \[y_i \sim Poisson(\lambda_i)\]
  • an equation decribing the phenomenom of interest \[\lambda_i = \alpha + \beta x_i\]
  • and a trick to ensure we will predict positive counts! \[log(\lambda_i) = \alpha + \beta x_i \equiv \lambda_i = e^{\alpha + \beta x_i}\]

Environmental constraints

Now, we are interested by the joint distribution of parameters. We need to define the probability of combinations of parameters!

We have two solutions to solve this problem:

  • use brut force… or use the bayes theorem!

\[ P(\alpha, \beta | y_i) \propto P( y_i | \alpha, \beta)P( \alpha, \beta)\\ P(\alpha, \beta | \boldsymbol{y}) \propto \prod_{i=1}^{n}P( y_i | \alpha, \beta)P( \alpha, \beta)\\ P(\alpha, \beta | \boldsymbol{y}) \propto \sum_{i=1}^{n}logP( y_i | \alpha, \beta)logP( \alpha, \beta)\\ \]

Environmental constraints

Posterior distribution are often intractable analytically!

But Markov Chains Monte Carlo algorithms allow to cleverly sample from the joint posterior distribution.

It allows simultaneously to get marginal distributions of parameters.

Environmental constraints

Before making any move, we need a complete data generating process!

\[ y_i \sim Poisson(\lambda_i)\\ \lambda_i = e^{\alpha + \beta x_i} \] And a complete data generating process needs priors. What would good priors be?

Environmental constraints

Prior choice

  • We need to use what we know about data and their links to parameters to choose priors which will favor efficient posterior estimation and viable inferences

  • What do we know about parameters, \(\alpha\) and \(\beta\)?

Environmental constraints

Prior choice

  • We need to use what we know about data and their link to parameters to choose priors which will favor efficient posterior estimation and inferences.

  • What do we know about parameters, \(\alpha\) and \(\beta\)?

  • Might be positive OR negative -> defined on \(R\)
  • Are on the log scale -> should not be too great

Environmental constraints

Prior choice

We need to use what we know about data and their link to parameters to choose priors which will favor efficient posterior estimation and inferences.

What do we know about parameters?

  • Might be positive OR negative -> symetric and defined on \(R\)
  • Are on the log scale -> should not be too great
ggplot(data.frame(x = c(-20,20)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 5)) +
  stat_function(fun = dstudent_t, args = list(df = 3, mu = 0, sigma = 5), lty = 2) +
  theme_minimal() +
  ylab("Likelihood")
## Or curve!!

Environmental constraints

Before making any move, we need a complete data generating process!

\[ y_i \sim Poisson(\lambda_i)\\ \lambda_i = e^{\alpha + \beta x_i} \] And a complete data generating process needs priors. What would good priors be?

\[ \alpha \sim normal(0,5)\\ \beta \sim normal(0,5)\\ \]

Environmental constraints

So lets sample from the posterior!

## Define the likelihood
likelihood <- function(param, x, y){
  alpha = param[1]
  beta = param[2]

  pred = exp(alpha + beta*x)
  singlelikelihoods = dpois(y, lambda = pred, log = T)
  sumll = sum(singlelikelihoods)
  return(sumll)
}

Environmental constraints

So lets sample from the posterior!

## Define priors
prior <- function(param){
  alpha = param[1]
  beta = param[2]

  aprior = dnorm(alpha, mean = 0, sd = 5, log = T)
  bprior = dnorm(beta, mean = 0, sd = 5, log = T)

  return(aprior+bprior)
}

Environmental constraints

So lets sample from the posterior!

## Compute the posterior
posterior <- function(param, x, y){
  return (likelihood(param, x, y) + prior(param))
}

Environmental constraints

So lets sample from the posterior!

## Function to get a new proposal
proposalfunction <- function(param){
  return(rnorm(2, mean = param, sd= c(0.1,0.5)))
}

Environmental constraints

So lets sample from the posterior!

## And the Metropolis-Hasting algorithm...
run_metropolis_MCMC <- function(startvalue, iterations, x, y){
  chain = array(dim = c(iterations+1,2)) # object to store results
  chain[1,] = startvalue # fix starting values
  for (i in 1:iterations){
    # new parameter value based on the former
    proposal = proposalfunction(chain[i,])
    # ratio of probability for the new parameter
    probab = exp(posterior(proposal, x, y) - posterior(chain[i,], x, y))
    if (runif(1) < probab){
      chain[i+1,] = proposal
    }else{
      chain[i+1,] = chain[i,]
    }
  }
  return(chain)
}

Environmental constraints

So lets sample from the posterior!

startvalue <- c(4,0)
chain <- run_metropolis_MCMC(startvalue, 20000,
                             x = Nemobius$litter_thickness,
                             y = Nemobius$abundance)

Environmental constraints

And look at the results: marginal distributions

burnIn <- 1000
par(mfrow = c(2,2))
hist(chain[-(1:burnIn),1],nclass=30, main="Posterior of alpha",
     xlab = "Red line represent true value"); abline(v = Truea, col = "red")
plot(chain[-(1:burnIn),1], type = "l", main = "Chain values of alpha",
     xlab = "Red line represent true value" ); abline(h = Truea, col = "red")
hist(chain[-(1:burnIn),2],nclass=30, main="Posterior of alpha",
     xlab = "Red line represent true value")
abline(v = Trueb, col = "red")
plot(chain[-(1:burnIn),2], type = "l", main = "Chain values of alpha",
     xlab = "Red line represent true value" )
abline(h = Trueb, col = "red")
par(mfrow = c(1,1))
summary(chain)

Environmental constraints

And look at the results: joint distribution

as.data.frame(chain) %>% ggplot(aes(x = V1, y = V2)) +
  stat_density2d(aes(alpha = ..level..),geom = "polygon", fill = "blue") +
  geom_point(data = data.frame(Truea = Truea, Trueb = Trueb),
             aes(Truea, Trueb)) +
  theme_minimal()

Environmental constraints

However, we cannot be convinced by one chain: the posterior surface might be complicated, and the chain might be stuck in a local maxima or might not explore sufficiently.

To make robust inferences, we use multiple chains

## Fit model with 4 chains
fitlist <- list()
for(f in 1:4) fitlist[[f]] <- MCMCpoisson(data = Nemobius,
                                          abundance ~ litter_thickness,
                                          mubeta = 0, Vbeta = 5,
                                          seed = runif(1,0,100))
## Transfomr in mcmc object for the coda package
fit <- as.mcmc.list(fitlist)
summary(fit)
plot(fit)
## 1 indicate perfect chain mixing!
gelman.diag(fit)

Environmental constraints

Let's have a look at the predictions!

## Define continuous new predictors
xpred = data.frame(int = 1, x = seq(min(Nemobius$litter_thickness),
            max(Nemobius$litter_thickness), length.out = 100))
## Compute expectation
pred <- exp(chain[-c(1:burnIn),] %*% t(as.matrix(xpred)))
## Simulate predicte data
ypred <- pred
for(j in 1:ncol(pred)) ypred[,j] <- rpois(pred[,j], pred[,j])
## Summarize predicted data with quantiles
ypred_quant <- apply(ypred, 2, quantile, c(0.05, 0.5, 0.95), na.rm = T)
ypred_quant <- cbind(as.data.frame(t(ypred_quant)), xpred)
summary(ypred_quant)

Environmental constraints

Let's have a look at the predictions!

## Plot observed and predicted patterns
ggplot(data = ypred_quant) +
  geom_ribbon(aes(x, ymin = `5%`, ymax = `95%`), alpha = 0.2) +
  geom_line(aes(x, y = `50%`)) +
  geom_point(data = Nemobius, aes(litter_thickness, abundance)) +
  theme_minimal() +
  xlab("Litter thickness (cm)") +
  ylab("Crickets abundance")

Congratulations!! You just fitted a Generalized Linear Model by hand!

Some packages to do that

  • brms and rstanarm, using stan and syntax close to lme4 syntax, the former highly flexible. Also implement facilities for state-of-the art model selection
  • MCMCpack, we just used it, syntax close to nlme, with a lot of specialized functions and algorithms. Unfortunatly, it lacks predict and other convenient (if not mandatory) functions
  • MCMCglmm, syntax close to classical glm, with syntax close to nlme

Process models

Population dynamics

In the 1922, Helen is hired by the Department of entomology at the university of Maryland.

During the 30's, she once met Raymond Pearl at John Hopkins university, who presented some work of Alfred Lotka…

And reminded her the traps she followed for 20 years in the familial forest!

She weighted biomass of Nemobius and every other order she captured, including arachnids…

Population dynamics

## Import data
Nemolist <- gen.nemobius.pred(P = 1, t = 35, logit = F,
                              mu_rg = 0.55, mu_ri = 0.1,
                              mu_rm = 0.5, mu_ra = 0.2)
Nemobius <- Nemolist$Nemobius # data.frame
pars <- Nemolist$pars # parameter values
yini <- Nemolist$yini # initial values of populations
summary(Nemobius)
## Plot observed patterns
Nemobius %>% ggplot(aes(x = times, y = Prey)) +
  geom_point(col = "blue") +
  geom_line(col = "blue") +
  geom_point(aes(y = Pred), col = "red") +
  geom_line(aes(y = Pred), col = "red") +
  ylab("Biomass (Nemobius and predators)") +
  xlab("Years after the beginning") +
  theme_minimal()

Population dynamics

Lotka-Volterra model of predation, with exponential growth of prey population

\[ \frac{dPrey}{dt} = r_{g}Prey - r_{i}PreyPred\\ \frac{dPred}{dt} = -r_{m}Prey - r_{a}PreyPred\\ \] \(r_g\) = maximal growth rate

\(r_i\) = ingestion rate

\(r_m\) = mortality rate

\(r_a\) = assimilation efficiency

Population dynamics

Or generalized to accomodate logistic growth of prey populations!

\[ \frac{dPrey}{dt} = r_{g}Prey(1-\frac{Prey}{K}) - r_{i}PreyPred\\ \frac{dPred}{dt} = -r_{m}Prey - r_{a}PreyPred\\ \] \(K\) = carrying support

Population dynamics

But these equations are deterministics… And are only a part of the data generating process!

How is noise produced?

Population dynamics

But these equations are deterministics… And are only a part of the data generating process!

How is noise produced? Lognormal distribution as likelihood

  • Biomass are strictly positive
  • Multiplicative errors => + or - a percentage of the total!

\[ log(y) \sim normal(\mu, \sigma)\\ y \sim lognormal(\mu, \sigma)\\ e^\mu = median = GEM\\ e^{\sigma} = GESD \]

curve(dlnorm(x, log(10), 1), from = 0, to = 50)

Population dynamics

But these equations are deterministics… An are only part of the data generating process!

What should priors be?

  • Rates are strictly positive
  • \(r_g\), and \(r_m\) multiply the prey and predators populations, respectively
  • \(r_i\), and \(r_a\) multiply both prey and predators populations
  • Initial populations values have to be estimated, and we know they are roughly between 0 and 100…
  • Multiplicative errors are…
sd(Nemobius$Prey); sd(log(Nemobius$Prey))
sd(Nemobius$Pred); sd(log(Nemobius$Pred))

Population dynamics

But these equations are deterministics… An are only part of the data generating process!

What should priors be?

\[ \begin{aligned} r_g \sim lognormal(log(1), 1)\\ r_i \sim lognormal(log(0.05), 1)\\ r_m \sim lognormal(log(1), 1)\\ r_a \sim lognormal(log(0.1), 1) \end{aligned} \begin{aligned} \boldsymbol{z_{init}} \sim lognormal(log(50), 1)\\ \boldsymbol{\sigma} \sim lognormal(0, 1) \end{aligned} \]

Population dynamics

If we are able to simulate population trajectories, we could

  • estimate parameters of the phenomenological model
  • compare the two formulations based on their out-of-sample predictive ability (information criteria): exponential or logistic growth of preys.

Population dynamics

Let's try to do this with stan!

## Put data in a format stan can handle
N = nrow(Nemobius) - 1; ts = 1:N
y = filter(Nemobius, times != 1) %>%
                   dplyr::select(Prey, Pred) %>% as.matrix
y_init = filter(Nemobius, times == 1) %>%
                   dplyr::select(Prey, Pred) %>% as.numeric
Nemostan <- list(y = y, y_init = y_init, ts = ts, N = N, mu_zinit = 50,
                 mu_sigma = 0, mu_rg = 1, mu_ri = 0.05,
                 mu_rm = 0.5, mu_ra = 0.1)
## First compile the model
mod_exp <- stan_model("LVmod_exp.stan")
## Sample from the posterior
fit_exp <- sampling(mod_exp, data = Nemostan,
                    iter = 2000, chains = 3, cores = 3, init_r = 1)# OR
fit_exp <- readRDS("fit_exp.RData")

Population dynamics

And the results…

print(fit_exp, pars = c("theta", "z_init"))
pars
yini
rstan::traceplot(fit_exp, pars = c("theta", "z_init"))
## Extract and summarize predicted values
y_rep <- extract(fit_exp, pars = "y_rep")$y_rep
z_init <- extract(fit_exp, pars = "z_init")$z_init
prey_rep <- y_rep[,,1]; pred_rep <- y_rep[,,2]
prey_init <- z_init[,1]; pred_init <- z_init[,2];
prey_quant <- as.data.frame(t(apply(cbind(prey_init, prey_rep), 2,
                                          quantile, c(0.1, 0.5, 0.90) )))
pred_quant <- as.data.frame(t(apply(cbind(pred_init, pred_rep), 2,
                                          quantile, c(0.1, 0.5, 0.90) )))

Population dynamics

How to ensure our model is not that bad? Posterior predictive checks allow to verify if it captured essential features of data.

## Observed vs. predicted distribution
bayesplot::ppc_dens_overlay(Nemostan$y[,1], prey_rep)
bayesplot::ppc_dens_overlay(Nemostan$y[,2], pred_rep)
## Observed vs. predicted patterns
ggplot() + geom_ribbon(data = prey_quant, alpha = 0.3,
            aes(x = 1:nrow(prey_quant), ymin = `10%`, ymax = `90%`)) +
  geom_ribbon(data = pred_quant, alpha = 0.3,
              aes(x = 1:nrow(pred_quant), ymin = `10%`, ymax = `90%`)) +
  geom_point(data = Nemobius, aes(x = times, y = Prey), col = "blue") +
  geom_point(data = Nemobius, aes(x = times, y = Pred), col = "red") +
  geom_line(data = Nemobius, aes(x = times, y = Prey), col = "blue") +
  geom_line(data = Nemobius, aes(x = times, y = Pred), col = "red")

Population dynamics

Question and exercise

  • Modify the stan code to fit the logistic form of the Lotka-Volterra model
  • Define a good prior for K
  • Compare the two formulations by using Leave-One-Out Information Criterion (LOO-IC)
  • Which formulation provide the best out of sample predictive accuracy?

Some tools to do that

  • rstan, because of the accessible C++ background and the integrated ode solvers
  • BayesianTools, a package done by a theoretical ecologist (writing the blog theoretical ecology.wordpress.com)

But what happens when we cannot define a likelihood function?

  • abc, stands for approximate bayesian computation. We might talk about it another time!

Congratulations!!

Very interesting sources