Hierarchical models
Lucas Deschamps
5 april 2019
- Packages
- Introduction
- Number of consumers : a simple mean model
- Varying intercepts
- Goal
- Results
- Results
- A simple intercept model
- Simulating from Data Generating Process
- Simulating from Data Generating Process
- Fit the non-hierarchical intercepts model with
brms
- Check the sampling
- Posterior predictive checks
- We should do better!
- Visualizing the uncertainty
- We should do better : hierarchical models!
- Varying intercept model
- Fit the varying intercept model with
brms
- Check the sampling
- Posterior predictive checks
- Posterior predictive checks
- Model comparison : shrinkage of parameters
- Model comparison: shrinkage of predictors and predictions
- Model comparison : predictive density
- Effective number of parameters
- Sampling new levels
- Conclusion
- Varying slope model
- Goal
- Results
- Results
- Results
- Varying slope model
- Varying slope model: Hierarchical priors
- Varying slope model: Priors and hyperpriors
- Simulate from the data generating process
- Simulate from the data generating process
- Simulate from the data generating process
- Fit the varying slope model with
brms
- Check the sampling
- Posterior predictive checks
- Posterior predictive checks
- Fit a non-hierarchical slope model with
brms
- Check the sampling
- Parameter comparison
- Model comparison : predictive density
- Hierarchical model comparison : No AIC!
- Prior predictive checks
- More and more…
- Exercises
Packages
rstan
: Hamiltonian Monte-Carlo. Installation guide is here.
rstanarm
:
- a very efficient, optimized and innovative modular package.
- Try to mimic the behavior of classical frequentist packages
- Fast for simple problems : pre-compiled models
brms
:
- a very flexible and powerfull package (vast set of models implemented, possibility to add hand-made code, to implement distributions…)
- slower for simple problems : need compilation
Introduction
Hierarchical models
The world is structured (populations in species, in regions; cells in organ, in individuals)…
… And we measure different drivers at different levels of organization (biotic interactions regulates populations size within lakes, but the presence or absence of a species is also dependant on its ability to get there!)…
… But informations flow through the different levels. And we want to exploit it to model the world.
So, let’s enter in the wonderfull world of hierarchical models!
Background
Mr Maxwell, descendant of the famous physicist, have sold his coffee brand to K***t
Moved to Trois-Rivieres, with the strong wish to open a new kind of café
But the competition is hard, and he needs to do a market investigation…
Number of consumers : a simple mean model
Goal
What is the mean number of consumers in Trois-Rivières’ cafés at the rush hour?
Mr Maxwell hired professionals to quantify the number of consumers in each café at 8h.
Results
Choosing the likelihood
What do we know about the number of consumers?
- Integers
- Strictly positive
- Variance \(\propto\) mean : if a shop having in average three consumers starts to receive more than height consumers some days, both mean and variance will increase!
=> Poisson distribution!
Poisson distribution
Poisson distribution is parametrized by a single parameter (representing mean and variance), \(\lambda > 0\).
This allows the formulation of the likelihood function:
\[ y_i \sim Poisson(\lambda)\\ P(y_i | \lambda) = \frac{\lambda^{y_i} e^{-\lambda}}{y_i!}\\ \]
Data Generating Process
- A likelihood
\[ y_i \sim Poisson(\lambda) \]
- Priors incorporating our knowledge about the parameters
\[ \lambda \sim logNormal(\mu = log(??), \sigma = log(??)) \]
Because \(\lambda\) is strictly positive, we have to choose a prior only defined over \((0, +\infty)\). The log-normal distribution is strictly positive. It’s \(\mu\) and \(\sigma\) corresponds to the mean and standard-deviation on the log-scale. If we think their is in average 10 consumers, we should put a prior defined by \(\mu = log(10)\).
Looking at the priors
curve(dlnorm(x, log(10), log(2)), from = 0, to = 20)
Simulating from Data Generating Process
## Load library : tidy df manipulation and plottings
library(tidyverse)
## Create the list of coffe shops
shops <- c("Morgane", "Starbucks", "LTDP", "Frida", "Presse", "TimHortons", "Gourmet",
"Depot", "Brulerie", "Choco-Brico", "Sacristain", "VanHoutte", "Crémière" ,
"Cafeier", "Griffin", "Panetier", "Bucafin", "Urbanithé", "MarcheNotreDame",
"Brulerie", "MoulinsLaFayette", "HistoireSansFin", "Pompon", "Boulangerie",
"Thimothys")
## Set the seed of the random number generator
set.seed(999)
## Sample parameters of the model
N = length(shops) # Number of samples
mu_lambda <- 10 # Expected mean of beta_0
sigma_lambda <- 2 # Expected standard deviation of beta_0
lambda <- rlnorm(1, log(mu_lambda), log(sigma_lambda)) # beta_0
y <- rpois(N, lambda) # Sample the observed waiting time
# Assemble in a data.frame
D <- data.frame(town = "Trois-Rivières", consumers = y, shops = shops)
# Boxplot of the result
D %>% ggplot(aes(x = town, y = consumers)) +
geom_boxplot() +
geom_jitter(aes(color = shops)) +
ylab("Consumers (n)") +
xlab("Town") +
theme_minimal()
Varying intercepts
Goal
Which cafés are the potentially more important concurents?
Mr Maxwell hired professionals to quantify the number of consumers in each café at 8h30, repeatedly.
Results
Results
A simple intercept model
- A likelihood
\[ y_i \sim Poisson(\lambda_i)\\ \] - An equation and a link function
\[ log(\lambda_i) = \beta_0 + \beta_s \Leftrightarrow \lambda_i = e^{\beta_0 + \beta_s} \] - Priors \[ \beta_0 \sim Normal(log(??), log(??))\\ \beta_s \sim Normal(0, log(??)) \]
The link function ensure that \(\lambda\) will stay in the positive area. Because parameters are on the log scale, we do not need to use the lognormal distribution anymore, but we still have to think about the log of what we think about parameters. The global mean, \(\beta_0\), should be 10, and the individual shops might vary by more or less 5 consumers. However, we have to note that because of the exponent, the shops deviation (\(\beta_s\)) to the population mean (\(\beta_0\)) will be multiplicative instead of additive!
Simulating from Data Generating Process
## Create a data frame
X_long <- data.frame(shops = rep(shops, each = 4))
## Create a design matrix
library(useful)
X_large <- build.x(~ shops, X_long, contrasts = F)[,-1]
## Sample from the prior
set.seed(999)
b_shops <- rnorm(ncol(X_large), 0, log(5)) # Hierarchical parameters
b_0 <- rnorm(1, log(10), log(2))
## Compute linear predictor
mu <- exp(b_0 + X_large %*% b_shops)
## Sample observations
y <- rpois(nrow(X_large), mu)
D <- data.frame(shops = X_long$shops, consumers = y, town = "Trois-Rivières",
type = "Observed")
D %>% ggplot(aes(x = town, y = consumers)) +
geom_boxplot() +
geom_point(aes(col = shops), position = "jitter") +
ylab("Consumers (n)") +
xlab("Town") +
theme_minimal()
Simulating from Data Generating Process
D %>% mutate(shops = reorder(.$shops, .$consumers, median, na.rm = T)) %>%
ggplot(aes(x = shops, y = consumers, col = shops)) +
geom_boxplot() +
guides(color = F) +
ylab("Consumers (n)") +
xlab("Shops") +
theme_minimal() +
theme(axis.text.x = element_text(angle=45))
Fit the non-hierarchical intercepts model with brms
## Load library : flexible bayesian model fitting
library(brms)
## Formula
form_int <- bf(consumers ~ shops, family = poisson)
## Get priors
get_prior(form_int, data = D)
## Define priors
priors_int <- c(prior(normal(0,5), class = Intercept),
prior(normal(0,5), class = b))
## Sample from the posterior
fit_int <- brm(formula = form_int, prior = priors_int, data = D,
chains = 4, iter = 2000,
cores = parallel::detectCores())
fit_int
Check the sampling
fit_int
Family: poisson
Links: mu = log
Formula: consumers ~ shops
Data: D (Number of observations: 100)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept 0.88 0.31 0.26 1.45 414 1.01
shopsBrulerie -2.46 0.81 -4.25 -1.03 1559 1.00
shopsBucafin 1.83 0.33 1.21 2.50 463 1.01
shopsCafeier 1.11 0.36 0.41 1.83 526 1.01
shopsChocoMBrico 0.40 0.40 -0.39 1.20 664 1.01
shopsCrémière -0.76 0.55 -1.92 0.30 1205 1.00
shopsDepot -5.65 2.62 -12.09 -1.92 1618 1.00
shopsFrida -1.01 0.63 -2.38 0.10 1126 1.00
shopsGourmet -1.33 0.68 -2.79 -0.09 1175 1.00
shopsGriffin -0.56 0.52 -1.63 0.42 993 1.00
shopsHistoireSansFin 2.86 0.31 2.28 3.50 432 1.01
shopsLTDP 1.18 0.35 0.51 1.89 529 1.01
shopsMarcheNotreDame 2.10 0.33 1.48 2.78 445 1.01
shopsMorgane 1.01 0.36 0.32 1.73 526 1.01
shopsMoulinsLaFayette 1.97 0.33 1.36 2.64 470 1.01
shopsPanetier -1.79 0.82 -3.61 -0.40 1792 1.00
shopsPompon 0.93 0.37 0.23 1.65 561 1.01
shopsPresse 0.93 0.36 0.23 1.64 564 1.01
shopsSacristain 1.94 0.33 1.33 2.61 468 1.01
shopsStarbucks -5.74 2.79 -12.47 -1.79 1828 1.00
shopsThimothys -1.33 0.71 -2.89 -0.11 1279 1.00
shopsTimHortons 1.74 0.34 1.12 2.43 471 1.01
shopsUrbanithé -0.02 0.44 -0.90 0.85 793 1.00
shopsVanHoutte 1.05 0.36 0.36 1.76 551 1.01
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We can look at the chains and posterior distributions with the simple following command
plot(fit_int)
No divergent transition nor BFMI warmings : the build-in diagnostics of the sampler did not reveal problematic behavior (essentially problems with the curvature of the posterior surface). See here.
Rhat : \(\hat R\), indicates the mixing/convergence of the different chains of the MCMC sampler. Should be equal to 1, or <1.02. A problem of mixing might arise from a model poorly identified, in that case, we have to rethink about the model and/or the prior.
Eff.Sample : Effective Monte-Carlo Sample Size, indicates the efficiency of posterior sampling. Should be >400 to trust the mean/median estimates, and possibly higher to have reliable estimates of quantiles 0.275 and 0.975 (which are not really useful anyway.)
Posterior predictive checks
PPC are a good solution to verify our model captures essential features of the data,and is not out (over/under dispersed, unable to capture observed patterns…).
pp_check(fit_int, type = "dens_overlay") +
xlab("Consumers (n)")
We should do better!
Actually, we are repeatidly sampling in different cafés from the same population.
The previous model:
- consider each café as being independant
- tries to estimate the parameter of each café from only a few data points : enormous standard error far shops with low available information!
- does not care about what we could learn from the population
It is like if each visit was its first time => no pooling
Visualizing the uncertainty
From the following plot, we can clearly see that at least two cafe have unreliable estimates, and globally the estimates from more popular cafe are more uncertain.
Remember that mean \(\propto\) variance! Which implied that a more popular cafe is also more variable…
stanplot(fit_int)
We should do better : hierarchical models!
Hierarchical models allow to learn simultaneously about the population and each café estimates : each visit in a café informs the model about the characteristics of the others.
Each estimate from the population helps to improve the other estimates through and adaptive prior!
Varying intercept model
The principle of a varying intercept model is to learn information about the population of cafe simultaneously to the learning of information about individual estimates. This is translated in the following model by estimating the dispersion (= standard deviation) of the population of cafe, \(\sigma_{\beta_s}\), instead of providing it as in the previous model.
Because this dispersion is now estimated, \(\sigma_{\beta_s}\) is a parameter and need a prior.
- A likelihood \[ y_i \sim Poisson(\lambda_i)\\ \lambda_i = e^{\beta_0 + \beta_s}\\ \]
- Priors \[ \beta_0 \sim Normal(log(??), log(??))\\ \beta_s \sim Normal(0, \sigma_{\beta_s}) \]
- Hyperpriors \[ \sigma_{\beta_s} \sim N+(log(??), log(??)) \]
Fit the varying intercept model with brms
The brms
(and rstanarm
) syntax for hierarchical model is similar to the one in lme4
.
## Formula
form_varint <- bf(consumers ~ (1|shops), family = poisson)
## Get priors
get_prior(form_varint, data = D)
## Define priors
priors_varint <- c(prior(normal(0,5), class = Intercept),
prior(normal(0,0.5), class = sd))
## Sample from the posterior
fit_varint <- brm(formula = form_varint, prior = priors_varint, data = D,
chains = 4, iter = 2000,
cores = parallel::detectCores())
fit_varint
Check the sampling
fit_varint
Family: poisson
Links: mu = log
Formula: consumers ~ (1 | shops)
Data: D (Number of observations: 100)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~shops (Number of levels: 24)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept) 1.38 0.19 1.05 1.78 644 1.01
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept 1.20 0.29 0.64 1.75 461 1.01
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(fit_varint)
Posterior predictive checks
pp_check(fit_varint, type = "dens_overlay") +
xlab("Consumers (n)")
Posterior predictive checks
pp_check(fit_varint, type = "violin_grouped", group = "shops") +
theme(axis.text.x = element_text(angle=45))