Hierarchical models
Lucas Deschamps
5 april 2019
- Packages
- Introduction
- Number of consumers : a simple mean model
- Varying intercepts
- Goal
- Results
- Results
- A simple intercept model
- Simulating from Data Generating Process
- Simulating from Data Generating Process
- Fit the non-hierarchical intercepts model with
brms
- Check the sampling
- Posterior predictive checks
- We should do better!
- Visualizing the uncertainty
- We should do better : hierarchical models!
- Varying intercept model
- Fit the varying intercept model with
brms
- Check the sampling
- Posterior predictive checks
- Posterior predictive checks
- Model comparison : shrinkage of parameters
- Model comparison: shrinkage of predictors and predictions
- Model comparison : predictive density
- Effective number of parameters
- Sampling new levels
- Conclusion
- Varying slope model
- Goal
- Results
- Results
- Results
- Varying slope model
- Varying slope model: Hierarchical priors
- Varying slope model: Priors and hyperpriors
- Simulate from the data generating process
- Simulate from the data generating process
- Simulate from the data generating process
- Fit the varying slope model with
brms
- Check the sampling
- Posterior predictive checks
- Posterior predictive checks
- Fit a non-hierarchical slope model with
brms
- Check the sampling
- Parameter comparison
- Model comparison : predictive density
- Hierarchical model comparison : No AIC!
- Prior predictive checks
- More and more…
- Exercises
Packages
rstan
: Hamiltonian Monte-Carlo. Installation guide is here.
rstanarm
:
- a very efficient, optimized and innovative modular package.
- Try to mimic the behavior of classical frequentist packages
- Fast for simple problems : pre-compiled models
brms
:
- a very flexible and powerfull package (vast set of models implemented, possibility to add hand-made code, to implement distributions…)
- slower for simple problems : need compilation
Introduction
Hierarchical models
The world is structured (populations in species, in regions; cells in organ, in individuals)…
… And we measure different drivers at different levels of organization (biotic interactions regulates populations size within lakes, but the presence or absence of a species is also dependant on its ability to get there!)…
… But informations flow through the different levels. And we want to exploit it to model the world.
So, let’s enter in the wonderfull world of hierarchical models!
Background
Mr Maxwell, descendant of the famous physicist, have sold his coffee brand to K***t
Moved to Trois-Rivieres, with the strong wish to open a new kind of café
But the competition is hard, and he needs to do a market investigation…
Number of consumers : a simple mean model
Goal
What is the mean number of consumers in Trois-Rivières’ cafés at the rush hour?
Mr Maxwell hired professionals to quantify the number of consumers in each café at 8h.
Results
Choosing the likelihood
What do we know about the number of consumers?
- Integers
- Strictly positive
- Variance \(\propto\) mean : if a shop having in average three consumers starts to receive more than height consumers some days, both mean and variance will increase!
=> Poisson distribution!
Poisson distribution
Poisson distribution is parametrized by a single parameter (representing mean and variance), \(\lambda > 0\).
This allows the formulation of the likelihood function:
\[ y_i \sim Poisson(\lambda)\\ P(y_i | \lambda) = \frac{\lambda^{y_i} e^{-\lambda}}{y_i!}\\ \]
Data Generating Process
- A likelihood
\[ y_i \sim Poisson(\lambda) \]
- Priors incorporating our knowledge about the parameters
\[ \lambda \sim logNormal(\mu = log(??), \sigma = log(??)) \]
Because \(\lambda\) is strictly positive, we have to choose a prior only defined over \((0, +\infty)\). The log-normal distribution is strictly positive. It’s \(\mu\) and \(\sigma\) corresponds to the mean and standard-deviation on the log-scale. If we think their is in average 10 consumers, we should put a prior defined by \(\mu = log(10)\).
Looking at the priors
curve(dlnorm(x, log(10), log(2)), from = 0, to = 20)
Simulating from Data Generating Process
## Load library : tidy df manipulation and plottings
library(tidyverse)
## Create the list of coffe shops
shops <- c("Morgane", "Starbucks", "LTDP", "Frida", "Presse", "TimHortons", "Gourmet",
"Depot", "Brulerie", "Choco-Brico", "Sacristain", "VanHoutte", "Crémière" ,
"Cafeier", "Griffin", "Panetier", "Bucafin", "Urbanithé", "MarcheNotreDame",
"Brulerie", "MoulinsLaFayette", "HistoireSansFin", "Pompon", "Boulangerie",
"Thimothys")
## Set the seed of the random number generator
set.seed(999)
## Sample parameters of the model
N = length(shops) # Number of samples
mu_lambda <- 10 # Expected mean of beta_0
sigma_lambda <- 2 # Expected standard deviation of beta_0
lambda <- rlnorm(1, log(mu_lambda), log(sigma_lambda)) # beta_0
y <- rpois(N, lambda) # Sample the observed waiting time
# Assemble in a data.frame
D <- data.frame(town = "Trois-Rivières", consumers = y, shops = shops)
# Boxplot of the result
D %>% ggplot(aes(x = town, y = consumers)) +
geom_boxplot() +
geom_jitter(aes(color = shops)) +
ylab("Consumers (n)") +
xlab("Town") +
theme_minimal()
Varying intercepts
Goal
Which cafés are the potentially more important concurents?
Mr Maxwell hired professionals to quantify the number of consumers in each café at 8h30, repeatedly.
Results
Results
A simple intercept model
- A likelihood
\[ y_i \sim Poisson(\lambda_i)\\ \] - An equation and a link function
\[ log(\lambda_i) = \beta_0 + \beta_s \Leftrightarrow \lambda_i = e^{\beta_0 + \beta_s} \] - Priors \[ \beta_0 \sim Normal(log(??), log(??))\\ \beta_s \sim Normal(0, log(??)) \]
The link function ensure that \(\lambda\) will stay in the positive area. Because parameters are on the log scale, we do not need to use the lognormal distribution anymore, but we still have to think about the log of what we think about parameters. The global mean, \(\beta_0\), should be 10, and the individual shops might vary by more or less 5 consumers. However, we have to note that because of the exponent, the shops deviation (\(\beta_s\)) to the population mean (\(\beta_0\)) will be multiplicative instead of additive!
Simulating from Data Generating Process
## Create a data frame
X_long <- data.frame(shops = rep(shops, each = 4))
## Create a design matrix
library(useful)
X_large <- build.x(~ shops, X_long, contrasts = F)[,-1]
## Sample from the prior
set.seed(999)
b_shops <- rnorm(ncol(X_large), 0, log(5)) # Hierarchical parameters
b_0 <- rnorm(1, log(10), log(2))
## Compute linear predictor
mu <- exp(b_0 + X_large %*% b_shops)
## Sample observations
y <- rpois(nrow(X_large), mu)
D <- data.frame(shops = X_long$shops, consumers = y, town = "Trois-Rivières",
type = "Observed")
D %>% ggplot(aes(x = town, y = consumers)) +
geom_boxplot() +
geom_point(aes(col = shops), position = "jitter") +
ylab("Consumers (n)") +
xlab("Town") +
theme_minimal()
Simulating from Data Generating Process
D %>% mutate(shops = reorder(.$shops, .$consumers, median, na.rm = T)) %>%
ggplot(aes(x = shops, y = consumers, col = shops)) +
geom_boxplot() +
guides(color = F) +
ylab("Consumers (n)") +
xlab("Shops") +
theme_minimal() +
theme(axis.text.x = element_text(angle=45))
Fit the non-hierarchical intercepts model with brms
## Load library : flexible bayesian model fitting
library(brms)
## Formula
form_int <- bf(consumers ~ shops, family = poisson)
## Get priors
get_prior(form_int, data = D)
## Define priors
priors_int <- c(prior(normal(0,5), class = Intercept),
prior(normal(0,5), class = b))
## Sample from the posterior
fit_int <- brm(formula = form_int, prior = priors_int, data = D,
chains = 4, iter = 2000,
cores = parallel::detectCores())
fit_int
Check the sampling
fit_int
Family: poisson
Links: mu = log
Formula: consumers ~ shops
Data: D (Number of observations: 100)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept 0.88 0.31 0.26 1.45 414 1.01
shopsBrulerie -2.46 0.81 -4.25 -1.03 1559 1.00
shopsBucafin 1.83 0.33 1.21 2.50 463 1.01
shopsCafeier 1.11 0.36 0.41 1.83 526 1.01
shopsChocoMBrico 0.40 0.40 -0.39 1.20 664 1.01
shopsCrémière -0.76 0.55 -1.92 0.30 1205 1.00
shopsDepot -5.65 2.62 -12.09 -1.92 1618 1.00
shopsFrida -1.01 0.63 -2.38 0.10 1126 1.00
shopsGourmet -1.33 0.68 -2.79 -0.09 1175 1.00
shopsGriffin -0.56 0.52 -1.63 0.42 993 1.00
shopsHistoireSansFin 2.86 0.31 2.28 3.50 432 1.01
shopsLTDP 1.18 0.35 0.51 1.89 529 1.01
shopsMarcheNotreDame 2.10 0.33 1.48 2.78 445 1.01
shopsMorgane 1.01 0.36 0.32 1.73 526 1.01
shopsMoulinsLaFayette 1.97 0.33 1.36 2.64 470 1.01
shopsPanetier -1.79 0.82 -3.61 -0.40 1792 1.00
shopsPompon 0.93 0.37 0.23 1.65 561 1.01
shopsPresse 0.93 0.36 0.23 1.64 564 1.01
shopsSacristain 1.94 0.33 1.33 2.61 468 1.01
shopsStarbucks -5.74 2.79 -12.47 -1.79 1828 1.00
shopsThimothys -1.33 0.71 -2.89 -0.11 1279 1.00
shopsTimHortons 1.74 0.34 1.12 2.43 471 1.01
shopsUrbanithé -0.02 0.44 -0.90 0.85 793 1.00
shopsVanHoutte 1.05 0.36 0.36 1.76 551 1.01
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We can look at the chains and posterior distributions with the simple following command
plot(fit_int)
No divergent transition nor BFMI warmings : the build-in diagnostics of the sampler did not reveal problematic behavior (essentially problems with the curvature of the posterior surface). See here.
Rhat : \(\hat R\), indicates the mixing/convergence of the different chains of the MCMC sampler. Should be equal to 1, or <1.02. A problem of mixing might arise from a model poorly identified, in that case, we have to rethink about the model and/or the prior.
Eff.Sample : Effective Monte-Carlo Sample Size, indicates the efficiency of posterior sampling. Should be >400 to trust the mean/median estimates, and possibly higher to have reliable estimates of quantiles 0.275 and 0.975 (which are not really useful anyway.)
Posterior predictive checks
PPC are a good solution to verify our model captures essential features of the data,and is not out (over/under dispersed, unable to capture observed patterns…).
pp_check(fit_int, type = "dens_overlay") +
xlab("Consumers (n)")
We should do better!
Actually, we are repeatidly sampling in different cafés from the same population.
The previous model:
- consider each café as being independant
- tries to estimate the parameter of each café from only a few data points : enormous standard error far shops with low available information!
- does not care about what we could learn from the population
It is like if each visit was its first time => no pooling
Visualizing the uncertainty
From the following plot, we can clearly see that at least two cafe have unreliable estimates, and globally the estimates from more popular cafe are more uncertain.
Remember that mean \(\propto\) variance! Which implied that a more popular cafe is also more variable…
stanplot(fit_int)
We should do better : hierarchical models!
Hierarchical models allow to learn simultaneously about the population and each café estimates : each visit in a café informs the model about the characteristics of the others.
Each estimate from the population helps to improve the other estimates through and adaptive prior!
Varying intercept model
The principle of a varying intercept model is to learn information about the population of cafe simultaneously to the learning of information about individual estimates. This is translated in the following model by estimating the dispersion (= standard deviation) of the population of cafe, \(\sigma_{\beta_s}\), instead of providing it as in the previous model.
Because this dispersion is now estimated, \(\sigma_{\beta_s}\) is a parameter and need a prior.
- A likelihood \[ y_i \sim Poisson(\lambda_i)\\ \lambda_i = e^{\beta_0 + \beta_s}\\ \]
- Priors \[ \beta_0 \sim Normal(log(??), log(??))\\ \beta_s \sim Normal(0, \sigma_{\beta_s}) \]
- Hyperpriors \[ \sigma_{\beta_s} \sim N+(log(??), log(??)) \]
Fit the varying intercept model with brms
The brms
(and rstanarm
) syntax for hierarchical model is similar to the one in lme4
.
## Formula
form_varint <- bf(consumers ~ (1|shops), family = poisson)
## Get priors
get_prior(form_varint, data = D)
## Define priors
priors_varint <- c(prior(normal(0,5), class = Intercept),
prior(normal(0,0.5), class = sd))
## Sample from the posterior
fit_varint <- brm(formula = form_varint, prior = priors_varint, data = D,
chains = 4, iter = 2000,
cores = parallel::detectCores())
fit_varint
Check the sampling
fit_varint
Family: poisson
Links: mu = log
Formula: consumers ~ (1 | shops)
Data: D (Number of observations: 100)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~shops (Number of levels: 24)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept) 1.38 0.19 1.05 1.78 644 1.01
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept 1.20 0.29 0.64 1.75 461 1.01
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(fit_varint)
Posterior predictive checks
pp_check(fit_varint, type = "dens_overlay") +
xlab("Consumers (n)")
Posterior predictive checks
pp_check(fit_varint, type = "violin_grouped", group = "shops") +
theme(axis.text.x = element_text(angle=45))
Model comparison : shrinkage of parameters
## Extract hierarchical estimates
Ints_ranef <- as.data.frame(ranef(fit_varint, probs = c(0.05, 0.95))$shops) %>%
rename(Estimate = Estimate.Intercept,
cinq = `Q5.Intercept`, qvquinze = `Q95.Intercept`) %>%
mutate(type = "Varying int.\n(hierarchical)", shops = rownames(.))
## Extract classical intercepts
Ints_fix <- as.data.frame(fixef(fit_int, probs = c(0.05, 0.95))) %>%
rename(cinq = `Q5`, qvquinze = `Q95`) %>%
mutate(type = "Fixed int.\n(simple)", shops = Ints_ranef$shops)
bind_rows(Ints_ranef, Ints_fix) %>%
ggplot(aes(x = Estimate, y = shops, col = type, group = type)) +
geom_point() +
geom_errorbarh(aes(xmin = cinq, xmax = qvquinze), height =0, alpha = 0.5) +
ylab("Shops") +
theme_minimal()
We can observe two phenomenoms in the previous plot :
- There is a global shrinkage. The estimates of every cafe tends to be closer to zero than in the simple intercept model. The varying intercept model tends to consider that each estimates is a particular realization of the underlying process (some cafe are more popular than others). However, what we want is to determine the underlying process, not to decribe exactly the contingencies which leads to our data set. Therefore, in absence of strong information leading to a particular estimate, varying intercepts models will consider cafe as being closer to the population mean than the observed mean (see also next plot).
- The influence of the population will be more important for cafe for which there is less information (Depot and Starbucks). Both mean estimates and uncertainty are greatly reduced for this two shops.
In summary : the less information there is about a particular entity, the more we have to rely on information at the population level to estimate it accurately. Possibly, with an infinity of data points for each cafe, we would know its mean number of consumer with such a certainty that population estimates would note carry information anymore.
Exercise : Run again the complete process but with a greater number of observations per cafe!
Model comparison: shrinkage of predictors and predictions
The following plot display the value of the predictor (the mean of the poisson distribution, \(\lambda\)), for each cafe. We can se that the further the mean is from the population mean, the more it is shrinked toward it.
## Create
Newdata <- data.frame(shops = rep(shops, each = 1),
Town = "Trois-Rivières")
## Compute predictions of the simple intercepts model
pred_int <- fitted(fit_int, probs = c(0.05, 0.95), newdata = Newdata, robust = T)
Pred_int <- data.frame(type = "Fixed int.\n(simple)",
consumers = pred_int[,1],
cinq = pred_int[,3],
qvquinze = pred_int[,4])
## Assemble prediction with the predictors
D_int <- Newdata %>% bind_cols(Pred_int)
## Compute predictions of the varying intercepts model
pred_varint <- fitted(fit_varint, probs = c(0.05, 0.95), newdata = Newdata, robust = T)
Pred_varint <- data.frame(type = "Varying int.\n(hierarchical)",
consumers = pred_varint[,1],
cinq = pred_varint[,3],
qvquinze = pred_varint[,4])
## Assemble prediction with the predictors
D_varint <- Newdata %>% bind_cols(Pred_varint)
## Extract population mean
Est_b0 <- fixef(fit_varint)
bind_rows(D_int, D_varint) %>%
mutate(shops = reorder(.$shops, .$consumers, median, na.rm = T)) %>%
ggplot(aes(x = shops, y = consumers, col = type)) +
geom_boxplot(alpha = 0.5) +
geom_hline(yintercept = exp(Est_b0[,1]), linetype = "dotted") +
ylab(expression(paste(lambda))) +
xlab("Shops") +
theme_minimal() +
theme(axis.text.x = element_text(angle=45))
Model comparison : predictive density
The looo-IC is an sphisticated information criterion, which estimates the effective number of parameters (p_loo
), provide built-in diagnostics to inform about the reliability of the inference based on it (and about possible model mispecification),while carrying the uncertainty of the fitting process (each loo-IC is distributed normally with a mean and a standard-error).
A lower loo-IC means that the ability of the model to predict future data from the same data generating process is greater. It is calculated from a measure of the fit of the model to the observed data (elpd_loo
), penalized for the number of effective sample of parameters, the parameter that had to be estimated by the model (p_loo
).
(loo_int <- loo(fit_int))
Computed from 4000 by 100 log-likelihood matrix
Estimate SE
elpd_loo -200.7 11.5
p_loo 21.7 3.0
looic 401.5 23.1
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 91 91.0% 362
(0.5, 0.7] (ok) 8 8.0% 400
(0.7, 1] (bad) 1 1.0% 155
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
(loo_varint <- loo(fit_varint))
Computed from 4000 by 100 log-likelihood matrix
Estimate SE
elpd_loo -201.4 10.7
p_loo 19.9 2.5
looic 402.9 21.3
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 92 92.0% 998
(0.5, 0.7] (ok) 6 6.0% 624
(0.7, 1] (bad) 2 2.0% 151
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
The following command allow to compare the two models
print(loo_compare(loo_int, loo_varint), simplify = F)
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic
fit_int 0.0 0.0 -200.7 11.5 21.7 3.0 401.5
fit_varint -0.7 2.0 -201.4 10.7 19.9 2.5 402.9
se_looic
fit_int 23.1
fit_varint 21.3
- The two models, the simple and the hierarchical one have the same predictive ability.
- The fit of the simple model to the particular dataset is slightly better.
- The the effective number of parameters of the hierarchical model is lower than the simple model (by 2). How is it possible, while we estimated a population’s standard-deviation in addition to the same number of intercepts?
Remember that for the two more popular cafe estimates, the information from the population had a great influence…
Effective number of parameters
In the world of adaptive priors, or more generally regularizing or informative prior, the number of parameters to estimate is not known in advance.
The relative contribution of the information contained in the prior vs. the information contained in data influenced the effective number of parameters. If the parameters is uniquely defined by its prior, there is no parameter to estimate!
A simple example : even with this very simple model, the effective number of parameters will change in function of the “true” mean.
If the mean is close to zero, the prior will have a great influence, informating the model that the mean cannot be negative. If the mean is far from zero, the influence of the prior will be negligeable, given that every value from zero to the infinity is equally probable.
\[ y_i \sim Normal(\mu, \sigma)\\ \mu \sim U(0, +\infty) \]
## Formula
form <- bf(y ~ 1)
## Low mean data
D_low <- data.frame(y = rnorm(100, mean = 0.1, sd = 1))
## High mean data
D_high <- data.frame(y = rnorm(100, mean = 50, sd = 5))
## Explore priors
get_prior(form, data = D_low)
## Define priors (remplacing infinity by an enormous number)
priors <- prior(uniform(0, 1000000000000), class = Intercept)
## Fit the low mean data
fit_low <- brm(formula = form, prior = priors, data = D_low)
## Fit the high mean data
fit_high <- brm(formula = form, prior = priors, data = D_high)
p_loo
fit_low 1.180004
fit_high 2.469511
The effective number of parameter is higher for the model with high mean than for the modelwith low mean! The measures
Sampling new levels
Because the model have learn the the distribution of the number of consumers in Trois-Rivières, we can predict potential distributions of newly installed cafés.
## Set the random seed
set.seed(999)
## Create new data for new cafe
Newdata <- data.frame(shops = rep(c("Cora", "Eggsqui",
"AuCafeBienheureux", "IloveCoffee"), each = 50))
## Compute predictions
Pred_new <- predict(fit_varint, newdata = Newdata, allow_new_levels = T,
summary = F, nsamples = 50)
## Plots the density which represents the population and the uncertainty about it
plot(density(Pred_new[,1]), xlim= c(0,60), bty = "n",
col = scales::alpha("black", 0.2))
for(j in 2:ncol(Pred_new)) lines(density(Pred_new[,j]),
col = scales::alpha("black", 0.2))
Conclusion
By shrinking toward the mean, varying intercepts avoid overfitting : estimates are regularized to be closer to the true data generating process, instead of capturing the contingences of a particular dataset.
Because each estimate inform the other estimates from the same population: the fitting process is more efficient (less effective number of parameters).
What we learn about the distribution of a population can be reused, and the prediction for a new dataset should incorporate the population variance.
Varying slope model
Goal
Mr Maxwell now wants to know sales dynamic throughout the day.
To do this, he hires professionals and asks them to visit each café every hour between 8h and 20h, and to note the number of consumers.
Results
Results
Results
Varying slope model
\[ y_i \sim Poisson(\lambda_i)\\ \lambda_i = e^{\beta_0 + \beta_1hour + \beta_{s} + \beta_{1s}hour}\\ \]
Varying slope model: Hierarchical priors
In a varying slope model, we assume that the slope is potentially correlated with the intercepts. That make sens : a cafe almost empty in average might not lower variation during the day than a more popular cafe.
However, it might not be the case : a shop might be completely full at some time and completely empty at some other time, leading to a mean equal to the one of the population.
We will let the model determine the correlation between intercepts and slopes by defining the hierarchical prior of these parameters as a multivariate normal with mean 0 and estimated covariance matrix.
THe covariance matrix can be decomposed on diagonal matrices containing the standard-deviations of each parameters and a correlation matrix.
\[ y_i \sim Poisson(\lambda_i)\\ \lambda_i = e^{\beta_0 + \beta_1hour + \beta_{s} + \beta_{1s}hour}\\ \begin{bmatrix} \beta_s\\ \beta_{1s}\\ \end{bmatrix} = MVNormal(\begin{bmatrix}0\\0\\\end{bmatrix}, S)\\ \begin{array}{c} S = \begin{bmatrix} \sigma_{\beta_s} & 0 \\ 0 & \sigma_{\beta_{1s}}\\ \end{bmatrix} \rho \begin{bmatrix} \sigma_{\beta_s} & 0\\ 0 & \sigma_{\beta_{1s}}\\ \end{bmatrix} \end{array}\\ \]
Varying slope model: Priors and hyperpriors
This decomposition of the covariance matrix allow to define different priors for the standard-deviation and for the correlation matrix.
I find that hal-normal, that is normal distribution truncated at zero are the best priors for standard-deviations.
The prior for correlation matrix is an LKJ distribution, which is an extension of the beta distribution between -1 and 1.
\[ y_i \sim Poisson(\lambda_i)\\ \lambda_i = e^{\beta_0 + \beta_1hour + \beta_{s} + \beta_{1s}hour}\\ \begin{bmatrix} \beta_s\\ \beta_{1s}\\ \end{bmatrix} = MVNormal(\begin{bmatrix}0\\0\\\end{bmatrix}, S)\\ \begin{array}{c} S = \begin{bmatrix} \sigma_{\beta_s} & 0 \\ 0 & \sigma_{\beta_{1s}}\\ \end{bmatrix} \rho \begin{bmatrix} \sigma_{\beta_s} & 0\\ 0 & \sigma_{\beta_{1s}}\\ \end{bmatrix} \end{array}\\ \rho \sim LKJ(1)\\ \begin{aligned} \sigma_s \sim N^+(0,1)\\ \sigma_{1s} \sim N^+(0,1) \end{aligned} \begin{aligned} \beta_0 \sim N(0,5)\\ \beta_1 \sim N(0,5) \end{aligned} \]
Simulate from the data generating process
## Create a data frame
X_long <- data.frame(shops = rep(shops, each = length(8:20)),
hour = scale(rep(8:20, length(shops))),
hour_unscale = rep(8:20, length(shops)))
## Create a design matrix for intercepts
X_int <- build.x(~ shops, X_long, contrasts = F)[,-1]
## Create a design matrix for slopes
X_slope <- build.x(~ shops:hour, X_long, contrasts = F)[,-1]
## Load libraries : truncated normal and multivariate normal distributions
library(rethinking)
library(truncnorm)
## Set the random seed
set.seed(999)
## Sample from the priors
hyper_sigma <- rtruncnorm(2, a = 0, 1, 1) # Hyperprior for hierarchical standard-deviations (intercepts and slopes)
hyper_rho <- rlkjcorr(n = 1, K = 2, eta = 0.1) # Hyperprior for the correlation matrix
b_0 <- rnorm(1, log(10), 1) # Prior for the global intercept
b_1 <- rnorm(1, 0, 1) # Prior for the global slope
## Hierarchical prior for the shops' intercept and slope
bs_shops <- rmvnorm2(n = ncol(X_int),Mu = c(0,0), Rho = hyper_rho, sigma = hyper_sigma)
b_shops <- bs_shops[,1]
b_1shops <- bs_shops[,2]
## Compute linear predictor
mu <- exp(b_0 + b_1 * X_long$hour + X_int %*% b_shops + X_slope %*% b_1shops)
## Sample observations
y <- rpois(nrow(X_int), mu)
D <- data.frame(shops = X_long$shops, hour = X_long$hour,
hour_unscale = X_long$hour_unscale,
consumers = y,
town = "Trois-Rivières",
type = "Observed")
D %>% ggplot(aes(x = town, y = consumers)) +
geom_boxplot() +
geom_point(aes(col = shops), position = "jitter") +
xlab("Town") +
ylab("Consumers (n)") +
theme_minimal()
Simulate from the data generating process
D %>% mutate(shops = reorder(.$shops, .$consumers, median, na.rm = T)) %>%
ggplot(aes(x = shops, y = consumers, col = shops)) +
geom_boxplot() +
guides(color = F) +
theme_minimal() +
xlab("Shops") +
ylab("Consumers (n)") +
theme(axis.text.x = element_text(angle=45))
Simulate from the data generating process
D %>%
ggplot(aes(x = hour_unscale, y = consumers, col = shops)) +
geom_smooth(se = F, method = "gam") +
xlab("Hour of the day") +
ylab("Consumers (n)") +
theme_minimal() +
theme(axis.text.x = element_text(angle=45))
Fit the varying slope model with brms
## Formula
form_varslope <- bf(consumers ~ hour + (1+hour|shops), family = poisson)
## Get priors
get_prior(form_varslope, data = D)
## Define priors
priors_varslope <- c(prior(normal(0,5), class = Intercept),
prior(normal(0,2), class = sd),
prior(normal(0,5), class = b))
## Sample from the posterior
fit_varslope <- brm(formula = form_varslope, prior = priors_varslope,
data = D, iter = 2000,
chains = 4, cores = 4)
fit_varslope
Check the sampling
fit_varslope
Family: poisson
Links: mu = log
Formula: consumers ~ hour + (1 + hour | shops)
Data: D (Number of observations: 325)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~shops (Number of levels: 24)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept) 0.37 0.06 0.27 0.50 707 1.01
sd(hour) 0.09 0.02 0.05 0.14 1470 1.00
cor(Intercept,hour) -0.81 0.16 -0.99 -0.41 1991 1.00
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept 2.45 0.08 2.30 2.60 503 1.01
hour -0.25 0.02 -0.30 -0.20 971 1.01
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(fit_varslope)
Posterior predictive checks
pp_check(fit_varslope, type = "dens_overlay") +
xlab("Consumers(n)")
Posterior predictive checks
pp_check(fit_varslope, type = "scatter_avg_grouped", group = "shops") +
geom_abline(intercept = 0, slope = 1)
Fit a non-hierarchical slope model with brms
## Formula
form_slope <- bf(consumers ~ shops + hour + shops:hour, family = poisson)
## Get priors
get_prior(form_slope, data = D)
## Define priors
priors_slope <- c(prior(normal(0,5), class = Intercept),
prior(normal(0,5), class = b))
## Sample from the posterior
fit_slope <- brm(formula = form_slope, prior = priors_slope,
data = D, iter = 2000,
chains = 4, cores = 4)
fit_slope
Check the sampling
fit_slope
Family: poisson
Links: mu = log
Formula: consumers ~ shops + hour + shops:hour
Data: D (Number of observations: 325)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample
Intercept 2.39 0.09 2.23 2.55 231
shopsBrulerie -0.29 0.11 -0.50 -0.08 348
shopsBucafin -0.28 0.13 -0.54 -0.03 603
shopsCafeier 0.24 0.11 0.02 0.46 442
shopsChocoMBrico 0.28 0.11 0.06 0.51 386
shopsCrémière -0.36 0.13 -0.63 -0.10 545
shopsDepot 0.30 0.11 0.08 0.52 403
shopsFrida -0.68 0.15 -0.97 -0.40 679
shopsGourmet 0.44 0.11 0.22 0.66 375
shopsGriffin 0.28 0.11 0.06 0.50 456
shopsHistoireSansFin 0.52 0.11 0.31 0.73 350
shopsLTDP -0.29 0.13 -0.56 -0.03 462
shopsMarcheNotreDame 0.13 0.11 -0.09 0.36 446
shopsMorgane 0.27 0.11 0.05 0.50 506
shopsMoulinsLaFayette 0.64 0.11 0.43 0.84 397
shopsPanetier -0.55 0.14 -0.83 -0.28 677
shopsPompon 0.27 0.11 0.04 0.48 406
shopsPresse -0.49 0.14 -0.76 -0.22 640
shopsSacristain 0.20 0.12 -0.02 0.44 437
shopsStarbucks -0.19 0.13 -0.44 0.06 531
shopsThimothys 0.37 0.11 0.15 0.58 373
shopsTimHortons 0.22 0.11 -0.01 0.44 432
shopsUrbanithé 0.29 0.11 0.07 0.50 419
shopsVanHoutte -0.15 0.12 -0.39 0.09 553
hour -0.12 0.08 -0.28 0.04 228
shopsBrulerie:hour 0.01 0.11 -0.20 0.21 370
shopsBucafin:hour -0.04 0.13 -0.30 0.21 576
shopsCafeier:hour -0.16 0.11 -0.38 0.06 427
shopsChocoMBrico:hour -0.21 0.11 -0.43 0.00 349
shopsCrémière:hour -0.06 0.13 -0.32 0.20 581
shopsDepot:hour -0.13 0.11 -0.35 0.08 378
shopsFrida:hour -0.05 0.14 -0.33 0.24 665
shopsGourmet:hour -0.32 0.11 -0.54 -0.11 393
shopsGriffin:hour -0.18 0.11 -0.39 0.04 439
shopsHistoireSansFin:hour -0.10 0.11 -0.31 0.11 339
shopsLTDP:hour -0.00 0.13 -0.26 0.25 474
shopsMarcheNotreDame:hour -0.03 0.12 -0.26 0.19 402
shopsMorgane:hour -0.29 0.11 -0.50 -0.07 450
shopsMoulinsLaFayette:hour -0.27 0.10 -0.48 -0.08 373
shopsPanetier:hour 0.10 0.14 -0.17 0.38 587
shopsPompon:hour -0.25 0.11 -0.47 -0.04 405
shopsPresse:hour -0.15 0.13 -0.42 0.11 631
shopsSacristain:hour -0.18 0.11 -0.40 0.04 391
shopsStarbucks:hour -0.09 0.12 -0.33 0.16 511
shopsThimothys:hour -0.04 0.11 -0.26 0.16 358
shopsTimHortons:hour -0.25 0.11 -0.47 -0.03 404
shopsUrbanithé:hour -0.19 0.11 -0.41 0.02 355
shopsVanHoutte:hour -0.17 0.12 -0.41 0.07 416
Rhat
Intercept 1.03
shopsBrulerie 1.02
shopsBucafin 1.01
shopsCafeier 1.02
shopsChocoMBrico 1.02
shopsCrémière 1.02
shopsDepot 1.01
shopsFrida 1.01
shopsGourmet 1.02
shopsGriffin 1.02
shopsHistoireSansFin 1.02
shopsLTDP 1.01
shopsMarcheNotreDame 1.02
shopsMorgane 1.01
shopsMoulinsLaFayette 1.02
shopsPanetier 1.01
shopsPompon 1.02
shopsPresse 1.01
shopsSacristain 1.01
shopsStarbucks 1.01
shopsThimothys 1.02
shopsTimHortons 1.02
shopsUrbanithé 1.02
shopsVanHoutte 1.01
hour 1.01
shopsBrulerie:hour 1.01
shopsBucafin:hour 1.01
shopsCafeier:hour 1.01
shopsChocoMBrico:hour 1.01
shopsCrémière:hour 1.00
shopsDepot:hour 1.01
shopsFrida:hour 1.00
shopsGourmet:hour 1.01
shopsGriffin:hour 1.01
shopsHistoireSansFin:hour 1.01
shopsLTDP:hour 1.01
shopsMarcheNotreDame:hour 1.01
shopsMorgane:hour 1.01
shopsMoulinsLaFayette:hour 1.01
shopsPanetier:hour 1.01
shopsPompon:hour 1.01
shopsPresse:hour 1.01
shopsSacristain:hour 1.01
shopsStarbucks:hour 1.01
shopsThimothys:hour 1.01
shopsTimHortons:hour 1.01
shopsUrbanithé:hour 1.01
shopsVanHoutte:hour 1.01
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(fit_slope)
Parameter comparison
# Extract classical model coefficients
## Extract mean intercepts
b_shops_fix <- fixef(fit_slope, probs = c(0.05, 0.95))[1:24,1]
## Extract mean slopes
b_1shops_fix <- fixef(fit_slope, probs = c(0.05, 0.95))[25:48,1]
## Assemble in a data frame
b_fix <- data.frame(type = "Fixed slopes\n(simple)", b_shops = b_shops_fix,
b_1shops = b_1shops_fix)
# Extract hierarchical model coefficient
## Extract mean intercepts
b_shops_var <- ranef(fit_varslope, probs = c(0.05, 0.95))$shops[,,1][,1]
## Extract mean slopes
b_1shops_var <- ranef(fit_varslope, probs = c(0.05, 0.95))$shops[,,2][,1]
## Assemble in a data frame
b_var <- data.frame(type = "Var. slope (hierarchical)", b_shops = b_shops_var,
b_1shops = b_1shops_var)
bind_rows(b_fix, b_var) %>%
ggplot(aes(x = b_shops, y = b_1shops)) +
geom_point(aes(col = type), alpha = 0.5) +
theme_minimal()
Model comparison : predictive density
(loo_slope <- loo(fit_slope, cores = 2))
Computed from 4000 by 325 log-likelihood matrix
Estimate SE
elpd_loo -864.1 11.2
p_loo 37.9 2.7
looic 1728.2 22.4
------
Monte Carlo SE of elpd_loo is 0.1.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 319 98.2% 993
(0.5, 0.7] (ok) 6 1.8% 793
(0.7, 1] (bad) 0 0.0% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.
(loo_varslope <- loo(fit_varslope, cores = 2))
Computed from 4000 by 325 log-likelihood matrix
Estimate SE
elpd_loo -854.3 11.2
p_loo 25.2 2.0
looic 1708.6 22.5
------
Monte Carlo SE of elpd_loo is 0.1.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
print(loo_compare(loo_slope, loo_varslope), simplify = F)
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic
fit_varslope 0.0 0.0 -854.3 11.2 25.2 2.0 1708.6
fit_slope -9.8 3.4 -864.1 11.2 37.9 2.7 1728.2
se_looic
fit_varslope 22.5
fit_slope 22.4
The varying slopes model have possibily better predictive ability, even if it is uncertain (elpd_dif
), but also have far lower effective number of parameters (p_loo
)!
Hierarchical model comparison : No AIC!
\[ AIC = 2k - 2ln(\hat L) \]
In the AIC formula, the k represent the number of parameters, and is fixed (unlike the effective number of parameters)! \(\hat L\) is the likelihood of the model.
Let’s observe what happen if we compare a simple slopes model to a varying slopes model with AIC and maximum likelihood…
## Measure of model fit
logLik(glm_slope)
'log Lik.' -818.1366 (df=48)
logLik(glmer_varslope)
'log Lik.' -878.723 (df=5)
## AIC
AIC(glm_slope, glmer_varslope)
df AIC
glm_slope 48 1732.273
glmer_varslope 5 1767.446
Using AIC, we would conclude that the fixed slope model is better, while loo-IC indicates it might be better!
Prior predictive checks
Prior predictive checks
It is possible to verify that the priors are appropriates by computing predictions with parameters only sampled from the priors. We will see that our previous prior were terrible, producing absolutly ridiculous values. We should think more about them!
## Sample from the priors
Prior_fit <- update(object = fit_varslope, recompile = F,
iter = 500, cores = 1, chains = 1,
sample_prior = "only")
SAMPLING FOR MODEL '4728eadc95784e0bef60aa65615bd0ee' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 6.3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.63 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 500 [ 0%] (Warmup)
Chain 1: Iteration: 50 / 500 [ 10%] (Warmup)
Chain 1: Iteration: 100 / 500 [ 20%] (Warmup)
Chain 1: Iteration: 150 / 500 [ 30%] (Warmup)
Chain 1: Iteration: 200 / 500 [ 40%] (Warmup)
Chain 1: Iteration: 250 / 500 [ 50%] (Warmup)
Chain 1: Iteration: 251 / 500 [ 50%] (Sampling)
Chain 1: Iteration: 300 / 500 [ 60%] (Sampling)
Chain 1: Iteration: 350 / 500 [ 70%] (Sampling)
Chain 1: Iteration: 400 / 500 [ 80%] (Sampling)
Chain 1: Iteration: 450 / 500 [ 90%] (Sampling)
Chain 1: Iteration: 500 / 500 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.263493 seconds (Warm-up)
Chain 1: 0.168723 seconds (Sampling)
Chain 1: 0.432216 seconds (Total)
Chain 1:
## Compute predictions
Prior_pred <- predict(Prior_fit, summary = F)
## Plot the predictions
plot(density(Prior_pred[complete.cases(Prior_pred[,1]),1]))
for(j in 2:50) lines(density(Prior_pred[complete.cases(Prior_pred[,j]),j]),
col = scales::alpha(1,0.2))
More and more…
More and more…
We just entered the world of hierarchical models… But the concept of adaptive prior can be derived infinitly. Among some example:
- We can make the slope of a category dependant on other predictors
- “Continuous categories” and gaussian processes (phylogeny, space, time, smoothing)
- Shrinkage priors (horseshoe, finnish horseshoe…)
Exercises
Exercises
- Could you add a level in the varying intercept model decribing the sampling in different towns?
- Could you make the slopes between the hour of the day and the number of consumer depedant on the shop area?