Dirichlet regression and the 2018 Quebec Elections (with brms)
Lucas Deschamps
28 septembre 2019
Disclaimer : I do not mean to do political analysis, neither to give any opinion. I just made this analysis after the October 2018 provincial election in Quebec, and I decide now to publish it since the R package brms
implements Dirichlet regression, hoping that my work would encourage ecologists to dive into multivariate models. Enjoy :)
Note : I discovered after having writen this post that an article went out in Methods in Ecology and Evolution (Douma and Weedon 2019) about regression over proportions. Read this as a tutorial!
# Empty R session
rm(list = ls())
# Set a seed
set.seed(999)
Context
Electors from the Quebec province (Canada) voted in 2018 October 1^st for the members of the national assembly of Quebec. The Coalition Avenir Quebec won the election with 74 seats and 37.42% of votes. They were followed by three principal political party: the Parti Libéral du Québec (31 seats and 24.82% of votes), the Parti Québécois (10 seats, 17.06% of votes) and Quebec Solidaire (10 seats, 16,10%).
For those elections, the Quebec province was divided into 125 electoral divisions (maps). In each of these divisions, voters had to choose a representant belonging to a political party. The representant with the highest number of votes was elected as a member of the national assembly.
In this post, we will explore the relationship between some socioeconomic characteristics of each electoral division and the results (in %) of the four principal political party cited above.
To replicate the analysis (and do more if you want!), you simply have to download the .zip file in the following repository.
Data
Sources
I compiled data from various sources about the election and the socioeconomical characteristics of each electoral division.
Results of the election can be found on the élections Quebec website. I gathered a compilation in a tidy format done by the modeler of the great electoral blog Qc125. The election results are stored in the file resultats.json
in the zip folder, with the correspondence between geographical and electoral units stored in the MRC.csv
and RA.csv
files.
The explanatory variables (socioeconomic data) have been gathered from the 2016 censoring, downloadable at this link on the élections Quebec website. The data are far from being tidy at all when downloaded at this address. The file recensement2016CEP.xls
in the zip folder contains the original data with a sheet made by copypasting data in a tidy format.
The coordinates of each electoral division, defined as the coordinates of the center of the corresponding Municipalité Régional de Comté, can be downloaded on the Ministère de l’énergie et des ressources naturelles website. The shapefiles are stored in the MRC
subfolder.
Preparation
I have prepared functions to be able to put order and assemble data from these different sources.
The first function, import.result
aims to extract the result of each parti in each region, stored in a Json format, and to join them with geographical informations, such as lattitude and longitude of the electoral division.
import.result < function(){
## Import election results
D < rjson::fromJSON(file = "resultats.json", simplify = T)
## Extract the results per electoral division
Circon < D$circonscriptions
## Prepare a data.frame to store results
result < data.frame(circon = rep(NA, length(Circon)), num_circon = NA,
CAQ = NA, PQ = NA, PLQ = NA, QS = NA)
colnames(result)
for(i in seq(length(Circon))){
## Store the electoral division name
result$circon[i] < Circon[[i]]$nomCirconscription
## Store the electoral division number
result$num_circon < Circon[[i]]$numeroCirconscription
## Extract results for each candidats
candidats < Circon[[i]]$candidats
for(j in seq(length(candidats))){
if(candidats[[j]]$numeroPartiPolitique == 27) result$CAQ[i] < candidats[[j]]$tauxVote
if(candidats[[j]]$numeroPartiPolitique == 8) result$PQ[i] < candidats[[j]]$tauxVote
if(candidats[[j]]$numeroPartiPolitique == 6) result$PLQ[i] < candidats[[j]]$tauxVote
if(candidats[[j]]$numeroPartiPolitique == 40) result$QS[i] < candidats[[j]]$tauxVote
}
}
## Match electoral division and administrative regions
regions < read.csv("RA.csv")
result < left_join(result, regions)
## Match electoral division and MRC divisions
MRC < read.csv("MRC.csv")
colnames(MRC)[2] < "circon"
result < left_join(result, select(MRC, circon, MRC) )
## Get Coordinates of each MRC
shape < readOGR("MRC/", "mrc_point")
D_shape < cbind(shape@data, shape@coords)
colnames(D_shape)[8] < "MRC"
result < left_join(result, select(D_shape, MRC, coords.x1, coords.x2))
return(result)
}
The second function import explanatory variables. Because most of these variables are proportions, defined on (0,1), I decided to scale every variable of other kinds under the same range. To constrain a vector \(x\) between \(a\) and \(b\), we need the following function :
\[ f(x) = \frac{(ba) * (x  min(x))}{max(x)  min(x)} \]
## Function to normalize data between a and b
normalize < function(x, a = 0, b = 1){
top < (ba) * (x  min(x, na.rm = T))
bot < max(x, na.rm = T)  min(x, na.rm = T)
y < top/bot + a
}
## And function to denormalize
denormalize < function(x,minval,maxval) {
x*(maxvalminval) + minval
}
## Import explanatory data
import.X < function(){
X1 < read.csv("explain.csv",sep = ",")
X < X1 %>% mutate(taux_actif = taux_actif/100,
taux_emploi = taux_emploi/100,
n_valeur_logement_mediane = normalize(valeur_logement_mediane),
n_cout_hab_loc_pourc = normalize(cout_hab_loc_pourc))
return(X)
}
n_valeur_logement_mediane
is the median of the estate value in the division and cout_hab_loc_pourc
is the median value of the domiciliary rent, both defined on \((0,+\infty)\).
Model
Beta distibution
Results are proportions, defined on (0,1). A nice way to model proportion is to use a \(Beta\) distribution for likelihood. One can define the Beta distribution in term of mean \(\mu\) (defined in (0,1)) and a positive precision \(\phi\). The following function compute the likelihood of a proportion, \(y\), given a mean and a precision (found in Paul Buerkner’s vignette
about parametrization of distributions, see also Ferrari and CribariNeto (2004)) :
\[ P(y\mu, \phi) = \frac{y^{\mu \phi  1} (1y)^{(1\mu) \phi1}}{B(\mu \phi, (1\mu) \phi)} \] \(B()\) being the Beta function. An intuition emerges if we focus on the numerator. We can see that the likelihood of a probability \(y\) is described as the observation risen at \(\mu\phi1\) multiplied by the complement, \((1y)\) raised to \((1\mu)\phi1\). The first part is a simple exponential function, defined for any value of \(y\) over \((Inf, +Inf)\).
first < function(x, mu, phi) x^(mu*phi1)
mu = 0.7; phi = 10
curve(first(x, mu, phi), from = 0, to = 1, bty = "n")
This allows the distribution to move around the (0,1) interval by skewing more and more while the mean approaches the boundaries.
numerator < function(x, mu, phi) x^(mu*phi1) * (1x)^((1mu)*phi1)
mu = 0.7; phi = 10
curve(numerator(x, mu, phi), from = 0, to = 1, col = "red", bty = "n")
abline(v = mu, lty = 2)
Then, the denominator scales the function to ensure that the area under the curve sum to 1, such as any probability density function shall.
final < function(x, mu, phi){
numerator < numerator(x, mu, phi)
denominator < beta(mu*phi, (1mu)*phi)
result < numerator / denominator
}
mu = 0.7; phi = 10
curve(final(x, mu, phi), col = "blue", from = 0, to = 1, bty = "n")
abline(v = 0.7, lty = 2)
integrate(numerator, lower = 0, upper = 1, mu = 0.7, phi = 10)
## 0.003968254 with absolute error < 4.4e17
integrate(final, lower = 0, upper = 1, mu = 0.7, phi = 10)
## 1 with absolute error < 1.1e14
We can make sense of the mean and dispersion parameters we the following plots. In R
, the beta distribution is not parameterized by \(\mu\) and \(\phi\), but by two shape parameter we will recover easily with the function such as \(a = \mu * \phi\) and \(b = (1\mu)*\phi\) :
par(mfrow = c(1,2), bty = "n")
mu = 0.5; phi = 10; a = mu*phi; b = (1mu)*phi
curve(dbeta(x, a, b), ylim = c(0,4))
mu = 0.25; phi = 10; a = mu*phi; b = (1mu)*phi
curve(dbeta(x, a, b), add = T, col = "blue")
mu = 0.75; phi = 10; a = mu*phi; b = (1mu)*phi
curve(dbeta(x, a, b), add = T, col = "red")
mu = 0.25; phi = 10; a = mu*phi; b = (1mu)*phi
curve(dbeta(x, a, b), ylim = c(0,4))
mu = 0.25; phi = 5; a = mu*phi; b = (1mu)*phi
curve(dbeta(x, a, b), add = T, col = "blue")
mu = 0.25; phi = 2; a = mu*phi; b = (1mu)*phi
curve(dbeta(x, a, b), add = T, col = "red")
par(mfrow = c(1,1))
Dirichlet distribution
But election results are not only proportions, they are compositional. As such, each party result is a proportion of the total number of votes in an electoral division : the sum of the proportions have to sum to 1. This is an interesting mathematical property we should exploit while modeling such data. In that case, the proportion of votes a party received in a division is not only constrained by the predictors, the socioenconomic data, but also by the sum of the proportions of the other parties. This aspect is implemented in the dirichlet distribution, defined as following in a regression framework (see the vignette
or (Maier 2014)) :
\[ P(y_{c}\mu_{c}, \phi) = \frac{1}{B((\mu_{1}, \ldots, \mu_{C}) \phi)} \prod_{d=1}^C y_{}^{\mu_{d} \phi  1} \] Where \(c\) is the component (column of the simplex matrix) and \(B\) the multivariate beta function.
Again, we can make sense of parameters by observing the distribution. In R
, the distribution is parametrized by a vector \(\boldsymbol{\alpha}\), such as \(\boldsymbol{\alpha} = \boldsymbol{\mu}\phi\). Where \(\boldsymbol{\mu}\) is a simplex reparting the strictly positive value \(\phi\), which represents the sum of \(\boldsymbol{\alpha}\) such as \(\phi = \sum_{c = 1}^{C} \alpha_{c}\).
Dirichlet regression
We can now formulate a dirichlet regression with \(\mu\) dependant of a set of covariates :
\[ y_{i,c} \sim Dirichlet(\mu_{i,c}, \phi)\\ \mu_{c,i} = \frac{\exp(\eta_{c,i})}{\sum_{d= 1}^{C} \exp(\eta_{i,d})}\\ \eta_{c,i} = \boldsymbol{\beta_{c}} X_{i} \]
Where the link between \(\mu\) and \(\eta\) is the multivariate binomial function. \(\boldsymbol{\beta_{c}}\) is a vector of slopes multiplying the predictor matrix \(X\). To ensure identifiability, \(\boldsymbol{\eta_{1}}\) is set to 0.
Dirichlet regression with brms
Data preparation
Let’s first import the election results, with geographical information on each electoral division. Because results are expressed as a percentage, we will divide them by 100.
library(tidyverse)
library(rgdal)
library(knitr)
library(kableExtra)
R < import.result() %>%
mutate_at(c("CAQ", "PQ", "PLQ", "QS"), function(x) x/100) %>%
rename(division = circon)
## OGR data source with driver: ESRI Shapefile
## Source: "/home/lucasd/Gdrive/Projects/RIVENumerilab.github.io/projects/LD_Elections/MRC", layer: "mrc_point"
## with 137 features
## It has 15 fields
## Integer64 fields read as strings: MRC_S_ MRC_S_ID F_POLYGONI
R %>% head() %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "300px")
division  num_circon  CAQ  PQ  PLQ  QS  CODE_CIRC  CODE_RA  RA  TRI_CIRC  TRI_RA  MRC  coords.x1  coords.x2 

AbitibiEst  332  0.4272  0.1948  0.1875  0.1566  648  8  AbitibiTémiscamingue  ABITIBIEST  ABITIBITEMISCAMINGUE  La Valléedel’Or  76.75412  47.91909 
AbitibiOuest  332  0.3412  0.3326  0.1131  0.1659  642  8  AbitibiTémiscamingue  ABITIBIOUEST  ABITIBITEMISCAMINGUE  Abitibi  78.00822  48.61578 
Acadie  332  0.1651  0.0900  0.5380  0.1375  338  6  Montréal  ACADIE  MONTREAL  Montréal  73.69991  45.49017 
AnjouLouisRiel  332  0.2891  0.1470  0.3906  0.1453  366  6  Montréal  ANJOULOUISRIEL  MONTREAL  Montréal  73.69991  45.49017 
Argenteuil  332  0.3888  0.2114  0.1741  0.1217  520  15  Laurentides  ARGENTEUIL  LAURENTIDES  Argenteuil  74.45213  45.66901 
Arthabaska  332  0.6184  0.0940  0.1135  0.1258  144  17  CentreduQuébec  ARTHABASKA  CENTREDUQUEBEC  Arthabaska  71.88812  46.01932 
Because we only selected the four major political parties, all the rows but two does not sum to one, as simplex should.
(RSums < R %>% select(CAQ:QS) %>% rowSums())
## [1] 0.9661 0.9528 0.9306 0.9720 0.8960 0.9517 0.9421 0.9341 0.9632 0.9577
## [11] 0.9711 0.9648 0.9601 0.9499 0.9631 0.9426 0.9690 0.9658 0.9627 0.9556
## [21] 0.9790 0.9581 0.9821 0.9580 0.8908 0.9684 0.9374 0.9424 0.9659 0.9053
## [31] 0.9447 0.9687 0.9584 0.9832 0.9435 1.0000 0.9417 0.9538 0.9679 0.9390
## [41] 0.9687 0.9230 0.9667 0.9479 1.0000 0.8829 0.9429 0.9620 0.9479 0.9504
## [51] 0.9788 0.9794 0.9739 0.9837 0.9837 0.9791 0.9174 0.9482 0.9449 0.9583
## [61] 0.9645 0.9358 0.9463 0.9698 0.9679 0.9374 0.9321 0.9470 0.9608 0.9609
## [71] 0.9058 0.9523 0.9709 0.9787 0.9626 0.9588 0.9238 0.9533 0.9662 0.9752
## [81] 0.9322 0.9205 0.9240 0.9591 0.8829 0.9525 0.9564 0.9858 0.9055 0.9387
## [91] 0.9831 0.9893 0.9707 0.9750 0.9619 0.9858 0.9892 0.9278 0.9634 0.9601
## [101] 0.9781 0.9711 0.9690 0.9204 0.9881 0.9657 0.9598 0.9222 0.9541 0.9534
## [111] 0.9701 0.9692 0.9477 0.9626 0.9680 0.9744 0.9652 0.9395 0.9618 0.9282
## [121] 0.9265 0.9674 0.9246 0.9500 0.9547 0.9171
To solve this, we will create a new column representing the cumulative proportion of votes obtained by all the other parties. Note that this is a personal choice and not the only nor the best solution to analyse the dataset (but definitly one of the simplest).
## Create a new column for others votes
R < R %>% mutate(Others = 1  RSums)
R %>% select(CAQ:QS, Others) %>% rowSums()
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
However, two lines were summing to one indicate divisions in which there was no other party than the four major ones or at least no vote for eventual ones. They will cause trouble later in our analysis, because dirichlet regression as defined in brms
in not define for a proportion of zero. Some solutions exist to accomodate the problem :

Using a Zero Inflated Dirichlet distribution. The function
ZIGDM
in themiLineage
package can perform such an analysis;  Using a zero adjusted dirichlet regression (Tsagris and Stewart 2018)
 Remove the lines.
We will choose the third one for sake of simplicity, reducing the number of line from 126 to 124.
## Remove divisions in which there was no vote for other parties
R_rm < R %>% filter(Others != 0)
Then we will now import the socioeconomic variables, and select some :
X_raw < import.X()
X < X_raw %>% select(division,
anglais_pourc, bilingue_pourc, aucune_pourc,
commun_pourc, bicycle_pourc,
migrants_5ans_pourc,
log_av60_pourc,
agri_pourc,
moins_40mille_pourc, X40_100mille_pour,
loc_pourc,
n_valeur_logement_mediane, valeur_logement_mediane,
n_cout_hab_loc_pourc, cout_hab_loc_pourc)

anglais_pourc
,bilingue_pourc
andaucune_pourc
are the proportion of person in a division speaking only English, English and French or none of these languages, respectively; 
commun_pourc
andbicycle_pourc
are the proportion of people going to work by public transportation or bicycle, respectively; 
migrants_5ans_pourc
is the proportion of people who have migrated from another country for more than five years; 
log_av60_pourc
is the proportion of habitation built before 1960; 
agri_pourc
is the percentage of people working in the agricultural sector; 
moins_40mille_pourc
andX40_100mille_pourc
are the proportion of people earning less than 40 000 CAD or between 40 000 and 100 000 CAD, respectively; 
loc_pourc
is the proportion of people renting their housing; 
n_valeur_logement_mediane
andn_cout_hab_loc_pourc
are the median cost of estate and of the housing rent, respectively.
We can now join the explanatory data with the election results. The joining variable will be circon
, the name of the electoral division.
Tot < left_join(R_rm, X)
Tot %>% select(CODE_CIRC, CODE_RA, TRI_CIRC, TRI_RA, MRC) %>%
summary() %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "300px")
division  num_circon 




 coords.x1  coords.x2 
 anglais_pourc  bilingue_pourc  aucune_pourc  commun_pourc  bicycle_pourc  migrants_5ans_pourc  log_av60_pourc  agri_pourc  moins_40mille_pourc  X40_100mille_pour  loc_pourc  n_valeur_logement_mediane  valeur_logement_mediane  n_cout_hab_loc_pourc  cout_hab_loc_pourc  

Length:124  Min. :332  Min. :0.0640  Min. :0.0256  Min. :0.0522  Min. :0.0434  Montréal :27  Min. :79.05  Min. :45.14  Min. :0.01070  Min. :0.001759  Min. :0.00000  Min. :0.000000  Min. :0.0000  Min. :0.00000  Min. :0.00000  Min. :0.0400  Min. :0.00000  Min. :0.1300  Min. :0.2700  Min. :0.1400  Min. :0.0000  Min. :129719  Min. :0.0000  Min. : 499.0  
Class :character  1st Qu.:332  1st Qu.:0.2643  1st Qu.:0.1003  1st Qu.:0.1283  1st Qu.:0.1151  Montérégie :22  1st Qu.:73.70  1st Qu.:45.49  1st Qu.:0.03130  1st Qu.:0.012114  1st Qu.:0.00000  1st Qu.:0.000000  1st Qu.:0.0100  1st Qu.:0.00000  1st Qu.:0.01000  1st Qu.:0.1375  1st Qu.:0.00000  1st Qu.:0.2375  1st Qu.:0.4300  1st Qu.:0.2200  1st Qu.:0.1233  1st Qu.:199814  1st Qu.:0.1985  1st Qu.: 609.8  
Mode :character  Median :332  Median :0.3921  Median :0.1454  Median :0.2061  Median :0.1424  CapitaleNationale :11  Median :73.51  Median :45.62  Median :0.04140  Median :0.045082  Median :0.01000  Median :0.000000  Median :0.0700  Median :0.01000  Median :0.02000  Median :0.2400  Median :0.01000  Median :0.3200  Median :0.4500  Median :0.2850  Median :0.2174  Median :253241  Median :0.4104  Median : 728.0  
NA  Mean :332  Mean :0.3635  Mean :0.1721  Mean :0.2560  Mean :0.1620  Lanaudière : 9  Mean :72.95  Mean :46.35  Mean :0.04644  Mean :0.111924  Mean :0.03113  Mean :0.007984  Mean :0.1271  Mean :0.01387  Mean :0.03565  Mean :0.2435  Mean :0.02145  Mean :0.3063  Mean :0.4454  Mean :0.3658  Mean :0.2443  Mean :268551  Mean :0.3756  Mean : 708.6  
NA  3rd Qu.:332  3rd Qu.:0.4716  3rd Qu.:0.2152  3rd Qu.:0.3351  3rd Qu.:0.1696  Laurentides : 8  3rd Qu.:71.82  3rd Qu.:46.72  3rd Qu.:0.05767  3rd Qu.:0.163318  3rd Qu.:0.05000  3rd Qu.:0.010000  3rd Qu.:0.2100  3rd Qu.:0.01000  3rd Qu.:0.05000  3rd Qu.:0.3125  3rd Qu.:0.03000  3rd Qu.:0.3700  3rd Qu.:0.4700  3rd Qu.:0.4625  3rd Qu.:0.3032  3rd Qu.:302040  3rd Qu.:0.5206  3rd Qu.: 789.5  
NA  Max. :332  Max. :0.6637  Max. :0.6946  Max. :0.7432  Max. :0.5914  ChaudièreAppalaches: 7  Max. :66.71  Max. :58.53  Max. :0.11710  Max. :0.664920  Max. :0.16000  Max. :0.060000  Max. :0.4700  Max. :0.15000  Max. :0.31000  Max. :0.6300  Max. :0.11000  Max. :0.4900  Max. :0.5100  Max. :0.7500  Max. :1.0000  Max. :698017  Max. :1.0000  Max. :1057.0  
NA  NA  NA  NA  NA  NA  (Other) :40  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 
To fit a Dirichlet regression with brms
, one has to format response data in a quite unusual way: the function asks for the response variables to be stores as a data frame inside a data frame!
Tot$Y < cbind(Others = Tot$Others, CAQ = Tot$CAQ,
PQ = Tot$PQ, PLQ = Tot$PLQ, QS = Tot$QS)
str(Tot)
## 'data.frame': 124 obs. of 31 variables:
## $ division : chr "AbitibiEst" "AbitibiOuest" "Acadie" "AnjouLouisRiel" ...
## $ num_circon : num 332 332 332 332 332 332 332 332 332 332 ...
## $ CAQ : num 0.427 0.341 0.165 0.289 0.389 ...
## $ PQ : num 0.195 0.333 0.09 0.147 0.211 ...
## $ PLQ : num 0.188 0.113 0.538 0.391 0.174 ...
## $ QS : num 0.157 0.166 0.138 0.145 0.122 ...
## $ CODE_CIRC : int 648 642 338 366 520 144 806 802 218 822 ...
## $ CODE_RA : int 8 8 6 6 15 17 12 12 16 12 ...
## $ RA : Factor w/ 17 levels "AbitibiTémiscamingue",..: 1 1 14 14 10 4 5 5 13 5 ...
## $ TRI_CIRC : Factor w/ 125 levels "ABITIBIEST","ABITIBIOUEST",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ TRI_RA : Factor w/ 17 levels "ABITIBITEMISCAMINGUE",..: 1 1 14 14 10 4 5 5 13 5 ...
## $ MRC : chr "La Valléedel'Or" "Abitibi" "Montréal" "Montréal" ...
## $ coords.x1 : num 76.8 78 73.7 73.7 74.5 ...
## $ coords.x2 : num 47.9 48.6 45.5 45.5 45.7 ...
## $ Others : num 0.0339 0.0472 0.0694 0.028 0.104 ...
## $ anglais_pourc : num 0.0304 0.0111 0.2226 0.0958 0.1335 ...
## $ bilingue_pourc : num 0 0 0.16 0.07 0.01 0 0 0 0 0 ...
## $ aucune_pourc : num 0 0 0.05 0.02 0 0 0 0 0 0 ...
## $ commun_pourc : num 0.01 0.03 0.35 0.34 0.01 0 0 0 0.02 0.01 ...
## $ bicycle_pourc : num 0.01 0.01 0.01 0.01 0 0.02 0 0 0.01 0 ...
## $ migrants_5ans_pourc : num 0.01 0.01 0.14 0.07 0.02 0.01 0.01 0.01 0.01 0.01 ...
## $ log_av60_pourc : num 0.26 0.29 0.31 0.18 0.24 0.24 0.25 0.24 0.32 0.31 ...
## $ agri_pourc : num 0.02 0.06 0 0 0.02 0.06 0.08 0.05 0.02 0.08 ...
## $ moins_40mille_pourc : num 0.32 0.32 0.39 0.34 0.33 0.37 0.27 0.33 0.36 0.29 ...
## $ X40_100mille_pour : num 0.39 0.43 0.44 0.47 0.45 0.47 0.5 0.49 0.45 0.48 ...
## $ loc_pourc : num 0.35 0.26 0.6 0.58 0.23 0.32 0.23 0.28 0.39 0.21 ...
## $ n_valeur_logement_mediane: num 0.124 0.053 0.477 0.475 0.167 ...
## $ valeur_logement_mediane : int 200330 159847 400631 399577 224493 150408 179680 149878 200389 185144 ...
## $ n_cout_hab_loc_pourc : num 0.197 0.122 0.543 0.491 0.297 ...
## $ cout_hab_loc_pourc : int 609 567 802 773 665 565 588 541 649 596 ...
## $ Y : num [1:124, 1:5] 0.0339 0.0472 0.0694 0.028 0.104 ...
## .. attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "Others" "CAQ" "PQ" "PLQ" ...
Model formulation
We can now formulate the posterior stan will explore! We will first begin by the formula of the model.
library(brms)
form < bf(Y ~ anglais_pourc + bilingue_pourc + aucune_pourc +
commun_pourc + bicycle_pourc +
migrants_5ans_pourc +
log_av60_pourc +
agri_pourc +
moins_40mille_pourc + X40_100mille_pour +
loc_pourc + n_valeur_logement_mediane + n_cout_hab_loc_pourc,
family = dirichlet)
Priors
Then, we will explore the parameters implied by our formula, and thus the priors we have to define. We can observe that the Others
column is absent, its parameters being only computed by contrasts from the others.
get_prior(form, data = Tot) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "300px")
prior  class  coef  group  resp  dpar  nlpar  bound 

b  
Intercept  
gamma(0.01, 0.01)  phi  
b  muCAQ  
b  agri_pourc  muCAQ  
b  anglais_pourc  muCAQ  
b  aucune_pourc  muCAQ  
b  bicycle_pourc  muCAQ  
b  bilingue_pourc  muCAQ  
b  commun_pourc  muCAQ  
b  loc_pourc  muCAQ  
b  log_av60_pourc  muCAQ  
b  migrants_5ans_pourc  muCAQ  
b  moins_40mille_pourc  muCAQ  
b  n_cout_hab_loc_pourc  muCAQ  
b  n_valeur_logement_mediane  muCAQ  
b  X40_100mille_pour  muCAQ  
Intercept  muCAQ  
b  muPLQ  
b  agri_pourc  muPLQ  
b  anglais_pourc  muPLQ  
b  aucune_pourc  muPLQ  
b  bicycle_pourc  muPLQ  
b  bilingue_pourc  muPLQ  
b  commun_pourc  muPLQ  
b  loc_pourc  muPLQ  
b  log_av60_pourc  muPLQ  
b  migrants_5ans_pourc  muPLQ  
b  moins_40mille_pourc  muPLQ  
b  n_cout_hab_loc_pourc  muPLQ  
b  n_valeur_logement_mediane  muPLQ  
b  X40_100mille_pour  muPLQ  
Intercept  muPLQ  
b  muPQ  
b  agri_pourc  muPQ  
b  anglais_pourc  muPQ  
b  aucune_pourc  muPQ  
b  bicycle_pourc  muPQ  
b  bilingue_pourc  muPQ  
b  commun_pourc  muPQ  
b  loc_pourc  muPQ  
b  log_av60_pourc  muPQ  
b  migrants_5ans_pourc  muPQ  
b  moins_40mille_pourc  muPQ  
b  n_cout_hab_loc_pourc  muPQ  
b  n_valeur_logement_mediane  muPQ  
b  X40_100mille_pour  muPQ  
Intercept  muPQ  
b  muQS  
b  agri_pourc  muQS  
b  anglais_pourc  muQS  
b  aucune_pourc  muQS  
b  bicycle_pourc  muQS  
b  bilingue_pourc  muQS  
b  commun_pourc  muQS  
b  loc_pourc  muQS  
b  log_av60_pourc  muQS  
b  migrants_5ans_pourc  muQS  
b  moins_40mille_pourc  muQS  
b  n_cout_hab_loc_pourc  muQS  
b  n_valeur_logement_mediane  muQS  
b  X40_100mille_pour  muQS  
Intercept  muQS 
We can see that we have to define priors for Intercepts, slopes, and precision. Generally, we define the priors while thinking about the scale of variation in data. Because both the predictors and the responses are constrained between 0 and 1, we know that an increase of 1 of a predictor value might at the maximum increase the response to 1. This is, of course, a theoretical consideration, because we know that, in Canada, the probability that a party obtain 100% of votes is very low. A normal prior centered on 0 and correctly scaled is reasonable, assuming no particular tendency but a carefully chosen standarddeviation to represent the possible amplitude of variation in the data set.
However, the multivariate logit link function gives us a lot of trouble. I find defining priors for parameters on the logit scale painful and not very intuitive, but in case of a multivariate logit link function, I am completely lost!
The best way to choose the scale of a prior is to perform prior predictive checks, that is, to sample from the prior and generate simulated data to understand the behavior of the model under this provided information.
Prior predictive checks are really easy to do with brms. We will perform it with 4 different scales : 1, 2 and 10.
curve(dnorm(x, 0, 1), from = 10, to = 10)
curve(dnorm(x, 0, 2), from = 10, to = 10, add = T, col = "blue")
curve(dnorm(x, 0, 10), from = 10, to = 10, add = T, col = "red")
We have little information about the dispersion parameter, so we will define a more diffuse prior than brms.
## Prior chosen by brms
curve(dgamma(x, 0.01, rate = 0.01), to = 30, ylim = c(0, 0.040))
## Prior we will define
curve(dstudent_t(x, 3, 0, 10), to = 30, add = T, col = "red")
Prior predictive checks
## Set prior with sd = 1
priors_1 < c(prior(normal(0,1), class = b),
prior(normal(0,1), class = Intercept),
prior(student_t(3, 0, 10), class = phi))
## sample from the priors
ppp_1 < brm(formula = form, prior = priors_1,
data = Tot, sample_prior = "only",
chains = 1, cores = 1)
## Set prior with sd = 2
priors_2 < c(prior(normal(0,2), class = b),
prior(normal(0,1), class = Intercept),
prior(student_t(3, 0, 10), class = phi))
## sample from the priors
ppp_2 < brm(formula = form, prior = priors_2,
data = Tot, sample_prior = "only",
chains = 1, cores = 1)
## Set prior with sd = 2
priors_10 < c(prior(normal(0,10), class = b),
prior(normal(0,1), class = Intercept),
prior(student_t(3, 0, 10), class = phi))
## sample from the priors
ppp_10 < brm(formula = form, prior = priors_10,
data = Tot, sample_prior = "only",
chains = 1, cores = 1)
Now that we have sampled the parameters, we can extract predicted values. Predicted values, in this case, are stored in a three dimensional array defined by [samples, observations, response variables]
pred_1 < predict(ppp_1, summary = F, nsamples = 100)
pred_2 < predict(ppp_2, summary = F, nsamples = 100)
pred_10 < predict(ppp_10, summary = F, nsamples = 100)
dim(pred_1)
## [1] 100 126 5
We will then plot the predicted densities given the three different scales.
plot(density(pred_1[1,,]), col = scales::alpha("blue", 0.2), main = "b ~ Normal(0,1)",
xlim = c(0.1, 1.1))
for(i in 2:100) lines(density(pred_1[i,,]), col = scales::alpha("blue", 0.2))
plot(density(pred_2[1,,]), col = scales::alpha("blue", 0.2), main = "b ~ Normal(0,2)",
xlim = c(0.1, 1.1))
for(i in 2:100) lines(density(pred_2[i,,]), col = scales::alpha("blue", 0.2))