Disclaimer : I do not mean to do political analysis, neither to give any opinion. I just made this analysis after the October 2018 provincial election in Quebec, and I decide now to publish it since the R package brms implements Dirichlet regression, hoping that my work would encourage ecologists to dive into multivariate models. Enjoy :)

Note : I discovered after having writen this post that an article went out in Methods in Ecology and Evolution (Douma and Weedon 2019) about regression over proportions. Read this as a tutorial!

# Empty R session
rm(list = ls())
# Set a seed
set.seed(999)

Context

Electors from the Quebec province (Canada) voted in 2018 October 1^st for the members of the national assembly of Quebec. The Coalition Avenir Quebec won the election with 74 seats and 37.42% of votes. They were followed by three principal political party: the Parti Libéral du Québec (31 seats and 24.82% of votes), the Parti Québécois (10 seats, 17.06% of votes) and Quebec Solidaire (10 seats, 16,10%).

For those elections, the Quebec province was divided into 125 electoral divisions (maps). In each of these divisions, voters had to choose a representant belonging to a political party. The representant with the highest number of votes was elected as a member of the national assembly.

In this post, we will explore the relationship between some socio-economic characteristics of each electoral division and the results (in %) of the four principal political party cited above.

To replicate the analysis (and do more if you want!), you simply have to download the .zip file in the following repository.

Data

Sources

I compiled data from various sources about the election and the socio-economical characteristics of each electoral division.

Results of the election can be found on the élections Quebec website. I gathered a compilation in a tidy format done by the modeler of the great electoral blog Qc125. The election results are stored in the file resultats.json in the zip folder, with the correspondence between geographical and electoral units stored in the MRC.csv and RA.csv files.

The explanatory variables (socio-economic data) have been gathered from the 2016 censoring, downloadable at this link on the élections Quebec website. The data are far from being tidy at all when downloaded at this address. The file recensement-2016-CEP.xls in the zip folder contains the original data with a sheet made by copy-pasting data in a tidy format.

The coordinates of each electoral division, defined as the coordinates of the center of the corresponding Municipalité Régional de Comté, can be downloaded on the Ministère de l’énergie et des ressources naturelles website. The shapefiles are stored in the MRC subfolder.

Preparation

I have prepared functions to be able to put order and assemble data from these different sources.

The first function, import.result aims to extract the result of each parti in each region, stored in a Json format, and to join them with geographical informations, such as lattitude and longitude of the electoral division.

import.result <- function(){
  ## Import election results
  D <- rjson::fromJSON(file = "resultats.json",  simplify = T)
  
  ## Extract the results per electoral division
  Circon <- D$circonscriptions
  
  ## Prepare a data.frame to store results
  result <- data.frame(circon = rep(NA, length(Circon)), num_circon = NA,
                       CAQ = NA, PQ = NA, PLQ = NA, QS = NA)
  colnames(result)
  
  for(i in seq(length(Circon))){
    ## Store the electoral division name
    result$circon[i] <- Circon[[i]]$nomCirconscription
    ## Store the electoral division number
    result$num_circon <- Circon[[i]]$numeroCirconscription
    ## Extract results for each candidats
    candidats <- Circon[[i]]$candidats
    for(j in seq(length(candidats))){
      if(candidats[[j]]$numeroPartiPolitique == 27) result$CAQ[i] <- candidats[[j]]$tauxVote
      if(candidats[[j]]$numeroPartiPolitique == 8) result$PQ[i] <- candidats[[j]]$tauxVote
      if(candidats[[j]]$numeroPartiPolitique == 6) result$PLQ[i] <- candidats[[j]]$tauxVote
      if(candidats[[j]]$numeroPartiPolitique == 40) result$QS[i] <- candidats[[j]]$tauxVote
    }
  }
  
  ## Match electoral division and administrative regions
  regions <- read.csv("RA.csv")
  result <- left_join(result, regions)
  
  ## Match electoral division and MRC divisions
  MRC <- read.csv("MRC.csv")
  colnames(MRC)[2] <- "circon"
  result <- left_join(result, select(MRC, circon, MRC) )
  
  ## Get Coordinates of each MRC
  shape <- readOGR("MRC/", "mrc_point")
  D_shape <- cbind(shape@data, shape@coords)
  colnames(D_shape)[8] <- "MRC"
  
  result <- left_join(result, select(D_shape, MRC, coords.x1, coords.x2))

  return(result)
}

The second function import explanatory variables. Because most of these variables are proportions, defined on (0,1), I decided to scale every variable of other kinds under the same range. To constrain a vector \(x\) between \(a\) and \(b\), we need the following function :

\[ f(x) = \frac{(b-a) * (x - min(x))}{max(x) - min(x)} \]

## Function to normalize data between a and b
normalize <- function(x, a = 0, b = 1){
  top <- (b-a) * (x - min(x, na.rm = T))
  bot <- max(x, na.rm = T) - min(x, na.rm = T)
  y <- top/bot + a
}

## And function to denormalize
denormalize <- function(x,minval,maxval) {
    x*(maxval-minval) + minval
}

## Import explanatory data
import.X <- function(){
  X1 <- read.csv("explain.csv",sep = ",")

  X <- X1 %>% mutate(taux_actif = taux_actif/100,
                     taux_emploi = taux_emploi/100,
                     n_valeur_logement_mediane = normalize(valeur_logement_mediane),
                     n_cout_hab_loc_pourc = normalize(cout_hab_loc_pourc))
  return(X)
}

n_valeur_logement_mediane is the median of the estate value in the division and cout_hab_loc_pourc is the median value of the domiciliary rent, both defined on \((0,+\infty)\).

Model

Beta distibution

Results are proportions, defined on (0,1). A nice way to model proportion is to use a \(Beta\) distribution for likelihood. One can define the Beta distribution in term of mean \(\mu\) (defined in (0,1)) and a positive precision \(\phi\). The following function compute the likelihood of a proportion, \(y\), given a mean and a precision (found in Paul Buerkner’s vignette about parametrization of distributions, see also Ferrari and Cribari-Neto (2004)) :

\[ P(y|\mu, \phi) = \frac{y^{\mu \phi - 1} (1-y)^{(1-\mu) \phi-1}}{B(\mu \phi, (1-\mu) \phi)} \] \(B()\) being the Beta function. An intuition emerges if we focus on the numerator. We can see that the likelihood of a probability \(y\) is described as the observation risen at \(\mu\phi-1\) multiplied by the complement, \((1-y)\) raised to \((1-\mu)\phi-1\). The first part is a simple exponential function, defined for any value of \(y\) over \((-Inf, +Inf)\).

first <- function(x, mu, phi) x^(mu*phi-1)
mu = 0.7; phi = 10
curve(first(x, mu, phi), from = 0, to = 1, bty = "n")

This allows the distribution to move around the (0,1) interval by skewing more and more while the mean approaches the boundaries.

numerator <- function(x, mu, phi) x^(mu*phi-1) * (1-x)^((1-mu)*phi-1)
mu = 0.7; phi = 10
curve(numerator(x, mu, phi), from = 0, to = 1, col = "red", bty = "n")
abline(v = mu, lty = 2)

Then, the denominator scales the function to ensure that the area under the curve sum to 1, such as any probability density function shall.

final <- function(x, mu, phi){
  numerator <- numerator(x, mu, phi)
  denominator <- beta(mu*phi, (1-mu)*phi)
  result <- numerator / denominator
}
mu = 0.7; phi = 10
curve(final(x, mu, phi), col = "blue", from = 0, to = 1, bty = "n")
abline(v = 0.7, lty = 2)

integrate(numerator, lower = 0, upper = 1, mu = 0.7, phi = 10)
## 0.003968254 with absolute error < 4.4e-17
integrate(final, lower = 0, upper = 1, mu = 0.7, phi = 10)
## 1 with absolute error < 1.1e-14

We can make sense of the mean and dispersion parameters we the following plots. In R, the beta distribution is not parameterized by \(\mu\) and \(\phi\), but by two shape parameter we will recover easily with the function such as \(a = \mu * \phi\) and \(b = (1-\mu)*\phi\) :

par(mfrow = c(1,2), bty = "n")
mu = 0.5; phi = 10; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), ylim = c(0,4))
mu = 0.25; phi = 10; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), add = T, col = "blue")
mu = 0.75; phi = 10; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), add = T, col = "red")

mu = 0.25; phi = 10; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), ylim = c(0,4))
mu = 0.25; phi = 5; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), add = T, col = "blue")
mu = 0.25; phi = 2; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), add = T, col = "red")

par(mfrow = c(1,1))

Dirichlet distribution

But election results are not only proportions, they are compositional. As such, each party result is a proportion of the total number of votes in an electoral division : the sum of the proportions have to sum to 1. This is an interesting mathematical property we should exploit while modeling such data. In that case, the proportion of votes a party received in a division is not only constrained by the predictors, the socio-enconomic data, but also by the sum of the proportions of the other parties. This aspect is implemented in the dirichlet distribution, defined as following in a regression framework (see the vignette or (Maier 2014)) :

\[ P(y_{c}|\mu_{c}, \phi) = \frac{1}{B((\mu_{1}, \ldots, \mu_{C}) \phi)} \prod_{d=1}^C y_{}^{\mu_{d} \phi - 1} \] Where \(c\) is the component (column of the simplex matrix) and \(B\) the multivariate beta function.

Again, we can make sense of parameters by observing the distribution. In R, the distribution is parametrized by a vector \(\boldsymbol{\alpha}\), such as \(\boldsymbol{\alpha} = \boldsymbol{\mu}\phi\). Where \(\boldsymbol{\mu}\) is a simplex reparting the strictly positive value \(\phi\), which represents the sum of \(\boldsymbol{\alpha}\) such as \(\phi = \sum_{c = 1}^{C} \alpha_{c}\).

Dirichlet regression

We can now formulate a dirichlet regression with \(\mu\) dependant of a set of covariates :

\[ y_{i,c} \sim Dirichlet(\mu_{i,c}, \phi)\\ \mu_{c,i} = \frac{\exp(\eta_{c,i})}{\sum_{d= 1}^{C} \exp(\eta_{i,d})}\\ \eta_{c,i} = \boldsymbol{\beta_{c}} X_{i} \]

Where the link between \(\mu\) and \(\eta\) is the multivariate binomial function. \(\boldsymbol{\beta_{c}}\) is a vector of slopes multiplying the predictor matrix \(X\). To ensure identifiability, \(\boldsymbol{\eta_{1}}\) is set to 0.

Dirichlet regression with brms

Data preparation

Let’s first import the election results, with geographical information on each electoral division. Because results are expressed as a percentage, we will divide them by 100.

library(tidyverse)
library(rgdal)
library(knitr)
library(kableExtra)
R <- import.result() %>% 
  mutate_at(c("CAQ", "PQ", "PLQ", "QS"), function(x) x/100) %>% 
  rename(division = circon)
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/lucasd/Gdrive/Projects/RIVE-Numeri-lab.github.io/projects/LD_Elections/MRC", layer: "mrc_point"
## with 137 features
## It has 15 fields
## Integer64 fields read as strings:  MRC_S_ MRC_S_ID F_POLYGONI
R %>% head() %>%  
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "300px")
division num_circon CAQ PQ PLQ QS CODE_CIRC CODE_RA RA TRI_CIRC TRI_RA MRC coords.x1 coords.x2
Abitibi-Est 332 0.4272 0.1948 0.1875 0.1566 648 8 Abitibi-Témiscamingue ABITIBIEST ABITIBITEMISCAMINGUE La Vallée-de-l’Or -76.75412 47.91909
Abitibi-Ouest 332 0.3412 0.3326 0.1131 0.1659 642 8 Abitibi-Témiscamingue ABITIBIOUEST ABITIBITEMISCAMINGUE Abitibi -78.00822 48.61578
Acadie 332 0.1651 0.0900 0.5380 0.1375 338 6 Montréal ACADIE MONTREAL Montréal -73.69991 45.49017
Anjou-Louis-Riel 332 0.2891 0.1470 0.3906 0.1453 366 6 Montréal ANJOULOUISRIEL MONTREAL Montréal -73.69991 45.49017
Argenteuil 332 0.3888 0.2114 0.1741 0.1217 520 15 Laurentides ARGENTEUIL LAURENTIDES Argenteuil -74.45213 45.66901
Arthabaska 332 0.6184 0.0940 0.1135 0.1258 144 17 Centre-du-Québec ARTHABASKA CENTREDUQUEBEC Arthabaska -71.88812 46.01932

Because we only selected the four major political parties, all the rows but two does not sum to one, as simplex should.

(RSums <- R %>% select(CAQ:QS) %>% rowSums())
##   [1] 0.9661 0.9528 0.9306 0.9720 0.8960 0.9517 0.9421 0.9341 0.9632 0.9577
##  [11] 0.9711 0.9648 0.9601 0.9499 0.9631 0.9426 0.9690 0.9658 0.9627 0.9556
##  [21] 0.9790 0.9581 0.9821 0.9580 0.8908 0.9684 0.9374 0.9424 0.9659 0.9053
##  [31] 0.9447 0.9687 0.9584 0.9832 0.9435 1.0000 0.9417 0.9538 0.9679 0.9390
##  [41] 0.9687 0.9230 0.9667 0.9479 1.0000 0.8829 0.9429 0.9620 0.9479 0.9504
##  [51] 0.9788 0.9794 0.9739 0.9837 0.9837 0.9791 0.9174 0.9482 0.9449 0.9583
##  [61] 0.9645 0.9358 0.9463 0.9698 0.9679 0.9374 0.9321 0.9470 0.9608 0.9609
##  [71] 0.9058 0.9523 0.9709 0.9787 0.9626 0.9588 0.9238 0.9533 0.9662 0.9752
##  [81] 0.9322 0.9205 0.9240 0.9591 0.8829 0.9525 0.9564 0.9858 0.9055 0.9387
##  [91] 0.9831 0.9893 0.9707 0.9750 0.9619 0.9858 0.9892 0.9278 0.9634 0.9601
## [101] 0.9781 0.9711 0.9690 0.9204 0.9881 0.9657 0.9598 0.9222 0.9541 0.9534
## [111] 0.9701 0.9692 0.9477 0.9626 0.9680 0.9744 0.9652 0.9395 0.9618 0.9282
## [121] 0.9265 0.9674 0.9246 0.9500 0.9547 0.9171

To solve this, we will create a new column representing the cumulative proportion of votes obtained by all the other parties. Note that this is a personal choice and not the only nor the best solution to analyse the dataset (but definitly one of the simplest).

## Create a new column for others votes
R <- R %>% mutate(Others = 1 - RSums)
R %>% select(CAQ:QS, Others) %>% rowSums()
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

However, two lines were summing to one indicate divisions in which there was no other party than the four major ones or at least no vote for eventual ones. They will cause trouble later in our analysis, because dirichlet regression as defined in brms in not define for a proportion of zero. Some solutions exist to accomodate the problem :

  • Using a Zero Inflated Dirichlet distribution. The function ZIGDM in the miLineage package can perform such an analysis;
  • Using a zero adjusted dirichlet regression (Tsagris and Stewart 2018)
  • Remove the lines.

We will choose the third one for sake of simplicity, reducing the number of line from 126 to 124.

## Remove divisions in which there was no vote for other parties
R_rm <- R %>% filter(Others != 0)

Then we will now import the socio-economic variables, and select some :

X_raw <- import.X()

X <- X_raw %>% select(division,
                      anglais_pourc, bilingue_pourc, aucune_pourc,
                      commun_pourc, bicycle_pourc,
                      migrants_5ans_pourc,
                      log_av60_pourc,
                      agri_pourc,
                      moins_40mille_pourc, X40_100mille_pour,
                      loc_pourc, 
                      n_valeur_logement_mediane, valeur_logement_mediane,
                      n_cout_hab_loc_pourc, cout_hab_loc_pourc)
  • anglais_pourc, bilingue_pourc and aucune_pourc are the proportion of person in a division speaking only English, English and French or none of these languages, respectively;
  • commun_pourc and bicycle_pourc are the proportion of people going to work by public transportation or bicycle, respectively;
  • migrants_5ans_pourc is the proportion of people who have migrated from another country for more than five years;
  • log_av60_pourc is the proportion of habitation built before 1960;
  • agri_pourc is the percentage of people working in the agricultural sector;
  • moins_40mille_pourc and X40_100mille_pourc are the proportion of people earning less than 40 000 CAD or between 40 000 and 100 000 CAD, respectively;
  • loc_pourc is the proportion of people renting their housing;
  • n_valeur_logement_mediane and n_cout_hab_loc_pourc are the median cost of estate and of the housing rent, respectively.

We can now join the explanatory data with the election results. The joining variable will be circon, the name of the electoral division.

Tot <- left_join(R_rm, X)

Tot %>% select(-CODE_CIRC, -CODE_RA, -TRI_CIRC, -TRI_RA, -MRC) %>% 
  summary() %>% 
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "300px")
division num_circon
  CAQ </th>
   PQ </th>
  PLQ </th>
   QS </th>
                RA </th>
coords.x1 coords.x2
 Others </th>
anglais_pourc bilingue_pourc aucune_pourc commun_pourc bicycle_pourc migrants_5ans_pourc log_av60_pourc agri_pourc moins_40mille_pourc X40_100mille_pour loc_pourc n_valeur_logement_mediane valeur_logement_mediane n_cout_hab_loc_pourc cout_hab_loc_pourc
Length:124 Min. :332 Min. :0.0640 Min. :0.0256 Min. :0.0522 Min. :0.0434 Montréal :27 Min. :-79.05 Min. :45.14 Min. :0.01070 Min. :0.001759 Min. :0.00000 Min. :0.000000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0400 Min. :0.00000 Min. :0.1300 Min. :0.2700 Min. :0.1400 Min. :0.0000 Min. :129719 Min. :0.0000 Min. : 499.0
Class :character 1st Qu.:332 1st Qu.:0.2643 1st Qu.:0.1003 1st Qu.:0.1283 1st Qu.:0.1151 Montérégie :22 1st Qu.:-73.70 1st Qu.:45.49 1st Qu.:0.03130 1st Qu.:0.012114 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0100 1st Qu.:0.00000 1st Qu.:0.01000 1st Qu.:0.1375 1st Qu.:0.00000 1st Qu.:0.2375 1st Qu.:0.4300 1st Qu.:0.2200 1st Qu.:0.1233 1st Qu.:199814 1st Qu.:0.1985 1st Qu.: 609.8
Mode :character Median :332 Median :0.3921 Median :0.1454 Median :0.2061 Median :0.1424 Capitale-Nationale :11 Median :-73.51 Median :45.62 Median :0.04140 Median :0.045082 Median :0.01000 Median :0.000000 Median :0.0700 Median :0.01000 Median :0.02000 Median :0.2400 Median :0.01000 Median :0.3200 Median :0.4500 Median :0.2850 Median :0.2174 Median :253241 Median :0.4104 Median : 728.0
NA Mean :332 Mean :0.3635 Mean :0.1721 Mean :0.2560 Mean :0.1620 Lanaudière : 9 Mean :-72.95 Mean :46.35 Mean :0.04644 Mean :0.111924 Mean :0.03113 Mean :0.007984 Mean :0.1271 Mean :0.01387 Mean :0.03565 Mean :0.2435 Mean :0.02145 Mean :0.3063 Mean :0.4454 Mean :0.3658 Mean :0.2443 Mean :268551 Mean :0.3756 Mean : 708.6
NA 3rd Qu.:332 3rd Qu.:0.4716 3rd Qu.:0.2152 3rd Qu.:0.3351 3rd Qu.:0.1696 Laurentides : 8 3rd Qu.:-71.82 3rd Qu.:46.72 3rd Qu.:0.05767 3rd Qu.:0.163318 3rd Qu.:0.05000 3rd Qu.:0.010000 3rd Qu.:0.2100 3rd Qu.:0.01000 3rd Qu.:0.05000 3rd Qu.:0.3125 3rd Qu.:0.03000 3rd Qu.:0.3700 3rd Qu.:0.4700 3rd Qu.:0.4625 3rd Qu.:0.3032 3rd Qu.:302040 3rd Qu.:0.5206 3rd Qu.: 789.5
NA Max. :332 Max. :0.6637 Max. :0.6946 Max. :0.7432 Max. :0.5914 Chaudière-Appalaches: 7 Max. :-66.71 Max. :58.53 Max. :0.11710 Max. :0.664920 Max. :0.16000 Max. :0.060000 Max. :0.4700 Max. :0.15000 Max. :0.31000 Max. :0.6300 Max. :0.11000 Max. :0.4900 Max. :0.5100 Max. :0.7500 Max. :1.0000 Max. :698017 Max. :1.0000 Max. :1057.0
NA NA NA NA NA NA (Other) :40 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

To fit a Dirichlet regression with brms, one has to format response data in a quite unusual way: the function asks for the response variables to be stores as a data frame inside a data frame!

Tot$Y <- cbind(Others = Tot$Others, CAQ = Tot$CAQ, 
               PQ = Tot$PQ, PLQ = Tot$PLQ, QS = Tot$QS)
str(Tot)
## 'data.frame':    124 obs. of  31 variables:
##  $ division                 : chr  "Abitibi-Est" "Abitibi-Ouest" "Acadie" "Anjou-Louis-Riel" ...
##  $ num_circon               : num  332 332 332 332 332 332 332 332 332 332 ...
##  $ CAQ                      : num  0.427 0.341 0.165 0.289 0.389 ...
##  $ PQ                       : num  0.195 0.333 0.09 0.147 0.211 ...
##  $ PLQ                      : num  0.188 0.113 0.538 0.391 0.174 ...
##  $ QS                       : num  0.157 0.166 0.138 0.145 0.122 ...
##  $ CODE_CIRC                : int  648 642 338 366 520 144 806 802 218 822 ...
##  $ CODE_RA                  : int  8 8 6 6 15 17 12 12 16 12 ...
##  $ RA                       : Factor w/ 17 levels "Abitibi-Témiscamingue",..: 1 1 14 14 10 4 5 5 13 5 ...
##  $ TRI_CIRC                 : Factor w/ 125 levels "ABITIBIEST","ABITIBIOUEST",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ TRI_RA                   : Factor w/ 17 levels "ABITIBITEMISCAMINGUE",..: 1 1 14 14 10 4 5 5 13 5 ...
##  $ MRC                      : chr  "La Vallée-de-l'Or" "Abitibi" "Montréal" "Montréal" ...
##  $ coords.x1                : num  -76.8 -78 -73.7 -73.7 -74.5 ...
##  $ coords.x2                : num  47.9 48.6 45.5 45.5 45.7 ...
##  $ Others                   : num  0.0339 0.0472 0.0694 0.028 0.104 ...
##  $ anglais_pourc            : num  0.0304 0.0111 0.2226 0.0958 0.1335 ...
##  $ bilingue_pourc           : num  0 0 0.16 0.07 0.01 0 0 0 0 0 ...
##  $ aucune_pourc             : num  0 0 0.05 0.02 0 0 0 0 0 0 ...
##  $ commun_pourc             : num  0.01 0.03 0.35 0.34 0.01 0 0 0 0.02 0.01 ...
##  $ bicycle_pourc            : num  0.01 0.01 0.01 0.01 0 0.02 0 0 0.01 0 ...
##  $ migrants_5ans_pourc      : num  0.01 0.01 0.14 0.07 0.02 0.01 0.01 0.01 0.01 0.01 ...
##  $ log_av60_pourc           : num  0.26 0.29 0.31 0.18 0.24 0.24 0.25 0.24 0.32 0.31 ...
##  $ agri_pourc               : num  0.02 0.06 0 0 0.02 0.06 0.08 0.05 0.02 0.08 ...
##  $ moins_40mille_pourc      : num  0.32 0.32 0.39 0.34 0.33 0.37 0.27 0.33 0.36 0.29 ...
##  $ X40_100mille_pour        : num  0.39 0.43 0.44 0.47 0.45 0.47 0.5 0.49 0.45 0.48 ...
##  $ loc_pourc                : num  0.35 0.26 0.6 0.58 0.23 0.32 0.23 0.28 0.39 0.21 ...
##  $ n_valeur_logement_mediane: num  0.124 0.053 0.477 0.475 0.167 ...
##  $ valeur_logement_mediane  : int  200330 159847 400631 399577 224493 150408 179680 149878 200389 185144 ...
##  $ n_cout_hab_loc_pourc     : num  0.197 0.122 0.543 0.491 0.297 ...
##  $ cout_hab_loc_pourc       : int  609 567 802 773 665 565 588 541 649 596 ...
##  $ Y                        : num [1:124, 1:5] 0.0339 0.0472 0.0694 0.028 0.104 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr  "Others" "CAQ" "PQ" "PLQ" ...

Model formulation

We can now formulate the posterior stan will explore! We will first begin by the formula of the model.

library(brms)
form <- bf(Y ~ anglais_pourc + bilingue_pourc + aucune_pourc +
                      commun_pourc + bicycle_pourc +
                      migrants_5ans_pourc +
                      log_av60_pourc +
                      agri_pourc +
                      moins_40mille_pourc + X40_100mille_pour +
                      loc_pourc + n_valeur_logement_mediane + n_cout_hab_loc_pourc,
           family = dirichlet)

Priors

Then, we will explore the parameters implied by our formula, and thus the priors we have to define. We can observe that the Others column is absent, its parameters being only computed by contrasts from the others.

get_prior(form, data = Tot) %>% 
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "300px")
prior class coef group resp dpar nlpar bound
b
Intercept
gamma(0.01, 0.01) phi
b muCAQ
b agri_pourc muCAQ
b anglais_pourc muCAQ
b aucune_pourc muCAQ
b bicycle_pourc muCAQ
b bilingue_pourc muCAQ
b commun_pourc muCAQ
b loc_pourc muCAQ
b log_av60_pourc muCAQ
b migrants_5ans_pourc muCAQ
b moins_40mille_pourc muCAQ
b n_cout_hab_loc_pourc muCAQ
b n_valeur_logement_mediane muCAQ
b X40_100mille_pour muCAQ
Intercept muCAQ
b muPLQ
b agri_pourc muPLQ
b anglais_pourc muPLQ
b aucune_pourc muPLQ
b bicycle_pourc muPLQ
b bilingue_pourc muPLQ
b commun_pourc muPLQ
b loc_pourc muPLQ
b log_av60_pourc muPLQ
b migrants_5ans_pourc muPLQ
b moins_40mille_pourc muPLQ
b n_cout_hab_loc_pourc muPLQ
b n_valeur_logement_mediane muPLQ
b X40_100mille_pour muPLQ
Intercept muPLQ
b muPQ
b agri_pourc muPQ
b anglais_pourc muPQ
b aucune_pourc muPQ
b bicycle_pourc muPQ
b bilingue_pourc muPQ
b commun_pourc muPQ
b loc_pourc muPQ
b log_av60_pourc muPQ
b migrants_5ans_pourc muPQ
b moins_40mille_pourc muPQ
b n_cout_hab_loc_pourc muPQ
b n_valeur_logement_mediane muPQ
b X40_100mille_pour muPQ
Intercept muPQ
b muQS
b agri_pourc muQS
b anglais_pourc muQS
b aucune_pourc muQS
b bicycle_pourc muQS
b bilingue_pourc muQS
b commun_pourc muQS
b loc_pourc muQS
b log_av60_pourc muQS
b migrants_5ans_pourc muQS
b moins_40mille_pourc muQS
b n_cout_hab_loc_pourc muQS
b n_valeur_logement_mediane muQS
b X40_100mille_pour muQS
Intercept muQS

We can see that we have to define priors for Intercepts, slopes, and precision. Generally, we define the priors while thinking about the scale of variation in data. Because both the predictors and the responses are constrained between 0 and 1, we know that an increase of 1 of a predictor value might at the maximum increase the response to 1. This is, of course, a theoretical consideration, because we know that, in Canada, the probability that a party obtain 100% of votes is very low. A normal prior centered on 0 and correctly scaled is reasonable, assuming no particular tendency but a carefully chosen standard-deviation to represent the possible amplitude of variation in the data set.

However, the multivariate logit link function gives us a lot of trouble. I find defining priors for parameters on the logit scale painful and not very intuitive, but in case of a multivariate logit link function, I am completely lost!

The best way to choose the scale of a prior is to perform prior predictive checks, that is, to sample from the prior and generate simulated data to understand the behavior of the model under this provided information.

Prior predictive checks are really easy to do with brms. We will perform it with 4 different scales : 1, 2 and 10.

curve(dnorm(x, 0, 1), from = -10, to = 10)
curve(dnorm(x, 0, 2), from = -10, to = 10, add = T, col = "blue")
curve(dnorm(x, 0, 10), from = -10, to = 10, add = T, col = "red")

We have little information about the dispersion parameter, so we will define a more diffuse prior than brms.

## Prior chosen by brms
curve(dgamma(x, 0.01, rate = 0.01), to = 30, ylim = c(0, 0.040))
## Prior we will define
curve(dstudent_t(x, 3, 0, 10), to = 30, add = T, col = "red")

Prior predictive checks

## Set prior with sd = 1
priors_1 <- c(prior(normal(0,1), class = b),
            prior(normal(0,1), class = Intercept),
            prior(student_t(3, 0, 10), class = phi))
## sample from the priors
ppp_1 <- brm(formula = form, prior = priors_1, 
           data = Tot, sample_prior = "only",
           chains = 1, cores = 1)
## Set prior with sd = 2
priors_2 <- c(prior(normal(0,2), class = b),
            prior(normal(0,1), class = Intercept),
            prior(student_t(3, 0, 10), class = phi))
## sample from the priors
ppp_2 <- brm(formula = form, prior = priors_2, 
           data = Tot, sample_prior = "only",
           chains = 1, cores = 1)
## Set prior with sd = 2
priors_10 <- c(prior(normal(0,10), class = b),
            prior(normal(0,1), class = Intercept),
            prior(student_t(3, 0, 10), class = phi))
## sample from the priors
ppp_10 <- brm(formula = form, prior = priors_10, 
           data = Tot, sample_prior = "only",
           chains = 1, cores = 1)

Now that we have sampled the parameters, we can extract predicted values. Predicted values, in this case, are stored in a three dimensional array defined by [samples, observations, response variables]

pred_1 <- predict(ppp_1, summary = F, nsamples = 100)
pred_2 <- predict(ppp_2, summary = F, nsamples = 100)
pred_10 <- predict(ppp_10, summary = F, nsamples = 100)
dim(pred_1)
## [1] 100 126   5

We will then plot the predicted densities given the three different scales.

plot(density(pred_1[1,,]), col = scales::alpha("blue", 0.2), main = "b ~ Normal(0,1)",
     xlim = c(-0.1, 1.1))
for(i in 2:100) lines(density(pred_1[i,,]), col = scales::alpha("blue", 0.2))

plot(density(pred_2[1,,]), col = scales::alpha("blue", 0.2), main = "b ~ Normal(0,2)",
     xlim = c(-0.1, 1.1))
for(i in 2:100) lines(density(pred_2[i,,]), col = scales::alpha("blue", 0.2))