Disclaimer : I do not mean to do political analysis, neither to give any opinion. I just made this analysis after the October 2018 provincial election in Quebec, and I decide now to publish it since the R package brms implements Dirichlet regression, hoping that my work would encourage ecologists to dive into multivariate models. Enjoy :)

Note : I discovered after having writen this post that an article went out in Methods in Ecology and Evolution (Douma and Weedon 2019) about regression over proportions. Read this as a tutorial!

# Empty R session
rm(list = ls())
# Set a seed
set.seed(999)

Context

Electors from the Quebec province (Canada) voted in 2018 October 1^st for the members of the national assembly of Quebec. The Coalition Avenir Quebec won the election with 74 seats and 37.42% of votes. They were followed by three principal political party: the Parti Libéral du Québec (31 seats and 24.82% of votes), the Parti Québécois (10 seats, 17.06% of votes) and Quebec Solidaire (10 seats, 16,10%).

For those elections, the Quebec province was divided into 125 electoral divisions (maps). In each of these divisions, voters had to choose a representant belonging to a political party. The representant with the highest number of votes was elected as a member of the national assembly.

In this post, we will explore the relationship between some socio-economic characteristics of each electoral division and the results (in %) of the four principal political party cited above.

To replicate the analysis (and do more if you want!), you simply have to download the .zip file in the following repository.

Data

Sources

I compiled data from various sources about the election and the socio-economical characteristics of each electoral division.

Results of the election can be found on the élections Quebec website. I gathered a compilation in a tidy format done by the modeler of the great electoral blog Qc125. The election results are stored in the file resultats.json in the zip folder, with the correspondence between geographical and electoral units stored in the MRC.csv and RA.csv files.

The explanatory variables (socio-economic data) have been gathered from the 2016 censoring, downloadable at this link on the élections Quebec website. The data are far from being tidy at all when downloaded at this address. The file recensement-2016-CEP.xls in the zip folder contains the original data with a sheet made by copy-pasting data in a tidy format.

The coordinates of each electoral division, defined as the coordinates of the center of the corresponding Municipalité Régional de Comté, can be downloaded on the Ministère de l’énergie et des ressources naturelles website. The shapefiles are stored in the MRC subfolder.

Preparation

I have prepared functions to be able to put order and assemble data from these different sources.

The first function, import.result aims to extract the result of each parti in each region, stored in a Json format, and to join them with geographical informations, such as lattitude and longitude of the electoral division.

import.result <- function(){
  ## Import election results
  D <- rjson::fromJSON(file = "resultats.json",  simplify = T)
  
  ## Extract the results per electoral division
  Circon <- D$circonscriptions
  
  ## Prepare a data.frame to store results
  result <- data.frame(circon = rep(NA, length(Circon)), num_circon = NA,
                       CAQ = NA, PQ = NA, PLQ = NA, QS = NA)
  colnames(result)
  
  for(i in seq(length(Circon))){
    ## Store the electoral division name
    result$circon[i] <- Circon[[i]]$nomCirconscription
    ## Store the electoral division number
    result$num_circon <- Circon[[i]]$numeroCirconscription
    ## Extract results for each candidats
    candidats <- Circon[[i]]$candidats
    for(j in seq(length(candidats))){
      if(candidats[[j]]$numeroPartiPolitique == 27) result$CAQ[i] <- candidats[[j]]$tauxVote
      if(candidats[[j]]$numeroPartiPolitique == 8) result$PQ[i] <- candidats[[j]]$tauxVote
      if(candidats[[j]]$numeroPartiPolitique == 6) result$PLQ[i] <- candidats[[j]]$tauxVote
      if(candidats[[j]]$numeroPartiPolitique == 40) result$QS[i] <- candidats[[j]]$tauxVote
    }
  }
  
  ## Match electoral division and administrative regions
  regions <- read.csv("RA.csv")
  result <- left_join(result, regions)
  
  ## Match electoral division and MRC divisions
  MRC <- read.csv("MRC.csv")
  colnames(MRC)[2] <- "circon"
  result <- left_join(result, select(MRC, circon, MRC) )
  
  ## Get Coordinates of each MRC
  shape <- readOGR("MRC/", "mrc_point")
  D_shape <- cbind(shape@data, shape@coords)
  colnames(D_shape)[8] <- "MRC"
  
  result <- left_join(result, select(D_shape, MRC, coords.x1, coords.x2))

  return(result)
}

The second function import explanatory variables. Because most of these variables are proportions, defined on (0,1), I decided to scale every variable of other kinds under the same range. To constrain a vector \(x\) between \(a\) and \(b\), we need the following function :

\[ f(x) = \frac{(b-a) * (x - min(x))}{max(x) - min(x)} \]

## Function to normalize data between a and b
normalize <- function(x, a = 0, b = 1){
  top <- (b-a) * (x - min(x, na.rm = T))
  bot <- max(x, na.rm = T) - min(x, na.rm = T)
  y <- top/bot + a
}

## And function to denormalize
denormalize <- function(x,minval,maxval) {
    x*(maxval-minval) + minval
}

## Import explanatory data
import.X <- function(){
  X1 <- read.csv("explain.csv",sep = ",")

  X <- X1 %>% mutate(taux_actif = taux_actif/100,
                     taux_emploi = taux_emploi/100,
                     n_valeur_logement_mediane = normalize(valeur_logement_mediane),
                     n_cout_hab_loc_pourc = normalize(cout_hab_loc_pourc))
  return(X)
}

n_valeur_logement_mediane is the median of the estate value in the division and cout_hab_loc_pourc is the median value of the domiciliary rent, both defined on \((0,+\infty)\).

Model

Beta distibution

Results are proportions, defined on (0,1). A nice way to model proportion is to use a \(Beta\) distribution for likelihood. One can define the Beta distribution in term of mean \(\mu\) (defined in (0,1)) and a positive precision \(\phi\). The following function compute the likelihood of a proportion, \(y\), given a mean and a precision (found in Paul Buerkner’s vignette about parametrization of distributions, see also Ferrari and Cribari-Neto (2004)) :

\[ P(y|\mu, \phi) = \frac{y^{\mu \phi - 1} (1-y)^{(1-\mu) \phi-1}}{B(\mu \phi, (1-\mu) \phi)} \] \(B()\) being the Beta function. An intuition emerges if we focus on the numerator. We can see that the likelihood of a probability \(y\) is described as the observation risen at \(\mu\phi-1\) multiplied by the complement, \((1-y)\) raised to \((1-\mu)\phi-1\). The first part is a simple exponential function, defined for any value of \(y\) over \((-Inf, +Inf)\).

first <- function(x, mu, phi) x^(mu*phi-1)
mu = 0.7; phi = 10
curve(first(x, mu, phi), from = 0, to = 1, bty = "n")

This allows the distribution to move around the (0,1) interval by skewing more and more while the mean approaches the boundaries.

numerator <- function(x, mu, phi) x^(mu*phi-1) * (1-x)^((1-mu)*phi-1)
mu = 0.7; phi = 10
curve(numerator(x, mu, phi), from = 0, to = 1, col = "red", bty = "n")
abline(v = mu, lty = 2)

Then, the denominator scales the function to ensure that the area under the curve sum to 1, such as any probability density function shall.

final <- function(x, mu, phi){
  numerator <- numerator(x, mu, phi)
  denominator <- beta(mu*phi, (1-mu)*phi)
  result <- numerator / denominator
}
mu = 0.7; phi = 10
curve(final(x, mu, phi), col = "blue", from = 0, to = 1, bty = "n")
abline(v = 0.7, lty = 2)

integrate(numerator, lower = 0, upper = 1, mu = 0.7, phi = 10)

## 0.003968254 with absolute error < 4.4e-17

integrate(final, lower = 0, upper = 1, mu = 0.7, phi = 10)

## 1 with absolute error < 1.1e-14

We can make sense of the mean and dispersion parameters we the following plots. In R, the beta distribution is not parameterized by \(\mu\) and \(\phi\), but by two shape parameter we will recover easily with the function such as \(a = \mu * \phi\) and \(b = (1-\mu)*\phi\) :

par(mfrow = c(1,2), bty = "n")
mu = 0.5; phi = 10; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), ylim = c(0,4))
mu = 0.25; phi = 10; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), add = T, col = "blue")
mu = 0.75; phi = 10; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), add = T, col = "red")

mu = 0.25; phi = 10; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), ylim = c(0,4))
mu = 0.25; phi = 5; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), add = T, col = "blue")
mu = 0.25; phi = 2; a = mu*phi; b = (1-mu)*phi
curve(dbeta(x, a, b), add = T, col = "red")

par(mfrow = c(1,1))

Dirichlet distribution

But election results are not only proportions, they are compositional. As such, each party result is a proportion of the total number of votes in an electoral division : the sum of the proportions have to sum to 1. This is an interesting mathematical property we should exploit while modeling such data. In that case, the proportion of votes a party received in a division is not only constrained by the predictors, the socio-enconomic data, but also by the sum of the proportions of the other parties. This aspect is implemented in the dirichlet distribution, defined as following in a regression framework (see the vignette or (Maier 2014)) :

\[ P(y_{c}|\mu_{c}, \phi) = \frac{1}{B((\mu_{1}, \ldots, \mu_{C}) \phi)} \prod_{d=1}^C y_{}^{\mu_{d} \phi - 1} \] Where \(c\) is the component (column of the simplex matrix) and \(B\) the multivariate beta function.

Again, we can make sense of parameters by observing the distribution. In R, the distribution is parametrized by a vector \(\boldsymbol{\alpha}\), such as \(\boldsymbol{\alpha} = \boldsymbol{\mu}\phi\). Where \(\boldsymbol{\mu}\) is a simplex reparting the strictly positive value \(\phi\), which represents the sum of \(\boldsymbol{\alpha}\) such as \(\phi = \sum_{c = 1}^{C} \alpha_{c}\).

Dirichlet regression

We can now formulate a dirichlet regression with \(\mu\) dependant of a set of covariates :

\[ y_{i,c} \sim Dirichlet(\mu_{i,c}, \phi)\\ \mu_{c,i} = \frac{\exp(\eta_{c,i})}{\sum_{d= 1}^{C} \exp(\eta_{i,d})}\\ \eta_{c,i} = \boldsymbol{\beta_{c}} X_{i} \]

Where the link between \(\mu\) and \(\eta\) is the multivariate binomial function. \(\boldsymbol{\beta_{c}}\) is a vector of slopes multiplying the predictor matrix \(X\). To ensure identifiability, \(\boldsymbol{\eta_{1}}\) is set to 0.

Dirichlet regression with `brms`

Data preparation

Let’s first import the election results, with geographical information on each electoral division. Because results are expressed as a percentage, we will divide them by 100.

library(tidyverse)
library(rgdal)
library(knitr)
library(kableExtra)
R <- import.result() %>% 
  mutate_at(c("CAQ", "PQ", "PLQ", "QS"), function(x) x/100) %>% 
  rename(division = circon)

## OGR data source with driver: ESRI Shapefile 
## Source: "/home/lucasd/Gdrive/Projects/RIVE-Numeri-lab.github.io/projects/LD_Elections/MRC", layer: "mrc_point"
## with 137 features
## It has 15 fields
## Integer64 fields read as strings:  MRC_S_ MRC_S_ID F_POLYGONI

R %>% head() %>%  
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "300px")

division	num_circon	CAQ	PQ	PLQ	QS	CODE_CIRC	CODE_RA	RA	TRI_CIRC	TRI_RA	MRC	coords.x1	coords.x2
Abitibi-Est	332	0.4272	0.1948	0.1875	0.1566	648	8	Abitibi-Témiscamingue	ABITIBIEST	ABITIBITEMISCAMINGUE	La Vallée-de-l’Or	-76.75412	47.91909
Abitibi-Ouest	332	0.3412	0.3326	0.1131	0.1659	642	8	Abitibi-Témiscamingue	ABITIBIOUEST	ABITIBITEMISCAMINGUE	Abitibi	-78.00822	48.61578
Acadie	332	0.1651	0.0900	0.5380	0.1375	338	6	Montréal	ACADIE	MONTREAL	Montréal	-73.69991	45.49017
Anjou-Louis-Riel	332	0.2891	0.1470	0.3906	0.1453	366	6	Montréal	ANJOULOUISRIEL	MONTREAL	Montréal	-73.69991	45.49017
Argenteuil	332	0.3888	0.2114	0.1741	0.1217	520	15	Laurentides	ARGENTEUIL	LAURENTIDES	Argenteuil	-74.45213	45.66901
Arthabaska	332	0.6184	0.0940	0.1135	0.1258	144	17	Centre-du-Québec	ARTHABASKA	CENTREDUQUEBEC	Arthabaska	-71.88812	46.01932

Because we only selected the four major political parties, all the rows but two does not sum to one, as simplex should.

(RSums <- R %>% select(CAQ:QS) %>% rowSums())

##   [1] 0.9661 0.9528 0.9306 0.9720 0.8960 0.9517 0.9421 0.9341 0.9632 0.9577
##  [11] 0.9711 0.9648 0.9601 0.9499 0.9631 0.9426 0.9690 0.9658 0.9627 0.9556
##  [21] 0.9790 0.9581 0.9821 0.9580 0.8908 0.9684 0.9374 0.9424 0.9659 0.9053
##  [31] 0.9447 0.9687 0.9584 0.9832 0.9435 1.0000 0.9417 0.9538 0.9679 0.9390
##  [41] 0.9687 0.9230 0.9667 0.9479 1.0000 0.8829 0.9429 0.9620 0.9479 0.9504
##  [51] 0.9788 0.9794 0.9739 0.9837 0.9837 0.9791 0.9174 0.9482 0.9449 0.9583
##  [61] 0.9645 0.9358 0.9463 0.9698 0.9679 0.9374 0.9321 0.9470 0.9608 0.9609
##  [71] 0.9058 0.9523 0.9709 0.9787 0.9626 0.9588 0.9238 0.9533 0.9662 0.9752
##  [81] 0.9322 0.9205 0.9240 0.9591 0.8829 0.9525 0.9564 0.9858 0.9055 0.9387
##  [91] 0.9831 0.9893 0.9707 0.9750 0.9619 0.9858 0.9892 0.9278 0.9634 0.9601
## [101] 0.9781 0.9711 0.9690 0.9204 0.9881 0.9657 0.9598 0.9222 0.9541 0.9534
## [111] 0.9701 0.9692 0.9477 0.9626 0.9680 0.9744 0.9652 0.9395 0.9618 0.9282
## [121] 0.9265 0.9674 0.9246 0.9500 0.9547 0.9171

To solve this, we will create a new column representing the cumulative proportion of votes obtained by all the other parties. Note that this is a personal choice and not the only nor the best solution to analyse the dataset (but definitly one of the simplest).

## Create a new column for others votes
R <- R %>% mutate(Others = 1 - RSums)
R %>% select(CAQ:QS, Others) %>% rowSums()

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

However, two lines were summing to one indicate divisions in which there was no other party than the four major ones or at least no vote for eventual ones. They will cause trouble later in our analysis, because dirichlet regression as defined in brms in not define for a proportion of zero. Some solutions exist to accomodate the problem :

Using a Zero Inflated Dirichlet distribution. The function ZIGDM in the miLineage package can perform such an analysis;
Using a zero adjusted dirichlet regression (Tsagris and Stewart 2018)
Remove the lines.

We will choose the third one for sake of simplicity, reducing the number of line from 126 to 124.

## Remove divisions in which there was no vote for other parties
R_rm <- R %>% filter(Others != 0)

Then we will now import the socio-economic variables, and select some :

X_raw <- import.X()

X <- X_raw %>% select(division,
                      anglais_pourc, bilingue_pourc, aucune_pourc,
                      commun_pourc, bicycle_pourc,
                      migrants_5ans_pourc,
                      log_av60_pourc,
                      agri_pourc,
                      moins_40mille_pourc, X40_100mille_pour,
                      loc_pourc, 
                      n_valeur_logement_mediane, valeur_logement_mediane,
                      n_cout_hab_loc_pourc, cout_hab_loc_pourc)

anglais_pourc, bilingue_pourc and aucune_pourc are the proportion of person in a division speaking only English, English and French or none of these languages, respectively;
commun_pourc and bicycle_pourc are the proportion of people going to work by public transportation or bicycle, respectively;
migrants_5ans_pourc is the proportion of people who have migrated from another country for more than five years;
log_av60_pourc is the proportion of habitation built before 1960;
agri_pourc is the percentage of people working in the agricultural sector;
moins_40mille_pourc and X40_100mille_pourc are the proportion of people earning less than 40 000 CAD or between 40 000 and 100 000 CAD, respectively;
loc_pourc is the proportion of people renting their housing;
n_valeur_logement_mediane and n_cout_hab_loc_pourc are the median cost of estate and of the housing rent, respectively.

We can now join the explanatory data with the election results. The joining variable will be circon, the name of the electoral division.

Tot <- left_join(R_rm, X)

Tot %>% select(-CODE_CIRC, -CODE_RA, -TRI_CIRC, -TRI_RA, -MRC) %>% 
  summary() %>% 
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "300px")

division	num_circon	`CAQ </th>`	`PQ </th>`	`PLQ </th>`	`QS </th>`	`RA </th>`	coords.x1	coords.x2	`Others </th>`	anglais_pourc	bilingue_pourc	aucune_pourc	commun_pourc	bicycle_pourc	migrants_5ans_pourc	log_av60_pourc	agri_pourc	moins_40mille_pourc	X40_100mille_pour	loc_pourc	n_valeur_logement_mediane	valeur_logement_mediane	n_cout_hab_loc_pourc	cout_hab_loc_pourc
Length:124	Min. :332	Min. :0.0640	Min. :0.0256	Min. :0.0522	Min. :0.0434	Montréal :27	Min. :-79.05	Min. :45.14	Min. :0.01070	Min. :0.001759	Min. :0.00000	Min. :0.000000	Min. :0.0000	Min. :0.00000	Min. :0.00000	Min. :0.0400	Min. :0.00000	Min. :0.1300	Min. :0.2700	Min. :0.1400	Min. :0.0000	Min. :129719	Min. :0.0000	Min. : 499.0
Class :character	1st Qu.:332	1st Qu.:0.2643	1st Qu.:0.1003	1st Qu.:0.1283	1st Qu.:0.1151	Montérégie :22	1st Qu.:-73.70	1st Qu.:45.49	1st Qu.:0.03130	1st Qu.:0.012114	1st Qu.:0.00000	1st Qu.:0.000000	1st Qu.:0.0100	1st Qu.:0.00000	1st Qu.:0.01000	1st Qu.:0.1375	1st Qu.:0.00000	1st Qu.:0.2375	1st Qu.:0.4300	1st Qu.:0.2200	1st Qu.:0.1233	1st Qu.:199814	1st Qu.:0.1985	1st Qu.: 609.8
Mode :character	Median :332	Median :0.3921	Median :0.1454	Median :0.2061	Median :0.1424	Capitale-Nationale :11	Median :-73.51	Median :45.62	Median :0.04140	Median :0.045082	Median :0.01000	Median :0.000000	Median :0.0700	Median :0.01000	Median :0.02000	Median :0.2400	Median :0.01000	Median :0.3200	Median :0.4500	Median :0.2850	Median :0.2174	Median :253241	Median :0.4104	Median : 728.0
NA	Mean :332	Mean :0.3635	Mean :0.1721	Mean :0.2560	Mean :0.1620	Lanaudière : 9	Mean :-72.95	Mean :46.35	Mean :0.04644	Mean :0.111924	Mean :0.03113	Mean :0.007984	Mean :0.1271	Mean :0.01387	Mean :0.03565	Mean :0.2435	Mean :0.02145	Mean :0.3063	Mean :0.4454	Mean :0.3658	Mean :0.2443	Mean :268551	Mean :0.3756	Mean : 708.6
NA	3rd Qu.:332	3rd Qu.:0.4716	3rd Qu.:0.2152	3rd Qu.:0.3351	3rd Qu.:0.1696	Laurentides : 8	3rd Qu.:-71.82	3rd Qu.:46.72	3rd Qu.:0.05767	3rd Qu.:0.163318	3rd Qu.:0.05000	3rd Qu.:0.010000	3rd Qu.:0.2100	3rd Qu.:0.01000	3rd Qu.:0.05000	3rd Qu.:0.3125	3rd Qu.:0.03000	3rd Qu.:0.3700	3rd Qu.:0.4700	3rd Qu.:0.4625	3rd Qu.:0.3032	3rd Qu.:302040	3rd Qu.:0.5206	3rd Qu.: 789.5
NA	Max. :332	Max. :0.6637	Max. :0.6946	Max. :0.7432	Max. :0.5914	Chaudière-Appalaches: 7	Max. :-66.71	Max. :58.53	Max. :0.11710	Max. :0.664920	Max. :0.16000	Max. :0.060000	Max. :0.4700	Max. :0.15000	Max. :0.31000	Max. :0.6300	Max. :0.11000	Max. :0.4900	Max. :0.5100	Max. :0.7500	Max. :1.0000	Max. :698017	Max. :1.0000	Max. :1057.0
NA	NA	NA	NA	NA	NA	(Other) :40	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

To fit a Dirichlet regression with brms, one has to format response data in a quite unusual way: the function asks for the response variables to be stores as a data frame inside a data frame!

Tot$Y <- cbind(Others = Tot$Others, CAQ = Tot$CAQ, 
               PQ = Tot$PQ, PLQ = Tot$PLQ, QS = Tot$QS)
str(Tot)

## 'data.frame':    124 obs. of  31 variables:
##  $ division                 : chr  "Abitibi-Est" "Abitibi-Ouest" "Acadie" "Anjou-Louis-Riel" ...
##  $ num_circon               : num  332 332 332 332 332 332 332 332 332 332 ...
##  $ CAQ                      : num  0.427 0.341 0.165 0.289 0.389 ...
##  $ PQ                       : num  0.195 0.333 0.09 0.147 0.211 ...
##  $ PLQ                      : num  0.188 0.113 0.538 0.391 0.174 ...
##  $ QS                       : num  0.157 0.166 0.138 0.145 0.122 ...
##  $ CODE_CIRC                : int  648 642 338 366 520 144 806 802 218 822 ...
##  $ CODE_RA                  : int  8 8 6 6 15 17 12 12 16 12 ...
##  $ RA                       : Factor w/ 17 levels "Abitibi-Témiscamingue",..: 1 1 14 14 10 4 5 5 13 5 ...
##  $ TRI_CIRC                 : Factor w/ 125 levels "ABITIBIEST","ABITIBIOUEST",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ TRI_RA                   : Factor w/ 17 levels "ABITIBITEMISCAMINGUE",..: 1 1 14 14 10 4 5 5 13 5 ...
##  $ MRC                      : chr  "La Vallée-de-l'Or" "Abitibi" "Montréal" "Montréal" ...
##  $ coords.x1                : num  -76.8 -78 -73.7 -73.7 -74.5 ...
##  $ coords.x2                : num  47.9 48.6 45.5 45.5 45.7 ...
##  $ Others                   : num  0.0339 0.0472 0.0694 0.028 0.104 ...
##  $ anglais_pourc            : num  0.0304 0.0111 0.2226 0.0958 0.1335 ...
##  $ bilingue_pourc           : num  0 0 0.16 0.07 0.01 0 0 0 0 0 ...
##  $ aucune_pourc             : num  0 0 0.05 0.02 0 0 0 0 0 0 ...
##  $ commun_pourc             : num  0.01 0.03 0.35 0.34 0.01 0 0 0 0.02 0.01 ...
##  $ bicycle_pourc            : num  0.01 0.01 0.01 0.01 0 0.02 0 0 0.01 0 ...
##  $ migrants_5ans_pourc      : num  0.01 0.01 0.14 0.07 0.02 0.01 0.01 0.01 0.01 0.01 ...
##  $ log_av60_pourc           : num  0.26 0.29 0.31 0.18 0.24 0.24 0.25 0.24 0.32 0.31 ...
##  $ agri_pourc               : num  0.02 0.06 0 0 0.02 0.06 0.08 0.05 0.02 0.08 ...
##  $ moins_40mille_pourc      : num  0.32 0.32 0.39 0.34 0.33 0.37 0.27 0.33 0.36 0.29 ...
##  $ X40_100mille_pour        : num  0.39 0.43 0.44 0.47 0.45 0.47 0.5 0.49 0.45 0.48 ...
##  $ loc_pourc                : num  0.35 0.26 0.6 0.58 0.23 0.32 0.23 0.28 0.39 0.21 ...
##  $ n_valeur_logement_mediane: num  0.124 0.053 0.477 0.475 0.167 ...
##  $ valeur_logement_mediane  : int  200330 159847 400631 399577 224493 150408 179680 149878 200389 185144 ...
##  $ n_cout_hab_loc_pourc     : num  0.197 0.122 0.543 0.491 0.297 ...
##  $ cout_hab_loc_pourc       : int  609 567 802 773 665 565 588 541 649 596 ...
##  $ Y                        : num [1:124, 1:5] 0.0339 0.0472 0.0694 0.028 0.104 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr  "Others" "CAQ" "PQ" "PLQ" ...

Model formulation

We can now formulate the posterior stan will explore! We will first begin by the formula of the model.

library(brms)
form <- bf(Y ~ anglais_pourc + bilingue_pourc + aucune_pourc +
                      commun_pourc + bicycle_pourc +
                      migrants_5ans_pourc +
                      log_av60_pourc +
                      agri_pourc +
                      moins_40mille_pourc + X40_100mille_pour +
                      loc_pourc + n_valeur_logement_mediane + n_cout_hab_loc_pourc,
           family = dirichlet)

Priors

Then, we will explore the parameters implied by our formula, and thus the priors we have to define. We can observe that the Others column is absent, its parameters being only computed by contrasts from the others.

get_prior(form, data = Tot) %>% 
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "300px")

prior	class	coef	dpar
	b
	Intercept
gamma(0.01, 0.01)	phi
	b		muCAQ
	b	agri_pourc	muCAQ
	b	anglais_pourc	muCAQ
	b	aucune_pourc	muCAQ
	b	bicycle_pourc	muCAQ
	b	bilingue_pourc	muCAQ
	b	commun_pourc	muCAQ
	b	loc_pourc	muCAQ
	b	log_av60_pourc	muCAQ
	b	migrants_5ans_pourc	muCAQ
	b	moins_40mille_pourc	muCAQ
	b	n_cout_hab_loc_pourc	muCAQ
	b	n_valeur_logement_mediane	muCAQ
	b	X40_100mille_pour	muCAQ
	Intercept		muCAQ
	b		muPLQ
	b	agri_pourc	muPLQ
	b	anglais_pourc	muPLQ
	b	aucune_pourc	muPLQ
	b	bicycle_pourc	muPLQ
	b	bilingue_pourc	muPLQ
	b	commun_pourc	muPLQ
	b	loc_pourc	muPLQ
	b	log_av60_pourc	muPLQ
	b	migrants_5ans_pourc	muPLQ
	b	moins_40mille_pourc	muPLQ
	b	n_cout_hab_loc_pourc	muPLQ
	b	n_valeur_logement_mediane	muPLQ
	b	X40_100mille_pour	muPLQ
	Intercept		muPLQ
	b		muPQ
	b	agri_pourc	muPQ
	b	anglais_pourc	muPQ
	b	aucune_pourc	muPQ
	b	bicycle_pourc	muPQ
	b	bilingue_pourc	muPQ
	b	commun_pourc	muPQ
	b	loc_pourc	muPQ
	b	log_av60_pourc	muPQ
	b	migrants_5ans_pourc	muPQ
	b	moins_40mille_pourc	muPQ
	b	n_cout_hab_loc_pourc	muPQ
	b	n_valeur_logement_mediane	muPQ
	b	X40_100mille_pour	muPQ
	Intercept		muPQ
	b		muQS
	b	agri_pourc	muQS
	b	anglais_pourc	muQS
	b	aucune_pourc	muQS
	b	bicycle_pourc	muQS
	b	bilingue_pourc	muQS
	b	commun_pourc	muQS
	b	loc_pourc	muQS
	b	log_av60_pourc	muQS
	b	migrants_5ans_pourc	muQS
	b	moins_40mille_pourc	muQS
	b	n_cout_hab_loc_pourc	muQS
	b	n_valeur_logement_mediane	muQS
	b	X40_100mille_pour	muQS
	Intercept		muQS

We can see that we have to define priors for Intercepts, slopes, and precision. Generally, we define the priors while thinking about the scale of variation in data. Because both the predictors and the responses are constrained between 0 and 1, we know that an increase of 1 of a predictor value might at the maximum increase the response to 1. This is, of course, a theoretical consideration, because we know that, in Canada, the probability that a party obtain 100% of votes is very low. A normal prior centered on 0 and correctly scaled is reasonable, assuming no particular tendency but a carefully chosen standard-deviation to represent the possible amplitude of variation in the data set.

However, the multivariate logit link function gives us a lot of trouble. I find defining priors for parameters on the logit scale painful and not very intuitive, but in case of a multivariate logit link function, I am completely lost!

The best way to choose the scale of a prior is to perform prior predictive checks, that is, to sample from the prior and generate simulated data to understand the behavior of the model under this provided information.

Prior predictive checks are really easy to do with brms. We will perform it with 4 different scales : 1, 2 and 10.

curve(dnorm(x, 0, 1), from = -10, to = 10)
curve(dnorm(x, 0, 2), from = -10, to = 10, add = T, col = "blue")
curve(dnorm(x, 0, 10), from = -10, to = 10, add = T, col = "red")

We have little information about the dispersion parameter, so we will define a more diffuse prior than brms.

## Prior chosen by brms
curve(dgamma(x, 0.01, rate = 0.01), to = 30, ylim = c(0, 0.040))
## Prior we will define
curve(dstudent_t(x, 3, 0, 10), to = 30, add = T, col = "red")

Prior predictive checks

## Set prior with sd = 1
priors_1 <- c(prior(normal(0,1), class = b),
            prior(normal(0,1), class = Intercept),
            prior(student_t(3, 0, 10), class = phi))
## sample from the priors
ppp_1 <- brm(formula = form, prior = priors_1, 
           data = Tot, sample_prior = "only",
           chains = 1, cores = 1)
## Set prior with sd = 2
priors_2 <- c(prior(normal(0,2), class = b),
            prior(normal(0,1), class = Intercept),
            prior(student_t(3, 0, 10), class = phi))
## sample from the priors
ppp_2 <- brm(formula = form, prior = priors_2, 
           data = Tot, sample_prior = "only",
           chains = 1, cores = 1)
## Set prior with sd = 2
priors_10 <- c(prior(normal(0,10), class = b),
            prior(normal(0,1), class = Intercept),
            prior(student_t(3, 0, 10), class = phi))
## sample from the priors
ppp_10 <- brm(formula = form, prior = priors_10, 
           data = Tot, sample_prior = "only",
           chains = 1, cores = 1)

Now that we have sampled the parameters, we can extract predicted values. Predicted values, in this case, are stored in a three dimensional array defined by [samples, observations, response variables]

pred_1 <- predict(ppp_1, summary = F, nsamples = 100)
pred_2 <- predict(ppp_2, summary = F, nsamples = 100)
pred_10 <- predict(ppp_10, summary = F, nsamples = 100)
dim(pred_1)

## [1] 100 126   5

We will then plot the predicted densities given the three different scales.

plot(density(pred_1[1,,]), col = scales::alpha("blue", 0.2), main = "b ~ Normal(0,1)",
     xlim = c(-0.1, 1.1))
for(i in 2:100) lines(density(pred_1[i,,]), col = scales::alpha("blue", 0.2))

plot(density(pred_2[1,,]), col = scales::alpha("blue", 0.2), main = "b ~ Normal(0,2)",
     xlim = c(-0.1, 1.1))
for(i in 2:100) lines(density(pred_2[i,,]), col = scales::alpha("blue", 0.2))

plot(density(pred_10[1,,]), col = scales::alpha("blue", 0.2), main = "b ~ Normal(0,10)",
     xlim = c(-0.1, 1.1))
for(i in 2:100) lines(density(pred_10[i,,]), col = scales::alpha("blue", 0.2))

We can see that the higher the scale, the higher the probability that a party wins with approximately 100% of votes, which is almost impossible (in Canada…). Ironically, in that case, a greater scale of the prior leads to more specific results than a lower scale! That should make us think about diffuse priors, and why they are often inappropriate, and sometimes lead to highly unwanted results… I will let you discover the work of Jim Savage, who proposed in this blog post, to choose prior scale based on prior predictive checks and an explicit loss function.

We will then define the priors as follow :

priors <- c(prior(normal(0,1), class = b),
            prior(normal(0,1), class = Intercept),
            prior(student_t(3, 0, 10), class = phi))

Model fit

Before fitting the model, we will define the number of chains we want to run in parallel. I suggest to run four chains and to exploit one less core than the number of cores available in the CPU (to allow you to continue listening to music).

nchains = 4
ncores = parallel::detectCores()-1

We can now ask brms to create a stan program and to call the No-U-Turn Sampler algorithm to sample the posterior of parameters.

fit <- brm(formula = form, prior = priors, 
           data = Tot,
           chains = nchains, cores = ncores)

We can observe diagnostic values in the summary :

summary(fit)

##  Family: dirichlet 
##   Links: muCAQ = logit; muPQ = logit; muPLQ = logit; muQS = logit; phi = identity 
## Formula: Y ~ anglais_pourc + bilingue_pourc + aucune_pourc + commun_pourc + bicycle_pourc + migrants_5ans_pourc + log_av60_pourc + agri_pourc + moins_40mille_pourc + X40_100mille_pour + loc_pourc + n_valeur_logement_mediane + n_cout_hab_loc_pourc 
##    Data: Tot (Number of observations: 124) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                                 Estimate Est.Error l-95% CI u-95% CI Rhat
## muCAQ_Intercept                     1.91      0.46     1.02     2.79 1.00
## muPQ_Intercept                      2.27      0.50     1.28     3.22 1.00
## muPLQ_Intercept                     1.16      0.44     0.30     2.03 1.00
## muQS_Intercept                      1.25      0.51     0.21     2.23 1.00
## muCAQ_anglais_pourc                -1.90      0.44    -2.75    -1.03 1.00
## muCAQ_bilingue_pourc               -0.47      0.88    -2.27     1.28 1.00
## muCAQ_aucune_pourc                 -0.19      0.95    -2.06     1.68 1.00
## muCAQ_commun_pourc                 -1.40      0.58    -2.54    -0.28 1.00
## muCAQ_bicycle_pourc                -0.80      0.91    -2.60     0.97 1.00
## muCAQ_migrants_5ans_pourc          -0.40      0.91    -2.21     1.39 1.00
## muCAQ_log_av60_pourc               -0.79      0.48    -1.76     0.16 1.00
## muCAQ_agri_pourc                    0.38      0.90    -1.37     2.14 1.00
## muCAQ_moins_40mille_pourc          -0.73      0.70    -2.10     0.68 1.00
## muCAQ_X40_100mille_pour             1.93      0.79     0.37     3.49 1.00
## muCAQ_loc_pourc                     0.10      0.47    -0.78     1.03 1.00
## muCAQ_n_valeur_logement_mediane    -0.33      0.49    -1.29     0.62 1.00
## muCAQ_n_cout_hab_loc_pourc         -0.15      0.38    -0.91     0.59 1.00
## muPQ_anglais_pourc                 -2.04      0.49    -3.02    -1.14 1.00
## muPQ_bilingue_pourc                -0.49      0.90    -2.23     1.25 1.00
## muPQ_aucune_pourc                  -0.25      0.98    -2.22     1.69 1.00
## muPQ_commun_pourc                   0.07      0.62    -1.16     1.27 1.00
## muPQ_bicycle_pourc                  0.14      0.92    -1.67     1.92 1.00
## muPQ_migrants_5ans_pourc           -0.79      0.94    -2.64     1.06 1.00
## muPQ_log_av60_pourc                -0.47      0.54    -1.49     0.59 1.00
## muPQ_agri_pourc                    -0.84      0.92    -2.63     0.97 1.00
## muPQ_moins_40mille_pourc            0.43      0.81    -1.14     1.99 1.00
## muPQ_X40_100mille_pour             -1.32      0.84    -2.99     0.34 1.00
## muPQ_loc_pourc                     -0.00      0.53    -1.07     1.04 1.00
## muPQ_n_valeur_logement_mediane     -0.77      0.56    -1.87     0.31 1.00
## muPQ_n_cout_hab_loc_pourc          -0.42      0.43    -1.25     0.44 1.00
## muPLQ_anglais_pourc                 1.51      0.41     0.72     2.35 1.00
## muPLQ_bilingue_pourc                1.54      0.87    -0.17     3.23 1.00
## muPLQ_aucune_pourc                  0.48      0.98    -1.42     2.37 1.00
## muPLQ_commun_pourc                  0.81      0.57    -0.28     1.92 1.00
## muPLQ_bicycle_pourc                -1.30      0.92    -3.14     0.53 1.00
## muPLQ_migrants_5ans_pourc           0.51      0.84    -1.13     2.18 1.00
## muPLQ_log_av60_pourc               -0.76      0.47    -1.68     0.15 1.00
## muPLQ_agri_pourc                    0.51      0.91    -1.32     2.33 1.00
## muPLQ_moins_40mille_pourc           0.89      0.69    -0.45     2.24 1.00
## muPLQ_X40_100mille_pour             0.62      0.77    -0.89     2.11 1.00
## muPLQ_loc_pourc                    -0.77      0.49    -1.73     0.18 1.00
## muPLQ_n_valeur_logement_mediane     0.66      0.46    -0.26     1.55 1.00
## muPLQ_n_cout_hab_loc_pourc         -0.80      0.39    -1.58    -0.05 1.00
## muQS_anglais_pourc                 -1.75      0.47    -2.72    -0.85 1.00
## muQS_bilingue_pourc                -1.07      0.89    -2.84     0.64 1.00
## muQS_aucune_pourc                  -0.14      0.98    -2.05     1.76 1.00
## muQS_commun_pourc                   0.61      0.59    -0.56     1.78 1.00
## muQS_bicycle_pourc                  1.89      0.92     0.08     3.72 1.00
## muQS_migrants_5ans_pourc            0.03      0.89    -1.67     1.76 1.00
## muQS_log_av60_pourc                 0.79      0.51    -0.19     1.80 1.00
## muQS_agri_pourc                    -0.13      0.93    -1.92     1.73 1.00
## muQS_moins_40mille_pourc           -0.20      0.78    -1.75     1.32 1.00
## muQS_X40_100mille_pour             -0.94      0.85    -2.60     0.70 1.00
## muQS_loc_pourc                      0.94      0.52    -0.05     1.95 1.00
## muQS_n_valeur_logement_mediane     -0.86      0.50    -1.83     0.13 1.00
## muQS_n_cout_hab_loc_pourc           0.26      0.43    -0.57     1.09 1.00
##                                 Bulk_ESS Tail_ESS
## muCAQ_Intercept                     7130     3027
## muPQ_Intercept                      7893     2663
## muPLQ_Intercept                     6441     3262
## muQS_Intercept                      6558     2446
## muCAQ_anglais_pourc                 4157     3322
## muCAQ_bilingue_pourc                7507     2698
## muCAQ_aucune_pourc                  9202     2894
## muCAQ_commun_pourc                  5254     3346
## muCAQ_bicycle_pourc                 8766     2456
## muCAQ_migrants_5ans_pourc           9978     2891
## muCAQ_log_av60_pourc                4938     3120
## muCAQ_agri_pourc                    9526     2762
## muCAQ_moins_40mille_pourc           6987     3472
## muCAQ_X40_100mille_pour             8190     2984
## muCAQ_loc_pourc                     4793     3717
## muCAQ_n_valeur_logement_mediane     4249     3386
## muCAQ_n_cout_hab_loc_pourc          3597     2968
## muPQ_anglais_pourc                  5099     3415
## muPQ_bilingue_pourc                 9672     3076
## muPQ_aucune_pourc                   9716     2762
## muPQ_commun_pourc                   5842     3554
## muPQ_bicycle_pourc                  8972     3097
## muPQ_migrants_5ans_pourc            9195     2615
## muPQ_log_av60_pourc                 5476     3418
## muPQ_agri_pourc                    10348     2858
## muPQ_moins_40mille_pourc            6971     3260
## muPQ_X40_100mille_pour              9039     2789
## muPQ_loc_pourc                      4891     2929
## muPQ_n_valeur_logement_mediane      4704     3047
## muPQ_n_cout_hab_loc_pourc           4198     3239
## muPLQ_anglais_pourc                 4457     3124
## muPLQ_bilingue_pourc                8116     3074
## muPLQ_aucune_pourc                  9361     2719
## muPLQ_commun_pourc                  4890     3027
## muPLQ_bicycle_pourc                 9271     2676
## muPLQ_migrants_5ans_pourc           9426     3161
## muPLQ_log_av60_pourc                5462     3624
## muPLQ_agri_pourc                    9473     2676
## muPLQ_moins_40mille_pourc           6342     3422
## muPLQ_X40_100mille_pour             7654     3514
## muPLQ_loc_pourc                     5358     3125
## muPLQ_n_valeur_logement_mediane     4110     3295
## muPLQ_n_cout_hab_loc_pourc          3372     3148
## muQS_anglais_pourc                  5144     3613
## muQS_bilingue_pourc                 8524     3273
## muQS_aucune_pourc                   9187     2976
## muQS_commun_pourc                   5872     3222
## muQS_bicycle_pourc                  8427     3018
## muQS_migrants_5ans_pourc            9973     3068
## muQS_log_av60_pourc                 5437     3272
## muQS_agri_pourc                     8554     2943
## muQS_moins_40mille_pourc            6370     2914
## muQS_X40_100mille_pour              7858     3099
## muQS_loc_pourc                      4886     3277
## muQS_n_valeur_logement_mediane      4370     2698
## muQS_n_cout_hab_loc_pourc           3854     3074
## 
## Family Specific Parameters: 
##     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## phi    39.21      2.58    34.20    44.44 1.00     6754     3356
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Every \(\hat{R}\) is egual to 1, meaning chains should have correctly converged. Bulk and Tail Effective Sample Size are superior to 100 per chains, so the estimates should be reliable. One can observe the chains with the following command :

plot(fit, pars = NA)

Predictions

The best way to understand the conclusion of a model is to visualize the counter-factual predictions, that is to plot the effect of a variable while the others are kept constant. Fortunately, brms allow doing this easily.

English speakers

p <- plot(marginal_effects(fit, categorical = T, effects = "anglais_pourc"),
                      plot = F)

p$`anglais_pourc:cats__`$data %>% 
  ggplot(aes(x = effect1__)) +
  geom_ribbon(aes(ymin = lower__, ymax = upper__, fill = effect2__), alpha = 0.2) +
  geom_line(aes(y = estimate__, color = effect2__)) + 
  theme_minimal() +
  xlab("Proportion of people speaking english only (0-1)") + 
  ylab("Proportion of vote (0-1)") + 
  scale_color_manual(values = c("black", "light blue", "blue", "red", "dark orange")) + 
  scale_fill_manual(values = c("black", "light blue", "blue", "red", "dark orange"))

Divisions with a higher proportion of people speaking English gave an overwhelming support to the Parti Libéral du Québec, with the proportion of voters being on average 3 time higher when more than half of the population speaks only English, vs. division with no English-only speakers. Every other party had lower results in English-speaking divisions, except the aggregation of every other smaller parties, which tended to increase slightly. There is the same tendency, less exacerbated, when considering the proportion of bilingual in the population.

Public transportation

p <- plot(marginal_effects(fit, categorical = T, effects = "commun_pourc"),
                      plot = F)

p$`commun_pourc:cats__`$data %>% 
  ggplot(aes(x = effect1__)) +
  geom_ribbon(aes(ymin = lower__, ymax = upper__, fill = effect2__), alpha = 0.2) +
  geom_line(aes(y = estimate__, color = effect2__)) + 
  theme_minimal() + 
  xlab("Proportion of people using public transportation (0-1)") + 
  ylab("Proportion of vote (0-1)") +
  scale_color_manual(values = c("black", "light blue", "blue", "red", "dark orange")) + 
  scale_fill_manual(values = c("black", "light blue", "blue", "red", "dark orange"))

Coalition Avenir Québec received a far proportion of votes (up to two times less) in division where people go to work by public transportation. Both PLQ and Quebec Solidaire seemed to be favored in this electoral divisions.

Bicycle

p <- plot(marginal_effects(fit, categorical = T, effects = "bicycle_pourc"),
                      plot = F)

p$`bicycle_pourc:cats__`$data %>% 
  ggplot(aes(x = effect1__)) +
  geom_ribbon(aes(ymin = lower__, ymax = upper__, fill = effect2__), alpha = 0.2) +
  geom_line(aes(y = estimate__, color = effect2__)) + 
  theme_minimal() + 
  xlab("Proportion of people using bicycle (0-1)") + 
  ylab("Proportion of vote (0-1)") +
  scale_color_manual(values = c("black", "light blue", "blue", "red", "dark orange")) + 
  scale_fill_manual(values = c("black", "light blue", "blue", "red", "dark orange"))

The more people use a bicycle to go to work, the more they tended to vote for QS, a left-oriented party. Both CAQ and PLQ results tended to decrease in that case.

Recently immigrated people

p <- plot(marginal_effects(fit, categorical = T, effects = "migrants_5ans_pourc"),
                      plot = F)

p$`migrants_5ans_pourc:cats__`$data %>% 
  ggplot(aes(x = effect1__)) +
  geom_ribbon(aes(ymin = lower__, ymax = upper__, fill = effect2__), alpha = 0.2) +
  geom_line(aes(y = estimate__, color = effect2__)) + 
  theme_minimal() + 
  xlab("Proportion of migrants since 5 years (0-1)") + 
  ylab("Proportion of vote (0-1)") +
  scale_color_manual(values = c("black", "light blue", "blue", "red", "dark orange")) + 
  scale_fill_manual(values = c("black", "light blue", "blue", "red", "dark orange"))

We cannot see any clear relationship between the proportion of people having immigrated for more than 5 years and the results, even if there is some slight and highly uncertain tendencies.

Old housing

p <- plot(marginal_effects(fit, categorical = T, effects = "log_av60_pourc"),
                      plot = F)

p$`log_av60_pourc:cats__`$data %>% 
  ggplot(aes(x = effect1__)) +
  geom_ribbon(aes(ymin = lower__, ymax = upper__, fill = effect2__), alpha = 0.2) +
  geom_line(aes(y = estimate__, color = effect2__)) + 
  theme_minimal() + 
  xlab("Proportion of housing built before 1960 (0-1)") + 
  ylab("Proportion of vote (0-1)") +
  scale_color_manual(values = c("black", "light blue", "blue", "red", "dark orange")) + 
  scale_fill_manual(values = c("black", "light blue", "blue", "red", "dark orange"))

We can observe a clear and strong relationship: QS was favored in divisions in which the proportion of old houses and buildings is higher, with an approximately two-time increase among division recently built and the one with a majority of older buildings.

Agriculture

p <- plot(marginal_effects(fit, categorical = T, effects = "agri_pourc"),
                      plot = F)

p$`agri_pourc:cats__`$data %>% 
  ggplot(aes(x = effect1__)) +
  geom_ribbon(aes(ymin = lower__, ymax = upper__, fill = effect2__), alpha = 0.2) +
  geom_line(aes(y = estimate__, color = effect2__)) + 
  theme_minimal() + 
  xlab("Proportion of people working in agriculture (0-1)") + 
  ylab("Proportion of vote (0-1)") +
  scale_color_manual(values = c("black", "light blue", "blue", "red", "dark orange")) + 
  scale_fill_manual(values = c("black", "light blue", "blue", "red", "dark orange"))

We cannot see any tendency related to the proportion of people working in the agricultural sector.

Income

There are two contrasted responses when we look at the effect of incomes on election results.

p <- plot(marginal_effects(fit, categorical = T, effects = "moins_40mille_pourc"),
                      plot = F)

p$`moins_40mille_pourc:cats__`$data %>% 
  ggplot(aes(x = effect1__)) +
  geom_ribbon(aes(ymin = lower__, ymax = upper__, fill = effect2__), alpha = 0.2) +
  geom_line(aes(y = estimate__, color = effect2__)) + 
  theme_minimal() + 
  xlab("Proportion of people earning between 40K$ and 100K$ (0-1)") + 
  ylab("Proportion of vote (0-1)") +
  scale_color_manual(values = c("black", "light blue", "blue", "red", "dark orange")) + 
  scale_fill_manual(values = c("black", "light blue", "blue", "red", "dark orange"))

First of all, the more people was earning less than 40 000 CAD a year, the higher was the result of the PLQ, and the lower the result of the CAQ.

p <- plot(marginal_effects(fit, categorical = T, effects = "X40_100mille_pour"),
                      plot = F)

p$`X40_100mille_pour:cats__`$data %>% 
  ggplot(aes(x = effect1__)) +
  geom_ribbon(aes(ymin = lower__, ymax = upper__, fill = effect2__), alpha = 0.2) +
  geom_line(aes(y = estimate__, color = effect2__)) + 
  theme_minimal() + 
  xlab("Proportion of people earning between 40K$ and 100K$ (0-1)") + 
  ylab("Proportion of vote (0-1)") +
  scale_color_manual(values = c("black", "light blue", "blue", "red", "dark orange")) + 
  scale_fill_manual(values = c("black", "light blue", "blue", "red", "dark orange"))

Conversely, the more people is earning between 40 000 and 100 000 CAD a year, the higher was the results of the CAQ, comforting it as a party trying to represent the middle class. Both the Parti Québécois and QS where unfavored in the division with a lower proportion of people earning this income.

Renting

p <- plot(marginal_effects(fit, categorical = T, effects = "loc_pourc"),
                      plot = F)

p$`loc_pourc:cats__`$data %>% 
  ggplot(aes(x = effect1__)) +
  geom_ribbon(aes(ymin = lower__, ymax = upper__, fill = effect2__), alpha = 0.2) +
  geom_line(aes(y = estimate__, color = effect2__)) + 
  theme_minimal() +  
  xlab("Proportion of people renting their housing (0-1)") + 
  ylab("Proportion of vote (0-1)") +
  scale_color_manual(values = c("black", "light blue", "blue", "red", "dark orange")) + 
  scale_fill_manual(values = c("black", "light blue", "blue", "red", "dark orange"))

QS tended to be favored in the division with more people renting their housing, while PLQ was less popular in such divisions.

p <- plot(marginal_effects(fit, categorical = T, effects = "n_cout_hab_loc_pourc"),
                      plot = F)

## Denormalize the x-axis
library(scales)
minval <- min(X$cout_hab_loc_pourc)
maxval <- max(X$cout_hab_loc_pourc)

norm_trans <- function(minval, maxval){
  require(scales)
  normalize <- function(x){
    top <- (b-a) * (x - min(x, na.rm = T))
    bot <- max(x, na.rm = T) - min(x, na.rm = T)
    y <- top/bot
    }
  denormalize <- function(xt) {xt*(maxval-minval) + minval}
  trans_new("normalize", transform = normalize, inverse = denormalize)
}

p$`n_cout_hab_loc_pourc:cats__`$data %>% 
  ggplot(aes(x = effect1__)) +
  geom_ribbon(aes(ymin = lower__, ymax = upper__, fill = effect2__), alpha = 0.2) +
  geom_line(aes(y = estimate__, color = effect2__)) + 
  theme_minimal() + 
  xlab("Median housing rent ($)") + 
  ylab("Proportion of vote (0-1)") +
  scale_x_continuous(trans = norm_trans(minval, maxval), labels = comma) + 
  scale_color_manual(values = c("black", "light blue", "blue", "red", "dark orange")) + 
  scale_fill_manual(values = c("black", "light blue", "blue", "red", "dark orange"))

Similarly, the higher the rent was, the more people voted for QS, and the less they voted for PLQ.

Estate value

## Denormalize the x-axis
minval <- min(X$valeur_logement_mediane)
maxval <- max(X$valeur_logement_mediane)
p <- plot(marginal_effects(fit, categorical = T, effects = "n_valeur_logement_mediane"),
                      plot = F)

p$`n_valeur_logement_mediane:cats__`$data %>% 
  ggplot(aes(x = effect1__)) +
  geom_ribbon(aes(ymin = lower__, ymax = upper__, fill = effect2__), alpha = 0.2) +
  geom_line(aes(y = estimate__, color = effect2__)) + 
  theme_minimal() + 
  xlab("Median estate value ($)") + 
  ylab("Proportion of vote (0-1)") +
  scale_x_continuous(trans = norm_trans(minval, maxval), labels = comma) + 
  scale_color_manual(values = c("black", "light blue", "blue", "red", "dark orange")) + 
  scale_fill_manual(values = c("black", "light blue", "blue", "red", "dark orange"))

The higher the value of estates, the higher was the score of the PLQ, and the lower were the ones of QS and PLQ!

Conclusion

We first show that dirichlet regression is an efficient and relatively simple solution to analyse compositional data. Concrete and simple functions are implemented to estimate the parameters in a bayesian framework in brms.

Looking at the results, this analysis showed :

Coalition Avenir Québec get its supporters in division with high concentration of the 40-100K CAD revenues, showing definitly its middle-class audience;
Parti Libéral du Québec touched especially English speakers and divisions with people owning expensive estates;
Quebec Solidaire is favored in divisions with a high proportion of renters, especially building are old and the rent cost is high. They are also favored in divisions with high usage of bicycle to go to work, denoting maybe environmental-friendly lower-to-middle class supporters.

To go further

We did not dealt with spatial autocorrelation. Try the following solution :

Define hierarchical parameters per region, to capture the particularity of the vote in each of the administrative divisions;
Try to define a CAR or ICAR components to the model. One can define an adjacency matrix between electoral divisions using the shapefile used to get geographical informations. brms allows the definition of autocorrelation structure by the argument autocor of the bf function. I do not know if it works with Dirichlet regression.

References

Douma, Jacob C., and James T. Weedon. 2019. “Analysing continuous proportions in ecology and evolution: A practical introduction to beta and Dirichlet regression.” Methods in Ecology and Evolution 10 (9): 1412–30. doi:10.1111/2041-210x.13234.

Ferrari, Silvia L.P., and Francisco Cribari-Neto. 2004. “Beta regression for modelling rates and proportions.” Journal of Applied Statistics 31 (7): 799–815. doi:10.1080/0266476042000214501.

Maier, Marco J. 2014. “DirichletReg: Dirichlet regression for compositional data in R.” Research Report Series, no. 125: 1–25. http://epub.wu.ac.at/4077/.

Tsagris, Michail, and Connie Stewart. 2018. “A Dirichlet Regression Model for Compositional Data with Zeros.” Lobachevskii Journal of Mathematics 39 (3): 398–412. doi:10.1134/S1995080218030198.

Dirichlet regression and the 2018 Quebec Elections (with brms)

Lucas Deschamps

28 septembre 2019

Context

Data

Sources

Preparation

Model

Beta distibution

Dirichlet distribution

Dirichlet regression

Dirichlet regression with `brms`

Data preparation

Model formulation

Priors

Prior predictive checks

Model fit

Predictions

English speakers

Public transportation

Bicycle

Recently immigrated people

Old housing

Agriculture

Income

Renting

Estate value

Conclusion

To go further

References

Dirichlet regression and the 2018 Quebec Elections (with brms)

Lucas Deschamps

28 septembre 2019

Context

Data

Sources

Preparation

Model

Beta distibution

Dirichlet distribution

Dirichlet regression

Dirichlet regression with brms

Data preparation

Model formulation

Priors

Prior predictive checks

Model fit

Predictions

English speakers

Public transportation

Bicycle

Recently immigrated people

Old housing

Agriculture

Income

Renting

Estate value

Conclusion

To go further

References

Dirichlet regression with `brms`