Fast data manipulation with dplyr and its allies

Charles Martin

October 2018

Our dataset
The 5 base operations on a dataset
This workshop was prepared with…

Our dataset

As in the previous workshop, we’ll use the mammal sleep dataset from ggplot2 to run our examples :

library(ggplot2)
data(msleep)
msleep

# A tibble: 83 x 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA    
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA    
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA    
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA    
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA    
# ... with 73 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#   bodywt <dbl>

The 5 base operations on a dataset

Just as in the previous workshop, all operations shown here could be done in base R. But, as you’ll see, the dplyr-way is much more integrated and easier to read (more on this later)

Filtering a dataset

For example, let’s say that you want to keep only mammals of at least 200 g

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

filter(msleep, bodywt > 0.2)

# A tibble: 61 x 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA    
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA    
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA    
 4 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667
 5 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767
 6 North… Call… carni Carn… vu                   8.7       1.4       0.383
 7 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333
 8 Roe d… Capr… herbi Arti… lc                   3        NA        NA    
 9 Goat   Capri herbi Arti… lc                   5.3       0.6      NA    
10 Guine… Cavis herbi Rode… domesticated         9.4       0.8       0.217
# ... with 51 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#   bodywt <dbl>

Please note that as in all object manipulations in R, the original object is not afffected and the result is “loss” unless it is assigned to a new object…

big_mammals <- filter(msleep, bodywt > 0.2)

In RStudio, you can also inspect the result of such an operation, with or without saving it first through the View function :

View(filter(msleep, bodywt > 0.2))
View(grands_mams)

Four important caveats when filtering datasets

#1 Floating point numbers are not indefinitely precise

1/49*49 == 1

[1] FALSE

This happens because when doing calculations, R can only keep a certain number of decimal places. Which means that in some cases, rounding errors can complicate comparisons.

To work around this problem, one can use the near function

near(1/49*49, 1)

[1] TRUE

#2 `=` doesn’t mean equal

In R, as in most programming languages, the = is used to assign the result of an operation to an object (in R, it is a synonym of <-). To check if two objects are equal, one must use the == operator.

filter(msleep, vore = "omni")

Error: `vore` (`vore = "omni"`) must not be named, do you need `==`?

3 Missing data have some unintuitive behaviors

You cannot check for missing values with the == operator

NA == NA

[1] NA

The underlying logic here being that, if I do not know Paul’s age, and I do not now Jack’s age, the answer to the question : Do Paul and Jack have the same age isn’t TRUE, it is I don’t know (NA)

To check for missing values, one must use the ia.na function :

filter(msleep, is.na(conservation))

# A tibble: 29 x 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
 1 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA    
 2 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767
 3 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA    
 4 Afric… Cric… omni  Rode… <NA>                 8.3       2        NA    
 5 Weste… Euta… herbi Rode… <NA>                14.9      NA        NA    
 6 Galago Gala… omni  Prim… <NA>                 9.8       1.1       0.55
 7 Human  Homo  omni  Prim… <NA>                 8         1.9       1.5  
 8 Macaq… Maca… omni  Prim… <NA>                10.1       1.2       0.75
 9 "Vole… Micr… herbi Rode… <NA>                12.8      NA        NA    
10 Littl… Myot… inse… Chir… <NA>                19.9       2         0.2  
# ... with 19 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#   bodywt <dbl>

#4 To combine comparisons, you need to think more like a machine

R includes operators that allow you to combine conditions either with OR (|) or AND (&). Their usage is a bit different than your normal flow of thought.

For example, if you want all mammals that are either omnivorous or carnivorous, you’d be tempted to write :

filter(msleep, vore == "omni" | "carni")

Error in filter_impl(.data, quo): Evaluation error: operations are possible only for numeric, logical or complex types.

But R is a bit dumber than that. You need to speficify every condition in details, e.g. :

filter(msleep, vore == "omni" | vore == "carni")

# A tibble: 39 x 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA    
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA    
 3 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133
 4 North… Call… carni Carn… vu                   8.7       1.4       0.383
 5 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333
 6 Grivet Cerc… omni  Prim… lc                  10         0.7      NA    
 7 Star-… Cond… omni  Sori… lc                  10.3       2.2      NA    
 8 Afric… Cric… omni  Rode… <NA>                 8.3       2        NA    
 9 Lesse… Cryp… omni  Sori… lc                   9.1       1.4       0.15
10 Long-… Dasy… carni Cing… lc                  17.4       3.1       0.383
# ... with 29 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#   bodywt <dbl>

Such a redundant syntax can easily be shortened with the %in% operator.

filter(msleep, vore %in% c("omni", "carni"))

# A tibble: 39 x 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA    
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA    
 3 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133
 4 North… Call… carni Carn… vu                   8.7       1.4       0.383
 5 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333
 6 Grivet Cerc… omni  Prim… lc                  10         0.7      NA    
 7 Star-… Cond… omni  Sori… lc                  10.3       2.2      NA    
 8 Afric… Cric… omni  Rode… <NA>                 8.3       2        NA    
 9 Lesse… Cryp… omni  Sori… lc                   9.1       1.4       0.15
10 Long-… Dasy… carni Cing… lc                  17.4       3.1       0.383
# ... with 29 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#   bodywt <dbl>

The %in% can also be used to list you prepare before your statement…

to_keep <- c("omni", "carni", "herbi")
filter(msleep, vore %in% to_keep)

# A tibble: 71 x 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA    
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA    
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA    
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383
 8 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333
 9 Roe d… Capr… herbi Arti… lc                   3        NA        NA    
10 Goat   Capri herbi Arti… lc                   5.3       0.6      NA    
# ... with 61 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#   bodywt <dbl>

Conditions can also be reversed

The exclamations point(!) can be used to inverse the result of a condition. For example, to extract all mammals except for omnivorous ones :

filter(msleep, !(vore == "omni"))

# A tibble: 56 x 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA    
 2 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA    
 3 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667
 4 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767
 5 North… Call… carni Carn… vu                   8.7       1.4       0.383
 6 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333
 7 Roe d… Capr… herbi Arti… lc                   3        NA        NA    
 8 Goat   Capri herbi Arti… lc                   5.3       0.6      NA    
 9 Guine… Cavis herbi Rode… domesticated         9.4       0.8       0.217
10 Chinc… Chin… herbi Rode… domesticated        12.5       1.5       0.117
# ... with 46 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#   bodywt <dbl>

Sorting

By default, sorting in R happens in an ascending way

arrange(msleep, bodywt)

# A tibble: 83 x 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
 1 Lesse… Cryp… omni  Sori… lc                   9.1       1.4       0.15
 2 Littl… Myot… inse… Chir… <NA>                19.9       2         0.2  
 3 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133
 4 Deer … Pero… <NA>  Rode… <NA>                11.5      NA        NA    
 5 House… Mus   herbi Rode… nt                  12.5       1.4       0.183
 6 Big b… Epte… inse… Chir… lc                  19.7       3.9       0.117
 7 North… Onyc… carni Rode… lc                  14.5      NA        NA    
 8 "Vole… Micr… herbi Rode… <NA>                12.8      NA        NA    
 9 Afric… Rhab… omni  Rode… <NA>                 8.7      NA        NA    
10 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA    
# ... with 73 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#   bodywt <dbl>

You need to add a special modifier to reverse that order

arrange(msleep, desc(bodywt))

# A tibble: 83 x 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
 1 Afric… Loxo… herbi Prob… vu                   3.3      NA        NA    
 2 Asian… Elep… herbi Prob… en                   3.9      NA        NA    
 3 Giraf… Gira… herbi Arti… cd                   1.9       0.4      NA    
 4 Pilot… Glob… carni Ceta… cd                   2.7       0.1      NA    
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667
 6 Horse  Equus herbi Peri… domesticated         2.9       0.6       1    
 7 Brazi… Tapi… herbi Peri… vu                   4.4       1         0.9  
 8 Donkey Equus herbi Peri… domesticated         3.1       0.4      NA    
 9 Bottl… Turs… carni Ceta… <NA>                 5.2      NA        NA    
10 Tiger  Pant… carni Carn… en                  15.8      NA        NA    
# ... with 73 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
#   bodywt <dbl>

Exercise one

Find, in ascending order of body weight, the list of all non-domestic herbivorous animals

Select some columns

select(msleep, vore, brainwt, bodywt)

# A tibble: 83 x 3
   vore   brainwt  bodywt
   <chr>    <dbl>   <dbl>
carni NA        50    
omni   0.0155    0.48
herbi NA         1.35
omni   0.00029   0.019
herbi  0.423   600    
herbi NA         3.85
carni NA        20.5  
<NA>  NA         0.045
carni  0.07     14    
herbi  0.0982   14.8  
# ... with 73 more rows

You can also specify a series of columns, as long as they are adjacent in the dataset

select(msleep, sleep_total:awake)

# A tibble: 83 x 4
   sleep_total sleep_rem sleep_cycle awake
         <dbl>     <dbl>       <dbl> <dbl>
      12.1      NA        NA      11.9
      17         1.8      NA       7  
      14.4       2.4      NA       9.6
      14.9       2.3       0.133   9.1
       4         0.7       0.667  20  
      14.4       2.2       0.767   9.6
       8.7       1.4       0.383  15.3
       7        NA        NA      17  
      10.1       2.9       0.333  13.9
       3        NA        NA      21  
# ... with 73 more rows

Please note that, as before, as long as you don’t override the object, the original dataset is not affected by column selection.

Adding columns

Because columns are added (by default) to the rightmost of the dataset, we create ourselves a simplified dataset just to see more easily what we are doing.

weights <- select(msleep,ends_with("wt"))
weights

# A tibble: 83 x 2
    brainwt  bodywt
      <dbl>   <dbl>
NA        50    
0.0155    0.48
NA         1.35
0.00029   0.019
0.423   600    
NA         3.85
NA        20.5  
NA         0.045
0.07     14    
0.0982   14.8  
# ... with 73 more rows

To add a column containing brain size in grams instead of kilograms

mutate(weights, cerveau_g = brainwt*1000)

# A tibble: 83 x 3
    brainwt  bodywt cerveau_g
      <dbl>   <dbl>     <dbl>
NA        50        NA    
0.0155    0.48     15.5  
NA         1.35     NA    
0.00029   0.019     0.290
0.423   600       423    
NA         3.85     NA    
NA        20.5      NA    
NA         0.045    NA    
0.07     14        70    
0.0982   14.8      98.2  
# ... with 73 more rows

Note that many columns can be used in the same calculation, e.g. to calculate the relative size of the brain :

mutate(weights, rel_brain = brainwt / bodywt)

# A tibble: 83 x 3
    brainwt  bodywt rel_brain
      <dbl>   <dbl>     <dbl>
NA        50     NA       
0.0155    0.48   0.0323  
NA         1.35  NA       
0.00029   0.019  0.0153  
0.423   600      0.000705
NA         3.85  NA       
NA        20.5   NA       
NA         0.045 NA       
0.07     14      0.005   
0.0982   14.8    0.00664
# ... with 73 more rows

Combining many operations in a chain

This is where, in my opinion, that the dplyr-way really shines.

With the few functions we’ve seen so far, we can already do a lot of things. For example, if we wished to find the name and the weight of the 10 smallest mammals in the dataset :

x <- arrange(msleep,bodywt) # sort from smallest to largest
y <- mutate(x,rang = row_number())# add a rank column
z <- filter(y, rang <= 10)# keep only the 10 smallest ones
select(z,name,bodywt)# keep only name and weight columns

# A tibble: 10 x 2
   name                       bodywt
   <chr>                       <dbl>
Lesser short-tailed shrew   0.005
Little brown bat            0.01
Greater short-tailed shrew  0.019
Deer mouse                  0.021
House mouse                 0.022
Big brown bat               0.023
Northern grasshopper mouse  0.028
"Vole "                     0.035
African striped mouse       0.044
Vesper mouse                0.045

Notice that on the way, we create many useless intermediate objects (x,y and z), which we could easily eliminate.

select(filter(mutate(arrange(msleep,bodywt),rang = row_number()), rang <= 10),name,bodywt)

But doing so, we loose a lot in readability. We could split that code again into separate lines and use identation to clarify things :

select(
  filter(
    mutate(
      arrange(msleep,bodywt),
      rang = row_number()
    ),
    rang <= 10
  ),
  name,
  bodywt
)

But the resulting code is not necessarily easy to read, and at first sight, it’s hard to know which dataset is affected. It is also annoying that you need to read from center to the outside, which is not natural for most readers.

This is where the pipe operator (%>%) from the magrittr library comes in really handy :

msleep %>%
  arrange(bodywt) %>%
  mutate(rang = row_number()) %>%
  filter(rang <= 10) %>%
  select(name, bodywt)

NB there is no need to manually load the pipe operator as dplyr does this for us everytime we load it.

%>% transforms our code in a way that is easier for us, while keeping it interpretable by R…

Code becomes easier to read
We get back our natural left-to-right and top-to-bottom reading sequence
No more intermediate objects littering our code
The dataset is the first thing in the chain, clearly indicating on what the code applies
Every line of code begins with an action verb

I know, %>% is clunky to type, but those of you with RStudio can use the : Ctrl+Shift+M to type it rapidly.

All librairies from Hadley Wickam’s tidyverse are required to support the pipe operator… except one : ggplot2.

ggplot2 doesn’t support the pipe operator, and probably never will. You can read the whole story here : https://community.rstudio.com/t/why-cant-ggplot2-use/4372/7.

ggplot2 can nevertheless be inserted at the end of a chain :

msleep %>%
  filter(bodywt > 0.200) %>%
  mutate(
    l_corps = log(bodywt),
    l_cerveau = log(brainwt)
  ) %>%
  ggplot(aes(x = l_corps,y = l_cerveau, col = vore)) +
  geom_point()

Warning: Removed 19 rows containing missing values (geom_point).

Mixing ggplot2 with dplyr forces you to be very alert, because you now have database manipulation statements chained together with %>% a graphic layers connected with +

Second exercise

Just try again the first exercise (finding, in ascending order of body weight, the list of all non-domestic herbivorous animals) but this time, taking advantage of the pipe operator to simplify your code.

Summarizing a dataset

With the summarize function, you can summarize many functions or variables at once :

msleep %>%
  summarize(
    mean_weight = mean(bodywt),
    sd_weight = sd(bodywt)
  )

# A tibble: 1 x 2
  mean_weight sd_weight
        <dbl>            <dbl>
1        166.             787.

Such an operation becomes much more powerful if we use a grouping clause :

msleep %>%
  group_by(vore) %>%
  summarize(
    mean_weight = mean(bodywt),
    sd_weight = sd(bodywt)
  )

# A tibble: 5 x 3
  vore    mean_weidth        sd_weight
  <chr>         <dbl>            <dbl>
1 carni        90.8             182.  
2 herbi       367.             1244.  
3 insecti      12.9              26.4
4 omni         12.7              24.7
5 <NA>          0.858             1.34

This workshop was prepared with…

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readxl_1.1.0   tidyr_0.8.1    bindrcpp_0.2.2 dplyr_0.7.7   
[5] ggplot2_3.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17     pillar_1.2.3     compiler_3.5.1   cellranger_1.1.0
 [5] plyr_1.8.4       bindr_0.1.1      tools_3.5.1      digest_0.6.15   
 [9] evaluate_0.10.1  tibble_1.4.2     gtable_0.2.0     pkgconfig_2.0.1
[13] rlang_0.2.1      cli_1.0.0        yaml_2.1.19      withr_2.1.2     
[17] stringr_1.3.1    knitr_1.20       rprojroot_1.3-2  grid_3.5.1      
[21] tidyselect_0.2.4 glue_1.2.0       R6_2.2.2         rmarkdown_1.10  
[25] purrr_0.2.5      magrittr_1.5     backports_1.1.2  scales_0.5.0    
[29] htmltools_0.3.6  assertthat_0.2.0 colorspace_1.3-2 labeling_0.3    
[33] utf8_1.1.4       stringi_1.2.3    lazyeval_0.2.1   munsell_0.5.0   
[37] crayon_1.3.4