Big Data Manipulation

Roxanne Giguère-Tremblay & Arthur de Grandpré

May 2021

Context : Big Data
The data.table package
- Introduction
- Installation
Packages
Data
The vroom package
References

Context : Big Data

What is “Big Data”? Databases that are considered big data are so large that normal database management tools have great difficulty in performing the manipulations required and their handling time becomes excessively long and non-productive.

Where do we find big data in Environmental Sciences? - GIS - Global monitoring - climate change - Climate modelling - Microbial ecology with genomics - Citizen participation via applications such as Ebird (for example) - etc…

What are the benefits of Big Data? First, at the statistical level, having a lot of data helps to compensate for the possibility that some data greatly influences the results via the sample size effect. That is, when we have a little set of data, an outlier point will probably influence the outcome of our analysis. However, if this same outlier point ends up in a very big data set, it will have less impact on the final result.

Second, at a more global level, Big Data allows for better environmental monitoring and fairer decisions about environmental laws as analyses and models are fairer and more reliable than before.

In this workshop… For these reasons and because many of us are probably already using Big Data in our research, it seems relevant to offer this workshop. The latter will allow us to be more efficient and above all faster in handling our data.

Although the word “Big Data” can be scary, it is possible to manipulate them without too many complexities and that is what we will see today. We’ll see the basic outlines of what can be done with R data.table and Vroom packages. Several references are also available if you want to learn more about the packages or if you have questions that are not presented in the workshop.

The data.table package

Introduction

Let’s start with the data.table package. This R package is used for data manipulation in table form, as is the default function of data.frame. However, it is much faster and for some people it also has a more intuitive syntax than the default function. Another advantage is that a database in the form of a data.table can be used in the same way as a data.frame in different packages such as ggplot2, for example.

An important piece of information to know about this package is that it has won each of the many comparison exercises that have been performed against dplyr and even with panda in Python. Thus, it is one of the important packages to know in Data Science.

Noted that the tutorial I am presenting today is greatly inspired from this one: https://www.machinelearningplus.com/data-manipulation/datatable-in-r-complete-guide/.

Installation

Packages

The data.table package can be installed using either CRAN for the latest stable build, or Github for the newest features.

# installing data.table
# from CRAN
install.packages('data.table')

# from Github
install.packages("data.table",
                 repos="https://Rdatatable.gitlab.io/data.table")

# check version and update
data.table::update.dev.pkg() # Workshop built on version 1.14.1.

# load the library
library("data.table")

Data

Set the directory

#setwd("C:/Users/Arthur/Desktop/bigdata_web") # Change for your own directory!

In this course we will use the data set abalone in the AppliedPredictiveModeling package.

# Installation du package AppliedPredictiveModeling pour télécharger le dataset qui sera utilisé pour l'atelier
install.packages("AppliedPredictiveModeling")

You can download it here and save it on your R project directly on your computer.

library("AppliedPredictiveModeling")

 data("abalone")
 dset <- abalone
 head(dset)

##   Type LongestShell Diameter Height WholeWeight ShuckedWeight VisceraWeight
## 1    M        0.455    0.365  0.095      0.5140        0.2245        0.1010
## 2    M        0.350    0.265  0.090      0.2255        0.0995        0.0485
## 3    F        0.530    0.420  0.135      0.6770        0.2565        0.1415
## 4    M        0.440    0.365  0.125      0.5160        0.2155        0.1140
## 5    I        0.330    0.255  0.080      0.2050        0.0895        0.0395
## 6    I        0.425    0.300  0.095      0.3515        0.1410        0.0775
##   ShellWeight Rings
## 1       0.150    15
## 2       0.070     7
## 3       0.210     9
## 4       0.155    10
## 5       0.055     7
## 6       0.120     8

#
#.csv creation
 write.csv(dset, "abalone.csv")

Reading data

Even though data.frame and data.table objects are similar, they are not used in the same way; data.table uses a more intuitive syntax.

The fread function (fast read) is equivalent to read.csv. It can import local or remote csv files approximately 20 times faster than read.csv. It also generates a data.table format object.

Since objects of class data.table are made from a data.frame, all functions accepting a data.frame should work with data.table

dset <- fread("abalone.csv")  
head(dset) # To visualy confirm data structure

##    V1 Type LongestShell Diameter Height WholeWeight ShuckedWeight VisceraWeight
## 1:  1    M        0.455    0.365  0.095      0.5140        0.2245        0.1010
## 2:  2    M        0.350    0.265  0.090      0.2255        0.0995        0.0485
## 3:  3    F        0.530    0.420  0.135      0.6770        0.2565        0.1415
## 4:  4    M        0.440    0.365  0.125      0.5160        0.2155        0.1140
## 5:  5    I        0.330    0.255  0.080      0.2050        0.0895        0.0395
## 6:  6    I        0.425    0.300  0.095      0.3515        0.1410        0.0775
##    ShellWeight Rings
## 1:       0.150    15
## 2:       0.070     7
## 3:       0.210     9
## 4:       0.155    10
## 5:       0.055     7
## 6:       0.120     8

class(dset) # To confirm object class

## [1] "data.table" "data.frame"

Even though fread reads data faster than read.csv, it doesn’t really show when reading small datasets. The difference becomes larger when using very large or complexe datasets (Big Data).

We will create a bigger data set (1M lines) that will allow us to visualize the read speed difference between read.csv and fread.

# Create a large .csv file
set.seed(100) # set random seed so all runs are the same
m <- data.frame(matrix(runif(1000000), nrow=1000000))
write.csv(m, "m2.csv", row.names = F)

# Time taken by read.csv to import
system.time({m_df <- read.csv("m2.csv")})

##    user  system elapsed
##   2.135   0.066   2.219

# Time taken by fread to import
system.time({m_dt <- fread("m2.csv")})

##    user  system elapsed
##   0.045   0.008   0.053

Converting a data.frame into a data.table

Conversion from data.frame to data.table can be done using two different functions :

data.table() or as.data.table()
setDT()

The main difference between both functions is that the first creates a copy while the second directly modifies the original object.

The first function as.data.table() does not include row names, so they must be reassigned if necessary.

1st function :

dset_dt <- as.data.table(dset) # Conversion vers DT
class(dset_dt) # Le data.table est fait à partir d'un data.frame

## [1] "data.table" "data.frame"

rownames(dset_dt) = rownames(dset)
dset_dt

##         V1 Type LongestShell Diameter Height WholeWeight ShuckedWeight
##    1:    1    M        0.455    0.365  0.095      0.5140        0.2245
##    2:    2    M        0.350    0.265  0.090      0.2255        0.0995
##    3:    3    F        0.530    0.420  0.135      0.6770        0.2565
##    4:    4    M        0.440    0.365  0.125      0.5160        0.2155
##    5:    5    I        0.330    0.255  0.080      0.2050        0.0895
##   ---                                                                 
## 4173: 4173    F        0.565    0.450  0.165      0.8870        0.3700
## 4174: 4174    M        0.590    0.440  0.135      0.9660        0.4390
## 4175: 4175    M        0.600    0.475  0.205      1.1760        0.5255
## 4176: 4176    F        0.625    0.485  0.150      1.0945        0.5310
## 4177: 4177    M        0.710    0.555  0.195      1.9485        0.9455
##       VisceraWeight ShellWeight Rings
##    1:        0.1010      0.1500    15
##    2:        0.0485      0.0700     7
##    3:        0.1415      0.2100     9
##    4:        0.1140      0.1550    10
##    5:        0.0395      0.0550     7
##   ---                                
## 4173:        0.2390      0.2490    11
## 4174:        0.2145      0.2605    10
## 4175:        0.2875      0.3080     9
## 4176:        0.2610      0.2960    10
## 4177:        0.3765      0.4950    12

2nd function

dset_copy <- copy(dset)
setDT(dset_copy)
class(dset_copy)

## [1] "data.table" "data.frame"

Converting a data.table to data.frame

If you wish to convert a data.table into a data.frame, it is just as simple using the setDF() or as.data.frame() functions

setDF(dset_copy)
class(dset_copy)

## [1] "data.frame"

Data manipulations

Conditionnal filtering

One of the main differences between both types of datasets is that data.table knows the names of its columns. This makes code writing more intuitive.

This is and example of the syntax used to filter a data.frame :

head(dset_copy[dset_copy$Type == "M" & dset_copy$Rings == 10, ]) # lines where type is M with 10 rings

##      V1 Type LongestShell Diameter Height WholeWeight ShuckedWeight
## 4     4    M        0.440    0.365  0.125      0.5160        0.2155
## 12   12    M        0.430    0.350  0.110      0.4060        0.1675
## 31   31    M        0.580    0.470  0.165      0.9975        0.3935
## 53   53    M        0.485    0.360  0.130      0.5415        0.2595
## 88   88    M        0.560    0.440  0.160      0.8645        0.3305
## 104 104    M        0.530    0.415  0.140      0.7240        0.3105
##     VisceraWeight ShellWeight Rings
## 4          0.1140       0.155    10
## 12         0.0810       0.135    10
## 31         0.2420       0.330    10
## 53         0.0960       0.160    10
## 88         0.2075       0.260    10
## 104        0.1675       0.205    10

While the syntax for a data.table goes like this :

DT[i, j, by] where i = line filter, j = column selection, by = grouping

head(dset_dt[Type== “M” & Rings== 10, ]) # lines where type is M with 10 rings and no grouping

## V1 Type LongestShell Diameter Height WholeWeight ShuckedWeight ## 1: 4 M 0.440 0.365 0.125 0.5160 0.2155 ## 2: 12 M 0.430 0.350 0.110 0.4060 0.1675 ## 3: 31 M 0.580 0.470 0.165 0.9975 0.3935 ## 4: 53 M 0.485 0.360 0.130 0.5415 0.2595 ## 5: 88 M 0.560 0.440 0.160 0.8645 0.3305 ## 6: 104 M 0.530 0.415 0.140 0.7240 0.3105 ## VisceraWeight ShellWeight Rings ## 1: 0.1140 0.155 10 ## 2: 0.0810 0.135 10 ## 3: 0.2420 0.330 10 ## 4: 0.0960 0.160 10 ## 5: 0.2075 0.260 10 ## 6: 0.1675 0.205 10

Column selection

Selecting columns in one of the most frequent database manipulation. Unlike a data.frame, a data.table can be subseted using its column names instead of their indexes, without quotation marks to obtain a vector, and with quotation marks to obtain a table.

# for a vector
head(dset_dt[, Diameter])

## [1] 0.365 0.265 0.420 0.365 0.255 0.300

# for a table
head(dset_dt[, "Diameter"])

##    Diameter
## 1:    0.365
## 2:    0.265
## 3:    0.420
## 4:    0.365
## 5:    0.255
## 6:    0.300

# equivalent to
# dset_dt[, .(Diameter)]
# equivalent to
# dset_dt[, 3]

Selecting multiple columns

To select multiple columns, use a vector of the column names you want to select.

head(dset_dt[,.(Type, Diameter, Rings)])

##    Type Diameter Rings
## 1:    M    0.365    15
## 2:    M    0.265     7
## 3:    F    0.420     9
## 4:    M    0.365    10
## 5:    I    0.255     7
## 6:    I    0.300     8

# equivalent to
# dset_dt[,c("Type","Diameter","Rings")]
# equivalent to
# dset_dt[,c(1,3,9)]

Dropping columns

To drop columns from the dataset, use the ! negation operator before the vector of columns to exclude.

drop_cols <- c("Height", "ShuckedWeight")
head(dset_dt[, !drop_cols, with = FALSE]) # Setting with = FALSE disables the ability to refer to columns as if they are variables, thereby restoring the “data.frame mode”

##    V1 Type LongestShell Diameter WholeWeight VisceraWeight ShellWeight Rings
## 1:  1    M        0.455    0.365      0.5140        0.1010       0.150    15
## 2:  2    M        0.350    0.265      0.2255        0.0485       0.070     7
## 3:  3    F        0.530    0.420      0.6770        0.1415       0.210     9
## 4:  4    M        0.440    0.365      0.5160        0.1140       0.155    10
## 5:  5    I        0.330    0.255      0.2050        0.0395       0.055     7
## 6:  6    I        0.425    0.300      0.3515        0.0775       0.120     8

# equivalent to
# dset_dt[, !c("Height", "ShuckedWeight")]
# equivalent to
# dset_dt[, !c(4,10)]

Renaming columns

The setnames function allows to rename a column by specifying the actual names and new names.

setnames(dset_dt, "Diameter", "Dia", skip_absent = T) # the skip_absent argument allows to not rename all columns
colnames(dset_dt)

##  [1] "V1"            "Type"          "LongestShell"  "Dia"          
##  [5] "Height"        "WholeWeight"   "ShuckedWeight" "VisceraWeight"
##  [9] "ShellWeight"   "Rings"

Creating a new column from existing columns

It is sometimes necessary to create new columns from existing ones, like by simple operations such as sums, products or mean to create new variables.

1 at a time

# data.frame syntax (works on data.table)
  # dset_dt$Masse_Tot <- dset_dt$ShuckedWeight +   dset_dt$VisceraWeight + dset_dt$ShellWeight

# data.table syntax
dset_dt[, Masse_Tot2 := ShuckedWeight + VisceraWeight + ShellWeight]

## Warning in `[.data.table`(dset_dt, , `:=`(Masse_Tot2, ShuckedWeight +
## VisceraWeight + : Invalid .internal.selfref detected and fixed by taking
## a (shallow) copy of the data.table so that := can add this new column by
## reference. At an earlier point, this data.table has been copied by R (or was
## created manually using structure() or similar). Avoid names<- and attr<- which
## in R currently (and oddly) may copy the whole data.table. Use set* syntax
## instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't
## help, please report your use case to the data.table issue tracker so the root
## cause can be fixed or this message improved.

head(dset_dt)

##    V1 Type LongestShell   Dia Height WholeWeight ShuckedWeight VisceraWeight
## 1:  1    M        0.455 0.365  0.095      0.5140        0.2245        0.1010
## 2:  2    M        0.350 0.265  0.090      0.2255        0.0995        0.0485
## 3:  3    F        0.530 0.420  0.135      0.6770        0.2565        0.1415
## 4:  4    M        0.440 0.365  0.125      0.5160        0.2155        0.1140
## 5:  5    I        0.330 0.255  0.080      0.2050        0.0895        0.0395
## 6:  6    I        0.425 0.300  0.095      0.3515        0.1410        0.0775
##    ShellWeight Rings Masse_Tot2
## 1:       0.150    15     0.4755
## 2:       0.070     7     0.2180
## 3:       0.210     9     0.6080
## 4:       0.155    10     0.4845
## 5:       0.055     7     0.1840
## 6:       0.120     8     0.3385

Multiple at a time

Suffices to use the := symbol as a function.

dset_dt[,  `:=`(Masse_Tot3 = ShuckedWeight * VisceraWeight * ShellWeight,
                  Masse_Tot4 = ShuckedWeight - VisceraWeight - ShellWeight)]
head(dset_dt)

##    V1 Type LongestShell   Dia Height WholeWeight ShuckedWeight VisceraWeight
## 1:  1    M        0.455 0.365  0.095      0.5140        0.2245        0.1010
## 2:  2    M        0.350 0.265  0.090      0.2255        0.0995        0.0485
## 3:  3    F        0.530 0.420  0.135      0.6770        0.2565        0.1415
## 4:  4    M        0.440 0.365  0.125      0.5160        0.2155        0.1140
## 5:  5    I        0.330 0.255  0.080      0.2050        0.0895        0.0395
## 6:  6    I        0.425 0.300  0.095      0.3515        0.1410        0.0775
##    ShellWeight Rings Masse_Tot2   Masse_Tot3 Masse_Tot4
## 1:       0.150    15     0.4755 0.0034011750    -0.0265
## 2:       0.070     7     0.2180 0.0003378025    -0.0190
## 3:       0.210     9     0.6080 0.0076218975    -0.0950
## 4:       0.155    10     0.4845 0.0038078850    -0.0535
## 5:       0.055     7     0.1840 0.0001944387    -0.0050
## 6:       0.120     8     0.3385 0.0013113000    -0.0565

Grouping

The ease of grouping columns using data.table makes it the 2nd main reason to use it.

It is possible to group columns using the argument “by”, which replaces the more complex aggregate() function from base R.

Let’s use it to obtain mean diameter by abalone type.

dset_dt[, .(mean_dia=mean(Dia)), by=Type]

##    Type  mean_dia
## 1:    M 0.4392866
## 2:    F 0.4547322
## 3:    I 0.3264940

It is just as easy to do so for multiple factors.

head(dset_dt[, .(mean_dia=mean(Dia)), by=.(Type, Rings)])

##    Type Rings  mean_dia
## 1:    M    15 0.4577885
## 2:    M     7 0.3445625
## 3:    F     9 0.4478992
## 4:    M    10 0.4550000
## 5:    I     7 0.3076592
## 6:    I     8 0.3556569

This results in the mean diameter of abalones for each type and number of rings.

Joining multiple datasets

data.table allows the use of the merge() function like in base R but faster.

dset_dt$ID = row.names(dset_dt) # create an ID column
dt1 <- dset_dt[1:500,.(ID, Type, Dia)]
dt2 <- dset_dt[250:600,.(ID, WholeWeight)]

# Inner Join
merge(dt1, dt2, by='ID') # Only joins when data is matching on both sides.

##       ID Type   Dia WholeWeight
##   1: 250    I 0.270      0.2135
##   2: 251    I 0.250      0.1715
##   3: 252    M 0.470      1.1235
##   4: 253    F 0.455      1.0605
##   5: 254    F 0.460      1.0940
##  ---                           
## 247: 496    F 0.500      0.9530
## 248: 497    F 0.520      1.2480
## 249: 498    F 0.485      1.0105
## 250: 499    F 0.525      1.0385
## 251: 500    M 0.450      0.8740

#> <returns 251 rows>

# Left Join
merge(dt1, dt2, by='ID', all.x = T) # Returns all lines from left table with matching data from right table

##       ID Type   Dia WholeWeight
##   1:   1    M 0.365          NA
##   2:  10    F 0.440          NA
##   3: 100    F 0.375          NA
##   4: 101    I 0.265          NA
##   5: 102    M 0.435          NA
##  ---                           
## 496:  95    M 0.560          NA
## 497:  96    M 0.535          NA
## 498:  97    M 0.435          NA
## 499:  98    M 0.375          NA
## 500:  99    M 0.370          NA

#> <returns 500 rows>

# Outer Join
merge(dt1, dt2, by='ID', all = T)  # Returns all lines from left and right tables, filling missing matches with NA

##       ID Type   Dia WholeWeight
##   1:   1    M 0.365          NA
##   2:  10    F 0.440          NA
##   3: 100    F 0.375          NA
##   4: 101    I 0.265          NA
##   5: 102    M 0.435          NA
##  ---                           
## 596:  95    M 0.560          NA
## 597:  96    M 0.535          NA
## 598:  97    M 0.435          NA
## 599:  98    M 0.375          NA
## 600:  99    M 0.370          NA

#> <returns 600 rows>

data.table specifics

What are .N and .I

.N returns the number of lines for a specified call. For example, if we want to know the number of unique value per type :

dset_dt[, .N, by=Type]

##    Type    N
## 1:    M 1528
## 2:    F 1307
## 3:    I 1342

.I returns the line numbers. This argument is equivalent to using the which() function for a data.frame.

head(dset_dt[, .I])

## [1] 1 2 3 4 5 6

This returns the line numbers of the first lines of the dataset, which would reach 4177 without the head() function.

Thus, .I is used to obtain the line numbers that fills certain conditions. For example, which are the line numbers containing type “M” individuals ?

head(dset_dt[, .I[Type=="M"]])

## [1]  1  2  4  9 12 13

# or the lines where the number of rings is equal to 15
dset_dt[, .I[Rings == 15]]

##   [1]    1   29   32   76   91   95  102  103  151  199  230  254  255  256  259
##  [16]  274  281  293  340  379  381  411  416  457  483  488  494  496  503  506
##  [31]  508  541  543  615  625  668  686  723  724  730  733  750  758  760  761
##  [46]  777  779  780  786  796  808  884 1395 1748 1934 2108 2179 2192 2237 2268
##  [61] 2273 2275 2320 2329 2330 2332 2365 2368 2405 2406 2409 2422 2490 2497 2499
##  [76] 2539 2956 3163 3168 3169 3178 3205 3208 3240 3241 3243 3248 3278 3289 3290
##  [91] 3303 3323 3338 3353 3383 3868 3871 3878 3880 3883 3901 3913 3942

Chaining results

What are chains ?

Chains allow to apply multiple manipulations to data without stocking intermediate results in meomry. This can be critical when working with very heavy data.

For example, instead of writing two lines of code to make two manipulations, you can attach those two manipulations with brackets [].

# Long format : 2 commands for 2 manipulations, with multiple objects saved in memory
dt1 <- dset_dt[, .(mean_dia=mean(Dia),
                     mean_rings=mean(Rings),
                     mean_masse=mean(WholeWeight))
                    ,by=Type]
output <- dt1[order(Type), ]
output

##    Type  mean_dia mean_rings mean_masse
## 1:    F 0.4547322  11.129304  1.0465321
## 2:    I 0.3264940   7.890462  0.4313625
## 3:    M 0.4392866  10.705497  0.9914594

# Short format : 1 command for 2 manipulations, single object in memory
output <- dset_dt[, .(mean_dia   = mean(Dia),
                      mean_rings = mean(Rings),
                      mean_masse = mean(WholeWeight)),
                  by=Type][
                    order(Type), ]
output

##    Type  mean_dia mean_rings mean_masse
## 1:    F 0.4547322  11.129304  1.0465321
## 2:    I 0.3264940   7.890462  0.4313625
## 3:    M 0.4392866  10.705497  0.9914594

Using chains is also faster than not using them.

system.time({dt1 <- dset_dt[, .(mean_dia=mean(Dia),
                     mean_rings=mean(Rings),
                     mean_masse=mean(WholeWeight))
                    ,by=Type]
output <- dt1[order(Type), ]
output})

##    user  system elapsed
##   0.001   0.000   0.002

system.time({output <- dset_dt[, .(mean_dia=mean(Dia),
                     mean_rings=mean(Rings),
                     mean_masse=mean(WholeWeight)), by=Type][order(Type), ]
output})

##    user  system elapsed
##   0.002   0.000   0.002

The .SD object and lapply()

The object .SD is another data.table containing the subsets defined by the by argument as a list. With this object and the lapply() function, it is possible to apply functions to every columns in a series of subsets defined by by in a single call.

Let’s see what it looks like :

dset_dt[,print(.SD), by=Type]

##         V1 LongestShell   Dia Height WholeWeight ShuckedWeight VisceraWeight
##    1:    1        0.455 0.365  0.095      0.5140        0.2245        0.1010
##    2:    2        0.350 0.265  0.090      0.2255        0.0995        0.0485
##    3:    4        0.440 0.365  0.125      0.5160        0.2155        0.1140
##    4:    9        0.475 0.370  0.125      0.5095        0.2165        0.1125
##    5:   12        0.430 0.350  0.110      0.4060        0.1675        0.0810
##   ---                                                                       
## 1524: 4171        0.550 0.430  0.130      0.8395        0.3155        0.1955
## 1525: 4172        0.560 0.430  0.155      0.8675        0.4000        0.1720
## 1526: 4174        0.590 0.440  0.135      0.9660        0.4390        0.2145
## 1527: 4175        0.600 0.475  0.205      1.1760        0.5255        0.2875
## 1528: 4177        0.710 0.555  0.195      1.9485        0.9455        0.3765
##       ShellWeight Rings Masse_Tot2   Masse_Tot3 Masse_Tot4   ID
##    1:      0.1500    15     0.4755 0.0034011750    -0.0265    1
##    2:      0.0700     7     0.2180 0.0003378025    -0.0190    2
##    3:      0.1550    10     0.4845 0.0038078850    -0.0535    4
##    4:      0.1650     9     0.4940 0.0040187812    -0.0610    9
##    5:      0.1350    10     0.3835 0.0018316125    -0.0485   12
##   ---                                                          
## 1524:      0.2405    10     0.7515 0.0148341001    -0.1205 4171
## 1525:      0.2290     8     0.8010 0.0157552000    -0.0010 4172
## 1526:      0.2605    10     0.9140 0.0245301127    -0.0360 4174
## 1527:      0.3080     9     1.1210 0.0465330250    -0.0700 4175
## 1528:      0.4950    12     1.8170 0.1762104713     0.0740 4177
##         V1 LongestShell   Dia Height WholeWeight ShuckedWeight VisceraWeight
##    1:    3        0.530 0.420  0.135      0.6770        0.2565        0.1415
##    2:    7        0.530 0.415  0.150      0.7775        0.2370        0.1415
##    3:    8        0.545 0.425  0.125      0.7680        0.2940        0.1495
##    4:   10        0.550 0.440  0.150      0.8945        0.3145        0.1510
##    5:   11        0.525 0.380  0.140      0.6065        0.1940        0.1475
##   ---                                                                       
## 1303: 4161        0.585 0.475  0.165      1.0530        0.4580        0.2170
## 1304: 4162        0.585 0.455  0.170      0.9945        0.4255        0.2630
## 1305: 4169        0.515 0.400  0.125      0.6150        0.2865        0.1230
## 1306: 4173        0.565 0.450  0.165      0.8870        0.3700        0.2390
## 1307: 4176        0.625 0.485  0.150      1.0945        0.5310        0.2610
##       ShellWeight Rings Masse_Tot2  Masse_Tot3 Masse_Tot4   ID
##    1:      0.2100     9     0.6080 0.007621897    -0.0950    3
##    2:      0.3300    20     0.7085 0.011066715    -0.2345    7
##    3:      0.2600    16     0.7035 0.011427780    -0.1155    8
##    4:      0.3200    19     0.7855 0.015196640    -0.1565   10
##    5:      0.2100    14     0.5515 0.006009150    -0.1635   11
##   ---                                                         
## 1303:      0.3000    11     0.9750 0.029815800    -0.0590 4161
## 1304:      0.2845    11     0.9730 0.031837399    -0.1220 4162
## 1305:      0.1765     8     0.5860 0.006219772    -0.0130 4169
## 1306:      0.2490    11     0.8580 0.022019070    -0.1180 4173
## 1307:      0.2960    10     1.0880 0.041022936    -0.0260 4176
##         V1 LongestShell   Dia Height WholeWeight ShuckedWeight VisceraWeight
##    1:    5        0.330 0.255  0.080      0.2050        0.0895        0.0395
##    2:    6        0.425 0.300  0.095      0.3515        0.1410        0.0775
##    3:   17        0.355 0.280  0.085      0.2905        0.0950        0.0395
##    4:   22        0.380 0.275  0.100      0.2255        0.0800        0.0490
##    5:   43        0.240 0.175  0.045      0.0700        0.0315        0.0235
##   ---                                                                       
## 1338: 4159        0.480 0.355  0.110      0.4495        0.2010        0.0890
## 1339: 4164        0.390 0.310  0.085      0.3440        0.1810        0.0695
## 1340: 4165        0.390 0.290  0.100      0.2845        0.1255        0.0635
## 1341: 4166        0.405 0.300  0.085      0.3035        0.1500        0.0505
## 1342: 4167        0.475 0.365  0.115      0.4990        0.2320        0.0885
##       ShellWeight Rings Masse_Tot2   Masse_Tot3 Masse_Tot4   ID
##    1:       0.055     7     0.1840 0.0001944387    -0.0050    5
##    2:       0.120     8     0.3385 0.0013113000    -0.0565    6
##    3:       0.115     7     0.2495 0.0004315375    -0.0595   17
##    4:       0.085    10     0.2140 0.0003332000    -0.0540   22
##    5:       0.020     5     0.0750 0.0000148050    -0.0120   43
##   ---                                                          
## 1338:       0.140     8     0.4300 0.0025044600    -0.0280 4159
## 1339:       0.079     7     0.3295 0.0009937805     0.0325 4164
## 1340:       0.081     7     0.2700 0.0006455093    -0.0190 4165
## 1341:       0.088     7     0.2885 0.0006666000     0.0115 4166
## 1342:       0.156    10     0.4765 0.0032029920    -0.0125 4167

## Empty data.table (0 rows and 1 cols): Type

We obtain a list of data.table for every columns of dset_dt classified by the “Type” variable (3 tables, 3 levels).

With the lapply() function, it is possible to, for example, obtain the mean of every variables by “Type”. We have seen how to do it for one or a few columns, but lapply makes it simple and fast to do it for every variables.

dset_dt = dset_dt[,!c("ID")]
dset_dt[, lapply(.SD, mean), by=Type]

##    Type       V1 LongestShell       Dia    Height WholeWeight ShuckedWeight
## 1:    M 2062.942    0.5613907 0.4392866 0.1513809   0.9914594     0.4329460
## 2:    F 2043.846    0.5790933 0.4547322 0.1580107   1.0465321     0.4461878
## 3:    I 2162.646    0.4277459 0.3264940 0.1079955   0.4313625     0.1910350
##    VisceraWeight ShellWeight     Rings Masse_Tot2  Masse_Tot3  Masse_Tot4
## 1:    0.21554450   0.2819692 10.705497  0.9304598 0.043389969 -0.06456774
## 2:    0.23068860   0.3020099 11.129304  0.9788864 0.045927292 -0.08651071
## 3:    0.09201006   0.1281822  7.890462  0.4112273 0.005608583 -0.02915723

We obtain the mean of every variables for each 3 types of individuals.

Using keys

Keys are one of the core concepts of data.table. They index rows based on one or many reference columns, acting as a super row name that can be duplicated to point towards multiple lines. Using keys allow faster binary search (instead of linear) for ordering and subsetting data.

This reference can give more information on the topic :
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html

setkey(dset_dt, Rings) # set the key(s)
# for multiple reference columns, use setkey(dt, v1, v2)
key(dset_dt) # determine which key is in action

## [1] "Rings"

head(dset_dt) # visualize ordering by key

##     V1 Type LongestShell   Dia Height WholeWeight ShuckedWeight VisceraWeight
## 1: 237    I        0.075 0.055  0.010      0.0020        0.0010        0.0005
## 2: 720    I        0.150 0.100  0.025      0.0150        0.0045        0.0040
## 3: 238    I        0.130 0.100  0.030      0.0130        0.0045        0.0030
## 4: 239    I        0.110 0.090  0.030      0.0080        0.0025        0.0020
## 5: 307    I        0.165 0.120  0.030      0.0215        0.0070        0.0050
## 6: 521    M        0.210 0.150  0.050      0.0385        0.0155        0.0085
##    ShellWeight Rings Masse_Tot2 Masse_Tot3 Masse_Tot4
## 1:      0.0015     1     0.0030 7.5000e-10    -0.0010
## 2:      0.0050     2     0.0135 9.0000e-08    -0.0045
## 3:      0.0040     3     0.0115 5.4000e-08    -0.0025
## 4:      0.0030     3     0.0075 1.5000e-08    -0.0025
## 5:      0.0050     3     0.0170 1.7500e-07    -0.0030
## 6:      0.0100     3     0.0340 1.3175e-06    -0.0030

head(dset_dt[.(3)]) # subset where the key's value == 3

##     V1 Type LongestShell  Dia Height WholeWeight ShuckedWeight VisceraWeight
## 1: 238    I        0.130 0.10  0.030      0.0130        0.0045        0.0030
## 2: 239    I        0.110 0.09  0.030      0.0080        0.0025        0.0020
## 3: 307    I        0.165 0.12  0.030      0.0215        0.0070        0.0050
## 4: 521    M        0.210 0.15  0.050      0.0385        0.0155        0.0085
## 5: 527    M        0.155 0.11  0.040      0.0155        0.0065        0.0030
## 6: 721    I        0.160 0.11  0.025      0.0180        0.0065        0.0055
##    ShellWeight Rings Masse_Tot2 Masse_Tot3 Masse_Tot4
## 1:       0.004     3     0.0115 5.4000e-08    -0.0025
## 2:       0.003     3     0.0075 1.5000e-08    -0.0025
## 3:       0.005     3     0.0170 1.7500e-07    -0.0030
## 4:       0.010     3     0.0340 1.3175e-06    -0.0030
## 5:       0.005     3     0.0145 9.7500e-08    -0.0015
## 6:       0.005     3     0.0170 1.7875e-07    -0.0040

Our data.table is now ordered by the “Rings” variable.

Key can also be used to join two datasets rapidly using an identifier. In this case, the “Rings” variable is not a good key to join two datasets since it possess too many repetitions at multiple levels. The name of every observation, as stored in an “ID” variable becomes a better key for joining datasets.

dset_dt$ID = row.names(dset_dt)
dt1 <- dset_dt[,.(ID, Type, Dia)]
dt2 <- dset_dt[1:10,.(ID, Height, WholeWeight)]

setkey(dt1, ID)

dt1[dt2] # join dt1 and dt2 by key

##     ID Type   Dia Height WholeWeight
##  1:  1    I 0.055  0.010      0.0020
##  2:  2    I 0.100  0.025      0.0150
##  3:  3    I 0.100  0.030      0.0130
##  4:  4    I 0.090  0.030      0.0080
##  5:  5    I 0.120  0.030      0.0215
##  6:  6    M 0.150  0.050      0.0385
##  7:  7    M 0.110  0.040      0.0155
##  8:  8    I 0.110  0.025      0.0180
##  9:  9    I 0.175  0.065      0.0665
## 10: 10    I 0.150  0.045      0.0375

We obtain the lines from dt2 with the data from dt1 based on the used key. Merge functions will be introduced later.
If we want to stop using a key, set the key as NULL.

setkey(dset_dt, NULL)
key(dset_dt)

## NULL

Grouping and applying keys

With the keyby function, it is possible to make groups as seen previously seen and apply a key for indexing rows at the same time.

# Previous example
dset_dt[, .(mean_dia=mean(Dia),
            mean_rings=mean(Rings),
            mean_masse=mean(WholeWeight)),
        by=Type][
          order(Type), ]

##    Type  mean_dia mean_rings mean_masse
## 1:    F 0.4547322  11.129304  1.0465321
## 2:    I 0.3264940   7.890462  0.4313625
## 3:    M 0.4392866  10.705497  0.9914594

# Using keyby
dset_dt[, .(mean_dia=mean(Dia),
            mean_rings=mean(Rings),
            mean_masse=mean(WholeWeight)),
        keyby=Type]

##    Type  mean_dia mean_rings mean_masse
## 1:    F 0.4547322  11.129304  1.0465321
## 2:    I 0.3264940   7.890462  0.4313625
## 3:    M 0.4392866  10.705497  0.9914594

Keys are complex tools and can be used in many other ways to speed up and improve data manipulations.

set() : Assigning values REALLY fast

data.table also offers the set() function that allow fast value assignations in for loops. While for loops are generally considered very slow, it is mostly due to their necessity of dealing with datasets overheads. set() allows to circumvent this limit and assign values up to multiple thousand times faster.

m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)    

system.time(for (i in 1:100000) DF[i,1] <- i)

##    user  system elapsed
##  29.844  28.071  66.122

system.time(for (i in 1:100000) DT[i,V1:=i])

##    user  system elapsed
##  42.324   0.606  47.545

system.time(for (i in 1:100000) set(DT,i,1L,i))

##    user  system elapsed
##   0.261   0.000   0.262

Using data.table with other packages

The data.table objects are compatible with packages often used to work with data.frames such as ggplot2 ou other packages from the tidyverse (dplyr, magrittr, etc.)

ggplot2

library(ggplot2)
# Total weigth in function of number of rings.
class(dset_dt)

## [1] "data.table" "data.frame"

ggplot(dset_dt, aes(x = Rings, y = WholeWeight, color = Type)) +
  geom_point()

### with the tidyverse

library("tidyverse")

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()

class(dset_dt)

## [1] "data.table" "data.frame"

dset_dt %>%
  filter(Type == "M") # But this doesn't run faster than a data frame.

##         V1 Type LongestShell   Dia Height WholeWeight ShuckedWeight
##    1:  521    M        0.210 0.150  0.050      0.0385        0.0155
##    2:  527    M        0.155 0.110  0.040      0.0155        0.0065
##    3: 2372    M        0.180 0.125  0.050      0.0230        0.0085
##    4:  519    M        0.325 0.230  0.090      0.1470        0.0600
##    5:  525    M        0.235 0.160  0.060      0.0545        0.0265
##   ---                                                              
## 1524: 2336    M        0.610 0.490  0.150      1.1030        0.4250
## 1525: 2437    M        0.515 0.400  0.160      0.8175        0.2515
## 1526: 3281    M        0.690 0.540  0.185      1.6195        0.5330
## 1527:  295    M        0.600 0.495  0.195      1.0575        0.3840
## 1528: 2109    M        0.665 0.535  0.225      2.1835        0.7535
##       VisceraWeight ShellWeight Rings Masse_Tot2   Masse_Tot3 Masse_Tot4   ID
##    1:        0.0085       0.010     3     0.0340 1.317500e-06    -0.0030    6
##    2:        0.0030       0.005     3     0.0145 9.750000e-08    -0.0015    7
##    3:        0.0055       0.010     3     0.0240 4.675000e-07    -0.0070   13
##    4:        0.0340       0.045     4     0.1390 9.180000e-05    -0.0190   26
##    5:        0.0095       0.015     4     0.0510 3.776250e-06     0.0020   27
##   ---                                                                        
## 1524:        0.2025       0.360    23     0.9875 3.098250e-02    -0.1375 4169
## 1525:        0.1560       0.300    23     0.7075 1.177020e-02    -0.2045 4170
## 1526:        0.3530       0.555    24     1.4410 1.044227e-01    -0.3750 4172
## 1527:        0.1900       0.375    26     0.9490 2.736000e-02    -0.1810 4174
## 1528:        0.3910       0.885    27     2.0295 2.607374e-01    -0.5225 4175

The vroom package

Introduction

Now that we know how to rapidly manipulate heavy data, let’s see how we can manipulate heavier data even faster. This is the idea behind the vroom package who’s objective is to maximize read and write spead by using an efficient parsing system that doesn’t load data in memory, but instead focuses on their structure.

What vroom does :
The main vroom function is vroom() and it is used to read databases very fast, whatever their type, even if it’s compressed. It also recognizes the data separation used in the files.

It contains almost all parsing functions contained in readr, but using multiple cores makes it much faster.

Differences with data.table :
The main difference between those two packages is the read speed, but this is depends on the type of data. vroom will read character data much faster than data.table but will be slower for pure numerical datasets. Why? Because it indexes data at importation, and it is much harder to index numerical values than characters, which are more limited.

vroom also uses a different syntax to data.table, more fitting the the “tidy” writing of dplyr.

How does it work ? Concept behind vroom’s speed :
Vroom uses the Altrep (alternative representation) framework of R (only for R 3.5+). This context allows a better management of memory during heavy tasks. When importing files, the data is not stored in memory. Vroom instead created a path (index) to find the data easily. Then, when we apply a command, we only require the necessary data without reading the whole dataset (on demand parsing). By mapping data and the memory, vroom can perform multi-threaded analysis, giving big performance gains even to laptops. The capacity of reading and writing strings in a multi-threaded environment generates major speed improvement compared to classic methods (see Jim Hester’s figure in the youtube reference below @ 4:39). When multiple character vectors have the same name, they are not stored multiple times, they are instead indexed in a way that allow multiple path to lead to the same object. This greatly diminishes memory costs.

With both packages, we can then explore all types of data at high speed.

Installation

Vroom can be installed directly from CRAN or Github to obtain latest updates in the package.

# CRAN
install.packages("vroom")

# GitHub
devtools::install_dev("vroom")

library(vroom) # load the package

Reading data

Single file

Let’s import the abalone dataset again, containing mostly characters and numerical variables.

vab <- vroom("abalone.csv")

## New names:
## * `` -> ...1

## Rows: 4,177
## Columns: 10
## Delimiter: ","
## chr [1]: Type
## dbl [9]: ...1, LongestShell, Diameter, Height, WholeWeight, ShuckedWeight, VisceraWeight...
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message

The output is a 4 177 x 9 tibble delimited by “;” separator.

While the function can guess the type of separator, it can make mistakes. It must then be specified using the delim = [either(“,” “” “ “|” “:” “;”)] argument. It is a better practice to specify it manually, resulting in additionnal speed gains.

Vroom also guesses column type, so it might be necessary to specify column format.

vab <- vroom("abalone.csv", delim = ",")

## New names:
## * `` -> ...1

## Rows: 4,177
## Columns: 10
## Delimiter: ","
## chr [1]: Type
## dbl [9]: ...1, LongestShell, Diameter, Height, WholeWeight, ShuckedWeight, VisceraWeight...
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message

Let’s see the time difference between both methods.

system.time({t_ab <- vroom("abalone.csv")})

## New names:
## * `` -> ...1

## Rows: 4,177
## Columns: 10
## Delimiter: ","
## chr [1]: Type
## dbl [9]: ...1, LongestShell, Diameter, Height, WholeWeight, ShuckedWeight, VisceraWeight...
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message

##    user  system elapsed
##   0.022   0.002   0.023

system.time({t_abdelim <- vroom("abalone.csv", delim = ",")})

## New names:
## * `` -> ...1
## Rows: 4,177
## Columns: 10
## Delimiter: ","
## chr [1]: Type
## dbl [9]: ...1, LongestShell, Diameter, Height, WholeWeight, ShuckedWeight, VisceraWeight...
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message

##    user  system elapsed
##   0.018   0.002   0.017

And now the read time difference between data.table and vroom for the database created previously, “m2.csv” .

system.time({dt_m2 <- fread("m2.csv")})

##    user  system elapsed
##   0.050   0.024   0.174

system.time({v_m2 <- vroom("m2.csv", delim = ",")})

## Rows: 1,000,000
## Columns: 1
## Delimiter: ","
## dbl [1]: matrix.runif.1e.06...nrow...1e.06.
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message

##    user  system elapsed
##   0.064   0.009   0.026

If we use “;” the columns will be “chr” and “,” for “dbl”

Multiple files

Vroom can also read multiple dataset with the same column names and combine them into a single database. Let’s use a copy of abalone.csv for example.

file.copy("abalone.csv","abalone_copy.csv") # copy abalone.csv to abalone_copy.csv

## [1] TRUE

library("fs")

files <- dir_ls(glob = "abalone*csv") # Seeks all files containing abalone and csv in the working directory

files

## abalone.csv      abalone_copy.csv

vroom(files, delim = ",") # Imports both files as a single dataset

## New names:
## * `` -> ...1

## Rows: 8,354
## Columns: 10
## Delimiter: ","
## chr [1]: Type
## dbl [9]: ...1, LongestShell, Diameter, Height, WholeWeight, ShuckedWeight, VisceraWeight...
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message

## # A tibble: 8,354 x 10
##     ...1 Type  LongestShell Diameter Height WholeWeight ShuckedWeight
##    <dbl> <chr>        <dbl>    <dbl>  <dbl>       <dbl>         <dbl>
##  1     1 M            0.455    0.365  0.095       0.514        0.224
##  2     2 M            0.35     0.265  0.09        0.226        0.0995
##  3     3 F            0.53     0.42   0.135       0.677        0.256
##  4     4 M            0.44     0.365  0.125       0.516        0.216
##  5     5 I            0.33     0.255  0.08        0.205        0.0895
##  6     6 I            0.425    0.3    0.095       0.352        0.141
##  7     7 F            0.53     0.415  0.15        0.778        0.237
##  8     8 F            0.545    0.425  0.125       0.768        0.294
##  9     9 M            0.475    0.37   0.125       0.509        0.216
## 10    10 F            0.55     0.44   0.15        0.894        0.314
## # … with 8,344 more rows, and 3 more variables: VisceraWeight <dbl>,
## #   ShellWeight <dbl>, Rings <dbl>

To differenciate between different data sources it is possible to add the “id=” argument to add a column refering to the original file.

m_vab <- vroom(files, delim = ",", id = "source")

## New names:
## * `` -> ...1

## New names:
## * ...1 -> ...2

## Rows: 8,354
## Columns: 11
## Delimiter: ","
## chr [1]: Type
## dbl [9]: ...1, LongestShell, Diameter, Height, WholeWeight, ShuckedWeight, VisceraWeight...
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message

head(m_vab)

## # A tibble: 6 x 11
##   source  ...2 Type  LongestShell Diameter Height WholeWeight ShuckedWeight
##   <chr>  <dbl> <chr>        <dbl>    <dbl>  <dbl>       <dbl>         <dbl>
## 1 abalo…     1 M            0.455    0.365  0.095       0.514        0.224
## 2 abalo…     2 M            0.35     0.265  0.09        0.226        0.0995
## 3 abalo…     3 F            0.53     0.42   0.135       0.677        0.256
## 4 abalo…     4 M            0.44     0.365  0.125       0.516        0.216
## 5 abalo…     5 I            0.33     0.255  0.08        0.205        0.0895
## 6 abalo…     6 I            0.425    0.3    0.095       0.352        0.141
## # … with 3 more variables: VisceraWeight <dbl>, ShellWeight <dbl>, Rings <dbl>

Compressed files

To read compressed files, the same method is used, but adding the appropriate extension at the end of the file names. i.e.: vroom(“abalone.csv.gz”)

Other filetypes

Vroom can also read multiple compressed files (wrapper function), files from online sources (put the URL in vroom) and compressed files from online sources. We invite you to refer to this tutorial which this workshop is based upon for more information : (https://vroom.r-lib.org/articles/vroom.html)

Manipulating data

Selecting columns

Column selection is made in a similar fashion to “dplyr” using the col_select=c() argument. It can be donne by column type, or by string match (starts with “T”, ends with “e”, etc.)

head(vroom("abalone.csv", col_select = c(Type, Rings, WholeWeight)))

## New names:
## * `` -> ...1

## # A tibble: 6 x 3
##   Type  Rings WholeWeight
##   <chr> <dbl>       <dbl>
## 1 M        15       0.514
## 2 M         7       0.226
## 3 F         9       0.677
## 4 M        10       0.516
## 5 I         7       0.205
## 6 I         8       0.352

head(vroom("abalone.csv", col_select = c(1, 5, 9)))

## New names:
## * `` -> ...1

## # A tibble: 6 x 3
##    ...1 Height ShellWeight
##   <dbl>  <dbl>       <dbl>
## 1     1  0.095       0.15
## 2     2  0.09        0.07
## 3     3  0.135       0.21
## 4     4  0.125       0.155
## 5     5  0.08        0.055
## 6     6  0.095       0.12

head(vroom("abalone.csv", col_select = starts_with("T")))

## New names:
## * `` -> ...1

## # A tibble: 6 x 1
##   Type
##   <chr>
## 1 M    
## 2 M    
## 3 F    
## 4 M    
## 5 I    
## 6 I

head(vroom("abalone.csv", col_select = ends_with("ght")))

## New names:
## * `` -> ...1

## # A tibble: 6 x 5
##   Height WholeWeight ShuckedWeight VisceraWeight ShellWeight
##    <dbl>       <dbl>         <dbl>         <dbl>       <dbl>
## 1  0.095       0.514        0.224         0.101        0.15
## 2  0.09        0.226        0.0995        0.0485       0.07
## 3  0.135       0.677        0.256         0.142        0.21
## 4  0.125       0.516        0.216         0.114        0.155
## 5  0.08        0.205        0.0895        0.0395       0.055
## 6  0.095       0.352        0.141         0.0775       0.12

Renaming columns

Changing column names is less intuitive than with data.table.

vroom("abalone.csv", col_select = list(Sexe = Type, Dia = Diameter, everything())) # "everything()" selects all other variables (info @ ?everything()).

## New names:
## * `` -> ...1

## # A tibble: 4,177 x 10
##    Sexe    Dia  ...1 LongestShell Height WholeWeight ShuckedWeight VisceraWeight
##    <chr> <dbl> <dbl>        <dbl>  <dbl>       <dbl>         <dbl>         <dbl>
##  1 M     0.365     1        0.455  0.095       0.514        0.224         0.101
##  2 M     0.265     2        0.35   0.09        0.226        0.0995        0.0485
##  3 F     0.42      3        0.53   0.135       0.677        0.256         0.142
##  4 M     0.365     4        0.44   0.125       0.516        0.216         0.114
##  5 I     0.255     5        0.33   0.08        0.205        0.0895        0.0395
##  6 I     0.3       6        0.425  0.095       0.352        0.141         0.0775
##  7 F     0.415     7        0.53   0.15        0.778        0.237         0.142
##  8 F     0.425     8        0.545  0.125       0.768        0.294         0.150
##  9 M     0.37      9        0.475  0.125       0.509        0.216         0.112
## 10 F     0.44     10        0.55   0.15        0.894        0.314         0.151
## # … with 4,167 more rows, and 2 more variables: ShellWeight <dbl>, Rings <dbl>

If we want to modify all column names so they have a similar format without doing so manually (notably in excel, which is a big no with Big Data), you can use the “.name_repair=” argument from the janitor package

install.packages("janitor")

library("janitor")

##
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
##
##     chisq.test, fisher.test

head(vroom("abalone.csv", .name_repair = ~ make_clean_names(., case = "all_caps")))

## Rows: 4,177
## Columns: 10
## Delimiter: ","
## chr [1]: TYPE
## dbl [9]: X, LONGEST_SHELL, DIAMETER, HEIGHT, WHOLE_WEIGHT, SHUCKED_WEIGHT, VISCERA_WEIGH...
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message

## # A tibble: 6 x 10
##       X TYPE  LONGEST_SHELL DIAMETER HEIGHT WHOLE_WEIGHT SHUCKED_WEIGHT
##   <dbl> <chr>         <dbl>    <dbl>  <dbl>        <dbl>          <dbl>
## 1     1 M             0.455    0.365  0.095        0.514         0.224
## 2     2 M             0.35     0.265  0.09         0.226         0.0995
## 3     3 F             0.53     0.42   0.135        0.677         0.256
## 4     4 M             0.44     0.365  0.125        0.516         0.216
## 5     5 I             0.33     0.255  0.08         0.205         0.0895
## 6     6 I             0.425    0.3    0.095        0.352         0.141
## # … with 3 more variables: VISCERA_WEIGHT <dbl>, SHELL_WEIGHT <dbl>,
## #   RINGS <dbl>

Adding new columns

For other types of data manipulations, it is possible to use dplyr normally, since the object format generated by vroom are compatible with dplyr. As opposed to manipulations done with vroom where we need to include the original file path, dplyr must refer to the imported object in the R environment.

library("dplyr")
test = mutate(vab, new = WholeWeight/Diameter)
head(test)

## # A tibble: 6 x 11
##    ...1 Type  LongestShell Diameter Height WholeWeight ShuckedWeight
##   <dbl> <chr>        <dbl>    <dbl>  <dbl>       <dbl>         <dbl>
## 1     1 M            0.455    0.365  0.095       0.514        0.224
## 2     2 M            0.35     0.265  0.09        0.226        0.0995
## 3     3 F            0.53     0.42   0.135       0.677        0.256
## 4     4 M            0.44     0.365  0.125       0.516        0.216
## 5     5 I            0.33     0.255  0.08        0.205        0.0895
## 6     6 I            0.425    0.3    0.095       0.352        0.141
## # … with 4 more variables: VisceraWeight <dbl>, ShellWeight <dbl>, Rings <dbl>,
## #   new <dbl>

Writing data

Writing files is done in the same way as the reading, specifying the delimiters and the desired extention into the vroom_write() function.

vroom_write(vab, "vroom_abalone.csv", delim = ";")

To make a compressed file, simply add a second extension:

vroom_write(vab, "vroom_abalone.csv.gz")

References

About Big Data: https://www.environmentalscience.org/data-science-big-data https://click.endnote.com/viewer?doi=10.1890%2F120103&token=WzI0OTgxMjksIjEwLjE4OTAvMTIwMTAzIl0.94jMXqxm3lJEzbuAhWEzvzal-xI

Data.table: https://www.listendata.com/2016/10/r-data-table.html
https://www.datacamp.com/community/tutorials/top-ten-most-important-packages-in-r-for-data-science
https://www.machinelearningplus.com/data-manipulation/datatable-in-r-complete-guide/.
Keys: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html

Vroom: https://vroom.r-lib.org/articles/vroom.html
Jim Hester: https://www.youtube.com/watch?v=RA9AjqZXxMU&t=10s
https://vroom.r-lib.org/articles/benchmarks.html
Other filetypes: https://vroom.r-lib.org/articles/vroom.html