Introduction to relational databases and SQL

April 2023

Charles Martin

What is a database?
Data structure and normalization
Overview of the available software packages
DuckDB workspace preparation
Introduction to the SQL language

What is a database?

In its simplest form, a database is a collection of data frames, which in this jargon are called tables. Each of these tables includes several columns (or variables) that are called fields in the vocabulary specific to databases.

Unlike data frames that exist in the computer’s memory, tables are stored on the hard drive. They can therefore be much larger in size.

Also, unlike data tables, database tables are almost always indexed. That is, some work is done upstream, as data is added/modified/deleted, to speed up filtering and sorting when they are needed. Note that this feature is not unique to databases. For example, the data.table library can also index its data.

Data structure and normalization

Unlike data frames which are a fixed product, generated at the end of the data collection, databases are intended to be dynamic. They include a whole infrastructure to enable the insertion of new data, their update, their deletion, etc., in a secure way.

For these changes to go well, how to structure our data is of particular importance, even more critical than in a data frame. A whole field of study with its own terminology is dedicated to this subject: database normalization.

Obviously, we can only skim over these techniques, but it is important to nevertheless understand the main principle, which is to never duplicate information.

For example, this data frame is perfect for R:

parcelle	site	richesse_parcelle	ph_site
A	X	3	7.5
B	X	2	7.5
C	X	5	7.5
D	Y	1	6.9
E	Y	0	6.9
F	Y	3	6.9

But it would not be appropriated in a relational database environment, in which it should be replaced by the following tables :

sites

id	nom	ph
1	X	7.5
2	Y	6.9

parcelles

id	site_id	nom	richesse
1	1	A	3
2	1	B	2
3	1	C	5
4	2	D	1
5	2	E	0
6	2	F	3

Notice in each table the presence of an id column, which becomes the unique identifier for each observation. This identifier is called the primary key of the table.

The link between the two tables is specified by the column site_id, which is called in this jargon the foreign key.

So the relationship here is a 1:N relationship, a one-to-many relationship. A site can have multiple plots, but each plot belongs only to a single site.

There are several other types of relationships, for example N:N relationships, many-to-many. This type of relationship would be appropriate to store the list species present in each plot:

especes

id	nom
1	perchaude
2	meunier

presences

parcelle_id	espece_id
1	1
1	3

Note that to define a N:N relationship, you must create an intermediate table, which contains only identifiers associating the two tables. It’s the role the presences table plays here, connecting the species table and the plots table in an N:N relationship.

Overview of the available software packages

There are dozens, if not hundreds, of relation database software packages. They can generally be divided into three categories of use:

Huge databases in the cloud, like Google BigQuery or Amazon Redshift. These solutions are extremely powerful because they can tap into a huge pool of resources when performing requests. Their usage is also very expensive.
Database servers, like MySQL, PostgreSQL, SQL Server, etc. These solutions allow simultaneous querying of their data by multiple users at once. They are an excellent client-server solution when several users must play simultaneously with the same data.
Embedded databases, such as SQLite and DuckDB. These solutions run directly on your computer, greatly increasing its ability to crunch data, while avoiding the configuration of a server infrastructure.

For this workshop, we will be using an embedded database, DuckDB. The advantage of this database for learning situations is that it is erased and recreated each time R is started. It is also extremely fast, being able to handle huge queries, often faster than its competitor SQLite.

However, almost everything we will see could be transferred to another software without problem.

DuckDB workspace preparation

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.2      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(DBI)
library(duckdb)

All our operations on databases will go through the DBI library (DataBase Interface). It will take care of transferring our requests to the database. On the other hand, as each database has its own small particularities, we will have to specify at the time of the connection which database we are connecting to. In our case, it will be to DuckDB, so:

connexion <- DBI::dbConnect(duckdb::duckdb())

If our database was a PostgreSQL server, we could have built our connection like this :

connexion <- DBI::dbConnect(
  RPostgres::Postgres(),
  hostname = "adresse.com",
  port = 1234
)

Finally, in a real analytics project, it would probably have been appropriate to specify a folder in which DuckDB can store the database rather than having to recreate it each time you start R:

connexion <- DBI::dbConnect(duckdb::duckdb(), dbdir = "donnees_projet")

We will now create some tables and populate them to be able to start working. You can also download the dataset associated with this workshop here.

Note that the data are already well organized for import: each observation is labeled with a unique identifier and database structure has already been normalized.

First, here’s how to create a DuckDB table from a CSV file:

duckdb_read_csv(connexion, "especes", "FakeData/especes.csv")

Since we have many tables to create, we could automated this process by using the map function :

liste_tables <- c("especes","sites","parcelles","presences")

map(liste_tables, function(x){
  duckdb_read_csv(connexion, x, str_c("FakeData/",x,".csv"))
})

[[1]]
[1] 166

[[2]]
[1] 26

[[3]]
[1] 1300000

[[4]]
[1] 1e+07

We can then verify that all our tables have been succesfully created.

dbListTables(connexion)

[1] "especes"   "parcelles" "presences" "sites"

The database with which we are going to work contains 26 sites, in which a total of 1,300,000 plots have been inventoried. In each plot, the list of all species (83 in total) were noted, for a total of 10,000,000 presences. These data are made up for the workshop, but will allow us to test the speed and power of the databse approach. Don’t worry about the execution speed. On my personal laptop, the whole workshop runs in less than 30 seconds…

Here, for information, is the list of all the fields in each of the tables that we created:

dbListFields(connexion,"especes")

[1] "id"  "nom"

dbListFields(connexion,"sites")

[1] "id"  "nom" "ph"

dbListFields(connexion,"parcelles")

[1] "id"          "site_id"     "nom"         "couvert_pct"

dbListFields(connexion,"presences")

[1] "parcelle_id" "espece_id"

Note that each of the plots is associated with a site by the site_id column, which makes it a 1:N relationship (a site contains several plots, but each plot only belongs to one site).

The presences table only contains identifiers, for each of the the species-plot combinations that have been observed, describing an N:N relationship.

Introduction to the SQL language

SQL (Structured Query Language) is a (relatively) standardized computer language for querying databases. I say “relatively” because whereas most of the language is used the same way with each database software, each of them also adds some specific differences.

SQL defines three main families of commands, those for:

Extracting data (make queries)
Defining data (insert, modify, delete, etc.)
Controling operations (give access, validate transactions, etc.)

In the vast majority of cases, it is possible to connect directly to the database using a terminal to send our queries. On the other hand, with the exception of maintenance operations, this way of doing things is rarely used. Requests are usually sent programmatically in the language of our choice, here R.

Queries

To send queries to extract data, we will use the dbGetQuery function from the DBI library. This function expects to receive two things, a connection to a database, and the SQL command to execute.

Select what, from which table?

The basis of SQL queries is the SELECT command, which allows you to choose which fields we want to extract from our table, followed by FROM, which specifies on which table to perform the extraction. For example, to get the name of all sites :

dbGetQuery(connexion, "SELECT nom FROM sites")

   nom
  A
  B
  C
  D
  E
  F
  G
  H
  I
 J
 K
 L
 M
 N
 O
 P
 Q
 R
 S
 T
 U
 V
 W
 X
 Y
 Z

The dbGetQuery function always returns a data.frame object, which can be retrieved and used as a normal R data.frame :

x <- dbGetQuery(connexion, "SELECT nom FROM sites")

x %>% 
  slice(1:10) %>% 
  arrange(desc(nom))

   nom
  J
  I
  H
  G
  F
  E
  D
  C
  B
 A

Note that the SELECT and FROM commands have been written here in uppercase. The majority of database softwares doesn’t case if commands are sent in lower or uppercase. The SeLeCt, SELECT and select commands would have the exact same result. On the other hand, by convention, it is customary to enter the keywords (SELECT, FROM, GROUP BY, etc.) in uppercase, and the field names in lowercase.

Rather than specifying each of the fields one by one, you can also use an *, which acts as a wildcard indicating that we want all the fields of this table:

dbGetQuery(connexion, "SELECT * FROM sites")

   id nom       ph
 1   A 7.169245
 2   B 6.954659
 3   C 6.798699
 4   D 6.746626
 5   E 7.241875
 6   F 6.970178
 7   G 6.828074
 8   H 6.692542
 9   I 7.213349
10   J 7.310457
11   K 6.945442
12   L 7.693323
13   M 7.358573
14   N 7.196335
15   O 6.865845
16   P 7.118721
17   Q 6.759258
18   R 7.125312
19   S 7.078200
20   T 6.708648
21   U 6.577113
22   V 7.569592
23   W 6.879513
24   X 6.655040
25   Y 6.991663
26   Z 6.876635

Applying filters

Obviously, we don’t always want to retrieve everything in a table, we sometimes want to retrieve only certain information. For example, only the sites with a pH of 7 or higher:

dbGetQuery(connexion, "SELECT * FROM sites WHERE ph >= 7")

   id nom       ph
 1   A 7.169245
 5   E 7.241875
 9   I 7.213349
10   J 7.310457
12   L 7.693323
13   M 7.358573
14   N 7.196335
16   P 7.118721
18   R 7.125312
19   S 7.078200
22   V 7.569592

You can also combine many conditions with AND and OR :

dbGetQuery(connexion, "SELECT * FROM sites WHERE ph >= 7 AND ph < 7.4")

  id nom       ph
1   A 7.169245
5   E 7.241875
9   I 7.213349
10   J 7.310457
13   M 7.358573
14   N 7.196335
16   P 7.118721
18   R 7.125312
19   S 7.078200

Sorting

You can change the order in which the results will appear by adding an ORDER BY clause to our query, like this:

dbGetQuery(connexion, "SELECT * FROM sites WHERE ph >= 7 ORDER BY ph")

   id nom       ph
19   S 7.078200
16   P 7.118721
18   R 7.125312
 1   A 7.169245
14   N 7.196335
 9   I 7.213349
 5   E 7.241875
10   J 7.310457
13   M 7.358573
22   V 7.569592
12   L 7.693323

You can sort by descending values instead of ascending, by adding the DESC modifier :

dbGetQuery(connexion, "SELECT * FROM sites WHERE ph >= 7 ORDER BY ph DESC")

   id nom       ph
12   L 7.693323
22   V 7.569592
13   M 7.358573
10   J 7.310457
 5   E 7.241875
 9   I 7.213349
14   N 7.196335
 1   A 7.169245
18   R 7.125312
16   P 7.118721
19   S 7.078200

Limiting the amount of information received

If you build dashboards, reports, etc., it might happen that you don’t need the whole set of rows returned by a request, but only first ones. We can use the LIMIT clause to cut our results, for example keeping only the first 5 lines:

dbGetQuery(connexion, "SELECT * FROM sites ORDER BY ph LIMIT 5")

  id nom       ph
21   U 6.577113
24   X 6.655040
8   H 6.692542
20   T 6.708648
4   D 6.746626

The LIMIT clause will become particularly important in queries to come, which could return millions of rows…

Connecting multiple tables together

Since our data is well organized into different tables with a normalized structure, one can easily exploit the relations between the different tables.

As a first scenario, let’s imagine that we would like to obtain all the plots located in sites with a pH of 7 or more.

First step, prepare a query to find these sites:

dbGetQuery(connexion, "SELECT * FROM sites WHERE ph >= 7")

   id nom       ph
 1   A 7.169245
 5   E 7.241875
 9   I 7.213349
10   J 7.310457
12   L 7.693323
13   M 7.358573
14   N 7.196335
16   P 7.118721
18   R 7.125312
19   S 7.078200
22   V 7.569592

To make our life easier, we are going to introduce the concept of SQL aliases, which allow us to rename a table in our query to make it easier to refer to. Note that aliases only exist at the time of this particular request. On the next request, we will have to redo this alias if we want to use it once more.

We will also take this opportunity to clarify that the pH and the * that interest us are those of the sites table:

dbGetQuery(connexion, "SELECT s.* FROM sites AS s WHERE s.ph >= 7")

   id nom       ph
 1   A 7.169245
 5   E 7.241875
 9   I 7.213349
10   J 7.310457
12   L 7.693323
13   M 7.358573
14   N 7.196335
16   P 7.118721
18   R 7.125312
19   S 7.078200
22   V 7.569592

Now, we can connect this request with the plots table :

dbGetQuery(connexion, "
           SELECT s.*, p.* 
           FROM sites AS s 
           LEFT JOIN parcelles AS p
           ON p.site_id = s.id
           WHERE s.ph >= 7
           LIMIT 10
          ")

   id nom       ph     id site_id        nom couvert_pct
 5   E 7.241875 122881       5 megvgtlavc   60.040527
 9   I 7.213349 122885       9 vnamlyqykq   10.955212
10   J 7.310457 122886      10 voynebxemt   87.516474
12   L 7.693323 122888      12 dcwdppalwt   91.731660
13   M 7.358573 122889      13 wdphqdvvnt   23.368555
14   N 7.196335 122890      14 vcvtjgthjp    3.024725
16   P 7.118721 122892      16 pjskhveidz   71.662687
18   R 7.125312 122894      18 lexdywvieo   91.496724
19   S 7.078200 122895      19 jlvqsnbyvy   61.266997
22   V 7.569592 122898      22 kfamkmimao   74.941283

Notice that we need two clauses to connect the two tables together.

The first is the LEFT JOIN. There are different types of “joins” in SQL, “left” being the most common. In a LEFT JOIN, all data from the left table (here the sites) will be present, and only data from the right table with matches will be added.

As the SQL does not read minds, we must also add an ON clause, to specify on which fields to connect the two tables. Here we specify that the site_id column of the plots table should match the id column of the sites table.

I used the LIMIT 10 clause here to avoid having 100s of pages of results, but in real-life scenarios, this would not have been necessary.

One particularly counter-intuitive thing here is that one can use the information coming from the plot in the SELECT clause, although at this point we have not yet mentioned that we wanted to join with it. SQL code is not interpreted sequentially. The request will be deconstructed and interpreted in an optimal way, for the entire query at once.

Notice also that it is mentioned that we want all the info from both tables (s., p.), but it is not mandatory. We could also have been more selective:

dbGetQuery(connexion, "
           SELECT s.nom AS nom_site, s.ph, p.nom AS nom_parcelle, p.couvert_pct
           FROM sites AS s 
           LEFT JOIN parcelles AS p
           ON p.site_id = s.id
           WHERE s.ph >= 7
           LIMIT 10
          ")

   nom_site       ph nom_parcelle couvert_pct
       A 7.169245   yjsvyluwyw    13.35865
       E 7.241875   dvbqleheht    42.34773
       I 7.213349   dspjaogwaf    53.68863
       J 7.310457   wenvtalrhm    33.90705
       L 7.693323   yiufghvqze    54.29557
       M 7.358573   segjczoozw    46.62623
       N 7.196335   bkbedboyii    97.87763
       P 7.118721   esqsvztcos    13.13683
       R 7.125312   vzozqzgsnw    19.40764
      S 7.078200   xnvlnblopy    42.59576

Note here that we can also use aliases in the SELECT clause of our request.

Performing grouped operations

SQL also contains a whole series of keywords allowing to carry out operations by group.

Knowing that COUNT provides us with the corresponding number of rows in a query, e.g.

dbGetQuery(connexion,"SELECT COUNT(id) FROM sites")

  count(id)
1        26

One can ask, for example, how many times each species was observed :

dbGetQuery(connexion,"
           SELECT COUNT(p.espece_id)
           FROM presences AS p
           GROUP BY p.espece_id
           LIMIT 10
           ")

   count(p.espece_id)
             75840
             22797
            169707
            160021
            198948
            177248
            118475
             87996
            175962
            27476

Clearly, this output is not very useful, but can easily become better when connecting it with the species table :

dbGetQuery(connexion,"
           SELECT COUNT(p.espece_id) AS nb_observations, ANY_VALUE(e.nom) AS nom
           FROM presences AS p
           LEFT JOIN especes AS e ON e.id = p.espece_id
           GROUP BY e.id
           ORDER BY nb_observations DESC
           LIMIT 10
           ")

   nb_observations                            nom
         439378      African giant pouched rat
         439314 Thirteen-lined ground squirrel
         436888               Little brown bat
         434872                     Musk shrew
         413442                        Giraffe
         412510          African striped mouse
         409380           Round-tailed muskrat
         408834               African elephant
         402402         Arctic ground squirrel
        397896                            Cow

Notice that we had to add the ANY_VALUE clause around e.nom. When we use group operators, all the values of the SELECT clause must undergo an aggregation type operation (i.e. COUNT, MIN, MAX,AVG, FIRST, LAST, ANY_VALUE, etc.) or be part of the GROUP BY clause.

Views and subqueries

Finally, as a last example of aggregation, we will calculate the average species richness per plot for each site. This will be the most complex query in this workshop!

First, let’s build the query that calculates the number of species by plot :

dbGetQuery(connexion,"
  SELECT 
    ANY_VALUE(pa.nom) AS nom,
    ANY_VALUE(pa.id) AS id, 
    ANY_VALUE(pa.site_id) AS site_id, 
    COUNT(DISTINCT(pr.espece_id)) AS richesse
  FROM parcelles AS pa
  LEFT JOIN presences as pr ON pr.parcelle_id = pa.id
  GROUP BY pa.id
  LIMIT 10
           ")

          nom      id site_id richesse
nldjabkezk 1264584      22        8
wanjivedhu 1272225      19       14
ggswzbuius   72442       6       21
xbjkkbamow  114094       6        6
gxdsdbdicm 1287497       3       10
dokcidjdqi  814664       6        4
zskfrdbata  145840       6       16
fccafsyimm 1274512      18       11
owgxsxcczv  467450      22       10
ymmbyupkzm  186856      20       14

Then, it is extremely convenient to know that relational databases also contain objects called views. These are a way to save a query and associate a name with it, to reuse it later.

You can create a view from the above query using the CREATE VIEW command:

dbExecute(connexion,"
  CREATE VIEW v_richesses AS
  SELECT 
    ANY_VALUE(pa.nom) AS nom,
    ANY_VALUE(pa.id) AS id, 
    ANY_VALUE(pa.site_id) AS site_id, 
    COUNT(DISTINCT(pr.espece_id)) AS richesse
    FROM parcelles AS pa
    LEFT JOIN presences as pr ON pr.parcelle_id = pa.id
    GROUP BY pa.id")

[1] 0

This view can then be used exactly as if it were a table :

dbGetQuery(connexion, "SELECT * FROM v_richesses LIMIT 5")

         nom      id site_id richesse
nolxmnqmdx 1058277      25       11
cwbbnvcniq 1285844      14        4
fzlzionnwq  846829       9       12
xiewxpzcrz  635508      16       11
jorditqaqa 1042151      19       12

To calculate the mean species richness per site, we then simply have to connect both tables together with a join clause :

  dbGetQuery(connexion, "
    SELECT sites.id, ANY_VALUE(sites.nom) AS nom, ANY_VALUE(sites.ph) AS ph, MEAN(richesse) AS richesse_moyenne
    FROM sites
    LEFT JOIN v_richesses
    ON v_richesses.site_id = sites.id
    GROUP BY sites.id
    LIMIT 5
  ")

  id nom       ph richesse_moyenne
1   A 7.169245          7.12628
2   B 6.954659          7.13368
3   C 6.798699          7.09738
4   D 6.746626          7.13422
5   E 7.241875          7.11826

Note that, as an alternate strategy, one could have used a subquery to get to the same result, but it would have made our query much harder to read :

dbGetQuery(connexion,"
  SELECT ANY_VALUE(s.nom) AS nom, ANY_VALUE(s.ph) AS ph, AVG(p.richesse) AS richesse_moyenne
  FROM sites AS s
  LEFT JOIN (
    SELECT 
    ANY_VALUE(pa.nom) AS nom,
    ANY_VALUE(pa.id) AS id, 
    ANY_VALUE(pa.site_id) AS site_id, 
    COUNT(DISTINCT(pr.espece_id)) AS richesse
    FROM parcelles AS pa
    LEFT JOIN presences as pr ON pr.parcelle_id = pa.id
    GROUP BY pa.id
  ) AS p ON p.site_id = s.id
  GROUP BY s.id
  ORDER BY richesse_moyenne DESC
           ")

   nom       ph richesse_moyenne
  L 7.693323          7.15808
  X 6.655040          7.15736
  K 6.945442          7.14678
  Q 6.759258          7.13844
  R 7.125312          7.13674
  D 6.746626          7.13422
  B 6.954659          7.13368
  M 7.358573          7.13298
  V 7.569592          7.13278
 I 7.213349          7.12652
 A 7.169245          7.12628
 F 6.970178          7.12502
 Z 6.876635          7.12202
 E 7.241875          7.11826
 U 6.577113          7.11738
 O 6.865845          7.11498
 N 7.196335          7.10808
 G 6.828074          7.10546
 W 6.879513          7.09800
 C 6.798699          7.09738
 S 7.078200          7.09358
 H 6.692542          7.08934
 T 6.708648          7.08824
 Y 6.991663          7.07886
 J 7.310457          7.07566
 P 7.118721          7.06196

Inserting, modifying and deleting data

As part of your work as a biologist, you will probably use a relatively static database, created as we have shown above, directly from CSV files.

Nevertheless, it is interesting to know the full capabilities of a relational database system. In particular, how to add, modify and delete data.

Inserting data

As we discussed earlier, it is very important in a relational database that each row in a table can be uniquely identified.

Some databases such as MySQL allow, when creating a table, to mention that a unique identifier must be generated each each time a row is added.

With DuckDB, it will take a little more work to achieve the same result. Before performing the first insert, we will need to create a sequence type object, which will allow us to obtain the next id value to use when we insert.

For this section, we will send our commands, not using the dbGetQuery function, but using dbExecute function, since our SQL commands are not queries designed to return data.

So, let’s first see how to create a sequence for the sites table, by specifying that the first value of the sequence should be the maximum site id + 1.

Note that unlike self-incrementing ids, this way of doing things is rather risky if several people work simultaneously on the same database. After retrieving the maximum id, a new insert could take place and change this maximum. Moreover, it would be possible for someone to insert a line without using the sequence, etc.

So first, what is the largest id in our site table?

dbGetQuery(connexion,"SELECT MAX(id) FROM sites")

  max(id)
1      26

Step two : create the sequence, reusing that number :

dbExecute(connexion, "CREATE SEQUENCE sequence_sites START 27")

[1] 0

The sequence can now be used in any insert operation :

dbExecute(connexion, "INSERT INTO sites VALUES (nextval('sequence_sites'), 'site ajouté', 7.0)")

[1] 1

dbExecute(connexion, "INSERT INTO sites VALUES (nextval('sequence_sites'), 'deuxième ajout', 6.5)")

[1] 1

We can now peek into the sites table to see that the two new sites have indeed been added, each with a unique id :

dbGetQuery(connexion,"SELECT * FROM sites ORDER BY id DESC LIMIT 10")

   id            nom       ph
28 deuxième ajout 6.500000
27    site ajouté 7.000000
26              Z 6.876635
25              Y 6.991663
24              X 6.655040
23              W 6.879513
22              V 7.569592
21              U 6.577113
20              T 6.708648
19              S 7.078200

Knowing this, one could be tempted to insert data using existing R objects, a bit like this :

nom <- "Site de Charles"
pH <- 7

dbExecute(connexion, paste0(
  "INSERT INTO sites VALUES (nextval('sequence_sites'), '",
  nom,
  "', ",
  pH,
  ")"
))

[1] 1

Note first that this kind of operation requires a lot of concentration, so as not to mix up the double quotes that define our strings in R, and single quotes, which define strings in SQL.

Also note that, if our site name contains, among other things, an apostrophe, we will get an SQL error:

nom <- "L'autre site"
pH <- 7

dbExecute(connexion, paste0(
  "INSERT INTO sites VALUES (nextval('sequence_sites'), '",
  nom,
  "', ",
  pH,
  ")"
))

Error: Parser Error: syntax error at or near "autre"
LINE 1: ... VALUES (nextval('sequence_sites'), 'L'autre site', 7)
                                                  ^

This kind of issue has given rise to many cases of website hacking in the early 1990s. For example, imagine a form in which a malicious visitor would enter A'; DELETE * FROM users; or A';INSERT INTO admin VALUES ''pirate'',''piratepassword''; as username or password!!!

For these two reasons, it is now recommended to always use the protection functions provided with the library that manages your database. In our case, we are talking about the DBI + DuckDB combination. The idea behind these protection functions will be to encode all elements potentially dangerous of a request to avoid problems, whether they are accidental or malicious.

For this, it will be necessary to work in two stages, first preparing the skeleton of our request, and then, in a second step, ask DBI to make replacements.

In the query skeleton, all elements that will come from outside are preceded by a ?:

valeur_tannante <- "L'apostrophe de malheur"
squelette <- "INSERT INTO sites VALUES (nextval('sequence_sites'),?nom,?ph)"

Then, the sqlInterpolate function will allow us to build the actual query, by specifying the actual value of each placeholder :

requete <- sqlInterpolate(
  connexion,
  squelette,
  nom="L'autre site",
  ph=6.5
)
print(requete)

<SQL> INSERT INTO sites VALUES (nextval('sequence_sites'),'L''autre site',6.5)

dbExecute(connexion,requete)

[1] 1

This procedure should be followed everytime some part of the query comes from an R object, even for simple SELECT statements :

squelette <- "SELECT * FROM sites WHERE nom = ?nom"
requete <- sqlInterpolate(
  connexion,
  squelette,
  nom="L'autre site"
)
print(requete)

<SQL> SELECT * FROM sites WHERE nom = 'L''autre site'

dbGetQuery(connexion, requete)

  id          nom  ph
1 30 L'autre site 6.5

Deleting data

Deleting data in SQL is done with the… DELETE function!

First, let’s make sure we have some sites where pH is below 6.6 :

dbGetQuery(connexion, "SELECT * FROM sites ORDER BY ph LIMIT 5")

  id            nom       ph
28 deuxième ajout 6.500000
30   L'autre site 6.500000
21              U 6.577113
24              X 6.655040
8              H 6.692542

To delete every site where pH < 6.6, you could run the following query :

dbExecute(connexion,"DELETE FROM sites WHERE ph < 6.6")

[1] 3

Let’s now verify that the sites were indeed deleted :

dbGetQuery(connexion, "SELECT * FROM sites ORDER BY ph LIMIT 5")

  id nom       ph
24   X 6.655040
8   H 6.692542
20   T 6.708648
4   D 6.746626
17   Q 6.759258

In more advanced databases like MySQL, PostgreSQL, etc., such an operation could also have led to the deletion of all the plots associated with the site, of all the presences associated with the plots, etc. However, since DuckDB is a local lightweight database, designed for one user at a time, this cascade did not occur:

dbGetQuery(connexion, "SELECT * FROM parcelles WHERE site_id = 21 LIMIT 5")

   id site_id        nom couvert_pct
21      21 otuzagswlo    55.11549
47      21 orqxestuqr    58.59047
73      21 vytwvjjvcr    47.13821
99      21 odicuudsgl    42.04843
125      21 izpwayhfaq    78.15716

Updating data

Finally, data can be updated with the UPDATE command.

For example, if we made a mistake and all observations for plot 3 were in fact supposed to be for plot 4, one could write :

dbExecute(connexion,"UPDATE presences SET parcelle_id = 4 WHERE parcelle_id = 3")

[1] 7

And then verify that it actually worked :

dbGetQuery(connexion, "
           SELECT parcelle_id, COUNT(*) 
           FROM presences 
           WHERE parcelle_id = 3 OR parcelle_id = 4
           GROUP BY parcelle_id
          ")

  parcelle_id count_star()
1           4           10

Conclusion and References

As with the majority of Numérilab workshops, we have only scratched the surface here of what is possible with SQL. I hope you enjoyed the fact that our database handled queries run on millions of lines without even blinking. Obviously, relational databases and SQL have their place in different computer systems, ranging from a few Mb of data for a master’s project, to several Pb of data to manage content for a world-class transactional website.

If you want to learn more about SQL, the resource I’d probably recommend (and the one that’s gotten me going after a few years of not doing SQL) is the W3Schools tutorial: https://www.w3schools.com/sql/.

I also strongly encourage you to explore SQLite, another lightweight local database like DuckDB, which is used as storage in many applications.