Introduction to relational databases and SQL
April 2023
Charles Martin
- What is a database?
- Data structure and normalization
- Overview of the available software packages
- DuckDB workspace preparation
- Introduction to the SQL language
What is a database?
In its simplest form, a database is a collection of data frames, which in this jargon are called tables. Each of these tables includes several columns (or variables) that are called fields in the vocabulary specific to databases.
Unlike data frames that exist in the computer’s memory, tables are stored on the hard drive. They can therefore be much larger in size.
Also, unlike data tables, database tables are almost always indexed. That is, some work is done upstream, as data is added/modified/deleted, to speed up filtering and sorting when they are needed. Note that this feature is not unique to databases. For example, the data.table library can also index its data.
Data structure and normalization
Unlike data frames which are a fixed product, generated at the end of the data collection, databases are intended to be dynamic. They include a whole infrastructure to enable the insertion of new data, their update, their deletion, etc., in a secure way.
For these changes to go well, how to structure our data is of particular importance, even more critical than in a data frame. A whole field of study with its own terminology is dedicated to this subject: database normalization.
Obviously, we can only skim over these techniques, but it is important to nevertheless understand the main principle, which is to never duplicate information.
For example, this data frame is perfect for R:
parcelle | site | richesse_parcelle | ph_site |
---|---|---|---|
A | X | 3 | 7.5 |
B | X | 2 | 7.5 |
C | X | 5 | 7.5 |
D | Y | 1 | 6.9 |
E | Y | 0 | 6.9 |
F | Y | 3 | 6.9 |
But it would not be appropriated in a relational database environment, in which it should be replaced by the following tables :
sites
id | nom | ph |
---|---|---|
1 | X | 7.5 |
2 | Y | 6.9 |
parcelles
id | site_id | nom | richesse |
---|---|---|---|
1 | 1 | A | 3 |
2 | 1 | B | 2 |
3 | 1 | C | 5 |
4 | 2 | D | 1 |
5 | 2 | E | 0 |
6 | 2 | F | 3 |
Notice in each table the presence of an id column, which becomes the unique identifier for each observation. This identifier is called the primary key of the table.
The link between the two tables is specified by the column site_id, which is called in this jargon the foreign key.
So the relationship here is a 1:N relationship, a one-to-many relationship. A site can have multiple plots, but each plot belongs only to a single site.
There are several other types of relationships, for example N:N relationships, many-to-many. This type of relationship would be appropriate to store the list species present in each plot:
especes
id | nom |
---|---|
1 | perchaude |
2 | meunier |
presences
parcelle_id | espece_id |
---|---|
1 | 1 |
1 | 3 |
Note that to define a N:N relationship, you must create an intermediate table, which contains only identifiers associating the two tables. It’s the role the presences table plays here, connecting the species table and the plots table in an N:N relationship.
Overview of the available software packages
There are dozens, if not hundreds, of relation database software packages. They can generally be divided into three categories of use:
-
Huge databases in the cloud, like Google BigQuery or Amazon Redshift. These solutions are extremely powerful because they can tap into a huge pool of resources when performing requests. Their usage is also very expensive.
-
Database servers, like MySQL, PostgreSQL, SQL Server, etc. These solutions allow simultaneous querying of their data by multiple users at once. They are an excellent client-server solution when several users must play simultaneously with the same data.
-
Embedded databases, such as SQLite and DuckDB. These solutions run directly on your computer, greatly increasing its ability to crunch data, while avoiding the configuration of a server infrastructure.
For this workshop, we will be using an embedded database, DuckDB. The advantage of this database for learning situations is that it is erased and recreated each time R is started. It is also extremely fast, being able to handle huge queries, often faster than its competitor SQLite.
However, almost everything we will see could be transferred to another software without problem.
DuckDB workspace preparation
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0 ✔ purrr 1.0.1
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.3.0 ✔ stringr 1.5.0
✔ readr 2.1.2 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(DBI)
library(duckdb)
All our operations on databases will go through the DBI library (DataBase Interface). It will take care of transferring our requests to the database. On the other hand, as each database has its own small particularities, we will have to specify at the time of the connection which database we are connecting to. In our case, it will be to DuckDB, so:
connexion <- DBI::dbConnect(duckdb::duckdb())
If our database was a PostgreSQL server, we could have built our connection like this :
connexion <- DBI::dbConnect(
RPostgres::Postgres(),
hostname = "adresse.com",
port = 1234
)
Finally, in a real analytics project, it would probably have been appropriate to specify a folder in which DuckDB can store the database rather than having to recreate it each time you start R:
connexion <- DBI::dbConnect(duckdb::duckdb(), dbdir = "donnees_projet")
We will now create some tables and populate them to be able to start working. You can also download the dataset associated with this workshop here.
Note that the data are already well organized for import: each observation is labeled with a unique identifier and database structure has already been normalized.
First, here’s how to create a DuckDB table from a CSV file:
duckdb_read_csv(connexion, "especes", "FakeData/especes.csv")
Since we have many tables to create, we could automated this process by using the map function :
liste_tables <- c("especes","sites","parcelles","presences")
map(liste_tables, function(x){
duckdb_read_csv(connexion, x, str_c("FakeData/",x,".csv"))
})
[[1]]
[1] 166
[[2]]
[1] 26
[[3]]
[1] 1300000
[[4]]
[1] 1e+07
We can then verify that all our tables have been succesfully created.
dbListTables(connexion)
[1] "especes" "parcelles" "presences" "sites"
The database with which we are going to work contains 26 sites, in which a total of 1,300,000 plots have been inventoried. In each plot, the list of all species (83 in total) were noted, for a total of 10,000,000 presences. These data are made up for the workshop, but will allow us to test the speed and power of the databse approach. Don’t worry about the execution speed. On my personal laptop, the whole workshop runs in less than 30 seconds…
Here, for information, is the list of all the fields in each of the tables that we created:
dbListFields(connexion,"especes")
[1] "id" "nom"
dbListFields(connexion,"sites")
[1] "id" "nom" "ph"
dbListFields(connexion,"parcelles")
[1] "id" "site_id" "nom" "couvert_pct"
dbListFields(connexion,"presences")
[1] "parcelle_id" "espece_id"
Note that each of the plots is associated with a site by the site_id column, which makes it a 1:N relationship (a site contains several plots, but each plot only belongs to one site).
The presences table only contains identifiers, for each of the the species-plot combinations that have been observed, describing an N:N relationship.
Introduction to the SQL language
SQL (Structured Query Language) is a (relatively) standardized computer language for querying databases. I say “relatively” because whereas most of the language is used the same way with each database software, each of them also adds some specific differences.
SQL defines three main families of commands, those for:
- Extracting data (make queries)
- Defining data (insert, modify, delete, etc.)
- Controling operations (give access, validate transactions, etc.)
In the vast majority of cases, it is possible to connect directly to the database using a terminal to send our queries. On the other hand, with the exception of maintenance operations, this way of doing things is rarely used. Requests are usually sent programmatically in the language of our choice, here R.
Queries
To send queries to extract data, we will use the dbGetQuery function from the DBI library. This function expects to receive two things, a connection to a database, and the SQL command to execute.
Select what, from which table?
The basis of SQL queries is the SELECT command, which allows you to choose which fields we want to extract from our table, followed by FROM, which specifies on which table to perform the extraction. For example, to get the name of all sites :
dbGetQuery(connexion, "SELECT nom FROM sites")
nom
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I
10 J
11 K
12 L
13 M
14 N
15 O
16 P
17 Q
18 R
19 S
20 T
21 U
22 V
23 W
24 X
25 Y
26 Z
The dbGetQuery function always returns a data.frame object, which can be retrieved and used as a normal R data.frame :
x <- dbGetQuery(connexion, "SELECT nom FROM sites")
x %>%
slice(1:10) %>%
arrange(desc(nom))
nom
1 J
2 I
3 H
4 G
5 F
6 E
7 D
8 C
9 B
10 A
Note that the SELECT and FROM commands have been written here in uppercase. The majority of database softwares doesn’t case if commands are sent in lower or uppercase. The SeLeCt, SELECT and select commands would have the exact same result. On the other hand, by convention, it is customary to enter the keywords (SELECT, FROM, GROUP BY, etc.) in uppercase, and the field names in lowercase.
Rather than specifying each of the fields one by one, you can also use an *, which acts as a wildcard indicating that we want all the fields of this table:
dbGetQuery(connexion, "SELECT * FROM sites")
id nom ph
1 1 A 7.169245
2 2 B 6.954659
3 3 C 6.798699
4 4 D 6.746626
5 5 E 7.241875
6 6 F 6.970178
7 7 G 6.828074
8 8 H 6.692542
9 9 I 7.213349
10 10 J 7.310457
11 11 K 6.945442
12 12 L 7.693323
13 13 M 7.358573
14 14 N 7.196335
15 15 O 6.865845
16 16 P 7.118721
17 17 Q 6.759258
18 18 R 7.125312
19 19 S 7.078200
20 20 T 6.708648
21 21 U 6.577113
22 22 V 7.569592
23 23 W 6.879513
24 24 X 6.655040
25 25 Y 6.991663
26 26 Z 6.876635
Applying filters
Obviously, we don’t always want to retrieve everything in a table, we sometimes want to retrieve only certain information. For example, only the sites with a pH of 7 or higher:
dbGetQuery(connexion, "SELECT * FROM sites WHERE ph >= 7")
id nom ph
1 1 A 7.169245
2 5 E 7.241875
3 9 I 7.213349
4 10 J 7.310457
5 12 L 7.693323
6 13 M 7.358573
7 14 N 7.196335
8 16 P 7.118721
9 18 R 7.125312
10 19 S 7.078200
11 22 V 7.569592
You can also combine many conditions with AND and OR :
dbGetQuery(connexion, "SELECT * FROM sites WHERE ph >= 7 AND ph < 7.4")
id nom ph
1 1 A 7.169245
2 5 E 7.241875
3 9 I 7.213349
4 10 J 7.310457
5 13 M 7.358573
6 14 N 7.196335
7 16 P 7.118721
8 18 R 7.125312
9 19 S 7.078200
Sorting
You can change the order in which the results will appear by adding an ORDER BY clause to our query, like this:
dbGetQuery(connexion, "SELECT * FROM sites WHERE ph >= 7 ORDER BY ph")
id nom ph
1 19 S 7.078200
2 16 P 7.118721
3 18 R 7.125312
4 1 A 7.169245
5 14 N 7.196335
6 9 I 7.213349
7 5 E 7.241875
8 10 J 7.310457
9 13 M 7.358573
10 22 V 7.569592
11 12 L 7.693323
You can sort by descending values instead of ascending, by adding the DESC modifier :
dbGetQuery(connexion, "SELECT * FROM sites WHERE ph >= 7 ORDER BY ph DESC")
id nom ph
1 12 L 7.693323
2 22 V 7.569592
3 13 M 7.358573
4 10 J 7.310457
5 5 E 7.241875
6 9 I 7.213349
7 14 N 7.196335
8 1 A 7.169245
9 18 R 7.125312
10 16 P 7.118721
11 19 S 7.078200
Limiting the amount of information received
If you build dashboards, reports, etc., it might happen that you don’t need the whole set of rows returned by a request, but only first ones. We can use the LIMIT clause to cut our results, for example keeping only the first 5 lines:
dbGetQuery(connexion, "SELECT * FROM sites ORDER BY ph LIMIT 5")
id nom ph
1 21 U 6.577113
2 24 X 6.655040
3 8 H 6.692542
4 20 T 6.708648
5 4 D 6.746626
The LIMIT clause will become particularly important in queries to come, which could return millions of rows…
Connecting multiple tables together
Since our data is well organized into different tables with a normalized structure, one can easily exploit the relations between the different tables.
As a first scenario, let’s imagine that we would like to obtain all the plots located in sites with a pH of 7 or more.
First step, prepare a query to find these sites:
dbGetQuery(connexion, "SELECT * FROM sites WHERE ph >= 7")
id nom ph
1 1 A 7.169245
2 5 E 7.241875
3 9 I 7.213349
4 10 J 7.310457
5 12 L 7.693323
6 13 M 7.358573
7 14 N 7.196335
8 16 P 7.118721
9 18 R 7.125312
10 19 S 7.078200
11 22 V 7.569592
To make our life easier, we are going to introduce the concept of SQL aliases, which allow us to rename a table in our query to make it easier to refer to. Note that aliases only exist at the time of this particular request. On the next request, we will have to redo this alias if we want to use it once more.
We will also take this opportunity to clarify that the pH and the * that interest us are those of the sites table:
dbGetQuery(connexion, "SELECT s.* FROM sites AS s WHERE s.ph >= 7")
id nom ph
1 1 A 7.169245
2 5 E 7.241875
3 9 I 7.213349
4 10 J 7.310457
5 12 L 7.693323
6 13 M 7.358573
7 14 N 7.196335
8 16 P 7.118721
9 18 R 7.125312
10 19 S 7.078200
11 22 V 7.569592
Now, we can connect this request with the plots table :
dbGetQuery(connexion, "
SELECT s.*, p.*
FROM sites AS s
LEFT JOIN parcelles AS p
ON p.site_id = s.id
WHERE s.ph >= 7
LIMIT 10
")
id nom ph id site_id nom couvert_pct
1 5 E 7.241875 122881 5 megvgtlavc 60.040527
2 9 I 7.213349 122885 9 vnamlyqykq 10.955212
3 10 J 7.310457 122886 10 voynebxemt 87.516474
4 12 L 7.693323 122888 12 dcwdppalwt 91.731660
5 13 M 7.358573 122889 13 wdphqdvvnt 23.368555
6 14 N 7.196335 122890 14 vcvtjgthjp 3.024725
7 16 P 7.118721 122892 16 pjskhveidz 71.662687
8 18 R 7.125312 122894 18 lexdywvieo 91.496724
9 19 S 7.078200 122895 19 jlvqsnbyvy 61.266997
10 22 V 7.569592 122898 22 kfamkmimao 74.941283
Notice that we need two clauses to connect the two tables together.
The first is the LEFT JOIN. There are different types of “joins” in SQL, “left” being the most common. In a LEFT JOIN, all data from the left table (here the sites) will be present, and only data from the right table with matches will be added.
As the SQL does not read minds, we must also add an ON clause, to specify on which fields to connect the two tables. Here we specify that the site_id column of the plots table should match the id column of the sites table.
I used the LIMIT 10 clause here to avoid having 100s of pages of results, but in real-life scenarios, this would not have been necessary.
One particularly counter-intuitive thing here is that one can use the information coming from the plot in the SELECT clause, although at this point we have not yet mentioned that we wanted to join with it. SQL code is not interpreted sequentially. The request will be deconstructed and interpreted in an optimal way, for the entire query at once.
Notice also that it is mentioned that we want all the info from both tables (s., p.), but it is not mandatory. We could also have been more selective:
dbGetQuery(connexion, "
SELECT s.nom AS nom_site, s.ph, p.nom AS nom_parcelle, p.couvert_pct
FROM sites AS s
LEFT JOIN parcelles AS p
ON p.site_id = s.id
WHERE s.ph >= 7
LIMIT 10
")
nom_site ph nom_parcelle couvert_pct
1 A 7.169245 yjsvyluwyw 13.35865
2 E 7.241875 dvbqleheht 42.34773
3 I 7.213349 dspjaogwaf 53.68863
4 J 7.310457 wenvtalrhm 33.90705
5 L 7.693323 yiufghvqze 54.29557
6 M 7.358573 segjczoozw 46.62623
7 N 7.196335 bkbedboyii 97.87763
8 P 7.118721 esqsvztcos 13.13683
9 R 7.125312 vzozqzgsnw 19.40764
10 S 7.078200 xnvlnblopy 42.59576
Note here that we can also use aliases in the SELECT clause of our request.
Performing grouped operations
SQL also contains a whole series of keywords allowing to carry out operations by group.
Knowing that COUNT provides us with the corresponding number of rows in a query, e.g.
dbGetQuery(connexion,"SELECT COUNT(id) FROM sites")
count(id)
1 26
One can ask, for example, how many times each species was observed :
dbGetQuery(connexion,"
SELECT COUNT(p.espece_id)
FROM presences AS p
GROUP BY p.espece_id
LIMIT 10
")
count(p.espece_id)
1 75840
2 22797
3 169707
4 160021
5 198948
6 177248
7 118475
8 87996
9 175962
10 27476
Clearly, this output is not very useful, but can easily become better when connecting it with the species table :
dbGetQuery(connexion,"
SELECT COUNT(p.espece_id) AS nb_observations, ANY_VALUE(e.nom) AS nom
FROM presences AS p
LEFT JOIN especes AS e ON e.id = p.espece_id
GROUP BY e.id
ORDER BY nb_observations DESC
LIMIT 10
")
nb_observations nom
1 439378 African giant pouched rat
2 439314 Thirteen-lined ground squirrel
3 436888 Little brown bat
4 434872 Musk shrew
5 413442 Giraffe
6 412510 African striped mouse
7 409380 Round-tailed muskrat
8 408834 African elephant
9 402402 Arctic ground squirrel
10 397896 Cow
Notice that we had to add the ANY_VALUE clause around e.nom. When we use group operators, all the values of the SELECT clause must undergo an aggregation type operation (i.e. COUNT, MIN, MAX,AVG, FIRST, LAST, ANY_VALUE, etc.) or be part of the GROUP BY clause.
Views and subqueries
Finally, as a last example of aggregation, we will calculate the average species richness per plot for each site. This will be the most complex query in this workshop!
First, let’s build the query that calculates the number of species by plot :
dbGetQuery(connexion,"
SELECT
ANY_VALUE(pa.nom) AS nom,
ANY_VALUE(pa.id) AS id,
ANY_VALUE(pa.site_id) AS site_id,
COUNT(DISTINCT(pr.espece_id)) AS richesse
FROM parcelles AS pa
LEFT JOIN presences as pr ON pr.parcelle_id = pa.id
GROUP BY pa.id
LIMIT 10
")
nom id site_id richesse
1 nldjabkezk 1264584 22 8
2 wanjivedhu 1272225 19 14
3 ggswzbuius 72442 6 21
4 xbjkkbamow 114094 6 6
5 gxdsdbdicm 1287497 3 10
6 dokcidjdqi 814664 6 4
7 zskfrdbata 145840 6 16
8 fccafsyimm 1274512 18 11
9 owgxsxcczv 467450 22 10
10 ymmbyupkzm 186856 20 14
Then, it is extremely convenient to know that relational databases also contain objects called views. These are a way to save a query and associate a name with it, to reuse it later.
You can create a view from the above query using the CREATE VIEW command:
dbExecute(connexion,"
CREATE VIEW v_richesses AS
SELECT
ANY_VALUE(pa.nom) AS nom,
ANY_VALUE(pa.id) AS id,
ANY_VALUE(pa.site_id) AS site_id,
COUNT(DISTINCT(pr.espece_id)) AS richesse
FROM parcelles AS pa
LEFT JOIN presences as pr ON pr.parcelle_id = pa.id
GROUP BY pa.id")
[1] 0
This view can then be used exactly as if it were a table :
dbGetQuery(connexion, "SELECT * FROM v_richesses LIMIT 5")
nom id site_id richesse
1 nolxmnqmdx 1058277 25 11
2 cwbbnvcniq 1285844 14 4
3 fzlzionnwq 846829 9 12
4 xiewxpzcrz 635508 16 11
5 jorditqaqa 1042151 19 12
To calculate the mean species richness per site, we then simply have to connect both tables together with a join clause :
dbGetQuery(connexion, "
SELECT sites.id, ANY_VALUE(sites.nom) AS nom, ANY_VALUE(sites.ph) AS ph, MEAN(richesse) AS richesse_moyenne
FROM sites
LEFT JOIN v_richesses
ON v_richesses.site_id = sites.id
GROUP BY sites.id
LIMIT 5
")
id nom ph richesse_moyenne
1 1 A 7.169245 7.12628
2 2 B 6.954659 7.13368
3 3 C 6.798699 7.09738
4 4 D 6.746626 7.13422
5 5 E 7.241875 7.11826
Note that, as an alternate strategy, one could have used a subquery to get to the same result, but it would have made our query much harder to read :
dbGetQuery(connexion,"
SELECT ANY_VALUE(s.nom) AS nom, ANY_VALUE(s.ph) AS ph, AVG(p.richesse) AS richesse_moyenne
FROM sites AS s
LEFT JOIN (
SELECT
ANY_VALUE(pa.nom) AS nom,
ANY_VALUE(pa.id) AS id,
ANY_VALUE(pa.site_id) AS site_id,
COUNT(DISTINCT(pr.espece_id)) AS richesse
FROM parcelles AS pa
LEFT JOIN presences as pr ON pr.parcelle_id = pa.id
GROUP BY pa.id
) AS p ON p.site_id = s.id
GROUP BY s.id
ORDER BY richesse_moyenne DESC
")
nom ph richesse_moyenne
1 L 7.693323 7.15808
2 X 6.655040 7.15736
3 K 6.945442 7.14678
4 Q 6.759258 7.13844
5 R 7.125312 7.13674
6 D 6.746626 7.13422
7 B 6.954659 7.13368
8 M 7.358573 7.13298
9 V 7.569592 7.13278
10 I 7.213349 7.12652
11 A 7.169245 7.12628
12 F 6.970178 7.12502
13 Z 6.876635 7.12202
14 E 7.241875 7.11826
15 U 6.577113 7.11738
16 O 6.865845 7.11498
17 N 7.196335 7.10808
18 G 6.828074 7.10546
19 W 6.879513 7.09800
20 C 6.798699 7.09738
21 S 7.078200 7.09358
22 H 6.692542 7.08934
23 T 6.708648 7.08824
24 Y 6.991663 7.07886
25 J 7.310457 7.07566
26 P 7.118721 7.06196
Inserting, modifying and deleting data
As part of your work as a biologist, you will probably use a relatively static database, created as we have shown above, directly from CSV files.
Nevertheless, it is interesting to know the full capabilities of a relational database system. In particular, how to add, modify and delete data.
Inserting data
As we discussed earlier, it is very important in a relational database that each row in a table can be uniquely identified.
Some databases such as MySQL allow, when creating a table, to mention that a unique identifier must be generated each each time a row is added.
With DuckDB, it will take a little more work to achieve the same result. Before performing the first insert, we will need to create a sequence type object, which will allow us to obtain the next id value to use when we insert.
For this section, we will send our commands, not using the dbGetQuery function, but using dbExecute function, since our SQL commands are not queries designed to return data.
So, let’s first see how to create a sequence for the sites table, by specifying that the first value of the sequence should be the maximum site id + 1.
Note that unlike self-incrementing ids, this way of doing things is rather risky if several people work simultaneously on the same database. After retrieving the maximum id, a new insert could take place and change this maximum. Moreover, it would be possible for someone to insert a line without using the sequence, etc.
So first, what is the largest id in our site table?
dbGetQuery(connexion,"SELECT MAX(id) FROM sites")
max(id)
1 26
Step two : create the sequence, reusing that number :
dbExecute(connexion, "CREATE SEQUENCE sequence_sites START 27")
[1] 0
The sequence can now be used in any insert operation :
dbExecute(connexion, "INSERT INTO sites VALUES (nextval('sequence_sites'), 'site ajouté', 7.0)")
[1] 1
dbExecute(connexion, "INSERT INTO sites VALUES (nextval('sequence_sites'), 'deuxième ajout', 6.5)")
[1] 1
We can now peek into the sites table to see that the two new sites have indeed been added, each with a unique id :
dbGetQuery(connexion,"SELECT * FROM sites ORDER BY id DESC LIMIT 10")
id nom ph
1 28 deuxième ajout 6.500000
2 27 site ajouté 7.000000
3 26 Z 6.876635
4 25 Y 6.991663
5 24 X 6.655040
6 23 W 6.879513
7 22 V 7.569592
8 21 U 6.577113
9 20 T 6.708648
10 19 S 7.078200
Knowing this, one could be tempted to insert data using existing R objects, a bit like this :
nom <- "Site de Charles"
pH <- 7
dbExecute(connexion, paste0(
"INSERT INTO sites VALUES (nextval('sequence_sites'), '",
nom,
"', ",
pH,
")"
))
[1] 1
Note first that this kind of operation requires a lot of concentration, so as not to mix up the double quotes that define our strings in R, and single quotes, which define strings in SQL.
Also note that, if our site name contains, among other things, an apostrophe, we will get an SQL error:
nom <- "L'autre site"
pH <- 7
dbExecute(connexion, paste0(
"INSERT INTO sites VALUES (nextval('sequence_sites'), '",
nom,
"', ",
pH,
")"
))
Error: Parser Error: syntax error at or near "autre"
LINE 1: ... VALUES (nextval('sequence_sites'), 'L'autre site', 7)
^
This kind of issue has given rise to many cases of website hacking
in the early 1990s. For example, imagine a
form in which a malicious visitor would enter
A'; DELETE * FROM users;
or
A';INSERT INTO admin VALUES ''pirate'',''piratepassword'';
as username
or password!!!
For these two reasons, it is now recommended to always use the protection functions provided with the library that manages your database. In our case, we are talking about the DBI + DuckDB combination. The idea behind these protection functions will be to encode all elements potentially dangerous of a request to avoid problems, whether they are accidental or malicious.
For this, it will be necessary to work in two stages, first preparing the skeleton of our request, and then, in a second step, ask DBI to make replacements.
In the query skeleton, all elements that will come from outside
are preceded by a ?
:
valeur_tannante <- "L'apostrophe de malheur"
squelette <- "INSERT INTO sites VALUES (nextval('sequence_sites'),?nom,?ph)"
Then, the sqlInterpolate function will allow us to build the actual query, by specifying the actual value of each placeholder :
requete <- sqlInterpolate(
connexion,
squelette,
nom="L'autre site",
ph=6.5
)
print(requete)
<SQL> INSERT INTO sites VALUES (nextval('sequence_sites'),'L''autre site',6.5)
dbExecute(connexion,requete)
[1] 1
This procedure should be followed everytime some part of the query comes from an R object, even for simple SELECT statements :
squelette <- "SELECT * FROM sites WHERE nom = ?nom"
requete <- sqlInterpolate(
connexion,
squelette,
nom="L'autre site"
)
print(requete)
<SQL> SELECT * FROM sites WHERE nom = 'L''autre site'
dbGetQuery(connexion, requete)
id nom ph
1 30 L'autre site 6.5
Deleting data
Deleting data in SQL is done with the… DELETE function!
First, let’s make sure we have some sites where pH is below 6.6 :
dbGetQuery(connexion, "SELECT * FROM sites ORDER BY ph LIMIT 5")
id nom ph
1 28 deuxième ajout 6.500000
2 30 L'autre site 6.500000
3 21 U 6.577113
4 24 X 6.655040
5 8 H 6.692542
To delete every site where pH < 6.6, you could run the following query :
dbExecute(connexion,"DELETE FROM sites WHERE ph < 6.6")
[1] 3
Let’s now verify that the sites were indeed deleted :
dbGetQuery(connexion, "SELECT * FROM sites ORDER BY ph LIMIT 5")
id nom ph
1 24 X 6.655040
2 8 H 6.692542
3 20 T 6.708648
4 4 D 6.746626
5 17 Q 6.759258
In more advanced databases like MySQL, PostgreSQL, etc., such an operation could also have led to the deletion of all the plots associated with the site, of all the presences associated with the plots, etc. However, since DuckDB is a local lightweight database, designed for one user at a time, this cascade did not occur:
dbGetQuery(connexion, "SELECT * FROM parcelles WHERE site_id = 21 LIMIT 5")
id site_id nom couvert_pct
1 21 21 otuzagswlo 55.11549
2 47 21 orqxestuqr 58.59047
3 73 21 vytwvjjvcr 47.13821
4 99 21 odicuudsgl 42.04843
5 125 21 izpwayhfaq 78.15716
Updating data
Finally, data can be updated with the UPDATE command.
For example, if we made a mistake and all observations for plot 3 were in fact supposed to be for plot 4, one could write :
dbExecute(connexion,"UPDATE presences SET parcelle_id = 4 WHERE parcelle_id = 3")
[1] 7
And then verify that it actually worked :
dbGetQuery(connexion, "
SELECT parcelle_id, COUNT(*)
FROM presences
WHERE parcelle_id = 3 OR parcelle_id = 4
GROUP BY parcelle_id
")
parcelle_id count_star()
1 4 10
Conclusion and References
As with the majority of Numérilab workshops, we have only scratched the surface here of what is possible with SQL. I hope you enjoyed the fact that our database handled queries run on millions of lines without even blinking. Obviously, relational databases and SQL have their place in different computer systems, ranging from a few Mb of data for a master’s project, to several Pb of data to manage content for a world-class transactional website.
If you want to learn more about SQL, the resource I’d probably recommend (and the one that’s gotten me going after a few years of not doing SQL) is the W3Schools tutorial: https://www.w3schools.com/sql/.
I also strongly encourage you to explore SQLite, another lightweight local database like DuckDB, which is used as storage in many applications.