Strings manipulations in R

Charles Martin

March 2023

Required Libraries
How is text constructed in R
Text extraction
Regular expressions (regexes)
Caution
References

Required Libraries

This workshop will require recent versions of the readr (minimum version 2.1), stringr (minimum version 1.5), tidyr (minimum version 1.3) and dplyr libraries. You can either enable them individually, or enable the tidyverse meta-library:

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.2      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

How is text constructed in R

Before embarking on the manipulation of text per se, it is important to understand the nature of text in R and how to construct it.

How text is encoded in the computer

In R’s memory, in CSV files, etc. the text is stored in a series of hexadecimal codes. We can indeed see the corresponding codes with the charToRaw function:

charToRaw("Charles")

[1] 43 68 61 72 6c 65 73

The “C” is encoded with 43, the “h” with 68, etc.

If we do the mathematics of the thing, we quickly realize that this system can only represent 16*16 = 256 symbols. This worked well at a time when Americans had a hand in computing, but doesn’t work at all for encoding text from any language in the world.

In the 1980s-1990s, a series of standards were developed to allow the encoding of characters other than English. Among other things, ISO-8859-1 (Latin1) allowed the encoding of most characters used in Western Europe, Latin2 in Eastern Europe, etc.

Today, there is an international standard, called UTF-8, which makes it possible to encode all imaginable characters, even Emojis!

Most modern software uses this encoding and the use of accents in files is no longer a problem. However, some older applications and some files produced before that era do not necessarily adhere to this convention.

It is for this reason that sometimes, when loading a CSV file, you will see a series of weird characters in your data:

read_csv("Latin1.csv")

Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Col3
dbl (2): Col1, Col2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 3
   Col1  Col2 Col3            
  <dbl> <dbl> <chr>           
1     1     2 "All\xf4"       
2     3     4 "\xc0 la place?"

With trial and error, we can try to guess the correct encoding and specify it at load time:

read_csv("Latin1.csv",locale = locale(encoding = "Windows-1252"))

Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Col3
dbl (2): Col1, Col2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 3
   Col1  Col2 Col3       
  <dbl> <dbl> <chr>      
1     1     2 Allô       
2     3     4 À la place?

read_csv("Latin1.csv",locale = locale(encoding = "Latin1"))

Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Col3
dbl (2): Col1, Col2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 3
   Col1  Col2 Col3       
  <dbl> <dbl> <chr>      
1     1     2 Allô       
2     3     4 À la place?

Usually, if you try Latin1, UTF-8, or Windows-1252, you’re pretty sure to stumble on the right one.

Recent versions of the readr library now come with a feature that lets the computer do the dirty work for you:

guess_encoding("Latin1.csv")

# A tibble: 2 × 2
  encoding   confidence
  <chr>           <dbl>
1 ISO-8859-1       0.62
2 ISO-8859-2       0.41

Manual text creation

Now, how can one do to create text in R?

The easiest way is to create a string object, like this:

chaine1 <- "Il faut l'essayer"
chaine2 <- 'Voici un autre "essai"'

Note that we can use the single or double quote to start and end our sequence.

In general, it is recommended to use the double quote, unless your character string contains more than one.

You can also include a double quote in a sequence started with a double quote, by using an escape character, i.e. the backslash (\):

chaine3 <- "Je contient un \" et ça fonctionne tout de même"

Therefore, if you want to produce an actual backslash in a string, you must precede it with another backslash:

backslash <- "\\"

Note that if you send one of these strings to the console, you will see the escape character:

chaine3

[1] "Je contient un \" et ça fonctionne tout de même"

backslash

[1] "\\"

This happens because, by default, R’s print function (called implicitly) gives us not what it sees as text, but what we should type to reconstruct it.

If we want to see the true representation of the text (as we will see in graphics, etc.), we can use the str_view function. Compare these two outputs:

print(c(chaine1, chaine2, chaine3))

[1] "Il faut l'essayer"                              
[2] "Voici un autre \"essai\""                       
[3] "Je contient un \" et ça fonctionne tout de même"

str_view(c(chaine1, chaine2, chaine3))

[1] │ Il faut l'essayer
[2] │ Voici un autre "essai"
[3] │ Je contient un " et ça fonctionne tout de même

It is also possible to create so-called “raw” character strings, for which, when they are created, R does not try to manage escape characters. To do this, we must start our string with r"( and end it with )"

complexe <- r"(L'apostrophe, le \ et même les "guillemets" ne posent plus de problèmes)"
str_view(complexe)

[1] │ L'apostrophe, le \ et même les "guillemets" ne posent plus de problèmes

As needed r"()" can be replaced with r"[]", r"{}", etc.

Escape characters

Besides \" \' and \\, there are a series of other special characters when constructing text in R.

Among others :

\n adds line break
\t adds an indent (tab)
and \u et \U allow us to insert any unicode character

str_view(c(
  "Saut\nde\nligne",
  "avec\tindentation",
  "\u00b5 mu",
  "\U0001f4a9 (sans commentaires)"
))

[1] │ Saut
    │ de
    │ ligne
[2] │ avec{\t}indentation
[3] │ µ mu
[4] │ 💩 (sans commentaires)

You can type ?'"' at the R console for more details on the possibilities of escape characters.

Creating text by programming

To create and combine text programmatically, there are three functions in the tidyverse, str_c and str_glue to work directly on objects (mutate, etc.) and str_flatten for cases where we try to summarize text.

The str_c function works like the c function, but for pasting chunks of text into a string, rather than pasting numbers into a vector.

str_c("a","b","c")

[1] "abc"

str_c("Salut ",c("Charles","Vincent"))

[1] "Salut Charles" "Salut Vincent"

Its operation is very similar to the paste0 function, but its handling of missing values and vectors of different lengths (see previous example) is more consistent with all the other functions of the tidyverse.

Each element passed to str_c can of course be an object containing text, rather than the text itself:

nom <- "Charles"
moment <- "aujourd'hui"
str_c("Bonjour ", nom, "! Comment allez-vous ", moment, "?")

[1] "Bonjour Charles! Comment allez-vous aujourd'hui?"

Although practical, this approach can become tedious if you have several pieces of text to put together and each time you have to close the quotation mark, add the comma, open the quotation mark, etc. without forgetting anything.

This is where the str_glue function comes in:

str_glue("Bonjour {nom}, comment allez-vous {moment}?")

Bonjour Charles, comment allez-vous aujourd'hui?

R will automatically replace each word enclosed in braces with the contents of the variable of the same name.

noms <- c("Pierre","Paul","Jacques")
str_glue("Bonjour {noms}, comment allez-vous {moment}?")

Bonjour Pierre, comment allez-vous aujourd'hui?
Bonjour Paul, comment allez-vous aujourd'hui?
Bonjour Jacques, comment allez-vous aujourd'hui?

As you may see, str_c and str_glue will work great in contexts like a mutate where the function needs to produce a series of strings in response to a vector.

If our operation must imperatively return a single string, then we must use the str_flatten function:

str_flatten(noms, ", ")

[1] "Pierre, Paul, Jacques"

You can also control the last separator to make nice enumerations:

str_flatten(noms,", ",last = " et ")

[1] "Pierre, Paul et Jacques"

habitats <- tribble(
  ~Espece,~Habitat,
  "A","désert",
  "A","forêt",
  "B", "forêt",
  "A", "tundra"
)
habitats %>% 
  group_by(Espece) %>% 
  summarize(
    Nombre = n(),
    Liste = str_flatten(Habitat,", ")
  )

# A tibble: 2 × 3
  Espece Nombre Liste                
  <chr>   <int> <chr>                
1 A           3 désert, forêt, tundra
2 B           1 forêt                

Text extraction

Another extremely common task that occurs in our R analyzes is having to extract textual information from already existing variables.

The tidyr library provides a series of functions designed expressly for this type of situation: the separate_ family, which includes 4 functions, namely:

separate_longer_delim
separate_longer_position
separate_wider_delim
separate_wider_position

In all cases, the functions will analyze the text contained in a variable, and separate it into pieces. Either by creating one observation per piece (_longer), or by creating one column per piece (_wider).

The splitting of the pieces can be based either on a separator (_delim), or on the position of the text (_position)

These functions are very useful, for example, when you have encoded information in the name of the sites in your experience:

experience_a <- tibble(
  nom_site = c("CT01","CT02","CT03","TR01","TR02","TR03")
)
experience_a

# A tibble: 6 × 1
  nom_site
  <chr>   
1 CT01    
2 CT02    
3 CT03    
4 TR01    
5 TR02    
6 TR03    

All you have to do is provide the function with the number of characters in each of the pieces, with the name that this column should have in the splitted data frame.

experience_a %>% 
  separate_wider_position(nom_site,c(traitement = 2, no_replicat = 2) )

# A tibble: 6 × 2
  traitement no_replicat
  <chr>      <chr>      
1 CT         01         
2 CT         02         
3 CT         03         
4 TR         01         
5 TR         02         
6 TR         03         

If our data had instead been in this format:

experience_b <- tibble(
  nom_site = c("CT-1","CT-10","R-1","R-100")
)
experience_b

# A tibble: 4 × 1
  nom_site
  <chr>   
1 CT-1    
2 CT-10   
3 R-1     
4 R-100   

we could have extracted them based on the presence of the separator, like this:

experience_b %>% 
  separate_wider_delim(nom_site,delim = "-",names = c("traitement","no_replicat"))

# A tibble: 4 × 2
  traitement no_replicat
  <chr>      <chr>      
1 CT         1          
2 CT         10         
3 R          1          
4 R          100        

Finally, it could happen that the information of several observations is encoded in the same cell:

experience_c <- tibble(
  site = c("A","B"),
  resultats_visites = c("0,0,1","1,0,1"),
  latitude = c(46,47),
  longitude = c(-72,-72.5)
)
experience_c

# A tibble: 2 × 4
  site  resultats_visites latitude longitude
  <chr> <chr>                <dbl>     <dbl>
1 A     0,0,1                   46     -72  
2 B     1,0,1                   47     -72.5

We can then use a _longer function to recreate each observation:

experience_c %>% 
  separate_longer_delim(resultats_visites, delim = ",")

# A tibble: 6 × 4
  site  resultats_visites latitude longitude
  <chr> <chr>                <dbl>     <dbl>
1 A     0                       46     -72  
2 A     0                       46     -72  
3 A     1                       46     -72  
4 B     1                       47     -72.5
5 B     0                       47     -72.5
6 B     1                       47     -72.5

For simpler operations, you can also fetch directly snippets of text with the str_sub function.

It allows you to use positive position numbers when you want to extract from the beginning, and negative position numbers when we want to extract from the end :

str_sub("LongTexte",5,9)

[1] "Texte"

str_sub("LongTexte",-5, -1)

[1] "Texte"

Regular expressions (regexes)

Now that we have seen most of the functions for processing text programmatically, we are going to tackle a second way: regular expressions, commonly called by their acronym: regex (singular) or regexes (plural)!

Regexes are sequences of characters that allow you to define a search pattern, in a very efficient way.

The price to pay for this efficiency is that regexes are sometimes difficult to read and debug. Some people call them black magic, because of their somewhat dark and unpredictable side. But well controlled, they are extremely powerful.

Exploring with str_view

To facilitate our learning of regular expressions, we will use a function that allows us to test them visually, the str_view function.

To illustrate our examples, we will use the name column of the msleep database, which is provided with the ggplot2 library

head(msleep$name)

[1] "Cheetah"                    "Owl monkey"                
[3] "Mountain beaver"            "Greater short-tailed shrew"
[5] "Cow"                        "Three-toed sloth"          

To make our examples simpler however, we are going to extract all the names into a new vector, which we are also going to convert to lowercase

noms <- msleep$name %>% str_to_lower()
head(noms)

[1] "cheetah"                    "owl monkey"                
[3] "mountain beaver"            "greater short-tailed shrew"
[5] "cow"                        "three-toed sloth"          

Basically, a regex is a sequence of characters to search for:

str_view(noms,"shrew")

 [4] │ greater short-tailed <shrew>
[17] │ lesser short-tailed <shrew>
[73] │ musk <shrew>
[79] │ tree <shrew>

The str_view function shows us all the elements of the name vector that contained the searched sequence (“shrew”). Notice that in the output, the part that matched our regex has been surrounded by < >, and if you look at the output in the console, it will also be in a different color.

Note that regexes are case sensitive. For example, the “SHREW” pattern does not return any results:

str_view(noms, "SHREW")

The construction of search sequences

In addition to the sequences of characters we are looking for, a regex can also contain meta-characters allowing for more precise control of the search.

The first one we will see is the . which, like a wildcard, can replace any character.

It can be used for example to find all names containing the word gray in English, regardless of whether it is written gray or grey:

str_view(noms,"gr.y")

[32] │ <gray> seal
[33] │ <gray> hyrax

There are then three meta-characters to choose how many times each of the searched patterns should occur:

+ : 1 time or more (required)
? : 0 or 1 time (optional)
* : 0 or more (optional with possible repetition)

For example, all names containing an a followed by at least one s

str_view(noms, "as+")

[21] │ <as>ian elephant
[26] │ pat<as> monkey
[47] │ northern gr<ass>hopper mouse
[59] │ c<as>pian seal
[67] │ e<as>tern american mole
[76] │ e<as>tern american chipmunk

Notice in this case that our pattern contains the second s in “grass hopper”

With the ?, the presence of the character becomes optional. We can catch for example all the nouns containing ham or am, like this:

str_view(noms,"h?am")

[20] │ north <am>erican opossum
[27] │ western <am>erican chipmunk
[40] │ golden <ham>ster
[67] │ eastern <am>erican mole
[76] │ eastern <am>erican chipmunk

Finally, with the asterisk, you can make a pattern optional, even if it repeats itself. For example, we can search for all names containing “oo”, or with any number of p between the two o, like this:

str_view(noms,"op*o")

[20] │ north american <opo>ssum
[35] │ mong<oo>se lemur
[37] │ thick-tailed <oppo>sum
[54] │ bab<oo>n
[61] │ potor<oo>

The square parentheses, on the other hand, allow you to define a series of alternatives that can be searched for. For example, we can find all rats, cats and bats like this:

str_view(noms,"[cbr]at")

[16] │ african giant pouched <rat>
[22] │ big brown <bat>
[28] │ domestic <cat>
[43] │ little brown <bat>
[44] │ round-tailed musk<rat>
[64] │ labo<rat>ory <rat>
[68] │ cotton <rat>
[69] │ mole <rat>

The circumflex accent makes it possible to reverse the condition of the square parenthesis. We could, for example, search for all the times where “at” appears, but it is not a cat or a rat:

str_view(noms,"[^cr]at")

 [4] │ gr<eat>er short-tailed shrew
[11] │ g<oat>
[22] │ big brown <bat>
[26] │ <pat>as monkey
[43] │ little brown <bat>

One can also combine square parentheses and meta-characters defining the numbers of appearances. For example, we could search for all words containing two vowels followed by one or more consonants, like this:

str_view(noms,"[aeiou][aeiou][^aeiou]+")

 [1] │ ch<eet>ah
 [3] │ m<ount><ain b><eav>er
 [4] │ gr<eat>er short-t<ail>ed shrew
 [6] │ thr<ee-t><oed sl>oth
 [7] │ northern fur s<eal>
 [8] │ vesper m<ous>e
[10] │ r<oe d><eer>
[11] │ g<oat>
[12] │ g<uin><ea p>ig
[16] │ african g<iant p><ouch>ed rat
[17] │ lesser short-t<ail>ed shrew
[19] │ tr<ee hyr>ax
[21] │ as<ian >elephant
[25] │ <eur>op<ean h>edgehog
[32] │ gray s<eal>
[35] │ mong<oos>e lemur
[37] │ thick-t<ail>ed opposum
[39] │ mongol<ian g>erbil
[42] │ h<ous>e m<ous>e
[44] │ r<ound-t><ail>ed muskrat
... and 18 more

Notice that in several instances, our pattern was found multiple times. Also notice that our definition of “consonant” is not quite right. As spaces are not vowels, they are also captured by our search pattern.

Another particularly useful special character is the vertical bar (|). The latter makes it possible to define alternative sequences (an OR). For example, to find all squirrels and chipmunks, one could write:

str_view(noms,"squirrel|chipmunk")

[27] │ western american <chipmunk>
[66] │ <squirrel> monkey
[70] │ arctic ground <squirrel>
[71] │ thirteen-lined ground <squirrel>
[72] │ golden-mantled ground <squirrel>
[76] │ eastern american <chipmunk>

Note that our sequence also detected the Squirrel Monkey. We will see later how to remedy this problem.

Regex-based functions

Now that we have seen the basic mechanics, we are going to do a brief overview of the functions that can use of regexes.

To properly illustrate these functions, we are going to create a small data frame containing the animal names in lowercase (as we used in the previous section) combined with another column of information to illustrate the application on data frames.

tableau <- tibble(
  noms = noms,
  sommeil = msleep$sleep_total
)

The most practical application will undoubtedly be to combine the dplyr filter with the str_detect function.

tableau %>% 
  filter(str_detect(noms, "squirrel|chipmunk"))

# A tibble: 6 × 2
  noms                           sommeil
  <chr>                            <dbl>
1 western american chipmunk         14.9
2 squirrel monkey                    9.6
3 arctic ground squirrel            16.6
4 thirteen-lined ground squirrel    13.8
5 golden-mantled ground squirrel    15.9
6 eastern american chipmunk         15.8

Another thing you will often want to do is replace the pattern found with something else. For example, if the experts finally decided that all chipmunks and squirrels were now called squimunks, we could do this:

tableau %>% 
  mutate(
    noms = str_replace(noms,"squirrel|chipmunk","squimunk")
  ) %>% slice(70:80)

# A tibble: 11 × 2
   noms                           sommeil
   <chr>                            <dbl>
arctic ground squimunk            16.6
thirteen-lined ground squimunk    13.8
golden-mantled ground squimunk    15.9
musk shrew                        12.8
pig                                9.1
short-nosed echidna                8.6
eastern american squimunk         15.8
brazilian tapir                    4.4
tenrec                            15.6
tree shrew                         8.9
bottle-nosed dolphin               5.2

Notice that I’m using the slice function to show you the rows where our replacement was made.

At the extreme, we could decide to replace all the “e” with another letter:

tableau %>% 
  mutate(
    noms = str_replace(noms, "e","X")
  )

# A tibble: 83 × 2
   noms                       sommeil
   <chr>                        <dbl>
chXetah                       12.1
owl monkXy                    17  
mountain bXaver               14.4
grXater short-tailed shrew    14.9
cow                            4  
thrXe-toed sloth              14.4
northXrn fur seal              8.7
vXsper mouse                   7  
dog                           10.1
roX deer                       3  
# … with 73 more rows

As you can see, str_replace only changes the first instance found. To replace all instances, use the str_replace_all function:

tableau %>% 
  mutate(
    noms = str_replace_all(noms, "e","X")
  )

# A tibble: 83 × 2
   noms                       sommeil
   <chr>                        <dbl>
chXXtah                       12.1
owl monkXy                    17  
mountain bXavXr               14.4
grXatXr short-tailXd shrXw    14.9
cow                            4  
thrXX-toXd sloth              14.4
northXrn fur sXal              8.7
vXspXr mousX                   7  
dog                           10.1
roX dXXr                       3  
# … with 73 more rows

If we leave the second argument empty (“”) in str_replace, we are essentially asking R to remove this sequence. It can sometimes be more elegant and readable in these cases to just use the str_remove (or str_remove_all) function. We could eliminate all the e’s from our data like this:

tableau %>% 
  mutate(
    noms = str_remove_all(noms, "e")
  )

# A tibble: 83 × 2
   noms                   sommeil
   <chr>                    <dbl>
chtah                     12.1
owl monky                 17  
mountain bavr             14.4
gratr short-taild shrw    14.9
cow                        4  
thr-tod sloth             14.4
northrn fur sal            8.7
vspr mous                  7  
dog                       10.1
ro dr                      3  
# … with 73 more rows

There are dozens of other functions for working with regexes, including str_subset, str_which, and str_count, which I strongly encourage you to explore on your own.

Escape characters (again!)

Okay, so if I had to ask you to write a regex to detect a period (.), what would you do?

texte <- "Première phrase. Deuxième phrase."
str_view(texte,".")

[1] │ <P><r><e><m><i><è><r><e>< ><p><h><r><a><s><e><.>< ><D><e><u><x><i><è><m><e>< ><p><h><r><a><s><e><.>

Obviously not. A period is a wildcard, which can replace any character.

Maybe with an escape character?

str_view(texte,"\.")

Error: '\.' is an unrecognized escape in character string starting ""\."

Ah no, it doesn’t work because, when writing the string, R replaces the \-something by encoding it, and \. doesn’t mean anything to R.

So, yes, you have to use \\. to detect a simple period!

str_view(texte,"\\.")

[1] │ Première phrase<.> Deuxième phrase<.>

A backslash at the level of R and a backslash at the level of the interpretation of the regex.

Now, if we prepare a character string with a backslash in it:

bs <- "a\\b"
str_view(bs)

[1] │ a\b

How do we find it with a regex?

Yes, it will take us 4 backslashes to find just one.

str_view(bs, "\\\\")

[1] │ a<\>b

When creating the character string in R, we lose 2 in encoding, then at the regex level, we lose the last one!

If we want to simplify our life a bit, we can remember that R allows us to create raw character strings, which do not go through the encoding stage:

bs2 <- r"(a\n)"
str_view(bs2, r"(\\)")

[1] │ a<\>n

It’s not ideal, but it eliminates at least one level of abstraction.

Special characters

Just as regexes have their own syntax, they also include a series of special characters unique to them.

First, let’s look at anchor characters. Regexes include two anchor characters, ^ and $ respectively designating the beginning and the end of a character string

For example, we can use an anchor to find all the names of mammals that end in squirrel, and thus eliminate the squirrel monkey from our results:

str_view(noms, "squirrel")

[66] │ <squirrel> monkey
[70] │ arctic ground <squirrel>
[71] │ thirteen-lined ground <squirrel>
[72] │ golden-mantled ground <squirrel>

vs.

str_view(noms, "squirrel$")

[70] │ arctic ground <squirrel>
[71] │ thirteen-lined ground <squirrel>
[72] │ golden-mantled ground <squirrel>

If we combine the two anchor characters, we ensure that the string contains only the requested text. Nothing more, nothing less.

str_view(noms,"pig")

[12] │ guinea <pig>
[74] │ <pig>

vs.

str_view(noms, "^pig$")

[74] │ <pig>

Regexes also include special characters allowing us to avoid long lists of characters. For example, you can use \w to find any letter or number.

So, to find all the species names composed of two words, we could do this:

str_view(noms,"^\\w+ \\w+$")

 [2] │ <owl monkey>
 [3] │ <mountain beaver>
 [8] │ <vesper mouse>
[10] │ <roe deer>
[12] │ <guinea pig>
[19] │ <tree hyrax>
[21] │ <asian elephant>
[25] │ <european hedgehog>
[26] │ <patas monkey>
[28] │ <domestic cat>
[31] │ <pilot whale>
[32] │ <gray seal>
[33] │ <gray hyrax>
[35] │ <mongoose lemur>
[36] │ <african elephant>
[39] │ <mongolian gerbil>
[40] │ <golden hamster>
[42] │ <house mouse>
[45] │ <slow loris>
[55] │ <desert hedgehog>
... and 14 more

There are also a series of other similar characters, of which here is an overview:

\d : any number
\D : anything but a number
\s : any space character (space, newline, indent)
\S : everything except a space
\w : any letter or number
\W : anything that is not a letter or number

Beyond the characters +, ? and *, regexes also allow us finer control over the number of repetitions of a pattern, using braces. To find all names starting with a 3-letter word followed by a space, we could do this:

str_view(noms,"^\\w{3}\\s")

 [2] │ <owl >monkey
[10] │ <roe >deer
[22] │ <big >brown bat
[83] │ <red >fox

Note that there is also a special character to detect borders around a word: \b.

If we want to find all the names containing a word of exactly 3 letters, we could do this:

str_view(noms, "\\b\\w{3}\\b")

 [2] │ <owl> monkey
 [5] │ <cow>
 [7] │ northern <fur> seal
 [9] │ <dog>
[10] │ <roe> deer
[12] │ guinea <pig>
[16] │ african giant pouched <rat>
[22] │ <big> brown <bat>
[28] │ domestic <cat>
[43] │ little brown <bat>
[64] │ laboratory <rat>
[68] │ cotton <rat>
[69] │ mole <rat>
[74] │ <pig>
[82] │ arctic <fox>
[83] │ <red> <fox>

If only one number is given, the braces search for an exact number of detections, but you can also specify two values, which will give the upper and lower bounds for the acceptable number of detections.

For example, to find all animal names containing a 3 to 5 letter word, we would write this:

str_view(noms, "\\b\\w{3,5}\\b")

 [2] │ <owl> monkey
 [4] │ greater <short>-tailed <shrew>
 [5] │ <cow>
 [6] │ <three>-<toed> <sloth>
 [7] │ northern <fur> <seal>
 [8] │ vesper <mouse>
 [9] │ <dog>
[10] │ <roe> <deer>
[11] │ <goat>
[12] │ guinea <pig>
[15] │ <star>-<nosed> <mole>
[16] │ african <giant> pouched <rat>
[17] │ lesser <short>-tailed <shrew>
[18] │ <long>-<nosed> armadillo
[19] │ <tree> <hyrax>
[20] │ <north> american opossum
[21] │ <asian> elephant
[22] │ <big> <brown> <bat>
[23] │ <horse>
[26] │ <patas> monkey
... and 37 more

Operator precedence

As in mathematics, the evaluation of a regex does not necessarily occur from left to right. There is an order of priority between the operations.

As a general rule, characters controlling the number of repetitions take precedence over those defining alternatives.

For example ab+ will be interpreted as a(b+). ^a|b$ will be interpreted as (^a)|(b$), etc.

Unlike algebra, these priorities are very difficult to remember, and the easiest way is probably to use as many parentheses as necessary to clarify the desired pattern.

Groupings and reuse

Besides clarifying priorities, parentheses have another special use in regexes: they create groups that can be reused. Each of the groups defined by the parentheses is numbered automatically by R. These numbers can then be used elsewhere in the regex. The first group will be named \1, the second \2, etc.

For example, we can find all the animals whose name contains a double letter, like this:

str_view(noms,"(\\w)\\1")

 [1] │ ch<ee>tah
 [6] │ thr<ee>-toed sloth
[10] │ roe d<ee>r
[14] │ chinchi<ll>a
[17] │ le<ss>er short-tailed shrew
[18] │ long-nosed armadi<ll>o
[19] │ tr<ee> hyrax
[20] │ north american opo<ss>um
[30] │ gira<ff>e
[35] │ mong<oo>se lemur
[37] │ thick-tailed o<pp>osum
[43] │ li<tt>le brown bat
[47] │ northern gra<ss>ho<pp>er mouse
[48] │ ra<bb>it
[49] │ sh<ee>p
[50] │ chimpanz<ee>
[54] │ bab<oo>n
[56] │ po<tt>o
[57] │ d<ee>r mouse
[60] │ co<mm>on porpoise
... and 9 more

The \w grabs a letter, the parentheses create a group (here the number 1), and then a reuses that group to duplicate its value.

By the same principle, we can also find all the names that start and end with the same letter:

str_view(noms,"^(\\w).*\\1$")

[10] │ <roe deer>
[12] │ <guinea pig>
[45] │ <slow loris>
[67] │ <eastern american mole>

Pattern reuse is also applicable with the str_replace function.

It can, for example, reverse the first two words of each name:

str_replace(noms,"^(\\w+)\\s(\\w+)(.*)", "\\2 \\1\\3")

 [1] "cheetah"                        "monkey owl"                    
 [3] "beaver mountain"                "short greater-tailed shrew"    
 [5] "cow"                            "three-toed sloth"              
 [7] "fur northern seal"              "mouse vesper"                  
 [9] "dog"                            "deer roe"                      
[11] "goat"                           "pig guinea"                    
[13] "grivet"                         "chinchilla"                    
[15] "star-nosed mole"                "giant african pouched rat"     
[17] "short lesser-tailed shrew"      "long-nosed armadillo"          
[19] "hyrax tree"                     "american north opossum"        
[21] "elephant asian"                 "brown big bat"                 
[23] "horse"                          "donkey"                        
[25] "hedgehog european"              "monkey patas"                  
[27] "american western chipmunk"      "cat domestic"                  
[29] "galago"                         "giraffe"                       
[31] "whale pilot"                    "seal gray"                     
[33] "hyrax gray"                     "human"                         
[35] "lemur mongoose"                 "elephant african"              
[37] "thick-tailed opposum"           "macaque"                       
[39] "gerbil mongolian"               "hamster golden"                
[41] "vole "                          "mouse house"                   
[43] "brown little bat"               "round-tailed muskrat"          
[45] "loris slow"                     "degu"                          
[47] "grasshopper northern mouse"     "rabbit"                        
[49] "sheep"                          "chimpanzee"                    
[51] "tiger"                          "jaguar"                        
[53] "lion"                           "baboon"                        
[55] "hedgehog desert"                "potto"                         
[57] "mouse deer"                     "phalanger"                     
[59] "seal caspian"                   "porpoise common"               
[61] "potoroo"                        "armadillo giant"               
[63] "hyrax rock"                     "rat laboratory"                
[65] "striped african mouse"          "monkey squirrel"               
[67] "american eastern mole"          "rat cotton"                    
[69] "rat mole"                       "ground arctic squirrel"        
[71] "thirteen-lined ground squirrel" "golden-mantled ground squirrel"
[73] "shrew musk"                     "pig"                           
[75] "short-nosed echidna"            "american eastern chipmunk"     
[77] "tapir brazilian"                "tenrec"                        
[79] "shrew tree"                     "bottle-nosed dolphin"          
[81] "genet"                          "fox arctic"                    
[83] "fox red"                       

Here we have 3 pairs of parentheses, so 3 groups. The first catches the first word (\w+), the second the second word (again \w+) and the last one catches everything else (.*). Then, we reconstruct a character string, but by placing the second group (\2) before the first (\1).

Caution

As Spider-man says so well: With great power comes great responsibility.

This is especially true when talking about regexes. It’s better to use them moderately. It can sometimes be more readable for future “you” to separate what you have to do into a few simple and readable operations, rather than trying to do everything in a single regex.

As a tragic reminder, I show you a classic regex, used in a library of the Perl programming language to validate if an email address is valid or not (https://metacpan.org/release/RJBS/Email-Valid-1.200/source/lib/Email/Valid.pm) :

[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\
xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xf
f\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\x
ff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015
"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\
xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80
-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*
)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\
\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\
x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n
\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([
^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\
\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\
x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-
\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()
]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\
x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\04
0\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\
n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\
015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?!
[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\
]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\
x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\01
5()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]
)|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^
()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\0
15()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][
^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\
n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\
x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?
:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-
\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\t]*
(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015
()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()
]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\0
40)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\
[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\
xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*
)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80
-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x
80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t
]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\
\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])
*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x
80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80
-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*(?:\([^\\\x80-\xff\n\015(
)]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\
\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*@[\040\t
]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0
15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015
()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(
\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|
\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80
-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()
]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff
])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\
\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x
80-\xff\n\015()]*)*\)[\040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015
()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\
\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^
(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-
\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\
n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|
\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))
[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff
\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\x
ff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(
?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\
000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\
xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\x
ff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)
*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\x
ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-
\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)
*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\
]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]
)[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-
\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\x
ff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(
?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80
-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<
>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:
\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]
*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)
*\)[\040\t]*)*)*>)

References

Material in this workshop draws heavily from the in preparation chapters of Hadley Wickham’s next edition of R for Data Science, available online here:

https://r4ds.hadley.nz/strings.html

https://r4ds.hadley.nz/regexps.html