R for Data Science Exercises: Strings
Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn.
R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)
Strings
Run the code in your script for the answers! I'm just exploring as I go.
Packages to load
library(tidyverse)
library(babynames)
library(gt)
library(gtExtras)
Introduction
Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. The stringr
package provides a set of internally consistent tools for working with character strings in R, i.e. sequences of characters surrounded by "quotation marks".
- Strings = text, eg the stuff you're reading here
- {stringr} = {tidyverse} string package (stringr cheatsheet)
- {stringr} functions start with
str_
Important uses of quoting in R
Code | Purpose |
---|---|
\n |
newline (aka 'line feed') |
\t |
tab |
\b |
backspace |
\v |
vertical tab |
\\ |
backslash '\ ' |
\' |
ASCII apostrophe '' ' |
\" |
ASCII quotation mark '" ' |
\nnn |
character with given octal code (1, 2 or 3 digits) |
\xnn |
character with given hex code (1 or 2 hex digits) |
Creating a String
Questions
-
Create strings that contain the following values:
-
He said "That's amazing!"
-
\a\b\c\d
-
\\\\\\
-
-
Create the string in your R session and print it. What happens to the special "\u00a0"? How does
str_view()
display it? Can you do a little googling to figure out what this special character is?x <- "This\u00a0is\u00a0tricky"
Answers
Solution 1:
-
He said "That's amazing!"
x = "He said \"That's amazing!\"" str_view(x)
-
\a\b\c\d
x = "\\a\\b\\c\\d" str_view(x)
-
\\\\\\
x = "\\\\\\\\\\\\" str_view(x)
Solution 2:
The "\u00a0"
represents a white space. By google, I find out that this represents No-Break Space (NBSP). But, str_view()
displays it in form of a greenish-blue font {\u00a0}.
"\u00a0" # This represents a white space
str_view("\u00a0")
x <- "This\u00a0is\u00a0tricky"
print(x)
str_view(x)
The "\u00a0"
represents a non-breaking space character in Unicode encoding. Unicode is a standardized character encoding system that assigns a unique numerical code to almost every character from every writing system in the world, including various symbols, letters, and special characters.
In Unicode, "\u
" is used to indicate that the following four characters represent a Unicode code point in hexadecimal notation. In this case, "\u00a0"
represents the code point for the non-breaking space character.
A non-breaking space is a type of space character that is used in typography and word processing to prevent a line break or word wrap from occurring at that particular space.
It is similar to a regular space character (ASCII code 32), but it has the special property of keeping adjacent words or characters together on the same line when text is justified or formatted.
Creating Many Strings from Data
Questions
- Compare and contrast the results of
paste0()
withstr_c()
for the following inputs:
str_c("hi ", NA)
str_c(letters[1:2], letters[1:3])
-
What's the difference between
paste()
andpaste0()
? How can you recreate the equivalent ofpaste()
withstr_c()
? -
Convert the following expressions from
str_c()
tostr_glue()
or vice versa:a.
str_c("The price of ", food, " is ", price)
b.
str_glue("I'm {age} years old and live in {country}")
c.
str_c("\\section{", title, "}")
Answers
Solution 1:
As we can see below, paste0
converts NA
into a string "NA"
and simply joins it with another string. However, str_c()
behaves more sensibly - it generates NA
if any of the strings being joined is NA
.
str_c("hi ", NA)
paste0("hi ", NA)
Further, we see below that we are joining two string vectors of unequal length, i.e., letters[1:2]
is "a" "b"
and letters[1:3]
is "a" "b" "c"
, both str_c()
and paste0()
behave differently.
str_c()
throws an error and informs us that the string vectors being joined are of unequal length.paste0
simple recycles the shorter string vector silently.
# str_c(letters[1:2], letters[1:3])
paste0(letters[1:2], letters[1:3])
Alternative Solution:
-
str_c("hi ", NA)
-
str_c("hi ", NA)
withstr_c()
:-
Result:
"hi NA"
-
Explanation:
str_c()
converts theNA
value to a character string representation of "NA" and concatenates it with the string "hi".
-
-
paste0("hi ", NA)
withpaste0()
:-
Result:
"hi NA"
-
Explanation:
paste0()
also converts theNA
value to a character string representation of "NA" and concatenates it with the string "hi". The behavior is similar tostr_c()
in this case.
-
The results of
str_c()
andpaste0()
are the same for this input, as both functions convertNA
to the character string "NA" and concatenate it with the preceding string. -
-
str_c(letters[1:2], letters[1:3])
-
str_c(letters[1:2], letters[1:3])
withstr_c()
:-
Result:
"aabbccc"
-
Explanation:
str_c()
concatenates the elements of the first vector (letters[1:2]
) with the corresponding elements of the second vector (letters[1:3]
). The resulting strings are then concatenated together.
-
-
paste0(letters[1:2], letters[1:3])
withpaste0()
:-
Result:
"aabbccc"
-
Explanation:
paste0()
behaves similarly tostr_c()
in this case, as it concatenates the elements of the first vector with the corresponding elements of the second vector, resulting in the same output.
-
The results of
str_c()
andpaste0()
are the same for this input as well, as both functions concatenate the corresponding elements of the vectors and create a single concatenated string. -
In summary, str_c()
and paste0()
generally produce similar results for concatenating strings. However, str_c()
has additional options and features that allow for more flexibility and customization, such as specifying separators between the elements being concatenated. On the other hand, paste0()
is a simplified version of paste()
that concatenates without any separator.
Solution 2:
In R, both paste()
and paste0()
functions are used to concatenate strings together. However, they differ in how they handle separating the concatenated elements.
paste()
concatenates its arguments with a space character as the default separator. We can specify a different separator using the sep
argument.
paste0()
is similar to paste()
, but it does not add any separator between the concatenated elements. It simply combines them as-is.
Example:
vec1 <- c("Hello", "Hi")
vec2 <- c("Amy", "Tom", "Neal")
paste(vec1, vec2)
paste(vec1, vec2, sep = ", ")
paste0(vec1, vec2)
We can recreate the equivalent of paste()
using the str_c()
function from the stringr
package in R
. To do this, we can specify the separator using the sep
argument in str_c()
as follows:
vec1 <- c(vec1, "Hello")
paste(vec1, vec2)
str_c(vec1, vec2, sep = " ")
Note: We had to add a string to vec1
so that both vec1
and vec2
are of length 3. Else, str_c
will throw up an error.
Alternative Solution:
The paste()
and paste0()
functions in R are used for concatenating strings. The main difference between the two is that paste()
allows you to specify a separator between the concatenated elements, whereas paste0()
concatenates the elements without any separator.
To recreate the equivalent of paste()
using str_c()
from the stringr
package, you can use the collapse
argument in str_c()
:
library(stringr)
vec <- c("a", "b", "c")
result_str_c <- str_c(vec, collapse = "-")
In the above code, str_c(vec, collapse = "-")
is equivalent to paste(vec, collapse = "-")
. It concatenates the elements of vec
with a hyphen (-
) separator specified by the collapse
argument, resulting in the same output as paste()
.
Solution 3:
a. str_c("The price of ", food, " is ", price)
- `str_glue("The price of {food} is {price}")`
b. str_glue("I'm {age} years old and live in {country}")
- `str_c("I'm ", age, " years old and live in ", country)`
c. str_c("\\section{", title, "}")
- `str_glue("\\\\section{{{title}}}")`
Additional Information:
data("babynames")
babynames |>
mutate(name_lgth = str_length(name)) |>
count(name_lgth, wt = n)
babynames |>
filter(str_length(name) == 15) |>
count(name, wt = n, sort = TRUE) |>
slice_head(n = 5) |>
select(name) |>
as_vector() |>
unname() |>
str_sub(start = -3, end = -1)
Letters
Questions
- When computing the distribution of the length of babynames, why did we use
wt = n
? - Use
str_length()
andstr_sub()
to extract the middle letter from each baby name. What will you do if the string has an even number of characters? - Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
Answers
Solution 1:
The babynames
data-set (@tbl-baby-names) displays the column n
to reflect the frequency, i.e., number of observations of that name in that year. Thus, when we are computing the distribution of the length of baby names (@tbl-baby-names-length), we need to weigh the observations by n
otherwise each row will be treated as 1 (@tbl-baby-names-length column 3), instead of the actual number reflected in n
leading to erroneous results.
#| tbl-cap: "The babynames data-set"
#| label: tbl-baby-names
#| code-fold: true
babynames |>
slice_head(n = 5) |>
gt() |>
fmt_number(prop, decimals = 4)
#| tbl-cap: "The distribution of the length of babynames"
#| label: tbl-baby-names-length
#| code-fold: true
df1 = babynames |>
mutate(name_length = str_length(name)) |>
count(name_length, wt = n) |>
rename(correct_frequency = n)
df2 = babynames |>
mutate(name_length = str_length(name)) |>
count(name_length) |>
rename(wrong_frequency_without_weights = n)
inner_join(df1, df2, by = "name_length") |>
gt() |>
fmt_number(-name_length , decimals = 0) |>
cols_label_with(
fn = ~ janitor::make_clean_names(., case = "title")
) |>
gt_theme_538()
Solution 2:
The code displayed below extracts the middle letter from each baby name, and the results for first 10 names are displayed in @tbl-middle-letters. If the string has an even number of characters, we can pick the middle two characters.
#| label: tbl-middle-letters
#| tbl-cap: "Middle letters of names"
df3 = babynames |>
mutate(
name_length = str_length(name),
middle_letter_start = if_else(name_length %% 2 == 0,
name_length/2,
(name_length/2) + 0.5),
middle_letter_end = if_else(name_length %% 2 == 0,
(name_length/2) + 1,
(name_length/2) + 0.5),
middle_letter = str_sub(name,
start = middle_letter_start,
end = middle_letter_end)
) |>
select(-c(year, sex, n, prop)) |>
slice_head(n = 10)
df3 |>
gt() |>
cols_label_with(fn = ~ janitor::make_clean_names(., case = "title")) |>
cols_align(align = "center",
columns = -name) |>
gt_theme_538()
Alternative Solution
# Extract middle letter(s) from each baby name
middle_letters <- sapply(babynames$name, function(name) {
name_length <- str_length(name)
middle_index <- ceiling(name_length / 2)
if (name_length %% 2 == 0) {
str_sub(name, middle_index, middle_index + 1)
} else {
str_sub(name, middle_index, middle_index)
}
})
# Display the middle letter(s)
head(middle_letters)
Solution 3:
The @fig-length-baby-names, @fig-trends-baby-names-start and @fig-trends-baby-names-end show the trends over time.
#| label: fig-length-baby-names
#| fig-cap: "Length of babynames over time"
#| code-fold: true
df4 = babynames |>
mutate(
name_length = str_length(name),
name_start = str_sub(name, 1, 1),
name_end = str_sub(name, -1, -1)
)
y_coord = c(5.4, 6.3)
df4 |>
group_by(year) |>
count(name_length, wt = n) |>
summarise(mean_length = weighted.mean(name_length, w = n)) |>
ggplot(aes(x = year, y = mean_length)) +
theme_classic() +
labs(y = "Average name length (for each year)",
x = "Year",
title = "Baby names have become longer over the past 12 decades",
subtitle = "Between 1890-1920, and 1960-1990 baby names became longer\nBut, since 1990 the names are becoming shorter again") +
scale_x_continuous(breaks = seq(1880, 2000, 20)) +
geom_rect(mapping = aes(xmin = 1890, xmax = 1920,
ymin = y_coord[1], ymax = y_coord[2]),
alpha = 0.01, fill = "grey") +
geom_rect(mapping = aes(xmin = 1960, xmax = 1990,
ymin = y_coord[1], ymax = y_coord[2]),
alpha = 0.01, fill = "grey") +
geom_line(lwd = 1) +
coord_cartesian(ylim = y_coord) +
theme(plot.title.position = "plot")
#| label: fig-trends-baby-names-start
#| fig-cap: "Trends on the starting letter of babynames over time"
#| code-fold: true
ns_vec = df4 |>
count(name_start, wt = n, sort = TRUE) |>
slice_head(n = 5) |>
select(name_start) |>
as_vector() |>
unname()
df4 |>
filter(name_start %in% ns_vec) |>
group_by(year) |>
count(name_start, wt = n) |>
mutate(prop = 100*n/sum(n)) |>
mutate(lbl = if_else(year == 2017,
name_start,
NA)) |>
ggplot(aes(x = year, y = prop,
col = name_start, label = lbl)) +
geom_line(lwd = 1) +
ggrepel::geom_label_repel(nudge_x = 1) +
labs(x = "Year",
y = "Percentage of names starting with character",
title = "People's preferences for baby names' starting letter change over time",
subtitle = "Names starting with A are most popular now\nNames starting with J were popular in the 1940s\nIn 1950s, names starting with D became popular, while those starting with A lost popularity") +
theme_classic() +
theme(legend.position = "none",
plot.title.position = "plot") +
scale_x_continuous(breaks = seq(1880, 2020, 20))
#| label: fig-trends-baby-names-end
#| fig-cap: "Trends on the ending letter of babynames over time"
#| code-fold: true
ns_vec = df4 |>
count(name_end, wt = n, sort = TRUE) |>
slice_head(n = 5) |>
select(name_end) |>
as_vector() |>
unname()
df4 |>
filter(name_end %in% ns_vec) |>
group_by(year) |>
count(name_end, wt = n) |>
mutate(prop = 100*n/sum(n)) |>
mutate(lbl = if_else(year == 2017,
name_end,
NA)) |>
ggplot(aes(x = year, y = prop,
col = name_end, label = lbl)) +
geom_line(lwd = 1) +
ggrepel::geom_label_repel(nudge_x = 1) +
labs(x = "Year",
y = "Percentage of names ending with character",
title = "People's preferences for baby names' ending letter change over time",
subtitle = "Names ending in N have risen in popularity over the decades.\nNames ending with E have become less popular over time") +
theme_classic() +
theme(legend.position = "none",
plot.title.position = "plot") +
scale_x_continuous(breaks = seq(1880, 2020, 20))
Alternative Solution
# lengths of names over time
babynames |>
group_by(year) |>
mutate(length = str_length(name)) |>
summarize(average_length = weighted.mean(length, n)) |>
ggplot(aes(x = year, y = average_length)) +
geom_line() +
scale_x_continuous(breaks = seq(1880, 2020, 10))
# first letter
babynames |>
mutate(first_letter = str_sub(name, start = 1, end = 1)) |>
group_by(year, first_letter) |>
summarize(total_prop = sum(prop), .groups = "drop") |>
ggplot(aes(x = year, y = total_prop)) +
geom_line() +
facet_wrap(~first_letter)
# last letter
babynames |>
mutate(last_letter = str_sub(name, start = -1, end = -1)) |>
group_by(year, last_letter) |>
summarize(total_prop = sum(prop), .groups = "drop") |>
ggplot(aes(x = year, y = total_prop)) +
geom_line() +
facet_wrap(~last_letter)
Reference
Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.