R for Data Science Exercises: Strings

Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn.

R for Data Science Exercises: Strings

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Strings

Run the code in your script for the answers! I'm just exploring as I go.

Packages to load

library(tidyverse)
library(babynames)
library(gt)
library(gtExtras)

Introduction

Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. The stringr package provides a set of internally consistent tools for working with character strings in R, i.e. sequences of characters surrounded by "quotation marks".

  • Strings = text, eg the stuff you're reading here
  • {stringr} = {tidyverse} string package (stringr cheatsheet)
  • {stringr} functions start with str_

stringr Autocomplete

Important uses of quoting in R

Code Purpose
⁠\n newline (aka 'line feed')
\t tab
\b backspace
\v⁠ vertical tab
\\ backslash '⁠\⁠'
\' ASCII apostrophe '⁠'⁠'
⁠\" ASCII quotation mark '⁠"⁠'
\nnn⁠ character with given octal code (1, 2 or 3 digits)
\xnn⁠ character with given hex code (1 or 2 hex digits)

Creating a String

Questions

  1. Create strings that contain the following values:

    1. He said "That's amazing!"

    2. \a\b\c\d

    3. \\\\\\

  2. Create the string in your R session and print it. What happens to the special "\u00a0"? How does str_view() display it? Can you do a little googling to figure out what this special character is?

    x <- "This\u00a0is\u00a0tricky"
    

Answers

Solution 1:

  1. He said "That's amazing!"

    x = "He said \"That's amazing!\""
    str_view(x)
    
  2. \a\b\c\d

    x = "\\a\\b\\c\\d"
    str_view(x)
    
  3. \\\\\\

    x = "\\\\\\\\\\\\"
    str_view(x)
    

Solution 2:

The "\u00a0" represents a white space. By google, I find out that this represents No-Break Space (NBSP). But, str_view() displays it in form of a greenish-blue font {\u00a0}.

    "\u00a0" # This represents a white space
    str_view("\u00a0")
    
    x <- "This\u00a0is\u00a0tricky"
    print(x)
    str_view(x)

The "\u00a0" represents a non-breaking space character in Unicode encoding. Unicode is a standardized character encoding system that assigns a unique numerical code to almost every character from every writing system in the world, including various symbols, letters, and special characters.

In Unicode, "\u" is used to indicate that the following four characters represent a Unicode code point in hexadecimal notation. In this case, "\u00a0" represents the code point for the non-breaking space character.

A non-breaking space is a type of space character that is used in typography and word processing to prevent a line break or word wrap from occurring at that particular space.

It is similar to a regular space character (ASCII code 32), but it has the special property of keeping adjacent words or characters together on the same line when text is justified or formatted.

Creating Many Strings from Data

Questions

  1. Compare and contrast the results of paste0() with str_c() for the following inputs:
  • str_c("hi ", NA)
  • str_c(letters[1:2], letters[1:3])
  1. What's the difference between paste() and paste0()? How can you recreate the equivalent of paste() with str_c()?

  2. Convert the following expressions from str_c() to str_glue() or vice versa:

    a. str_c("The price of ", food, " is ", price)

    b. str_glue("I'm {age} years old and live in {country}")

    c. str_c("\\section{", title, "}")

Answers

Solution 1:

As we can see below, paste0 converts NA into a string "NA" and simply joins it with another string. However, str_c() behaves more sensibly - it generates NA if any of the strings being joined is NA.

str_c("hi ", NA)
paste0("hi ", NA)

Further, we see below that we are joining two string vectors of unequal length, i.e., letters[1:2] is "a" "b" and letters[1:3] is "a" "b" "c" , both str_c() and paste0() behave differently.

  • str_c() throws an error and informs us that the string vectors being joined are of unequal length.
  • paste0 simple recycles the shorter string vector silently.
# str_c(letters[1:2], letters[1:3])
paste0(letters[1:2], letters[1:3])

Alternative Solution:

  1. str_c("hi ", NA)

    • str_c("hi ", NA) with str_c():

      • Result: "hi NA"

      • Explanation: str_c() converts the NA value to a character string representation of "NA" and concatenates it with the string "hi".

    • paste0("hi ", NA) with paste0():

      • Result: "hi NA"

      • Explanation: paste0() also converts the NA value to a character string representation of "NA" and concatenates it with the string "hi". The behavior is similar to str_c() in this case.

    The results of str_c() and paste0() are the same for this input, as both functions convert NA to the character string "NA" and concatenate it with the preceding string.

  2. str_c(letters[1:2], letters[1:3])

    • str_c(letters[1:2], letters[1:3]) with str_c():

      • Result: "aabbccc"

      • Explanation: str_c() concatenates the elements of the first vector (letters[1:2]) with the corresponding elements of the second vector (letters[1:3]). The resulting strings are then concatenated together.

    • paste0(letters[1:2], letters[1:3]) with paste0():

      • Result: "aabbccc"

      • Explanation: paste0() behaves similarly to str_c() in this case, as it concatenates the elements of the first vector with the corresponding elements of the second vector, resulting in the same output.

    The results of str_c() and paste0() are the same for this input as well, as both functions concatenate the corresponding elements of the vectors and create a single concatenated string.

In summary, str_c() and paste0() generally produce similar results for concatenating strings. However, str_c() has additional options and features that allow for more flexibility and customization, such as specifying separators between the elements being concatenated. On the other hand, paste0() is a simplified version of paste() that concatenates without any separator.

Solution 2:

In R, both paste() and paste0() functions are used to concatenate strings together. However, they differ in how they handle separating the concatenated elements.

paste() concatenates its arguments with a space character as the default separator. We can specify a different separator using the sep argument.

paste0() is similar to paste(), but it does not add any separator between the concatenated elements. It simply combines them as-is.

Example:

vec1 <- c("Hello", "Hi")
vec2 <- c("Amy", "Tom", "Neal")
paste(vec1, vec2)
paste(vec1, vec2, sep = ", ")
paste0(vec1, vec2)

We can recreate the equivalent of paste() using the str_c() function from the stringr package in R. To do this, we can specify the separator using the sep argument in str_c() as follows:

vec1 <- c(vec1, "Hello")
paste(vec1, vec2)
str_c(vec1, vec2, sep = " ")

Note: We had to add a string to vec1 so that both vec1 and vec2 are of length 3. Else, str_c will throw up an error.

Alternative Solution:

The paste() and paste0() functions in R are used for concatenating strings. The main difference between the two is that paste() allows you to specify a separator between the concatenated elements, whereas paste0() concatenates the elements without any separator.

To recreate the equivalent of paste() using str_c() from the stringr package, you can use the collapse argument in str_c():

library(stringr)

vec <- c("a", "b", "c")

result_str_c <- str_c(vec, collapse = "-")

In the above code, str_c(vec, collapse = "-") is equivalent to paste(vec, collapse = "-"). It concatenates the elements of vec with a hyphen (-) separator specified by the collapse argument, resulting in the same output as paste().

Solution 3:

a. str_c("The price of ", food, " is ", price)

-   `str_glue("The price of {food} is {price}")`

b. str_glue("I'm {age} years old and live in {country}")

-   `str_c("I'm ", age, " years old and live in ", country)`

c. str_c("\\section{", title, "}")

-   `str_glue("\\\\section{{{title}}}")`

Additional Information:

data("babynames")
babynames |>
  mutate(name_lgth = str_length(name)) |>
  count(name_lgth, wt = n)

babynames |>
  filter(str_length(name) == 15) |>
  count(name, wt = n, sort = TRUE) |>
  slice_head(n = 5) |>
  select(name) |>
  as_vector() |>
  unname() |>
  str_sub(start = -3, end = -1)

Letters

Questions

  1. When computing the distribution of the length of babynames, why did we use wt = n?
  2. Use str_length() and str_sub() to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
  3. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?

Answers

Solution 1:

The babynames data-set (@tbl-baby-names) displays the column n to reflect the frequency, i.e., number of observations of that name in that year. Thus, when we are computing the distribution of the length of baby names (@tbl-baby-names-length), we need to weigh the observations by n otherwise each row will be treated as 1 (@tbl-baby-names-length column 3), instead of the actual number reflected in n leading to erroneous results.

#| tbl-cap: "The babynames data-set"
#| label: tbl-baby-names
#| code-fold: true

babynames |>
  slice_head(n = 5) |>
  gt() |>
  fmt_number(prop, decimals = 4)
#| tbl-cap: "The distribution of the length of babynames"
#| label: tbl-baby-names-length
#| code-fold: true

df1 = babynames |>
  mutate(name_length = str_length(name)) |>
  count(name_length, wt = n) |>
  rename(correct_frequency = n)

df2 = babynames |>
  mutate(name_length = str_length(name)) |>
  count(name_length) |>
  rename(wrong_frequency_without_weights = n)

inner_join(df1, df2, by = "name_length") |>
  gt() |>
  fmt_number(-name_length , decimals = 0) |>
  cols_label_with(
    fn = ~ janitor::make_clean_names(., case = "title")
    ) |>
  gt_theme_538()

Solution 2:

The code displayed below extracts the middle letter from each baby name, and the results for first 10 names are displayed in @tbl-middle-letters. If the string has an even number of characters, we can pick the middle two characters.

#| label: tbl-middle-letters
#| tbl-cap: "Middle letters of names"

df3 = babynames |>
  mutate(
    name_length = str_length(name),
    middle_letter_start = if_else(name_length %% 2 == 0,
                                  name_length/2,
                                  (name_length/2) + 0.5),
    middle_letter_end = if_else(name_length %% 2 == 0,
                                (name_length/2) + 1,
                                (name_length/2) + 0.5),
    middle_letter = str_sub(name,
                            start = middle_letter_start,
                            end = middle_letter_end)
    ) |>
  select(-c(year, sex, n, prop)) |>
  slice_head(n = 10)
    
df3 |>
  gt() |>
  cols_label_with(fn = ~ janitor::make_clean_names(., case = "title")) |>
  cols_align(align = "center",
             columns = -name) |>
  gt_theme_538()

Alternative Solution

# Extract middle letter(s) from each baby name
middle_letters <- sapply(babynames$name, function(name) {
  name_length <- str_length(name)
  middle_index <- ceiling(name_length / 2)
  
  if (name_length %% 2 == 0) {
    str_sub(name, middle_index, middle_index + 1)
  } else {
    str_sub(name, middle_index, middle_index)
  }
})

# Display the middle letter(s)
head(middle_letters)

Solution 3:

The @fig-length-baby-names, @fig-trends-baby-names-start and @fig-trends-baby-names-end show the trends over time.

#| label: fig-length-baby-names
#| fig-cap: "Length of babynames over time"
#| code-fold: true

df4 = babynames |>
  mutate(
    name_length = str_length(name),
    name_start = str_sub(name, 1, 1),
    name_end = str_sub(name, -1, -1)
  )
y_coord = c(5.4, 6.3)

df4 |>
  group_by(year) |>
  count(name_length, wt = n) |>
  summarise(mean_length = weighted.mean(name_length, w = n)) |>
  ggplot(aes(x = year, y = mean_length)) +
  theme_classic() +
  labs(y = "Average name length (for each year)",
       x = "Year", 
       title = "Baby names have become longer over the past 12 decades",
       subtitle = "Between 1890-1920, and 1960-1990 baby names became longer\nBut, since 1990 the names are becoming shorter again") +
  scale_x_continuous(breaks = seq(1880, 2000, 20)) +
  geom_rect(mapping = aes(xmin = 1890, xmax = 1920,
                          ymin = y_coord[1], ymax = y_coord[2]),
            alpha = 0.01, fill = "grey") +
  geom_rect(mapping = aes(xmin = 1960, xmax = 1990,
                          ymin = y_coord[1], ymax = y_coord[2]),
            alpha = 0.01, fill = "grey") +
  geom_line(lwd = 1) +
  coord_cartesian(ylim = y_coord) +
  theme(plot.title.position = "plot")
#| label: fig-trends-baby-names-start
#| fig-cap: "Trends on the starting letter of babynames over time"
#| code-fold: true


ns_vec = df4 |>
  count(name_start, wt = n, sort = TRUE) |>
  slice_head(n = 5) |>
  select(name_start) |>
  as_vector() |>
  unname()

df4 |>
  filter(name_start %in% ns_vec) |>
  group_by(year) |>
  count(name_start, wt = n) |>
  mutate(prop = 100*n/sum(n)) |>
  mutate(lbl = if_else(year == 2017, 
                       name_start, 
                       NA)) |>
  ggplot(aes(x = year, y = prop, 
             col = name_start, label = lbl)) +
  geom_line(lwd = 1) +
  ggrepel::geom_label_repel(nudge_x = 1) +
  labs(x = "Year",
       y = "Percentage of names starting with character",
       title = "People's preferences for baby names' starting letter change over time",
       subtitle = "Names starting with A are most popular now\nNames starting with J were popular in the 1940s\nIn 1950s, names starting with D became popular, while those starting with A lost popularity") +
  theme_classic() +
  theme(legend.position = "none",
        plot.title.position = "plot") +
  scale_x_continuous(breaks = seq(1880, 2020, 20))

#| label: fig-trends-baby-names-end
#| fig-cap: "Trends on the ending letter of babynames over time"
#| code-fold: true

ns_vec = df4 |>
  count(name_end, wt = n, sort = TRUE) |>
  slice_head(n = 5) |>
  select(name_end) |>
  as_vector() |>
  unname()

df4 |>
  filter(name_end %in% ns_vec) |>
  group_by(year) |>
  count(name_end, wt = n) |>
  mutate(prop = 100*n/sum(n)) |>
  mutate(lbl = if_else(year == 2017, 
                       name_end, 
                       NA)) |>
  ggplot(aes(x = year, y = prop, 
             col = name_end, label = lbl)) +
  geom_line(lwd = 1) +
  ggrepel::geom_label_repel(nudge_x = 1) +
  labs(x = "Year",
       y = "Percentage of names ending with character",
       title = "People's preferences for baby names' ending letter change over time",
       subtitle = "Names ending in N have risen in popularity over the decades.\nNames ending with E have become less popular over time") +
  theme_classic() +
  theme(legend.position = "none",
        plot.title.position = "plot") +
  scale_x_continuous(breaks = seq(1880, 2020, 20))

Alternative Solution

# lengths of names over time

babynames |>
  group_by(year) |> 
  mutate(length = str_length(name)) |>
  summarize(average_length = weighted.mean(length, n)) |>
  ggplot(aes(x = year, y = average_length)) +
  geom_line() +
  scale_x_continuous(breaks = seq(1880, 2020, 10))

# first letter

babynames |>
  mutate(first_letter = str_sub(name, start = 1, end = 1)) |>
  group_by(year, first_letter) |>
  summarize(total_prop = sum(prop), .groups = "drop") |>
  ggplot(aes(x = year, y = total_prop)) +
  geom_line() +
  facet_wrap(~first_letter)

# last letter

babynames |>
  mutate(last_letter = str_sub(name, start = -1, end = -1)) |>
  group_by(year, last_letter) |>
  summarize(total_prop = sum(prop), .groups = "drop") |>
  ggplot(aes(x = year, y = total_prop)) +
  geom_line() +
  facet_wrap(~last_letter)

Reference

Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.