R for Data Science Exercises: Data Transformation

These exercises are an introduction to the data transformation basics of R, including sorting, filtering, grouping, modifying, summarising, relocating and selecting data.

R for Data Science Exercises: Data Transformation

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Data Transformation

Run the code in your script for the answers! I'm just exploring as I go.

Data Transformation Exercises

Packages to load

library(nycflights13)
library(tidyverse)

Exercises for Rows

  1. In a single pipeline for each condition, find all flights that meet the condition:

    a. Had an arrival delay of two or more hours (>=120 mins).

    flights |>
      filter(arr_delay >= 120) |>
      arrange(desc(arr_delay))
    

    b. Flew to Houston (IAH or HOU).

    flights |>
      filter(dest %in% c("IAH", "HOU"))
    

    c. Were operated by United (UA), American (AA), or Delta (DL).

    flights |>
      filter(carrier %in% c("UA", "DL", "AA"))
    

    d. Departed in summer (July, August, and September) (Hint: Number out of 12 corresponds to the number of the month in the year).

    flights |>
      filter(month %in% c(7, 8, 9))
    

    e. Arrived more than two hours late (>=120 mins), but didn’t leave late.

    flights |> 
      filter(arr_delay >= 120 & dep_delay <= 0) |> view()
    

    f. Were delayed by at least an hour (>=60 mins), but made up over 30 minutes in flight.

    flights |> 
      filter(dep_delay >= 60 & dep_delay - arr_delay > 30)
    
  2. Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

flights |> 
  arrange(desc(dep_delay)) |> 
  arrange(sched_dep_time) |>
  relocate(dep_delay, sched_dep_time)
  1. Sort flights to find the fastest flights. (Hint: Try including a maths calculation inside of your function).
flights |> 
  mutate(speed = distance / (air_time / 60)) |>
  arrange(desc(speed)) |>
  relocate(speed)
  1. Was there a flight on every day of 2013?

    a. Using distinct()

    flights |> 
      distinct(year, month, day) |>
      nrow()
    

    b. Using count()

    flights |>
      count(year, month, day, sort = TRUE) |>
      arrange(n)
    

Yes, there was a flight every day as the resulting tibble has 365 rows (corresponds to days). Furthermore, using count() also creates a sum column of the total flights for the day.

  1. Which flights travelled the farthest distance? Which travelled the least distance?

    a. Farthest distance

    flights |> 
      arrange(desc(distance)) |>
      relocate(distance)
    

    b. Shortest distance

    flights |> 
      arrange(distance) |>
      relocate(distance)
    
  2. Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

When arrange() is used before filter(), the sorting will be applied to all the rows in the dataset, including those filtered out later. When filter() is used before arrange(), the sorting will only be applied to the remaining rows after filtering, which can be more efficient and faster.

Exercises for Columns

  1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

I expect the dep_time to be sched_dep_time + dep_delay.

flights |> 
  relocate(dep_time, sched_dep_time, dep_delay)
  1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

    a. Using the variable names.

    flights |> 
      select(dep_time, dep_delay, arr_time, arr_delay)
    

    b. Using the starts_with() call.

    flights |> 
      select(starts_with("dep"), starts_with("arr"))
    

    c. Using select through using : where you select the columns between the two specified and - to remove a stated column.

    flights |>
      select(dep_time:arr_delay, -contains("sched"))
    
  2. What happens if you specify the name of the same variable multiple times in a select() call?

flights |> 
  select(dep_time, dep_time)

It will only show the column at the first position it was specified.

  1. What does the any_of() function do? Why might it be helpful in conjunction with this vector?
variables <- c("year", "month", "day", "dep_delay", "arr_delay")

flights |> 
  select(any_of(variables))

any_of() doesn’t check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed. The order of selected columns is determined by the order in the vector.

  1. Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
flights |> 
  select(contains("TIME"))

Yes, it does surprise me since the variable names are lowercase but the string in contains() is uppercase. By default, contains() ignores case. To change this default behaviour, set ignore.case = FALSE.

flights |> 
  select(contains("TIME", ignore.case = FALSE))
  1. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.
    • rename() is used to rename the variable.
    • relocate() brings the selected variable to the beginning of the data frame.
flights |>
  rename(air_time_min = air_time) |>
  relocate(air_time_min)
  1. Why doesn’t the following work, and what does the error mean?

Error Code:

  • flights |> select(tailnum) |> arrange(arr_delay)

The code doesn’t work as the result of the select() step is a tibble with only the tailnum variable, so it’s not possible to arrange it by another variable, arr_delay.

Exercises for Groups

  1. Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights \|\> group_by(carrier, dest) \|\> summarize(n()))
flights |>
  group_by(carrier) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) |>
  arrange(desc(avg_dep_delay))

The carrier with worst average delays is F9.

To try to disentangle the effects of bad airports vs. bad carriers and using the hint to group and summarize:

# calculate average delay and sort

flights |>
  group_by(carrier, dest) |>
  summarize(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  ) |>
  arrange(desc(avg_delay))
# add total_delay and sort

flights |>
  group_by(carrier, dest) |>
  summarize(
    delay = sum(arr_delay, na.rm = TRUE),
    n = n()
  ) |>
  mutate(total_delay = delay * n) |> 
  arrange(desc(total_delay))

In the 1st calculation, United Airlines to Seattle has a very high average delay, but it only happens twice. In the 2nd calculation, Delta to Atlanta has the higher total delay.

  1. Find the flights that are most delayed upon departure from each destination.
flights |> 
  group_by(dest) |> 
  arrange(dest, desc(dep_delay)) |>
  slice_head(n = 5) |>
  relocate(dest, dep_delay)

The above code shows the top 5 most departure delayed flights from each destination.

  1. How do delays vary over the course of the day. Illustrate your answer with a plot.
flights |>
  group_by(hour) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) |>
  ggplot(aes(x = hour, y = avg_dep_delay)) + 
  geom_smooth()

Over the course of the day, hourly average departure delay increases until about 7pm, and then declines again, however doesn’t go as low as the beginning of the day.

  1. What happens if you supply a negative n to slice_min() and friends?

Using slice_min() with a negative value

flights |> 
  slice_min(dep_delay, n = -5) |>
  relocate(dep_delay)

Using slice_min() with a positive value

flights |> 
  slice_min(dep_delay, n = 5) |>
  relocate(dep_delay)

Using slice_max() with a negative value

flights |> 
  slice_max(dep_delay, n = -5) |>
  relocate(dep_delay)

Using slice_max() with a positive value

flights |> 
  slice_max(dep_delay, n = 5) |>
  relocate(dep_delay)

Supplying a negative value arranges the data frame in either ascending (with slice_min()) or descending (with slice_max()) order, but it doesn’t actually slice the data frame for the lowest/highest values of the given variable.

  1. Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

count() is used to count the number of rows in a data frame, grouped by one or more variables. The sort argument specifies if how the results are sorted (TRUE ascending) is the default.

  1. Suppose we have the following tiny data frame:
df <- tibble(
  x = 1:5,
  y = c("a", "b", "a", "a", "b"),
  z = c("K", "K", "L", "L", "K")
)

a. Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.

df |>
  group_by(y)

Groups df by y.

b. Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a).

df |>
  arrange(y)

Arranges df in ascending order of the value of y.

c. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.

df |>
  group_by(y) |>
  summarize(mean_x = mean(x))

Groups df by y and then calculates the average value of x for each group.

d. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x))

Groups df by y and z, and then calculates the average value of x for each group combination. The resulting data frame is grouped by y.

e. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d)?

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x), .groups = "drop")

Groups df by y and z, and then calculates the average value of x for each group combination. The resulting data frame is not grouped.

f. Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x))

df |>
  group_by(y, z) |>
  mutate(mean_x = mean(x))

Groups df by y and z, and then calculates the average value of x for each group combination. With summarize() the resulting data frame has one row per group combination while with mutate() the resulting data frame has the same number of rows as the original data frame.

Reference

Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.