R for Data Science Exercises: Introduction

The beginning of a series in which I'm working through the 'R for Data Science 2nd Edition' Exercises to learn R. Follow along if you're interested!

R for Data Science Exercises: Introduction

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Introduction

Run the code in your script for the answers! I'm just exploring as I go.

Packages needed for book examples

When required use install.packages(package name) to install these packages

  • [arrow], [babynames], [curl], [duckdb], [gapminder], [ggrepel], [ggridges], [ggthemes], [hexbin], [janitor], [Lahman], [leaflet], [maps], [nycflights13], [openxlsx], [palmerpenguins], [repurrrsive], [tidymodels], [writexl], [tidyverse]

Defining a typical data science project

  1. Data wrangling = Importing & Tidying data.
  2. Data understanding = Tranform data to your needs, Visualise to comprehend, and Model to inform.
  3. Communicate = Help visually understand your learning from the data.

Typical process for data analysis
data-science.png

Data Visualisation Exercises

Packages to load

library(tidyverse)
library(palmerpenguins)
library(ggplot2)
library(ggthemes)
  1. How many rows are in penguins? How many columns?
penguins

344 rows and 8 columns.

  1. What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.
?penguins

Describes bill depth in millimeters (mm).

  1. Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.
ggplot(
  data = penguins, 
  aes(x = bill_depth_mm, y = bill_length_mm)
) + 
  geom_point()

Positive, linear, and moderate association.

  1. What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?
ggplot(
  data = penguins, 
  aes(x = bill_depth_mm, y = species)
) + 
  geom_point()

Species is a categorical variable and a scatterplot of a categorical variable is not that useful. Boxplot would be more appropriate to show categorical data.

  1. Why does the following give an error and how would you fix it?

Code:

  • ggplot(data = penguins) + geom_point()

No aesthetic (aes) mappings for x and y these are required aesthetics for the point geom.

  1. What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.
ggplot(
  data = penguins, 
  aes(x = bill_depth_mm, y = bill_length_mm)
) + 
  geom_point(na.rm = TRUE)

Setting the na.rm argument to TRUE removes the missing values without a warning. The value for this argument is FALSE by default.

  1. Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().
ggplot(
  data = penguins,
  aes(x = bill_depth_mm, y = bill_length_mm)
) +
  geom_point(na.rm = TRUE) +
  labs(caption = "Data come from the palmerpenguins package.")
  1. Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?
ggplot(
  data = penguins,
  aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(aes(color = bill_depth_mm)) + 
  geom_smooth()

The bill_depth_mm variable is only needed for the point geom and should be mapped at the local level, as it is not used for the smooth geom – the points are colored for bill depth but the smooth line is a single color.

  1. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

Code should display a scatterplot of body mass vs. flipper length with points and smooth lines (no confidence interval) for each species in a different color based on the island.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
  geom_point() +
  geom_smooth(se = FALSE)
  1. Will these two graphs look different? Why/why not?

Plots will look the same. The first plot aesthetic mappings are at the global level and passed down to both geoms. The second plot both geoms have the same aesthetic mappings, each defined at the local level.

Plot 1

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()

Plot 2

ggplot() +
  geom_point(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_smooth(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  )

Reference

Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.