R for Data Science Exercises: ggplot2

These exercises are an introduction to the ggplot2 package used to visualise data in R.

R for Data Science Exercises: ggplot2

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

ggplot2 Calls

Run the code in your script for the answers! I'm just exploring as I go.

ggplot2 Visualising Distribution Exercises

Packages to load

library(tidyverse)
library(palmerpenguins)
library(ggplot2)
library(ggthemes)
  1. Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?
ggplot(penguins, aes(y = species)) + 
  geom_bar()

Makes the bars horizontal instead of vertical.

  1. How are the following two plots different? Which aesthetic, colour or fill, is more useful for changing the colour of bars?

Plot 1

ggplot(penguins, aes(x = species)) +
  geom_bar(colour = "red")

Plot 2

ggplot(penguins, aes(x = species)) +
  geom_bar(fill = "red")

Borders of the bars are coloured in the 1st plot. Bars are filled in with colours in the 2nd plot. The fill aesthetic is more useful for changing the colour of the bars. For geom_bar(), “colour” affects borders. “Fill” affects inside.

  1. What does the bins argument in geom_histogram() do?
?ggplot2::geom_histogram()

The bins argument is helpful when you don’t have a particular bin width in mind, but you do want to narrow things down to a particular number of bins. Essentially, it determines the number of bins (bars) in a histogram.

  1. Make a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths (0.01, 0.10, 1). What binwidth reveals the most interesting patterns?

Plot 1 (binwidth = 0.01)

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

Plot 2 (binwidth = 0.10)

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.10)

Plot 3 (binwidth = 1)

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 1)

For this example, a binwidth of 0.1 seems to be a good compromise between showing granularity in the data and not overwhelming ourselves with too many bars.

ggplot2 Visualising Relationships Exercises

  1. The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?
glimpse(mpg)

Using glimpse(mpg) or ?mpg will show the variables. We can assume all of the character-based variables ("chr") are categorical which includes manufacturer, class, fl, drv, model, and trans. The numerical variables ("int") include displ, year, cyl, cty, and hwy.

  1. Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to colour, then size, then both colour and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?

Plot1

ggplot(
  mpg, 
  aes(x = hwy, y = displ, colour = cty)
  ) + 
  geom_point()

Plot 2

ggplot(
  mpg, 
  aes(x = hwy, y = displ, size = cty)
  ) + 
  geom_point()

Plot 3

ggplot(
  mpg, 
  aes(x = hwy, y = displ, size = cty, colour = cty)
  ) + 
  geom_point()

Plot 4

ggplot(
  mpg, 
  aes(x = hwy, y = displ, size = cty, colour = cty, shape = drv)
  ) + 
  geom_point()

The shape aesthetic doesn’t accept numerical variables, however it does work with categorical variables. When a categorical variable is mapped to size, a warning suggests to use an alternative. Mapping with colour gives a gradient palette with numerical variables while categorical variables are give distinct colours.

  1. In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?
ggplot(
  mpg,
  aes(x = hwy, y = displ, linewidth = cty)
  ) + 
  geom_point()

Linewidth is ignored as there are no lines in a scatterplot to alter the width of. The code runs as if the linewidth aesthetic was not called.

  1. What happens if you map the same variable to multiple aesthetics?
ggplot(
  mpg,
  aes(x = hwy, y = hwy, colour = hwy)
  ) + 
  geom_point()

ggplot2 will allow you to map the same variable to multiple aesthetics, but the plot is not useful to show any information.

  1. Make a scatterplot of bill_depth_mm vs. bill_length_mm and colour the points by species. What does adding colouring by species reveal about the relationship between these two variables? What about faceting by species?

Plot 1

ggplot(
  penguins,
  aes(x = bill_depth_mm, y = bill_length_mm, colour = species)
  ) + 
  geom_point()

Plot 2

ggplot(
  penguins,
  aes(x = bill_depth_mm, y = bill_length_mm, colour = species)
  ) + 
  geom_point() + facet_wrap(~species)

Mapping by colour helps spot the relationship of bill sizes by species. Facet wrapping species helps show the specific relationship better by removing “noise” of other species in a single plot. Adelies tend to have higher bill depth while Gentoo have longer bills and Chinstrap have deeper and longer bills.

  1. Why does the following yield two separate legends? How would you fix it to combine the two legends?
ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm, 
    colour = species, shape = species
  )
) +
  geom_point() +
  labs(colour = "Species")

This code yields two separate legends because the legend for colour is renamed to "Species" but the legend for shape is not, and is named "species" by default. Adding an argument to labs for shape and colour will solve that issue.

ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm, 
    colour = species, shape = species
  )
) +
  geom_point() +
labs(
  colour = "Species",
  shape = "Species"
)
  1. Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?

Plot 1

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = "fill")

Plot 2

ggplot(penguins, aes(x = species, fill = island)) +
  geom_bar(position = "fill")

The first plot shows which species make up certain proportions of an island’s overall population. The second plot shows how a species’ overall population is split among the islands.

  1. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?

Plot 1

ggplot(mpg, aes(x = class)) +
  geom_bar()

Plot 2

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()
ggsave("mpg-plot.png")

Only the second plot is saved, because ggsave() saves the last plot you made.

  1. What do you need to change in the code above to save the plot as a PDF instead of a PNG? How could you find out what types of image files would work in ggsave()?

Changing filename in ggsave() from .png to .pdf will change the saved file type. Files can be saved as “eps”, “ps”, “tex” (pictex), “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg” and “wmf” (windows devices only).

Save to PNG

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()
ggsave("mpg-plot.png")

Save to PDF

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()
ggsave("mpg-plot.pdf")

Reference

Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.