R for Data Science Exercises: ggplot2
These exercises are an introduction to the ggplot2 package used to visualise data in R.
R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)
ggplot2
Calls
Run the code in your script for the answers! I'm just exploring as I go.
ggplot2
Visualising Distribution Exercises
Packages to load
library(tidyverse)
library(palmerpenguins)
library(ggplot2)
library(ggthemes)
- Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?
ggplot(penguins, aes(y = species)) +
geom_bar()
Makes the bars horizontal instead of vertical.
- How are the following two plots different? Which aesthetic,
colour
orfill
, is more useful for changing the colour of bars?
Plot 1
ggplot(penguins, aes(x = species)) +
geom_bar(colour = "red")
Plot 2
ggplot(penguins, aes(x = species)) +
geom_bar(fill = "red")
Borders of the bars are coloured in the 1st plot. Bars are filled in with colours in the 2nd plot. The fill aesthetic is more useful for changing the colour of the bars. For geom_bar(), “colour” affects borders. “Fill” affects inside.
- What does the
bins
argument ingeom_histogram()
do?
?ggplot2::geom_histogram()
The bins
argument is helpful when you don’t have a particular bin width in mind, but you do want to narrow things down to a particular number of bins. Essentially, it determines the number of bins (bars) in a histogram.
- Make a histogram of the
carat
variable in thediamonds
dataset that is available when you load the tidyverse package. Experiment with different binwidths (0.01, 0.10, 1). What binwidth reveals the most interesting patterns?
Plot 1 (binwidth
= 0.01)
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.01)
Plot 2 (binwidth
= 0.10)
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.10)
Plot 3 (binwidth
= 1)
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 1)
For this example, a binwidth
of 0.1 seems to be a good compromise between showing granularity in the data and not overwhelming ourselves with too many bars.
ggplot2
Visualising Relationships Exercises
- The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type
?mpg
to read the documentation for the dataset.) How can you see this information when you run mpg?
glimpse(mpg)
Using glimpse(mpg)
or ?mpg
will show the variables. We can assume all of the character-based variables ("chr") are categorical which includes manufacturer
, class
, fl
, drv
, model
, and trans.
The numerical variables ("int") include displ
, year
, cyl
, cty
, and hwy.
- Make a scatterplot of
hwy
vs.displ
using thempg
data frame. Next, map a third, numerical variable to colour, then size, then both colour and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?
Plot1
ggplot(
mpg,
aes(x = hwy, y = displ, colour = cty)
) +
geom_point()
Plot 2
ggplot(
mpg,
aes(x = hwy, y = displ, size = cty)
) +
geom_point()
Plot 3
ggplot(
mpg,
aes(x = hwy, y = displ, size = cty, colour = cty)
) +
geom_point()
Plot 4
ggplot(
mpg,
aes(x = hwy, y = displ, size = cty, colour = cty, shape = drv)
) +
geom_point()
The shape aesthetic doesn’t accept numerical variables, however it does work with categorical variables. When a categorical variable is mapped to size, a warning suggests to use an alternative. Mapping with colour gives a gradient palette with numerical variables while categorical variables are give distinct colours.
- In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?
ggplot(
mpg,
aes(x = hwy, y = displ, linewidth = cty)
) +
geom_point()
Linewidth is ignored as there are no lines in a scatterplot to alter the width of. The code runs as if the linewidth aesthetic was not called.
- What happens if you map the same variable to multiple aesthetics?
ggplot(
mpg,
aes(x = hwy, y = hwy, colour = hwy)
) +
geom_point()
ggplot2
will allow you to map the same variable to multiple aesthetics, but the plot is not useful to show any information.
- Make a scatterplot of bill_depth_mm vs. bill_length_mm and colour the points by species. What does adding colouring by species reveal about the relationship between these two variables? What about faceting by species?
Plot 1
ggplot(
penguins,
aes(x = bill_depth_mm, y = bill_length_mm, colour = species)
) +
geom_point()
Plot 2
ggplot(
penguins,
aes(x = bill_depth_mm, y = bill_length_mm, colour = species)
) +
geom_point() + facet_wrap(~species)
Mapping by colour helps spot the relationship of bill sizes by species. Facet wrapping species helps show the specific relationship better by removing “noise” of other species in a single plot. Adelies tend to have higher bill depth while Gentoo have longer bills and Chinstrap have deeper and longer bills.
- Why does the following yield two separate legends? How would you fix it to combine the two legends?
ggplot(
data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm,
colour = species, shape = species
)
) +
geom_point() +
labs(colour = "Species")
This code yields two separate legends because the legend for colour is renamed to "Species" but the legend for shape is not, and is named "species" by default. Adding an argument to labs
for shape and colour will solve that issue.
ggplot(
data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm,
colour = species, shape = species
)
) +
geom_point() +
labs(
colour = "Species",
shape = "Species"
)
- Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?
Plot 1
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = "fill")
Plot 2
ggplot(penguins, aes(x = species, fill = island)) +
geom_bar(position = "fill")
The first plot shows which species make up certain proportions of an island’s overall population. The second plot shows how a species’ overall population is split among the islands.
- Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?
Plot 1
ggplot(mpg, aes(x = class)) +
geom_bar()
Plot 2
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
ggsave("mpg-plot.png")
Only the second plot is saved, because ggsave()
saves the last plot you made.
- What do you need to change in the code above to save the plot as a PDF instead of a PNG? How could you find out what types of image files would work in
ggsave()
?
Changing filename in ggsave()
from .png to .pdf will change the saved file type. Files can be saved as “eps”, “ps”, “tex” (pictex), “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg” and “wmf” (windows devices only).
Save to PNG
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
ggsave("mpg-plot.png")
Save to PDF
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
ggsave("mpg-plot.pdf")
Reference
Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.