R for Data Science Exercises: Visualise Layers

This section is based on the layered grammar of visualisations, facets, statistics, position adjustments, and coordinate systems to give you a fundamental understanding of plotting your data.

R for Data Science Exercises: Visualise Layers

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Visualise Layers

Run the code in your script for the answers! I'm just exploring as I go.

Packages to load

library(tidyverse)

A gentle introduction

Element Description
Data The dataset being visualised.
Aesthetics The scales onto which we map our data.
Geometries The visual elements used for our data.
Facets Plotting small multiples.
Statistics Representations of our data to aid understanding.
Coordinates The space on which the data will be visualised.
Themes All non-data elements.

: Components of Visualising Data

Aesthetic mappings

Questions

  1. Create a scatterplot of hwy vs. displ where the points are pink filled in triangles.

  2. Why did the following code not result in a plot with blue points?

    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy, colour = "blue"))
    
  3. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

  4. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you'll also need to specify x and y.

Answers

  1. Below is a scatterplot of hwy vs. displ where the points are pink filled in triangles.

    ggplot(mpg, aes(x = hwy, y = displ)) +
      geom_point(colour = "pink", shape = "triangle")
    
  2. Colour should be set outside of the aesthetic mapping.

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(colour = "blue")
    
  3. Stroke controls the size of the edge/border of the points for shapes 21-24 (filled circle, square, triangle, and diamond).

  4. It creates a logical variable of values TRUE and FALSE for cars with displacement (displ) values below and above 5. Those with displ <5 are denoted as TRUE those with displ >5 are denoted as FALSE. In general, mapping an aesthetic to something other than a variable first evaluates that expression then maps the aesthetic to the outcome.

    ggplot(mpg, aes(x = hwy, y = displ, colour = displ < 5)) + 
      geom_point()
    

Geometric objects

Questions

  1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

  2. Earlier in this chapter we used show.legend without explaining it:

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_smooth(aes(colour = drv), show.legend = FALSE)
    

    What does show.legend = FALSE do here? What happens if you remove it? Why do you think we used it earlier?

  3. What does the se argument to geom_smooth() do?

  4. Recreate the R code necessary to generate the following described graphs. Note that wherever a categorical variable is used in the plot, it's drv.

    • All plots highway fuel efficiency of cars are on the y-axis (hwy) and engine displacement in litres is on the x-axis (displ).
    • 1st plot shows all points in black with a smooth curve overlaid on them.
    • 2nd plot points are also all black, with separate smooth curves overlaid for each level of drive train.
    • 3rd plot, points and the smooth curves are represented in different colours for each level of drive train.
    • 4th plot the points are represented in different colours for each level of drive train but there is only a single smooth line fitted to the whole data.
    • 5th plot, points are represented in different colours for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train.
    • 6th plot points are represented in different colours for each level of drive train and they have a thick white border.

Answers

  1. For a line chart you can use geom_path() or geom_line(). For a boxplot you can use geom_boxplot(). For a histogram, geom_histogram(). For an area chart, geom_area().

  2. It removes the legend for the geom it's specified in, in this case it removes the legend for the smooth lines that are coloured based on drv.

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_smooth(aes(colour = drv), show.legend = FALSE)
    
  3. It displays the confidence interval around the smooth lin. You can remove this with se = FALSE.

  4. The code for each of the plots is given below.

    ggplot(mpg, aes(x = displ, y = hwy)) + 
      geom_point() # the starting point
    
    # All plots highway fuel efficiency of cars are on the y-axis (hwy) and engine displacement in litres is on the x-axis (displ).
    
    ggplot(mpg, aes(x = displ, y = hwy)) + 
      geom_point() + 
      geom_smooth(se = FALSE) # adds a smoothed line
    
    # 1st plot shows all points in black with a smooth curve overlaid on them. 
    
    ggplot(mpg, aes(x = displ, y = hwy)) + 
      geom_smooth(aes(group = drv), se = FALSE) + # adds smoothed line for each drv group group
      geom_point()
    
    # 2nd plot points are also all black, with separate smooth curves overlaid for each level of drive train.
    
    ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + # colours by drv group
      geom_point() + 
      geom_smooth(se = FALSE) # adds coloured smoothed lines for each group
    
    # 3rd plot, points and the smooth curves are represented in different colours for each level of drive train.
    
    ggplot(mpg, aes(x = displ, y = hwy)) + 
      geom_point(aes(colour = drv)) + # colours by drv group for points only
      geom_smooth(se = FALSE) # adds smoothed line
    
    # 4th plot the points are represented in different colours for each level of drive train but there is only a single smooth line fitted to the whole data.
    
    ggplot(mpg, aes(x = displ, y = hwy)) + 
      geom_point(aes(colour = drv)) + # colours by drv group for points only
      geom_smooth(aes(linetype = drv), se = FALSE) # adds different smoothed line types for each drv group
    
    # 5th plot, points are represented in different colours for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train.
    
    ggplot(mpg, aes(x = displ, y = hwy)) + 
      geom_point(size = 4, colour = "white") + # adds white border around each point
      geom_point(aes(colour = drv)) # colours by drv group
    
    # 6th plot points are represented in different colours for each level of drive train and they have a thick white border.
    

Facets

Questions

  1. What happens if you facet on a continuous variable?

  2. What do the empty cells in the plot above with facet_grid(drv ~ cyl) mean? Run the following code. How do they relate to the resulting plot?

    ggplot(mpg) + 
      geom_point(aes(x = drv, y = cyl))
    
  3. What plots does the following code make? What does . do?

    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) +
      facet_grid(drv ~ .)
    
    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) +
      facet_grid(. ~ cyl)
    
  4. Take the first faceted plot in this section:

    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) + 
      facet_wrap(~ cyl, nrow = 2)
    

    What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

  5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn't facet_grid() have nrow and ncol arguments?

  6. Which of the following plots makes it easier to compare engine size (displ) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?

    ggplot(mpg, aes(x = displ)) + 
      geom_histogram() + 
      facet_grid(drv ~ .)
    
    ggplot(mpg, aes(x = displ)) + 
      geom_histogram() +
      facet_grid(. ~ drv)
    
  7. Recreate the following plot using facet_wrap() instead of facet_grid(). How do the positions of the facet labels change?

    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) +
      facet_grid(drv ~ .)
    

Answers

  1. Faceting by a continuous variable results in one facet per each unique value of the continuous variable. We can see this in the scatterplot below of displ vs. fl, faceted by cyl.

    ggplot(mpg, aes(x = fl, y = displ)) + 
      geom_point() +
      facet_wrap(~cyl)
    
  2. There are no cars with front-wheel drive and 5 cylinders, for example. Therefore the facet corresponding to that combination is empty. In general, empty facets mean no observations fall in that category.

    ggplot(mpg) + 
      geom_point(aes(x = drv, y = cyl)) +
      facet_grid(drv ~ cyl)
    
  3. In general, the period means "keep everything together". In the first plot, with facet_grid(drv ~ .), the period means "don't facet across columns". In the second plot, with facet_grid(. ~ drv), the period means "don't facet across rows".

    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) +
      facet_grid(drv ~ .)
    
    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) +
      facet_grid(. ~ cyl)
    
  4. The advantages of faceting is seeing each class of car separately, without any over-plotting. Additionally, colour can be helpful for easily telling classes apart. Using both can be helpful, but doesn't mitigate the issue of easy comparison across classes. The disadvantage is not being able to compare the classes to each other as easily when they're in separate plots. If we were interested in a specific class, e.g. compact cars, it would be useful to highlight that group only with an additional layer as shown in the last plot below.

    # facet
    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) + 
      facet_wrap(~ class, nrow = 2)
    
    # colour
    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy, colour = class))
    
    # both
    ggplot(mpg) + 
      geom_point(
        aes(x = displ, y = hwy, colour = class), 
        show.legend = FALSE) + 
      facet_wrap(~ class, nrow = 2)
    
    # highlighting
    ggplot(mpg, aes(x = displ, y = hwy)) + 
      geom_point(colour = "gray") +
      geom_point(
        data = mpg |> filter(class == "compact"),
        colour = "pink"
      )
    
  5. nrow controls the number panels and ncol controls the number of columns the panels should be arranged in. facet_grid() does not have these arguments because the number of rows and columns are determined by the number of levels of the two categorical variables facet_grid() plots. dir controls the whether the panels should be arranged horizontally or vertically.

  6. The first plot makes it easier to compare engine size (displ) across cars with different drive trains because the axis that plots displ is shared across the panels. What this says is that if the goal is to make comparisons based on a given variable, that variable should be placed on the shared axis.

    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) + 
      facet_grid(drv ~ .)
    
    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) + 
      facet_grid(. ~ drv)
    
  7. Facet grid chose to use rows instead of columns in the first code.

    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) +
      facet_grid(drv ~ .)
    
    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) +
      facet_wrap(~drv, nrow = 3)
    

Statistical Transformations

Questions

  1. What is the default geom associated with stat_summary()?
    How could you rewrite the previous plot to use that geom function instead of the stat function?

  2. What does geom_col() do?
    How is it different from geom_bar()?

  3. Most geoms and stats come in pairs that are almost always used in concert.
    Make a list of all the pairs.
    What do they have in common?

  4. What variables does stat_smooth() compute?
    What arguments control its behavior?

  5. In our proportion bar chart, we needed to set group = 1.
    Why?
    In other words, what is the problem with these two graphs?

    ggplot(diamonds, aes(x = cut, y = after_stat(prop))) + 
      geom_bar()
    ggplot(diamonds, aes(x = cut, fill = colour, y = after_stat(prop))) + 
      geom_bar()
    

Answers

  1. The default geom of stat summary is geom_pointrange().
    The plot from the book can be recreated as follows.

    diamonds |>
      group_by(cut) |>
      summarize(
        lower = min(depth),
        upper = max(depth),
        midpoint = median(depth)
      ) |>
      ggplot(aes(x = cut, y = midpoint)) +
      geom_pointrange(aes(ymin = lower, ymax = upper))
    
  2. geom_col() plots the heights of the bars to represent values in the data, while geom_bar() first calculates the heights from data and then plots them.
    geom_col() can be used to make a bar plot from a data frame that represents a frequency table, while geom_bar() can be used to make a bar plot from a data frame where each row is an observation.

  3. Geoms and stats that are almost always used together are listed below:

    geom stat
    geom_bar() stat_count()
    geom_bin2d() stat_bin_2d()
    geom_boxplot() stat_boxplot()
    geom_contour_filled() stat_contour_filled()
    geom_contour() stat_contour()
    geom_count() stat_sum()
    geom_density_2d() stat_density_2d()
    geom_density() stat_density()
    geom_dotplot() stat_bindot()
    geom_function() stat_function()
    geom_sf() stat_sf()
    geom_sf() stat_sf()
    geom_smooth() stat_smooth()
    geom_violin() stat_ydensity()
    geom_hex() stat_bin_hex()
    geom_qq_line() stat_qq_line()
    geom_qq() stat_qq()
    geom_quantile() stat_quantile()
  4. stat_smooth() computes the following variables:

    • y or x: Predicted value
    • ymin or xmin: Lower pointwise confidence interval around the mean
    • ymax or xmax: Upper pointwise confidence interval around the mean
    • se: Standard error
  5. In the first pair of plots, we see that setting group = 1 results in the marginal proportions of cuts being plotted.
    In the second pair of plots, setting group = colour results in the proportions of colours within each cut being plotted.

    # one variable
    ggplot(diamonds, aes(x = cut, y = after_stat(prop))) + 
      geom_bar()
    ggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) + 
      geom_bar()
    
    # two variables
    ggplot(diamonds, aes(x = cut, fill = colour, y = after_stat(prop))) + 
      geom_bar()
    ggplot(diamonds, aes(x = cut, fill = colour, y = after_stat(prop), group = colour)) + 
      geom_bar()
    

Position Adjustments

Questions

  1. What is the problem with the following plot?
    How could you improve it?

    ggplot(mpg, aes(x = cty, y = hwy)) + 
      geom_point()
    
  2. What, if anything, is the difference between the two plots?
    Why?

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point()
    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(position = "identity")
    
  3. What parameters to geom_jitter() control the amount of jittering?

  4. Compare and contrast geom_jitter() with geom_count().

  5. What's the default position adjustment for geom_boxplot()?
    Create a visualization of the mpg dataset that demonstrates it.

Answers

  1. The mpg dataset has r nrow(mpg) observations, however the plot shows fewer observations than that.
    This is due to overplotting; many cars have the same city and highway mileage.
    This can be addressed by jittering the points.

    ggplot(mpg, aes(x = cty, y = hwy)) + 
      geom_point()
    ggplot(mpg, aes(x = cty, y = hwy)) + 
      geom_jitter()
    
  2. The two plots are identical. position = "identity" means “just plot the values as given”.

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point()
    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(position = "identity")
    
  3. The width and height parameters control the amount of horizontal and vertical displacement, recpectively.
    Higher values mean more displacement.
    In the plot below you can see the non-jittered points in gray and the jittered points in black.

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(colour = "gray") +
      geom_jitter(height = 1, width = 1)
    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(colour = "gray") +
      geom_jitter(height = 1, width = 5)
    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(colour = "gray") +
      geom_jitter(height = 5, width = 1)
    
  4. geom_jitter() adds random noise to the location of the points to avoid overplotting.
    geom_count() sizes the points based on the number of observations at a given location.

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_jitter()
    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_count()
    
  5. The default is position for geom_boxplot() is "dodge2".

    ggplot(mpg, aes(x = cty, y = displ)) +
      geom_boxplot()
    ggplot(mpg, aes(x = cty, y = displ)) +
      geom_boxplot(position = "dodge2")
    

Coordinate Systems

Questions

  1. Turn a stacked bar chart into a pie chart using coord_polar().

  2. What's the difference between coord_quickmap() and coord_map()?

  3. What does the following plot tell you about the relationship between city and highway mpg?
    Why is coord_fixed() important?
    What does geom_abline() do?

    ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_point() + 
      geom_abline() +
      coord_fixed()
    

Answers

  1. We can turn a stacked bar chart into a pie chart by adding a coord_polar() layer.

    ggplot(diamonds, aes(x = "", fill = cut)) +
      geom_bar()
    
    ggplot(diamonds, aes(x = "", fill = cut)) +
      geom_bar() + 
      coord_polar(theta = "y")
    
  2. coord_map() projects the portion of the earth you're plotting onto a flat 2D plane using a given projection.
    coord_quickmap() is an approximation of this projection.

  3. geom_abline() adds a straight line at y = x, in other words, where highway mileage is equal to city mileage and coord_fixed() uses a fixed scale coordinate system where the number of units on the x and y-axes are equivalent.
    Since all the points are above the line, the highway mileage is always greater than city mileage for these cars.

    ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_point() + 
      geom_abline() +
      coord_fixed()
    

Reference

Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.