Learning R Featured

R for Data Science Exercises: Data Transformation Case Study

In this case study, we will be comparing how often a player successfully hits the ball (H) to the total number of attempts they made to hit the ball (AB). Including a count ensures our analysis is based on a reasonable amount of data and not just a few instances.

Lorcán Mason

25 Jun 2024 • 2 min read

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Data Transformation Case Study

Run the code in your script for the answers! I'm just exploring as I go.

Aggregates and Sample Size with the `Lahmans` package

Packages to load

library(Lahman)
library(tidyverse)

Whenever you summarise data it is always a good idea to include a count (n()) of the number of observations. This helps you to ensure that you’re not drawing conclusions based on very small amounts of data.

In this case study, we will demonstrate this with some baseball data from the Lahmans package. Specifically, comparing how often a player successfully hits the ball (H) to the total number of attempts they made to hit the ball (AB). Including a count ensures our analysis is based on a reasonable amount of data and not just a few instances.

batters <- Lahman::Batting |> 
  group_by(playerID) |> 
  summarize(
    performance = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    n = sum(AB, na.rm = TRUE)
  )

When we compare the skill of the batter (measured by the batting average, performance) to number of opportunities to hit the ball (measured by times at bat, n), we notice two things:

Players with fewer attempts to hit show more varying results in their performance. This is a common pattern: when you compare averages for different groups, you’ll often see less variation as the group size gets larger.
Skilled players tend to have more chances to hit. This is because teams prefer to let their best players have more opportunities to bat. So, better players get more chances to hit the ball.

batters |> 
  filter(n > 100) |> 
  ggplot(aes(x = n, y = performance)) +
  geom_point(alpha = 1 / 10) + 
  geom_smooth(se = FALSE)

If you simply rank players by desc(performance), i.e. batting average, the players at the top of the list will be those who had very few opportunities at-bats and happened to get a hit. These players may not actually be the most skilled players.

batters |> 
  arrange(desc(performance))

You can find a good explanation of this problem and how to overcome it at http://varianceexplained.org/r/empirical_bayes_baseball/ and https://www.evanmiller.org/how-not-to-sort-by-average-rating.html.

Reference

Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.

R4DS Data Transformation Case Study

Quarto Document

R4DS Data Transformation Case Study.qmd

3 KB