Learning R Featured

R for Data Science Exercises: Factors

Lorcán Mason

06 Jul 2024 • 5 min read

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Factors

Run the code in your script for the answers! I'm just exploring as I go.

Packages to load

library(tidyverse)
library(ggthemes)
library(gt)
library(gtExtras)

Introduction

Factors are a type of variable in R that are used to store categorical data. Factors can be ordered or unordered. Factors are stored as integers, and have labels associated with these unique integers. These labels are used when displaying the factor. Factors are very useful in statistical modelling and graphics. They are also useful when you want to display character vectors in a non-alphabetical order.

Questions

Explore the distribution of rincome (reported income).
What makes the default bar chart hard to understand?
How could you improve the plot?
What is the most common relig in this survey?
What's the most common partyid?
Which relig does denom (denomination) apply to?
How can you find out with a table?
How can you find out with a visualization?

Answers

Solution 1:

The default bar chart is hard to understand because:

It is vertical, and the names of categories overlap on x-axis.
The "Not applicable" category is before the lowest income group. Thus, the pattern is disturbed.

We could improve the plot, as shown in @fig-q1-ex3, by:

Making it into a horizontal bar chart to allow space and easy reading of categories of income levels.
Move the "Not Applicable" level after the highest income level, along-side "Refused", "Dont' know" and "No answer".
Further, we could remove non-data ink, as per principles of Mr. Tufte to make our pattern stand out. Also, we could create a separate colouring scheme for data outside the income levels.

#| label: fig-q1-ex3
#| fig-cap: "Improved bar chart"

no_levels = levels(gss_cat$rincome)[c(1:3, 16)]

gss_cat |>
  mutate(col_level = rincome %in% no_levels) |>
  ggplot(aes(y = fct_relevel(rincome, 
                             "Not applicable",
                             after = 3),
             fill = col_level)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Number of respondents", y = NULL,
       title = "Income Levels of respondents in General Social Survey") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "none") +
  scale_fill_manual(values = c("#3d3b3b", "#999494"))

Solution 2:

The most common relig is "Protestant". And, the most common partyid is "Independent".

gss_cat |>
  count(relig, sort = TRUE)

gss_cat |>
  count(partyid, sort = TRUE)

Solution 3:

We can see from the code below that more than one factor values in denom (denomination) occur only in "Protestant", "Christian" and "Other" religions. To explore further, we can cross-tabulate religion and denomination, as shown in @tbl-q3-ex3, and realize that the only religion to which denomination really applies to is "Protestant".

We could also do a visualization as in @fig-q3a-ex3.

gss_cat |>
  group_by(relig) |>
  summarise(n = n_distinct(denom)) |>
  arrange(desc(n)) |>
  filter(n > 1)

#| label: tbl-q3-ex3
#| tbl-cap: "Cross-Table of the deominations within the three religions"

gss_cat |>
  filter(relig %in% c("Protestant", "Christian", "Other")) |>
  group_by(relig, denom) |>
  tally() |>
  spread(relig, n) |>
  arrange(desc(Christian)) |>
  gt() |>
  sub_missing(missing_text = "") |>
  gt_theme_538()

#| label: fig-q3a-ex3
#| tbl-cap: "Visualization of the number of deominations within religions"

gss_cat |>
  group_by(relig) |>
  summarise(n = n_distinct(denom)) |>
  arrange(desc(n)) |>
  ggplot(aes(y = reorder(relig, n), x = n)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(x = "Number of denominations", y = NULL,
       title = "Only Protestant religion has demoninations within it")

Modifying Factor Order

Questions

There are some suspiciously high numbers in tvhours.
Is the mean a good summary?
For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.
Why did moving "Not applicable" to the front of the levels move it to the bottom of the plot?

Answers

Solution 1:

No, mean is not a good summary as the distribution of tvhours is right skewed. Instead, we should use median as a summary measure.

gss_cat |>
  drop_na() |>
  mutate(tvhours = as_factor(tvhours)) |>
  ggplot(aes(x = tvhours)) +
  geom_bar(col = "black", fill = "white") +
  theme_clean() +
  labs(x = "Hours per day spent watching TV",
       y = "Numbers", title = "Distribution of TV Hours is right skewed")

Solution 2:

The variables in the gss_cat data-set which are factors are: ---

Factor Variable	Levels	Order is arbitrary or principled
`marital`	No answer, Never married, Separated, Divorced, Widowed and, Married	Arbitrary, since they are not in a specific order
`race`	Other, Black, White and, Not applicable	Arbitrary, since they are not in a specific order
`rincome`	No answer, Don't know, Refused, $25000 or more, $20000 - 24999, $15000 - 19999, $10000 - 14999, $8000 to 9999, $7000 to 7999, $6000 to 6999, $5000 to 5999, $4000 to 4999, $3000 to 3999, $1000 to 2999, Lt $1000 and, Not applicable	Principled, since the income levels are in a specified increasing or decreasing order, with few levels arbitrary
`partyid`	No answer, Don't know, Other party, Strong republican, Not str republican, Ind,near rep, Independent, Ind,near dem, Not str democrat and, Strong democrat	Partly Principled, as there are two extremes, and then levels in the middle.
`relig`	No answer, Don't know, Inter-nondenominational, Native american, Christian, Orthodox-christian, Moslem/islam, Other eastern, Hinduism, Buddhism, Other, None, Jewish, Catholic, Protestant and, Not applicable	Arbitrary, as the religions are not in a specific order.
`denom`	No answer, Don't know, No denomination, Other, Episcopal, Presbyterian-dk wh, Presbyterian, merged, Other presbyterian, United pres ch in us, Presbyterian c in us, Lutheran-dk which, Evangelical luth, Other lutheran, Wi evan luth synod, Lutheran-mo synod, Luth ch in america, Am lutheran, Methodist-dk which, Other methodist, United methodist, Afr meth ep zion, Afr meth episcopal, Baptist-dk which, Other baptists, Southern baptist, Nat bapt conv usa, Nat bapt conv of am, Am bapt ch in usa, Am baptist asso and, Not applicable	Arbitrary, as the denominations are not in a specific order.

Solution 3:

Moving "Not applicable" to the front of the levels move it to the bottom of the plot, because ggplot2 plots the levels in increasing order, starting bottom's upwards. Thus, the first level is plotted at the bottom, and the last level at the top.

Modifying Factor Levels

Questions

How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
How could you collapse rincome into a small set of categories?
Notice there are 9 groups (excluding other) in the fct_lump example above.
Why not 10?
(Hint: type ?fct_lump, and find the default for the argument other_level is "Other".)

Answers

Solution 1:

As reflected in the @fig-q1-ex5, the proportions of people identifying as Democrat has slightly increased, Republican has slightly decreased, and Independent has increased, over the period of 15 years reflected in the data-set.

#| label: fig-q1-ex5
#| fig-cap: "Stacked bar chart of the partyid in the data-set"


gss_cat |>
  mutate(
    partyid = fct_collapse(partyid,
      "Republican"  = c("Strong republican",  "Not str republican"),
      "Democrat"    = c("Strong democrat", "Not str democrat"),
      "Independent" = c("Independent", "Ind,near dem", "Ind,near rep"),
      "Others"      = c("No answer", "Don't know", "Other party")
    )
  ) |>
  group_by(year, partyid) |>
  count() |>
  ggplot(aes(x = year, y = n, fill = partyid)) +
  geom_col(position = "fill") +
  scale_fill_manual(values = c("lightgrey", "red", "grey", "blue")) +
  theme_classic() +
  theme(legend.position = "bottom") +
  labs(x = "Year", y = "Proportion of respondents", fill = "Party",
       subtitle = "Proportion of republicans has decreased, while that of independents has increased over the years",
       title = "In 15 years, share of parties' supporters has changed")

Solution 2:

We could collapse the rincome into a small set of categories using the following functions: --

fct_lump_n()
fct_lump_lowfreq()
fct_lump_min()
fct_lump_prop()
fct_lump()
fct_collapse()

gss_cat|>
  mutate(
    rincome = fct_lump_n(rincome, n = 6)
  ) |>
  group_by(rincome) |>
  count() |>
  arrange(desc(n)) |>
  ungroup() |>
  gt() |>
  cols_label(rincome = "Annual Income",
             n = "Numbers")

Solution 3:

There are 9 groups (excluding other) in the fct_lump example above, as shown below also in @fig-q3-ex5. This is because n = 10 argument limits the total groups to 10, and the function needs one group for "Other", i.e. all other groups whose count is lesser than top 9 groups. Thus, the groups shown are 9, with 1 as "Other" (at the end).

#| label: fig-q3-ex5
#| fig-cap: "Table of number of respondents from each of top 10 religions, including Other"

gss_cat |>
  mutate(relig = fct_lump_n(relig, n = 10)) |>
  count(relig) |>
  gt()

Reference

Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Factors

Run the code in your script for the answers! I'm just exploring as I go.

Packages to load

Introduction

General Social Survey

Questions

Answers

Modifying Factor Order

Questions

Answers

Modifying Factor Levels

Questions

Answers

Reference

Sign up for FREE to access the downloadable map!