R for Data Science Exercises: Factors

Factors are a type of variable in R that are used to store categorical data. Factors can be ordered or unordered. Factors are stored as integers, and have labels associated with these unique integers.

R for Data Science Exercises: Factors

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Factors

Run the code in your script for the answers! I'm just exploring as I go.

Packages to load

library(tidyverse)
library(ggthemes)
library(gt)
library(gtExtras)

Introduction

Factors are a type of variable in R that are used to store categorical data. Factors can be ordered or unordered. Factors are stored as integers, and have labels associated with these unique integers. These labels are used when displaying the factor. Factors are very useful in statistical modelling and graphics. They are also useful when you want to display character vectors in a non-alphabetical order.

General Social Survey

Questions

  1. Explore the distribution of rincome (reported income).
    What makes the default bar chart hard to understand?
    How could you improve the plot?

  2. What is the most common relig in this survey?
    What's the most common partyid?

  3. Which relig does denom (denomination) apply to?
    How can you find out with a table?
    How can you find out with a visualization?

Answers

Solution 1:

The default bar chart is hard to understand because:

  1. It is vertical, and the names of categories overlap on x-axis.

  2. The "Not applicable" category is before the lowest income group. Thus, the pattern is disturbed.

We could improve the plot, as shown in @fig-q1-ex3, by:

  1. Making it into a horizontal bar chart to allow space and easy reading of categories of income levels.

  2. Move the "Not Applicable" level after the highest income level, along-side "Refused", "Dont' know" and "No answer".

  3. Further, we could remove non-data ink, as per principles of Mr. Tufte to make our pattern stand out. Also, we could create a separate colouring scheme for data outside the income levels.

#| label: fig-q1-ex3
#| fig-cap: "Improved bar chart"

no_levels = levels(gss_cat$rincome)[c(1:3, 16)]

gss_cat |>
  mutate(col_level = rincome %in% no_levels) |>
  ggplot(aes(y = fct_relevel(rincome, 
                             "Not applicable",
                             after = 3),
             fill = col_level)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Number of respondents", y = NULL,
       title = "Income Levels of respondents in General Social Survey") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "none") +
  scale_fill_manual(values = c("#3d3b3b", "#999494"))

Solution 2:

The most common relig is "Protestant". And, the most common partyid is "Independent".

gss_cat |>
  count(relig, sort = TRUE)

gss_cat |>
  count(partyid, sort = TRUE)

Solution 3:

We can see from the code below that more than one factor values in denom (denomination) occur only in "Protestant", "Christian" and "Other" religions. To explore further, we can cross-tabulate religion and denomination, as shown in @tbl-q3-ex3, and realize that the only religion to which denomination really applies to is "Protestant".

We could also do a visualization as in @fig-q3a-ex3.

gss_cat |>
  group_by(relig) |>
  summarise(n = n_distinct(denom)) |>
  arrange(desc(n)) |>
  filter(n > 1)
#| label: tbl-q3-ex3
#| tbl-cap: "Cross-Table of the deominations within the three religions"

gss_cat |>
  filter(relig %in% c("Protestant", "Christian", "Other")) |>
  group_by(relig, denom) |>
  tally() |>
  spread(relig, n) |>
  arrange(desc(Christian)) |>
  gt() |>
  sub_missing(missing_text = "") |>
  gt_theme_538()
#| label: fig-q3a-ex3
#| tbl-cap: "Visualization of the number of deominations within religions"

gss_cat |>
  group_by(relig) |>
  summarise(n = n_distinct(denom)) |>
  arrange(desc(n)) |>
  ggplot(aes(y = reorder(relig, n), x = n)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(x = "Number of denominations", y = NULL,
       title = "Only Protestant religion has demoninations within it")

Modifying Factor Order

Questions

  1. There are some suspiciously high numbers in tvhours.
    Is the mean a good summary?

  2. For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.

  3. Why did moving "Not applicable" to the front of the levels move it to the bottom of the plot?

Answers

Solution 1:

No, mean is not a good summary as the distribution of tvhours is right skewed. Instead, we should use median as a summary measure.

gss_cat |>
  drop_na() |>
  mutate(tvhours = as_factor(tvhours)) |>
  ggplot(aes(x = tvhours)) +
  geom_bar(col = "black", fill = "white") +
  theme_clean() +
  labs(x = "Hours per day spent watching TV",
       y = "Numbers", title = "Distribution of TV Hours is right skewed")

Solution 2:

The variables in the gss_cat data-set which are factors are: ---

Factor Variable Levels Order is arbitrary or principled
marital No answer, Never married, Separated, Divorced, Widowed and, Married Arbitrary, since they are not in a specific order
race Other, Black, White and, Not applicable Arbitrary, since they are not in a specific order
rincome No answer, Don't know, Refused, $25000 or more, $20000 - 24999, $15000 - 19999, $10000 - 14999, $8000 to 9999, $7000 to 7999, $6000 to 6999, $5000 to 5999, $4000 to 4999, $3000 to 3999, $1000 to 2999, Lt $1000 and, Not applicable Principled, since the income levels are in a specified increasing or decreasing order, with few levels arbitrary
partyid No answer, Don't know, Other party, Strong republican, Not str republican, Ind,near rep, Independent, Ind,near dem, Not str democrat and, Strong democrat Partly Principled, as there are two extremes, and then levels in the middle.
relig No answer, Don't know, Inter-nondenominational, Native american, Christian, Orthodox-christian, Moslem/islam, Other eastern, Hinduism, Buddhism, Other, None, Jewish, Catholic, Protestant and, Not applicable Arbitrary, as the religions are not in a specific order.
denom No answer, Don't know, No denomination, Other, Episcopal, Presbyterian-dk wh, Presbyterian, merged, Other presbyterian, United pres ch in us, Presbyterian c in us, Lutheran-dk which, Evangelical luth, Other lutheran, Wi evan luth synod, Lutheran-mo synod, Luth ch in america, Am lutheran, Methodist-dk which, Other methodist, United methodist, Afr meth ep zion, Afr meth episcopal, Baptist-dk which, Other baptists, Southern baptist, Nat bapt conv usa, Nat bapt conv of am, Am bapt ch in usa, Am baptist asso and, Not applicable Arbitrary, as the denominations are not in a specific order.

Solution 3:

Moving "Not applicable" to the front of the levels move it to the bottom of the plot, because ggplot2 plots the levels in increasing order, starting bottom's upwards. Thus, the first level is plotted at the bottom, and the last level at the top.

Modifying Factor Levels

Questions

  1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?

  2. How could you collapse rincome into a small set of categories?

  3. Notice there are 9 groups (excluding other) in the fct_lump example above.
    Why not 10?
    (Hint: type ?fct_lump, and find the default for the argument other_level is "Other".)

Answers

Solution 1:

As reflected in the @fig-q1-ex5, the proportions of people identifying as Democrat has slightly increased, Republican has slightly decreased, and Independent has increased, over the period of 15 years reflected in the data-set.

#| label: fig-q1-ex5
#| fig-cap: "Stacked bar chart of the partyid in the data-set"


gss_cat |>
  mutate(
    partyid = fct_collapse(partyid,
      "Republican"  = c("Strong republican",  "Not str republican"),
      "Democrat"    = c("Strong democrat", "Not str democrat"),
      "Independent" = c("Independent", "Ind,near dem", "Ind,near rep"),
      "Others"      = c("No answer", "Don't know", "Other party")
    )
  ) |>
  group_by(year, partyid) |>
  count() |>
  ggplot(aes(x = year, y = n, fill = partyid)) +
  geom_col(position = "fill") +
  scale_fill_manual(values = c("lightgrey", "red", "grey", "blue")) +
  theme_classic() +
  theme(legend.position = "bottom") +
  labs(x = "Year", y = "Proportion of respondents", fill = "Party",
       subtitle = "Proportion of republicans has decreased, while that of independents has increased over the years",
       title = "In 15 years, share of parties' supporters has changed")

Solution 2:

We could collapse the rincome into a small set of categories using the following functions: --

  • fct_lump_n()

  • fct_lump_lowfreq()

  • fct_lump_min()

  • fct_lump_prop()

  • fct_lump()

  • fct_collapse()

gss_cat|>
  mutate(
    rincome = fct_lump_n(rincome, n = 6)
  ) |>
  group_by(rincome) |>
  count() |>
  arrange(desc(n)) |>
  ungroup() |>
  gt() |>
  cols_label(rincome = "Annual Income",
             n = "Numbers")

Solution 3:

There are 9 groups (excluding other) in the fct_lump example above, as shown below also in @fig-q3-ex5. This is because n = 10 argument limits the total groups to 10, and the function needs one group for "Other", i.e. all other groups whose count is lesser than top 9 groups. Thus, the groups shown are 9, with 1 as "Other" (at the end).

#| label: fig-q3-ex5
#| fig-cap: "Table of number of respondents from each of top 10 religions, including Other"

gss_cat |>
  mutate(relig = fct_lump_n(relig, n = 10)) |>
  count(relig) |>
  gt()

Reference

Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.