R for Data Science Exercises: Factors
Factors are a type of variable in R that are used to store categorical data. Factors can be ordered or unordered. Factors are stored as integers, and have labels associated with these unique integers.
R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)
Run the code in your script for the answers! I'm just exploring as I go.
Packages to load
Factors are a type of variable in R that are used to store categorical data. Factors can be ordered or unordered. Factors are stored as integers, and have labels associated with these unique integers. These labels are used when displaying the factor. Factors are very useful in statistical modelling and graphics. They are also useful when you want to display character vectors in a non-alphabetical order.
General Social Survey
Explore the distribution of
(reported income).
What makes the default bar chart hard to understand?
How could you improve the plot? -
What is the most common
in this survey?
What's the most commonpartyid
? -
(denomination) apply to?
How can you find out with a table?
How can you find out with a visualization?
Solution 1:
The default bar chart is hard to understand because:
It is vertical, and the names of categories overlap on x-axis.
The "Not applicable" category is before the lowest income group. Thus, the pattern is disturbed.
We could improve the plot, as shown in @fig-q1-ex3, by:
Making it into a horizontal bar chart to allow space and easy reading of categories of income levels.
Move the "Not Applicable" level after the highest income level, along-side "Refused", "Dont' know" and "No answer".
Further, we could remove non-data ink, as per principles of Mr. Tufte to make our pattern stand out. Also, we could create a separate colouring scheme for data outside the income levels.
#| label: fig-q1-ex3
#| fig-cap: "Improved bar chart"
no_levels = levels(gss_cat$rincome)[c(1:3, 16)]
gss_cat |>
mutate(col_level = rincome %in% no_levels) |>
ggplot(aes(y = fct_relevel(rincome,
"Not applicable",
after = 3),
fill = col_level)) +
geom_bar() +
theme_minimal() +
labs(x = "Number of respondents", y = NULL,
title = "Income Levels of respondents in General Social Survey") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none") +
scale_fill_manual(values = c("#3d3b3b", "#999494"))
Solution 2:
The most common relig
is "Protestant". And, the most common partyid
is "Independent".
gss_cat |>
count(relig, sort = TRUE)
gss_cat |>
count(partyid, sort = TRUE)
Solution 3:
We can see from the code below that more than one factor values in denom
(denomination) occur only in "Protestant", "Christian" and "Other" religions. To explore further, we can cross-tabulate religion and denomination, as shown in @tbl-q3-ex3, and realize that the only religion to which denomination really applies to is "Protestant".
We could also do a visualization as in @fig-q3a-ex3.
gss_cat |>
group_by(relig) |>
summarise(n = n_distinct(denom)) |>
arrange(desc(n)) |>
filter(n > 1)
#| label: tbl-q3-ex3
#| tbl-cap: "Cross-Table of the deominations within the three religions"
gss_cat |>
filter(relig %in% c("Protestant", "Christian", "Other")) |>
group_by(relig, denom) |>
tally() |>
spread(relig, n) |>
arrange(desc(Christian)) |>
gt() |>
sub_missing(missing_text = "") |>
#| label: fig-q3a-ex3
#| tbl-cap: "Visualization of the number of deominations within religions"
gss_cat |>
group_by(relig) |>
summarise(n = n_distinct(denom)) |>
arrange(desc(n)) |>
ggplot(aes(y = reorder(relig, n), x = n)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(x = "Number of denominations", y = NULL,
title = "Only Protestant religion has demoninations within it")
Modifying Factor Order
There are some suspiciously high numbers in
Is the mean a good summary? -
For each factor in
identify whether the order of the levels is arbitrary or principled. -
Why did moving "Not applicable" to the front of the levels move it to the bottom of the plot?
Solution 1:
No, mean is not a good summary as the distribution of tvhours
is right skewed. Instead, we should use median as a summary measure.
gss_cat |>
drop_na() |>
mutate(tvhours = as_factor(tvhours)) |>
ggplot(aes(x = tvhours)) +
geom_bar(col = "black", fill = "white") +
theme_clean() +
labs(x = "Hours per day spent watching TV",
y = "Numbers", title = "Distribution of TV Hours is right skewed")
Solution 2:
The variables in the gss_cat
data-set which are factors are: ---
Factor Variable | Levels | Order is arbitrary or principled |
marital |
No answer, Never married, Separated, Divorced, Widowed and, Married | Arbitrary, since they are not in a specific order |
race |
Other, Black, White and, Not applicable | Arbitrary, since they are not in a specific order |
rincome |
No answer, Don't know, Refused, $25000 or more, $20000 - 24999, $15000 - 19999, $10000 - 14999, $8000 to 9999, $7000 to 7999, $6000 to 6999, $5000 to 5999, $4000 to 4999, $3000 to 3999, $1000 to 2999, Lt $1000 and, Not applicable | Principled, since the income levels are in a specified increasing or decreasing order, with few levels arbitrary |
partyid |
No answer, Don't know, Other party, Strong republican, Not str republican, Ind,near rep, Independent, Ind,near dem, Not str democrat and, Strong democrat | Partly Principled, as there are two extremes, and then levels in the middle. |
relig |
No answer, Don't know, Inter-nondenominational, Native american, Christian, Orthodox-christian, Moslem/islam, Other eastern, Hinduism, Buddhism, Other, None, Jewish, Catholic, Protestant and, Not applicable | Arbitrary, as the religions are not in a specific order. |
denom |
No answer, Don't know, No denomination, Other, Episcopal, Presbyterian-dk wh, Presbyterian, merged, Other presbyterian, United pres ch in us, Presbyterian c in us, Lutheran-dk which, Evangelical luth, Other lutheran, Wi evan luth synod, Lutheran-mo synod, Luth ch in america, Am lutheran, Methodist-dk which, Other methodist, United methodist, Afr meth ep zion, Afr meth episcopal, Baptist-dk which, Other baptists, Southern baptist, Nat bapt conv usa, Nat bapt conv of am, Am bapt ch in usa, Am baptist asso and, Not applicable | Arbitrary, as the denominations are not in a specific order. |
Solution 3:
Moving "Not applicable" to the front of the levels move it to the bottom of the plot, because ggplot2
plots the levels in increasing order, starting bottom's upwards. Thus, the first level is plotted at the bottom, and the last level at the top.
Modifying Factor Levels
How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
How could you collapse
into a small set of categories? -
Notice there are 9 groups (excluding other) in the
example above.
Why not 10?
(Hint: type?fct_lump
, and find the default for the argumentother_level
is "Other".)
Solution 1:
As reflected in the @fig-q1-ex5, the proportions of people identifying as Democrat has slightly increased, Republican has slightly decreased, and Independent has increased, over the period of 15 years reflected in the data-set.
#| label: fig-q1-ex5
#| fig-cap: "Stacked bar chart of the partyid in the data-set"
gss_cat |>
partyid = fct_collapse(partyid,
"Republican" = c("Strong republican", "Not str republican"),
"Democrat" = c("Strong democrat", "Not str democrat"),
"Independent" = c("Independent", "Ind,near dem", "Ind,near rep"),
"Others" = c("No answer", "Don't know", "Other party")
) |>
group_by(year, partyid) |>
count() |>
ggplot(aes(x = year, y = n, fill = partyid)) +
geom_col(position = "fill") +
scale_fill_manual(values = c("lightgrey", "red", "grey", "blue")) +
theme_classic() +
theme(legend.position = "bottom") +
labs(x = "Year", y = "Proportion of respondents", fill = "Party",
subtitle = "Proportion of republicans has decreased, while that of independents has increased over the years",
title = "In 15 years, share of parties' supporters has changed")
Solution 2:
We could collapse the rincome
into a small set of categories using the following functions: --
rincome = fct_lump_n(rincome, n = 6)
) |>
group_by(rincome) |>
count() |>
arrange(desc(n)) |>
ungroup() |>
gt() |>
cols_label(rincome = "Annual Income",
n = "Numbers")
Solution 3:
There are 9 groups (excluding other) in the fct_lump
example above, as shown below also in @fig-q3-ex5. This is because n = 10
argument limits the total groups to 10, and the function needs one group for "Other", i.e. all other groups whose count is lesser than top 9 groups. Thus, the groups shown are 9, with 1 as "Other" (at the end).
#| label: fig-q3-ex5
#| fig-cap: "Table of number of respondents from each of top 10 religions, including Other"
gss_cat |>
mutate(relig = fct_lump_n(relig, n = 10)) |>
count(relig) |>
Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.