Learning R Featured

R for Data Science Exercises: Workflow Scripts and Projects and Data Import

The final installment of exercises from the "Whole game" section of the R4DS 2nd edition. These exercises focus on becoming more familiar and efficient in R studio.

Lorcán Mason

25 Jun 2024 • 4 min read

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Workflow Scripts and Projects

Run the code in your script for the answers! I'm just exploring as I go.

Workflow Scripts and Projects Tips

"Option + Shift + K" or "Alt + Shift + K" shows the keyboard shortcuts available. You can also go to the menu bar: Tools > Keyboard shortcuts help.

Tools > Global Options > Appearance allows you to change the theme of your RStudio.

RStudio will warn with a caution icon (orange triangle with an exclamation point in it) if variable used in your code is not defined in your scope (i.e., your environment).

Opening the script editor adds a new pane at the top-left of the IDE

Use the script editor instead of the console -> complex pipelines and graphics.

File -> New file -> R script, or "Cmd/Ctrl + Shift + N" for the best place for experimenting with your R code.

Edit the script and re-run the code. Save it as a R code file: yourfilename.R.

You can move panes around too.

Move focus to script "CTRL/Cmd + 1".
Move focus to console "CTRL/Cmd + 2".
Move focus to Terminal "ALT + Shift + M".

To run the current R expression in the console "Ctrl/Cmd + Enter".

The cursor will then move to the next code block.

To run the entire script "Ctrl/Cmd + Shift + S".

getwd() identifies the current working directory.

You can always find the working directory by looking at top of console too:

Data Import

Packages to load

library(tidyverse)

What function would you use to read a file where fields were separated with "|"?

For reading a file delimited with |, use read_delim() with argument delim = "|".

Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

All other arguments are common among the two functions: col_names, col_types, col_select, id, locale, na, quoted_na, quote, comment, trim_ws, n_max, guess_max, progress, name_repair, num_threads, show_col_types, skip_empty_rows, lazy.

What are the most important arguments to read_fwf()?

col_positions is an important argument since it defines the beginning and end of columns. file, fwf_cols, fwf_positions, fwf_widths, and fwf_empty are also important arguments.

Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like " or '. By default, read_csv() assumes that the quoting character will be ". To read the following text into a data frame, what argument to read_csv() do you need to specify?

A. Example Code:
```
a.  `"x,y\n1,'a,b'"`

*We need to specify the `quote` argument. `quote = "''"` to specify that we are using `''` as quotes in a string.*

```{r}
read_csv("x,y\n1,'a,b'", quote = "\'")
```
```
Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

a. read_csv("a,b\n1,2,3\n4,5,6")

b. read_csv("a,b,c\n1,2\n1,2,3,4")

c. read_csv("a,b\n"1")

d. read_csv("a,b\n1,2\na,b")

e. read_csv("a;b\n1;3")

a. There are only two column headers but three values in each row, so the last two get merged:

read_csv("a,b\n1,2,3\n4,5,6")

b. There are only three column headers, first row is missing a value in the last column so gets an NA there, the second row has four values so the last two get merged:

read_csv("a,b,c\n1,2\n1,2,3,4")

c. No rows are read in:

read_csv("a,b\n\"1")

d. Each column has a numerical and a character value, so the column type is coerced to character:

read_csv("a,b\n1,2\na,b")

e. The delimiter is ; but it's not specified, therefore this is read in as a single-column data frame with a single observation:

read_csv("a;b\n1;3")

Practice referring to non-syntactic names in the following data frame by:

a. Extracting the variable called 1.

b. Plotting a scatterplot of 1 vs. 2.

c. Creating a new column called 3, which is 2 divided by 1.

d. Renaming the columns to one, two, and three.

Example Data frame

annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)

a. Extracting the variable called 1:

```{r}
annoying |>
  select(`1`)
```

b. Plotting a scatterplot of 1 vs. 2:

```{r}
ggplot(annoying, aes(x = `2`, y = `1`)) +
  geom_point()
```

c. Creating a new column called 3, which is 2 divided by 1:

```{r}
annoying |>
  mutate(`3` = `2` / `1`)
```

d. Renaming the columns to one, two, and three:

```{r}
annoying |>
  mutate(`3` = `2` / `1`) |>
  rename(
    "one" = `1`,
    "two" = `2`,
    "three" = `3`
  )
```

Reference

Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.

R4DS Workflow Scripts and Projects and Data Import

Quarto Document

R4DS Workflow Scripts and Projects and Data Import.qmd

5 KB