Learning R Featured

R for Data Science Exercises: Web Scraping

Web scraping involves programmatically extracting data from websites.

Lorcán Mason

Jul 10, 2024 • 6 min read

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Web Scraping

Run the code in your script for the answers! I'm just exploring as I go.

Packages to load

library(tidyverse)
library(rvest)
library(gt)
library(gtExtras)
library(scales)
library(janitor)
library(prismatic)
library(ggrepel)

Introduction

Web scraping involves programmatically extracting data from websites. This covers:

Legal and Ethical Considerations: Understanding the legality and ethical implications of scraping data.
Tools and Packages: Introduction to httr, rvest, selectr, and xml2 for web scraping.
Techniques: Methods for scraping both static and dynamic websites.

Legal and Ethical Considerations

Ethical Considerations:
- robots.txt: Check the website's robots.txt file, which indicates the areas of the site that can or cannot be accessed by web crawlers.
- Terms of Service: Read and respect the terms of service of the website.
- Server Load: Be considerate of the impact your scraping activities may have on the server's performance.
Legal Considerations:
- Copyright Laws: Be aware of the intellectual property rights related to the content you are scraping.
- Terms of Service Violations: Scraping data in violation of a website's terms of service can lead to legal issues.

Tools for Web Scraping

httr: A package for performing HTTP requests.
rvest: A package designed to simplify web scraping.
selectr: For parsing CSS selectors.
xml2: For parsing XML and HTML documents.

HTTP Requests with `httr`

HTTP requests are used to interact with web servers to retrieve or send data.

Basic Usage of `httr`

GET Request:
```
library(httr)

response <- GET("https://example.com")
content(response, "text")
```
- Use GET() to fetch data from a URL.
- content(response, "text") retrieves the response content as text.

POST Request:

response <- POST("https://example.com/post", body = list(key1 = "value1"))
content(response, "text")

Use POST() to send data to a server.

Handling Errors:
```
response <- GET("https://example.com")
stop_for_status(response)
```
- stop_for_status() checks for HTTP errors and stops the function if one is encountered.

Query Parameters:

response <- GET("https://example.com", query = list(param1 = "value1", param2 = "value2"))
content(response, "text")

Add query parameters using the query argument.

Scraping HTML with `rvest`

rvest simplifies the process of scraping web data using CSS or XPath selectors.

Basic Usage of `rvest`

Reading HTML:
```
library(rvest)

page <- read_html("https://example.com")
```
- read_html() loads the HTML content of a webpage.
CSS Selectors:
```
title <- page %>% html_node("title") %>% html_text()
links <- page %>% html_nodes("a") %>% html_attr("href")
```
- html_node() selects a single HTML element.
- html_nodes() selects multiple elements.
- html_text() extracts the text from an element.
- html_attr() retrieves the value of an attribute.
XPath Selectors:
```
title <- page %>% html_node(xpath = "//title") %>% html_text()
```
- Use xpath for selecting elements with XPath syntax.
Extracting Tables:
```
tables <- page %>% html_table()
```
- html_table() extracts tables from HTML as data frames.

Advanced Scraping Techniques

For more complex scraping tasks, such as dealing with JavaScript-rendered content or automating interactions, additional tools are necessary.

Dealing with JavaScript

RSelenium: A package that provides an R interface to Selenium WebDriver, enabling control of a web browser for scraping dynamic content.
chromote: A package that interacts with the Chrome DevTools Protocol.

Automating Browser Actions with `RSelenium`

Starting a Selenium Server and Browser:
```
library(RSelenium)

rD <- rsDriver(browser = "chrome")
remDr <- rD$client
```
- rsDriver() starts a Selenium server and browser instance.
Navigating to a Webpage:
```
remDr$navigate("https://example.com")
```
- navigate() loads a specified URL in the browser.
Extracting Content:
```
webElem <- remDr$findElement(using = "css selector", "p")
webElem$getElementText()
```
- findElement() locates an element using a CSS selector.
- getElementText() retrieves the text of the element.
Closing the Browser:
```
remDr$close()
```

Working with APIs

APIs offer a more structured and reliable way to get data compared to scraping web pages directly.

Interacting with APIs using `httr`

GET Request to an API:

response <- GET("https://api.example.com/data")
data <- content(response, "parsed")

content(response, "parsed") parses the response content.

POST Request to an API:

response <- POST("https://api.example.com/submit", body = list(name = "John"))
data <- content(response, "parsed")

Send data to an API endpoint using POST().

Parsing and Cleaning Data

After extracting data, it's often necessary to clean and preprocess it for analysis.

Handling HTML with `xml2`

Parsing HTML:
```
library(xml2)

doc <- read_html("https://example.com")
nodes <- xml_find_all(doc, "//p")
texts <- xml_text(nodes)
```
- read_html() loads HTML content.
- xml_find_all() selects elements using XPath.
- xml_text() retrieves the text content of elements.

Cleaning Data

Use data manipulation packages like dplyr and tidyr to clean and preprocess the scraped data.

library(dplyr)
library(tidyr)

# Example data cleaning steps
cleaned_data <- raw_data %>%
  filter(!is.na(value)) %>%
  mutate(new_column = as.numeric(old_column))

Premier League Example

Scraping Data

# url storage
url <- "https://fbref.com/en/comps/9/Premier-League-Stats"

# read_html to scrape the items on url page
full_table <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE)

# html_nodes to pull all nodes under the "table" label
# the number [1] tells which table to pull from the list of tables
# html_table converts it to table format
prem <- full_table %>%  
  html_nodes("table") %>% 
  .[[1]] %>% 
  html_table(fill=T)

Visualising Data

pl <- prem %>% ggplot(aes(x = xG, y = xGA, label = Squad)) +
geom_smooth(method = "lm", color = "green", fill = "green") +
geom_point(aes(fill = "green", color = after_scale(clr_darken(fill, 0.3))), 
             shape = 21, 
             alpha = .75, 
             size = 3) +
  geom_text_repel(size = 2.5, color = "white", min.segment.length = unit(0.1, "lines")) +
  theme(
    legend.position = "none",
    plot.background = element_rect(fill = "purple", colour = "purple"),
    panel.background = element_rect(fill = "purple", colour = "purple"),
    panel.grid.major = element_line(colour = "purple"),
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "white"),
    axis.text = element_text(colour = "white"),
    axis.title = element_text(colour = "white"),
    plot.title = element_text(colour = "white", hjust=.5, face="bold", size = 15),
    plot.subtitle = element_text(colour = "white", hjust=.5, face="bold", size = 8)) +
  labs(title = "xG For vs xG Against of PL Teams",
       subtitle = "2023-2024 Season") +
  scale_y_reverse()

pl

# Saving Plot
# ggsave("premier_league.png", pl, height = 6, width = 6, dpi = 300)

Bundesliga Example

Scraping Data

# url storage
url <- "https://fbref.com/en/comps/20/Bundesliga-Stats"

# read_html to scrape the items on url page
full_table <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE)

# html_nodes to pull all nodes under the "table" label
# the number [1] tells which table to pull from the list of tables
# html_table converts it to table format
bund <- full_table %>%  
  html_nodes("table") %>% 
  .[[1]] %>% 
  html_table(fill=T)

Visualising Data

bl <- bund %>% ggplot(aes(x = xG, y = xGA, label = Squad)) +
geom_smooth(method = "lm", color = "green", fill = "green") +
geom_point(aes(fill = "green", color = after_scale(clr_darken(fill, 0.3))), 
             shape = 21, 
             alpha = .75, 
             size = 3) +
  geom_text_repel(size = 2.5, color = "white", min.segment.length = unit(0.1, "lines")) +
  theme(
    legend.position = "none",
    plot.background = element_rect(fill = "purple", colour = "purple"),
    panel.background = element_rect(fill = "purple", colour = "purple"),
    panel.grid.major = element_line(colour = "purple"),
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "white"),
    axis.text = element_text(colour = "white"),
    axis.title = element_text(colour = "white"),
    plot.title = element_text(colour = "white", hjust=.5, face="bold", size = 15),
    plot.subtitle = element_text(colour = "white", hjust=.5, face="bold", size = 8)) +
  labs(title = "xG For vs xG Against of BL Teams",
       subtitle = "2023-2024 Season") +
  scale_y_reverse()

bl

# Saving Plot
# ggsave("bundesliga.png", pl, height = 6, width = 6, dpi = 300)

Big 5 Leagues Example

Scraping Data

# url storage
url <- "https://fbref.com/en/comps/Big5/Big-5-European-Leagues-Stats"

# read_html to scrape the items on url page
full_table <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE)

# html_nodes to pull all nodes under the "table" label
# the number [1] tells which table to pull from the list of tables
# html_table converts it to table format
big5 <- full_table %>%  
  html_nodes("table") %>% 
  .[[1]] %>% 
  html_table(fill=T)

Visualising Data

b5 <- big5 %>% ggplot(aes(x = xG, y = xGA, label = Squad)) +
geom_smooth(method = "lm", color = "green", fill = "green") +
geom_point(aes(fill = "green", color = after_scale(clr_darken(fill, 0.3))), 
             shape = 21, 
             alpha = .75, 
             size = 3) +
  geom_text_repel(size = 2.5, color = "white", min.segment.length = unit(0.1, "lines")) +
  theme(
    legend.position = "none",
    plot.background = element_rect(fill = "purple", colour = "purple"),
    panel.background = element_rect(fill = "purple", colour = "purple"),
    panel.grid.major = element_line(colour = "purple"),
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "white"),
    axis.text = element_text(colour = "white"),
    axis.title = element_text(colour = "white"),
    plot.title = element_text(colour = "white", hjust=.5, face="bold", size = 15),
    plot.subtitle = element_text(colour = "white", hjust=.5, face="bold", size = 8)) +
  labs(title = "xG For vs xG Against of Big 5 League Teams",
       subtitle = "2023-2024 Season") +
  scale_y_reverse()

b5

# Saving Plot
# ggsave("big5.png", pl, height = 6, width = 6, dpi = 300)

References

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".

R4DS Web Scraping

Quarto Document

R4DS Web Scraping.qmd

11 KB

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Web Scraping

Run the code in your script for the answers! I'm just exploring as I go.

Packages to load

Introduction

Legal and Ethical Considerations

Tools for Web Scraping

HTTP Requests with httr

Basic Usage of httr

Scraping HTML with rvest

Basic Usage of rvest

Advanced Scraping Techniques

Dealing with JavaScript

Automating Browser Actions with RSelenium

Working with APIs

Interacting with APIs using httr

Parsing and Cleaning Data

Handling HTML with xml2

Cleaning Data

Premier League Example

Scraping Data

Visualising Data

Bundesliga Example

Scraping Data

Visualising Data

Big 5 Leagues Example

Scraping Data

Visualising Data

References

Sign up for FREE to access the downloadable map!

HTTP Requests with `httr`

Basic Usage of `httr`

Scraping HTML with `rvest`

Basic Usage of `rvest`

Automating Browser Actions with `RSelenium`

Interacting with APIs using `httr`

Handling HTML with `xml2`