R for Data Science Exercises: Web Scraping
Web scraping involves programmatically extracting data from websites.
R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)
Web Scraping
Run the code in your script for the answers! I'm just exploring as I go.
Packages to load
library(tidyverse)
library(rvest)
library(gt)
library(gtExtras)
library(scales)
library(janitor)
library(prismatic)
library(ggrepel)
Introduction
Web scraping involves programmatically extracting data from websites. This covers:
- Legal and Ethical Considerations: Understanding the legality and ethical implications of scraping data.
- Tools and Packages: Introduction to
httr
,rvest
,selectr
, andxml2
for web scraping. - Techniques: Methods for scraping both static and dynamic websites.
Legal and Ethical Considerations
- Ethical Considerations:
- robots.txt: Check the website's
robots.txt
file, which indicates the areas of the site that can or cannot be accessed by web crawlers. - Terms of Service: Read and respect the terms of service of the website.
- Server Load: Be considerate of the impact your scraping activities may have on the server's performance.
- robots.txt: Check the website's
- Legal Considerations:
- Copyright Laws: Be aware of the intellectual property rights related to the content you are scraping.
- Terms of Service Violations: Scraping data in violation of a website's terms of service can lead to legal issues.
Tools for Web Scraping
- httr: A package for performing HTTP requests.
- rvest: A package designed to simplify web scraping.
- selectr: For parsing CSS selectors.
- xml2: For parsing XML and HTML documents.
HTTP Requests with httr
HTTP requests are used to interact with web servers to retrieve or send data.
Basic Usage of httr
-
GET Request:
library(httr) response <- GET("https://example.com") content(response, "text")
- Use
GET()
to fetch data from a URL. content(response, "text")
retrieves the response content as text.
- Use
-
POST Request:
response <- POST("https://example.com/post", body = list(key1 = "value1")) content(response, "text")
- Use
POST()
to send data to a server.
- Use
-
Handling Errors:
response <- GET("https://example.com") stop_for_status(response)
stop_for_status()
checks for HTTP errors and stops the function if one is encountered.
-
Query Parameters:
response <- GET("https://example.com", query = list(param1 = "value1", param2 = "value2")) content(response, "text")
- Add query parameters using the
query
argument.
- Add query parameters using the
Scraping HTML with rvest
rvest
simplifies the process of scraping web data using CSS or XPath selectors.
Basic Usage of rvest
-
Reading HTML:
library(rvest) page <- read_html("https://example.com")
read_html()
loads the HTML content of a webpage.
-
CSS Selectors:
title <- page %>% html_node("title") %>% html_text() links <- page %>% html_nodes("a") %>% html_attr("href")
html_node()
selects a single HTML element.html_nodes()
selects multiple elements.html_text()
extracts the text from an element.html_attr()
retrieves the value of an attribute.
-
XPath Selectors:
title <- page %>% html_node(xpath = "//title") %>% html_text()
- Use
xpath
for selecting elements with XPath syntax.
- Use
-
Extracting Tables:
tables <- page %>% html_table()
html_table()
extracts tables from HTML as data frames.
Advanced Scraping Techniques
For more complex scraping tasks, such as dealing with JavaScript-rendered content or automating interactions, additional tools are necessary.
Dealing with JavaScript
- RSelenium: A package that provides an R interface to Selenium WebDriver, enabling control of a web browser for scraping dynamic content.
- chromote: A package that interacts with the Chrome DevTools Protocol.
Automating Browser Actions with RSelenium
-
Starting a Selenium Server and Browser:
library(RSelenium) rD <- rsDriver(browser = "chrome") remDr <- rD$client
rsDriver()
starts a Selenium server and browser instance.
-
Navigating to a Webpage:
remDr$navigate("https://example.com")
navigate()
loads a specified URL in the browser.
-
Extracting Content:
webElem <- remDr$findElement(using = "css selector", "p") webElem$getElementText()
findElement()
locates an element using a CSS selector.getElementText()
retrieves the text of the element.
-
Closing the Browser:
remDr$close()
Working with APIs
APIs offer a more structured and reliable way to get data compared to scraping web pages directly.
Interacting with APIs using httr
-
GET Request to an API:
response <- GET("https://api.example.com/data") data <- content(response, "parsed")
content(response, "parsed")
parses the response content.
-
POST Request to an API:
response <- POST("https://api.example.com/submit", body = list(name = "John")) data <- content(response, "parsed")
- Send data to an API endpoint using
POST()
.
- Send data to an API endpoint using
Parsing and Cleaning Data
After extracting data, it's often necessary to clean and preprocess it for analysis.
Handling HTML with xml2
- Parsing HTML:
library(xml2) doc <- read_html("https://example.com") nodes <- xml_find_all(doc, "//p") texts <- xml_text(nodes)
read_html()
loads HTML content.xml_find_all()
selects elements using XPath.xml_text()
retrieves the text content of elements.
Cleaning Data
Use data manipulation packages like dplyr
and tidyr
to clean and preprocess the scraped data.
library(dplyr)
library(tidyr)
# Example data cleaning steps
cleaned_data <- raw_data %>%
filter(!is.na(value)) %>%
mutate(new_column = as.numeric(old_column))
Premier League Example
Scraping Data
# url storage
url <- "https://fbref.com/en/comps/9/Premier-League-Stats"
# read_html to scrape the items on url page
full_table <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE)
# html_nodes to pull all nodes under the "table" label
# the number [1] tells which table to pull from the list of tables
# html_table converts it to table format
prem <- full_table %>%
html_nodes("table") %>%
.[[1]] %>%
html_table(fill=T)
Visualising Data
pl <- prem %>% ggplot(aes(x = xG, y = xGA, label = Squad)) +
geom_smooth(method = "lm", color = "green", fill = "green") +
geom_point(aes(fill = "green", color = after_scale(clr_darken(fill, 0.3))),
shape = 21,
alpha = .75,
size = 3) +
geom_text_repel(size = 2.5, color = "white", min.segment.length = unit(0.1, "lines")) +
theme(
legend.position = "none",
plot.background = element_rect(fill = "purple", colour = "purple"),
panel.background = element_rect(fill = "purple", colour = "purple"),
panel.grid.major = element_line(colour = "purple"),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "white"),
axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
plot.title = element_text(colour = "white", hjust=.5, face="bold", size = 15),
plot.subtitle = element_text(colour = "white", hjust=.5, face="bold", size = 8)) +
labs(title = "xG For vs xG Against of PL Teams",
subtitle = "2023-2024 Season") +
scale_y_reverse()
pl
# Saving Plot
# ggsave("premier_league.png", pl, height = 6, width = 6, dpi = 300)
Bundesliga Example
Scraping Data
# url storage
url <- "https://fbref.com/en/comps/20/Bundesliga-Stats"
# read_html to scrape the items on url page
full_table <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE)
# html_nodes to pull all nodes under the "table" label
# the number [1] tells which table to pull from the list of tables
# html_table converts it to table format
bund <- full_table %>%
html_nodes("table") %>%
.[[1]] %>%
html_table(fill=T)
Visualising Data
bl <- bund %>% ggplot(aes(x = xG, y = xGA, label = Squad)) +
geom_smooth(method = "lm", color = "green", fill = "green") +
geom_point(aes(fill = "green", color = after_scale(clr_darken(fill, 0.3))),
shape = 21,
alpha = .75,
size = 3) +
geom_text_repel(size = 2.5, color = "white", min.segment.length = unit(0.1, "lines")) +
theme(
legend.position = "none",
plot.background = element_rect(fill = "purple", colour = "purple"),
panel.background = element_rect(fill = "purple", colour = "purple"),
panel.grid.major = element_line(colour = "purple"),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "white"),
axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
plot.title = element_text(colour = "white", hjust=.5, face="bold", size = 15),
plot.subtitle = element_text(colour = "white", hjust=.5, face="bold", size = 8)) +
labs(title = "xG For vs xG Against of BL Teams",
subtitle = "2023-2024 Season") +
scale_y_reverse()
bl
# Saving Plot
# ggsave("bundesliga.png", pl, height = 6, width = 6, dpi = 300)
Big 5 Leagues Example
Scraping Data
# url storage
url <- "https://fbref.com/en/comps/Big5/Big-5-European-Leagues-Stats"
# read_html to scrape the items on url page
full_table <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE)
# html_nodes to pull all nodes under the "table" label
# the number [1] tells which table to pull from the list of tables
# html_table converts it to table format
big5 <- full_table %>%
html_nodes("table") %>%
.[[1]] %>%
html_table(fill=T)
Visualising Data
b5 <- big5 %>% ggplot(aes(x = xG, y = xGA, label = Squad)) +
geom_smooth(method = "lm", color = "green", fill = "green") +
geom_point(aes(fill = "green", color = after_scale(clr_darken(fill, 0.3))),
shape = 21,
alpha = .75,
size = 3) +
geom_text_repel(size = 2.5, color = "white", min.segment.length = unit(0.1, "lines")) +
theme(
legend.position = "none",
plot.background = element_rect(fill = "purple", colour = "purple"),
panel.background = element_rect(fill = "purple", colour = "purple"),
panel.grid.major = element_line(colour = "purple"),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "white"),
axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
plot.title = element_text(colour = "white", hjust=.5, face="bold", size = 15),
plot.subtitle = element_text(colour = "white", hjust=.5, face="bold", size = 8)) +
labs(title = "xG For vs xG Against of Big 5 League Teams",
subtitle = "2023-2024 Season") +
scale_y_reverse()
b5
# Saving Plot
# ggsave("big5.png", pl, height = 6, width = 6, dpi = 300)
References
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".