Basic Data Scraping and Visualisation Tutorial
A tutorial of data scraping and visualising through my love for Shamrock Rovers F.C. and their journey so far in 2024-2025 Uefa Conference League.
A tutorial of data scraping and visualising through my love for Shamrock Rovers F.C. and their journey so far in 2024-2025 Uefa Conference League.
Introduction
This tutorial has provided an overview of the data scraping and visualisation process using R and is for illustrative purposes only. There are people much smarter than me who can do much better work, I just like to learn and share new skills!
Shamrock Rovers F.C. are an Irish football team based in Dublin. They are the most successful team in the Republic of Ireland, having won the League of Ireland title a record 21 times. They have also won the FAI Cup a record 25 times. Shamrock Rovers are the first Irish team to have qualified for the group stages of a major European competition, having reached the group stages of the 2011–12 UEFA Europa League. Shamrock Rovers have a rich football history, and this tutorial will focus on their recent participation in the Uefa Conference League using R.
Scraping Data
Load libraries for Web Scraping
The first step in this tutorial is to scrape the data from the Uefa website. The data we are interested in is the match results and statistics for Shamrock Rovers F.C. in the Uefa Conference League. The packages that we will use for this are rvest
, tidyverse
, stringr
, readr
, and openxlsx
. These packages will allow us to scrape the data from the Uefa website, clean and process the data, and save it to an Excel file. rvest
is a web scraping package that allows us to extract data from web pages. tidyverse
is a collection of packages for data manipulation and visualization. stringr
is a package for string manipulation. readr
is a package for reading and writing data. openxlsx
is a package for reading and writing Excel files.
library(rvest)
library(tidyverse)
library(stringr)
library(readr)
library(openxlsx)
Define the Live HTML URLs for each Stats Web Page from FBRef
Next we need to define the URLs for each of the web pages that we want to scrape. The data we are interested in includes general stats, goalkeeper stats, advanced goalkeeper stats, shooting stats, passing stats, passing types stats, creation stats, defensive stats, possession stats, playing time stats, and miscellaneous stats. We will scrape the data from each of these web pages and create a data frame for each set of statistics that we can save to an Excel file at a later stage.
general_stats_url <- "https://fbref.com/en/comps/882/stats/Conference-League-Stats"
gk_url <- "https://fbref.com/en/comps/882/keepers/Conference-League-Stats"
adv_gk_url <- "https://fbref.com/en/comps/882/keepersadv/Conference-League-Stats"
shooting_url <- "https://fbref.com/en/comps/882/shooting/Conference-League-Stats"
passing_stats <- "https://fbref.com/en/comps/882/passing/Conference-League-Stats"
passing_types_stats <- "https://fbref.com/en/comps/882/passing_types/Conference-League-Stats"
creation_stats <- "https://fbref.com/en/comps/882/gca/Conference-League-Stats"
defensive_stats <- "https://fbref.com/en/comps/882/defense/Conference-League-Stats"
possession_stats <- "https://fbref.com/en/comps/882/possession/Conference-League-Stats"
playing_time_stats <- "https://fbref.com/en/comps/882/playingtime/Conference-League-Stats"
misc_stats <- "https://fbref.com/en/comps/882/misc/Conference-League-Stats"
Scrape General Stats Page
The first step in scraping the data is to read the live HTML tables from the general stats page. We will use the read_html_live
function from the rvest
package to read the live HTML tables from the web page. We will then use the html_table
function to extract the tables from the HTML and convert them to data frames. For this tutorial we will focus only on the general stats page, but the same process can be applied to the other pages as well.
Read the live html tables from FBRef
The read_html_live
function reads the live HTML tables from the web page and converts them to data frames. This will extract the tables from the HTML and allow us to convert them to data frames.
standard_conference_tables <- read_html_live(general_stats_url) %>%
html_table(fill = T)
Next, we need to identify which table we are interested in. The standard_conference_tables
object is a list of data frames, with each data frame representing a table on the web page. We can use the [[
operator to access the data frame we are interested in. In this case, we are interested in the third table on the page, which contains the player statistics. The way the tables are ordered on the actual web page dictates the order in which they are stored in the list. If for example we wanted the first table, we would use .[[1]]
in place of .[[3]]
. The as.data.frame
function is used to convert the table to a data frame.
standard_players <- standard_conference_tables %>%
.[[3]] %>%
as.data.frame()
Clean up the data frame
Now that we have the data frame, we can clean it up by removing the first row, which contains the column names, and setting the column names to the first row of the data frame.
colnames(standard_players) <- standard_players[1, ]
standard_players <- standard_players[-1, ]
We can then remove the first row, which contains the column names, and clean up the column names using the janitor::clean_names
function. This function converts the column names to lowercase and replaces spaces with underscores. We can also remove any rows where the player name is "Player", as these are not actual player statistics and are likely headers or footers from the web page.
standard_players <- standard_players %>%
janitor::clean_names() %>%
filter(player != "Player")
The "matches" column is not needed for this analysis, so we can remove it from the data frame using the select
function from the dplyr
package.
standard_players <- standard_players %>%
select(-matches)
The data in the numeric columns is currently stored as character data, so we need to convert these columns to numeric data using the as.numeric
function. We can use the mutate
function from the dplyr
package to apply the as.numeric
function to all the numeric columns in the data frame.
standard_players <- standard_players %>%
mutate(across(c(1, 7:ncol(.)), as.numeric))
Extract the player ids from the player links
Next, we need to extract the player ids from the player links in the data frame. The player links contain the player ids, which we can use to link the player statistics to the player match logs. We can extract the player ids from the player links using the html_nodes
and html_attr
functions from the rvest
package. We can then use the as.data.frame
function to convert the player ids to a data frame and set the column name to "player_id".
player_id <- read_html_live(general_stats_url) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the player statistics data frame with the player ids data frame
Now that we have the player ids, we can join the player statistics data frame with the player ids data frame using the bind_cols
function from the dplyr
package. This will add the player ids to the player statistics data frame.
standard_players <- standard_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
More data cleaning
We also dont need the rank or age columns so we can remove them using the select
function from the dplyr
package. Placing a -
before the column name will remove that column from the data frame.
standard_players <- standard_players %>%
select(-rk, -age)
Now we can calculate an up-to-date player age. The 'born' column contains the birth date of the player. We can calculate the age of the player by subtracting the birth year from 2024. We can use the str_sub
function from the stringr
package to extract the last 4 characters of the 'born' column, which represent the birth year. We can then convert the birth year to a numeric data type using the as.numeric
function and subtract it from 2024 to get the age of the player.
standard_players <- standard_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Now that we have calculated the age of the players, we can remove the 'born' column as it is no longer needed. We can use the select
function from the dplyr
package to remove the 'born' column. We can also use the select
function to position the 'age' column next to the 'player' column.
standard_players <- standard_players %>%
select(-born) %>%
select(player, age, everything())
Now we can rename the columns to more descriptive names using the rename
function from the dplyr
package. This will make the data easier to work with and understand. Keep this as descriptive as possible to make it easier to work with the data.
standard_players <- standard_players %>%
rename(
position = pos,
squad = squad,
matches_played = mp,
minutes_played = min,
played_90 = x90s,
goals = gls,
assists = ast,
combined_goals_and_assists = g_a,
non_penalty_goals = g_pk,
penalties_scored = pk,
penalties_taken = p_katt,
yellow_cards = crd_y,
red_cards = crd_r,
expected_goals = x_g,
non_penalty_expected_goals = npx_g,
expected_assisted_goals = x_ag,
non_penalty_expected_goals_and_assisted_goals = npx_g_x_ag,
progressive_carries = prg_c,
progressive_passes = prg_p,
progressive_passes_received = prg_r,
per90_goals = gls_2,
per90_assists = ast_2,
per90_goals_and_assists = g_a_2,
per90_non_penalty_goals = g_pk_2,
per90_non_penalty_goals_and_assists = g_a_pk,
per90_expected_goals = x_g_2,
per90_expected_assisted_goals = x_ag_2,
per90_expected_goals_and_assisted_goals = x_g_x_ag,
per90_non_penalty_expected_goals = npx_g_2,
per90_non_penalty_expected_goals_and_assisted_goals = npx_g_x_ag_2,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
The last step in cleaning the data is to remove the lowercase letters from the nation
and squad
columns. We can remove the first three characters from the nation
and squad
columns using the str_sub
function from the stringr
package. This will remove the lowercase letters from the columns and leave only the uppercase letters. This makes it easier to identify the nation and squad of the player.
standard_players <- standard_players %>%
mutate(nation = str_sub(nation, start = 4), # This removes the first 3 characters from the 'nation' column
squad = str_sub(squad, start = 4)) # This removes the first 3 characters from the 'squad' column
Finally, we can view the cleaned data frame using the gtsummary
package. The gtsummary
package provides a simple and flexible way to create summary tables of data frames. We can use the tbl_summary
function from the gtsummary
package to create a summary table of the data frame. This will display the mean, median, and standard deviation for the numeric columns in the data frame. We can also specify the data types of the columns and the summary statistics to display using the statistic
and type
arguments of the tbl_summary
function. This is a great way to get an overview of the summary statistics of the data frame and identify any potential issues or outliers.
library(gtsummary)
standard_players %>%
# Select the columns to display in the summary table
select(age, minutes_played, goals, assists, non_penalty_expected_goals_and_assisted_goals, progressive_carries, progressive_passes, yellow_cards, red_cards) %>%
tbl_summary( # Create a summary table of the data frame
statistic = list(all_continuous() ~ "{mean} ± {sd}",
all_categorical() ~ "{n} / {N} ({p}%)"), # Specify the summary statistics to display
type = c( # Specify the data types of the columns
age ~ "continuous",
minutes_played ~ "continuous",
goals ~ "continuous",
assists ~ "continuous",
non_penalty_expected_goals_and_assisted_goals ~ "continuous",
progressive_carries ~ "continuous",
progressive_passes ~ "continuous",
yellow_cards ~ "continuous",
red_cards ~ "continuous"
),
# Show 2 decimal places
digits = list(all_continuous() ~ c(2, 2))
) %>%
modify_header(all_stat_cols() ~ "**{level}**")
Visualising Data
Before we can create visualisations, we need to organise the data in a format that is suitable for plotting. We will use the dplyr
package to manipulate the data and create new variables that we can use for plotting. We also want to create a new data frame that contains the player statistics for Shamrock Rovers F.C. To do this, we will create a new dataframe called shamrock_players
that contains only the player statistics for Shamrock Rovers F.C. We can filter the standard_players
data frame to only include players from Shamrock Rovers F.C. using the filter
function from the dplyr
package.
shamrock_players <- standard_players %>%
filter(str_detect(squad, "Shamrock Rov"))
Load libraries for Data Visualisation
The next step in this tutorial is to visualise the data using the ggplot2
package. The ggplot2
package is a powerful and flexible package for creating visualisations in R. We will use ggplot2
to create a variety of visualisations to explore the data and gain insights into Shamrock Rovers F.C.'s performance in the Uefa Conference League. We will also use the scales
package to format the axis labels and the ggtext
package to format the plot titles and labels. For added functionality, we will also use the ggrepel
package to add labels to the data points in the plots and also load the ggiraph
and ggiraphExtra
package to create interactive plots.
library(ggplot2)
library(scales)
library(ggtext)
library(ggrepel)
library(ggiraph)
library(ggiraphExtra)
Minutes Played by Shamrock Rovers F.C. Players in the Uefa Conference League
The first visualisation we will create is a simple bar plot of the number of minutes played by Shamrock Rovers F.C. players in the Uefa Conference League. We will use the ggplot
function from the ggplot2
package to create the plot. We will use the aes
function to specify the x and y variables for the plot, and the geom_bar
function to create the bar plot. We will also use the geom_text
function to add labels to the bars. We will use the labs
function to add a title and axis labels to the plot, and the theme_minimal
function to apply a minimal theme to the plot. We will also use the theme
function to customise the appearance of the plot, such as the angle of the x-axis labels.
minutes_plot_basic <- shamrock_players %>%
ggplot(aes(
# specify the x and y variables for the plot
x = reorder(player, minutes_played), # reorder the players by the number of minutes played
y = minutes_played)) +
geom_bar(stat = "identity", fill = "green", color = "black") + # create a bar plot with a green fill and black border
scale_y_continuous(breaks = breaks_width(90)) + # Set the y-axis breaks to 90
# Add data labels inside the bars with a black text colour
geom_text(aes(label = minutes_played), hjust = 1.1, color = "black", size = 3) +
# Add a title and y-axis title
labs(title = "Player Minutes in the Uefa Conference League",
y = "Minutes Played",
caption = "Data Source: FBRef.com", # Add a caption with the data source
subtitle = "Minutes played by Shamrock Rovers F.C. players in the Uefa Conference League" # Add a subtitle to describe the plot
) +
coord_flip() + # Flip the coordinates so the bars are horizontal
theme_minimal() + # Apply a minimal theme to the plot
theme(
axis.title.y = element_blank(), # Remove the x-axis title
axis.text.y = element_text(angle = 0, hjust = 1), # Rotate the y-axis labels
)
minutes_plot_basic # Display the plot
Now we can add an extra layer to this visualisation by adding the minutes minutes played by all players in the Uefa Conference League. This will allow us to compare the minutes played by Shamrock Rovers F.C. players to the minutes played by all players in the Uefa Conference League. We will use the geom_vline
function to add a vertical line to the plot that represents the average minutes played by all players in the Uefa Conference League. We will also use the annotate
function to add a label to the vertical line that displays the average minutes played by all players in the Uefa Conference League.
minutes_plot_vline <- minutes_plot_basic +
# Add a horizontal line at the average minutes played by all players in the Uefa Conference League
geom_hline(yintercept = mean(standard_players$minutes_played), linetype = "dashed", color = "red") +
# Annotation for the average minutes played by all players in the Uefa Conference League
annotate("text", x = 0, y = mean(standard_players$minutes_played), label = "Average Minutes Played", color = "red", hjust = -0.1, vjust = -1, # Add a label to the horizontal line
angle = 90, size = 3) # Adjust the angle of the label 90 degrees
minutes_plot_vline # Display the plot
From the plot, we can see that the average minutes played by all players in the Uefa Conference League is around 270 minutes. 11 Shamrock Rovers F.C. players have played more minutes than the average player in the Uefa Conference League. The players who have played the most minutes are Pico, Léon Pöhls, Lee Grace, Dan Cleary, Josh Honohan, Johnny Kenny, Markus Poom, Darragh Burns, Neil Farrugia, Dylan Watts and Gary O'Neill. Adding the average minutes line to the plot allows us to compare the minutes played by Shamrock Rovers F.C. players to the average minutes played by all players in the Uefa Conference League to guage selection preferences, squad rotation and squad depth.
Non-penalty expected goals (np-xG) and Non-penalty expected assists (np-xA) by Shamrock Rovers F.C. players in the Uefa Conference League
The next visualisation we will create is a scatter plot of the non-penalty expected goals (np-xG) and non-penalty expected assists (np-xA) by Shamrock Rovers F.C. players in the Uefa Conference League. Again, we will use the ggplot
function from the ggplot2
package to create the plot. We will also use the aes
function to specify the x and y variables for the plot, and then the geom_point
function to create the scatter plot. We will also use the geom_text_repel
function from the ggrepel
package to add labels to the data points in the plot. We will use the labs
function to add a title and axis labels to the plot, and the theme_minimal
function to apply a minimal theme to the plot.
Sign Up below to read the rest of this tutorial and for access to all the accompanying files