Basic Data Scraping and Visualisation Tutorial

A tutorial of data scraping and visualising through my love for Shamrock Rovers F.C. and their journey so far in 2024-2025 Uefa Conference League.

Basic Data Scraping and Visualisation Tutorial

A tutorial of data scraping and visualising through my love for Shamrock Rovers F.C. and their journey so far in 2024-2025 Uefa Conference League.

Introduction

This tutorial has provided an overview of the data scraping and visualisation process using R and is for illustrative purposes only. There are people much smarter than me who can do much better work, I just like to learn and share new skills!

Shamrock Rovers F.C. are an Irish football team based in Dublin. They are the most successful team in the Republic of Ireland, having won the League of Ireland title a record 21 times. They have also won the FAI Cup a record 25 times. Shamrock Rovers are the first Irish team to have qualified for the group stages of a major European competition, having reached the group stages of the 2011–12 UEFA Europa League. Shamrock Rovers have a rich football history, and this tutorial will focus on their recent participation in the Uefa Conference League using R.

Scraping Data

Load libraries for Web Scraping

The first step in this tutorial is to scrape the data from the Uefa website. The data we are interested in is the match results and statistics for Shamrock Rovers F.C. in the Uefa Conference League. The packages that we will use for this are rvest, tidyverse, stringr, readr, and openxlsx. These packages will allow us to scrape the data from the Uefa website, clean and process the data, and save it to an Excel file. rvest is a web scraping package that allows us to extract data from web pages. tidyverse is a collection of packages for data manipulation and visualization. stringr is a package for string manipulation. readr is a package for reading and writing data. openxlsx is a package for reading and writing Excel files.

library(rvest)
library(tidyverse)
library(stringr)
library(readr)
library(openxlsx)

Define the Live HTML URLs for each Stats Web Page from FBRef

Next we need to define the URLs for each of the web pages that we want to scrape. The data we are interested in includes general stats, goalkeeper stats, advanced goalkeeper stats, shooting stats, passing stats, passing types stats, creation stats, defensive stats, possession stats, playing time stats, and miscellaneous stats. We will scrape the data from each of these web pages and create a data frame for each set of statistics that we can save to an Excel file at a later stage.

general_stats_url <- "https://fbref.com/en/comps/882/stats/Conference-League-Stats"
gk_url <- "https://fbref.com/en/comps/882/keepers/Conference-League-Stats"
adv_gk_url <- "https://fbref.com/en/comps/882/keepersadv/Conference-League-Stats"
shooting_url <- "https://fbref.com/en/comps/882/shooting/Conference-League-Stats"
passing_stats <- "https://fbref.com/en/comps/882/passing/Conference-League-Stats"
passing_types_stats <- "https://fbref.com/en/comps/882/passing_types/Conference-League-Stats"
creation_stats <- "https://fbref.com/en/comps/882/gca/Conference-League-Stats"
defensive_stats <- "https://fbref.com/en/comps/882/defense/Conference-League-Stats"
possession_stats <- "https://fbref.com/en/comps/882/possession/Conference-League-Stats"
playing_time_stats <- "https://fbref.com/en/comps/882/playingtime/Conference-League-Stats"
misc_stats <- "https://fbref.com/en/comps/882/misc/Conference-League-Stats"

Scrape General Stats Page

The first step in scraping the data is to read the live HTML tables from the general stats page. We will use the read_html_live function from the rvest package to read the live HTML tables from the web page. We will then use the html_table function to extract the tables from the HTML and convert them to data frames. For this tutorial we will focus only on the general stats page, but the same process can be applied to the other pages as well.

Read the live html tables from FBRef

The read_html_live function reads the live HTML tables from the web page and converts them to data frames. This will extract the tables from the HTML and allow us to convert them to data frames.

standard_conference_tables <- read_html_live(general_stats_url) %>% 
  html_table(fill = T)

Next, we need to identify which table we are interested in. The standard_conference_tables object is a list of data frames, with each data frame representing a table on the web page. We can use the [[ operator to access the data frame we are interested in. In this case, we are interested in the third table on the page, which contains the player statistics. The way the tables are ordered on the actual web page dictates the order in which they are stored in the list. If for example we wanted the first table, we would use .[[1]] in place of .[[3]]. The as.data.frame function is used to convert the table to a data frame.

standard_players <- standard_conference_tables %>%
  .[[3]] %>%
  as.data.frame()

Clean up the data frame

Now that we have the data frame, we can clean it up by removing the first row, which contains the column names, and setting the column names to the first row of the data frame.

colnames(standard_players) <- standard_players[1, ]
standard_players <- standard_players[-1, ]

We can then remove the first row, which contains the column names, and clean up the column names using the janitor::clean_names function. This function converts the column names to lowercase and replaces spaces with underscores. We can also remove any rows where the player name is "Player", as these are not actual player statistics and are likely headers or footers from the web page.

standard_players <- standard_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

The "matches" column is not needed for this analysis, so we can remove it from the data frame using the select function from the dplyr package.

standard_players <- standard_players %>%
  select(-matches)

The data in the numeric columns is currently stored as character data, so we need to convert these columns to numeric data using the as.numeric function. We can use the mutate function from the dplyr package to apply the as.numeric function to all the numeric columns in the data frame.

standard_players <- standard_players %>%
  mutate(across(c(1, 7:ncol(.)), as.numeric))

Next, we need to extract the player ids from the player links in the data frame. The player links contain the player ids, which we can use to link the player statistics to the player match logs. We can extract the player ids from the player links using the html_nodes and html_attr functions from the rvest package. We can then use the as.data.frame function to convert the player ids to a data frame and set the column name to "player_id".

player_id <- read_html_live(general_stats_url) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the player statistics data frame with the player ids data frame

Now that we have the player ids, we can join the player statistics data frame with the player ids data frame using the bind_cols function from the dplyr package. This will add the player ids to the player statistics data frame.

standard_players <- standard_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

More data cleaning

We also dont need the rank or age columns so we can remove them using the select function from the dplyr package. Placing a - before the column name will remove that column from the data frame.

standard_players <- standard_players %>%
  select(-rk, -age)

Now we can calculate an up-to-date player age. The 'born' column contains the birth date of the player. We can calculate the age of the player by subtracting the birth year from 2024. We can use the str_sub function from the stringr package to extract the last 4 characters of the 'born' column, which represent the birth year. We can then convert the birth year to a numeric data type using the as.numeric function and subtract it from 2024 to get the age of the player.

standard_players <- standard_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Now that we have calculated the age of the players, we can remove the 'born' column as it is no longer needed. We can use the select function from the dplyr package to remove the 'born' column. We can also use the select function to position the 'age' column next to the 'player' column.

standard_players <- standard_players %>%
  select(-born) %>%
  select(player, age, everything())

Now we can rename the columns to more descriptive names using the rename function from the dplyr package. This will make the data easier to work with and understand. Keep this as descriptive as possible to make it easier to work with the data.

standard_players <- standard_players %>%
  rename(
    position = pos,
    squad = squad,
    matches_played = mp,
    minutes_played = min,
    played_90 = x90s,
    goals = gls,
    assists = ast,
    combined_goals_and_assists = g_a,
    non_penalty_goals = g_pk,
    penalties_scored = pk,
    penalties_taken = p_katt,
    yellow_cards = crd_y,
    red_cards = crd_r,
    expected_goals = x_g,
    non_penalty_expected_goals = npx_g,
    expected_assisted_goals = x_ag,
    non_penalty_expected_goals_and_assisted_goals = npx_g_x_ag,
    progressive_carries = prg_c,
    progressive_passes = prg_p,
    progressive_passes_received = prg_r,
    per90_goals = gls_2,
    per90_assists = ast_2,
    per90_goals_and_assists = g_a_2,
    per90_non_penalty_goals = g_pk_2,
    per90_non_penalty_goals_and_assists = g_a_pk,
    per90_expected_goals = x_g_2,
    per90_expected_assisted_goals = x_ag_2,
    per90_expected_goals_and_assisted_goals = x_g_x_ag,
    per90_non_penalty_expected_goals = npx_g_2,
    per90_non_penalty_expected_goals_and_assisted_goals = npx_g_x_ag_2,
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

The last step in cleaning the data is to remove the lowercase letters from the nation and squad columns. We can remove the first three characters from the nation and squad columns using the str_sub function from the stringr package. This will remove the lowercase letters from the columns and leave only the uppercase letters. This makes it easier to identify the nation and squad of the player.

standard_players <- standard_players %>%
  mutate(nation = str_sub(nation, start = 4), # This removes the first 3 characters from the 'nation' column
         squad = str_sub(squad, start = 4)) # This removes the first 3 characters from the 'squad' column

Finally, we can view the cleaned data frame using the gtsummary package. The gtsummary package provides a simple and flexible way to create summary tables of data frames. We can use the tbl_summary function from the gtsummary package to create a summary table of the data frame. This will display the mean, median, and standard deviation for the numeric columns in the data frame. We can also specify the data types of the columns and the summary statistics to display using the statistic and type arguments of the tbl_summary function. This is a great way to get an overview of the summary statistics of the data frame and identify any potential issues or outliers.

library(gtsummary)

standard_players %>%
  # Select the columns to display in the summary table
  select(age, minutes_played, goals, assists, non_penalty_expected_goals_and_assisted_goals, progressive_carries, progressive_passes, yellow_cards, red_cards) %>%
  tbl_summary( # Create a summary table of the data frame
    statistic = list(all_continuous() ~ "{mean} ± {sd}",
                     all_categorical() ~ "{n} / {N} ({p}%)"), # Specify the summary statistics to display
    type = c( # Specify the data types of the columns
      age ~ "continuous",
      minutes_played ~ "continuous",
      goals ~ "continuous",
      assists ~ "continuous",
      non_penalty_expected_goals_and_assisted_goals ~ "continuous",
      progressive_carries ~ "continuous",
      progressive_passes ~ "continuous",
      yellow_cards ~ "continuous",
      red_cards ~ "continuous"
    ),
    # Show 2 decimal places
    digits = list(all_continuous() ~ c(2, 2))
  ) %>% 
  modify_header(all_stat_cols() ~ "**{level}**")

Visualising Data

Before we can create visualisations, we need to organise the data in a format that is suitable for plotting. We will use the dplyr package to manipulate the data and create new variables that we can use for plotting. We also want to create a new data frame that contains the player statistics for Shamrock Rovers F.C. To do this, we will create a new dataframe called shamrock_players that contains only the player statistics for Shamrock Rovers F.C. We can filter the standard_players data frame to only include players from Shamrock Rovers F.C. using the filter function from the dplyr package.

shamrock_players <- standard_players %>%
  filter(str_detect(squad, "Shamrock Rov"))

Load libraries for Data Visualisation

The next step in this tutorial is to visualise the data using the ggplot2 package. The ggplot2 package is a powerful and flexible package for creating visualisations in R. We will use ggplot2 to create a variety of visualisations to explore the data and gain insights into Shamrock Rovers F.C.'s performance in the Uefa Conference League. We will also use the scales package to format the axis labels and the ggtext package to format the plot titles and labels. For added functionality, we will also use the ggrepel package to add labels to the data points in the plots and also load the ggiraph and ggiraphExtra package to create interactive plots.

library(ggplot2)
library(scales)
library(ggtext)
library(ggrepel)
library(ggiraph)
library(ggiraphExtra)

Minutes Played by Shamrock Rovers F.C. Players in the Uefa Conference League

The first visualisation we will create is a simple bar plot of the number of minutes played by Shamrock Rovers F.C. players in the Uefa Conference League. We will use the ggplot function from the ggplot2 package to create the plot. We will use the aes function to specify the x and y variables for the plot, and the geom_bar function to create the bar plot. We will also use the geom_text function to add labels to the bars. We will use the labs function to add a title and axis labels to the plot, and the theme_minimal function to apply a minimal theme to the plot. We will also use the theme function to customise the appearance of the plot, such as the angle of the x-axis labels.


minutes_plot_basic <- shamrock_players %>%
  ggplot(aes(
    # specify the x and y variables for the plot
    x = reorder(player, minutes_played), # reorder the players by the number of minutes played
    y = minutes_played)) + 
  geom_bar(stat = "identity", fill = "green", color = "black") + # create a bar plot with a green fill and black border
  scale_y_continuous(breaks = breaks_width(90)) + # Set the y-axis breaks to 90
  # Add data labels inside the bars with a black text colour
  geom_text(aes(label = minutes_played), hjust = 1.1, color = "black", size = 3) +
  # Add a title and y-axis title
  labs(title = "Player Minutes in the Uefa Conference League",
       y = "Minutes Played",
       caption = "Data Source: FBRef.com", # Add a caption with the data source
       subtitle = "Minutes played by Shamrock Rovers F.C. players in the Uefa Conference League" # Add a subtitle to describe the plot
       ) +
    coord_flip() + # Flip the coordinates so the bars are horizontal
  theme_minimal() + # Apply a minimal theme to the plot
  theme(
    axis.title.y = element_blank(), # Remove the x-axis title
    axis.text.y = element_text(angle = 0, hjust = 1), # Rotate the y-axis labels
    )

minutes_plot_basic # Display the plot

Now we can add an extra layer to this visualisation by adding the minutes minutes played by all players in the Uefa Conference League. This will allow us to compare the minutes played by Shamrock Rovers F.C. players to the minutes played by all players in the Uefa Conference League. We will use the geom_vline function to add a vertical line to the plot that represents the average minutes played by all players in the Uefa Conference League. We will also use the annotate function to add a label to the vertical line that displays the average minutes played by all players in the Uefa Conference League.


minutes_plot_vline <- minutes_plot_basic +
  # Add a horizontal line at the average minutes played by all players in the Uefa Conference League
  geom_hline(yintercept = mean(standard_players$minutes_played), linetype = "dashed", color = "red") +
  # Annotation for the average minutes played by all players in the Uefa Conference League
  annotate("text", x = 0, y = mean(standard_players$minutes_played), label = "Average Minutes Played", color = "red", hjust = -0.1, vjust = -1, # Add a label to the horizontal line
           angle = 90, size = 3) # Adjust the angle of the label 90 degrees

minutes_plot_vline # Display the plot

From the plot, we can see that the average minutes played by all players in the Uefa Conference League is around 270 minutes. 11 Shamrock Rovers F.C. players have played more minutes than the average player in the Uefa Conference League. The players who have played the most minutes are Pico, Léon Pöhls, Lee Grace, Dan Cleary, Josh Honohan, Johnny Kenny, Markus Poom, Darragh Burns, Neil Farrugia, Dylan Watts and Gary O'Neill. Adding the average minutes line to the plot allows us to compare the minutes played by Shamrock Rovers F.C. players to the average minutes played by all players in the Uefa Conference League to guage selection preferences, squad rotation and squad depth.

Non-penalty expected goals (np-xG) and Non-penalty expected assists (np-xA) by Shamrock Rovers F.C. players in the Uefa Conference League

The next visualisation we will create is a scatter plot of the non-penalty expected goals (np-xG) and non-penalty expected assists (np-xA) by Shamrock Rovers F.C. players in the Uefa Conference League. Again, we will use the ggplot function from the ggplot2 package to create the plot. We will also use the aes function to specify the x and y variables for the plot, and then the geom_point function to create the scatter plot. We will also use the geom_text_repel function from the ggrepel package to add labels to the data points in the plot. We will use the labs function to add a title and axis labels to the plot, and the theme_minimal function to apply a minimal theme to the plot.

Sign Up below to read the rest of this tutorial and for access to all the accompanying files