Learning R Featured

EURO 24 Web Scraping: A tutorial of scraping EURO 24 Stats from FBRef by a complete beginner!

Lorcán Mason

16 Jul 2024 • 19 min read

Introduction to Web Scraping and Writing Data to CSV and Excel Files

Web scraping is a powerful technique used to extract data from websites, allowing you to gather information that might not be readily available in structured formats like databases or APIs. This data can then be cleaned and processed for analysis or storage in various formats, such as CSV (Comma-Separated Values) or Excel files. Using R and its powerful packages such as rvest, tidyverse, stringr, readr, and openxlsx, you can efficiently scrape, clean, and store data in formats suitable for analysis.

Attached is a lovely downloadable PDF file that documents each step!

EURO-24-Web-Scraping

EURO-24-Web-Scraping.pdf

3 MB

I've no doubt that this can be done more efficiently than I have done it but this is what I have learned so far. I hope this helps in some way and if you are a more experienced R user, I would love to hear your feedback on how I can improve this process - feel free to reach out to me on Twitter or LinkedIn. Below is a step-by-step guide on how to scrape all stats pages from EURO 24 via the FBRef website and write it to an Excel file and a CSV file. To learn from the experts, please check out TidyX by Patrick Ward and Ellis Hughes on Youtube. Mladen Jovanović also has fantastic courses on his website Complementary Training.

Load libraries for Web Scraping

library(rvest)
library(tidyverse)
library(stringr)
library(readr)
library(openxlsx)

Define the Live HTML URLs for each Stats Web Page from FBRef

general_stats_url <- "https://fbref.com/en/comps/676/stats/UEFA-Euro-Stats"
gk_url <- "https://fbref.com/en/comps/676/keepers/UEFA-Euro-Stats"
adv_gk_url <- "https://fbref.com/en/comps/676/keepersadv/UEFA-Euro-Stats"
shooting_url <- "https://fbref.com/en/comps/676/shooting/UEFA-Euro-Stats"
passing_stats <- "https://fbref.com/en/comps/676/passing/UEFA-Euro-Stats"
passing_types_stats <- "https://fbref.com/en/comps/676/passing_types/UEFA-Euro-Stats"
creation_stats <- "https://fbref.com/en/comps/676/gca/UEFA-Euro-Stats"
defensive_stats <- "https://fbref.com/en/comps/676/defense/UEFA-Euro-Stats"
possession_stats <- "https://fbref.com/en/comps/676/possession/UEFA-Euro-Stats"
playing_time_stats <- "https://fbref.com/en/comps/676/playingtime/UEFA-Euro-Stats"
misc_stats <- "https://fbref.com/en/comps/676/misc/UEFA-Euro-Stats"

Scrape General Stats Page

Read the live html tables from FBRef

standard_euro_tables <- read_html_live(general_stats_url) %>% 
  html_table(fill = T)

Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data

standard_players <- standard_euro_tables %>%
  .[[3]] %>%
  as.data.frame()

Convert the 1st row to column names and remove it from the data

colnames(standard_players) <- standard_players[1, ]
standard_players <- standard_players[-1, ]

Clean the column names and remove the rows with recurring column names

standard_players <- standard_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

Remove the "matches" column as it is not needed

standard_players <- standard_players %>%
  select(-matches)

Convert the numeric columns to numeric

standard_players <- standard_players %>%
  mutate(across(c(1, 5:ncol(.)), as.numeric))

Scrape the player ids from the player links in the live html table

player_id <- read_html_live(general_stats_url) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the datasets together

standard_players <- standard_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format

standard_players <- standard_players %>%
  select(-rk, -age)

Calculate an new 'age' column from the 'born' column to 2024

standard_players <- standard_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column

standard_players <- standard_players %>%
  select(-born) %>%
  select(player, age, everything())

Rename every column to make them more descriptive

standard_players <- standard_players %>%
  rename(
    position = pos,
    country = squad,
    matches_played = mp,
    minutes_played = min,
    played_90 = x90s,
    goals = gls,
    assists = ast,
    combined_goals_and_assists = g_a,
    non_penalty_goals = g_pk,
    penalties_scored = pk,
    penalties_taken = p_katt,
    yellow_cards = crd_y,
    red_cards = crd_r,
    expected_goals = x_g,
    non_penalty_expected_goals = npx_g,
    expected_assisted_goals = x_ag,
    non_penalty_expected_goals_and_assisted_goals = npx_g_x_ag,
    progressive_carries = prg_c,
    progressive_passes = prg_p,
    progressive_passes_received = prg_r,
    per90_goals = gls_2,
    per90_assists = ast_2,
    per90_goals_and_assists = g_a_2,
    per90_non_penalty_goals = g_pk_2,
    per90_non_penalty_goals_and_assists = g_a_pk,
    per90_expected_goals = x_g_2,
    per90_expected_assisted_goals = x_ag_2,
    per90_expected_goals_and_assisted_goals = x_g_x_ag,
    per90_non_penalty_expected_goals = npx_g_2,
    per90_non_penalty_expected_goals_and_assisted_goals = npx_g_x_ag_2,
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

View the final data

head(standard_players)

Write the data to a csv file

write_csv(standard_players, "standard_players.csv")

Write the data to an Excel file

write.xlsx(standard_players, "standard_players.xlsx")

Scrape General GK Stats Page

Read the live html tables from FBRef

gk_euro_tables <- read_html_live(gk_url) %>% 
  html_table(fill = T)

Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data

gk_players <- gk_euro_tables %>%
  .[[3]] %>%
  as.data.frame()

Convert the 1st row to column names and remove it from the data

colnames(gk_players) <- gk_players[1, ]
gk_players <- gk_players[-1, ]

Clean the column names and remove the rows with recurring column names

gk_players <- gk_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

Remove the "matches" column as it is not needed

gk_players <- gk_players %>%
  select(-matches)

Convert the numeric columns to numeric

gk_players <- gk_players %>%
  mutate(across(c(1, 5:ncol(.)), as.numeric))

Scrape the player ids from the player links in the live html table

player_id <- read_html_live(gk_url) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the datasets together

gk_players <- gk_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format

gk_players <- gk_players %>%
  select(-rk, -age)

Calculate an new 'age' column from the 'born' column to 2024

gk_players <- gk_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column

gk_players <- gk_players %>%
  select(-born) %>%
  select(player, age, everything())

Rename every column to make them more descriptive

gk_players <- gk_players %>%
  rename(
    position = pos,
    country = squad,
    matches_played = mp,
    minutes_played = min,
    played_90 = x90s,
    goals_against = ga,
    goals_against_per_90 = ga90,
    shots_on_target_against = so_ta,
    wins = w,
    draw = d,
    losses = l,
    clean_sheets = cs,
    clean_sheets_percentage = cs_percent,
    penalty_kicks_attempted = p_katt,
    penalty_kicks_allowed = pka,
    penalty_kicks_saved = p_ksv,
    penalty_kicks_missed = p_km,
    penalty_kicks_saved_percentage = save_percent_2,
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

View the data

glimpse(gk_players)

Write the data to a csv file

write_csv(gk_players, "gk_players.csv")

Write the data to an Excel file

write.xlsx(gk_players, "gk_players.xlsx")

Scrape Advanced GK Stats Page

Read the live html tables from FBRef

adv_gk_euro_tables <- read_html_live(adv_gk_url) %>% 
  html_table(fill = T)

Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data

adv_gk_players <- adv_gk_euro_tables %>%
  .[[3]] %>%
  as.data.frame()

Convert the 1st row to column names and remove it from the data

colnames(adv_gk_players) <- adv_gk_players[1, ]
adv_gk_players <- adv_gk_players[-1, ]

Clean the column names and remove the rows with recurring column names

adv_gk_players <- adv_gk_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

Remove the "matches" column as it is not needed

adv_gk_players <- adv_gk_players %>%
  select(-matches)

Convert the numeric columns to numeric

adv_gk_players <- adv_gk_players %>%
  mutate(across(c(1, 5:ncol(.)), as.numeric))

Scrape the player ids from the player links in the live html table

player_id <- read_html_live(adv_gk_url) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the datasets together

adv_gk_players <- adv_gk_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format

adv_gk_players <- adv_gk_players %>%
  select(-rk, -age)

Calculate an new 'age' column from the 'born' column to 2024

adv_gk_players <- adv_gk_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column

adv_gk_players <- adv_gk_players %>%
  select(-born) %>%
  select(player, age, everything())

Rename every column to make them more descriptive

adv_gk_players <- adv_gk_players %>%
  rename(
    position = pos,
    country = squad,
    played_90 = x90s,
    goals_against = ga,
    penalty_kicks_allowed = pka,
    free_kick_goals_against = fk,
    corner_kick_goals_against = ck,
    own_goals_against = og,
    post_shot_xg = p_sx_g,
    post_shot_xg_per_shot_on_target = p_sx_g_so_t,
    post_shot_xg_excluding_conceded = p_sx_g_2,
    post_shot_xg_excluding_conceded_per90 = x90,
    passes_completed_over_40yrds = cmp,
    passes_attempted_over_40yrds = att,
    passes_completed_over_40yrds_percent = cmp_percent,
    passes_attempted = att_gk,
    throws_attempted = thr,
    passes_attempted_over_40yrds_percent = launch_percent,
    average_pass_length = avg_len,
    goal_kicks_attempted = att_2,
    goal_kicks_over_40yrds_percent= launch_percent_2,
    average_goal_kick_length = avg_len_2,
    crosses_faced = opp,
    crosses_stopped = stp,
    crosses_stopped_percent = stp_percent,
    number_of_defensive_actions_outside_penalty_area = number_opa,
    number_of_defensive_actions_outside_penalty_area_per90 = number_opa_90,
    distance_from_goal_of_all_defensive_actions_yrds = avg_dist,
    
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

View the data

head(adv_gk_players)

Write the data to a csv file

write_csv(adv_gk_players, "adv_gk_players.csv")

Write the data to an Excel file

write.xlsx(adv_gk_players, "adv_gk_players.xlsx")

Scrape Shooting Stats Page

Read the live html tables from FBRef

shooting_euro_tables <- read_html_live(shooting_url) %>% 
  html_table(fill = T)

Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data

shooting_players <- shooting_euro_tables %>%
  .[[3]] %>%
  as.data.frame()

Convert the 1st row to column names and remove it from the data

colnames(shooting_players) <- shooting_players[1, ]
shooting_players <- shooting_players[-1, ]

Clean the column names and remove the rows with recurring column names

shooting_players <- shooting_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

Remove the "matches" column as it is not needed

shooting_players <- shooting_players %>%
  select(-matches)

Convert the numeric columns to numeric

shooting_players <- shooting_players %>%
  mutate(across(c(1, 5:ncol(.)), as.numeric))

Scrape the player ids from the player links in the live html table

player_id <- read_html_live(shooting_url) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the datasets together

shooting_players <- shooting_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format

shooting_players <- shooting_players %>%
  select(-rk, -age)

Calculate an new 'age' column from the 'born' column to 2024

shooting_players <- shooting_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column

shooting_players <- shooting_players %>%
  select(-born) %>%
  select(player, age, everything())

Rename every column to make them more descriptive

shooting_players <- shooting_players %>%
  rename(
    position = pos,
    country = squad,
    played_90 = x90s,
    goals = gls,
    shots = sh,
    shots_on_target = so_t,
    shots_on_target_percent = so_t_percent,
    shots_per90 = sh_90,
    shots_on_target_per90 = so_t_90,
    goals_per_shot = g_sh,
    goals_per_shot_on_target = g_so_t,
    average_shot_distance = dist,
    free_kick_shots = fk,
    penalty_scored = pk,
    penalty_kicks_attempted = p_katt,
    expected_goals = x_g,
    non_penalty_expected_goals = npx_g,
    non_penalty_expected_goals_shots = npx_g_sh,
    goals_minus_expected_goals = g_x_g,
    non_penalty_expected_goals_minus_expected_goals = np_g_x_g,
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

Write the data to a csv file

write_csv(shooting_players, "shooting_players.csv")

Write the data to an Excel file

write.xlsx(shooting_players, "shooting_players.xlsx")

Scrape Passing Stats Page

Read the live html tables from FBRef

passing_euro_tables <- read_html_live(passing_stats) %>% 
  html_table(fill = T)

Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data

passing_players <- passing_euro_tables %>%
  .[[3]] %>%
  as.data.frame()

Convert the 1st row to column names and remove it from the data

colnames(passing_players) <- passing_players[1, ]
passing_players <- passing_players[-1, ]

Clean the column names and remove the rows with recurring column names

passing_players <- passing_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

Remove the "matches" column as it is not needed

passing_players <- passing_players %>%
  select(-matches)

Convert the numeric columns to numeric

passing_players <- passing_players %>%
  mutate(across(c(1, 5:ncol(.)), as.numeric))

Scrape the player ids from the player links in the live html table

player_id <- read_html_live(passing_stats) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the datasets together

passing_players <- passing_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format

passing_players <- passing_players %>%
  select(-rk, -age)

Calculate an new 'age' column from the 'born' column to 2024

passing_players <- passing_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column

passing_players <- passing_players %>%
  select(-born) %>%
  select(player, age, everything())

Rename every column to make them more descriptive

passing_players <- passing_players %>%
  rename(
    position = pos,
    country = squad,
    played_90 = x90s,
    passes_completed = cmp,
    passes_attempted = att,
    pass_completion_percent = cmp_percent,
    passing_distance_total = tot_dist,
    passing_distance_progressive = prg_dist,
    short_passes_completed = cmp_2,
    short_passes_attempted = att_2,
    short_passes_completion_percent = cmp_percent_2,
    medium_passes_completed = cmp_3,
    medium_passes_attempted = att_3,
    medium_passes_completion_percent = cmp_percent_3,
    long_passes_completed = cmp_4,
    long_passes_attempted = att_4,
    long_passes_completion_percent = cmp_percent_4,
    assists = ast,
    expected_assisted_goals = x_ag,
    expected_assists = x_a,
    assists_minus_expected_assisted_goals = a_x_ag,
    key_passes = kp,
    passes_into_final_third = x1_3,
    passes_into_penalty_area = ppa,
    crosses_into_penalty_area = crs_pa,
    progressive_passes = prg_p,
    
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

Write the data to a csv file

write_csv(passing_players, "passing_players.csv")

Write the data to an Excel file

write.xlsx(passing_players, "passing_players.xlsx")

Scrape Passing Types Stats Page

Read the live html tables from FBRef

types_passing_euro_tables <- read_html_live(passing_types_stats) %>% 
  html_table(fill = T)

Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data

types_passing_players <- types_passing_euro_tables %>%
  .[[3]] %>%
  as.data.frame()

Convert the 1st row to column names and remove it from the data

colnames(types_passing_players) <- types_passing_players[1, ]
types_passing_players <- types_passing_players[-1, ]

Clean the column names and remove the rows with recurring column names

types_passing_players <- types_passing_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

Remove the "matches" column as it is not needed

types_passing_players <- types_passing_players %>%
  select(-matches)

Convert the numeric columns to numeric

types_passing_players <- types_passing_players %>%
  mutate(across(c(1, 5:ncol(.)), as.numeric))

Scrape the player ids from the player links in the live html table

player_id <- read_html_live(passing_types_stats) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the datasets together

types_passing_players <- types_passing_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format

types_passing_players <- types_passing_players %>%
  select(-rk, -age)

Calculate an new 'age' column from the 'born' column to 2024

types_passing_players <- types_passing_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column

types_passing_players <- types_passing_players %>%
  select(-born) %>%
  select(player, age, everything())

As column 15 contains the inswinging corner kicks, we can rename it to make it more descriptive

colnames(types_passing_players)[15] <- "inswinging_corner_kicks"

Rename every other column to make them more descriptive

types_passing_players <- types_passing_players %>%
  rename(
    position = pos,
    country = squad,
    played_90 = x90s,
    passes_attempted = att,
    live_ball_passes = live,
    dead_ball_passes = dead,
    free_kick_passes = fk,
    through_balls_completed = tb,
    switches_over_40yrds = sw,
    crosses = crs,
    throws_ins_taken = ti,
    corner_kicks = ck,
    outswinging_corner_kicks = out,
    straight_corner_kicks = str,
    completed_passes = cmp,
    passes_offside = off,
    passes_blocked = blocks,
    
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

Write the data to a csv file

write_csv(types_passing_players, "types_passing_players.csv")

Write the data to an Excel file

write.xlsx(types_passing_players, "types_passing_players.xlsx")

Scrape Creation Stats Page

Read the live html tables from FBRef

creation_euro_tables <- read_html_live(creation_stats) %>% 
  html_table(fill = T)

Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data

creation_players <- creation_euro_tables %>%
  .[[3]] %>%
  as.data.frame()

Convert the 1st row to column names and remove it from the data

colnames(creation_players) <- creation_players[1, ]
creation_players <- creation_players[-1, ]

Clean the column names and remove the rows with recurring column names

creation_players <- creation_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

Remove the "matches" column as it is not needed

creation_players <- creation_players %>%
  select(-matches)

Convert the numeric columns to numeric

creation_players <- creation_players %>%
  mutate(across(c(1, 5:ncol(.)), as.numeric))

Scrape the player ids from the player links in the live html table

player_id <- read_html_live(creation_stats) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the datasets together

creation_players <- creation_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format

creation_players <- creation_players %>%
  select(-rk, -age)

Calculate an new 'age' column from the 'born' column to 2024

creation_players <- creation_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column

creation_players <- creation_players %>%
  select(-born) %>%
  select(player, age, everything())

Rename every column to make them more descriptive

creation_players <- creation_players %>%
  rename(
    position = pos,
    country = squad,
    played_90 = x90s,
    shot_creating_actions = sca,
    shot_creating_actions_per_90 = sca90,
    live_pass_shot_creating_actions = pass_live,
    dead_ball_shot_creating_actions = pass_dead,
    take_on_to_shot = to,
    shot_to_shot_creating_actions = sh,
    fouled_to_shot_creating_actions = fld,
    defensive_action_to_shot_creating_actions = def,
    goal_creating_actions = gca,
    goal_creating_actions_per_90 = gca90,
    live_pass_goal_creating_actions = pass_live_2,
    dead_ball_goal_creating_actions = pass_dead_2,
    take_on_to_goal_creating_actions = to_2,
    shot_to_goal_creating_actions = sh_2,
    fouled_to_goal_creating_actions = fld_2,
    defensive_action_to_goal_creating_actions = def_2,
    
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

Write the data to a csv file

write_csv(creation_players, "creation_players.csv")

Write the data to an Excel file

write.xlsx(creation_players, "creation_players.xlsx")

Scrape Defensive Stats Page

Read the live html tables from FBRef

defence_euro_tables <- read_html_live(defensive_stats) %>% 
  html_table(fill = T)

Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data

defence_players <- defence_euro_tables %>%
  .[[3]] %>%
  as.data.frame()

Convert the 1st row to column names and remove it from the data

colnames(defence_players) <- defence_players[1, ]
defence_players <- defence_players[-1, ]

Clean the column names and remove the rows with recurring column names

defence_players <- defence_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

Remove the "matches" column as it is not needed

defence_players <- defence_players %>%
  select(-matches)

Convert the numeric columns to numeric

defence_players <- defence_players %>%
  mutate(across(c(1, 5:ncol(.)), as.numeric))

Scrape the player ids from the player links in the live html table

player_id <- read_html_live(defensive_stats) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the datasets together

defence_players <- defence_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format

defence_players <- defence_players %>%
  select(-rk, -age)

Calculate an new 'age' column from the 'born' column to 2024

defence_players <- defence_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column

defence_players <- defence_players %>%
  select(-born) %>%
  select(player, age, everything())

Rename every column to make them more descriptive

defence_players <- defence_players %>%
  rename(
    position = pos,
    country = squad,
    played_90 = x90s,
    players_tackled = tkl,
    tackles_won = tkl_w,
    tackles_def_3rd = def_3rd,
    tackles_mid_3rd = mid_3rd,
    tackles_att_3rd = att_3rd,
    dribblers_tackled = tkl_2,
    dribbles_challenged = att,
    dribblers_tackled_success_percent = tkl_percent,
    dribblers_tackled_unsuccessful = lost,
    blocks = blocks,
    shots_blocked = sh,
    passes_blocked = pass,
    interceptions = int,
    tackles_plus_interceptions = tkl_int,
    clearances = clr,
    errors = err,
    
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

Write the data to a csv file

write_csv(defence_players, "defence_players.csv")

Write the data to an Excel file

write.xlsx(defence_players, "defence_players.xlsx")

Scrape Possession Stats Page

Read the live html tables from FBRef

possession_euro_tables <- read_html_live(possession_stats) %>% 
  html_table(fill = T)

Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data

possession_players <- possession_euro_tables %>%
  .[[3]] %>%
  as.data.frame()

Convert the 1st row to column names and remove it from the data

colnames(possession_players) <- possession_players[1, ]
possession_players <- possession_players[-1, ]

Clean the column names and remove the rows with recurring column names

possession_players <- possession_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

Remove the "matches" column as it is not needed

possession_players <- possession_players %>%
  select(-matches)

Convert the numeric columns to numeric

possession_players <- possession_players %>%
  mutate(across(c(1, 5:ncol(.)), as.numeric))

Scrape the player ids from the player links in the live html table

player_id <- read_html_live(possession_stats) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the datasets together

possession_players <- possession_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format

possession_players <- possession_players %>%
  select(-rk, -age)

Calculate an new 'age' column from the 'born' column to 2024

possession_players <- possession_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column

possession_players <- possession_players %>%
  select(-born) %>%
  select(player, age, everything())

Rename every column to make them more descriptive

possession_players <- possession_players %>%
  rename(
    position = pos,
    country = squad,
    played_90 = x90s,
    defensive_penalty_area_touches = def_pen,
    defensive_third_touches = def_3rd,
    middle_third_touches = mid_3rd,
    attacking_third_touches = att_3rd,
    attacking_penalty_area_touches = att_pen,
    live_ball_touches = live,
    take_ons_attempted = att,
    take_ons_successful = succ,
    take_ons_success_percent = succ_percent,
    tackled_during_take_on = tkld,
    tackled_during_take_on_percent = tkld_percent,
    total_dribbling_distance_yrds = tot_dist,
    progressive_dribbling_distance_yrds = prg_dist,
    progressive_carries = prg_c,
    carries_into_final_third = x1_3,
    carries_into_penalty_area = cpa,
    miscontrols = mis,
    dispossessed = dis,
    passes_recieved = rec,
    progressive_passes_recieved = prg_r,
    
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

Write the data to a csv file

write_csv(possession_players, "possession_players.csv")

Write the data to an Excel file

write.xlsx(possession_players, "possession_players.xlsx")

Scrape Playing Time Stats Page

Read the live html tables from FBRef

playing_time_euro_tables <- read_html_live(playing_time_stats) %>% 
  html_table(fill = T)

Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data

playing_time_players <- playing_time_euro_tables %>%
  .[[3]] %>%
  as.data.frame()

Convert the 1st row to column names and remove it from the data

colnames(playing_time_players) <- playing_time_players[1, ]
playing_time_players <- playing_time_players[-1, ]

Clean the column names and remove the rows with recurring column names

playing_time_players <- playing_time_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

Remove the "matches" column as it is not needed

playing_time_players <- playing_time_players %>%
  select(-matches)

Convert the numeric columns to numeric

playing_time_players <- playing_time_players %>%
  mutate(across(c(1, 5:ncol(.)), as.numeric))

Scrape the player ids from the player links in the live html table

player_id <- read_html_live(playing_time_stats) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the datasets together

playing_time_players <- playing_time_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format

playing_time_players <- playing_time_players %>%
  select(-rk, -age)

Calculate an new 'age' column from the 'born' column to 2024

playing_time_players <- playing_time_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column

playing_time_players <- playing_time_players %>%
  select(-born) %>%
  select(player, age, everything())

Rename every column to make them more descriptive

playing_time_players <- playing_time_players %>%
  rename(
    position = pos,
    country = squad,
    matches_played = mp,
    minutes_played = min,
    minutes_per_match_played = mn_mp,
    percent_of_minutes_played = min_percent,
    played_90 = x90s,
    minutes_per_match_started = mn_start,
    completed_matches = compl,
    substitute_appearances = subs,
    minutes_per_substitute_appearance = mn_sub,
    unused_substitute = un_sub,
    points_per_match = ppm,
    goals_scored = on_g,
    goals_conceded = on_ga,
    goal_difference = x,
    total_goal_difference_per_90 = x90,
    goal_difference_per_appearance = on_off,
    xg_while_on_the_pitch = onx_g,
    xg_against_while_on_the_pitch = onx_ga,
    xg_difference = x_g,
    xg_difference_per_90 = x_g_90,
    xg_difference_per_appearance = on_off_2,
    
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

Write the data to a csv file

write_csv(playing_time_players, "playing_time_players.csv")

Write the data to an Excel file

write.xlsx(playing_time_players, "playing_time_players.xlsx")

Scrape Misc Stats Page

Read the live html tables from FBRef

misc_euro_tables <- read_html_live(misc_stats) %>% 
  html_table(fill = T)

Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data

misc_players <- misc_euro_tables %>%
  .[[3]] %>%
  as.data.frame()

Convert the 1st row to column names and remove it from the data

colnames(misc_players) <- misc_players[1, ]
misc_players <- misc_players[-1, ]

Clean the column names and remove the rows with recurring column names

misc_players <- misc_players %>%
  janitor::clean_names() %>%
  filter(player != "Player")

Remove the "matches" column as it is not needed

misc_players <- misc_players %>%
  select(-matches)

Convert the numeric columns to numeric

misc_players <- misc_players %>%
  mutate(across(c(1, 5:ncol(.)), as.numeric))

Scrape the player ids from the player links in the live html table

player_id <- read_html_live(misc_stats) %>%
  html_nodes("table") %>%
  html_nodes("tbody") %>%
  html_elements("a") %>% 
  html_attr("href") %>%
  as.data.frame() %>%
  setNames("url_info") %>% 
  # to this point the url also contains the player matchlogs so we need to filter out those
  mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
  mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
  filter(get_players == 1 & get_matchlogs == 0) %>%
  select(-get_players) %>%
  select(-get_matchlogs) %>%
  mutate(player_id = gsub("\\..*","", url_info),
         player_id = gsub(".*/[a-z]/","", player_id))

Join the datasets together

misc_players <- misc_players %>%
  bind_cols(player_id) %>%
  mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
  rename(link_to_player_page = url_info)

As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format

misc_players <- misc_players %>%
  select(-rk, -age)

Calculate an new 'age' column from the 'born' column to 2024

misc_players <- misc_players %>%
  mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))

Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column

misc_players <- misc_players %>%
  select(-born) %>%
  select(player, age, everything())

Rename every column to make them more descriptive

misc_players <- misc_players %>%
  rename(
    position = pos,
    country = squad,
    played_90 = x90s,
    yellow_cards = crd_y,
    red_cards = crd_r,
    second_yellow_cards = x2crd_y,
    fouls_committed = fld,
    fouls_drawn = fls,
    offsides = off,
    crosses = crs,
    interceptions = int,
    tackles_won = tkl_w,
    penalty_kicks_won = p_kwon,
    penalty_kicks_conceded = p_kcon,
    own_goals = og,
    loose_ball_recoveries = recov,
    aerial_duels_won = won,
    aerial_duels_lost = lost,
    aerial_duels_won_percent = won_percent,
    
    player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
  select(-player_id)

Write the data to a csv file

write_csv(misc_players, "misc_players.csv")

Write the data to an Excel file

write.xlsx(misc_players, "misc_players.xlsx")

Combine all csv files into one

files <- list.files(pattern = "*.csv")
all_players <- map_df(files, read_csv)

Move 'player_link' column to after 'player' column and Move 'age' column to after 'country' column

all_players <- all_players %>% select(player, player_link, position, country, 
                                      age, everything())

Replace all numeric NA values with 0

all_players <- all_players %>% mutate_if(is.numeric, ~replace(., is.na(.), 0))

Remove all rows with dublicated player_link

all_players <- all_players %>% distinct(player_link, .keep_all = TRUE)

Write the combined data to a csv file

write_csv(all_players, "all_players.csv")

Write the combined data to an Excel file

write.xlsx(all_players, "all_players.xlsx")

Now you have a nice shiny new dataset with all the player data from the Euro 2024 tournament in both csv and Excel formats. Now to explore - Enjoy!

Here is the Quarto document so that you can replicate every step.

EURO 24 Web Scraping

EURO 24 Web Scraping.qmd

44 KB