EURO 24 Web Scraping: A tutorial of scraping EURO 24 Stats from FBRef by a complete beginner!
I've no doubt that this can be done more efficiently than I have done it but this is what I have learned so far. I hope this helps in some way and if you are a more experienced R user, I would love to hear your feedback on how I can improve this process - feel free to reach out to me.
Introduction to Web Scraping and Writing Data to CSV and Excel Files
Web scraping is a powerful technique used to extract data from websites, allowing you to gather information that might not be readily available in structured formats like databases or APIs. This data can then be cleaned and processed for analysis or storage in various formats, such as CSV (Comma-Separated Values) or Excel files. Using R
and its powerful packages such as rvest
, tidyverse
, stringr
, readr
, and openxlsx
, you can efficiently scrape, clean, and store data in formats suitable for analysis.
Attached is a lovely downloadable PDF file that documents each step!
I've no doubt that this can be done more efficiently than I have done it but this is what I have learned so far. I hope this helps in some way and if you are a more experienced R user, I would love to hear your feedback on how I can improve this process - feel free to reach out to me on Twitter or LinkedIn. Below is a step-by-step guide on how to scrape all stats pages from EURO 24 via the FBRef website and write it to an Excel file and a CSV file. To learn from the experts, please check out TidyX by Patrick Ward and Ellis Hughes on Youtube. Mladen Jovanović also has fantastic courses on his website Complementary Training.
Load libraries for Web Scraping
library(rvest)
library(tidyverse)
library(stringr)
library(readr)
library(openxlsx)
Define the Live HTML URLs for each Stats Web Page from FBRef
general_stats_url <- "https://fbref.com/en/comps/676/stats/UEFA-Euro-Stats"
gk_url <- "https://fbref.com/en/comps/676/keepers/UEFA-Euro-Stats"
adv_gk_url <- "https://fbref.com/en/comps/676/keepersadv/UEFA-Euro-Stats"
shooting_url <- "https://fbref.com/en/comps/676/shooting/UEFA-Euro-Stats"
passing_stats <- "https://fbref.com/en/comps/676/passing/UEFA-Euro-Stats"
passing_types_stats <- "https://fbref.com/en/comps/676/passing_types/UEFA-Euro-Stats"
creation_stats <- "https://fbref.com/en/comps/676/gca/UEFA-Euro-Stats"
defensive_stats <- "https://fbref.com/en/comps/676/defense/UEFA-Euro-Stats"
possession_stats <- "https://fbref.com/en/comps/676/possession/UEFA-Euro-Stats"
playing_time_stats <- "https://fbref.com/en/comps/676/playingtime/UEFA-Euro-Stats"
misc_stats <- "https://fbref.com/en/comps/676/misc/UEFA-Euro-Stats"
Scrape General Stats Page
Read the live html tables from FBRef
standard_euro_tables <- read_html_live(general_stats_url) %>%
html_table(fill = T)
Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data
standard_players <- standard_euro_tables %>%
.[[3]] %>%
as.data.frame()
Convert the 1st row to column names and remove it from the data
colnames(standard_players) <- standard_players[1, ]
standard_players <- standard_players[-1, ]
Clean the column names and remove the rows with recurring column names
standard_players <- standard_players %>%
janitor::clean_names() %>%
filter(player != "Player")
Remove the "matches" column as it is not needed
standard_players <- standard_players %>%
select(-matches)
Convert the numeric columns to numeric
standard_players <- standard_players %>%
mutate(across(c(1, 5:ncol(.)), as.numeric))
Scrape the player ids from the player links in the live html table
player_id <- read_html_live(general_stats_url) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the datasets together
standard_players <- standard_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format
standard_players <- standard_players %>%
select(-rk, -age)
Calculate an new 'age' column from the 'born' column to 2024
standard_players <- standard_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column
standard_players <- standard_players %>%
select(-born) %>%
select(player, age, everything())
Rename every column to make them more descriptive
standard_players <- standard_players %>%
rename(
position = pos,
country = squad,
matches_played = mp,
minutes_played = min,
played_90 = x90s,
goals = gls,
assists = ast,
combined_goals_and_assists = g_a,
non_penalty_goals = g_pk,
penalties_scored = pk,
penalties_taken = p_katt,
yellow_cards = crd_y,
red_cards = crd_r,
expected_goals = x_g,
non_penalty_expected_goals = npx_g,
expected_assisted_goals = x_ag,
non_penalty_expected_goals_and_assisted_goals = npx_g_x_ag,
progressive_carries = prg_c,
progressive_passes = prg_p,
progressive_passes_received = prg_r,
per90_goals = gls_2,
per90_assists = ast_2,
per90_goals_and_assists = g_a_2,
per90_non_penalty_goals = g_pk_2,
per90_non_penalty_goals_and_assists = g_a_pk,
per90_expected_goals = x_g_2,
per90_expected_assisted_goals = x_ag_2,
per90_expected_goals_and_assisted_goals = x_g_x_ag,
per90_non_penalty_expected_goals = npx_g_2,
per90_non_penalty_expected_goals_and_assisted_goals = npx_g_x_ag_2,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
View the final data
head(standard_players)
Write the data to a csv file
write_csv(standard_players, "standard_players.csv")
Write the data to an Excel file
write.xlsx(standard_players, "standard_players.xlsx")
Scrape General GK Stats Page
Read the live html tables from FBRef
gk_euro_tables <- read_html_live(gk_url) %>%
html_table(fill = T)
Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data
gk_players <- gk_euro_tables %>%
.[[3]] %>%
as.data.frame()
Convert the 1st row to column names and remove it from the data
colnames(gk_players) <- gk_players[1, ]
gk_players <- gk_players[-1, ]
Clean the column names and remove the rows with recurring column names
gk_players <- gk_players %>%
janitor::clean_names() %>%
filter(player != "Player")
Remove the "matches" column as it is not needed
gk_players <- gk_players %>%
select(-matches)
Convert the numeric columns to numeric
gk_players <- gk_players %>%
mutate(across(c(1, 5:ncol(.)), as.numeric))
Scrape the player ids from the player links in the live html table
player_id <- read_html_live(gk_url) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the datasets together
gk_players <- gk_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format
gk_players <- gk_players %>%
select(-rk, -age)
Calculate an new 'age' column from the 'born' column to 2024
gk_players <- gk_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column
gk_players <- gk_players %>%
select(-born) %>%
select(player, age, everything())
Rename every column to make them more descriptive
gk_players <- gk_players %>%
rename(
position = pos,
country = squad,
matches_played = mp,
minutes_played = min,
played_90 = x90s,
goals_against = ga,
goals_against_per_90 = ga90,
shots_on_target_against = so_ta,
wins = w,
draw = d,
losses = l,
clean_sheets = cs,
clean_sheets_percentage = cs_percent,
penalty_kicks_attempted = p_katt,
penalty_kicks_allowed = pka,
penalty_kicks_saved = p_ksv,
penalty_kicks_missed = p_km,
penalty_kicks_saved_percentage = save_percent_2,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
View the data
glimpse(gk_players)
Write the data to a csv file
write_csv(gk_players, "gk_players.csv")
Write the data to an Excel file
write.xlsx(gk_players, "gk_players.xlsx")
Scrape Advanced GK Stats Page
Read the live html tables from FBRef
adv_gk_euro_tables <- read_html_live(adv_gk_url) %>%
html_table(fill = T)
Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data
adv_gk_players <- adv_gk_euro_tables %>%
.[[3]] %>%
as.data.frame()
Convert the 1st row to column names and remove it from the data
colnames(adv_gk_players) <- adv_gk_players[1, ]
adv_gk_players <- adv_gk_players[-1, ]
Clean the column names and remove the rows with recurring column names
adv_gk_players <- adv_gk_players %>%
janitor::clean_names() %>%
filter(player != "Player")
Remove the "matches" column as it is not needed
adv_gk_players <- adv_gk_players %>%
select(-matches)
Convert the numeric columns to numeric
adv_gk_players <- adv_gk_players %>%
mutate(across(c(1, 5:ncol(.)), as.numeric))
Scrape the player ids from the player links in the live html table
player_id <- read_html_live(adv_gk_url) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the datasets together
adv_gk_players <- adv_gk_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format
adv_gk_players <- adv_gk_players %>%
select(-rk, -age)
Calculate an new 'age' column from the 'born' column to 2024
adv_gk_players <- adv_gk_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column
adv_gk_players <- adv_gk_players %>%
select(-born) %>%
select(player, age, everything())
Rename every column to make them more descriptive
adv_gk_players <- adv_gk_players %>%
rename(
position = pos,
country = squad,
played_90 = x90s,
goals_against = ga,
penalty_kicks_allowed = pka,
free_kick_goals_against = fk,
corner_kick_goals_against = ck,
own_goals_against = og,
post_shot_xg = p_sx_g,
post_shot_xg_per_shot_on_target = p_sx_g_so_t,
post_shot_xg_excluding_conceded = p_sx_g_2,
post_shot_xg_excluding_conceded_per90 = x90,
passes_completed_over_40yrds = cmp,
passes_attempted_over_40yrds = att,
passes_completed_over_40yrds_percent = cmp_percent,
passes_attempted = att_gk,
throws_attempted = thr,
passes_attempted_over_40yrds_percent = launch_percent,
average_pass_length = avg_len,
goal_kicks_attempted = att_2,
goal_kicks_over_40yrds_percent= launch_percent_2,
average_goal_kick_length = avg_len_2,
crosses_faced = opp,
crosses_stopped = stp,
crosses_stopped_percent = stp_percent,
number_of_defensive_actions_outside_penalty_area = number_opa,
number_of_defensive_actions_outside_penalty_area_per90 = number_opa_90,
distance_from_goal_of_all_defensive_actions_yrds = avg_dist,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
View the data
head(adv_gk_players)
Write the data to a csv file
write_csv(adv_gk_players, "adv_gk_players.csv")
Write the data to an Excel file
write.xlsx(adv_gk_players, "adv_gk_players.xlsx")
Scrape Shooting Stats Page
Read the live html tables from FBRef
shooting_euro_tables <- read_html_live(shooting_url) %>%
html_table(fill = T)
Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data
shooting_players <- shooting_euro_tables %>%
.[[3]] %>%
as.data.frame()
Convert the 1st row to column names and remove it from the data
colnames(shooting_players) <- shooting_players[1, ]
shooting_players <- shooting_players[-1, ]
Clean the column names and remove the rows with recurring column names
shooting_players <- shooting_players %>%
janitor::clean_names() %>%
filter(player != "Player")
Remove the "matches" column as it is not needed
shooting_players <- shooting_players %>%
select(-matches)
Convert the numeric columns to numeric
shooting_players <- shooting_players %>%
mutate(across(c(1, 5:ncol(.)), as.numeric))
Scrape the player ids from the player links in the live html table
player_id <- read_html_live(shooting_url) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the datasets together
shooting_players <- shooting_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format
shooting_players <- shooting_players %>%
select(-rk, -age)
Calculate an new 'age' column from the 'born' column to 2024
shooting_players <- shooting_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column
shooting_players <- shooting_players %>%
select(-born) %>%
select(player, age, everything())
Rename every column to make them more descriptive
shooting_players <- shooting_players %>%
rename(
position = pos,
country = squad,
played_90 = x90s,
goals = gls,
shots = sh,
shots_on_target = so_t,
shots_on_target_percent = so_t_percent,
shots_per90 = sh_90,
shots_on_target_per90 = so_t_90,
goals_per_shot = g_sh,
goals_per_shot_on_target = g_so_t,
average_shot_distance = dist,
free_kick_shots = fk,
penalty_scored = pk,
penalty_kicks_attempted = p_katt,
expected_goals = x_g,
non_penalty_expected_goals = npx_g,
non_penalty_expected_goals_shots = npx_g_sh,
goals_minus_expected_goals = g_x_g,
non_penalty_expected_goals_minus_expected_goals = np_g_x_g,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
Write the data to a csv file
write_csv(shooting_players, "shooting_players.csv")
Write the data to an Excel file
write.xlsx(shooting_players, "shooting_players.xlsx")
Scrape Passing Stats Page
Read the live html tables from FBRef
passing_euro_tables <- read_html_live(passing_stats) %>%
html_table(fill = T)
Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data
passing_players <- passing_euro_tables %>%
.[[3]] %>%
as.data.frame()
Convert the 1st row to column names and remove it from the data
colnames(passing_players) <- passing_players[1, ]
passing_players <- passing_players[-1, ]
Clean the column names and remove the rows with recurring column names
passing_players <- passing_players %>%
janitor::clean_names() %>%
filter(player != "Player")
Remove the "matches" column as it is not needed
passing_players <- passing_players %>%
select(-matches)
Convert the numeric columns to numeric
passing_players <- passing_players %>%
mutate(across(c(1, 5:ncol(.)), as.numeric))
Scrape the player ids from the player links in the live html table
player_id <- read_html_live(passing_stats) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the datasets together
passing_players <- passing_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format
passing_players <- passing_players %>%
select(-rk, -age)
Calculate an new 'age' column from the 'born' column to 2024
passing_players <- passing_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column
passing_players <- passing_players %>%
select(-born) %>%
select(player, age, everything())
Rename every column to make them more descriptive
passing_players <- passing_players %>%
rename(
position = pos,
country = squad,
played_90 = x90s,
passes_completed = cmp,
passes_attempted = att,
pass_completion_percent = cmp_percent,
passing_distance_total = tot_dist,
passing_distance_progressive = prg_dist,
short_passes_completed = cmp_2,
short_passes_attempted = att_2,
short_passes_completion_percent = cmp_percent_2,
medium_passes_completed = cmp_3,
medium_passes_attempted = att_3,
medium_passes_completion_percent = cmp_percent_3,
long_passes_completed = cmp_4,
long_passes_attempted = att_4,
long_passes_completion_percent = cmp_percent_4,
assists = ast,
expected_assisted_goals = x_ag,
expected_assists = x_a,
assists_minus_expected_assisted_goals = a_x_ag,
key_passes = kp,
passes_into_final_third = x1_3,
passes_into_penalty_area = ppa,
crosses_into_penalty_area = crs_pa,
progressive_passes = prg_p,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
Write the data to a csv file
write_csv(passing_players, "passing_players.csv")
Write the data to an Excel file
write.xlsx(passing_players, "passing_players.xlsx")
Scrape Passing Types Stats Page
Read the live html tables from FBRef
types_passing_euro_tables <- read_html_live(passing_types_stats) %>%
html_table(fill = T)
Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data
types_passing_players <- types_passing_euro_tables %>%
.[[3]] %>%
as.data.frame()
Convert the 1st row to column names and remove it from the data
colnames(types_passing_players) <- types_passing_players[1, ]
types_passing_players <- types_passing_players[-1, ]
Clean the column names and remove the rows with recurring column names
types_passing_players <- types_passing_players %>%
janitor::clean_names() %>%
filter(player != "Player")
Remove the "matches" column as it is not needed
types_passing_players <- types_passing_players %>%
select(-matches)
Convert the numeric columns to numeric
types_passing_players <- types_passing_players %>%
mutate(across(c(1, 5:ncol(.)), as.numeric))
Scrape the player ids from the player links in the live html table
player_id <- read_html_live(passing_types_stats) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the datasets together
types_passing_players <- types_passing_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format
types_passing_players <- types_passing_players %>%
select(-rk, -age)
Calculate an new 'age' column from the 'born' column to 2024
types_passing_players <- types_passing_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column
types_passing_players <- types_passing_players %>%
select(-born) %>%
select(player, age, everything())
As column 15 contains the inswinging corner kicks, we can rename it to make it more descriptive
colnames(types_passing_players)[15] <- "inswinging_corner_kicks"
Rename every other column to make them more descriptive
types_passing_players <- types_passing_players %>%
rename(
position = pos,
country = squad,
played_90 = x90s,
passes_attempted = att,
live_ball_passes = live,
dead_ball_passes = dead,
free_kick_passes = fk,
through_balls_completed = tb,
switches_over_40yrds = sw,
crosses = crs,
throws_ins_taken = ti,
corner_kicks = ck,
outswinging_corner_kicks = out,
straight_corner_kicks = str,
completed_passes = cmp,
passes_offside = off,
passes_blocked = blocks,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
Write the data to a csv file
write_csv(types_passing_players, "types_passing_players.csv")
Write the data to an Excel file
write.xlsx(types_passing_players, "types_passing_players.xlsx")
Scrape Creation Stats Page
Read the live html tables from FBRef
creation_euro_tables <- read_html_live(creation_stats) %>%
html_table(fill = T)
Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data
creation_players <- creation_euro_tables %>%
.[[3]] %>%
as.data.frame()
Convert the 1st row to column names and remove it from the data
colnames(creation_players) <- creation_players[1, ]
creation_players <- creation_players[-1, ]
Clean the column names and remove the rows with recurring column names
creation_players <- creation_players %>%
janitor::clean_names() %>%
filter(player != "Player")
Remove the "matches" column as it is not needed
creation_players <- creation_players %>%
select(-matches)
Convert the numeric columns to numeric
creation_players <- creation_players %>%
mutate(across(c(1, 5:ncol(.)), as.numeric))
Scrape the player ids from the player links in the live html table
player_id <- read_html_live(creation_stats) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the datasets together
creation_players <- creation_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format
creation_players <- creation_players %>%
select(-rk, -age)
Calculate an new 'age' column from the 'born' column to 2024
creation_players <- creation_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column
creation_players <- creation_players %>%
select(-born) %>%
select(player, age, everything())
Rename every column to make them more descriptive
creation_players <- creation_players %>%
rename(
position = pos,
country = squad,
played_90 = x90s,
shot_creating_actions = sca,
shot_creating_actions_per_90 = sca90,
live_pass_shot_creating_actions = pass_live,
dead_ball_shot_creating_actions = pass_dead,
take_on_to_shot = to,
shot_to_shot_creating_actions = sh,
fouled_to_shot_creating_actions = fld,
defensive_action_to_shot_creating_actions = def,
goal_creating_actions = gca,
goal_creating_actions_per_90 = gca90,
live_pass_goal_creating_actions = pass_live_2,
dead_ball_goal_creating_actions = pass_dead_2,
take_on_to_goal_creating_actions = to_2,
shot_to_goal_creating_actions = sh_2,
fouled_to_goal_creating_actions = fld_2,
defensive_action_to_goal_creating_actions = def_2,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
Write the data to a csv file
write_csv(creation_players, "creation_players.csv")
Write the data to an Excel file
write.xlsx(creation_players, "creation_players.xlsx")
Scrape Defensive Stats Page
Read the live html tables from FBRef
defence_euro_tables <- read_html_live(defensive_stats) %>%
html_table(fill = T)
Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data
defence_players <- defence_euro_tables %>%
.[[3]] %>%
as.data.frame()
Convert the 1st row to column names and remove it from the data
colnames(defence_players) <- defence_players[1, ]
defence_players <- defence_players[-1, ]
Clean the column names and remove the rows with recurring column names
defence_players <- defence_players %>%
janitor::clean_names() %>%
filter(player != "Player")
Remove the "matches" column as it is not needed
defence_players <- defence_players %>%
select(-matches)
Convert the numeric columns to numeric
defence_players <- defence_players %>%
mutate(across(c(1, 5:ncol(.)), as.numeric))
Scrape the player ids from the player links in the live html table
player_id <- read_html_live(defensive_stats) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the datasets together
defence_players <- defence_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format
defence_players <- defence_players %>%
select(-rk, -age)
Calculate an new 'age' column from the 'born' column to 2024
defence_players <- defence_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column
defence_players <- defence_players %>%
select(-born) %>%
select(player, age, everything())
Rename every column to make them more descriptive
defence_players <- defence_players %>%
rename(
position = pos,
country = squad,
played_90 = x90s,
players_tackled = tkl,
tackles_won = tkl_w,
tackles_def_3rd = def_3rd,
tackles_mid_3rd = mid_3rd,
tackles_att_3rd = att_3rd,
dribblers_tackled = tkl_2,
dribbles_challenged = att,
dribblers_tackled_success_percent = tkl_percent,
dribblers_tackled_unsuccessful = lost,
blocks = blocks,
shots_blocked = sh,
passes_blocked = pass,
interceptions = int,
tackles_plus_interceptions = tkl_int,
clearances = clr,
errors = err,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
Write the data to a csv file
write_csv(defence_players, "defence_players.csv")
Write the data to an Excel file
write.xlsx(defence_players, "defence_players.xlsx")
Scrape Possession Stats Page
Read the live html tables from FBRef
possession_euro_tables <- read_html_live(possession_stats) %>%
html_table(fill = T)
Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data
possession_players <- possession_euro_tables %>%
.[[3]] %>%
as.data.frame()
Convert the 1st row to column names and remove it from the data
colnames(possession_players) <- possession_players[1, ]
possession_players <- possession_players[-1, ]
Clean the column names and remove the rows with recurring column names
possession_players <- possession_players %>%
janitor::clean_names() %>%
filter(player != "Player")
Remove the "matches" column as it is not needed
possession_players <- possession_players %>%
select(-matches)
Convert the numeric columns to numeric
possession_players <- possession_players %>%
mutate(across(c(1, 5:ncol(.)), as.numeric))
Scrape the player ids from the player links in the live html table
player_id <- read_html_live(possession_stats) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the datasets together
possession_players <- possession_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format
possession_players <- possession_players %>%
select(-rk, -age)
Calculate an new 'age' column from the 'born' column to 2024
possession_players <- possession_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column
possession_players <- possession_players %>%
select(-born) %>%
select(player, age, everything())
Rename every column to make them more descriptive
possession_players <- possession_players %>%
rename(
position = pos,
country = squad,
played_90 = x90s,
defensive_penalty_area_touches = def_pen,
defensive_third_touches = def_3rd,
middle_third_touches = mid_3rd,
attacking_third_touches = att_3rd,
attacking_penalty_area_touches = att_pen,
live_ball_touches = live,
take_ons_attempted = att,
take_ons_successful = succ,
take_ons_success_percent = succ_percent,
tackled_during_take_on = tkld,
tackled_during_take_on_percent = tkld_percent,
total_dribbling_distance_yrds = tot_dist,
progressive_dribbling_distance_yrds = prg_dist,
progressive_carries = prg_c,
carries_into_final_third = x1_3,
carries_into_penalty_area = cpa,
miscontrols = mis,
dispossessed = dis,
passes_recieved = rec,
progressive_passes_recieved = prg_r,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
Write the data to a csv file
write_csv(possession_players, "possession_players.csv")
Write the data to an Excel file
write.xlsx(possession_players, "possession_players.xlsx")
Scrape Playing Time Stats Page
Read the live html tables from FBRef
playing_time_euro_tables <- read_html_live(playing_time_stats) %>%
html_table(fill = T)
Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data
playing_time_players <- playing_time_euro_tables %>%
.[[3]] %>%
as.data.frame()
Convert the 1st row to column names and remove it from the data
colnames(playing_time_players) <- playing_time_players[1, ]
playing_time_players <- playing_time_players[-1, ]
Clean the column names and remove the rows with recurring column names
playing_time_players <- playing_time_players %>%
janitor::clean_names() %>%
filter(player != "Player")
Remove the "matches" column as it is not needed
playing_time_players <- playing_time_players %>%
select(-matches)
Convert the numeric columns to numeric
playing_time_players <- playing_time_players %>%
mutate(across(c(1, 5:ncol(.)), as.numeric))
Scrape the player ids from the player links in the live html table
player_id <- read_html_live(playing_time_stats) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the datasets together
playing_time_players <- playing_time_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format
playing_time_players <- playing_time_players %>%
select(-rk, -age)
Calculate an new 'age' column from the 'born' column to 2024
playing_time_players <- playing_time_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column
playing_time_players <- playing_time_players %>%
select(-born) %>%
select(player, age, everything())
Rename every column to make them more descriptive
playing_time_players <- playing_time_players %>%
rename(
position = pos,
country = squad,
matches_played = mp,
minutes_played = min,
minutes_per_match_played = mn_mp,
percent_of_minutes_played = min_percent,
played_90 = x90s,
minutes_per_match_started = mn_start,
completed_matches = compl,
substitute_appearances = subs,
minutes_per_substitute_appearance = mn_sub,
unused_substitute = un_sub,
points_per_match = ppm,
goals_scored = on_g,
goals_conceded = on_ga,
goal_difference = x,
total_goal_difference_per_90 = x90,
goal_difference_per_appearance = on_off,
xg_while_on_the_pitch = onx_g,
xg_against_while_on_the_pitch = onx_ga,
xg_difference = x_g,
xg_difference_per_90 = x_g_90,
xg_difference_per_appearance = on_off_2,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
Write the data to a csv file
write_csv(playing_time_players, "playing_time_players.csv")
Write the data to an Excel file
write.xlsx(playing_time_players, "playing_time_players.xlsx")
Scrape Misc Stats Page
Read the live html tables from FBRef
misc_euro_tables <- read_html_live(misc_stats) %>%
html_table(fill = T)
Scrape the table of interest (3rd in this case) and convert it to a data frame - this has the individual player data
misc_players <- misc_euro_tables %>%
.[[3]] %>%
as.data.frame()
Convert the 1st row to column names and remove it from the data
colnames(misc_players) <- misc_players[1, ]
misc_players <- misc_players[-1, ]
Clean the column names and remove the rows with recurring column names
misc_players <- misc_players %>%
janitor::clean_names() %>%
filter(player != "Player")
Remove the "matches" column as it is not needed
misc_players <- misc_players %>%
select(-matches)
Convert the numeric columns to numeric
misc_players <- misc_players %>%
mutate(across(c(1, 5:ncol(.)), as.numeric))
Scrape the player ids from the player links in the live html table
player_id <- read_html_live(misc_stats) %>%
html_nodes("table") %>%
html_nodes("tbody") %>%
html_elements("a") %>%
html_attr("href") %>%
as.data.frame() %>%
setNames("url_info") %>%
# to this point the url also contains the player matchlogs so we need to filter out those
mutate(get_players = ifelse(grepl(pattern = '/players/', url_info), 1, 0)) %>%
mutate(get_matchlogs = ifelse(grepl(pattern = '/matchlogs/', url_info), 1, 0)) %>%
filter(get_players == 1 & get_matchlogs == 0) %>%
select(-get_players) %>%
select(-get_matchlogs) %>%
mutate(player_id = gsub("\\..*","", url_info),
player_id = gsub(".*/[a-z]/","", player_id))
Join the datasets together
misc_players <- misc_players %>%
bind_cols(player_id) %>%
mutate(url_info = paste0("https://fbref.com/", url_info)) %>%
rename(link_to_player_page = url_info)
As we don't need the 'rk' and 'age' columns, we can remove it to clean up the data into its final format
misc_players <- misc_players %>%
select(-rk, -age)
Calculate an new 'age' column from the 'born' column to 2024
misc_players <- misc_players %>%
mutate(age = 2024 - as.numeric(str_sub(born, start = -4)))
Remove the 'born' column as it is no longer needed and position the 'age' column next to the 'player' column
misc_players <- misc_players %>%
select(-born) %>%
select(player, age, everything())
Rename every column to make them more descriptive
misc_players <- misc_players %>%
rename(
position = pos,
country = squad,
played_90 = x90s,
yellow_cards = crd_y,
red_cards = crd_r,
second_yellow_cards = x2crd_y,
fouls_committed = fld,
fouls_drawn = fls,
offsides = off,
crosses = crs,
interceptions = int,
tackles_won = tkl_w,
penalty_kicks_won = p_kwon,
penalty_kicks_conceded = p_kcon,
own_goals = og,
loose_ball_recoveries = recov,
aerial_duels_won = won,
aerial_duels_lost = lost,
aerial_duels_won_percent = won_percent,
player_link = link_to_player_page) %>% # remove 'player_id' as it is no longer needed
select(-player_id)
Write the data to a csv file
write_csv(misc_players, "misc_players.csv")
Write the data to an Excel file
write.xlsx(misc_players, "misc_players.xlsx")
Combine all csv files into one
files <- list.files(pattern = "*.csv")
all_players <- map_df(files, read_csv)
Move 'player_link' column to after 'player' column and Move 'age' column to after 'country' column
all_players <- all_players %>% select(player, player_link, position, country,
age, everything())
Replace all numeric NA values with 0
all_players <- all_players %>% mutate_if(is.numeric, ~replace(., is.na(.), 0))
Remove all rows with dublicated player_link
all_players <- all_players %>% distinct(player_link, .keep_all = TRUE)
Write the combined data to a csv file
write_csv(all_players, "all_players.csv")
Write the combined data to an Excel file
write.xlsx(all_players, "all_players.xlsx")
Now you have a nice shiny new dataset with all the player data from the Euro 2024 tournament in both csv and Excel formats. Now to explore - Enjoy!
Here is the Quarto document so that you can replicate every step.