Athletics rankings

athletics

Using R to gather data from an athletics rankings website.

Published

March 8, 2020

As a keen athlete I spent many hours on athletics rankings website https://www.thepowerof10.info. For athletics nerds, the site is fantastic and provides all the information you need to keep up to date with the latest results. However, I wondered if it might be possible to explore some alternative ways of presenting the results…

Getting the data

Initially, I had to figure out a way to first gather the data from the power of 10 website. As you can see in the screenshot below, the rankings are already in a table format, so thankfully it wasn’t too difficult to figure something out.

knitr::include_graphics("images/powerof10homepage.PNG")

Figure from https://www.thepowerof10.info

This was my first time using the rvest package, so I’m sure this code can be improved somewhat… It does, however, seem to do the trick. To start, I identified a url from one of the ranking pages that would serve as a starting point to access the site.

url <- "https://www.thepowerof10.info/rankings/rankinglist.aspx?event=100&agegroup=ALL&sex=M&year=2012"

To access other pages it was just a case of supplying new values to the arguments in the url. For example to get the 200m rankings I could just swap the “event=100” for “event=200” in the string. To do this, I wrote a simple function that would generate a vector of strings with the desired events, years and for male or female athletes.

library(tidyverse)
library(zoo)
library(knitr)
library(lubridate)

urlmaker <- function(years, events, gender) {
  url <- "https://www.thepowerof10.info/rankings/rankinglist.aspx?event=100&agegroup=ALL&sex=M&year=2012"
  baseevent <- as.character(str_extract_all(url, "[0-9]+")[[1]][[2]])
  baseyear <- as.character(str_extract_all(url, "[0-9]+")[[1]][[3]])
  newyear <- rep(years, times = length(events))
  list_of_years <- str_replace(url, baseyear, newyear)
  newevent <- rep(events, each = length(years))
  yearandevent <- str_replace(list_of_years, baseevent, newevent)

  sex <- str_c("sex=", gender, sep = "")
  yearandevent <- str_replace(yearandevent, "sex=M", sex)
  return(yearandevent)
}

I then created a string of events and years I wanted to scrape. In this instance I only wanted male athletes, so also set gender to “M”. The result was a vector of urls as character strings that I could use to access the necessary pages.

events <- c("100", "200", "400", "800", "1500", "3000", "5000", "10000")
years <- as.character(seq(from = 2006, to = 2018, by = 1))
gender <- "M"

# create list of urls to use for scraping
urls <- urlmaker(years, events, gender)

print(head(urls, 3))

[1] "https://www.thepowerof10.info/rankings/rankinglist.aspx?event=100&agegroup=ALL&sex=M&year=2006"
[2] "https://www.thepowerof10.info/rankings/rankinglist.aspx?event=100&agegroup=ALL&sex=M&year=2007"
[3] "https://www.thepowerof10.info/rankings/rankinglist.aspx?event=100&agegroup=ALL&sex=M&year=2008"

Next, I had to figure out a way to actually get the data. Inspecting the html code on the page helped to identify the labels for the tables that I needed.

knitr::include_graphics("images/po10_window.png")

Once I knew how the different elements on the page were identified, it was as easy as copying and pasting the table id into html_nodes() and converting to a table with html_table(). After that it was just the usual wrangling you’d expect with a messy data frame.

library(rvest)

readtable <- function(url) {
  main <- read_html(url)
  rankings <- main %>%
    html_nodes(xpath = '//*[@id="cphBody_lblCachedRankingList"]/table') %>%
    html_table() %>%
    data.frame() %>%
    select(1:13) %>%
    set_names(c("rank", 
                "perf", 
                "windy", 
                "windspeed", 
                "PB", 
                "newpb", 
                "name", 
                "agegroup", 
                "month_year", 
                "coach", 
                "club", 
                "venue", 
                "date")) %>%
    mutate(
      year = as.numeric(str_extract_all(url, "[0-9]+")[[1]][[3]]),
      event = as.numeric(str_extract_all(url, "[0-9]+")[[1]][[2]])
    ) %>%
    mutate_at(vars(name), list(~ replace(., . == "", NA))) %>%
    mutate(name = na.locf(name))
}

Now I simply had to pass the url to the function, and we’ll have the table from the webpage stored as a dataframe.

ranks <- readtable(url)

ranks %>%
  slice(4:6) %>%
  select(rank, perf, name, year, club, venue) %>% 
  kable()

rank	perf	name	year	club	venue
1	10.02	Dwain Chambers	2012	Belgrave	Olympic Park
2	10.05	Adam Gemili	2012	Blackheath & Bromley	Barcelona, ESP
3	10.13	James Dasaolu	2012	Croydon	Olympic Park

If I wanted to scrape multiple urls at once I would use map_df. Here I can pass my vector of urls and map_df will perform the scraping function readtable on each element in the vector and append the result together (this can take a while to complete so I’ll just use the first 3 urls from our vector of 104 urls).

urls_short <- urls[1:3]

male_rankings <- urls_short %>%
  map_df(readtable)

Now we can view the top ranked athlete for each year.

male_rankings %>%
  filter(rank == "1") %>%
  select(year, rank, perf, name, club, venue) %>% 
  kable()

year	rank	perf	name	club	venue
2006	1	10.07	Dwain Chambers	Belgrave	Gateshead
2007	1	10.06	Marlon Devonish	Coventry	Lausanne, SUI
2008	1	10.00	Dwain Chambers	Belgrave	Birmingham

Individual athletes

The PO10 also provides detailed performance history for individual athletes. I wanted to be able to have access to this data as well, however I wanted to avoid downloading every individual athletes data to disk as I imagine that may have taken a while…

Instead the solution I came up with was a function get each individual athlete’s unique identifier number. With this number I could get each athletes individual rankings when required using an “on the fly” scrape, as a single athletes page is not very much data at all.

The screenshot below shows how the code that I needed to extract

knitr::include_graphics("images/po10id.png")

The function below collects the athltes name, unique url, year and event.

athleteurl <- function(url) {
  main <- read_html(url)
  athleteinfo <- tibble(
    name = html_text(html_nodes(main, "td:nth-child(7) a"), "href"),
    athleteurl = html_attr(html_nodes(main, "td:nth-child(7) a"), "href"),
    year = as.numeric(str_extract_all(url, "[0-9]+")[[1]][[3]]),
    event = as.numeric(str_extract_all(url, "[0-9]+")[[1]][[2]])
  ) %>%
    filter(name != "")
}

ids <- athleteurl(url)

head(ids)

# A tibble: 6 × 4
  name                   athleteurl                               year event
  <chr>                  <chr>                                   <dbl> <dbl>
1 Dwain Chambers         /athletes/profile.aspx?athleteid=31816   2012   100
2 Adam Gemili            /athletes/profile.aspx?athleteid=208735  2012   100
3 James Dasaolu          /athletes/profile.aspx?athleteid=22721   2012   100
4 Harry Aikines-Aryeetey /athletes/profile.aspx?athleteid=19988   2012   100
5 Mark Lewis-Francis     /athletes/profile.aspx?athleteid=21139   2012   100
6 James Alaka            /athletes/profile.aspx?athleteid=22255   2012   100

In another function I join these two tables together so I had one data frame that had all the results and rankings as well as each athletes individual id that I could use to get their individual data. I also appended the full address to each individual athletes id.

finaljoin <- function(ranks, ids, gender) {
  yeartimes <- ranks %>%
    group_by(year, event) %>%
    filter(str_detect(rank, "[:alpha:]")) %>%
    select(year, rank) %>%
    filter(str_detect(rank, "^UK")) %>%
    separate(rank, c("topn", "timing"), ": ") %>%
    spread(key = topn, value = timing) %>%
    ungroup()
  cleanrankings <- ranks %>%
    filter(!str_detect(rank, "[:alpha:]")) %>%
    inner_join(., ids, by = c("name", "year", "event")) %>%
    mutate(athleteurl = paste("https://www.thepowerof10.info", athleteurl, sep = ""))

  cleanrankings <- left_join(cleanrankings, yeartimes, by = c("event", "year")) %>%
    janitor::clean_names() %>%
    mutate(gender = gender)
}

This left me with a complete dataframe I could work with, with the options of getting individual athletes data as needed,

clean_ranks <- finaljoin(ranks = ranks, ids = ids, gender = "M")

Adding missing grouping variables: `event`

clean_ranks %>%
  head() %>%
  select(rank, perf, name, year, club, venue, athleteurl) %>% 
  kable()

rank	perf	name	year	club	venue	athleteurl
1	10.02	Dwain Chambers	2012	Belgrave	Olympic Park	https://www.thepowerof10.info/athletes/profile.aspx?athleteid=31816
2	10.05	Adam Gemili	2012	Blackheath & Bromley	Barcelona, ESP	https://www.thepowerof10.info/athletes/profile.aspx?athleteid=208735
3	10.13	James Dasaolu	2012	Croydon	Olympic Park	https://www.thepowerof10.info/athletes/profile.aspx?athleteid=22721
4	10.20	Harry Aikines-Aryeetey	2012	Sutton & District	Rovereto, ITA	https://www.thepowerof10.info/athletes/profile.aspx?athleteid=19988
5	10.21	Mark Lewis-Francis	2012	Birchfield H	Mesa AZ, USA	https://www.thepowerof10.info/athletes/profile.aspx?athleteid=21139
6	10.22	James Alaka	2012	Blackheath & Bromley	Eugene OR, USA	https://www.thepowerof10.info/athletes/profile.aspx?athleteid=22255

The final function I used was to scrape and clean each individuals rankings “on the fly”. The only input is an athletes individual url. The result was another dataframe that contains an individual athletes history of performances.

individual <- function(athlete) {
  history <- athlete %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id = "cphBody_pnlPerformances"]/table') %>%
    html_table(fill = TRUE) %>%
    .[[2]] %>%
    select(-c(X4, X5, X8, X9)) %>%
    set_names(c("event", 
                "perf", 
                "indoor", 
                "position", 
                "heat", 
                "venue", 
                "meeting", 
                "date")) %>%
    filter(!str_detect(event, "[:alpha:]")) %>%
    mutate(
      date = dmy(date),
      year = substr(date, 1, 4),
      perf_time = case_when(
        str_detect(perf, ":") == FALSE ~ str_c("00:00:", perf),
        str_length(str_split_fixed(perf, ".", 4)[4]) == 3 ~ str_c(perf, "0"),
        str_length(perf) == 6 | str_length(perf) == 7 ~ str_c("00:0", perf),
        str_length(perf) == 8 | str_length(perf) == 9 ~ str_c("00:", perf),
        TRUE ~ perf
      )
    )
  name <- read_html(athlete)
  name <- html_text(html_nodes(name, css = "h2"), trim = TRUE)
  history <- history %>%
    mutate(name = name)
  return(history)
}

athlete <- "https://www.thepowerof10.info/athletes/profile.aspx?athleteid=31816"

individual(athlete) %>%
  head(5) %>%
  select(name, event, perf, position, venue, meeting, date) %>% 
  kable()

name	event	perf	position	venue	meeting	date
Dwain Chambers	100	10.91	4	Sportcity	Müller British Championships inc. Invitation 10000m	2021-06-25
Dwain Chambers	60	6.73	1	Lee Valley	South of England AA U20 / Senior Championships	2020-02-01
Dwain Chambers	60	6.74	1	Lee Valley	South of England AA U20 / Senior Championships	2020-02-01
Dwain Chambers	60	6.76	2	Lee Valley	London U20 / Senior Games	2020-01-19
Dwain Chambers	60	6.77	5	Lee Valley	Scienhealth Athletics Invitational	2020-01-05

So that’s it! I am currently in the process of creating an interactive dashboard to visualise these results. You can see an early version here:

https://harryfish.shinyapps.io/resultsdashboard/

Lots of work left to do, but any comments or feedback are always welcome. The source code is on my github if you want to try anything out.

Thanks for reading!