What do people mean when they say “The Big Four”? In tennis, there is a single answer: Roger Federer, Rafael Nadal, Novak Djokovic and Andy Murray. These 4 players have dominated the world of men’s tennis throught the beginning of the 21st century. To demonstrate, one of them has one every Grand Slam tournament from 2004 to 2016 (and counting) except 7. That’s 45 wins out of 13 years x 4 Grand Slams a year = 52 tournaments, or 87%!

Wikipedia has a really nice chart to visualize this: Notice how if one of the Big Four won a Grand Slam the cell is green and, well, the entire chart from 2004 to 2016 is green!

However - this chart does not tell the whole story. Winning isn’t everything. Reaching Semi-Finals (“SF”) is also quite impressive, as well as reaching the Final 16 (“4R”). How do we visualize this? First I was thinking of a heatmap - a heatmap is a great way to display “matrix data”. That is 3D data, in which 2 of the dimensions (the row and column) are discrete or qualitative, and the third is continuous or quantitative (the color or “heat”). In our case:

• Row Variable: Grand Slam (4 rows: AU-Open, Roland-Garros, Wimbledon, US-Open)
• Column Variable: Player (4 columns: Federer, Nadal, Djokovic, Murray)
• Heat Variable: Result (9 options: NA, 1R, 2R, 3R, 4R, QF, SF, F, W)1

But what to do with the 4th variable, that is the year?… That’s where Motion Charts come in handy. A Motion Chart is excellent for a situation in which you want to present how a static view of some data changes over time or some other dimension. The classic example, and where I first encountered this type of charts, is Hans Rosling’s visualization of the Gapminder data.

# Serving You Data since 1999

These data have many sources, I’m going to go straight to the “oracle” - the ATP World Tour official site. A quick look shows that the data are pretty well structured. Each tournament has its own id, each Grand Slam in each year has its own domain, see for example Roland-Garros 2016. We see a single table in which the first player under the title “Finals” is the Winner. Then, in the next column the first 127 players are the rest of the Final 128. Should be simple to get all tables x 4 Grand Slams x 18 years (1999-2016), and see the Big Four’s position in each tournament. Using the htmltabs package and some trial and error I got to this function:

library(htmltab)
library(dplyr)

rankings <- c(8, rep(7:1, 2^(0:6)))
rank_labels <- c("1R", "2R", "3R", "4R", "QF", "SF", "F", "W")

get128Rankings <- function(tournament, id, year) {
url <- paste0("http://www.atpworldtour.com/en/scores/archive/", tournament, "/",
id, "/", year, "/results")
htmlTab <- htmltab(url, which = "//th[text() = 'Finals']/ancestor::table",
rm_nodata_cols = F)
winner <- htmlTab[1, 3]
players <- c(winner, htmlTab[1:127, 7])
return(tibble(tournament = tournament, year = year, player = players,
rank = rankings, rank_label = rank_labels[rankings]))
}

See an example with Rolland-Garros 2016:

get128Rankings("roland-garros", 520, 2016)
## # A tibble: 128 x 5
##    tournament     year player                 rank rank_label
##    <chr>         <dbl> <chr>                 <dbl> <chr>
##  1 roland-garros  2016 Novak Djokovic            8 W
##  2 roland-garros  2016 Andy Murray               7 F
##  3 roland-garros  2016 Dominic Thiem             6 SF
##  4 roland-garros  2016 Stan Wawrinka             6 SF
##  5 roland-garros  2016 Tomas Berdych             5 QF
##  6 roland-garros  2016 Richard Gasquet           5 QF
##  7 roland-garros  2016 Albert Ramos-Vinolas      5 QF
##  8 roland-garros  2016 David Goffin              5 QF
##  9 roland-garros  2016 Roberto Bautista Agut     4 4R
## 10 roland-garros  2016 John Isner                4 4R
## # ... with 118 more rows

Notice I’m using rank and rank_label. The rank_label column is the finish rank as we know it. The rank column is the finish rank coded to a number between 1 (“1R” or Final 128) and 8 (“W” or Winner). If one of the Big Four players did not finish in the Last 128 he will later get the NA value as rank and rank_label. Let us get 18 years x 4 Grand Slams x 128 players = 9,216 rows of data, with pmap_df function from the purrr package. This function will quickly map every combination of tournament, id and year values into the get128Rankings function, and return a data.frame:

library(purrr)

tournaments <- c("australian-open", "roland-garros", "wimbledon", "us-open")

tib <- tibble(tournament = rep(tournaments, 18),
id = rep(c(580, 520, 540, 560), 18),
year = rep(1999:2016, each = 4))

grand_slams_full_rankings_1999_2016 <- pmap_df(tib, get128Rankings)

dim(grand_slams_full_rankings_1999_2016)
## [1] 9216    5

Finally, we only need the data for the Big Four, and in case one of them has not finished in the Final 128 of a tournament we would like to complete this row with the value NA as rank and rank_label (this is for later use):

TennisBig4 <- c("Roger Federer", "Rafael Nadal", "Novak Djokovic", "Andy Murray")

library(tidyr)
GS_Big4 <- grand_slams_full_rankings_1999_2016 %>%
filter(player %in% TennisBig4) %>%
select(year, player, tournament, rank, rank_label) %>%
complete(year, player, tournament, fill = list(rank = NA, rank_label = NA))

dim(GS_Big4)
## [1] 288   5

Sanity check: 18 years x 4 Grand Slams x 4 players = 288 rows. We’re cool.

# Heatmap Smash

For the “heatmap in motion” I will use the Motion plugin from the wonderful highcharter package, by Jushua Kunst. highcharter is an R wrapper for the Javascript Highcharts library. And Jushua Kunst himself helped me with achieving this “heatmap in motion” chart. We prepare the data first:

library(highcharter)

# getting the data for the inital heatmap, in 1999
first_year <- GS_Big4 %>%
filter(year == 1999) %>%
group_by(player, tournament)

# highcharter needs for each tournament/player combination a list of 18 years
# holding the player's rank (now: value) and rank_label (now: rank_label)
year_sequences <- GS_Big4 %>%
group_by(player, tournament) %>%
do(sequence = list_parse(select(., value = rank, value_label = rank_label)))

# the first year data joined to the sequences data
final_data <- left_join(first_year, year_sequences)

# knowing the limits of the rank/value, which are by convention 1 to 8
limits <- unlist(final_data$sequence)[names(unlist(final_data$sequence)) == "value"] %>%
as.numeric() %>%
{ c(min(., na.rm = T), max(., na.rm = T))}

# a little JS function to represent the tool tip
fntltp <- JS("function(){
return this.series.xAxis.categories[this.point.x] + ', ' +
this.series.yAxis.categories[this.point.y] + ':<br>' +
'Finished: ' + this.point.value_label;
}")

Now for the hchart function itself. Notice I disable the legend, because it will present numbers from 0 to about 8, which is meaningless. If you want to know the finish rank for a specific player in a specific tournament, you need to hover over the relevant cell and read the tool tip.

hchart(final_data, type = "heatmap", hcaes(x = player, y = tournament, value = rank)) %>%
hc_title(text = "Tennis Big 4 1999-2016 Grand Slam Domination",
style = list(fontSize = '24px')) %>%
hc_motion(enabled = TRUE, series = 0, startIndex = 0,
labels = unique(GS_Big4$year)) %>% hc_legend(enabled = FALSE) %>% hc_colorAxis(min = limits[1], max = limits[2]) %>% hc_tooltip(formatter = fntltp) %>% hc_xAxis(reversed = TRUE, labels = list(style = list(fontSize = '14px'))) %>% hc_yAxis(labels = list(style = list(fontSize = '14px'))) %>% hc_size(height = 450) %>% hc_exporting(enabled = TRUE) # Game, Set and Match Well, this was fun. Notice how when pressing the slider’s play button and watching the animation, you get a sense of how Federer “rose to power”, then Nadal showed up, “power” leaked to his direction, then to the direction of Djokovic and Murray. With Federer’s win at the Australian Open in 2017, amazingly the story isn’t over yet, and I just might have to update this “heatmap in motion”, to add more years! ### UPDATE: 2017-05-24 After seeing this post from David Smith I was like OMFG, I have to make an animation of this motion heatmap with gganimate and tweenr! See comments in the code: library(gganimate) #NOTE! Youlll need this fork for the added functionality of the title_frame paramter: #devtools::install_github('nteetor/gganimate') library(tweenr) library(magrittr) library(ggplot2) # imputing 0 instead of NA because it looks better in animation transitions #GS_Big4$rank[is.na(GS_Big4$rank)] <- 0 TennisBig4Short <- c("Federer", "Nadal", "Djokovic", "Murray") tournamentsShort <- c("AU-Open", "FR-Open", "Wimbledon", "US-Open") GS_Big4 %<>% replace_na(list(rank = 0)) %>% mutate(player = plyr::mapvalues(player, TennisBig4, TennisBig4Short), tournament = plyr::mapvalues(tournament, tournaments, tournamentsShort)) # supplying a list of data.frames by year, for the tween_states function # the second list will be used later, see comment on tween_states issue GS_Big4_List1 <- split(GS_Big4 %>% select(year, rank), f = GS_Big4$year)
GS_Big4_List2 <- split(GS_Big4 %>% select(player, tournament, rank_label), f = GS_Big4\$year)

# the tween_states function for interpolating between ranks
# I tried inputting the entire data.frame including the player and tournament columns
# but for some reason tween_states chooses to see player as a color column and complains...
GS_Big4_tween_Part1 <- tween_states(GS_Big4_List1, 3, 1, 'cubic-in-out', 300)

# function for replicating a data.frame, I need this to get back
# the player and tournament columns to the GS_Big4_tween data.frame
replicateDF <- function(df, n) {df[rep(seq(nrow(df)),n),]}

# knowing how many times to replicate the first n - 1 years
playerTournamentsComb <- length(TennisBig4) * length(tournaments)
repNMinus1 <- GS_Big4_tween_Part1 %>% count(floor(year)) %>%
select(n) %>% slice(1) %>% unlist() / playerTournamentsComb

# the first n - 1 years replicated
GS_Big4_tween_Part2_A <- lapply(GS_Big4_List2[1:(length(GS_Big4_List2) - 1)], replicateDF, repNMinus1)

# knowing how many times to replicate the last year
repN <- GS_Big4_tween_Part1 %>% count(floor(year)) %>%
select(n) %>% slice(length(GS_Big4_List2)) %>% unlist() / playerTournamentsComb

# the last year replicated
GS_Big4_tween_Part2_B <- lapply(GS_Big4_List2[length(GS_Big4_List2)], replicateDF, repN)

# finalizing the second tween part
GS_Big4_tween_Part2 <- rbind(do.call("rbind", GS_Big4_tween_Part2_A), GS_Big4_tween_Part2_B[[1]])

# uniting the tween parts
GS_Big4_tween <- cbind(GS_Big4_tween_Part1, GS_Big4_tween_Part2)

# ggplot by frame
p <- ggplot(GS_Big4_tween, aes(factor(player,
levels = c("Federer", "Nadal", "Djokovic", "Murray")),
tournament, frame = year)) +
geom_tile(aes(fill = rank)) +
scale_fill_continuous(low = "white", high = "blue",
labels = rank_labels, breaks = 1:8, limits = c(0,8)) +
theme(panel.background = element_rect(fill="white"),
panel.grid.major = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(hjust = 0.5, size = 20),
axis.text.x = element_text(size = 15),
axis.text.y = element_text(size = 15),
legend.text=element_text(size = 15),
legend.key.heigh = unit(1, "cm")) +
labs(x = "", y = "", fill = "")

# again, note this is the old gg_animate version from a fork, not the new gganimate
gg_animate(p, filename = "tennisBig4.gif", ani.width = 700, interval = 0.1,
title_frame = ~paste("The Tennis Big 4 Grand Slam Domination:", floor(.)))

1. where “1R” is Final 128, “2R” is Final 64, …, “QF” is Last 8, “SF” is Last 4, “F” is Runner-up and “W” is the Winner.