I was looking for a “mellow” project to make, when I came across this post by Michael Kling, playing in Python with the Billboard Hot 100 charts since 1940, scraped from the Ultimate Music Database site. Now that is something I need to do in R. The project, as you will see below, turned out to be not so mellow, as more and more questions piled up for this ~350K rows dataset. From small-sized questions, e.g. “Which is the top performing song on Billboard?”, to medium-sized questions, e.g. “Is there a difference between words used for songs titles in the 1940s and in the 1990s?”, to large-sized questions, e.g. “Which are the top 1 songs which performed differently than most top 1 songs?”. Enjoy!

Get The Party Started

Since this post is not about scraping, I won’t go into details. But let me tell you, the UMD site’s html does not come without its challenges. I will point out this is the first time I’ve used the rvest package and it was great:

library(tidyverse)
library(stringr)
library(rvest)
library(lubridate)

getURLFromDate <- function(date) {
  paste0("http://www.umdmusic.com/default.asp?Lang=English&Chart=D&ChDate=",
         str_replace_all(date, "-", ""))
}

getTableFromDate <- function(date) {
  url <- getURLFromDate(date)
  df <- read_file(url) %>%
    gsub(pattern = "<br>", replacement = "|", .) %>%
    read_html() %>%
    html_nodes("table") %>%
    .[[10]] %>%
    html_table(fill = TRUE) %>%
    slice(3:n()) %>%
    as_tibble()
  
  colnames(df) <- c("ThisWeekPosition", "LastWeekPosition", "TW_PeekPosition", "TW_TotalWeeks",
                    "TitleArtist", "EntryDate", "EntryPosition", "OV_PeekPosition", "OV_TotalWeeks")
  
  df %>%
    separate(TitleArtist, c("Title", "Artist"), sep = "\\|") %>%
    mutate_each(funs(str_trim), c(Title, Artist)) %>%
    mutate_each(funs(str_replace_all(., pattern = "\\s+", replacement = " ")),
                c(Title, Artist)) %>%
    mutate_each(funs(as.numeric), c(ThisWeekPosition, TW_PeekPosition, TW_TotalWeeks,
                                    EntryPosition, OV_PeekPosition, OV_TotalWeeks)) %>%
    mutate(EntryDate = as.Date(EntryDate),
           date = date,
           year = year(date),
           month = month(date),
           day = month(date))
}

saturdays <- seq.Date(as.Date("1940-07-20"), as.Date("2017-04-29"), "7 days")

otherwise <- data.frame(
  ThisWeekPosition = NA,
  LastWeekPosition = NA,
  TW_PeekPosition = NA,
  TW_TotalWeeks = NA,
  Title = NA,
  Artist = NA,
  EntryDate = NA,
  EntryPosition = NA,
  OV_PeekPosition = NA,
  OV_TotalWeeks = NA,
  date = NA, year = NA, month = NA, day = NA)

billboard <- map_df(saturdays, possibly(getTableFromDate, otherwise))

EDIT: only after getting the entire history of the Billboard Hot 100 (BBH100 from now on…) I saw this post from Bob Rudis, and he’s right. The above scraping method isn’t very polite, and I should use Sys.sleep() between calls to the website.

How many rows, how many columns did we get?

dim(billboard)
## [1] 354975     14
library(knitr)
billboard %>%
  select(Artist, Title, date, EntryDate, ThisWeekPosition, OV_TotalWeeks) %>%
  arrange(date) %>%
  filter(row_number() <= 3 | row_number() >= n() - 2) %>%
  kable()
Artist Title date EntryDate ThisWeekPosition OV_TotalWeeks
TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA & PIED PIPERS I’ll Never Smile Again 1940-07-20 1940-07-20 1 15
JIMMY DORSEY & HIS ORCHESTRA / BOB EBERLY The Breeze And I 1940-07-20 1940-07-20 2 9
GLENN MILLER & HIS ORCHESTRA / RAY EBERLE Imagination 1940-07-20 1940-07-20 3 3
ALESSIA CARA How Far I’ll Go 2017-04-29 2016-12-31 98 18
DARIUS RUCKER If I Told You 2017-04-29 2017-04-29 99 1
COLE SWINDELL Flatliner 2017-04-29 2017-04-29 100 1

So we have ~355K rows, 14 columns (e.g. a song’s Artist, Title, ThisWeekPosition, OV_TotalWeeks which is the overall no. of consecutive weeks the song has stayed in the charts for a single EntryDate). We can also see I got dates from July 20 1940 to April 29 2017.

Trouble in Paradise

So this looked pretty easy, but there are actually a few problems with this dataset.

First the Artists names are not unified for some reason, as pointed out by Michael Kling with the Beatles example:

billboard %>%
  select(Artist, Title) %>%
  filter(str_detect(Artist, "BEATLES")) %>%
  unique() %>%
  group_by(Artist) %>%
  summarise(nSongs = n_distinct(Title)) %>%
  arrange(Artist)
## # A tibble: 6 × 2
##                           Artist nSongs
##                            <chr>  <int>
## 1                        BEATLES     49
## 2        BEATLES & TONY SHERIDAN      1
## 3                    DIE BEATLES      1
## 4                    THE BEATLES     42
## 5 THE BEATLES with BILLY PRESTON      2
## 6 THE BEATLES with TONY SHERIDAN      2

You can see here that about half of The Beatles songs appear with “BEATLES” and the other half with “THE BEATLES”, not to mention “DIE BEATLES” from their German version of She Loves You. But this is an issue we can solve, a more problematic issue is: how do we get R to understand that the Artist “BEATLES & TONY SHERIDAN” is in fact also The Beatles. And an even more problematic issue is: how do we get R to understand that “BEATLES & TONY SHERIDAN” and “THE BEATLES with TONY SHERIDAN” is the same Artist?! This may be possible with some careful separation to many artists using separators such as “&”, “with” and “feat.” - but for now we’ll just handle the general “THE” problem, and make “THE” go away:

billboard %<>%
  mutate(Artist = str_replace(Artist, "THE ", ""))

I like this example as well:

billboard %>%
  select(Artist, Title, EntryDate) %>%
  filter(str_detect(Artist, "JA RULE feat")) %>%
  arrange(Artist) %>%
  unique()
## # A tibble: 12 × 3
##                                    Artist              Title  EntryDate
##                                     <chr>              <chr>     <date>
## 1                    JA RULE feat ASHANTI          Mesmerize 2002-12-28
## 2   JA RULE feat CHARLI "CHUCK" BALTIMORE     Down A** Chick 2002-04-13
## 3         JA RULE feat R. KELLY & ASHANTI          Wonderful 2004-10-16
## 4  JA RULE feat. CHARLI "CHUCK" BALTIMORE     Down A** Chick 2002-04-13
## 5               JA RULE featuring ASHANTI     Always On Time 2001-11-17
## 6               JA RULE featuring ASHANTI          Mesmerize 2002-12-28
## 7           JA RULE featuring BOBBY BROWN        Thug Lovin' 2002-11-23
## 8                  JA RULE featuring CASE       Livin' It Up 2001-09-01
## 9                  JA RULE featuring CASE       Living It Up 2001-09-01
## 10     JA RULE featuring CHRISTINA MILIAN Between Me And You 2000-09-16
## 11              JA RULE featuring LIL' MO              I Cry 2001-05-05
## 12       JA RULE featuring LIL' MO & VITA       Put It On Me 2000-12-23

This shows you that even songs titles can sometimes appear in multiple formats (“Livin’ It Up” and “Living It Up”) and of course, as with The Beatles the same Artist is sometimes “JA RULE feat ASHANTI” and sometimes “JA RULE featuring ASHANTI”.

And another problem:

billboard %>%
  filter(Title == "-")
## # A tibble: 48 × 14
##    ThisWeekPosition LastWeekPosition TW_PeekPosition TW_TotalWeeks Title
##               <dbl>            <chr>           <dbl>         <dbl> <chr>
## 1                15              NEW              15             1     -
## 2                18              NEW              18             1     -
## 3                20              NEW              20             1     -
## 4                16              NEW              16             1     -
## 5                20              NEW              20             1     -
## 6                13              NEW              13             1     -
## 7                14              NEW              14             1     -
## 8                15              NEW              15             1     -
## 9                17              NEW              17             1     -
## 10               20              NEW              20             1     -
## # ... with 38 more rows, and 9 more variables: Artist <chr>,
## #   EntryDate <date>, EntryPosition <dbl>, OV_PeekPosition <dbl>,
## #   OV_TotalWeeks <dbl>, date <date>, year <dbl>, month <dbl>, day <dbl>

For some reason we have a few songs with “-” for Artist and Title… Let’s just get rid of those:

billboard %<>%
  filter(Title != "-")

And yet another problem, is that the Billboard Hot 100 weren’t really 100 until 1958!

billboard %>%
  count(date) %>%
  ggplot(aes(x = date, y = n)) +
  geom_bar(stat = "identity", fill = "red", alpha = 0.5) +
  ggtitle("Billboard Hot 100: No. of Top Songs per Week over Time")

At the start of this dataset, the 1940s, it’s just 10 top songs per week. At the middle, during the 1960s, it’s over 130 top songs per week. Only in 1984 the BBH100 stabilized on exactly 100 top songs per week. We’ll take this under consideration.

Do I Wanna Know?

Let’s start off with some basic questions we’re all dying to know. How many unique songs?

billboard %>%
  select(Artist, Title) %>%
  unique() %>%
  nrow()
## [1] 35501

How many unique songs per decade?

billboard %>%
  mutate(decade = (year %/% 10) * 10) %>%
  group_by(decade) %>%
  summarise(n = n_distinct(Title)) %>%
  ggplot(aes(x = decade, y = n)) +
  geom_bar(stat = "identity", fill = "green", alpha = 0.5) +
  ggtitle("Billboard Hot 100: No. of Unique Songs by Decade")

Ah, but remember it wasn’t really 100 until 1958, and that it went above 100 for certain weeks before 1984. Let’s take songs only from 1960 to 2010, and cut off all songs ranking above 100:

billboard %>%
  mutate(decade = (year %/% 10) * 10) %>%
  filter(between(decade, 1960, 2000), ThisWeekPosition <= 100) %>%
  group_by(decade) %>%
  summarise(n = n_distinct(Title)) %>%
  ggplot(aes(x = decade, y = n)) +
  geom_bar(stat = "identity", fill = "green", alpha = 0.5) +
  ggtitle("Billboard Hot 100: No. of Unique Songs by Decade")

The effect is somewhat diminished but still: The 1960s hit chart was extremely varied, with over 6K unique songs reaching the top 100, and this seems to be decreasing with time. In the 1990s and 2000s only ~3K songs (just over half) reach the top 100. I’d be careful to reach conclusions regarding taste and variation based on this, though, as the method for ranking songs has changed multiple times over the years.

Is there a “seasonality” effect to this number, e.g. more songs during christmas time or any other month?

billboard %>%
  filter(ThisWeekPosition <= 100,
         between(date, ymd("1941-01-01"), ymd("2016-12-31"))) %>%
  mutate(month = month(month, label = TRUE)) %>%
  group_by(month) %>%
  summarise(n = n_distinct(Title)) %>%
  ggplot(aes(x = month, y = n)) +
  geom_bar(stat = "identity", fill = "blue", alpha = 0.5) +
  ggtitle("Billboard Hot 100: No. of Unique Songs by Month")

Nope. I’ll save you the Chi-Square Test :)

How many unique artists?

billboard %>%
  select(Artist) %>%
  unique() %>%
  unlist() %>%
  length()
## [1] 11511

So that’s over 11K artists, but rememebr we’re counting “LIL WAYNE featuring NICKI MINAJ” as a single Artist, different from either “LIL WAYNE” or “NICKI MINAJ”, so this number is an upper limit at best.

How many unique artists per decade? Does this pattern replicate the unique no. of songs phenomenon?

billboard %>%
  mutate(decade = (year %/% 10) * 10) %>%
  filter(between(decade, 1960, 2000), ThisWeekPosition <= 100) %>%
  group_by(decade) %>%
  summarise(n = n_distinct(Artist)) %>%
  ggplot(aes(x = decade, y = n)) +
  geom_bar(stat = "identity", fill = "green", alpha = 0.5) +
  ggtitle("Billboard Hot 100: No. of Unique Artist by Decade")

This is interesting: although the variety of artists reaching the top 100 has decreased like the variety of songs since the 1960s, (a) it hasn’t decreased that much, from 2.2K to 1.8K and (b) there is a “dip” during the 1980s - the decade with the least varied group of artists…

So how many songs on average could a top artist expect to enter the BBH100?

billboard %>%
    mutate(decade = (year %/% 10) * 10) %>%
    filter(between(decade, 1960, 2000), ThisWeekPosition <= 100) %>%
    group_by(decade, Artist) %>%
    summarise(nTitles = n_distinct(Title)) %>%
    group_by(decade) %>%
    summarise(titlesPerArtist = mean(nTitles)) %>%
    ggplot(aes(x = decade, y = titlesPerArtist)) +
    geom_bar(stat = "identity", fill = "purple", alpha = 0.5) +
    ggtitle("Billboard Hot 100: Average No. of Songs per Artist by Decade")

If you worked during the 1960s: 3. Nowadays: 2. But notice this is the mean, the median in all decades is actually 1 song per artist.

More Than Words

You know what comes next… Text Mining! With the lovely tidytext package, of course.

Guess what’s the no. 1 word in songs titles in each decade:

library(tidytext)

billboard %>%
  mutate(decade = (year %/% 10) * 10) %>%
  select(decade, Title) %>%
  unique() %>%
  unnest_tokens(word, Title) %>%
  anti_join(stop_words) %>%
  group_by(decade) %>%
  count(word) %>%
  top_n(1)
## Source: local data frame [8 x 3]
## Groups: decade [8]
## 
##   decade  word     n
##    <dbl> <chr> <int>
## 1   1940  love    28
## 2   1950  love   222
## 3   1960  love   672
## 4   1970  love   601
## 5   1980  love   469
## 6   1990  love   310
## 7   2000  love   144
## 8   2010  love   128

Shocker. This can also be clearly seen if we draw a wordcloud for each decade:

library(wordcloud)

wordsPerDecade <- billboard %>%
  mutate(decade = (year %/% 10) * 10) %>%
  select(decade, Title) %>%
  unique() %>%
  unnest_tokens(word, Title) %>%
  anti_join(stop_words) %>%
  group_by(decade, word) %>%
  summarise(n = n())

wcDecade <- function(dec) {
  wordsPerDecade %>%
    filter(decade == dec) %>%
    with(wordcloud(word, n, max.words = 30, scale=c(3, .2),
                   colors= c("indianred1","indianred2","indianred3","indianred")))
}

par(mfcol = c(2,4))
decades <- seq(1940, 2010, 10)
for (dec in decades) wcDecade(dec)

par(mfcol = c(1,1))

Sorry, I just had to do that. But you can’t really get anything off wordclouds can you?

billboard %>%
  mutate(decade = (year %/% 10) * 10) %>%
  select(decade, Title) %>%
  unique() %>%
  unnest_tokens(word, Title) %>%
  anti_join(stop_words) %>%
  group_by(decade) %>%
  count(word) %>%
  top_n(5) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = factor(decade))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~decade, scales = "free_y") +
  labs(y = "No. of appearances",
       x = NULL) +
  coord_flip() +
  ggtitle("Billboard Hot 100: Top 5 Words in Titles by Decade")

That’s more like it. First, although there are much less songs in the 1940s, it seems like the gap between the word “Love” and the following words is the smallest. This is in contrast to the 1960s through 1980s, when the word “Love” dominated the most. Nowadays (2000s, 2010s) the word “Love” is still in the lead but not by a lot, like in the 1940s. Also note words which appear in the top 5 only in specific decades: “Blue” in the 1940s, “Rock” in the 1950s, “Woman” in the 1970s (Go Gloria Steinem!), “Night” in the 1980s and 2010s, and “Wanna” in the 2000s.

So it seems there was much more variety with word choice in the 1940s - is that the case? If we want to check this, we need to control the fact that there are simply less songs in the 1940s and 1950s. Let’s see the number of distinct words in songs in a decade, divided by the no. of songs in that decade. Hopefully this would help in eliminating the bias of no. of songs:

titlesPerDecade <- billboard %>%
  mutate(decade = (year %/% 10) * 10) %>%
  group_by(decade) %>%
  summarise(nTitles = n_distinct(Title))
  
billboard %>%
  mutate(decade = (year %/% 10) * 10) %>%
  select(decade, Title) %>%
  unique() %>%
  unnest_tokens(word, Title) %>%
  anti_join(stop_words) %>%
  group_by(decade) %>%
  summarise(nWords = n_distinct(word)) %>%
  inner_join(titlesPerDecade, "decade") %>%
  mutate(wordsPerTitle = nWords / nTitles) %>%
  ggplot(aes(x = decade, y = wordsPerTitle)) +
  geom_bar(stat = "identity", fill = "orange", alpha = 0.5) +
  ggtitle("Billboard Hot 100: No. of Distinct Words / No. of Songs by Decade")

Oh my. So the 1940s is in fact the only decade where for each new song we get on average over 1 new word. This ratio has declined rapidly until the 1960s and keeps going up since then. But there might be a ceiling effect here, i.e. the more songs we encounter the lower the chance of encountering new words. And the 1940s decade only has 500 songs. So what if we sample 500 songs from each decade to control for this possible ceiling effect?

billboard %>%
  mutate(decade = (year %/% 10) * 10) %>%
  select(decade, Title) %>%
  unique() %>%
  group_by(decade) %>%
  sample_n(500) %>%
  unnest_tokens(word, Title) %>%
  anti_join(stop_words) %>%
  group_by(decade) %>%
  summarise(nWords = n_distinct(word)) %>%
  mutate(wordsPerTitle = nWords / 500) %>%
  ggplot(aes(x = decade, y = wordsPerTitle)) +
  geom_bar(stat = "identity", fill = "orange", alpha = 0.5) +
  ggtitle("Billboard Hot 100: No. of Distinct Words / No. of Songs, 500 Songs per Decade")

Nice! We see the 1940s effect diminished. The 1940s and 1950s had a wider vocabulary for songs titles, this decreased until the 1980s/1990s and it is increasing since then.

The Most Beautiful Girl In The World

Let’s talk success.

library(padr)

plotSongCurve <- function(artist, title) {
  billboard %>%
    filter(Artist == artist, Title == title) %>%
    select(date, ThisWeekPosition) %>%
    pad(by = "date") %>%
    fill_by_value(ThisWeekPosition, value = 100) %>%
    ggplot(aes(x = date, y = ThisWeekPosition)) +
    geom_step(color = "navyblue") +
    scale_y_reverse(lim = c(100, 0)) +
    ggtitle(paste("Song Curve for", title, "by", artist))
}

plotSongCurve("BEYONCE", "Single Ladies (Put A Ring On It)")

What makes Beyonce’s Single Ladies “successful”? The length of the period it stayed on charts. The total area under its “Success Curve”. The “velocity” it got to its peak position. Let’s see the top 10 BBH100 songs by each of these metrics, and let’s limit ourselves to the 1960s onwards, to avoid the bias of less songs before that.

Top Songs by Max Period On Chart

options(width = 100)

calculateSongMaxPeriod <- function(dates) {
  max(dates) - min(dates)
}

calculateSongMaxPeriodByEntryDates <- function(dates) {
  data.frame(dates) %>%
    group_by(X1) %>%
    summarise(maxPeriod = map_dbl(list(X2), calculateSongMaxPeriod)) %>%
    select(maxPeriod) %>%
    sum() %>%
    unlist()
}

billboard %>%
  filter(EntryDate >= date_decimal(1960)) %>%
  group_by(Artist, Title) %>%
  summarise(dates = list(cbind(EntryDate, date))) %>%
  mutate(maxPeriodInDays = map_dbl(dates, calculateSongMaxPeriodByEntryDates)) %>%
  select(Artist, Title, maxPeriodInDays) %>%
  arrange(-maxPeriodInDays)
## Source: local data frame [31,613 x 3]
## Groups: Artist [10,340]
## 
##                                       Artist               Title maxPeriodInDays
##                                        <chr>               <chr>           <dbl>
## 1                            IMAGINE DRAGONS         Radioactive             595
## 2                                 AWOLNATION                Sail             539
## 3                                 JASON MRAZ           I'm Yours             525
## 4                                LeANN RIMES       How Do I Live             476
## 5                                ONEREPUBLIC      Counting Stars             469
## 6  LMFAO featuring LAUREN BENNETT & GOONROCK   Party Rock Anthem             462
## 7                                      ADELE Rolling In The Deep             441
## 8                           CARRIE UNDERWOOD    Before He Cheats             441
## 9                                  LUMINEERS              Ho Hey             427
## 10                                 LIFEHOUSE            You & Me             420
## # ... with 31,603 more rows

I won’t go into the details of this calculation, only note that it’s not simply max(date) - min(date) - that would give biased results towards “comeback” songs, e.g. when an artist’s song has fallen off the charts and is back again due to the artist’s death years later.

Really? Radioactive by Imagine Dragons is the longest running song on BBH100? Yes:

plotSongCurve("IMAGINE DRAGONS", "Radioactive")

Top Songs by Area Under Success Curve, or Integral

options(width = 100)

calculateSongIntegral <- function(positions) {
  sum(100 - positions)
}

billboard %>%
  filter(EntryDate >= date_decimal(1960)) %>%
  group_by(Artist, Title) %>%
  summarise(positions = list(ThisWeekPosition)) %>%
  mutate(integral = map_dbl(positions, calculateSongIntegral)) %>%
  group_by(Artist, Title) %>%
  tally(integral) %>%
  arrange(-n)
## Source: local data frame [31,613 x 3]
## Groups: Artist [10,340]
## 
##                                       Artist                        Title     n
##                                        <chr>                        <chr> <dbl>
## 1                            IMAGINE DRAGONS                  Radioactive  5845
## 2                                LeANN RIMES                How Do I Live  5546
## 3                                 JASON MRAZ                    I'm Yours  5357
## 4  LMFAO featuring LAUREN BENNETT & GOONROCK            Party Rock Anthem  5283
## 5                                ONEREPUBLIC               Counting Stars  5011
## 6           MARK RONSON featuring BRUNO MARS                 Uptown Funk!  4907
## 7                                      ADELE          Rolling In The Deep  4893
## 8               SANTANA featuring ROB THOMAS                       Smooth  4780
## 9                     GOTYE featuring KIMBRA Somebody That I Used To Know  4617
## 10                           BLACK EYED PEAS              I Gotta Feeling  4577
## # ... with 31,603 more rows

We definitely see a correlation between the above two metrics.

Top Songs by Success “Velocity”

Let’s see the fastest and slowest songs to reach no. 1:

options(width = 100)

calculateTop1SongVelocity <- function(positions) {
  df <- data.frame(positions)
  colnames(df) <- c("date", "entryDate", "position")
  firstDateIndexTop1 <- which(df$position == 1)[1]
  velocity <- df[firstDateIndexTop1, "date"] - df[firstDateIndexTop1, "entryDate"]
  velocity <- floor(velocity / 7)
  if (length(velocity) > 0) {
    return(velocity)
  } else {
    return(NA)
  }
}

top1SongsByVelocity <- billboard %>%
  filter(EntryDate >= date_decimal(1960)) %>%
  group_by(Artist, Title) %>%
  summarise(positions = list(cbind(date, EntryDate, ThisWeekPosition))) %>%
  mutate(velocity = map_dbl(positions, calculateTop1SongVelocity)) %>%
  select(Artist, Title, velocity) %>%
  filter(!is.na(velocity)) %>%
  arrange(velocity)

top1SongsByVelocity %>% head(10)
## Source: local data frame [10 x 3]
## Groups: Artist [9]
## 
##              Artist                                                        Title velocity
##               <chr>                                                        <chr>    <dbl>
## 1             ADELE                                                        Hello        0
## 2         AEROSMITH                                 I Don't Want To Miss A Thing        0
## 3            BAAUER                                                 Harlem Shake        0
## 4    BRITNEY SPEARS                                                            3        0
## 5    BRITNEY SPEARS                                           Hold It Against Me        0
## 6  CARRIE UNDERWOOD                                           Inside Your Heaven        0
## 7       CELINE DION                                          My Heart Will Go On        0
## 8        CLAY AIKEN                                            This Is The Night        0
## 9        ED SHEERAN                                                 Shape Of You        0
## 10       ELTON JOHN Candle In The Wind 1997/Something About The Way You Look Ton        0
top1SongsByVelocity %>% tail(10)
## Source: local data frame [10 x 3]
## Groups: Artist [10]
## 
##                       Artist                     Title velocity
##                        <chr>                     <chr>    <dbl>
## 1                 ROSE ROYCE                  Car Wash       19
## 2              WALTER MURPHY      A Fifth Of Beethoven       19
## 3                NICK GILDER     Hot Child In The City       20
## 4  OUTKAST feat SLEEPY BROWN          The Way You Move       20
## 5                ROBERT JOHN                  Sad Eyes       20
## 6                   VANGELIS Chariots Of Fire - Titles       21
## 7    SIA featuring SEAN PAUL             Cheap Thrills       22
## 8           VERTICAL HORIZON       Everything You Want       25
## 9                      CREED       With Arms Wide Open       26
## 10               JOHN LEGEND                 All Of Me       26

We see many songs which reached no. 1 in 0 weeks, obviously Hello by Adele. What’s more interesting is a “sleeper hit” like John Legend’s All Of Me, which took half a year to reach no. 1!

plotSongCurve("JOHN LEGEND", "All Of Me")

Best I Ever Had

Now let’s rank artists. We can define an artist’s success as the sum of her songs successes.

Top Artists by Sum of Max Periods On Chart

options(width = 100)

billboard %>%
  filter(EntryDate >= date_decimal(1960)) %>%
  group_by(Artist, Title) %>%
  summarise(dates = list(cbind(EntryDate, date))) %>%
  mutate(maxPeriod = map_dbl(dates, calculateSongMaxPeriodByEntryDates)) %>%
  group_by(Artist) %>%
  tally(maxPeriod) %>%
  arrange(-n)
## # A tibble: 10,340 × 2
##             Artist     n
##              <chr> <dbl>
## 1       ELTON JOHN  5922
## 2          MADONNA  5712
## 3    ELVIS PRESLEY  5698
## 4     TAYLOR SWIFT  4921
## 5       TIM McGRAW  4683
## 6    KENNY CHESNEY  4669
## 7      ROD STEWART  4284
## 8    STEVIE WONDER  4270
## 9      KEITH URBAN  4123
## 10 ARETHA FRANKLIN  4004
## # ... with 10,330 more rows

Top Artists by Sum of Integrals

options(width = 100)

billboard %>%
  filter(EntryDate >= date_decimal(1960)) %>%
  group_by(Artist, Title) %>%
  summarise(positions = list(ThisWeekPosition)) %>%
  mutate(integral = map_dbl(positions, calculateSongIntegral)) %>%
  group_by(Artist) %>%
  tally(integral) %>%
  arrange(-n)
## # A tibble: 10,340 × 2
##             Artist     n
##              <chr> <dbl>
## 1          MADONNA 56434
## 2       ELTON JOHN 54910
## 3    ELVIS PRESLEY 52250
## 4     TAYLOR SWIFT 48896
## 5     MARIAH CAREY 45079
## 6    STEVIE WONDER 41727
## 7          BEATLES 41285
## 8  MICHAEL JACKSON 39089
## 9      ROD STEWART 37171
## 10         RIHANNA 37154
## # ... with 10,330 more rows

It’s nice to see Madonna, Elton John and Elvis Presley kick Taylor Swift’s butt. What about the artist with the most songs on BBH100?

billboard %>%
  filter(EntryDate >= date_decimal(1960)) %>%
  select(Artist, Title) %>%
  unique() %>%
  count(Artist) %>%
  arrange(-n)
## # A tibble: 10,340 × 2
##             Artist     n
##              <chr> <int>
## 1        GLEE CAST   167
## 2    ELVIS PRESLEY   109
## 3      RAY CHARLES    78
## 4  ARETHA FRANKLIN    77
## 5      JAMES BROWN    75
## 6          BEATLES    70
## 7     TAYLOR SWIFT    65
## 8       ELTON JOHN    61
## 9            DRAKE    60
## 10      BEACH BOYS    58
## # ... with 10,330 more rows

Glee Cast?! OK, the artist with the most no. 1 songs:

billboard %>%
  filter(EntryDate >= date_decimal(1960), ThisWeekPosition == 1) %>%
  select(Artist, Title) %>%
  unique() %>%
  count(Artist) %>%
  arrange(-n)
## # A tibble: 677 × 2
##             Artist     n
##              <chr> <int>
## 1          BEATLES    19
## 2     MARIAH CAREY    15
## 3          MADONNA    12
## 4  MICHAEL JACKSON    11
## 5  WHITNEY HOUSTON    11
## 6         SUPREMES    10
## 7         BEE GEES     9
## 8    JANET JACKSON     8
## 9   ROLLING STONES     8
## 10         RIHANNA     7
## # ... with 667 more rows

The Beatles. Thank you.

Baby Got Back

Can we spot a Comeback? Let’s define an artist’s Comeback as having 15 years gap between two songs on chart:

isCombackFunc <- function(entryDatesList, combackDiffInYears) {
  any(diff(sort(entryDatesList)) > 365 * combackDiffInYears)
}

billboard %>%
  filter(EntryDate >= date_decimal(1960)) %>%
  group_by(Artist) %>%
  summarise(entryDatesList = list(unique(EntryDate))) %>%
  mutate(isComeback = map_lgl(entryDatesList, isCombackFunc, 15)) %>%
  filter(isComeback) %>%
  select(Artist) %>%
  unlist() %>%
  unname()
##  [1] "A TRIBE CALLED QUEST"           "AARON NEVILLE"                 
##  [3] "ANDY WILLIAMS"                  "ANGEL CITY"                    
##  [5] "ANIMALS"                        "BANG"                          
##  [7] "BARRY WHITE"                    "BILLY VERA"                    
##  [9] "BLONDIE"                        "BOBBY HELMS"                   
## [11] "BRENDA LEE"                     "BURL IVES"                     
## [13] "CHIMES"                         "CHIPMUNKS"                     
## [15] "CONTOURS"                       "DARLENE LOVE"                  
## [17] "DAVID BOWIE"                    "EAZY-E"                        
## [19] "ELTON JOHN & KIKI DEE"          "ELVIS PRESLEY"                 
## [21] "EVERLY BROTHERS"                "FLOYD CRAMER"                  
## [23] "FRANKIE MILLER"                 "FROST"                         
## [25] "GEORGE MICHAEL"                 "GLEN CAMPBELL"                 
## [27] "JAGGED EDGE"                    "JET"                           
## [29] "JIMMY CLIFF"                    "JOHN LENNON & PLASTIC ONO BAND"
## [31] "JOSE FELICIANO"                 "K.D. LANG"                     
## [33] "KENNY ROGERS"                   "LEE GREENWOOD"                 
## [35] "LITTLE CAESAR"                  "LOUIS ARMSTRONG"               
## [37] "MITCH RYDER"                    "MONKEES"                       
## [39] "NARVEL FELTS"                   "NAT KING COLE"                 
## [41] "NEIGHBOURHOOD"                  "NIGHTCRAWLERS"                 
## [43] "O'JAYS"                         "OZZY OSBOURNE"                 
## [45] "PATTI LaBELLE"                  "PRINCE AND REVOLUTION"         
## [47] "QUEEN & DAVID BOWIE"            "RIGHTEOUS BROTHERS"            
## [49] "SERGIO MENDES"                  "SILK"                          
## [51] "STEVE WARINER"                  "STOMPERS"                      
## [53] "TEENA MARIE"                    "TOKENS"                        
## [55] "TONY ORLANDO"                   "VAN HALEN"                     
## [57] "VARIOUS ARTISTS"                "WANTED"                        
## [59] "WHAM!"                          "WILLIE NELSON"

We get a lot of names, notice George Michael and Wham!, which are considered as Comebacks since after George Michael’s death a few of his songs popped back in the chart.

What about Comeback songs? I mean songs which go away and return to the charts, so they would have multiple EntryDates:

options(width = 100)

billboard %>%
  group_by(Artist, Title) %>%
  summarise(n = n_distinct(EntryDate)) %>%
  filter(n >= 2) %>%
  arrange(-n)
## Source: local data frame [3,275 x 3]
## Groups: Artist [2,180]
## 
##                                      Artist                             Title     n
##                                       <chr>                             <chr> <int>
## 1                               BING CROSBY                   White Christmas    17
## 2                               BOBBY HELMS                  Jingle Bell Rock     7
## 3                                BRENDA LEE Rockin' Around The Christmas Tree     6
## 4                              MARIAH CAREY   All I Want For Christmas Is You     6
## 5                                   RIHANNA                       Sex With Me     6
## 6                                     ADELE              Set Fire To The Rain     5
## 7                                      AKON                             Angel     5
## 8  AMY WINEHOUSE featuring GHOSTFACE KILLAH              You Know I'm No Good     5
## 9                   BEYONCE featuring JAY-Z                         Upgrade U     5
## 10                              BUDDY CLARK                   Peg O' My Heart     5
## # ... with 3,265 more rows

The top 4 here are christmas songs which makes sense, but what about Sex With Me by Rihanna?…

plotSongCurve("RIHANNA", "Sex With Me")

What about Cover songs? A Cover song would repeat a Title with a different Artist:

billboard %>%
  group_by(Title) %>%
  summarise(n = n_distinct(Artist)) %>%
  filter(n >= 2) %>%
  arrange(-n)
## # A tibble: 4,559 × 2
##         Title     n
##         <chr> <int>
## 1     Hold On    19
## 2         You    15
## 3       Angel    14
## 4       Happy    14
## 5  I Love You    14
## 6     Runaway    14
## 7        Stay    14
## 8       Crazy    13
## 9     Forever    13
## 10      Smile    13
## # ... with 4,549 more rows

Now, I wanted to do a lot with Cover songs, e.g. indicating Cover songs which succeeded more than the original but, as you can see, there’s no way of telling whether the 19 (!) artists with the song “Hold On” are singing the same song or not… So, I’ll skip that.

Under The Bridge

Lastly before I do something more challenging: how do we indicate the “underdog” songs - songs that are just there, never quite reaching the top? I’ll look for songs which never reached the top 10, but stayed for a long time:

top10Songs <- billboard %>%
  filter(EntryDate >= date_decimal(1960), ThisWeekPosition <= 10) %>%
  mutate(Artist = str_replace(Artist, "THE ", "")) %>%
  select(Artist, Title) %>%
  unique()

billboard %>%
  filter(EntryDate >= date_decimal(1960)) %>%
  group_by(Artist, Title) %>%
  summarise(dates = list(cbind(EntryDate, date))) %>%
  mutate(maxPeriod = map_dbl(dates, calculateSongMaxPeriodByEntryDates)) %>%
  select(Artist, Title, maxPeriod) %>%
  anti_join(top10Songs) %>%
  arrange(-maxPeriod)
## Source: local data frame [27,007 x 3]
## Groups: Artist [9,740]
## 
##                                Artist                  Title maxPeriod
##                                 <chr>                  <chr>     <dbl>
## 1                          AWOLNATION                   Sail       539
## 2                          PAULA COLE   I Don't Want To Wait       385
## 3                        DUNCAN SHEIK       Barely Breathing       378
## 4                          BAND PERRY         If I Die Young       357
## 5                           DISHWALLA     Counting Blue Cars       329
## 6                 OF MONSTERS AND MEN           Little Talks       322
## 7                     IMAGINE DRAGONS              It's Time       322
## 8                        TAYLOR SWIFT Teardrops On My Guitar       322
## 9  LEE ANN WOMACK with SONS OF DESERT       I Hope You Dance       322
## 10                       HUNTER HAYES                 Wanted       315
## # ... with 26,997 more rows

Interesting. So apparently Sail by Awolnation stayed in the charts for almost 80 weeks (not consecutively) without ever reaching the top 10!

plotSongCurve("AWOLNATION", "Sail")

Strangers In The Night

This was educating. But let’s try something a bit more clever: I assume there is an “average” success curve to a top 1 song. It debuts, slowly goes up to no. 1, then maybe more rapidly declines and vanishes.

How many weeks do top 1 songs “live”?

top1Songs <- billboard %>%
  filter(EntryDate >= date_decimal(1960), ThisWeekPosition == 1) %>%
  select(Artist, Title, OV_TotalWeeks) %>%
  unique() %>%
  arrange(-OV_TotalWeeks)

top1Songs %>%
  ggplot(aes(x = OV_TotalWeeks)) + geom_histogram(binwidth = 1) +
  ggtitle("Billboard Hot 100: Total Weeks Distribution for Top 1 Songs")

So that’s a nice symmetric distribution around 20 weeks, and some top 1 songs even “live” over 60 weeks. Which are the top and bottom top 1 songs by this metric?

options(width = 100)

rbind(head(top1Songs), tail(top1Songs))
## # A tibble: 12 × 3
##                                       Artist                        Title OV_TotalWeeks
## *                                      <chr>                        <chr>         <dbl>
## 1  LMFAO featuring LAUREN BENNETT & GOONROCK            Party Rock Anthem            67
## 2                                      ADELE          Rolling In The Deep            64
## 3                     GOTYE featuring KIMBRA Somebody That I Used To Know            59
## 4               SANTANA featuring ROB THOMAS                       Smooth            58
## 5               KATY PERRY featuring JUICY J                   Dark Horse            57
## 6                            BLACK EYED PEAS              I Gotta Feeling            56
## 7                                    BEATLES            Eight Days A Week            10
## 8                           HERMAN'S HERMITS      I'm Henry The VIII I Am            10
## 9                                    BEATLES             Paperback Writer            10
## 10                                   BEATLES    The Long And Winding Road            10
## 11                                  FANTASIA                    I Believe            10
## 12                              TAYLOR HICKS          Do I Make You Proud             8

Never did like American Idol.

You can check that most (87%) of the 1K top 1 songs have been alive over 15 weeks, and that most (65%) had their peek before 15 weeks have passed. So I would say 15 weeks is a fair length of period to look at the “average” behavior or profile of a top 1 song:

billboard %>%
  filter(EntryDate >= date_decimal(1960), TW_TotalWeeks <= 15) %>%
  inner_join(top1Songs) %>%
  select(Artist, Title, TW_TotalWeeks, ThisWeekPosition) %>%
  group_by(TW_TotalWeeks) %>%
  summarize(success = sum(100 - ThisWeekPosition)) %>%
  mutate(propSuccess = success / sum(success)) %>%
  ggplot(aes(x = 1:15, y = propSuccess)) +
  geom_line() +
  ggtitle("Billboard Hot 100: Average Success Curve for a Top 1 Song")

Here propSuccess in a given week is the proportion of a song’s entire “Integral” just for that week.

What am I getting at? I want to be able to see irregular top 1 songs, in the sense that they deviate from this average behavior. And this reminds me of an article by Our Lord and Saviour Hadley Wickham in which he did just that: there he detected causes of death with unusual temporal patterns within a day, in Mexico. You should really read that article, I will now blatantly copy it with no explanations (see Hadley’s article if you’d like explanations), to get the top 1 songs which deviate from the average behavior the most:

top1Songs %<>%
  filter(OV_TotalWeeks >= 15)

hod <- billboard %>%
  filter(EntryDate >= date_decimal(1960), TW_TotalWeeks <= 15) %>%
  inner_join(top1Songs) %>%
  select(Artist, Title, TW_TotalWeeks, ThisWeekPosition) %>%
  group_by(Artist, Title) %>%
  mutate(prop = (100 - ThisWeekPosition) / sum(100 - ThisWeekPosition))

overall <- hod %>%
  group_by(TW_TotalWeeks) %>%
  summarize(freq_all = sum(100 - ThisWeekPosition)) %>%
  mutate(prop_all = freq_all / sum(freq_all))

hod %<>%
  inner_join(overall, "TW_TotalWeeks")

devi <- hod %>%
  group_by(Artist, Title) %>%
  summarize(n = sum(100 - ThisWeekPosition), dist = mean((prop - prop_all)^2))

library(MASS)
devi$resid <- resid(rlm(log(dist) ~ log(n), data = devi))
detach("package:MASS", unload = TRUE)

unusual <- devi %>% arrange(-resid) %>% head(9)

hod_unusual <- hod %>%
  inner_join(unusual, c("Artist", "Title")) %>%
  mutate(TitleArtist = paste(Title, "by", Artist))

ggplot(data = hod_unusual, aes(x = TW_TotalWeeks, y = prop)) +
  geom_line(aes(y = prop_all), data = overall, colour = "grey50") +
  geom_line() +
  facet_wrap(~TitleArtist, ncol = 3) +
  theme(strip.text = element_text(size = 8)) +
  ggtitle("Billboard Hot 100: Top 1 Songs Which Deviate from the Norm")

Nice. You can see the “Norm” in gray and the song’s Success Curve in black. A common theme here is songs which start very strong, possibly at no. 1, then slowly declining - as opposed to the “Norm” where a typical top 1 song “climbs” its way to the top.

I’ll tell you what we could add to Hadley’s analysis: we could look for the least deviant songs:

usual <- devi %>% arrange(resid) %>% head(9)

hod_usual <- hod %>%
  inner_join(usual, c("Artist", "Title")) %>%
  mutate(TitleArtist = paste(Title, "by", Artist))

ggplot(data = hod_usual, aes(x = TW_TotalWeeks, y = prop)) +
  geom_line(aes(y = prop_all), data = overall, colour = "grey50") +
  geom_line() +
  facet_wrap(~TitleArtist, ncol = 3) +
  theme(strip.text = element_text(size = 8)) +
  ggtitle("Billboard Hot 100: Top 1 Songs Which Agree with the Norm")

It’s interesting that Against All Odds by Phill Collins is even more “normal” than we thought.

Finish What Ya Started

Well it wasn’t supposed to be a very innovative post, was it? Still, scraped some data, dealt with its messiness, came up with some metrics, plotted a few plots. Oh, it’s such a perfect day, I’m glad I spent it with you (this song never reached the Billboard Hot 100, so next time I’m analyzing the UK Singles Chart!)