My friend Nir and I have this habbit of sending one another songs in the middle of the day. The other day I sent him this one by Mariah Carey:

To which he replied “that’s so gay”, to which I replied “don’t confuse my 12-year-old taste with my gay taste!”. But seriously people, you shouldn’t underestimate Mariah Carey just because of her troubling sense of style, and now I’m going to show you why.

The Voice

This post by “Fun with R”1 taught me to transcribe notes from sound in R, with the great tuneR package. In this function I’m inputting the path to a MP3 sound file, and outputting its frequency in Herz every ~0.1 of a second:

library(tuneR)

processSong <- function(mp3FilePath, widthSample = 4096) {
  stereoMP3File <- readMP3(mp3FilePath)
  wavFile <- extractWave(stereoMP3File, interact = FALSE)
  if (nchannel(wavFile) > 1) {
    wavFile <- mono(wavFile, "both")
  }
  
  perioWav <- periodogram(wavFile, width = widthSample)
  freqWav <- FF(perioWav)
  return(freqWav)
}

The frequency of a sound wave can be mapped to what we know as musical notes. The higher the frequency, the higher the note. See this table for the mapping and this Wikipedia article for more on the typical frequency of the human voice, a.k.a the Vocal Range. The bottomline is I get a bunch of numbers, and I can tell “98” means the sound is a very low G2.

Turn Off That Music!

However in order to transcribe Mariah’s voice I need it in an Acapella version to her songs, that is without the backing music. There aren’t a lot of those out there, but I found a few on Youtube for a number of Pop singers I wanted to check out. I can’t tell you how I got from this point to having 75 MP3 files of Acapella versions for 15 Pop singers on my local computer (that’s 15 singers x 5 songs for each), because that’s illegal. But I have them, in directory “D:/singers_songs”, and I have an Excel file called “acapella.xlsx” detailing the singers and songs, and now I do:

library(tidyverse)
library(stringr)

singersFreqRange <- readxl::read_xlsx("~/acapella.xlsx") %>%
  mutate(mp3FileAddress = str_c("D:/singers_songs/", str_pad(row_number(), 2, pad = "0"), ".mp3"),
         Frequency = map(mp3FileAddress, processSong)) %>%
  unnest(Frequency)

So the singersFreqRange table holds for each singer, for each song, in a very long format, all its sampled frequencies:

singersFreqRange
## # A tibble: 188,512 x 5
##    Singer  Song         Link                   mp3FileAddress    Frequency
##    <chr>   <chr>        <chr>                  <chr>                 <dbl>
##  1 Beyonce Pretty Hurts https://www.youtube.c~ D:/singers_songs~      16.1
##  2 Beyonce Pretty Hurts https://www.youtube.c~ D:/singers_songs~      13.3
##  3 Beyonce Pretty Hurts https://www.youtube.c~ D:/singers_songs~      17.1
##  4 Beyonce Pretty Hurts https://www.youtube.c~ D:/singers_songs~      57.5
##  5 Beyonce Pretty Hurts https://www.youtube.c~ D:/singers_songs~      78.4
##  6 Beyonce Pretty Hurts https://www.youtube.c~ D:/singers_songs~      17.7
##  7 Beyonce Pretty Hurts https://www.youtube.c~ D:/singers_songs~    1499  
##  8 Beyonce Pretty Hurts https://www.youtube.c~ D:/singers_songs~      41.4
##  9 Beyonce Pretty Hurts https://www.youtube.c~ D:/singers_songs~      13.9
## 10 Beyonce Pretty Hurts https://www.youtube.c~ D:/singers_songs~    1498  
## # ... with 188,502 more rows

Cool, is it not?

Now, it’s important to note I didn’t do any random sampling here. The singers are the ones that I like (e.g. the absence of Taylor Swift isn’t coincidental), and most of the songs are simply the ones I could get, again 5 per singer. But I think these 15 singers are still the most popular Pop singers since let’s say the 1990s, and the songs are hit songs. This is the songs list:

library(knitr)
library(kableExtra)

singersFreqRange %>%
  select(Singer, Song, Link) %>%
  distinct() %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", position = "left",
                font_size = 13, full_width = F) %>%
  scroll_box(width = "500px", height = "200px")
Singer Song Link
Beyonce Pretty Hurts https://www.youtube.com/watch?v=Mq3ON5OpDsg
Beyonce Halo https://www.youtube.com/watch?v=gr6_1Y6mDBE
Beyonce Best Thing I Never Had https://www.youtube.com/watch?v=RD42VgOLnuk
Beyonce Sweet Dreams https://www.youtube.com/watch?v=xM54nrb--jg
Beyonce Naughty https://www.youtube.com/watch?v=pSTCGiZq0Uw
Mariah Carey We Belong Together https://www.youtube.com/watch?v=v-b7LtIbKV0
Mariah Carey Hero https://www.youtube.com/watch?v=9YD1suttVZU
Mariah Carey Butterfly https://www.youtube.com/watch?v=nvZlyzD8LJc
Mariah Carey Through The Rain https://www.youtube.com/watch?v=rvFYpiPQjM0
Mariah Carey Emotions https://www.youtube.com/watch?v=eEmMQW5KdEo
Rihanna Diamonds https://www.youtube.com/watch?v=Uxc6tzXis8Y
Rihanna Russian Roulette https://www.youtube.com/watch?v=-tizc66nvns
Rihanna S&M https://www.youtube.com/watch?v=BgVgAK_ZXNk
Rihanna Where Have You Been https://www.youtube.com/watch?v=lWQq4jSkyMQ
Rihanna Unfaithful https://www.youtube.com/watch?v=tXqvd5FqdBQ
Christina Aguilera Hurt https://www.youtube.com/watch?v=v7am8I_QYoU
Christina Aguilera Beautiful https://www.youtube.com/watch?v=KHRhV7cDUcg
Christina Aguilera Ain’t No Other Man https://www.youtube.com/watch?v=xggfqh9yRJs
Christina Aguilera Genie In A Bottle https://www.youtube.com/watch?v=Q2uzakRUj-k
Christina Aguilera What A Girl Wants https://www.youtube.com/watch?v=s1sXWOcRAzE
Britney Spears Circus https://www.youtube.com/watch?v=wu8cxqvAzeg
Britney Spears Slave https://www.youtube.com/watch?v=j8pki3G_MmI
Britney Spears Hold It Against Me https://www.youtube.com/watch?v=zAROc_jtfLw
Britney Spears Womanizer https://www.youtube.com/watch?v=WUoVq7y7uU0
Britney Spears Babay One More Time https://www.youtube.com/watch?v=82lbiwj0s2g
Madonna Frozen https://www.youtube.com/watch?v=zwdl-ETciqQ
Madonna Beautiful Stranger https://www.youtube.com/watch?v=hEnYDDo-gGo
Madonna Vogue https://www.youtube.com/watch?v=oUuYeF5Vy-w
Madonna Rain https://www.youtube.com/watch?v=PDM6_KhIHmw
Madonna Bitch I’m Madonna https://www.youtube.com/watch?v=wo1Rq9_U3Bw
Ariana Grande Dangerous Woman https://www.youtube.com/watch?v=VVYTzauIb_4
Ariana Grande Into You https://www.youtube.com/watch?v=Q4KZaI_jtKs
Ariana Grande Moonlight https://www.youtube.com/watch?v=BiZih62zjMY
Ariana Grande Bad Decisions https://www.youtube.com/watch?v=0FTtETXlTT4
Ariana Grande Baby I https://www.youtube.com/watch?v=L9zPO4HWMg0
Celine Dion The Power Of Love https://www.youtube.com/watch?v=bUPhxKkTmBI
Celine Dion My Heart Will Go On https://www.youtube.com/watch?v=hRaNmDLBSaU
Celine Dion I Drove All Night https://www.youtube.com/watch?v=2a0YmV7nAMI
Celine Dion Loved Me Back To Life https://www.youtube.com/watch?v=tu_kFXNheDo
Celine Dion Encore Un Soir https://www.youtube.com/watch?v=3mAZEi45Erg
Adele Set Fire To The Rain https://www.youtube.com/watch?v=ezREO8oe5Is
Adele Rolling In The Deep https://www.youtube.com/watch?v=nvjEOzmGMNw
Adele Skyfall https://www.youtube.com/watch?v=wFk0tizQDms
Adele Send My Love (To Your New Lover) https://www.youtube.com/watch?v=eiUsMrFKuqw
Adele Hello https://www.youtube.com/watch?v=o7bUhHEEugM
Sia Elastic Heart https://www.youtube.com/watch?v=r9EgGR6Hj_c
Sia The Greatest https://www.youtube.com/watch?v=NhCaCP5Y1UQ
Sia Chandelier https://www.youtube.com/watch?v=Ans82nxKtEY
Sia Alive https://www.youtube.com/watch?v=hH5lAVirV2s
Sia Cheap Thrills https://www.youtube.com/watch?v=kRw4bsXxKnw
Lady Gaga You and I https://www.youtube.com/watch?v=JgEzZdwQx9g
Lady Gaga Bad Romance https://www.youtube.com/watch?v=iVD-S21ejSw
Lady Gaga Judas https://www.youtube.com/watch?v=mB2hmwenxM8
Lady Gaga Paparazzi https://www.youtube.com/watch?v=Hz_9FNImfMs
Lady Gaga The Edge Of Glory https://www.youtube.com/watch?v=OR7jGPtRRH8
Katy Perry Firework https://www.youtube.com/watch?v=hx8seY-wdEY
Katy Perry Wide Awake https://www.youtube.com/watch?v=Il-UCGN89UM
Katy Perry California Girls https://www.youtube.com/watch?v=lj6kBl06MF0
Katy Perry Dark Horse https://www.youtube.com/watch?v=adCpix_cuq0
Katy Perry Part Of Me https://www.youtube.com/watch?v=TmZHNkwveYk
Whitney Houston How Will I Know https://www.youtube.com/watch?v=JZGVUpXQHV4
Whitney Houston I’m Every Woman https://www.youtube.com/watch?v=IS5sG12thcg
Whitney Houston I Wanna Dance With Somebody https://www.youtube.com/watch?v=b-gpym3EkI4
Whitney Houston So Emotional https://www.youtube.com/watch?v=-QggLkd7Zj4
Whitney Houston I Learned From The Best https://www.youtube.com/watch?v=_3_TgCl_bXE
Alicia Keys Fallin’ https://www.youtube.com/watch?v=zMHo5WvhGA4
Alicia Keys No One https://www.youtube.com/watch?v=TK8NGqgwTYw
Alicia Keys You Don’t Know My Name https://www.youtube.com/watch?v=h8Zik_hwXtU
Alicia Keys Girl On Fire https://www.youtube.com/watch?v=cJQHJ3ZWizQ
Alicia Keys Un-Thinkable https://www.youtube.com/watch?v=QHEjMJAh4P8
Pink Who Knew https://www.youtube.com/watch?v=tHatCuYe7cg
Pink Please Don’t Leave Me https://www.youtube.com/watch?v=_3YOMR6IOA0
Pink Get The Party Started https://www.youtube.com/watch?v=P3tZtr5rTPg
Pink Funhouse https://www.youtube.com/watch?v=aJo1L3JvWfE
Pink Sober https://www.youtube.com/watch?v=RsFFQtwJ2Eg

Super Bass/Sopran

And now for my very first ggridges plot, a.k.a Joy plot after the cover of Joy Division’s Unknown Pleasures Album3:

library(ggridges)

singersFreqRange %>%
  ggplot(aes(x = Frequency, y = Singer)) +
  geom_density_ridges()

Hahaha! What an epic fail.

The most obvious problem is bad data. I say “bad” and I mean it, because apparently some of my singers reach Super-Bass and/or Super-Sopran notes:

singersFreqRange %>%
  select(Singer, Song, Frequency) %>%
  remove_missing(na.rm = TRUE) %>%
  arrange(Frequency) %>%
  slice(c(1:3, (n() - 2):n()))
## # A tibble: 6 x 3
##   Singer             Song              Frequency
##   <chr>              <chr>                 <dbl>
## 1 Christina Aguilera Genie In A Bottle      10.8
## 2 Christina Aguilera Genie In A Bottle      10.8
## 3 Britney Spears     Slave                  10.8
## 4 Christina Aguilera What A Girl Wants    9650  
## 5 Rihanna            S&M                 12439  
## 6 Rihanna            S&M                 15660

If you followed the frequency-to-notes and Wikipedia links you know there is no human voice reaching these low and high frequencies. I’m guessing both types of anomalies come from unwanted sounds in the MP3 files, e.g. the drums.4 In fact it is unlikely that any of these singers sing below 98 Herz (\(G_2\)) and above 3,135 (\(G_7\)). So the way I chose to clean these data is to set a lower limit of 98 Herz, and take for each singer her 99th percentile as the upper limit.

Whitney Crashing My Party

After cleaning the data, adding some color and labels, and ordering the ridges according to the Singer’s median frequency we get:

library(viridis)

.breaks <-  c(49, 98, 196, 392, 783.99, 1567.98, 3135.96)
.labels <- c(expression("G"[1]), expression("G"[2]), expression("G"[3]),
             expression("G"[4]), expression("G"[5]), expression("G"[6]),
             expression("G"[7]))

singersFreqRange %>%
  group_by(Singer) %>%
  mutate(minFreq = 98,
         maxFreq = quantile(Frequency, 0.99, na.rm = TRUE)) %>%
  filter(Frequency > minFreq, Frequency < maxFreq) %>%
  mutate(medianFreq = median(Frequency, na.rm = TRUE),
         maxFreq = max(Frequency, na.rm = TRUE)) %>%
  ggplot(aes(x = Frequency, y = reorder(Singer, -medianFreq), fill = ..x..)) +
  geom_density_ridges_gradient(rel_min_height = 0.0) +
  scale_fill_viridis(name = "Freq.[Hz]", option = "C") +
  theme_ridges(font_size = 13, grid = TRUE) +
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(size=12),
        text = element_text(family="mono")) +
  labs(title = 'Pop Singers Vocal Range',
    subtitle = 'Frequency [Hz] Distribution (ordered by Median)\nData: 5 Hit Songs per Singer Performed Acapella') +
  scale_x_continuous(breaks = .breaks,
                   labels = .labels, limits = c(0, 800))

A few things come to mind:

  • Excuse me Whitney, you’re getting in the way!
  • Poor Britney.
  • Respect for Ariana.

Butterfly

So, Whitney beat Mariah with the median frequency she’s singing with. Of course these are only 5 top songs for each, the data is not 100% pure, tuneR’s capabalities are not perfect, but I like what I see.

If we not only arrange these singers by median frequency, but also by the range of frequency (the Vocal Range) and maximum frequency in these 5 songs, we can see Mariah is indeed a unicorn. I couldn’t decide which 3-dimensional-data plot I liked best so I’ll give you both:

singersFreqRangeSum <- singersFreqRange %>%
  group_by(Singer) %>%
  mutate(minFreq = 98,
            maxFreq = quantile(Frequency, 0.99, na.rm = TRUE)) %>%
  filter(Frequency > minFreq, Frequency < maxFreq) %>%
  summarise(medianFreq = median(Frequency, na.rm = TRUE),
            minFreq = min(Frequency, na.rm = TRUE),
            maxFreq = max(Frequency, na.rm = TRUE)) %>%
  mutate(rangeFreq = maxFreq - minFreq,
         name = map_chr(Singer, function(s) str_split(s, " ")[[1]][1]))

singersFreqRangeSum$name[9] <- "Gaga"


ggplot(singersFreqRangeSum, aes(x = medianFreq, y = rangeFreq, label = name, fill = maxFreq)) +
  geom_label(hjust = "inward", vjust = "inward", color = "white") +
  theme_classic() +
  theme(text = element_text(family="mono"),
        axis.title.x = element_text(size=12),
        axis.title.y = element_text(size=12)) +
  labs(title = 'Pop Singers Vocal Frequency: Range, Median and Maximum',
       subtitle = 'Data: 5 Hit Songs per Singer Performed Acapella',
       x = "Median Frequency [Hz]",
       y = "Frequency Range [Hz]",
       fill = "Max. Freq [Hz]")

ggplot(singersFreqRangeSum, aes(x = medianFreq, y = rangeFreq, label = name)) +
  geom_point(aes(size = maxFreq), shape = 21, fill = "red") +
  geom_text(hjust = 0.5, vjust = -1.1, color = "black", size = 3) +
  theme_classic() +
  theme(text = element_text(family="mono"),
        axis.title.x = element_text(size=12),
        axis.title.y = element_text(size=12)) +
  labs(title = 'Pop Singers Vocal Frequency: Range, Median and Maximum',
       subtitle = 'Data: 5 Hit Songs per Singer Performed Acapella',
       x = "Median Frequency [Hz]",
       y = "Frequency Range [Hz]",
       size = "Max. Freq [Hz]")

And again: poor Britney.

Gifme Gifme Gifme

If ever there was a plot worthy of putting a gif on top of it! This I learned from Daniel P. Hadley:

library(magick)

singersFreqRange %>%
  group_by(Singer) %>%
  mutate(minFreq = 98,
         maxFreq = quantile(Frequency, 0.99, na.rm = TRUE)) %>%
  filter(Frequency > minFreq, Frequency < maxFreq) %>%
  mutate(medianFreq = median(Frequency, na.rm = TRUE),
         maxFreq = max(Frequency, na.rm = TRUE)) %>%
  ggplot(aes(x = Frequency, y = reorder(Singer, -medianFreq), fill = ..x..)) +
  geom_density_ridges_gradient(rel_min_height = 0.0) +
  scale_fill_viridis(name = "Freq.[Hz]", option = "C") +
  theme_ridges(font_size = 13, grid = TRUE) +
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(size=12),
        text = element_text(family="mono"),
        plot.background = element_rect(fill = rgb(198/255, 189/255, 189/255))) +
  labs(title = 'Pop Singers Vocal Range',
    subtitle = 'Frequency [Hz] Distribution (ordered by Median)\nData: 5 Hit Songs per Singer Performed Acapella') +
  scale_x_continuous(breaks = .breaks,
                   labels = .labels, limits = c(0, 800))+
  ggsave(filename = "singers_ridge.png")

background <- image_read("singers_ridge.png")

whitney <- image_read("whitney.gif") %>%
  image_border(color = "white", geometry = "2x2")

frames <- lapply(whitney, function(frame) {
  image_composite(background, frame, offset = "+1730+1220")
})

animation <- image_animate(image_join(frames))


image_write(animation, "ridge_whitney.gif")
ggplot(singersFreqRangeSum, aes(x = medianFreq, y = rangeFreq, label = name)) +
  geom_point(aes(size = maxFreq), shape = 21, fill = "red") +
  geom_text(hjust = 0.5, vjust = -1.1, color = "black", size = 3) +
  theme_classic() +
  theme(text = element_text(family="mono"),
        axis.title.x = element_text(size=12),
        axis.title.y = element_text(size=12)) +
  labs(title = 'Pop Singers Vocal Frequency: Range, Median and Maximum',
       subtitle = 'Data: 5 Hit Songs per Singer Performed Acapella',
       x = "Median Frequency [Hz]",
       y = "Frequency Range [Hz]",
       size = "Max. Freq [Hz]") +
  ggsave(filename = "singers_bubble.png")

background <- image_read("singers_bubble.png")

mariah <- image_read("mariah.gif") 

frames <- lapply(mariah, function(frame) {
  image_composite(background, frame, offset = "+1050+200")
})

animation <- image_animate(image_join(frames))

image_write(animation, "bubble_mariah.gif")

What Did We Learn?

Don’t underestimate Mariah, Whitney or the lengths I would go to win an argument.


  1. I have no idea who this person is, does anyone?

  2. Sol.

  3. Am I the only one who whenever people are talking about Ridge Regression thinks of Ridge Forrester from the Bold and the Beautiful?

  4. Rihanna can’t reach 15K Herz people!