My friend Nir and I have this habbit of sending one another songs in the middle of the day. The other day I sent him this one by Mariah Carey:

To which he replied “that’s so gay”, to which I replied “don’t confuse my 12-year-old taste with my gay taste!”. But seriously people, you shouldn’t underestimate Mariah Carey just because of her troubling sense of style, and now I’m going to show you why.

# The Voice

This post by “Fun with R”1 taught me to transcribe notes from sound in R, with the great tuneR package. In this function I’m inputting the path to a MP3 sound file, and outputting its frequency in Herz every ~0.1 of a second:

library(tuneR)

processSong <- function(mp3FilePath, widthSample = 4096) {
wavFile <- extractWave(stereoMP3File, interact = FALSE)
if (nchannel(wavFile) > 1) {
wavFile <- mono(wavFile, "both")
}

perioWav <- periodogram(wavFile, width = widthSample)
freqWav <- FF(perioWav)
return(freqWav)
}

The frequency of a sound wave can be mapped to what we know as musical notes. The higher the frequency, the higher the note. See this table for the mapping and this Wikipedia article for more on the typical frequency of the human voice, a.k.a the Vocal Range. The bottomline is I get a bunch of numbers, and I can tell “98” means the sound is a very low G2.

# Turn Off That Music!

However in order to transcribe Mariah’s voice I need it in an Acapella version to her songs, that is without the backing music. There aren’t a lot of those out there, but I found a few on Youtube for a number of Pop singers I wanted to check out. I can’t tell you how I got from this point to having 75 MP3 files of Acapella versions for 15 Pop singers on my local computer (that’s 15 singers x 5 songs for each), because that’s illegal. But I have them, in directory “D:/singers_songs”, and I have an Excel file called “acapella.xlsx” detailing the singers and songs, and now I do:

library(tidyverse)
library(stringr)

unnest(Frequency)

So the singersFreqRange table holds for each singer, for each song, in a very long format, all its sampled frequencies:

singersFreqRange
## # A tibble: 188,512 x 5
##    <chr>   <chr>     <chr>                    <chr>                  <dbl>
##  1 Beyonce Pretty H~ https://www.youtube.com~ D:/singers_songs/~      16.1
##  2 Beyonce Pretty H~ https://www.youtube.com~ D:/singers_songs/~      13.3
##  3 Beyonce Pretty H~ https://www.youtube.com~ D:/singers_songs/~      17.1
##  4 Beyonce Pretty H~ https://www.youtube.com~ D:/singers_songs/~      57.5
##  5 Beyonce Pretty H~ https://www.youtube.com~ D:/singers_songs/~      78.4
##  6 Beyonce Pretty H~ https://www.youtube.com~ D:/singers_songs/~      17.7
##  7 Beyonce Pretty H~ https://www.youtube.com~ D:/singers_songs/~    1499.
##  8 Beyonce Pretty H~ https://www.youtube.com~ D:/singers_songs/~      41.4
##  9 Beyonce Pretty H~ https://www.youtube.com~ D:/singers_songs/~      13.9
## 10 Beyonce Pretty H~ https://www.youtube.com~ D:/singers_songs/~    1498.
## # ... with 188,502 more rows

Cool, is it not?

Now, it’s important to note I didn’t do any random sampling here. The singers are the ones that I like (e.g. the absence of Taylor Swift isn’t coincidental), and most of the songs are simply the ones I could get, again 5 per singer. But I think these 15 singers are still the most popular Pop singers since let’s say the 1990s, and the songs are hit songs. This is the songs list:

library(knitr)
library(kableExtra)

singersFreqRange %>%
distinct() %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", position = "left",
font_size = 13, full_width = F) %>%
scroll_box(width = "500px", height = "200px")
Mariah Carey We Belong Together https://www.youtube.com/watch?v=v-b7LtIbKV0
Mariah Carey Through The Rain https://www.youtube.com/watch?v=rvFYpiPQjM0
Rihanna Where Have You Been https://www.youtube.com/watch?v=lWQq4jSkyMQ
Christina Aguilera Ain’t No Other Man https://www.youtube.com/watch?v=xggfqh9yRJs
Christina Aguilera Genie In A Bottle https://www.youtube.com/watch?v=Q2uzakRUj-k
Christina Aguilera What A Girl Wants https://www.youtube.com/watch?v=s1sXWOcRAzE
Britney Spears Hold It Against Me https://www.youtube.com/watch?v=zAROc_jtfLw
Britney Spears Babay One More Time https://www.youtube.com/watch?v=82lbiwj0s2g
Celine Dion The Power Of Love https://www.youtube.com/watch?v=bUPhxKkTmBI
Celine Dion My Heart Will Go On https://www.youtube.com/watch?v=hRaNmDLBSaU
Celine Dion I Drove All Night https://www.youtube.com/watch?v=2a0YmV7nAMI
Celine Dion Loved Me Back To Life https://www.youtube.com/watch?v=tu_kFXNheDo
Celine Dion Encore Un Soir https://www.youtube.com/watch?v=3mAZEi45Erg
Katy Perry Part Of Me https://www.youtube.com/watch?v=TmZHNkwveYk
Whitney Houston How Will I Know https://www.youtube.com/watch?v=JZGVUpXQHV4
Whitney Houston I’m Every Woman https://www.youtube.com/watch?v=IS5sG12thcg
Whitney Houston I Wanna Dance With Somebody https://www.youtube.com/watch?v=b-gpym3EkI4
Whitney Houston I Learned From The Best https://www.youtube.com/watch?v=_3_TgCl_bXE
Alicia Keys You Don’t Know My Name https://www.youtube.com/watch?v=h8Zik_hwXtU
Alicia Keys Girl On Fire https://www.youtube.com/watch?v=cJQHJ3ZWizQ
Pink Get The Party Started https://www.youtube.com/watch?v=P3tZtr5rTPg

# Super Bass/Sopran

And now for my very first ggridges plot, a.k.a Joy plot after the cover of Joy Division’s Unknown Pleasures Album3:

library(ggridges)

singersFreqRange %>%
ggplot(aes(x = Frequency, y = Singer)) +
geom_density_ridges()

Hahaha! What an epic fail.

The most obvious problem is bad data. I say “bad” and I mean it, because apparently some of my singers reach Super-Bass and/or Super-Sopran notes:

singersFreqRange %>%
select(Singer, Song, Frequency) %>%
remove_missing(na.rm = TRUE) %>%
arrange(Frequency) %>%
slice(c(1:3, (n() - 2):n()))
## # A tibble: 6 x 3
##   Singer             Song              Frequency
##   <chr>              <chr>                 <dbl>
## 1 Christina Aguilera Genie In A Bottle      10.8
## 2 Christina Aguilera Genie In A Bottle      10.8
## 3 Britney Spears     Slave                  10.8
## 4 Christina Aguilera What A Girl Wants    9650.
## 5 Rihanna            S&M                 12439.
## 6 Rihanna            S&M                 15660.

If you followed the frequency-to-notes and Wikipedia links you know there is no human voice reaching these low and high frequencies. I’m guessing both types of anomalies come from unwanted sounds in the MP3 files, e.g. the drums.4 In fact it is unlikely that any of these singers sing below 98 Herz ($$G_2$$) and above 3,135 ($$G_7$$). So the way I chose to clean these data is to set a lower limit of 98 Herz, and take for each singer her 99th percentile as the upper limit.

# Whitney Crashing My Party

After cleaning the data, adding some color and labels, and ordering the ridges according to the Singer’s median frequency we get:

library(viridis)

.breaks <-  c(49, 98, 196, 392, 783.99, 1567.98, 3135.96)
.labels <- c(expression("G"[1]), expression("G"[2]), expression("G"[3]),
expression("G"[4]), expression("G"[5]), expression("G"[6]),
expression("G"[7]))

singersFreqRange %>%
group_by(Singer) %>%
mutate(minFreq = 98,
maxFreq = quantile(Frequency, 0.99, na.rm = TRUE)) %>%
filter(Frequency > minFreq, Frequency < maxFreq) %>%
mutate(medianFreq = median(Frequency, na.rm = TRUE),
maxFreq = max(Frequency, na.rm = TRUE)) %>%
ggplot(aes(x = Frequency, y = reorder(Singer, -medianFreq), fill = ..x..)) +
scale_fill_viridis(name = "Freq.[Hz]", option = "C") +
theme_ridges(font_size = 13, grid = TRUE) +
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_text(size=12),
text = element_text(family="mono")) +
labs(title = 'Pop Singers Vocal Range',
subtitle = 'Frequency [Hz] Distribution (ordered by Median)\nData: 5 Hit Songs per Singer Performed Acapella') +
scale_x_continuous(breaks = .breaks,
labels = .labels, limits = c(0, 800))

A few things come to mind:

• Excuse me Whitney, you’re getting in the way!
• Poor Britney.
• Respect for Ariana.

# Butterfly

So, Whitney beat Mariah with the median frequency she’s singing with. Of course these are only 5 top songs for each, the data is not 100% pure, tuneR’s capabalities are not perfect, but I like what I see.

If we not only arrange these singers by median frequency, but also by the range of frequency (the Vocal Range) and maximum frequency in these 5 songs, we can see Mariah is indeed a unicorn. I couldn’t decide which 3-dimensional-data plot I liked best so I’ll give you both:

singersFreqRangeSum <- singersFreqRange %>%
group_by(Singer) %>%
mutate(minFreq = 98,
maxFreq = quantile(Frequency, 0.99, na.rm = TRUE)) %>%
filter(Frequency > minFreq, Frequency < maxFreq) %>%
summarise(medianFreq = median(Frequency, na.rm = TRUE),
minFreq = min(Frequency, na.rm = TRUE),
maxFreq = max(Frequency, na.rm = TRUE)) %>%
mutate(rangeFreq = maxFreq - minFreq,
name = map_chr(Singer, function(s) str_split(s, " ")[[1]][1]))

singersFreqRangeSum\$name[9] <- "Gaga"

ggplot(singersFreqRangeSum, aes(x = medianFreq, y = rangeFreq, label = name, fill = maxFreq)) +
geom_label(hjust = "inward", vjust = "inward", color = "white") +
theme_classic() +
theme(text = element_text(family="mono"),
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12)) +
labs(title = 'Pop Singers Vocal Frequency: Range, Median and Maximum',
subtitle = 'Data: 5 Hit Songs per Singer Performed Acapella',
x = "Median Frequency [Hz]",
y = "Frequency Range [Hz]",
fill = "Max. Freq [Hz]")

ggplot(singersFreqRangeSum, aes(x = medianFreq, y = rangeFreq, label = name)) +
geom_point(aes(size = maxFreq), shape = 21, fill = "red") +
geom_text(hjust = 0.5, vjust = -1.1, color = "black", size = 3) +
theme_classic() +
theme(text = element_text(family="mono"),
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12)) +
labs(title = 'Pop Singers Vocal Frequency: Range, Median and Maximum',
subtitle = 'Data: 5 Hit Songs per Singer Performed Acapella',
x = "Median Frequency [Hz]",
y = "Frequency Range [Hz]",
size = "Max. Freq [Hz]")

And again: poor Britney.

# Gifme Gifme Gifme

If ever there was a plot worthy of putting a gif on top of it! This I learned from Daniel P. Hadley:

library(magick)

singersFreqRange %>%
group_by(Singer) %>%
mutate(minFreq = 98,
maxFreq = quantile(Frequency, 0.99, na.rm = TRUE)) %>%
filter(Frequency > minFreq, Frequency < maxFreq) %>%
mutate(medianFreq = median(Frequency, na.rm = TRUE),
maxFreq = max(Frequency, na.rm = TRUE)) %>%
ggplot(aes(x = Frequency, y = reorder(Singer, -medianFreq), fill = ..x..)) +
scale_fill_viridis(name = "Freq.[Hz]", option = "C") +
theme_ridges(font_size = 13, grid = TRUE) +
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_text(size=12),
text = element_text(family="mono"),
plot.background = element_rect(fill = rgb(198/255, 189/255, 189/255))) +
labs(title = 'Pop Singers Vocal Range',
subtitle = 'Frequency [Hz] Distribution (ordered by Median)\nData: 5 Hit Songs per Singer Performed Acapella') +
scale_x_continuous(breaks = .breaks,
labels = .labels, limits = c(0, 800))+
ggsave(filename = "singers_ridge.png")

image_border(color = "white", geometry = "2x2")

frames <- lapply(whitney, function(frame) {
image_composite(background, frame, offset = "+1730+1220")
})

animation <- image_animate(image_join(frames))

image_write(animation, "ridge_whitney.gif")

ggplot(singersFreqRangeSum, aes(x = medianFreq, y = rangeFreq, label = name)) +
geom_point(aes(size = maxFreq), shape = 21, fill = "red") +
geom_text(hjust = 0.5, vjust = -1.1, color = "black", size = 3) +
theme_classic() +
theme(text = element_text(family="mono"),
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12)) +
labs(title = 'Pop Singers Vocal Frequency: Range, Median and Maximum',
subtitle = 'Data: 5 Hit Songs per Singer Performed Acapella',
x = "Median Frequency [Hz]",
y = "Frequency Range [Hz]",
size = "Max. Freq [Hz]") +
ggsave(filename = "singers_bubble.png")

frames <- lapply(mariah, function(frame) {
image_composite(background, frame, offset = "+1050+200")
})

animation <- image_animate(image_join(frames))

image_write(animation, "bubble_mariah.gif")

# What Did We Learn?

Don’t underestimate Mariah, Whitney or the lengths I would go to win an argument.

1. I have no idea who this person is, does anyone?

2. Sol.

3. Am I the only one who whenever people are talking about Ridge Regression thinks of Ridge Forrester from the Bold and the Beautiful?

4. Rihanna can’t reach 15K Herz people!