During my hiatus from this blog, I saw an announcement on Twitter by Bruna Wundervald regarding the chorrrds package1 which lets you easily retrieve music chords to popular songs from the Brazilian site, CifraClub. I speak zero Portuguese but I know some chords, and live for such data. One thing led to another and I decided to revisit my Master thesis reviewing methods for Sparse Canonical Correlation Analysis, to find correlations between guitar chords and lyrics, for thousands of popular Classic Rock songs.

If it sounds statistically heavy to you, I think the findings will make the effort to understand SCCA worthwhile. Here’s a sneak peek:

# Getting the chords

Why Classic Rock? Because Classic Rock is pretty much based on guitar chords. Because CifraClub seems to be more aimed at Classic Rock. Because I already did Pop and Rap. Because I’m in a Classic Rock mood.

To get a list of top Classic Rock acts, I just scraped ultimateclassicrock.com for the top 100 Rock acts, and this is what I got, if you’re interested:

library(tidyverse)

col_names = "artist")
knitr::kable(top_classic_rock_acts, format = "html") %>%
kableExtra::kable_styling(full_width = FALSE, position = "left", font_size = 12) %>%
kableExtra::scroll_box(width = "40%", height = "200px")
artist
Scorpions
Ronnie James Dio
Billy Idol
Jeff Beck
Vaughan Stevie Ray
Blue Oyster Cult
Traffic
Thin Lizzy
James Gang
Chicago
Peter Frampton
The Moody Blues
Emerson Lake and Palmer
Ted Nugent
John Fogerty
Bachman Turner Overdrive
Ringo Starr
Henley Don
Bon Jovi
Faces
Phil Collins
Hagar Sammy
Motley Crue
Walsh Joe
Electric Light Orchestra
Dire Straits
Buffalo Springfield
Foreigner
Jethro Tull
Lou Reed
Doobie Brothers
Robert Plant
Peter Gabriel
John Mellencamp
Styx
Steve Miller Band
Nicks Stevie
Alice Cooper
Boston
Ozzy Osbourne
George Harrison
Santana
Iron Maiden
Judas Priest
Jefferson Airplane
Def Leppard
Janis Joplin
Jackson Browne
Rod Stewart
Heart
Journey
Bob Seger
Yes
Cheap Trick
Kiss
Genesis
Ramones
The Cars
Paul McCartney
Deep Purple
Steely Dan
Allman Brothers Band
The Eagles
Crosby Stills Nash Young
John Lennon
Lynyrd Skynyrd
Rush
Eric Clapton
Billy Joel
ZZ Top
Guns N Roses
Queen
Cream
Tom Petty
Fleetwood Mac
Band The
Aerosmith
Metallica
The Clash
The Police
AC DC
The Doors
The Kinks
Black Sabbath
Creedence Clearwater Revival
The Who
Neil Young
Bruce Springsteen
Elton John
David Bowie
Van Halen
Pink Floyd
Bob Dylan
Jimi Hendrix
Led Zeppelin
The Rolling Stones
The Beatles

As said, the chorrrds package does most of the work here. I just wrapped its main functions get_songs() and get_chords() in my own wrapper functions to make it more consistent, more safe and more polite. I also added a progress bar because this can take some time.

First get the songs URLs:

process_artist <- function(artist) {
str_c(str_split(artist, " ")[[1]], collapse = "-")
}

get_songs_wrapper <- function(artist, pb = NULL) {
if (!is.null(pb)) pb$tick()$print()
Sys.sleep(0.5)
urls <- as.character(chorrrds::get_songs(artist)$url) if (length(urls) == 0) { return(NA_character_) } else { return(urls) } } pb <- progress_estimated(nrow(top_classic_rock_acts)) rock_songs_table <- top_classic_rock_acts %>% mutate(processed_artist = map_chr(artist, process_artist), url = map(processed_artist, get_songs_wrapper, pb = pb)) %>% unnest(url) %>% drop_na() rock_songs_table ## # A tibble: 8,984 x 3 ## artist processed_artist url ## <chr> <chr> <chr> ## 1 Scorpions Scorpions /scorpions/10-light-years-away/ ## 2 Scorpions Scorpions /scorpions/20th-century-man/ ## 3 Scorpions Scorpions /scorpions/321/ ## 4 Scorpions Scorpions /scorpions/a-moment-in-million-years/ ## 5 Scorpions Scorpions /scorpions/alex-truntiagin/ ## 6 Scorpions Scorpions /scorpions/aleyah/ ## 7 Scorpions Scorpions /scorpions/alien-nation/ ## 8 Scorpions Scorpions /scorpions/all-for-one/ ## 9 Scorpions Scorpions /scorpions/always-be-with-you/ ## 10 Scorpions Scorpions /scorpions/always-somewhere/ ## # ... with 8,974 more rows We got almost 9K Rock songs. Did we get songs for all 100 artists? top_classic_rock_acts %>% anti_join(rock_songs_table, by = "artist") ## # A tibble: 1 x 1 ## artist ## <chr> ## 1 Ronnie James Dio Almost, we’re missing songs by Ronnie James Dio. But I’m OK with it. Which artists have the most songs? rock_songs_table %>% count(artist, sort = TRUE) ## # A tibble: 99 x 2 ## artist n ## <chr> <int> ## 1 The Beatles 305 ## 2 Paul McCartney 288 ## 3 Kiss 266 ## 4 The Rolling Stones 246 ## 5 Bon Jovi 239 ## 6 Ramones 228 ## 7 Iron Maiden 219 ## 8 AC DC 212 ## 9 Bob Dylan 212 ## 10 Black Sabbath 211 ## # ... with 89 more rows There’s nothing surprising here, but having both “The Beatles” and “Paul McCartney” in our list is a bit suspicious. If we look for “Hey Jude”: rock_songs_table %>% filter(str_detect(url, "hey-jude")) ## # A tibble: 3 x 3 ## artist processed_artist url ## <chr> <chr> <chr> ## 1 Paul McCartney Paul-McCartney /paul-mccartney/hey-jude/ ## 2 John Lennon John-Lennon /john-lennon/hey-jude/ ## 3 The Beatles The-Beatles /the-beatles/hey-jude/ We see that it exists 3 times, once by The Beatles, once by Paul, once by John. So we’re gonna have to consider removing John, Paul and George (sorry Ringo, you didn’t make the list). To give you another idea how messy internet data can be: rock_songs_table %>% filter(str_detect(url, "send-me-an-angel")) ## # A tibble: 2 x 3 ## artist processed_artist url ## <chr> <chr> <chr> ## 1 Scorpions Scorpions /scorpions/send-me-an-angel/ ## 2 Scorpions Scorpions /scorpions/send-me-an-angel-acustico/ There can even be two versions of the same song by the same artist. OK, now let’s get the chords: get_chords_wrapper <- function(url, pb = NULL) { if (!is.null(pb)) pb$tick()$print() Sys.sleep(0.5) chords <- tryCatch(chorrrds::get_chords(url), error = function(e) { data.frame(chord = character(0), key = character(0), music = character(0)) } ) if (nrow(chords) > 0) { return(chords) } else { return(data.frame(chord = character(0), key = character(0), music = character(0))) } } pb <- progress_estimated(nrow(rock_songs_table)) chords <- rock_songs_table %>% mutate(chords = map(url, get_chords_wrapper, pb = pb)) %>% unnest(chords) %>% filter(str_length(chord) < 15) # instead of chorrrds::clean chords %>% select(artist, music, chord) ## # A tibble: 280,549 x 3 ## artist music chord ## <chr> <chr> <chr> ## 1 Scorpions scorpions 10 light years away G ## 2 Scorpions scorpions 10 light years away Em ## 3 Scorpions scorpions 10 light years away G ## 4 Scorpions scorpions 10 light years away Em ## 5 Scorpions scorpions 10 light years away G ## 6 Scorpions scorpions 10 light years away Em ## 7 Scorpions scorpions 10 light years away G ## 8 Scorpions scorpions 10 light years away Em ## 9 Scorpions scorpions 10 light years away G ## 10 Scorpions scorpions 10 light years away Em ## # ... with 280,539 more rows We got 280K chords (a single chord for each row, for each song), hurray. But not for all songs: chords %>% count(music) %>% nrow() ## [1] 4502 We got chords for only about half the songs, but it’s still an impressive 4.5K songs. Also some other artists did not survive this cut, we are down to 97: top_classic_rock_acts %>% anti_join(chords %>% count(artist), by = "artist") ## # A tibble: 3 x 1 ## artist ## <chr> ## 1 Ronnie James Dio ## 2 James Gang ## 3 Ted Nugent # Kids Stuff While we’re at it, we might as well play a bit with this chords data. You can see more here. What are the most common chords? chords %>% count(chord, sort = TRUE) ## # A tibble: 1,288 x 2 ## chord n ## <chr> <int> ## 1 G 33347 ## 2 D 28661 ## 3 C 27932 ## 4 A 23868 ## 5 F 17484 ## 6 E 16684 ## 7 Am 10856 ## 8 Em 9437 ## 9 B 7816 ## 10 Bb 7296 ## # ... with 1,278 more rows Unsurprisingly we have the major chords A, B, C, D, E, F, G and some minor chords. How many unique chords do songs have? chords %>% group_by(music) %>% summarise(n_chords = n_distinct(chord)) %>% ggplot(aes(n_chords)) + geom_histogram(fill = "red", alpha = 0.5) + labs(title = "Distribution of No. of Unique Chords in Classic Rock", x = "No. of Unique Chords", y = "No. of Songs") + theme(text = element_text(family = "mono", size = 8)) + theme_bw() So most songs will have between 2-10 chords, with some over 30 or even 60 unique chords. What?! chords %>% group_by(music) %>% summarise(n_chords = n_distinct(chord)) %>% arrange(-n_chords) ## # A tibble: 4,502 x 2 ## music n_chords ## <chr> <int> ## 1 pink floyd dark side of the moon 61 ## 2 queen bohemian rhapsody 51 ## 3 chicago hard habit to break 45 ## 4 genesis the lamia 43 ## 5 phil collins everything that i am 33 ## 6 queen time 33 ## 7 the beatles golden slumbers carry that weight the end 33 ## 8 aerosmith janies got gun 191 32 ## 9 jethro tull thick is brick 32 ## 10 chicago youre the inspiration 31 ## # ... with 4,492 more rows Yes, people, according to CifraClub The Pink Floyd’s Dark Side of the Moon contains 61 unique chords! Notice the 7th place there, which is in fact three Beatles songs in one. That’s messy data for you. So which songs have only a single chord? This sounds like an error and they may need to be discarded: songs_with_less_than_2_chords <- chords %>% group_by(music) %>% summarise(n_chords = n_distinct(chord)) %>% filter(n_chords == 1) songs_with_less_than_2_chords ## # A tibble: 73 x 2 ## music n_chords ## <chr> <int> ## 1 ac dc girls got rhythm 1 ## 2 ac dc hold me back 1 ## 3 ac dc rock n roll dream 1 ## 4 aerosmith last child 1 ## 5 aerosmith st john 1 ## 6 aerosmith you gotta move 1 ## 7 alice cooper hard hearted alice 1 ## 8 alice cooper woman machine 1 ## 9 black sabbath devil daughter 1 ## 10 black sabbath dirty women tab versao 1 ## # ... with 63 more rows You can look up some of these songs, it appears that this is indeed a mistake, caused by either chorrrds parsers or some weird abnormality in CifraClub. But two chords are not uncommon in Classic Rock. If you’re an aspiring guitar player as I am2, I’d suggest starting with… chords %>% group_by(music) %>% summarise(n_chords = n_distinct(chord)) %>% filter(n_chords == 2) ## # A tibble: 82 x 2 ## music n_chords ## <chr> <int> ## 1 ac dc aint no fun waiting round to be millionairre 2 ## 2 ac dc whole latta rosie 2 ## 3 ac dc whole lotta rosie 2 ## 4 aerosmith back in the saddle 2 ## 5 aerosmith critical mass 2 ## 6 aerosmith i aint got you 2 ## 7 aerosmith shela 2 ## 8 aerosmith walk this way 2 ## 9 black sabbath laguna sunrise 2 ## 10 black sabbath loser gets it all 2 ## # ... with 72 more rows And who can avoid a nice ggridges plot, a.k.a Joy plot, of the distribution of no. of unique chords by artist, for the top 20 artists on the list: top_20_artists <- top_classic_rock_acts %>% slice(81:100) %>% pull(artist) chords %>% filter(artist %in% top_20_artists) %>% group_by(artist, music) %>% summarise(n_chords = n_distinct(chord)) %>% filter(n_chords >= 2, n_chords <= 20) %>% group_by(artist) %>% mutate(median_n_chords = median(n_chords, na.rm = TRUE)) %>% ggplot(aes(x = n_chords, y = reorder(artist, -median_n_chords), fill = ..x..)) + ggridges::geom_density_ridges_gradient(rel_min_height = 0.0) + viridis::scale_fill_viridis(name = "") + ggridges::theme_ridges(font_size = 13, grid = TRUE) + theme(axis.title.y = element_blank(), axis.title.x = element_blank(), axis.text.x = element_text(size = 12), text = element_text(family = "mono")) + labs(title = 'Classic Rock Acts No. of Unique Chords', subtitle = 'No. of Unique Chords Distribution\nData: Cifra Club via the chorrrds package') It shouldn’t surprise you that Elton John, who taught himself how to play the piano when he was four, is right at the top. Go Elton. # Getting the lyrics To get the lyrics I decided to also use the CifraClub site, and wrote my own rvest function: library(rvest) get_lyrics_from_url <- function(url, pb = NULL) { if (!is.null(pb)) pb$tick()$print() html <- str_c("https://www.cifraclub.com.br", url, "letra/") %>% read_html(.) lyrics <- NULL lyrics_optionA <- html %>% html_node(".letra-l") lyrics_optionB <- html %>% html_node(".letra") if (length(lyrics_optionA) > 0) { lyrics <- lyrics_optionA %>% html_nodes(".l_row") %>% html_text() %>% str_c(collapse = " ") } else if (length(lyrics_optionB) > 0) { lyrics <- lyrics_optionB %>% as.character() %>% str_replace_all("<.*?>", " ") %>% str_replace_all("\n|\\s+", " ") } if (length(lyrics) > 0) { return(lyrics) } else { return(NA_character_) } } pb <- progress_estimated(nrow(rock_songs_table)) lyrics <- chords %>% select(artist, url, music) %>% distinct() %>% mutate(lyrics = map_chr(url, possibly( get_lyrics_from_url, otherwise = NA), pb = pb)) %>% drop_na() %>% filter(str_length(lyrics) > 100) lyrics ## # A tibble: 3,876 x 4 ## artist url music lyrics ## <chr> <chr> <chr> <chr> ## 1 Scorpio~ /scorpions/10-l~ scorpions 10 l~ In a run down bed and breakf~ ## 2 Scorpio~ /scorpions/321/ scorpions 321 Time is a wicked master Put ~ ## 3 Scorpio~ /scorpions/a-mo~ scorpions a mo~ The lights are slowly fading~ ## 4 Scorpio~ /scorpions/all-~ scorpions all ~ The weekend comes around The~ ## 5 Scorpio~ /scorpions/alwa~ scorpions alwa~ I do not want to leave And b~ ## 6 Scorpio~ /scorpions/alwa~ scorpions alwa~ "Arrive at seven the place f~ ## 7 Scorpio~ /scorpions/are-~ scorpions are ~ Another rainy morning Peopl~ ## 8 Scorpio~ /scorpions/ariz~ scorpions ariz~ Arizona really was a gas I w~ ## 9 Scorpio~ /scorpions/ave-~ scorpions ave ~ Ave Maria del monte preciosi~ ## 10 Scorpio~ /scorpions/back~ scorpions back~ Jealous hearts are wild, oh ~ ## # ... with 3,866 more rows We have lyrics (with over 100 characters) to 3.8K of the 4.5K songs. Each of the lyrics is in one gigantic string, and I’m going to use the unnest_tokens() function from the tidytext package to separate them into words, one in each row. I will also remove stop words, while I’m at it: library(tidytext) lyrics <- lyrics %>% unnest_tokens(word, lyrics) %>% anti_join(stop_words, by = "word") lyrics ## # A tibble: 261,182 x 4 ## artist url music word ## <chr> <chr> <chr> <chr> ## 1 Scorpions /scorpions/10-light-year~ scorpions 10 light yea~ run ## 2 Scorpions /scorpions/10-light-year~ scorpions 10 light yea~ bed ## 3 Scorpions /scorpions/10-light-year~ scorpions 10 light yea~ breakfast ## 4 Scorpions /scorpions/10-light-year~ scorpions 10 light yea~ view ## 5 Scorpions /scorpions/10-light-year~ scorpions 10 light yea~ river ## 6 Scorpions /scorpions/10-light-year~ scorpions 10 light yea~ loose ## 7 Scorpions /scorpions/10-light-year~ scorpions 10 light yea~ claustroph~ ## 8 Scorpions /scorpions/10-light-year~ scorpions 10 light yea~ utopia ## 9 Scorpions /scorpions/10-light-year~ scorpions 10 light yea~ utopia ## 10 Scorpions /scorpions/10-light-year~ scorpions 10 light yea~ denying ## # ... with 261,172 more rows We now have the lyrics 260K-long table in which every word in every song has its own row. Before we proceed to CCA let’s clean a bit more, maybe should have done this earlier: filter out from chords songs with less than 10 chords in them, songs with only a single unique chord. Filter out from both chords and lyrics songs by The Beatles members, and join the two long datasets so we’d know they share the exact same songs. chords <- chords %>% add_count(music) %>% filter(n > 10, !music %in% songs_with_less_than_2_chords$music,
!artist %in% c("Paul McCartney", "George Harrison", "John Lennon")) %>%
inner_join(
lyrics %>% select(artist, music) %>% distinct(),
by = c("artist", "music"))

lyrics <- lyrics %>%
inner_join(chords %>% select(artist, music) %>% distinct(),
by = c("artist", "music"))

And we are left with chords and lyrics for 3297 songs.

# OK, now what?

Well, it depends. Suppose you suspect there’s a correlation between sad words and minor chords. If you’re a Pearson correlation kind of girl, you might try looking at, say, the percentage of minor chords in a song (e.g. Am) vs. the percentage of sad words in its lyrics (e.g. “lonely”):

get_word_chord_pct_vectors <- function(words_, chords_) {
word_vec <- lyrics %>%
mutate(wordBool = ifelse(word %in% words_, 1, 0)) %>%
group_by(music) %>%
summarise(wordPct = sum(wordBool) / n()) %>%
pull(wordPct)
chord_vec <- chords %>%
mutate(chordBool = ifelse(chord %in% chords_, 1, 0)) %>%
group_by(music) %>%
summarise(chordPct = sum(chordBool) / n()) %>%
pull(chordPct)
as_tibble(cbind(word_vec, chord_vec))
}

res <- get_word_chord_pct_vectors(c("sad", "lonely", "blue", "sorrow"),
str_c(LETTERS[1:7], "m"))
ggplot(res, aes(word_vec, chord_vec)) +
geom_point(color = "red", alpha = 0.5) +
labs(title = "Minor Chords % vs. Sad Words % in Classic Rock Songs",
x = "Sad Words %",
y = "Minor Chords %") +
theme_bw() +
theme(text = element_text(family = "mono", size = 9))

The correlation is 0.03 which is nothing to write home about.

If you’re more of a contingency table person, you might look at the relation of these two dichotomous variables: Does a song have these minor chords (yes/no) and does it contain these sad words in its lyrics (yes/no):

res_boolean <- res %>%
mutate(sad_words = ifelse(word_vec > 0, "yes", "no"),
minor_chords = ifelse(chord_vec > 0, "yes", "no")) %>%

table(res_boolean)
##             sad_words
## minor_chords   no  yes
##          no  1162  186
##          yes 1559  390

Among songs with sad words, the chance of also having minor chords is 67%, while among songs without sad words this chance goes down to 57%, which may not sound so impressive to you but is statistically significant, with a Chi-square test. A nice mosaic plot to show this:

library(ggmosaic)

ggplot(res_boolean) +
geom_mosaic(aes(x = product(minor_chords, sad_words), fill = minor_chords)) +
labs(x = "Song has Sad Words",
y = "Song has Minor Chords") +
scale_fill_discrete(guide = FALSE) +
theme_bw() +
theme(text = element_text(family = "mono", size = 10))

So this is all very nice, but what if the correlation is higher if we add in “heart” and “broken” to our sad words list? What if we should look at some of the minor chords in addition to a few dominant 7th chords (e.g. A7)? The possibilities are endless.

# Canonical Correlation Analysis is not a curse

I’ll try to give intuition as well as aim towards the next part of Sparse CCA. Statisticians don’t be mad at me.

Remember that Pearson r correlation coefficient? This is the formula for r, for two vectors $$x$$ and $$y$$ of length $$n$$:

$$r = Corr(x, y) = \frac{\hat{Cov(x, y)}}{\sqrt{\hat{Var(x)}}\sqrt{\hat{Var(y)}}}=\frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})} {\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2}\sqrt{(\sum_{i=1}^{n} (y_i - \overline{y})^2}}$$

If we center $$x$$ and $$y$$, make sure their mean is 0, we can write:

$$r = \frac{{}\sum_{i=1}^{n} (x_i y_i)} {\sqrt{\sum_{i=1}^{n} (x_i)^2}\sqrt{(\sum_{i=1}^{n} (y_i)^2}}$$

But that just means:

$$r = \frac{x^t y} {\sqrt{x^t x}\sqrt{y^t y}}$$

And if we further assume the variance of $$x$$ and $$y$$ to be $$\frac{1}{n-1}$$3, meaning that $$x^t x = 1$$ and $$y^t y = 1$$, they have L2 norm of 1, we finally get:

$$(1)\space r = x^ty \text{ s.t.} \sum{x_i}=0; \sum{y_i}=0; x^tx=1; y^ty=1$$

If you don’t believe me try it yourself. Set random x and random y, see their correlation:

x <- runif(20, -10, 10)
y <- runif(20, -10, 10)

cor(x, y)
## [1] 0.0946337

Now scale them to have mean 0 and variance $$\frac{1}{n-1}$$ and compute manually:

center_vec <- function(x) x - mean(x)
norm_vec <- function(x) sqrt(sum(x^2))
x_scaled <- center_vec(x) / norm_vec(center_vec(x))
y_scaled <- center_vec(y) / norm_vec(center_vec(y))

t(x_scaled) %*% y_scaled
##           [,1]
## [1,] 0.0946337

Hope you’re convinced, remember equation $$(1)$$. It basically tells you that simple correlation between two vectors is their dot product, with some constraints. And the dot product between two vectors with L2 norm of 1 is actually the cosine of the angle between them. The higher the correlation, the larger the cosine, the smaller the angle between them, which gives another interpretation for correlation.

In our case, we don’t have a pair of vectors $$x$$ and $$y$$, we have a pair of matrices, $$X$$ and $$Y$$. $$X$$ will be the chords matrix, of size n songs X p chords. $$Y$$ will be the lyrics matrix, of size n songs X q words. We’re taking the bag-of-words approach: An entry $$(i, j)$$ in these matrices means “the no. of times we see chord/word $$j$$ in song $$i$$”. Easy to do this with cast_dtm() from the tidytext package:

X <- chords %>%
count(music, chord) %>%
cast_dtm(music, chord, nn) %>%
as.matrix()

Y <- lyrics %>%
count(music, word) %>%
cast_dtm(music, word, n) %>%
as.matrix()

dim(X)
## [1] 3297 1132
dim(Y)
## [1]  3297 16263

So $$n$$ is 3297 songs, $$p$$ is 1132 chords and $$q$$ is 16263 words.

Now we are going to “project” our $$p$$-dimensional $$X$$ to a vector of length $$n$$, by multiplying it with a vector $$u$$ of length $$p$$: $$Xu_{nx1}$$. Similarly with $$Y$$ and a vector $$v$$ of length $$q$$: $$Yv_{nx1}$$. We will search for projections $$u$$ and $$v$$ which will maximize the correlation between (now simple vectors) $$Xu$$ and $$Yv$$. If we keep in mind the previous interpretation of correlation being the cosine of the angle between vectors, we are seeking a “viewpoint” $$u$$ to look at high-dimensional $$X$$, and a “viewpoint” $$v$$ to look at high-dimensional $$Y$$, which will make the resulting vectors closest (having the smallest angle) in this common space.4

$$(u, v) = \text{argmax}_{u,v}{Corr(Xu, Yv)}$$

But we already know how to write $$Corr(Xu, Yv)$$ from equation $$(1)$$!

$$(2)\space (u, v)= \text{argmax}_{u,v}{Corr(Xu, Yv)} = \text{argmax}_{u,v}u^tX^tYv$$ $$\text{ s.t. }\sum{Xu_i}=0;\sum{Yv_i}=0;u^tX^tXu=1;v^tY^tYv=1$$

It turns out that, for simple problems, equation $$(2)$$ has closed solutions. You’ll get multiple vectors $$u$$ and vectors $$v$$, if you extract the first $$k$$ eigenvectors5 of matrices $$(X^tX)^{-1}X^tY(Y^tY)^{-1}Y^tX$$ and $$(Y^tY)^{-1}Y^tX(X^tX)^{-1}X^tY$$, respectively. And the resulting correlations, the $$Corr(Xu, Yv)$$ are in fact the “Canonical Correlation”s, which I hope you now can see are a sort of a generalization to the well known Pearson’s r correlation coefficient. It looks awful so you’d probably do it with R’s function cancor(), but we’re not going to proceed with that.

Why not?!

First of all, our case is not a simple case. CCA was devised in the days where $$p$$ and $$q$$ were small in general and in comparison to $$n$$. I think Psychologists mainly used CCA to correlate between two different questionnaires measuring the same quality on the same $$n$$ subjects. One with $$p$$ questions, typically no more than a few dozens. One with $$q$$ questions. We, however have a large $$n$$ and even larger $$q$$ in the case of our $$Y$$ lyrics matrix. This will make all multiplications, inversions and eigendecompositions of matrices (e.g. $$(Y^tY)^{-1}$$) long and unstable.

Second, think of what the $$u$$ and $$v$$ vectors will mean to us. They will be weight vectors on $$X$$ and $$Y$$ respectively, on chords and lyrics, specifying an amount and direction for each and every chord, for each and every word in our lyrics. As these Regression-like solutions tend to be, $$u$$ and $$v$$ will not be sparse, will probably have all elements different from zero, making interpretation very difficult for us.

# Enter Sparse CCA

In my thesis I review a few methods to handle this high-dimensional problem and to achieve sparsity in $$u$$ and $$v$$. Here I will focus on a method which extends my explanation above in a natural way, the “Diagnoal Penalized CCA” or DP-CCA, by Daniela Witten, Robert Tibshirani6 and Trevor Hastie7.

Witten et. al. devise a general approach for decomposing matrices with penalties, “Penalized Matrix Decomposition” (PMD). They show how Sparse CCA can be achieved by essentially performing what they call $$PMD(L_1,L_1)$$ on the covariance matrix $$X^tY$$, with some modifications to $$(2)$$:

1. The assumption on $$X$$ and $$Y$$ being centered ($$\sum{Xu_i}=0;\sum{Yv_i}=0$$) comes built-in, without loss of generality. So it isn’t written.
2. They treat $$X^tX$$ and $$Y^tY$$ as the identity matrix $$I$$, a step that has proven to be beneficial in high-dimensional problems where we expect sparsity.
3. They loosen up the variance constraint, the L2 norm constraint, to be “smaller or equal to 1” rather than “equal to 1”, to make the problem convex.
4. They add the Lasso penalty, controling the L1 norm of $$u$$ and $$v$$, which tends to bring sparse solutions.

$$(3)\space (u, v)= \text{argmax}_{u,v}{Corr(Xu, Yv)} \approx \text{argmax}_{u,v}u^tX^tYv$$ $$\text{ s.t. }u^tu\leq1;v^tv\leq1; ||u||_1\leq c_1; ||v||_1\leq c_2$$

With $$c_1$$ and $$c_2$$ being penalty parameters which can be chosen via cross-validation.

And from here the problem is solved iteratively: assuming some $$v$$ vector, the problem becomes very similar to the LASSO problem:

$$u = \text{argmax}_{u}u^tX^tYv\text{ s.t. }u^tu\leq1;||u||_1\leq c_1$$

We are “regressing” $$X$$ on $$Yv$$ (which is now a given vector!) to get an update for vector $$u$$, with this $$u$$ “regressing” $$Y$$ on $$Xu$$ to get an update for vector $$v$$, and so on until convergence. Using Regression also means high speed!

These are $$u_1$$ and $$v_1$$. To get the next pair of $$u_2$$ and $$v_2$$, they repeat the process on the “remainder” of the $$X^tY$$ covariance matrix, once you subtract the resulting8 $$u_1^tX^tYv_1\cdot u_1\cdot v_1^t$$ matrix from it. And so on.

Sounds complicated? I tried. Anyway, as always in R, “there’s a package for it”. Witten et. al. fortunately released the PMA package, inside it you get the CCA() function.

# Minor Empty, Major Baby

If you input the above X and Y matrices into PMA::CCA() you’ll get weird results, biased by very long songs. It is crucial here to perform TF-IDF weighting, and I will also remove sparse terms, say words or chords which appear in less than 1% of songs. The tm package is extremely useful here.

X <- chords %>%
count(music, chord) %>%
cast_dtm(music, chord, nn) %>%
tm::removeSparseTerms(., 0.99) %>%
tm::weightTfIdf() %>%
as.matrix()

Y <- lyrics %>%
count(music, word) %>%
cast_dtm(music, word, n) %>%
tm::removeSparseTerms(., 0.99) %>%
tm::weightTfIdf() %>%
as.matrix()

dim(X)
## [1] 3297   79
dim(Y)
## [1] 3297  784

We are left with 3297 songs, $$p$$ is 79 chords and $$q$$ is 784 words. Less impressive but cleaner, I assure you.

We’ll take PMA::CCA() default penalty parameters, and ask for 10 canonical correaltions:

library(PMA)

K <- 10
scca <- CCA(X, Y, K = 10, niter = 100)
## 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
## 12345678910111213141516171819202122232425262728293031323334353637383940
## 123456789101112131415161718192021222324
## 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
## 12345678910111213141516171819202122232425262728293031323334
## 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
## 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
## 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
## 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100

Let’s look at the output summary:

print(scca)
## Call: CCA(x = X, z = Y, K = 10, niter = 100)
##
##
## Num non-zeros u's:  9 9 14 22 16 18 37 30 32 27
## Num non-zeros v's:  206 144 221 262 177 279 277 199 268 212
## Type of x:  standard
## Type of z:  standard
## Penalty for x: L1 bound is  0.3
## Penalty for z: L1 bound is  0.3
## Cor(Xu,Zv):  0.4481368 0.3475206 0.4145097 0.4594632 0.4311564 0.4451745
## 0.489829 0.4203822 0.463351 0.4276246

Where it says “Num non-zeros u’s” this is where the output reports how many elements in $$u$$ are non-zero, meaning important chords. The same with “Num non-zeros v’s”, and words. Where it says “Cor(Xu,Zv)” these are the actual canonical correlations, here we see ranging between 0.3 and 0.5 which is not bad for messy internet data, and the bag-of-words approach.

The first pair of $$u$$ and $$v$$ has 9 non-zero chords (out of 79!) correlated with 206 non-zero words (out of 784!), and a correlation of 0.448. The top 5 chords in terms of absolute weights and corresponding weights are:

k <- 1
important_id <- which(scca$u[, k] != 0) important_id_order <- order(abs(scca$u[important_id, k]), decreasing = TRUE)
important_id <- important_id[important_id_order[1:5]]
important_id <- important_id[!is.na(important_id)]
colnames(X)[important_id]
## [1] "B5"  "A5"  "E5"  "D5"  "F#5"
scca$u[important_id, k] ## [1] 0.5120943 0.4797985 0.4200780 0.3749777 0.2375822 The top 5 words and corresponding weights are: important_id <- which(scca$v[, k] != 0)
important_id_order <- order(abs(scca$v[important_id, k]), decreasing = TRUE) important_id <- important_id[important_id_order[1:5]] important_id <- important_id[!is.na(important_id)] colnames(Y)[important_id] ## [1] "edge" "hell" "tryin" "thunder" "blood" scca$v[important_id, k]
## [1] 0.3788166 0.2625134 0.2603274 0.2382183 0.2306617

If you know some chords this pattern makes perfect sense! The A5, B5, D5 etc. chords are also known as Power Chords, usually played with electric guitars, and are used mainly in hard rock songs (think Guns n Roses), where words such as “hell”, “thunder” and… “blood” are used abundantly.

It is a bit exhausting to look at these vectors this way, so I concocted some functions, used some ggplot and patchwork, to present the 5 most positive and 5 most negative chords/words, and present them side to side:

library(patchwork)

weights_tibble <- function(m1, m2) {
as_tibble(m1) %>%
mutate(label = colnames(m2)) %>%
mutate_if(is.numeric,
funs(case_when(. == 0 ~ runif(n(), -0.01, 0.01), TRUE ~ .))) %>%
select(label, everything())
}

max_weights <- function(tib, k, top, n, col) {
tn <- ifelse(top, n, -n)
suppressMessages(
tib %>%
select(1, k + 1) %>%
top_n(tn) %>%
slice(1:n) %>%
mutate(color = col) %>%
set_names(c("label", "weight", "color"))
)
}

max_weights_combined <- function(M, k, n, col_top, col_bot) {
combined <- rbind(
max_weights(M, k, TRUE, n, col_top),
max_weights(M, k, FALSE, n, col_bot)
) %>%
mutate(weight = ifelse(abs(weight) <= 0.01, 0, weight)) %>%
arrange(weight)
combined$label_factor <- factor(combined$label, levels = as.character(combined$label)) combined } plot_weights_single_side <- function(combined, title) { combined %>% mutate(pos_label = ifelse(weight > 0, label, NA), neg_label = ifelse(weight < 0, label, NA)) %>% ggplot(aes(label_factor, weight, fill = color)) + geom_bar(stat = "identity", alpha = 0.5) + coord_flip() + theme_bw() + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text( family = "mono", hjust = 0.05, margin = margin(t = 10, b = -15))) + ylim(c(-1.1, 1.1)) + labs(title = title, y = NULL, x = NULL) + scale_fill_identity() + scale_color_identity() + geom_text(aes(y = weight, label = pos_label, color = color), na.rm = TRUE, nudge_y = 0.05, family = "mono", hjust = "left") + geom_text(aes(y = weight, label = neg_label, color = color), na.rm = TRUE, nudge_y = -0.05, family = "mono", hjust = "right") } plot_weights <- function(res, x, y, k, title = NULL) { U <- weights_tibble(res$u, x)
V <- weights_tibble(res\$v, y)
chords_combined <- max_weights_combined(U, k, 5, "darkgreen", "red")
words_combined <- max_weights_combined(V, k, 5, "blue", "orange")
p1 <- plot_weights_single_side(chords_combined, "Chords")
p2 <- plot_weights_single_side(words_combined, "Lyrics")
p1 + p2 + plot_annotation(title = title)
}

plot_weights(scca, X, Y, 1, "Power chords are correlated with hard rock lyrics")

To reiterate, what does this pair of vectors tell us: It is creating a sort of a dictionary, two mappings into a common space, between the chords space and the lyrics space. “Take only the chords listed here, multiply their entries (= go along these dimensions) by these coefficients. Take only the words listed here, multiply their entries by these coefficients. You now have two most correlated, close, similar, vectors”. In other words, having Power Chords is similar to having words like “hell” and “blood”.

2nd canonical correlation pair:

plot_weights(scca, X, Y, 2, "Major chords A, E, B are correlated with Soft Rock happy lyrics")

As can be seen early Beatlesy lyrics with words such as “shake”, “baby”, “ya”, “happy”, go with basic major chords A, B, E. Words like “lost” and “desert” are correlated with a mix of basic major and minor chords such as G and Am.

3rd canonical correlation pair:

plot_weights(scca, X, Y, 3, "Flat chords are correlated with weird 70s rock lyrics?")

I’m not sure I get this correlation.

4th canonical correlation pair:

plot_weights(scca, X, Y, 4, "Power Chords, Electric Guitars, Hard Rock Lyrics")

Again, Power chords.

5th canonical correlation pair:

plot_weights(scca, X, Y, 5, "Subdivision within Power chords?")

Notice the same theme may repeat here, I remeber that well from my thesis9.

# Rock On!

I hope it was fun for you as it was for me. Heck, I hope you stayed awake. As always the road does not end here. The road is long with many a winding turn. There’s even a generalization to Multiple Canonical Correlation Analysis if you find another module of features $$Z$$ for these songs, say ratings by listeners or performance in charts around the world. This field of research grew strongly up until say 2012 and the coming of the Deep Learning age. As always, I claim that knowing the statistical history shouldn’t be a burden, and in many cases will give you equivalent results, faster.

1. Wundervald & Trecenti (2018, Aug. 19). R-Music: Introduction to the chorrrds package. Retrieved from https://r-music.rbind.io/posts/2018-08-19-chords-analysis-with-the-chorrrds-package/

2. That is, aspiring for over 30 years, haha.

3. We can scale these vectors to have any standard deviation we want!

4. I apologize if I make things worse, it is not my intention :(

5. Corresponding to the first ordered $$k$$ eigenvalues.

6. Yes, the same one from my last post, and it is no coincidence as you’ll see

7. Witten, D., Tibshirani, R., Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534.

8. This is the matrix which approximates $$X^tY$$ best under these constranits

9. 8 year ago!