I have been struggling with the understanding of Recurrent Neural Networks (RNN) and Long Short Term Memory (LSTM) for a while. I find that explaining a topic to other people really helps in nailing down just what is it you don’t understand, and eventually “getting it”. This week I have made my laptop rap on its own. I thought I was being very original but it turns out this is not the case (see here, here and here). The important thing is I think I understand RNNs just a little bit better and I’m going to share this now. If you find mistakes in my math or logic - please let me know.

Scrapin’

NOTE: If you’re not interested in the Scraping part, just skip this section. Bottom-line, I got rhymes.

Your laptop is a new born baby. It does not know any English. In order for it to rap, you’re gonna have to teach it “RapEnglish”, and show it some rhymes. First let’s extract a list of Rap/Hip-Hop musicians from Wikipedia:

library(tidyverse)
library(rvest)
library(stringr)
library(magrittr)

url <- "https://en.wikipedia.org/wiki/List_of_hip_hop_musicians"
hipHopMusicians <- read_html(url) %>%
  html_nodes("a") %>%
  html_text() %>%
  .[35:1410] %>%
  discard(. %in% c("edit", "", "[1]", "[2]"))

url <- "https://en.wikipedia.org/wiki/List_of_hip_hop_groups"
hipHopGroups <- read_html(url) %>%
  html_nodes("a") %>%
  html_text() %>%
  .[35:688] %>%
  discard(. == "edit")

artistsTable <- tibble(artist = c(hipHopMusicians, hipHopGroups)) %>%
  distinct()

artistsTable
## # A tibble: 1,812 x 1
##              artist
##               <chr>
##  1 Afrika Bambaataa
##  2         100 Kila
##  3             100s
##  4         12 Gauge
##  5         2 Chainz
##  6        2 Pistols
##  7          2$ Fabo
##  8        21 Savage
##  9             2Mex
## 10              360
## # ... with 1,802 more rows

So we got a list of 1,812 Hip Hop acts. Now my intention was to scrape AZLyrics.com to get 30 years worth of rap lyrics. AZ Lyrics looked like an ideal site to scrape because of a really clear structure. Each artist has its own page in which all of his/her songs links are in, then each link leads you to each song. The following code is theoretically sound, but don’t run it just yet:

# a function to get the artist's AZ Lyrics html page "stripped" name, e.g. "Lil' Kim" => "lilkim"
stripArtistName <- function(artist) {
  artist %>%
    tolower() %>%
    str_replace_all(., "[^a-zA-Z0-9]", "")
}

# a function to get an artist's full AZ lyrics URL
createAZArtistURL <- function(artistStripped) {
  subLetter <- str_sub(artistStripped, 1, 1)
  if (suppressWarnings(!is.na(as.numeric(subLetter)))) {
    subLetter <- "19"
  }
  paste0("http://www.azlyrics.com/", subLetter, "/", artistStripped, ".html")
}

# a function to get a list of an artist's entire songs URL list
getArtistSongsURLList <- function(artistAZURL, pb = NULL) {
  if (!is.null(pb)) pb$tick()$print()
  
  Sys.sleep(sample(seq(0, 1, 0.25), 1))
  
  read_html(artistAZURL) %>%
    html_nodes("a") %>%
    html_attr("href") %>%
    discard(!str_detect(., "\\.\\./lyrics") | is.na(.)) %>%
    str_replace(., "\\.\\.", "http://www.azlyrics.com")
}

# a function to get the lyrics of a song URL, as one long string
getSongLyrics <- function(songURL, pb = NULL) {
  if (!is.null(pb)) pb$tick()$print()
  
  Sys.sleep(sample(seq(0, 1, 0.25), 1))
  
  read_html(songURL) %>%
    html_nodes("div") %>%
    .[[22]] %>%
    as.character() %>%
    str_replace_all(., "\r\n", "\n") %>%
    str_replace_all(., "<br><br>", "\n") %>%
    str_replace_all(., "<.*?>|\\[.*?\\]", " ") %>%
    str_replace_all(., " +\n", "\n")
}

artistsTable <- artistsTable %>%
  mutate(artistStripped = map_chr(artist, stripArtistName))

pb <- progress_estimated(nrow(artistsTable))

songsTable <- artistsTable %>%
  mutate(artistAZURL = map_chr(artistStripped, createAZArtistURL),
         songURL = map(artistAZURL, possibly(getArtistSongsURLList, otherwise = NULL), pb = pb)) %>%
  unnest(songURL)

pb <- progress_estimated(nrow(songsTable))

songsTable %<>%
  mutate(lyrics = map_chr(songURL, possibly(getSongLyrics, otherwise = ""), pb))

Notice I even followed Bob Rudis’ advice about being nice to websites and putting a Sys.sleep once in a while to give their servers a chance to breathe. So what’s wrong with this code? The thing is, the AZ Lyrics website does not like to be scraped. I’m guessing they have some kind of a max requests policy per day, above which the IP address you’re surfing from gets blocked for a few hours. Or days. And this is just what happened to me, after I extracted ~100 of my 1,812 Hip Hop acts. Luckily, there is a thing called archive.org - a website which now and then archives many other websites, including AZ lyrics. You might not get the most updated info, but it’s great for my needs. At this stage I also understood that asking for the lyrics of 1.8K artists is a ridiculously overkill for what I’m trying to accomplish, so I stuck to ~30 top acts I was familiar with:

customArtists <- c(
  "eminem",
  "jayz",
  "drdre",
  "snoopdogg",
  "west",
  "2pac",
  "ludacris",
  "icecube",
  "lilwayne",
  "notorious",
  "50cent",
  "nas",
  "kendricklamar",
  "busta",
  "future",
  "game",
  "chancetherapper",
  "methodman",
  "2chainz",
  "ghostface",
  "tylerthecreator",
  "birdman",
  "tribecalledquest",
  "wutang",
  "nwa",
  "outkast",
  "gangstarr",
  "publicenemy",
  "rundmc",
  "cypress",
  "roots",
  "d12",
  "fugees"
)

createAZArtistURL <- function(artistStripped) {
  subLetter <- str_sub(artistStripped, 1, 1)
  if (suppressWarnings(!is.na(as.numeric(subLetter)))) {
    subLetter <- "19"
  }
  paste0("http://web.archive.org/web/20170330010509/http://www.azlyrics.com/",
         subLetter, "/", artistStripped, ".html")
}

getArtistSongsURLList <- function(artistAZURL, pb = NULL) {
  if (!is.null(pb)) pb$tick()$print()
  
  Sys.sleep(sample(seq(0, 1, 0.25), 1))
  
  read_html(artistAZURL) %>%
    html_nodes("a") %>%
    html_attr("href") %>%
    discard(!str_detect(., "/lyrics/") | is.na(.)) %>%
    str_replace(., "/", "http://web.archive.org/")
}

getSongLyrics <- function(songURL, pb = NULL) {
  if (!is.null(pb)) pb$tick()$print()
  
  Sys.sleep(sample(seq(0, 1, 0.25), 1))
  
  read_html(songURL) %>%
    html_nodes("div") %>%
    .[[39]] %>%
    as.character() %>%
    str_replace_all(., "\r\n", "\n") %>%
    str_replace_all(., "<br><br>", "\n") %>%
    str_replace_all(., "<.*?>|\\[.*?\\]", " ") %>%
    str_replace_all(., " +\n", "\n")
}

pb <- progress_estimated(length(customArtists))

songsTable <- tibble(artist = customArtists) %>%
  mutate(artistAZURL = map_chr(artist, createAZArtistURL),
         songURL = map(artistAZURL, possibly(getArtistSongsURLList, otherwise = NULL), pb = pb)) %>%
  unnest(songURL)

pb <- progress_estimated(nrow(songsTable))

songsTable %<>%
  mutate(lyrics = map_chr(songURL, possibly(getSongLyrics, otherwise = ""), pb))

write_file(paste(songsTable$lyrics, collapse = "\n"), path = "all_rap_custom_artists.txt")

This got me to a text file with 400K (!) lines of rap, with 14 million (!) characters. To compare, the Nietzsche example from the keras manual has ~9K lines with 600K characters. The bible is said to have ~3.5M characters. So 14 million is a lot.

Inputtin’

We have the rhymes. 400K lines, 14 million characters. This should be enough to train a RNN which will make the laptop rap. But simply copy-pasting the Nietzsche example from the keras manual won’t work, because there’s not enough RAM in the world to do what we’re about to do with 14 million characters, so let’s start small, with the first 1K lines:

library(keras)
library(stringr)
library(tokenizers)

maxlen <- 60

path <- "~/all_rap_custom_artists.txt"

text <- read_lines("~/all_rap_custom_artists.txt", n_max = 1000) %>%
  str_to_lower() %>%
  str_c(collapse = "\n") %>%
  tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)

We read the first 1,000 lines, we turned all letters to lower case, and tokenized the entire text to separate characters:

head(text, 100)
##   [1] "\n" "\n" "o"  "h"  " "  "y"  "e"  "a"  "h"  ","  " "  "t"  "h"  "i" 
##  [15] "s"  " "  "i"  "s"  " "  "e"  "m"  "i"  "n"  "e"  "m"  " "  "b"  "a" 
##  [29] "b"  "y"  ","  " "  "b"  "a"  "c"  "k"  " "  "u"  "p"  " "  "i"  "n" 
##  [43] " "  "t"  "h"  "a"  "t"  " "  "m"  "o"  "t"  "h"  "e"  "r"  "f"  "u" 
##  [57] "c"  "k"  "i"  "n"  "g"  " "  "a"  "s"  "s"  "\n" "o"  "n"  "e"  " " 
##  [71] "t"  "i"  "m"  "e"  " "  "f"  "o"  "r"  " "  "y"  "o"  "u"  "r"  " " 
##  [85] "m"  "o"  "t"  "h"  "e"  "r"  " "  "f"  "u"  "c"  "k"  "i"  "n"  "g" 
##  [99] " "  "m"

You see here the begining of Eminem’s “Infinite” track from his first album, “cut” into characters. Overall we have 41465 characters. And in them, we have:

chars <- text %>%
  unique() %>%
  sort()

chars
##  [1] "'"  "-"  " "  "\n" "!"  "\"" "("  ")"  "*"  ","  "."  "?"  "1"  "2" 
## [15] "3"  "5"  "6"  "7"  "9"  "a"  "å"  "b"  "c"  "d"  "e"  "f"  "g"  "h" 
## [29] "i"  "j"  "k"  "l"  "m"  "n"  "o"  "p"  "q"  "r"  "s"  "t"  "u"  "v" 
## [43] "w"  "x"  "y"  "z"

46 unique characters. Now the way our RNN will “learn” this lexicon well enough until it is capable of rapping by itself, is by getting a bunch of maxlen-long “sentences”, and the character which follows each. It will try to guess that next character (one in 46 options). If it succeeds - great, if it doesn’t, it will have to update some of its “knowledge”. More on that later. In the above example, the RNN will get the “sentence” , , o, h, , y, e, a, h, ,, , t, h, i, s, , i, s, , e, m, i, n, e, m, , b, a, b, y, ,, , b, a, c, k, , u, p, , i, n, , t, h, a, t, , m, o, t, h, e, r, f, u, c, k, i, n and will have to guess the next letter is “g”.

Let’s get many many sentences:

dataset <-  map(
 seq(1, length(text) - maxlen - 1, by = 3),
 ~list(sentence = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])

)

dataset <- transpose(dataset)

length(dataset$sentence)
## [1] 13802
head(dataset$sentence)
## [[1]]
##  [1] "\n" "\n" "o"  "h"  " "  "y"  "e"  "a"  "h"  ","  " "  "t"  "h"  "i" 
## [15] "s"  " "  "i"  "s"  " "  "e"  "m"  "i"  "n"  "e"  "m"  " "  "b"  "a" 
## [29] "b"  "y"  ","  " "  "b"  "a"  "c"  "k"  " "  "u"  "p"  " "  "i"  "n" 
## [43] " "  "t"  "h"  "a"  "t"  " "  "m"  "o"  "t"  "h"  "e"  "r"  "f"  "u" 
## [57] "c"  "k"  "i"  "n" 
## 
## [[2]]
##  [1] "h" " " "y" "e" "a" "h" "," " " "t" "h" "i" "s" " " "i" "s" " " "e"
## [18] "m" "i" "n" "e" "m" " " "b" "a" "b" "y" "," " " "b" "a" "c" "k" " "
## [35] "u" "p" " " "i" "n" " " "t" "h" "a" "t" " " "m" "o" "t" "h" "e" "r"
## [52] "f" "u" "c" "k" "i" "n" "g" " " "a"
## 
## [[3]]
##  [1] "e"  "a"  "h"  ","  " "  "t"  "h"  "i"  "s"  " "  "i"  "s"  " "  "e" 
## [15] "m"  "i"  "n"  "e"  "m"  " "  "b"  "a"  "b"  "y"  ","  " "  "b"  "a" 
## [29] "c"  "k"  " "  "u"  "p"  " "  "i"  "n"  " "  "t"  "h"  "a"  "t"  " " 
## [43] "m"  "o"  "t"  "h"  "e"  "r"  "f"  "u"  "c"  "k"  "i"  "n"  "g"  " " 
## [57] "a"  "s"  "s"  "\n"
## 
## [[4]]
##  [1] ","  " "  "t"  "h"  "i"  "s"  " "  "i"  "s"  " "  "e"  "m"  "i"  "n" 
## [15] "e"  "m"  " "  "b"  "a"  "b"  "y"  ","  " "  "b"  "a"  "c"  "k"  " " 
## [29] "u"  "p"  " "  "i"  "n"  " "  "t"  "h"  "a"  "t"  " "  "m"  "o"  "t" 
## [43] "h"  "e"  "r"  "f"  "u"  "c"  "k"  "i"  "n"  "g"  " "  "a"  "s"  "s" 
## [57] "\n" "o"  "n"  "e" 
## 
## [[5]]
##  [1] "h"  "i"  "s"  " "  "i"  "s"  " "  "e"  "m"  "i"  "n"  "e"  "m"  " " 
## [15] "b"  "a"  "b"  "y"  ","  " "  "b"  "a"  "c"  "k"  " "  "u"  "p"  " " 
## [29] "i"  "n"  " "  "t"  "h"  "a"  "t"  " "  "m"  "o"  "t"  "h"  "e"  "r" 
## [43] "f"  "u"  "c"  "k"  "i"  "n"  "g"  " "  "a"  "s"  "s"  "\n" "o"  "n" 
## [57] "e"  " "  "t"  "i" 
## 
## [[6]]
##  [1] " "  "i"  "s"  " "  "e"  "m"  "i"  "n"  "e"  "m"  " "  "b"  "a"  "b" 
## [15] "y"  ","  " "  "b"  "a"  "c"  "k"  " "  "u"  "p"  " "  "i"  "n"  " " 
## [29] "t"  "h"  "a"  "t"  " "  "m"  "o"  "t"  "h"  "e"  "r"  "f"  "u"  "c" 
## [43] "k"  "i"  "n"  "g"  " "  "a"  "s"  "s"  "\n" "o"  "n"  "e"  " "  "t" 
## [57] "i"  "m"  "e"  " "
head(dataset$next_char)
## [[1]]
## [1] "g"
## 
## [[2]]
## [1] "s"
## 
## [[3]]
## [1] "o"
## 
## [[4]]
## [1] " "
## 
## [[5]]
## [1] "m"
## 
## [[6]]
## [1] "f"

So dataset is an object containing two lists of length 13802: the sentence list which holds the “senetences” (see my example in the first one) and the next_char list which holds the “next characters”.

Unfortunately the RNN isn’t good with characters. It knows only numbers. We will now “vectorize” this data into a numeric matrix form which the RNN can handle.

X <- array(0, dim = c(length(dataset$sentence), maxlen, length(chars)))
y <- array(0, dim = c(length(dataset$sentence), length(chars)))

for(i in 1:length(dataset$sentence)){
  X[i, , ] <- sapply(chars, function(x){
    as.integer(x == dataset$sentence[[i]])
  })
  y[i, ] <- as.integer(chars == dataset$next_char[[i]])
}

dim(X)
## [1] 13802    60    46
dim(y)
## [1] 13802    46
X[1, 1:10, 1:10]
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    0    0    0    1    0    0    0    0    0     0
##  [2,]    0    0    0    1    0    0    0    0    0     0
##  [3,]    0    0    0    0    0    0    0    0    0     0
##  [4,]    0    0    0    0    0    0    0    0    0     0
##  [5,]    0    0    1    0    0    0    0    0    0     0
##  [6,]    0    0    0    0    0    0    0    0    0     0
##  [7,]    0    0    0    0    0    0    0    0    0     0
##  [8,]    0    0    0    0    0    0    0    0    0     0
##  [9,]    0    0    0    0    0    0    0    0    0     0
## [10,]    0    0    0    0    0    0    0    0    0     1
y[1, ]
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## [36] 0 0 0 0 0 0 0 0 0 0 0

The 13802 “sentences” are now 13802 matrices of size 60 x 46 (or in general maxlen x length(chars)). Together this is a 3D matrix X of size 13802 x 60 x 46. Where there is a value of 1 in cell [i, j, k] if in matrix or sentence i, in row or location j, appears the character represented by column k. And 0 otherwise. These are the matrices which will be fed into the RNN (typically in random batches of say 128). And the RNN should output probability-like vectors of length 46 (for 46 characters) for the next character for each sentence, which hold the probability distribution for the next characters and hopefully will have a maximum value in the column representing the “real” next character for each sentence.

Modelin’

Let’s initialize the keras RNN network, as in the Nietzsche example:

model <- keras_model_sequential()

model %>%
  layer_lstm(128, input_shape = c(maxlen, length(chars))) %>%
  layer_dense(length(chars)) %>%
  layer_activation("softmax")

optimizer <- optimizer_rmsprop(lr = 0.01)

model %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = optimizer
)

summary(model)
## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## lstm_1 (LSTM)                    (None, 128)                   89600       
## ___________________________________________________________________________
## dense_1 (Dense)                  (None, 46)                    5934        
## ___________________________________________________________________________
## activation_1 (Activation)        (None, 46)                    0           
## ===========================================================================
## Total params: 95,534
## Trainable params: 95,534
## Non-trainable params: 0
## ___________________________________________________________________________
## 
## 

We initialized the model with a LSTM layer with the size of 128, its input shape is the shape of the X matrices: maxlen x length chars, or 60 x 46 in our case. Then we added a Fully Connected layer taking each of the LSTM 128 outputs to 46 output neurons (for 46 characters), then a softmax layer to make the 46 output neurons sum up to 1 and get a vector which represents the probability distribution for the next character.

Where do these Param # numbers come from?

colah’s blog is in my opinion the best place to understand LSTM. I won’t copy-paste what he wrote, I’ll just explain what I’ve learned:

The LSTM receives at time \(t\) a new input matrix \(X_t\) (size 60 x 46 in our case) as well as its previous output \(h_{t-1}\) (size 128 x 1 in our case). It decides what to “forget” in the “Forget Gate”:

\(f_t = \sigma(W_f \cdot X_t^t + U_f \cdot h_{t-1} + b_f)\)

Where \(W_f\) is a 128 x 46 weight matrix, \(U_f\) is 128 x 128 weight matrix, \(b_f\) is a 128 x 1 bias vector and \(\sigma\) is the sigmoid function to get a 0 to 1 “forget” vector.

It decides what to “update” in the “Input Gate”:

\(i_t = \sigma(W_i \cdot X_t^t + U_i \cdot h_{t-1} + b_i)\)

It produces a “candidate Cell State”:

\(\tilde{C}_t = \tanh(W_C \cdot X_t^t + U_C \cdot h_{t-1} + b_C)\)

Then the final Cell State at time \(t\) is like a “weighted average” of the previous Cell State multiplied by what we want to forget, and the candidate Cell State multiplied by what we want to update:

\(C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t\)

And the final \(h_t\) output:

\(o_t = \sigma(W_o \cdot X_t^t + U_o \cdot h_{t-1} + b_o)\)

\(h_t = o_t \cdot \tanh(C_t)\)

So we get 4 \(W\) matrices of size 128 x 46, 4 \(U\) matrices of size 128 x 128, 4 bias vectors of size 128 x 1, so the number of parameters for the LSTM layer is: 4 * (128 * 46 + 128 * 128 + 128) = 89600

The next Fully Connected layer should have 128 * 46 + 46 = 5934 parameters (that is a single \(W\) matrix connecting the 128-long \(h\) output to 46 neurons, plus a bias vector).

And this is how we get to overall 89600 + 5934 = 95534 parameters. Yay.

Fittin’

Next we train the RNN for say 20 epochs, in batches of say 100 - this means we’re letting it see the entire dataset (13802 sentences and their next characters) 20 times, in steps of random 100 sentences each time. So the RNN will take 100 sentences at time \(t\), compute the final 46-long vectors (distributions) for the next 100 characters (Forward Feed), compute a loss function (cross entropy in our case), then go back (Backward Feed) to update the entire set of 95K parameters through Gradient Descent. Then again and again, hopefully getting better and better at it.

The code to do this would look like this:

model %>% fit(
      X, y,
      batch_size = 100,
      epochs = 20,
      verbose = 0
    )

Rappin’

We now have a RNN which knows something about rap. Our laptop is smarter than a new born baby. The usual approach to making it rap, is to give it a random valid initializing sentence, and to watch what it does with it from this point onwards. Note that a good RNN should know when to start a new line, as the \n character is one of our chars. It should know when to start a new verse, i.e. start two or more new lines in a row, as it has seen many times the character \n followed again by the character \n. It may even know how to rhyme! Because maybe it has seen many cases of “cat” at the end of a line, and “hat” at the end of the following line.

But there’s still some missing link: the keras predict function will simply give us the final 46-long vector which represents the probability distribution of the 46 chars to be the next character. We shall have to either choose the character of maximum value (probability) as our next character or sample from this (multinomial!) distribution to get our next character. The keras Nietzsche example adds a temperature (or diversity) parameter to “slide” between these two options:

sample_mod <- function(preds, temperature = 1){
  preds <- log(preds)/temperature
  exp_preds <- exp(preds)
  preds <- exp_preds/sum(exp(preds))
  
  rmultinom(1, 1, preds) %>% 
    as.integer() %>%
    which.max()
}

In this sample_mod function we first manipulate the predicted preds probabilities we got, only then sample one of them with the rmultinom function. What is this manipulation? Let’s pretend we have 3 chars and we got for them the probabilities preds = c(0.5, 0.3, 0.2).

When temperature = 1, you can verify we get the original probabilities:

preds <- c(0.5, 0.3, 0.2)
temperature <- 1
preds <- log(preds)/temperature
exp_preds <- exp(preds)
exp_preds/sum(exp(preds))
## [1] 0.5 0.3 0.2

When temperature = 0.2:

preds <- c(0.5, 0.3, 0.2)
temperature <- 0.2
preds <- log(preds)/temperature
exp_preds <- exp(preds)
exp_preds/sum(exp(preds))
## [1] 0.919117647 0.071470588 0.009411765

The maximum probability 0.5 of our first char is becoming even larger. And when temperature = 2:

preds <- c(0.5, 0.3, 0.2)
temperature <- 2
preds <- log(preds)/temperature
exp_preds <- exp(preds)
exp_preds/sum(exp(preds))
## [1] 0.4154459 0.3218030 0.2627511

The maximum probability 0.5 has become smaller, and smaller probabilities have become larger. So the more we increase the temperature parameter, we make the output distribution over chars more Uniform, allowing for more “diversity” in output. And the more we decrease the temperature parameter, we make the output distribution over chars more extreme, and we will almost surely pick the character which has the maximum value for our next character.

Great. So overall to make the laptop rap say 1000 characters, we first sample a starting input “sentence” and initialize the output generated string to be empty:

start_index <- sample(1:(length(text) - maxlen), size = 1)
sentence <- text[start_index:(start_index + maxlen - 1)]
generated <- ""

Now for each of the 1000 characters we take the last “sentence”, vectorize it, predict a probability distribution over the 46 possible next characters, pick one with the sample_mod function and the desired temperature (I chose 1 but you should really choose different values to see the results, it’s fun!), then update the “sentence” for the next character:

for (word in 1:1000) {
  x <- sapply(chars, function(x){
    as.integer(x == sentence)
    })
  dim(x) <- c(1, dim(x))
      
  preds <- predict(model, x)
  next_index <- sample_mod(preds, 1)
  next_char <- chars[next_index]
  
  generated <- str_c(generated, next_char, collapse = "")
  sentence <- c(sentence[-1], next_char)
}

cat(generated)
## lifeded with gottambuching out my shit pacted of hur ablell sing lous, "oun you dan and it's clrence like they some man and for tome jeep to dan and cante no the trome and distughing i get itpent to men fare? sle wan and all
## you show, s a ratch and play swarebbla manfe and nought, out m then you see tonall with bucks
## i'm no do thon, walue's your setbulay, cause you can rase a lan dito 
## i get a then and careet itchlut a leaving troking lifth a treet mc now
## whaoou's all sand i wanna know
## avar, but in tirle of mcrumentably, you till the fromlerd you wrack you the mared a then dan a lan
## and leave
## to be a tcatted
## with you gotta back nab foctint on my renall
## 
## i got nob a hisksand spinterched
## to whing to ge and showeome
## i got comcarind fucking murd aits
## an op acreet mc ncerr
## 
## "nell and day i'm trysty jealing you then you can you?
## if all back alrrabfors and slithed, sup ther faller you can rase up as for at prace for no then you dear i'm supla calll
## tacter me?
## mens, you no give helaon't me fro

Well there are a few real words there, but overall the result is underwhelming. It is nice to see though that the RNN learned that question marks tend to come at the end of a line.

I’m Sorry Miss Jackson

All of this was very nice, but I have 400K lines of rap, not 1K. Like I said, I don’t think there’s enough RAM in the world for the X matrix in this case where it is not even a sparse matrix (I’m exaggerating, there are 173 Gigabytes of RAM in the world, just not on a single laptop).

One way of overcoming this is to train the data in batches - though I will call them “steps”, as the “batch” term is already taken here - via the train_on_batch function1:

maxlen <- 60

path <- "~/all_rap_custom_artists.txt"

text <- read_lines(path) %>%
  str_to_lower() %>%
  str_c(collapse = "\n") %>%
  tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)

chars <- text %>%
  unique() %>%
  sort()

dataset <-  map(
 seq(1, length(text) - maxlen - 1, by = 3),
 ~list(sentence = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
)

dataset <- transpose(dataset)

model <- keras_model_sequential()

model %>%
  layer_lstm(128, input_shape = c(maxlen, length(chars))) %>%
  layer_dense(length(chars)) %>%
  layer_activation("softmax")

optimizer <- optimizer_rmsprop(lr = 0.01)

model %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = optimizer
)

all_samples <- 1:length(dataset$sentence)
batch_size <- 100
num_steps <- trunc(length(dataset$sentence) / batch_size)

for (epoch in 1:20) {
  
  cat(sprintf("epoch: %02d ---------------\n\n", epoch))
  
  for (step in 1:num_steps) {
  
    cat(sprintf("step: %02d ---------------\n\n", step))
    
    batch <- sample(all_samples, batch_size)
    all_samples <- all_samples[-batch]
    
    sentences <- dataset$sentence[batch]
    next_chars <- dataset$next_char[batch]
    
    X <- array(0, dim = c(batch_size, maxlen, length(chars)))
    y <- array(0, dim = c(batch_size, length(chars)))
    
    for(i in 1:batch_size){
      
      X[i,,] <- sapply(chars, function(x){
        as.integer(x == sentences[[i]])
      })
      
      y[i,] <- as.integer(chars == next_chars[[i]])
      
    }
    
    model %>% train_on_batch(
      X, y
    )
  }
}

Here in a single epoch I am training a batch of 100 sentences, vectorizing them, fitting the model, then reading the next 100 sentences, vectorizing them, fitting the model, then again and again. In the full dataset, all 400K lines, I have 4.78 million “sentences”, so this is 47,806 steps per epoch…

So it is feasible. But I’m a busy man. And this is the definition of an overkill. So I’m sorry Miss Jackson, I prefer performing many epochs using the lyrics of my favorite group of the bunch - Outkast. I got 118 songs for Outkast, total of 7K lines, 250K characters, 62 distinct characters, 85K “sentences”, all simply repeating the above code. Then I did 150 epochs. Then I generated 2000 characters from a random initializing sentence, and this is what I got:

## thought that it was man, i used ta crew and every i ain't no mesher on the playans ok the zon' lakin' we fuck the starts on that back, i get top of peor hoes with coling up you, whore than this prebring on the smalk the passe (kin oursed
## this over black up but the place
## at y cause you got the don't want on the one hootie got
## to ride and listendread
## lize handle
## so i say wondenh wing you oage around i gunsin inse takent to there
## and then the beed to it's kille, the ave on the buz in and i wonjer, and she had yk with with your
## son't 
## now sil give them jellhigh
## we lave my nife know y'all you happed i dunked
## the time by presson
## fucked up like the stigge etecy' popen
## i was thats from beeny longers
## cat like mar
## gmes wet doper, send coup your placemy
## i conthrs
## you gon' that shit
## 't fing throub plation
## musin sit heaving through that bring the slot
## and then this junisheltion
## long for stali)
## main callid
## seville puelt off in my leviva fat ol' y'all
## fethinousin this shit, enannaw
## perchin on you traps
## if you want a poples give you this oth she ain't put it
## i said etestin boinfincuschan polong furlp miness floss (floath,
## my gking for sitsing or pisk as a mils and homebody's ain't for yo'
## the sun cuz i'll get your fedrerigging your sittin you go for two top raunat cuz they cook and right but meanstold short
## or fince crumblupbold
## but i let me live bout the datch headd is though, camably do you hoo
## so must be about me
## 'cause they kneep hit trapple she's sleal, know looking for move to to the play to haller
## you earnnd my baby marril ass makes and little me
## now had hom to   stroo
## you've gat a decking ours again
## 
## tailin' your jumpsine than i peesin them.fulle tand on a scurst, we can run up and then the dirt cuz eamast s
## dig of dyings, gett
## crumblers
## i got supernun-street sound of solne y's caugh such over
## ik
## knuck stin i see i wanna breach me
## but but i've go ugh's my this stare like nutt
## they speiches we spend all they pait
## timan, horrighip
## i love the do falk
## i like eatt
## 
## bab
## und away a

OK, so again underwhelming. But hear this!

Wrappin’

Do I believe computers will be able to generate art somewhere in the future? Yes. Have I demonstrated this? Haha, no. But I had fun, and I think I understand RNNs and LSTM a bit better. Please tell me how my rap can be improved, because right now it’s embarrassing.


  1. Thank you JJ Allaire and Daniel Falbel for this insight: https://github.com/rstudio/keras/issues/50