Recently RStudio (a.k.a my dream job) released a wrapper around keras with TensorFlow backend. Well, I just had to take this baby for a spin. But what to train my first Deep Learning network in R on? I’m neither a Cats nor a Dogs person. Whatever, I do what I want! South Park vs. Simpsons! ATTENTION: THIS IS NOT A DEEP LEARNING LESSON.

“Looks like Stan’s dad’s been hitting the bottle again.”

But where in the world will I get (at least) tens of thousands of South Park and Simpsons images? I considered scraping Google Images Search. But (a) Google have changed their API and it has become quite difficult to get thousands of images off their search results, and (b) you’re going to get some really messy results, e.g. creepy fan drawings of Bart as a real kid. So I thought I’d simply download all episodes and get as many images as I want from there. Now, the downloading itself is illegal, so I won’t show you how to do this. The starting point of this post is having all of South Park’s first 14 seasons (208 episodes) and all of The Simpsons’s first 13 seasons (291 episodes), on disk.

To record sequences of images from the shows episodes, you’ll need to install first FFmpeg and make sure it’s in your PATH. Installing Image Magick won’t hurt, later on. Now, I really wanted to use the imager package load.video function, but whatever I did it didn’t work. So here I’m simply calling the ffmpeg command in the Windows command line, via the system function. Don’t worry, I’m no ffmpeg expert either, I simply used this superuser answer to get what I wanted.

library(tidyverse)
library(stringr)

extractSeasonNum <- function(file, sp = TRUE) {
  separator <- ifelse(sp, " |\\.|/|S|E", " |/|Season")
  ind <- ifelse(sp, 10, 6)
  str_split(file, separator)[[1]][ind] %>% as.numeric()
}

extractEpisodeNum <- function(file, sp = TRUE) {
  separator <- ifelse(sp, " |\\.|/|S|E", " |/|Season")
  ind <- ifelse(sp, 13, 7)
  str_split(file, separator)[[1]][ind] %>% as.numeric()
}

getVideosTable <- function(videosAddress, sp = TRUE) {
  tibble(file = list.files(videosAddress, recursive = TRUE, full.names = TRUE)) %>%
    mutate(season = map_dbl(file, extractSeasonNum, sp),
           episode = map_dbl(file, extractEpisodeNum, sp))
}

sp_df <- getVideosTable("D:/South Park")
si_df <- getVideosTable("D:/The Simpsons", FALSE)

# problems with The Simpsons two episodes numbers
si_df[which(is.na(si_df$episode)),]$episode <- c(11, 1)

convertVideoToImages <- function(file, season, episode, sp = TRUE) {
  init <- ifelse(sp, "sp_", "si_")
  skipSeconds <- ifelse(sp, "00:01:00.000", "00:01:30.000")
  framesPerSecond <- ifelse(sp, "0.125", "0.1")
  imageName <- paste0(init, season, "_", episode)
  ffCommand <- paste0("ffmpeg -i \"", gsub("/", "\\\\", file),
                      "\" -s 80x60 -ss ", skipSeconds, " -t 1200 -r ",
                      framesPerSecond, " D:\\", init, "images\\",
                      imageName, "_%04d.jpg")
  system(ffCommand)
}

sp_df %>% pwalk(.f = convertVideoToImages)
si_df %>% pwalk(.f = convertVideoToImages, FALSE)

I’ve always wanted to use the purrr’s pwalk function! Anyway what I did in the ffmpeg command is:

  • -s 80x60: resize the video frames to 80 x 60
  • -ss 00:01:00.000: skip the first 1 minute of a South Park video (or 1.5 minutes for Simpsons)
  • -t 1200: keep recording images for 1200 seconds = 20 minutes
  • -r 0.125: record 0.125 frames per second, or a frame every 8 seconds for South Park (10 seconds for Simpsons)

This way I got 208 * (1200/8) = 31.2K South Park images, and 291 * (1200/10) = 34.92K Simpsons images. At this stage I did something I regret a bit - I manually deleted “black screen” images from both groups - you know, the screens you’d get from fade-out/fade-in. It’s a good move, I just should have done it automatically, e.g. by removing image with close to zero variance. Bottom line I got: 31,150 South Park images, and 34,551 Simpsons images, of size 80x60.

These images need to be loaded as R matrices, as that’s what keras knows. I used the imager and abind packages for that, and got me the sp_X and si_X matrices. The 0 to 255 rgb values are normalized to the 0 to 1 range.

library(imager)
library(abind)

image_read_matrix <- function(imgFile) {
  imgFile %>% load.image() %>% as.array() %>% adrop(3) #already normalizes to 0 to 1 range!
}

loadImagesAsXMatrix <- function(imagesAddress) {
  imageMatrices <- list.files(imagesAddress, full.names = TRUE) %>%
    map(image_read_matrix)
  
  array (
    data = do.call(rbind, map(imageMatrices, as.vector)),
    dim = c(length(imageMatrices), dim(imageMatrices[[1]]))
  )
}

sp_X <- loadImagesAsXMatrix("D:/sp_images")
gc()
si_X <- loadImagesAsXMatrix("D:/si_images")
gc()

sp_Y <- rep(0, dim(sp_X)[1])
si_Y <- rep(1, dim(si_X)[1])

Notice the heavy use of gc() here, these objects are a few Gigabytes big. Also notice I created label objects, sp_Y and si_Y, containing all 0s and all 1s, for South Park and Simpsons respectively.

Now we need to split these matrices to training (75%) and testing (25%):

trainFrac <- 0.75

set.seed(42)
sp_sample <- sample(dim(sp_X)[1], floor(trainFrac * dim(sp_X)[1]))
si_sample <- sample(dim(si_X)[1], floor(trainFrac * dim(si_X)[1]))

samples <- list(sp_sample = sp_sample, si_sample = si_sample)

sp_X_train <- sp_X[sp_sample, , , ]
si_X_train <- si_X[si_sample, , , ]
sp_X_test <- sp_X[-sp_sample, , , ]
si_X_test <- si_X[-si_sample, , , ]

sp_Y_train <- sp_Y[sp_sample]
si_Y_train <- si_Y[si_sample]
sp_Y_test <- sp_Y[-sp_sample]
si_Y_test <- si_Y[-si_sample]

rm(sp_X, si_X)
gc()

x_train <- abind(sp_X_train, si_X_train, along = 1)
x_test <- abind(sp_X_test, si_X_test, along = 1)

y_train <- c(sp_Y_train, si_Y_train)
y_test <- c(sp_Y_test, si_Y_test)

rm(sp_X_train, si_X_train, sp_X_test, si_X_test)
gc()

sp_si_data <- list(train = list(x = x_train, y = y_train), test = list(x = x_test, y = y_test))
save(sp_si_data, file = "D:/sp_si_data.RData")
rm(sp_si_data)

We now have the x_train, x_test, y_train and y_test matrices and vectors, and we’re good to go! Notice though I exported these objects in a single RData object called sp_si_data, so in the future, I can easily upload them like this:

load("D:/sp_si_data.RData")
x_train <- sp_si_data$train$x
y_train <- sp_si_data$train$y
x_test <- sp_si_data$test$x
y_test <- sp_si_data$test$y
rm(sp_si_data)
gc()

“Stan, don’t you know the first law of physics? Anything that’s fun costs at least eight dollars.”

What’s the first rule of Deep Learning? You may NOT need Deep Learning! Think of this. You’ve seen computers drive cars. You’ve seen computers conversate with people recommending them what to buy on ebay. Will it really be that hard for computers to classify South Park vs. Simpsons images?

No, I expect a computer to do this quite easily. The Simpsons are extremely yellow, South Park characters complexion is more pink. The Simpsons are very cartoonish. South Park have larger faces. Etc.

So don’t go grabbing them fancy GPUs just yet. Start from a “weak” learner to see what it can do. SVM, CART, Logistic Regression - any of these would give a good starting benchmark you expect Deep Learning to pass.

In this case I’m going to go with something close to Logistic Regression, which is a single neuron firing a sigmoid probability-like number between 0 and 1, for each sample. This neuron will train on 80 * 60 * 3 = 14400 pixles of an image, as simple 14,400 features. From here on now the final 0/1 decision (South Park or Simpsons) will be made with threshold 0.5 on the netwrok’s output.

library(keras)

batch_size <- 32
epochs <- 1

# flatten image pixels to become 14400 features
x_train <- array(x_train, dim = c(dim(x_train)[1], dim(x_train)[2] * dim(x_train)[3] * dim(x_train)[4]))
x_test <- array(x_test, dim = c(dim(x_test)[1], dim(x_test)[2] * dim(x_test)[3] * dim(x_test)[4]))

model <- keras_model_sequential()
model %>% 
  layer_dense(units = 1, activation = 'sigmoid', input_shape = c(14400))

summary(model)
## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_1 (Dense)                  (None, 1)                     14401       
## ===========================================================================
## Total params: 14,401.0
## Trainable params: 14,401
## Non-trainable params: 0.0
## ___________________________________________________________________________
model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

model %>% fit(
  x_train, y_train,
  batch_size = batch_size,
  epochs = epochs,
  verbose = 1,
  validation_data = list(x_test, y_test)
)
## Train on 49275 samples, validate on 16426 samples
## Epoch 1/1
## 49275/49275 [==============================] - 10s - loss: 1.0201 - acc: 0.6532 - val_loss: 0.7269 - val_acc: 0.7294

We got 72% accuracy1 on the testing sample, just like that. So you see, even a single neuron can do a descent job.

“I’m not fat! I’m festively plump!”

Let’s go deeper. By “deeper” I still mean a simple neural network, but with several hidden layers, say 4, each followed by 20% Dropout to regularize and avoid overfitting, and a ReLU activation function, replicating this over 10 epochs. This is also called a “Multilayer Perceptron”2:

batch_size <- 32
epochs <- 10

model <- keras_model_sequential()
model %>% 
  layer_dense(units = 512, activation = 'relu', input_shape = c(14400)) %>% 
  layer_dropout(rate = 0.2) %>% 
  layer_dense(units = 256, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 32, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 1, activation = 'sigmoid')

summary(model)
## ## Model
## ## ___________________________________________________________________________
## ## Layer (type)                     Output Shape                  Param #     
## ## ===========================================================================
## ## dense_1 (Dense)                  (None, 512)                   7373312     
## ## ___________________________________________________________________________
## ## dropout_1 (Dropout)              (None, 512)                   0           
## ## ___________________________________________________________________________
## ## dense_2 (Dense)                  (None, 256)                   131328      
## ## ___________________________________________________________________________
## ## dropout_2 (Dropout)              (None, 256)                   0           
## ## ___________________________________________________________________________
## ## dense_3 (Dense)                  (None, 128)                   32896       
## ## ___________________________________________________________________________
## ## dropout_3 (Dropout)              (None, 128)                   0           
## ## ___________________________________________________________________________
## ## dense_4 (Dense)                  (None, 64)                    8256        
## ## ___________________________________________________________________________
## ## dropout_4 (Dropout)              (None, 64)                    0           
## ## ___________________________________________________________________________
## ## dense_5 (Dense)                  (None, 32)                    2080        
## ## ___________________________________________________________________________
## ## dropout_5 (Dropout)              (None, 32)                    0           
## ## ___________________________________________________________________________
## ## dense_6 (Dense)                  (None, 1)                     33          
## ## ===========================================================================
## ## Total params: 7,547,905.0
## ## Trainable params: 7,547,905.0
## ## Non-trainable params: 0.0
## ## ___________________________________________________________________________
## ## 
## ##
model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

model %>% fit(
  x_train, y_train,
  batch_size = batch_size,
  epochs = epochs,
  verbose = 1,
  validation_data = list(x_test, y_test)
)
## Train on 49275 samples, validate on 16426 samples
## Epoch 1/10
## 49275/49275 [==============================] - 137s - loss: 0.6106 - acc: 0.6967 - val_loss: 0.4599 - val_acc: 0.7833
## Epoch 2/10
## 49275/49275 [==============================] - 135s - loss: 0.4846 - acc: 0.7747 - val_loss: 0.4719 - val_acc: 0.7842
## Epoch 3/10
## 49275/49275 [==============================] - 135s - loss: 0.4700 - acc: 0.7890 - val_loss: 0.4542 - val_acc: 0.7970
## Epoch 4/10
## 49275/49275 [==============================] - 135s - loss: 0.4640 - acc: 0.7959 - val_loss: 0.4266 - val_acc: 0.8286
## Epoch 5/10
## 49275/49275 [==============================] - 135s - loss: 0.4683 - acc: 0.8015 - val_loss: 0.4103 - val_acc: 0.8243
## Epoch 6/10
## 49275/49275 [==============================] - 135s - loss: 0.4683 - acc: 0.8010 - val_loss: 0.4300 - val_acc: 0.8311
## Epoch 7/10
## 49275/49275 [==============================] - 134s - loss: 0.4627 - acc: 0.8063 - val_loss: 0.3974 - val_acc: 0.8401
## Epoch 8/10
## 49275/49275 [==============================] - 135s - loss: 0.4474 - acc: 0.8117 - val_loss: 0.4340 - val_acc: 0.8448
## Epoch 9/10
## 49275/49275 [==============================] - 135s - loss: 0.4450 - acc: 0.8151 - val_loss: 0.3929 - val_acc: 0.8524
## Epoch 10/10
## 49275/49275 [==============================] - 135s - loss: 0.4597 - acc: 0.8156 - val_loss: 0.4182 - val_acc: 0.8333

We got to 83% test sample accuracy. Not too shabby for a 20 minutes neural network on my laptop.

“Respect my Authoritah!”

For the real deal, we’ll use a Convolutional neural network. CNNs are especially good at image recognition tasks, so I’m expecting accuracy to go through the roof. I’m going to input 4 convolutional 2D layers (the first with a kernel size of 10x10x32, the rest 3x3x32), then a dense layer, then a sigmoid activation. See the code for the Dropout rates and Pooling layers.

batch_size <- 32
epochs <- 10

# need to go back to 80x60x3 matrices structure
x_train <- array(x_train, dim = c(dim(x_train)[1], 80, 60, 3))
x_test <- array(x_test, dim = c(dim(x_test)[1], 80, 60, 3))

model <- keras_model_sequential()
model %>%
  layer_conv_2d(
    filter = 32, kernel_size = c(10, 10), padding = "same", 
    input_shape = c(80, 60, 3)
  ) %>%
  layer_activation("relu") %>%
  layer_conv_2d(filter = 32, kernel_size = c(3,3)) %>%
  layer_activation("relu") %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_dropout(0.25) %>%
  
  layer_conv_2d(filter = 32, kernel_size = c(3,3), padding = "same") %>%
  layer_activation("relu") %>%
  layer_conv_2d(filter = 32, kernel_size = c(3,3)) %>%
  layer_activation("relu") %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_dropout(0.25) %>%
  
  layer_flatten() %>%
  layer_dense(512) %>%
  layer_activation("relu") %>%
  layer_dropout(0.5) %>%
  layer_dense(units = 1, activation = 'sigmoid')

opt <- optimizer_rmsprop(lr = 0.0001, decay = 1e-6)

model %>% compile(
  loss = "binary_crossentropy",
  optimizer = opt,
  metrics = "accuracy"
)

summary(model)
## Model
## ___________________________________________________________________________________________________________________
## Layer (type)                                       Output Shape                                  Param #           
## ===================================================================================================================
## conv2d_1 (Conv2D)                                  (None, 80, 60, 32)                            9632              
## ___________________________________________________________________________________________________________________
## activation_1 (Activation)                          (None, 80, 60, 32)                            0                 
## ___________________________________________________________________________________________________________________
## conv2d_2 (Conv2D)                                  (None, 78, 58, 32)                            9248              
## ___________________________________________________________________________________________________________________
## activation_2 (Activation)                          (None, 78, 58, 32)                            0                 
## ___________________________________________________________________________________________________________________
## max_pooling2d_1 (MaxPooling2D)                     (None, 39, 29, 32)                            0                 
## ___________________________________________________________________________________________________________________
## dropout_1 (Dropout)                                (None, 39, 29, 32)                            0                 
## ___________________________________________________________________________________________________________________
## conv2d_3 (Conv2D)                                  (None, 39, 29, 32)                            9248              
## ___________________________________________________________________________________________________________________
## activation_3 (Activation)                          (None, 39, 29, 32)                            0                 
## ___________________________________________________________________________________________________________________
## conv2d_4 (Conv2D)                                  (None, 37, 27, 32)                            9248              
## ___________________________________________________________________________________________________________________
## activation_4 (Activation)                          (None, 37, 27, 32)                            0                 
## ___________________________________________________________________________________________________________________
## max_pooling2d_2 (MaxPooling2D)                     (None, 18, 13, 32)                            0                 
## ___________________________________________________________________________________________________________________
## dropout_2 (Dropout)                                (None, 18, 13, 32)                            0                 
## ___________________________________________________________________________________________________________________
## flatten_1 (Flatten)                                (None, 7488)                                  0                 
## ___________________________________________________________________________________________________________________
## dense_1 (Dense)                                    (None, 512)                                   3834368           
## ___________________________________________________________________________________________________________________
## activation_5 (Activation)                          (None, 512)                                   0                 
## ___________________________________________________________________________________________________________________
## dropout_3 (Dropout)                                (None, 512)                                   0                 
## ___________________________________________________________________________________________________________________
## dense_2 (Dense)                                    (None, 1)                                     513               
## ===================================================================================================================
## Total params: 3,872,257.0
## Trainable params: 3,872,257.0
## Non-trainable params: 0.0
## ___________________________________________________________________________________________________________________
## 
## 
model %>% compile(
  loss = "binary_crossentropy",
  optimizer = opt,
  metrics = "accuracy"
)

model %>% fit(
  x_train, y_train,
  batch_size = batch_size,
  epochs = epochs,
  validation_data = list(x_test, y_test),
  shuffle = TRUE,
  verbose = 1
)
## Train on 49275 samples, validate on 16426 samples
## Epoch 1/10
## 49275/49275 [==============================] - 994s - loss: 0.3153 - acc: 0.8668 - val_loss: 0.3123 - val_acc: 0.8602
## Epoch 2/10
## 49275/49275 [==============================] - 992s - loss: 0.1891 - acc: 0.9277 - val_loss: 0.1505 - val_acc: 0.9441
## Epoch 3/10
## 49275/49275 [==============================] - 992s - loss: 0.1603 - acc: 0.9397 - val_loss: 0.1597 - val_acc: 0.9367
## Epoch 4/10
## 49275/49275 [==============================] - 989s - loss: 0.1462 - acc: 0.9462 - val_loss: 0.1543 - val_acc: 0.9420
## Epoch 5/10
## 49275/49275 [==============================] - 987s - loss: 0.1351 - acc: 0.9497 - val_loss: 0.1632 - val_acc: 0.9414
## Epoch 6/10
## 49275/49275 [==============================] - 986s - loss: 0.1287 - acc: 0.9535 - val_loss: 0.1226 - val_acc: 0.9549
## Epoch 7/10
## 49275/49275 [==============================] - 985s - loss: 0.1248 - acc: 0.9546 - val_loss: 0.1163 - val_acc: 0.9579
## Epoch 8/10
## 49275/49275 [==============================] - 986s - loss: 0.1226 - acc: 0.9548 - val_loss: 0.1488 - val_acc: 0.9497
## Epoch 9/10
## 49275/49275 [==============================] - 986s - loss: 0.1210 - acc: 0.9563 - val_loss: 0.1090 - val_acc: 0.9587
## Epoch 10/10
## 49275/49275 [==============================] - 985s - loss: 0.1184 - acc: 0.9570 - val_loss: 0.1117 - val_acc: 0.9600

Well that’s 96% accuracy! Hooray. Obviously this can be improved, there are many hyperparamters to tune here, e.g. using cross validation. But I’m happy with 96% for now.

“How do I reeeach these kids?”

What did we learn? First see the predicted probabilities histogram:

pred <- predict_proba(model, x_test)

library(ggplot2)
ggplot(data.frame(pred = pred), aes(x = pred)) +
  geom_histogram(bins = 10, fill = "red", alpha = 0.5) +
  xlab("Predicted P(Simpsons)") +
  ylab("No. of Images in Test Sample") +
  ggtitle("South Park vs. Simpsons: Histogram of Predicted P(Simpsons) on Test Set")

Now this is a sight to be hold. The classifier is either “really sure” an image is South Park, or “really sure” it is Simpsons. Good for us. Let’s look at the confusion matrix:

caret::confusionMatrix(y_test, ifelse(pred > thresh, 1, 0))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7552  236
##          1  421 8217
##                                           
##                Accuracy : 0.96            
##                  95% CI : (0.9569, 0.9629)
##     No Information Rate : 0.5146          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9199          
##  Mcnemar's Test P-Value : 7.047e-13       
##                                           
##             Sensitivity : 0.9472          
##             Specificity : 0.9721          
##          Pos Pred Value : 0.9697          
##          Neg Pred Value : 0.9513          
##              Prevalence : 0.4854          
##          Detection Rate : 0.4598          
##    Detection Prevalence : 0.4741          
##       Balanced Accuracy : 0.9596          
##                                           
##        'Positive' Class : 0               
## 

We see our 96% accuracy again. We see that precision (“Pos Pred Value” and “Neg Pred Value”) is over 95% both for South Park and for The Simpsons (i.e. if I say an image is from South Park I’m over 95% correct, also for The Simpsons). We also see though a few hundreds of images missclassied. Let’s look at the top 5 “most” misclassified images, i.e. South Park images which got the highest predicted P(Simposns) scores, and vice versa:

gaps <- tibble(ind = 1:length(y_test),
               y_test = y_test,
               pred = c(pred),
               gap = abs(c(pred) - y_test)) %>%
  arrange(-gap)

si_top5_ind <- gaps %>% filter(y_test == 1) %>% head(5) %>% select(ind) %>% unlist()
sp_top5_ind <- gaps %>% filter(y_test == 0) %>% head(5) %>% select(ind) %>% unlist()

sp_test_sample <- (1:(sum(y_train == 0) + sum(y_test == 0)))[-samples$sp_sample]
si_test_sample <- (1:(sum(y_train == 1) + sum(y_test == 1)))[-samples$si_sample]

sp_images_paths <- list.files("D:/sp_images", full.names = TRUE)
sp_top5_paths <- sp_images_paths[sp_test_sample[sp_top5_ind]]

si_images_paths <- list.files("D:/si_images", full.names = TRUE)
si_top5_paths <- si_images_paths[si_test_sample[si_top5_ind - sum(y_test == 0)]]

sp_top5_paths %>% image_read() %>% image_border("black", "3x3") %>% image_append() %>% image_write("D:/sp_top5.jpg")
si_top5_paths %>% image_read() %>% image_border("black", "3x3") %>% image_append() %>% image_write("D:/si_top5.jpg")

Top South Park images missclassified as Simpsons

Top Simpsons images missclassified as South Park

This is nice. For South Park, I see the 3rd, 4th and 5th images having a lot of yellow in them. The 4th image is from the Korn episode (Season 3, Episode 10), where they didn’t make Korn’s lead singer very “South Parky” with large elliptic eyes. The first image - I have no idea why Kenney is here :) And the second - well, this is fantastic! This is an image from South Park that looks exactly like The Simpsons living room!

If you want to know from where these images are taken, here:

sp_top5_paths
si_top5_paths
## [1] "D:/sp_images/sp_4_9_0006.jpg"  "D:/sp_images/sp_6_7_0137.jpg" 
## [3] "D:/sp_images/sp_8_11_0064.jpg" "D:/sp_images/sp_3_10_0129.jpg"
## [5] "D:/sp_images/sp_8_11_0067.jpg"
## [1] "D:/si_images/si_6_18_0085.jpg" "D:/si_images/si_2_8_0001.jpg" 
## [3] "D:/si_images/si_1_3_0097.jpg"  "D:/si_images/si_4_18_0031.jpg"
## [5] "D:/si_images/si_4_22_0031.jpg"

Let’s have fun, let’s give this network a list of random images very different from what it has seen to see how “smart” it really is. I’ve modified Maelle Salmon’s cool function a bit to fit my needs and make this “tile” (this is done using the magick package):

library(magick)

test_image_read_matrix <- function(imgFile) {
  imgFile %>% load.image() %>% resize(80, 60) %>% as.array() %>% array(dim = c(1, 80, 60, 3))
}

extractImageName <- function(imgFile) {
  str_split(imgFile, "/|\\.")[[1]] %>% tail(2) %>% head(1)
}

test_images_df <- "D:/test_images" %>%
  tibble(file = list.files(., full.names = TRUE)) %>%
  mutate(name = map_chr(file, extractImageName),
    matrix = map(file, test_image_read_matrix),
    prob = map_dbl(matrix, function(m) predict_proba(model, m))) %>%
  arrange(-prob)

no_rows <- 9
no_cols <- 3

make_row <- function(i, files, no_cols){
  image_read(files[(i*no_cols+1):((i+1)*no_cols)]) %>%
    image_scale("194x194!") %>%
    image_border("black", "3x3!") %>%
    image_append(stack = FALSE) %>%
    image_write(paste0("D:/test_images_arranged/", i, ".jpg"))
}

walk(0:(no_rows-1), make_row, files = test_images_df$file,
     no_cols = no_cols)

image_all <- image_read(dir("D:/test_images_arranged", full.names = TRUE)) %>%
  image_append(stack = TRUE)

for (i in 1:nrow(test_images_df)) {
  xLoc <- (i + no_cols - 1) %% no_cols * 200 + 30
  yLoc <- ((i - 1) %/% no_cols) * 200 + 160
  image_all <- image_annotate(image_all, format(test_images_df$prob[i], digits = 3),
                                                 size = 20, color = "black", boxcolor = "white",
                                                 location = paste0("+", xLoc, "+", yLoc))
}

image_write(image_all,"D:/test_images_sorted.jpg")

Some Hard Test Images with P(Simpsons)

This is extremely interesting, if we think of 0.5 as the threshold above which an image gets prediction “Simpsons” and below “South Park”.

  • The network is very sure that Homer and Bart are Simpsons, and that Cartman and Randy Marsh are South Park.
  • It is also pretty sure that real donuts, a yellow and white rectangles, and Putin, are taken from The Simpsons.
  • And that Gal Gadot, Trump, my daughter, my best friend Nir and an Ikea table are taken from South Park.
  • It is somewhat sure that an image of me is from South Park (p = 0.115), but an image of me with a yellow mask is from The Simpsons (p = 0.775).
  • Beyonce is Simpsons.
  • Mr. Hankey is Simpsons (wrong!).
  • It’s not just a “yellow” thing because Bart Simpson as illustrated by South Park is predicted to be from South Park.
  • Amazingly, Matt Groening, the creator of the Simpsons, passes the threshold “to become” Simpsons, while Trey Parker and Matt Stone, the creators of South Park, are predicted to be from South Park with probability 0.999.

“Well, I looked in my mom’s closet and saw what I was getting for Christmas, an UltraVibe Pleasure 2000.”

But what did we really learn? Here are the first 5 kernels from the first convolution layer, magnified x10:

w <- model$get_weights()

getKernelImage <- function(i) {
  as.cimg(w[[1]][, , , i]) %>%
    resize(100, 100) %>%
    save.image(file = paste0("D:/w1_", i, "_cnn.jpg"))
}
walk(1:5, getKernelImage)

image_read(paste0("D:/w1_", 1:5, "_cnn.jpg")) %>% image_border("black", "3x3") %>%
  image_append() %>% image_write("D:/kernels_images_cnn.jpg")

First 5 10x10 Convolution Kernels Magnified x10

OK, there’s nothing to be said here except the network responds to blobs of colors at this stage.

How about our images, how do they respond to the colvolution? What happens to Chef, the first four convolution layers?

chef <- test_image_read_matrix("D:/sp_images/sp_1_1_0011.jpg")

layer_names <- c("conv2d_1", "conv2d_2", "conv2d_3", "conv2d_4")

getChefAfterConvolutions <- function(i) {
  layer_name <- layer_names[i]
  intermediate_layer_model <- keras_model(inputs = model$input,
                                          outputs = get_layer(model, layer_name)$output)
  intermediate_output <- predict(intermediate_layer_model, chef)
  save.image(as.cimg(intermediate_output[1, , , 1]) %>% resize(-200, -200), paste0("D:/chef_", i, ".jpg"))
}

walk(1:4, getChefAfterConvolutions)

image_read(paste0("D:/chef_", 1:4, ".jpg")) %>%
  image_border("black", "3x3") %>%
  image_append() %>%
  image_write("D:/chef_after_convolutions.jpg")

Chef through convolutions

Cool, you can see how the network “sums” Chef up more and more. You can see what the network sees.

Finally as in the ImageNet legacy article, we can try to use our network as a search engine. By calculating for a given search image, its vector representation (“SimpsonsToVec”) from the last dense layer (length 512), correlate this vector with all images vectors in the training set, and see which images give us the top correlations and are retrieved this way:

search_images <- c(
  "D:/si_images/si_3_14_0047.jpg",
  "D:/si_images/si_5_20_0106.jpg",
  "D:/si_images/si_7_9_0065.jpg",
  "D:/si_images/si_7_12_0030.jpg",
  "D:/si_images/si_8_5_0002.jpg"
)

layer_name <- "activation_5"
intermediate_layer_model <- keras_model(inputs = model$input,
                                        outputs = get_layer(model, layer_name)$output)
intermediate_output_all <- predict(intermediate_layer_model,
                                   x_train[y_train == 1, , , ])
si_train_sample <- (1:(sum(y_train == 1) + sum(y_test == 1)))[samples$si_sample]
si_images_paths <- list.files("D:/si_images", full.names = TRUE)

image_read(search_images) %>%
  image_border("black", "3x3") %>%
  image_append(stack = TRUE) %>%
  image_write("D:/searchImages.jpg")

getSearchImageResults <- function(i) {
  searchImage <- test_image_read_matrix(search_images[i])
  intermediate_output <- predict(intermediate_layer_model, searchImage)
  correlations <- apply(intermediate_output_all, 1, function(x) cor(c(x), c(intermediate_output)))
  best_images_ind <- order(correlations, decreasing = TRUE) %>% head(5)
  best_images_paths <- si_images_paths[si_train_sample[best_images_ind]]
  
  best_images_paths %>%
    image_read() %>%
    image_border("black", "3x3") %>%
    image_append() %>%
    image_write(paste0("D:/best_images_", i, ".jpg"))
}

walk(1:5, getSearchImageResults)

image_read(paste0("D:/best_images_", 1:5, ".jpg")) %>%
  image_append(stack = TRUE) %>%
  image_write("D:/best_images_all.jpg")

image_read(c("D:/searchImages.jpg", "D:/best_images_all.jpg")) %>%
  image_border("white", "10x10") %>%
  image_append() %>%
  image_write("D:/searchResults.jpg")

Top 5 Images Retrieved ImageNet Style for Some Simpsons Images

Cool.

  • For the first Marge we get what we expected, different “Marges”. This is also true for Kent Brockman (last image).
  • For the second Marge it seems at first the results are not related. But they are! They show a different Simpsons character on a very similar pink and brown background.
  • For Grandpa, results are not good.
  • Lastly for Homer and Mr. Burns standing and talking - we see all kinds of characters, well, standing and talking.

“Scr** you guys, I’m going home.”

That was fun! My first Deep Learning network in R. I think I was able to show how smart this network is, as well as how silly. The network only knows what you feed into it. It especially doesn’t know how to say “I don’t know”, unless you teach it. So if you’re worried about networks eating the world and taking your job (“They Took Our Jobs”) - chillax. There’s still some time before this happens.


  1. i.e. 72% of images in the ~16K testing sample were classified correctly

  2. Sounds like a Transformer