In one of Silicon Valley Season 4’s episodes Dinesh finds himself in need of penises images, in order to make a penis images detection app. I thought about choosing this task for my object detection project. After all I am no Dinesh, tagging images of penises comes easy to me1. But then I could hear my Dad after this is posted, saying “it had to be you.” So I opted for faces instead:

You’re looking at about a month’s work, which might not seem impressive to you, as you know how the real Snapchat2 face masks look like. I should also acknowledge early-on that most of my work consisted of translating this wonderful Python code from Pierluigi Ferrari implementing Single-Shot MultiBox Detector (SSD). However, it was not so easy, and at times I thought about giving up. Eventually this little ssd-keras package, I think, demonstrates well:

  • Using Rstudio’s Keras wrapper for a complex model
  • Writing custom non-trivial loss functions
  • Writing custom non-trivial layers with the help of the R6 classes system
  • Writing a custom data generator in R
  • Using reticulate for using Python’s Numpy inside R
  • Using the amazing and still underrated magick package the same way Pythoners use Open-CV for image manipulation and augmentation
  • Creating a tagged dataset for object detection from scratch
  • Making tremendous value from a small tagged dataset

WARNING: Yes, this is a “package”. But no, it is completely immature, there is a big chance you cannot use it for your own object detection project without understanding what you’re doing

The Hell is SSD

OK, so this is the original article explaining Single-Shot MultiBox Detector (SSD) for object detection. SSD is not the only way to detect objects and in fact object detection has been around long before Deep Learning, especially for detecting faces as a whole. Deep Learning, however, achieves “state-of-the-art” results on various kinds of objects, and is the key to make us all relax and never have to drive a car again. The field continues to improve, see this amazing review of various other DL object detection methods.

The article is the single source of truth. Alas, as many DL articles, it is somewhat unclear (for me) in the details. I found better explanations in this video3 and in Pierluigi’s original code.

Here’s my attempt at explaining.

Previous (though still in use) methods for DL object detection usually involved two stages:

  1. Offer the network some “regions of interest”
  2. Train the network on those regions, producing for each region predictions regarding their object class (that’s a vector of #classes probabilities) and their object bounding box (that’s a vector of 4 values: xmin, xmax, ymin and ymax of the box)

The novelty in SSD is that it works in a single stage, hence “Single Shot Detection”. SSD takes advantage of the fact that we’re already looking at the image by various scales when we’re adding smaller and smaller convolutional layers. Let these be our “regions of interest”:

Let’s stride through each of these smaller feature maps, each time focusing more and more our perspective and what a “cell” means, and predict on each patch around a “cell” the #classes + 4 values.

And while we’re at it, let’s not always look at a specific “square” or “rectangle” around a cell, let’s look at different “anchor boxes” (a square, a 2:1 rectangle, a 1:2 rectangle etc.):

That is how the 8x8 feature map in (b) captures the cat but not the dog, and the 4x4 feature map in (c) catches the dog and (maybe) not the cat. It’s a matter of perspective. In this project we’re looking at 4 different boxes for each cell.

The downside is that we’re considering thousands of such boxes (4,144 in this project). It is very reasonable our cat might be captured by dozens of boxes at different feature maps or perspectives. So SSD uses a tool called Non-Maximum Suppression, NMS for short, which sounds complicated but it’s basically an iterative method to pick the best box during prediction, by removing boxes which overlap considerably with one another.

Now, notice the SSD architecture does not start from “zero” but takes its initial weights from the famous VGG16 network which was trained for the ImageNet Challenge. In my implementation which is a copy of Pierluigi’s “SSD7” model I’m simplifying things by starting from “zero”, adding 3 top convolutional layers, then another 4 convolutional layers decreasing in size, on which predictions are made:

Face, Face, Face

Originally, Pierluigi demonstrates his code working on Udacity’s self-driving-car dataset. This means detecting cars, signs, pedestrians etc. I couldn’t do that, and since penises were out of the question I chose faces instead. I want to be able to look at an image of one or more faces and locate these three classes:

  1. Eyes
  2. Nose
  3. Mouth

Nothing too complicated. Now, getting images of faces is easy online, but getting tagged faces for eyes, noses and mouths - is a bit more difficult. I decided I’d do it myself. I downloaded the Labeled Faces in the Wild dataset, which consists of ~13K 250x250 images of faces. I chose a random batch of 1000 faces I was going to tag manually (really, 1000 is my limit…):

allFaces <- list.files("D:/lfw", recursive = TRUE, full.names = TRUE)
currentFiles <- allFaces[sample(length(allFaces), 1000)]
destFiles <- str_c("D:/faces/", str_pad(1:1000, 5, pad = "0"), ".jpg")
file.copy(from=currentFiles, to=destFiles)

head(currentFiles)
## [1] "D:/lfw/Norah_Jones/Norah_Jones_0006.jpg"        
## [2] "D:/lfw/John_Negroponte/John_Negroponte_0024.jpg"
## [3] "D:/lfw/Tony_Bennett/Tony_Bennett_0002.jpg"      
## [4] "D:/lfw/George_W_Bush/George_W_Bush_0438.jpg"    
## [5] "D:/lfw/Eminem/Eminem_0001.jpg"                  
## [6] "D:/lfw/Jiang_Zemin/Jiang_Zemin_0013.jpg"

I thought about implementing a tool in shiny which would allow me to quickly draw rectangles around the eyes, noses and mouths4 but then found out about the LabelImg tool which is open source and free to use. You simply draw your rectangles and the tool shoots out a XML for each and every face you can later parse. This is how I tagged 1K faces, and fast:

And this is how the XML looks like, specifying the xmin, xmax, ymin and ymax for each tagged class:

Then, with the help of the xml2 package I’m parsing these XMLs, creating the dataset and writing into a training/validation CSV files:

library(xml2)
library(tidyverse)
library(stringr)

parseXML <- function(xml) {
  frame <- xml %>%
    xml_find_first("//filename") %>%
    xml_text()
  
  classes <- xml %>%
    xml_find_all("//object") %>%
    xml_find_all("//name") %>%
    xml_text() %>%
    factor(levels = c("eyes", "nose", "mouth")) %>%
    as.integer() %>%
    as_tibble() %>%
    magrittr::set_colnames("class_id")
  
  bndbx <- xml %>%
    xml_find_all("//bndbox") %>%
    xml_children() %>%
    xml_integer() %>%
    split(rep(1:dim(classes)[1], each = 4)) %>%
    as_tibble() %>%
    t() %>%
    magrittr::set_colnames(c("xmin", "ymin", "xmax", "ymax")) %>%
    as_tibble() %>%
    select(xmin, xmax, ymin, ymax)
  
  cbind(frame, bndbx, classes) %>%
    as_tibble %>%
    mutate(frame = as.character(frame))
}

data <- list.files("D:/faces", full.names = TRUE) %>%
  discard(!str_detect(., "xml")) %>%
  map(., read_xml) %>%
  map_dfr(parseXML)

splitN <- 3001
write_csv(data[1:splitN,], "train.csv")
write_csv(data[(splitN + 1):nrow(data),], "val.csv")

This is how data looks like:

data
## # A tibble: 6 x 6
##       frame  xmin  xmax  ymin  ymax class_id
##       <chr> <int> <int> <int> <int>    <int>
## 1 00001.jpg    89   164    95   126        1
## 2 00001.jpg   104   132   109   144        2
## 3 00001.jpg    96   143   146   165        3
## 4 00002.jpg    97   159    97   123        1
## 5 00002.jpg   115   138   106   145        2
## 6 00002.jpg   108   138   154   172        3

You can download all images and two CSVs from my Github repo.

Face Model

There’s no way I’m going to go through each and every line of code, that’s what Pierluigi’s comments are for, see the source. But it is worth showing here the actual SSD model written in R’s Keras wrapper, and seeing how it translates to the theoretical explanation of SSD. Remember I’m using the “SSD7” simplified version, which contains “only” 7 convolutional layers. Again, for many more details see the source:

x <- layer_input(shape = c(img_height, img_width, img_channels))
normed <- layer_lambda(x, function(z) z/127.5 - 1., # Convert input feature range to [-1,1]
                      output_shape = c(img_height, img_width, img_channels),
                      name = 'lambda1')

conv1 <- layer_conv_2d(normed, 32, c(5, 5), name ='conv1', strides = c(1, 1), padding= "same")
conv1 <- layer_batch_normalization(conv1, axis=3, momentum =0.99, name = 'bn1')
conv1 <- layer_activation_elu(conv1, name = 'elu1')
pool1 <- layer_max_pooling_2d(conv1, pool_size = c(2, 2), name='pool1')

conv2 <- layer_conv_2d(pool1, 48, c(3, 3), name='conv2', strides=c(1, 1), padding="same")
conv2 <- layer_batch_normalization(conv2, axis=3, momentum=0.99, name='bn2')
conv2 <- layer_activation_elu(conv2, name='elu2')
pool2 <- layer_max_pooling_2d(conv2, pool_size=c(2, 2), name='pool2')

conv3 <- layer_conv_2d(pool2, 64, c(3, 3), name='conv3', strides=c(1, 1), padding="same")
conv3 <- layer_batch_normalization(conv3, axis=3, momentum=0.99, name='bn3')
conv3 <- layer_activation_elu(conv3, name='elu3')
pool3 <- layer_max_pooling_2d(conv3, pool_size = c(2, 2), name='pool3')

conv4 <- layer_conv_2d(pool3, 64, c(3, 3), name='conv4', strides=c(1, 1), padding="same")
conv4 <- layer_batch_normalization(conv4, axis=3, momentum=0.99, name='bn4')
conv4 <- layer_activation_elu(conv4, name='elu4')
pool4 <- layer_max_pooling_2d(conv4, pool_size=c(2, 2), name='pool4')

conv5 <- layer_conv_2d(pool4, 48, c(3, 3), name='conv5', strides=c(1, 1), padding="same")
conv5 <- layer_batch_normalization(conv5, axis=3, momentum=0.99, name='bn5')
conv5 <- layer_activation_elu(conv5, name='elu5')
pool5 <- layer_max_pooling_2d(conv5, pool_size=c(2, 2), name='pool5')

conv6 <- layer_conv_2d(pool5, 48, c(3, 3), name='conv6', strides=c(1, 1), padding="same")
conv6 <- layer_batch_normalization(conv6, axis=3, momentum=0.99, name='bn6')
conv6 <- layer_activation_elu(conv6, name='elu6')
pool6 <- layer_max_pooling_2d(conv6, pool_size=c(2, 2), name='pool6')

conv7 <- layer_conv_2d(pool6, 32, c(3, 3), name='conv7', strides=c(1, 1), padding="same")
conv7 <- layer_batch_normalization(conv7, axis=3, momentum=0.99, name='bn7')
conv7 <- layer_activation_elu(conv7, name='elu7')

Now we’re leaving the first 3 convolutional layers alone, and predicting our classes for each of the last 4 convolutional layers (Classification of 3 classes + 1 class for ‘background’) and the box for each class (Regression of 4 values: xmin, xmax, ymin and ymax):

classes4 <- layer_conv_2d(conv4, n_boxes_conv4 * n_classes, c(3, 3), strides=c(1, 1), padding="valid", name='classes4')
classes5 <- layer_conv_2d(conv5, n_boxes_conv5 * n_classes, c(3, 3), strides=c(1, 1), padding="valid", name='classes5')
classes6 <- layer_conv_2d(conv6, n_boxes_conv6 * n_classes, c(3, 3), strides=c(1, 1), padding="valid", name='classes6')
classes7 <- layer_conv_2d(conv7, n_boxes_conv7 * n_classes, c(3, 3), strides=c(1, 1), padding="valid", name='classes7')

boxes4 <- layer_conv_2d(conv4, n_boxes_conv4 * 4, c(3, 3), strides=c(1, 1), padding="valid", name='boxes4')
boxes5 <- layer_conv_2d(conv5, n_boxes_conv5 * 4, c(3, 3), strides=c(1, 1), padding="valid", name='boxes5')
boxes6 <- layer_conv_2d(conv6, n_boxes_conv6 * 4, c(3, 3), strides=c(1, 1), padding="valid", name='boxes6')
boxes7 <- layer_conv_2d(conv7, n_boxes_conv7 * 4, c(3, 3), strides=c(1, 1), padding="valid", name='boxes7')

We’re reshaping the predicting layers to have the bottom-line shape of batch_size x no. of boxes x no. of classes and batch_size x no. of boxes x 4:

classes4_reshaped <- layer_reshape(classes4, c(-1, n_classes), name='classes4_reshape')
classes5_reshaped <- layer_reshape(classes5, c(-1, n_classes), name='classes5_reshape')
classes6_reshaped <- layer_reshape(classes6, c(-1, n_classes), name='classes6_reshape')
classes7_reshaped >- layer_reshape(classes7, c(-1, n_classes), name='classes7_reshape')

boxes4_reshaped <- layer_reshape(boxes4, c(-1, 4), name='boxes4_reshape')
boxes5_reshaped <- layer_reshape(boxes5, c(-1, 4), name='boxes5_reshape')
boxes6_reshaped <- layer_reshape(boxes6, c(-1, 4), name='boxes6_reshape')
boxes7_reshaped <- layer_reshape(boxes7, c(-1, 4), name='boxes7_reshape')

There are a few more lines I’m excluding here because this post is long enough, I just want to give you a sense of what’s going on. We’re concatentaing the classes and boxes output, we’re doing “softmax” activation on the classes, then concatenating all output into the predictions layer. Don’t worry if you don’t know what anchors_concat means, you have to get into the source to get it:

classes_concat <- layer_concatenate(list(classes4_reshaped,
                                          classes5_reshaped,
                                          classes6_reshaped,
                                          classes7_reshaped), axis = 1, name = 'classes_concat')

boxes_concat <- layer_concatenate(list(boxes4_reshaped,
                                      boxes5_reshaped,
                                      boxes6_reshaped,
                                      boxes7_reshaped), axis = 1, name = 'boxes_concat')

classes_softmax <- layer_activation(classes_concat, 'softmax', name='classes_softmax')

predictions <- layer_concatenate(list(classes_softmax, boxes_concat, anchors_concat),
                                axis=2, name='predictions')

model <- keras_model(inputs = x, outputs = predictions)

Eyes Wide Shut

Let’s get it on! These are our parameters:

img_height <- 250 # Height of the input images
img_width <- 250 # Width of the input images
img_channels <- 3 # Number of color channels of the input images
n_classes <- 4L # Number of classes including the background class
min_scale <- 0.08 # The scaling factor for the smallest anchor boxes
max_scale <- 0.96 # The scaling factor for the largest anchor boxes
scales <- c(0.08, 0.16, 0.32, 0.64, 0.96) # An explicit list of anchor box scaling factors. If this is passed, it will override `min_scale` and `max_scale`.
aspect_ratios <- c(0.5, 1.0, 2.0) # The list of aspect ratios for the anchor boxes
two_boxes_for_ar1 <- TRUE # Whether or not you want to generate two anchor boxes for aspect ratio 1
limit_boxes <- FALSE # Whether or not you want to limit the anchor boxes to lie entirely within the image boundaries
variances <- c(1.0, 1.0, 1.0, 1.0) # The list of variances by which the encoded target coordinates are scaled
coords <- 'centroids' # Whether the box coordinates to be used should be in the 'centroids' or 'minmax' format, see documentation
normalize_coords <- FALSE # Whether or not the model is supposed to use relative coordinates that are within [0,1]

Notice again there are 4 classes including the background.

The build_model function returns the not-yet compiled model architecture as we’ve seen above:

library(keras)
library(stringr)
library(ssdkeras)

K <- backend()

K$clear_session() # Clear previous models from memory.
# The output `predictor_sizes` is needed below to set up `SSDBoxEncoder`
modelOut = build_model(image_size = c(img_height, img_width, img_channels),
                                     n_classes=n_classes,
                                     min_scale=min_scale,
                                     max_scale=max_scale,
                                     scales=scales,
                                     aspect_ratios_global=aspect_ratios,
                                     aspect_ratios_per_layer=NULL,
                                     two_boxes_for_ar1=two_boxes_for_ar1,
                                     limit_boxes=limit_boxes,
                                     variances=variances,
                                     coords=coords,
                                     normalize_coords=normalize_coords)
model <- modelOut$model
predictor_sizes <- modelOut$predictor_sizes

The model holds the model, obviously, and predictor_sizes holds the size of the 4 convolutional layers “squares” we scaled the 250x250 images into:

predictor_sizes
##      [,1] [,2]
## [1,]   29   29
## [2,]   13   13
## [3,]    5    5
## [4,]    1    1

Choosing batch_size, ADAM optimizer and our custom compute_loss function, which we do by instantiating the SSDLoss class. Then, compiling:

batch_size <- 32L

adam <- optimizer_adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=5e-05)

ssd_loss <- SSDLoss$new(neg_pos_ratio=3L, n_neg_min=0L, alpha=1.0, n_classes = n_classes)

model$compile(optimizer=adam, loss=ssd_loss$compute_loss)

Instantiating the box encoder. This is a utility class which helps us later interchange between the actual predicted offest values and what we know as xmin, xmax, ymin and ymax pixel values of each box:

ssd_box_encoder <- SSDBoxEncoder$new(img_height=img_height,
                                img_width=img_width,
                                n_classes=n_classes, 
                                predictor_sizes=predictor_sizes,
                                min_scale=min_scale,
                                max_scale=max_scale,
                                scales=scales,
                                aspect_ratios_global=aspect_ratios,
                                aspect_ratios_per_layer= NULL,
                                two_boxes_for_ar1=two_boxes_for_ar1,
                                limit_boxes=limit_boxes,
                                variances=variances,
                                pos_iou_threshold=0.5,
                                neg_iou_threshold=0.2,
                                coords=coords,
                                normalize_coords=normalize_coords)

Now we create our train_generator and val_generator data generators by instantiating the BatchGenerator class, uploading the two CSVs we created before, specifying where to find the images:

train_dataset <- BatchGenerator$new(box_output_format = c('class_id', 'xmin', 'xmax', 'ymin', 'ymax'))

train_dataset$parse_csv(
  images_path = "D:/faces",
  labels_path = "train.csv",
  include_classes = 'all',
  input_format = c('image_name', 'xmin', 'xmax', 'ymin', 'ymax', 'class_id')
)

train_generator <- train_dataset$generate(batch_size = batch_size,
                                         train = TRUE,
                                         ssd_box_encoder = ssd_box_encoder,
                                         equalize = FALSE,
                                         brightness = c(0.5, 2, 0.5),
                                         flip = 0.5,
                                         translate = list(c(5, 50), c(3, 30), 0.5),
                                         scale = c(0.75, 1.3, 0.5),
                                         limit_boxes = TRUE,
                                         include_thresh = 0.4,
                                         diagnostics = FALSE)

n_train_samples <- train_dataset$get_n_samples()

val_dataset <- BatchGenerator$new(box_output_format = c('class_id', 'xmin', 'xmax', 'ymin', 'ymax'))

val_dataset$parse_csv(images_path = "D:/faces",
                      labels_path = "val.csv",
                      include_classes = 'all',
                      input_format = c('image_name', 'xmin', 'xmax', 'ymin', 'ymax', 'class_id')
                      )

val_generator <- val_dataset$generate(batch_size = batch_size,
                                     train = TRUE,
                                     ssd_box_encoder = ssd_box_encoder,
                                     equalize = FALSE,
                                     brightness = FALSE,
                                     flip = FALSE,
                                     translate = FALSE,
                                     scale = FALSE,
                                     limit_boxes = TRUE,
                                     include_thresh = 0.4,
                                     diagnostics = FALSE)

n_val_samples = val_dataset$get_n_samples()

If you’re using my CSVs, you get 946 training images and 54 validation images.

Notice all those mysterious equalize, brightness, flip etc. values. These are values used by the generator during training to make some random manipulations on the training images before entering training. For example flip = 0.5 means the image is flipped horizontally with 50% probability. I talk about these some more in the Magic Magick section. These are especially important when your dataset is small like in this case, only ~1K images. Because it allows the network to see a more varied image of the world and refrain from overfitting to the small subset of images it sees.

Now train, model, train!

With an “early-stopping” strategy in which if you don’t see improvement in the validation loss for 5 epochs, quit training. Notice the use of reticulate::py_iterator here to wrap our generators.5 The generators return a function which returns for each batch a list of (batch_X, y_true), which are the batch_size x 250 x 250 x 3 batch_X matrix for the current batch, and the batch_size x #boxes x (#classes + 12)6 y_true matrix.

epochs <- 100

history <- model$fit_generator(generator = reticulate::py_iterator(train_generator),
                              steps_per_epoch = ceiling(n_train_samples/batch_size),
                              epochs = epochs,
                              validation_data = reticulate::py_iterator(val_generator),
                              validation_steps = ceiling(n_val_samples/batch_size),
                              callbacks = list(
                                callback_model_checkpoint("checkpoints3.h5",
                                                          monitor='val_loss',
                                                          verbose=1,
                                                          save_best_only=TRUE,
                                                          save_weights_only=TRUE,
                                                          mode='auto',
                                                          period=1),
                                callback_early_stopping(monitor='val_loss',
                                                        min_delta=0.001,
                                                        patience=5),
                                callback_reduce_lr_on_plateau(monitor='val_loss',
                                                              factor=0.5,
                                                              patience=0,
                                                              epsilon=0.001,
                                                              cooldown=0)
                              ))
## Epoch 1/100
## 30/30 [==============================] - 309s - loss: 0.2653 - val_loss: 0.2076l_loss improved from inf to 0.20762, saving model to checkpoints3.h5
## Epoch 2/100
## 30/30 [==============================] - 284s - loss: 0.1356 - val_loss: 0.1745l_loss improved from 0.20762 to 0.17455, saving model to checkpoints3.h5
## Epoch 3/100
## 30/30 [==============================] - 282s - loss: 0.1092 - val_loss: 0.1452l_loss improved from 0.17455 to 0.14524, saving model to checkpoints3.h5
## Epoch 4/100
## 30/30 [==============================] - 283s - loss: 0.0973 - val_loss: 0.1254l_loss improved from 0.14524 to 0.12539, saving model to checkpoints3.h5
## Epoch 5/100
## 30/30 [==============================] - 280s - loss: 0.0947 - val_loss: 0.1109l_loss improved from 0.12539 to 0.11087, saving model to checkpoints3.h5
## Epoch 6/100
## 30/30 [==============================] - 280s - loss: 0.0851 - val_loss: 0.1056l_loss improved from 0.11087 to 0.10560, saving model to checkpoints3.h5
## Epoch 7/100
## 30/30 [==============================] - 284s - loss: 0.0779 - val_loss: 0.1025l_loss improved from 0.10560 to 0.10249, saving model to checkpoints3.h5
## Epoch 8/100
## 30/30 [==============================] - 281s - loss: 0.0716 - val_loss: 0.0970l_loss improved from 0.10249 to 0.09703, saving model to checkpoints3.h5
## Epoch 9/100
## 30/30 [==============================] - 282s - loss: 0.0672 - val_loss: 0.0958l_loss improved from 0.09703 to 0.09579, saving model to checkpoints3.h5
## Epoch 10/100
## 30/30 [==============================] - 283s - loss: 0.0605 - val_loss: 0.0847l_loss improved from 0.09579 to 0.08474, saving model to checkpoints3.h5
## Epoch 11/100
## 30/30 [==============================] - 284s - loss: 0.0583 - val_loss: 0.0746l_loss improved from 0.08474 to 0.07464, saving model to checkpoints3.h5
## Epoch 12/100
## 30/30 [==============================] - 287s - loss: 0.0532 - val_loss: 0.0728l_loss improved from 0.07464 to 0.07285, saving model to checkpoints3.h5
## Epoch 13/100
## 30/30 [==============================] - 290s - loss: 0.0508 - val_loss: 0.0630l_loss improved from 0.07285 to 0.06295, saving model to checkpoints3.h5
## Epoch 14/100
## 30/30 [==============================] - 290s - loss: 0.0482 - val_loss: 0.0551l_loss improved from 0.06295 to 0.05508, saving model to checkpoints3.h5
## Epoch 15/100
## 30/30 [==============================] - 283s - loss: 0.0462 - val_loss: 0.0530l_loss improved from 0.05508 to 0.05296, saving model to checkpoints3.h5
## Epoch 16/100
## 30/30 [==============================] - 283s - loss: 0.0436 - val_loss: 0.0464l_loss improved from 0.05296 to 0.04636, saving model to checkpoints3.h5
## Epoch 17/100
## 30/30 [==============================] - 281s - loss: 0.0419 - val_loss: 0.0369l_loss improved from 0.04636 to 0.03693, saving model to checkpoints3.h5
## Epoch 18/100
## 30/30 [==============================] - 280s - loss: 0.0392 - val_loss: 0.0365l_loss improved from 0.03693 to 0.03649, saving model to checkpoints3.h5
## Epoch 19/100
## 30/30 [==============================] - 285s - loss: 0.0374 - val_loss: 0.0349l_loss improved from 0.03649 to 0.03487, saving model to checkpoints3.h5
## Epoch 20/100
## 30/30 [==============================] - 283s - loss: 0.0342 - val_loss: 0.0317l_loss improved from 0.03487 to 0.03173, saving model to checkpoints3.h5
## Epoch 21/100
## 30/30 [==============================] - 279s - loss: 0.0346 - val_loss: 0.0309l_loss improved from 0.03173 to 0.03094, saving model to checkpoints3.h5
## Epoch 22/100
## 30/30 [==============================] - 279s - loss: 0.0332 - val_loss: 0.0293l_loss improved from 0.03094 to 0.02928, saving model to checkpoints3.h5
## Epoch 23/100
## 30/30 [==============================] - 282s - loss: 0.0344 - val_loss: 0.0287l_loss improved from 0.02928 to 0.02872, saving model to checkpoints3.h5
## Epoch 24/100
## 30/30 [==============================] - 281s - loss: 0.0320 - val_loss: 0.0291l_loss did not improve
## Epoch 25/100
## 30/30 [==============================] - 281s - loss: 0.0332 - val_loss: 0.0280l_loss improved from 0.02872 to 0.02798, saving model to checkpoints3.h5
## Epoch 26/100
## 30/30 [==============================] - 282s - loss: 0.0328 - val_loss: 0.0284l_loss did not improve
## Epoch 27/100
## 30/30 [==============================] - 280s - loss: 0.0318 - val_loss: 0.0272l_loss improved from 0.02798 to 0.02725, saving model to checkpoints3.h5
## Epoch 28/100
## 30/30 [==============================] - 280s - loss: 0.0314 - val_loss: 0.0271l_loss improved from 0.02725 to 0.02708, saving model to checkpoints3.h5
## Epoch 29/100
## 30/30 [==============================] - 282s - loss: 0.0336 - val_loss: 0.0274l_loss did not improve
## Epoch 30/100
## 30/30 [==============================] - 281s - loss: 0.0314 - val_loss: 0.0283l_loss did not improve
## Epoch 31/100
## 30/30 [==============================] - 283s - loss: 0.0314 - val_loss: 0.0273l_loss did not improve
## Epoch 32/100
## 30/30 [==============================] - 286s - loss: 0.0316 - val_loss: 0.0274l_loss did not improve
## Epoch 33/100
## 30/30 [==============================] - 280s - loss: 0.0312 - val_loss: 0.0266l_loss improved from 0.02708 to 0.02657, saving model to checkpoints3.h5
## Epoch 34/100
## 30/30 [==============================] - 281s - loss: 0.0318 - val_loss: 0.0268l_loss did not improve
## Epoch 35/100
## 30/30 [==============================] - 281s - loss: 0.0314 - val_loss: 0.0281l_loss did not improve
## Epoch 36/100
## 30/30 [==============================] - 281s - loss: 0.0323 - val_loss: 0.0275l_loss did not improve
## Epoch 37/100
## 30/30 [==============================] - 281s - loss: 0.0316 - val_loss: 0.0272l_loss did not improve
## Epoch 38/100
## 30/30 [==============================] - 282s - loss: 0.0312 - val_loss: 0.0282l_loss did not improve
## Epoch 39/100
## 30/30 [==============================] - 280s - loss: 0.0313 - val_loss: 0.0275l_loss did not improve
## Epoch 40/100
## 30/30 [==============================] - 282s - loss: 0.0320 - val_loss: 0.0267l_loss did not improve
## Epoch 41/100
## 30/30 [==============================] - 281s - loss: 0.0313 - val_loss: 0.0254l_loss improved from 0.02657 to 0.02545, saving model to checkpoints3.h5
## Epoch 42/100
## 30/30 [==============================] - 281s - loss: 0.0319 - val_loss: 0.0280l_loss did not improve
## Epoch 43/100
## 30/30 [==============================] - 281s - loss: 0.0311 - val_loss: 0.0275l_loss did not improve
## Epoch 44/100
## 30/30 [==============================] - 281s - loss: 0.0318 - val_loss: 0.0291l_loss did not improve
## Epoch 45/100
## 30/30 [==============================] - 282s - loss: 0.0311 - val_loss: 0.0296l_loss did not improve
## Epoch 46/100
## 30/30 [==============================] - 280s - loss: 0.0314 - val_loss: 0.0281l_loss did not improve
## Epoch 47/100
## 30/30 [==============================] - 282s - loss: 0.0318 - val_loss: 0.0245l_loss improved from 0.02545 to 0.02451, saving model to checkpoints3.h5
## Epoch 48/100
## 30/30 [==============================] - 279s - loss: 0.0326 - val_loss: 0.0294l_loss did not improve
## Epoch 49/100
## 30/30 [==============================] - 283s - loss: 0.0312 - val_loss: 0.0285l_loss did not improve
## Epoch 50/100
## 30/30 [==============================] - 282s - loss: 0.0317 - val_loss: 0.0261l_loss did not improve
## Epoch 51/100
## 30/30 [==============================] - 287s - loss: 0.0310 - val_loss: 0.0248l_loss did not improve
## Epoch 52/100
## 30/30 [==============================] - 285s - loss: 0.0318 - val_loss: 0.0300l_loss did not improve

Let’s load the best weights in terms of minimum validation loss we’ve seen:

model$load_weights("checkpoints3.h5")

Now… how did we do?

We need to decode the y_pred values we get from the model. We do this with the decode_y2 function (as opposed to the decode_y function, see the source for more details) which does Non-Maximum Suppression on boxes whose predicted class exceeded some confidence threshold. However, I’m doing something a bit different here. Since I expect to see in almost all pictures a single “eyes” box, a single nose and a single mouth, I have added a oneOfEach parameter which if TRUE tells decode_y2 to give the best box for each class (provided they have passed the confidence threshold of course, here 0.8):

predict_generator = val_dataset$generate(batch_size=1L,
                                         train = FALSE,
                                         equalize = FALSE,
                                         brightness=FALSE,
                                         flip=FALSE,
                                         translate=FALSE,
                                         scale=FALSE,
                                         limit_boxes=TRUE,
                                         include_thresh=0.4,
                                         diagnostics=FALSE)

predGen <- predict_generator()

X <- predGen[[1]]
y_true <- predGen[[2]]
filenames <- predGen[[3]]

y_pred = model$predict(X)

y_pred_decoded = decode_y2(y_pred,
                           confidence_thresh = 0.8,
                           iou_threshold = 0.2,
                           top_k = 'all',
                           input_coords = 'centroids',
                           normalize_coords = FALSE,
                           img_height = NULL,
                           img_width = NULL,
                           n_classes = n_classes,
                           oneOfEach = TRUE)

classes = c("eyes", "nose", "mouth")

img <- jpeg::readJPEG(stringr::str_c("D:/faces/", filenames))
plot.new()
rasterImage(img, 0, 0, 1, 1)

if (length(y_true) > 0) {
  imgBoxes <- cbind(y_true[[1]][, 2:3] / img_width,
                    1 -  y_true[[1]][, 4:5] / img_height)
  rect(xleft = imgBoxes[, 1], xright = imgBoxes[, 2],
       ybottom = imgBoxes[, 3], ytop = imgBoxes[, 4], border = "green")
}

if (dim(y_pred_decoded[[1]])[1] > 0) {
  predBoxes <- cbind(y_pred_decoded[[1]][, 3:4, drop = FALSE] / img_width,
                     1 - y_pred_decoded[[1]][, 5:6, drop = FALSE] / img_height)
  rect(xleft = predBoxes[, 1], xright = predBoxes[, 2],
       ybottom = predBoxes[, 3], ytop = predBoxes[, 4], border = "red")
  captions <- str_c(classes[y_pred_decoded[[1]][, 1]], " ",
                    format(round(y_pred_decoded[[1]][, 2], 2), nsmall = 2))
  text(x = predBoxes[, 1], y = predBoxes[, 3], labels = captions,
       adj = c(-0.1, -0.3), col = "red", cex = 0.8)
}

Not too shabby! Notice I put in green boxes the “true” tag, and in red boxes the predicted boxes, in addition to the predicted class and confidence.

Let’s see one more:

predGen <- predict_generator()

X <- predGen[[1]]
y_true <- predGen[[2]]
filenames <- predGen[[3]]

y_pred = model$predict(X)

y_pred_decoded = decode_y2(y_pred,
                           confidence_thresh = 0.8,
                           iou_threshold = 0.2,
                           top_k = 'all',
                           input_coords = 'centroids',
                           normalize_coords = FALSE,
                           img_height = NULL,
                           img_width = NULL,
                           n_classes = n_classes,
                           oneOfEach = TRUE)

classes = c("eyes", "nose", "mouth")

img <- jpeg::readJPEG(stringr::str_c("D:/faces/", filenames))
plot.new()
rasterImage(img, 0, 0, 1, 1)

if (length(y_true) > 0) {
  imgBoxes <- cbind(y_true[[1]][, 2:3] / img_width,
                    1 -  y_true[[1]][, 4:5] / img_height)
  rect(xleft = imgBoxes[, 1], xright = imgBoxes[, 2],
       ybottom = imgBoxes[, 3], ytop = imgBoxes[, 4], border = "green")
}

if (dim(y_pred_decoded[[1]])[1] > 0) {
  predBoxes <- cbind(y_pred_decoded[[1]][, 3:4, drop = FALSE] / img_width,
                     1 - y_pred_decoded[[1]][, 5:6, drop = FALSE] / img_height)
  rect(xleft = predBoxes[, 1], xright = predBoxes[, 2],
       ybottom = predBoxes[, 3], ytop = predBoxes[, 4], border = "red")
  captions <- str_c(classes[y_pred_decoded[[1]][, 1]], " ",
                    format(round(y_pred_decoded[[1]][, 2], 2), nsmall = 2))
  text(x = predBoxes[, 1], y = predBoxes[, 3], labels = captions,
       adj = c(-0.1, -0.3), col = "red", cex = 0.8)
}

Well hello there, Naomi Watts!7 Let’s see the performance on a face not part of the LFW dataset:

img_address <- "ayelet.jpg"
image_matrices <- list(
  image_array_resize(image_load(img_address), img_height, img_width)
)
X = array (
  data = do.call(rbind, map(image_matrices, as.vector)),
  dim = c(length(image_matrices), dim(image_matrices[[1]]))
)
y_pred <- model$predict(X)

y_pred_decoded = decode_y2(y_pred,
                           confidence_thresh = 0.4,
                           iou_threshold = 0.4,
                           top_k = 'all',
                           input_coords = 'centroids',
                           normalize_coords = FALSE,
                           img_height = NULL,
                           img_width = NULL,
                           n_classes = n_classes,
                           oneOfEach = TRUE)

img <- imager::load.image(img_address) %>%
  imager::resize(img_height, img_width)
plot.new()
rasterImage(img, 0, 0, 1, 1)

if (dim(y_pred_decoded[[1]])[1] > 0) {
  predBoxes <- cbind(y_pred_decoded[[1]][, 3:4, drop = FALSE] / img_width,
                     1 - y_pred_decoded[[1]][, 5:6, drop = FALSE] / img_height)
  rect(xleft = predBoxes[, 1], xright = predBoxes[, 2],
       ybottom = predBoxes[, 3], ytop = predBoxes[, 4], border = "red")
  captions <- str_c(classes[y_pred_decoded[[1]][, 1]], " ",
                    format(round(y_pred_decoded[[1]][, 2], 2), nsmall = 2))
  text(x = predBoxes[, 1], y = predBoxes[, 3], labels = captions,
       adj = c(-0.1, -0.3), col = "red", cex = 0.8)
}

Could be better.

Masks On

In order to put a Groucho Glasses mask on, I’ve downloaded and polished a bit the Groucho glasses and nose:

I’m going to scale the glasses to the eyes box size, and the nose to the nose box size, with the help of the wonderful magick package:

groucho_glasses <- magick::image_read("groucho_glasses.png")
groucho_nose <- magick::image_read("groucho_nose.png")

getGeometry <- function(organ) str_c(organ[2] - organ[1], "x",
                                     organ[4] - organ[3], "+",
                                     organ[1], "+", organ[3])
getWidthHeight <- function(organ) str_c(organ[2] - organ[1], "x",
                                        organ[4] - organ[3], "!")
getOffset <- function(organ) str_c("+", organ[1], "+", organ[3])

eyes <- as.integer(y_pred_decoded[[1]][y_pred_decoded[[1]][, 1] == 1, 3:6])
nose <- as.integer(y_pred_decoded[[1]][y_pred_decoded[[1]][, 1] == 2, 3:6])

img <- magick::image_read(img_address) %>%
  magick::image_scale("250x250!")

if (length(eyes) > 0) {
  geometryEyes <- getGeometry(eyes)
  img <- magick::image_composite(img,
                                 magick::image_scale(groucho_glasses,
                                                     getWidthHeight(eyes)),
                                 offset = getOffset(eyes))
}
if (length(nose) > 0) {
  geometryNose <- getGeometry(nose)
  
  img <- magick::image_composite(
    img, magick::image_scale(groucho_nose,
                             getWidthHeight(nose)),
    offset = getOffset(nose))
}

rasterImage(img, 0, 0, 1, 1)

Ha! Let’s do another one, say a black awesome person:

img_address <- "serena.jpg"
image_matrices <- list(
  image_array_resize(image_load(img_address), img_height, img_width)
)
X = array (
  data = do.call(rbind, map(image_matrices, as.vector)),
  dim = c(length(image_matrices), dim(image_matrices[[1]]))
)
y_pred <- model$predict(X)

y_pred_decoded = decode_y2(y_pred,
                           confidence_thresh = 0.4,
                           iou_threshold = 0.4,
                           top_k = 'all',
                           input_coords = 'centroids',
                           normalize_coords = FALSE,
                           img_height = NULL,
                           img_width = NULL,
                           n_classes = n_classes,
                           oneOfEach = TRUE)

eyes <- as.integer(y_pred_decoded[[1]][y_pred_decoded[[1]][, 1] == 1, 3:6])
nose <- as.integer(y_pred_decoded[[1]][y_pred_decoded[[1]][, 1] == 2, 3:6])

img <- magick::image_read(img_address) %>%
  magick::image_scale("250x250!")

if (length(eyes) > 0) {
  geometryEyes <- getGeometry(eyes)
  img <- magick::image_composite(img,
                                 magick::image_scale(groucho_glasses,
                                                     getWidthHeight(eyes)),
                                 offset = getOffset(eyes))
}
if (length(nose) > 0) {
  geometryNose <- getGeometry(nose)
  
  img <- magick::image_composite(
    img, magick::image_scale(groucho_nose,
                             getWidthHeight(nose)),
    offset = getOffset(nose))
}

rasterImage(img, 0, 0, 1, 1)

Many a Times a Man’s Mouth Broke His Nose

Next I just recorded a short video of myself on my cellphone, converted it to many images at a rate of 24 frames per second with ffmpeg. Notice I’m doing this on my Windows system with the system command:

file <- "me.mp4"
framesPerSecond <- 24
imageName <- "me_"
ffCommand <- paste0("ffmpeg -i \"", gsub("/", "\\\\", file),
                    "\" -s 250x250 -t 1200 -r ",
                    framesPerSecond, " D:\\me_images\\",
                    imageName, "_%04d.jpg")
system(ffCommand)

Had the model predict on each of these images:

read_image_matrix <- function(filename) {
  image_to_array(image_load(filename))
}

vid_images_addresses <- list.files("D:/me_images", full.names = TRUE)

image_matrices <- map(vid_images_addresses, read_image_matrix)

X = array (
  data = do.call(rbind, map(image_matrices, as.vector)),
  dim = c(length(image_matrices), dim(image_matrices[[1]]))
)

y_pred <- model$predict(X)

y_pred_decoded = decode_y2(y_pred,
                           confidence_thresh = 0.6,
                           iou_threshold = 0.2,
                           top_k = 'all',
                           input_coords = 'centroids',
                           normalize_coords = FALSE,
                           img_height = NULL,
                           img_width = NULL,
                           n_classes = n_classes,
                           oneOfEach = TRUE)

Drew the mask if I could and converted the images back to a video/gif with the animation package:

animation::ani.options(interval = 1 / framesPerSecond)
animation::saveGIF({
  for (i in seq_len(length(vid_images_addresses))) {
    y_pred_decodedSingle <- y_pred_decoded[[i]]
    
    eyes <- as.integer(y_pred_decodedSingle[y_pred_decodedSingle[, 1] == 1, 3:6])
    nose <- as.integer(y_pred_decodedSingle[y_pred_decodedSingle[, 1] == 2, 3:6])
    
    img <- magick::image_read(vid_images_addresses[i]) %>%
      magick::image_scale("250x250!")
    
    if (length(eyes) > 0) {
      geometryEyes <- getGeometry(eyes)
      img <- magick::image_composite(
        img,
        magick::image_scale(groucho_glasses, getWidthHeight(eyes)),
        offset = getOffset(eyes))
    }
    if (length(nose) > 0) {
      geometryNose <- getGeometry(nose)
      
      img <- magick::image_composite(
        img,
        magick::image_scale(groucho_nose, getWidthHeight(nose)),
        offset = getOffset(nose))
    }
    plot.new()
    rasterImage(img, 0, 0, 1, 1)
  }
}, movie.name = "me_vid_mask.gif")

Word.

Magic Magick

One thing before you go. You should really check out the magick package. I think that this and the imager package are the closest things to the amazing Python Open-CV library, that R has to offer.

At first I tried to skip the stage of random image manipulations during training, because Pierluigi’s code relies heavily at this stage on Open-CV. But magick makes things really easy. Here’s how we ask for making the image brighter (or darker) by some factor. Assume brightness is c(0.5, 2, 0.5) which means we want each image in the batch to be 50-200% brighter (or darker) with 50% probability. We read in batch_X[i, , ,] where i is between 1 and batch_size, and modulate it by brightValue:

if (runif(1) < brightness[3]) {
  brightValue <- as.integer(
    runif(1, min = brightness[1], max = brightness[2]) * 100
    )
  tempX <- image_read(batch_X[i, , ,] / 255) %>%
    image_modulate(brightValue) %>%
    .[[1]] %>%
    as.numeric()
  batch_X[i, , ,] <- tempX[, , 1:3] * 255
}

I didn’t implement all manipulations, only the ones you see entered into the generator above. But this was an essential stage, decreasing the loss metric by a lot.8

Enough!

What can we do to improve the end result?

  • Add layers (remember this is a relaxed version of SSD, the real one starts with loading the VGG16 network weights and has more layers)
  • Add boxes
  • Add manipulations
  • Add images (e.g. get all your eyes, noses and mouths, crop them and paste them on non-face backgrounds)
  • Maybe the LFW faces are not such a good fit for the Snapchat-like face mask task…

Have fun! And remember - the package is heavily under construction.


  1. For God’s sake, the blog is called Sex, Drugs and Data

  2. I don’t, of course, claim this is how they do things at Snapchat, in fact I’m pretty sure it isn’t.

  3. Yes, I know it’s in Russian, look at their slides though

  4. Actually this is a great idea for an upcomg project

  5. I used reticulate heavily throughout this code.

  6. See source to find out where this 12 magic number comes from.

  7. Yes, I know, the dataset is very 2005, a lot of George Bush

  8. Just before releasing this post I see magick 1.6 is released with some geometry helpers - will have to check these out.