This post serves a few purposes:

• Exercising Deep Learning with Regression, as opposed to Classification, which is my comfort zone.
• A tribute to my previous working place - ebay.
• Trying to combine text AND image data in a “Bi-Modal” model - never did it before.
• Doing something which involves Fashion. As opposed to Boston Housing Prices, God.

The idea is very simple: I’m going to scrape ebay for as many items from the category “Men’s Wristwatches” as I can. I will download their (seller-given) titles, their images and their price. And then I will see if I can predict a watch’s price by using:

1. Its title
2. Its image
3. Both title and image

I will do it with both Gradient Boosting Trees with xgboost and Deep Neural Networks with Rstudio’s keras.

Attention 1: This is not a Deep Learning tutorial, I’m assuming you’re “in the know”. For more detailed posts dealing with Deep Learning consider this and this.

Attention 2: As always if you’re not interested in how I scraped thousands of ebay items, skip the first part.

Let’s go!

# Scraping the Bay

Amazingly, scraping ebay is quite easy. With rvest that is. The site is well structured and does not, as of yet, apply any anti-scraping policies (try to do this with Amazon and tell me what happens!). The only limitation is that out over 220K Men’s Wristwatches items, we can only get ~10K, and with ebay Search algorithm deciding for us which are “Best Match”.

library(tidyverse)
library(rvest)
library(stringr)

getURL <- function(i)
str_c("https://www.ebay.com/sch/Wristwatches/31387/i.html?Gender=Men%2527s&LH_BIN=1&_dcat=31387&LH_ItemCondition=1000&LH_PrefLoc=3&_ipg=200&_pgn=", i)

nPages <- 50

urls <- map_chr(1:nPages, getURL)

Currently in urls we have the first 50 URLs of ebay’s Wristwatches category, when we ask for:

• Gender = Men
• Fixed price only, a.k.a. “Buy It Now” (don’t want Auctions)
• New items only (I figure used watches will enter a lot of noise for this experiment)
• US only (I figure without this we would get a ton of expensive-looking watches from China, again noisy)
• 200 results per page

We ask for 50 pages because 50 pages X 200 results = 10K items, though see we have many duplicates later.

Now we define an empty tibble called watches which will hold the resulting table, and a few functions to help us1:

watches <- tibble(
imgurl = character(0),
ilsPrice = numeric(),
id = numeric(0),
price = numeric(0),
fileName = character(0),
exclude = numeric(0)
)

getPrice <- function(priceText) {
prcRange <- str_detect(priceText, "[0-9,]+\\.[0-9]+ to ILS [0-9,]+\\.[0-9]+")
prices <- str_extract_all(priceText, "[0-9,]+\\.[0-9]+")[[1]]
prices <- as.numeric(str_replace(prices, ",", ""))
if (prcRange) {
price <- mean(prices[1:2])
} else {
price <- prices[1]
}
return(price)
}

ils2usd <- function(ilsPrice, rate = 0.28) ilsPrice * rate

getFileName <- function(id) str_c("D:/ebay_wristwatches/", id, ".jpg")

tryCatch(
error = function(e) return(1)
)
}

getItems <- function(i, pb = NULL) {

imgurlA <- html %>%
html_nodes(".lvpic") %>%
html_nodes("img")

imgurlB <- imgurlA %>%
html_attr("src") %>%
str_replace(., ".+gif$", "NA") imgurlC <- imgurlA %>% html_attr("imgurl") imgurl <- cbind(imgurlB, imgurlC) %>% as_tibble() %>% transmute(imgurl = ifelse(is.na(imgurlC), imgurlB, imgurlC)) %>% unlist() %>% unname title <- imgurlA %>% html_attr("alt") ilsPrice <- html %>% html_nodes(".lvprice") %>% html_text() %>% map_dbl(getPrice) idToUse <- if (is_empty(watches$id)) {
1:length(imgurl)
} else {
(tail(watches$id, 1) + 1):(length(imgurl) + tail(watches$id, 1))
}

watchesDF <- tibble(imgurl = imgurl, ilsPrice = ilsPrice, title = title) %>%
mutate(id = idToUse,
price = map_dbl(ilsPrice, ils2usd),
fileName = map_chr(id, getFileName),

watches <<- rbind.data.frame(watches, watchesDF)

if (!is.null(pb)) pb$tick()$print()
}
• getPrice: a function to parse the price of an item from the html, notice a price can come as a range for different sizes/colors, though less common in watches, in which case we take the mean
• ils2usd: as I’m doing this from Israel I get prices in ILS (Israeli Shekels), and this function converts Shekels to USD with the rate today
• getFileName: a function to create the address on disk for the item’s image
• downloadImage: a tryCatch wrapper around download.file for downloading an image. If a failure was encountered the function would return 1 and this is how we’ll know to exclude this specific image later
• getItems: the main function which uses rvest to parse each URL’s html, get the title and price data, and download the images to disk

This is how we use getItems:

pb <- progress_estimated(nPages)

walk(1:nPages, getItems, pb)

And that goes pretty quickly, for me without problems until page 50 where it failed, so I expect to see in watches 49 x 200 = 9,800 items:

glimpse(watches)
## Observations: 9,800
## Variables: 7
## $imgurl <chr> "https://i.ebayimg.com/thumbs/images/g/LA4AAOSwhOVXdn... ##$ ilsPrice <dbl> 159.000, 547.770, 583.110, 636.120, 671.460, 270.830,...
## $title <chr> "Invicta 8932OB Men's Pro Diver Black Dial SS Bracele... ##$ id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $price <dbl> 44.5200, 153.3756, 163.2708, 178.1136, 188.0088, 75.8... ##$ fileName <chr> "D:/ebay_wristwatches/1.jpg", "D:/ebay_wristwatches/2...
## $exclude <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... (P.S. how cool is glimpse?! How come no one ever tells you about these helpers) So indeed we have a 9,800 items long tibble and these are the variables: • imgurl: the full URL of the item’s image • ilsPrice: the price in ILS • title: the item’s title, e.g. “Invicta 8932OB Men’s Pro Diver Black Dial SS Bracelet Dive Watch” • id: serial number for the item • price: the price in USD • fileName: the item’s image location on disk • exclude: a 0/1 variable, if 1 this means the image was not successfully dowloaded and we need to get rid of this item OK, three problems before we go on: • There are a few (23 actually) items which we don’t have images for • There are many duplicates, about 2K if we count title and price as variables we want distinct. This can occur because of two reasons: (a) we’re calling “page 2” a few minutes after we called “page 1”, so the ebay Search algorithm might have put a specific item from page 1 now in page 2 (b) there are actual duplicate items on site • Later on some helper methods from keras really dislike 1-word titles, there are few of those but still We need to exclude all these: watches <- watches %>% filter(exclude == 0) %>% distinct(title, price, .keep_all = TRUE) %>% mutate(titleLength = map_dbl(title, function(x) length(str_split(x, " ")[[1]]))) %>% filter(titleLength >= 2) dim(watches) ## [1] 7803 8 We’re left with 7.8K items. Nice! # Watching Watches Some exploratory analysis would be nice to see everything is cool between us and the watches. We have 7.8K watches, let’s see the distribution of price: watches %>% ggplot(aes(price)) + geom_histogram(fill = "red", alpha = 0.5) + ggtitle("Distribution of ebay's Men's Wristwatches Price (USD)") This distribution is very extreme: summary(watches$price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##     3.95   215.84   330.87   525.21   537.70 61650.69

Most watches cost around $200-600, but there is a watch which costs over$60K! Which one is it?

watches %>%
top_n(1, price) %>%
select(title)
## # A tibble: 1 x 1
##   title
##   <chr>
## 1 Patek Philippe NEW Nautilus Travel Time Chronograph Steel Watch Box/Pape~

Looks like any $500 watch to me. But I’m getting ahead of myself. Anyway, this distribution is very extreme, very skewed, and I recommend going with the logarithm of the price which is much more “normal” and probably less affected later on by extreme values: watches %>% ggplot(aes(log(price))) + geom_histogram(fill = "red", alpha = 0.5) + ggtitle("Distribution of ebay's Men's Wristwatches log-Price (USD)") Another thing which might be interesting at this stage to see is what’s going on in the watches titles, e.g. the most common words. The tidytext package makes this really easy: library(tidytext) watches %>% select(title) %>% unnest_tokens(word, title) %>% count(word, sort = TRUE) %>% slice(1:10) ## # A tibble: 10 x 2 ## word n ## <chr> <int> ## 1 watch 6312 ## 2 men's 5678 ## 3 quartz 2414 ## 4 steel 2392 ## 5 black 2268 ## 6 stainless 2199 ## 7 mens 1531 ## 8 analog 1510 ## 9 dial 1492 ## 10 display 1162 Makes sense. The last thing we’ll do in this section is get a reasonable baseline for our predictions. Why do we need one? This is Regrssion. There is no concept of “accuracy” like in classification, where we know that “100%” is good, and if we have two balanced classes “60%” is better than chance. In Regression we usually use the Mean Squared Error or the square root of that, RMSE. $$RMSE=\sqrt{\frac{1}{n}\sum_{t=1}^{n}(\hat{y_t} - y_t)^2}$$ You can see that the MSE takes the mean of squared deviations between the real observations ($$y_t$$) and the predicted ones ($$\hat{y_t}$$). And the RMSE just takes the root of that to “go back” to the original scale of $$y_t$$. In our case the original scale is log(USD) so I’m not keen on doing that. The important thing is that we want RMSE to be as low as possible. If you don’t like RMSE, you can use the Pearson correlation or define some accuracy-equivalent metric, e.g. “The percentage of observations for which we managed to predict the USD price within a plus/minus$20 range”, or better yet “The percentage of observations for which we managed to predict the USD price within a plus/minus 30% range”. Because I’m not OK with predicting a 30 dollar watch as 10 (minus 20 dollars), but I am OK with predicting it as 21 (minus 30 percent).

What’s our current best guess? The mean!

rmse <- function(y_true, y_pred) {
sqrt(mean((y_pred - y_true)^2))
}

pctWithinUSDPercentMargin <- function(y_true, y_pred, pct.margin = 30) {
sum(abs(exp(y_pred)/exp(y_true) - 1) < pct.margin/100)/length(y_true)
}

y_pred <- rep(mean(log(watches$price)), nrow(watches)) y_true <- log(watches$price)
rmse(y_true, y_pred)
## [1] 0.8067097
pctWithinUSDPercentMargin(y_true, y_pred)
## [1] 0.3212867

This means that with predicting the mean for all observations, the baseline RMSE is 0.806 (whatever it means in log(USD)) and for ~32% of observations the mean is already within the +/- 30% range, in USD scale. Good to know.

# Title Only

Let’s first try xgboost, with a simple bag-of-words model, no weighting2. We can again use tidytext’s helpers here to get a Document-Term sparse matrix, as most values would become zeros:

dtm <- watches %>%
select(id, title) %>%
unnest_tokens(word, title) %>%
group_by(id, word) %>%
dplyr::summarise(n = n()) %>%
cast_sparse(id, word, n)

dim(dtm)
## [1] 7803 9188

We get our 7.8K items X 9,188 unique terms.

Next we’ll split the data into training, validation and test sets. This is the split we’ll use from now on, and we’re not going to use the test set until the very end.

trainN <- 5000
validN <- 2000
set.seed(42)
trainSamples <- sort(sample(nrow(dtm), trainN))
validSamples <- sort(sample((1:nrow(dtm))[-trainSamples], validN))
testSamples <- sort((1:nrow(dtm))[-c(trainSamples, validSamples)])

dtmTrain <- dtm[trainSamples, ]
dtmValid <- dtm[validSamples, ]
dtmTest <- dtm[testSamples, ]

dim(dtmTrain)
dim(dtmValid)
dim(dtmTest)
## [1] 5000 9188
## [1] 2000 9188
## [1]  803 9188

Let’s prepare the data for xgboost.

library(xgboost)

xgbTrain <- xgb.DMatrix(dtmTrain, label = log(watches$price[trainSamples])) xgbValid <- xgb.DMatrix(dtmValid, label = log(watches$price[validSamples]))

xgbParams <- list(objective = "reg:linear",
eval_metric = "rmse",
max_depth = 5,
eta = 1,
gamma = 0,
subsample = 1,
min_child_weight = 1,
alpha = 0,
lambda = 1.1)

watchlist <- list(train = xgbTrain, test = xgbValid)

We’re telling xgboost that we’re in a Regression framework (objective), the metric to minimize is RMSE (eval_metric). We use shallow trees of max_depth 5, and increase a bit the L2 “Ridge” penalty (lambda) to 1.1 (1 is the default). All of these parameters should of course be selected using Cross Validation, not as I did here.

Let’s tell xgboost to build a maximum of 5,000 trees, stop only if it doesn’t see any improvement in the validation set for 100 trees, to avoid overfitting:

xgbFit <- xgb.train(data = xgbTrain,
params = xgbParams,
nrounds = 5000,
verbose = TRUE,
print_every_n = 100,
early_stopping_rounds = 100,
watchlist = watchlist)
## [1]  train-rmse:0.728552 test-rmse:0.747846
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 100 rounds.
##
## [101]    train-rmse:0.308889 test-rmse:0.583139
## [201]    train-rmse:0.237314 test-rmse:0.582568
## [301]    train-rmse:0.195867 test-rmse:0.581100
## Stopping. Best iteration:
## [286]    train-rmse:0.201348 test-rmse:0.580927

After a single tree xgboost decreased RMSE to 0.74 (we started at 0.80 remember), and decided to stop at 286 trees, when it got to RMSE 0.58.

Let’s see our other metric, of “percent predicted within +/- 30% of real price”:

y_pred <- predict(xgbFit, xgbValid)
y_true <- log(watches$price[validSamples]) rmse(y_true, y_pred) pctWithinUSDPercentMargin(y_true, y_pred) ## [1] 0.5809266 ## [1] 0.5315 This means with xgboost on text only data, 53% of watches in our validation set get a prediction which is in the +/- 30% of original USD price. Let’s see the predicted values against real observations, to see that there is nothing “degenerate” here: tibble(y_true = y_true, y_pred = y_pred) %>% ggplot(aes(y_true, y_pred)) + coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) + geom_point(color = "red", alpha = 0.5) + ggtitle("Title-Only XGBOOST log-Price Prediction: Predicted vs. Actual") That’s very good, actually. Correlation: cor(y_true, y_pred) ## [1] 0.7162165 Can “Deep Learning” beat that? We’ll use a simple LSTM model, not unlike the famous imdb example, only we don’t use our LSTM output as the basis for Sentiment classification, but for Regression. But first, some preparation. keras doesn’t know words, it knows numbers. We need our titles as lists of numbers, encoded from words. Luckily keras and reticulate have some tools to do that easily: library(keras) tnizer <- text_tokenizer() tnizer$fit_on_texts(watches$title) watches_dl <- texts_to_sequences(tnizer, watches$title)
watches_dl <- reticulate::py_to_r(watches_dl) # this step was needed in previous
# version of keras, it may very well be unnecessary here.

OK, what’s watches_dl?

length(watches_dl)
## [1] 7803

watches_dl is a representation of watches titles. It is a list of 7,803 lists, each holding a bunch of integer codes replacing the original words. The first element which has 11 numbers corresponds to the first title which has 11 words:

watches$title[1] ## [1] "Invicta 8932OB Men's Pro Diver Black Dial SS Bracelet Dive Watch" watches_dl[1] ## [[1]] ## [1] 14 7306 2 41 31 4 9 101 34 149 1 Getting our train/validation/test matrices: maxlen <- 17 x_train_title <- watches_dl[trainSamples] y_train_title <- log(watches$price[trainSamples])
x_valid_title <- watches_dl[validSamples]
y_valid_title <- log(watches$price[validSamples]) x_test_title <- watches_dl[testSamples] y_test_title <- log(watches$price[testSamples])

x_train_title <- pad_sequences(x_train_title, maxlen = maxlen)
x_valid_title <- pad_sequences(x_valid_title, maxlen = maxlen)
x_test_title <- pad_sequences(x_test_title, maxlen = maxlen)

dim(x_train_title)
## [1] 5000   17
dim(x_valid_title)
## [1] 2000   17
dim(x_test_title)
## [1] 803  17

The last thing we did there was “pad” the title matrices with zeros if the titles are less than 17 words, the longest title.

Now we’re ready for keras:

max_features <- 10000

model_title <- keras_model_sequential()
model_title %>%
layer_embedding(input_dim = max_features, output_dim = 256) %>%
layer_lstm(units = 256) %>%
layer_dense(units = 1, activation = "linear")

model_title %>% compile(
loss = "mean_squared_error",
)

Notice we’re asking LSTM to give us an output of length 256, then we add a dense layer with a “linear” activation function, because this is Regression. Finally we compile the model with MSE loss.

Let’s do 20 epochs with batch size 32:

history <- model_title %>% fit(
x_train_title, y_train_title,
batch_size = 32,
epochs = 20,
validation_data = list(x_valid_title, y_valid_title),
verbose = 0
)

plot(history)

Notice we didn’t get any additional decrease in MSE after epoch 5. Let’s see our metrics:

y_pred <- model_title$predict(x_valid_title) rmse(y_true, y_pred) pctWithinUSDPercentMargin(y_true, y_pred) cor(y_true, y_pred) ## [1] 0.5592608 ## [1] 0.522 ## [,1] ## [1,] 0.7269884 It seems keras beat xgboost, reaching a RMSE of 0.56 and correlation of 0.72. But not according to our other metric, only 52% of watches in our validation set get a prediction which is in the +/- 30% of original price. tibble(y_true = y_true, y_pred = c(y_pred)) %>% ggplot(aes(y_true, y_pred)) + coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) + geom_point(color = "red", alpha = 0.5) + ggtitle("Title-Only Keras log-Price Prediction: Predicted vs. Actual") # Image Only Before we do Image-Only models - do you think these models will be successful? Test yourself: can you arrange the following 5 random men’s wristwatches from least to most expensive? (Answer below3, don’t peek) I’d be surprised if you got it right, and even then would you know how to estimate the actual cost? Let’s load the images. Using the imager package here: library(imager) library(abind) image_read_matrix <- function(imgFile) { #already normalizes to 0 to 1 range! imgFile %>% load.image() %>% resize(100, 100) %>% as.array() %>% adrop(3) } loadImagesAsXMatrix <- function(imagesAddress) { imageMatrices <- tibble(address = list.files(imagesAddress, full.names = TRUE)) %>% filter(address %in% watches$fileName) %>%
function(x)
as.numeric(
str_split(
str_split(x, "/")[[1]][3], "\\."
)[[1]][1]))) %>%
arrange(order) %>%
unlist() %>%

array (
data = do.call(rbind, map(imageMatrices, as.vector)),
dim = c(length(imageMatrices), dim(imageMatrices[[1]]))
)
}

dim(X)
## [1] 7803  100  100    3

X is a 4D matrix: 7,803 images X 100 pixels width X 100 pixels height X 3 RGB chanels.

Let’s get the train/validation/test matrices:

x_train_image <- X[trainSamples, , , ]
y_train_image <- y_train_title
x_valid_image <- X[validSamples, , , ]
y_valid_image <- y_valid_title
x_test_image <- X[testSamples, , , ]
y_test_image <- y_test_title

rm(X)

Can xgboost handle these images? Sure it can, each image can be flattened to have 100 X 100 X 3 = 30K features, xgboost is OK with that:

x_train_image_xgb <- array(x_train_image,
dim = c(dim(x_train_image)[1],
dim(x_train_image)[2] *
dim(x_train_image)[3] *
dim(x_train_image)[4]))
x_valid_image_xgb <- array(x_valid_image,
dim = c(dim(x_valid_image)[1],
dim(x_valid_image)[2] *
dim(x_valid_image)[3] *
dim(x_valid_image)[4]))
x_test_image_xgb <- array(x_test_image,
dim = c(dim(x_test_image)[1],
dim(x_test_image)[2] *
dim(x_test_image)[3] *
dim(x_test_image)[4]))

xgbTrain <- xgb.DMatrix(x_train_image_xgb, label = y_train_image)
xgbValid <- xgb.DMatrix(x_valid_image_xgb, label = y_valid_image)

watchlist <- list(train = xgbTrain, test = xgbValid)

xgbFit <- xgb.train(data = xgbTrain,
params = xgbParams,
nrounds = 5000,
verbose = TRUE,
print_every_n = 100,
early_stopping_rounds = 100,
watchlist = watchlist)
## [1]  train-rmse:0.770060 test-rmse:0.811320
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 100 rounds.
##
## [101]    train-rmse:0.065538 test-rmse:0.979093
## Stopping. Best iteration:
## [1]  train-rmse:0.770060 test-rmse:0.811320

What just happened here? xgboost reached a validation RMSE of 0.81 (slightly worse than our baseline 0.806) after a single tree, and could not improve on this! It stayed with a single tree with RMSE no better than the original. Other metrics:

y_pred <- predict(xgbFit, xgbValid)
rmse(y_true, y_pred)
pctWithinUSDPercentMargin(y_true, y_pred)
cor(y_true, y_pred)
## [1] 0.8113195
## [1] 0.3165
## [1] 0.1374303

Let’s see the predicted values against real observations, to see the pattern:

tibble(y_true = y_true, y_pred = y_pred) %>%
ggplot(aes(y_true, y_pred)) +
coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) +
geom_point(color = "red", alpha = 0.5) +
ggtitle("Image-Only XGBOOST log-Price Prediction: Predicted vs. Actual")

Unsurprisingly, it seems like the best xgboost could do with image-only data was to predict the global mean for most observations.

Will Deep Learning do better? Let’s use a mean convolutional network:

model_image <- keras_model_sequential()

model_image %>%
layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu', padding = 'same',
input_shape = c(100, 100, 3)) %>%
layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>%
layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_flatten() %>%
layer_dense(512) %>%
layer_activation("relu") %>%
layer_dropout(0.5) %>%
layer_dense(units = 1, activation = "linear")

model_image %>% compile(
loss = "mean_squared_error",
)

history <- model_image %>% fit(
x_train_image, y_train_image,
batch_size = 32,
epochs = 20,
validation_data = list(x_valid_image, y_valid_image),
verbose = 0
)

plot(history)

Again we get no improvement after just a few epochs. This time we get nothing in general :(

y_pred <- model_image$predict(x_valid_image) rmse(y_true, y_pred) pctWithinUSDPercentMargin(y_true, y_pred) cor(y_true, y_pred) ## [1] 0.8500388 ## [1] 0.3615 ## [,1] ## [1,] 0.1289341 We got a RMSE worse than our baseline, and our predicted vs. actual scatterplot shows how it happened: tibble(y_true = y_true, y_pred = c(y_pred)) %>% ggplot(aes(y_true, y_pred)) + coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) + geom_point(color = "red", alpha = 0.5) + ggtitle("Image-Only Keras log-Price Prediction: Predicted vs. Actual") So is there nothing in an image to help with the prediction of price? # Title and Image How would we combine title and image data, to predict price? • xgboost: This is a simple matter of concatenating the features. The title got us 9.1K features (words), the image 30K features (pixels), we can just paste these together: cbind(as.matrix(dtmTrain), x_train_image_xgb) • keras: This is more interesting. We’re going to use a “Bi-Modal” architecture, where we train a LSTM for the titles exactly like we did, we train a convolutional network for the images exactly like we did, then what we get? It’s just numbers! Vectors. And we can concatenate these two vectors into a single vector, then add dense layers to it, to get our single output. See this diagram if it makes more sense: The following uses keras “functional” API, and is based heavily on this example. The trick is to use layer_concatenate to paste together two tensors: image_model <- keras_model_sequential() image_model %>% layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu', padding = 'same', input_shape = c(100, 100, 3)) %>% layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu') %>% layer_max_pooling_2d(pool_size = c(2, 2)) %>% layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>% layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu') %>% layer_max_pooling_2d(pool_size = c(2, 2)) %>% layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>% layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>% layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>% layer_max_pooling_2d(pool_size = c(2, 2)) %>% layer_flatten() image_input <- layer_input(shape = c(100, 100, 3), name = "image_input") encoded_image <- image_input %>% image_model title_input <- layer_input(shape = maxlen, dtype = 'int32', name = "title_input") encoded_title <- title_input %>% layer_embedding(input_dim = max_features, output_dim = 256, input_length = maxlen) %>% layer_lstm(units = 256) output <- layer_concatenate(c(encoded_title, encoded_image)) %>% layer_dense(units = 1, activation='linear') image_title_model <- keras_model(inputs = c(image_input, title_input), outputs = output) image_title_model %>% compile( loss = "mean_squared_error", optimizer = "adam" ) history <- image_title_model %>% fit( x = list(image_input = x_train_image, title_input = x_train_title), y = y_train_image, batch_size = 32, epochs = 20, validation_data = list( list(image_input = x_valid_image, title_input = x_valid_title), y_valid_image), verbose = 0 ) plot(history) You can’t see this from the graph but we actually got the best result yet: y_pred <- image_title_model$predict(list(image_input = x_valid_image, title_input = x_valid_title))
rmse(y_true, y_pred)
pctWithinUSDPercentMargin(y_true, y_pred)
cor(y_true, y_pred)
## [1] 0.5435071
## [1] 0.536
##           [,1]
## [1,] 0.7611045

The Bi-Modal Deep Learning model beat any of our previous models: 0.54 RMSE, 53.6% of items get predictions within +/- 30% of their real price, and a Pearson correlation of 0.76.

tibble(y_true = y_true, y_pred = c(y_pred)) %>%
ggplot(aes(y_true, y_pred)) +
coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) +
geom_point(color = "red", alpha = 0.5) +
ggtitle("Image + Title Keras log-Price Prediction: Predicted vs. Actual")

# It Can’t Be.

Is this counter-intuitive to you? We got nothing from the image data alone, yet when combining image data with title data we got a better result than using just the title data. I can give you a very short example by simulation of simple linear regression, of how this can be. If you’re not interested feel free to skip.

x1 <- rnorm(100)
x2 <- rnorm(100)
y <- 2 * x1 + 0.5 * x1 * x2 + rnorm(100)

So x1 is a random vector of 100 $$N(0,1)$$ elements. x2 is another such vector. And y is a combination of both, plus some random noise.

A linear regression of y over x1 is of course “very”4 significant, resulting in a high adjusted5 R squared (0.70):

lm_x1 <- lm(y ~ x1)
summary(lm_x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.1360 -0.7990 -0.1261  0.8247  3.8525
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.09313    0.11977   0.778    0.439
## x1           1.90615    0.12237  15.577   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.191 on 98 degrees of freedom
## Multiple R-squared:  0.7123, Adjusted R-squared:  0.7094
## F-statistic: 242.6 on 1 and 98 DF,  p-value: < 2.2e-16

A linear regression of y over x2 gives you nothing! Adjusted R squared -0.001 and the x2 coefficient is not significant:

lm_x2 <- lm(y ~ x2)
summary(lm_x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.9159 -1.6234 -0.1591  1.5529  4.8242
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.3074     0.2212   1.389    0.168
## x2            0.2085     0.2236   0.933    0.353
##
## Residual standard error: 2.21 on 98 degrees of freedom
## Multiple R-squared:  0.008799,   Adjusted R-squared:  -0.001315
## F-statistic:  0.87 on 1 and 98 DF,  p-value: 0.3533

Now look at the regression of y over x1, x2 and the interaction between them:

lm_x1x2 <- lm(y ~ x1 * x2)
summary(lm_x1x2)
##
## Call:
## lm(formula = y ~ x1 * x2)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -2.04434 -0.77211  0.05196  0.62709  2.99492
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.03347    0.10388   0.322    0.748
## x1           2.11811    0.11178  18.949  < 2e-16 ***
## x2           0.08450    0.10402   0.812    0.419
## x1:x2        0.64271    0.10775   5.965 4.07e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 96 degrees of freedom
## Multiple R-squared:  0.7909, Adjusted R-squared:  0.7844
## F-statistic:   121 on 3 and 96 DF,  p-value: < 2.2e-16

The x1 and interaction coefficients are significant, the adjusted R square is higher: 0.78.

So by adding a term, x2, which seems marginally not related to y, we improved the entire model!

anova(lm_x1, lm_x1x2)
## Analysis of Variance Table
##
## Model 1: y ~ x1
## Model 2: y ~ x1 * x2
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)
## 1     98 138.90
## 2     96 100.94  2    37.952 18.047 2.222e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# What About the Testing Set?

Remember we left aside ~800 items as a “testing” set? Why did we do that?

In two words: parameter tuning. I now know xgboost with title-only data needs 286 trees. I now know that the best result our last DL model can achieve is actually after 18 epochs, not 20. And so on and so forth. But the way I know these parameters, is by looking at the validation set. Who’s to say it isn’t biased in some way? All my parameter tuning decisions were based on wanting to optimize the RMSE of the validation set. To really be able to have a clear unbiased picture of the RMSE I got, I need a fresh set of data, absolutely never been seen before. I will unite the training and validation sets as one “big” training set, train my best model again (using e.g. GBT with 286 trees or DL with only 18 epochs) and predict on my fresh testing set.

x_train_image_all <- abind::abind(x_train_image, x_valid_image, along = 1)
x_train_title_all <- rbind(x_train_title, x_valid_title)

y_train_all <- c(y_train_image, y_valid_image)

image_model <- keras_model_sequential()
image_model %>%
layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu', padding = 'same',
input_shape = c(100, 100, 3)) %>%
layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>%
layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_flatten()

image_input <- layer_input(shape = c(100, 100, 3), name = "image_input")

encoded_image <- image_input %>% image_model

title_input <- layer_input(shape = maxlen, dtype = 'int32', name = "title_input")

encoded_title <- title_input %>%
layer_embedding(input_dim = max_features, output_dim = 256, input_length = maxlen) %>%
layer_lstm(units = 256)

output <- layer_concatenate(c(encoded_title, encoded_image)) %>%
layer_dense(units = 1, activation='linear')

image_title_model <- keras_model(inputs = c(image_input, title_input), outputs = output)

image_title_model %>% compile(
loss = "mean_squared_error",
)

history <- image_title_model %>% fit(
x = list(image_input = x_train_image_all, title_input = x_train_title_all),
y = y_train_all,
batch_size = 32,
epochs = 18,
validation_data = list(
list(image_input = x_test_image, title_input = x_test_title),
y_test_image),
verbose = 0
)

plot(history)

y_pred <- image_title_model\$predict(list(image_input = x_test_image, title_input = x_test_title))
y_true <- y_test_image
rmse(y_true, y_pred)
pctWithinUSDPercentMargin(y_true, y_pred)
cor(y_true, y_pred)
## [1] 0.4728875
## [1] 0.5031133
##          [,1]
## [1,] 0.821674

We’re seeing a different test set, but we also went from 7K training items to 7K + 2K = 9K, so we would expect a better RMSE and other metrics. With 9K items, we reach a RMSE of 0.47, 50% of items get predictions within +/- 30% of their real price, and a Pearson correlation of 0.82.

tibble(y_true = y_true, y_pred = c(y_pred)) %>%
ggplot(aes(y_true, y_pred)) +
coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) +
geom_point(color = "red", alpha = 0.5) +
ggtitle("Image + Title Keras log-Price Prediction: Predicted vs. Actual")

In case you’re wondering this is how the original (log-Price exponentiated) predicted vs. actual scatterplot looks like, which is not too bad:

tibble(y_true = exp(y_true), y_pred = exp(c(y_pred))) %>%
ggplot(aes(y_true, y_pred)) +
coord_fixed(ratio = 1, xlim = c(0, 3000), ylim = c(0, 3000), expand = TRUE) +
geom_point(color = "red", alpha = 0.5) +
ggtitle("Image + Title Keras log-Price Prediction: Predicted vs. Actual")

# Prediction, What Is It Good For?

So, ebay has the full 220K dataset, including images. What could they use this for? One possibility is recommending an incoming watch’s price. “This is what we think this should cost.” Another possibility could be verifying prices. Ebay could alert a seller if the price she specified is significantly larger or smaller than the expected price. Maybe it’s a typo!

Let’s look at the model’s biggest “mistake” in absolute value:

biggestMistakeIndex <- which.max(abs(exp(y_pred) - exp(y_true)))

exp(y_true[biggestMistakeIndex])
## [1] 7331.069
exp(y_pred[biggestMistakeIndex])
## [1] 2716.547
watches[testSamples[biggestMistakeIndex], c("title", "price")]
## # A tibble: 1 x 2
##   title                                                              price
##   <chr>                                                              <dbl>
## 1 Gevril Men's 'Washington' Swiss Automatic Stainless Steel Casual ~ 7331.

The model predicted this watch should cost almost 5,000 dollars less than it costs. I wouldn’t automatically say this means the watch’s actual cost is too much, but I will say that when you search ebay for this watch you get 7 (!) exact-match results, from the same seller, with prices ranging from ~1,700 to ~7,300 USD (our watch). This screenshot shows the first five with price in Israeli Shekels:

They could be different watches, because of subtleties not mentioned in the title, but this is a bad user experience, and it also explains why our model had such a hard time with this watch.

# Watch It!

I stepped out of my comfort zone and did something with regression. I showed how to create an amazing fashion or financial dataset out of thin air (though you should probably use ebay Search API for bigger projects). And I experimented with a Deep Learning architecture a bit more complicated than the usual examples. All in all this was a great week for Sex, Drugs and Data. Don’t ask me to share the data scraped here, though. A tad illegal.

Thanks, ebay!

1. There’s absolutely NO guarantee this would work with any ebay category, any time, the site is probably very dynamic.

2. I tried Tf-Idf here, didn’t give me an edge.

3. Answer, from least to most expensive: D (~260 USD), A, C, E, B (~1000 USD)

4. Don’t use such idioms, please.

5. Why am I looking at adjusted R squared? Because I’m going to add terms to this regression and by definition the non-adjusted R-squared should grow!