This post serves a few purposes:

- Exercising Deep Learning with Regression, as opposed to Classification, which is my comfort zone.
- A tribute to my previous working place - ebay.
- Trying to combine text AND image data in a “Bi-Modal” model - never did it before.
- Doing something which involves Fashion. As opposed to Boston Housing Prices, God.

The idea is very simple: I’m going to scrape ebay for as many items from the category “Men’s Wristwatches” as I can. I will download their (seller-given) titles, their images and their price. And then I will see if I can predict a watch’s price by using:

- Its title
- Its image
- Both title and image

I will do it with both Gradient Boosting Trees with `xgboost`

and Deep Neural Networks with Rstudio’s `keras`

.

Attention 1: This is not a Deep Learning tutorial, I’m assuming you’re “in the know”. For more detailed posts dealing with Deep Learning consider this and this.

Attention 2: As always if you’re not interested in how I scraped thousands of ebay items, skip the first part.

Let’s go!

# Scraping the Bay

Amazingly, scraping ebay is quite easy. With `rvest`

that is. The site is well structured and does not, as of yet, apply any anti-scraping policies (try to do this with Amazon and tell me what happens!). The only limitation is that out over 220K Men’s Wristwatches items, we can only get ~10K, and with ebay Search algorithm deciding for us which are “Best Match”.

```
library(tidyverse)
library(rvest)
library(stringr)
getURL <- function(i)
str_c("https://www.ebay.com/sch/Wristwatches/31387/i.html?Gender=Men%2527s&LH_BIN=1&_dcat=31387&LH_ItemCondition=1000&LH_PrefLoc=3&_ipg=200&_pgn=", i)
nPages <- 50
urls <- map_chr(1:nPages, getURL)
```

Currently in `urls`

we have the first 50 URLs of ebay’s Wristwatches category, when we ask for:

- Gender = Men
- Fixed price only, a.k.a. “Buy It Now” (don’t want Auctions)
- New items only (I figure used watches will enter a lot of noise for this experiment)
- US only (I figure without this we would get a ton of expensive-looking watches from China, again noisy)
- 200 results per page

We ask for 50 pages because 50 pages X 200 results = 10K items, though see we have many duplicates later.

Now we define an empty `tibble`

called `watches`

which will hold the resulting table, and a few functions to help us^{1}:

```
watches <- tibble(
imgurl = character(0),
ilsPrice = numeric(),
id = numeric(0),
price = numeric(0),
fileName = character(0),
exclude = numeric(0)
)
getPrice <- function(priceText) {
prcRange <- str_detect(priceText, "[0-9,]+\\.[0-9]+ to ILS [0-9,]+\\.[0-9]+")
prices <- str_extract_all(priceText, "[0-9,]+\\.[0-9]+")[[1]]
prices <- as.numeric(str_replace(prices, ",", ""))
if (prcRange) {
price <- mean(prices[1:2])
} else {
price <- prices[1]
}
return(price)
}
ils2usd <- function(ilsPrice, rate = 0.28) ilsPrice * rate
getFileName <- function(id) str_c("D:/ebay_wristwatches/", id, ".jpg")
downloadImage <- function(imgurl, fileName) {
tryCatch(
download.file(imgurl, fileName, mode = "wb", quiet = TRUE),
error = function(e) return(1)
)
}
getItems <- function(i, pb = NULL) {
html <- read_html(urls[i])
imgurlA <- html %>%
html_nodes(".lvpic") %>%
html_nodes("img")
imgurlB <- imgurlA %>%
html_attr("src") %>%
str_replace(., ".+gif$", "NA")
imgurlC <- imgurlA %>%
html_attr("imgurl")
imgurl <- cbind(imgurlB, imgurlC) %>%
as_tibble() %>%
transmute(imgurl = ifelse(is.na(imgurlC), imgurlB, imgurlC)) %>%
unlist() %>%
unname
title <- imgurlA %>%
html_attr("alt")
ilsPrice <- html %>%
html_nodes(".lvprice") %>%
html_text() %>%
map_dbl(getPrice)
idToUse <- if (is_empty(watches$id)) {
1:length(imgurl)
} else {
(tail(watches$id, 1) + 1):(length(imgurl) + tail(watches$id, 1))
}
watchesDF <- tibble(imgurl = imgurl, ilsPrice = ilsPrice, title = title) %>%
mutate(id = idToUse,
price = map_dbl(ilsPrice, ils2usd),
fileName = map_chr(id, getFileName),
exclude = map2_dbl(imgurl, fileName, downloadImage))
watches <<- rbind.data.frame(watches, watchesDF)
if (!is.null(pb)) pb$tick()$print()
}
```

`getPrice`

: a function to parse the price of an item from the html, notice a price can come as a range for different sizes/colors, though less common in watches, in which case we take the mean`ils2usd`

: as I’m doing this from Israel I get prices in ILS (Israeli Shekels), and this function converts Shekels to USD with the rate today`getFileName`

: a function to create the address on disk for the item’s image`downloadImage`

: a`tryCatch`

wrapper around`download.file`

for downloading an image. If a failure was encountered the function would return 1 and this is how we’ll know to exclude this specific image later`getItems`

: the main function which uses`rvest`

to parse each URL’s html, get the title and price data, and download the images to disk

This is how we use `getItems`

:

```
pb <- progress_estimated(nPages)
walk(1:nPages, getItems, pb)
```

And that goes pretty quickly, for me without problems until page 50 where it failed, so I expect to see in `watches`

49 x 200 = 9,800 items:

`glimpse(watches)`

```
## Observations: 9,800
## Variables: 7
## $ imgurl <chr> "https://i.ebayimg.com/thumbs/images/g/LA4AAOSwhOVXdn...
## $ ilsPrice <dbl> 159.000, 547.770, 583.110, 636.120, 671.460, 270.830,...
## $ title <chr> "Invicta 8932OB Men's Pro Diver Black Dial SS Bracele...
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ price <dbl> 44.5200, 153.3756, 163.2708, 178.1136, 188.0088, 75.8...
## $ fileName <chr> "D:/ebay_wristwatches/1.jpg", "D:/ebay_wristwatches/2...
## $ exclude <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
```

(P.S. how cool is `glimpse`

?! How come no one ever tells you about these helpers)

So indeed we have a 9,800 items long `tibble`

and these are the variables:

`imgurl`

: the full URL of the item’s image`ilsPrice`

: the price in ILS`title`

: the item’s title, e.g. “Invicta 8932OB Men’s Pro Diver Black Dial SS Bracelet Dive Watch”`id`

: serial number for the item`price`

: the price in USD`fileName`

: the item’s image location on disk`exclude`

: a 0/1 variable, if 1 this means the image was not successfully dowloaded and we need to get rid of this item

OK, three problems before we go on:

- There are a few (23 actually) items which we don’t have images for
- There are many duplicates, about 2K if we count
`title`

and`price`

as variables we want distinct. This can occur because of two reasons: (a) we’re calling “page 2” a few minutes after we called “page 1”, so the ebay Search algorithm might have put a specific item from page 1 now in page 2 (b) there are actual duplicate items on site - Later on some helper methods from
`keras`

really dislike 1-word titles, there are few of those but still

We need to exclude all these:

```
watches <- watches %>%
filter(exclude == 0) %>%
distinct(title, price, .keep_all = TRUE) %>%
mutate(titleLength = map_dbl(title, function(x) length(str_split(x, " ")[[1]]))) %>%
filter(titleLength >= 2)
dim(watches)
```

`## [1] 7803 8`

We’re left with 7.8K items. Nice!

# Watching Watches

Some exploratory analysis would be nice to see everything is cool between us and the watches.

We have 7.8K watches, let’s see the distribution of price:

```
watches %>%
ggplot(aes(price)) +
geom_histogram(fill = "red", alpha = 0.5) +
ggtitle("Distribution of ebay's Men's Wristwatches Price (USD)")
```

This distribution is very extreme:

`summary(watches$price)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.95 215.84 330.87 525.21 537.70 61650.69
```

Most watches cost around $ 200-600, but there is a watch which costs over $60K! Which one is it?

```
watches %>%
top_n(1, price) %>%
select(title)
```

```
## # A tibble: 1 x 1
## title
## <chr>
## 1 Patek Philippe NEW Nautilus Travel Time Chronograph Steel Watch Box/Pape~
```

Looks like any $500 watch to me. But I’m getting ahead of myself.

Anyway, this distribution is very extreme, very skewed, and I recommend going with the logarithm of the price which is much more “normal” and probably less affected later on by extreme values:

```
watches %>%
ggplot(aes(log(price))) +
geom_histogram(fill = "red", alpha = 0.5) +
ggtitle("Distribution of ebay's Men's Wristwatches log-Price (USD)")
```

Another thing which might be interesting at this stage to see is what’s going on in the watches titles, e.g. the most common words. The `tidytext`

package makes this really easy:

```
library(tidytext)
watches %>%
select(title) %>%
unnest_tokens(word, title) %>%
count(word, sort = TRUE) %>%
slice(1:10)
```

```
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 watch 6312
## 2 men's 5678
## 3 quartz 2414
## 4 steel 2392
## 5 black 2268
## 6 stainless 2199
## 7 mens 1531
## 8 analog 1510
## 9 dial 1492
## 10 display 1162
```

Makes sense.

The last thing we’ll do in this section is get a reasonable baseline for our predictions. Why do we need one? This is Regrssion. There is no concept of “accuracy” like in classification, where we know that “100%” is good, and if we have two balanced classes “60%” is better than chance. In Regression we usually use the Mean Squared Error or the square root of that, RMSE.

\(RMSE=\sqrt{\frac{1}{n}\sum_{t=1}^{n}(\hat{y_t} - y_t)^2}\)

You can see that the MSE takes the mean of squared deviations between the real observations (\(y_t\)) and the predicted ones (\(\hat{y_t}\)). And the RMSE just takes the root of that to “go back” to the original scale of \(y_t\). In our case the original scale is log(USD) so I’m not keen on doing that. The important thing is that we want RMSE to be as low as possible.

If you don’t like RMSE, you can use the Pearson correlation or define some accuracy-equivalent metric, e.g. “The percentage of observations for which we managed to predict the USD price within a plus/minus $20 range”, or better yet “The percentage of observations for which we managed to predict the USD price within a plus/minus 30% range”. Because I’m not OK with predicting a 30 dollar watch as 10 (minus 20 dollars), but I am OK with predicting it as 21 (minus 30 percent).

What’s our current best guess? The mean!

```
rmse <- function(y_true, y_pred) {
sqrt(mean((y_pred - y_true)^2))
}
pctWithinUSDPercentMargin <- function(y_true, y_pred, pct.margin = 30) {
sum(abs(exp(y_pred)/exp(y_true) - 1) < pct.margin/100)/length(y_true)
}
y_pred <- rep(mean(log(watches$price)), nrow(watches))
y_true <- log(watches$price)
rmse(y_true, y_pred)
```

`## [1] 0.8067097`

`pctWithinUSDPercentMargin(y_true, y_pred)`

`## [1] 0.3212867`

This means that with predicting the mean for all observations, the baseline RMSE is 0.806 (whatever it means in log(USD)) and for ~32% of observations the mean is already within the +/- 30% range, in USD scale. Good to know.

# Title Only

Let’s first try `xgboost`

, with a simple bag-of-words model, no weighting^{2}. We can again use `tidytext`

’s helpers here to get a Document-Term *sparse* matrix, as most values would become zeros:

```
dtm <- watches %>%
select(id, title) %>%
unnest_tokens(word, title) %>%
group_by(id, word) %>%
dplyr::summarise(n = n()) %>%
cast_sparse(id, word, n)
dim(dtm)
```

`## [1] 7803 9188`

We get our 7.8K items X 9,188 unique terms.

Next we’ll split the data into training, validation and test sets. This is the split we’ll use from now on, and we’re not going to use the test set until the very end.

```
trainN <- 5000
validN <- 2000
set.seed(42)
trainSamples <- sort(sample(nrow(dtm), trainN))
validSamples <- sort(sample((1:nrow(dtm))[-trainSamples], validN))
testSamples <- sort((1:nrow(dtm))[-c(trainSamples, validSamples)])
dtmTrain <- dtm[trainSamples, ]
dtmValid <- dtm[validSamples, ]
dtmTest <- dtm[testSamples, ]
dim(dtmTrain)
dim(dtmValid)
dim(dtmTest)
```

`## [1] 5000 9188`

`## [1] 2000 9188`

`## [1] 803 9188`

Let’s prepare the data for `xgboost`

.

```
library(xgboost)
xgbTrain <- xgb.DMatrix(dtmTrain, label = log(watches$price[trainSamples]))
xgbValid <- xgb.DMatrix(dtmValid, label = log(watches$price[validSamples]))
xgbParams <- list(objective = "reg:linear",
eval_metric = "rmse",
max_depth = 5,
eta = 1,
gamma = 0,
subsample = 1,
min_child_weight = 1,
alpha = 0,
lambda = 1.1)
watchlist <- list(train = xgbTrain, test = xgbValid)
```

We’re telling `xgboost`

that we’re in a Regression framework (`objective`

), the metric to minimize is RMSE (`eval_metric`

). We use shallow trees of `max_depth`

5, and increase a bit the L2 “Ridge” penalty (`lambda`

) to 1.1 (1 is the default). All of these parameters should of course be selected using Cross Validation, not as I did here.

Let’s tell `xgboost`

to build a maximum of 5,000 trees, stop only if it doesn’t see any improvement in the *validation* set for 100 trees, to avoid overfitting:

```
xgbFit <- xgb.train(data = xgbTrain,
params = xgbParams,
nrounds = 5000,
verbose = TRUE,
print_every_n = 100,
early_stopping_rounds = 100,
watchlist = watchlist)
```

```
## [1] train-rmse:0.728552 test-rmse:0.747846
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 100 rounds.
##
## [101] train-rmse:0.308889 test-rmse:0.583139
## [201] train-rmse:0.237314 test-rmse:0.582568
## [301] train-rmse:0.195867 test-rmse:0.581100
## Stopping. Best iteration:
## [286] train-rmse:0.201348 test-rmse:0.580927
```

After a single tree `xgboost`

decreased RMSE to 0.74 (we started at 0.80 remember), and decided to stop at 286 trees, when it got to RMSE 0.58.

Let’s see our other metric, of “percent predicted within +/- 30% of real price”:

```
y_pred <- predict(xgbFit, xgbValid)
y_true <- log(watches$price[validSamples])
rmse(y_true, y_pred)
pctWithinUSDPercentMargin(y_true, y_pred)
```

`## [1] 0.5809266`

`## [1] 0.5315`

This means with `xgboost`

on text only data, 53% of watches in our validation set get a prediction which is in the +/- 30% of original USD price.

Let’s see the predicted values against real observations, to see that there is nothing “degenerate” here:

```
tibble(y_true = y_true, y_pred = y_pred) %>%
ggplot(aes(y_true, y_pred)) +
coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) +
geom_point(color = "red", alpha = 0.5) +
ggtitle("Title-Only XGBOOST log-Price Prediction: Predicted vs. Actual")
```

That’s very good, actually. Correlation:

`cor(y_true, y_pred)`

`## [1] 0.7162165`

Can “Deep Learning” beat that? We’ll use a simple LSTM model, not unlike the famous imdb example, only we don’t use our LSTM output as the basis for Sentiment classification, but for Regression.

But first, some preparation. `keras`

doesn’t know words, it knows numbers. We need our titles as lists of numbers, encoded from words. Luckily `keras`

and `reticulate`

have some tools to do that easily:

```
library(keras)
tnizer <- text_tokenizer()
tnizer$fit_on_texts(watches$title)
watches_dl <- texts_to_sequences(tnizer, watches$title)
watches_dl <- reticulate::py_to_r(watches_dl) # this step was needed in previous
# version of keras, it may very well be unnecessary here.
```

OK, what’s `watches_dl`

?

`length(watches_dl)`

`## [1] 7803`

`watches_dl`

is a representation of `watches`

titles. It is a list of 7,803 lists, each holding a bunch of integer codes replacing the original words. The first element which has 11 numbers corresponds to the first title which has 11 words:

`watches$title[1]`

`## [1] "Invicta 8932OB Men's Pro Diver Black Dial SS Bracelet Dive Watch"`

`watches_dl[1]`

```
## [[1]]
## [1] 14 7306 2 41 31 4 9 101 34 149 1
```

Getting our train/validation/test matrices:

```
maxlen <- 17
x_train_title <- watches_dl[trainSamples]
y_train_title <- log(watches$price[trainSamples])
x_valid_title <- watches_dl[validSamples]
y_valid_title <- log(watches$price[validSamples])
x_test_title <- watches_dl[testSamples]
y_test_title <- log(watches$price[testSamples])
x_train_title <- pad_sequences(x_train_title, maxlen = maxlen)
x_valid_title <- pad_sequences(x_valid_title, maxlen = maxlen)
x_test_title <- pad_sequences(x_test_title, maxlen = maxlen)
dim(x_train_title)
```

`## [1] 5000 17`

`dim(x_valid_title)`

`## [1] 2000 17`

`dim(x_test_title)`

`## [1] 803 17`

The last thing we did there was “pad” the title matrices with zeros if the titles are less than 17 words, the longest title.

Now we’re ready for `keras`

:

```
max_features <- 10000
model_title <- keras_model_sequential()
model_title %>%
layer_embedding(input_dim = max_features, output_dim = 256) %>%
layer_lstm(units = 256) %>%
layer_dense(units = 1, activation = "linear")
model_title %>% compile(
loss = "mean_squared_error",
optimizer = "adam"
)
```

Notice we’re asking LSTM to give us an output of length 256, then we add a dense layer with a “linear” activation function, because this is Regression. Finally we compile the model with MSE loss.

Let’s do 20 epochs with batch size 32:

```
history <- model_title %>% fit(
x_train_title, y_train_title,
batch_size = 32,
epochs = 20,
validation_data = list(x_valid_title, y_valid_title),
verbose = 0
)
plot(history)
```

Notice we didn’t get any additional decrease in MSE after epoch 5. Let’s see our metrics:

```
y_pred <- model_title$predict(x_valid_title)
rmse(y_true, y_pred)
pctWithinUSDPercentMargin(y_true, y_pred)
cor(y_true, y_pred)
```

`## [1] 0.5592608`

`## [1] 0.522`

```
## [,1]
## [1,] 0.7269884
```

It seems `keras`

beat `xgboost`

, reaching a RMSE of 0.56 and correlation of 0.72. But not according to our other metric, only 52% of watches in our validation set get a prediction which is in the +/- 30% of original price.

```
tibble(y_true = y_true, y_pred = c(y_pred)) %>%
ggplot(aes(y_true, y_pred)) +
coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) +
geom_point(color = "red", alpha = 0.5) +
ggtitle("Title-Only Keras log-Price Prediction: Predicted vs. Actual")
```

# Image Only

Before we do Image-Only models - do you think these models will be successful?

Test yourself: can you arrange the following 5 *random* men’s wristwatches from least to most expensive? (Answer below^{3}, don’t peek)

I’d be surprised if you got it right, and even then would you know how to estimate the actual cost?

Let’s load the images. Using the `imager`

package here:

```
library(imager)
library(abind)
image_read_matrix <- function(imgFile) {
#already normalizes to 0 to 1 range!
imgFile %>% load.image() %>% resize(100, 100) %>% as.array() %>% adrop(3)
}
loadImagesAsXMatrix <- function(imagesAddress) {
imageMatrices <- tibble(address =
list.files(imagesAddress, full.names = TRUE)) %>%
filter(address %in% watches$fileName) %>%
mutate(order = map_dbl(address,
function(x)
as.numeric(
str_split(
str_split(x, "/")[[1]][3], "\\."
)[[1]][1]))) %>%
arrange(order) %>%
select(address) %>%
unlist() %>%
map(image_read_matrix)
array (
data = do.call(rbind, map(imageMatrices, as.vector)),
dim = c(length(imageMatrices), dim(imageMatrices[[1]]))
)
}
X <- loadImagesAsXMatrix("C:/SDAD_materials/ebay_wristwatches")
dim(X)
```

`## [1] 7803 100 100 3`

`X`

is a 4D matrix: 7,803 images X 100 pixels width X 100 pixels height X 3 RGB chanels.

Let’s get the train/validation/test matrices:

```
x_train_image <- X[trainSamples, , , ]
y_train_image <- y_train_title
x_valid_image <- X[validSamples, , , ]
y_valid_image <- y_valid_title
x_test_image <- X[testSamples, , , ]
y_test_image <- y_test_title
rm(X)
```

Can `xgboost`

handle these images? Sure it can, each image can be flattened to have 100 X 100 X 3 = 30K features, `xgboost`

is OK with that:

```
x_train_image_xgb <- array(x_train_image,
dim = c(dim(x_train_image)[1],
dim(x_train_image)[2] *
dim(x_train_image)[3] *
dim(x_train_image)[4]))
x_valid_image_xgb <- array(x_valid_image,
dim = c(dim(x_valid_image)[1],
dim(x_valid_image)[2] *
dim(x_valid_image)[3] *
dim(x_valid_image)[4]))
x_test_image_xgb <- array(x_test_image,
dim = c(dim(x_test_image)[1],
dim(x_test_image)[2] *
dim(x_test_image)[3] *
dim(x_test_image)[4]))
xgbTrain <- xgb.DMatrix(x_train_image_xgb, label = y_train_image)
xgbValid <- xgb.DMatrix(x_valid_image_xgb, label = y_valid_image)
watchlist <- list(train = xgbTrain, test = xgbValid)
xgbFit <- xgb.train(data = xgbTrain,
params = xgbParams,
nrounds = 5000,
verbose = TRUE,
print_every_n = 100,
early_stopping_rounds = 100,
watchlist = watchlist)
```

```
## [1] train-rmse:0.770060 test-rmse:0.811320
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 100 rounds.
##
## [101] train-rmse:0.065538 test-rmse:0.979093
## Stopping. Best iteration:
## [1] train-rmse:0.770060 test-rmse:0.811320
```

What just happened here? `xgboost`

reached a validation RMSE of 0.81 (slightly worse than our baseline 0.806) after a single tree, and could not improve on this! It stayed with a single tree with RMSE no better than the original. Other metrics:

```
y_pred <- predict(xgbFit, xgbValid)
rmse(y_true, y_pred)
pctWithinUSDPercentMargin(y_true, y_pred)
cor(y_true, y_pred)
```

`## [1] 0.8113195`

`## [1] 0.3165`

`## [1] 0.1374303`

Pretty bad all around.

Let’s see the predicted values against real observations, to see the pattern:

```
tibble(y_true = y_true, y_pred = y_pred) %>%
ggplot(aes(y_true, y_pred)) +
coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) +
geom_point(color = "red", alpha = 0.5) +
ggtitle("Image-Only XGBOOST log-Price Prediction: Predicted vs. Actual")
```

Unsurprisingly, it seems like the best `xgboost`

could do with image-only data was to predict the global mean for most observations.

Will Deep Learning do better? Let’s use a mean convolutional network:

```
model_image <- keras_model_sequential()
model_image %>%
layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu', padding = 'same',
input_shape = c(100, 100, 3)) %>%
layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>%
layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_flatten() %>%
layer_dense(512) %>%
layer_activation("relu") %>%
layer_dropout(0.5) %>%
layer_dense(units = 1, activation = "linear")
model_image %>% compile(
loss = "mean_squared_error",
optimizer = "adam"
)
history <- model_image %>% fit(
x_train_image, y_train_image,
batch_size = 32,
epochs = 20,
validation_data = list(x_valid_image, y_valid_image),
verbose = 0
)
plot(history)
```

Again we get no improvement after just a few epochs. This time we get nothing in general :(

```
y_pred <- model_image$predict(x_valid_image)
rmse(y_true, y_pred)
pctWithinUSDPercentMargin(y_true, y_pred)
cor(y_true, y_pred)
```

`## [1] 0.8500388`

`## [1] 0.3615`

```
## [,1]
## [1,] 0.1289341
```

We got a RMSE worse than our baseline, and our predicted vs. actual scatterplot shows how it happened:

```
tibble(y_true = y_true, y_pred = c(y_pred)) %>%
ggplot(aes(y_true, y_pred)) +
coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) +
geom_point(color = "red", alpha = 0.5) +
ggtitle("Image-Only Keras log-Price Prediction: Predicted vs. Actual")
```

So is there nothing in an image to help with the prediction of price?

# Title and Image

How would we combine title and image data, to predict price?

`xgboost`

: This is a simple matter of concatenating the features. The title got us 9.1K features (words), the image 30K features (pixels), we can just paste these together:`cbind(as.matrix(dtmTrain), x_train_image_xgb)`

`keras`

: This is more interesting. We’re going to use a “Bi-Modal” architecture, where we train a LSTM for the titles exactly like we did, we train a convolutional network for the images exactly like we did, then what we get? It’s just numbers! Vectors. And we can concatenate these two vectors into a single vector, then add dense layers to it, to get our single output. See this diagram if it makes more sense:

The following uses `keras`

“functional” API, and is based heavily on this example. The trick is to use `layer_concatenate`

to paste together two tensors:

```
image_model <- keras_model_sequential()
image_model %>%
layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu', padding = 'same',
input_shape = c(100, 100, 3)) %>%
layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>%
layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_flatten()
image_input <- layer_input(shape = c(100, 100, 3), name = "image_input")
encoded_image <- image_input %>% image_model
title_input <- layer_input(shape = maxlen, dtype = 'int32', name = "title_input")
encoded_title <- title_input %>%
layer_embedding(input_dim = max_features, output_dim = 256, input_length = maxlen) %>%
layer_lstm(units = 256)
output <- layer_concatenate(c(encoded_title, encoded_image)) %>%
layer_dense(units = 1, activation='linear')
image_title_model <- keras_model(inputs = c(image_input, title_input), outputs = output)
image_title_model %>% compile(
loss = "mean_squared_error",
optimizer = "adam"
)
history <- image_title_model %>% fit(
x = list(image_input = x_train_image, title_input = x_train_title),
y = y_train_image,
batch_size = 32,
epochs = 20,
validation_data = list(
list(image_input = x_valid_image, title_input = x_valid_title),
y_valid_image),
verbose = 0
)
plot(history)
```

You can’t see this from the graph but we actually got the best result yet:

```
y_pred <- image_title_model$predict(list(image_input = x_valid_image, title_input = x_valid_title))
rmse(y_true, y_pred)
pctWithinUSDPercentMargin(y_true, y_pred)
cor(y_true, y_pred)
```

`## [1] 0.5435071`

`## [1] 0.536`

```
## [,1]
## [1,] 0.7611045
```

The Bi-Modal Deep Learning model beat any of our previous models: 0.54 RMSE, 53.6% of items get predictions within +/- 30% of their real price, and a Pearson correlation of 0.76.

```
tibble(y_true = y_true, y_pred = c(y_pred)) %>%
ggplot(aes(y_true, y_pred)) +
coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) +
geom_point(color = "red", alpha = 0.5) +
ggtitle("Image + Title Keras log-Price Prediction: Predicted vs. Actual")
```

# It Can’t Be.

Is this counter-intuitive to you? We got nothing from the image data alone, yet when combining image data with title data we got a better result than using just the title data. I can give you a very short example by simulation of simple linear regression, of how this can be. If you’re not interested feel free to skip.

```
x1 <- rnorm(100)
x2 <- rnorm(100)
y <- 2 * x1 + 0.5 * x1 * x2 + rnorm(100)
```

So `x1`

is a random vector of 100 \(N(0,1)\) elements. `x2`

is another such vector. And `y`

is a combination of both, plus some random noise.

A linear regression of `y`

over `x1`

is of course “very”^{4} significant, resulting in a high adjusted^{5} R squared (0.70):

```
lm_x1 <- lm(y ~ x1)
summary(lm_x1)
```

```
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1360 -0.7990 -0.1261 0.8247 3.8525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09313 0.11977 0.778 0.439
## x1 1.90615 0.12237 15.577 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.191 on 98 degrees of freedom
## Multiple R-squared: 0.7123, Adjusted R-squared: 0.7094
## F-statistic: 242.6 on 1 and 98 DF, p-value: < 2.2e-16
```

A linear regression of `y`

over `x2`

gives you nothing! Adjusted R squared -0.001 and the `x2`

coefficient is not significant:

```
lm_x2 <- lm(y ~ x2)
summary(lm_x2)
```

```
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9159 -1.6234 -0.1591 1.5529 4.8242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3074 0.2212 1.389 0.168
## x2 0.2085 0.2236 0.933 0.353
##
## Residual standard error: 2.21 on 98 degrees of freedom
## Multiple R-squared: 0.008799, Adjusted R-squared: -0.001315
## F-statistic: 0.87 on 1 and 98 DF, p-value: 0.3533
```

Now look at the regression of `y`

over `x1`

, `x2`

and the interaction between them:

```
lm_x1x2 <- lm(y ~ x1 * x2)
summary(lm_x1x2)
```

```
##
## Call:
## lm(formula = y ~ x1 * x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04434 -0.77211 0.05196 0.62709 2.99492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03347 0.10388 0.322 0.748
## x1 2.11811 0.11178 18.949 < 2e-16 ***
## x2 0.08450 0.10402 0.812 0.419
## x1:x2 0.64271 0.10775 5.965 4.07e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 96 degrees of freedom
## Multiple R-squared: 0.7909, Adjusted R-squared: 0.7844
## F-statistic: 121 on 3 and 96 DF, p-value: < 2.2e-16
```

The `x1`

and interaction coefficients are significant, the adjusted R square is higher: 0.78.

So by adding a term, `x2`

, which seems *marginally* not related to `y`

, we improved the entire model!

`anova(lm_x1, lm_x1x2)`

```
## Analysis of Variance Table
##
## Model 1: y ~ x1
## Model 2: y ~ x1 * x2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 98 138.90
## 2 96 100.94 2 37.952 18.047 2.222e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

# What About the Testing Set?

Remember we left aside ~800 items as a “testing” set? Why did we do that?

In two words: parameter tuning. I now know `xgboost`

with title-only data needs 286 trees. I now know that the best result our last DL model can achieve is actually after 18 epochs, not 20. And so on and so forth. But the way I know these parameters, is by looking at the validation set. Who’s to say it isn’t biased in some way? All my parameter tuning decisions were based on wanting to optimize the RMSE of the validation set. To really be able to have a clear unbiased picture of the RMSE I got, I need a *fresh* set of data, absolutely never been seen before. I will unite the training and validation sets as one “big” training set, train my best model again (using e.g. GBT with 286 trees or DL with only 18 epochs) and predict on my fresh testing set.

```
x_train_image_all <- abind::abind(x_train_image, x_valid_image, along = 1)
x_train_title_all <- rbind(x_train_title, x_valid_title)
y_train_all <- c(y_train_image, y_valid_image)
image_model <- keras_model_sequential()
image_model %>%
layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu', padding = 'same',
input_shape = c(100, 100, 3)) %>%
layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>%
layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu', padding = 'same') %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>%
layer_conv_2d(filters = 256, kernel_size = c(3, 3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_flatten()
image_input <- layer_input(shape = c(100, 100, 3), name = "image_input")
encoded_image <- image_input %>% image_model
title_input <- layer_input(shape = maxlen, dtype = 'int32', name = "title_input")
encoded_title <- title_input %>%
layer_embedding(input_dim = max_features, output_dim = 256, input_length = maxlen) %>%
layer_lstm(units = 256)
output <- layer_concatenate(c(encoded_title, encoded_image)) %>%
layer_dense(units = 1, activation='linear')
image_title_model <- keras_model(inputs = c(image_input, title_input), outputs = output)
image_title_model %>% compile(
loss = "mean_squared_error",
optimizer = "adam"
)
history <- image_title_model %>% fit(
x = list(image_input = x_train_image_all, title_input = x_train_title_all),
y = y_train_all,
batch_size = 32,
epochs = 18,
validation_data = list(
list(image_input = x_test_image, title_input = x_test_title),
y_test_image),
verbose = 0
)
plot(history)
```

```
y_pred <- image_title_model$predict(list(image_input = x_test_image, title_input = x_test_title))
y_true <- y_test_image
rmse(y_true, y_pred)
pctWithinUSDPercentMargin(y_true, y_pred)
cor(y_true, y_pred)
```

`## [1] 0.4728875`

`## [1] 0.5031133`

```
## [,1]
## [1,] 0.821674
```

We’re seeing a different test set, but we also went from 7K training items to 7K + 2K = 9K, so we would expect a better RMSE and other metrics. With 9K items, we reach a RMSE of 0.47, 50% of items get predictions within +/- 30% of their real price, and a Pearson correlation of 0.82.

```
tibble(y_true = y_true, y_pred = c(y_pred)) %>%
ggplot(aes(y_true, y_pred)) +
coord_fixed(ratio = 1, xlim = c(0, 10), ylim = c(0, 10), expand = TRUE) +
geom_point(color = "red", alpha = 0.5) +
ggtitle("Image + Title Keras log-Price Prediction: Predicted vs. Actual")
```

In case you’re wondering this is how the original (log-Price exponentiated) predicted vs. actual scatterplot looks like, which is not too bad:

```
tibble(y_true = exp(y_true), y_pred = exp(c(y_pred))) %>%
ggplot(aes(y_true, y_pred)) +
coord_fixed(ratio = 1, xlim = c(0, 3000), ylim = c(0, 3000), expand = TRUE) +
geom_point(color = "red", alpha = 0.5) +
ggtitle("Image + Title Keras log-Price Prediction: Predicted vs. Actual")
```

# Prediction, What Is It Good For?

So, ebay has the full 220K dataset, including images. What could they use this for? One possibility is recommending an incoming watch’s price. “This is what we think this should cost.” Another possibility could be verifying prices. Ebay could alert a seller if the price she specified is significantly larger or smaller than the expected price. Maybe it’s a typo!

Let’s look at the model’s biggest “mistake” in absolute value:

```
biggestMistakeIndex <- which.max(abs(exp(y_pred) - exp(y_true)))
exp(y_true[biggestMistakeIndex])
```

`## [1] 7331.069`

`exp(y_pred[biggestMistakeIndex])`

`## [1] 2716.547`

`watches[testSamples[biggestMistakeIndex], c("title", "price")]`

```
## # A tibble: 1 x 2
## title price
## <chr> <dbl>
## 1 Gevril Men's 'Washington' Swiss Automatic Stainless Steel Casual ~ 7331.
```

The model predicted this watch should cost almost 5,000 dollars less than it costs. I wouldn’t automatically say this means the watch’s actual cost is too much, but I will say that when you search ebay for this watch you get 7 (!) exact-match results, from the same seller, with prices ranging from ~1,700 to ~7,300 USD (our watch). This screenshot shows the first five with price in Israeli Shekels:

They could be different watches, because of subtleties not mentioned in the title, but this is a bad user experience, and it also explains why our model had such a hard time with this watch.

# Watch It!

I stepped out of my comfort zone and did something with regression. I showed how to create an amazing fashion or financial dataset out of thin air (though you should probably use ebay Search API for bigger projects). And I experimented with a Deep Learning architecture a bit more complicated than the usual examples. All in all this was a great week for Sex, Drugs and Data. Don’t ask me to share the data scraped here, though. A tad illegal.

Thanks, ebay!

There’s absolutely NO guarantee this would work with any ebay category, any time, the site is probably very dynamic.↩

I tried Tf-Idf here, didn’t give me an edge.↩

Answer, from least to most expensive: D (~260 USD), A, C, E, B (~1000 USD)↩

Don’t use such idioms, please.↩

Why am I looking at adjusted R squared? Because I’m going to add terms to this regression and by definition the non-adjusted R-squared should grow!↩