Recently I’ve stumbled upon this treasure of data, called sexualitics.org. I love exploring bizarre datasets and it doesn’t get more bizarre than (brace yourselves) ~786 thousand (!) tagged porn video titles from xhamster.com - the site’s entire inventory from 2007 to 2013. I am going to see if I can predict a porn video’s category by its title. Tha main challenge for me here is that some videos have more than a single category so this is Multi-label Classification. Be warned: Completely NSFW.

Get me some porn! Data!

Downloading and untar-ing the first zipped file from here, you will get 2 csv files which we will load:

library(tidyverse)

data1 <- read_csv("~/xhamster.000.csv")
data2 <- read_csv("~/xhamster.001.csv")

data <- rbind(data1, data2)
rm(data1, data2)

dim(data)
## [1] 786121     10
colnames(data)
##  [1] "id"          "upload_date" "title"       "channels"    "description"
##  [6] "nb_views"    "nb_votes"    "nb_comments" "runtime"     "uploader"

As promised, 786K records of porn videos, in each:

  • upload_date
  • title - in lowercase letters
  • channels - these will be our tags or categories and as I said some videos have more than one
  • description - some more free text about the video
  • nb_views - no. of views, it is unclear in which period
  • nb_views - no. of votes, again which period
  • nb_comments - no. of comments, again which period
  • runtime - duration in seconds
  • uploader - unique ID of the user who uploaded the video

First things first - tidy up the data. For each channels or “tag” I would like a single row (a.k.a “long” data). But the channels column at the moment looks like this:

data %>% select(channels) %>% head()
## # A tibble: 6 x 1
##                                    channels
##                                       <chr>
## 1 ['BBW', 'Black and Ebony', 'Interracial']
## 2                          ['Masturbation']
## 3             ['Babes', 'Teens', 'Webcams']
## 4                                   ['Men']
## 5                                  ['Gays']
## 6                     ['Amateur', 'Voyeur']

So we need to parse this column to get a list of tags, then unnest this list to get our “long” data1:

library(stringr)

cleanToken <- function(s) {
  str_sub(s, 2, nchar(s) - 1) %>%
    str_replace_all("[\\s]+|-", "")
}

getTagList <- function(channels) {
  channels %>%
    cleanToken %>%
    str_split(., pattern = ",") %>%
    .[[1]] %>%
    map_chr(cleanToken)
}

dataLong <- data %>%
  mutate(id = 1:n(),
         tag = map(channels, getTagList)) %>%
  unnest(tag)

dataLong %>% select(id, title, tag) %>% head()
## # A tibble: 1,864,952 x 3
##       id                                               title           tag
##    <int>                                               <chr>         <chr>
##  1     1                              girl riding black cock           BBW
##  2     1                              girl riding black cock BlackandEbony
##  3     1                              girl riding black cock   Interracial
##  4     2                                        masturbation  Masturbation
##  5     3                              sexy horny booty dance         Babes
##  6     3                              sexy horny booty dance         Teens
##  7     3                              sexy horny booty dance       Webcams
##  8     4                 group of young bareback sportsmen d           Men
##  9     5 horny latinos double penetrating hot ass in a store          Gays
## 10     6                                   geile katja en ik       Amateur
## # ... with 1,864,942 more rows

You can see here that we have 1.86 million rows, for 786K videos, that’s 2.3 categories per video, on average. Also note that I replaced id with 1:n().

XXXploration

Ready to see some interesting stuff?! This. Is a dream come true. How many categories? Which are they?

dataLong %>%
  select(tag) %>%
  unique() %>%
  unlist() %>%
  unname()
##  [1] "BBW"               "BlackandEbony"     "Interracial"      
##  [4] "Masturbation"      "Babes"             "Teens"            
##  [7] "Webcams"           "Men"               "Gays"             
## [10] "Amateur"           "Voyeur"            "Shemales"         
## [13] "HiddenCams"        "Celebrities"       "MILFs"            
## [16] "Vintage"           "Matures"           "DoublePenetration"
## [19] "Hardcore"          "Threesomes"        "BDSM"             
## [22] "Japanese"          "Squirting"         "Old+Young"        
## [25] "Blowjobs"          "Cumshots"          "Brunettes"        
## [28] "Pornstars"         "Latin"             "British"          
## [31] "Facials"           "Cuckold"           "BigBoobs"         
## [34] "Hentai"            "Blondes"           "Asian"            
## [37] "Lesbians"          "SexToys"           "CreamPie"         
## [40] "Redheads"          "Closeups"          "Femdom"           
## [43] "Strapon"           "Stockings"         "Anal"             
## [46] "PublicNudity"      "Grannies"          "Tits"             
## [49] "Emo"               "Gothic"            "Bisexuals"        
## [52] "Hairy"             "Handjobs"          "Gaping"           
## [55] "French"            "FaceSitting"       "Italian"          
## [58] "German"            "Fingering"         "Upskirts"         
## [61] "Softcore"          "Arab"              "FootFetish"       
## [64] "Bukkake"           "Thai"              "Ladyboys"         
## [67] "Swingers"          "Flashing"          "Indian"           
## [70] "GroupSex"          "Russian"           "Nipples"          
## [73] "Funny"             "Latex"             "POV"              
## [76] "Spanking"          "Turkish"           "Gangbang"         
## [79] "Beach"             "Cartoons"          "BlackGays"        
## [82] "Lingerie"          "Chinese"           "Massage"          
## [85] NA                  "Brazilian"         "Czech"            
## [88] "Danish"            "Showers"           "Midgets"          
## [91] "Babysitters"       "Korean"            "Swedish"          
## [94] ""

So we have 94 categories but 2 of them are NA or simply "". We’ll get rid of those soon. Which are the most and least common categories?

dataLong %>%
  count(tag) %>%
  arrange(-n) %>%
  slice(c(1:5, 90:94))
## # A tibble: 10 x 2
##            tag      n
##          <chr>  <int>
##  1     Amateur 223139
##  2         Men 137197
##  3       Teens  76688
##  4    Hardcore  75250
##  5    Blowjobs  65050
##  6     Swedish    683
##  7 Babysitters    672
##  8      Gothic    661
##  9     Midgets    397
## 10                  9

“Amateur” is the category with most videos, then the gay category “Men”. The least common categories, except the "" are “Midgets” and “Gothic”. Precious.

Let’s see how many videos each user uploaded. In any user-generated data from the internet, there is a big chance that say 80% of records come from just a few users, and you need to know that:

data %>%
  count(uploader) %>%
  select(n) %>%
  summary()
##        n           
##  Min.   :    1.00  
##  1st Qu.:    1.00  
##  Median :    2.00  
##  Mean   :    8.87  
##  3rd Qu.:    4.00  
##  Max.   :32921.00

OK, so 50% of users uploaded 1-2 videos, but the the distribution is so biased towards extreme values that the average is almost 9 videos per user, and we can see a single user uploading over 32K videos! Who is this person?!

data %>%
  count(uploader) %>%
  arrange(-n)
## # A tibble: 88,603 x 2
##                                    uploader     n
##                                       <chr> <int>
##  1                                     <NA> 32921
##  2 32b6ecffbe885f8940a8da79384ddb6ad8acb6b8  5408
##  3 0ba9719a8a43ef6304dc5e2ee4eabc8db4b2b8b7  4199
##  4 b1bbcd4a4047deee60667e368373cc8c9233a743  2810
##  5 8705925d7a97b3324f93ecad8a7d2370ea0584a7  1708
##  6 1392903d92ec7d56b297ad947fdb75f380d6608d  1707
##  7 310d4bdb2d17aba8cceaa03aa50a20be0aa287ba  1691
##  8 3c786264009d499f4d67dea3487c0ec37dea5fc0  1676
##  9 047304dd76a44569ec3a51da2a19f1360530a858  1613
## 10 c4a5fb01a6322b4add53eee4ffd3fdfec755bcc5  1609
## # ... with 88,593 more rows

Ah, yes. The NA user. Let there be a lesson to us all: check for missing data first! The following plot is based on this:

library(ggplot2)
library(reshape2)

bar_missing <- function(x){
  x %>%
    is.na %>%
    melt %>%
    ggplot(data = ., aes(x = Var2)) +
    geom_bar(aes(y=(..count..),fill=value),alpha=0.7) +
    scale_fill_manual(values = c("skyblue","red"), name = "",
                      labels = c("Available","Missing")) +
    theme_minimal()+
    ggtitle("Missing Data in Columns") +
    theme(axis.text.x = element_text(angle=45, vjust=0.5),
          plot.title = element_text(hjust = 0.5)) +
    labs(x = "Variables in Dataset",
         y = "Observations") + coord_flip()
}

bar_missing(data)

OK, so besides uploader we see that many videos miss nb_comments. About half have no description. We also see some videos have no channels which is our tag variable, our label. Let’s remove those since they’re so few:

library(magrittr)

data %<>%
  filter(!is.na(channels) & channels != "[]")

dataLong %<>%
  filter(!is.na(tag) & tag != "")

Let’s see how our tags or categories behave in relation to the metrics we have: nb_views, nb_votes and runtime:

tagMetrics <- dataLong %>%
  group_by(tag) %>%
  summarise_each(funs(min, max, median, length) , nb_views, nb_votes, runtime)
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over a selection of variables, use `summarise_at()`
colnames(tagMetrics)[ncol(tagMetrics)] <- "nVideos"

See the top 25 categories with the largest median no. of views:

tagMetrics %>%
  arrange(-nb_views_median) %>%
  slice(1:25) %>%
  mutate(tag = factor(tag, tag)) %>%
  ggplot(aes(x = tag, y = nb_views_median)) +
  geom_bar(stat = "identity") +
  ggtitle("Top 25 Categories by Median No. of Views") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        plot.title = element_text(hjust = 0.5))

So interestingly the most watched categories are: “Grannies”, “Korean”, “Old+Young”, “Midgets” and “Matures”! I’d say these are a bit out of the ordinary categories. What if we took only the top 25 categories with the highest no. of videos?

tagMetrics %>%
  arrange(-nVideos) %>%
  slice(1:25) %>%
  arrange(-nb_views_median) %>%
  mutate(tag = factor(tag, tag)) %>%
  ggplot(aes(x = tag, y = nb_views_median)) +
  geom_bar(stat = "identity") +
  ggtitle("Median No. of Views in Top 25 Categories by No. of Videos") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        plot.title = element_text(hjust = 0.5))

See that except “Matures”, none of the top views categories are in here. Also see that although “Gays” and “Men” are in the top 25 categories by no. of videos in this dataset - they have a very small median no. of views.

So there seems to be a reverse correlation here, few videos from eccentric categories receive high traffic:

library(scales)
## Warning: package 'scales' was built under R version 3.4.2
ggplot(tagMetrics, aes(nVideos, nb_views_median, label = tag)) +
  geom_point() +
  scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x),
                labels = trans_format("log10", math_format(10^.x))) +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                labels = trans_format("log10", math_format(10^.x))) +
  geom_smooth() +
  geom_point(data = tagMetrics %>% filter(tag == 'Men'), colour="red", size=5) +
  geom_text(hjust = 0.5, vjust = -0.6) +
  ggtitle("Category Median No. of Views Vs. No. of Videos") +
  theme(plot.title = element_text(hjust = 0.5))

There’s definitely a decreasing trend here: the higher the variety, the less views per video. Note again the category “Men”, with a huge variety but very little views. This, I’m guessing, is due to the fact that the audience for such videos (gay men?) isn’t targeted by xHamster.

Let’s see some more metrics:

library(gridExtra)

barplotTopNTags <- function(metric, topN = 10) {
  tagMetrics %>%
    arrange_(paste0("desc(", metric, ")")) %>%
    slice(1:topN) %>%
    mutate(tag = factor(tag, tag)) %>%
    ggplot(aes_string(x = "tag", y = metric)) +
    geom_bar(stat = "identity") +
    xlab("") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
}

plotsList <- list(barplotTopNTags("nb_views_median"),
barplotTopNTags("nb_votes_median"),
barplotTopNTags("runtime_median"))

do.call(grid.arrange,  plotsList)

I see here that the no. of views variable is somewhat correlated with the no. of votes - makes sense. Another finding which makes sense: “Double Penetration”, “Gangbang” and “Threesome” are the categories with the longest videos. A median duration for these categories is 17-21 minutes.

That was fun. But what about the relation between categories? We can build the co-occurence matrix of categories to learn which categories “go with” other categories, in terms of content:

tagMat <- dataLong %>%
  select(id, tag) %>%
  table() %>%
  crossprod()

diag(tagMat) <- 0

dim(tagMat)
## [1] 92 92

How to plot this matrix? One way is a Chord Diagram, and the chorddiag package has a really nice interactive version:

library(chorddiag)
chorddiag(tagMat, showTicks = FALSE, groupnameFontsize = 10, groupnamePadding = 5)

Alas, we can get very little out of this busy plot. One way is to keep only the top N categories:

topN <- 25

topNTags <- dataLong %>%
  count(tag) %>%
  arrange(-n) %>%
  slice(1:topN) %>%
  select(tag) %>%
  unlist()

tagMatSmall <- dataLong %>%
  filter(tag %in% topNTags) %>%
  select(id, tag) %>%
  table() %>%
  crossprod()

diag(tagMatSmall) <- 0

chorddiag(tagMatSmall, showTicks = FALSE, groupnameFontsize = 14, groupnamePadding = 5)

Somewhat better but not quite. We can see the “brave” bond between categories “Men” and “Gays”. Maybe the reason is simple configuration of the site: only if you tag a video as “Gays”, the additional “Men” category is available. But note there is no corresponding “Women” category, and that category “Lesbians” is connected to all other categories. Obviously category “Lesbians” is for straight men. We can also see strong linkages between category “Amateur” and “Teens”, “Blowjobs”, “Webcams” and “Masturbation”.

Let’s face it, I don’t like Chord Diagrams. Let’s use a simple network graph:

library(igraph)
library(visNetwork)

g <- graph.adjacency(tagMat, weighted=TRUE, mode ='undirected')

edges <- data.frame(get.edgelist(g), width = log(E(g)$weight + 1), width2 = E(g)$weight)
colnames(edges) <- c("from", "to", "width", "width2")
nodes <- data.frame(id = names(V(g)))

totalVideos <- dataLong %>%
  count(tag) %>%
  select(n) %>%
  unlist()

nodes$value = totalVideos
nodes$title <- paste0("Tag: ", nodes$id, "<br>", "#Videos: ", totalVideos)
nodes$label <- nodes$id

nodes$font.size <- 50

visNetwork(nodes, edges, height = "600px") %>%
  visIgraphLayout() %>%
  visInteraction(keyboard = TRUE,
                 dragNodes = T, 
                 dragView = T, 
                 zoomView = T)

This is beautiful. We a see a huge blob of categories, then some outliers: “Gays”, “Men”, “BlackGays”, “Ladyboys” and “Shemales”. And notice one of the categories which connect the “outlier” categories to the huge blob, is “Bisexuals”. If you look closely on the peripheral categories surrounding the blob, you’ll get another interesting insight: these are categories like “Czech”, “Korean”, “Turkish” - obviously that is because these are not “main stream” categories, in the sense that they appeal to a very specific taste, and there is little overlapping between them, e.g. a video which is tagged both “Czech” and “Korean”. Let’s remove “thin” edges - i.e. edges with less than say 100 common videos - to see a better picture:

edges %<>% filter(width2 > 100)

visNetwork(nodes, edges, height = "600px") %>%
  visIgraphLayout() %>%
  visInteraction(keyboard = TRUE,
                 dragNodes = T, 
                 dragView = T, 
                 zoomView = T)

Gorgeous! The “specific taste” phenomenon is better shown, and you can see some additional interesting mini-clusters, e.g.: “Hentai” and “Cartoons”, “Gothic” and “Emo”, and “Midgets” are on their own.

The Road to Prediction

Well all this was very nice, but let us now turn to more “serious” stuff: I wonder is it possible to predict a video’s categories or tags from its title? My first approach would be using the title’s words as features, a.k.a Bag of Words. My second approach would be using the titles for Feature Hashing. After extracting the features in both methods I will operate an excellent off-the-shelf classifier with a great package in R that can handle huge, sparse datasets: XGBoost. I will use this classifier in a Multi-label Classification framework, where each sample can have more than a single label (or tag or category). And I will implement two approaches: Binary Relevance and Classifier Chains. For an excellent review of these and other more modern approaches see here.

I’m going to give myself one small discount: I am only going to perform the following analysis on the top 25 categories above. This is mainly in order to speed things up as I need my beauty sleep. But do not worry, the top 25 categries still cover over 90% of the data:

dataLong %>%
  filter(tag %in% topNTags) %>%
  select(id) %>%
  unlist() %>%
  unique() %>%
  length() / nrow(data)
## [1] 0.9140897

Feature Engineering - Bag of Dirty Words

What are the most common words in titles? Let’s use the amazing tidytext package:

library(tidytext)

data %>%
  select(title) %>%
  unnest_tokens(word, title) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  filter(!is.na(word)) %>%
  anti_join(stop_words, "word") %>%
  count(word) %>%
  arrange(-n)
## # A tibble: 100,991 x 2
##       word     n
##      <chr> <int>
##  1     hot 31606
##  2    cock 26530
##  3    girl 24331
##  4    teen 23875
##  5     ass 22666
##  6    fuck 22370
##  7   pussy 20686
##  8     cum 20641
##  9 amateur 20545
## 10  fucked 20150
## # ... with 100,981 more rows

Lovely. You can also see here that after keeping only words with letters (no digits), and removing Stop Words we are left with a vocabulary of ~100K words.

Really? 100K words in porn titles?! Well, first you should consider Stemmig, where the words “fuck” and “fucked” would become one. Second, there are many, many singleton words:

data %>%
  select(title) %>%
  unnest_tokens(word, title) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  filter(!is.na(word)) %>%
  anti_join(stop_words, "word") %>%
  count(word) %>%
  select(n) %>%
  summary()
##        n           
##  Min.   :    1.00  
##  1st Qu.:    1.00  
##  Median :    1.00  
##  Mean   :   24.37  
##  3rd Qu.:    4.00  
##  Max.   :31606.00

50% of the words appear only once! E.g. “aaa” and “aaaa” - that’s the Internet for you. So I think we can safely remove words which occur below a certain frequency - these words later on will not have any predictive power, they may cause overfitting and will surely slow us down.

allowedWords <- data %>%
  select(title) %>%
  unnest_tokens(word, title) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  filter(!is.na(word)) %>%
  anti_join(stop_words, "word") %>%
  count(word) %>%
  filter(n >= 10) %>%
  select(word) %>%
  unlist()

length(allowedWords)
## [1] 13970

By taking only words with 10 or more appearances, we cut down over 85% of words or features, to ~14K. Now using only top 25 categories:

allowedWords <- dataLong %>%
  filter(tag %in% topNTags) %>%
  select(id, title) %>%
  unique() %>%
  unnest_tokens(word, title) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  filter(!is.na(word)) %>%
  anti_join(stop_words, "word") %>%
  count(word) %>%
  filter(n >= 10) %>%
  select(word) %>%
  unlist()

length(allowedWords)
## [1] 13038

We’re down to ~13K words or features. If you’re worried, you can always perform the rest of the analysis with 100K words and see if you lost any information.

Now we have ~700K videos with ~13K words or features. XGBoost needs a matrix of video titles or samples (rows) over words or features (columns), in which every entry [i, j] is the no. of appearances of word j in video title i. There are many weighting methodologies for these entries, e.g. Tf-Idf, which I won’t go into. You can also not settle for 1-gram tokens in your Bag of Words, but also want 2-gram and 3-gram tokens. But a regular “Document Term Matrix” of 700K rows and 13K columns would be huge in memory, so we cast it as a sparse matrix:

dtm <- dataLong %>%
  filter(tag %in% topNTags) %>%
  select(id, title) %>%
  unique() %>%
  unnest_tokens(word, title) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  filter(word %in% allowedWords) %>%
  group_by(id, word) %>%
  summarise(n = n()) %>%
  cast_sparse(id, word, n)

dim(dtm)
## [1] 689583  13038

If you’re following you might have expected to see ~717K rows, not ~690K. This is because after allowing for only specific words many video titles had nothing left and were discarded. This is a general problem in this Feature Engineering approach which we won’t encounter in the Feature Hashing approach. See later.

Let us choose a training, validation and testing sets. We’ll use the training set (500K) to train the model, the validation set (100K) as an interim test set to help us tune parameters and see interim realistic results. We won’t touch the testing set (89K) until the very end.

trainN <- 500000
validN <- 100000
set.seed(42)
trainSamples <- sort(sample(nrow(dtm), trainN))
validSamples <- sort(sample((1:nrow(dtm))[-trainSamples], validN))
testSamples <- sort((1:nrow(dtm))[-c(trainSamples, validSamples)])

dtmTrain <- dtm[trainSamples, ]
dtmValid <- dtm[validSamples, ]
dtmTest <- dtm[testSamples, ]

dim(dtmTrain)
## [1] 500000  13038
dim(dtmValid)
## [1] 100000  13038
dim(dtmTest)
## [1] 89583 13038

Next we need for each category a label column holding 1 if the video was tagged with this category or 0 otherwise (that’s how XGBoost likes its labels). That’s easy in tidyverse:

labelColumns <- dataLong %>%
  select(id, tag) %>%
  mutate(temp = 1) %>%
  spread(tag, temp, fill = 0) %>%
  filter(id %in% as.numeric(rownames(dtm))) %>%
  mutate(setType = ifelse(id %in% as.numeric(rownames(dtmTrain)), "train",
                          ifelse(id %in% as.numeric(rownames(dtmValid)), "valid", "test")))

Note I also added a column setType holding the relevant set for a given video (row). Let’s see that one of these columns holds what is expected:

labelColumns["Men"]
## # A tibble: 689,583 x 1
##      Men
##    <dbl>
##  1     0
##  2     0
##  3     0
##  4     1
##  5     0
##  6     0
##  7     0
##  8     0
##  9     1
## 10     1
## # ... with 689,573 more rows

Now we can finally model using the xgboost package. I won’t go into what Gradient Boosting does, if you don’t know it by now - you really should. Also xgboost has many many parameters you should calibrate via cross validation like in this wonderful SO answer. I am going to use a few parameters that worked well, and in fact use the validation set only to help me know when to stop growing more and more Gradient Booosted Trees.

Let’s see our data interfaces well with a single tag or category. I like the “Men” category not only because I like men :) but also because it should be realtively easy to classify in this mostly straight porn website. We will use the AUC metric for binary classification, again out of scope, you need to know that 0.5 is “bad” and 1 is “hamazing”:

library(xgboost)

getLabelsColumn <- function(tag, type) {
  labelColumns %>%
    filter(setType == type) %>%
    select_(tag) %>%
    unlist() %>%
    unname()
}

labelTrain <- getLabelsColumn("Men", "train")
labelValid <- getLabelsColumn("Men", "valid")

xgbTrain <- xgb.DMatrix(dtmTrain, label = labelTrain)
xgbValid <- xgb.DMatrix(dtmValid, label = labelValid)

xgbParams <- list(objective = "binary:logistic",
                eval_metric = "auc",
                max_depth = 5, # max depth of each boosted tree
                eta = 1,
                gamma = 0,
                subsample = 0.5, # use only 0.5 of samples to speed things up
                min_child_weight = 1)

watchlist <- list(train = xgbTrain, test = xgbValid)

xgbFit <- xgb.train(data = xgbTrain, params = xgbParams,
                    nrounds = 5000, # max number of trees to build
                    verbose = TRUE,
                    print_every_n = 100,
                    early_stopping_rounds = 100, # stop if no improvement in validation within 100 trees
                    watchlist = watchlist)
## [1]  train-auc:0.636599  test-auc:0.636987 
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 100 rounds.
## 
## [101]    train-auc:0.898666  test-auc:0.898183 
## [201]    train-auc:0.914095  test-auc:0.911745 
## [301]    train-auc:0.923911  test-auc:0.920761 
## [401]    train-auc:0.928360  test-auc:0.924743 
## [501]    train-auc:0.931149  test-auc:0.927164 
## [601]    train-auc:0.933137  test-auc:0.928428 
## [701]    train-auc:0.934303  test-auc:0.928965 
## [801]    train-auc:0.935542  test-auc:0.929991 
## [901]    train-auc:0.936404  test-auc:0.930556 
## [1001]   train-auc:0.937220  test-auc:0.930821 
## [1101]   train-auc:0.937810  test-auc:0.931635 
## [1201]   train-auc:0.938316  test-auc:0.931795 
## [1301]   train-auc:0.938820  test-auc:0.931920 
## [1401]   train-auc:0.939393  test-auc:0.932279 
## [1501]   train-auc:0.939809  test-auc:0.932360 
## [1601]   train-auc:0.940174  test-auc:0.932789 
## [1701]   train-auc:0.940491  test-auc:0.932930 
## [1801]   train-auc:0.940800  test-auc:0.932921 
## Stopping. Best iteration:
## [1736]   train-auc:0.940696  test-auc:0.933062

As expected we easily got to training AUC of 0.94 and validation AUC of 0.93. Meaning that things are working OK for this Feature Engineering approach! We predict probabilities of videos being tagged as “Men”. If you want to see the performance in terms of Percision/Recall with a certain threshold on these probabilities:

xgbPred <- predict(xgbFit, xgbValid)

library(caret)

confusionMatrix(ifelse(xgbPred > mean(labelTrain), 1, 0), labelValid, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 72577  3112
##          1  8678 15633
##                                           
##                Accuracy : 0.8821          
##                  95% CI : (0.8801, 0.8841)
##     No Information Rate : 0.8126          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6526          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8340          
##             Specificity : 0.8932          
##          Pos Pred Value : 0.6430          
##          Neg Pred Value : 0.9589          
##              Prevalence : 0.1875          
##          Detection Rate : 0.1563          
##    Detection Prevalence : 0.2431          
##       Balanced Accuracy : 0.8636          
##                                           
##        'Positive' Class : 1               
## 

This means that out of all videos tagged as “Men”, we correctly identified 83% (Recall or Sensitivity), and that out of all videos we predicted as “Men”, we are correct 64% (Precision or Positive Predictive Value). Overall Accuracy is 88%. Obviously there’s room for improvement but this is not bad. If you want to see the words or features most “important” in predicting “Men”:

xgb.importance(colnames(dtm), xgbFit)[1:10,]
##     Feature        Gain        Cover    Frequency
##  1: tribute 0.063047420 0.0011631585 0.0025592230
##  2:    mein 0.032269071 0.0012468751 0.0005905899
##  3:     cum 0.022561506 0.0013771666 0.0189644990
##  4: cumming 0.017235297 0.0008517492 0.0030841919
##  5:      cu 0.014442777 0.0024410998 0.0097775445
##  6:     gay 0.014301056 0.0027505376 0.0095806811
##  7: jerking 0.010843805 0.0003835474 0.0013780432
##  8:  cumsho 0.010095323 0.0021006151 0.0041997506
##  9:    cock 0.009654527 0.0013894319 0.0248047772
## 10: cumshot 0.008713220 0.0006948192 0.0031498130

Charming.

Feature Hashing - Hashing for Tags

Now for Feature Hashing. Again, it is out of the scope of this post to explain the so called “Hashing Trick”. Basically a video’s title will be hashed by a vast number of hashing functions, thus mapped into a huge feature space automatically. Each hashing function is a single feature, and there is no need for feature engineering. We’ll talk about this more after we’ve seen it in action via the FeatureHashing package. Let’s get ourselves the training, validation and test hash matrices (using the same IDs as with Feature Engineering!), and let’s use a huge number of hash functions, 2 to the power of 22:

dataTrainValidTest <- dataLong %>%
  select(id, title) %>%
  unique() %>%
  filter(id %in% as.numeric(rownames(dtm))) %>%
  mutate(setType = ifelse(id %in% as.numeric(rownames(dtmTrain)), "train",
                          ifelse(id %in% as.numeric(rownames(dtmValid)), "valid", "test")))

dataTrain <- dataTrainValidTest %>% filter(setType == "train")
dataValid <- dataTrainValidTest %>% filter(setType == "valid")
dataTest <- dataTrainValidTest %>% filter(setType == "test")

library(FeatureHashing)

hashTrain <- hashed.model.matrix(~ split(title, delim = " "),
                                  data = dataTrain, hash.size=2^22, transpose=FALSE)
hashValid <- hashed.model.matrix(~ split(title, delim = " "),
                                  data = dataValid, hash.size=2^22, transpose=FALSE)
hashTest <- hashed.model.matrix(~ split(title, delim = " "),
                                  data = dataTest, hash.size=2^22, transpose=FALSE)

Now XGBoost:

xgbTrain <- xgb.DMatrix(hashTrain, label = labelTrain)
xgbValid <- xgb.DMatrix(hashValid, label = labelValid)

watchlist <- list(train = xgbTrain, test = xgbValid)

xgbFit <- xgb.train(data = xgbTrain, params = xgbParams,
                    nrounds = 5000,
                    verbose = TRUE,
                    print_every_n = 100,
                    early_stopping_rounds = 100,
                    watchlist = watchlist)
## [1]  train-auc:0.636321  test-auc:0.636827 
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 100 rounds.
## 
## [101]    train-auc:0.909077  test-auc:0.907673 
## [201]    train-auc:0.923861  test-auc:0.920969 
## [301]    train-auc:0.930773  test-auc:0.926190 
## [401]    train-auc:0.934731  test-auc:0.929652 
## [501]    train-auc:0.937647  test-auc:0.931796 
## [601]    train-auc:0.939932  test-auc:0.933326 
## [701]    train-auc:0.941266  test-auc:0.933680 
## [801]    train-auc:0.942781  test-auc:0.934642 
## [901]    train-auc:0.943700  test-auc:0.934689 
## [1001]   train-auc:0.944883  test-auc:0.935182 
## [1101]   train-auc:0.945658  test-auc:0.935465 
## [1201]   train-auc:0.946374  test-auc:0.935627 
## [1301]   train-auc:0.947043  test-auc:0.935980 
## [1401]   train-auc:0.947669  test-auc:0.936255 
## [1501]   train-auc:0.948151  test-auc:0.936234 
## [1601]   train-auc:0.948748  test-auc:0.936432 
## Stopping. Best iteration:
## [1572]   train-auc:0.948605  test-auc:0.936662

AUC is slightly higher and in less iterations but that’s not even what I want you to appreciate here:

  1. xgboost, in R, not using Apache Spark or h2o, not using a cluster of 3K computers but a single laptop - performed a classification task on 0.5M x 2^22 large dataset, in about 1 minute.
  2. This was simple. We didn’t contemplate on which words to use as features and which not. We didn’t consider different weightings. We didn’t do pre-processing of any kind! We just hashed. Furthermore if we were to start with Feature Hashing first, we would not discard so many samples because their titles were emptied, as happened in Feature Engineering.
  3. The most impressive advantage to Feature Hashing that you may not see if you haven’t worked on a real life Text Classifier which should actually work somewhere: with Feature Engineering and the Bag of Words approach - we only know the words we see during the training phase. When a new title comes, having words we’ve never seen before, they would be meaningless in our boosted trees. Moreover for various applications we would have to store the Document Term Matrix somewhere in memory so we have something to compare to. In Feature Hashing, on the contrary, we only need to store our hashing functions logic.

One disadvantage to Feature Hashing, though: I cannot show you any ranking on “importance” of the words, as in “this word was very helpful in separating the Men videos from the other videos”. This is a disadvantage common to all Black Box methods.

Prediction! Prediction!

You might have fallen asleep but now comes the most interesting part, for me at least. We’ve verified we can predict a single category. But how to predict multiple categories? Again, here is an excellent review of approaches to Multi-label Classification. I will implement two approaches:

  1. Binary Relevance - this is inspired by the “One Vs. All” approach in Multi-class Classification. We train each category, one vs. all, and get (in our case) 25 classifiers for 25 categories. Then we simply aggregate the predicted categories for each video. A video predicted (passing the threshold) to be of categories “Amateur”, “Teens” and “Masturbation” will be a [“Amateur”, “Teens”, “Masturbation”] video. What are the most blatant issues? (a) a video could end up having no categories at all. (b) we assume independence between categories, which we’ve seen in our xxxploration is completely not the case here.

  2. Classifier Chains - just a bit more sophisticated. We still train 25 classifiers for 25 categories, but we do it in a chain, where each category X classifier takes as input the observations of the categories preceding category X. Thus the dependence between categories may be used. During prediction, the classifiers are predicting in the same order, using as input the predictions of previous classifiers. Again, 2 issues: (a) a video could still end up having no categories. (b) we need some ordering on the categories. This can be done using cross validation on random permutations of categories. In our case I’ll use a simple hierarchical clustering of categories on our tagMatSmall co-occurence matrix to give me a reasonable ordering.

Few more notes:

  • The issue where a video could end up being classified to zero categories: this can be avoided if we add another layer in which if this occurs to a video, we classify a single category with the highest probability (or the highest probability relative to a category’s threshold)
  • Why am I not using the mlr package? Because it can’t handle sparse matrices, as far as I know.
  • We need a metric to compare the 2 {Engineering, Hashing} x 2 {Binary Relevance, Classifier Chains} approaches. We’ll use the very strict accuracy, i.e. the percentage of samples that have all their labels classified correctly. And we’ll use the less strict Hamming Loss, i.e. the fraction of the wrong labels to the total number of labels (which is n samples x q labels, or 89583 * 25 = 2239575 in our testing set case).

Binary Relevance

BRSingleClassifier <- function(tag, type) {
  labelTrain <- getLabelsColumn(tag, "train")
  labelValid <- getLabelsColumn(tag, "valid")
  
  # choose features matrices
  featTrain <- if (type == "FE") dtmTrain else if (type == "FH") hashTrain
  featValid <- if (type == "FE") dtmValid else if (type == "FH") hashValid
  featTest <- if (type == "FE") dtmTest else if (type == "FH") hashTest
  
  # first model to know when to stop
  xgbTrain <- xgb.DMatrix(data = featTrain, label = labelTrain)
  xgbValid <- xgb.DMatrix(data = featValid, label = labelValid)
  
  watchlist <- list(train = xgbTrain, test = xgbValid)
  
  xgbFit1 <- xgb.train(data = xgbTrain, params = xgbParams,
                    nrounds = 5000,
                    verbose = FALSE,
                    early_stopping_rounds = 100,
                    watchlist = watchlist)
  
  # second model on all of available data: training + valid
  featAll <- rbind(featTrain, featValid)
  labelAll <- c(labelTrain, labelValid)
  
  xgbFit2 <- xgboost(data = featAll, params = xgbParams, label = labelAll,
                    nrounds = xgbFit1$best_iteration,
                    verbose = FALSE)
  
  # predict on test set
  xgbPred <- predict(xgbFit2, featTest)
  
  return(ifelse(xgbPred > mean(labelAll), 1, 0))
}

The Feature Engineering approach with BR:

predictMatrix_BR_FE <- matrix(0, nrow(dtmTest), length(topNTags))

for (i in 1:length(topNTags)) {
  predictMatrix_BR_FE[, i] <- BRSingleClassifier(topNTags[i], "FE")
  cat(paste("Finished category", i, "out of", length(topNTags)), "\n")
}

The Feature Hashing approach with BR:

predictMatrix_BR_FH <- matrix(0, nrow(dtmTest), length(topNTags))

for (i in 1:length(topNTags)) {
  predictMatrix_BR_FH[, i] <- BRSingleClassifier(topNTags[i], "FH")
  cat(paste("Finished category", i, "out of", length(topNTags)), "\n")
}

Classifier Chains

For Classifier Chains we need an ordering over categories. As said I’ve chosen a simple hierarchical clustering on our tagMatSmall co-occurrence matrix. The matrix currently holds co-occurence between every pair of our top 25 categories. So these are similarities, and we need a transformation to make the entries dissimilarities, e.g. subtracting each entry from the max entry:

hclustObj <- hclust(as.dist(max(tagMatSmall) - tagMatSmall))

plot(hclustObj)

Makes sense! And the ordering from left to right:

topNTagsOrdered <- colnames(tagMatSmall)[hclustObj$order]

topNTagsOrdered
##  [1] "Gays"          "Men"           "Voyeur"        "Facials"      
##  [5] "Blowjobs"      "Cumshots"      "Blondes"       "Anal"         
##  [9] "Hardcore"      "Amateur"       "Teens"         "Matures"      
## [13] "MILFs"         "BBW"           "BigBoobs"      "BlackandEbony"
## [17] "Interracial"   "Brunettes"     "Pornstars"     "Babes"        
## [21] "Lesbians"      "Asian"         "Webcams"       "Masturbation" 
## [25] "SexToys"

We also need to have quick access to the label columns after re-ordering:

getTopLabelColumnsOrdered <- function(type) {
  labelColumns %>%
    filter(setType == type) %>%
    select_(.dots = topNTagsOrdered)
}

labelColumnsOrdTrain <- getTopLabelColumnsOrdered("train")
labelColumnsOrdValid <- getTopLabelColumnsOrdered("valid")
labelColumnsOrdTest <- getTopLabelColumnsOrdered("test")

Now for the Classifier Chains function:

CCSingleClassifier <- function(tag, type, i, predictMatrix) {
  labelTrain <- getLabelsColumn(tag, "train")
  labelValid <- getLabelsColumn(tag, "valid")
  
  # choose features matrices
  featTrain <- if (type == "FE") dtmTrain else if (type == "FH") hashTrain
  featValid <- if (type == "FE") dtmValid else if (type == "FH") hashValid
  featTest <- if (type == "FE") dtmTest else if (type == "FH") hashTest
  
  # append categories up until category i to features matrices
  if (i > 1) {
    featTrain <- cbind(featTrain, unlist(labelColumnsOrdTrain[, 1:(i - 1)]))
    featValid <- cbind(featValid, unlist(labelColumnsOrdValid[, 1:(i - 1)]))
    featTest <- cbind(featTest, predictMatrix[, 1:(i - 1)])
  }
  
  # first model to know when to stop
  xgbTrain <- xgb.DMatrix(data = featTrain, label = labelTrain)
  xgbValid <- xgb.DMatrix(data = featValid, label = labelValid)
  
  watchlist <- list(train = xgbTrain, test = xgbValid)
  
  xgbFit1 <- xgb.train(data = xgbTrain, params = xgbParams,
                    nrounds = 5000,
                    verbose = FALSE,
                    early_stopping_rounds = 100,
                    watchlist = watchlist)
  
  # second model on all of available data: training + valid
  featAll <- rbind(featTrain, featValid)
  labelAll <- c(labelTrain, labelValid)
  
  xgbFit2 <- xgboost(data = featAll, params = xgbParams, label = labelAll,
                    nrounds = xgbFit1$best_iteration,
                    verbose = FALSE)
  
  # predict on test set
  xgbPred <- predict(xgbFit2, featTest)
  
  return(ifelse(xgbPred > mean(labelAll), 1, 0))
}

The Feature Engineering approach with CC:

predictMatrix_CC_FE <- matrix(0, nrow(dtmTest), length(topNTagsOrdered))

for (i in 1:length(topNTagsOrdered)) {
  predictMatrix_CC_FE[, i] <- CCSingleClassifier(topNTagsOrdered[i], "FE", i,
                                                 predictMatrix_CC_FE)
  cat(paste("Finished category", i, "out of", length(topNTagsOrdered)), "\n")
}

The Feature Hashing approach with CC:

predictMatrix_CC_FH <- matrix(0, nrow(dtmTest), length(topNTagsOrdered))

for (i in 1:length(topNTagsOrdered)) {
  predictMatrix_CC_FH[, i] <- CCSingleClassifier(topNTagsOrdered[i], "FH", i,
                                                 predictMatrix_CC_FH)
  cat(paste("Finished category", i, "out of", length(topNTagsOrdered)), "\n")
}

We need accuracy and Hamming loss functions:

getTestAccuracy <- function(predictMatrix, topNTagsOrdered) {
  refMatrix <- labelColumns[labelColumns$setType == "test", topNTagsOrdered]
  return(sum(apply(refMatrix == predictMatrix, 1, all)) / nrow(refMatrix))
}

getTestHammingLoss <- function(predictMatrix, topNTagsOrdered) {
  refMatrix <- labelColumns[labelColumns$setType == "test", topNTagsOrdered]
  return(sum(refMatrix != predictMatrix) / prod(dim(refMatrix)))
}

accuracy_BR_FE <- getTestAccuracy(predictMatrix_BR_FE, topNTags)
hamming_BR_FE <- getTestHammingLoss(predictMatrix_BR_FE, topNTags)

accuracy_BR_FH <- getTestAccuracy(predictMatrix_BR_FH, topNTags)
hamming_BR_FH <- getTestHammingLoss(predictMatrix_BR_FH, topNTags)

accuracy_CC_FE <- getTestAccuracy(predictMatrix_CC_FE, topNTagsOrdered)
hamming_CC_FE <- getTestHammingLoss(predictMatrix_CC_FE, topNTagsOrdered)

accuracy_CC_FH <- getTestAccuracy(predictMatrix_CC_FH, topNTagsOrdered)
hamming_CC_FH <- getTestHammingLoss(predictMatrix_CC_FH, topNTagsOrdered)

metrics_df <- tibble(Approach = c(rep("BR", 2), rep("CC", 2)),
                     Features = rep(c("Engineering", "Hashing"), 2),
                     Accuracy = c(accuracy_BR_FE, accuracy_BR_FH, accuracy_CC_FE, accuracy_CC_FH),
                     HammingLoss = c(hamming_BR_FE, hamming_BR_FH, hamming_CC_FE, hamming_CC_FH))

library(knitr)

kable(metrics_df, digits = 2)
Approach Features Accuracy HammingLoss
BR Engineering 0.06 0.15
BR Hashing 0.07 0.15
CC Engineering 0.12 0.12
CC Hashing 0.12 0.11

Well now. It depends on how optimistic you are. Look at the last row: according to accuracy, we succeeded in predicting exactly just 12% of samples. According to Hamming Loss, we succeeded in predicting 89% of labels :) Obviously there’s room for improvement: weighting scheme for FE, no. of hash functions for FH, playing with thresholds for BR, different ordering with CC. We could add a video’s description to the features, use better approaches for Multi-label classification, etc. If you’d like to see your accuracy goes up - really up - you should probably train a Deep Learning network, as categorizing free and short internet text to 92 categories, even 25, is not an easy task, as you can see. Remember we got an accuracy of 88% just for the “Men” category - however once we “add in” 24 other categories we need to get right - the chances of such high accuracy quickly diminish.

Notice the superiority of Classifier Chains over Binary Relevance - it performed almost twice as well! And, of course, the magic of Hashing, which in Classifier Chains even improved results.

Coming to a Finish

If you’re still here - this was one of the most fun analyses I have ever made! This is the first time I’ve ever made Multi-label Classification. And yes, because it is about porn. Every time I performed a simple sanity checking to see that I got a title correct, a tag correct, every graph I plotted - simply made me laugh. I’m sorry, I don’t take porn that seriously. And I consider these data as legitimate as any other2. And remember: The Internet is for porn.


  1. Look here for a more elegant way to parse this string if it were pure JSON: http://stackoverflow.com/questions/43960781/extracting-text-between-double-quotation-marks-in-r

  2. You probably shouldn’t choose this one as your class project, though.