As a father of three girls, Facebook decided recently to show me this video, titled “If you have a daughter, you need to see this”:

It shows a mother and daughter gradually removing books from a children’s books bookshelf in a typical book store. The books they’re removing are books with only male characters, books in which female characters appear but do not speak, books about princesses etc. Meanwhile interesting statistics about gender representation in children’s books are shown. To the viewer’s horror, the mother and daughter are left with a tiny percentage of the amount of books they’ve started with.

I get the message of the video and support it - we need more books about girls. But while I would be the first guy in line to buy his daughter a book about Ada Lovelace or Marie Curie, the video to me is a bit of an exaggeration. Let me explain why. First it turns out to be part of a campaign for a new series of children’s books targeted for girls by a brand called “Rebel Girls”. Second, if you read the original paper1 quoted there, which tagged over 5K books for whether females or males appeared in their titles and for whether the central character was male or female - you’ll get to see the exact numbers. The exact numbers are not nice but they’re not “ugly”: the ratio between male and female central characters was found to be 1.6 (3.4K male books vs. 2.1K female books), meaning that for every 2 books about female characters, there are 3 about male characters.2 Third and last, as a father of three girls who reads his girls sometimes as many as 5 different books a day - the disparity presented in the video is not my experience.

So I’ve decided to scrape Goodreads and use some Machine Learning to see this for myself. As always, if you want to just get the answer, without the tutorial, go to “The Answer” section below.

Scraping like Cinderella

Why Goodreads? First because I needed many many books to get a sense of (or replicate) the male to female ratio in children’s books. Remember the original article surveyed thousands of children’s books. Now, if you google “list of children’s books” you either get “Top 100” lists of children’s books e.g. by Time magazine, or you get a list from Goodreads, titled “Favorite books from my childhood”, which (currently) contains ~3.6K books, sorted by votes. The second reason why Goodreads was the choice for me is that there is something democratic about Goodreads lists. Any Goodreads reader with an account can vote. While there may be bias towards male or female preferences if there are simply more male or female voters, it looks a bit more fair to me than a curated list by the Guardian. Lastly, Goodreads has a page for each and every book, which can be easily scraped for the book’s description, which would become the basis for my ML model later on.

So the 3.6K books currently look like this:

Enter the rvest package. With a simple getBooks function and another 2 lines of code, I’m scraping this entire list which is spread on (currently) 37 (!) web pages, into a single tidy tibble:

library(tidyverse)
library(stringr)
library(rvest)

startUrl <- "https://www.goodreads.com/list/show/226.Favorite_books_from_my_childhood"

getBookDescription <- function(bookLink) {
  url <- str_c("https://www.goodreads.com", bookLink)
  read_html(url) %>% html_node("#descriptionContainer") %>% html_text() %>% trimws()
}

getBooks <- function(i) {
  cat(i, "\n")
  url <- str_c(startUrl, "?page=", i)
  
  html <- read_html(url)
  
  title <- html %>%
    html_nodes(".bookTitle") %>%
    html_text(trim = TRUE) #%>%
    #discard(!str_detect(., "[A-Z0-9]"))
  
  author <- html %>%
    html_nodes(".authorName") %>%
    html_text(trim = TRUE) %>%
    discard(str_detect(., "^\\("))
  
  rate <- html %>%
    html_nodes(".minirating") %>%
    html_text(trim = TRUE) %>%
    str_extract_all("[0-9.,]+", simplify = TRUE) %>%
    as_tibble() %>%
    magrittr::set_colnames(c("avg", "nRaters")) %>%
    mutate(nRaters = str_replace_all(nRaters, ",", "")) %>%
    mutate_all(as.numeric)
  
  score <- html %>%
    html_nodes("a") %>%
    html_text() %>%
    discard(!str_detect(., "score: [0-9,]+")) %>%
    str_extract("[0-9,]+") %>%
    str_replace_all(",", "") %>%
    as.numeric()
  
  nVoters <- html %>%
    html_nodes("a") %>%
    html_text() %>%
    discard(!str_detect(., "([0-9,]+ people voted)|(1 person voted)")) %>%
    str_extract("[0-9,]+") %>%
    str_replace_all(",", "") %>%
    as.numeric()
  
  bookLinks <- html %>%
    html_nodes("a") %>%
    html_attr("href") %>%
    discard(!str_detect(., "^/book/show")) %>%
    na.omit() %>%
    unique()
  
  bookDescription <- bookLinks %>%
    map_chr(getBookDescription)
  
  return(tibble(
    title = title,
    author = author,
    rating = rate$avg,
    nRaters = rate$nRaters,
    score = score,
    nVoters = nVoters,
    bookDescription = bookDescription
  ))
}

goodreads <- c(1:37) %>%
  map_dfr(getBooks)

Wait, what just happened here?!

str(goodreads)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3680 obs. of  7 variables:
##  $ title          : chr  "Charlotte's Web" "The Secret Garden" "The Lion, the Witch, and the Wardrobe (Chronicles of Narnia, #1)" "Anne of Green Gables (Anne of Green Gables, #1)" ...
##  $ author         : chr  "E.B. White" "Frances Hodgson Burnett" "C.S. Lewis" "L.M. Montgomery" ...
##  $ rating         : num  4.15 4.12 4.19 4.23 4.04 4.29 4.1 4.22 4.04 4.16 ...
##  $ nRaters        : num  1095909 697044 1599885 537144 1321544 ...
##  $ score          : num  152953 117544 115330 108951 99723 ...
##  $ nVoters        : num  1559 1209 1185 1115 1033 ...
##  $ bookDescription: chr  "This beloved book by E. B. White, author of Stuart Little and The Trumpet of the Swan, is a classic of children"| __truncated__ "When orphaned Mary Lennox comes to live at her uncle's great house on the Yorkshire Moors, she finds it full of"| __truncated__ "They open a door and enter a world NARNIA...the land beyond the wardrobe, the secret country known only to Pete"| __truncated__ "Everyone's favorite redhead, the spunky Anne Shirley, begins her adventures at Green Gables, a farm outside Avo"| __truncated__ ...

The goodreads tibble has 3,680 rows, one for each book (though see later), in each row:

  • The book’s title
  • The author name
  • The average rating of the book on Goodreads
  • nRaters: Number of raters
  • The book’s score on this specific list (this is used to rank the books in the context of this specific list)
  • nVoters: Number of people who voted for this book (in the context of this specific list)
  • bookDescription: the book’s long character description as it appears in the book’s page

Is this magic? No, it’s rvest.

Classifying Classics

The authors of the original paper used by Rebel Girls had at their disposal “multiple coders”3 to go over thousands of books and tag each one for whether the main character is female or male. I, on the other hand, cannot (and will not!) tag thousands of books, for I have a life4. So I’m going to use… Machine Learning!

Strategy is: tag only a small subset of the goodreads sample as having a “Male” central character (e.g. Where the Wild Things Are), “Female” (e.g. Anne of Green Gables) or “Other” (Other meaning “I’m not sure” or “There isn’t a single central character, some are male some female”, e.g. The Chronicles of Narnia). Then, train a ML model on this small sample that can classify between these three classes. Then, if the accuracy is high enough, say 80%, on a small unseen test sample - use the model to predict a class on what’s left of the entire 3.6K Goodreads books list. It may not be the most accurate thing to do, but it is fast. Moreover we’re not talking life saving medicine here, this is only for estimation and only for my mediocre blog.

I figured I only have patience for manually tagging 200 books. And I decided to tag the first 200 books. This is problematic because it is not a random sample, and its validity regarding the other 3.4K books is questionable at best. Still, the name of the game is speed, and I knew the first 200 books would be popular books, some would take me a second because I already knew them, and most if not all would have a Wikipedia article. Furthermore, if I get the accurate label for 200 of the books, it might as well be the most popular 200 because at least then I’m able to say something almost without a doubt regarding a substantial sample of the books.

So these are my taggings:

classFirst200 <- c(
  0, 0, 2, 0, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 2, 1, 1, 0, 0, 0, 2, 1, 0, 0, 2, 0, 1, 2, 0, 1, 1, 1, 1, 0, 1, 2, 0, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 2, 1, 1, 0, 1, 0, 1, 1, 2, 0, 2, 2, 1, 1, 0, 0, 2, 0, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 0, 2, 1, 2, 0, 2, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 2, 1, 2, 0, 2, 1, 0, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 2, 0, 1, 2, 1, 0, 1, 0, 2, 1, 2, 0, 0, 2, 1, 1, 2, 0, 0, 2, 0, 1, 2, 0, 0, 0, 2, 1, 2, 2, 2, 1, 1, 2, 2, 0, 0, 0, 0, 2, 0, 1, 1, 1, 2, 2
)

goodreads$class <- c(classFirst200, rep(NA, nrow(goodreads) - 200))

Where 0 is “Female”, 1 is “Male” and 2 means “Other”. We can already see how the class of the first 200 books distribute according to my judgement:

goodreads %>%
  count(class) %>%
  mutate(total = sum(n), p = n / total)
## # A tibble: 4 x 4
##   class     n total          p
##   <dbl> <int> <int>      <dbl>
## 1     0    66  3680 0.01793478
## 2     1    89  3680 0.02418478
## 3     2    45  3680 0.01222826
## 4    NA  3480  3680 0.94565217

So for the first 200 books we have 89 male books and 66 female, a ratio of 1.3 to 1. If I understand correctly the original article, they said that my “Other” category belongs to both “Male” and “Female”, therefore the ratio is (89 + 45) to (66 + 45) or 1.2. Either way these 200 top children’s books keep the trend of “more male than female” but it is somewhat diminished (the original ratio was 1.6 to 1).

Preparing the Machine

I thought of a few useful features, before I got to the actual words in the book description:

  • Does the title contain a “male” name? (e.g. “Harry Potter”)
  • Does the author have a “female” name? (e.g. Louisa May Alcott)
  • Does the title start with the word “The”? (e.g. “The Secret Garden”)5
  • How many plural words are there in the book description?6
  • And more

To get a list of 1K male and 1K female English names I scraped a website called babble.com:

boys <- "https://www.babble.com/pregnancy/1000-most-popular-boy-names/"
boysNames <- read_html(boys) %>%
  html_nodes("li") %>%
  html_text() %>%
  .[22:1021]

girls <- "https://www.babble.com/pregnancy/1000-most-popular-girl-names/"
girlsNames <- read_html(girls) %>%
  html_nodes("li") %>%
  html_text() %>%
  .[22:1021]

boysGirlsNames <- data.frame(cbind(boysNames, girlsNames))

Then I “engineered” all my additional features:

library(magrittr)

getNBoysGirlsNames <- function(s, boys, firstToken) {
  namesPattern <- if (boys) {
    str_c(boysGirlsNames$boysNames, collapse = "\\b|\\b")
  } else {
    str_c(boysGirlsNames$girlsNames, collapse = "\\b|\\b")
  }
  s2 <- if (firstToken) {
    str_split(s, " ")[[1]][1]
  } else {
    s
  }
  str_count(s2, namesPattern)
}

getNPluralWords <- function(s) {
  sum(str_detect(str_split(s, " ")[[1]], "[A-Za-z]+s$"))
}

isStringStartsWithThe <- function(s) {
  as.numeric(startsWith(s, "The"))
}
goodreads %<>%
  mutate(i_nBoysInAuthorName = map_dbl(author, getNBoysGirlsNames, TRUE, TRUE),
         i_nGirlsInAuthorName = map_dbl(author, getNBoysGirlsNames, FALSE, TRUE),
         i_nBoysInBookDesc = map_dbl(bookDescription, getNBoysGirlsNames, TRUE, FALSE),
         i_nGirlsInBookDesc = map_dbl(bookDescription, getNBoysGirlsNames, FALSE, FALSE),
         i_nBoysInBookTitle = map_dbl(title, getNBoysGirlsNames, TRUE, FALSE),
         i_nGirlsInBookTitle = map_dbl(title, getNBoysGirlsNames, FALSE, FALSE),
         i_gapBoysGirlsAuthor = i_nBoysInAuthorName - i_nGirlsInAuthorName,
         i_gapBoysGirlsBookDesc = i_nBoysInBookDesc - i_nGirlsInBookDesc,
         i_gapBoysGirlsTitle = i_nBoysInBookTitle - i_nGirlsInBookTitle,
         i_nSharpsInBookTitle = map_dbl(title, function(s) str_count(s, "#")),
         i_nDotsInAuthorName = map_dbl(author, function(s) str_count(s, "\\.")),
         i_lengthTitle = map_dbl(title, nchar),
         i_lengthBookDesc = map_dbl(bookDescription, nchar),
         i_lengthAuthorName = map_dbl(author, nchar),
         i_nPluralWordsInBookTitle = map_dbl(title, getNPluralWords),
         i_titleStartsWithThe = map_dbl(title, isStringStartsWithThe))

Furthermore, I felt like for what’s coming I shouldn’t include books with description with a length less than say 100 characters:

goodreads %<>%
  filter(nchar(bookDescription) > 100)

table(goodreads$class, useNA = "always")
## 
##    0    1    2 <NA> 
##   65   88   42 3083

You can see this excluded from my 200 books 1 male book (~1%), 1 female (~1%) and 3 “Other” (~6%) - which means I’m down to 195 tagged books. It also excluded 397 books from the rest of the list (~11%). This exclusion doesn’t seem to affect the male/female classes as it affects the “Other” class and the rest of the list. It is something to keep in mind.

Now for the train and test split. Since this is such a small sample for a non-trivial task, I figured I used almost every book I could get, to be able to assess what I have here. So 90% of the 195 books (175) would go to training:

nTagged <- sum(table(goodreads$class))
trainFrac <- 0.9
trainSample <- sample(nTagged, floor(nTagged * trainFrac))
goodreads %<>%
  mutate(id = row_number(),
         type = ifelse(row_number() %in% trainSample, "train",
                       ifelse(row_number() %in% 1:nTagged, "test",
                              "other")))

This got me a type column which holds for my books “train”, “test” or “other”.

Now for the Grand Finale which is the actual text features, coming from the bookDescription. I’m using the wonderful tidytext package, I’ll explain this long pipe in a second:

library(tidytext)

dataForModeling <- goodreads %>%
  unnest_tokens(word, bookDescription, drop = FALSE) %>%
  count(id, word, sort = TRUE) %>% 
  #don't anti_join stop_words because 'her' and 'him' are there!
  cast_dtm(id, word, n, weighting = tm::weightTf) %>%
  tm::removeSparseTerms(0.999) %>%
  as.matrix() %>%
  as.data.frame() %>%
  rownames_to_column("id") %>%
  mutate(id = as.numeric(id)) %>%
  rename(class1 = class, type1 = type) %>%
  inner_join(goodreads %>%
               select(id, starts_with("i_"), class, type), "id")
## Warning: Mangling the following names: <U+0438> -> <U+0438>, <U+0432> -
## > <U+0432>, sie -> sie, it’s -> it<U+0092>s, who’s -> who<U+0092>s, he’s -
## > he<U+0092>s, she’s -> she<U+0092>s, they’re -> they<U+0092>re, can’t -
## > can<U+0092>t, doesn’t -> doesn<U+0092>t, family’s -> family<U+0092>s,
## you’ll -> you<U+0092>ll, <U+043D><U+0435> -> <U+043D><U+0435>, isn’t -
## > isn<U+0092>t, don’t -> don<U+0092>t, america’s -> america<U+0092>s,
## children’s -> children<U+0092>s, world’s -> world<U+0092>s, father’s ->
## father<U+0092>s, here’s -> here<U+0092>s, won’t -> won<U+0092>t, that’s -
## > that<U+0092>s, didn’t -> didn<U+0092>t, there’s -> there<U+0092>s, she’ll
## -> she<U+0092>ll, <U+043A><U+0430><U+043A> -> <U+043A><U+0430><U+043A>,
## park’s -> park<U+0092>s, he’d -> he<U+0092>d, mother’s -> mother<U+0092>s,
## child’s -> child<U+0092>s, junie’s -> junie<U+0092>s. Use enc2native() to
## avoid the warning.
nonWordColIdx <- c(1, which(str_detect(colnames(dataForModeling), "i_")),
                   ncol(dataForModeling) - 1,
                   ncol(dataForModeling))
bkDescWords <- colnames(dataForModeling)[-nonWordColIdx]
colnames(dataForModeling)[-nonWordColIdx] <- str_c("word", 1:length(bkDescWords))

dim(dataForModeling)
## [1] 3278 6808

The pipe goes like this:

  • Take the goodreads bookDescription long string
  • Unnest the unigram tokens, meaning that each token or word appears in its own row, for each book, creating a very “long” dataset
  • Count each token for each book (e.g. for “Matilda” the word “her” appears 3 times in its description)
  • Cast this as a sparse Document-Term-Matrix (where each book is a row and each token is a column)
  • Remove really sparse tokens, i.e. tokens which do not appear in over 99.9% of books descriptions. In other words, a token must appear in at least 0.001 * 3.2K = 3.2 or 4 books descriptions to be included.
  • Then joining back to the goodreads dataset to create the final dataset which has 3,278 rows (books) and 6,808 columns (features, or more accurately 6,806 features, one class variable and one type)

There is also some data munging there to account for the fact that in R column names cannot be named “he’s” which is a valid token here. I left the warning for you to see it.

Notice I also purposefully did not remove Stop Words as these include some terribly important gender-related words such as “she”, “herself”, “she’ll” etc.

Taking only my 195 tagged books and splitting them to train and test according to the type column defined earlier:

train <- dataForModeling %>%
  filter(type == "train") %>%
  select(-type)

test <- dataForModeling %>%
  filter(type == "test") %>%
  select(-type)

Learning the Machine

As always, for an off-the-shelf algo for classification I like to use Gradient-Boosted-Trees, and I like to do it using xgboost. GBTs are fast, easy-to-understand and handle overfitting and complex interactions well. I’m using xgboost default parameters here, specifying only that I’d like 50 trees:

library(xgboost)

train_matrix <- xgb.DMatrix(data = as.matrix(train[,-c(1, ncol(train))]), label = train$class)
test_matrix <- xgb.DMatrix(data = as.matrix(test[, -c(1, ncol(test))]), label = test$class)

nClasses <- 3
xgb_params <- list("objective" = "multi:softprob",
                   "eval_metric" = "mlogloss",
                   "num_class" = nClasses)
nround    <- 50

xgboostModel <- xgb.train(params = xgb_params, data = train_matrix, nrounds = nround)

Interesting to see which features were the most “important” (which means they were most used in trees splits):

names <-  colnames(train[,-c(1, ncol(train))])
importance_matrix <- xgb.importance(feature_names = names, model = xgboostModel)
head(importance_matrix, 10) %>%
  left_join(tibble(Feature = colnames(dataForModeling)[-nonWordColIdx], word = bkDescWords), "Feature") %>%
  select(word, Feature, Gain)
##    word                Feature       Gain
## 1   her                 word16 0.11797388
## 2  <NA> i_gapBoysGirlsBookDesc 0.09682513
## 3   she                 word22 0.07617852
## 4   and                  word2 0.06355696
## 5    he                 word26 0.05515927
## 6   the                  word1 0.04940728
## 7  <NA>   i_nGirlsInAuthorName 0.04616141
## 8  they                 word25 0.04488963
## 9   his                 word28 0.03393887
## 10   on                word131 0.02346124

This makes sense. The most important words are gender-related words such as “her”, “she” and “he”. Some of my “engineered” features also made it to the top: i_gapBoysGirlsBookDesc (no. of boys names in book description minus no. of girls names in book description) and i_nGirlsInAuthorName (is the author first name a girl name).

You can see that for books in which i_nGirlsInAuthorName = 1, i.e. this is most likely a woman author, the frequency of female books is unsurprisingly much higher:

table(goodreads$i_nGirlsInAuthorName, goodreads$class)
##    
##      0  1  2
##   0 31 83 35
##   1 34  5  7

How does it look like on the 20 books I left aside for testing? We’ll use a simple confusion matrix:

test_pred <- predict(xgboostModel, newdata = test_matrix)

labelNumToChar <- function(n) {
  ifelse(n == 1, "Female",
         ifelse(n == 2, "Male", "Other"))
}

test_prediction <- matrix(test_pred, nrow = nClasses,
                          ncol=length(test_pred)/nClasses) %>%
  t() %>%
  data.frame() %>%
  mutate(labelN = test$class + 1,
         predN = max.col(., "last"),
         maxProb = pmap_dbl(list(X1, X2, X3), max),
         predN_strict = ifelse(maxProb > 0.75, predN, 3),
         label = map_chr(labelN, labelNumToChar),
         pred = map_chr(predN, labelNumToChar))

caret::confusionMatrix(test_prediction$pred,
                test_prediction$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male Other
##     Female     10    0     0
##     Male        0    6     2
##     Other       0    1     1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.85            
##                  95% CI : (0.6211, 0.9679)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.001288        
##                                           
##                   Kappa : 0.7479          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Female Class: Male Class: Other
## Sensitivity                    1.0      0.8571       0.3333
## Specificity                    1.0      0.8462       0.9412
## Pos Pred Value                 1.0      0.7500       0.5000
## Neg Pred Value                 1.0      0.9167       0.8889
## Prevalence                     0.5      0.3500       0.1500
## Detection Rate                 0.5      0.3000       0.0500
## Detection Prevalence           0.5      0.4000       0.1000
## Balanced Accuracy              1.0      0.8516       0.6373

There are a lot of metrics here. Some important ones are:

  • Accuracy: overall the model is correct for about 85% of the unseen books, which is not bad, I wanted at least 80%
  • Pos Pred Value: this means precision or \(P(correct|predicted)\). If I predict “Female” I’m 100% correct. If I predict “Male” I’m 75% correct. If I say “Other” I’m only 50% correct.
  • Sensitivity: this is recall or \(P(correct|reality)\). The model predicted correctly 100% of the female books, ~85% of the male books and only 33% (1 out of 3, out of 20) of the “Other” books.

This may not seem very impressive, but believe you me, for a real-world problem, with only 175 books to train on, that’s pretty good. Furthermore if the problem was “Is it a female book or not” it seems like the model would be perfect!7 Because the confusion seems to concentrate on distinguishing between male and “Other” books only.

The bottom-line is I’m OK with using this model for guesstimation for my question.

My Answer

We’ll train the model again on all 195 tagged books (no reason to throw away 20 tagged books!):

train_matrix <- xgb.DMatrix(data = as.matrix(rbind(train, test)[,-c(1, ncol(train))]),
                            label = c(train$class, test$class))
xgboostModelAll <- xgb.train(params = xgb_params,
                       data = train_matrix,
                       nrounds = nround)

Predict on all the 3,083 books in the list which are not in the top 195:

unknown <- dataForModeling %>%
  filter(type == "other") %>%
  select(-type)

unknown_matrix <- xgb.DMatrix(data = as.matrix(unknown[, -c(1, ncol(unknown))]))

unknown_pred <- predict(xgboostModelAll, newdata = unknown_matrix)

unknown_prediction <- matrix(unknown_pred, nrow = nClasses,
                          ncol=length(unknown_pred)/nClasses) %>%
  t() %>%
  data.frame() %>%
  mutate(predN = max.col(., "last"),
         maxProb = pmap_dbl(list(X1, X2, X3), max),
         predN_strict = ifelse(maxProb > 0.75, predN, 3),
         pred = map_chr(predN, labelNumToChar))

So for the top 200 we got to a male to female ratio of 1.3 or 1.2, depending on how you define it.

For the top 500, treating our predictions as “the true class”:

First195Labels <- goodreads %>%
  filter(type != "other") %>%
  select(class) %>%
  transmute(label = map_chr(class, labelNumToChar)) %>%
  unlist()

table(c(unknown_prediction[1:(500 - length(First195Labels)), ]$pred, First195Labels))
## 
## Female   Male  Other 
##    174    219    107

That’s 219/174 = 1.25 ratio or (219 + 107)/(174 + 107) = 1.16 ratio, depending on your definition (see the Classifying Classics section above).

For the top 1000:

table(c(unknown_prediction[1:(1000 - length(First195Labels)), ]$pred, First195Labels))
## 
## Female   Male  Other 
##    358    452    190

That’s 452/358 = 1.25 ratio or (452 + 190)/(358 + 190) = 1.17.

For all the books in the list:

table(c(unknown_prediction$pred, First195Labels))
## 
## Female   Male  Other 
##   1185   1303    790

That’s 1303/1185 = 1.1 ratio or (1303 + 709)/(1185 + 709) = 1.06

So we get an answer of somewhere between 1.1 and 1.3 male books to every female book. That’s 10 to 30 percent more male books, and while the trend found in the original article (60% more male books) is diminished - it is still there!

But why am I bothering with the “top 500”, “top 1000” etc.?

Major Limitations

First of all the quality of the entire list is questionable, especially when you get to the books at the bottom. This occurs due to a thing called “The Internet” - as I said anyone can vote.

  1. “1984” is part of this list. Really? 1984 a children’s book?
  2. I see some books in languages which are not English, e.g. Arabic and German
  3. The last 1.6K books have only 1 vote! I repeat: only 1 person thought they should be there.

That’s why I’m looking at my male to female ratio not just for all books but also for the top 500 and top 1000. The higher the book is ranked, the better I trust its relevance to this.

The second limitation of this project is of course that it is based upon a 85% accurate model. But there we are.

Nonetheless the bottom-disappointing-line is out there: we need more books about girls, though they’re hardly rare.

The Sum

There are two things I liked about this project. First, I again tried to use Data Science to answer a specific question. I’ve moaned in the past how these days, since it’s so easy, people either abuse their data until they get “something”, or apply the same trick on different datasets, again and again, without telling us what they accomplished, what is the new insight they brought. For eaxmple I didn’t show you if there is a correlation between a book’s gender and its rating8 - because it’s not relevant.

Second, I like using Machine Learning to show how I can get the same results without an army of out/crowd-source vendors or taggers. I think the approach, and I’ve talked about it wherever I worked, should be: “ML First”. First see if ML gets you what you want, especially if what you want is a ballpark estimation as in this case. Only then use your army of human judges9.

Lastly, 1.6, 1.3 or even 1.1 - isn’t this outrageous? Write more books about girls, my daughters and I will forever be in your debt!


  1. McCabe, J., Fairchild, E., Grauerholz, L., Pescosolido, B. A., & Tope, D. (2011). Gender in twentieth-century children’s books: patterns of disparity in titles and central characters. Gender & Society, 25, 197–226.

  2. The greatest gap was found for books about animals: 2.6, meaning that for every 2 female animal books there were 5 male animal books!

  3. Were these students? Were they paid? Were they told the purpose of the project? They don’t say.

  4. A boring life but a life nonetheless.

  5. This seems to be correlated with “Other” books in my gut feeling

  6. This seems to be correlated with “Other” books in my gut feeling

  7. Ain’t no such thing. Not with 175 books. Be wary of perfect models.

  8. I’ll save you the trouble, there isn’t, here at least

  9. And even then use them wisely, which may be the topic of one of the coming posts