RuPaul’s Drag Race happens to be my second best reality TV show. Can you guess the first? Oh, I’m sure I’ll get to that eventually. Anyway. For 9 seasons (not including two All Stars seasons) I have been fascinated by the wittiness and artistry of the Drag Queens who compete in the show, in order to become “America’s Next Drag Superstar”. But how can I use Queen Data to show RPDR fans something that has not been shown before? It’s time for… Network Analysis Extravaganza!!!1

Tweet Tweet

What is my goal here? I wish to use the Twitter API to get for each queen a list of other queens she follows, explore this network some with the usual network metrics, and plot it in a way which might prove useful. And I want it all done in R. I’ll use the wonderful rtweet package to send simple queries to Twitter, e.g. How many followers a queen has, when did she join twitter, how many tweets she twitted, etc. But first things first - I need a list of the Twitter usernames of all 113 queens!

queens_usernames <- c("PorkChopLA", "thetammiebrown", "DragAkashia", "SotomayorJade", "ongina", "ShannelOfficial", "BeckyGlasscock", "DJNinaFlowers", "BeBeZaharaBenet", "itsSHANGELA", "nicolepbrooks", "MystiqueSummers", "xoSonique", "morganmcmichael", "SaharaDavenport", "JessicaWild88", "ThePandoraBoxx", "TATIANNANOW", "jujuboston", "RavenHUNTY", "TyraSanchez", "venusdlite", "Phoenix_atl", "MimiImfurst", "Indiaferrah", "MUG4DAYZ", "StacyLMatthews", "THEEEDeltaWork", "Carmen_Carrera", "yarasofiapr", "AlexisMateo79", "manilaluzon", "sutanamrull", "alisasummers", "LashauwnBeyond", "TheDragPrincess", "MadameLaQueer", "dwaynemilan", "JigglyCaliente", "willam", "DiDaRitz", "KenyaMichaels", "LatriceRoyale", "PhiPhiOhara", "ChadMichaels1", "SHARON_NEEDLES", "PennyTration", "SerenaChaCha", "MonicaBHillz", "honeymahogany", "ViviennePinay", "LineyshaSparx", "QueenJadeJolie", "Ivy_Winters", "AlyssaEdwards_1", "cocomontrese", "TheOnlyDetox", "RoxxxyAndrews", "Alaska5000", "JinkxMonsoon", "thekellymantle", "magnoliaisme", "VivaciousNYC", "ApriLcarrion", "GiaGunn" , "bigandmilky", "LaganjaEstranja", "trinitykbonet", "Joslyn_Fox", "bendelacreme", "dariennelake", "courtneyact", "AdoreDelano", "TheBiancaDelRio", "TempestDujour", "SashaBelle3", "JasMasters76", "KashaDavis", "kandy_ho", "MAXcollective", "jaidynnfierce", "MissFameNYC", "trixiemattel", "katya_zamo", "KENNEDYtheDOLL", "pearliaison", "TheGingerMinj", "VioletChachki", "LailaMcQueen", "Daxclamation", "lee_fontaine", "nayshalopez", "acidbettyrocks", "TheRobbieTurner", "ThorgyThor", "DerrickBarry", "ChiChiDeVayne", "KimChi_Chic", "naomismallsduh", "thatonequeen", "JaymesMansfield", "kimorablac", "charliehidestv", "eurekaohara", "ajaqueen", "AlexisLives", "farrahrized", "atlsexyslim", "Peppermint247", "sasha_velour", "SheaCoulee", "TrinityTheTuck", "AllOfValentina")

Uh… how did that happen? I hate to break it to you: yes, I spent two hours on Twitter hunting down each and every one of these names! (What I’d do for data) We could theoretically get a list of the 113 queens from Wikipedia, then use rtweet‘s function lookup_users on the queens’ names. Unfortunately not all queens will use a concatenation of their names as either their Twitter name or username, and you’ll end up with many missing usernames. An even worse issue is that in some cases you’ll end up with the wrong person! Take a look at Bianca Del Rio - Hunty2, that ain’t Bianca!

OK, let’s get these queens some data:

library(rtweet)
rupaul <- lookup_users(queens_usernames)

dim(rupaul)
## [1] 112  36

We’re missing a queen! That’s because Milan from Season 4 may have deleted her account. It’s still pretty amazing that 99% of these queens have a Twitter account. Now you can see that the rupaul data.frame has 36 columns. Let’s look at a few interesting ones:

head(rupaul[, c(1:4, 7:8, 16)])
## # A tibble: 6 x 7
##     user_id            name     screen_name      location followers_count
##       <dbl>           <chr>           <chr>         <chr>           <int>
## 1 523839746 PorkChop Parker      PorkChopLA   Los Angeles           23060
## 2 289215998    Tammie Brown  thetammiebrown          <NA>           41277
## 3 634335727         Akashia     DragAkashia          <NA>             923
## 4 176642704  Jade Sotomayor   SotomayorJade       Chicago           23827
## 5  22226534          Ongina          ongina   LA LA Land!           73162
## 6 794786977         Shannel ShannelOfficial Las Vegas, NV           37364
## # ... with 2 more variables: friends_count <int>, statuses_count <int>

You can see that each queen has a user_id, a name and a screen_name - that’s 3 different identifiers you don’t want to mix. In a moments when we’ll ask for a queen’s friends you’ll see we’ll get them as user_ids. Also you can see every queen lists her own version of location, e.g. “LA LA Land”, so if you’d like to use their origin in some analysis you might need a different source of data. Lastly, followers_count is the number of Twitter followers for this queen, friends_count is the number of Twitter users she follows and statuses_count is the number of Tweets she has twitted to date. Let’s look at the distribution of these measures:

par(mfrow = c(1, 3))
boxplot(rupaul$followers_count, main = "#Followers Distribution")
boxplot(rupaul$friends_count, main = "#Friends Distribution")
boxplot(rupaul$statuses_count, main = "#Tweets Distribution")

We can see all distributions are right-tailed, and that a “median” queen would have ~40K followers, would follow ~600 people and would tweet ~6K(!) tweets.

Just out of curiousity, who are the top and bottom queens?

library(dplyr)

get_top_and_bottom <- function(column_name) {
  rupaul %>%
  select_("screen_name", column_name) %>%
  arrange_(column_name) %>%
  slice(c(1, n()))
}

get_top_and_bottom("followers_count")
## # A tibble: 2 x 2
##       screen_name followers_count
##             <chr>           <int>
## 1     DragAkashia             923
## 2 TheBiancaDelRio          323642
get_top_and_bottom("friends_count")
## # A tibble: 2 x 2
##     screen_name friends_count
##           <chr>         <int>
## 1 KenyaMichaels             2
## 2 AlexisMateo79         14462
get_top_and_bottom("statuses_count")
## # A tibble: 2 x 2
##      screen_name statuses_count
##            <chr>          <int>
## 1  KenyaMichaels             10
## 2 ThePandoraBoxx          53408

So Akashia from Season 1 has the least followers, and Kenya Michaels from Season 4 follows only 2 people and has tweeted only 10 tweets. Similarly Bianca from Season 6 has the most followers, Alexis Mateo from Season 3 follows the most people (Over 14K, Queen? Really?) and Pandora Boxx from Season 2 has tweeted the most (53K tweets over 9 years - that’s 16 tweets a day, every day, and if she sleeps 8 hours a day, that’s 1 tweet every single hour!)

Why am I telling you this? First you always want to explore the data just a bit before starting any analysis, even in ways you’re sure are irrelevant to what you’re doing. Data is Dirty, in various and unexpected ways. Secondly, you might not have felt it but we’ve just stumbled upon a tiny problem. We’d like to get each queen’s list of users she follows (a.k.a “firends”), but the default behavior of the get_friends function in rtweet is to give a maximum of 5K users. And some queens have over 14K! That is why we shall need a wrapper function for get_friends:

queens_user_ids <- rupaul$user_id %>% as.numeric()
queens_user_names <- rupaul$screen_name %>% unlist

get_queen_friends <- function(queen) {
  friends <- get_friends(queen)
  page <- next_cursor(friends)
  first_time_loop <- TRUE
  while(page != "0") {
    first_time_loop <- FALSE
    more_friends <- get_friends(queen, page = page)
    friends <- rbind(friends, more_friends)
    
    page <- if (first_time_loop) {
      next_cursor(friends)
    } else {
      next_cursor(more_friends)
    }
    
    
  }
  friends_user_id <- friends$user_id %>% as.numeric()
  queen_friends <- queens_user_names[queens_user_ids %in% friends_user_id]
  return(queen_friends)
}

I won’t go into the tiny details of this function, cunsult the rtweet documentation if something is unclear. This function basically get’s a queen’s friends (as numeric user_ids). If the queen has over 5K friends it gets the other friends in additional batches of up to 5K until there are no more friends. It then filters only those friends who are RPDR queens and returns a vector of their usernames.

Remember, our goal is to have for each queen a list of queen friends. We’d now want to apply this function for each queen, e.g. in rupaul %>% mutate(queen_friend = map(user_id, get_queen_friends)). Alas, we encounter another limitation: the Twitter API won’t let you fetch so many data at once, without further tweaking. Once you exceed your Rate Limit, you’ll have to wait for 15 minutes until it renews. I was able to call the function on only 10 or so queens at a time. So you have 3 options at this stage:

  1. Further tweaking… (out of scope)
  2. Get the queens friends in steps of ~10, waiting 15 minutes between each batch, as in:
library(purrr)
library(tidyr)

step <- 10
queens_edges <- NULL
num_steps <- floor(nrow(rupaul) / step)
for (i in 1:num_steps) {
  sliceRange <- (step * (i - 1) + 1):(step * i)
  queens_edges_temp <- rupaul %>%
    slice(sliceRange) %>%
    mutate(queen_friend = map(user_id, get_queen_friends)) %>%
    select(screen_name, queen_friend) %>%
    unnest(queen_friend)
  queens_edges <- rbind(queens_edges, queens_edges_temp)
  Sys.sleep(15 * 60)
}
sliceRange <- (step * num_steps + 1):nrow(rupaul)
queens_edges_temp <- rupaul %>%
  slice(sliceRange) %>%
  mutate(queen_friend = map(user_id, get_queen_friends)) %>%
  select(screen_name, queen_friend) %>%
  unnest(queen_friend)
queens_edges <- rbind(queens_edges, queens_edges_temp)

colnames(queens_edges) <- c("from", "to")

It’s not optimal, I know, but it’s a one-time effort of 12 steps x 15 minutes = 3 hours.

  1. Just ask me and I’ll give you the file :)3

RuExploration

However you got here you now have a queens_edges data.frame, in each row a queen (in the from column) and her friend whom she follows (in the to column). These are basically our netwrok’s edges hence the name queen_edges.

head(queens_edges)
## # A tibble: 6 x 2
##         from              to
##        <chr>           <chr>
## 1 PorkChopLA  thetammiebrown
## 2 PorkChopLA     DragAkashia
## 3 PorkChopLA   SotomayorJade
## 4 PorkChopLA ShannelOfficial
## 5 PorkChopLA  BeckyGlasscock
## 6 PorkChopLA BeBeZaharaBenet
dim(queens_edges)
## [1] 8496    2
par(mfrow = c(1, 2))
queens_edges %>% count(from) %>% select(n) %>% boxplot(n, main = "Out Degree Distribution")
queens_edges %>% count(to) %>% select(n) %>% boxplot(n, main = "In Degree Distribution")

Further exploration shows that overall we have 8,496 edges. A “median” queen would follow (and be followed by) ~85 queens. That is quite a lot if the maximum is 111, even now we can expect this network to be very “connected” or “cohesive”. Who are the top and bottom queens?

queens_edges %>% count(from) %>% arrange(n) %>% slice(c(1, n()))
## # A tibble: 2 x 2
##            from     n
##           <chr> <int>
## 1 KenyaMichaels     1
## 2 jaidynnfierce   109
queens_edges %>% count(to) %>% arrange(n) %>% slice(c(1, n()))
## # A tibble: 2 x 2
##              to     n
##           <chr> <int>
## 1 KenyaMichaels     2
## 2  TheOnlyDetox   106

So, again Kenya Michaels is the least active queen, follows only 1 queen and being followed by only 2 queens. Jaidynn Diore Fierce from Season 7 follows almost all queeens, and the most popular queen by this metric is Detox from Season 5, being followed by 106 of 111 possible queens. Notice that by popular vote, Bianca takes the lead. I have to ask, who doesn’t like Detox?!

rupaul %>%
  anti_join(queens_edges %>% filter(to == "TheOnlyDetox"), by = c("screen_name" = "from")) %>%
  select(screen_name)
## # A tibble: 6 x 1
##     screen_name
##           <chr>
## 1   TyraSanchez
## 2   DragAkashia
## 3 MAXcollective
## 4  TheOnlyDetox
## 5 KenyaMichaels
## 6 AlexisMateo79

Shade.

I’d also like to know - are there queens with huge gaps between the number of queens who are following them and the number of queens they’re following?

follows <- queens_edges %>% count(from) %>% rename(queen = from, n_follows = n)
followed <- queens_edges %>% count(to) %>% rename(queen = to, n_followed = n)

follows %>%
  inner_join(followed) %>%
  mutate(gap = n_follows - n_followed) %>%
  arrange(gap) %>%
  slice(c(1,2,n()-1, n()))
## # A tibble: 4 x 4
##            queen n_follows n_followed   gap
##            <chr>     <int>      <int> <int>
## 1 Carmen_Carrera         6         75   -69
## 2  AlexisMateo79        21         88   -67
## 3    eurekaohara       105         53    52
## 4       ajaqueen       103         49    54

So Carmen Carrera from Season 3 follows only 6 queens while being followed by 75 queens - maybe that’s not so surprising, because Carmen has revealed that she was transsexual on the show and has gave up Drag for modelling. And Aja follows 103 queens while being followed by only 49 - awkward, but again maybe not surprising because Aja is from Season 9 which has just started while I’m writing this.

The Network is Open

Let’s see this network already! We’ll use a fancy interactive plot with the visNetwork package, and we’ll even have the queens’ Twitter profile images as nodes! visNetwork needs two data.frames: nodes and edges. Whatever property you’d like for your nodes/edges (e.g. their size), you need to add them as another column to these tables:

library(magrittr)
library(visNetwork)

rupaul %<>% arrange(screen_name)
in_degree <- queens_edges %>% count(to) %>% arrange(to) %>%
  select(n) %>% unlist()
out_degree <- queens_edges %>% count(from) %>% arrange(from) %>%
  select(n) %>% unlist()

queens_nodes <- data.frame(id = rupaul$screen_name, image = rupaul$profile_image_url,
                           shape = "image", size = in_degree)

#the title field will be used as a tooltip when node is hovered
queens_nodes$title <- paste0(rupaul$screen_name, "<br>", "#Followers: ", in_degree,
                             "<br>", "#Follows: ", out_degree)

queens_edges$arrows <- "to"

visNetwork(queens_nodes, queens_edges, width = "100%", main = "RPDR Queens Twitter Network") %>%
  visIgraphLayout() %>%
  visOptions(highlightNearest = TRUE, 
              nodesIdSelection = list(enabled = TRUE, useLabels = TRUE)) %>%
  visNodes(shapeProperties = list(useBorderWithImage = TRUE)) %>%
  visInteraction(keyboard = TRUE,
                dragNodes = T, 
                dragView = T, 
                zoomView = T)

I’m gagging for this plot!4 Go ahead, zoom in, zoom out, hover, double-click and filter a queen from the drop-down menu! At first glance our expectation is confirmed - this is one cohesive network, where every queen is friends with almost every queen. Two queens pop out of this giant component, if you’re following my steps you can already guess who they are: Kenya Michaels and Akashia - who are simply not very active Twitter users. And in the middle of the network we see the most connected queens, unsurprisingly queens who got very far, like Detox, Alaska, Bianca, Jinkx, Latrice Royal, Jujubee atc.

Suppose we drop the images and have simple circles for nodes filled with color by some grouping variable - the queen’s Season, the queen’s Hometwon, the queen’s final ranking on the show, etc. Will we see an interesting pattern? Let’s do this! If you remember Wikipedia actually has a page holding a table with the data we want. We’ll extract it using the htmltabs package:

library(htmltab)

url <- "https://en.wikipedia.org/wiki/List_of_RuPaul%27s_Drag_Race_contestants"
rupaul_wiki <- htmltab(doc = url, which = "//th[text() = 'Drag Name']/ancestor::table")

dim(rupaul_wiki)

head(rupaul_wiki)
## [1] 137   6
##                    Drag Name           Real Name Age
## 2 Victoria "Porkchop" Parker                <NA>  39
## 3               Tammie Brown Keith Glen Schubert  26
## 4                    Akashia                <NA>  24
## 5                       Jade     David Sotomayor  25
## 6                     Ongina                <NA>  26
## 7                    Shannel                <NA>  26
##                  Hometown Season Finish
## 2 Raleigh, North Carolina      1    9th
## 3 Los Angeles, California      1    8th
## 4         Cleveland, Ohio      1    7th
## 5       Chicago, Illinois      1    6th
## 6 Los Angeles, California      1    5th
## 7       Las Vegas, Nevada      1    4th

The rupaul_wiki table has 137 rows, while rupaul has only 112. This is because (a) Wikipedia also lists the two All-Stars seasons, (b) some queens appeared in more than 1 regular season (Shangela, Cynthia Lee Fontaine) and (c) we omitted Milan from Season 4 whose Twitter account cannot be found. So we’ll have to do some work to be able to join between the two tables.

#Notice the queens_usernames vector is already pretty much sorted by Season and Rank.
#Some tweaking and we'll have a vector matching exactly the Wikipedia order of names.
queens_usernames_wiki <- c(queens_usernames[1:29], "itsSHANGELA", queens_usernames[30:46], "MimiImfurst",
"ThePandoraBoxx", "thetammiebrown", "DJNinaFlowers", "LatriceRoyale", "manilaluzon", "AlexisMateo79", "yarasofiapr", "ShannelOfficial", "jujuboston", "RavenHUNTY", "ChadMichaels1", queens_usernames[47:100], "cocomontrese", "AdoreDelano", "TheGingerMinj", "PhiPhiOhara", "TATIANNANOW", "AlyssaEdwards_1", "RoxxxyAndrews", "TheOnlyDetox", "katya_zamo", "Alaska5000", queens_usernames[101:106], "lee_fontaine", queens_usernames[107:113])

rupaul <- rupaul_wiki %>%
  cbind(screen_name = queens_usernames_wiki) %>%
  inner_join(rupaul)

dim(rupaul)
## [1] 136  42

The joined table has 136 rows. That’s 137 minus Milan who isn’t on Twitter. This table is still quite “dirty”. Let’s tidy it up to get a table in which each queen has only one row in which we keep her username, (original) Season, (original) Rank or Finish, Age, State, a variable indicating whether she’s an All-Star, when her Twitter account was created and of course: followers_count, friends_count and statuses_count.

rupaul %<>%
  select(screen_name, Season, Age, Hometown, Finish, created_at, followers_count,
         friends_count, statuses_count)

Rank (Finish) is complicated, because some queens have values like “9th”, some “Runner-up” or “Winner” and some values like “11th/12th”. When it’s that complicated, remember a for-loop isn’t a dirty word:

rankQueen <- function(finish) {
  suppressWarnings(
    finish %>% strsplit("th") %>% .[[1]] %>% as.numeric %>%
      ifelse(length(.) == 1, ., NA)
  )
}

rupaul$Rank <- NULL
for (i in 1:nrow(rupaul)) {
  rupaul$Rank[i] <- rankQueen(rupaul$Finish[i])
  if (is.na(rupaul$Rank[i])) {
    if ((rupaul$Season[i] == "All Stars 1" && rupaul$Season[i - 1] == "4") ||
        (rupaul$Season[i] == "8" && rupaul$Season[i - 1] == "7")){
      rupaul$Rank[i] <- 12
    } else if (rupaul$Season[i] != "9") {
      rupaul$Rank[i] <- rupaul$Rank[i - 1] - 1
    }
  }
}

summary(rupaul$Rank)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   4.000   7.000   6.921  10.000  14.000      10

Notice that for Season 9 which is still taking place I gave all queens who are still in the race NA as Rank. Now for a variable indicating whether the queen is an All-Star:

all_stars_queens <- rupaul %>%
  filter(Season %in% c("All Stars 1", "All Stars 2")) %>%
  select(screen_name) %>%
  unlist %>%
  as.character()

rupaul$is_all_star <- ifelse(rupaul$screen_name %in% all_stars_queens, 1, 0)

Now for everything else. Notice the group_by to dedup queens, keeping only the first row for returning queens:

library(purrr)
library(lubridate)

getStateFromHometown <- function(Hometown) {
  State <- strsplit(Hometown, ",")[[1]][2]
  if (substring(State, 1, 1) == " ") {
    State <- substring(State, 2)
  }
  State
}


getDaysSinceCreated <- function(created_at) {
  today() - as.Date(created_at)
}

rupaul %<>%
  filter(!Season %in% c("All Stars 1", "All Stars 2")) %>%
  mutate(Season = as.numeric(Season),
         Age = as.numeric(Age),
         State = map_chr(Hometown, getStateFromHometown),
         days_since_created = map_dbl(created_at, getDaysSinceCreated)) %>%
  group_by(screen_name) %>%
  arrange(Season) %>%
  filter(row_number() == 1) %>%
  ungroup %>%
  arrange(screen_name)

Hope it was worth it, let us define some functions to help us later on:

getNodesColorAndTitle <- function(colorVector, titleName, titleVector = colorVector, ordinal = TRUE) {
  if (ordinal) {
    col1 <- adjustcolor("yellow", alpha=0.6)
    col2 <- adjustcolor("red", alpha=0.6)
    palette <- colorRampPalette(c(col1, col2), alpha = TRUE)
    node_colors <- palette(length(unique(colorVector)))
    color <- t(col2rgb(node_colors[colorVector])/255) %>% as.data.frame %>% pmap_chr(rgb)
  } else {
    palette <- c("green", "orange", "red", "blue", "purple", "black")
    color <- palette[colorVector]
  }
  title <- paste0(rupaul$screen_name, "<br>", "#Followers: ", in_degree,
                             "<br>", "#Follows: ", out_degree, "<br>",
                             titleName, ": ", titleVector)
  return(list(color = color, title = title))
}

plotVisNetwork <- function(nodes, edges, mainTitle) {
  visNetwork(nodes, edges, width = "100%", main = mainTitle) %>%
  visIgraphLayout(randomSeed = 42) %>%
   visOptions(highlightNearest = TRUE, 
              nodesIdSelection = list(enabled = TRUE, useLabels = TRUE)) %>%
  visInteraction(keyboard = TRUE,
                dragNodes = T, 
                dragView = T, 
                zoomView = T)
}

Now let us see the network colored by Season:

queens_nodes$shape <- "circle"

colorAndTitle <- getNodesColorAndTitle(rupaul$Season, "Season")

queens_nodes$color <- colorAndTitle$color
queens_nodes$title <- colorAndTitle$title

plotVisNetwork(queens_nodes, queens_edges, "RPDR Twitter Network by Season")

Very nice, we see an interesting pattern, where early Seasons queens (yellow-orange) appear on one side, and later Seasons queens (orange-red) appear on the other side. Now we will color by Rank:

colorAndTitle <- getNodesColorAndTitle(rupaul$Rank, "Rank")

queens_nodes$color <- colorAndTitle$color
queens_nodes$title <- colorAndTitle$title

plotVisNetwork(queens_nodes, queens_edges, "RPDR Twitter Network by Finish Rank")

That is also a nice pattern - it seems like queens ranked on top are closer to the center. I expect the queens in the center of network are All-Star queens:

colorAndTitle <- getNodesColorAndTitle(rupaul$is_all_star + 1, "All-Star",
                                       ifelse(rupaul$is_all_star == 1, "Yes", "No"), FALSE)

queens_nodes$color <- colorAndTitle$color
queens_nodes$title <- colorAndTitle$title

plotVisNetwork(queens_nodes, queens_edges, "RPDR Twitter Network by Is All-Stars")