I’ve recently stumbbled upon this really cool text analysis of Seinfeld scripts, by Michael Groesbeck. It occurred to me though that with recent advancements in text mining and visualization tools (especially the tidytext package and various D3.js R wrappers), more and more R bloggers (including myself!) are just having fun applying the same analyses to different datasets, instead of actually answering preconceived questions using data, refuting well-formulated hypotheses etc. R is fun but we need some focus! So I’ve decided I am going to ask a very specific question and try answering it with “everything” in my arsenal: Who was the lead character in the NBC hit sitcom Friends? It wasn’t until I had a pretty clear-cut answer that I found out that other people have tried to answer this question before, e.g. here, here and here. But these guys used different data, some didn’t share any code, and one even reached a different answer! So it’s even more exciting for me to write this post. See below for more!

# The One with Getting the Scripts

Isn’t it amazing I can scrape off all 10 seasons scripts using R? I found this site holding all 236 episodes scripts manually (!) transcribed, and I’m scraping all scripts using the rvest package and further manipulating the dataset to get to the desired format using purrr and stringr:

library(tidyverse)
library(rvest)
library(stringr)
library(magrittr)

10
} else {
}
}

extractTitle <- function(season, html) {
title <- html_nodes(html, "title") %>% html_text() %>% paste(collapse = " ")
if (season == 10) {
title <- str_split(title, " - ")[[1]][3]
}
if (season != 9 & length(title) > 0) {
title
} else {
""
}
}

getSeason9Titles <- function() {
html_nodes(".summary") %>%
html_text()
map_chr(titles[4:26], function(x) str_split(x, "\"")[[1]][2])
}

url <- "http://livesinabox.com/friends/scripts.shtml"

html_nodes("a") %>%
html_attr("href") %>%
slice(46:275) %>%
unique() %>%
episodeTitle = map2_chr(season, html, extractTitle)) %>%
filter(!startsWith(episodeTitle, "Friends")) %>%
group_by(season) %>%
mutate(episodeNum = row_number()) %>%
ungroup()

episodes_df$episodeTitle[episodes_df$season == 9] <- getSeason9Titles()

## # A tibble: 228 x 4
##    season html               episodeTitle                                                          episodeNum
##     <dbl> <list>             <chr>                                                                      <int>
##  1      1 <S3: xml_document> The One Where Monica Gets a New Roomate (The Pilot-The Uncut Version)          1
##  2      1 <S3: xml_document> The One With The Sonogram at the End                                           2
##  3      1 <S3: xml_document> The One With The Thumb                                                         3
##  4      1 <S3: xml_document> The One With George Stephanopoulos                                             4
##  5      1 <S3: xml_document> The One With the East German Laundry Detergent                                 5
##  6      1 <S3: xml_document> The One With The Butt                                                          6
##  7      1 <S3: xml_document> The One With the Blackout                                                      7
##  8      1 <S3: xml_document> The One Where Nana Dies Twice                                                  8
##  9      1 <S3: xml_document> The One Where Underdog Gets Away                                               9
## 10      1 <S3: xml_document> The One With The Monkey                                                       10
## # ... with 218 more rows

OK, episodes_df holds all episodes titles (in the episodeTitle column) and the raw html scripts (in the html column). Notice I had to take special care with Season 9 episodes - you get many issues with html in the wild. Also notice I only have 228 episodes, not 236 - this is because there is only a single script for “double” episodes.

Now I wish to unnest the scripts to have one row per line. To identify the lines in the html I’m using a simple regex. Then, I’m separateing each line to the character saying the line, and the line itself. This was somewhat harder than I expected, but again due to irregularities you see in various sites html: some episodes in Season 2 and Season 9 have a completely different structure.

getPeronLinePairs <- function(html) {
html %>%
html_nodes("body") %>%
html_nodes("p") %>%
html_text() %>%
tibble(text = .) %>%
filter(str_detect(text, "^[A-Z][a-zA-Z. ]+:")) %>%
unlist() %>%
unname() %>%
str_to_lower() %>%
str_replace_all("\n", " ") %>%
str_replace(":", "\\|\\|")
}

html %>%
html_nodes("body") %>%
html_text() %>%
str_split(., "\n") %>%
unlist %>%
tibble(text = .) %>%
filter(str_detect(text, "^[A-Z][a-zA-Z. ]+:")) %>%
unlist() %>%
unname() %>%
str_to_lower() %>%
str_replace_all("\n", " ") %>%
str_replace(":", "\\|\\|")
}

personLines_df <- episodes_df %>%
filter(!(season == 2 & episodeNum %in% c(9, 12:23)) &
!(season == 9 & episodeNum %in% c(7, 11, 15))) %>%
mutate(personLine = map(html, getPeronLinePairs))

irregulars <- episodes_df %>%
filter((season == 2 & episodeNum %in% c(9, 12:23)) |
(season == 9 & episodeNum %in% c(7, 11, 15))) %>%

personLines_df %<>%
rbind(irregulars) %>%
group_by(season, episodeNum, episodeTitle) %>%
unnest(personLine) %>%
ungroup() %>%
separate(personLine, c("person", "line"), sep = "\\|\\|") %>%
filter(!str_detect(person, " by"))

personLines_df %>% select(season, episodeNum, person, line)
## # A tibble: 60,817 x 4
##    season episodeNum person   line
##     <dbl>      <int> <chr>    <chr>
##  1      1          1 monica   " there's nothing to tell! he's just some guy i work with!"
##  2      1          1 joey     " c'mon, you're going out with the guy! there's gotta be something wrong with him!"
##  3      1          1 chandler " all right joey, be nice.  so does he have a hump? a hump and a hairpiece?"
##  4      1          1 phoebe   " wait, does he eat chalk?"
##  5      1          1 phoebe   " just, 'cause, i don't want her to go through what i went through with carl- oh!"
##  6      1          1 monica   " okay, everybody relax. this is not even a date. it's just two people going out to dinner and- not having sex."
##  7      1          1 chandler " sounds like a date to me."
##  8      1          1 chandler " alright, so i'm back in high school, i'm standing in the middle of the cafeteria, and i realize i am totally naked."
##  9      1          1 all      " oh, yeah. had that dream."
## 10      1          1 chandler " then i look down, and i realize there's a phone... there."
## # ... with 60,807 more rows

The episodes_df turned into the personLines_df holding over 60K lines, one row per line. Notice I didn’t call the “character” column “character” but person, because character is a big word in R… And that’s basically it, we can start answering our question!

# The One with the Basic Metrics

Before starting with basic metrics, I’m going to write a plotFunc to help plot a nice ggplot2 barchart, whatever the data are. I will use the new magick package and the grid package to help me show the characters images on the bars, just for swag. VERY IMPORTANT: I had very little time to figure out grid’s eccentricities, I know the way I’m locating the images isn’t very intelligent, sorry about it.

library(magick)
library(grid)
library(RColorBrewer)

addImages <- function(p, df, offsetY, offsetY2, width = 0.6) {
person <- df$person plotData <- ggplot_build(p)$data[[1]]

for (i in 1:length(person)) {
rasterGrob(just = "top")
x <- plotData$x[i] y <- plotData$y[i]
p <- p + annotation_custom(img, xmin = x - width / 2, xmax = x + width / 2,
ymax = y + offsetY - ((i - 1)^1.5) * offsetY2)
}
p
}

plotFunc <- function(df, title, ylab, offsetY, offsetY2, width = 0.6) {
colorsDict <- tibble(person = c("chandler", "ross", "joey", "monica", "rachel", "phoebe"),
color = brewer.pal(6,"Set1"))
p <- df %>%
inner_join(colorsDict, "person") %>%
mutate(person = factor(person, levels = person)) %>%
ggplot(aes(person, n, fill = color)) +
geom_bar(stat = "identity", alpha = 0.5) +
ggtitle(paste0("Friends: ", title)) +
xlab("") +
ylab(ylab) +
theme(axis.text.x = element_text(size = 15),
plot.title = element_text(size=18, hjust = 0.5),
legend.position="none")

}

So: who is the character with the highest no. of lines?

df <- personLines_df %>%
count(person) %>%
arrange(-n) %>%

plotFunc(df, "No. of Lines per Character", "No. of Lines", 9000, 120)

It’s Rachel! I wonder will this view change when we count the no. of episodes each character has the most lines:

df <- personLines_df %>%
count(season, episodeNum, person) %>%
group_by(season, episodeNum) %>%
top_n(1, n) %>%
ungroup() %>%
count(person) %>%
rename(n = nn) %>%
arrange(-n) %>%

plotFunc(df, "No. of Episodes in which Character has most Lines", "No. of Episodes", 45, 3)

Nope, the same pattern. Does Rachel have most lines in all seasons? (Gonna need custom plotting here…)

df <- personLines_df %>%
count(season, person) %>%
group_by(season) %>%
top_n(1, n) %>%
ungroup()

p <- df %>%
mutate(season = paste0("s", season)) %>%
mutate(season = factor(season, levels = season)) %>%
ggplot(aes(season, n, fill = factor(person))) +
geom_bar(stat = "identity", alpha = 0.5) +
ggtitle("Friends: Character with Most Lines per Season") +
xlab("") +
ylab("No. of Lines") +
theme(axis.text.x = element_text(size = 15),
plot.title = element_text(size=18, hjust = 0.5),
legend.position="none")

width <- 0.6
offsetY <- 880
offsetY2 <- 0

Interesting! Ross seems to have the most lines in the first seasons. Chandler has the most lines in seasons 5 & 6 which makes sense in the light of him and Monica’s dominant relationship plotline. And Rachel is only in the lead in 3 out of 10 seasons.

Who has the highest no. of words?

countNWords <- function(line) {
str_count(line, " ") + 1
}
df <- personLines_df %>%
mutate(nWords = map_dbl(line, countNWords)) %>%
group_by(person) %>%
tally(nWords) %>%
arrange(-n) %>%

plotFunc(df, "No. of Words per Character", "No. of Words", 130000, 1500)

Again Rachel! But notice something interesting: when it comes to no. of words, except for Rachel, the girls are left behind. Although Joey is only in the 5th place when it comes to no. of lines, he is 3rd place when it comes to no. of words - you wouldn’t expect that, from Joey, would you?

Let’s see the character who is mentioned the most in all characters lines:

df <- personLines_df %>%
mutate(chandler = map_int(line, str_count, "chandler"),
ross = map_int(line, str_count, "ross"),
joey = map_int(line, str_count, "joey"),
monica = map_int(line, str_count, "monica"),
rachel = map_int(line, str_count, "rachel"),
phoebe = map_int(line, str_count, "phoebe")) %>%
select(chandler, ross, joey, monica, rachel, phoebe) %>%
summarise_all(funs(sum)) %>%
t() %>%
as.data.frame() %>%
rownames_to_column() %>%
set_colnames(c("person", "n")) %>%
arrange(-n) %>%

plotFunc(df, "No. of Mentions in Script per Character", "No. of Mentions", 2500, 100)

OK, that’s Ross. Again, the girls are left behind. But maybe that’s because Rachel is sometimes called “Rache”, Monica “Mon” and Phoebe “Phoebs”? Still, poor Phoebe. Characters speak her name half the times they speak Ross’s name!

Who is the character mentioned the most in episodes titles? (e.g. “The One Where Rachel Smokes”)

df <- episodes_df %>%
mutate(chandler = map_int(episodeTitle, str_count, "Chandler"),
ross = map_int(episodeTitle, str_count, "Ross"),
joey = map_int(episodeTitle, str_count, "Joey"),
monica = map_int(episodeTitle, str_count, "Monica"),
rachel = map_int(episodeTitle, str_count, "Rachel"),
phoebe = map_int(episodeTitle, str_count, "Phoebe")) %>%
select(chandler, ross, joey, monica, rachel, phoebe) %>%
summarise_all(funs(sum)) %>%
t() %>%
as.data.frame() %>%
rownames_to_column() %>%
set_colnames(c("person", "n")) %>%
arrange(-n)

plotFunc(df, "No. of Mentions in Episodes Titles per Character", "No. of Mentions", 16, 1)

Rachel again! Rachel has 27 episodes with her name on them - more than enough for an entire season.

# The One with Google Search and Wikipedia

So far I’d say Rachel is a strong contender for being Friends’ “lead character”. Let’s give the scripts a rest for a while, and see what Google has to say. I’m going to use rvest again and a simple regex to parse the Google Search results page and get me the no. of searches for each character.

So who is the character with most Google Search results, when searching for its full name? (e.g. “Joey Tribbiani”)

search.url <- paste("http://www.google.com/search?q=", gsub(" ", "+", query), sep = "")

html_nodes("body") %>%
html_nodes(xpath = "//div[@id='resultStats']") %>%
html_text() %>%
str_extract("\\d+(,\\d+)*") %>%
gsub(",", "", .) %>%
as.numeric()
}

df <- tibble(person = c("chandler", "ross", "joey", "monica", "rachel", "phoebe"),
surname = c("bing", "geller", "tribbiani", "geller", "green", "buffay")) %>%
mutate(n = map_dbl(paste(person, surname), googleSearchTResultsCount)) %>%
arrange(-n)

plotFunc(df, "No. of Full Name Google Search Results", "No. of Search Results", 12000000, 0)

Ha! There are ~20M search results for “Rachel Green”, but less than 1M for all other characters. But, to be fair, “Rachel Green” is a name of an actual person, as opposed to “Phoebe Buffay”, so (a) Phoebe coming in second here is very impressive and (b) these results are extremly biased.

Let’s look for “Friends Character-Name” (e.g. “Friends Monica”):

df <- tibble(person = c("chandler", "ross", "joey", "monica", "rachel", "phoebe")) %>%
arrange(-n)

plotFunc(df, "No. of \"Friends X\" Google Search Results",
"No. of Search Results", 150000000, 4000000)

Rachel again, though the effect is indeed diminished. Monica is surprising.

How about seeing this over time with Google Trends? Google Trends allows you to compare only 5 search terms so I’m excluding Phoebe here, sorry Phoebs, but we can see results from 2004!1

Here Rachel ties with Ross and Joey. But in 2004 Joey is much more sought after. Which is interesting because this might mean the fans focus has shifted over the years in accordance with the show’s focus as we’ve seen. Nowadays, maybe Rachel is the most sought after character, but back in 2004 Joey was.

One other thing I just got to find out: who is the character with the longest Wikipedia article?

getLengthOfWikiArticle <- function(personFullName) {
html_nodes("body") %>%
html_text() %>%
nchar()
}

df <- tibble(person = c("chandler", "ross", "joey", "monica", "rachel", "phoebe"),
personFullName = c("Chandler_Bing", "Ross_Geller", "Joey_Tribbiani",
"Monica_Geller", "Rachel_Green", "Phoebe_Buffay")) %>%
mutate(n = map_dbl(personFullName, getLengthOfWikiArticle)) %>%
arrange(-n)

plotFunc(df, "Wikipdia Article Length per Character", "No. of Letters", 90000, 3000)

Another one for Ms. Green. Again Monica is surprising, I wonder if Courtney Cox wrote Monica’s Wikipedia article…

# The One with the Network Analysis

OK, so by now Rachel has got this in the bag. But all the above are simple statistics. What would be even more impressive is if we managed to prove that Rachel is also “central” to the plot, i.e. in this network of 6 friends, Rachel is somehow the most “important”.

One way to form the network of the characters of Friends, is to see how many times each of them is saying the other character’s name (assuming he/she is talking to/about the other character). The more character X is talking about character Y, the “tighter” they’re linked. Then we’ll build the co-occurences matrix of these mentions, and create a graph from this matrix:

m <- personLines_df %>%
filter(person %in% c("chandler", "ross", "joey", "monica", "rachel", "phoebe")) %>%
mutate(chandler = map_int(line, str_count, "chandler"),
ross = map_int(line, str_count, "ross"),
joey = map_int(line, str_count, "joey"),
monica = map_int(line, str_count, "monica"),
rachel = map_int(line, str_count, "rachel"),
phoebe = map_int(line, str_count, "phoebe")) %>%
select(person, chandler, joey, monica, phoebe, rachel, ross) %>%
group_by(person) %>%
summarise_all(funs(sum)) %>%
column_to_rownames("person") %>%
as.matrix()

m
##          chandler joey monica phoebe rachel ross
## chandler      153  438    425    138    146  335
## joey          460  240    234    167    243  528
## monica        513  313    140    377    338  311
## phoebe        242  265    322    158    289  299
## rachel        228  473    358    309    138  777
## ross          280  330    233    166    478  209

How to read this matrix? “Chandler said his own name 153 times. Chandler said Joey’s name 438 times.” Etc.

Now for the network graph I am going to use the igraph and ggraph packages, and these packages assume that the number representing an edge between nodes is a “distance”, “energy” or “dissimilarity” rather than “similarity” as in our case. So I will invert the matrix values to make them “dissimilarities”, and also put zeros on the diagonal. Lastly, excuse me again for the crude way I am “sticking” the characters images here, I’m a father of three daughters aged less than 4.5 and I have no time…

library(igraph)
library(ggraph)

m2 <- 1/m
diag(m2) <- 0

g <- graph.adjacency(m2, weighted = TRUE, mode ="directed")

p <- ggraph(g, layout = 'kk') +
geom_edge_link(aes(width = 1/weight, color = 1/weight), show.legend = FALSE) +
geom_node_text(aes(label = name), size = 0) +
theme_graph(text_colour = "black", base_size = 12) +
ggtitle("Friends: Graph by Characters #Co-Mentions in Script")

plotData <- ggplot_build(p)$data[[2]] person <- plotData$label %>% as.character()

width <- 0.6
offsetY <- c(chandler = 2.15,  joey = 1.2, monica = 2, phoebe = 1.3, rachel = 0.6, ross = 0.6)
for (i in 1:length(person)) {
rasterGrob(just = "top")
x <- plotData$x[i] y <- plotData$y[i]
p <- p + annotation_custom(img, xmin = x - width / 5, xmax = x + width / 5,
ymax = y + offsetY[i])
}
p

It’s pretty hard to see “centrality” here. The one thing which is clear from this network graph is that Phoebe is somewhat on the fringe of the network. Also apparent are the “strong” connections between couples of characters: Ross-Rachel, Monica-Chandler, Chandler-Joey.

To “see” centrality we can just calculate it, e.g. using Closeness centrality:

closeness(g) %>% sort(decreasing = TRUE)
##   rachel   monica   phoebe     joey     ross chandler
## 72.37632 71.57817 56.12902 54.21639 52.60466 46.04727

Nice, we see that according to this metric Rachel is the most central character, but Monica isn’t far behind, being Rachel’s roomate, Chandler’s partner, Joey’s neighbor and Ross’s sister.

# The One with the Other Analyses

It seems like a slam dunk for Rachel. But here David Schoch claimed Chandler is the most central character, basing his analysis on a completely different dataset, taken from Alex Albright here.

Alex listed for each and every episode the participants for every subplot! E.g. in episode 1 of season 1 we have Monica’s solo plotline, Rachel’s solo plotline, then Rachel and Ross’s, and Chandler, Joey and Ross’s.

getDynamicsString <- function(dynamics) {
dynamics %>%
str_split("") %>%
.[[1]] %>%
map_chr(plyr::mapvalues, as.character(1:6),
c("chandler", "joey", "monica", "phoebe", "rachel", "ross")) %>%
paste(collapse = "+")
}

plotlinesData %<>%
mutate(dynamics = map_chr(dynamics, getDynamicsString))

plotlinesData
## # A tibble: 696 x 4
##    epseason epnum epname                               dynamics
##       <int> <int> <chr>                                <chr>
##  1        1     1 The One Where Monica Gets a Roommate monica
##  2        1     1 The One Where Monica Gets a Roommate rachel
##  3        1     1 The One Where Monica Gets a Roommate rachel+ross
##  4        1     1 The One Where Monica Gets a Roommate chandler+joey+ross
##  5        1     2 The One with the Sonogram at the End monica
##  6        1     2 The One with the Sonogram at the End ross
##  7        1     2 The One with the Sonogram at the End rachel+ross
##  8        1     3 The One with the Thumb               chandler
##  9        1     3 The One with the Thumb               monica
## 10        1     3 The One with the Thumb               phoebe
## # ... with 686 more rows

Let’s look at the top plotlines (Alex does that in her post, but still):

plotlinesData %>% count(dynamics) %>% arrange(-n)
## # A tibble: 41 x 2
##    dynamics            n
##    <chr>           <int>
##  1 rachel+ross        70
##  2 phoebe             65
##  3 chandler+monica    63
##  4 ross               56
##  5 joey               54
##  6 rachel             47
##  7 monica             42
##  8 chandler+joey      36
##  9 chandler           34
## 10 joey+rachel        26
## # ... with 31 more rows

So 70 episodes (about a third) have a Ross/Rachel plotline, 63 have a Chandler/Monica plotline. And the character with the most solo plotlines is Phoebe. Poor Phoebe :(

So how to form a network from these data? Simple: I am going to count for each two characters the no. of plotlines they shared. This will discard for each character the no. of solo plotlines but I’m willing to pay that price.

splitDynamics <- function(dynamics) {
str_split(dynamics, "\\+")[[1]]
}

m <- plotlinesData %>%
mutate(chandler = map_int(dynamics, str_count, "chandler"),
ross = map_int(dynamics, str_count, "ross"),
joey = map_int(dynamics, str_count, "joey"),
monica = map_int(dynamics, str_count, "monica"),
rachel = map_int(dynamics, str_count, "rachel"),
phoebe = map_int(dynamics, str_count, "phoebe"),
person = map(dynamics, splitDynamics)) %>%
unnest(person) %>%
select(person, chandler, joey, monica, phoebe, rachel, ross) %>%
group_by(person) %>%
summarise_all(funs(sum)) %>%
column_to_rownames("person") %>%
as.matrix()

m
##          chandler joey monica phoebe rachel ross
## chandler      204   69     94     16     20   25
## joey           69  197     27     28     40   26
## monica         94   27    193     40     33   12
## phoebe         16   28     40    171     38   20
## rachel         20   40     33     38    216   81
## ross           25   26     12     20     81  194

How to read this matrix? Ignoring the diagonal, “Chandler and Joey shared 69 plotlines. Chandler and Monica shared 94 plotlines.” Etc.

Putting this matrix into a graph, again not before inverting these “similarities” to “dissimilarities”:

m2 <- 1/m
diag(m2) <- 0

g <- graph.adjacency(m2, weighted = TRUE, mode ="undirected")

p <- ggraph(g, layout = 'kk') +
geom_edge_link(aes(width = 1/weight, color = 1/weight), show.legend = FALSE) +
geom_node_text(aes(label = name), size = 0) +
theme_graph(text_colour = "black", base_size = 12) +
ggtitle("Friends: Graph by Characters No. of Shared Plotlines")

plotData <- ggplot_build(p)$data[[2]] person <- plotData$label %>% as.character()

width <- 0.6
offsetY <- c(chandler = 2.8,  joey = 2.5, monica = 3, phoebe = 1.6, rachel = 1.3, ross = 1)
for (i in 1:length(person)) {
rasterGrob(just = "top")
x <- plotData$x[i] y <- plotData$y[i]
p <- p + annotation_custom(img, xmin = x - width / 5, xmax = x + width / 5,
ymax = y + offsetY[i])
}
p

Using the standard Kamada–Kawai algorithm for arranging the network’s nodes, it is clear that if anyone is at the center of this netwrok, it is Rachel. How about that Closeness centrality?

closeness(g) %>% sort(decreasing = TRUE)
##   rachel     joey chandler   monica   phoebe     ross
## 6.946157 6.635453 5.629646 5.367349 5.011776 4.461486

I know! Rachel is the most central character by this dataset and metric as well.

So: how did David Schoch find Chandler to be the most central character using this dataset? I must say I don’t quite know. To my understanding David used an incidence matrix as opposed to an adjacency matrix, “projected to a square matrix in the actor space” and Eigen centrality as opposed to Closeness centrality. Without code it’s a bit hard to reproduce this and discuss the differences. David did however find that Rachel was the character who was most central in most seasons (4 out of 10, while Chandler only 3 out of 10).

# The One with the Wrap Up

This was fun. Let me again re-iterate the goal of this post: in the midst of all these new and engaging exploratory R packages, I wanted to ask a single question, and answer it. The simplicity of the answer was a surprise to me, though. But as far as I’m concerned: congrats, Jennifer Aniston! Girl, you should have asked for more money.

### UPDATE: 2017-06-08

Look at the beautiful infographics made out of this post by my best friend Nir Avigad:

1. I was going to embed here the Google Trends results but I find the embedding unreliable - sometimes it would render, sometimes not, so I’m putting here a screenshot instead - sorry about it