A while ago I read through Social Media Mining with R and was fascinated by the subject of Sentiment Analysis. I decided to apply more or less the same analysis to a text which was dear to my heart: Anne Frank: The Diary of a Young Girl. Anne Frank’s diary chronicles over two years of her life (June 1942 - August 1944, ages 13-14), when she was hiding with her family in a secret attic in Amsterdam, during the Nazi occupation of the Netherlands. The Franks were eventually caught, all family members died a few months later except Otto Frank, Anne’s father, who lived until 1980.

Since over 70 years have passed since Anne’s death, many believe that the Diary should have entered public domain at the end of 2015, according to European copyrights laws (horrifyingly, alongside Adolf Hitler’s Mein Kampf…). However, in a questionable move, the Anne Frank Fonds - the organization receiving the Diary’s royalties - announced it would list Otto Frank as a co-author of the Diary, delaying the release of the text to 2050! Nevertheless the Dutch court has decided the original version of the Diary may be copied for academic research, and - well, here we are :)

# From HTML to something I can work with

The text is out there. Let’s read it by parsing this cumbersome html page with the xml2 package:

library(xml2)
library(tibble)

l <- as_list(h)
tibble(text = read_lines(l[3]$body$div[7]$main$div[9]$pre[[1]])) } anne <- read_anne_frank_diary() head(anne, 10) ## # A tibble: 10 x 1 ## text ## <chr> ## 1 "THE DIARY OF A YOUNG GIRL : THE DEFINITIVE EDITION " ## 2 "" ## 3 "" ## 4 "Anne Frank " ## 5 "" ## 6 "Edited by Otto H. Frank and Mirjam Pressler " ## 7 "Translated by Susan Massotty " ## 8 "" ## 9 "" ## 10 "BOOK FLAP " So currently anne is a tibble with 11242 rows, each is a single string. My aim is to represent the diary with a table containing two columns: a single long string representing a single complete post, and the date in which the post was written. Let’s fetch the dates first! Thank God for lubridate (the package with the kinkiest name out there): library(dplyr) library(lubridate) library(magrittr) anne %<>% mutate(date = mdy(text, tz = "UTC")) head(anne, 10) ## # A tibble: 10 x 2 ## text date ## <chr> <dttm> ## 1 "THE DIARY OF A YOUNG GIRL : THE DEFINITIVE EDITIO~ NA ## 2 "" NA ## 3 "" NA ## 4 "Anne Frank " NA ## 5 "" NA ## 6 "Edited by Otto H. Frank and Mirjam Pressler " NA ## 7 "Translated by Susan Massotty " NA ## 8 "" NA ## 9 "" NA ## 10 "BOOK FLAP " NA This doesn’t look too good at first but trust me: the mdy function turned each row representing a date (e.g. "SUNDAY, JUNE 14, 1942"), into an actual Date object (e.g. 1942-06-14). If it couldn’t find a date, it returned NA. Now, for some reason the mdy function misses the date of the first post, returning NA instead of 1942-06-12 so I’m going to change this manually: anne[164, ] ## # A tibble: 1 x 2 ## text date ## <chr> <dttm> ## 1 "June 12, 1942 " NA anne[164, "date"] <- as_date("1942-06-12", tz = "UTC") And, with the lubridate package version 1.7.1, mdy has made some weird weird parsings we need to filter out: tibble(text = stringr::str_sub(anne$text[!is.na(anne$date)][c(5, 37, 117)], 1, 30), date = anne$date[!is.na(anne$date)][c(5, 37, 117)]) ## # A tibble: 3 x 2 ## text date ## <chr> <dttm> ## 1 January 1942. No one knows how 2042-01-19 00:00:00 ## 2 Breakfast: At 9 A.M. daily exc 2030-09-11 00:00:00 ## 3 "the last one, in July 1943. " 2043-07-19 00:00:00 anne$text[!is.na(anne$date)][c(5, 37, 117)] <- NA anne$date[!is.na(anne$date)][c(5, 37, 117)] <- NA Did Anne write two or more posts in a single day? anne %>% filter(!is.na(date)) %>% select(date) %>% duplicated() %>% sum() ## [1] 6 Yep, she did. Let’s dedup these dates by adding a different hour to each one. If we don’t, we’ll soon stumble upon the problem that posts belonging to the same date will be grouped together (in a minute). We’ll use some more packages from the tidyverse suite of packages: purrr and tidyr. add_hours_to_date <- function(date, index) { if (!is.na(date)) { date + hours(index %% 24) } else { as.POSIXct(NA, tz = "UTC") } } library(purrr) library(tidyr) anne %<>% mutate(index = 1:n(), date = map2(date, index, add_hours_to_date)) %>% unnest(date) %>% select(text, date) #Are there duplicate dates now? anne %>% filter(!is.na(date)) %>% select(date) %>% duplicated() %>% sum() ## [1] 0 No more duplicates. Now we need to drop the preface, and concatenate all lines between every two dates into a single post. This is a bit tricky but there’s still no need to resort to a for-loop yet, if you harness the power of dplyr. I’m also going to use a function from the zoo package here: library(zoo) anne %<>% mutate(isItText = is.na(date), date = na.locf(date, na.rm = FALSE)) %>% filter(isItText & !is.na(date)) %>% group_by(date) %>% summarise(post = paste(text, collapse = " ")) %>% ungroup() summary(anne) ## date post ## Min. :1942-06-12 20:00:00 Length:182 ## 1st Qu.:1943-03-05 19:30:00 Class :character ## Median :1944-01-06 14:30:00 Mode :character ## Mean :1943-09-18 05:25:24 ## 3rd Qu.:1944-03-29 04:45:00 ## Max. :1944-08-01 07:00:00 OK, what just happend: 1. Created a boolean indicator isItText indicating whether this row should be treated as part of the post or not. I’m going to use it to filter out rows which contain the dates, not part of the post. 2. Changed date to “drag” the last seen date across all NAs with the useful na.locf function from the zoo package. This is done so that all rows from the same post would have the same identifier (date) so I will be able to group them appropriately. 3. Filtered out the text rows holding dates (e.g. "SUNDAY, JUNE 14, 1942") and the preface rows (which at this point still have NA for date values). 4. Grouped posts by date which is now the unique identifier for each post. 5. Concatenated each post’s rows into a single string. 6. Ungrouped, because we don’t need a Grouped table anymore. The result: 1831 posts, from June 1942 to August 1944. # Get Sentiment(al) I will take Social Media Mining with R’s simple approach, defining a sentiment score for post $$i$$: $$score_i = pos_i - neg_i$$ where $$pos_i$$ is the number of “positive” words in the post, and $$neg_i$$ the number of “negative” words in the post. Positive according to who? That’s where positive/negative lexicons come in handy. Now, a few years back the syuzhet package did not exist. I only heard of it when reading the wonderful blog by Julia Silge who did amazing analyses on Jane Austen’s works and is the author of the tidytext package. Anyway, the syuzhet package has a convenient get_sentiment function which magically gives you a sentiment score for any English text, using various positive/negative lexicons. This is a bit too convenient! Let’s look at the Bing lexicon2: library(syuzhet) bing <- get_sentiment_dictionary("bing") bing_pos_words <- bing %>% filter(value == 1) %>% select(word) %>% unlist %>% as.character() bing_neg_words <- bing %>% filter(value == -1) %>% select(word) %>% unlist %>% as.character() Though this lexicon contains over 6,000 words it does not contain basic positive/negative words Anne used, like: any(c("hugging", "prettiest", "kindest", "friends") %in% bing_pos_words) ## [1] FALSE any(c("hiding", "war", "soldiers", "gun", "shot", "hardheartedness", "failings", "darn", "wronged") %in% bing_neg_words) ## [1] FALSE So in general I think you’d want to fit the lexicon you’re choosing to the domain of the text you’re analyzing. Just imagine doing Sentiment Analysis on Facebook users chats without having “ROFL” or “Facepalm” in your lexicon! So, for now I will just add these few words I found in Anne’s diary, but for any serious work you should match somehow the lexicon you’re using to the domain. And of course pay attention to stemming. bing_pos_words <- c(bing_pos_words, c("hugging", "prettiest", "kindest", "friends")) bing_neg_words <- c(bing_neg_words, c("hiding", "war", "soldiers", "gun", "shot", "hardheartedness", "failings", "darn", "wronged")) Since we’re now in “Custom Lexicons Land” we need our own get_sentiment function. This should be simple. We need to split the posts to words/tokens, ignoring punctuation and numbers, count no. of “positive” words, count “negative” and subtract. Let’s use the magic of the stringr package: library(stringr) get_sentiment_score <- function(post) { words <- post %>% str_replace_all("[[:punct:]]|[[:digit:]]", " ") %>% tolower() %>% str_split("\\s+") %>% unlist sum(words %in% bing_pos_words) - sum(words %in% bing_neg_words) } Mapping our function to anne’s posts we get: anne %<>% mutate(score = map_int(post, get_sentiment_score)) head(anne[, c("date", "score")]) ## # A tibble: 6 x 2 ## date score ## <dttm> <int> ## 1 1942-06-12 20:00:00 5 ## 2 1942-06-14 08:00:00 12 ## 3 1942-06-15 15:00:00 1 ## 4 1942-06-20 19:00:00 -4 ## 5 1942-06-20 22:00:00 0 ## 6 1942-06-21 15:00:00 -8 # Showing Feelings Hard part is over. At this stage a simple plot would do to see Anne’s Sentiment as a function of time. Let’s make it slightly more interesting, treating these data as a Time Series, with the xts package. library(xts) anne_xts <- xts(anne$score, anne$date) plot(anne_xts, main = "Anne Frank's Diary: A Sentiment Analysis", cex = 0.5) Nice. You can definitely see a pattern here, but let’s smooth things over, using loess: lo <- loess(score ~ as.numeric(date), anne) anne_xts <- cbind(anne_xts, predict(lo, anne$date))
colnames(anne_xts) <- c("score", "loess")
plot(anne_xts$score, main = "Anne Frank's Diary: A Sentiment Analysis", cex = 0.5) lines(anne_xts$loess, col = "red", lwd = 2)

Better. The loess smoothing allows to see the S-shape pattern of Anne’s emotion during those two years: she starts positive, decreases until a low point wround the winter of ’42-’43, then an increase as she falls in love with Peter van Daan, then decrease again. Also interesting to find out what’s written in the lowest and highest posts in terms of sentiment score:

# top post
anne %>%
arrange(-score) %>%
slice(1) %>%
select(post) %>%
unlist %>%
unname

# bottom post
anne %>%
arrange(score) %>%
slice(1) %>%
select(post) %>%
unlist %>%
unname

As far as plotting goes, as a rule of thumb ggplot23 should always do a better job…

library(ggplot2)
library(broom)
tidy(anne_xts) %>%
ggplot(aes(x = index, y = value)) +
geom_line() +
geom_smooth(method = "loess") +
ggtitle("Anne Frank's Diary: A Sentiment Analysis")

And if you want to get really fancy and go interactive, dygraphs is a great choice:

library(dygraphs)
dygraph(anne_xts, main = "Anne Frank's Diary: A Sentiment Analysis") %>%
dyRangeSelector()

Do not be shy! Play with that RangeSelector!

A final note on plotting: why did I not use the fancy “Low-pass Fourier Transform”, like in the nice get_transformed_values function in the syuzhet package. This is because of the assumption of a fixed interval between “units” of sentiment on the X axis, which is simply not true in the case of Anne’s Diary. But here it is nonetheless:

ft_values <- get_transformed_values(
anne$score, low_pass_size = 3, x_reverse_len = 100, padding_factor = 2, scale_vals = TRUE, scale_range = FALSE ) ## Warning in get_transformed_values(anne$score, low_pass_size = 3,
## x_reverse_len = 100, : This function is maintained for legacy purposes.
## Consider using get_dct_transform() instead.
plot(ft_values, type = "l", main = "Anne Frank's Diary: A Sentiment Analysis - Fourier Transform")
abline(h = 0)

# Wrapping It Up

I like performing trivial analyses on untrivial data. Here we’ve seen some basic processing of an online text which does not come without its problems and challenges. We’ve seen some cool tidyverse pipelines to achieve pretty complicated results without looping over data. We’ve seen how to customize even the simplest Sentiment Analysis. We’ve seen some smoothing. And who can forget that RangeSelector! On a serious note you should really read Anne Frank’s Diary. It will give you some perspective on what’s important in life.

1. Notice that if would have missed the detail about duplicate dates, we would have here 176 posts!

2. Hu, M. & Liu, B. (2004). Mining and summarizing customer reviews. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 168-177.

3. Notice we don’t really need a Time Series object in ggplot2 though: ggplot(anne, aes(x=date, y=score)) + geom_line() + geom_smooth(method='loess')