A while ago I read through Social Media Mining with R and was fascinated by the subject of Sentiment Analysis. I decided to apply more or less the same analysis to a text which was dear to my heart: Anne Frank: The Diary of a Young Girl. Anne Frank’s diary chronicles over two years of her life (June 1942 - August 1944, ages 13-14), when she was hiding with her family in a secret attic in Amsterdam, during the Nazi occupation of the Netherlands. The Franks were eventually caught, all family members died a few months later except Otto Frank, Anne’s father, who lived until 1980.

Since over 70 years have passed since Anne’s death, many believe that the Diary should have entered public domain at the end of 2015, according to European copyrights laws (horrifyingly, alongside Adolf Hitler’s Mein Kampf…). However, in a questionable move, the Anne Frank Fonds - the organization receiving the Diary’s royalties - announced it would list Otto Frank as a co-author of the Diary, delaying the release of the text to 2050! Nevertheless the Dutch court has decided the original version of the Diary may be copied for academic research, and - well, here we are :)

From HTML to something I can work with

The text is out there. Let’s read it by parsing this cumbersome html page with the xml2 package:


read_anne_frank_diary <- function() {
  h <- read_html("https://archive.org/stream/AnneFrankTheDiaryOfAYoungGirl_201606/Anne-Frank-The-Diary-Of-A-Young-Girl_djvu.txt")
  l <- as_list(h)
  tibble(text = read_lines(l[3]$body$div[7]$main$div[9]$pre[[1]]))

anne <- read_anne_frank_diary()

head(anne, 10)
## # A tibble: 10 x 1
##    text                                                 
##    <chr>                                                
##  2 ""                                                   
##  3 ""                                                   
##  4 "Anne Frank "                                        
##  5 ""                                                   
##  6 "Edited by Otto H. Frank and Mirjam Pressler "       
##  7 "Translated by Susan Massotty "                      
##  8 ""                                                   
##  9 ""                                                   
## 10 "BOOK FLAP "

So currently anne is a tibble with 11242 rows, each is a single string. My aim is to represent the diary with a table containing two columns: a single long string representing a single complete post, and the date in which the post was written. Let’s fetch the dates first! Thank God for lubridate (the package with the kinkiest name out there):


anne %<>%
  mutate(date = mdy(text, tz = "UTC"))

head(anne, 10)
## # A tibble: 10 x 2
##    text                                                date               
##    <chr>                                               <dttm>             
##  2 ""                                                  NA                 
##  3 ""                                                  NA                 
##  4 "Anne Frank "                                       NA                 
##  5 ""                                                  NA                 
##  6 "Edited by Otto H. Frank and Mirjam Pressler "      NA                 
##  7 "Translated by Susan Massotty "                     NA                 
##  8 ""                                                  NA                 
##  9 ""                                                  NA                 
## 10 "BOOK FLAP "                                        NA

This doesn’t look too good at first but trust me: the mdy function turned each row representing a date (e.g. "SUNDAY, JUNE 14, 1942"), into an actual Date object (e.g. 1942-06-14). If it couldn’t find a date, it returned NA. Now, for some reason the mdy function misses the date of the first post, returning NA instead of 1942-06-12 so I’m going to change this manually:

anne[164, ]
## # A tibble: 1 x 2
##   text             date               
##   <chr>            <dttm>             
## 1 "June 12, 1942 " NA
anne[164, "date"] <- as_date("1942-06-12", tz = "UTC")

And, with the lubridate package version 1.7.1, mdy has made some weird weird parsings we need to filter out:

tibble(text = stringr::str_sub(anne$text[!is.na(anne$date)][c(5, 37, 117)], 1, 30),
       date = anne$date[!is.na(anne$date)][c(5, 37, 117)])
## # A tibble: 3 x 2
##   text                           date               
##   <chr>                          <dttm>             
## 1 January 1942. No one knows how 2042-01-19 00:00:00
## 2 Breakfast: At 9 A.M. daily exc 2030-09-11 00:00:00
## 3 "the last one, in July 1943. " 2043-07-19 00:00:00
anne$text[!is.na(anne$date)][c(5, 37, 117)] <- NA
anne$date[!is.na(anne$date)][c(5, 37, 117)] <- NA

Did Anne write two or more posts in a single day?

anne %>%
  filter(!is.na(date)) %>%
  select(date) %>%
  duplicated() %>%
## [1] 6

Yep, she did. Let’s dedup these dates by adding a different hour to each one. If we don’t, we’ll soon stumble upon the problem that posts belonging to the same date will be grouped together (in a minute). We’ll use some more packages from the tidyverse suite of packages: purrr and tidyr.

add_hours_to_date <- function(date, index) {
  if (!is.na(date)) {
    date + hours(index %% 24)
  } else {
    as.POSIXct(NA, tz = "UTC")


anne %<>%
  mutate(index = 1:n(), date = map2(date, index, add_hours_to_date)) %>%
  unnest(date) %>%
  select(text, date)

#Are there duplicate dates now?
anne %>%
  filter(!is.na(date)) %>%
  select(date) %>%
  duplicated() %>%
## [1] 0

No more duplicates. Now we need to drop the preface, and concatenate all lines between every two dates into a single post. This is a bit tricky but there’s still no need to resort to a for-loop yet, if you harness the power of dplyr. I’m also going to use a function from the zoo package here:

anne %<>%
  mutate(isItText = is.na(date), date = na.locf(date, na.rm = FALSE)) %>%
  filter(isItText & !is.na(date)) %>%
  group_by(date) %>%
  summarise(post = paste(text, collapse = " ")) %>%

##       date                         post          
##  Min.   :1942-06-12 20:00:00   Length:182        
##  1st Qu.:1943-03-05 19:30:00   Class :character  
##  Median :1944-01-06 14:30:00   Mode  :character  
##  Mean   :1943-09-18 05:25:24                     
##  3rd Qu.:1944-03-29 04:45:00                     
##  Max.   :1944-08-01 07:00:00

OK, what just happend:

  1. Created a boolean indicator isItText indicating whether this row should be treated as part of the post or not. I’m going to use it to filter out rows which contain the dates, not part of the post.
  2. Changed date to “drag” the last seen date across all NAs with the useful na.locf function from the zoo package. This is done so that all rows from the same post would have the same identifier (date) so I will be able to group them appropriately.
  3. Filtered out the text rows holding dates (e.g. "SUNDAY, JUNE 14, 1942") and the preface rows (which at this point still have NA for date values).
  4. Grouped posts by date which is now the unique identifier for each post.
  5. Concatenated each post’s rows into a single string.
  6. Ungrouped, because we don’t need a Grouped table anymore.

The result: 1831 posts, from June 1942 to August 1944.

Get Sentiment(al)

I will take Social Media Mining with R’s simple approach, defining a sentiment score for post \(i\):

\(score_i = pos_i - neg_i\)

where \(pos_i\) is the number of “positive” words in the post, and \(neg_i\) the number of “negative” words in the post. Positive according to who? That’s where positive/negative lexicons come in handy.

Now, a few years back the syuzhet package did not exist. I only heard of it when reading the wonderful blog by Julia Silge who did amazing analyses on Jane Austen’s works and is the author of the tidytext package. Anyway, the syuzhet package has a convenient get_sentiment function which magically gives you a sentiment score for any English text, using various positive/negative lexicons. This is a bit too convenient!

Let’s look at the Bing lexicon2:


bing <- get_sentiment_dictionary("bing")

bing_pos_words <- bing %>%
  filter(value == 1) %>%
  select(word) %>%
  unlist %>%

bing_neg_words <- bing %>%
  filter(value == -1) %>%
  select(word) %>%
  unlist %>%

Though this lexicon contains over 6,000 words it does not contain basic positive/negative words Anne used, like:

any(c("hugging", "prettiest", "kindest", "friends") %in% bing_pos_words)
## [1] FALSE
any(c("hiding", "war", "soldiers", "gun", "shot", "hardheartedness", "failings", "darn", "wronged") %in% bing_neg_words)
## [1] FALSE

So in general I think you’d want to fit the lexicon you’re choosing to the domain of the text you’re analyzing. Just imagine doing Sentiment Analysis on Facebook users chats without having “ROFL” or “Facepalm” in your lexicon! So, for now I will just add these few words I found in Anne’s diary, but for any serious work you should match somehow the lexicon you’re using to the domain. And of course pay attention to stemming.

bing_pos_words <- c(bing_pos_words, c("hugging", "prettiest", "kindest", "friends"))
bing_neg_words <- c(bing_neg_words, c("hiding", "war", "soldiers", "gun", "shot", "hardheartedness", "failings", "darn", "wronged"))

Since we’re now in “Custom Lexicons Land” we need our own get_sentiment function. This should be simple. We need to split the posts to words/tokens, ignoring punctuation and numbers, count no. of “positive” words, count “negative” and subtract. Let’s use the magic of the stringr package:


get_sentiment_score <- function(post) {
  words <- post %>%
    str_replace_all("[[:punct:]]|[[:digit:]]", " ") %>%
    tolower() %>%
    str_split("\\s+") %>%
  sum(words %in% bing_pos_words) - sum(words %in% bing_neg_words)

Mapping our function to anne’s posts we get:

anne %<>%
  mutate(score = map_int(post, get_sentiment_score))

head(anne[, c("date", "score")])
## # A tibble: 6 x 2
##   date                score
##   <dttm>              <int>
## 1 1942-06-12 20:00:00     5
## 2 1942-06-14 08:00:00    12
## 3 1942-06-15 15:00:00     1
## 4 1942-06-20 19:00:00    -4
## 5 1942-06-20 22:00:00     0
## 6 1942-06-21 15:00:00    -8

Showing Feelings

Hard part is over. At this stage a simple plot would do to see Anne’s Sentiment as a function of time. Let’s make it slightly more interesting, treating these data as a Time Series, with the xts package.


anne_xts <- xts(anne$score, anne$date)
plot(anne_xts, main = "Anne Frank's Diary: A Sentiment Analysis", cex = 0.5)

Nice. You can definitely see a pattern here, but let’s smooth things over, using loess:

lo <- loess(score ~ as.numeric(date), anne)
anne_xts <- cbind(anne_xts, predict(lo, anne$date))
colnames(anne_xts) <- c("score", "loess")
plot(anne_xts$score, main = "Anne Frank's Diary: A Sentiment Analysis", cex = 0.5)

lines(anne_xts$loess, col = "red", lwd = 2)

Better. The loess smoothing allows to see the S-shape pattern of Anne’s emotion during those two years: she starts positive, decreases until a low point wround the winter of ’42-’43, then an increase as she falls in love with Peter van Daan, then decrease again. Also interesting to find out what’s written in the lowest and highest posts in terms of sentiment score:

# top post
anne %>%
  arrange(-score) %>%
  slice(1) %>%
  select(post) %>%
  unlist %>%

Dearest Kitty, When I think back to my life in 1942, it all seems so unreal. The Anne Frank who enjoyed that heavenly existence was completely different from the one who has grown wise within these walls. Yes, it was heavenly. Five admirers on every street corner, twenty or so friends, the favorite of most of my teachers, spoiled rotten by Father and Mother, bags full of candy and a big allowance. What more could anyone ask for? You’re probably wondering how I could have charmed all those people. Peter says It s ecause I m “attractive,” but that isn’t it entirely. The teachers were amused and entertained by my clever answers, my witty remarks, my smthng face and my critical mind. That’s all I was: a terrible flirt, coquettish and amusing. I had a few plus points, which kept me in everybody’s good graces: I was hardworking, honest and generous. I would never have refused anyone who wanted to peek at my answers, I was magnanimous with my candy, and I wasn’t stuck-up. Would all that admiration eventually have made me overconfident? It’s a good thing that, at the height of my glory, I was suddenly plunged into reality. It took me more than a year to get used to doing without admiration. How did they see me at school? As the class comedian, the eternal ringleader, never in a bad mood, never a crybaby. Was it any wonder that everyone wanted to bicycle to school with me or do me little favors? I look back at that Anne Frank as a pleasant, amusing, but superficial girl, who has nothing to do with me. What did Peter say about me? “Whenever I saw you, you were surrounded by a flock of girls and at least two boys, you were always laughing, and you were always the center of attention!” He was right. What’s remained of that Anne Frank? Oh, I haven’t forgotten how to laugh or toss off a remark, I’m just as good, if not better, at raking people over the coals, and I can still flirt and be amusing, if I want to be . . . But there’s the catch. I’d like to live that seemingly carefree and happy life for an evening, a few days, a week. At the end of that week I’d be exhausted, and would be grateful to the first person to talk to me about something meaningful. I want friends, not admirers. Peo- pie who respect me for my character and my deeds, not my flattering smile. The circle around me would be much smaller, but what does that matter, as long as they’re sincere? In spite of everything, I wasn’t altogether happy in 1942; I often felt I’d been deserted, but because I was on the go all day long, I didn’t think about it. I enjoyed myself as much as I could, trying consciously or unconsciously to fill the void with jokes. Looking back, I realize that this period of my life has irrevocably come to a close; my happy-go-lucky, carefree schooldays are gone forever. I don’t even miss them. I’ve outgrown them. I can no longer just kid around, since my serious side is always there. I see my life up to New Year’s 1944 as if I were looking through a powerful magnifying glass. When I was at home, my life was filled with sunshine. Then, in the middle of 1942, everything changed overnight. The quarrels, the accusations — I couldn’t take it all in. I was caught off guard, and the only way I knew to keep my bearings was to talk back. The first half of 1943 brought crying spells, loneliness and the gradual realization of my faults and short- comings, which were numerous and seemed even more so. I filled the day with chatter, tried to draw Pirn closer to me and failed. This left me on my own to face the difficult task of improving myself so I wouldn’t have to hear their reproaches, because they made me so despondent. The second half of the year was slightly better. I became a teenager, and was treated more like a grown-up. I began to think about things and to write stories, finally coming to the conclusion that the others no longer had anything to do with me. They had no right to swing me back and forth like a pendulum on a clock. I wanted to change myself in my own way. I realized I could man- age without my mother, completely and totally, and that hurt. But what affected me even more was the realization that I was never going to be able to confide in Father. I didn’t trust anyone but myself. After New Year’s the second big change occurred: my dream, through which I discovered my longing for … a boy; not for a girlfriend, but for a boyfriend. I also discovered an inner happiness underneath my superficial and cheerful exterior. From time to time I was quiet. Now I live only for Peter, since what happens to me in the future depends largely on him! I lie in bed at night, after ending my prayers with the words “Ich Janke air fur all das Cute una Liebe una Schone,”* [* Thank you, God, for all that is good and dear and beautiful.] and I’m filled with joy. I think of going into hiding, my health and my whole being as das Cute; Peter’s love (which is still so new and fragile and which neither of us dares to say aloud), the future, happiness and love as das Liebe; the world, nature and the tremendous beauty of everything, all that splendor, as das Schone. At such moments I don’t think about all the misery, but about the beauty that still remains. This is where Mother and I differ greatly. Her advice in the face of melancholy is: “Think about all the suffering in the world and be thankful you’re not part of it.” My advice is: “Go outside, to the country, enjoy the sun and all nature has to offer. Go outside and try to recapture the happiness within yourself; think of all the beauty in yourself and in everything around you and be happy.” I don’t think Mother’s advice can be right, because what are you supposed to do if you become part of the suffering? You’d be completely lost. On the contrary, beauty remains, even in misfortune. If you just look for it, you discover more and more happiness and regain your balance. A person who’s happy will make others happy; a person who has courage and faith will never die in misery! Yours, Anne M. Frank

# bottom post
anne %>%
  arrange(score) %>%
  slice(1) %>%
  select(post) %>%
  unlist %>%

My dearest Kitty, At long, long last, I can sit quietly at my table before the crack in the window frame and write you everything, everything I want to say. I feel more miserable than I have in months. Even after the break— in I didn’t feel so utterly broken, inside and out. On the one hand, there’s the news about Mr. van Hoeven, the Jewish question (which is discussed in detail by everyone in the house), the invasion (which is so long in coming), the awful food, the tension, the misera- ble atmosphere, my disappointment in Peter. On the other hand, there’s Bep’s engagement, the Pentecost reception, the flowers, Mr. Kugler’s birthday, cakes and stories about cabarets, movies and concerts. That gap, that enormous gap, is always there. One day we’re laugh- ing at the comical side of life in hiding, and the next day (and there are many such days), we’re frightened, and the fear, tension and despair can be read on our faces. Miep and Mr. Kugler bear the greatest burden for us, and for all those in hiding-Miep in everything she does and Mr. Kugler through his enormous responsibthty for the eight of us, which is sometimes so overwhelming that he can hardly speak from the pent-up tension and strain. Mr. Kleiman and Bep also take very good care of us, but they’re able to put the Annex out of their minds, even if it’s only for a few hours or a few days. They have their own worries, Mr. Kleiman with his health and Bep with her engagement, which isn’t looking very promising lat the moment. But they also have their outings, their visits with friends, their everyday lives as ordinary people, so that the tension is sometimes relieved, if only for a short while, while ours never is, never has been, not once in the two years we’ve been here. How much longer will this increasingly oppressive, unbearable weight press I down on us? The drains are clogged again. We can’t run the wa- ter, or if we do, only a trickle; we can’t flush the toilet, so we have to use a toilet brush; and we’ve been putting our dirty water into a big earthenware jar. We can man- age for today, but what will happen if the plumber can’t fix it on his own? The Sanitation Department can’t come until Tuesday. Miep sent us a raisin bread with “Happy Pentecost” written on top. It’s almost as if she were mocking us, since our moods and cares are far from “happy.” We’ve all become more frightened since the van Hoeven business. Once again you hear “shh” from all I sides, and we’re doing everything more quietly. The police forced the door there; they could just as easily do that here too! What will we do if we’re ever. . . no, I mustn’t write that down. But the question won’t let itself be pushed to the back of my mind today; on the contrary, all the fear I’ve ever felt is looming before me in all its horror. I had to go downstairs alone at eight this evening to use the bathroom. There was no one down there, since they were all listening to the radio. I wanted to be brave, but it was hard. I always feel safer upstairs than in that huge, silent house; when I’m alone with those mysterious muffied sounds from upstairs and the honking of horns in the street, I have to hurry and remind myself where I am to keep from getting the shivers. Miep has been acting much nicer toward us since her talk with Father. But I haven’t told you about that yet. Miep came up one afternoon all flushed and asked Father straight out if we thought they too were infected with the current anti-Semitism. Father was stunned and quickly talked her out of the idea, but some of Miep’s suspicion has lingered on. They’re doing more errands for us now and showing more of an interest in our troubles, though we certainly shouldn’t bother them with our woes. Oh, they’re such good, noble people! I’ve asked myself again and again whether it wouldn’t have been better if we hadn’t gone into hiding, if we were dead now and didn’t have to go through this misery, especially so that the others could be spared the burden. But we all shrink from this thought. We still love life, we haven’t yet forgotten the voice of nature, and we keep hoping, hoping for. . . everything. Let something happen soon, even an air raid. Nothing can be more crushing than this anxiety. Let the end come, however cruel; at least then we’ll know whether we are to be the victors or the vanquished. Yours, Anne M. Frank

As far as plotting goes, as a rule of thumb ggplot23 should always do a better job…

tidy(anne_xts) %>%
  ggplot(aes(x = index, y = value)) +
  geom_line() +
  geom_smooth(method = "loess") +
  ggtitle("Anne Frank's Diary: A Sentiment Analysis")

And if you want to get really fancy and go interactive, dygraphs is a great choice:

dygraph(anne_xts, main = "Anne Frank's Diary: A Sentiment Analysis") %>%

Do not be shy! Play with that RangeSelector!

A final note on plotting: why did I not use the fancy “Low-pass Fourier Transform”, like in the nice get_transformed_values function in the syuzhet package. This is because of the assumption of a fixed interval between “units” of sentiment on the X axis, which is simply not true in the case of Anne’s Diary. But here it is nonetheless:

ft_values <- get_transformed_values(
  low_pass_size = 3, 
  x_reverse_len = 100,
  padding_factor = 2,
  scale_vals = TRUE,
  scale_range = FALSE
## Warning in get_transformed_values(anne$score, low_pass_size = 3,
## x_reverse_len = 100, : This function is maintained for legacy purposes.
## Consider using get_dct_transform() instead.
plot(ft_values, type = "l", main = "Anne Frank's Diary: A Sentiment Analysis - Fourier Transform")
abline(h = 0)

Wrapping It Up

I like performing trivial analyses on untrivial data. Here we’ve seen some basic processing of an online text which does not come without its problems and challenges. We’ve seen some cool tidyverse pipelines to achieve pretty complicated results without looping over data. We’ve seen how to customize even the simplest Sentiment Analysis. We’ve seen some smoothing. And who can forget that RangeSelector! On a serious note you should really read Anne Frank’s Diary. It will give you some perspective on what’s important in life.

  1. Notice that if would have missed the detail about duplicate dates, we would have here 176 posts!

  2. Hu, M. & Liu, B. (2004). Mining and summarizing customer reviews. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 168-177.

  3. Notice we don’t really need a Time Series object in ggplot2 though: ggplot(anne, aes(x=date, y=score)) + geom_line() + geom_smooth(method='loess')