It’s been 7 months since my last post. A lot can happen in 7 months, is all I can say. Anyway, in this post I thought how I would combine three of my passions:

  • The Tidyverse
  • Web Scraping
  • Network Analysis

So I’m going to scrape the Tidyverse (blogsphere), and perform Network Analysis on it. Or, rather Network Visualization. The result is quite neat, stay with me here. I will also show how I use Thomas Lin Pedersen’s excellent tidygraph and ggraph packages.

Kiki

So what is the Tidyverse? The Tidyverse is a suite of R packages written by Hadley Wickham, RStudio and other open source collaborators, that revolutionaized how many data professionals extract, manipulate and visualize data, in R.

What if, I thought, the Tidyverse were actually a universe. Of functions. What if each function were a planet, its size proportional to its “usage”, and the distance between each two functions (planets) were determined by how well they “played” together - whether people used them, one next to the other, often. That would be a galaxy I would like to explore.

I thought about scraping the Tidyverse code itself, but then I realized that would be silly, since the code which made Tidyverse functions is probably the least likely place to look for the usage of these functions. So I have decided to scrape the Tidyverse blogsphere, through posts linked to in Rweekly. Rweekly is a great site which curates inspiring and creative R posts on a weekly basis.

With the rvest package we’re scraping each of RWeekly’s weekly pages since the beginning of 2017 (~90 weeks or ~90 pages), harvesting all links to R posts.1 2

library(rvest)
library(tidyverse)
library(glue)

get_rweekly_links <- function(year, weeknum) {
  Sys.sleep(1)
  url <- glue("https://rweekly.org/{year}-{weeknum}.html")
  links <- read_html(url) %>%
    html_nodes("a") %>%
    html_attr("href") %>%
    discard(!str_detect(., "^http"))
  return(links)
}

rweekly_links <- tibble(year = c(2017, 2018),
                        weeknum = list(1:52, 1:37)) %>%
  unnest(weeknum) %>%
  mutate(link = map2(year, weeknum,  get_rweekly_links))

rweekly_links
## # A tibble: 89 x 3
##     year weeknum link      
##    <dbl>   <int> <list>    
##  1  2017       1 <chr [73]>
##  2  2017       2 <chr [80]>
##  3  2017       3 <chr [86]>
##  4  2017       4 <chr [76]>
##  5  2017       5 <chr [81]>
##  6  2017       6 <chr [84]>
##  7  2017       7 <chr [83]>
##  8  2017       8 <chr [85]>
##  9  2017       9 <chr [78]>
## 10  2017      10 <chr [82]>
## # ... with 79 more rows

So rweekly_links contains 89 weeks of R posts links. At this stage I couldn’t help myself, I had to see if there was an interesting trend in RWeekly’s number of weekly posts:3

library(lubridate)

rweekly_links %>%
  mutate(date = make_date(year) + weeks(weeknum - 1),
         n_posts = map_int(link, length)) %>%
  ggplot(aes(date, n_posts)) +
  geom_line() +
  theme_bw() +
  labs(title = "RWeekly no. of Weekly Posts",
       y = "No. of Posts") +
  ylim(0, 150)

It seems like there is a decline towards christmas in 2017, then a pick of posts once 2018 starts, then a steady decline. Maybe the 2018 pick is related to some R conferences like rstudio::conf in February.

Now, for each of the links we scrape the R code as plain text.

get_post_r_code <- function(post_url, pb = NULL) {
  if (!is.null(pb)) {
    pb$tick()$print()
  }
  r_code <- read_html(post_url) %>%
    html_nodes(".r, .language-r") %>% 
    html_text() %>%
    str_c(., collapse = " ")
  if (length(r_code) > 0) {
    return(r_code)
  } else {
    return(NA)
  }
}

rweekly_r_code <- rweekly_links %>%
  unnest(link) %>%
  distinct(link, .keep_all = TRUE)

pb <- progress_estimated(nrow(rweekly_r_code))

rweekly_r_code <- rweekly_r_code %>%
  mutate(r_code =
           map_chr(link,
                   possibly(get_post_r_code, otherwise = NA),
                   pb = pb)) %>%
  drop_na(r_code) %>%
  distinct(r_code, .keep_all = TRUE)

Notice this could take some time, that’s why I added a progress bar. I also called distinct on the links (and the actual code to be safe) because some links appear more than once.

We have our code. 1475 posts. Phase I completed.

rweekly_r_code
## # A tibble: 1,475 x 4
##     year weeknum link                         r_code                      
##    <dbl>   <int> <chr>                        <chr>                       
##  1  2017       1 http://feedproxy.google.com~ "output$acronyms <- DT::ren~
##  2  2017       1 https://privefl.github.io/b~ "print(dim(X)) print(dim(X)~
##  3  2017       1 https://walkerke.github.io/~ "library(leaflet)\nlibrary(~
##  4  2017       1 http://ellisp.github.io/blo~ "#-----functionality-------~
##  5  2017       1 http://ryanhafen.com/blog/p~ "# install packages if not ~
##  6  2017       1 http://www.fightprior.com/2~ "suppressPackageStartupMess~
##  7  2017       2 http://varianceexplained.or~ "library(dplyr)\nlibrary(ti~
##  8  2017       2 http://www.win-vector.com/b~ "library(\"rpart\")\nlibrar~
##  9  2017       2 http://ellisp.github.io/blo~ "# Load up R packages inclu~
## 10  2017       2 http://ropensci.org/blog/te~ "foo_bar <- function(x, y) ~
## # ... with 1,465 more rows

KB

Now we need the Tidyverse functions to look for. I have decided to narrow down to only the 8 packages that are automatically loaded when you load the tidyverse package. And did you know getNamespaceExports() was a built-in function to get all functions of a given package?!

tidyverse_packs <- c('ggplot2', 'purrr', 'tibble', 'dplyr',
                     'tidyr','stringr', 'readr', 'forcats')

tidyverse_funcs <- tibble(package = tidyverse_packs) %>%
  mutate(func = map(package, getNamespaceExports)) %>%
  unnest(func) %>%
  filter(str_detect(func, '[a-z]'),
         !str_detect(func, '<-|:|__|%|\\+|^\\.'),
         str_length(func) > 2)

tidyverse_funcs
## # A tibble: 1,059 x 2
##    package func                   
##    <chr>   <chr>                  
##  1 ggplot2 scale_fill_brewer      
##  2 ggplot2 scale_size             
##  3 ggplot2 GeomLine               
##  4 ggplot2 ScaleContinuousIdentity
##  5 ggplot2 geom_boxplot           
##  6 ggplot2 guide_legend           
##  7 ggplot2 ScaleDiscreteIdentity  
##  8 ggplot2 Layout                 
##  9 ggplot2 scale_colour_ordinal   
## 10 ggplot2 geom_hex               
## # ... with 1,049 more rows

And tidyverse_funcs holds 1059 Tidyverse functions.

What do I consider functions appearing “together”? I’ve narrowed it down to 3 scenarios, with 3 corresponding regex expressions:

The classic tidyverse pipe scenario:

f1 <- "mutate"; f2 <- "count"
re1 <- glue("(?!\\(){f1}\\(.*\\)[\\s+]*%>%[\\s+]*{f2}\\(")

str_detect("mtcars %>% mutate(log_disp = log(disp)) %>% count(cyl)", re1)
## [1] TRUE

The function-within-a-function scenario:

f1 <- "ggplot"; f2 <- "aes"
re2 <- glue("{f1}\\([^\\(\\)]*{f2}\\(")

str_detect("mtcars %>% ggplot(aes(factor(cyl), disp)) + geom_boxplot()", re2)
## [1] TRUE

The ggplot2 plus operator scenario:

f1 <- "ggplot"; f2 <- "geom_boxplot"
re3 <- glue("(?!\\(){f1}\\(.*\\)[\\s+]*\\+[\\s+]*{f2}\\(")

str_detect("mtcars %>% ggplot(aes(factor(cyl), disp)) + geom_boxplot()", re3)
## [1] TRUE

I also made sure that each of the other examples give a FALSE result for the two other expressions, otherwise we might count this occurrence more than once. Adding this to a function:

count_tidy_funcs_pair_in_text <- function(f1, f2, tex) {
  re1 <- glue('(?!\\(){f1}\\(.*\\)[\\s+]*%>%[\\s+]*{f2}\\(')
  re2 <- glue('{f1}\\([^\\(\\)]*{f2}\\(')
  re3 <- glue('(?!\\(){f1}\\(.*\\)[\\s+]*\\+[\\s+]*{f2}\\(')
  str_count(tex, re1) +
    str_count(tex, re2) +
    str_count(tex, re3)
}

We have a function to count the occurrence of a pair of functions (in a specific order), and we can apply it to all code snippets. There’s no need to look for all ~1M4 possible pairs, times 1475 posts, though. This would take a long time. Instead, I first check for each post which Tidyverse functions appear in it, then check all possible pairs from this narrowed down list.

Getting all possible functions for each post is easy:

find_all_funcs_in_text <- function(tex, funcs) {
  funcs[str_detect(tex, funcs)]
}

post_codes <- rweekly_r_code$r_code
func_lists <- map(post_codes, find_all_funcs_in_text, funcs = tidyverse_funcs$func)

head(func_lists, 5)
## [[1]]
## [1] "first"     "filter"    "collapse"  "select"    "last"      "str_split"
## [7] "coll"     
## 
## [[2]]
## [1] "xlab"
## 
## [[3]]
## [1] "map"       "intersect" "near"      "nest"     
## 
## [[4]]
##  [1] "scale_colour_brewer" "ylab"                "scale_x_continuous" 
##  [4] "labs"                "ggplot"              "Stat"               
##  [7] "scale_y_continuous"  "aes"                 "ggtitle"            
## [10] "facet_wrap"          "geom_line"           "mutate"             
## [13] "arrange"             "summarise"           "group_by"           
## [16] "desc"                "count"               "gather"             
## [19] "fixed"              
## 
## [[5]]
##  [1] "vars"      "stat"      "rel"       "glimpse"   "mutate_at"
##  [6] "matches"   "mutate"    "mutate_"   "vars"      "glimpse"  
## [11] "read_csv"

Another trick I’m going to do is keeping a long counter table for each possible pair, as a global variable I would update whenever I encounter a pair of functions appearing “together”. I will have a update_counter_table function which will receive a pair of functions and their co-occurrence and update counter_table globally with the <<- operator:

counter_table <- tidyverse_funcs %>%
  expand(func, func) %>%
  rename(func1 = func, func2 = func1) %>%
  mutate(count = 0)

update_counter_table <- function(func1, func2, n) {
  loc <- which(counter_table$func1 == func1 &
                 counter_table$func2 == func2)
  counter_table[loc, "count"] <<-
    counter_table[loc, "count"] + n
}

counter_table
## # A tibble: 1,065,024 x 3
##    func1      func2            count
##    <chr>      <chr>            <dbl>
##  1 accumulate accumulate           0
##  2 accumulate accumulate_right     0
##  3 accumulate add_case             0
##  4 accumulate add_column           0
##  5 accumulate add_count            0
##  6 accumulate add_count_           0
##  7 accumulate add_row              0
##  8 accumulate add_rownames         0
##  9 accumulate add_tally            0
## 10 accumulate add_tally_           0
## # ... with 1,065,014 more rows

So let’s count! Again, this may take a few minutes, so I added a progress bar:

find_funcs_pairs_and_update <- function(funcs, tex, pb = NULL) {
  if (!is.null(pb)) {
    pb$tick()$print()
  }
  if (length(funcs) > 1) {
    this_text_counter_table <- tibble(func1 = funcs) %>%
      expand(func1, func1) %>%
      rename(func2 = func11) %>%
      mutate(n = map2_int(func1, func2, 
                          count_tidy_funcs_pair_in_text,
                          tex = tex)) %>%
      filter(n > 0) %>%
      pwalk(update_counter_table)
  }
}

pb <- progress_estimated(length(post_codes))
walk2(func_lists, post_codes,
      find_funcs_pairs_and_update, pb = pb)

counter_table <- counter_table %>% filter(count > 0)

counter_table
## # A tibble: 3,154 x 3
##    func1      func2         count
##    <chr>      <chr>         <dbl>
##  1 accumulate select            1
##  2 add_column as.tibble         1
##  3 add_column as_tibble         1
##  4 add_column group_indices     1
##  5 add_column mutate            1
##  6 add_column mutate_if         1
##  7 add_column ungroup           2
##  8 add_count  filter            3
##  9 add_count  group_by          1
## 10 add_count  mutate            1
## # ... with 3,144 more rows

Don’t you want to know which pairs are most frequent?

counter_table %>%
  arrange(-count) %>%
  head(10)
## # A tibble: 10 x 3
##    func1      func2         count
##    <chr>      <chr>         <dbl>
##  1 ggplot     aes            1487
##  2 aes        geom_point      427
##  3 theme      element_text    421
##  4 ggplot     geom_point      374
##  5 group_by   summarise       351
##  6 aes        geom_line       303
##  7 ggplot     geom_line       237
##  8 geom_point aes             226
##  9 theme      element_blank   215
## 10 geom_line  aes             205

Not surprisingly ggplot() and aes() are the most common pair, since you can’t really create a ggplot without defining the aesthetics. From the dplyr world we also have a representative pair in the top 10: group_by() and summarise().

Another question Tidyverse developers could ask is which function is most used in the vicinity of function X? If a function is almost always followed by another one (as in the ggplot(df, aes(x, y)) case - this could potentially lead to developing a new function to bind both. Another use I could think of is a potential re-factoring of the the Tidyverse. A developer might detect that a function from package A is always used within the context of package B, therefore this function might belong to package B. Let’s see mutate():

counter_table %>%
  filter(func1 == "mutate") %>%
  arrange(-count) %>%
  head(5)
## # A tibble: 5 x 3
##   func1  func2  count
##   <chr>  <chr>  <dbl>
## 1 mutate ggplot   154
## 2 mutate mutate   142
## 3 mutate select   108
## 4 mutate map       92
## 5 mutate filter    86
counter_table %>%
  filter(func2 == "mutate") %>%
  arrange(-count) %>%
  head(5)
## # A tibble: 5 x 3
##   func1    func2  count
##   <chr>    <chr>  <dbl>
## 1 group_by mutate   186
## 2 filter   mutate   151
## 3 mutate   mutate   142
## 4 ungroup  mutate    85
## 5 arrange  mutate    73

So after mutate() the most common function is ggplot(). And before mutate() the most common function is group_by() (but also ungroup()). Interestingly, one of the most common functions after mutate() is mutate() itself! There might be end-cases which justify this pattern, but in general I would guess people aren’t aware you can use a single mutate() call to mutatea multiple variables.

Resha

How do we convert this amazing table into a network, a graph? I want to try tidygraph and ggraph, which extend the “tidy” philosophy into the Network Analysis domain.

tidygraph needs a nodes table, specifying the nodes attributes, and an edges table, specifying the edges attributes.

Let’s do nodes first. I want each node to have a size proportional to its “usage” so for each function I’m counting the total number of times it occurs (the way I do it here isn’t really accurate, since the same occurrence of a function can sometimes be as the first function (func1) and sometimes as the second (func2) but this is just for the function size, I’ll live with that). I’ll also filter out functions appearing less than, say 10 times. Finally I’ll put this number in log() for better visibility.

total_function_count <- function(func) {
  counter_table %>%
    filter(func1 == func | func2 == func) %>%
    pull(count) %>%
    sum()
}

nodes <- tidyverse_funcs %>%
  # some functions appear in more than 1 package!
  distinct(func, .keep_all = TRUE) %>%
  rename(label = func) %>%
  mutate(size = map_dbl(label, total_function_count)) %>%
  filter(size > 10, label != "compose") %>%
  mutate(id = 1:n(), size = log(size), color = recode(package,
                 "ggplot2" = "green",
                 "purrr" = "pink",
                 "tibble" = "orange",
                 "dplyr" = "red",
                 "tidyr" = "yellow",
                 "stringr" = "purple",
                 "readr" = "blue",
                 "forcats" = "brown"))

nodes
## # A tibble: 216 x 5
##    package label                  size    id color
##    <chr>   <chr>                 <dbl> <int> <chr>
##  1 ggplot2 scale_fill_brewer      4.53     1 green
##  2 ggplot2 scale_size             3.56     2 green
##  3 ggplot2 geom_boxplot           4.71     3 green
##  4 ggplot2 guide_legend           4.16     4 green
##  5 ggplot2 coord_sf               3.93     5 green
##  6 ggplot2 annotation_custom      3.43     6 green
##  7 ggplot2 scale_color_viridis_c  3.00     7 green
##  8 ggplot2 geom_step              2.77     8 green
##  9 ggplot2 theme_grey             2.83     9 green
## 10 ggplot2 quo_name               2.71    10 green
## # ... with 206 more rows

We are left with 216 functions.

Now edges is a bit more complicated. For instance, we need to decide whether we would like a directed or undirected graph. Do we need to see a different edge from geom_line() to geom_point() and from geom_point() to geom_line()? As you will see below I didn’t really care, I just wanted my galaxy. So I chose to treat my graph as undirected, unifying edges between each pair of functions. I also gave a weight to each edge, which can loosly be defined as P(funcA, funcB|funcA) * P(funcA, funcB|funcB), but ended up not using it here.

edges <- counter_table %>%
  mutate(pair = map2_chr(func1, func2,
                         function(s1, s2)
                           str_c(sort(c(s1, s2)),
                                 collapse = "|"))) %>%
  group_by(pair) %>%
  mutate(joint_count = sum(count)) %>%
  inner_join(nodes, by = c("func1" = "label")) %>%
  rename(size_from = size, package_from = package) %>%
  inner_join(nodes, by = c("func2" = "label")) %>%
  rename(size_to = size, package_to = package,
         from = id.x, to = id.y) %>%
  mutate(weight = (joint_count ^ 2) / (size_from * size_to)) %>%
  distinct(pair, .keep_all = TRUE)

edges
## # A tibble: 2,152 x 14
## # Groups:   pair [2,152]
##    func1 func2 count pair  joint_count package_from size_from  from color.x
##    <chr> <chr> <dbl> <chr>       <dbl> <chr>            <dbl> <int> <chr>  
##  1 add_~ as.t~     1 add_~           1 tibble            2.77   131 orange 
##  2 add_~ as_t~     1 add_~           1 tibble            2.77   131 orange 
##  3 add_~ muta~     1 add_~           1 tibble            2.77   131 orange 
##  4 add_~ muta~     1 add_~           1 tibble            2.77   131 orange 
##  5 add_~ ungr~     2 add_~           2 tibble            2.77   131 orange 
##  6 aes   aes       1 aes|~           1 ggplot2           8.50    93 green  
##  7 aes   anno~     5 aes|~           5 ggplot2           8.50    93 green  
##  8 aes   anno~     1 aes|~           1 ggplot2           8.50    93 green  
##  9 aes   coor~     7 aes|~           7 ggplot2           8.50    93 green  
## 10 aes   coor~     7 aes|~           7 ggplot2           8.50    93 green  
## # ... with 2,142 more rows, and 5 more variables: package_to <chr>,
## #   size_to <dbl>, to <int>, color.y <chr>, weight <dbl>

We have 2152 edges, and creating our tidy_graph is very easy now:

library(tidygraph)

tidy_graph <- tbl_graph(nodes, edges, directed = FALSE)

tidy_graph
## # A tbl_graph: 216 nodes and 2152 edges
## #
## # An undirected multigraph with 1 component
## #
## # Node Data: 216 x 5 (active)
##   package label              size    id color
##   <chr>   <chr>             <dbl> <int> <chr>
## 1 ggplot2 scale_fill_brewer  4.53     1 green
## 2 ggplot2 scale_size         3.56     2 green
## 3 ggplot2 geom_boxplot       4.71     3 green
## 4 ggplot2 guide_legend       4.16     4 green
## 5 ggplot2 coord_sf           3.93     5 green
## 6 ggplot2 annotation_custom  3.43     6 green
## # ... with 210 more rows
## #
## # Edge Data: 2,152 x 14
##    from    to func1 func2 count pair  joint_count package_from size_from
##   <int> <int> <chr> <chr> <dbl> <chr>       <dbl> <chr>            <dbl>
## 1   131   135 add_~ as.t~     1 add_~           1 tibble            2.77
## 2   131   132 add_~ as_t~     1 add_~           1 tibble            2.77
## 3   131   145 add_~ muta~     1 add_~           1 tibble            2.77
## # ... with 2,149 more rows, and 5 more variables: color.x <chr>,
## #   package_to <chr>, size_to <dbl>, color.y <chr>, weight <dbl>

We have a tbl_graph object which combines our two nodes and edges tibbles. With this, and ggraph, plotting the network is again, very easy:

library(ggraph)

ggraph(tidy_graph, layout = "fr") +
  geom_edge_link(alpha = 0.8, edge_colour = "white",
                 show.legend = FALSE) +
  geom_node_point(aes(color = color, size = size),
                  show.legend = FALSE) +
  geom_node_text(aes(label = label), color = "white",
                 size = 3, family = "mono",
                 show.legend = FALSE) +
  theme_graph(background = "black") +
  theme(plot.title = element_text(color = "white",
                                  family = "mono")) +
  scale_color_identity() +
  labs(title = "The Actual Tidyverse")

As you can see there’s really no point in plotting the edges if we want to see the basic structure. I will also avoid specifying repel = TRUE when specifying geom_node_text() as this further obscures the picture:

ggraph(tidy_graph, layout = "fr") +
  geom_node_point(aes(color = color, size = size),
                  show.legend = FALSE) +
  geom_node_text(aes(label = label), color = "white",
                 size = 3, family = "mono",
                 show.legend = FALSE) +
  theme_graph(background = "black") +
  theme(plot.title = element_text(color = "white",
                                  family = "mono")) +
  scale_color_identity() +
  labs(title = "The Actual Tidyverse")

And the picture is beautiful. If you know your Network Analysis literature, a similar pattern which comes to mind is from Lada Adamic and Natalie Glance excellent paper5 which analyzed the political blogsphere of the US 2004 Elections, and showed how liberal-oriented blogs link to other liberal-oriented blogs, and the same for conservatives:

In our case, it seems ggplot2 functions (green) “go with” other ggplot2 functions and dplyr, tidyr and purrr functions (red, purple) go with other functions from these packages, which of course makes a lot of sense. Most of the time bloggers either manipulate data or plot it.

An exercise for the reader: can you characterize a blogger as more visualization-oriented (ggplot) or more manipulation-oriented (dplyr)?

JT

So this was really nice but I want me an actual Tidyverse. A universe. A galaxy. As in planets.

I will use the insane rgl package to make a 3D webGL component which will give a better sense of this Tidyverse. If it works in your browser feel free to imagine you’re on the Starship Enterprise by zooming in on the galaxy and rotating. If not, below are a few screenshots I recorded for your viewing pleasure.

library(igraph)
library(rgl)

layout3d <- layout_with_fr(tidy_graph, dim = 3) %>%
  as_tibble() %>%
  rename(x = V1, y = V2, z = V3) %>%
  mutate(label = V(tidy_graph)$label,
         color = V(tidy_graph)$color,
         size = V(tidy_graph)$size)

with(layout3d, plot3d(x, y, z, type="s", col = color,
                      radius = 0.006 * size, box = FALSE,
                      xlab = "", ylab = "", zlab = "",
                      axes = FALSE))
with(layout3d, text3d(x, y, z, text = label,
                      adj = c(1.5, 1.5), cex = 1,
                      color = "yellow", family = "mono"))
bg3d("black")
view3d(zoom = 0.2)

htmlwidgets::saveWidget(
  rglwidget(elementId = "plot3drgl", width = 700, height = 500),
  "C:/SDAD_materials/tidyverse_widget.html", background = "black", selfcontained = TRUE)

Drake

I think I’ll try defying gravity, and you can’t pull me down!


  1. Notice I tried to be polite about this, Sys.sleepinging for 1 second between each call to RWeekly, so I wouldn’t overload their server.

  2. This is the first time I’m using the glue package, and I think I’m in love.

  3. Well, links in general, you’ll notice I didn’t actually filter only R posts, but I’m sure most of the links are in fact R posts.

  4. ~1000 X ~1000

  5. Lada A. Adamic and Natalie Glance. 2005. The political blogosphere and the 2004 U.S. election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery (LinkKDD ’05). ACM, New York, NY, USA, 36-43. DOI=http://dx.doi.org/10.1145/1134271.1134277