One of the best talks in my opinion at rstudio::conf 2018 was actually the first keynote by Prof. Di Cook, of Monash University, titled “To the Tidyverse and Beyond: Challenges for the Future in Data Science”1. Di talked about how she views the plot of a dataset as simply another statistic, a function of this dataset. And as such, we can use a plot to perform statistical inference. Then she described “Visual Inference Protocols” - procedures for doing just that, inferring statistical significance from plots. In this post I will focus on one of these protocols - the “Lineup” - and perform some Deep Learning experiments inspired by a question Di was asked at the end of the lecture.

The Lineup

How would you infer statistical significance from a plot? The idea is clever and straightforward: (as you could do with any statistic) perform a permutation test in which the data is permuted a few times, say 20. Each time plot the same type of plot appropriate to your hypothesis, e.g. a scatterplot for a hypothesis of linear correlation. Put all of these plots next to one another and ask someone to pick the plot that is most different, or “Which one of the following plots shows the strongest relationship between the two variables?”. Hence, a lineup. Under the null hypothesis, e.g. there is no correlation, there shouldn’t be any differences between the plots, they would all represent big “blobs”, and a person is unlikely to pick the original plot as the most different. If, however, the original plot easily suggests a correlation to the human eye, most people should pick it as most different. We could run a binomial test to see that 8 out of 10 people choosing the same plot from a panel of 20 plots - is “statistically significant”.

Let’s see an example with mtcars2. Is there a correlation between a car’s engine displacement (disp) and its horsepower? Which one of the following plots shows the strongest relationship between the two variables?

library(tidyverse)
library(nullabor)

lineup_data_mtcars <- lineup(null_permute("hp"), mtcars)

ggplot(lineup_data_mtcars, aes(disp, hp)) +
  geom_point() +
  facet_wrap(~ .sample)

If you chose 13 you chose the original plot (notice I’m not saying “you are correct”). If enough people out of N subjects pick that plot, there is a correlation between displacement and horsepower, it is “significant”.

To produce this panel of plots I used Di’s nullabor package, I suggest you install the dev version with devtools::install_github("dicook/nullabor").

What about the usual parameteric statistical test for Pearson’s r correlation for a linear relationship?

cor.test(mtcars$disp, mtcars$hp)
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$disp and mtcars$hp
## t = 7.0801, df = 30, p-value = 7.143e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6106794 0.8932775
## sample estimates:
##       cor 
## 0.7909486

The sample correlation is quite high (0.79) and the two-sided p-value is low enough (e.g. lower than 0.01), so we’d reject the null hypothesis of no correlation, and say that there is a correlation between displacement and horsepower, it is “significant”.

Let’s look at Di’s example. Di took Graham Tierney’s dataset describing for each of the 50 US states the percent of population living in its capital (capital_pop_percentage) and some government ranking (rank) between 1 and 50 (based on: fiscal stability, budget transparency, government digitalization, and state integrity). You can get the data by cloning Graham’s repo and running his city_level_analysis.R script.

So, is there a relation between a state’s capital_pop_percentage and a state’s rank? Let’s see the lineup:

state_level <- readRDS("~/g_tierney_states_data.RData")

lineup_data_states <- lineup(null_permute("rank"), state_level)
## decrypt("2LZg AlRl Ix U6nIRI6x qY")
ggplot(lineup_data_states, aes(capital_pop_percentage, rank)) +
  geom_point() +
  facet_wrap(~ .sample)

If you chose 17 you chose the original plot. I doubt that you did, though, because the relationship doesn’t seem to be significant:

cor.test(state_level$capital_pop_percentage, state_level$rank)
## 
##  Pearson's product-moment correlation
## 
## data:  state_level$capital_pop_percentage and state_level$rank
## t = -0.48865, df = 48, p-value = 0.6273
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3420054  0.2121470
## sample estimates:
##         cor 
## -0.07035524

What Else Is New?

At the end of the lecture Di was asked whether she tried applying machine learning to these plots, to recognize which is the most different automatically. I think her answer was, in short, “not yet”. This is where this post comes in. I’m trying some approaches to make the computer decide which is the most different plot in a given lineup of scatterplots, using convolutional neural networks (a.k.a Deep Learning), with Rstudio’s keras wrapper3. We’ll see the strengths and weaknesses of each approach, but stop and think: a computer can now drive a freaking car. Do you think it will have problems telling if a scatterplot describes a “significant” relationship or estimate Pearson’s r correlation?

Let’s get a simulated dataset for the computer to learn how to do this. The basic idea is to simulate x and y of length n = 50 (same size as the no. of states in Graham’s dataset), from a bivariate normal distribution, with a few \(\rho\)s (the population parameter Pearson’s r is estimating). I would have a few pairs of x and y, some having r = 0.5, some having r = -0.3, some statistically significant (p-value < 0.05), some not. I will “plot”4 these pairs in a simple scatterplot, and feed these scatterplots to the Deep Learning network. I will then make the computer learn either of the following:

  1. Tell me if a plot is “significant” or not (Classification)
  2. Estimate Pearson’s r correlation of the plot (Regression)

Then, once the computer will “see” a lineup of scatterplots, it will choose the most different plot by either choosing the plot with the highest score (Approach 1) or the plot with the highest Pearson’s r correaltion (Approach 2). Again it is worth to stop and think what do you think of these approaches, will they work or not, and how could this be done differently.

Sampling

If you’re not into sampling, probability theory or power analysis, please skip this part.

This is my sample for producing a single pair of x and y of length n = 50, with a specific \(\rho\):

sampleBVN <- function(rho, n = 50) {
  mu <- c(0, 0) # Mean
  sigma <- matrix(c(1, rho, rho, 1), 2) # Covariance matrix
  bvn <- MASS::mvrnorm(n, mu = mu, Sigma = sigma) # from MASS package
  bvn <- round(bvn, 1) * 10 + 31
  bvn <- ifelse(bvn < 1, 1, bvn)
  bvn <- ifelse(bvn > 60, 60, bvn)
  ct <- cor.test(bvn[, 1], bvn[, 2])
  
  r <- unname(ct$estimate)
  sig <- ct$p.value < 0.05
  return(list(
    bvn = bvn,
    r = r,
    sig = sig
  ))
}

The function will return a list of bvn - the pair itself, the r correaltion and sig - a logical indicating whether the two-sided p-value was below 0.05 or not. If you look closer, I make sure the x and y are in fact integers between 1 and 60 by mapping the typical Normal values of (-3) to 3 to the 1 to 60 range. This mapping will help me later skip the actual “plotting” of the x and y pair. I will simply create a binary matrix of size 60 x 60 in which each (x, y) cell has a value of 1, else it is all zeros. This is a grayscale scatterplot!

Now, sampling 1000 pairs for each \(\rho\) for a few \(\rho\)s:

N <- 1000

sig_rhos <- tibble(rho = rep(c(2:9, -2:-9)/10, each = N)) %>%
  mutate(res = map(rho, sampleBVN),
         r = map_dbl(res, ~.$r),
         sig = map_lgl(res, ~.$sig))

sig_rhos
## # A tibble: 16,000 x 4
##      rho res              r sig  
##    <dbl> <list>       <dbl> <lgl>
##  1 0.200 <list [3]>  0.300  T    
##  2 0.200 <list [3]>  0.0912 F    
##  3 0.200 <list [3]> -0.0121 F    
##  4 0.200 <list [3]>  0.108  F    
##  5 0.200 <list [3]>  0.0238 F    
##  6 0.200 <list [3]>  0.188  F    
##  7 0.200 <list [3]>  0.220  F    
##  8 0.200 <list [3]>  0.0875 F    
##  9 0.200 <list [3]>  0.344  T    
## 10 0.200 <list [3]>  0.128  F    
## # ... with 15,990 more rows

Now, sampling 3000 pairs for each \(\rho\) for a few other \(\rho\)s:

N <- 1000

non_sig_rhos <- tibble(rho = rep(c(-1:1)/10, each = N * 3)) %>%
  mutate(res = map(rho, sampleBVN),
         r = map_dbl(res, ~.$r),
         sig = map_lgl(res, ~.$sig))

non_sig_rhos
## # A tibble: 9,000 x 4
##       rho res               r sig  
##     <dbl> <list>        <dbl> <lgl>
##  1 -0.100 <list [3]> -0.243   F    
##  2 -0.100 <list [3]>  0.0506  F    
##  3 -0.100 <list [3]> -0.111   F    
##  4 -0.100 <list [3]>  0.00863 F    
##  5 -0.100 <list [3]> -0.0865  F    
##  6 -0.100 <list [3]> -0.116   F    
##  7 -0.100 <list [3]> -0.0412  F    
##  8 -0.100 <list [3]> -0.0537  F    
##  9 -0.100 <list [3]> -0.203   F    
## 10 -0.100 <list [3]> -0.0718  F    
## # ... with 8,990 more rows

I’m sampling in this weird way because I want about 10K “significant” pairs, and 10K “non-significant” pairs for learning. And I know that for n = 50 observations some correlations would, with high probability, be significant, e.g. 0.9. And some not, e.g. 0.1. I know this due to a simple power analysis:

\(Power(\rho, \alpha, n) = Prob(reject \space null \space hypothesis|\rho, \alpha, n)\)

Meaning, power is the probability of deciding a sample’s correlation is “significant” when it actually is, when the true \(\rho\) is in fact different from 0 and there is correaltion in the population from which the sample was drawn. In R we can use the pwr package to calculate the power for different \(\rho\)s:

library(pwr)

power <- function(rho) pwr.r.test(n = 50, r = rho, sig.level = 0.05, alternative = "two.sided")$power

tibble(rho = seq(-0.9, 0.9, 0.1)) %>%
  mutate(power = map_dbl(rho, power))
## # A tibble: 19 x 2
##       rho  power
##     <dbl>  <dbl>
##  1 -0.900 1.000 
##  2 -0.800 1.000 
##  3 -0.700 1.000 
##  4 -0.600 0.998 
##  5 -0.500 0.967 
##  6 -0.400 0.834 
##  7 -0.300 0.572 
##  8 -0.200 0.289 
##  9 -0.100 0.106 
## 10  0     0.0497
## 11  0.100 0.106 
## 12  0.200 0.289 
## 13  0.300 0.572 
## 14  0.400 0.834 
## 15  0.500 0.967 
## 16  0.600 0.998 
## 17  0.700 1.000 
## 18  0.800 1.000 
## 19  0.900 1.000

As expected, for very low \(\rho\)s the power is very small, a sample of size n = 50 is simply not large enough to show a sample from this population would produce a significant correlation. And see the power for \(\rho = 0\), do you understand why it is about 5%?

Anyway, you can see that indeed in my simulated data the chance of deciding a pair is “significant” matches the theoretic power:

tibble(r = seq(-0.9, 0.9, 0.1)) %>%
  mutate(power = map_dbl(r, power)) %>%
  bind_cols(
    rbind(sig_rhos, non_sig_rhos) %>%
      group_by(rho) %>%
      summarise(m = mean(sig)) %>%
      rename(r = rho)
  ) %>%
  ggplot() +
  geom_line(aes(r, power, color = "blue")) +
  geom_line(aes(r, m, color = "red")) +
  ggtitle("Statistical Power of Discovering Pearson's Correlation, N = 50, alpha = 5%") +
  scale_color_discrete(name = "type", labels = c("theoretical", "simulated"))

So, let’s see how many significant and insignificant pairs we have:

rbind(sig_rhos, non_sig_rhos) %>%
  count(sig)
## # A tibble: 2 x 2
##   sig       n
##   <lgl> <int>
## 1 F     10960
## 2 T     14040

We will sample 10K of each, and these would be the raw data for the neural networks:

data_raw <- rbind(sig_rhos, non_sig_rhos) %>%
  group_by(sig) %>%
  sample_n(10000) %>%
  ungroup()

Now, again, the neural networks are reading “plots”, we need the data as a plot matrix I have described above. Each and every plot would be a 60 x 60 binary matrix, all of these together we’ll bind in a big x matrix:

plotMat <- function(res) {
  m <- array(0, c(60, 60))
  m[res$bvn] <- 1
  array(m, c(1, 60, 60, 1))
}

x <- do.call(abind::abind, list(map(data_raw$res, plotMat), along = 1))

dim(x)
## [1] 20000    60    60     1

Let’s see the first x and y pair with the usual plot function and the image function for plotting their corresponding matrix:

plot(data_raw$res[[1]]$bvn, xlab = "", ylab = "")

image(x[1, , , 1])

Cool.

Splitting x into training and testing:

trainSamp <- sample(nrow(x), 10000)

x_train <- x[trainSamp, , , , drop = FALSE]
x_test <- x[-trainSamp, , , , drop = FALSE]

And we’re good to go.

Approach 1 - Classification: Tell us if a Plot is Significant

In this approach, our y is binary. The model will try to classify if a x and y pair has a statistically significant correlation (label “1”) or not (label “0”):

y_classification <- rep(0:1, each = 10000)

y_train_cl <- y_classification[trainSamp]
y_test_cl <- y_classification[-trainSamp]

Now the convolutional neural network, following the very basic MNIST example found in keras documentation. We’ll run it for 10 epochs, unless no improvement is seen in the validation loss, using 20% of the training data as validation to see that we’re not overfitting:

library(keras)

input_shape <- c(60, 60, 1)
batch_size <- 128
epochs <- 10

# Define model
model_cl <- keras_model_sequential() %>%
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',
                input_shape = input_shape) %>% 
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_dropout(rate = 0.25) %>% 
  layer_flatten() %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dropout(rate = 0.5) %>% 
  layer_dense(units = 1, activation = 'sigmoid')

# Compile model
model_cl %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_adadelta(),
  metrics = c('accuracy')
)

# Train model
history <- model_cl %>% fit(
  x_train, y_train_cl,
  batch_size = batch_size,
  epochs = epochs,
  validation_split = 0.2,
  callbacks = list(callback_early_stopping(
    monitor='val_loss', min_delta = 0.01, patience = 2))
)
## Train on 8000 samples, validate on 2000 samples
## Epoch 1/10
## 8000/8000 [==============================] - 84s - loss: 0.5207 - acc: 0.7298 - val_loss: 0.3328 - val_acc: 0.8475
## Epoch 2/10
## 8000/8000 [==============================] - 84s - loss: 0.3069 - acc: 0.8685 - val_loss: 0.2992 - val_acc: 0.8760
## Epoch 3/10
## 8000/8000 [==============================] - 85s - loss: 0.2312 - acc: 0.9062 - val_loss: 0.1776 - val_acc: 0.9340
## Epoch 4/10
## 8000/8000 [==============================] - 85s - loss: 0.1949 - acc: 0.9243 - val_loss: 0.1784 - val_acc: 0.9305
## Epoch 5/10
## 8000/8000 [==============================] - 84s - loss: 0.1754 - acc: 0.9329 - val_loss: 0.1452 - val_acc: 0.9420
## Epoch 6/10
## 8000/8000 [==============================] - 83s - loss: 0.1432 - acc: 0.9425 - val_loss: 0.1522 - val_acc: 0.9335
## Epoch 7/10
## 8000/8000 [==============================] - 84s - loss: 0.1237 - acc: 0.9527 - val_loss: 0.1143 - val_acc: 0.9490
## Epoch 8/10
## 8000/8000 [==============================] - 83s - loss: 0.1252 - acc: 0.9539 - val_loss: 0.1127 - val_acc: 0.9605
## Epoch 9/10
## 8000/8000 [==============================] - 83s - loss: 0.1126 - acc: 0.9583 - val_loss: 0.1117 - val_acc: 0.9580
## Epoch 10/10
## 8000/8000 [==============================] - 83s - loss: 0.1028 - acc: 0.9613 - val_loss: 0.1175 - val_acc: 0.9555

Let’s see the test accuracy:

scores <- model_cl %>% evaluate(
  x_test, y_test_cl, verbose = 0
)

cat('Test loss:', scores[[1]], '\n')
cat('Test accuracy:', scores[[2]], '\n')
## Test loss: 0.1603153
## Test accuracy: 0.9371

The model reached over 93% accuracy on unseen test data, meaning that it is pretty good at seeing a scatterplot of n = 50 observations, and decding whether their correlation is significant or not. Surprising? Not really.

We’ll soon see how well this model does in the Lineup protocol.

Approach 2 - Regression: Tell us the Pearson’s r Correlation

In this approach, our y is a number between (-1) and 1. The model will try to estimate a x and y pair’s Pearson r correlation:

y_regression <- data_raw$r

y_train_reg <- y_regression[trainSamp]
y_test_reg <- y_regression[-trainSamp]

Now the convolutional neural network, this time performing regression:

# Define model
model_reg <- keras_model_sequential() %>%
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',
                input_shape = input_shape) %>% 
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_dropout(rate = 0.25) %>% 
  layer_flatten() %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dropout(rate = 0.5) %>% 
  layer_dense(units = 1)

# Compile model
model_reg %>% compile(
  loss = "mean_squared_error",
  optimizer = optimizer_adadelta(),
  metrics = "mean_squared_error"
)

# Train model
history <- model_reg %>% fit(
  x_train, y_train_reg,
  batch_size = batch_size,
  epochs = epochs,
  validation_split = 0.2,
  callbacks = list(callback_early_stopping(
    monitor='val_loss', min_delta = 0.001, patience = 2))
)
## Train on 8000 samples, validate on 2000 samples
## Epoch 1/10
## 8000/8000 [==============================] - 85s - loss: 0.0419 - mean_squared_error: 0.0419 - val_loss: 0.0156 - val_mean_squared_error: 0.0156
## Epoch 2/10
## 8000/8000 [==============================] - 84s - loss: 0.0166 - mean_squared_error: 0.0166 - val_loss: 0.0189 - val_mean_squared_error: 0.0189
## Epoch 3/10
## 8000/8000 [==============================] - 84s - loss: 0.0148 - mean_squared_error: 0.0148 - val_loss: 0.0072 - val_mean_squared_error: 0.0072
## Epoch 4/10
## 8000/8000 [==============================] - 85s - loss: 0.0130 - mean_squared_error: 0.0130 - val_loss: 0.0050 - val_mean_squared_error: 0.0050
## Epoch 5/10
## 8000/8000 [==============================] - 83s - loss: 0.0120 - mean_squared_error: 0.0120 - val_loss: 0.0066 - val_mean_squared_error: 0.0066
## Epoch 6/10
## 8000/8000 [==============================] - 84s - loss: 0.0113 - mean_squared_error: 0.0113 - val_loss: 0.0131 - val_mean_squared_error: 0.0131
## Epoch 7/10
## 8000/8000 [==============================] - 83s - loss: 0.0106 - mean_squared_error: 0.0106 - val_loss: 0.0053 - val_mean_squared_error: 0.0053

The test MSE:

scores <- model_reg %>% evaluate(
  x_test, y_test_reg, verbose = 0
)

cat('Test MSE:', scores[[1]], '\n')
## Test MSE: 0.00517295

The model reached a MSE of 0.005. To convince yourself that this is “good” you could, for example, plot the predicted correlations vs. the real ones. Notice I’m trimming the model’s predictions to be in the (-1, 1) range.

y_pred <- c(model_reg %>% predict(x_test))
y_pred <- ifelse(y_pred < -1, -1, y_pred)
y_pred <- ifelse(y_pred > 1, 1, y_pred)

plot(y_test_reg, y_pred)

Noice. Notice a few outliers there, e.g. where the true correaltion is ~0.4 and the model predicts 0 correlation. Can you come up with a scatterplot which would create such an effect? I’ll get back to that soon.

Line’em Up!

Both models look pretty “good”. But how well will they do in the lineup task?

Reminder: model_cl will simply pick the plot with the highest classification score. model_reg will simply pick the plot with the highest estimated Pearson’s r correlation.

First, the mtcars horsepower vs. displacement task. Notice that here n = 32 which isn’t what the model trained on. Second, see that we’re using the scale1to60 function in order to map any vector of data to the (1, 60) range of integers, then the lineup function to make permuted datasets again:

scale1to60 <- function(x) {
 as.integer(floor(59 * round((x - min(x)) / (max(x) - min(x)), 2)) + 1) 
}

mtcars_1to60 <- mtcars %>%
  select(disp, hp) %>%
  transmute_all(scale1to60)

lineup_data_mtcars <- lineup(null_permute("hp"), mtcars_1to60)

ggplot(lineup_data_mtcars, aes(disp, hp)) +
  geom_point() +
  facet_wrap(~ .sample)

The original plot is plot 16. What will model_cl decide?

plotMat <- function(bvn) {
  m <- array(0, c(60, 60))
  m[as.matrix(bvn[, 1:2])] <- 1
  array(m, c(1, 60, 60, 1))
}

lineup_data_mtcars %>%
  split(., .$.sample) %>%
  map(plotMat) %>%
  map_dbl(model_cl$predict, verbose = 0) %>%
  which.max()
## [1] 16

Success. What is model_reg’s decision?

lineup_data_mtcars %>%
  split(., .$.sample) %>%
  map(plotMat) %>%
  map_dbl(model_reg$predict, verbose = 0) %>%
  which.max()
## [1] 16

Success.

Now for the states data:

state_level_1to60 <- state_level %>%
  select(capital_pop_percentage, rank) %>%
  transmute_all(scale1to60)

lineup_data_states <- lineup(null_permute("rank"), state_level_1to60)

ggplot(lineup_data_states, aes(capital_pop_percentage, rank)) +
  geom_point() +
  facet_wrap(~ .sample)

The original plot is plot 4. What will model_cl decide?

lineup_data_states %>%
  split(., .$.sample) %>%
  map(plotMat) %>%
  map_dbl(model_cl$predict, verbose = 0) %>%
  which.max()
## [1] 2

Success. In the sense the model didn’t pick the original plot, meaning that it does not depict a significant relationship between the variables.

Now for model_reg’s choice:

lineup_data_states %>%
  split(., .$.sample) %>%
  map(plotMat) %>%
  map_dbl(model_reg$predict, verbose = 0) %>%
  which.max()
## [1] 9

Success.

Seems almost too good to be true, right? Both models succeeded where they should and failed where they should, even though in the mtcars example the n is different from the one the model was trained with.

Let’s give these models something else to eat: mtcars, but this time engine displacement vs. miles per gallon (mpg).

mtcars_1to60 <- mtcars %>%
  select(disp, mpg) %>%
  transmute_all(scale1to60)

lineup_data_mtcars <- lineup(null_permute("mpg"), mtcars_1to60)

ggplot(lineup_data_mtcars, aes(disp, mpg)) +
  geom_point() +
  facet_wrap(~ .sample)

The “right” plot is obvious, isn’t it? It’s plot 13, showing a pretty strong relationship (r = -0.84, p < 0.01). The models should do well, shouldn’t they?

This is model_cl:

lineup_data_mtcars %>%
  split(., .$.sample) %>%
  map(plotMat) %>%
  map_dbl(model_cl$predict, verbose = 0) %>%
  which.max()
## [1] 10

Fail! Now model_reg:

lineup_data_mtcars %>%
  split(., .$.sample) %>%
  map(plotMat) %>%
  map_dbl(model_reg$predict, verbose = 0) %>%
  which.max()
## [1] 10

Fail!

How can that be, the relationship is so strong! Believe me, it has nothing to do with the fact that n = 32, you can bootstrap your way to some more observations, the models will still fail. And this is because, the relationship is strong but NOT linear! And both models were trained only on linear patterns. Remember they only see plots or plot matrices, and this plot matrix does not look like anything they know, so they will always prefer a linear one to a non-linear.

This is fascinating to me because:

  1. it is a great example of the pitfalls of any linear model, demonstrated in the most twisted way - if the data isn’t linear, you should not fit a linear model to it (without any transformation)
  2. it is a great example of how superior human vision is to naive machine’s vision, even when the latter seems almost 100% accurate. The machine is only accurate in regards to what it was trained on, while humans have an amazing capability for generalization.

Never thought mtcars could serve as an adversary example, did you?!

About those outliers…

Do you remember the outliers when plotted model_reg’s predictions vs. the actual correlations? We’ve seen some cases in which the real correlation was ~0.4 and the predicted correlation was ~0. How does this happen?

Let’s look at the highest gap between y_test_reg and y_pred:

ind <- which.max((y_test_reg - y_pred)^2)
y_test_reg[ind]
## [1] 0.2188547
y_pred[ind]
## [1] -0.05763819

Interesting! The real correlation is positive while the model thought it was almost zero.

Let’s look at these x and y pair:

image(x_test[ind, , , 1])

Do you see how this happened? The model almost ignored a few observations on the left side as if they were outliers, and considered only the big “blob” at the middle which has no shape. Is the model wrong? Technically yes of course, but ask yourself do you agree with the model or with the technical computation?

In fact, there is a whole branch in statistics called Robust Statistics which assumes there would be outliers in the data, and strives to give more accurate statistics once these outliers have been ignored or corrected. So is the model wrong or simply “robust”? Could we have just found a nice version of robust correlation? I wonder.

Before you go

Well, that was fun. The entire rstudio::conf was fun. Of course there could be plenty of other approaches for making the computer perform visual statistical inference. For example, you could feed the entire panel of plots, forcing the computer to compare them all at once, not one by one. And, to handle my “adversary example”, you could make the training data simply more varied, taking into account more varied relations, not just linear. Whatever you do, know the limits of your model. And thank you Di Cook for a great keynote, and Rstudio for a great conference!


  1. Loved the lecture, still don’t understand why it was called this way.

  2. Yes, I know, how boring, but I don’t have the time to pick something sexier

  3. So in a sense I’m combining both rstudio::conf’s keynotes!

  4. I won’t really plot them, I will put them in a convenient matrix which represents the plot as you will see