My Boss and I were trying devise some metric related to our work, to reflect the behavior of… call it X. “If X is low”, said my Boss, “I want the metric to be low. Then, if X is around 3, I want the metric to reach its maximum. And then it can decrease from there onto zero around 10, and then I don’t care…”. Being the Statistician that I am I immediately shouted “You’re talking about the Chi-Square distribution!”, to which he replied “Language!”, to which I replied:

x <- seq(0, 20, 0.1)
plot(x, dchisq(x, 5), type = "l", xlab = "X", ylab = "f(X)",
     main = expression(chi[5]^2* " Distribution"), xaxt="n")
axis(1, at = c(0, 3, 5, 10, 15, 10))
abline(v = 3, lty = 2, lwd = 2, col = 2)

To which he replied “Wow, Giora, you might be earning your paycheck after all!”

This got me thinking: What if there was an app for that. And so I made one:

But as always, this was hardly the end:

What’s the point?

OK, so besides the once-a-year need of knowing a simple line’s equation, what could possibly be the use for turning an image or a drawing into an equation?

Compression

Think about it. It turns out this simple drawing:

Can be turned into a matrix of just 20 x 6 = 120 numbers (these numbers might be more clear later). Sort of a primitive autoencoder:

##         [,1]    [,2]    [,3]      [,4]      [,5]      [,6]
##  [1,]   3.03  2.4619  0.1981  6.98e-04  0.00e+00  0.00e+00
##  [2,]  11.34 -0.4568 -0.1562 -1.15e-04  0.00e+00  0.00e+00
##  [3,]   1.84  0.0627  0.1409  9.89e-04  1.82e-05  3.26e-07
##  [4,]  12.16  0.0000  0.0000  0.00e+00 -1.74e-04 -1.38e-04
##  [5,]   3.49  3.2643 -0.8391  3.55e-06  1.30e-02  0.00e+00
##  [6,]   5.42  0.0000  0.0506  4.70e-05  0.00e+00 -8.74e-05
##  [7,]   1.07 -0.2401  0.0000  0.00e+00  0.00e+00  4.51e-05
##  [8,]   3.67 -2.7203  0.5628  0.00e+00  0.00e+00 -1.46e-03
##  [9,]   2.23  1.5632  0.0000  0.00e+00  0.00e+00 -5.69e-03
## [10,]   7.83 -1.6753  0.0000  6.99e-07  1.77e-02  1.00e-03
## [11,] -14.04  3.7370  0.0000  0.00e+00 -2.49e-06 -3.97e-04
## [12,]  21.61 -3.2244  0.0000  0.00e+00  1.96e-07  3.08e-04
## [13,] -14.95  5.1116  0.0000  0.00e+00  0.00e+00 -2.30e-03
## [14,]  27.58 -6.7116  0.0000  0.00e+00  0.00e+00  2.99e-03
## [15,]   2.99  0.0000  0.0000  2.90e-03  4.38e-05  0.00e+00
## [16,]   3.41  0.0000  0.0000 -3.83e-04 -1.86e-04  0.00e+00
## [17,]   9.93 -1.3583  0.0000  0.00e+00  0.00e+00  0.00e+00
## [18,]   4.27  0.0000  0.0000 -1.06e-02 -1.24e-04  0.00e+00
## [19,]   3.04  0.0000  0.0000  3.73e-03  2.75e-04  0.00e+00
## [20,]   1.15  0.0000  0.0000  2.87e-02  4.01e-04  0.00e+00

Encryption

Sign your name.

There you go, you have formulas instead of your signature (put it on a T-shirt!):

Fun

Remember that cat?

res_list <- readRDS("C:/SDAD_materials/cat_res_list.RData")

x <- seq(0, 10, 0.1)
max_degree <- 5
par(pty = "s")
plot(NULL, NULL, xlim = c(0, 10), ylim = c(0, 10), xlab = "x", ylab = "y")

for (res in res_list) {
  mm <- model.matrix(~ poly(x, degree = max_degree, raw = TRUE))
  lines(x, mm %*% res$coef_)
}

Now you have a game to play with your kids!

Look Mom, no Deep Learning!

If you’re new to the field of Data Science, when asked to translate a simple hand-drawn line into a formula, you might be thinking: Deep Learning! It may work. But just before you warm up your fancy GPUs, let me give you a tip: Statisticians have solved this problem over 200 years ago, and the approach I’m using here has remained basically the same since then, except for one minor difference introduced in 1990s, see later. It’s called Regression.

Take the 2nd degree polynomial, a.k.a parabola: \(y = 1 + 2x + 3x^2\). Suppose we don’t know these coefficients, 1, 2 and 3, we see only \((x, y)\) dots:

x <- seq(-10, 10, 0.5)
plot(x, 1 + 2 * x + 3 * x ^ 2, ylab = "y")

And we assume these dots were made by this polynomial: \(y = \beta_1 + \beta_2x + \beta_3x^2\). The goal of Regression is to find these coefficients, by finding the coefficients \(\hat{\beta}\) which minimize the “least squares criterion” a.k.a “the linear model (LM)”, a.k.a “Ordinary Least Squares (OLS)”:

\(\sum{(y_i - (\hat{\beta_1} + \hat{\beta_2}x_i + \hat{\beta_3}x_i^2))^2}\)

I.e., making the sum of squared residulas between the estimated line and the actual line, as small as possible. Now, there are many ways to solve this, but it turns out that for simple cases such as this there is actually a formula:

\(\hat{\beta} = (X^TX)^{-1}X\vec{y}\)

Where \(\vec{y}\) is the vector of \(y\)s and \(X\) is the so called “model matrix”:

(in our case \(n\) is ~40 and \(m\), the degree of the polynomial, is 2)

Before we go on, obviously I’m oversimplyfing things. Actually:

  1. Regression is not just for the polynomial case, this is a special case often referred to as “Polynomial Regression”.
  2. Regression was not made for curve fitting, and there are other methods maybe more suited to your needs. In fact, we model the \((x, y)\) dots as \(y = \beta_1 + \beta_2x + \beta_3x^2 + \epsilon\) where \(\epsilon\) is a random noise often assumed to distribute “normally”. And this is because the “dots” we see often look more like this:
x <- seq(-10, 10, 0.5)
plot(x, 1 + 2 * x + 3 * x ^ 2 + rnorm(length(x), sd = 20), ylab = "y")

In R, the solution for \(\hat{\beta}\) is very easy to obtain:

x <- seq(-10, 10, 0.5)
y <- 1 + 2 * x + 3 * x ^ 2
lm_1 <- lm(y ~ x + I(x^2)) # the intercept term is added by default
summary(lm_1)
## Warning in summary.lm(lm_1): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.08e-13 -1.25e-14 -1.98e-15  7.91e-15  2.38e-13 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 1.00e+00   1.06e-14 9.47e+13   <2e-16 ***
## x           2.00e+00   1.19e-15 1.68e+15   <2e-16 ***
## I(x^2)      3.00e+00   2.25e-16 1.33e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.51e-14 on 38 degrees of freedom
## Multiple R-squared:     1,   Adjusted R-squared:     1 
## F-statistic: 9.03e+31 on 2 and 38 DF,  p-value: <2e-16

Lo and behold, in the “Estimate” column, we see our coefficients, 1, 2 and 3. No Deep Learning, only good old shallow Statistics. However, we also get a warning, saying “essentially perfect fit: summary may be unreliable”. And this is because as I said Regression was not meant for this perfect-fit, simulated, scenario.

If you haven’t seen me LASSO, you haven’t lived

I’ve seen the dots, I guessed it was a parabola, the model turned out to be a perfect fit, good for me. But in the app I was thinking about, this process should be automatized. What if I assumed this polynomial to be of the 3rd degree?

lm_2 <- lm(y ~ x + I(x^2) + I(x^3))
summary(lm_2)
## Warning in summary.lm(lm_2): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = y ~ x + I(x^2) + I(x^3))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.28e-13 -1.04e-14 -2.78e-15  9.72e-15  2.10e-13 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)  1.00e+00   1.03e-14  9.74e+13   <2e-16 ***
## x            2.00e+00   2.90e-15  6.91e+14   <2e-16 ***
## I(x^2)       3.00e+00   2.19e-16  1.37e+16   <2e-16 ***
## I(x^3)      -7.60e-17   4.22e-17 -1.80e+00     0.08 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.38e-14 on 37 degrees of freedom
## Multiple R-squared:     1,   Adjusted R-squared:     1 
## F-statistic: 6.38e+31 on 3 and 37 DF,  p-value: <2e-16

I got something. Some \(\hat{\beta_4}\) close to zero but not quite zero, and it’s even “significant at the 10% level”!1

Even worse, what if I assumed the polynomial to be of the shape \(y = \beta_1 + \beta_3x^2\), with no linear coefficient for \(x\)?

lm_3 <- lm(y ~ I(x^2))
summary(lm_3)
## 
## Call:
## lm(formula = y ~ I(x^2))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##    -20    -10      0     10     20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0000     2.8434    0.35     0.73    
## I(x^2)        3.0000     0.0606   49.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.1 on 39 degrees of freedom
## Multiple R-squared:  0.984,  Adjusted R-squared:  0.984 
## F-statistic: 2.45e+03 on 1 and 39 DF,  p-value: <2e-16

It’s not a perfect fit but it’s damn close to one, everything looks “OK” and “significant” and I wouldn’t have any way of telling this model is in fact wrong.

So it seems simple Polynomial Regression might not be our final station. There are all sorts of ways for approaching this issue altogether or dealing with Regression’s intrinsic problems such as Model Selection, among them Generalized Additive Models (GAM), Robust Regression, information criteria such as AIC and BIC and different methods to penalize our least squares criterion for tending to be over-optimistic about unimportant parameters, and forcing these to shrink towards zero.

Here I will use the LASSO2. The LASSO, popularized by Robert Tibshirani, adds a small \(\lambda\) penalty on our \(\hat{\beta}\) coefficients’ L1 norm:

\(\sum{(y_i - (\hat{\beta_1} + \hat{\beta_2}x_i + \hat{\beta_3}x_i^2))^2} + \lambda\sum|\hat\beta_j|_1\)

Which means the more there are non-zero coefficients, the more “penalty” our criterion will suffer, leading eventually to more parsimonious or “sparser” solutions, making unimportant \(\hat{\beta}\)s equal to zero. This will add some bias to our estimated \(\hat\beta\), but for high-dimensional problems in which we suspect there are much fewer important paramaters than “possible”, certainly for problems in which deciding whether the \(\beta\) coefficient is larger or not than zero is the goal - this is a price we are willing to pay.

I’ll use Tibshirani’s glmnet package for fitting the lasso solution. Let’s see what happens if I fit a 3rd degree polynomial to our 2nd degree polynomial:

# as before, just reminder:
x <- seq(-10, 10, 0.5)
y <- 1 + 2 * x + 3 * x ^ 2

library(glmnet)

# creating a model matrix without the intercept column, which glmnet adds by default:
poly_degree <- 3
mm <- model.matrix(~ poly(x, degree = poly_degree, raw = TRUE))[, -1]
colnames(mm) <- paste0("x", 1:poly_degree)

# fitting the LASSO with cross-validation to choose the lambda
cvfit <- cv.glmnet(mm, y)

lasso_coef <- coef(cvfit, s = "lambda.min")
lasso_coef
## 4 x 1 sparse Matrix of class "dgCMatrix"
##                1
## (Intercept) 3.54
## x1          1.62
## x2          2.93
## x3          .

The estimates as expected are somewhat biased, but the LASSO was correct in identifying that the 3rd degree of the polynomial should absolutely be zero.

lasso_beta <- c(as.matrix(lasso_coef))
mm <- model.matrix(~ poly(x, degree = poly_degree, raw = TRUE))
y_pred <- mm %*% lasso_beta
plot(x, y, type = "l")
lines(x, y_pred, col = 2)
legend("topright", legend = c("observed", "fitted"), lty = 1, col = 1:2)

The fit is not too bad either. One last example before things start getting weird: \(y=\sin(x)\).

x <- seq(-10, 10, 0.1)
y <- sin(x)
plot(x, y, ylim = c(-5, 5))

Pause and think what polynomial you expect a good method to give you.

# let's choose degree 10 for kicks
poly_degree <- 10
mm <- model.matrix(~ poly(x, degree = poly_degree, raw = TRUE))[, -1]
colnames(mm) <- paste0("x", 1:poly_degree)

# fitting the LASSO with cross-validation to choose the lambda
cvfit <- cv.glmnet(mm, y)

lasso_coef <- coef(cvfit, s = "lambda.min")
lasso_coef
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept)  9.64e-16
## x1           1.74e-01
## x2           .       
## x3          -2.35e-02
## x4           .       
## x5           6.73e-04
## x6           .       
## x7          -5.72e-06
## x8           .       
## x9           1.04e-08
## x10          .

Do you see the pattern? We got something like:

\(y = \hat\beta_1 + \hat\beta_2x + \hat\beta_3x^3 + \hat\beta_4x^5 + \hat\beta_5x^7 + ...\)

With \(\hat\beta\)s decreasing in size and switching signs. Why if it isn’t similar to the Taylor Series expansion of \(sin(x)\) around zero!

\(\sin(x) \sim x -\frac{x^3}{3!}+\frac{x^5}{5!}+...\)

Well, it’s only similar I’m afraid:

lasso_beta <- c(as.matrix(lasso_coef))
mm <- model.matrix(~ poly(x, degree = poly_degree, raw = TRUE))
y_pred <- mm %*% lasso_beta
taylor <- mm %*% c(0, 1, 0, -1/factorial(3), 0, 1/factorial(5), 0, -1/factorial(7), 0, 1/factorial(9), 0)
plot(x, y, type = "l", ylim = c(-5, 5))
lines(x, y_pred, col = 2)
lines(x, taylor, col = 3)
legend("bottomright", legend = c("observed", "fitted", "Taylor"), lty = 1, col = 1:3)

The Taylor polynomial was designed to approximate \(sin(x)\) at 0. Our polynomial was designed to fit “everywhere”, and so despite its efforts it is less impressive.

And so, you should never trust hand-picked examples. And you shouldn’t treat the LASSO as some kind of magic which always works, but it will suit our needs here.

Formulan

What’s left? Put it in a function get_poly_coef() which receives x, y and the maximum polynomial degree, and returns the coefficients:

get_poly_coef <- function(x, y, max_degree = 5) {
    if (length(x) < 20) {
        warning('x length less than 20')
    }
    mm <- model.matrix(~ poly(x, degree = max_degree, raw = TRUE))[, -1]
    cvfit <- cv.glmnet(mm, y)
    c(as.matrix(coef(cvfit, s = "lambda.min")))
}

Add some more functions for better Latex printing:

polynomize <- function(x, y, max_degree = 5) {
    coef <- get_poly_coef(x, y, max_degree)
    mm <- model.matrix(~ poly(x, degree = max_degree, raw = TRUE))
    list(xy = cbind(x, mm %*% coef),
         formula_ = form_latex(coef, x),
         coef_ = coef)
}

nice <- function(n) {
    if (abs(n) %% 1 < 0.001) {
        sprintf("%.4f", abs(n))
    } else if (abs(n) %% 1 < 0.01) {
        sprintf("%.3f", abs(n))
    } else if (abs(n) %% 1 < 0.1) {
        sprintf("%.2f", abs(n))
    } else {
        sprintf("%.1f", abs(n))
    }
}

signit <- function(n) ifelse(n > 0, "+", "-")
pass <- function(n, eps = 0.000001) abs(n - 0) >= eps

format_x_range <- function(x) {
    paste0("x\\in(", nice(min(x)), ",\\space", nice(max(x)), ")")
}

form_latex <- function(coef, x) {
    formula_ <- "$$"
    if (length(coef) > 0 && pass(coef[1])) {
        formula_ <- paste0(formula_, nice(coef[1]))
    }
    if (length(coef) > 1 && pass(coef[2])) {
        formula_ <- paste0(formula_, signit(coef[2]), nice(coef[2]), "x")
    }
    if (length(coef) > 2) {
        deg <- 2
        for (co in coef[3:length(coef)]) {
            if (pass(co)) {
                formula_ <-
                  paste0(formula_, signit(co), nice(co), "x^{", deg, "}")
                deg <- deg + 1
            }
        }
    }
    paste0(formula_, ";\\space ", format_x_range(x), "$$")
}

Here polynomize() is the main function which wraps get_poly_coef() and returns a list containing xy the original x and the fitted y, formula_ the polynomial formula in Latex, and coef_ which is a vector of the coefficients.

Next, put it in a Shiny app, and deploy it to shinyapps.io so everyone can play with it. You can either fork the code and run it locally at this gist3, or play with the app on shinyapps, or even right here if the app is on and the following embedding works in your browser:

If not, at least enjoy the gif I made again…

Cream-colored ponies and crisp Apple strudels

Can we do this to every image? Maybe. I’ll show you where I got.

Start simple, Apple’s logo, 300x300:

library(tidyverse)
library(magick)

img <- image_read("C:/SDAD_materials/apple.png")

img

We need a way to isolate the contour of the apple, a.k.a edge detection, as if it were a line drawing. Then we need a way to separate the apple’s contour line into several lines, a.k.a clustering. Only then we can apply the polynomize() function on each and every one of these lines, to get Apple’s logo as a bunch of formulas. Sounds like a plan?

We can use R’s equivalent to Python’s openCV, the great magick package, to speed things up:

img_edges <- img %>% image_edge(radius = 1) %>% image_negate()

img_edges

This is not a lesson in Computer Vision, but if you want some name dropping, what image_edge() does would probably be convolving a Sobel kernel around the image, a standard way for edge detection.

We now need to cluster the logo’s contour line into separate lines. Let’s get the edges as a simple grayscale 2D matrix, and cast it into a long table in which each pixel is a line having its x location, y location and grayscale value:

img_edges_mat <- img_edges %>%
  image_rotate(90) %>%
  image_data(channels = "gray") %>%
  as.numeric() %>%
  matrix(., nrow(.), ncol(.))

img_dim <- dim(img_edges_mat)

img_pixels <- data.frame(
  x = rep(1:img_dim[1], each = img_dim[2]),
  y = rep(img_dim[2]:1, img_dim[1])) %>%
  mutate(col = map2_dbl(x, y, function(x, y) img_edges_mat[x, y]))

head(img_pixels)
##   x   y col
## 1 1 302   1
## 2 1 301   1
## 3 1 300   1
## 4 1 299   1
## 5 1 298   1
## 6 1 297   1

Now we can filter only those pixels belonging to the edges, i.e. not white, i.e. they have a color value less than 1.

img_pixels_filtered <- img_pixels %>% filter(col < 1)

dim(img_pixels_filtered)
## [1] 1802    3

So out of 300 x 300 = 90000 pixels, we’re left with 1802 relevant pixels.

Next we cluster into separate lines. There are countless ways to do this, it seems to me that simple hierarchical or agglomerative clustering is the natural solution to the problem, choosing at each stage for a given pixel its closest neighbor, resulting in neighboring groups of pixels, or lines.

hc <- hclust(dist(img_pixels_filtered[, c("x", "y")]))

That’s about it, in R. We need to choose where to cut the hierarchical tree hclust() gives, i.e. we need to specify k, no. of clusters, or lines. Let’s say 12. Then we can plot the result using a different color for each line:

k = 12
img_pixels_filtered$line <- cutree(hc, k = k)

img_pixels <- img_pixels %>%
  left_join(img_pixels_filtered, by = c("x" = "x", "y" = "y")) %>%
  replace_na(list(line = 0))

col_palette <- c("white", "red", "green", "blue", "orange", "purple",
                 "cyan", "brown", "yellow", "khaki", "pink", "grey", "maroon")
par(pty = "s")
plot(img_pixels[, c("x", "y")], col = col_palette[img_pixels$line + 1])

Not bad. And only now we can polynomize():

max_degree <- 5
x <- seq(0, img_dim[1], 1)
par(pty = "s")
plot(NULL, NULL, xlim = c(0, img_dim[2]), ylim = c(0, img_dim[1]),
     xlab = "x", ylab = "y")
for (sub in 1:k) {
  img_pixels_sub <- subset(img_pixels[, c("x", "y")], img_pixels$line == sub)
  res <- polynomize(img_pixels_sub$x, img_pixels_sub$y)
  lines(res$xy, col = "red", lty = 1, lwd = 2)
}

Hehe. We just made the Apple logo into 12 polynomials, and we can draw it in any R plot. Sort of.

To re-iterate:

Please bring honor to us all

Mulan is quite popular with my daughters these days. Let’s try a simple drawing of Mulan.

img <- image_read("C:/SDAD_materials/mulan_drawing.jpg")

img

As can be seen I chose a very simple drawing of Mulan, so we don’t even need edge detection. Let’s repeat the process until clustering.

img_edges_mat <- img %>%
  image_rotate(90) %>%
  image_data(channels = "gray") %>%
  as.numeric() %>%
  matrix(., nrow(.), ncol(.))

img_dim <- dim(img_edges_mat)

img_pixels <- data.frame(
  x = rep(1:img_dim[1], each = img_dim[2]),
  y = rep(img_dim[2]:1, img_dim[1])) %>%
  mutate(col = map2_dbl(x, y, function(x, y) img_edges_mat[x, y]))

img_pixels_filtered <- img_pixels %>% filter(col < 1)

hc <- hclust(dist(img_pixels_filtered[, c("x", "y")]))

Let’s put our plot in a function which could handle different ks:

plot_polynomials <- function(k, max_degree = 5) {
  img_pixels_filtered$line <- cutree(hc, k = k)
  img_pixels <- img_pixels %>%
    left_join(img_pixels_filtered, by = c("x" = "x", "y" = "y")) %>%
    replace_na(list(line = 0))
  
  max_degree <- 5
  x <- seq(0, img_dim[1], 1)
  par(pty = "s")
  plot(NULL, NULL, xlim = c(0, img_dim[1]), ylim = c(0, img_dim[2]),
       xlab = "x", ylab = "y")
  for (sub in 1:k) {
    img_pixels_sub <- subset(img_pixels[, c("x", "y")], img_pixels$line == sub)
    res <- polynomize(img_pixels_sub$x, img_pixels_sub$y)
    lines(res$xy, col = "red", lty = 1, lwd = 2)
  }
}

Let’s try for k = 12 as before:

plot_polynomials(k = 12)

Quite worthless, let’s try k = 50 lines:

plot_polynomials(k = 50)

Well we can see Mulan, though the outcome might be considered a tad racist. What about k = 200?

plot_polynomials(k = 200)

The problem of course with increasing k is that while your lines become more similar to the original shape, they’re becoming less and less “lines”, more like dots, losing what you set out to achieve in the first place.

The flower that blooms in adversity is the most rare and beautiful of all

It’s not perfect. But it’s a start. If you wish to improve, say for making this a package, I would focus attention on two directions:

  1. Be capable of fitting circles and vertical lines - with Regression, this is a problem, as we can only fit functions in which X has one and only one value in Y.
  2. Better edge detection - which is not an easy task. But you will be able to polynomize() any image.

Have fun! Learn from Mulan!


  1. Not going to explain this one, see Regression 101.

  2. Least absolute shrinkage and selection operator

  3. I owe a big thank you for this SO answer: https://stackoverflow.com/a/48442522/4095235