I feel like I should start with an apology: this is a long, barely readable post. But you might earn some satisfaction from it, once you realize I scanned the entire English Wikipedia just to find, or not, evidence for a 19th century psychophysics law! Anyway, if you survive this post, and wish for a prize of some sort, email mw and I will grant you a badge.

Logarithmic Perception

I recently stumbled upon an interesting article published in Science Magazine back in 2008 by Stanislas Dehaene et. al., titled “Log or Linear? Distinct Intuitions of the Number Scale in Western and Amazonian Indigene Cultures”. The authors compared performance of American and Mundurucu participants in a numbers mapping task:

The participant was presented with a number in dots (or sounds or numerals) and was asked to place this number on a line representing a scale of 1 to 10 in some trials and 1 to 100 on other trials.

Results showed that when the range was 1 to 10, American participants tended to place numbers linearly, in even spaces, while Mundurucu participants tended to place numbers “logarithmically”, making more emphasis on the difference between smaller numbers such as 1, 2 and 3, than on the difference between larger numbers such as 8, 9 and 10. However, when the range was 1 to 100 - American participants also tended to place numbers logarithmically!

This finding isn’t new. For example, and you can check this on a friend, most of us would feel pretty good about paying only 20 dollars for a 50 dollars dress, but would feel bordeline insulted if offered to pay “only” 500,000 dollars for a 500,030 dollars house. Although in both cases we can save 30 dollars, in the first case 30 dollars is a lot off the initial amount of 50 dollars, and in the second case it is nothing off the initial amount of 500,030.

Weber-Fechner Law

The concept of logarithmic perception was formalized by psychophysicist Gustav Fechner in the 19th century. Fechner explained the experimental findings of his partner Ernst Weber with the following formula, a.k.a Weber-Fechner Law:

\(p = k\ln{\frac{S}{S_0}}\)

Which goes to say that perception \(p\) of a stimulus \(S\) equals a constant \(k\) (depending on the stimulus) times the natural logarithm of the ratio between this stimulus \(S\) and an initial stimulus \(S_0\) causing “zero” perception.

OK, this doesn’t mean anything. Suppose you are blindfolded and given a weight of 1 kilo in one hand (the initial stimulus \(S_0\)). If you are now given in the other hand a weight of 2 kilos (the stimulus \(S\)), you would surely notice that the weight has increased, though you may not be able to say it has doubled precisely. According to the law you would report about \(k\ln{2}\) increase. Now suppose you are given a weight of 10 kilos in one hand (\(S_0\)), then a weight of 11 kilos (\(S\)) in the other hand - you might not even notice that the weight has increased! The law puts this intuition to numbers and predicts you would report \(k\ln{1.1}\) increase which is close to zero. And in fact the only way to make you feel that initial increase from 1 kilo to 2 kilos when you start with 10 kilos, is to move to 20 kilos!1

Do you see how this relates to logarithmic perception of numbers? We are much better at seeing the difference between 1 and 2 because we are much better at seeing ratios and the number has doubled. We are much worse at seeing the difference between 10 and 11 because we are much worse at seeing additions and the number has gone up by only 10%.

Whorf (not the Klingon)

So Weber and Fechner would claim that in general we’re better at seeing ratios than at seeing additions and this is true for any type of perception. I do not know if they ever thought about why this is so, for example whether there are evolutionary advantages to paying more attention to ratios than to additions.

When it comes to numbers, one hypothesis called the Sapir-Whorf Hypothesis after linguists Edward Sapir and Benjamin Lee Whorf does try to provide a reason and the reason is: Language. Whorfianism isn’t about the logarithmic perception of numbers specifically. It states generally that our experience with language shapes our thoughts and behavior. Think about that for a moment. How many times did you count “one, two, three”? A lot. How many times did you count “sixty-one, sixty-two, sixty-three”? Not that many. So since childhood, one, two and three are a big deal to us. 61, 62 and 63 not so much. Which might also explain the differences found in the Science article between American participants and Mundurucu participants: the Mundurucu participants must culturally care much more about the one, two and three digits than they do about eight, nine and ten. For American participants the first 10 integers are a big thing.

God damn you, show me some code!

So I’m sitting there thinking: why don’t I scan a big corpus of text, say the entire English Wikipedia and simply count the occurrence of the numbers 1 to 100, and see how this frequency behaves.

But surely Giora, you don’t mean the entire Wikipedia? Girrrl, I do.

Wikipedia actually releases its entire corpus of pages here and explains what each of these compressed files means here. I downloaded the ~15GB enwiki-latest-pages-articles.xml.bz2 file, decompressed it to enwiki-latest-pages-articles.xml now weighing ~68GB. It’s a regular XML with over 1B rows, you can see a sample here. And I’m simply gonna read it in chunks of 100K rows at a time, counting the occurrence of the words “one”, “two”, “three”, etc.

library(tidyverse)
library(xfun)

regex_num <- function(num) str_c("\\b", numbers_to_words(num), "\\b")

n <- 100
search_nums <- map_chr(1:n, regex_num)
counter <- numeric(length(search_nums))

count_in_chunk <- function(i, chunk) sum(str_count(chunk, search_nums[i]))
update_counter <- function(x, pos) counter <<- counter + map_int(1:length(search_nums), count_in_chunk, x)

read_lines_chunked("enwiki-latest-pages-articles.xml", SideEffectChunkCallback$new(update_counter), chunk_size = 100000)

Wait, that’s it, 9 lines of code? Almost:

  1. I’m defining a function regex_num which takes a number (e.g. 23) and returns it as string (“twenty-three”). So in the object search_nums I have the numbers “one”, “two”, “three”, … up to “one-hundred” and I want them searched as whole words so they’re regexified2: c("\\bone\\b", "\\btwo\\b", "\\bthree\\b", ..., "\\bone hundred\\b").

  2. Defining a counter vector of length 100, each of its elements will count the corresponding number-word appearances.

  3. Using the readr package read_lines_chunked function to read this massive XML file in chunks of 100K lines at a time. I’m passing it a SideEffectChunkCallback object, feeding it a update_counter function which will update my global counter object with each chunk.

A few issues might arise in the acute reader’s mind:

  1. Why are we only counting number-words, what about the numbers themselves?

  2. Why are the numbers of 20-something, 30-something etc. hyphened? (“twenty-three” as opposed to “twenty three”)

  3. You do realize “twenty-three” counts also “three”, don’t you? Moreover if the number-word “one hundred twenty-three” appears it also counts “twenty-three”, etc.

  4. This will take a long time, why not parallelize?

Here are my answers:

  1. This was a compromise. There are plenty of raw numbers in this raw XML which are not in the text and I did not want to count those numbers. Every address, every hash code… We can rely more on number-words. Also actually parsing the XML would take a long time.

  2. I could search for both forms, there’s actually a useful boolean parameter in the numbers_to_words function called hyphen which defaults to TRUE. This might surprise you but the more common form is the hyphened form.

  3. Yes, I realize and that’s what I’m dealing with in the next chunk of code.

  4. You are right, I was lazy. But ~10 hours for a one-time effort, for a private blog which no one reads is OK :)

So we have our counter:

head(counter)
## [1] 7300632 4939875 2581111 1592384 1068104  792180

But we need to subtract all counts of “twenty-one”, “thirty-one”, “forty-one” etc. from the count of “one”. And we need to subtract all counts of “twenty-two”, “twenty-three”, “twenty-four” etc. from the count of “twenty”3:

# a function to get the indices of counter which repeat the count of a digit (1, 2, ...)
get_indices_to_subtract_digit <- function(digit, total_n = n) {
  indices <- which(1:total_n %% 10 == digit)
  indices <- indices[3:length(indices)]
  if (digit == 1) {
    indices <- c(indices, 100)
  }
  return(indices)
}

# a function to get the indices of counter which repeat the count of a full ten (20, 30, ...)
get_indices_to_subtract_ten <- function(ten, total_n = n) {
  indices <- which(1:total_n > ten & 1:total_n < ten + 10)
  return(indices)
}

how_much_to_subtract <- function(num) {
  if (num > 0 & num < 10) {
    indices_to_subtract <- get_indices_to_subtract_digit(num)
  } else {
    indices_to_subtract <- get_indices_to_subtract_ten(num)
  }
  sum(counter[indices_to_subtract])
}

subtract_from_counter <- function(num) {
  if (num %% 1 != 0 || num < 0 || num > 90 || (num > 9 && (!(num %% 10 == 0) || num == 10))) {
    stop('num should be an integer between 1 and 9 or round tens from 20 to 90')
  }
  counter[num] <<- counter[num] - how_much_to_subtract(num)
}

walk(c(1:9, seq(20, 90, 10)), subtract_from_counter)

And that’s the final counter:

counter
##   [1] 7262356 4920332 2566535 1572877 1039010  778003  468949  438583
##   [9]  285970  406013   89319  125496   58864   47377   65082   47425
##  [17]   28589   36847   17598  117154    9664    8361    6188   10809
##  [25]   14039    5663    4382    4524    3039   62930    2743    4277
##  [33]    3218    2430    4933    3493    2170    2196    1897   40279
##  [41]    1717    2259    1843    2270    3420    1432    1434    2281
##  [49]    1167   50821    1164    1696    1069    1127    1611    1081
##  [57]     950     811     821   24839     730     858     835    1155
##  [65]    1621     871     745     691     694   13251     541    1092
##  [73]     572     626    2142     627     589     543     457    9727
##  [81]     556     512     499     756     787     501     440     641
##  [89]     381    7708     358     488     352     334     541     509
##  [97]     310     310     746   20803

So counter actually summarizes our actual observed data in a compact manner. Our actual observed data is ~7M “one”s, ~5M “two”s, etc:

data <- rep(1:n, counter)

head(data)
## [1] 1 1 1 1 1 1
tail(data)
## [1] 100 100 100 100 100 100

Or in a histogram (really a barplot since I’m asking for 100 bins):

tibble(num = data) %>%
  ggplot(aes(x = num)) +
  geom_histogram(bins = n, fill = "red", alpha = 0.5) +
  labs(x = "Number", y = "No. of occurences", title = "No. of Number-Word Occurrences in English Wikipedia") +
  theme_bw() +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(breaks = seq(0, n, 10))

Log, Exponential, Face, I Ain’t Bothered!

First, it is already interesting to see that indeed there is a sharp decline in frequency as numbers increase. Second, although all numbers have positive counts, from about 20 we can only see the jumps for 30, 40, 50 etc. up until 100. This makes sense, as I’ve commented, one rarely writes the number 53 as “fifty-three” but there’s a high chance one will write the round number 50 as “fifty”.

How would you show that this sharp decline, a.k.a “decaying function” is inline with the “logarithmic perception” phenomenon the Science researchers demonstrated? You could simply invert this histogram to see its outline matches the logarithmic pattern the researchers got. I’m a Statistician, so what I know is how to fit distributions :) And the proper distribution to fit here is the Exponential Distribution.

If \(X\) is an exponentially distributed random variable, its probability densitiy function (PDF) looks like this:

\(f(X) = \lambda {e}^{-\lambda x};\space X > 0, \space \lambda > 0\)

Which means we only need to find “the best” \(\lambda\), the rate parameter, assuming our distribution is indeed exponential, and if we get some kind of an acceptable “goodness of fit” measure or other diagnostics saying the fit is “good”, we can be happy.

Now the most common way for fitting a distribution to the data and getting this \(\lambda\) is through Maximum Likelihood Estimation or MLE, which isn’t complicated but is out of scope, plus we have R at our disposal so why worry our pretty little heads about it?

library(MASS)

fit_exp <- fitdistr(data, "exponential")
fit_exp
##        rate    
##   2.540445e-01 
##  (5.577092e-05)

We got: \(\lambda = 0.25\). Let’s first draw this \(Exp(0.25)\) distribution right next to the (normalized to 1) histogram and see if it fits:

tibble(num = data) %>%
  ggplot(aes(x = num)) +
  geom_histogram(bins = n, fill = "red", alpha = 0.5, aes(y = ..density..)) +
  labs(x = "Number", y = "% of occurrences / f(X)", title = "No. of Number-Word Occurrences in English Wikipedia - Exponential Fit") +
  theme_bw() +
  scale_x_continuous(breaks = seq(0, n, 10)) +
  stat_function(fun = dexp, color = "blue", args = list(rate = fit_exp$estimate))

Personally I’m not impressed! The decay here looks more extreme than the exponential decay. We can also draw a QQ-plot, comparing the quantiles of our actual data to the theoretical quantiles of the \(Exp(0.25)\) distribution:

theoretical_exp <- rexp(20000, rate = fit_exp$estimate)
qqplot(data, theoretical_exp, ylim = c(0, n), ylab = "Exp(0.25) Theoretical Quantiles", xlab = "Our Count Data Quantiles")
abline(0, 1, col = "red")

If the theoretical and empirical distributions matched, they would have had similar quantiles and we would have seen the dots arrange close to the identity line, here in red. Alas, after the first ~50 quantiles, the dots completely diverge from the red line. It seems like the theoretical distribution has much more lower quantiles, it doesn’t seem to increase over 40.

We can also compare some distribution metrics for our data versus the \(Exp(0.25)\) distribution:

rbind(c(data = "empirical", round(summary(data), 2)), c(data = "theoretical", round(summary(theoretical_exp), 2))) %>%
  as_tibble() %>%
  knitr::kable()
data Min. 1st Qu. Median Mean 3rd Qu. Max.
empirical 1 1 2 3.94 4 100
theoretical 0 1.11 2.76 3.94 5.51 37.79

The numbers do not align, especially “at the end” of the distribution.

Finally, if you look at the Exponential Distribution, how will this density function look like if you take its log?

hist_obj <- hist(theoretical_exp, breaks = seq(0, max(theoretical_exp) + 1, 1), plot = FALSE)

tibble(num = hist_obj$breaks[2:length(hist_obj$breaks)], log_freq = log(hist_obj$counts/20000)) %>%
  ggplot(aes(x = num, y = log_freq)) +
  geom_point(stat = "identity", col = "blue") +
  labs(x = "Number", y = "log(% of occurrences)", title = "Theoretical Exponential Data: log(Freq(X)) vs. X") +
  theme_bw() +
  scale_x_continuous(breaks = seq(0, max(theoretical_exp), 10)) +
  geom_abline(intercept = log(dexp(0, fit_exp$estimate)), slope = -fit_exp$estimate, col = "red")

That’s right, a straight line! And the slope? \(-\lambda\). Where in our case:

hist_obj <- hist(data, breaks = seq(0, n, 1), plot = FALSE)

tibble(num = hist_obj$breaks[2:length(hist_obj$breaks)], log_freq = log(hist_obj$counts/length(data))) %>%
  ggplot(aes(x = num, y = log_freq)) +
  geom_point(stat = "identity", col = "blue") +
  labs(x = "Number", y = "log(% of occurrences)", title = "Empirical Data: log(Freq(Number)) vs. Number") +
  theme_bw() +
  scale_x_continuous(breaks = seq(0, n, 10)) +
  geom_abline(intercept = log(dexp(0, fit_exp$estimate)), slope = -fit_exp$estimate, col = "red")

Not a very straight line, perhaps only at its beginning. You do remember what the “jumping dots” stand for, don’t you?

So, the distribution of the number-words from 1 to 100 in English Wikipedia does not seem to be exponential, and it doesn’t seem to fully explain the “logarithmic perception” of numbers phenomenon seen by the Science article researchers and by Fechner.

But what is it?

Alright. We’ll call it a Zipf.

The linguist George Zipf was on to a mind-blowing phenomenon. By simply counting words in any big corpus of text (and this was 1935, no computers!), he saw that the frequency of any word is inversely proportional to its rank in the frequency table:

\(n\space \tilde\space\space r^{-a}\)

Suppose the most common word (rank \(r = 1\)), say the word “the”, appears 10,000 times (frequency \(n = 10000\)). Suppose the exponent \(a\) is 1. Then Zipf Law would predict the next most common word (rank \(r = 2\)) frequency would be \(2^{-1} = \frac{1}{2}\) as large, 5,000 times. And the next most common word frequency would be \(\frac{1}{3}\) as large as the first word, 3333 times, etc.

If you take the log from both sides of this relation:

\(log(n)\space \tilde\space\space -alog(r)\)

You will see that the log frequency of a word should linearly descend by \(a\) units as the log rank increases. So on a log-log plot we expect to see a decreasing linear line, and amazingly this is exactly what we see in many languages4:

Let us plot the log frequencies of our number-words versus their log rank. Will the pattern be a straight line? In other words, will it Zipf?

lm1 <- lm(log(counter) ~ log(rank(-counter)))

tibble(n = 1:n, log_rank = log(rank(-counter)), log_count = log(counter)) %>%
  ggplot(aes(x = log_rank, y = log_count)) +
  geom_point(stat = "identity", col = "blue") +
  labs(x = "log(rank of number-word)", y = "log(frequency)", title = "No. of Number-Word Occurrences in English Wikipedia - Log-Log Plot") +
  theme_bw() +
  geom_abline(intercept = lm1$coef[1], slope = lm1$coef[2], col = "red")

Well that’s more like it! And, as you can see, I also fitted a straight line to these dots which means I can also say what is the exponent \(a\) here (notice I’m taking the negative slope coefficient):

print(str_c('Zipf exponent: ', round(-lm1$coef[2], 2)))
## [1] "Zipf exponent: 2.6"

This is what I call “Zipf fitting”. An alternative would be to look at the ratios between successive counts, see that they are more or less constant:

ratios <- tibble(num = 1:n, m1 = counter, m2 = lead(counter)) %>% mutate(ratio = round(m1/m2, 2))
head(ratios) %>%
  knitr::kable()
num m1 m2 ratio
1 7262356 4920332 1.48
2 4920332 2566535 1.92
3 2566535 1572877 1.63
4 1572877 1039010 1.51
5 1039010 778003 1.34
6 778003 468949 1.66

And the average of these ratios is ~3, similar to our \(a\) of 2.6.

Pareto is the new Normal

How about that distribution though? It doesn’t seem to be exponential, it seems to be more extreme. Many modern age quantities are distributed just like that, extreme, where few values account for a great bulk of the distribution, and then we’d see a long tail of values with barely visible bulk of the distribution. Other examples include the no. of visitors per site (it has been estimated that Google, Amazon, Facebook and few other Social Media websites account for 80% or more of internet traffic), wealth distribution (you might have heared that 1% of population is holding 99% of the world’s wealth), size of cities population (few mega-cities holding most urban population, many small cities).

If you’re an Economics major you might be shouting by now “Ooh ooh! Power-Law! Pareto!” and you’d be right. Vilfredo Pareto was an Italian scientist who noticed a very similar (equivalent some would argue) phenomenon to Zipf: At the turn of the 19th century, about 80% of the land in Italy was owned by about 20% of its population. This would be generalized as the Pareto Principle or the 80/20 rule.

In general we could speak of the Power-Law Distribution. But there’s actually a formulation of distribution named after Pareto, the Pareto Distribution, let’s use that.

If a random variable \(X\) is Pareto distributed, its probability density function would be:

\(f(X) = \frac{\alpha x_{min}^\alpha}{x^{\alpha + 1}};\space X > x_{min} > 0, \space \alpha > 0\)

You can see that the Pareto Distirbution has only two parameters: \(x_{min}\) which is the minimum \(X\) (and is thus estimated) and the exponent \(\alpha\). Again we’ll use MLE to estimate these. Now there’s no “pareto” distribution in base R, so using fitdistr(data, "pareto") won’t work. You have a number of options here:

  • Use the fitdistrplus library which can accept custom distributions or the “pareto” distributions found in the actuar or VGAM libraries
  • Use the formulation of the Power-Law Distriution instead of Pareto, then you have the fit_power_law function from the igraph library or the estimate_pars function from the poweRlaw library5
  • Go manual

I went manual. You can see how I got the MLE formula here.

fit_pareto_dist <- function(data) {
  n <- length(data)
  x_min <- min(data)
  alpha <- n/sum(log(data)-log(x_min))
  return(list(x_min = x_min, alpha = alpha))
}

fit_pareto <- fit_pareto_dist(data)
fit_pareto
## $x_min
## [1] 1
## 
## $alpha
## [1] 1.153942

See the normalized histogram with the Pareto distribution on top of it as we did with the Exponential Distribution:

tibble(num = data) %>%
  ggplot(aes(x = num)) +
  geom_histogram(bins = n, fill = "red", alpha = 0.5, aes(y = ..density..)) +
  labs(x = "Number", y = "% of occurrences / f(X)", title = "No. of Number-Word Occurrences in English Wikipedia - Pareto Fit") +
  theme_bw() +
  ylim(c(0, 0.4)) +
  scale_x_continuous(breaks = seq(0, n, 10)) +
  stat_function(fun = VGAM::dpareto, color = "blue", args = list(scale = fit_pareto$x_min, shape = fit_pareto$alpha))

Looks much better. And the QQ plot:

set.seed(42)
theoretical_pareto <- VGAM::rpareto(20000, scale = fit_pareto$x_min, shape = fit_pareto$alpha)
qqplot(data, theoretical_pareto, ylab = "Pareto(1, 1.15) Theoretical Quantiles", xlab = "Our Count Data Quantiles")
abline(0, 1, col = "red")

Looks better than the Exponential Distribution QQ plot, but notice the Pareto Distribution achieves much higher values than 100. That’s actually Ok because we have much higher numbers than 100, the 100 upper limit is our own restriction.

Comparing some distribution metrics for our data versus the \(Pareto(1, 1.15)\) distribution:

rbind(c(data = "empirical", round(summary(data), 2)), c(data = "theoretical", round(summary(theoretical_pareto), 2))) %>%
  as_tibble() %>%
  knitr::kable()
data Min. 1st Qu. Median Mean 3rd Qu. Max.
empirical 1 1 2 3.94 4 100
theoretical 1 1.29 1.82 5.95 3.41 9295.57

Here the most prominent differences are the maximum (the theoretical is much larger than 100, that’s Ok) and the mean (obviously because the maximum is much larger). However, the quantiles look much better.

Finally, if you look at the Pareto Distribution, how will this density function look like if you take both its log and the log of \(X\)?

hist_obj <- hist(log(theoretical_pareto), breaks = seq(0, max(log(theoretical_pareto)) + 1, 1), plot = FALSE)

tibble(num = hist_obj$breaks[2:length(hist_obj$breaks)], log_freq = log(hist_obj$counts/10000)) %>%
  ggplot(aes(x = num, y = log_freq)) +
  geom_point(stat = "identity", col = "blue") +
  labs(x = "log(X)", y = "log(% of occurrences)", title = "Theoretical Pareto Data: log(Freq(log(X)) vs. log(X))") +
  theme_bw() +
  scale_x_continuous(breaks = seq(0, max(log(theoretical_pareto)), 1)) +
  geom_abline(intercept = dexp(log(1), fit_pareto$alpha), slope = -fit_pareto$alpha, color = "red")

A straight line! And the slope? \(-\alpha\). Where in our case:

hist_obj <- hist(log(data), breaks = seq(0, 5, 0.5), plot = FALSE)
lm2 <- lm(log(hist_obj$counts/length(data)) ~ hist_obj$breaks[2:length(hist_obj$breaks)])

tibble(num = hist_obj$breaks[2:length(hist_obj$breaks)], log_freq = log(hist_obj$counts/length(data))) %>%
  ggplot(aes(x = num, y = log_freq)) +
  geom_point(stat = "identity", col = "blue") +
  labs(x = "log(Number)", y = "log(% of occurrences)", title = "Empirical Data: log(Freq(log(Number))) vs. log(Number)") +
  theme_bw() +
  scale_x_continuous(breaks = seq(0, 5, 0.5)) +
  geom_abline(intercept = lm2$coef[1], slope = lm2$coef[2], color = "red")

A straight line!

And the slope we got is very similar to the \(-\alpha\) we got (-1.15):

print(str_c('Pareto exponent: ', round(lm2$coef[2], 2)))
## [1] "Pareto exponent: -1.29"

Do you have patience for one more Zipf?

At first glance it seems there is no connection between the Zipf and Pareto, as if we found two different power-law formulations to explain our data. One deals with ranks and got a slope of -2.6, one deals with distribution and got a slope of -1.15 or -1.3…

But Lada Adamic shows in this great tutorial6 that they are, in fact, different formulations of the same fit. You just need to look at it right.

Going back to our counter vector - some word-numbers occurred over 7M times. Some less than 700 times. What if we looked at this distribution, the distribution of occurrences, not the distribution of the actual observed data:

tibble(num = counter) %>%
  ggplot(aes(x = num)) +
  geom_histogram(bins = 100, fill = "red", alpha = 0.5) +
  labs(x = "Occurrence", y = "No. of Number-Words", title = "Distribution of No. of Occurrences of Number-Words") +
  theme_bw() +
  scale_x_continuous(breaks = seq(0, max(counter), 1000000), labels = scales::comma)

This “inverse” view is also a Power-Law distribution! Let us fit a Pareto Distribution to this distribution:

fit_pareto_counter <- fit_pareto_dist(counter)
fit_pareto_counter
## $x_min
## [1] 310
## 
## $alpha
## [1] 0.3734685

We got \(\alpha = 0.37\), let’s see the fit:

tibble(num = counter) %>%
  ggplot(aes(x = num)) +
  geom_histogram(bins = 100, fill = "red", alpha = 0.5, aes(y = ..density..)) +
  labs(x = "Occurrence", y = "% of Number-Words / f(X)", title = "Distribution of No. of Occurrences of Number-Words - Pareto Fit") +
  theme_bw() +
  scale_x_continuous(seq(0, max(counter), 1000000), labels = scales::comma) +
  stat_function(fun = VGAM::dpareto, color = "blue", args = list(scale = fit_pareto$x_min, shape = fit_pareto$alpha))

Now does this \(\alpha\) of 0.37 sound familiar? Interestingly enough it is almost exactly the inverse of the \(a\) parametr we got from Zipf fitting:

print(str_c('Zipf exponent: ', round(-lm1$coef[2], 2)))
## [1] "Zipf exponent: 2.6"
print(str_c('Inverse of Zipf exponent: ', round(-1/lm1$coef[2], 2)))
## [1] "Inverse of Zipf exponent: 0.38"

This is not by chance, I’m goint to repeat Adamic’s simple explanation in my own formulation, for a full proof see her tutorial:

If \(n\space \tilde\space\space r^{-a}\) in Zipf, where \(n\) is the no. of number-words occurrences \(r\) is the rank of the number-word and \(a\) is the Zipf exponent, then the Pareto exponent if you fit the no. of occurrences distribution is \(1/a\) so that:

\(r\space \tilde\space\space n^{-1/a}\)

Good God Girl, Wrap it up!

So, what did we learn?

  • It seems like our perception of numbers is logarithmic rather than linear
  • Language supports it, small number-words appear exponentially more than larger number-words
  • But language is even more extreme, it is power-lawed, or Zipfed, or Pareto-ed - this is saying the same thing

Number.


  1. Bonus: this example is actually similar to Weber’s original experiment!

  2. That’s a verb, right?

  3. As I said the problem is even worse since the count of “one hundred” for example also includes the counts for all number-words between “one hundred” and “one hundred ninety-nine”, but these are somewhat negligible, above 50 most chances we’d see the number, not the number-word

  4. Source: By SergioJimenez - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45516736

  5. Notice that the alpha parameter you’d get there once you set \(x_{min} = 1\) should be equal to our Pareto expoent \(\alpha\) + 1

  6. I know, tutorial without code, very 2002.