I feel like I should start with an apology: this is a long, barely readable post. But you might earn some satisfaction from it, once you realize I scanned the entire English Wikipedia just to find, or not, evidence for a 19th century psychophysics law! Anyway, if you survive this post, and wish for a prize of some sort, email mw and I will grant you a badge.

# Logarithmic Perception

I recently stumbled upon an interesting article published in Science Magazine back in 2008 by Stanislas Dehaene et. al., titled “Log or Linear? Distinct Intuitions of the Number Scale in Western and Amazonian Indigene Cultures”. The authors compared performance of American and Mundurucu participants in a numbers mapping task:

The participant was presented with a number in dots (or sounds or numerals) and was asked to place this number on a line representing a scale of 1 to 10 in some trials and 1 to 100 on other trials.

Results showed that when the range was 1 to 10, American participants tended to place numbers linearly, in even spaces, while Mundurucu participants tended to place numbers “logarithmically”, making more emphasis on the difference between smaller numbers such as 1, 2 and 3, than on the difference between larger numbers such as 8, 9 and 10. However, when the range was 1 to 100 - American participants also tended to place numbers logarithmically!

This finding isn’t new. For example, and you can check this on a friend, most of us would feel pretty good about paying only 20 dollars for a 50 dollars dress, but would feel bordeline insulted if offered to pay “only” 500,000 dollars for a 500,030 dollars house. Although in both cases we can save 30 dollars, in the first case 30 dollars is *a lot* off the initial amount of 50 dollars, and in the second case it is *nothing* off the initial amount of 500,030.

# Weber-Fechner Law

The concept of logarithmic perception was formalized by psychophysicist Gustav Fechner in the 19th century. Fechner explained the experimental findings of his partner Ernst Weber with the following formula, a.k.a Weber-Fechner Law:

\(p = k\ln{\frac{S}{S_0}}\)

Which goes to say that perception \(p\) of a stimulus \(S\) equals a constant \(k\) (depending on the stimulus) times the natural logarithm of the ratio between this stimulus \(S\) and an initial stimulus \(S_0\) causing “zero” perception.

OK, this doesn’t mean anything. Suppose you are blindfolded and given a weight of 1 kilo in one hand (the initial stimulus \(S_0\)). If you are now given in the other hand a weight of 2 kilos (the stimulus \(S\)), you would surely notice that the weight has increased, though you may not be able to say it has doubled precisely. According to the law you would report about \(k\ln{2}\) increase. Now suppose you are given a weight of 10 kilos in one hand (\(S_0\)), then a weight of 11 kilos (\(S\)) in the other hand - you might not even notice that the weight has increased! The law puts this intuition to numbers and predicts you would report \(k\ln{1.1}\) increase which is close to zero. And in fact the only way to make you feel that initial increase from 1 kilo to 2 kilos when you start with 10 kilos, is to move to 20 kilos!^{1}

Do you see how this relates to logarithmic perception of numbers? We are much better at seeing the difference between 1 and 2 because we are much better at seeing *ratios* and the number has doubled. We are much worse at seeing the difference between 10 and 11 because we are much worse at seeing *additions* and the number has gone up by only 10%.

# Whorf (not the Klingon)

So Weber and Fechner would claim that in general we’re better at seeing ratios than at seeing additions and this is true for any type of perception. I do not know if they ever thought about *why* this is so, for example whether there are evolutionary advantages to paying more attention to ratios than to additions.

When it comes to numbers, one hypothesis called the Sapir-Whorf Hypothesis after linguists Edward Sapir and Benjamin Lee Whorf *does* try to provide a reason and the reason is: Language. Whorfianism isn’t about the logarithmic perception of numbers specifically. It states generally that our experience with language shapes our thoughts and behavior. Think about that for a moment. How many times did you count “one, two, three”? A lot. How many times did you count “sixty-one, sixty-two, sixty-three”? Not that many. So since childhood, one, two and three are a big deal to us. 61, 62 and 63 not so much. Which might also explain the differences found in the Science article between American participants and Mundurucu participants: the Mundurucu participants must culturally care much more about the one, two and three digits than they do about eight, nine and ten. For American participants the first 10 integers are a big thing.

# God damn you, show me some code!

So I’m sitting there thinking: why don’t I scan a big corpus of text, say the entire English Wikipedia and simply count the occurrence of the numbers 1 to 100, and see how this frequency behaves.

But surely Giora, you don’t mean the entire Wikipedia? Girrrl, I do.

Wikipedia actually releases its entire corpus of pages here and explains what each of these compressed files means here. I downloaded the ~15GB `enwiki-latest-pages-articles.xml.bz2`

file, decompressed it to `enwiki-latest-pages-articles.xml`

now weighing ~68GB. It’s a regular XML with over 1B rows, you can see a sample here. And I’m simply gonna read it in chunks of 100K rows at a time, counting the occurrence of the words “one”, “two”, “three”, etc.

```
library(tidyverse)
library(xfun)
regex_num <- function(num) str_c("\\b", numbers_to_words(num), "\\b")
n <- 100
search_nums <- map_chr(1:n, regex_num)
counter <- numeric(length(search_nums))
count_in_chunk <- function(i, chunk) sum(str_count(chunk, search_nums[i]))
update_counter <- function(x, pos) counter <<- counter + map_int(1:length(search_nums), count_in_chunk, x)
read_lines_chunked("enwiki-latest-pages-articles.xml", SideEffectChunkCallback$new(update_counter), chunk_size = 100000)
```

Wait, that’s it, 9 lines of code? Almost:

I’m defining a function

`regex_num`

which takes a number (e.g. 23) and returns it as string (“twenty-three”). So in the object`search_nums`

I have the numbers “one”, “two”, “three”, … up to “one-hundred” and I want them searched as whole words so they’re regexified^{2}:`c("\\bone\\b", "\\btwo\\b", "\\bthree\\b", ..., "\\bone hundred\\b")`

.Defining a

`counter`

vector of length 100, each of its elements will count the corresponding number-word appearances.Using the

`readr`

package`read_lines_chunked`

function to read this massive XML file in chunks of 100K lines at a time. I’m passing it a`SideEffectChunkCallback`

object, feeding it a`update_counter`

function which will update my global`counter`

object with each chunk.

A few issues might arise in the acute reader’s mind:

Why are we only counting number-

*words*, what about the numbers themselves?Why are the numbers of 20-something, 30-something etc. hyphened? (“twenty-three” as opposed to “twenty three”)

You do realize “twenty-three” counts also “three”, don’t you? Moreover if the number-word “one hundred twenty-three” appears it also counts “twenty-three”, etc.

This will take a long time, why not parallelize?

Here are my answers:

This was a compromise. There are plenty of raw numbers in this raw XML which are not in the text and I did not want to count those numbers. Every address, every hash code… We can rely more on number-words. Also actually parsing the XML would take a long time.

I could search for both forms, there’s actually a useful boolean parameter in the

`numbers_to_words`

function called`hyphen`

which defaults to`TRUE`

. This might surprise you but the more common form is the hyphened form.Yes, I realize and that’s what I’m dealing with in the next chunk of code.

You are right, I was lazy. But ~10 hours for a one-time effort, for a private blog which no one reads is OK :)

So we have our `counter`

:

`head(counter)`

`## [1] 7300632 4939875 2581111 1592384 1068104 792180`

But we need to subtract all counts of “twenty-one”, “thirty-one”, “forty-one” etc. from the count of “one”. And we need to subtract all counts of “twenty-two”, “twenty-three”, “twenty-four” etc. from the count of “twenty”^{3}:

```
# a function to get the indices of counter which repeat the count of a digit (1, 2, ...)
get_indices_to_subtract_digit <- function(digit, total_n = n) {
indices <- which(1:total_n %% 10 == digit)
indices <- indices[3:length(indices)]
if (digit == 1) {
indices <- c(indices, 100)
}
return(indices)
}
# a function to get the indices of counter which repeat the count of a full ten (20, 30, ...)
get_indices_to_subtract_ten <- function(ten, total_n = n) {
indices <- which(1:total_n > ten & 1:total_n < ten + 10)
return(indices)
}
how_much_to_subtract <- function(num) {
if (num > 0 & num < 10) {
indices_to_subtract <- get_indices_to_subtract_digit(num)
} else {
indices_to_subtract <- get_indices_to_subtract_ten(num)
}
sum(counter[indices_to_subtract])
}
subtract_from_counter <- function(num) {
if (num %% 1 != 0 || num < 0 || num > 90 || (num > 9 && (!(num %% 10 == 0) || num == 10))) {
stop('num should be an integer between 1 and 9 or round tens from 20 to 90')
}
counter[num] <<- counter[num] - how_much_to_subtract(num)
}
walk(c(1:9, seq(20, 90, 10)), subtract_from_counter)
```

And that’s the final counter:

`counter`

```
## [1] 7262356 4920332 2566535 1572877 1039010 778003 468949 438583
## [9] 285970 406013 89319 125496 58864 47377 65082 47425
## [17] 28589 36847 17598 117154 9664 8361 6188 10809
## [25] 14039 5663 4382 4524 3039 62930 2743 4277
## [33] 3218 2430 4933 3493 2170 2196 1897 40279
## [41] 1717 2259 1843 2270 3420 1432 1434 2281
## [49] 1167 50821 1164 1696 1069 1127 1611 1081
## [57] 950 811 821 24839 730 858 835 1155
## [65] 1621 871 745 691 694 13251 541 1092
## [73] 572 626 2142 627 589 543 457 9727
## [81] 556 512 499 756 787 501 440 641
## [89] 381 7708 358 488 352 334 541 509
## [97] 310 310 746 20803
```

So counter actually summarizes our actual observed data in a compact manner. Our actual observed data is ~7M “one”s, ~5M “two”s, etc:

```
data <- rep(1:n, counter)
head(data)
```

`## [1] 1 1 1 1 1 1`

`tail(data)`

`## [1] 100 100 100 100 100 100`

Or in a histogram (really a barplot since I’m asking for 100 bins):

```
tibble(num = data) %>%
ggplot(aes(x = num)) +
geom_histogram(bins = n, fill = "red", alpha = 0.5) +
labs(x = "Number", y = "No. of occurences", title = "No. of Number-Word Occurrences in English Wikipedia") +
theme_bw() +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(breaks = seq(0, n, 10))
```

# Log, Exponential, Face, I Ain’t Bothered!

First, it is already interesting to see that indeed there is a sharp decline in frequency as numbers increase. Second, although all numbers have positive counts, from about 20 we can only see the jumps for 30, 40, 50 etc. up until 100. This makes sense, as I’ve commented, one rarely writes the number 53 as “fifty-three” but there’s a high chance one *will* write the round number 50 as “fifty”.

How would you show that this sharp decline, a.k.a “decaying function” is inline with the “logarithmic perception” phenomenon the Science researchers demonstrated? You could simply invert this histogram to see its outline matches the logarithmic pattern the researchers got. I’m a Statistician, so what I know is how to fit distributions :) And the proper distribution to fit here is the Exponential Distribution.

If \(X\) is an exponentially distributed random variable, its probability densitiy function (PDF) looks like this:

\(f(X) = \lambda {e}^{-\lambda x};\space X > 0, \space \lambda > 0\)

Which means we only need to find “the best” \(\lambda\), the rate parameter, assuming our distribution is indeed exponential, and if we get some kind of an acceptable “goodness of fit” measure or other diagnostics saying the fit is “good”, we can be happy.

Now the most common way for fitting a distribution to the data and getting this \(\lambda\) is through Maximum Likelihood Estimation or MLE, which isn’t complicated but *is* out of scope, plus we have R at our disposal so why worry our pretty little heads about it?

```
library(MASS)
fit_exp <- fitdistr(data, "exponential")
fit_exp
```

```
## rate
## 2.540445e-01
## (5.577092e-05)
```

We got: \(\lambda = 0.25\). Let’s first draw this \(Exp(0.25)\) distribution right next to the (normalized to 1) histogram and see if it fits:

```
tibble(num = data) %>%
ggplot(aes(x = num)) +
geom_histogram(bins = n, fill = "red", alpha = 0.5, aes(y = ..density..)) +
labs(x = "Number", y = "% of occurrences / f(X)", title = "No. of Number-Word Occurrences in English Wikipedia - Exponential Fit") +
theme_bw() +
scale_x_continuous(breaks = seq(0, n, 10)) +
stat_function(fun = dexp, color = "blue", args = list(rate = fit_exp$estimate))
```

Personally I’m not impressed! The decay here looks more extreme than the exponential decay. We can also draw a QQ-plot, comparing the quantiles of our actual data to the theoretical quantiles of the \(Exp(0.25)\) distribution:

```
theoretical_exp <- rexp(20000, rate = fit_exp$estimate)
qqplot(data, theoretical_exp, ylim = c(0, n), ylab = "Exp(0.25) Theoretical Quantiles", xlab = "Our Count Data Quantiles")
abline(0, 1, col = "red")
```

If the theoretical and empirical distributions matched, they would have had similar quantiles and we would have seen the dots arrange close to the identity line, here in red. Alas, after the first ~50 quantiles, the dots completely diverge from the red line. It seems like the theoretical distribution has much more lower quantiles, it doesn’t seem to increase over 40.

We can also compare some distribution metrics for our data versus the \(Exp(0.25)\) distribution:

```
rbind(c(data = "empirical", round(summary(data), 2)), c(data = "theoretical", round(summary(theoretical_exp), 2))) %>%
as_tibble() %>%
knitr::kable()
```

data | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|---|

empirical | 1 | 1 | 2 | 3.94 | 4 | 100 |

theoretical | 0 | 1.11 | 2.76 | 3.94 | 5.51 | 37.79 |

The numbers do not align, especially “at the end” of the distribution.

Finally, if you look at the Exponential Distribution, how will this density function look like if you take its log?

```
hist_obj <- hist(theoretical_exp, breaks = seq(0, max(theoretical_exp) + 1, 1), plot = FALSE)
tibble(num = hist_obj$breaks[2:length(hist_obj$breaks)], log_freq = log(hist_obj$counts/20000)) %>%
ggplot(aes(x = num, y = log_freq)) +
geom_point(stat = "identity", col = "blue") +
labs(x = "Number", y = "log(% of occurrences)", title = "Theoretical Exponential Data: log(Freq(X)) vs. X") +
theme_bw() +
scale_x_continuous(breaks = seq(0, max(theoretical_exp), 10)) +
geom_abline(intercept = log(dexp(0, fit_exp$estimate)), slope = -fit_exp$estimate, col = "red")
```

That’s right, a straight line! And the slope? \(-\lambda\). Where in our case:

```
hist_obj <- hist(data, breaks = seq(0, n, 1), plot = FALSE)
tibble(num = hist_obj$breaks[2:length(hist_obj$breaks)], log_freq = log(hist_obj$counts/length(data))) %>%
ggplot(aes(x = num, y = log_freq)) +
geom_point(stat = "identity", col = "blue") +
labs(x = "Number", y = "log(% of occurrences)", title = "Empirical Data: log(Freq(Number)) vs. Number") +
theme_bw() +
scale_x_continuous(breaks = seq(0, n, 10)) +
geom_abline(intercept = log(dexp(0, fit_exp$estimate)), slope = -fit_exp$estimate, col = "red")
```

Not a very straight line, perhaps only at its beginning. You do remember what the “jumping dots” stand for, don’t you?

So, the distribution of the number-words from 1 to 100 in English Wikipedia does not seem to be exponential, and it doesn’t seem to fully explain the “logarithmic perception” of numbers phenomenon seen by the Science article researchers and by Fechner.

But what is it?

# Alright. We’ll call it a Zipf.

The linguist George Zipf was on to a mind-blowing phenomenon. By simply counting words in any big corpus of text (and this was 1935, no computers!), he saw that the frequency of any word is inversely proportional to its rank in the frequency table:

\(n\space \tilde\space\space r^{-a}\)

Suppose the most common word (rank \(r = 1\)), say the word “the”, appears 10,000 times (frequency \(n = 10000\)). Suppose the exponent \(a\) is 1. Then Zipf Law would predict the next most common word (rank \(r = 2\)) frequency would be \(2^{-1} = \frac{1}{2}\) as large, 5,000 times. And the next most common word frequency would be \(\frac{1}{3}\) as large as the first word, 3333 times, etc.

If you take the log from both sides of this relation:

\(log(n)\space \tilde\space\space -alog(r)\)

You will see that the log frequency of a word should linearly descend by \(a\) units as the log rank increases. So on a log-log plot we expect to see a decreasing linear line, and amazingly this is exactly what we see in many languages^{4}:

Let us plot the log frequencies of our number-words versus their log rank. Will the pattern be a straight line? In other words, will it Zipf?

```
lm1 <- lm(log(counter) ~ log(rank(-counter)))
tibble(n = 1:n, log_rank = log(rank(-counter)), log_count = log(counter)) %>%
ggplot(aes(x = log_rank, y = log_count)) +
geom_point(stat = "identity", col = "blue") +
labs(x = "log(rank of number-word)", y = "log(frequency)", title = "No. of Number-Word Occurrences in English Wikipedia - Log-Log Plot") +
theme_bw() +
geom_abline(intercept = lm1$coef[1], slope = lm1$coef[2], col = "red")
```

Well that’s more like it! And, as you can see, I also fitted a straight line to these dots which means I can also say what is the exponent \(a\) here (notice I’m taking the negative slope coefficient):

`print(str_c('Zipf exponent: ', round(-lm1$coef[2], 2)))`

`## [1] "Zipf exponent: 2.6"`

This is what I call “Zipf fitting”. An alternative would be to look at the ratios between successive counts, see that they are more or less constant:

```
ratios <- tibble(num = 1:n, m1 = counter, m2 = lead(counter)) %>% mutate(ratio = round(m1/m2, 2))
head(ratios) %>%
knitr::kable()
```

num | m1 | m2 | ratio |
---|---|---|---|

1 | 7262356 | 4920332 | 1.48 |

2 | 4920332 | 2566535 | 1.92 |

3 | 2566535 | 1572877 | 1.63 |

4 | 1572877 | 1039010 | 1.51 |

5 | 1039010 | 778003 | 1.34 |

6 | 778003 | 468949 | 1.66 |

And the average of these ratios is ~3, similar to our \(a\) of 2.6.

# Pareto is the new Normal

How about that distribution though? It doesn’t seem to be exponential, it seems to be more extreme. Many modern age quantities are distributed just like that, extreme, where few values account for a great bulk of the distribution, and then we’d see a long tail of values with barely visible bulk of the distribution. Other examples include the no. of visitors per site (it has been estimated that Google, Amazon, Facebook and few other Social Media websites account for 80% or more of internet traffic), wealth distribution (you might have heared that 1% of population is holding 99% of the world’s wealth), size of cities population (few mega-cities holding most urban population, many small cities).

If you’re an Economics major you might be shouting by now “Ooh ooh! Power-Law! Pareto!” and you’d be right. Vilfredo Pareto was an Italian scientist who noticed a very similar (equivalent some would argue) phenomenon to Zipf: At the turn of the 19th century, about 80% of the land in Italy was owned by about 20% of its population. This would be generalized as the Pareto Principle or the 80/20 rule.

In general we could speak of the Power-Law Distribution. But there’s actually a formulation of distribution named after Pareto, the Pareto Distribution, let’s use that.

If a random variable \(X\) is Pareto distributed, its probability density function would be:

\(f(X) = \frac{\alpha x_{min}^\alpha}{x^{\alpha + 1}};\space X > x_{min} > 0, \space \alpha > 0\)

You can see that the Pareto Distirbution has only two parameters: \(x_{min}\) which is the minimum \(X\) (and is thus estimated) and the exponent \(\alpha\). Again we’ll use MLE to estimate these. Now there’s no “pareto” distribution in base R, so using `fitdistr(data, "pareto")`

won’t work. You have a number of options here:

- Use the
`fitdistrplus`

library which can accept custom distributions or the “pareto” distributions found in the`actuar`

or`VGAM`

libraries - Use the formulation of the Power-Law Distriution instead of Pareto, then you have the
`fit_power_law`

function from the`igraph`

library or the`estimate_pars`

function from the`poweRlaw`

library^{5} - Go manual

I went manual. You can see how I got the MLE formula here.

```
fit_pareto_dist <- function(data) {
n <- length(data)
x_min <- min(data)
alpha <- n/sum(log(data)-log(x_min))
return(list(x_min = x_min, alpha = alpha))
}
fit_pareto <- fit_pareto_dist(data)
fit_pareto
```

```
## $x_min
## [1] 1
##
## $alpha
## [1] 1.153942
```

See the normalized histogram with the Pareto distribution on top of it as we did with the Exponential Distribution:

```
tibble(num = data) %>%
ggplot(aes(x = num)) +
geom_histogram(bins = n, fill = "red", alpha = 0.5, aes(y = ..density..)) +
labs(x = "Number", y = "% of occurrences / f(X)", title = "No. of Number-Word Occurrences in English Wikipedia - Pareto Fit") +
theme_bw() +
ylim(c(0, 0.4)) +
scale_x_continuous(breaks = seq(0, n, 10)) +
stat_function(fun = VGAM::dpareto, color = "blue", args = list(scale = fit_pareto$x_min, shape = fit_pareto$alpha))
```

Looks much better. And the QQ plot:

```
set.seed(42)
theoretical_pareto <- VGAM::rpareto(20000, scale = fit_pareto$x_min, shape = fit_pareto$alpha)
qqplot(data, theoretical_pareto, ylab = "Pareto(1, 1.15) Theoretical Quantiles", xlab = "Our Count Data Quantiles")
abline(0, 1, col = "red")
```

Looks better than the Exponential Distribution QQ plot, but notice the Pareto Distribution achieves much higher values than 100. That’s actually Ok because we have much higher numbers than 100, the 100 upper limit is our own restriction.

Comparing some distribution metrics for our data versus the \(Pareto(1, 1.15)\) distribution:

```
rbind(c(data = "empirical", round(summary(data), 2)), c(data = "theoretical", round(summary(theoretical_pareto), 2))) %>%
as_tibble() %>%
knitr::kable()
```

data | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|---|

empirical | 1 | 1 | 2 | 3.94 | 4 | 100 |

theoretical | 1 | 1.29 | 1.82 | 5.95 | 3.41 | 9295.57 |

Here the most prominent differences are the maximum (the theoretical is much larger than 100, that’s Ok) and the mean (obviously because the maximum is much larger). However, the quantiles look much better.

Finally, if you look at the Pareto Distribution, how will this density function look like if you take both its log and the log of \(X\)?

```
hist_obj <- hist(log(theoretical_pareto), breaks = seq(0, max(log(theoretical_pareto)) + 1, 1), plot = FALSE)
tibble(num = hist_obj$breaks[2:length(hist_obj$breaks)], log_freq = log(hist_obj$counts/10000)) %>%
ggplot(aes(x = num, y = log_freq)) +
geom_point(stat = "identity", col = "blue") +
labs(x = "log(X)", y = "log(% of occurrences)", title = "Theoretical Pareto Data: log(Freq(log(X)) vs. log(X))") +
theme_bw() +
scale_x_continuous(breaks = seq(0, max(log(theoretical_pareto)), 1)) +
geom_abline(intercept = dexp(log(1), fit_pareto$alpha), slope = -fit_pareto$alpha, color = "red")
```

A straight line! And the slope? \(-\alpha\). Where in our case:

```
hist_obj <- hist(log(data), breaks = seq(0, 5, 0.5), plot = FALSE)
lm2 <- lm(log(hist_obj$counts/length(data)) ~ hist_obj$breaks[2:length(hist_obj$breaks)])
tibble(num = hist_obj$breaks[2:length(hist_obj$breaks)], log_freq = log(hist_obj$counts/length(data))) %>%
ggplot(aes(x = num, y = log_freq)) +
geom_point(stat = "identity", col = "blue") +
labs(x = "log(Number)", y = "log(% of occurrences)", title = "Empirical Data: log(Freq(log(Number))) vs. log(Number)") +
theme_bw() +
scale_x_continuous(breaks = seq(0, 5, 0.5)) +
geom_abline(intercept = lm2$coef[1], slope = lm2$coef[2], color = "red")
```

A straight line!

And the slope we got is very similar to the \(-\alpha\) we got (-1.15):

`print(str_c('Pareto exponent: ', round(lm2$coef[2], 2)))`

`## [1] "Pareto exponent: -1.29"`

# Do you have patience for one more Zipf?

At first glance it seems there is no connection between the Zipf and Pareto, as if we found two different power-law formulations to explain our data. One deals with ranks and got a slope of -2.6, one deals with distribution and got a slope of -1.15 or -1.3…

But Lada Adamic shows in this great tutorial^{6} that they are, in fact, different formulations of the same fit. You just need to look at it right.

Going back to our `counter`

vector - some word-numbers occurred over 7M times. Some less than 700 times. What if we looked at this distribution, the distribution of *occurrences*, not the distribution of the actual observed data:

```
tibble(num = counter) %>%
ggplot(aes(x = num)) +
geom_histogram(bins = 100, fill = "red", alpha = 0.5) +
labs(x = "Occurrence", y = "No. of Number-Words", title = "Distribution of No. of Occurrences of Number-Words") +
theme_bw() +
scale_x_continuous(breaks = seq(0, max(counter), 1000000), labels = scales::comma)
```

This “inverse” view is also a Power-Law distribution! Let us fit a Pareto Distribution to this distribution:

```
fit_pareto_counter <- fit_pareto_dist(counter)
fit_pareto_counter
```

```
## $x_min
## [1] 310
##
## $alpha
## [1] 0.3734685
```

We got \(\alpha = 0.37\), let’s see the fit:

```
tibble(num = counter) %>%
ggplot(aes(x = num)) +
geom_histogram(bins = 100, fill = "red", alpha = 0.5, aes(y = ..density..)) +
labs(x = "Occurrence", y = "% of Number-Words / f(X)", title = "Distribution of No. of Occurrences of Number-Words - Pareto Fit") +
theme_bw() +
scale_x_continuous(seq(0, max(counter), 1000000), labels = scales::comma) +
stat_function(fun = VGAM::dpareto, color = "blue", args = list(scale = fit_pareto$x_min, shape = fit_pareto$alpha))
```

Now does this \(\alpha\) of 0.37 sound familiar? Interestingly enough it is almost exactly the inverse of the \(a\) parametr we got from Zipf fitting:

`print(str_c('Zipf exponent: ', round(-lm1$coef[2], 2)))`

`## [1] "Zipf exponent: 2.6"`

`print(str_c('Inverse of Zipf exponent: ', round(-1/lm1$coef[2], 2)))`

`## [1] "Inverse of Zipf exponent: 0.38"`

This is not by chance, I’m goint to repeat Adamic’s simple explanation in my own formulation, for a full proof see her tutorial:

If \(n\space \tilde\space\space r^{-a}\) in Zipf, where \(n\) is the no. of number-words occurrences \(r\) is the rank of the number-word and \(a\) is the Zipf exponent, then the Pareto exponent if you fit the no. of occurrences distribution is \(1/a\) so that:

\(r\space \tilde\space\space n^{-1/a}\)

# Good God Girl, Wrap it up!

So, what did we learn?

- It seems like our perception of numbers is logarithmic rather than linear
- Language supports it, small number-words appear exponentially more than larger number-words
- But language is even more extreme, it is power-lawed, or Zipfed, or Pareto-ed - this is saying the same thing

Bonus: this example is actually similar to Weber’s original experiment!↩

That’s a verb, right?↩

As I said the problem is even worse since the count of “one hundred” for example also includes the counts for all number-words between “one hundred” and “one hundred ninety-nine”, but these are somewhat negligible, above 50 most chances we’d see the number, not the number-word↩

Source: By SergioJimenez - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45516736↩

Notice that the

`alpha`

parameter you’d get there once you*set*\(x_{min} = 1\) should be equal to*our*Pareto expoent \(\alpha\) + 1↩I know, tutorial without code, very 2002.↩