For my very first course in Statistics during my undergrad years, we had to make “a project”, defined as “go out there, gather some data, summarize them and perform proper statistics on them”. My friend and I teamed up, and started asking random people 2 simple questions:

  • When’s your birthday?
  • When’s your best friend’s birthday?

I think we got to a couple of hundred respondents. The idea then was very simple: we would test if statistically there was any evidence to what astrology considers a “good matching” between two people. The assumption was that if for example astrology considers signs of the same “element” (see below) to be a good match, in a sample of best friends one should expect a higher proportion of best friends with signs of the same element, than random matchings. I know there are many problems with this assumption and “project” as a whole, I’m going to talk about them in a minute.

What I’m going to do here is more or less replicate that fun little project x30, without asking anyone a single question. I will scrape (English) Wikipedia for thousands of celebrity couples, record for each person his or her birth date, and see if there’s any statistical evidence in this sample (is it a sample? see later) to common astrological theories or “rules” of which signs make a good couple.

As always, if you wish to skip the “How I did it” part and see some results, go to the “Enough Talking, Let’s See Some Results” section, but I highly encourage you to also read the “What’s Wrong?” section, it may be the only not-wrong section in this post.

Wikipedia to the Rescue

I thought a lot about how to get thousands of couples. Wikipedia is a great site for this because it is very well (though not perfectly) structured. Each person has her own “infobox” with her name, her birth date (most times in its very own CSS class which is great for me), country, occupation and most important to my needs: spouse or partner.

Take Elton John:

I can see his birth date, he’s from the UK, he’s a musician, and he has a spouse with a link, meaning that this spouse (David Furnish) may also have a birth date, occupation etc, which is what I want, in the same Wikipedia structure. Elton John had another spouse, Renate Blauel, but she does not have a (English) Wikipedia article, so that’s that.

But where can I get to Elton John in the first place?

I can go to Wikipedia’s 20th-century English singers, and scrape them one by one. But if I want also say Beyonce in this sample, I’d have to go to 20th-century American singers as well. But what about Meryl Streep?!

Luckily for us all the above people and many more appear in a category called “Living People”. I kid you not, this list holds over 830,0001 people with a Wikipedia article, almost any person with a (English) Wikipedia who is known to be alive should be there. So what, did I check each and every person on this list for a birth date and spouse? I did, Honeys, it took me 5 days.

First I got a list of all 4,159 urls which comprised this list (this alone took over an hour):

library(tidyverse)
library(stringr)
library(rvest)
library(lubridate)
library(knitr)
library(kableExtra)

currentURL <- "/wiki/Category:Living_people"

urls <- currentURL

nextURL <- tryCatch(
  read_html(str_c("https://en.wikipedia.org", currentURL)) %>%
    html_nodes("a") %>%
    html_attr("href") %>%
    discard(is.na(.) |
              !str_detect(., "title=Category:Living_people&pagefrom=")) %>%
    .[[1]],
  error = function(e) return(NULL)
)
i <- 0
while (!is.null(nextURL)) {
  i <- i + 1
  cat(i, "\n")
  urls <- c(urls, nextURL)
  currentURL <- nextURL
  nextURL <- tryCatch(
    read_html(str_c("https://en.wikipedia.org", currentURL)) %>%
      html_nodes("a") %>%
      html_attr("href") %>%
      discard(is.na(.) |
                !str_detect(., "title=Category:Living_people&pagefrom=")) %>%
      .[[1]],
    error = function(e) return(NULL)
  )
}

So now there’s a vector of 4,159 Wikipedia “Living People” URLs in urls. Need to loop over them, and in each page loop over all people’s links on that page.

After some trial and error, we get to this long getPersonWikiData function for parsing out the information of a single person, built with rvest:

getCountry <- function(s) {
  country <- str_extract(s, str_c(c(countrycode::countrycode_data$country.name.en,
                         "U.S.", "US", "USA", "Russia", "United States", state.name,
                         "United Kingdom", "Venezuela", "Tanzania", "Iran",
                         "England", "Scotland", "Wales", "Northern Ireland"), collapse = "|"))
  if (country %in% c("United States of America", "United States", "U.S.", "US",
                     "USA", "Georgia", "Jersey", state.name)) {
    country <- "USA"
  }
  if (country %in% c("United Kingdom", "England", "Scotland", "Wales", "Northern Ireland")) {
    country <- "UK"
  }
  
  return(country)
}

getPersonWikiData <- function(spouseAlink) {
  
  # if link is already on couples list, no need to look for it
  if (spouseAlink %in% couples$spouseA) {
    return(NULL)
  }
  
  html <- tryCatch(read_html(str_c("https://en.wikipedia.org", spouseAlink)),
                   error = function(e) return(NULL))
  
  # if failed to read html - bye
  if (is.null(html)) {
    return(NULL)
  }
  
  bday <- tryCatch(
    html %>%
      html_node(".bday") %>%
      html_text() %>%
      as_date(),
    error = function(e) NA
  )
  
  # if failed to read bday - bye
  if (is.na(bday)) {
    return(NULL)
  }
  
  infos <- html %>%
    html_node(".infobox") %>%
    html_children()
  
  spousesLinks <- NULL
  partnersLinks <- NULL
  
  spouseIdx <- infos %>%
    html_text() %>%
    str_detect(., "^Spouse") %>%
    which()
  
  partnerIdx <- infos %>%
    html_text() %>%
    str_detect(., "^Partner") %>%
    which()
  
  if (!is_empty(spouseIdx)) {
    spousesLinks <- infos[[spouseIdx[1]]] %>%
      html_nodes("a") %>%
      html_attr("href") %>%
      discard(!startsWith(., "/wiki/"))
  }
  
  if (!is_empty(partnerIdx)) {
    partnersLinks <- infos[[partnerIdx[1]]] %>%
      html_nodes("a") %>%
      html_attr("href") %>%
      discard(!startsWith(., "/wiki/"))
  }
  
  spousesLinks <- c(spousesLinks, partnersLinks)
  
  # if failed to read spouse - bye
  if (is_empty(spousesLinks)) {
    return(NULL)
  }
  
  country <- html %>%
    html_node(".birthplace") %>%
    html_text() %>%
    getCountry()
  
  if (is.na(country)) {
    country <- html %>%
      html_node(".infobox") %>%
      html_text() %>%
      getCountry()
  }
  
  occupationIdx <- infos %>%
    html_text() %>%
    str_detect(., "^Occupation") %>%
    which()
  
  occupation <- NULL
  
  if (!is_empty(occupationIdx)) {
    occupationListItems <- infos[[occupationIdx[1]]] %>%
      html_nodes("li")
    occupation <- if (length(occupationListItems) == 0) {
      infos[[occupationIdx[1]]] %>%
        html_text() %>%
        str_split("\n") %>%
        unlist() %>%
        .[[2]] %>%
        str_split(",") %>%
        unlist() %>%
        map_chr(trimws)
    } else {
      occupationListItems %>%
        html_text()
    }
  }
  
  return(list(
    spouseA = spouseAlink,
    spouseB = list(spousesLinks),
    bday = bday,
    country = country,
    occupation = list(occupation)
  ))
}

Notice a few things:

  1. This function eventually parses out:
  • spouseA: the link to person A
  • spouseB: the link to person B (it is a list, as some people have more than one spouse)
  • bday: birth date of person A
  • country: country of person A (or NA)
  • occupation: occupation of person A (it is a list, or NA)

When we reach Elton John he would be spouseA and David Furnish spouseB. When we reach David Furnish he would be spouseA and Elton John spouseB.

  1. The function presumes there is a global table called couples. If the person we’re looking at is already in couples as spouseA there’s no point in adding them. This is for later when we’ve gone through all 830K people, got their spouseBs, and we’d want to get the info regarding those spouseBs - no need to add those we’ve seen again.

  2. If the person either has no birth date or has no spouse with a link (or the function failed to parse a birth date or spouse from her article) - this person is discarded.

Once you have your function, gathering the info is very simple:

couples <- tibble(
  spouseA = character(0),
  spouseB = list(),
  bday = as_date(x = integer(0), origin = "1970-01-01"),
  country = character(0),
  occupation = list()
)

getCouples <- function(i) {
  cat(i, "\n")
  couple <- tryCatch(
    read_html(str_c("https://en.wikipedia.org", urls[i])) %>%
      html_nodes("li") %>%
      html_nodes("a") %>%
      html_attr("href") %>%
      discard(!startsWith(., "/wiki/")) %>%
      map_dfr(getPersonWikiData),
    error = function(e) NULL
  )
  
  couples <<- rbind.data.frame(couples, couple)
}

walk(1:length(urls), getCouples)

You just need to wait 5 days, haha.

Once we’ve gone over all 830K people, some of them have spouseBs which we don’t have yet as spouseAs, so we need to complement couples with some more people (e.g. people who are dead but have a Wikipedia article - we don’t have them yet):

moreSpouses <- couples %>%
  select(spouseB) %>%
  unnest(spouseB) %>%
  unlist() %>%
  map_dfr(getPersonWikiData) %>%
  distinct(spouseA, .keep_all = TRUE)

couples <- rbind.data.frame(couples, moreSpouses)

dim(couples)
## [1] 14554     5

We’re left with 14.5K people in potential, most probably famous. We’ll soon get rid of some of these people because they have a spouse without a proper birth date.

And how does this table look like?

couples
## # A tibble: 14,554 x 5
##                           spouseA   spouseB       bday country occupation
##                             <chr>    <list>     <date>   <chr>     <list>
##  1                    /wiki/A-Lin <chr [1]> 1983-09-20    <NA>  <chr [2]>
##  2           /wiki/Jennifer_Aaker <chr [1]> 1967-01-15     USA  <chr [2]>
##  3                /wiki/Ben_Aaron <chr [1]> 1981-09-10     USA     <NULL>
##  4         /wiki/Aarthi_(actress) <chr [1]> 1983-11-09   India  <chr [1]>
##  5           /wiki/Marianne_Aasen <chr [1]> 1967-02-21  Norway  <chr [1]>
##  6            /wiki/Maryam_Abacha <chr [1]> 1945-03-04   Niger  <chr [1]>
##  7 /wiki/Ang%C3%A9lique_Abachkina <chr [1]> 2000-01-26  Russia     <NULL>
##  8  /wiki/Juan_Manuel_Abal_Medina <chr [1]> 1945-03-01  Mexico  <chr [1]>
##  9            /wiki/Linor_Abargil <chr [1]> 1980-02-17  Israel     <NULL>
## 10             /wiki/Almeda_Abazi <chr [1]> 1992-02-13 Albania     <NULL>
## # ... with 14,544 more rows

Now, Elton John and David Furnish appear as a couple twice:

couples_unnested <- couples %>%
  unnest(spouseB, .drop = FALSE)

couples_unnested %>%
  filter(spouseA == "/wiki/Elton_John" | spouseB == "/wiki/Elton_John") %>%
  mutate(bday = as.character(bday)) %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", position = "left",
                font_size = 13, full_width = F)
spouseA bday country occupation spouseB
/wiki/David_Furnish 1962-10-25 Canada Film maker, producer, director /wiki/Elton_John
/wiki/Elton_John 1947-03-25 UK Musician, singer-songwriter, composer /wiki/David_Furnish

And if we want to get distinct couples we need to create somehow a coupleID key so we can get only unique keys. I chose to simply sort each couple by name alphabetically, then concatenate them into a single string, e.g. /wiki/David_Furnish|/wiki/Elton_John:

coupleID <- function(spouseA, spouseB) {
  str_c(sort(c(spouseA, spouseB)), collapse = "|")
}

couples_unique <- couples_unnested %>%
    set_names(c("spouseA", "bdayA", "countryA", "occupationA", "spouseB")) %>%
    inner_join(couples_unnested %>%
                 set_names(c("spouseB", "bdayB", "countryB", "occupationB", "spouseA")),
               by = c("spouseA", "spouseB")) %>%
    mutate(coupleID = map2_chr(spouseA, spouseB, coupleID)) %>%
    distinct(coupleID, .keep_all = TRUE)

couples_unique %>%
  filter(spouseA == "/wiki/Elton_John" | spouseB == "/wiki/Elton_John") %>%
  select(-coupleID, -occupationA, -occupationB) %>%
  mutate_at(c("bdayA", "bdayB"), as.character) %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", position = "left",
                font_size = 13, full_width = F)
spouseA bdayA countryA spouseB bdayB countryB
/wiki/David_Furnish 1962-10-25 Canada /wiki/Elton_John 1947-03-25 UK

So only unique couples now. But we have another subtle point here. Sometimes Wikipedia would redirect spouseB to the original spouseA (see example here). This results in incorrect couples where spouseA and spouseB are the same person, and we need to remove these:

couples_unique <- couples_unique %>% filter(spouseA != spouseB)

dim(couples_unique)
## [1] 5093    9

Overall just over 5K couples.

How many unique people are participating in at least one couple here?

unique_people <- rbind(
  couples_unique %>% select(spouseA, bdayA, countryA, occupationA) %>%
    set_names(c("spouse", "bday", "country", "occupation")),
  couples_unique %>% select(spouseB, bdayB, countryB, occupationB) %>%
    set_names(c("spouse", "bday", "country", "occupation"))
) %>% distinct(spouse, .keep_all = TRUE)

dim(unique_people)
## [1] 9524    4

That’s ~9.5K unique people. What’s wrong with that?

Later I speak a lot of what’s wrong with this experiment, and one of the things is the dependencies between couples. Take Donald Trump:

couples_unique %>%
  filter(spouseA == "/wiki/Donald_Trump" | spouseB == "/wiki/Donald_Trump") %>%
  select(-coupleID, -occupationA, -occupationB) %>%
  mutate_at(c("bdayA", "bdayB"), as.character) %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", position = "left",
                font_size = 13, full_width = F)
spouseA bdayA countryA spouseB bdayB countryB
/wiki/Marla_Maples 1963-10-27 USA /wiki/Donald_Trump 1946-06-14 USA
/wiki/Donald_Trump 1946-06-14 USA /wiki/Ivana_Trump 1949-02-20 Czechoslovakia
/wiki/Donald_Trump 1946-06-14 USA /wiki/Melania_Trump 1970-04-26 USA

Donald Trump appears in 3 different couples, so if we’d want to avoid that, we could for example choose for each person only her first or last spouse, this way a person belongs only to a single couple.

couples_unique_1to1 <- couples_unnested %>%
  group_by(spouseA) %>%
  top_n(1, spouseB) %>%
  ungroup()
  
couples_unique_1to1 <- couples_unique_1to1 %>%
  set_names(c("spouseA", "bdayA", "countryA", "occupationA", "spouseB")) %>%
  inner_join(couples_unique_1to1 %>%
               set_names(c("spouseB", "bdayB", "countryB", "occupationB", "spouseA")),
             by = c("spouseA", "spouseB")) %>%
  mutate(coupleID = map2_chr(spouseA, spouseB, coupleID)) %>%
  distinct(coupleID, .keep_all = TRUE) %>%
  filter(spouseA != spouseB)

couples_unique_1to1 %>%
  filter(spouseA == "/wiki/Donald_Trump" | spouseB == "/wiki/Donald_Trump") %>%
  select(-coupleID, -occupationA, -occupationB) %>%
  mutate_at(c("bdayA", "bdayB"), as.character) %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", position = "left",
                font_size = 13, full_width = F)
spouseA bdayA countryA spouseB bdayB countryB
/wiki/Donald_Trump 1946-06-14 USA /wiki/Melania_Trump 1970-04-26 USA
dim(couples_unique_1to1)
## [1] 4185    9

If we only take a single couple for each person, meaning couples are now independent, we get to 4.2K couples.

Some Exploratory Analysis Would Be Nice

So we have 9.5K celebrity individuals in a table which looks like this:

unique_people %>%
  head(10) %>%
  mutate(bday = as.character(bday)) %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", position = "left",
                font_size = 13, full_width = F)
spouse bday country occupation
/wiki/Jennifer_Aaker 1967-01-15 USA Author, Social psychologist
/wiki/Ben_Aaron 1981-09-10 USA NULL
/wiki/Aarthi_(actress) 1983-11-09 India Actress
/wiki/Maryam_Abacha 1945-03-04 Niger Former Nigeria First Lady
/wiki/Ang%C3%A9lique_Abachkina 2000-01-26 Russia NULL
/wiki/Juan_Manuel_Abal_Medina 1945-03-01 Mexico Lawyer
/wiki/Almeda_Abazi 1992-02-13 Albania NULL
/wiki/Farhat_Abbas 1976-06-22 Indonesia Lawyer
/wiki/Bruce_Abbott 1954-07-28 USA Actor
/wiki/Diahnne_Abbott 1945-01-01 USA Actress, singer

Are they “distributed” uniformly across signs?

Let’s give them a sign, by joining their birth date into a Zodiac table. Notice the use of the wonderful padr package:

zodiac <- tibble(
  sign = c("Capricorn", "Aquarius", "Pisces",
           "Aries", "Taurus", "Gemini",
           "Cancer", "Leo", "Virgo",
           "Libra", "Scorpio", "Sagittarius",
           "Capricorn", "Capricorn"),
  startDate = c("01-01", "01-21", "02-20",
                "03-21", "04-21", "05-22",
                "06-22", "07-23", "08-23",
                "09-24", "10-24", "11-23",
                "12-22", "12-31")) %>%
  mutate(startDate = as_date(str_c("2000-", startDate))) %>%
  padr::pad(interval = "day") %>%
  tidyr::fill(sign)

dateTo2000 <- function(date) {
  year(date) <- 2000
  as.character(date)
}

unique_people <- unique_people %>%
  mutate(bday2000 = as_date(map_chr(bday, dateTo2000))) %>%
  inner_join(zodiac, by = c("bday2000" = "startDate"))

unique_people %>%
  select(-bday2000) %>%
  head(10) %>%
  mutate(bday = as.character(bday)) %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", position = "left",
                font_size = 13, full_width = F)
spouse bday country occupation sign
/wiki/Jennifer_Aaker 1967-01-15 USA Author, Social psychologist Capricorn
/wiki/Ben_Aaron 1981-09-10 USA NULL Virgo
/wiki/Aarthi_(actress) 1983-11-09 India Actress Scorpio
/wiki/Maryam_Abacha 1945-03-04 Niger Former Nigeria First Lady Pisces
/wiki/Ang%C3%A9lique_Abachkina 2000-01-26 Russia NULL Aquarius
/wiki/Juan_Manuel_Abal_Medina 1945-03-01 Mexico Lawyer Pisces
/wiki/Almeda_Abazi 1992-02-13 Albania NULL Aquarius
/wiki/Farhat_Abbas 1976-06-22 Indonesia Lawyer Cancer
/wiki/Bruce_Abbott 1954-07-28 USA Actor Leo
/wiki/Diahnne_Abbott 1945-01-01 USA Actress, singer Capricorn

Let’s count signs:

signCounts <- unique_people %>%
  count(sign, sort = TRUE)

signCounts %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
sign n
Virgo 837
Gemini 828
Leo 826
Capricorn 821
Scorpio 809
Cancer 805
Libra 780
Pisces 777
Sagittarius 777
Taurus 762
Aries 758
Aquarius 744

The signs look pretty evenly distributed but there are ~90 more Virgo than Aquarius - is that significantly different from uniform?

chisq.test(signCounts$n)
## 
##  Chi-squared test for given probabilities
## 
## data:  signCounts$n
## X-squared = 13.478, df = 11, p-value = 0.2633

A Chi-Squared test says its not.

Which countries are these people from?

unique_people %>%
  count(country, sort = TRUE) %>%
  head(10) %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
country n
USA 3747
UK 1195
NA 449
India 415
Canada 311
Japan 236
France 228
Russia 180
China 169
Australia 146

Obviously since this is the English Wikipedia, there’s a strong bias towards USA and UK famous people, next there are ~450 NA people who either did not have a country or the function failed to parse it for them. There’s also an impressive representation for India, Canada, France and Japan.

What do these people do?…

unique_people %>%
  unnest(occupation) %>%
  mutate(occupation = str_to_lower(occupation)) %>%
  count(occupation, sort = TRUE) %>%
  head(10) %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
occupation n
actress 2056
actor 1788
singer 573
producer 533
director 411
model 400
writer 379
screenwriter 340
author 276
film director 275

Unsurprisingly we have many people from the showbiz industry: actors, singers, producers etc. Keep in mind a person can be both an “actor” and a “director”.

Lastly, when were they born?

unique_people %>%
  mutate(year = year(bday), decade = year %/% 10 * 10) %>%
  count(decade) %>%
  arrange(-decade) %>%
  kable("html") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
decade n
2010 7
2000 33
1990 310
1980 1112
1970 1654
1960 1632
1950 1507
1940 1546
1930 1024
1920 531
1910 131
1900 31
1890 6

Remember there’s a bias towards married couples in which at least one of the spouses is alive - so we would expect most of the sample to have been born during the 1940s to 1980s.

Enough Talking, Let’s See Some Results

To reiterate, if you skipped previous sections, we have 9.5K celebrities, mostly from USA and UK. These celebrities’ signs are more or less uniformly distributed. They form ~5K couples in which we know both spouses birth dates and signs, if we allow one person to have more than one spouse. If we’re stricter and allow one person to only have one spouse we get 4.2K “independent” couples.

To get some astrology theories of what makes a good couple, we’ll look at this article from the “INSIDER” website (though most of these appear in countless other sources).

We’ll look at 3 “rules” I find most common in astrology websites:

  1. “Couples with the exact same sign are often well-matched”

  2. “While same-sign couples can work, having exact opposite signs is another common pairing”

  3. “Signs whose elements match are another super compatible pairing”

Same Sign

If “all is random” we’d expect the probability of having a same sign partner to be 1/12 = 0.0833.

couples_unique <- couples_unique %>%
  mutate(bdaySpouseA2000 = as_date(map_chr(bdayA, dateTo2000)),
         bdaySpouseB2000 = as_date(map_chr(bdayB, dateTo2000))) %>%
  inner_join(zodiac, by = c("bdaySpouseA2000" = "startDate")) %>%
  inner_join(zodiac, by = c("bdaySpouseB2000" = "startDate")) %>%
  rename(signA = sign.x, signB = sign.y) %>%
  mutate(sameSign = signA == signB)

couples_unique %>%
  count(sameSign) %>%
  mutate(total = sum(n),
         p = n / total) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
sameSign n total p
FALSE 4634 5093 0.91
TRUE 459 5093 0.09

Amazingly, 9% of couples are of the same sign, almost 0.7% above what we had expected. Is this statistically significant?

prop.test(459, 5093, 1/12)
## 
##  1-sample proportions test with continuity correction
## 
## data:  459 out of 5093, null probability 1/12
## X-squared = 2.9859, df = 1, p-value = 0.08399
## alternative hypothesis: true p is not equal to 0.08333333
## 95 percent confidence interval:
##  0.08247076 0.09840219
## sample estimates:
##         p 
## 0.0901237

It isn’t statistically significant at the acceptable 95% level of confidence, it is if the level of confidence would have been 90% or the hypothesis was one-sided - but let’s not resort to these kinds of shenanigans here.

What about “independent couples”, where no couple shares another couple’s spouses?

couples_unique_1to1 <- couples_unique_1to1 %>%
  mutate(bdaySpouseA2000 = as_date(map_chr(bdayA, dateTo2000)),
         bdaySpouseB2000 = as_date(map_chr(bdayB, dateTo2000))) %>%
  inner_join(zodiac, by = c("bdaySpouseA2000" = "startDate")) %>%
  inner_join(zodiac, by = c("bdaySpouseB2000" = "startDate")) %>%
  rename(signA = sign.x, signB = sign.y) %>%
  mutate(sameSign = signA == signB)

couples_unique_1to1 %>%
  count(sameSign) %>%
  mutate(total = sum(n),
         p = n / total) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
sameSign n total p
FALSE 3813 4185 0.911
TRUE 372 4185 0.089

Yes, it’s 8.9% of couples which is more than the expected 8.3%, but no, it’s not statistically significant:

prop.test(372, 4185, 1/12)
## 
##  1-sample proportions test with continuity correction
## 
## data:  372 out of 4185, null probability 1/12
## X-squared = 1.619, df = 1, p-value = 0.2032
## alternative hypothesis: true p is not equal to 0.08333333
## 95 percent confidence interval:
##  0.08052544 0.09801681
## sample estimates:
##          p 
## 0.08888889

Does this result hold for every sign? For instance if I check all Scorpios - do they tend to have more Scorpio partners than other partners? This is somewhat more tricky, because it also depends on the ratio of Scorpios in the sample. If Scorpios are only 0.08 of the sample, instead of 1/12 = 0.083, we need to compare to this 0.08 null probability, not 0.083. Let’s do this check only for the “independent” version of couples:

prop.test.wrap <- function(x, n, p) prop.test(x, n, p)$p.value

couples_unique_1to1 %>%
  select(signA, sameSign) %>%
  rename(sign = signA) %>%
  rbind(
    couples_unique_1to1 %>%
      select(signB, sameSign) %>%
      rename(sign = signB)
    )%>%
  group_by(sign) %>%
  count(sameSign) %>%
  mutate(sameSign = ifelse(sameSign, "sameSign", "notSameSign")) %>%
  spread(sameSign, n) %>%
  mutate(totalSign = sameSign + notSameSign) %>%
  select(sign, sameSign, totalSign) %>%
  ungroup() %>%
  mutate(totalSample = sum(totalSign), nullProb = totalSign / totalSample,
         p = sameSign/totalSign, gap = p - nullProb,
         p.value = pmap_dbl(list(sameSign, totalSign, nullProb), prop.test.wrap)) %>%
  arrange(p.value) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
sign sameSign totalSign totalSample nullProb p gap p.value
Pisces 74 683 8370 0.082 0.108 0.027 0.013
Leo 80 724 8370 0.086 0.110 0.024 0.026
Libra 66 687 8370 0.082 0.096 0.014 0.205
Taurus 60 662 8370 0.079 0.091 0.012 0.304
Cancer 52 708 8370 0.085 0.073 -0.011 0.318
Aries 52 679 8370 0.081 0.077 -0.005 0.717
Aquarius 52 643 8370 0.077 0.081 0.004 0.755
Capricorn 64 722 8370 0.086 0.089 0.002 0.872
Gemini 66 735 8370 0.088 0.090 0.002 0.901
Scorpio 58 705 8370 0.084 0.082 -0.002 0.905
Virgo 64 736 8370 0.088 0.087 -0.001 0.977
Sagittarius 56 686 8370 0.082 0.082 0.000 1.000

Well, first of all for 4 out of 12 signs the effect is reversed: some signs tend to couple with their own sign less than random, though it’s not statistically significant. Interestingly, for two signs - Pisces and Leo - the effect seems to be statistically significant, they couple with their own sign over 2% above the expected probabilities, resulting in a p-value < 0.05.

However, something bugs me here, doesn’t it bug you?

We’re seeing here multiple comparisons, each at the level of confidence of 95%. If we’re testing 100 hypotheses, this means we could expect 5 to be wrongly rejected by random error, even if all 100 should not be rejected! The entire level of confidence is much lower than 95% (see the Multiple Comparisons Problem).

One drastic way to protect ourselves against this problem is slicing our alpha of 5% by the number of comparisons, here 12 (a.k.a the Bonferroni Correction). So we would only reject null hypotheses if the p-value is less than 0.05/12 = 0.004 - none here. Another less drastic way is controlling the False Discovery Rate (see FDR). Won’t bore you with the details, but the result is also not rejecting the null hypotheses for Pisces and Leo.

How about seeing this result by country? Only top 10 countries.

country_data <- couples_unique_1to1 %>%
  select(countryA, sameSign) %>%
  rename(country = countryA) %>%
  rbind(
    couples_unique_1to1 %>%
      select(countryB, sameSign) %>%
      rename(country = countryB)
    )

top10Countries <- country_data %>%
  count(country, sort = TRUE) %>%
  filter(row_number() %in% 1:10) %>%
  select(country) %>%
  unlist()

country_data %>%
  filter(country %in% top10Countries) %>%
  group_by(country) %>%
  count(sameSign) %>%
  mutate(sameSign = ifelse(sameSign, "sameSign", "notSameSign")) %>%
  spread(sameSign, n) %>%
  mutate(totalCountry = sameSign + notSameSign) %>%
  select(country, sameSign, totalCountry) %>%
  ungroup() %>%
  mutate(nullProb = 1/12,
         p = sameSign/totalCountry, gap = p - nullProb,
         p.value = pmap_dbl(list(sameSign, totalCountry, nullProb), prop.test.wrap)) %>%
  arrange(p.value) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
country sameSign totalCountry nullProb p gap p.value
Japan 35 225 0.083 0.156 0.072 0.000
Australia 3 132 0.083 0.023 -0.061 0.018
India 42 382 0.083 0.110 0.027 0.074
China 18 162 0.083 0.111 0.028 0.256
UK 93 1009 0.083 0.092 0.009 0.338
NA 30 406 0.083 0.074 -0.009 0.549
Russia 16 174 0.083 0.092 0.009 0.784
USA 270 3206 0.083 0.084 0.001 0.881
France 16 183 0.083 0.087 0.004 0.947
Canada 23 274 0.083 0.084 0.001 1.000

In all countries except for Japan there is no statistically significant effect (in Australia the reverse effect has p-value < 0.05, however with either a Bonferroni or FDR correction, this p-value is not small enough). For Japanese people, amazingly the ratio of same-sign couples is 15.6%, almost twice the expected 8.3% rate!

Let’s see this by occupation:

occupation_data <- couples_unique_1to1 %>%
  unnest(occupationA) %>%
  mutate(occupationA = str_to_lower(occupationA)) %>%
  select(occupationA, sameSign) %>%
  rename(occupation = occupationA) %>%
  rbind(
    couples_unique_1to1 %>%
      unnest(occupationB) %>%
      mutate(occupationB = str_to_lower(occupationB)) %>%
      select(occupationB, sameSign) %>%
      rename(occupation = occupationB)
    )

top10Occupations <- occupation_data %>%
  count(occupation, sort = TRUE) %>%
  filter(row_number() %in% 1:10) %>%
  select(occupation) %>%
  unlist()

occupation_data %>%
  filter(occupation %in% top10Occupations) %>%
  group_by(occupation) %>%
  count(sameSign) %>%
  mutate(sameSign = ifelse(sameSign, "sameSign", "notSameSign")) %>%
  spread(sameSign, n) %>%
  mutate(totalOccupation = sameSign + notSameSign) %>%
  select(occupation, sameSign, totalOccupation) %>%
  ungroup() %>%
  mutate(nullProb = 1/12,
         p = sameSign/totalOccupation, gap = p - nullProb,
         p.value = pmap_dbl(list(sameSign, totalOccupation, nullProb), prop.test.wrap)) %>%
  arrange(p.value) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
occupation sameSign totalOccupation nullProb p gap p.value
author 32 238 0.083 0.134 0.051 0.006
writer 35 341 0.083 0.103 0.019 0.233
actress 154 1701 0.083 0.091 0.007 0.303
actor 134 1514 0.083 0.089 0.005 0.495
director 31 344 0.083 0.090 0.007 0.721
singer 39 453 0.083 0.086 0.003 0.899
producer 37 454 0.083 0.081 -0.002 0.955
model 27 317 0.083 0.085 0.002 0.986
film director 19 222 0.083 0.086 0.002 1.000
screenwriter 22 270 0.083 0.081 -0.002 1.000

We don’t see a statistically significant effect for any occupation. Remember that the p-value for authors we’d want to see would be smaller than 0.05/10 = 0.005, which it is not.

Finally let’s see this by age, or decade of birth:

couples_unique_1to1 %>%
  mutate(year = year(bdayA), decade = year %/% 10 * 10) %>%
  select(decade, sameSign) %>%
  rbind(
    couples_unique_1to1 %>%
      mutate(year = year(bdayB), decade = year %/% 10 * 10) %>%
      select(decade, sameSign)
    ) %>%
  filter(decade %in% seq(1920, 1990, 10)) %>%
  group_by(decade) %>%
  count(sameSign) %>%
  mutate(sameSign = ifelse(sameSign, "sameSign", "notSameSign")) %>%
  spread(sameSign, n) %>%
  mutate(totalDecade = sameSign + notSameSign) %>%
  select(decade, sameSign, totalDecade) %>%
  ungroup() %>%
  mutate(nullProb = 1/12,
         p = sameSign/totalDecade, gap = p - nullProb,
         p.value = pmap_dbl(list(sameSign, totalDecade, nullProb), prop.test.wrap)) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
decade sameSign totalDecade nullProb p gap p.value
1920 32 445 0.083 0.072 -0.011 0.432
1930 75 850 0.083 0.088 0.005 0.649
1940 123 1293 0.083 0.095 0.012 0.138
1950 108 1345 0.083 0.080 -0.003 0.724
1960 132 1437 0.083 0.092 0.009 0.262
1970 146 1493 0.083 0.098 0.014 0.048
1980 88 1038 0.083 0.085 0.001 0.911
1990 22 301 0.083 0.073 -0.010 0.590

No decade shows a statistically significant effect, not even people born in the 1970s, because for 8 multiple comparisons we’d want the smallest p-value, according to FDR or Bonferroni, to be smaller than 0.05/8 = 0.006 and it is not.

Opposite Signs

So far we’ve only discovered statistically significant evidence in favor of astrology when looking at only… Japanese people2. How about looking at opposite signs? I’m a Scorpio, my best friends are Taurus, maybe there’s something there.

oppositeSignPairs <- list(
    c("Aries", "Libra"),
    c("Taurus", "Scorpio"),
    c("Gemini", "Sagittarius"),
    c("Cancer", "Capricorn"),
    c("Leo", "Aquarius"),
    c("Virgo", "Pisces")
  )

IsOppositeSigns <- function(signA, signB) {
  signA != signB & any(map_lgl(oppositeSignPairs, function(pair) all(c(signA, signB) %in% pair)))
  
}

couples_unique <- couples_unique %>%
  mutate(oppositeSign = map2_lgl(signA, signB, IsOppositeSigns))

couples_unique %>%
  count(oppositeSign) %>%
  mutate(total = sum(n),
         p = n / total) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
oppositeSign n total p
FALSE 4692 5093 0.921
TRUE 401 5093 0.079
prop.test(401, 5093, 1/12)
## 
##  1-sample proportions test with continuity correction
## 
## data:  401 out of 5093, null probability 1/12
## X-squared = 1.3499, df = 1, p-value = 0.2453
## alternative hypothesis: true p is not equal to 0.08333333
## 95 percent confidence interval:
##  0.07155840 0.08655607
## sample estimates:
##          p 
## 0.07873552

Here the effect is reverse, there are less couples with opposite signs in this sample, but it’s not statistically significant.

How about “independent couples”?

couples_unique_1to1 <- couples_unique_1to1 %>%
  mutate(oppositeSign = map2_lgl(signA, signB, IsOppositeSigns))

couples_unique_1to1 %>%
  count(oppositeSign) %>%
  mutate(total = sum(n),
         p = n / total) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
oppositeSign n total p
FALSE 3859 4185 0.922
TRUE 326 4185 0.078
prop.test(326, 4185, 1/12)
## 
##  1-sample proportions test with continuity correction
## 
## data:  326 out of 4185, null probability 1/12
## X-squared = 1.5486, df = 1, p-value = 0.2133
## alternative hypothesis: true p is not equal to 0.08333333
## 95 percent confidence interval:
##  0.07004526 0.08653481
## sample estimates:
##          p 
## 0.07789725

Nope, nothing significantly different than random here.

How about by each sign?

couples_unique_1to1 %>%
  select(signA, oppositeSign) %>%
  rename(sign = signA) %>%
  rbind(
    couples_unique_1to1 %>%
      select(signB, oppositeSign) %>%
      rename(sign = signB)
    )%>%
  group_by(sign) %>%
  count(oppositeSign) %>%
  mutate(oppositeSign = ifelse(oppositeSign, "oppositeSign", "notOppositeSign")) %>%
  spread(oppositeSign, n) %>%
  mutate(totalSign = oppositeSign + notOppositeSign) %>%
  select(sign, oppositeSign, totalSign) %>%
  ungroup() %>%
  mutate(totalSample = sum(totalSign), nullProb = totalSign / totalSample,
         p = oppositeSign/totalSign, gap = p - nullProb,
         p.value = pmap_dbl(list(oppositeSign, totalSign, nullProb), prop.test.wrap)) %>%
  arrange(p.value) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
sign oppositeSign totalSign totalSample nullProb p gap p.value
Virgo 53 736 8370 0.088 0.072 -0.016 0.144
Gemini 53 735 8370 0.088 0.072 -0.016 0.150
Leo 54 724 8370 0.086 0.075 -0.012 0.283
Capricorn 55 722 8370 0.086 0.076 -0.010 0.369
Taurus 58 662 8370 0.079 0.088 0.009 0.459
Aquarius 54 643 8370 0.077 0.084 0.007 0.543
Cancer 55 708 8370 0.085 0.078 -0.007 0.553
Libra 53 687 8370 0.082 0.077 -0.005 0.688
Sagittarius 53 686 8370 0.082 0.077 -0.005 0.705
Pisces 53 683 8370 0.082 0.078 -0.004 0.755
Aries 53 679 8370 0.081 0.078 -0.003 0.824
Scorpio 58 705 8370 0.084 0.082 -0.002 0.905

Nothing significant here. If anything it’s interesting to note that for 10 out of 12 signs the effect is reverse - less couples with opposite signs than randomly expected, though not significant.

Same Element

This last astrology “rule” I’ve heared a lot: signs from the same element create good couples. If you have no idea what “element” is, apparently there are four of those: Earth, Fire, Water and Air. Each element consists of 3 of the 12 zodiac signs, e.g. Water is the element of Cancer, Scorpio and Pisces. Interestingly I’m a Scorpio and my partner is Cancer - maybe there’s something to this “rule”?

Notice the null probability now is 0.25, as with random matchings we would expect a person to couple with the 3 out of 12 signs which belong to her element, with 3/12 = 0.25 probability.

signToElement <- function(sign) {
  if (sign %in% c("Aries", "Leo", "Sagittarius")) return("Fire")
  if (sign %in% c("Gemini", "Libra", "Aquarius")) return("Air")
  if (sign %in% c("Taurus", "Virgo", "Capricorn")) return("Earth")
  if (sign %in% c("Cancer", "Scorpio", "Pisces")) return("Water")
}

couples_unique <- couples_unique %>%
  mutate(elementA = map_chr(signA, signToElement),
         elementB = map_chr(signB, signToElement),
         sameElement = elementA == elementB)

couples_unique %>%
  count(sameElement) %>%
  mutate(total = sum(n),
         p = n / total) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
sameElement n total p
FALSE 3793 5093 0.745
TRUE 1300 5093 0.255
prop.test(1300, 5093, 0.25)
## 
##  1-sample proportions test with continuity correction
## 
## data:  1300 out of 5093, null probability 0.25
## X-squared = 0.72158, df = 1, p-value = 0.3956
## alternative hypothesis: true p is not equal to 0.25
## 95 percent confidence interval:
##  0.2433690 0.2675076
## sample estimates:
##         p 
## 0.2552523

We see a rate of 25.5% same-element couples, but this is not significantly different from 25%.

And with “independent couples”…

couples_unique_1to1 <- couples_unique_1to1 %>%
  mutate(elementA = map_chr(signA, signToElement),
         elementB = map_chr(signB, signToElement),
         sameElement = elementA == elementB)

couples_unique_1to1 %>%
  count(sameElement) %>%
  mutate(total = sum(n),
         p = n / total) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
sameElement n total p
FALSE 3117 4185 0.745
TRUE 1068 4185 0.255
prop.test(1068, 4185, 0.25)
## 
##  1-sample proportions test with continuity correction
## 
## data:  1068 out of 4185, null probability 0.25
## X-squared = 0.57547, df = 1, p-value = 0.4481
## alternative hypothesis: true p is not equal to 0.25
## 95 percent confidence interval:
##  0.2420998 0.2687475
## sample estimates:
##         p 
## 0.2551971

We get a very similar result.

Let’s see this by sign:

couples_unique_1to1 %>%
  select(signA, sameElement) %>%
  rename(sign = signA) %>%
  rbind(
    couples_unique_1to1 %>%
      select(signB, sameElement) %>%
      rename(sign = signB)
    )%>%
  group_by(sign) %>%
  count(sameElement) %>%
  mutate(sameElement = ifelse(sameElement, "sameElement", "notSameElement")) %>%
  spread(sameElement, n) %>%
  mutate(totalSign = sameElement + notSameElement) %>%
  select(sign, sameElement, totalSign) %>%
  ungroup() %>%
  mutate(totalSample = sum(totalSign), nullProb = 0.25,
         p = sameElement/totalSign, gap = p - nullProb,
         p.value = pmap_dbl(list(sameElement, totalSign, nullProb), prop.test.wrap)) %>%
  arrange(p.value) %>%
  kable(format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
sign sameElement totalSign totalSample nullProb p gap p.value
Pisces 199 683 8370 0.25 0.291 0.041 0.014
Leo 209 724 8370 0.25 0.289 0.039 0.018
Aries 152 679 8370 0.25 0.224 -0.026 0.126
Gemini 168 735 8370 0.25 0.229 -0.021 0.194
Sagittarius 183 686 8370 0.25 0.267 0.017 0.332
Aquarius 150 643 8370 0.25 0.233 -0.017 0.351
Scorpio 187 705 8370 0.25 0.265 0.015 0.373
Taurus 175 662 8370 0.25 0.264 0.014 0.419
Virgo 191 736 8370 0.25 0.260 0.010 0.580
Libra 166 687 8370 0.25 0.242 -0.008 0.644
Cancer 174 708 8370 0.25 0.246 -0.004 0.828
Capricorn 182 722 8370 0.25 0.252 0.002 0.932

If you’re following what we’re doing it should come as no surprise that Pisces and Loe have p-values < 0.05, because we’ve already seen this phenomenon when looking at “same sign” couples. However, due to multiple comparisons correction, again, these p-values are not small enough, and we can’t reject the hypotheses that Pisces and Leo people choose partners at random3

What’s Wrong?

As I said there are many problems with this approach, let’s vent some of them out in the open:

Not a Sample

I’m taking my couples from English Wikipedia. This means most of the people here would be famous, I’d even say celebrities, e.g. many Hollywood actors and actresses. They’re probably not representative of the entire world population. For example, maybe they have a tendency to believe in astrology and pick their partners according to their signs? Maybe they have the reverse tendency? Maybe they all go to the same astrologist up in Hollywood who tells them things completely counter-astrology-intuitive?

Not Independent

When my friend and I gathered data for our little project, we at least tried to ask random people, so there wouldn’t be connection between different couples of best friends. Here, for example:

You can see there is extreme dependencies between couples. And so if Angelina Jolie is a strong believer in astrology, being a part of a few couples here, she’s heavily affecting results. I’ve tried to deal with that creating an “independent” version of the couples, couples_unique_1to1, see above. However even if every person here belonged to just a single relationship, it is still a pretty homogeneous, small population, they could be watching each other and being affected by that.

Not Interesting?

Suppose someone tells you there are 50% biological female humans and 50% biological male humans living on earth. You sample 1000 people and you get 500 males, 500 females. You sample 10000 people and get 5000 males, 5000 females. You sample 1 million people, and get 499,000 males and 501,000 females, which by a proportions test is “significant” (p-value < 0.05) indicating that the ratio of male to female in the entire population is different from the 50% you thought was the current state. The question is: It is significant. Is it interesting though? I’m not sure at all, it’s just 1,000 people adavantage to females in the sample! The sample proportion is 50.1%, which is not that different from 50%. And with a large enough sample size, any difference can be made significant, the question is, again, is it interesting? Same here, I can take 1 million independent couples sample, and see that yes, there are signifcantly more same-sign couples. But the proportion that my sample yields is p = 0.088, while the null probability is p = 0.083. One could claim that, OK, it’s significant, but it’s not interesting, and you’d expect some true magic about what makes a good couple to produce a much greater effect.

Not Powerful

On the other hand, one could claim the entire experiment isn’t “powerful” enough, in the sense that it does not have enough statistical power to decide a small effect is significant. I’ll give you this intuition: 4-5K couples are definitely enough to say that an observed ratio of same-sign couples of 15% is definitely “significantly” larger than 8.33%. But are they enough if we expect to see a ratio of 8.34%, just 0.01% larger than the theoretic ratio of 8.33%? Probably not. Furthermore, suppose the ratio for Pisces is indeed 10.8% for same-sign couples, and we should reject the null hypothesis for this sign. But we didn’t, because we only saw n = 683 Pisces - is that fair? In fact, if we had 10 times larger sample size, 6830, and we observed this ratio of 10.8% - it would have been significant! Some would say that if your test isn’t powerful enough (which mostly means your sample size n is too small) to detect the effect size you’re expecting - don’t do it! You can reach a wrong conclusion just because you didn’t use enough observations, not because of anything real.

Not Decisive

I think it’s problematic if you know in advance that your experiment is not decisive between two hypotheses or two “worlds”, and its results can be interpreted “in the favor of” any direction. The case here is a good example:

The Sign It’s The End

To summarize what just happened: I scraped almost 1 million Wikipedia profiles, couldn’t find any statistical evidence in favor of any astrological “rule” of what makes a good couple, except for Japanese people4, then tore this entire research apart with a few simple claims. Lovely. But it was extremely fun. The last point is to me the most important - when someone believes in something, really really believes in something, be it religion, astrology or that the earth is flat - the chances to convince them they’re wrong using statistical evidence, are slight to none. Here is what Beyonce has to say about astrology.


  1. True for October 2017

  2. Yes, I know we made dozens of comparisons and even this p-value might have been rejected considering this fact

  3. Or, if you don’t choose your partner, but nature chooses them for you, we can’t reject the hypotheses that nature doesn’t care about signs, for Pisces and Leo and all other signs.

  4. Again, I know we made dozens of comparisons and even this p-value might have been rejected considering this fact