It’s Pride Month! So I thought, maybe I should perform some cool analyses of data concerning the Gay community. From one google result to another I reached the American “Centers for Disease Control and Prevention” (CDC) website, and this crazy dataset concerning a survey of “Youth Risk Behavior” (YRBSS) conducted at the national and state level since 1991 (!) every two years. The latest survey from 2015 also included questions regarding “sexual identity”, which is what I wanted. Right? Wrong! To my amazement, the data was publicly released in the most unconvenient way, as if the year was 2003 and we were all using SAS, SPSS, or… STATA. So I “accessiblized” this dataset, and made it a package!

WARNING: I WRITE A LOT. To skip to the grpahs, scroll down to the “Finally, Some Grpahs” section.

Oh No, She Didn’t

To reiterate, we’re talking data gold mine here. ~170 multiple-choice questions regarding risk behavior of youth (9th to 12th grade), asked in huge surveys conducted at the national and state level, every 2 years, from 1991 to 2015. Not all questions appear in all years and for all states, but it’s still pretty amazing. Questions examples:

  • Are there any foods that you have to avoid because eating the food could cause an allergic reaction, like skin rashes, swelling, itching, vomiting, coughing, or trouble breathing?
  • During the past 30 days, how many times did you use any form of cocaine, including powder, crack, or freebase?
  • During the past 12 months, have you ever been the victim of teasing or name calling because someone thought you were gay, lesbian, or bisexual?

The YRBSS also has a Youth Online Data Analysis Tool which lets you pick any variable and see the proportion of respondents answering any question options, by almost any filter you’d like (e.g. you can get the proportion of black 9th grade respondents who never put a helmet when riding a bicycle, and the proportion of New York female respondents in 1997 who never had sexual intercourse). You can even get nice interactive graphs of the US states map colored by the proportion found for the variable of your choosing in a given year. And barcharts. Important note: you can only get the above for what I call “The Binary Version” of the survey, where each question (e.g. “How many hours of sleep do you get?”) was “binarized” by the CDC researchers to a “Yes”/“No” question (e.g. “Had 8 or more hours of sleep: Yes/No”). From now on I am also going to talk exclusively of this version of the YRBSS results.

Anyway, issue no. 1: the API is too unconvenient and slow for any real research.

And issue no. 2: the barcharts are ugly.

You Better Work

So I’m thinking: let’s load the raw data into R and give these analyses some swag. I go here and realize I can only download the (“Combined”) data as a “ASCII” raw .dat file or as a MS Access (?!) file. The raw “ASCII” data file, however, has no column labels, and to get those you need to run a script you can download either in SAS or SPSS.

MS Access it is.

I follow instructions here, download the MS Access driver from here, download and unzip the “National” Access zip and do:

library(tidyverse)
library(RODBC)

conn <- odbcDriverConnect("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=SADC_2015_National.MDB")

subset(sqlTables(conn), TABLE_TYPE == "TABLE")

yrbss_data_binary <- sqlFetch(conn, "SADCQN") %>% as_tibble()

close(conn)
## [1] 188898    176

So I now have my yrbss_data_binary table, for the entire US, which holds 188,898 rows and 176 columns for 176 binary questions, where 1 means “Yes”, 2 means “No” and of course there are NAs. I’m happy, I want to test this data to see that the proportions I’m getting are the same as in the Youth Online Data Analysis Tool. Now the YRBSS tool gives for the qn63 variable (“Currently sexually active”), for the entire US, for 2015, this proportion:

Which means that of the 13,910 non-NA 9th to 12th grades kids in the US answering the survey in 2015, 30.1% had sex in the last 3 months (prior to answering the survey). The 95% confidence interval is 27.4% to 32.9%. With R code this should be:

yrbss_data_binary %>%
  filter(year == 2015) %>%
  count(qn63)
## # A tibble: 3 x 2
##    qn63     n
##   <dbl> <int>
## 1     1  4304
## 2     2  9606
## 3    NA  1714

And 4304 + 9606 = 13910 so I’m getting the same sample size n, and… 4304 / (4304 + 9606) = 30.9%.

Well, that’s a failure. I wanted 30.1%.

Not Today, Satan

Almost despaired, I remember I do have a Master in Statistics, and one should RTFM. So I do that, and understand the survey CANNOT be simply “summed-up” the way I did. The survey entails a complex weighted sampling design, and I need a specific package for extracting even simple proportions: the survey package by Thomas Lumley. Let’s:

library(survey)

yrbss_data_binary <- svydesign(id = ~PSU, weight = ~weight, strata = ~stratum,
                        data = yrbss_data_binary, nest = TRUE)

svyciprop(~I(qn63 == 1), subset(yrbss_data_binary, year == 2015), na.rm=TRUE)
##                     2.5% 97.5%
## I(qn63 == 1) 0.301 0.273  0.33

Success! BUT AT WHAT COST?!

Then it hits me. No one should suffer what I just went through. I am going to make this… a package!

The yrbss Package

See my package on Github (still under construction). I built it quickly using (by this order) Hillary Parker’s post, Karl Broman’s primer and Hadley Wickham’s book. You should be able to install it with devtools:

devtools::install_github("gsimchoni/yrbss")

In it you will find just a few functions to help you get most of what you want (tell me if more is needed, etc). The main function is getProportionSingleVariable which extracts a list of the sample size, proportion and confidence interval for the variable of your choosing, for a given year, a given location (“US” or a US state abbreviation), and possibly additional filters (e.g. race7 == 3 for African-American respondents only). See ?getProportionSingleVariable for more. The qn63 proportion we got earlier can be retrieved by:

library(yrbss)

getProportionSingleVariable(.variable = "qn63", .location = "US", .year = 2015)
## $prop
## [1] 0.3009435
## 
## $ciLB
## [1] 0.2728081
## 
## $ciUB
## [1] 0.3306611
## 
## $n
## [1] 13910

No MS Access driver, no hassle.

Let’s try something more complex. 12th grade female respondents in 2009 Alabama whose partners used a condom, during the last time they had sex:

getProportionSingleVariable("qn65", "AL", 2009, c("sex == 1", "grade == 4"))
## $prop
## [1] 0.3834514
## 
## $ciLB
## [1] 0.2787636
## 
## $ciUB
## [1] 0.5001886
## 
## $n
## [1] 102

So that’s 38.3% (which means over 60% didn’t use a condom, Lord), and you can verify that in the Youth Online Data Analysis Tool.

Now, you’re probably thinking “where does he know that the qn65 variable is about condom use”. You can view the list of variable plus their labels and summaries like this:

yrbss_questions_binary %>% head(20)
## # A tibble: 20 x 3
##       variable                                        label
##          <chr>                                        <chr>
##  1    sitecode                                     sitecode
##  2    sitename                                     sitename
##  3    sitetype                                     sitetype
##  4 sitetypenum                                  sitetypenum
##  5        year                                         year
##  6    survyear                                     survyear
##  7      weight                                       weight
##  8     stratum                                      stratum
##  9         PSU                                          PSU
## 10      record                                       record
## 11         age                             How old are you?
## 12         sex                            What is your sex?
## 13       grade                       In what grade are you?
## 14       race4                                        race4
## 15       race7                                        race7
## 16    stheight      How tall are you without your shoes on?
## 17    stweight How much do you weigh without your shoes on?
## 18         bmi                                          bmi
## 19      bmipct                                       bmipct
## 20     qnobese                                      qnobese
## # ... with 1 more variables: summary <chr>

You can also open this as a CSV file in MS Excel if you prefer, because each variable’s summary is a very long string (“yrbss_questions_binary.csv”). To see appropriate state abbreviations use the getListOfParticipatingStates function. But to know that the value 4 means “12th grade” for the grade variable, you’ll have to RTFM. You can get this 80 pages (!) long manual as a PDF in the library’s directory (“2015_yrbs_sadc_documentation.pdf”) or at the YRBSS site.

Finally, Some Graphs

Using some purrr and some ggplot2 we can finally get some neat visualizations.

tibble(.variable = rep("qn65", 26),
       .location = rep("US", 26),
       .year = rep(seq(1991, 2015, 2), 2),
       .filters = rep(c("sex == 1", "sex == 2"), each = 13)) %>%
  bind_cols(pmap_df(., getProportionSingleVariable)) %>%
  ggplot(aes(.year, prop, color = .filters)) +
  geom_line() +
  scale_color_manual(labels = c("Female", "Male"), values = c("red", "blue")) +
  labs(title = "YRBSS: Proportion of Male and Female US Youth Reporting Use of Condom",
       x = "Survey Year", y = "Proportion Reporting Use of Condom",
       color = "Gender")

Nice. We see two interesting effects here: (a) girls consistently report ~15% less chance of condom use than boys (either they tend to tell the truth more, or they just don’t know - I’ll let you ponder), and (b) there has been an increase in condom usage rate since 1991 until a peak in 2003-2005, and a slow but alarming decrease since then.

Let’s do the same trend graph with marijuana usage and White vs. African-American students:

tibble(.variable = rep("qn47", 26),
       .location = rep("US", 26),
       .year = rep(seq(1991, 2015, 2), 2),
       .filters = rep(c("race7 == 3", "race7 == 6"), each = 13)) %>%
  bind_cols(pmap_df(., getProportionSingleVariable)) %>%
  ggplot(aes(.year, prop, color = .filters)) +
  geom_line() +
  scale_color_manual(labels = c("White", "Black"), values = c("red", "blue")) +
  labs(title = "YRBSS: Proportion of White and Black US Youth Reporting Use of Marijuana",
       x = "Survey Year", y = "Proportion Reporting Use of Marijuana",
       color = "Race")

White kids seem to be smoking marijuana slightly more than African-American kids. And in general there was a peak during 1997 (over 50% of White Amrican kids reported smoking pot!), a decrease for African-American kids since then, and a decrease then an increase for White kids.

Now, I’d love to do the same trend graphs with “Gays vs. Straights”, unfortunately:

yrbss_data_binary$variables %>% count(year, sexid2)
## # A tibble: 16 x 3
##     year sexid2     n
##    <dbl>  <dbl> <int>
##  1  1991     NA 12272
##  2  1993     NA 16296
##  3  1995     NA 10904
##  4  1997     NA 16262
##  5  1999     NA 15349
##  6  2001     NA 13601
##  7  2003     NA 15214
##  8  2005     NA 13917
##  9  2007     NA 14041
## 10  2009     NA 16410
## 11  2011     NA 15425
## 12  2013     NA 13583
## 13  2015      1 12954
## 14  2015      2  1246
## 15  2015      3   503
## 16  2015     NA   921

Which means the sexid2 variable (1 for Heterosexual, 2 for LGBT, 3 for Unsure) was entered into the YRBSS survey only in 2015. So let’s look at 2015 only, the top 10 categories in which LGBT and Heterosexul US Youth show the largest differences:

allQNVars <- yrbss_questions_binary %>%
  select(variable) %>%
  filter(stringr::str_detect(variable, "^qn")) %>%
  unlist()

gaysStraightTable <- tibble(.variable = rep(allQNVars, 2),
       .location = rep("US", 2 * length(allQNVars)),
       .year = rep(2015, 2 * length(allQNVars)),
       .filters = rep(c("sexid2 == 1", "sexid2 == 2"), each = length(allQNVars))) %>%
  bind_cols(pmap_df(., getProportionSingleVariable))

diffLabels <- yrbss_questions_binary %>%
  filter(variable %in% allQNVars) %>%
  select(label) %>%
  unlist()

gayStraightDiffs <- tibble(label = diffLabels,
                           straight = gaysStraightTable$prop[1:length(allQNVars)],
                           gay = gaysStraightTable$prop[
                             (length(allQNVars) + 1):(2 * length(allQNVars))]) %>%
  na.omit() %>%
  mutate(diff = gay - straight) %>%
  arrange(-diff) %>%
  slice(c(1:10, 121:130)) %>%
  arrange(diff)

gayStraightDiffs$label <- factor(gayStraightDiffs$label,
                                 levels = as.character(gayStraightDiffs$label))

library(grid)
library(gridExtra)

p <- ggplot(gayStraightDiffs, aes(x = label, y = diff)) +
  geom_bar(stat = "identity", fill = "red", alpha = 0.5) +
  coord_flip() +
  labs(title = NULL,
     y = "Difference: LGBT - Heterosexual", x = "")

title.grob <- textGrob(
    label = "YRBSS: Top Gaps Between LGBT and Heterosexual US Youth in 2015",
    x = unit(0.5, "lines"), 
    y = unit(0, "lines"),
    hjust = 0, vjust = 0,
    gp = gpar(fontsize = 14))

p1 <- arrangeGrob(p, top = title.grob)
grid.arrange(p1)

LGBT youth reports feeling sadder, considering and attempting suicide, using cigarettes and marijuana, and being bullied - more than Heterosexual youth. They tend to get less exercise, eat breakfast and drink water less regularly and use a condom less regularly. These results are sad but not surprising.

Next, we can’t wrap up this post without a map, can we? Let’s see a map of LGBT youth who were bullied at school across the US:

library(choroplethr)

states <- getListOfParticipatingStates()

lgbtBulliedAtSchool <- tibble(.variable = rep("qn24", length(states)),
       .location = states,
       .year = rep(2015, length(states)),
       .filters = "sexid2 == 2") %>%
  bind_cols(pmap_df(., getProportionSingleVariable)) %>%
  na.omit()


allStatesData <- tibble(region = state.name, region2 = state.abb) %>%
    left_join(lgbtBulliedAtSchool, c("region2" = ".location")) %>%
    select(region, region2, prop) %>%
    mutate(value = prop, region = tolower(region)) %>%
    na.omit()

state_choropleth(allStatesData,
                 title = "YRBSS: Proportion of LGBT Youth Reporting Being Bullied at School in 2015")

As you can see this map could have been a lot more informative if we had data for all states. Still it’s pretty amazing that the rate for California - 1 in 4 kids! - is considered “low”. In Wyoming it is 1 in 2.

Let’s do the map chart for a question which would give us more states with a sample size large enough. The proportion of respondents reporting to have had sex:

hadSex <- tibble(.variable = rep("qn60", length(states)),
       .location = states,
       .year = rep(2015, length(states))) %>%
  bind_cols(pmap_df(., getProportionSingleVariable)) %>%
  na.omit()


allStatesData <- tibble(region = state.name, region2 = state.abb) %>%
    left_join(hadSex, c("region2" = ".location")) %>%
    select(region, region2, prop) %>%
    mutate(value = prop, region = tolower(region)) %>%
    na.omit()

state_choropleth(allStatesData,
                 title = "YRBSS: Proportion of Youth Who Claim They Had Sex in 2015")

hadSex %>%
  arrange(-prop) %>%
  slice(c(1:5, 26:30))
## # A tibble: 10 x 7
##    .variable .location .year      prop      ciLB      ciUB     n
##        <chr>     <chr> <dbl>     <dbl>     <dbl>     <dbl> <int>
##  1      qn60        MS  2015 0.4800649 0.4370643 0.5233628  1671
##  2      qn60        DE  2015 0.4678359 0.4196092 0.5166710  2521
##  3      qn60        WV  2015 0.4671404 0.4148493 0.5201636  1454
##  4      qn60        AL  2015 0.4634046 0.4075651 0.5201771  1313
##  5      qn60        AR  2015 0.4595870 0.3900593 0.5307242  2282
##  6      qn60        CT  2015 0.3297859 0.2904887 0.3716148  2015
##  7      qn60        NE  2015 0.3248977 0.2812191 0.3718508  1365
##  8      qn60        MD  2015 0.3235981 0.3132136 0.3341593 46251
##  9      qn60        CA  2015 0.3231991 0.2725465 0.3783692  1828
## 10      qn60        NY  2015 0.3042986 0.2627823 0.3492666  8508

48% of kids in Mississippi claim they had sex. Delaware, West Virginia, Alabama and Arizona not far behind with 46%. The least “sexually active” kids are in New York - 30%. Then California, Maryland, Nebraska and Connecticut with 32%.

Lastly, let’s correlate the proportion of kids saying they had sex, and the proportion of kids saying they drink a can, bottle, or glass of soda or pop three or more times per day:

soda <- tibble(.variable = rep("qnsoda3", length(states)),
       .location = states,
       .year = rep(2015, length(states))) %>%
  bind_cols(pmap_df(., getProportionSingleVariable)) %>%
  na.omit()

sexAndSoda <- inner_join(hadSex, soda, ".location")

cor.test(sexAndSoda$prop.x, sexAndSoda$prop.y)
## 
##  Pearson's product-moment correlation
## 
## data:  sexAndSoda$prop.x and sexAndSoda$prop.y
## t = 4.9568, df = 27, p-value = 3.42e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4333785 0.8433925
## sample estimates:
##       cor 
## 0.6902482
ggplot(sexAndSoda, aes(prop.x, prop.y, label = .location)) +
  geom_point() +
  geom_smooth(method = "loess") +
  geom_point(data = sexAndSoda %>% filter(.location == 'MS'), colour="red", size=5) +
  geom_text(hjust = 0.5, vjust = -0.6) +
  labs(title = "YRBSS: Proportion of US Youth Drinking Soda 3+ Times a Day vs. Having Sex",
       x = "P(Having Sex)", y = "P(Soda)")

There is some correlation between drinking too much soda and having sex at an early age. And Go Delaware!

Can I Get an Amen Up in Here?

My first R package! Hope I was able to show what can be done with it. Go ahead, install it, use it and tell me about all the places it breaks! I hope it won’t though. And if any LGBT youth is listening out there - be safe, do well at school, get some exercise. I promise you, it gets better.