It’s Pride Month! So I thought, maybe I should perform some cool analyses of data concerning the Gay community. From one google result to another I reached the American “Centers for Disease Control and Prevention” (CDC) website, and this crazy dataset concerning a survey of “Youth Risk Behavior” (YRBSS) conducted at the national and state level since 1991 (!) every two years. The latest survey from 2015 also included questions regarding “sexual identity”, which is what I wanted. Right? Wrong! To my amazement, the data was publicly released in the most unconvenient way, as if the year was 2003 and we were all using SAS, SPSS, or… STATA. So I “accessiblized” this dataset, and made it a package!

WARNING: I WRITE A LOT. To skip to the grpahs, scroll down to the “Finally, Some Grpahs” section.

# Oh No, She Didn’t

To reiterate, we’re talking data gold mine here. ~170 multiple-choice questions regarding risk behavior of youth (9th to 12th grade), asked in huge surveys conducted at the national and state level, every 2 years, from 1991 to 2015. Not all questions appear in all years and for all states, but it’s still pretty amazing. Questions examples:

• Are there any foods that you have to avoid because eating the food could cause an allergic reaction, like skin rashes, swelling, itching, vomiting, coughing, or trouble breathing?
• During the past 30 days, how many times did you use any form of cocaine, including powder, crack, or freebase?
• During the past 12 months, have you ever been the victim of teasing or name calling because someone thought you were gay, lesbian, or bisexual?

The YRBSS also has a Youth Online Data Analysis Tool which lets you pick any variable and see the proportion of respondents answering any question options, by almost any filter you’d like (e.g. you can get the proportion of black 9th grade respondents who never put a helmet when riding a bicycle, and the proportion of New York female respondents in 1997 who never had sexual intercourse). You can even get nice interactive graphs of the US states map colored by the proportion found for the variable of your choosing in a given year. And barcharts. Important note: you can only get the above for what I call “The Binary Version” of the survey, where each question (e.g. “How many hours of sleep do you get?”) was “binarized” by the CDC researchers to a “Yes”/“No” question (e.g. “Had 8 or more hours of sleep: Yes/No”). From now on I am also going to talk exclusively of this version of the YRBSS results.

Anyway, issue no. 1: the API is too unconvenient and slow for any real research.

And issue no. 2: the barcharts are ugly.

# You Better Work

So I’m thinking: let’s load the raw data into R and give these analyses some swag. I go here and realize I can only download the (“Combined”) data as a “ASCII” raw .dat file or as a MS Access (?!) file. The raw “ASCII” data file, however, has no column labels, and to get those you need to run a script you can download either in SAS or SPSS.

MS Access it is.

library(tidyverse)
library(RODBC)

conn <- odbcDriverConnect("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=SADC_2015_National.MDB")

subset(sqlTables(conn), TABLE_TYPE == "TABLE")

yrbss_data_binary <- sqlFetch(conn, "SADCQN") %>% as_tibble()

close(conn)
## [1] 188898    176

So I now have my yrbss_data_binary table, for the entire US, which holds 188,898 rows and 176 columns for 176 binary questions, where 1 means “Yes”, 2 means “No” and of course there are NAs. I’m happy, I want to test this data to see that the proportions I’m getting are the same as in the Youth Online Data Analysis Tool. Now the YRBSS tool gives for the qn63 variable (“Currently sexually active”), for the entire US, for 2015, this proportion:

Which means that of the 13,910 non-NA 9th to 12th grades kids in the US answering the survey in 2015, 30.1% had sex in the last 3 months (prior to answering the survey). The 95% confidence interval is 27.4% to 32.9%. With R code this should be:

yrbss_data_binary %>%
filter(year == 2015) %>%
count(qn63)
## # A tibble: 3 x 2
##    qn63     n
##   <dbl> <int>
## 1     1  4304
## 2     2  9606
## 3    NA  1714

And 4304 + 9606 = 13910 so I’m getting the same sample size n, and… 4304 / (4304 + 9606) = 30.9%.

Well, that’s a failure. I wanted 30.1%.

# Not Today, Satan

Almost despaired, I remember I do have a Master in Statistics, and one should RTFM. So I do that, and understand the survey CANNOT be simply “summed-up” the way I did. The survey entails a complex weighted sampling design, and I need a specific package for extracting even simple proportions: the survey package by Thomas Lumley. Let’s:

library(survey)

yrbss_data_binary <- svydesign(id = ~PSU, weight = ~weight, strata = ~stratum,
data = yrbss_data_binary, nest = TRUE)

svyciprop(~I(qn63 == 1), subset(yrbss_data_binary, year == 2015), na.rm=TRUE)
##                     2.5% 97.5%
## I(qn63 == 1) 0.301 0.273  0.33

Success! BUT AT WHAT COST?!

Then it hits me. No one should suffer what I just went through. I am going to make this… a package!

# The yrbss Package

See my package on Github (still under construction). I built it quickly using (by this order) Hillary Parker’s post, Karl Broman’s primer and Hadley Wickham’s book. You should be able to install it with devtools:

devtools::install_github("gsimchoni/yrbss")

In it you will find just a few functions to help you get most of what you want (tell me if more is needed, etc). The main function is getProportionSingleVariable which extracts a list of the sample size, proportion and confidence interval for the variable of your choosing, for a given year, a given location (“US” or a US state abbreviation), and possibly additional filters (e.g. race7 == 3 for African-American respondents only). See ?getProportionSingleVariable for more. The qn63 proportion we got earlier can be retrieved by:

library(yrbss)

getProportionSingleVariable(.variable = "qn63", .location = "US", .year = 2015)
## $prop ## [1] 0.3009435 ## ##$ciLB
## [1] 0.2728081
##
## $ciUB ## [1] 0.3306611 ## ##$n
## [1] 13910

No MS Access driver, no hassle.

Let’s try something more complex. 12th grade female respondents in 2009 Alabama whose partners used a condom, during the last time they had sex:

getProportionSingleVariable("qn65", "AL", 2009, c("sex == 1", "grade == 4"))
## $prop ## [1] 0.3834514 ## ##$ciLB
## [1] 0.2787636
##
## $ciUB ## [1] 0.5001886 ## ##$n
## [1] 102

So that’s 38.3% (which means over 60% didn’t use a condom, Lord), and you can verify that in the Youth Online Data Analysis Tool.

Now, you’re probably thinking “where does he know that the qn65 variable is about condom use”. You can view the list of variable plus their labels and summaries like this:

yrbss_questions_binary %>% head(20)
## # A tibble: 20 x 3
##    variable   label                   summary
##    <chr>      <chr>                   <chr>
##  1 sitecode   sitecode                <NA>
##  2 sitename   sitename                <NA>
##  3 sitetype   sitetype                <NA>
##  4 sitetypen~ sitetypenum             <NA>
##  5 year       year                    <NA>
##  6 survyear   survyear                <NA>
##  7 weight     weight                  <NA>
##  8 stratum    stratum                 <NA>
##  9 PSU        PSU                     <NA>
## 10 record     record                  <NA>
## 11 age        How old are you?        <NA>
## 12 sex        What is your sex?       <NA>
## 14 race4      race4                   <NA>
## 15 race7      race7                   <NA>
## 16 stheight   How tall are you witho~ <NA>
## 17 stweight   How much do you weigh ~ <NA>
## 18 bmi        bmi                     <NA>
## 19 bmipct     bmipct                  <NA>
## 20 qnobese    qnobese                 Percentage of students who were obe~

You can also open this as a CSV file in MS Excel if you prefer, because each variable’s summary is a very long string (“yrbss_questions_binary.csv”). To see appropriate state abbreviations use the getListOfParticipatingStates function. But to know that the value 4 means “12th grade” for the grade variable, you’ll have to RTFM. You can get this 80 pages (!) long manual as a PDF in the library’s directory (“2015_yrbs_sadc_documentation.pdf”) or at the YRBSS site.

# Finally, Some Graphs

Using some purrr and some ggplot2 we can finally get some neat visualizations.

tibble(.variable = rep("qn65", 26),
.location = rep("US", 26),
.year = rep(seq(1991, 2015, 2), 2),
.filters = rep(c("sex == 1", "sex == 2"), each = 13)) %>%
bind_cols(pmap_df(., getProportionSingleVariable)) %>%
ggplot(aes(.year, prop, color = .filters)) +
geom_line() +
scale_color_manual(labels = c("Female", "Male"), values = c("red", "blue")) +
labs(title = "YRBSS: Proportion of Male and Female US Youth Reporting Use of Condom",
x = "Survey Year", y = "Proportion Reporting Use of Condom",
color = "Gender")

Nice. We see two interesting effects here: (a) girls consistently report ~15% less chance of condom use than boys (either they tend to tell the truth more, or they just don’t know - I’ll let you ponder), and (b) there has been an increase in condom usage rate since 1991 until a peak in 2003-2005, and a slow but alarming decrease since then.

Let’s do the same trend graph with marijuana usage and White vs. African-American students:

tibble(.variable = rep("qn47", 26),
.location = rep("US", 26),
.year = rep(seq(1991, 2015, 2), 2),
.filters = rep(c("race7 == 3", "race7 == 6"), each = 13)) %>%
bind_cols(pmap_df(., getProportionSingleVariable)) %>%
ggplot(aes(.year, prop, color = .filters)) +
geom_line() +
scale_color_manual(labels = c("White", "Black"), values = c("red", "blue")) +
labs(title = "YRBSS: Proportion of White and Black US Youth Reporting Use of Marijuana",
x = "Survey Year", y = "Proportion Reporting Use of Marijuana",
color = "Race")

White kids seem to be smoking marijuana slightly more than African-American kids. And in general there was a peak during 1997 (over 50% of White Amrican kids reported smoking pot!), a decrease for African-American kids since then, and a decrease then an increase for White kids.

Now, I’d love to do the same trend graphs with “Gays vs. Straights”, unfortunately:

yrbss_data_binary$variables %>% count(year, sexid2) ## # A tibble: 16 x 3 ## year sexid2 n ## <dbl> <dbl> <int> ## 1 1991 NA 12272 ## 2 1993 NA 16296 ## 3 1995 NA 10904 ## 4 1997 NA 16262 ## 5 1999 NA 15349 ## 6 2001 NA 13601 ## 7 2003 NA 15214 ## 8 2005 NA 13917 ## 9 2007 NA 14041 ## 10 2009 NA 16410 ## 11 2011 NA 15425 ## 12 2013 NA 13583 ## 13 2015 1 12954 ## 14 2015 2 1246 ## 15 2015 3 503 ## 16 2015 NA 921 Which means the sexid2 variable (1 for Heterosexual, 2 for LGBT, 3 for Unsure) was entered into the YRBSS survey only in 2015. So let’s look at 2015 only, the top 10 categories in which LGBT and Heterosexul US Youth show the largest differences: allQNVars <- yrbss_questions_binary %>% select(variable) %>% filter(stringr::str_detect(variable, "^qn")) %>% unlist() gaysStraightTable <- tibble(.variable = rep(allQNVars, 2), .location = rep("US", 2 * length(allQNVars)), .year = rep(2015, 2 * length(allQNVars)), .filters = rep(c("sexid2 == 1", "sexid2 == 2"), each = length(allQNVars))) %>% bind_cols(pmap_df(., getProportionSingleVariable)) diffLabels <- yrbss_questions_binary %>% filter(variable %in% allQNVars) %>% select(label) %>% unlist() gayStraightDiffs <- tibble(label = diffLabels, straight = gaysStraightTable$prop[1:length(allQNVars)],
gay = gaysStraightTable$prop[ (length(allQNVars) + 1):(2 * length(allQNVars))]) %>% na.omit() %>% mutate(diff = gay - straight) %>% arrange(-diff) %>% slice(c(1:10, 121:130)) %>% arrange(diff) gayStraightDiffs$label <- factor(gayStraightDiffs$label, levels = as.character(gayStraightDiffs$label))

library(grid)
library(gridExtra)

p <- ggplot(gayStraightDiffs, aes(x = label, y = diff)) +
geom_bar(stat = "identity", fill = "red", alpha = 0.5) +
coord_flip() +
labs(title = NULL,
y = "Difference: LGBT - Heterosexual", x = "")

title.grob <- textGrob(
label = "YRBSS: Top Gaps Between LGBT and Heterosexual US Youth in 2015",
x = unit(0.5, "lines"),
y = unit(0, "lines"),
hjust = 0, vjust = 0,
gp = gpar(fontsize = 14))

p1 <- arrangeGrob(p, top = title.grob)
grid.arrange(p1)

LGBT youth reports feeling sadder, considering and attempting suicide, using cigarettes and marijuana, and being bullied - more than Heterosexual youth. They tend to get less exercise, eat breakfast and drink water less regularly and use a condom less regularly. These results are sad but not surprising.

Next, we can’t wrap up this post without a map, can we? Let’s see a map of LGBT youth who were bullied at school across the US:

states <- getListOfParticipatingStates()

lgbtBulliedAtSchool <- tibble(.variable = rep("qn24", length(states)),
.location = states,
.year = rep(2015, length(states)),
.filters = "sexid2 == 2") %>%
bind_cols(pmap_df(., getProportionSingleVariable)) %>%
na.omit()

allStatesData <- tibble(region = state.name, region2 = state.abb) %>%
left_join(lgbtBulliedAtSchool, c("region2" = ".location")) %>%
select(region, region2, prop) %>%
mutate(percent = prop, region = tolower(region)) %>%
na.omit()

states_map_data = map_data("state")

ggplot() +
geom_map(data = states_map_data, map = states_map_data,
aes(x = long, y = lat, map_id = region),
color = "white", size = 0.1, fill = "black") +
geom_map(data = allStatesData, map = states_map_data,
aes(fill = percent, map_id = region),
color = "white", size = 0.1) +
coord_map() +
labs(x = "", y = "", fill = "",
title = "YRBSS: Proportion of LGBT Youth Reporting Being Bullied at School in 2015") +
theme_classic() +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.line.x = element_blank(),
axis.line.y = element_blank(),
title = element_text(size = 10)) +
scale_fill_gradient(labels = scales::percent_format(accuracy = 1))

As you can see this map could have been a lot more informative if we had data for all states. Still it’s pretty amazing that the rate for California - 1 in 4 kids! - is considered “low”. In Wyoming it is 1 in 2.

Let’s do the map chart for a question which would give us more states with a sample size large enough. The proportion of respondents reporting to have had sex:

hadSex <- tibble(.variable = rep("qn60", length(states)),
.location = states,
.year = rep(2015, length(states))) %>%
bind_cols(pmap_df(., getProportionSingleVariable)) %>%
na.omit()

allStatesData <- tibble(region = state.name, region2 = state.abb) %>%
select(region, region2, prop) %>%
mutate(percent = prop, region = tolower(region)) %>%
na.omit()

ggplot() +
geom_map(data = states_map_data, map = states_map_data,
aes(x = long, y = lat, map_id = region),
color = "white", size = 0.1, fill = "black") +
geom_map(data = allStatesData, map = states_map_data,
aes(fill = percent, map_id = region),
color = "white", size = 0.1) +
coord_map() +
labs(x = "", y = "", fill = "",
title = "YRBSS: Proportion of Youth Who Claim They Had Sex in 2015") +
theme_classic() +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.line.x = element_blank(),
axis.line.y = element_blank(),
title = element_text(size = 10)) +
scale_fill_gradient(labels = scales::percent_format(accuracy = 1))

hadSex %>%
arrange(-prop) %>%
slice(c(1:5, 26:30))
## # A tibble: 10 x 7
##    .variable .location .year  prop  ciLB  ciUB     n
##    <chr>     <chr>     <dbl> <dbl> <dbl> <dbl> <int>
##  1 qn60      MS         2015 0.480 0.437 0.523  1671
##  2 qn60      DE         2015 0.468 0.420 0.517  2521
##  3 qn60      WV         2015 0.467 0.415 0.520  1454
##  4 qn60      AL         2015 0.463 0.408 0.520  1313
##  5 qn60      AR         2015 0.460 0.390 0.531  2282
##  6 qn60      CT         2015 0.330 0.290 0.372  2015
##  7 qn60      NE         2015 0.325 0.281 0.372  1365
##  8 qn60      MD         2015 0.324 0.313 0.334 46251
##  9 qn60      CA         2015 0.323 0.273 0.378  1828
## 10 qn60      NY         2015 0.304 0.263 0.349  8508

48% of kids in Mississippi claim they had sex. Delaware, West Virginia, Alabama and Arizona not far behind with 46%. The least “sexually active” kids are in New York - 30%. Then California, Maryland, Nebraska and Connecticut with 32%.

Lastly, let’s correlate the proportion of kids saying they had sex, and the proportion of kids saying they drink a can, bottle, or glass of soda or pop three or more times per day:

soda <- tibble(.variable = rep("qnsoda3", length(states)),
.location = states,
.year = rep(2015, length(states))) %>%
bind_cols(pmap_df(., getProportionSingleVariable)) %>%
na.omit()

cor.test(sexAndSoda$prop.x, sexAndSoda$prop.y)
##
##  Pearson's product-moment correlation
##
## data:  sexAndSoda$prop.x and sexAndSoda$prop.y
## t = 4.9568, df = 27, p-value = 3.42e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4333785 0.8433925
## sample estimates:
##       cor
## 0.6902482
ggplot(sexAndSoda, aes(prop.x, prop.y, label = .location)) +
geom_point() +
geom_smooth(method = "loess") +
geom_point(data = sexAndSoda %>% filter(.location == 'MS'), colour="red", size=5) +
geom_text(hjust = 0.5, vjust = -0.6) +
labs(title = "YRBSS: Proportion of US Youth Drinking Soda 3+ Times a Day vs. Having Sex",
x = "P(Having Sex)", y = "P(Soda)")

There is some correlation between drinking too much soda and having sex at an early age. And Go Delaware!

# Can I Get an Amen Up in Here?

My first R package! Hope I was able to show what can be done with it. Go ahead, install it, use it and tell me about all the places it breaks! I hope it won’t though. And if any LGBT youth is listening out there - be safe, do well at school, get some exercise. I promise you, it gets better.