COVID death proportions by race and education

Here I will do some analysis on data released by CDC about how the total deaths due to COVID-19 differ between people of different racial categories and educational attainment. This data was collected between Jan 1, 2020 and Feb 1, 2021. It is open and can be downloaded from: https://catalog.data.gov/dataset/ah-provisional-covid-19-deaths-by-race-and-educational-attainment.

I also have a version of the data in the GitHub repo for this project along with the code found here: https://github.com/MiningMyBusiness/covid_race_and_education

First, let’s first load in the data and clean up the column names a bit.

# read in covid data
covid_data <- read_csv('AH_Provisional_COVID-19_Deaths_by_Race_and_Educational_Attainment.csv')

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   `Data as of` = col_character(),
##   `Start Date` = col_character(),
##   `End Date` = col_character(),
##   `Education Level` = col_character(),
##   `Race or Hispanic Origin` = col_character(),
##   `COVID-19 Deaths` = col_double(),
##   `Total Deaths` = col_double()
## )

# change column names so they don't have messy characters
names(covid_data) <- str_replace_all(names(covid_data), c(" " = "." , "-" = "" ))

Examine the data

If we take a look at the table, we can see that the data is provided for COVID deaths and total deaths over the period of time for each racial category and educational attainment.

head(covid_data, 8)

Data.as.of	Start.Date	End.Date	Education.Level	Race.or.Hispanic.Origin	COVID19.Deaths	Total.Deaths
02/01/2021	01/01/2020	01/30/2021	8th grade or less	Hispanic	29157	106285
02/01/2021	01/01/2020	01/30/2021	8th grade or less	Non-Hispanic American Indian or Alaska Native	706	3085
02/01/2021	01/01/2020	01/30/2021	8th grade or less	Non-Hispanic Asian	2610	16283
02/01/2021	01/01/2020	01/30/2021	8th grade or less	Non-Hispanic Black	5699	41437
02/01/2021	01/01/2020	01/30/2021	8th grade or less	Non-Hispanic More than one race	103	1676
02/01/2021	01/01/2020	01/30/2021	8th grade or less	Non-Hispanic Native Hawaiian or Other Pacific Islander	87	484
02/01/2021	01/01/2020	01/30/2021	8th grade or less	Non-Hispanic White	18871	157236
02/01/2021	01/01/2020	01/30/2021	8th grade or less	Unknown	39	380

The raw numbers of total deaths and total COVID deaths are not appropriate measures to compare since these racial categories compose dramatically different portions of the population. A more appropriate value to compare might be the proportion of COVID deaths out of the total deaths.

In a perfectly fair world, all people of different racial categories and educational attainment would be equally likely to contract COVID-19, equally likely to die of COVID-19, and also equally likely to die of any other cause. Therefore, in this perfectly fair world, the proportion of COVID-19 deaths would be the same across all categories.

Let’s start with this perfect-fair-world assumption.

Covid deaths grouped by education

Let’s see how the proportion of COVID deaths are different for different education levels when we aggregate across all racial categories.

# group by education level and get total deaths in the education level
by_education <- group_by(covid_data, Education.Level)
edu_sum <- summarize(by_education,
                    count=n(), 
                    tot_total_death=sum(Total.Deaths, na.rm=TRUE), 
                    tot_covid_death=sum(COVID19.Deaths, na.rm=TRUE))

# add columns to summary table
edu_sum <- mutate(edu_sum, 
                 covid_death_prop=tot_covid_death/tot_total_death)
ggplot(data=edu_sum) +
  geom_col(mapping=aes(x=reorder(Education.Level, covid_death_prop),
                       y=covid_death_prop)) +
  geom_hline(data = edu_sum, 
             mapping = aes(yintercept = mean(covid_death_prop)),
             alpha=0.5, linetype='dashed') +
  coord_flip() +
  labs(y="Proportion of deaths due to COVID", x="Education level",
       title="COVID deaths grouped by educational attainment")

Already we start to see deviation for from the perfect-fair-world scenario. Note that the proportion of COVID deaths is far higher for 8th grade or less and Unknown education levels. There still seems to be some effect for educational attainment below a high-school level but these results are mixed. The dashed line is the mean proportion of deaths due to COVID across all educational attainments.

Covid deaths grouped by race

Let’s now check the proportion of COVID deaths for different racial categories aggregated across all levels of education.

# group data by race
by_race <- group_by(covid_data, Race.or.Hispanic.Origin)
race_sum <- summarize(by_race,
                    count=n(), 
                    tot_total_death=sum(Total.Deaths, na.rm=TRUE), 
                    tot_covid_death=sum(COVID19.Deaths, na.rm=TRUE))

# add columns to summary table
race_sum <- mutate(race_sum, 
                 covid_death_prop=tot_covid_death/tot_total_death)
ggplot(data=race_sum) + 
  geom_col(mapping=aes(x=reorder(Race.or.Hispanic.Origin, covid_death_prop), 
                       y=covid_death_prop)) + 
  geom_hline(data = race_sum, 
             mapping = aes(yintercept = mean(covid_death_prop)),
             alpha=0.5, linetype='dashed') +
  coord_flip() +
  labs(y="Proportion of deaths due to COVID", x="Race", 
       title="COVID deaths grouped by race")

By race, the disparities between the proportion of COVID-19 deaths are even more graded. The proportion of COVID deaths for Hispanic people is more than twice that of Non-Hispanic White people. The mean proportion of COVID deaths across all racial categories is the dashed line. Non-Hispanic American Indian or Alaska Native and Hispanic racial categories fall far above this line.

But if we look within each racial category, do we find that higher educational attainment has lower COVID deaths?

Covid deaths by race and education

Here are the proportion of deaths due to COVID for Hispanic people broken down by educational attainment.

covid_data <- mutate(covid_data, 
                     covid_death_prop=COVID19.Deaths/Total.Deaths)
covid_data$Education.Level <- str_wrap(covid_data$Education.Level, 
                                       width = 15)
covid_data$Race.or.Hispanic.Origin <- str_wrap(covid_data$Race.or.Hispanic.Origin,
                                               width = 15)

ggplot(data=filter(covid_data, Race.or.Hispanic.Origin == 'Hispanic')) +
    geom_col(mapping=aes(
      x=factor(Education.Level,
               levels = c("Unknown",
                          "8th grade or\nless",
                          "9 -12th grade,\nwith no diploma",
                          "High school\ngraduate or GED\ncompleted",
                          "Some college\ncredit, but no\ndegree",
                          "Associate\ndegree",
                          "Bachelor’s\ndegree" ,
                          "Master’s degree",
                          "Doctorate or\nProfessional\nDegree")), 
      y=covid_death_prop)) +
    geom_hline(data = filter(covid_data, 
                             Race.or.Hispanic.Origin == 'Hispanic'), 
             mapping = aes(yintercept = mean(covid_death_prop)),
             alpha=0.5, linetype='dashed') +
    coord_flip() +
    labs(x="Education level", y="Proportion of deaths due to Covid",
         title='COVID deaths grouped by education for Hispanic people')

I purposely ordered the educational attainment by degree to make it easier to see an effect. And we do see an effect of education on the proportion of total deaths due to COVID.

However we see a different trend when we inspect this same data for Non-Hispanic American Indian or Alaska Native people.

ggplot(data=filter(covid_data, 
                   Race.or.Hispanic.Origin == 'Non-Hispanic\nAmerican Indian\nor Alaska\nNative')) +
    geom_col(mapping=aes(
      x=factor(Education.Level,
               levels = c("Unknown",
                          "8th grade or\nless",
                          "9 -12th grade,\nwith no diploma",
                          "High school\ngraduate or GED\ncompleted",
                          "Some college\ncredit, but no\ndegree",
                          "Associate\ndegree",
                          "Bachelor’s\ndegree" ,
                          "Master’s degree",
                          "Doctorate or\nProfessional\nDegree")), 
      y=covid_death_prop)) +
    geom_hline(data = filter(covid_data, 
                             Race.or.Hispanic.Origin == 'Hispanic'), 
             mapping = aes(yintercept = mean(covid_death_prop)),
             alpha=0.5, linetype='dashed') +
    coord_flip() +
    labs(x="Education level", y="Proportion of deaths due to Covid",
         title='COVID deaths grouped by education for Non-Hispanic\nAmerican Indian or Alaska Native people')

There seems to be no clear trend or effect of education on the proportion of COVID deaths for this race. However, these estimates of proportion might be influenced by the comparatively few numbers of deaths of Non-Hispanic American Indian or Alaska Native people.

am_edu <- filter(covid_data, Race.or.Hispanic.Origin == 'Non-Hispanic\nAmerican Indian\nor Alaska\nNative')
head(am_edu[c('Education.Level','Race.or.Hispanic.Origin','COVID19.Deaths', 'Total.Deaths')], 8)

## # A tibble: 8 x 4
##   Education.Level         Race.or.Hispanic.Origin    COVID19.Deaths Total.Deaths
##   <chr>                   <chr>                               <dbl>        <dbl>
## 1 "8th grade or\nless"    "Non-Hispanic\nAmerican I…            706         3085
## 2 "9 -12th grade,\nwith … "Non-Hispanic\nAmerican I…            662         3915
## 3 "Associate\ndegree"     "Non-Hispanic\nAmerican I…            447         1949
## 4 "Bachelor’s\ndegree"    "Non-Hispanic\nAmerican I…            282         1276
## 5 "Doctorate or\nProfess… "Non-Hispanic\nAmerican I…             12           97
## 6 "High school\ngraduate… "Non-Hispanic\nAmerican I…           1671         9556
## 7 "Master’s degree"       "Non-Hispanic\nAmerican I…             90          448
## 8 "Some college\ncredit,… "Non-Hispanic\nAmerican I…            765         3889

Let’s look across all races and see if educational attainment has some influence on the proportion of COVID deaths.

df2 <- covid_data %>%
  group_by(Race.or.Hispanic.Origin) %>%
  summarise(mean_covid_death_prop = mean(covid_death_prop))

ov_prop = mean(covid_data$covid_death_prop)
ggplot(data=covid_data) + geom_point(mapping=aes(
  x=factor(Education.Level,
           levels = c("Unknown",
                      "8th grade or\nless",
                      "9 -12th grade,\nwith no diploma",
                      "High school\ngraduate or GED\ncompleted",
                      "Some college\ncredit, but no\ndegree",
                      "Associate\ndegree",
                      "Bachelor’s\ndegree" ,
                      "Master’s degree",
                      "Doctorate or\nProfessional\nDegree")), 
  y=covid_death_prop)) + 
  geom_hline(yintercept=ov_prop, color='red', alpha=0.5,
             linetype="dashed") +
  geom_hline(data = df2, 
             mapping = aes(yintercept = mean_covid_death_prop),
             alpha=0.5) + 
  coord_flip() +
  facet_wrap(~ Race.or.Hispanic.Origin, nrow=2) + 
  labs(x="Education level", y="Proportion of deaths due to Covid",
       title="Covid deaths by race and education")

In the above figure, each panel is a single race and each point is the proportions of COVID deaths for a specific educational attainment for that race. The red dashed line is in the same position in all the plots and it’s the mean COVID death proportion across all racial and educational categories. The black line in each panel is different and is the mean COVID death proportion for that race across all education categories.

In the perfect-fair-world scenario, the red dashed line and the black line would fall on top of each other in each panel…

The black line also serves another purpose visually. It can inform us whether increasing educational attainment actually reduces the proportion of COVID deaths within a racial category. For Hispanic, Non-Hispanic Asian, and Non-Hispanic White people, higher educational attainment does seem to have lower proportions of COVID death. However, this relationship does not seem to be consistent for all races.

How can COVID death proportions be different?

Clearly, COVID death proportions are different for different races and different education attainments. However, higher educational attainment does not necessary correspond to lower COVID death proportions within each race. Education does not seem to be an equalizer in this case. In some ways, the effect of education on the proportion of COVID deaths seems to be further amplified or dampened by race.

The proportion of COVID deaths could be higher for one group when compared to another when one group

is more likely to get COVID.
is more likely to die from COVID.
is less likely to die of other reasons.

I doubt the feasibility of option 3. However, option 1 and 2 seem sensible. I don’t find it hard to believe that individuals of different races and educational attainment are more likely to get COVID or to die from COVID.

Some scenarios that might make one group more likely to get COVID are

genetic susceptability to contracting COVID
inability to socially distance
lack of desire to socially distance

Some scenarios that might make one group more likely to die from COVID are

genetic susceptability to dying from COVID.
lack of appropriate medical attention.

I think the chain of causation is fairly clear here to me. But I will leave it to the reader to draw their own conclusions about the reasons why COVID is not an equal-opportunity pandemic.