Here I will do some analysis on data released by CDC about how the total deaths due to COVID-19 differ between people of different racial categories and educational attainment. This data was collected between Jan 1, 2020 and Feb 1, 2021. It is open and can be downloaded from: https://catalog.data.gov/dataset/ah-provisional-covid-19-deaths-by-race-and-educational-attainment.
I also have a version of the data in the GitHub repo for this project along with the code found here: https://github.com/MiningMyBusiness/covid_race_and_education
First, let’s first load in the data and clean up the column names a bit.
# read in covid data
covid_data <- read_csv('AH_Provisional_COVID-19_Deaths_by_Race_and_Educational_Attainment.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## `Data as of` = col_character(),
## `Start Date` = col_character(),
## `End Date` = col_character(),
## `Education Level` = col_character(),
## `Race or Hispanic Origin` = col_character(),
## `COVID-19 Deaths` = col_double(),
## `Total Deaths` = col_double()
## )
# change column names so they don't have messy characters
names(covid_data) <- str_replace_all(names(covid_data), c(" " = "." , "-" = "" ))
If we take a look at the table, we can see that the data is provided for COVID deaths and total deaths over the period of time for each racial category and educational attainment.
head(covid_data, 8)
Data.as.of | Start.Date | End.Date | Education.Level | Race.or.Hispanic.Origin | COVID19.Deaths | Total.Deaths |
---|---|---|---|---|---|---|
02/01/2021 | 01/01/2020 | 01/30/2021 | 8th grade or less | Hispanic | 29157 | 106285 |
02/01/2021 | 01/01/2020 | 01/30/2021 | 8th grade or less | Non-Hispanic American Indian or Alaska Native | 706 | 3085 |
02/01/2021 | 01/01/2020 | 01/30/2021 | 8th grade or less | Non-Hispanic Asian | 2610 | 16283 |
02/01/2021 | 01/01/2020 | 01/30/2021 | 8th grade or less | Non-Hispanic Black | 5699 | 41437 |
02/01/2021 | 01/01/2020 | 01/30/2021 | 8th grade or less | Non-Hispanic More than one race | 103 | 1676 |
02/01/2021 | 01/01/2020 | 01/30/2021 | 8th grade or less | Non-Hispanic Native Hawaiian or Other Pacific Islander | 87 | 484 |
02/01/2021 | 01/01/2020 | 01/30/2021 | 8th grade or less | Non-Hispanic White | 18871 | 157236 |
02/01/2021 | 01/01/2020 | 01/30/2021 | 8th grade or less | Unknown | 39 | 380 |
The raw numbers of total deaths and total COVID deaths are not appropriate measures to compare since these racial categories compose dramatically different portions of the population. A more appropriate value to compare might be the proportion of COVID deaths out of the total deaths.
In a perfectly fair world, all people of different racial categories and educational attainment would be equally likely to contract COVID-19, equally likely to die of COVID-19, and also equally likely to die of any other cause. Therefore, in this perfectly fair world, the proportion of COVID-19 deaths would be the same across all categories.
Let’s start with this perfect-fair-world
assumption.
Let’s see how the proportion of COVID deaths are different for different education levels when we aggregate across all racial categories.
# group by education level and get total deaths in the education level
by_education <- group_by(covid_data, Education.Level)
edu_sum <- summarize(by_education,
count=n(),
tot_total_death=sum(Total.Deaths, na.rm=TRUE),
tot_covid_death=sum(COVID19.Deaths, na.rm=TRUE))
# add columns to summary table
edu_sum <- mutate(edu_sum,
covid_death_prop=tot_covid_death/tot_total_death)
ggplot(data=edu_sum) +
geom_col(mapping=aes(x=reorder(Education.Level, covid_death_prop),
y=covid_death_prop)) +
geom_hline(data = edu_sum,
mapping = aes(yintercept = mean(covid_death_prop)),
alpha=0.5, linetype='dashed') +
coord_flip() +
labs(y="Proportion of deaths due to COVID", x="Education level",
title="COVID deaths grouped by educational attainment")
Already we start to see deviation for from the perfect-fair-world
scenario. Note that the proportion of COVID deaths is far higher for 8th grade or less
and Unknown
education levels. There still seems to be some effect for educational attainment below a high-school level but these results are mixed. The dashed line is the mean proportion of deaths due to COVID across all educational attainments.
Let’s now check the proportion of COVID deaths for different racial categories aggregated across all levels of education.
# group data by race
by_race <- group_by(covid_data, Race.or.Hispanic.Origin)
race_sum <- summarize(by_race,
count=n(),
tot_total_death=sum(Total.Deaths, na.rm=TRUE),
tot_covid_death=sum(COVID19.Deaths, na.rm=TRUE))
# add columns to summary table
race_sum <- mutate(race_sum,
covid_death_prop=tot_covid_death/tot_total_death)
ggplot(data=race_sum) +
geom_col(mapping=aes(x=reorder(Race.or.Hispanic.Origin, covid_death_prop),
y=covid_death_prop)) +
geom_hline(data = race_sum,
mapping = aes(yintercept = mean(covid_death_prop)),
alpha=0.5, linetype='dashed') +
coord_flip() +
labs(y="Proportion of deaths due to COVID", x="Race",
title="COVID deaths grouped by race")
By race, the disparities between the proportion of COVID-19 deaths are even more graded. The proportion of COVID deaths for Hispanic
people is more than twice that of Non-Hispanic White
people. The mean proportion of COVID deaths across all racial categories is the dashed line. Non-Hispanic American Indian or Alaska Native
and Hispanic
racial categories fall far above this line.
But if we look within each racial category, do we find that higher educational attainment has lower COVID deaths?
Here are the proportion of deaths due to COVID for Hispanic people broken down by educational attainment.
covid_data <- mutate(covid_data,
covid_death_prop=COVID19.Deaths/Total.Deaths)
covid_data$Education.Level <- str_wrap(covid_data$Education.Level,
width = 15)
covid_data$Race.or.Hispanic.Origin <- str_wrap(covid_data$Race.or.Hispanic.Origin,
width = 15)
ggplot(data=filter(covid_data, Race.or.Hispanic.Origin == 'Hispanic')) +
geom_col(mapping=aes(
x=factor(Education.Level,
levels = c("Unknown",
"8th grade or\nless",
"9 -12th grade,\nwith no diploma",
"High school\ngraduate or GED\ncompleted",
"Some college\ncredit, but no\ndegree",
"Associate\ndegree",
"Bachelor’s\ndegree" ,
"Master’s degree",
"Doctorate or\nProfessional\nDegree")),
y=covid_death_prop)) +
geom_hline(data = filter(covid_data,
Race.or.Hispanic.Origin == 'Hispanic'),
mapping = aes(yintercept = mean(covid_death_prop)),
alpha=0.5, linetype='dashed') +
coord_flip() +
labs(x="Education level", y="Proportion of deaths due to Covid",
title='COVID deaths grouped by education for Hispanic people')
I purposely ordered the educational attainment by degree to make it easier to see an effect. And we do see an effect of education on the proportion of total deaths due to COVID.
However we see a different trend when we inspect this same data for Non-Hispanic American Indian or Alaska Native
people.
ggplot(data=filter(covid_data,
Race.or.Hispanic.Origin == 'Non-Hispanic\nAmerican Indian\nor Alaska\nNative')) +
geom_col(mapping=aes(
x=factor(Education.Level,
levels = c("Unknown",
"8th grade or\nless",
"9 -12th grade,\nwith no diploma",
"High school\ngraduate or GED\ncompleted",
"Some college\ncredit, but no\ndegree",
"Associate\ndegree",
"Bachelor’s\ndegree" ,
"Master’s degree",
"Doctorate or\nProfessional\nDegree")),
y=covid_death_prop)) +
geom_hline(data = filter(covid_data,
Race.or.Hispanic.Origin == 'Hispanic'),
mapping = aes(yintercept = mean(covid_death_prop)),
alpha=0.5, linetype='dashed') +
coord_flip() +
labs(x="Education level", y="Proportion of deaths due to Covid",
title='COVID deaths grouped by education for Non-Hispanic\nAmerican Indian or Alaska Native people')
There seems to be no clear trend or effect of education on the proportion of COVID deaths for this race. However, these estimates of proportion might be influenced by the comparatively few numbers of deaths of Non-Hispanic American Indian or Alaska Native
people.
am_edu <- filter(covid_data, Race.or.Hispanic.Origin == 'Non-Hispanic\nAmerican Indian\nor Alaska\nNative')
head(am_edu[c('Education.Level','Race.or.Hispanic.Origin','COVID19.Deaths', 'Total.Deaths')], 8)
## # A tibble: 8 x 4
## Education.Level Race.or.Hispanic.Origin COVID19.Deaths Total.Deaths
## <chr> <chr> <dbl> <dbl>
## 1 "8th grade or\nless" "Non-Hispanic\nAmerican I… 706 3085
## 2 "9 -12th grade,\nwith … "Non-Hispanic\nAmerican I… 662 3915
## 3 "Associate\ndegree" "Non-Hispanic\nAmerican I… 447 1949
## 4 "Bachelor’s\ndegree" "Non-Hispanic\nAmerican I… 282 1276
## 5 "Doctorate or\nProfess… "Non-Hispanic\nAmerican I… 12 97
## 6 "High school\ngraduate… "Non-Hispanic\nAmerican I… 1671 9556
## 7 "Master’s degree" "Non-Hispanic\nAmerican I… 90 448
## 8 "Some college\ncredit,… "Non-Hispanic\nAmerican I… 765 3889
Let’s look across all races and see if educational attainment has some influence on the proportion of COVID deaths.
df2 <- covid_data %>%
group_by(Race.or.Hispanic.Origin) %>%
summarise(mean_covid_death_prop = mean(covid_death_prop))
ov_prop = mean(covid_data$covid_death_prop)
ggplot(data=covid_data) + geom_point(mapping=aes(
x=factor(Education.Level,
levels = c("Unknown",
"8th grade or\nless",
"9 -12th grade,\nwith no diploma",
"High school\ngraduate or GED\ncompleted",
"Some college\ncredit, but no\ndegree",
"Associate\ndegree",
"Bachelor’s\ndegree" ,
"Master’s degree",
"Doctorate or\nProfessional\nDegree")),
y=covid_death_prop)) +
geom_hline(yintercept=ov_prop, color='red', alpha=0.5,
linetype="dashed") +
geom_hline(data = df2,
mapping = aes(yintercept = mean_covid_death_prop),
alpha=0.5) +
coord_flip() +
facet_wrap(~ Race.or.Hispanic.Origin, nrow=2) +
labs(x="Education level", y="Proportion of deaths due to Covid",
title="Covid deaths by race and education")
In the above figure, each panel is a single race and each point is the proportions of COVID deaths for a specific educational attainment for that race. The red dashed line is in the same position in all the plots and it’s the mean COVID death proportion across all racial and educational categories. The black line in each panel is different and is the mean COVID death proportion for that race across all education categories.
In the perfect-fair-world
scenario, the red dashed line and the black line would fall on top of each other in each panel…
The black line also serves another purpose visually. It can inform us whether increasing educational attainment actually reduces the proportion of COVID deaths within a racial category. For Hispanic
, Non-Hispanic Asian
, and Non-Hispanic White
people, higher educational attainment does seem to have lower proportions of COVID death. However, this relationship does not seem to be consistent for all races.
Clearly, COVID death proportions are different for different races and different education attainments. However, higher educational attainment does not necessary correspond to lower COVID death proportions within each race. Education does not seem to be an equalizer in this case. In some ways, the effect of education on the proportion of COVID deaths seems to be further amplified or dampened by race.
The proportion of COVID deaths could be higher for one group when compared to another when one group
I doubt the feasibility of option 3. However, option 1 and 2 seem sensible. I don’t find it hard to believe that individuals of different races and educational attainment are more likely to get COVID or to die from COVID.
Some scenarios that might make one group more likely to get COVID are
Some scenarios that might make one group more likely to die from COVID are
I think the chain of causation is fairly clear here to me. But I will leave it to the reader to draw their own conclusions about the reasons why COVID is not an equal-opportunity pandemic.