When rummaging through data, we were perplexed to noice a surprisingly low correlation between masking and COVID cases and deaths per capita over the pandemic up to May 31, 2021. In other words, states that masked more didn't seem to have more or less COVID cases and deaths, bucking conventional narratives. Though preliminary, this result gets at the importance of data in public policy, as it allows us to see where the reality on the ground diverges with our preconceived intuitions.
For this project, we analyzed a basket of variables and their relationships to COVID, searching for which variables we might want to emphasize in combating a pandemic. This analysis is important for two reasons: first, because historical understanding is always useful for informing a mental model of the world; and second, because answering these questions could directly contribute to future pandemic preparedness, saving lives. There are lots of narratives floating around COVID; let's see what the data has to say.
Seeing prediction as a pathway to understanding, we decided to collect a basket of variables - as many measures potentially-important to COVID as possible. Our research question then was - given this basket of variables, can we predict 2020 COVID deaths and cases per capita, and can we use our models to gain insight into COVID? We aimed, in other words, towards an exploratory basket-of-variable analysis.
Several research teams have conducted basket-of-variable analyses on COVID. These were helpful; some of our variables were inspired from these studies. Surprisingly, our specific analysis is not covered in the literature; almost all analyses we could find used spring 2020 data and/or studied international or state-level data. (Only one group of researchers performed county-level analysis on COVID, and they focused on spring 2020, while our project tackles all 2020.) (Velasco et. al. 2021, https://pubmed.ncbi.nlm.nih.gov/33466900/; Riley et. al. 2022, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0266330#sec006; Ziyadidegan et. al. 2022, https://link.springer.com/article/10.1007/s00477-021-02148-0; Karmakar, et. al. 2021, https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2775732; Aabed et. al. 2020, https://www.sciencedirect.com/science/article/pii/S1319562X20306331; Chang et. al. 2022, https://www.nature.com/articles/s41598-022-09783-9)
We limited COVID data collection from the beginning of pandemic data collection (which varied by state, so we mass-downloaded data starting from January 1, 2020) up to December 31, 2020 (after which the appearance of COVID strains, the availability of vaccinations, and the non-uniform lifting of vaccine and masking mandates greatly complicates the analysis). We also collected pre-COVID data for baselines, mostly from between 2015 and 2019; our earliest such data, religious demographics, is from 2010.
Our input variables fall into four broad categories. Our "baseline" variables measure a county's pre-COVID socioeconomic, physical, and mental "health"/vulnerabilities. (Think obesity rates, air pollution, inequality, demographics, etc.) This data is coarsely-grained; there is one data point per county. Our "politics" variables constitute similarly coarsely-grained data about 2020 proper. (Think political affiliation, political control, election results, and masking; no reliable, comprehensive week-by-week/month-by-month masking data exists.) Our "fluctuant" variables are measures that changed week-by-week during COVID, potentially influencing outcomes. (Think lockdown mobility, hospital capacity statistics, policies, etc.) This data is finely-grained; there are multiple data points per county. Lastly, our "spatial" variable measures the extent of COVID in a county's geographical neighbors, to attempt to incorporate disease spread into our models.
Our output variables are two - COVID cases per capita and COVID deaths per capita - predicted for the most part on a weekly basis