COVID-19

Lately, a lot of COVID-19 data has been made available for analysis. I was interested in doing something other than the time series analyses that everyone is doing.

Correlations are fun, and I was thinking about different hypotheses about how data might be related. For example:

  1. Might more affluent areas have fewer, or more, cases of COVID-19?
  2. Could one’s political affiliation be correlated with cases of COVID-19?

One might argue that Republicans are more likely to be affluent, and according to what I’ve seen in the news, Republicans might take social distancing measures less seriously. Republicans are also typically of an older demographic, and being that the 65+ crowd is more vulnerable, perhaps all of these factors contribute to a higher correlation with COVID-19 cases?

First I will play around with correlations and make some nice graphs to get a visual of the data.

DataSets Used:

I decided to explore Median income as it relates to COVID-19 deaths. I grabbed the total median income and removed any “null” rows so that I was left with only “valid” data. I then did an inner join on this file with the COVID-19 file, on County and State.

Next, it is obvious that larger populations in larger counties will naturally have more COVID cases, so to mitigate this I worked with the percentage of people for a variable. So the median income and the number of deaths was divided by the population N and then multiplied by 100% to get the percentage.

Here is what I found:

A slight, but significant, positive correlation of COVID cases with income.

I did a correlation for each state (if available) to get a good visual to ascertain whether some states showed this same correlation or not (and to speculate as to why as well).

Looking at these correlations was interesting. Some states, I noted, at least Illinois, California, and New York which have populous cities and are very democratic, had positive, significant correlations. Other states did not, notably some Republican states such as Florida and Arizona. These casual correlations had me wondering about political affiliation. Given that the current situation has the country very politically divided, perhaps one’s political affiliation (at least at the state level) would play a part in the number of COVID cases as it relates to Income.

Just to get an idea of there was a basic relationship between number of COVID cases and party, I graphed just that — divided the number of COVID cases into two groups — those from Republican states and those from Democrat states.

This really surprised me as I thought for sure those more right-leaning people in the news who are protesting the Shelter In Place orders might have higher levels of infection. But, it should be noted that different states have different capacities for testing, and New York and other “more progressive” states have been much more proactive in getting their population tested. This could explain the correlations in these states, as well as the increased number of recorded cases for democrat states vs. republican states.

Surely there was an interaction of some sort, or multiple factors were in play. Next I wanted to explore whether or not a county’s Party and Median Income played a role in the number of cases. I decided to run a multiple linear regression.

> summary(fit)

Call:
lm(formula = p.cases ~ MedIncomeTotal + IsDemocrat, data = bigDataSet)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.24329 -0.05881 -0.02446  0.01095  2.49839 

Coefficients: (1 not defined because of singularities)
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -1.044e-01  2.969e-02  -3.518  0.00046 ***
MedIncomeTotal           5.051e-06  9.664e-07   5.227 2.21e-07 ***
IsDemocrat               6.302e-02  1.295e-02   4.865 1.38e-06 ***  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1646 on 784 degrees of freedom
Multiple R-squared:  0.08065,	Adjusted R-squared:  0.0783 
F-statistic: 34.39 on 2 and 784 DF,  p-value: 4.839e-15

I found it interesting that all pieces of this regression were meaningful! Time to plot the two equations — one for Democrats and one for Republicans:

Republicans (coded as 0): y = -0.01044 + 0.0000051x
Democrats (coded as 1): y = 0.05262 + 0.0000051x

The data become a bit more clear now. My original hypothesis that Republicans likely suffer more COVID cases seems to be true, its just that the spread of the cases in Democrats is more varied — there are some counties with a LOT of cases (especially New York, which is that outlier up there).

Might more affluent areas have fewer, or more, cases of COVID-19?

Overall, the answer to his question is YES, although the correlation is weak. The correlation does not exist in every state, which could be due to testing and death documentation capacities. Not all states had enough data either to get a reliable analysis. State-wise, the correlations were all over — some not significant, some positive, and some negative.

Could one’s political affiliation be correlated with cases of COVID-19?

Just directly looking at the cases by county, it appears that Democrats are more likely to be infected and die from COVID-19. However, a multiple linear regression that considers one’s party along with income level by county shows that for both parties, as median income increases, so does the number of COVID-19 deaths (which corresponds with the original correlation that was run), but also that this line is higher for Republicans, meaning that for a given Median Income Level, they will likely have more COVID-19 cases than Democrats.

Why is this? Perhaps my original speculation is true. Republicans are typically of an older age-group, and we know that the 65+ crowd is more vulnerable to COVID-19, and so they are more likely to die from it. It could also be that Republicans are practicing less social distancing. But, that is confounded by the fact that cities, which are more populous, are largely Democrat, so one may speculate that the closer confines, despite social distancing measures, will invariably infect more people.

Yet, age also correlates with wealth. Perhaps the 65+ crowd in general accounts for the increase in COVID-19 cases with increases in median income.

The most believable conclusion to me is that age plays a major factor — it can explain both the Party and Median Income variables. Older people are generally more wealthy and more Republican, and the model seems to explain that nicely.