Predicting COVID-19 on the U.S. County Level
As a majority of counties have already detected COVID-19 cases, today (4/1/2020) is our last update. Our data science team is switching to other projects related to COVID-19.
With the fight against COVID-19 spreading across the U.S. and the world, DataRobot understands it is essential that federal government entities convey accurate information to citizens, local governments, and healthcare providers. Towards that end, the DataRobot AI Platform has developed models to predict which U.S. counties are likely to have their first confirmed COVID-19 cases in the next five days.
It is our hope that the federal, state, and local governments can use this information to budget resources, take preemptive measures, and help citizens to take preventive measures. This information also would be very useful to healthcare providers to help prepare their staff with the most accurate information.
[UPDATE] We are releasing new U.S. country predictions based on the data available today (4/1/2020).
|IL Boone County
|IL Vermilion County
|AL Dale County
|NC Pender County
|MD Allegany County
|ID Latah County
|NM Los Alamos County
|MO Phelps County
|AR Miller County
|OH Scioto County
|NM Otero County
|VA Caroline County
|AL Coffee County
|CA Tehama County
|OR Coos County
|WI Manitowoc County
|IL Coles County
|VA Falls Church city
|NC Haywood County
|SC Cherokee County
[UPDATE] We are releasing new U.S. county predictions based on the data available today (03/25/2020):
|ND Grand Forks
|MO Cape Girardeau
[UPDATE] We are releasing new U.S. county predictions based on the data available today (03/23/2020):
|MI Van Buren
|MN Otter Tail
[UPDATE] We are releasing new U.S. county predictions based on the data available today (03/20/2020):
|CT New London
|MD St. Mary’s
|CA El Dorado
|FL St. Lucie
[UPDATE] 17 of our top 20 predictions from yesterday have already been confirmed. We are releasing new U.S. county predictions based on the data available today (03/18/2020) :
|NC New Hanover
|MO St. Charles
|CA El Dorado
|FL St. Lucie
Based on model trained with data from March 16, 2020, the top 20 risky counties are:
|UT Utah County
|MO St. Charles
This map shows the 449 counties that are currently infected in dark blue and the predicted 50 high risk counties in light blue.
Our models suggest that regions with larger populations, higher median income, and a higher level of education are more susceptible to infections in the early outbreak of the coronavirus. Factors leading to this conclusion may include that this is a population that has been travelling more and is also getting tested at a higher rate. Because the movement of the virus is changing every day as testing and travel patterns change, we need to update these predictions on a regular basis.
The models used to predict these results appear to be quite accurate. On March 11, 2020 we predicted 50 high-risk counties. As of March 16th, 44 of the 50 reported confirmed cases by March 16, 2020, 5pm EST.
What data did DataRobot use?
DataRobot drew from the following resources:
- Johns Hopkins University’s dataset of confirmed cases
- Existing U.S. county-level socioeconomic data
- U.S. county geo-coordinates
- News Break Coronavirus Realtime Updates
- Claritas demographics data
- County data from USAFacts
How does the DataRobot model work?
DataRobot identifies patterns in demographic and socio-economic data in counties that have reported cases of the COVID-19 and uses those patterns to identify similar counties who have not. The models performed well, with an 88% precision rate for a five-day forecast of its top 50 predictions. Precision increases to 96% with a 10-day forecast window.
The county-level Johns Hopkins data that DataRobot used for this model is now being aggregated at the state level, so it is no longer useful for our model. The value of each model decreases day by day without new data. Each day that we miss out on new data represents a missed opportunity to help local officials and healthcare providers with more information.
The following data would be helpful in filling the gap:
- Alternative data sources that are tracking infection rates on the county level.
- More county-level data, such as road density, airports, hospital beds, age distribution, and population density, as well as data on travel between counties and airports.
Collection of this data will allow DataRobot’s data scientists to generate manual geospatial features, allowing them to make predictions on which counties in the U.S. have a higher probability of infection.
Even with county-level data, DataRobot can model the spread of the disease, but not the severity of outbreaks or location of the next hot spot. If those leading the response in hot spots like Washington state, New York, and elsewhere can provide more localized disease and infection information, as well as socio-economic data, than what is currently available, we can model the severity of outbreaks and locations of the next hot spots.
If you have questions or would like more information, please email COVID19Responseteam@datarobot.com