DataRobot PartnersUnify all of your data, ETL and AI tools in our open platform with our Technology Partners, extend your cloud investments with our Cloud Partners, and connect with DataRobot Services Partners to help you build, deploy or migrate to the DataRobot AI Platform.
We want to predict the days-to-recovery (or days-to-discharge) of a COVID-19 patient at time of diagnosis, especially if it is a mild case. While COVID-19 is a global pandemic, thousands have recovered from it and are no longer infectious. For many who may not be feeling sick despite being COVID-19 positive, one of the key questions on their mind will be, “How long will it take for me to be discharged and get back home?”
It is important to note that days-to-recovery prediction is especially relevant in countries where the policy is virus containment. These predictions can help public health experts to track COVID-19 positive citizens and their contacts, test the community frequently, and put everyone who tested positive into isolation at a hospital or community facility until they fully recover.
Motivation
We used South Korea as a specific example because, as of April 30, 2020 and to the best of our knowledge, around 85% of 10,793 confirmed infected patients have recovered1 and the project Data Science for COVID-19 (DS4C) has the best publicly available COVID-19 patient-level data (hosted in Kaggle) compiled from Korea Centers for Disease Control and Prevention and local governments2.
There are 3,388 patients in the DS4C data, of which 1,327 have recovered. Patient data, such as age and location, is readily available, as is patient travel history, the infection cluster or group to which a patient belongs, regions and daily weather in South Korea, and population behavioral data, such as daily search engine trends and Seoul’s floating population numbers. Government policies on COVID-19 have recently been added. Time-series data on daily confirmed infected, recovered, and death cases is also made available.
For better data understanding, Patient 31 is well-known as the first person from Shincheonji church in Daegu to test positive for COVID-19, and most of South Korea’s COVID-19 patients are from this church and province3. From DS4C patient data, we know that she is 61 years old and came into contact with 1,160 people. She was confirmed to be infected on February 18th and is currently isolated and has not yet recovered. From DS4C travel history data, Patient 31 went to 15 places (hospital, church, lodging) in Daegu between February 6 and 17. For DS4C infection cluster data, we observe that Shincheonji church in Daegu has 4,510 confirmed infected, and there are 13 other Shincheonji churches in other provinces that have 682 confirmed infected.
Approaches and Results
Feature ‘D2R’ Histogram
The figure above shows, after removing outliers (below 10 and above 40), it takes an average of about 22 days for someone in South Korea to recover from COVID-19, with a standard deviation of seven-and-a-half days.
We tried two approaches to predict days-to-recovery, by automatically building models (with DataRobot’s default settings) using:
Approaches to Predicting Days-to-Recovery for COVID-19
The second approach, which uses features from several tables, is promising for days-to-recovery prediction. Although accuracy is not high, it can be improved with updated monthly versions of DS4C data, and more diverse models and model tuning.
eXtreme Gradient Boosted Trees Regressor with Early Stopping (Poisson Loss) (Effect Size) (Holdout) (province) Feature EffectseXtreme Gradient Boosted Trees Regressor with Early Stopping (Poisson Loss) (Effect Size) (Holdout) (infection_case) Feature Effects
From the figures above, provinces such as Gyeongsangbuk-do and Chungcheongnam-do have more patients and they have slightly higher days-to-recovery; and people who get infected in places such as nursing homes, churches, and community centers will take more time to recover.
Conclusion
For days-to-recovery predictions, South Korean COVID-19 patients will take around 22 days to be discharged, and each prediction can have an error of around ±6.7 days. We noticed that within a month from February 18 to March 18, 2020, average days-to-recovery dropped from 25 to 17 days. We hypothesize that it is the result of having many new South Korean COVID-19 patients with no symptoms (asymptomatic) or mild symptoms who can recover much faster than the general population. There was an enormous amount of testing after the discovery of Patient 31 on February 18th3, and maybe about 10% with no symptoms and 40% with mild symptoms4. Recovery for asymptomatic patients is confirmed by having two consecutive negative test results within the last 24 hours5 .
One of the main limitations of DS4C data is that 83% of all patients are from three provinces of Daegu (64%), Gyeongsangbuk-do (13%), and Gyeonggi-do (6%). However, almost all of the Daegu patients are currently not present in the DS4C data. As the data is updated monthly, it is possible that more Daegu patients will be added.
Several prediction use cases are possible using DS4C. For most use cases, we have built initial models but decided that the data is not sufficient or reliable at the moment, or the predictive models are not accurate yet:
1. Other than days-to-recovery, there are other patient-level prediction use cases, such as:
Length-of-Stay (LOS) – regression use case like days-to-recovery where it is the days difference between discharge and confirmed dates, except LOS will usually include patients who have died
Symptoms-to-confirmation – regression use case where it is the days difference between confirmed and symptom start dates (only 300+ patients in DS4C data with symptom start dates)
Severity – binary classification use case where we know that the patient was in an Intensive Care Unit or had comorbidities (not available in DS4C data), or exceeds x days of hospitalization (where x can vary from country to country) or died due to COVID-19 related complications (60+ deaths in DS4C data, but >200 deaths in official statistics)
State – multi-classification use case where the patient can be isolated, recovered, or deceased (most are isolated in DS4C, but most are recovered in official statistics)
Clifton is a Customer Facing Data Scientist (CFDS) at DataRobot working in Singapore and leads the Asia Pacific (APAC)’s CFDS team. His vertical domain expertise is in banking, insurance, government; and his horizontal domain expertise is in cybersecurity, fraud detection, and public safety. Clifton’s PhD and Bachelor’s degrees are from Clayton School of Information Technology, Monash University, Australia. In his free time, Clifton volunteers professional services to events, conferences, and journals. Was also part of teams which won some analytics competitions.