DataRobot PartnersUnify all of your data, ETL and AI tools in our open platform with our Technology Partners, extend your cloud investments with our Cloud Partners, and connect with DataRobot Services Partners to help you build, deploy or migrate to the DataRobot AI Platform.
As of April 19, 2020, Taiwan has one of the lowest number of confirmed COVID-19 cases around the world at 419 cases1, of which 189 cases have recovered. The major infection clusters in March 2020 are imported from two major regions such as the United States and United Kingdom. These imported clusters are unlikely to cause local transmissions, since all confirmed cases need to stay in hospital until multiple test results are negative.
As early as February 2020, Taiwan banned mainland China visitors from entering the country2. Given the fact that it is geographically close to mainland China, the infection epicenter, this was an important step by the Taiwanese government to control the spread of COVID-19 cases. With the increase in cases globally, the Taiwanese government placed additional controls starting from March 14, 2020, limiting the immigration entry to Taiwanese only and encouraging the use of face masks. In addition, they were required to practice a 14-day home quarantine after entering Taiwan3.
The Problem
With tight Taiwanese government policies on COVID-19 and strong health awareness among the Taiwanese population, we would like to understand their positive impact on the daily confirmed number of COVID-19 cases. This is challenging because:
Only 64 out of 100 days analyzed have confirmed cases. With the limited data availability, this poses a challenge for machine learning problems
To add on to the challenge, the Taiwanese government cannot publish patient details, such as age, home address, and when the patient has recovered
The published data includes the patients’ registered residential regions, travel history, and the infection cluster or group a patient belongs to. Therefore, additional public data had to be extracted for predictive modelling and insights generation.
Approaches and Results
The additional public data and extracted features include:
Taiwan COVID-19 government policies
Taiwan Centers for Disease Control (CDC) published data: Two numeric and two text features extracted—daily confirmed cases, infection cluster information, related Mandarin news on the policies, and the total number of people tested
Ministry of the Interior’s National Immigration Agency4 immigration data: One numeric feature extracted – number of daily inflow passengers
Health awareness among Taiwanese
Google search trends5 in Taiwan for COVID-19 related keywords in traditional Mandarin: 45 numeric features extracted – 新冠肺炎 口罩 (COVID-19 face mask), 新冠肺炎 居家隔離 (COVID-19 home isolation) and 武漢肺炎 症狀 (Coronavirus symptoms)
Additional feature engineering steps were taken to create potentially useful features. When a new cluster is found, or when the frequency of news being published increases, it could have a correlation with the number of confirmed cases.
New cluster indicator: A binary flag indicating if a new cluster is found on that day
Number of new infection clusters found: A numeric feature indicating the number of new infection clusters found on that day
Total number of news items from CDC: A numeric feature indicating the total number of news items published by Taiwan CDC
A total of 54 features and 100 rows were used in the predictive modeling and insights generation. Each row represents a day, along with the corresponding characteristics.
Table 1: Snapshot of the training data
The average number of cases per day is four, with 36 days out of 100 days having zero confirmed cases. Notably, from Chart 1 below, there are two spikes in the number of confirmed cases:
March 18-31, 2020: (total 245 cases, 58% of the total cases), caused by the surge in the number of residents coming back from overseas6 (imported cases)
April 18, 2020: (total 22 cases, 5% of total cases), caused by a navy ship cluster7
Chart 1: Number of daily confirmed cases over time
We tried two different approaches to predict the daily new confirmed cases: Automated Machine Learning and Automated Time Series. Both approaches were run using AutoPilot capabilities with DataRobot’s default settings in the first iteration. Subsequent iterations were run based on different feature lists in order to find the optimal set of features. For the Automated Machine Learning approach, stratified sampling is applied to set aside 20% of the holdout set, and remaining 80% on five-folds validation. Out-of-time approach is not used in this analysis as we do not have sufficient number of days to build a model with more than one backtest.
Approaches to COVID research
Chart 2: Feature Impact chart with redundant features highlighted in yellow
Based on our Automated Machine Learning approach, the most impactful feature is the Google search frequency for “home isolation (居家隔離)” 1 day before. Having all else equal, higher search frequency for “home isolation (居家隔離)” represents a higher number of confirmed cases. However, as the search frequency increases past 5% for “home isolation (居家隔離),” the number of confirmed cases remains the same. It is also found that in places where a larger number of people frequently gather for a long period of time such as “workplace” and “school,” has a higher number of confirmed cases compared to “family” and “tour.”
Conclusion
To improve model accuracy, additional features such as the daily sales volume of face masks and hand sanitizers can be used. Since 87.8% of the confirmed cases are imported, a more granular number of returning residents, (e.g., breakdown by country), can also be included.
Even though Taiwan COVID-19 government policies were not top features in our models, it is widely reported that the Taiwanese leadership (such as their Vice President8 and Health Minister9) has received international recognition for its effectiveness and seen as a model for other countries to emulate.
COVID
COVID-19 Response: DataRobot is offering services pro bono
Clifton is a Customer Facing Data Scientist (CFDS) at DataRobot working in Singapore and leads the Asia Pacific (APAC)’s CFDS team. His vertical domain expertise is in banking, insurance, government; and his horizontal domain expertise is in cybersecurity, fraud detection, and public safety. Clifton’s PhD and Bachelor’s degrees are from Clayton School of Information Technology, Monash University, Australia. In his free time, Clifton volunteers professional services to events, conferences, and journals. Was also part of teams which won some analytics competitions.