DataRobot PartnersUnify all of your data, ETL and AI tools in our open platform with our Technology Partners, extend your cloud investments with our Cloud Partners, and connect with DataRobot Services Partners to help you build, deploy or migrate to the DataRobot AI Platform.
Current methods for testing E. coli levels take days to return results. This creates a delay between when the water is tested and when the results are shared with the public. And because water quality can rapidly change with weather conditions (for example, if it rains), test results can become outdated before they’re even returned.
Intelligent Solution
By using streaming data from sensors, AI can predict if E. coli levels in the waterway are above safe levels. Data such as discharge, gauge height, and temperature are good indicators of the current conditions of the waterway. These data points are aggregated over periods of time to create a current and historical snapshot of the waterway. Automated machine learning (AutoML) can create dozens of different models that use preprocessing techniques to find the best-fitting model.
Technical Implementation
About the Data
The data consists of United States Geological Survey (USGS) sensor readings and hand-sampled E. coli levels from 28 different sites in the Anacostia River collected from January 2013 to September 2020. The data was constructed so that the sensor readings were pulled on the same date the E. coli sample was collected. All USGS data was collected using the USGS REST API.
Problem Framing
This is framed as a binary classification problem to predict if E. coli levels are above a safe level. All USGS data was collected using the USGS REST API.
Sample Feature List
The full sensor reading includes:
Discharge_cubic_feet_per_second
Dissolved_oxygen_milligrams_per_liter
Gage_height_feet
Temperature_water_degrees_celsisus
pH_water_unfiltered_field_standard
The table lists a sample of the features used in the problem. An example of one of the sensor readings, discharge_cubic_feet_per_second, is shown.
Feature Name
Data Type
Description
Data Source
Example
Site
Categorical
Name of the USGS sensor
USGS sensor
Watts Branch at Washington, DC
Date
Date stamp
Date on which the E. coli sample was taken and data was aggregated
USGS sensor
2/23/14
Bacteria
Binary
If the E. coli sample was above 410 mpn/100ml
USGS Manual sample
1
Discharge_cubic_feet_per_second_0_lag
Float
Most recent discharge value at time of E. coli sample
USGS sensor
45.6
Discharge_cubic_feet_per_second_12h_median
Float
Median discharge value over past 12 hours
USGS sensor
34.2
Discharge_cubic_feet_per_second_12h_max
Float
Maximum discharge value over past 12 hours
USGS sensor
122.5
Discharge_cubic_feet_per_second_24h_median
Float
Median discharge value over past 24 hours
USGS sensor
50.2
Discharge_cubic_feet_per_second_24h_max
Float
Maximum discharge value over past 24 hours
USGS sensor
130.2
Data Preparation
The features were aggregated into 12- and 24-hour median and maximum values, in addition to the current values, to create a historical view of the river’s conditions.
Model Training
After the modeling data is uploaded to DataRobot, an exploratory data analysis (EDA) is run to produce a summary of the data, including descriptions of feature type, summary statistics for numeric features, and the distribution of each feature. Data quality is checked at this step, too. DataRobot has guardrails in place to help ensure that only appropriate data is used in the modeling process. For example, to avoid overfitting, DataRobot sets aside 20% of the data as a holdout set. DataRobot also automates other steps in the modeling process, such as partitioning the dataset.
Interpret Results
After a model is built, it is helpful to identify which features are its key drivers. Feature Impact ranks features based on feature importance, from the most important to the least important. It also shows the relative importance of those features. In the following example, Discharge_cubic_feet_per_second_24h_median is the most important feature for this model, followed by discharge_cubic_feed_per_second_24h_max, and discharge_cubic_feet_per_second_12h_max.
Business Implementation
Decision Environment
The DataRobot team developed a Python script that pulls data from the USGS API and aggregates it. This data is sent to DataRobot using the DataRobot API. Results are scored in a MySQL database. A Tableau dashboard that is connected to the database is used for visualizing the predictions.
Model Deployment
Depending on its readiness for production, the model can be deployed through DataRobot’s drag-and-drop functionality or through a REST API. Before the model is integrated into production, it might be useful to create a pilot for:
Testing the model performance using new USGS data.
Monitoring unexpected scenarios so a formal monitoring process can be designed or modified accordingly.
Increasing the end users’ confidence in the use of the model outputs to assist business decision making.
Model Monitoring
If a REST API is used to deploy the model, metrics such as service health, data drift, and accuracy can be monitored in the DataRobot platform.
Experience the DataRobot AI Platform
Less Friction, More AI. Get Started Today With a Free 30-Day Trial.
Non-profits leverage AI to optimize donor outreach, streamline operations, and assess the impact of their initiatives. AI-driven insights ensure that resources are utilized effectively for maximum societal benefit.