Anacostia Riverkeeper: Predicting E. coli Levels from Streaming Sensor Data

Non-Profit Operations Improve Performance
Predicting water quality to close the gap between manually sampling and receiving results.
Build with Free Trial

Overview

Business Problem

Current methods for testing E. coli levels take days to return results. This creates a delay between when the water is tested and when the results are shared with the public. And because water quality can rapidly change with weather conditions (for example, if it rains), test results can become outdated before they’re even returned.

Intelligent Solution

By using streaming data from sensors, AI can predict if E. coli levels in the waterway are above safe levels. Data such as discharge, gauge height, and temperature are good indicators of the current conditions of the waterway. These data points are aggregated over periods of time to create a current and historical snapshot of the waterway. Automated machine learning (AutoML) can create dozens of different models that use preprocessing techniques to find the best-fitting model.

Technical Implementation

About the Data

The data consists of United States Geological Survey (USGS) sensor readings and hand-sampled E. coli levels from 28 different sites in the Anacostia River collected from January 2013 to September 2020. The data was constructed so that the sensor readings were pulled on the same date the E. coli sample was collected.  All USGS data was collected using the USGS REST API.

Problem Framing

This is framed as a binary classification problem to predict if E. coli levels are above a safe level. All USGS data was collected using the USGS REST API.

Sample Feature List

The full sensor reading includes:

  • Discharge_cubic_feet_per_second
  • Dissolved_oxygen_milligrams_per_liter
  • Gage_height_feet
  • Temperature_water_degrees_celsisus
  • pH_water_unfiltered_field_standard

The table lists a sample of the features used in the problem. An example of one of the sensor readings, discharge_cubic_feet_per_second, is shown.

Feature NameData TypeDescriptionData SourceExample
SiteCategoricalName of the USGS sensorUSGS sensorWatts Branch at Washington, DC
DateDate stampDate on which the E. coli sample was taken and data was aggregatedUSGS sensor2/23/14
BacteriaBinaryIf the E. coli sample was above 410 mpn/100mlUSGS Manual sample1
Discharge_cubic_feet_per_second_0_lagFloatMost recent discharge value at time of E. coli sampleUSGS sensor45.6
Discharge_cubic_feet_per_second_12h_medianFloatMedian discharge value over past 12 hoursUSGS sensor34.2
Discharge_cubic_feet_per_second_12h_maxFloatMaximum discharge value over past 12 hoursUSGS sensor122.5
Discharge_cubic_feet_per_second_24h_medianFloatMedian discharge value over past 24 hoursUSGS sensor50.2
Discharge_cubic_feet_per_second_24h_maxFloatMaximum discharge value over past 24 hoursUSGS sensor130.2
Data Preparation

The features were aggregated into 12- and 24-hour median and maximum values, in addition to the current values, to create a historical view of the river’s conditions.

Model Training

After the modeling data is uploaded to DataRobot, an exploratory data analysis (EDA) is run to produce a summary of the data, including descriptions of feature type, summary statistics for numeric features, and the distribution of each feature. Data quality is checked at this step, too. DataRobot has guardrails in place to help ensure that only appropriate data is used in the modeling process. For example, to avoid overfitting, DataRobot sets aside 20% of the data as a holdout set. DataRobot also automates other steps in the modeling process, such as partitioning the dataset.

Interpret Results

After a model is built, it is helpful to identify which features are its key drivers. Feature Impact ranks features based on feature importance, from the most important to the least important. It also shows the relative importance of those features. In the following example, Discharge_cubic_feet_per_second_24h_median is the most important feature for this model, followed by discharge_cubic_feed_per_second_24h_max, and discharge_cubic_feet_per_second_12h_max.

feature impact sdfserwfs

Business Implementation

Decision Environment

The DataRobot team developed a Python script that pulls data from the USGS API and aggregates it. This data is sent to DataRobot using the DataRobot API. Results are scored in a MySQL database. A Tableau dashboard that is connected to the database is used for visualizing the predictions.

Model Deployment

Depending on its readiness for production, the model can be deployed through DataRobot’s drag-and-drop functionality or through a REST API.
Before the model is integrated into production, it might be useful to create a pilot for:

  • Testing the model performance using new USGS data.
  • Monitoring unexpected scenarios so a formal monitoring process can be designed or modified accordingly.
  • Increasing the end users’ confidence in the use of the model outputs to assist business decision making.
Model Monitoring

If a REST API is used to deploy the model, metrics such as service health, data drift, and accuracy can be monitored in the DataRobot platform.

banner purple waves bg

Experience the DataRobot AI Platform

Less Friction, More AI. Get Started Today With a Free 30-Day Trial.

Sign Up for Free
planet dark
Explore More Non-Profit Use Cases
Non-profits leverage AI to optimize donor outreach, streamline operations, and assess the impact of their initiatives. AI-driven insights ensure that resources are utilized effectively for maximum societal benefit.