Support Students at Risk of Unsuccessful Course Outcomes
Overview
Business Problem
According to the National Center for Education Statistics, the overall 6-year graduation rate for full-time, degree-seeking undergraduate students in the US is 62%. One of the core drivers of student attrition is an unsuccessful outcome in one or more of their courses; failing a class has significant mental and financial repercussions for students. This problem is particularly pronounced among first-generation students and students of color, who tend to be at higher risk of dropping out.
Universities use proactive strategies to help ensure students have the best chances of success, including office hours, peer tutoring, course selection guidance, etc. However, it can be difficult to identify which students need that additional help. By the time midterm grades or projects reveal which students are struggling, it may be too late for them to make up the content they’ve missed.
Intelligent Solution
Using AI, universities can use historical data on student academic outcomes to create models that proactively identify which students enrolled in a course are likely to need additional academic support. As the term progresses, these initial predictions can be updated to reflect the latest information.
Beyond the probability of success, faculty and academic support staff can use Prediction Explanations (i.e., student-level explanations of why that specific student was flagged as at-risk or not) to better match them with the right support. For instance, one student who lives far from campus might have consistent attendance problems, and could be scheduled to a later section, whereas another might be flagged for additional support because they received a poor grade in a prerequisite course.
Value Estimation
How would I measure ROI for my use case?
Universities want their students to graduate and be successful. For public universities, there are financial incentives – in fact, most US states offer performance-based incentives to universities that improve against or outperform their peers on metrics including graduation rate. For public universities, improving graduation rates can translate directly into substantial additional funding.
Technical Implementation
Problem Framing
The target variable for this use case is to predict whether a student will receive an unsuccessful outcome in a course in which they are enrolled. This choice in target makes this a binary classification problem. By making these predictions for all of the courses students are enrolled in, advisors can be proactive about getting students the support they need.
Features:
Here are some recommendations of features that can help you train a model for this use case. You can add or remove features based on the nature of the data available and the requirements of the model you are trying to build.
- Student Data
- Student demographics information (e.g., scholarship recipient, first-generation college student)
- International student status (country of origin)
- On-campus/off-campus status and distance away from campus
- Assessment from a mentor or academic coach
- Course Data
- Level of difficulty of the enrolled course based on historical pass rates
- Type of course (lecture, lab, seminar)
- Class schedule (day of week, time of day)
- Academic Data
- Student’s total course load (# of credits)
- Current academic information (student’s GPA, identified field of study)
- Past grades for required prerequisites or related courses
- Data on course attendance, or project or exam grades (as this data becomes available throughout the term)
- Consumption of online learning content (if available)
Sample Feature List
Feature Name | Data Type | Description | Data Source | Example |
---|---|---|---|---|
Course outcome | Binary (Target) | Whether or not the student received a passing grade in the course | Academic Data | True |
Grades in Pre-requisite courses | Numeric | Grade in ECON 101 (for a model predicting success in ECON 201) | Academic Data | 82% |
Total Course Load | Numeric | Number of enrolled credits | Academic Data | 14 |
Required for Major | Binary | Whether the course is a required course for that student’s major field of study | Academic Data | False |
Course Number | Categorical | The unique ID of the enrolled course | Course Data | BIOLOGY_2b |
Average Course Pass Rate | Numeric | Level of difficulty of the course | Course Data | 0.75 |
Student GPA | Numeric | Current Grade Point Average | Student Data | 3.78 |
Student Country of Origin | Categorical | Whether the student is an international student | Student Data | India |
Mentor assessment | Text | Free-text assessment of a student’s perspective and outlook from a coach or career advisor | Student Data | “Diamond had a fantastic experience onduring her summer internship in the microbiology lab and is ready and excited for the coming term” |
Each row in the data is an example of a student and a course. A student taking four courses in one term would have four rows in the data for that semester. The student data would be the same for every row, but the course specific data would be different for each course.
When training the data on multiple semesters of history, be sure to avoid target leakage. Target leakage occurs when information about the target “leaks” into the features used to train a model. Target leakage is a problem because it makes models appear better in training and testing than they will actually perform in real life. For instance, if we have a “current GPA” feature, we should use the student’s GPA from the semester prior to the term on which we are modeling. If, instead, we were to use a student’s final GPA as their GPA for predicting success within a given term, that would be target leakage.
Model Training
DataRobot Automated Machine Learning automates many parts of the modeling pipeline. Instead of hand-coding and manually testing dozens of models to find the one that best fits your needs, DataRobot automatically runs dozens of models and finds the most accurate one for you, all in a matter of minutes. In addition to training the models, DataRobot automates other steps in the modeling process such as processing and partitioning the dataset.
While we will jump straight to deploying the model, you can take a look here to see how DataRobot works from start to finish and to understand the data science methodologies embedded in its automation.
A few key modeling decisions for this use case:
- Group Partitioning: Most students will appear multiple times in our training data (e.g., a given student will have one row per course, potentially across multiple terms). By using group partitioning based on Student ID, we can ensure that all records for a given student end up in the same partition. Otherwise, the model might patterns about individual students, which means it wouldn’t generalize well when applied to new students.
- Relevant History: We may choose to include multiple semesters of data when training our models so that we have a larger volume of data to train on. However, that long history can have downsides: if the university introduced new practices to support students on scholarship, for instance, then historical trends may not be as predictive as the most recent semester of data. By defining the most recent semester of data as our holdout, we can test whether a wider view of history improves or reduces predictive power.
Business Implementation
Decision Environment
After you finalize a model, DataRobot makes it easy to deploy it into your desired decision environment. Decision environments are the methods by which predictions will ultimately be used for decision-making.
Decision Maturity
Automation | Augmentation | Blend
Individual components of this process may be automatable. For example, perhaps students that are unlikely to pass at least two courses are automatically referred to the academic support office so that academic coaches can proactively check in. More commonly, though, this kind of model is a decision-support tool for academic advisors and teaching staff.
Model Deployment
Because the end consumers are student-facing staff, they should find predictions and visualizations understandable and consumable. This could mean Power BI, Tableau, or an internal database maintained by the university are used for providing model results to the staff.
Scores should be updated as frequently as the underlying features. If it is possible to include grades, attendance, staff notes, or other relevant features that are collected throughout the term, then the scoring could be done on a daily or weekly basis to provide the most useful information.
Decision Stakeholders
- Decision Executors: Student support staff (coaches, academic advisors, tutors, TAs) are the most direct consumers and will use the predictions on a daily or weekly basis.
- Decision Managers: The Academic Affairs Office (or equivalent) is ultimately responsible for making sure that students have what they need to be successful.
- Decision Authors: Data scientists, analysts, and statistical modeling practitioners are all well-positioned to build the models in DataRobot. IT Support or vendors can be brought in if there are specific deployment needs (e.g., Tableau integration).
Decision Process
- Identify students that would benefit from tutoring, coaching or other academic support before midterm projects and grades are determined
- Guide faculty and teaching assistants with insights about the drivers of success for their specific courses (e.g., attendance, previous lab experience, completing a certain prerequisite class)
Model Monitoring
Models should be retrained when data drift tracking shows significant deviations between the scoring and training data. If student enrollment changes dramatically (for instance, a shift towards online learning and more students living at home rather than on-campus), then the models should be reevaluated for accuracy.
Trusted AI
Universities should think carefully about what actions can and should be taken in response to these predictions. This use case is positioned around proactively offering students additional support given their academic load. In that context, we might be willing to use demographic features (ethnicity, scholarship recipient status) to provide additional support to historically underserved groups.
Theoretically, the model could also be used as a “What If?” scenario planning tool to help set students up for success within a given semester. In which case, incorporating information about gender, race, first-generation college status, or scholarship status could systematically steer those students away from challenging classes or ambitious course loads. If this is the case, those features should be excluded from the model. More broadly, it is essential that student-facing staff using the models understand the limitations and appropriate uses of the models with which they interact.

Experience the DataRobot AI Platform
Less Friction, More AI. Get Started Today With a Free 30-Day Trial.
Sign Up for Free