Scoring Goals with Data: Champions League Predictions
We at DataRobot have been enjoying building models to predict various sporting events around the world. However, it’s been pointed out to me that we have shown a bias for covering North American events so far. Several colleagues in Europe asked when we were going to predict the world’s most popular game, football. So I enlisted Chloe and Akshay to help me and see if we could predict the Champions League Knockout Stage and the eventual Champions League winner.
To start, we approached Data Sports Group to see if they would be willing to share some of their rich football data. As soon as we had that data, we faced several challenges. First, how do we extend our models beyond what we’ve done in previous sports and create useful football-specific features that can help our predictions? Second, how best do we merge the different datasets together to create the training dataset? Finally, how can we handle the nature of the knockout stage tiebreakers? Taking advantage of all that we’ve learned from prior sports predictions, and DataRobot’s recent purchase of Paxata, we’ve been able to answer these questions.
Using Data Sports Group’s data, we calculated Elo ratings for each team, captured additional team rankings based on their goals, and scored and added a DataRobot proprietary ranking for every player. Based on this information, we predict the champion is most likely to come from Manchester City, Liverpool, Juventus, PSG or Bayern Munich, with Manchester City being the slight favorites.
In the Champions League Knockout Stage, all rounds (except the final) are played with each team hosting one leg. The winner of each round is the team with the higher aggregate goal score and if tied on aggregate goals, the winner is the team with the most away goals. To adequately simulate this, we decided to build models to predict the goal totals for each team in a home or away matchup.
To start, we collected the last four years of data from the Premier League, La Liga, Bundesliga, Ligue Un, Series A, and the Champions League from DataSportsGroup. This data contains information on every match played, the players, their individual stats, and teams stats for those games. Using that data, we calculated Elo, rankings (inspired by Pythagorean win percentages) for the individual stats, and individual player rankings (see below) for every team after every game.
Next we built our training dataset using both the current Elo, season-to-date performance, and recent form (both for the team rankings and the individual rankings) using Paxata (see below) in DataRobot. By examining one of our models for home team score, we see (in Figure 1) that our model relied most on each team’s Elo scores, but that the recent and season-to-date performance of the top eleven players for each team was also important. Beyond the Elo and player performance models, our additional rankings also impacted our model.
Figure 1: Feature importance of the home team goal model
Digging into the impact of Elo (in Figure 2), we see that the number of goals the model predicts increases as expected with increasing Elo:
Figure 2: Impact of the home team’s Elo on the models prediction of home goals
Finally, we examine the effect of our player rating models, see Figure 3 below. The better the team’s players have performed over the course of the season, the higher prediction of goals for the home team:
Figure 3: Impact of the home team’s top 11 players’ rating over the season
Using these models, we predicted the average number of goals scored by both teams on both legs of a round in the Knockout stage. With the averages, we then ran a simulation of the entire Knockout stage. This simulation was repeated 10,000 times and the results were tallied to determine the Champions League favorites, as detailed above. The total percentage of simulations each team won (along with what percentage of the time they reached each round) is shown in Table 1 below:
Table 1: Results of DataRobot’s simulation of the Champions League Knockout Stage
There is rich data at a player level about each player’s contribution to a team win. Can we predict whether Chelsea are going to beat Liverpool based on their line-up? And what is the best predictor of match form? Is it their performance in the preceding match, month or aggregated cumulatively over the season as a whole? We built models based on each of these questions to provide a performance score per player and to ascertain the key performance indicators.
In Figure 5 below, we can see the impact of the game’s stats of each individual on the rating we calculate for that player:
Figure 4: Feature Importance of the player model
Once we have obtained a predicted performance score for each player (i.e. the probability of a win given their inclusion in the squad), we wanted to explore how this can be combined with their teammates’ predicted performances. Perhaps the performance of the best player is the best predictor for a team win. We calculated scores for team average, best player, and top N players on each team, normalizing the scores by position.
Paxata Data Prep
When taking a closer look at the datasets (separated by leagues) – there was a lot of extraneous data. Each league had about 2,500 columns and it was extremely important for us to identify which would be the most relevant columns before we would run it through DataRobot. Using the Rapid Data Profiling feature in Paxata, we were able to get a better understanding of the data. Also by using the Paxata’s powerful data prep capabilities, we were then able to get rid of the unnecessary columns using a single click:
Figure 5: Rapid Data Profiling in Paxata
Figure 6: Column Management in Paxata
Since the data was separated by leagues, it was important for us to join all these datasets together and build features/key metrics given that our goal was to make predictions for the upcoming Champions League games, in which top teams from the different leagues play against each other.
Using Paxata, we were able to consolidate all this data into one single file; once we had finished with this data, we developed a Python script to calculate Elo scores for all teams. Then, we brought this data back into Paxata to build the Home and Away form for all teams currently in the Champions League. Since form is a categorical variable, we used One Hot Encoding in Paxata to shape this data before we ran it through the model.
Figure 7: Appending multiple files using Paxata
Figure 8: Shaping and One Hot Encoding using Paxata
After shaping, transforming and building additional key features to this dataset, we brought it onto DataRobot to model and get our predictions.
Predicting game outcomes, especially for the world’s sport of football, is both fun and challenging. By using DataRobot’s leading enterprise AI platform, we were quickly able to leverage DataSportsGroup’s generous data to predict the Champions League Knockout Stage and the eventual Champions League winner. Time will tell if Manchester City, Liverpool, or Juventus will be victorious, as predicted as most likely. Or will PSG, Bayern Munich, or Barcelona come in and upset from their slightly lower probabilities? Perhaps one of the ten other teams will break the less-than-five-percent odds to take it all. Time will tell. Until then enjoy the game.