• Blog
  • Scoring Goals with Data: Champions League Predictions

Scoring Goals with Data: Champions League Predictions

February 18, 2020
by
6 min

We at DataRobot have been enjoying building models to predict various sporting events around the world. However, it’s been pointed out to me that we have shown a bias for covering North American events so far. Several colleagues in Europe asked when we were going to predict the world’s most popular game, football. So I enlisted Chloe and Akshay to help me and see if we could predict the Champions League Knockout Stage and the eventual Champions League winner.

To start, we approached Data Sports Group to see if they would be willing to share some of their rich football data. As soon as we had that data, we faced several challenges. First, how do we extend our models beyond what we’ve done in previous sports and create useful football-specific features that can help our predictions? Second, how best do we merge the different datasets together to create the training dataset? Finally, how can we handle the nature of the knockout stage tiebreakers? Taking advantage of all that we’ve learned from prior sports predictions, and DataRobot’s recent purchase of Paxata, we’ve been able to answer these questions.

Using Data Sports Group’s data, we calculated Elo ratings for each team, captured additional team rankings based on their goals, and scored and added a DataRobot proprietary ranking for every player. Based on this information, we predict the champion is most likely to come from Manchester City, Liverpool, Juventus, PSG or Bayern Munich, with Manchester City being the slight favorites.

Modeling Approach

In the Champions League Knockout Stage, all rounds (except the final) are played with each team hosting one leg. The winner of each round is the team with the higher aggregate goal score and if tied on aggregate goals, the winner is the team with the most away goals. To adequately simulate this, we decided to build models to predict the goal totals for each team in a home or away matchup.

To start, we collected the last four years of data from the Premier League, La Liga, Bundesliga, Ligue Un, Series A, and the Champions League from DataSportsGroup. This data contains information on every match played, the players, their individual stats, and teams stats for those games. Using that data, we calculated Elo, rankings (inspired by Pythagorean win percentages) for the individual stats, and individual player rankings (see below) for every team after every game.

Next we built our training dataset using both the current Elo, season-to-date performance, and recent form (both for the team rankings and the individual rankings) using Paxata (see below) in DataRobot. By examining one of our models for home team score, we see (in Figure 1) that our model relied most on each team’s Elo scores, but that the recent and season-to-date performance of the top eleven players for each team was also important. Beyond the Elo and player performance models, our additional rankings also impacted our model.

for the blog 1

Figure 1: Feature importance of the home team goal model

Digging into the impact of Elo (in Figure 2), we see that the number of goals the model predicts increases as expected with increasing Elo:

for the blog 12

Figure 2: Impact of the home team’s Elo on the models prediction of home goals

Finally,  we examine the effect of our player rating models, see Figure 3 below. The better the team’s players have performed over the course of the season, the higher prediction of goals for the home team:

for the blog 13

Figure 3: Impact of the home team’s top 11 players’ rating over the season

Using these models, we predicted the average number of goals scored by both teams on both legs of a round in the Knockout stage. With the averages, we then ran a simulation of the entire Knockout stage. This simulation was repeated 10,000 times and the results were tallied to determine the Champions League favorites, as detailed above. The total percentage of simulations each team won (along with what percentage of the time they reached each round) is shown in Table 1 below:

Team Quarterfinals Semifinals Finals Champion
Manchester City 65% 41% 26% 16%
Liverpool 72% 43% 26% 15%
Juventus 74% 44% 25% 13%
PSG 73% 43% 24% 12%
Bayern Munich 67% 39% 22% 12%
Barcelona 62% 36% 20% 10%
RB Leipzig 66% 27% 11% 4%
Real Madrid 35% 17% 8% 3%
Napoli 38% 18% 8% 3%
Atlanta 55% 23% 8% 3%
Chelsea 33% 14% 5% 2%
Dortmund 27% 12% 5% 2%
Atletico Madrid 28% 11% 5% 2%
Valencia 45% 15% 5% 1%
Olympique Lyon 26% 8% 2% 1%
Tottenham 34% 9% 2% 1%

Table 1: Results of DataRobot’s simulation of the Champions League Knockout Stage

Player Ratings

There is rich data at a player level about each player’s contribution to a team win. Can we predict whether Chelsea are going to beat Liverpool based on their line-up? And what is the best predictor of match form? Is it their performance in the preceding match, month or aggregated cumulatively over the season as a whole? We built models based on each of these questions to provide a performance score per player and to ascertain the key performance indicators.

In Figure 5 below, we can see the impact of the game’s stats of each individual on the rating we calculate for that player:

for the blog 14

Figure 4: Feature Importance of the player model

Once we have obtained a predicted performance score for each player (i.e. the probability of a win given their inclusion in the squad), we wanted to explore how this can be combined with their teammates’ predicted performances. Perhaps the performance of the best player is the best predictor for a team win. We calculated scores for team average, best player, and top N players on each team, normalizing the scores by position.

Paxata Data Prep

When taking a closer look at the datasets (separated by leagues) – there was a lot of extraneous data. Each league had about 2,500 columns and it was extremely important for us to identify which would be the most relevant columns before we would run it through DataRobot. Using the Rapid Data Profiling feature in Paxata, we were able to get a better understanding of the data. Also by using the Paxata’s powerful data prep capabilities, we were then able to get rid of the unnecessary columns using a single click:

for the blog 15

Figure 5: Rapid Data Profiling in Paxata

for the blog 16

Figure 6: Column Management in Paxata

Since the data was separated by leagues, it was important for us to join all these datasets together and build features/key metrics given that our goal was to make predictions for the upcoming Champions League games, in which top teams from the different leagues play against each other.

Using Paxata, we were able to consolidate all this data into one single file; once we had finished with this data, we developed a Python script to calculate Elo scores for all teams. Then, we brought this data back into Paxata to build the Home and Away form for all teams currently in the Champions League. Since form is a categorical variable, we used One Hot Encoding in Paxata to shape this data before we ran it through the model.

for the blog 17

Figure 7: Appending multiple files using Paxata

for the blog 18

Figure 8: Shaping and One Hot Encoding using Paxata

After shaping, transforming and building additional key features to this dataset, we brought it onto DataRobot to model and get our predictions.

Conclusions

Predicting game outcomes, especially for the world’s sport of football, is both fun and challenging. By using DataRobot’s leading enterprise AI platform, we were quickly able to leverage DataSportsGroup’s generous data to predict the Champions League Knockout Stage and the eventual Champions League winner. Time will tell if Manchester City, Liverpool, or Juventus will be victorious, as predicted as most likely. Or will PSG, Bayern Munich, or Barcelona come in and upset from their slightly lower probabilities? Perhaps one of the ten other teams will break the less-than-five-percent odds to take it all. Time will tell. Until then enjoy the game.

New call-to-action

About the author
Andrew Engel
Andrew Engel

General Manager for Sports and Gaming, DataRobot

Andrew Engel is General Manager for Sports and Gaming at DataRobot. He works with DataRobot customers across sports and casinos, including several Major League Baseball, National Basketball League and National Hockey League teams. He has been working as a data scientist and leading teams of data scientists for over ten years in a wide variety of domains from fraud prediction to marketing analytics. Andrew received his Ph.D. in Systems and Industrial Engineering with a focus on optimization and stochastic modeling. He has worked for Towson University, SAS Institute, the US Navy, Websense (now ForcePoint), Stics, and HP before joining DataRobot in February of 2016.

Meet Andrew Engel
0 18
Chloe Coates

Applied Data Science Associate at DataRobot

Chloe is an Applied Data Science Associate at DataRobot. She holds a PhD in Materials Chemistry from the University of Oxford with experience in the analysis and modelling of synchrotron diffraction data of disordered materials. She has also represented Oxford in the Varsity Women’s Football Match against Cambridge, for which she was awarded an Oxford Blue.

Meet Chloe Coates
0 19
Akshay Viswanathan

Data Platform Architect at DataRobot

Akshay is a Data Platform Architect at DataRobot. He works with DataRobot Paxata customers to identify and implement use cases in the Data Prep domain across various industries including Sports, Finance and Healthcare. Akshay received his Masters in Information Systems and Science with a focus on Data Science and Analytics in 2017. He worked for Paxata before joining DataRobot as part of the Paxata acquisition in 2020.

Meet Akshay Viswanathan

Newsletter Subscription
Subscribe to our Blog