5 Data Science Challenges Banks Face (And How to Overcome Them)

Making predictions has been a part of the banking industry since the world was flat. These days, you would be hard-pressed to identify a line of business or function in a bank that doesn’t have multiple needs for predictive analytics.

Banks of all sizes are realizing that they must find new ways of capturing, organizing, and making data available, and must up their game with new tools and techniques for learning from their data and embedding data-based capabilities into products, services, client interactions, and operations.

Cheap commodity hardware, the rise of open-source technologies, and new machine learning algorithms have eliminated the technological barriers of yesteryear. Modelers can now crank through an enormous amount of data and let the computer do the hard work of finding the best predictors. The machine “learns” how to make predictions based on the data you provide.

In the last decade or so, the sheer number of different machine learning models that can be used to glean insight from your data has exploded. Regression models have waned. Now there are neural networks, random forests, support vector machines, and gradient-boosted trees, just to name a few. But this has given rise to a whole new set of challenges.

Why is Data Science So Hard?

It is a practical impossibility to know exactly which of the myriad of available modeling algorithms and techniques will give you the best result given the data you have and what you are trying to predict. So, data scientists have to try a lot of them. Not even the most experienced data scientists can know a priori which model will work the best. Given practical time and budget constraints and an enormous backlog of demand, data scientists usually rely on a few models they know well.

It’s not enough to know that a model works well. With skeptical users, never mind regulators, the ability to explain how and why the model works is critical. Which data is important and when (all of the time, some of the time, only rarely)? Can you explain the reasons for a specific prediction (was it a single reason or a combination of factors)? Black box models that work brilliantly but without insight are of little value, even if you could get them past regulators — which you can’t.

There are almost always trade-offs that need to be carefully weighed and considered. Perhaps a model is very good at sniffing out positive outcomes, missing very few, but at the cost of a high number of false positives—a classic problem in fraud detection and AML. Or a model is exceedingly accurate but can’t perform at the speed needed to support actual business operations. Perhaps a model works very well for one location or customer group, but because of behavioral differences, works less well in others. These decisions must be carefully weighed and the trade-offs assessed before deploying a model into operational use.

Even the best model can’t correct for errors, gaps, or biases in your data. Nowadays, it’s cheap and easy to store data. That wasn’t always the case. This means that you may have limited history to work with — will that be good enough? Maybe you have data for one geography but are considering expanding into another — will a model trained on the one work for the other? Are all of the data features used to create your model appropriate? Can you defend use of all the data for decision-making purposes, even if the model finds that age, for example, is a good predictor? What if your history has embedded biases? If human-originated biases are reflected in the training data, this bias will be reflected in the resultant model.

Will your model continue to perform as conditions change, for example as customer behaviors evolve? Many business conditions are a moving target, and your models need to continuously learn and improve.

Automated Machine Learning Can Help Your Bank Overcome These Challenges

Automated machine learning (invented by DataRobot) solves many of these challenges and makes the others more manageable. With these advances it is now possible to get substantially more productivity from your team of overworked data scientists. And, solutions can be brought to market faster.

Automated machine learning:

Finds the best model for your particular situation through competitive elimination from an extensive resource library of common models—cutting hundreds or even thousands of hours off the time required to find the best model for your situation.
Ranks the top performing models using any one of several different metrics available so you can evaluate and select from among the models best suited to your particular problem.
Provides transparency into each model’s use of data, telling you not just which data is most important, but when to deploy it.
Explains individual predictions, down to specific data features and their values.
Provides diagnostics for understanding each models accuracy using a variety of standard metrics.
Provides tools for understanding and making tradeoff decisions (e.g., between speed and accuracy, positive versus negative predictive value, when and where additional models may help).
Automatically creates most of the documentation required for model validation and model risk management (that data scientists, almost universally, dislike spending time on).
Reduces the cost, difficulty, and risk of deploying models into your production environment by providing minimally invasive deployment options such as code generation, API deployment, and deployment to Hadoop.
Makes it easier to monitor model performance and detect drift or performance degradation over time, alerting modelers to the need for retraining or creation of challenger models.
Makes retraining models on new data and redeploying models into production simple, fast, and low risk.