 Blog
 AI & ML Expertise
 Regularized Linear Regression with scikitlearn
Regularized Linear Regression with scikitlearn
Earlier we covered Ordinary Least Squares regression. In this posting we will build upon this foundation and introduce an important extension to linear regression, regularization, that makes it applicable for illposed problems (e.g. number of predictors >> number of samples) and helps to prevent overfitting.
This is part of a series of blog posts showing how to do common statistical learning techniques with Python. We provide only a small amount of background on the concepts and techniques we cover, so if you’d like a more thorough explanation check out Introduction to Statistical Learning or sign up for the free online course run by the book’s authors here.
Regularized Linear Regression
In a previous posting we introduced linear regression and polynomial regression. Polynomial regression fits a nth order polynomial to our data using least squares. There’s a question that we didn’t answer: which order of the polynomial should we choose? Clearly, the higher the order of the polynomial, the higher the complexity of the model. This is true both computationally and conceptually because in both cases we now have a higher number of adaptable parameters. The higher the complexity of a model the more variance it can capture. Given that computation is cheap, should we always pick the most complex model? As we will show below, the answer to this question is no: we have to strike a balance between variance and (inductive) bias: our model needs to have sufficient complexity to model the relationship between the predictors and the response, but it must not fit the idiosyncrasies of our training data, idiosyncrasies which will limit its ability to generalize to new, unseen cases.
This is best illustrated using a simple curve fitting example, which is adopted from C. Bishop’s Pattern Recognition and Machine Learning (2007). Let’s create a synthetic dataset by adding some random gaussian noise to a sinusoidal function.
%pylab inline import numpy as np from sklearn.linear_model import LinearRegression from sklearn.linear_model import Ridge from sklearn.cross_validation import train_test_split try: from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline except ImportError: # use backports for sklearn 1.4 # available from https://s3.amazonaws.com/datarobotblog/notebooks/sklearn_backports.py from sklearn_backports import PolynomialFeatures from sklearn_backports import make_pipeline # ignore DeprecateWarnings by sklearn import warnings warnings.filterwarnings("ignore", category=DeprecationWarning) np.random.seed(9) def f(x): return np.sin(2 * np.pi * x) # generate points used to plot x_plot = np.linspace(0, 1, 100) # generate points and keep a subset of them n_samples = 100 X = np.random.uniform(0, 1, size=n_samples)[:, np.newaxis] y = f(X) + np.random.normal(scale=0.3, size=n_samples)[:, np.newaxis] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8) ax = plt.gca() ax.plot(x_plot, f(x_plot), color='green') ax.scatter(X_train, y_train, s=10) ax.set_ylim((2, 2)) ax.set_xlim((0, 1)) ax.set_ylabel('y') ax.set_xlabel('x') Populating the interactive namespace from numpy and matplotlib use backports
<matplotlib.text.Text at 0x5008e90>
Now let’s see how different polynomials can approximate this curve.
def plot_approximation(est, ax, label=None):
"""Plot the approximation of ``est`` on axis ``ax``. """
ax.plot(x_plot, f(x_plot), color='green')
ax.scatter(X_train, y_train, s=10)
ax.plot(x_plot, est.predict(x_plot[:, np.newaxis]), color='red', label=label)
ax.set_ylim((2, 2))
ax.set_xlim((0, 1))
ax.set_ylabel('y')
ax.set_xlabel('x')
ax.legend(loc='upper right') #, fontsize='small')
fig, axes = plt.subplots(2, 2, figsize=(8, 5))
# fit different polynomials and plot approximations
for ax, degree in zip(axes.ravel(), [0, 1, 3, 9]):
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(X_train, y_train)
plot_approximation(est, ax, label='degree=%d' % degree)
plt.tight_layout()
In the plot above we see that the polynomial of degree zero is just a constant approximation, the polynomial of degree one fits a straight line, the polynomial of degree three nicely approximates the ground truth, and finally, the polynomial of degree nine has nearly zero training error but does a poor job approximating the ground truth because it already fits the variance induced by the random gaussian noise that we added to our data.
If we plot the training and testing error as a function of the degree of the polynomial we can see what’s happening: the higher the degree of the polynomial (our proxy for model complexity), the lower the training error. The testing error decreases too, but it eventually reaches its minimum at a degree of three and then starts increasing at a degree of seven.
This phenomenon is called overfitting: the model is already so complex that it fits the idiosyncrasies of our training data, idiosyncrasies which limit the model’s ability to generalize (as measured by the testing error).
from sklearn.metrics import mean_squared_error
train_error = np.empty(10)
test_error = np.empty(10)
for degree in range(10):
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(X_train, y_train)
train_error[degree] = mean_squared_error(y_train, est.predict(X_train))
test_error[degree] = mean_squared_error(y_test, est.predict(X_test))
plt.plot(np.arange(10), train_error, color='green', label='train')
plt.plot(np.arange(10), test_error, color='red', label='test')
plt.ylim((0.0, 1e0))
plt.ylabel('log(mean squared error)')
plt.xlabel('degree')
plt.legend(loc='lower left')
<matplotlib.legend.Legend at 0x58d64d0>
In the above example, the optimal choice for the degree of the polynomial approximation would be between three and six. However, there is an alternative to manually selecting the degree of the polynomial: we can add a constraint to our linear regression model that constrains the magnitude of the coefficients in the regression model. This constraint is called the regularization term and the technique is often called shrinkage in the statistical community because it shrinks the coefficients towards zero. In the context of polynomial regression, constraining the magnitude of the regression coefficients effectively is a smoothness assumption: by constraining the L2 norm of the regression coefficients we express our preference for smooth functions rather than wiggly functions.
A popular regularized linear regression model is Ridge Regression. This adds the L2 norm of the coefficients to the ordinary least squares objective:
(J(boldsymbolbeta) = frac{1}{n}sum_{i=0}^n (y_i – boldsymbolbeta^T mathbf{x}_i’)^2 + alpha boldsymbolbeta_2)
where (boldsymbolbeta) is the vector of coefficients including the intercept term and (mathbf{x}_i’) is the vector of the predictors of the ith data point including a constant predictor for the intercept. The L2 norm term is weighted by a regularization parameter alpha
: if alpha=0
then you recover the Ordinary Least Squares regression model. The larger the alpha
the higher the smoothness constraint.
Below you can see the approximation of a sklearn.linear_model.RidgeRegression
estimator fitting a polynomial of degree nine for various values of alpha
(left) and the corresponding coefficient loadings (right). The smaller the value of alpha
the higher the magnitude of the coefficients, so the functions we can model can be more and more wiggly.
fig, ax_rows = plt.subplots(4, 2, figsize=(8, 10))
def plot_coefficients(est, ax, label=None, yscale='log'):
coef = est.steps[1][1].coef_.ravel()
if yscale == 'log':
ax.semilogy(np.abs(coef), marker='o', label=label)
ax.set_ylim((1e1, 1e8))
else:
ax.plot(np.abs(coef), marker='o', label=label)
ax.set_ylabel('abs(coefficient)')
ax.set_xlabel('coefficients')
ax.set_xlim((1, 9))
degree = 9
alphas = [0.0, 1e8, 1e5, 1e1]
for alpha, ax_row in zip(alphas, ax_rows):
ax_left, ax_right = ax_row
est = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alpha))
est.fit(X_train, y_train)
plot_approximation(est, ax_left, label='alpha=%r' % alpha)
plot_coefficients(est, ax_right, label='Ridge(alpha=%r) coefficients' % alpha)
plt.tight_layout()
Regularization techniques
In the above example we used Ridge Regression, a regularized linear regression technique that puts an L2 norm penalty on the regression coefficients. Another popular regularization technique is the LASSO, a technique which puts an L1 norm penalty instead. The difference between the two is that the LASSO leads to sparse solutions, driving most coefficients to zero, whereas Ridge Regression leads to dense solutions, in which most coefficients are nonzero. The intuition behind the sparseness property of the L1 norm penalty can be seen in the plot below. The plot shows the value of the penalty in the coefficient space, here a space with two coefficients (w_0) and (w_1). The L2 penalty appears as a cone in this space whereas the L1 penalty is a diamond. The objective function of a regularized linear model is just the ordinary least squared solution plus the (weighted) penalty term (the point that minimizes the objective function is where those two error surfaces meet), so in the case of the L1 penalty this is usually at the spike of the diamond, a sparse solution because some coefficients are zero. For the L2 penalty, on the other hand, the optimal point generally has nonzero coefficients. Another popular regularization technique is the Elastic Net, the convex combination of the L2 norm and the L1 norm. It too leads to a sparse solution.
L2 and L1 regularization differ in how they cope with correlated predictors: L2 will divide the coefficient loading equally among them whereas L1 will place all the loading on one of them while shrinking the others towards zero. Elastic Net combines the advantages of both: it tends to either select a group of correlated predictors in which case it puts equal loading on all of them, or it completely shrinks the group.
Scikitlearn provides separate classes for LASSO and Elastic Net: sklearn.linear_model.Lasso
and sklearn.linear_model.ElasticNet
. In contrast to RidgeRegression, the solution for both LASSO and Elastic Net has to be computed numerically. The classes above use an optimization technique called coordinate descent. Alterntively, you can also use the class sklearn.linear_model.SGDRegressor
which uses stochastic gradient descent instead and often is more efficient for largescale, highdimensional and sparse data.
from sklearn.linear_model import Lasso fig, ax_rows = plt.subplots(2, 2, figsize=(8, 5)) degree = 9 alphas = [1e3, 1e2] for alpha, ax_row in zip(alphas, ax_rows): ax_left, ax_right = ax_row est = make_pipeline(PolynomialFeatures(degree), Lasso(alpha=alpha)) est.fit(X_train, y_train) plot_approximation(est, ax_left, label='alpha=%r' % alpha) plot_coefficients(est, ax_right, label='Lasso(alpha=%r) coefficients' % alpha, yscale=None) plt.tight_layout() /home/pprett/workspace/scikitlearn/sklearn/linear_model/coordinate_descent.py:481: UserWarning: Objective did not converge. You might want to increase the number of iterations ' to increase the number of iterations')
Regularization Path Plots
Another handy diagnostic tool for regularized linear regression is the use of socalled regularization path plots. These show the coefficient loading (yaxis) against the regularization parameter alpha
(xaxis). Each (nonzero) coefficient is represented by a line in this space. The example below is taken from the scikitlearn documentation. You can see that the smaller the alpha
(i.e. the higher the –log(alpha)
, the higher the magnitude of the coefficients and the more predictors selected). You can also see that the Elastic Net tends to select more predictors, distributing the loading evenly among them, whereas L1 tends to select fewer predictors.
Regularization path plots can be efficiently created using coordinate descent optimization methods but they are harder to create with (stochastic) gradient descent optimzation methods. Scikitlearn provides a number of convenience functions to create those plots for coordinate descent based regularized linear regression models: sklearn.linear_model.lasso_path
and sklearn.linear_model.enet_path
.
Download Notebook View on NBViewer
Discover strategic and tactical tips for AI success this year and beyond.
Download NowValueDriven AI
DataRobot is the leader in ValueDriven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad usecase implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed onprem or in any cloud environment. DataRobot and our partners have a decade of worldclass AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.

6 Reasons Why Generative AI Initiatives Fail and How to Overcome Them
February 8, 2024· 9 min read 
Beyond Differences, Embracing the Journey: A New Year's Resolution for a Better Tomorrow
January 23, 2024· 6 min read 
Choosing the Right Database for Your Generative AI Use Case
January 11, 2024· 8 min read
Latest posts
Related Posts
You’ve just successfully subscribed