Explaining Multivariate Adaptive Regression Splines in DataRobot

January 25, 2020

by

· 4 min read

This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about DataRobot, AI Platform, data science, and more.

Multivariate adaptive regression splines (MARS) is an algorithm for regression analysis. It is based on linear regression with the following differences:

it is a non-parametrics technique
it models non-linearities and interactions between variables automatically

MARS models build the following estimation for the resulting function:

$$\hat f(x) = \Sigma c_i B_i(x),$$

where:

$c_i$ — coefficients
$B_i$ — basis functions

$B_i$ can take the following values:

$B_i$ = 1 — We need this for inteceptions
$B_i$ = \max(0, x – const)$ or $B_i = \max(0, const – x)$ — We need this for segments. This type of function is called “hinge.”
$B_i$ = The product of two or more hinge (functions). We need this for non-linearities and interactions.

In this article, we demonstrate how MARS can be performed in Python using the package, pyearth.

To run this code for our demonstration, we need the following packages:

import urllib2  #need to read url
import sys  #need to read url
from bs4 import BeautifulSoup #need to read url
import pandas as pd #need for data preprocessing
import numpy as np #need for data preprocessing
from pyearth import Earth #need for MARS

Note: Not all of these packages are required for MARS. In fact, most are needed for other purposes, such as data transformations (as explained in the import comments).

For data, we’re creating a new dataset from information in the page, http://www.databasebaseball.com. (UPDATE: The new URL for this page is https://www.rotowire.com/.)

The new dataset we create contains information about seasons played by Red Sox and their results.

page = urllib2.urlopen('http://www.databasebaseball.com/teams/teampage.htm?franch=BOS').read()
soup = BeautifulSoup(page)
soup.prettify()

table = soup.findAll('table')

A = []

rows = soup.findAll('tr')    
for tr in rows:
    cols = tr.findAll('td')
    row = [ele.text.encode('latin1') for ele in cols]
    if len(row) < 10:
        row.append('')
    A.append(row)

del A[0:7]  #remove header

At this point we’re going to transform our data to do some preprocessing. As is often the case, this step takes quite a bit of time and effort; this is important to keep in mind when you create your own machine learning projects.

The first step is to define the target variable. In our case, the target we select identifies if the Red Sox won.

print type(A)

from numpy  import asarray
X = asarray(A)
print type(X)


resp = (X[:,[9]] == '')
y = resp.astype(int)

Output:

<type 'list'>
<type 'numpy.ndarray'>

Now, we will do some data analysis to transform string variables to numeric. In addition, we’ll parse string variables that contain number of wins and losses into two numeric columns.

to_parse = X[:,[2]].astype('str')

print "to_parse:", to_parse[6]  # use this row to determine the original data format

win = []
lose = []

for i in range(0, len(to_parse)):
    string = to_parse[i].tostring()
    parse = string.split(" - ",1)
    win.append(filter(None, parse[0].split('\x00')))
    lose.append(filter(None, parse[1].split('\x00')))
    
from numpy  import array
w = array(win, dtype=float)
l = array(lose, dtype=float)
x_0 = array(X[:,[0]], dtype=float)
x_3 = array(X[:,[3]], dtype=float)
x_4 = array(X[:,[4]], dtype=float)
x_5 = array(X[:,[5]], dtype=float)
x_6 = array(X[:,[6]], dtype=float)
x_7 = array(X[:,[7]], dtype=float)

print x_0.dtype
print x_3.dtype
print x_4.dtype
print x_5.dtype
print x_6.dtype
print x_7.dtype
print l.dtype

Output:

to_parse: ['95 - 67']
float64
float64
float64
float64
float64
float64
float64

Then, we stack all of the variables together in the format required by MARS algorithm:

variables = np.hstack((w, l, x_0, x_3, x_4, x_5, x_6, x_7))
#variables

The model by itself is just the following two lines of code:

model = Earth()
model.fit(variables,y)

Output:

Earth(penalty=None, min_search_points=None, endspan_alpha=None, check_every=None, max_terms=None, max_degree=None, minspan_alpha=None, thresh=None, minspan=None, endspan=None, allow_linear=None)

At this point, we can see the main results of the model, including variables, their coefficients, and main metrics:

MSE
$R^2$

print model.summary()

Output:

Earth Model
-------------------------------------
Basis Function  Pruned  Coefficient  
-------------------------------------
(Intercept)     No      1.73036      
h(x7-3)         No      0.691092     
h(3-x7)         No      -0.839318    
h(x2-1992)      No      -0.0402292   
h(1992-x2)      Yes     None         
h(x7-2)         No      -0.698699    
h(2-x7)         Yes     None         
x0              Yes     None         
x1              Yes     None         
-------------------------------------
MSE: 0.0270, GCV: 0.0378, RSQ: 0.8267, GRSQ: 0.7612

We can see that five features were included in the final model, and that the overall perfomance in quite good: $R^2 = 0.8267$ is a good indicator together with $MSE = 0.027$

We can compare this fit with OLS. Here we see that $R^2_{OLS} = 0.87$, which is even better then $R^2_{MARS}$; however, but the values are still fairly comparable.

In addition, we can see that $MSE_{OLS} = 0.80$ is much larger than $MSE_{MARS}$. Thus, overall the MARS model shows better performance than OLS.

import statsmodels.api as sm

model = sm.OLS(y.ravel(), variables)
results = model.fit()

print "MSE = ",results.mse_total
print "R-squared = ",results.rsquared

Output:

MSE = 0.807339449541
R-squared = 0.870987820528

As another check on the model’s performance, we can determine how accurate it is for a particular threshold. The following example uses the threshold $0.4$.

resp_hat = model.predict(variables)

results = ((resp_hat > 0.4).astype(int).reshape(len(resp_hat),1) == y).astype(int)
accuracy = float(results.sum()) / len(results)

print "accuracy =", accuracy

Output:

accuracy = 0.963302752294

In this post, we showed how a MARS model can be easily applied in Python, and how this model can be highly accurate.

(The example in this article used the following packages: urllib2, sys, bs4, pandas, numpy, pyearth.)

About the author

Linda Haviland

Community Manager

Meet Linda Haviland

Share this post

Subscribe to DataRobot Blog

First Name

Last Name

Email

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

Share this post

Subscribe to DataRobot Blog

First Name

Last Name

Email

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

Explaining Multivariate Adaptive Regression Splines in DataRobot

How to Choose the Right LMM for Your Use Case

Belong @ DataRobot: Celebrating 2024 Women’s History Month with DataRobot AI Legends

Choosing the Right Vector Embedding Model for Your Generative AI Use Case

Related Posts

Thanks! Check your inbox to confirm your subscription.