# Explaining Multivariate Adaptive Regression Splines in DataRobot

January 25, 2020
by

This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about DataRobot, AI Platform, data science, and more.

Multivariate adaptive regression splines (MARS) is an algorithm for regression analysis. It is based on linear regression with the following differences:

• it is a non-parametrics technique
• it models non-linearities and interactions between variables automatically

MARS models build the following estimation for the resulting function:

$$\hat f(x) = \Sigma c_i B_i(x),$$

where:

• $c_i$ — coefficients
• $B_i$ — basis functions

$B_i$ can take the following values:

• $B_i$ = 1 — We need this for inteceptions
• $B_i$ = \max(0, x – const)$or$B_i = \max(0, const – x)$— We need this for segments. This type of function is called “hinge.” •$B_i$= The product of two or more hinge (functions). We need this for non-linearities and interactions. In this article, we demonstrate how MARS can be performed in Python using the package, pyearth. To run this code for our demonstration, we need the following packages: import urllib2 #need to read url import sys #need to read url from bs4 import BeautifulSoup #need to read url import pandas as pd #need for data preprocessing import numpy as np #need for data preprocessing from pyearth import Earth #need for MARS Note: Not all of these packages are required for MARS. In fact, most are needed for other purposes, such as data transformations (as explained in the import comments). For data, we’re creating a new dataset from information in the page, http://www.databasebaseball.com. (UPDATE: The new URL for this page is https://www.rotowire.com/.) The new dataset we create contains information about seasons played by Red Sox and their results. page = urllib2.urlopen('http://www.databasebaseball.com/teams/teampage.htm?franch=BOS').read() soup = BeautifulSoup(page) soup.prettify() table = soup.findAll('table') A = [] rows = soup.findAll('tr') for tr in rows: cols = tr.findAll('td') row = [ele.text.encode('latin1') for ele in cols] if len(row) < 10: row.append('') A.append(row) del A[0:7] #remove header At this point we’re going to transform our data to do some preprocessing. As is often the case, this step takes quite a bit of time and effort; this is important to keep in mind when you create your own machine learning projects. The first step is to define the target variable. In our case, the target we select identifies if the Red Sox won. print type(A) from numpy import asarray X = asarray(A) print type(X) resp = (X[:,[9]] == '') y = resp.astype(int)  Output: <type 'list'> <type 'numpy.ndarray'>  Now, we will do some data analysis to transform string variables to numeric. In addition, we’ll parse string variables that contain number of wins and losses into two numeric columns. to_parse = X[:,[2]].astype('str') print "to_parse:", to_parse[6] # use this row to determine the original data format win = [] lose = [] for i in range(0, len(to_parse)): string = to_parse[i].tostring() parse = string.split(" - ",1) win.append(filter(None, parse[0].split('\x00'))) lose.append(filter(None, parse[1].split('\x00'))) from numpy import array w = array(win, dtype=float) l = array(lose, dtype=float) x_0 = array(X[:,[0]], dtype=float) x_3 = array(X[:,[3]], dtype=float) x_4 = array(X[:,[4]], dtype=float) x_5 = array(X[:,[5]], dtype=float) x_6 = array(X[:,[6]], dtype=float) x_7 = array(X[:,[7]], dtype=float) print x_0.dtype print x_3.dtype print x_4.dtype print x_5.dtype print x_6.dtype print x_7.dtype print l.dtype  Output: to_parse: ['95 - 67'] float64 float64 float64 float64 float64 float64 float64 Then, we stack all of the variables together in the format required by MARS algorithm: variables = np.hstack((w, l, x_0, x_3, x_4, x_5, x_6, x_7)) #variables The model by itself is just the following two lines of code: model = Earth() model.fit(variables,y) Output: Earth(penalty=None, min_search_points=None, endspan_alpha=None, check_every=None, max_terms=None, max_degree=None, minspan_alpha=None, thresh=None, minspan=None, endspan=None, allow_linear=None) At this point, we can see the main results of the model, including variables, their coefficients, and main metrics: • MSE •$R^2$print model.summary() Output: Earth Model ------------------------------------- Basis Function Pruned Coefficient ------------------------------------- (Intercept) No 1.73036 h(x7-3) No 0.691092 h(3-x7) No -0.839318 h(x2-1992) No -0.0402292 h(1992-x2) Yes None h(x7-2) No -0.698699 h(2-x7) Yes None x0 Yes None x1 Yes None ------------------------------------- MSE: 0.0270, GCV: 0.0378, RSQ: 0.8267, GRSQ: 0.7612  We can see that five features were included in the final model, and that the overall perfomance in quite good:$R^2 = 0.8267$is a good indicator together with$MSE = 0.027$We can compare this fit with OLS. Here we see that$R^2_{OLS} = 0.87$, which is even better then$R^2_{MARS}$; however, but the values are still fairly comparable. In addition, we can see that$MSE_{OLS} = 0.80$is much larger than$MSE_{MARS}$. Thus, overall the MARS model shows better performance than OLS. import statsmodels.api as sm model = sm.OLS(y.ravel(), variables) results = model.fit() print "MSE = ",results.mse_total print "R-squared = ",results.rsquared Output: MSE = 0.807339449541 R-squared = 0.870987820528 As another check on the model’s performance, we can determine how accurate it is for a particular threshold. The following example uses the threshold$0.4\$.

resp_hat = model.predict(variables)

results = ((resp_hat > 0.4).astype(int).reshape(len(resp_hat),1) == y).astype(int)
accuracy = float(results.sum()) / len(results)

print "accuracy =", accuracy

Output:

accuracy = 0.963302752294

In this post, we showed how a MARS model can be easily applied in Python, and how this model can be highly accurate.

(The example in this article used the following packages: urllib2, sys, bs4, pandas, numpy, pyearth.)

Public documentation
Find All the Information to Succeed with DataRobot and Machine Learning