This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about DataRobot, AI Cloud, data science, and more.
This article introduces the popular calibration method, Platt Scaling. For many problems, it is convenient to get a probability P(y=1|x) which is a classification that not only gives an answer, but also a degree of certainty about the answer. However, some classification models like (SVM and Decision Trees) do not provide such a probability, or they provide poor probability estimates.
Platt Scaling amounts to training a logistic regression model on the classifier outputs—has a way of transforming the outputs of a non-probabilistic classification model into a probability distribution over classes.
We will see an example where we train an SVM and then train the parameters of an additional sigmoid function to map the SVM outputs into probabilities.
(mathrm{P}(y=1 | x) = frac{1}{1 + exp(Af(x) + B)}=)
We would like to obtain A, B, — two scalar parameters that are learned by the algorithm.
This idea was suggested in Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods published in 1999 by John C. Platt.
Packages Needed
Functions Needed
createDataPartition()
train()
predict()
We will use spam
, a Spam Email database that comes with the kernlab
package.
Dataset Description: A dataset collected at Hewlett-Packard Labs, that classifies 4601 emails as spam or non-spam. In addition to this class label there are 57 variables indicating the frequency of certain words and characters in the email.
The last column (i.e., variable 58) indicates the type of the mail and is either “nonspam” or “spam”, (i.e. unsolicited commercial email).
Load the caret
and kernlab
packages:
library(caret)
library(kernlab)
Load the spam
data:
data(spam)
Create training and test sets:
inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE) # creates test/training partitions # returns Training Set Indeces
training <- spam[inTrain,] # Training Set
testing <- spam[-inTrain,] # Test Set
dim(training)
## [1] 3451 58
Fit predictive models over different tuning parameters:
set.seed(32343) # to allow reproducibility of results
modelFit <- train(type ~.,data=training, method="svmLinear") # Use the 'type' variable as labels; 'training' data to train
modelFit
## Support Vector Machines with Linear Kernel
##
## 3451 samples
## 57 predictors
## 2 classes: 'nonspam', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.9 0.8 0.007 0.01
##
## Tuning parameter 'C' was held constant at a value of 1
##
Final model using the best parameters
In the train control statement, you must specify classProbs = TRUE
if the class probabilities must be returned.
modelFit <- train(type ~.,data=training, method="svmLinear", trControl = trainControl(method = "repeatedcv", repeats = 2,
classProbs = TRUE))
modelFit$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 686
##
## Objective Function Value : -638
## Training error : 0.067806
## Probability model included.
Make test data predictions
Note: The returned values are probabilities themselves. Returned type can be Votes; however, such an option is not available.
predictProbs <- predict(modelFit,newdata=testing, type="prob")
head(predictProbs)
## nonspam spam
## 1 5.22e-05 1.000
## 2 3.07e-01 0.693
## 3 4.24e-01 0.576
## 4 8.45e-03 0.992
## 5 1.22e-01 0.878
## 6 7.43e-02 0.926
Train a Logistic Regression model:
labels <- testing$type
labels <- as.numeric(labels)-1
processed_data <- data.frame(predictProbs[,2],labels)
LOGISTIC_model <- train(labels ~.,data=processed_data, method="glm",family=binomial(logit))
LOGISTIC_model$finalModel
##
## Call: NULL
##
## Coefficients:
## (Intercept) predictProbs...2.
## -3.78 8.76
##
## Degrees of Freedom: 1149 Total (i.e. Null); 1148 Residual
## Null Deviance: 1540
## Residual Deviance: 449 AIC: 453
Display the Logistic Regression model coefficients:
LOGISTIC_model$finalModel$coefficients
## (Intercept) predictProbs...2.
## -3.78 8.76
A and B are now estimated.
References
TRIAL
Try DataRobot for Free
Take your machine learning game to the next level
Sign up