Additional material for the course “Introduction to Data Mining”

CC This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.

Introduction

Logistic regression is a probabilistic statistical classification model to predict a binary outcome (a probability) given a set of features.

Logistic regression can be thought of as a linear regression with the log odds ratio (logit) as the dependent variable:

\[logit(p) = ln(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\]

logit  <- function(p) log(p/(1-p))
x <- seq(0, 1, length.out = 100)
plot(x, logit(x), type = "l")
abline(v=0.5, lty = 2)
abline(h=0, lty = 2)

This is equivalent to modeling the probability \(p\) by

\[ p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}} = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...)}}\]

Prepare the Data

Load and shuffle data. We also add a useless variable to see if the logistic regression removes it.

data(iris)
x <- iris[sample(1:nrow(iris)),]
x <- cbind(x, useless = rnorm(nrow(x)))

Make Species into a binary classification problem so we will classify if a flower is of species Virginica

x$virginica <- x$Species == "virginica"
x$Species <- NULL
plot(x, col=x$virginica+1)

Create a Logistic Regression Model

Logistic regression is a generalized linear model (GLM) with logit as the link function and a binomial error model.

model <- glm(virginica ~ .,
  family = binomial(logit), data=x)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Aboiut the warning: glm.fit: fitted probabilities numerically 0 or 1 occurred means that the data is possibly linearely separable.

model
## 
## Call:  glm(formula = virginica ~ ., family = binomial(logit), data = x)
## 
## Coefficients:
##  (Intercept)  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width  
##     -45.9999       -2.3098       -7.0826        9.4651       20.2878  
##      useless  
##      -0.3406  
## 
## Degrees of Freedom: 149 Total (i.e. Null);  144 Residual
## Null Deviance:       191 
## Residual Deviance: 11.76     AIC: 23.76

Check which features are significant?

summary(model)
## 
## Call:
## glm(formula = virginica ~ ., family = binomial(logit), data = x)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.93175  -0.00042   0.00000   0.00029   1.83529  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -45.9999    30.3565  -1.515   0.1297  
## Sepal.Length  -2.3098     2.4637  -0.938   0.3485  
## Sepal.Width   -7.0826     4.6820  -1.513   0.1303  
## Petal.Length   9.4651     4.9890   1.897   0.0578 .
## Petal.Width   20.2878    11.9326   1.700   0.0891 .
## useless       -0.3406     0.9343  -0.365   0.7155  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 190.954  on 149  degrees of freedom
## Residual deviance:  11.763  on 144  degrees of freedom
## AIC: 23.763
## 
## Number of Fisher Scoring iterations: 12

AIC can be used for model selection

Do Stepwise Variable Selection

model2 <- step(model, data = x)
## Start:  AIC=23.76
## virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + 
##     useless
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                Df Deviance    AIC
## - useless       1   11.899 21.899
## - Sepal.Length  1   12.848 22.848
## <none>              11.762 23.763
## - Sepal.Width   1   15.482 25.482
## - Petal.Width   1   23.376 33.376
## - Petal.Length  1   24.706 34.707
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=21.9
## virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                Df Deviance    AIC
## - Sepal.Length  1   13.266 21.266
## <none>              11.899 21.899
## - Sepal.Width   1   15.492 23.492
## - Petal.Width   1   23.772 31.772
## - Petal.Length  1   25.902 33.902
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=21.27
## virginica ~ Sepal.Width + Petal.Length + Petal.Width
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                Df Deviance    AIC
## <none>              13.266 21.266
## - Sepal.Width   1   20.564 26.564
## - Petal.Length  1   27.399 33.399
## - Petal.Width   1   31.512 37.512
summary(model2)
## 
## Call:
## glm(formula = virginica ~ Sepal.Width + Petal.Length + Petal.Width, 
##     family = binomial(logit), data = x)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.75795  -0.00043   0.00000   0.00026   1.92193  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -50.527     23.995  -2.106   0.0352 *
## Sepal.Width    -8.376      4.761  -1.759   0.0785 .
## Petal.Length    7.875      3.841   2.050   0.0403 *
## Petal.Width    21.430     10.707   2.001   0.0453 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 190.954  on 149  degrees of freedom
## Residual deviance:  13.266  on 146  degrees of freedom
## AIC: 21.266
## 
## Number of Fisher Scoring iterations: 12

The estimates (\(\beta_0, \beta_1,...\) ) are log-odds and can be converted into odds using \(exp(\beta)\). A negative log-odds ratio means that the odds go down with an increase in the value of the predictor. A predictor with a positive log-odds ratio increases the odds. In this case, the odds of looking at a Virginica iris goes down with Sepal.Width and increases with the other two predictors.

Calculate Response

Note: we do here in-sample testing on the data we learned the data from. To get a generalization error estimate you should use a test set or cross-validation!

pr <- predict(model2, x, type="response")
round(pr, 2)
##   25   60  136    1    4  132   87  112  104   92   49   23   46    6   54 
## 0.00 0.00 1.00 0.00 0.00 1.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 
##   76  135  115   35   41  126   68    7  133   13  149   42  142   93    3 
## 0.00 0.86 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00 
##   15   36   65   52  130   82  114  139   21   80   57   56   20   71    5 
## 0.00 0.00 0.00 0.00 0.99 0.00 1.00 0.67 0.00 0.00 0.00 0.00 0.00 0.28 0.00 
##  127  111   14   48  123  103  109  121  118  128   61   26   59  102  100 
## 0.92 1.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 0.82 0.00 0.00 0.00 1.00 0.00 
##   27   72  145   17   37  150   11   22   34   85   47   66   91  146   18 
## 0.00 0.00 1.00 0.00 0.00 0.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 
##  141  106  140   39   31   96   69   67   24  147   84   10   38   73  117 
## 1.00 1.00 1.00 0.00 0.00 0.00 0.20 0.00 0.00 1.00 0.79 0.00 0.00 0.32 1.00 
##   58   29   79   28   95  116  144   78   19    2   98   43    9   77   90 
## 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.54 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
##  124  120   74  113  107   97   75   99   45   44   63   53   89  105   16 
## 0.98 0.93 0.00 1.00 0.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 
##  125   81   32  143  148  129  119  110  137  108   40   86   51  138   30 
## 1.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00 
##   70   64  131   88    8   94   12   83  134   50   55   33  122  101   62 
## 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.16 0.00 0.00 0.00 1.00 1.00 0.00
hist(pr, breaks=20)
hist(pr[x$virginica==TRUE], col="red", breaks=20, add=TRUE)

Check Classification Performance

table(actual=x$virginica, predicted=pr>.5)
##        predicted
## actual  FALSE TRUE
##   FALSE    98    2
##   TRUE      1   49