Additional material for the course “Introduction to Data Mining”
This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.
Logistic regression is a probabilistic statistical classification model to predict a binary outcome (a probability) given a set of features.
Logistic regression can be thought of as a linear regression with the log odds ratio (logit) as the dependent variable:
\[logit(p) = ln(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\]
logit <- function(p) log(p/(1-p))
x <- seq(0, 1, length.out = 100)
plot(x, logit(x), type = "l")
abline(v=0.5, lty = 2)
abline(h=0, lty = 2)
This is equivalent to modeling the probability \(p\) by
\[ p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}} = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...)}}\]
Load and shuffle data. We also add a useless variable to see if the logistic regression removes it.
data(iris)
x <- iris[sample(1:nrow(iris)),]
x <- cbind(x, useless = rnorm(nrow(x)))
Make Species into a binary classification problem so we will classify if a flower is of species Virginica
x$virginica <- x$Species == "virginica"
x$Species <- NULL
plot(x, col=x$virginica+1)
Logistic regression is a generalized linear model (GLM) with logit as the link function and a binomial error model.
model <- glm(virginica ~ .,
family = binomial(logit), data=x)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Aboiut the warning: glm.fit: fitted probabilities numerically 0 or 1 occurred means that the data is possibly linearely separable.
model
##
## Call: glm(formula = virginica ~ ., family = binomial(logit), data = x)
##
## Coefficients:
## (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
## -45.9999 -2.3098 -7.0826 9.4651 20.2878
## useless
## -0.3406
##
## Degrees of Freedom: 149 Total (i.e. Null); 144 Residual
## Null Deviance: 191
## Residual Deviance: 11.76 AIC: 23.76
Check which features are significant?
summary(model)
##
## Call:
## glm(formula = virginica ~ ., family = binomial(logit), data = x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.93175 -0.00042 0.00000 0.00029 1.83529
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -45.9999 30.3565 -1.515 0.1297
## Sepal.Length -2.3098 2.4637 -0.938 0.3485
## Sepal.Width -7.0826 4.6820 -1.513 0.1303
## Petal.Length 9.4651 4.9890 1.897 0.0578 .
## Petal.Width 20.2878 11.9326 1.700 0.0891 .
## useless -0.3406 0.9343 -0.365 0.7155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 190.954 on 149 degrees of freedom
## Residual deviance: 11.763 on 144 degrees of freedom
## AIC: 23.763
##
## Number of Fisher Scoring iterations: 12
AIC can be used for model selection
model2 <- step(model, data = x)
## Start: AIC=23.76
## virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width +
## useless
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - useless 1 11.899 21.899
## - Sepal.Length 1 12.848 22.848
## <none> 11.762 23.763
## - Sepal.Width 1 15.482 25.482
## - Petal.Width 1 23.376 33.376
## - Petal.Length 1 24.706 34.707
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=21.9
## virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - Sepal.Length 1 13.266 21.266
## <none> 11.899 21.899
## - Sepal.Width 1 15.492 23.492
## - Petal.Width 1 23.772 31.772
## - Petal.Length 1 25.902 33.902
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=21.27
## virginica ~ Sepal.Width + Petal.Length + Petal.Width
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 13.266 21.266
## - Sepal.Width 1 20.564 26.564
## - Petal.Length 1 27.399 33.399
## - Petal.Width 1 31.512 37.512
summary(model2)
##
## Call:
## glm(formula = virginica ~ Sepal.Width + Petal.Length + Petal.Width,
## family = binomial(logit), data = x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.75795 -0.00043 0.00000 0.00026 1.92193
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -50.527 23.995 -2.106 0.0352 *
## Sepal.Width -8.376 4.761 -1.759 0.0785 .
## Petal.Length 7.875 3.841 2.050 0.0403 *
## Petal.Width 21.430 10.707 2.001 0.0453 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 190.954 on 149 degrees of freedom
## Residual deviance: 13.266 on 146 degrees of freedom
## AIC: 21.266
##
## Number of Fisher Scoring iterations: 12
The estimates (\(\beta_0, \beta_1,...\) ) are log-odds and can be converted into odds using \(exp(\beta)\). A negative log-odds ratio means that the odds go down with an increase in the value of the predictor. A predictor with a positive log-odds ratio increases the odds. In this case, the odds of looking at a Virginica iris goes down with Sepal.Width and increases with the other two predictors.
Note: we do here in-sample testing on the data we learned the data from. To get a generalization error estimate you should use a test set or cross-validation!
pr <- predict(model2, x, type="response")
round(pr, 2)
## 25 60 136 1 4 132 87 112 104 92 49 23 46 6 54
## 0.00 0.00 1.00 0.00 0.00 1.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00
## 76 135 115 35 41 126 68 7 133 13 149 42 142 93 3
## 0.00 0.86 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00
## 15 36 65 52 130 82 114 139 21 80 57 56 20 71 5
## 0.00 0.00 0.00 0.00 0.99 0.00 1.00 0.67 0.00 0.00 0.00 0.00 0.00 0.28 0.00
## 127 111 14 48 123 103 109 121 118 128 61 26 59 102 100
## 0.92 1.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 0.82 0.00 0.00 0.00 1.00 0.00
## 27 72 145 17 37 150 11 22 34 85 47 66 91 146 18
## 0.00 0.00 1.00 0.00 0.00 0.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00
## 141 106 140 39 31 96 69 67 24 147 84 10 38 73 117
## 1.00 1.00 1.00 0.00 0.00 0.00 0.20 0.00 0.00 1.00 0.79 0.00 0.00 0.32 1.00
## 58 29 79 28 95 116 144 78 19 2 98 43 9 77 90
## 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.54 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## 124 120 74 113 107 97 75 99 45 44 63 53 89 105 16
## 0.98 0.93 0.00 1.00 0.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00
## 125 81 32 143 148 129 119 110 137 108 40 86 51 138 30
## 1.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00
## 70 64 131 88 8 94 12 83 134 50 55 33 122 101 62
## 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.16 0.00 0.00 0.00 1.00 1.00 0.00
hist(pr, breaks=20)
hist(pr[x$virginica==TRUE], col="red", breaks=20, add=TRUE)
table(actual=x$virginica, predicted=pr>.5)
## predicted
## actual FALSE TRUE
## FALSE 98 2
## TRUE 1 49