Additional material for the course “Introduction to Data Mining”

CC This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.


Logistic regression is a probabilistic statistical classification model to predict a binary outcome (a probability) given a set of features.

Logistic regression can be thought of as a linear regression with the log odds ratio (logit) as the dependent variable:

\[logit(p) = ln(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\]

logit  <- function(p) log(p/(1-p))
x <- seq(0, 1, length.out = 100)
plot(x, logit(x), type = "l")
abline(v=0.5, lty = 2)
abline(h=0, lty = 2)

This is equivalent to modeling the probability \(p\) by

\[ p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}} = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...)}}\]

Prepare the Data

Load and shuffle data. We also add a useless variable to see if the logistic regression removes it.

x <- iris[sample(1:nrow(iris)),]
x <- cbind(x, useless = rnorm(nrow(x)))

Make Species into a binary classification problem so we will classify if a flower is of species Virginica

x$virginica <- x$Species == "virginica"
x$Species <- NULL
plot(x, col=x$virginica+1)