Additional material for the course “Introduction to Data Mining”

This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.

Logistic regression is a probabilistic statistical classification model to predict a binary outcome (a probability) given a set of features.

Logistic regression can be thought of as a linear regression with the log odds ratio (logit) as the dependent variable:

\[logit(p) = ln(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\]

```
logit <- function(p) log(p/(1-p))
x <- seq(0, 1, length.out = 100)
plot(x, logit(x), type = "l")
abline(v=0.5, lty = 2)
abline(h=0, lty = 2)
```

This is equivalent to modeling the probability \(p\) by

\[ p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}} = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...)}}\]

Load and shuffle data. We also add a useless variable to see if the logistic regression removes it.

```
data(iris)
x <- iris[sample(1:nrow(iris)),]
x <- cbind(x, useless = rnorm(nrow(x)))
```

Make Species into a binary classification problem so we will classify if a flower is of species Virginica

```
x$virginica <- x$Species == "virginica"
x$Species <- NULL
plot(x, col=x$virginica+1)
```