Additional material for the course “Introduction to Data Mining”

This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.

Logistic regression is a probabilistic statistical classification model to predict a binary outcome (a probability) given a set of features

\[logit(p) = ln(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\]

Logistic regression can be thought of as a linear regression with the log odds ratio as the dependent variable which is equivalent to

\[ p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}} = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...)}}\]

Load and shuffle data. We also add a useless variable to see if the logistic regression removes it.

```
data(iris)
x <- iris[sample(1:nrow(iris)),]
x <- cbind(x, useless = rnorm(nrow(x)))
```

Make Species into a binary classification problem so we will classify if a flower is of species Virginica

```
x$virginica <- x$Species == "virginica"
x$Species <- NULL
plot(x, col=x$virginica+1)
```