Additional material for the course “Introduction to Data Mining”

CC This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.

Introduction

Logistic regression is a probabilistic statistical classification model to predict a binary outcome (a probability) given a set of features

\[logit(p) = ln(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\]

Logistic regression can be thought of as a linear regression with the log odds ratio as the dependent variable which is equivalent to

\[ p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}} = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...)}}\]

Load and shuffle data. We also add a useless variable to see if the logistic regression removes it.

data(iris)
x <- iris[sample(1:nrow(iris)),]
x <- cbind(x, useless = rnorm(nrow(x)))

Make Species into a binary classification problem so we will classify if a flower is of species Virginica

x$virginica <- x$Species == "virginica"
x$Species <- NULL
plot(x, col=x$virginica+1)