Additional material for the course “Introduction to Data Mining”

# Introduction

Logistic regression is a probabilistic statistical classification model to predict a binary outcome (a probability) given a set of features.

Logistic regression can be thought of as a linear regression with the log odds ratio (logit) as the dependent variable:

$logit(p) = ln(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...$

logit  <- function(p) log(p/(1-p))
x <- seq(0, 1, length.out = 100)
plot(x, logit(x), type = "l")
abline(v=0.5, lty = 2)
abline(h=0, lty = 2)

This is equivalent to modeling the probability $$p$$ by

$p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}} = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...)}}$

# Prepare the Data

Load and shuffle data. We also add a useless variable to see if the logistic regression removes it.

data(iris)
x <- iris[sample(1:nrow(iris)),]
x <- cbind(x, useless = rnorm(nrow(x)))

Make Species into a binary classification problem so we will classify if a flower is of species Virginica

x$virginica <- x$Species == "virginica"
x$Species <- NULL plot(x, col=x$virginica+1)