Additional material for the course “Introduction to Data Mining”

CC This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.

Introduction

Linear regression models the value of a dependent variable \(y\) (also called response) as as linear function of independent variabels \(X_1, X_2, ..., X_p\) (also called regressors, predictors, exogenous variables or covariates). Given \(n\) observations the model is: \[y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \epsilon_i \]

where \(\beta_0\) is the intercept, \(\beta\) is a \(p\)-dimensional parameter vector learned from the data, and \(\epsilon\) is the error term (called residuals).

Linear regression makes several assumptions:

A First Model

set.seed(2000)

Load and shuffle data (flowers are in order by species)

data(iris)
x <- iris[sample(1:nrow(iris)),]
plot(x, col=x$Species)

Make the data a little messy and add a useless feature

x[,1] <- x[,1] + rnorm(nrow(x))
x[,2] <- x[,2] + rnorm(nrow(x))
x[,3] <- x[,3] + rnorm(nrow(x))
x <- cbind(x[,-5], useless = mean(x[,1]) + rnorm(nrow(x)), Species = x[,5])

plot(x, col=x$Species)