Additional material for the course “Introduction to Data Mining”

Introduction

Linear regression models the value of a dependent variable $$y$$ (also called response) as as linear function of independent variabels $$X_1, X_2, ..., X_p$$ (also called regressors, predictors, exogenous variables or covariates). Given $$n$$ observations the model is: $y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \epsilon_i$

where $$\beta_0$$ is the intercept, $$\beta$$ is a $$p$$-dimensional parameter vector learned from the data, and $$\epsilon$$ is the error term (called residuals).

Linear regression makes several assumptions:

• Weak exogeneity: Predictor variables are assumed to be error free.
• Linearity: There is a linear relationship between dependent and independent variables.
• Homoscedasticity: The variance of the error ($$\epsilon$$) does not change (increase) with the predicted value.
• Independence of errors: Errors between observations are uncorrelated.
• No multicollinearity of predictors: Predictors cannot be perfectly correlated or the parameter vector cannot be identified. Note that highly correlated pedictors lead to unstable results and should be avoided.

A First Model

set.seed(2000)

Load and shuffle data (flowers are in order by species)

data(iris)
x <- iris[sample(1:nrow(iris)),]
plot(x, col=x$Species) Make the data a little messy and add a useless feature x[,1] <- x[,1] + rnorm(nrow(x)) x[,2] <- x[,2] + rnorm(nrow(x)) x[,3] <- x[,3] + rnorm(nrow(x)) x <- cbind(x[,-5], useless = mean(x[,1]) + rnorm(nrow(x)), Species = x[,5]) plot(x, col=x$Species)