Additional material for the course “Introduction to Data Mining”

This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.

Linear regression models the value of a dependent variable \(y\) (also called response) as as linear function of independent variabels \(X_1, X_2, ..., X_p\) (also called regressors, predictors, exogenous variables or covariates). Given \(n\) observations the model is: \[y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \epsilon_i \]

where \(\beta_0\) is the intercept, \(\beta\) is a \(p\)-dimensional parameter vector learned from the data, and \(\epsilon\) is the error term (called residuals).

Linear regression makes several assumptions:

*Weak exogeneity:*Predictor variables are assumed to be error free.*Linearity:*There is a linear relationship between dependent and independent variables.*Homoscedasticity:*The variance of the error (\(\epsilon\)) does not change (increase) with the predicted value.*Independence of errors:*Errors between observations are uncorrelated.*No multicollinearity of predictors:*Predictors cannot be perfectly correlated or the parameter vector cannot be identified.*Note*that highly correlated pedictors lead to unstable results and should be avoided.

`set.seed(2000)`

Load and shuffle data (flowers are in order by species)

```
data(iris)
x <- iris[sample(1:nrow(iris)),]
plot(x, col=x$Species)
```

Make the data a little messy and add a useless feature

```
x[,1] <- x[,1] + rnorm(nrow(x))
x[,2] <- x[,2] + rnorm(nrow(x))
x[,3] <- x[,3] + rnorm(nrow(x))
x <- cbind(x[,-5], useless = mean(x[,1]) + rnorm(nrow(x)), Species = x[,5])
plot(x, col=x$Species)
```