Intro to R - 5. R for Data Science
What is Data Science?
Data Science
Data Science is still evolving. One definition by Hal Varian (Chief economist at Google and professor at UC Berkeley) is:
The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it – that’s going to be a hugely important skill in the next decades._ – Hal Varian
Predictive Modeling
Predictive modeling includes
Data mining
Machine learning
Prediction
- regression (predict a number, e.g., the age of a person)
- classification (predict a label, e.g., yes/no)
Predictive Modeling Workflow
Predictive Modeling Workflow in R
Example
data(mtcars) # Load the dataset
::kable(head(mtcars)) knitr
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21 | 6 | 160 | 110 | 3.9 | 2.6 | 16 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21 | 6 | 160 | 110 | 3.9 | 2.9 | 17 | 0 | 1 | 4 | 4 |
Datsun 710 | 23 | 4 | 108 | 93 | 3.9 | 2.3 | 19 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21 | 6 | 258 | 110 | 3.1 | 3.2 | 19 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 19 | 8 | 360 | 175 | 3.1 | 3.4 | 17 | 0 | 0 | 3 | 2 |
Valiant | 18 | 6 | 225 | 105 | 2.8 | 3.5 | 20 | 1 | 0 | 3 | 1 |
Note: kable
in package knitr
is used to pretty-print the table because the slides were created with Markdown.
Example: Predict Miles per Gallon
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()
Linear Regression
<- lm(mpg ~ wt, data = mtcars)
model model
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.29 -5.34
Formula Interface
R often uses a “model formula” to specify models of the from response ~ predictors
. See ? formula
for details.
Linear Regression: Model summary
summary(model)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.543 -2.365 -0.125 1.410 6.873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.285 1.878 19.86 < 2e-16 ***
## wt -5.344 0.559 -9.56 1.3e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3 on 30 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.745
## F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10
Linear Regression: Plotting the regression line
coef(model)
## (Intercept) wt
## 37.3 -5.3
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() +
geom_abline(intercept = coef(model)["(Intercept)"], slope = coef(model)["wt"], color = "red")
Note: I have used an a/b line here, but you can get the regression line directly with geom_smooth(method = "lm")
Multiple Linear Regression
<- lm(mpg ~ wt + cyl + hp, data = mtcars)
model model
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
##
## Coefficients:
## (Intercept) wt cyl hp
## 38.752 -3.167 -0.942 -0.018
summary(model)
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.929 -1.560 -0.531 1.185 5.899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7518 1.7869 21.69 <2e-16 ***
## wt -3.1670 0.7406 -4.28 0.0002 ***
## cyl -0.9416 0.5509 -1.71 0.0985 .
## hp -0.0180 0.0119 -1.52 0.1400
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.5 on 28 degrees of freedom
## Multiple R-squared: 0.843, Adjusted R-squared: 0.826
## F-statistic: 50.2 on 3 and 28 DF, p-value: 2.18e-11
Where to go from here?
- Interaction effects: see
? lm
? step
to choose a simpler model using step-wise model selection.- Generalized linear models:
? glm
Prediction
Almost all R models provide a predict
function.
predict(model, head(mtcars))
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 23 22 26 21
## Hornet Sportabout Valiant
## 17 20
Note: Prediction is typically done on new or test data. Package like caret
and mlr3
and Superlearner
help with this.
Package Caret
Caret is a package that simplifies training and testing predictive models.
Fit a linear regression model
(lm means linear model)
library("caret")
## Loading required package: lattice
<- train(mpg ~ wt + cyl + hp,
model data = mtcars,
method = "lm")
model
## Linear Regression
##
## 32 samples
## 3 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 32, 32, 32, 32, 32, 32, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.8 0.84 2.3
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Look at the final model learned on all the data.
summary(model$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.929 -1.560 -0.531 1.185 5.899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7518 1.7869 21.69 <2e-16 ***
## wt -3.1670 0.7406 -4.28 0.0002 ***
## cyl -0.9416 0.5509 -1.71 0.0985 .
## hp -0.0180 0.0119 -1.52 0.1400
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.5 on 28 degrees of freedom
## Multiple R-squared: 0.843, Adjusted R-squared: 0.826
## F-statistic: 50.2 on 3 and 28 DF, p-value: 2.18e-11
Train a regression tree
rpart implements CART (here a regression tree). I use all variables .
in the formula because decision trees perform variable selection.
<- train(mpg ~ .,
model data = mtcars,
method = "rpart")
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
model
## CART
##
## 32 samples
## 10 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 32, 32, 32, 32, 32, 32, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.000 4.1 0.54 3.4
## 0.097 4.2 0.53 3.4
## 0.643 5.1 0.48 4.2
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.
Note: CART has a tuning parameter cp and train tries several values and picks the best.
Plotting a regression tree and look at variable importance
library(rpart.plot)
## Loading required package: rpart
rpart.plot(model$finalModel)
What are the most important variables? Note that there are variables “hidden” in the visualization of the tree.
varImp(model)
## rpart variable importance
##
## Overall
## hp 100.0
## cyl 99.6
## disp 97.0
## wt 93.3
## vs 38.2
## drat 16.4
## am 0.0
## carb 0.0
## gear 0.0
## qsec 0.0
Where to go from here?
Caret offers many models: Available models in Caret
Evaluation Schemes:
- Training/Test split
- Cross-validation
Hyper parameters
Exercises
Cars Exercises
- Define a new variable called “green” with values yes and no for label cars that are friendly to the environment. Use a suitable threshold on mpg.
- Create a model (e.g., a decision tree) to predict “green” given all other variables.
- What are the most important variables?
- Create a few hypothetical car designs and test if they will be predicted as green.
MLB Exercises
- Load the MLB data set and create a scatter plot of weight by height with a regression line added.
- Create a prediction model that predicts weight using position, height, and age of the player. Compare different models using caret.
- Create a classification model to predict the position. Decide what information is useful for this task.