Intro to R - 5. R for Data Science

What is Data Science?

Data Science

Data Science is still evolving. One definition by Hal Varian (Chief economist at Google and professor at UC Berkeley) is:

The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it – that’s going to be a hugely important skill in the next decades._ – Hal Varian

Data Science

Data Science LifeCycle

Source: https://datascience.berkeley.edu/about/what-is-data-science/

Predictive Modeling

Predictive modeling includes

Data mining
Machine learning
Prediction
- regression (predict a number, e.g., the age of a person)
- classification (predict a label, e.g., yes/no)

Predictive Modeling Workflow

Workflow of Predictive Modeling

Predictive Modeling Workflow in R

Workflow of Predictive Modeling with R

Example

data(mtcars)    # Load the dataset
knitr::kable(head(mtcars))

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21	6	160	110	3.9	2.6	16	0	1	4	4
Mazda RX4 Wag	21	6	160	110	3.9	2.9	17	0	1	4	4
Datsun 710	23	4	108	93	3.9	2.3	19	1	1	4	1
Hornet 4 Drive	21	6	258	110	3.1	3.2	19	1	0	3	1
Hornet Sportabout	19	8	360	175	3.1	3.4	17	0	0	3	2
Valiant	18	6	225	105	2.8	3.5	20	1	0	3	1

Note: kable in package knitr is used to pretty-print the table because the slides were created with Markdown.

Example: Predict Miles per Gallon

library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()

Linear Regression

model <- lm(mpg ~ wt, data = mtcars)
model

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##       37.29        -5.34

Formula Interface

R often uses a “model formula” to specify models of the from response ~ predictors. See ? formula for details.

Linear Regression: Model summary

summary(model)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.543 -2.365 -0.125  1.410  6.873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   37.285      1.878   19.86  < 2e-16 ***
## wt            -5.344      0.559   -9.56  1.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3 on 30 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
## F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10

Linear Regression: Plotting the regression line

coef(model)

## (Intercept)          wt 
##        37.3        -5.3

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + 
  geom_abline(intercept = coef(model)["(Intercept)"], slope = coef(model)["wt"], color = "red")

Note: I have used an a/b line here, but you can get the regression line directly with geom_smooth(method = "lm")

Multiple Linear Regression

model <- lm(mpg ~ wt + cyl + hp, data = mtcars)
model

## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt          cyl           hp  
##      38.752       -3.167       -0.942       -0.018

summary(model)

## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.929 -1.560 -0.531  1.185  5.899 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.7518     1.7869   21.69   <2e-16 ***
## wt           -3.1670     0.7406   -4.28   0.0002 ***
## cyl          -0.9416     0.5509   -1.71   0.0985 .  
## hp           -0.0180     0.0119   -1.52   0.1400    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.5 on 28 degrees of freedom
## Multiple R-squared:  0.843,  Adjusted R-squared:  0.826 
## F-statistic: 50.2 on 3 and 28 DF,  p-value: 2.18e-11

Where to go from here?

Interaction effects: see ? lm
? step to choose a simpler model using step-wise model selection.
Generalized linear models: ? glm

Prediction

Almost all R models provide a predict function.

predict(model, head(mtcars))

##         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
##                23                22                26                21 
## Hornet Sportabout           Valiant 
##                17                20

Note: Prediction is typically done on new or test data. Package like caret and mlr3 and Superlearner help with this.

Package Caret

Caret is a package that simplifies training and testing predictive models.

Fit a linear regression model

(lm means linear model)

library("caret")

## Loading required package: lattice

model <- train(mpg ~ wt + cyl + hp,
               data = mtcars,
               method = "lm")
model

## Linear Regression 
## 
## 32 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 32, 32, 32, 32, 32, 32, ... 
## Resampling results:
## 
##   RMSE  Rsquared  MAE
##   2.8   0.84      2.3
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Look at the final model learned on all the data.

summary(model$finalModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.929 -1.560 -0.531  1.185  5.899 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.7518     1.7869   21.69   <2e-16 ***
## wt           -3.1670     0.7406   -4.28   0.0002 ***
## cyl          -0.9416     0.5509   -1.71   0.0985 .  
## hp           -0.0180     0.0119   -1.52   0.1400    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.5 on 28 degrees of freedom
## Multiple R-squared:  0.843,  Adjusted R-squared:  0.826 
## F-statistic: 50.2 on 3 and 28 DF,  p-value: 2.18e-11

Train a regression tree

rpart implements CART (here a regression tree). I use all variables . in the formula because decision trees perform variable selection.

model <- train(mpg ~ .,
               data = mtcars,
               method = "rpart")

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

model

## CART 
## 
## 32 samples
## 10 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 32, 32, 32, 32, 32, 32, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE  Rsquared  MAE
##   0.000  4.1   0.54      3.4
##   0.097  4.2   0.53      3.4
##   0.643  5.1   0.48      4.2
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.

Note: CART has a tuning parameter cp and train tries several values and picks the best.

Plotting a regression tree and look at variable importance

library(rpart.plot)

## Loading required package: rpart

rpart.plot(model$finalModel)

What are the most important variables? Note that there are variables “hidden” in the visualization of the tree.

varImp(model)

## rpart variable importance
## 
##      Overall
## hp     100.0
## cyl     99.6
## disp    97.0
## wt      93.3
## vs      38.2
## drat    16.4
## am       0.0
## carb     0.0
## gear     0.0
## qsec     0.0

Where to go from here?

Caret offers many models: Available models in Caret

Evaluation Schemes:

Training/Test split
Cross-validation

Hyper parameters

Exercises

Cars Exercises

Define a new variable called “green” with values yes and no for label cars that are friendly to the environment. Use a suitable threshold on mpg.
Create a model (e.g., a decision tree) to predict “green” given all other variables.
What are the most important variables?
Create a few hypothetical car designs and test if they will be predicted as green.

MLB Exercises

Load the MLB data set and create a scatter plot of weight by height with a regression line added.
Create a prediction model that predicts weight using position, height, and age of the player. Compare different models using caret.
Create a classification model to predict the position. Decide what information is useful for this task.