Intro to R - 5. R for Data Science

What is Data Science?

Data Science

Data Science is still evolving. One definition by Hal Varian (Chief economist at Google and professor at UC Berkeley) is:

The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it – that’s going to be a hugely important skill in the next decades._ – Hal Varian

Data Science

Data Science LifeCycle

Source: https://datascience.berkeley.edu/about/what-is-data-science/

Predictive Modeling

Predictive modeling includes

  • Data mining

  • Machine learning

  • Prediction

    • regression (predict a number, e.g., the age of a person)
    • classification (predict a label, e.g., yes/no)

Predictive Modeling Workflow

Workflow of Predictive Modeling

Predictive Modeling Workflow in R

Workflow of Predictive Modeling with R

Example

data(mtcars)    # Load the dataset
knitr::kable(head(mtcars))
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.9 2.6 16 0 1 4 4
Mazda RX4 Wag 21 6 160 110 3.9 2.9 17 0 1 4 4
Datsun 710 23 4 108 93 3.9 2.3 19 1 1 4 1
Hornet 4 Drive 21 6 258 110 3.1 3.2 19 1 0 3 1
Hornet Sportabout 19 8 360 175 3.1 3.4 17 0 0 3 2
Valiant 18 6 225 105 2.8 3.5 20 1 0 3 1

Note: kable in package knitr is used to pretty-print the table because the slides were created with Markdown.

Example: Predict Miles per Gallon

library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()

Linear Regression

model <- lm(mpg ~ wt, data = mtcars)
model
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##       37.29        -5.34

Formula Interface

R often uses a “model formula” to specify models of the from response ~ predictors. See ? formula for details.

Linear Regression: Model summary

summary(model)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.543 -2.365 -0.125  1.410  6.873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   37.285      1.878   19.86  < 2e-16 ***
## wt            -5.344      0.559   -9.56  1.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3 on 30 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
## F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10

Linear Regression: Plotting the regression line

coef(model)
## (Intercept)          wt 
##        37.3        -5.3
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + 
  geom_abline(intercept = coef(model)["(Intercept)"], slope = coef(model)["wt"], color = "red")

Note: I have used an a/b line here, but you can get the regression line directly with geom_smooth(method = "lm")

Multiple Linear Regression

model <- lm(mpg ~ wt + cyl + hp, data = mtcars)
model
## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt          cyl           hp  
##      38.752       -3.167       -0.942       -0.018
summary(model)
## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.929 -1.560 -0.531  1.185  5.899 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.7518     1.7869   21.69   <2e-16 ***
## wt           -3.1670     0.7406   -4.28   0.0002 ***
## cyl          -0.9416     0.5509   -1.71   0.0985 .  
## hp           -0.0180     0.0119   -1.52   0.1400    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.5 on 28 degrees of freedom
## Multiple R-squared:  0.843,  Adjusted R-squared:  0.826 
## F-statistic: 50.2 on 3 and 28 DF,  p-value: 2.18e-11

Where to go from here?

  • Interaction effects: see ? lm
  • ? step to choose a simpler model using step-wise model selection.
  • Generalized linear models: ? glm

Prediction

Almost all R models provide a predict function.

predict(model, head(mtcars))
##         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
##                23                22                26                21 
## Hornet Sportabout           Valiant 
##                17                20

Note: Prediction is typically done on new or test data. Package like caret and mlr3 and Superlearner help with this.

Package Caret

Caret is a package that simplifies training and testing predictive models.

Fit a linear regression model

(lm means linear model)

library("caret")
## Loading required package: lattice
model <- train(mpg ~ wt + cyl + hp,
               data = mtcars,
               method = "lm")
model
## Linear Regression 
## 
## 32 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 32, 32, 32, 32, 32, 32, ... 
## Resampling results:
## 
##   RMSE  Rsquared  MAE
##   2.8   0.84      2.3
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Look at the final model learned on all the data.

summary(model$finalModel)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.929 -1.560 -0.531  1.185  5.899 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.7518     1.7869   21.69   <2e-16 ***
## wt           -3.1670     0.7406   -4.28   0.0002 ***
## cyl          -0.9416     0.5509   -1.71   0.0985 .  
## hp           -0.0180     0.0119   -1.52   0.1400    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.5 on 28 degrees of freedom
## Multiple R-squared:  0.843,  Adjusted R-squared:  0.826 
## F-statistic: 50.2 on 3 and 28 DF,  p-value: 2.18e-11

Train a regression tree

rpart implements CART (here a regression tree). I use all variables . in the formula because decision trees perform variable selection.

model <- train(mpg ~ .,
               data = mtcars,
               method = "rpart")
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
model
## CART 
## 
## 32 samples
## 10 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 32, 32, 32, 32, 32, 32, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE  Rsquared  MAE
##   0.000  4.1   0.54      3.4
##   0.097  4.2   0.53      3.4
##   0.643  5.1   0.48      4.2
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.

Note: CART has a tuning parameter cp and train tries several values and picks the best.

Plotting a regression tree and look at variable importance

library(rpart.plot)
## Loading required package: rpart
rpart.plot(model$finalModel)

What are the most important variables? Note that there are variables “hidden” in the visualization of the tree.

varImp(model)
## rpart variable importance
## 
##      Overall
## hp     100.0
## cyl     99.6
## disp    97.0
## wt      93.3
## vs      38.2
## drat    16.4
## am       0.0
## carb     0.0
## gear     0.0
## qsec     0.0

Where to go from here?

Caret offers many models: Available models in Caret

Evaluation Schemes:

  • Training/Test split
  • Cross-validation

Hyper parameters

Exercises

Cars Exercises

  1. Define a new variable called “green” with values yes and no for label cars that are friendly to the environment. Use a suitable threshold on mpg.
  2. Create a model (e.g., a decision tree) to predict “green” given all other variables.
  3. What are the most important variables?
  4. Create a few hypothetical car designs and test if they will be predicted as green.

MLB Exercises

  1. Load the MLB data set and create a scatter plot of weight by height with a regression line added.
  2. Create a prediction model that predicts weight using position, height, and age of the player. Compare different models using caret.
  3. Create a classification model to predict the position. Decide what information is useful for this task.