CS 7331 - Data Mining
Instructor: Dr. Michael Hahsler
Syllabus and assignments: Canvas
All course material is provided under
Course Description
Analytics is based on collecting, managing, exploring and acting on large amounts of data and has become a source of competitive advantage for many organizations. This course provides an overview of descriptive analytics and introduces major data-mining techniques (classification, association analysis and cluster analysis) used in predictive analytics. All material covered will be reinforced through hands-on experience using state-of-the art tools to design and execute data mining processes.
Prerequisites:
- Programming skills (e.g, CS 1342)
- Knowledge of basic probability theory and statistics (e.g., CS 4340/EMIS 3340/STAT 4340, CS 7370/EMIS 7370)
Outline
- Introduction
- Read Syllabus
- Introduction to Data Mining (Chapter 1)
- Slides: Introduction
- Additional reading: KDnuggets 2020 Data Science Tools Popularity Poll (KDnuggets polls)
- Optional reading: How to become a Data Scientist for Free, Data Mining Certifications
- Optional reading: How a fitness app revealed military secrets and the new reality of data collection
- Make sure you do not show the signs of a bad data scientist!
- Tools
- Install R and RStudio (see Tool section)
- Intro to R: Introduction
- Data and Exploration
- Data and Exploration (Chapter 2)
- Slides: Data, Exploration
- R Code: Code for Chapter 2 - Data
- Intro to R:
- Cheat Sheets: Choosing a good chart, Essential Visualizations
- Visualization examples: Find the appropriate visualization for your data (from Data to Viz), R Graph Gallery, interactive charts with Plotly
- Optional reading:
- How To Build Compelling Stories From Your Data Sets
- Data Visualization: A practical introduction by Kieran Healy. A comprehensive introduction with R code using ggplot.
- Data and Exploration (Chapter 2)
- Clustering
- Cluster Analysis (Chapter 7)
- Slides: Cluster Analysis
- R code: Code for Chapter 7
- Example: k-means visualization
- Additional reading: CRAN Task View: Cluster Analysis & Finite Mixture Models
- Optional reading: A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis
- Optional reading: NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set
- Optional reading: Measures for external evaluation of clustering
- Cluster Analysis (Chapter 7)
- Classification
- Classification, Classification Trees and Evaluation (Chapter 3)
- Slides: Classification
- R Code: Code for Chapter 3
- R package: Introduction to rpart (implements CART) and examples for rpart.plot.
- Additional reading: Glossary of Terms to understand common terms in machine learning, statistics, and data mining (Kohavi & Provost, 1998)
- Optional reading: Learning from Imbalanced Classes (Tom Fawcett, 2016)
- Alternative Classification Methods (Chapter 4)
- Slides: Alternative Classification Methods
- R Code: Code for Chapter 4
- R package: caret (train and evaluate different classifiers). See: Introduction to the caret package, table with all available models.
- Additional reading: CRAN Task View: Machine Learning & Statistical Learning
- R Code: Multiple Linear Regression and Logistic Regression
- Deep Learning
- Optional reading: Main Types of Neural Networks and its Applications - Tutorial
- Example: Tinker With a Neural Network (Deep Learning)
- Classification, Classification Trees and Evaluation (Chapter 3)
- Association Analysis
- Association Analysis (Chapter 5)
- Slides: Association analysis
- R Code: Code for Chapter 5
- R package: Introduction to arules, Visualizing association rules with arulesViz
- Association Analysis (Chapter 5)
- Advanced Topics
- Mining Big Data data with MapReduce (download a quick-start VM from cloudera or use cloud-based MapReduce on Amazon EC2)
- A Probabilistic Approach to Association Rule Mining (R package arules)
- Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering (R package seriation)
- Recommender Systems: From content to latent factor analysis (R package recommenderlab)
- Introduction to Data Stream Mining, Data Stream Clustering (R package stream)
- Optimization in R
Textbooks
- Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison Wesley, 1st or 2nd edition, 2019. (Book web page) [required]
- An R Companion for Introduction to Data Mining by Michael Hahsler, 2021. (free web book) [required]
- R for Data Science by Garrett Grolemund and Hadley Wickham, 2017. (free web book)
- Data Visualization: A practical introduction by Kieran Healy, 2019. (free web book)
- Applied Predictive Modeling (with Examples using R and caret) by Max Kuhn and Kjell Johnson, Springer, 2013. Book Web Page, fulltext via SpringerLink (free download on campus)
- The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman, 2nd edition, Springer, 2009. Book Web Page, fulltext via SpringerLink (free download on campus)
- Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman, Cambridge University Press, 2011. (free e-book)
Projects
Project assignments are available on Canvas.
Models, Code and Data Used in Class
- Download data sets (iris, Fishery, etc. ) for class here.
- R Code accompanying the text book and additional R Code used in class.
- For a fast and simple way to copy tables from R to Word use
R> source(“http://michael.hahsler.net/SMU/EMIS7332/R/copytable.R”) - For tables in latex use package xtable
Tool
- R
- Download and install R from the R homepage (look for CRAN to the left).
- Download and install RStudio IDE and/or Microsoft R Tools for Visual Studio (we will use RStudio in class).
- Optional: Rattle: A Graphical User Interface for Data Mining using R.
- Optional: Microsoft R Open (formerly known as Revolution Analytics). A commercial high-performance, scalable, enterprise-capable analytics platform using R.
- SQL (optional): sqlite Manager for Firefox, RSQLite, MySQL, PostgreSQL.
Learning resources
- R
- Work through Introduction to R.
- Workshop Series Introduction to R Programming
- Find R packages using CRAN Task Views. Packages and short descriptions are organized by topic.
- Useful cheat sheets: R Reference Card, R Data Wrangling cheatsheet, R Reference Card for Data Mining , more cheat sheets (from RStudio).
- Quick-R: A very good introduction.
- Community and Guides
- KDNuggets - Data Mining Community’s Top Resource
- CRISP-DM User Guide (SPSS, IBM)
- Videos
- StatQuest! Great explanation of statistical concepts, statistical learning and machine learning methods.
- 3Blue1Brown Introduction of mathematical concepts and artifical neural networks.
- Videos for Statistics 202: Statistical Aspects of Data Mining. These videos cover the textbook used in this class.
- CS109 Data Science - Harvard (videos, slides)
- In-depth introduction to machine learning in 15 hours of expert videos - Data School
Data Sets
- UCI Machine Learning Repository
- KDnuggets - Datasets
- StatCruch (datasets)
- Airline Data (20 years)
- City of Dallas Open Data Portal
- Murder Accountability Project
- World DataBank (World Bank)
- kaggle - Making Science Sport (Data Mining Competitions)
- Innocentive - innovative Solutions to Real Problems (Competitions including Data Mining)
- data.world - Data set repository
- HealthData.gov - Making high value health data more accessible to entrepreneurs, researchers, and policy makers
- PhysioNEt - large collections of recorded physiologic signals