EMIS, Lyle School of Eng., SMU

CS 8331: Data Mining (crosslisted as EMIS 8331), Spring 2019

All course material is provided under Creative Commons Attribution 4.0 International (CC BY 4.0)
Instructor: Dr. Michael Hahsler
Syllabus and assignments: Canvas

Course Description

This second course in data mining focuses on understanding the research methods used in the field of data mining. The course targets students who want to gain in-depth knowledge about a particular data mining topic (e.g., PhD students who plan to use a data mining component in their research).

Prerequisites: Successful completion an introductory data mining course like EMIS/CS 7331 or CS 7324. It is assumed that every student is familiar with all the basic data mining topics (clustering, classification, and association rules) and has experience with programming and one or more data mining tools (R, Python). Enrolling in this course requires department/instructor consent.

Assignments

Tutorials

Date Presenter Tutorial Abstract Additional material (code, software, etc.)
1/23 Michael Hahsler Data Stream Mining Abstract R package stream, MOA, Apache Spark Streaming, Apache Storm, Apache Samza, Apache SAMOA, IBM Streams, MS Azure Stream Analytics
2/6 Mohamed Elsaied Hybrid Recommender Systems Abstract
2/13 Mengli Hua Large Scale Recommender Systems Abstract Example (R Code), R package recommenderlab
2/20 Bowei Tian Machine Learning for Time Series Data Abstract
2/27 Liang Ma Regularization in Data Mining Abstract R examples for Regularization
3/6 Ian Johnson Ensemble Classification Methods Abstract Example code
3/27 Bolong Zhang Web Usage Mining Abstract
4/3 Justin Ledford Adversarial Inputs to Image Classification with Neural Networks Abstract
4/3 Zohreh Raziei Reinforcement Learning Abstract thompson_sampling.R, upper_confidence_bound.R, Ads_CTR_Optimisation.csv
4/10 Jake Carlson Community Detection in Python Abstract
4/10 Jiaqi Song Vehicle routing problems with time windows Abstract
4/17 Xingming Qu Sentiment Analysis Abstract
4/17 Roger Stanton Network Analysis Abstract R code, network.csv, stations.csv.

Paper Reviews

Due Date Paper Additional Material (code, papers, etc.)
2/6 Clustering Using a Similarity Measure Based on Shared Near Neighbors, IEEE Transactions on Computers, 22(11), 1025-1034, 1973. (download from CiteSeer) Jarvis-Patrick clustering in R, SNN clustering paper
2/13 ABACUS: Mining Arbitrary Shaped Clusters from Large Datasets based on Backbone Identification, SIAM International Conference on Data Mining, 2011 Partial ABACUS implementation, clustering images, DBSCAN paper, clustering with dbscan in R, more dbscan examples, a unified view
2/20 Dissimilarity plots: A visual exploration tool for partitional clustering, Journal of Computational and Graphical Statistics, 2011 R package seriation, Dissimilarity plot in R
2/27 Clustering of Time Series Subsequences is Meaningless, IEEE International Conference on Data Mining, 2003 Other papers: An Alternate Measure for Comparing Time Series Subsequence Clusters, Making Subsequence Time Series Clustering Meaningful.
Code: Using R for time series analysis, time series example code, Cylinder-Bell-Funnel example, subsequence clustering example with real data
3/6 Temporal structure learning for clustering massive data streams in real-time, SIAM International Conference on Data Mining, 2011 rEMM package, code example from paper
3/27 Clustering data streams based on shared density between micro-clusters, IEEE Transactions on Knowledge and Data Engineering, 2016 Implementation in the R package stream
4/3 Integrating Classification and Association Rule Mining, KDD'98 LUCS-KDD Implementations, R package arulesCBA, arulesCBA example. Classification Using Association Rules: Weaknesses and Enhancements.
4/10 Interpretable regularized class association rules algorithm for classification in a categorical data space, Information Sciences 2019 R package arulesCBA, about rulefit, some R code examples.
4/17 Approximate Ranking from Pairwise Comparisons, International Conference on Artificial Intelligence and Statistics 2018 R code for H-LUBC, R package relations

Reference Text (Recommended but not required)

Tutorials from Previous Semesters

Links

Conferences: Journals: Digital Libraries: Data Mining Competitions and Data Sets

High performance computing

Tools for distance students


Michael Hahsler
Last modified: