CS 8331 - Advanced Data Mining

Instructor: Dr. Michael Hahsler
Syllabus and assignments: Canvas

All course material is provided under

Course Description

This second course in data mining focuses on understanding the research methods used in the field of data mining. The course targets students who want to gain in-depth knowledge about a particular data mining topic (e.g., PhD students who plan to use a data mining component in their research).

Prerequisites: Successful completion an introductory course in data mining (e.g., EMIS/CS 7331) or in machine learning (e.g., CS 7324). It is assumed that every student is familiar with basic data mining topics (clustering, classification, and association rules) and has experience with programming and one or more data mining tools (R, Python). Enrolling in this course requires instructor consent.

Assignments

Reviewing Papers: We will discuss 1-2 research papers almost every Wednesday (see syllabus) in class. Here are the guidelines and the review form.
Tutorial: Each student will prepare and present a tutorial. Tutorials are held typically on Monday. A list with exact times and topics will be published on this Web site. Here are the guidelines for preparing the tutorial and the used grading form. Note: All students need to participate with questions and a discussion on the presented topics.
Final Project: For the final project you need to implement a data mining method or writing a review paper. The topic of the final project should be coordinated with the tutorial’s topic. Here are the guidelines for the final project.

Tutorials

Presenter	Tutorial	Abstract	Additional material (code, software, etc.)
Michael Hahsler	Data Stream Mining	Abstract	R package stream with a clustering example. Other tools: MOA, Apache Spark Streaming, Apache Storm, Apache Samza, Apache SAMOA, IBM Streams, MS Azure Stream Analytics
Michael Hahsler	Association Rule Mining Slides,		Code examples, R Package arules
…	…	…	…

Student tutorials are available on Canvas.

Paper Reviews

Due dates are available on Canvas.

Due Date	Paper	Additional Material (code, papers, etc.)
	Clustering Using a Similarity Measure Based on Shared Near Neighbors, IEEE Transactions on Computers, 22(11), 1025-1034, 1973. (download from CiteSeer)	Jarvis-Patrick clustering in R, SNN clustering paper
	A density-based algorithm for discovering clusters in large spatial databases with noise, KDD’96	clustering with dbscan in R, more dbscan examples, a unified view
	ABACUS: Mining Arbitrary Shaped Clusters from Large Datasets based on Backbone Identification, SIAM International Conference on Data Mining, 2011	Partial ABACUS implementation, clustering images
	Dissimilarity plots: A visual exploration tool for partitional clustering, Journal of Computational and Graphical Statistics, 2011	R package seriation, Dissimilarity plot in R
	Clustering of Time Series Subsequences is Meaningless, IEEE International Conference on Data Mining, 2003	Other papers: Subsequence Clustering Papers (Google Scholar), Making Subsequence Time Series Clustering Meaningful. Code: Using R for time series analysis, time series example code, Cylinder-Bell-Funnel example, subsequence clustering example with real data
	Temporal structure learning for clustering massive data streams in real-time, SIAM International Conference on Data Mining, 2011	rEMM package, code example from paper
	Clustering data streams based on shared density between micro-clusters, IEEE Transactions on Knowledge and Data Engineering, 2016	Implementation in the R package stream
	Integrating Classification and Association Rule Mining, KDD’98	LUCS-KDD Implementations, R package arulesCBA, arulesCBA example. Classification Using Association Rules: Weaknesses and Enhancements.
	Interpretable regularized class association rules algorithm for classification in a categorical data space, Information Sciences 2019	R package arulesCBA, about rulefit, some R code examples.
	Approximate Ranking from Pairwise Comparisons, International Conference on Artificial Intelligence and Statistics 2018	R code for H-LUBC, R package relations

Reference Text (Recommended but not required)

Data Mining: Introductory and Advanced Topics by Margaret H. Dunham, Prentice Hall, 2003. Book Web Page
Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison Wesley, 2005. Book Web Page
The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman, 2nd edition, Springer-Verlag, 2009. Book Web Page
Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman, Cambridge University Press, 2011. Book Web Page

Tutorials from Previous Semesters

High performance computing

Apache Hadoop: MapReduce, Pig, Hive (Cloudera)
Storm (reliabe processing of unbounded streams of data in realtime)
Spark (fast and general engine for large-scale data processing)
HPCC Systems (by LexisNexis)
Deep Learning with H2O and R