EMIS/CSE 8331: Data Mining, Spring 2017

Dr. Michael Hahsler
TTh 2:00-3:20pm, Junkins 101

News

Please contact me via email if you consider taking this course in Spring 2017.

Course Description

This second course in data mining focuses on understanding the research methods used in the field of data mining. The course targets students who want to gain in-dept knowledge about a particular data mining topic (e.g., PhD students who plan to use a data mining component in their research).

Prerequisites: Successful completion an introductory data mining course like EMIS/CSE 7331 (till Fall 2016 EMIS 7332). It is assumed that every student is familiar with all the basic data mining topics (clustering, classification, and association rules) and has some experience with programming and one or more data mining tools (R, RapidMiner, SAS, SPSS, Weka, XLMiner, etc.). Enrolling in this course requires department/instructor consent.

Assignments

Reviewing Papers: We will discuss 1-2 research papers almost every Wednesday (see syllabus) in class. Here are the guidelines and the review form.
Tutorial: Every student will prepare and present a tutorial. Tutorials are held typically on Tuesdays. A list with exact times and topics will be published on this Web site. Here are the guidelines for preparing the tutorial and the used grading form. Note: All students need to participate with questions and a discussion on the presented topics.
Final Project: For the final project you have a choice between a data mining project, implementing a data mining method or writing a review paper. The topic of the final project should be coordinated with the tutorial's topic. Here are the guidelines for the final project and the presentation schedule.

Tutorial Topics

Date	Presenter	Tutorial	Abstract	Additional material (code, software, etc.)
2/9	Michael Hahsler	Data Stream Mining		R package stream, MOA, Apache Spark Streaming, Apache Storm, Apache Samza, Apache SAMOA, IBM Streams, MS Azure Stream Analytics
2/16	Xinxiang Zhang	Deep Learning	Abstract
2/23	Anyu Zhang	Lung Cancer Detection	Abstract
3/2	Farzad Kamalzadeh	Markov Models in Health Care	Abstract	R Code, R Package seqHMM
3/9	Michael Prappas	Natural Lanuage Processing	Abstract	Code
3/23	Scott Eisenhart	Recommender Systems	Abstract	recomenderlab
3/30	Andrew Cranmer	Image Mining - Local Binary Patterns (video)	Abstract
4/6	Tutorial moved to 4/25
4/13	Revant Reddy Katanguri	Big Data Technologies	Abstract
4/20	Ben Brock	Mining Applications for the Internet of Things (video)	Abstract
4/25	Harold Mitchell	Introduction to Apache Spark	Abstract	Anaconda, Pyspark_First_Program.ipynb, Spark-HelloWorld.ipynb

Papers for Review

Due Date	Paper	Additional Material (code, papers, etc.)
2/2	1. Clustering Using a Similarity Measure Based on Shared Near Neighbors, IEEE Transactions on Computers 1973	Jarvis-Patrick clustering in R, SNN clustering paper
2/7	2. A density-based algorithm for discovering clusters in large spatial databases with noise, KDD'96	clustering with dbscan in R, more dbscan examples, a unified view
2/14	3. ABACUS: Mining Arbitrary Shaped Clusters from Large Datasets based on Backbone Identification, SDM'11	Partial ABACUS implementation, clustering images
2/21	4. Clustering of Time Series Subsequences is Meaningless, ICDM'03	Other papers: An Alternate Measure for Comparing Time Series Subsequence Clusters, Making Subsequence Time Series Clustering Meaningful. Code: Using R for time series analysis, time series example code, subsequence clustering examples, subsequence clustering tests by Hala El-Ali
2/28	5. Temporal structure learning for clustering massive data streams in real-time, SDM'11	rEMM package, code example from paper
3/7	6. Integrating Classification and Association Rule Mining, KDD'98	LUCS-KDD Implementations, arulesCBA package, arulesCBA example.
3/21	7. Grouping Association Rules Using Lift, DM-DA 2016	R package arulesViz, clustering/dissimilarity calculation using arules
3/28	We will meet at the SMU Research Day (Hughes-Trigg Student Center, 2 - 3:20pm). See Canvas for assignment.
4/4	No paper - individual meetings to discuss your progress report.
4/11	8. Concept Tree Based Ordering for Shaded Similarity Matrix, ICDM'02	ordering in R
4/18	9. Dissimilarity plots: A visual exploration tool for partitional clustering, JCGS, 2011	R package seriation, Dissimilarity plot in R

Reference Text (Recommended but not required)

Data Mining: Introductory and Advanced Topics by Margaret H. Dunham, Prentice Hall, 2003. Book Web Page
Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison Wesley, 2005. Book Web Page
The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman, 2nd edition, Springer-Verlag, 2009. Book Web Page
Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman, Cambridge University Press, 2011. Book Web Page

Tutorials from Previous Semesters

High performance computing

Apache Hadoop: MapReduce, Pig, Hive (Cloudera)
Storm (reliabe processing of unbounded streams of data in realtime)
Spark (fast and general engine for large-scale data processing)
HPCC Systems (by LexisNexis)
Deep Learning with H2O and R

Tools for distance students

CamStudio: Free Screen recording software for Windows.

Michael Hahsler
Last modified: