CS 8331 - Advanced Data Mining
Instructor: Dr. Michael Hahsler
Syllabus and assignments: Canvas
All course material is provided under
Course Description
This second course in data mining focuses on understanding the research methods used in the field of data mining. The course targets students who want to gain in-depth knowledge about a particular data mining topic (e.g., PhD students who plan to use a data mining component in their research).
Prerequisites: Successful completion an introductory course in data mining (e.g., EMIS/CS 7331) or in machine learning (e.g., CS 7324). It is assumed that every student is familiar with basic data mining topics (clustering, classification, and association rules) and has experience with programming and one or more data mining tools (R, Python). Enrolling in this course requires instructor consent.
Assignments
- Reviewing Papers: We will discuss 1-2 research papers almost every Wednesday (see syllabus) in class. Here are the guidelines and the review form.
- Tutorial: Each student will prepare and present a tutorial. Tutorials are held typically on Monday. A list with exact times and topics will be published on this Web site. Here are the guidelines for preparing the tutorial and the used grading form. Note: All students need to participate with questions and a discussion on the presented topics.
- Final Project: For the final project you need to implement a data mining method or writing a review paper. The topic of the final project should be coordinated with the tutorial’s topic. Here are the guidelines for the final project.
Tutorials
Presenter | Tutorial | Abstract | Additional material (code, software, etc.) |
---|---|---|---|
Michael Hahsler | Data Stream Mining | Abstract | R package stream with a clustering example. Other tools: MOA, Apache Spark Streaming, Apache Storm, Apache Samza, Apache SAMOA, IBM Streams, MS Azure Stream Analytics |
Michael Hahsler | Association Rule Mining Slides, | Code examples, R Package arules | |
… | … | … | … |
Student tutorials are available on Canvas.
Paper Reviews
Due dates are available on Canvas.
Reference Text (Recommended but not required)
- Data Mining: Introductory and Advanced Topics by Margaret H. Dunham, Prentice Hall, 2003. Book Web Page
- Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison Wesley, 2005. Book Web Page
- The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman, 2nd edition, Springer-Verlag, 2009. Book Web Page
- Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman, Cambridge University Press, 2011. Book Web Page
Tutorials from Previous Semesters
Links
Conferences:
- ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
- ACM SIGMOD Conference (SIGMOD)
- European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)
- Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)
- SIAM Conference on Data Mining (SDM)
- IEEE International Conference on Data Mining (ICDM)
Journals:
- ACM Transactions on Knowledge Discovery from Data (TKDD) - ACM
- Data Mining and Knowledge Discovery (DAMI) - Springer-Verlag
- Data & Knowledge Engineering (DKE) - Elsevir
- Knowledge and Information Systems: An International Journal - Springer-Verlag
- IEEE Transactions on Knowledge and Data Engineering (TKDE) - IEEE
- Intelligent Data Analysis: An International Journal - IOS Press
- Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery - Wiley
Digital Libraries:
- ACM Digital Library
- CS Digital Library (IEEE Comp. Society)
- Google Scholar
- CiteSeerx
- arXiv.org Computer Science
Data Mining Competitions and Data Sets
- KDnuggets: Competitions
- Kaggle Competitions
- KDnuggets: Dataseta
- UCI KDD Archive
- Yahoo! Webscope
- Dallas Open Data
- U.S. Government’s open data
High performance computing
- Apache Hadoop: MapReduce, Pig, Hive (Cloudera)
- Storm (reliabe processing of unbounded streams of data in realtime)
- Spark (fast and general engine for large-scale data processing)
- HPCC Systems (by LexisNexis)
- Deep Learning with H2O and R