DS 1300 - A Practical Introduction to Data Science
- Instructor: Dr. Michael Hahsler
- Syllabus and assignments: Canvas
- All course material is provided under
Course Description
Data have become one of the most critical resources in today’s world. This course provides a first introduction to the exciting field of data science using applications and case studies from various domains (e.g., social media, marketing, sociology, engineering, digital humanities). The course will introduce data-centric thinking including a discussion of how data is acquired, managed, manipulated, visualized, and used to support problem-solving. The fundamental practical skills necessary will be taught in class, and each step will be illustrated with small examples. Tools presented in this course include SQL, Excel, along with other state-of-the-art tools.
Prerequisites: None
Outline
Note: Readings need to be done before class at home. The reading material will be part of the exam.
1. Introduction
- Purpose: Introduction to the course and the data science process including applications and ethics.
- Watch: Big data is better data (TED talk, 2014)
- Read Syllabus
- Reading: 7 Use Cases For Data Science
- Slides: Introduction to DS
- Reading: The Data Science Process
- Template: Data Science Report Template (Word).
- Additional Material:
2. Tables (Spreadsheets)
- Purpose: Tables are the most basic form of organizing data. We will learn about importing data, cleaning, formatting, functions, and pivot tables.
- Tools: Excel (install the latest version from SMU Office 365 download - use install Office link)
- Reading:
- If you need to brush up on basic Excel skills, then go through the reference material below.
- Things to know about Excel Tables (optional: Using structured references with Excel tables)
- Things to know about Pivot tables in Excel
- Case Study: Analyzing MLB Player Data
- Instructions: Analyzing MLB Player Data (Tables in Excel)
- Data: MLB Heights and Weights, MLB Player Statistics
- Finished workbook examples: Cleaned MLB data, workbook with slicers, pivot table and visualizations.
- Reference Material:
- Excel Tutorial (Basics, Editing Worksheets, Formatting Cells, and Working with Formulas. You can also watch the video tutorials)
- Excel for Data Analysis Tutorial (Read: Importing data, cleaning data with text function, tables, conditional formatting, sorting, and filtering)
3. Visualization and Charts
- Purpose: Learn basic visualization skills (scatter plot, bar charts,histogramms and line charts wih trend lines).
- Tools: Excel
- Watch on your own: The best stats you’ve ever seen (Hans Rosling’s TED talk, 2006)
- Slides: Visualization
- Reading:
- Example: 4 questions about weather-driven power outages, answered
- Case Study: UK Life Expectancy
- Case Study: Using Excel trendlines to predict Current M2 money supply time series data from the US Federal Reserve
- Information on M2: What is money supply?
- Can we use data up to the end of the 80’s and predict current money supply levels? How?
- Important takeaways:
- What type of trendline is appropriate (linear, exponential or a polynomial).
- We need to be careful with exceptions in the data that will change our trendline.
- Case Study: Using Excel to create a map visualization of the GDP by US state.
- Important takeaway: Comparing “things” of different sizes often does not show what we need to see. We may have to normalize by the size (e.g., by population by state) for a better comparison.
- Reference Material:
- Additional Material:
- COVID-19: Data Visualizations (University of Minnesota, 2022)
- How Do You Tell A Story With Data Visualization? (Forbes, 2019)
- Beauty of Visualization (TEDGlobal, 2010)
- A Visual History of Which Countries Have Dominated the Summer Olympics (NY Times, 2016)
- U.S. Gun Deaths in 2013 (Periscopic, 2013)
4. Tables (Relational Databases)
- Purpose: Relational databases are the most used way to store, manage and manipulate data. We will talk about database design, importing data, simple SQL queries, joins, and aggregation.
- Tools: DB Browser for SQLite (install the latest version)
- Slides:
- Examples from class: Product Database (SQLite; save and open with DB Browser), Example Queries, Cross Products and Joins
- Quizzes to check your SQL knowledge: Quiz 1, Quiz 2, Quiz 3
- Reference Material:
- SQLite Tutorial (start with SELECT Query in the menu to the left)
- SQL As Understood By SQLite
5. Descriptive Analytics: Data, Distributions and Correlation
- Purpose: Learn about the scales of measurement, histograms, distributions, random samples, mean, variance, and proportions.
- Tools: Excel, RapidMiner Studio (you have to sign up for a free account and may apply for a free educational license)
- Slides:
- Examples from class:
- RapidMiner process Cleaning and preprocessing data (save the .rmp file and import process in rapidminer). You will need to download and add the census dataset.
- RapidMiner process Basic statistics and visualization (save and import process in rapidminer). Needed data: census.
- Reference Material:
- Tutorial: Getting started with RapidMiner
- Videos: Get Started with RapidMiner & Machine Learning
- Regular expressions for text manipultaion: Intro video to regular expressions in RapidMinder and a general introduction to regular expressions.
6. Predictive Analytics: Data Mining/Machine Learning
- Purpose: Understand the basic supervised and unsupervised learning techniques used in data science.
- Slides: Data Mining - Predictive Analytics (Machine Learning Videos from StatQuest)
- Examples from class:
- RapidMiner processes for Clustering k-means, hierarchical, example
- Example: k-means visualization
- RapidMiner processes for Classification classification, example, evaluation, compare classifiers
- RapidMiner processes for Regression regression
- Reference Material:
- RapidMiner Tutorial: Auto Model Tutorial and Video
Textbooks [not required]
- Data Science and Rapidminer: Vijay Kotu, Data Science: Concepts and Practice, Morgan Kaufmann; 2nd edition (December 21, 2018).
- Visualization: Eduard R. Tufte, The Visual Display of Quantitative Information, 2nd Edition, May 2001, ISBN: 1930824130, Graphics Press.
- SQL: Learn SQLite, tutorialspoint.
Software
- Excel (available free of charge for SMU students here)
- DB Browser for SQLite
- RapidMiner Studio (you have to sign up for a free account and may apply for a free educational license)
- Optional: SQLite JDBC Driver (How to use SQLite with RapidMiner)