CMSC320 – Fall 2019

Introduction to Data Science

Data Science!?

Instructor: John P. Dickerson
TAs: Nitin Balachandran, Brian Brubach, Mucong Ding, Samuel Dooley, Saeed Hadadan, Vishal Hundal, Alexander Mendelsohn, Vedant Nanda, Aviva Prins, Candice Schumann, Hanyu Wang, Yukun Zheng,
Lectures: Tuesday and Thursday, 5:00–6:15 PM, IRB 0324

Description of Course

Data science encapsulates the interdisciplinary activities required to create data-centric products and applications that address specific scientific, socio-political or business questions. It has drawn tremendous attention from both academia and industry and is making deep inroads in industry, government, health and journalism—just ask Nate Silver!

This course focuses on (i) data management systems, (i) exploratory and statistical data analysis, (ii) data and information visualization, and (iv) the presentation and communication of analysis results. It will be centered around case studies drawing extensively from applications, and will yield a publicly-available final project that will strengthen course participants' data science portfolios.

This course will consist primarily of sets of self-contained lectures and assignments that leverage real-world data science platforms when needed; as such, there is no assigned textbook. Each lecture will come with links to required reading, which should be done before that lecture, and (when appropriate) a list of links to other resources on the web.

Requirements

Students enrolled in the course should be comfortable with programming (for those at UMD, having passed CMSC216 will be good enough!) and be reasonably mathematically mature. The course itself will make heavy use of the Python scripting language by way of Jupyter Notebooks, leaning on the Anaconda package manager; we'll give some Python-for-data-science primer lectures early on, so don't worry if you haven't used Python before. Later lectures will delve into statistics and machine learning and may make use of basic calculus and basic linear algebra; light mathematical maturity is preferred at roughly the level of a junior CS student.

There will be one written, in-class midterm examination. There will not be a final examination; rather, in the interest of building students' public portfolios, and in the spirit of "learning by doing", students will create a self-contained online tutorial to be posted publicly. This tutorial can be created individually or in a small group. As described here (subject to change!), the tutorial will be a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data.

Final grades will be calculated as:

This course is aimed at junior- and senior-level Computer Science majors, but should be accessible to any student of life with some degree of mathematical and statistical maturity, reasonable experience with programming, and an interest in the topic area. If in doubt, e-mail me: john@cs.umd.edu!

Office Hours & Communication

For course-related questions, please use Piazza to communicate with your fellow students, the TAs, and the course instructors. For private correspondance or special situations (e.g., excused absences, DDS accomodations, etc), please email John with [CMSC320] in the email subject line.

Office Hours
Human Time Location
Nitin Balachandran Monday 3PM-5PM AVW 1120
Brian Brubach Wednesday 10AM-12PM AVW 1120
Mucong Ding Friday 10AM-12PM AVW 1120
John Dickerson By appointment; please email John with [CMSC320] in the email subject line. IRB 4128
Samuel Dooley Tuesday and Thursday 2PM-3PM AVW 1120
Saeed Hadadan Monday 12PM-2PM AVW 1120
Vishal Hundal Monday 11AM-12PM AVW 1120
Alexander Mendelsohn Wednesday 4PM-6PM AVW 1120
Vedant Nanda Friday 12PM-2PM AVW 1120
Aviva Prins Thursday 8:30AM-10:30AM AVW 1120
Candice Schumann Monday 9AM-11AM AVW 1120
Hanyu Wang Wednesday 12PM-2PM AVW 1120
Yukun Zheng Tuesday and Thursday 3PM-4PM AVW 1120

University Policies and Resources

Policies relevant to Undergraduate Courses are found here: http://ugst.umd.edu/courserelatedpolicies.html. Topics that are addressed in these various policies include academic integrity, student and instructor conduct, accessibility and accommodations, attendance and excused absences, grades and appeals, copyright and intellectual property.

Course evaluations

Course evaluations are important and the department and faculty take student feedback seriously. Near the end of the semester, students can go to http://www.courseevalum.umd.edu to complete their evaluations.


Schedule

(Schedule subject to change as the semester progresses!)
# Date Topic Reading Slides Lecturer Notes
1 8/27 Introduction What the Fox Knows. pdf, pptx Dickerson Sign up on Piazza!
Part I: Data Collection, Storage, & Management
2 8/29 What is Data & Lightning Python Overview Anaconda's Test Drive. pdf, pptx Dickerson
3 9/3 Scraping Data with Python "What happens when you type google.com into your browser's address bar?" pdf, pptx Dickerson PDF download script from class: link; Extra reading/quick tutorial on using BeautifulSoup: link
4 9/5 NumPy & SciPy, & Best Practices Introduction to pandas. pdf, pptx Dickerson Pandas tutorials: link
5 9/10 Data Wrangling I: Pandas & Tidy Data Hadley Wickham. "Tidy Data." pdf, pptx Dickerson Hould's Tidy Data for Python
6 9/12 Data Wrangling II: Tidy data & SQL Derman & Wilmott's "Financial Modelers' Manifesto." pdf, pptx Dickerson SQLite: link; pandasql library: link
7 9/17 Version Control & Git pdf, pptx Schumann Simons Institute
8 9/19 Missing Data Pandas tutorial on working with missing data. pdf, pptx Dickerson Scikit-learn's imputation functionality: link
9 9/24 Data Wrangling Wrap-Up: Data Integration, Data Warehousing, Entity Resolution Data Cleaning: Problems and Current Approaches (Note: this is a reference piece; please don't read the whole thing!) pdf, pptx Dickerson Wikipdia article on outliers
10 9/26 Exploratory Data Analysis: Summary Statistics, Transformations, & Visualization John W. Tukey: His Life and Professional Contributions. pdf, pptx Dickerson Seaborn visualization library for Python: link
11 10/1 Data Wrangling Wrap-Up: Summary Statistics, Transformations, & Graphs Introduction to GraphQL: link pdf, pptx Dickerson NetworkX: link; Rosh Hashanah (9/29–10/1)
12 10/3 Project Day! Rubric for final project: link pdf, pptx Simons Institute; No formal lecture held!
13 10/8 Graphs pdf, pptx Dickerson Yom Kippur (10/8–10/9); GraphQL language: link
14 10/10 Natural Language I NLTK Book. pdf, pptx Dickerson Python Natural Language Toolkit (NLTK): link; Criticisms of the Turing Test: link
15 10/15 Natural Language II Continued from last class ... pdf, pptx Dickerson Continued from last class ...
16 10/17 Midterm Review & Introduction to Machine Learning I Midterm review: pdf, pptx; Lecture slides: pdf, pptx Dickerson This is the last lecture with content that can be included on the midterm.
17 10/22 Automatically Learning how to Avoid Censorship Bock et al. Geneva: Evolving Censorship Evasion Strategies. Levin INFORMS Annual Meeting
18 10/24 Midterm Dickerson Bring a pen!
19 10/29 Introduction to Machine Learning II Hal Daumé III. A Course in Machine Learning. pdf, pptx Dickerson Scikit-learn cheat sheet: link
20 10/31 Decision Trees and Random Forests Russell & Norvig's Chapter 18 lecture slides: pdf, pptx Dickerson Scikit-learn's basic decision tree functionality: link; Bart Selman's CS4700: link
21 11/5 Random Forests, K-NN pdf, pptx Dickerson
22 11/7 Practical Issues I: Overfitting, Cross-validation, Regularization pdf, pptx Dickerson xkcd on overfitting: link
23 11/12 Practical Issues II: Feature Engineering, PCA, Clustering, Association Rules pdf, pptx Dickerson
24 11/14 Scaling It Up I pdf, pptx Dickerson
25 11/19 Scaling It Up II; Hypothesis Testing Recap; Data Science Ethics & Best Practices I Clarifications about p-values. pdf, pptx Dickerson
26 11/21 Data Science Ethics & Best Practices II The Atlantic. "Everything We Know About Facebook's Secret Mood Manipulation Experiment" pdf, pptx Dickerson SIGCOMM paper that passed IRB review but is widely seen as unethical: link
27 11/26 Data Science Ethics & Best Practices III Apple's brief overview of differential privacy: pdf, pptx Dickerson
11/28 Thanksgiving Break
28 12/3 Barocas, Hardt, & Narayanan. Fairness in Machine Learning. Dickerson
29 12/5 Debugging Data Science, & Data Science in Industry pdf, pptx Dickerson Last class :(.
Final 12/16 Final Exam Date Final versions of tuturials must be posted by 4:00PM, the exam time. Instructions & rubric: link

Mini-Projects né Homework

In addition to the tutorial to be posted publicly at the end of the semester, there will be four "mini-projects" assigned over the course of the semester (plus one simple setup assignment that will walk you through using git, Docker, and Jupyter). The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry.

Posting solutions publicly online without the staff's express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted.

(Assignments will appear over the course of the semester.)
# Description Date Released Date Due Project Link
0 Setting Things Up August 27 September 4 link
1 Fly Me To The Moon September 10 September 25 link
2 Moneyball September 27 October 16 link
3 Fact Tank October 31 November 20 November 25 link
4 Baltimore Crime November 21 December 6 link

Additional Administrative Information

Excused Absences

Missing an exam for reasons such as illness, religious observance, participation in required university activities, or family or personal emergency (such as a serious automobile accident or close relative’s funeral) will be excused so long as the absence is requested in writing at least 2 days in advance and the student includes documentation that shows the absence qualifies as excused; a self-signed note is not sufficient as exams are Major Scheduled Grading Events. For this class, such events are the final project assessment and midterms, which will be due on the dates listed in the schedule above. The final exam is scheduled according to the University Registrar.

For medical absences, you must furnish documentation from the health care professional who treated you. This documentation must verify dates of treatment and indicate the timeframe that the student was unable to meet academic responsibilities. In addition, it must contain the name and phone number of the medical service provider to be used if verification is needed. No diagnostic information will ever be requested. Note that simply being seen by a health care professional does not constitute an excused absence; it must be clear that you were unable to perform your academic duties.

It is the University’s policy to provide accommodations for students with religious observances conflicting with exams, but it is the your responsibility to inform the instructor in advance of intended religious observances. If you have a conflict with a planned exam, you must inform the instructor prior to the end of the first two weeks of the class.

The policies for excused absences do not apply to project assignments. Projects will be assigned with sufficient time to be completed by students who have a reasonable understanding of the necessary material and begin promptly. In cases of extremely serious documented illness of lengthy duration or other protracted, severe emergency situations, the instructor may consider extensions on project assignments, depending upon the specific circumstances.

Besides the policies in this syllabus, the University’s policies apply during the semester. Various policies that may be relevant appear in the Undergraduate Catalog.

If you experience difficulty during the semester keeping up with the academic demands of your courses, you may consider contacting the Learning Assistance Service in 2201 Shoemaker Building at (301) 314-7693. Their educational counselors can help with time management issues, reading, note-taking, and exam preparation skills.

Right to Change Information

Although every effort has been made to be complete and accurate, unforeseen circumstances arising during the semester could require the adjustment of any material given here. Consequently, given due notice to students, the instructors reserve the right to change any information on this syllabus or in other course materials. Such changes will be announced and prominently displayed at the top of the syllabus.

University of Maryland Policies for Undergraduate Students

Please read the university’s guide on Course Related Policies, which provides you with resources and information relevant to your participation in a UMD course.