Introduction to Data Science

CMSC320 – Fall 2021

Introduction to Data Science

Header image to data science course website showing a multi-colored cube or matrix

Instructor: John Dickerson (john@cs.umd.edu)
TAs: Hirunima Jayasekara, Kamala Varma, MG Hirsch, Alexander Gao, Tobias Janssen, Fuxiao Liu, Neel Jain, Sazan Mahbub
Lectures: Tuesday & Thursday 5:00–6:15 PM
Lectures are live in the Iribe Antonov Auditorium & posted via Panopto on ELMS

Description of Course

Data science encapsulates the interdisciplinary activities required to create data-centric products and applications that address specific scientific, socio-political or business questions. It has drawn tremendous attention from both academia and industry and is making deep inroads in industry, government, health and journalism.

This course focuses on (i) data management systems, (i) exploratory and statistical data analysis, (ii) data and information visualization, and (iv) the presentation and communication of analysis results. It will be centered around case studies drawing extensively from applications, and will yield a publicly-available final project that will strengthen course participants' data science portfolios.

This course will consist primarily of sets of self-contained lectures and assignments that leverage real-world data science platforms when needed; as such, there is no assigned textbook. Each lecture will come with links to required reading, which should be done before that lecture, and (when appropriate) a list of links to other resources on the web.

There will be a weekly quiz on ELMS relating to the material from that week. These quizzes are Pass/Fail. Of these weekly quizzes (about 12), you must take 10. The lowest 2 will be dropped. Because only 10 are required, there is no need to worry if you miss a quiz! Just make sure that you take at least 10 of them over the whole term. If you take more than 10 of the quizzes, we will only count the 10 best scores.

Requirements

Students enrolled in the course should be comfortable with programming (for those at UMD, having passed CMSC216 will be good enough!) and be reasonably mathematically mature. The course itself will make heavy use of the Python scripting language by way of Jupyter Notebooks, leaning on the Anaconda package manager; we'll give some Python-for-data-science primer lectures early on, so don't worry if you haven't used Python before. Later lectures will delve into statistics and machine learning and may make use of basic calculus and basic linear algebra; light mathematical maturity is preferred at roughly the level of a junior CS student.

There will be two written midterm examinations. There will not be a final examination; rather, in the interest of building students' public portfolios, and in the spirit of "learning by doing", students will create a self-contained online tutorial to be posted publicly. This tutorial can be created individually or in a small group. As described here (subject to change!), the tutorial will be a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data.

Final grades will be calculated as:

  • 10% weekly quizzes
  • 15% midterm #1
  • 15% midterm #2
  • 40% mini-project assignments
  • 20% final tutorial to be posted publicly online (instructions, subject to change!)

This course is aimed at junior- and senior-level Computer Science majors, but should be accessible to any student of life with some degree of mathematical and statistical maturity, reasonable experience with programming, and an interest in the topic area. If in doubt, e-mail me: john@cs.umd.edu!

Office Hours & Communication

We are going to use a combination of in-person office hours, as long as those are a thing, and the Piazza forum (sign up here: https://piazza.com/umd/fall2021/cmsc320) for Q&A. This means that it's appropriate to use Piazza for asynchronous communication with the course instructors and other students, and also for short high-bandwidth discussions that could usually take place before/after class. These may still occur, but COVID will surely slow things down for anything in person. Note that Piazza is not appropriate for things like asking for accommodation or other such issues/concerns, please email John (john@cs.umd.edu) with [CMSC320] in the email subject line.for those things.

As mentioned above: for private correspondence or special situations (e.g., excused absences, DDS accomodations, etc), please email John (john@cs.umd.edu) with [CMSC320] in the email subject line.

Office Hours (all times EDT). You can find TA office hours locations for all your courses here. Unless otherwise stated, CMSC320's assigned space is AVW 4122. Do keep an eye on Piazza, though; TAs will sometimes swap hours, shift hours, host hours on Zoom, and so on! Additionally, we have at least one TA explicitly covering Piazza on each weekday; all course staff will float around Piazza in general, too!
Human Time Location
Alexander Gao 6:00–8:00PM, Wednesdays AVW 4122
John Dickerson By appointment; please email John (john@cs.umd.edu) with [CMSC320] in the email subject line. Zoom or Iribe 4128, as scheduled
Fuxiao Liu 1:00–3:00PM, Fridays AVW 4122
Hirunima Jayasekara 2:00–4:00PM, Wednesdays AVW 4122
Kamala Varma 1:00–3:00PM, Thursdays Tuesdays AVW 4122
MG Hirsch 10:30–12:30PM, Mondays AVW 4122
Neel Jain 11:00–1:00PM, Thursdays AVW 4122
Sazan Mahbub 10:30–12:30PM, Tuesdays AVW 4122
Tobias Janssen 5:00–7:00PM, Mondays AVW 4122

University Policies and Resources

Policies relevant to Undergraduate Courses are found here: http://ugst.umd.edu/courserelatedpolicies.html. Topics that are addressed in these various policies include academic integrity, student and instructor conduct, accessibility and accommodations, attendance and excused absences, grades and appeals, copyright and intellectual property. The following policies (about masks, projects, group chats, and so on) are largely directly copied and pasted from standard campus and CMSC-specific guidance, so they should not come as a surprise to anyone.

Masks: President Pines provided clear expectations to the University about the wearing of masks for students, faculty, and staff. Face coverings over the nose and mouth are required while you are indoors at all times. There are no exceptions when it comes to classrooms, laboratories, and campus offices. Students not wearing a mask will be given a warning and asked to wear one, or will be asked to leave the room immediately. Students who have additional issues with the mask expectation after a first warning will be referred to the Office of Student Conduct for failure to comply with a directive of University officials.

Projects/Labs: On any graded project or lab, you are NOT allowed to exchange code. We compare each student's code with every other student's code to check for similarities. Every semester, we catch an embarrassingly high number of students that engage in cheating and we have to take them to the Honor Council.

GroupMe/Other Group Chats: We encourage students to talk about course material and help each other out in group chats. However, this does NOT include graded assignments. There have been a couple instances in the past where students have posted pictures/source files of their code, or earlier sections have given away exam questions to later sections. Not only did this lower the curve for the earlier section because the later one will do better, the WHOLE group chat had to pay a visit to the Honor Council. It was an extremely ugly business. Remember that in a group of 200+, someone or the other will blow the whistle. If you happen to be an innocent person in an innocent groupchat and someone starts cheating out of the blue, leave it immediately (and better yet, say you are leaving and say you will report it).

Course evaluations

Course evaluations are important and the department and faculty take student feedback seriously. Near the end of the semester, students can go to http://www.courseevalum.umd.edu to complete their evaluations.


Schedule

(Schedule subject to change as the semester progresses!)
# Date Topic Reading Slides Lecturer Notes
1 08/31 Introduction
  • What the Fox Knows.
  • A Proposal for Identifying and Managing Bias in Artificial Intelligence (NIST Special Publication 1270).
pdf,
pptx
Dickerson Sign up on Piazza!
2 09/02 What is Data & Lightning Python Overview Anaconda's Test Drive. pdf,
pptx
Dickerson
3 09/07 Scraping Data (with Python) I "What happens when you type google.com into your browser's address bar?" pdf,
pptx
Dickerson PDF download script from class: link; Extra reading/quick tutorial on using BeautifulSoup: link
4 09/09 Scraping Data (with Python) II pdf,
pptx
Dickerson Regex helper sites: regexr.com, pythex.org, regex101.com, rubular.com
5 09/14 NumPy & SciPy, & Best Practices Introduction to pandas. pdf, pptx Dickerson Pandas tutorials: link
6 09/16 Data Wrangling I: Pandas & Tidy Data Hadley Wickham. "Tidy Data." pdf, pptx Dickerson Hould's Tidy Data for Python
7 09/21 Data Wrangling II: Tidy data & SQL Derman & Wilmott's "Financial Modelers' Manifesto." pdf, pptx Dickerson SQLite: link; pandasql library: link
8 09/23 Version Control & Git pdf, pptx Dickerson
9 09/28 Graphs
  • Introduction to GraphQL: link
  • Girvan & Newman. "Community structure in social and biological networks," PNAS-02.
pdf, pptx Dickerson NetworkX: link
10 09/30 Graphs, & Summary Statistics and Transformations Backstrom & Kleinberg. "Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook," CSCW-14. arXiv link. pdf, pptx Dickerson
11 10/5 Summary Statistics and Transformations, & Missing Data I Chapters 1 and 2 of "Methods for Handling Missing Item Values in Regression Models using the National Survey on Drug Use and Health (NSDUH) pdf, pptx Dickerson Scikit-learn's imputation functionality: link
12 10/7 Missing Data II, & Data Wrangling Wrap-Up: Data Integration, Data Warehousing, Entity Resolution I
  • Pandas tutorial on working with missing data.
  • Data Cleaning: Problems and Current Approaches (Note: this is a reference piece; please don't read the whole thing!)
pdf, pptx Dickerson Wikipdia article on outliers
13 10/12 Midterm Chat, & Data Wrangling Wrap-Up: Data Integration, Data Warehousing, Entity Resolution II pdf, pptx Dickerson Virtual lecture; no in-person lecture!
14 10/14 Midterm #1 Taken virtually! Dickerson No in-person lecture (take-home midterm during class period!)
15 10/19 Natural Language I: Syntax & Semantics NLTK Book. pdf, pptx Dickerson Python Natural Language Toolkit (NLTK): link; Criticisms of the Turing Test: link
16 10/21 Natural Language II: Representation Continued from last class ... pdf, pptx Dickerson Continued from last class ...
17 10/26 Natural Language III: Embeddings & Similarity; & Introduction to Machine Learning Hal Daumé III. A Course in Machine Learning. pdf, pptx Dickerson
18 10/28 Introduction to Machine Learning II Hal Daumé III. A Course in Machine Learning. pdf, pptx Dickerson
19 11/2 Decision Trees and Random Forests I Russell & Norvig's Chapter 18 lecture slides: pdf, pptx Dickerson Scikit-learn's basic decision tree functionality: link; Bart Selman's CS4700: link
20 11/4 Decision Trees and Random Forests II pdf, pptx Dickerson Recorded lecture; file on ELMS
21 11/9 Nonlinear regression, overfitting, and regularization pdf, pptx Dickerson (live via Zoom) Virtual lecture, live via Zoom. xkcd on overfitting: link; Polynomial features/Interaction terms in Scikit: link
22 11/11 Unsupervised learning, k-NN, and dimensionality reduction Nguyen & Holmes. "Ten quick tips for effective dimensionality reduction," PLoS Computational Biology. pdf, pptx Dickerson Wikipedia article on the confusion matrix: link
23 11/16 Midterm review, & PCA, Recommender Systems, & Association Rules Best Practices for Recommender Systems (from Microsoft). pdf, pptx Dickerson
24 11/18 Midterm #2 Taken virtually! Dickerson No in-person lecture (take-home midterm during class period!)
25 11/23 PCA, Recommender Systems, & Association Rules pdf, pptx Dickerson (live via Zoom) Virtual lecture, live via Zoom.
11/25 Thanksgiving Break
26 11/30 Scaling It Up I Dean & Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters," CACM. pdf, pptx Dickerson Wikipedia on SGD: link
27 12/2 Scaling It Up II; Data Science Ethics & Best Practices I The Atlantic. "Everything We Know About Facebook's Secret Mood Manipulation Experiment" pdf, pptx Dickerson What is GDPR? (link)
28 12/7 Data Science Ethics & Best Practices II Apple's brief overview of differential privacy: ; Barocas, Hardt, & Narayanan. Fairness in Machine Learning. pdf, pptx Dickerson SIGCOMM paper that passed IRB review but is widely seen as unethical: link
29 12/9 Debugging Data Science, & Data Science in Industry pdf, pptx Dickerson Virtual lecture, TBD on live vs pre-recorded. Additional discussion of debugging models (from Cornell): link
Final 12/20 Final Project Due Date Final versions of tutorials must be posted by 4:00PM Instructions & rubric: link

Mini-Projects né Homework

In addition to the tutorial to be posted publicly at the end of the semester, there will be four "mini-projects" assigned over the course of the semester (plus one simple setup assignment that will walk you through using git, Docker, and Jupyter). The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry.

Posting solutions publicly online without the staff's express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted.
(Assignments will appear over the course of the semester.)
# Description Date Released Date Due Project Link
0 Setting Things Up August 31 September 7 link
1 Solar Power September 10 September 24 September 27 link
2 Moneyball September 29 October 18 This date will not move, and I will not accept late assignments. Start your projects early, and do not submit at 11:59PM, please! link
3 Gap Minder October 26 November 22 link
4 Baltimore Crime November 23 December 7 link


Final Tutorials

In the spirit of "learning by doing," students created a self-contained online tutorial to be posted publicly. Tutorials could be created individually or in a small group. The intention was to create a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data. In late December 2021, we will list (most of) the tutorials created in the Fall 2021 version of CMSC320. Stay tuned!

Most links lead to a public GitHub Page created by a student or small group in the Fall 2021 CMSC320 course; some links lead to students' personal websites or to a Notebook hosted on Google Colab. Project creators: if a link is missing or incorrect, please get in touch with John!
Project Title URL

Additional Administrative Information

Excused Absences

Missing an exam for reasons such as illness, religious observance, participation in required university activities, or family or personal emergency (such as a serious automobile accident or close relative’s funeral) will be excused so long as the absence is requested in writing at least 2 days in advance and the student includes documentation that shows the absence qualifies as excused; a self-signed note is not sufficient as exams are Major Scheduled Grading Events. For this class, such events are the final project assessment and midterms, which will be due on the dates listed in the schedule above. The final exam is scheduled according to the University Registrar.

For medical absences, you must furnish documentation from the health care professional who treated you. This documentation must verify dates of treatment and indicate the timeframe that the student was unable to meet academic responsibilities. In addition, it must contain the name and phone number of the medical service provider to be used if verification is needed. No diagnostic information will ever be requested. Note that simply being seen by a health care professional does not constitute an excused absence; it must be clear that you were unable to perform your academic duties.

It is the University’s policy to provide accommodations for students with religious observances conflicting with exams, but it is the your responsibility to inform the instructor in advance of intended religious observances. If you have a conflict with a planned exam, you must inform the instructor prior to the end of the first two weeks of the class.

The policies for excused absences do not apply to project assignments. Projects will be assigned with sufficient time to be completed by students who have a reasonable understanding of the necessary material and begin promptly. In cases of extremely serious documented illness of lengthy duration or other protracted, severe emergency situations, the instructor may consider extensions on project assignments, depending upon the specific circumstances.

Besides the policies in this syllabus, the University’s policies apply during the semester. Various policies that may be relevant appear in the Undergraduate Catalog.

If you experience difficulty during the semester keeping up with the academic demands of your courses, you may consider contacting the Learning Assistance Service in 2201 Shoemaker Building at (301) 314-7693. Their educational counselors can help with time management issues, reading, note-taking, and exam preparation skills.

Right to Change Information

Although every effort has been made to be complete and accurate, unforeseen circumstances arising during the semester could require the adjustment of any material given here. Consequently, given due notice to students, the instructors reserve the right to change any information on this syllabus or in other course materials. Such changes will be announced and prominently displayed at the top of the syllabus.

University of Maryland Policies for Undergraduate Students

Please read the university’s guide on Course Related Policies, which provides you with resources and information relevant to your participation in a UMD course.

Miscellaneous Resources

As we go through the course sometimes I will mention additional resources or next steps. None of this is required for the course, but students have asked for me to keep a record of which texts/websites I mention.

  • Python for Data Analysis covers some of the same topics that we cover in this class, but in textbook form.
  • Data Science From Scratch this book is meant to give you a grounding in how some of the statistics and math could be implemented. Mostly, data scientists use off-the-shelf libraries for their mathematical routines/functionality, this books works through how you would implement some of those libraries. The idea is that this might give you a deeper understanding of what is going on. The implementations in this book are not meant to be high-performance, or industrial strength, but illustrative.
  • Web Scraping 101 is a good overview to how web scraping works.
  • Recent iterations of this course can be found for the Fall 2019 (Dickerson), Fall 2020 (Dickerson), Spring 2021 (Calderón Trilla), and Summer 2021 (Calderón Trilla) semesters.