Introduction to Data Science

CMSC320 – Fall 2023

Introduction to Data Science

Header image to data science course website showing a multi-colored cube or matrix

Instructors: Maksym Morawski (morawski@umd.edu) [0101]
Dr. Fardina Alam (fardina@umd.edu) [0201]
Head TA: Nakyung Lee
TAs: Zoey Ki, Tianyi Xiong, Huy Nghiem, Shengjie Xu, Soumya Suvra Ghosal, Yonghan Lee, Hyemi Song, Yao-Chih Lee, Vatsal Agarwal, Aakriti Agrawal, Vikas Thoti Reddy, Svetlana Semenova, George Witt
Lectures [0101]: ESJ 0224, Tuesday & Thursday 3:30–4:45 PM
Lectures [0201]: ESJ 0224, Monday & Wednesday 3:30–4:45 PM


Course Overview

Data science encapsulates the interdisciplinary activities required to create data-centric products and applications that address specific scientific, socio-political or business questions. It has drawn tremendous attention from both academia and industry and is making deep inroads in industry, government, health and journalism.

This course focuses on (i) data management systems, (i) exploratory and statistical data analysis, (ii) data and information visualization, and (iv) the presentation and communication of analysis results. It will be centered around case studies drawing extensively from applications, and will yield a publicly-available final project that will strengthen course participants' data science portfolios.

This course will consist primarily of sets of self-contained lectures and assignments that leverage real-world data science platforms when needed; as such, there is no assigned textbook. Each lecture will come with links to required reading, which should be done before that lecture, and (when appropriate) a list of links to other resources on the web.

Prerequisites

Students enrolled in the course should be comfortable with programming (for those at UMD, having passed CMSC216 will be good enough!) and be reasonably mathematically mature. The course itself will make heavy use of the Python scripting language by way of Jupyter Notebooks, leaning on the Anaconda package manager; we'll give some Python-for-data-science primer lectures early on, so don't worry if you haven't used Python before. Later lectures will delve into statistics and machine learning and may make use of basic calculus and basic linear algebra; light mathematical maturity is preferred at roughly the level of a junior CS student.

This course is aimed at junior- and senior-level Computer Science majors, but should be accessible to any student of life with some degree of mathematical and statistical maturity, reasonable experience with programming, and an interest in the topic area. If in doubt, e-mail me: morawski@umd.edu!

Grading

There will be six assignments, one final project, two written midterm exams, and one final.

Final grades will be calculated as:

  • 40% assignments
  • 15% midterm #1
  • 15% midterm #2
  • 15% final
  • 15% final project

Late Policy

Late work gets no credit.

However, each student is provided with four 12-hour extensions, called tokens, on homeworks. Tokens will be automatically applied onto late homeworks. No request is necessary.

See the next section about how to contact us in special circumstances. We aim to help everyone succeed.

Communication

We are going to use a combination of in-person office hours and the Piazza forum (sign up here) for Q&A. This means that it's appropriate to use Piazza for asynchronous communication with the course instructors and other students, and also for short high-bandwidth discussions that could usually take place before/after class.

Note that Piazza is not appropriate for things like asking for accommodations or other such issues/concerns. Please email:

If you have a request that fits into one of these categories and you don't email one of these two emails, then your request may not get to the right person and not be answered in a timely manner.

If your correspondence does not fit into either of those two categories, please email an instructor (TA or professor) with [CMSC320] in the email subject line. You may also go to an instructor's office hours at the times listed below.

Please note: if you don't include [CMSC320] in your subject line when emailing instructors, your email may not be filtered correctly.

Office Hours

* All hours are EDT

Instructor Office Hours

Instructor Time Location
Maksym Morawski MW 12PM-4PM; Friday by Appointment; Google Calendar IRB 2232
Dr. Fardina Alam F 2PM-4PM IRB 2222

TA Office Hours

You can find TA office hours locations for all your courses here. CMSC320's assigned space is AVW 4122, and if TAs are hosting hours virtually, you can find the zoom link next to their names.

Do keep an eye on Piazza, though; TAs will sometimes swap hours, shift hours, host hours on Zoom, and so on! Additionally, we have at least one TA explicitly covering Piazza on each weekday; all course staff will float around Piazza in general, too!

The table below is mostly in time order.

THERE ARE NO OFFICE HOURS FIRST WEEK (AUG 28-SEP 1)

TA Time Location
Svetlana SemenovaM 9AM-11AMAVW 4122
George WittM 11AM-1PMAVW 4122
Huy NghiemM 1PM-2PMAVW 4122
NaKyung LeeTu 9AM-11AMAVW 4122
Vatsal AgarwalTu 11AM-12PMAVW 4122
Soumya Suvra GhosalTu 1PM-3PMAVW 4122
Vikas Thoti ReddyTu 3PM-5PMAVW 4122
Yonghan LeeW 9AM-11AMAVW 4122
Huy NghiemW 1PM-2PMAVW 4122
Zoey KiTh 9AM-11AMAVW 4122
Vatsal AgarwalTh 11AM-12PMAVW 4122
Tianyi XiongTh 1PM-3PMAVW 4122
Hyemi SongF 9AM-11AMAVW 4122
Aakriti AgrawalF 11AM-1PMAVW 4122


Schedule

Schedule subject to change as the semester progresses!

# Date Topic Reading Slides Notes
1 Aug 28,29 Introduction Fardina Max Sign up on Piazza!
2 Aug 30,31 Experiment Design Experimentation Fardina Max
4 Sep 5,6 Git, Pandas, SQL Pandas Demo (Fardina) Max Fardina HW1 Out (Sep 6)
3 Sep 7,11 Data Types Max Fardina
5 Sep 12,13 Probability, Distributions, and Summary Stats 6. Probability (Bayes thm,Law of tot prob) Max Fardina
6 Sep 14,18 cont. Central Limit Thm HW2 Out (Sep 18)
7 Sep 19,20 Hypothesis Testing 7. Hypothesis and Inference Hypo Testing Steps And Examples Max Fardina HW3 Out (Sep 20)
8 Sep 21, 25 Data Visualization Colab Used Dataset Max Fardina
9 Sep 26,27 Data Exploration I
10 Sep 28, Oct 2 Data Exploration II HW4 Out (Oct 2); HW1 Due (Oct 2)
11 Oct 3,4 Data Cleaning
12 Oct 5,9 Feature Engineering HW2 Due (Oct 9)
13 Oct 10, 11 Midterm I
14 Oct 12,16 Decision Trees HW3 Due (Oct 16)
15 Oct 17,18 Machine Learning HW5 Out (Oct 18)
16 Oct 19,23 Classification HW4 Due (Oct 23)
17 Oct 24,25 Regression
18 Oct 26,30 Neural Networks
19 Oct 31, Nov 1 Image Processing HW5 Due (Nov 1)
20 Nov 2,6 NLP I HW6 Out (Nov 6)
21 Nov 7,8 NLP II
22 Nov 9,13 Review Final Project Released (Nov 9)
23 Nov 14,15 Midterm II
24 Nov 16,20 Graphs HW6 Due (Nov 20)
25 Nov 21,27 Time Series
26 Nov 28,29 Recommender Systems
27 Nov 30, Dec 4 Soft Skills
28 Dec 5,6 Ethics I
29 Dec 7,11 Ethics II
Dec 15? Final Exam

Deliverables

Assignments

Instructions will appear over the course of the semester. Most assignments get released on the second day lecture material gets presented and are due two or three weeks after that.

# Description Date Released Date Due Project Link
Homework 1 Git, Pandas, and SQL Sep 6 Oct 2 Git Pandas SQL
Homework 2 Statistics Review Questions Sep 18 Oct 9 Link
Homework 3 Hypothesis Testing Sep 20 Oct 16 link
Homework 4 Data Exploration Oct 2 Oct 23
Homework 5 Machine Learning Oct 18 Nov 1
Homework 6 NLP Nov 6 Nov 20

Final Project

We will have one final project, posted as a page on github. The project will be released November 9th. It will be due ??.

Stay tuned!


Additional Administrative Information

University Policies and Resources

Policies relevant to Undergraduate Courses are found here: http://ugst.umd.edu/courserelatedpolicies.html. Topics that are addressed in these various policies include academic integrity, student and instructor conduct, accessibility and accommodations, attendance and excused absences, grades and appeals, copyright and intellectual property.

Projects/Labs: On any graded project or lab, you are NOT allowed to exchange code. We compare each student's code with every other student's code to check for similarities. Every semester, we catch an embarrassingly high number of students that engage in cheating and we have to take them to the Honor Council.

GroupMe/Other Group Chats: We encourage students to talk about course material and help each other out in group chats. However, this does NOT include graded assignments. There have been a couple instances in the past where students have posted pictures/source files of their code, or earlier sections have given away exam questions to later sections. Not only did this lower the curve for the earlier section because the later one will do better, the WHOLE group chat had to pay a visit to the Honor Council. It was an extremely ugly business. Remember that in a group of 200+, someone or the other will blow the whistle. If you happen to be an innocent person in an innocent groupchat and someone starts cheating out of the blue, leave it immediately (and better yet, say you are leaving and say you will report it).

Excused Absences

Missing an exam for reasons such as illness, religious observance, participation in required university activities, or family or personal emergency (such as a serious automobile accident or close relative’s funeral) will be excused so long as the absence is requested in writing at least 2 days in advance and the student includes documentation that shows the absence qualifies as excused; a self-signed note is not sufficient as exams are Major Scheduled Grading Events. For this class, such events are the final project assessment and midterms, which will be due on the dates listed in the schedule above. The final exam is scheduled according to the University Registrar.

For medical absences, you must furnish documentation from the health care professional who treated you. This documentation must verify dates of treatment and indicate the timeframe that the student was unable to meet academic responsibilities. In addition, it must contain the name and phone number of the medical service provider to be used if verification is needed. No diagnostic information will ever be requested. Note that simply being seen by a health care professional does not constitute an excused absence; it must be clear that you were unable to perform your academic duties.

It is the University's policy to provide accommodations for students with religious observances conflicting with exams, but it is the your responsibility to inform the instructor in advance of intended religious observances. If you have a conflict with a planned exam, you must inform the instructor prior to the end of the first two weeks of the class.

The policies for excused absences do not apply to project assignments. Projects will be assigned with sufficient time to be completed by students who have a reasonable understanding of the necessary material and begin promptly. In cases of extremely serious documented illness of lengthy duration or other protracted, severe emergency situations, the instructor may consider extensions on project assignments, depending upon the specific circumstances.

Besides the policies in this syllabus, the University's policies apply during the semester. Various policies that may be relevant appear in the Undergraduate Catalog.

If you experience difficulty during the semester keeping up with the academic demands of your courses, you may consider contacting the Learning Assistance Service in 2201 Shoemaker Building at (301) 314-7693. Their educational counselors can help with time management issues, reading, note-taking, and exam preparation skills.

Right to Change Information

Although every effort has been made to be complete and accurate, unforeseen circumstances arising during the semester could require the adjustment of any material given here. Consequently, given due notice to students, the instructors reserve the right to change any information on this syllabus or in other course materials. Such changes will be announced and prominently displayed.

Course Evaluations

Course evaluations are important and the department and faculty take student feedback seriously. Near the end of the semester, students can go to http://www.courseevalum.umd.edu to complete their evaluations.

Miscellaneous Resources

As we go through the course sometimes I will mention additional resources or next steps. None of this is required for the course, but students have asked for me to keep a record of which texts/websites I mention.

  • Python for Data Analysis covers some of the same topics that we cover in this class, but in textbook form.
  • Data Science From Scratch this book is meant to give you a grounding in how some of the statistics and math could be implemented. Mostly, data scientists use off-the-shelf libraries for their mathematical routines/functionality, this books works through how you would implement some of those libraries. The idea is that this might give you a deeper understanding of what is going on. The implementations in this book are not meant to be high-performance, or industrial strength, but illustrative.
  • Web Scraping 101 is a good overview to how web scraping works.