{ "cells": [ { "cell_type": "code", "execution_count": 2, "id": "immediate-christopher", "metadata": {}, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "from urllib.parse import urlparse" ] }, { "cell_type": "code", "execution_count": 3, "id": "insured-playlist", "metadata": {}, "outputs": [], "source": [ "r = requests.get('http://cmsc320.github.io')" ] }, { "cell_type": "code", "execution_count": 5, "id": "comparative-underwear", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'\\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n Introduction to Data Science\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n
\\n
\\n \\n

CMSC320 – Spring 2021

\\n
\\n \\n
\\n

Introduction to Data Science

\\n \\n \\n \\n \\n \\n \"Data \\n

\\n

\\n Instructor: Jos\\xc3\\xa9 Manuel Calder\\xc3\\xb3n Trilla
\\n TAs:\\n Anubhav,\\n Kishan Kolur,\\n Jerry Peng,\\n Noor Singh,\\n Yow-Ting Shiue,\\n Qingyang Tan,\\n Julian Vanecek,\\n Amulya Velamakanni,\\n Laura Zheng\\n
\\n Lectures: Monday and Wednesday, 5:00–6:15 PM
\\n Lectures are live on Zoom & posted on ELMS
\\n
\\n

\\n
\\n \\n
\\n

Description of Course

\\n

Data science encapsulates the interdisciplinary activities required to create data-centric products and applications that address specific scientific, socio-political or business questions. It has drawn tremendous attention from both academia and industry and is making deep inroads in industry, government, health and journalism—just ask Nate Silver!

\\n \\n

This course focuses on (i) data management systems, (i) exploratory and statistical data analysis, (ii) data and information visualization, and (iv) the presentation and communication of analysis results. It will be centered around case studies drawing extensively from applications, and will yield a publicly-available final project that will strengthen course participants\\' data science portfolios.

\\n \\n

This course will consist primarily of sets of self-contained lectures and assignments that leverage real-world data science platforms when needed; as such, there is no assigned textbook. Each lecture will come with links to required reading, which should be done before that lecture, and (when appropriate) a list of links to other resources on the web.

\\n\\n

There will be a weekly quiz on ELMS relating to the material from that week. Of these weekly quizzes (about 14), you must take 10. Each quiz is pass/fail where anything about 60% is considered a pass. Because only 10 are required, there is no need to worry if you miss a quiz! Just make sure that you take at least 10 of them over the whole semester. If you take more than 10 of the quizzes, we will only count the 10 best scores.

\\n\\n

Requirements

\\n

\\n Students enrolled in the course should be comfortable with programming (for those at UMD, having passed CMSC216 will be good enough!) and be reasonably mathematically mature. The course itself will make heavy use of the Python scripting language by way of Jupyter Notebooks, leaning on the Anaconda package manager; we\\'ll give some Python-for-data-science primer lectures early on, so don\\'t worry if you haven\\'t used Python before. Later lectures will delve into statistics and machine learning and may make use of basic calculus and basic linear algebra; light mathematical maturity is preferred at roughly the level of a junior CS student.\\n

\\n \\n

\\n There will be one written, take-home (obviously, given COVID-19 and all) midterm examination. There will not be a final examination; rather, in the interest of building students\\' public portfolios, and in the spirit of "learning by doing", students will create a self-contained online tutorial to be posted publicly. This tutorial can be created individually or in a small group. As described here (subject to change!), the tutorial will be a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data.\\n

\\n \\n

\\n Final grades will be calculated as:\\n

    \\n
  • 10% weekly quizzes
  • \\n
  • 25% midterm
  • \\n
  • 40% mini-project assignments
  • \\n
  • 25% final tutorial to be posted publicly online (instructions, subject to change!)
  • \\n \\n
\\n

\\n \\n \\n \\n

This course is aimed at junior- and senior-level Computer Science majors, but should be accessible to any student of life with some degree of mathematical and statistical maturity, reasonable experience with programming, and an interest in the topic area. If in doubt, e-mail me: jmct@umd.edu!

\\n \\n

Office Hours & Communication

\\n

We are going to use Discord as a replacement for the physical space that we don\\'t have access to during COVID. This means that it\\'s appropriate to use Discord for office hours and or short high-bandwidth discussions that would usually take place before/after class. Note that Discord is not appropriate for things like asking for accommodation or other such issues/concerns, please email Jos\\xc3\\xa9 with [CMSC320] in the email subject line.for those things.

\\n \\n

Discord, while useful, can be very \\'stream of conscious\\' and does not allows for threading when several students have the same concern. Therefore for course-related questions, please use Piazza.\\nAs mentioned above: for private correspondence or special situations (e.g., excused absences, DDS accomodations, etc), please email Jos\\xc3\\xa9 with [CMSC320] in the email subject line.

\\n \\n

\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n
Office Hours (all times EDT)
HumanTimeLocation
Anubhav2PM-4PM Monday, 10AM-12PM Wednesday; Piazza on 12PM-2PM FridayOnline
Jos\\xc3\\xa9 Manuel Calder\\xc3\\xb3n TrillaBy appointment; please email Jos\\xc3\\xa9 with [CMSC320] in the email subject line.Zoom
Kishan Kolur12PM-2PM Wednesday; Piazza on TBDOnline
Jerry Peng11AM-1PM Wednesday, 1PM-3PM Friday; Piazza on Monday 1PM-3PMOnline
Noor Singh3PM-5PM on Wednesday and Thursday; Piazza on Friday 12PM-2PMOnline
Yow-Ting Shiue10AM-12PM on Tuesday and Friday; Piazza on TBDOnline
Qingyang Tan2PM-4PM on Wednesday; Piazza on Monday 10AM-12PMOnline
Julian Vanecek4PM-6PM Friday and Sun; Piazza on 7PM-9PMOnline
Amulya Velamakanni3PM-5PM on Thursday; Piazza on 4PM-6PM SaturdayOnline
Laura Zheng1PM-3PM on Wednesday; Piazza on Monday eveningsOnline
\\n

\\n \\n \\n

University Policies and Resources

\\n \\n

Policies relevant to Undergraduate Courses are found here: http://ugst.umd.edu/courserelatedpolicies.html. Topics that are addressed in these various policies include academic integrity, student and instructor conduct, accessibility and accommodations, attendance and excused absences, grades and appeals, copyright and intellectual property.

\\n \\n \\n

Course evaluations

\\n \\n

Course evaluations are important and the department and faculty take student feedback seriously. Near the end of the semester, students can go to http://www.courseevalum.umd.edu to complete their evaluations.

\\n \\n
\\n \\n \\n
\\n \\n
\\n

Schedule

\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n
(Schedule subject to change as the semester progresses!)
#DateTopicReadingSlidesLecturerNotes
101/25IntroductionWhat the Fox Knows. pdf, pptxCalder\\xc3\\xb3n TrillaSign up on Piazza!
201/27What is Data & Lightning Python OverviewAnaconda\\'s Test Drive. pdf, pptxCalder\\xc3\\xb3n Trilla
302/1Scraping Data (with Python) I"What happens when you type google.com into your browser\\'s address bar?" pdfCalder\\xc3\\xb3n Trilla
402/3Scraping Data (with Python) IIThink about data link pdfCalder\\xc3\\xb3n Trilla
402/8Scraping Data (with Python) IIIThink about data (extended edition!) link pdf, notebookCalder\\xc3\\xb3n TrillaExample APIs: WMATA API, NOAA API, Library Docs: Requests, JSON, Beautiful Soup
502/10NumPy & SciPy, & Best PracticesIntroduction to pandas. pdf, pptxCalder\\xc3\\xb3n TrillaPandas tutorials: link
602/15Data Wrangling I: Pandas & Tidy DataHadley Wickham. "Tidy Data." pdf, pptxCalder\\xc3\\xb3n TrillaHould\\'s Tidy Data for Python
702/17Data Wrangling II: Tidy data & SQLDerman & Wilmott\\'s "Financial Modelers\\' Manifesto." pdf, pptxCalder\\xc3\\xb3n TrillaSQLite: link; pandasql library: link
902/22GraphsIntroduction to GraphQL: link pdf, pptxCalder\\xc3\\xb3n TrillaNetworkX: link
1002/24Graphs, & Summary Statistics and TransformationsBackstrom & Kleinberg. "Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook," CSCW-14. arXiv link. pdf, pptxCalder\\xc3\\xb3n Trilla
1103/01Summary Statistics and Transformations, & Missing Data I pdf, pptxCalder\\xc3\\xb3n Trilla
1203/03Missing Data IIPandas tutorial on working with missing data. pdf, pptxCalder\\xc3\\xb3n TrillaScikit-learn\\'s imputation functionality: link
1303/08Missing Data III, & Data Wrangling Wrap-Up: Data Integration, Data Warehousing, Entity ResolutionData Cleaning: Problems and Current Approaches (Note: this is a reference piece; please don\\'t read the whole thing!) pdf, pptxCalder\\xc3\\xb3n TrillaWikipdia article on outliers
1403/10Natural Language I: Syntax & SemanticsNLTK Book. pdf, pptxCalder\\xc3\\xb3n TrillaPython Natural Language Toolkit (NLTK): link; Criticisms of the Turing Test: link
1403/15Spring Break 1 Nobody Enjoy!
1403/17Spring Break 2 Nobody Enjoy!
1503/22Natural Language II: RepresentationContinued from last class ... pdf, pptxCalder\\xc3\\xb3n TrillaContinued from last class ...
1603/24Natural Language III: Embeddings & SimilarityContinued from last class ... pdf, pptxCalder\\xc3\\xb3n TrillaContinued from last class ...
1703/29Midterm Review & TBDMidterm review: pdf, pptx; Lecture slides: pdf, pptxCalder\\xc3\\xb3n TrillaNew material from this lecture will not be included on the midterm.
1803/31MidtermCalder\\xc3\\xb3n Trilla
1904/05TBDTBDPossibly Exam review
2004/07Introduction to Machine LearningHal Daumé III. A Course in Machine Learning. pdf, pptxTBD
2104/12Decision Trees and Random ForestsRussell & Norvig\\'s Chapter 18 lecture slides: pdf, pptxCalder\\xc3\\xb3n TrillaScikit-learn\\'s basic decision tree functionality: link; Bart Selman\\'s CS4700: link
2204/14Random Forests, K-NN pdf, pptxCalder\\xc3\\xb3n Trilla
2304/19Practical Issues I: Overfitting, Cross-validation, Regularization pdf, pptxTBDxkcd on overfitting: link; Polynomial features/Interaction terms in Scikit: link
2404/21Practical Issues II: Feature Engineering, PCA, Clustering, Association RulesNguyen & Holmes. "Ten quick tips for effective dimensionality reduction," PLoS Computational Biology. pdf, pptxTBDWikipedia article on the confusion matrix: link
2504/26Practical Issues III: Recommender Systems and Association RulesBest Practices for Recommender Systems (from Microsoft). pdf, pptxTBD
2604/28Scaling It UpDean & Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters," CACM. pdf, pptxCalder\\xc3\\xb3n TrillaWikipedia on SGD: link
2705/03Data Science Ethics & Best Practices IThe Atlantic. "Everything We Know About Facebook\\'s Secret Mood Manipulation Experiment" pdf, pptxCalder\\xc3\\xb3n TrillaWhat is GDPR? (link)
2805/05Data Science Ethics & Best Practices IIApple\\'s brief overview of differential privacy: ; Barocas, Hardt, & Narayanan. Fairness in Machine Learning. pdf, pptxCalder\\xc3\\xb3n TrillaSIGCOMM paper that passed IRB review but is widely seen as unethical: link
2905/10Debugging Data Science, & Data Science in Industry pdf, pptxCalder\\xc3\\xb3n TrillaAdditional discussion of debugging models (from Cornell): link
Final05/17Final Exam DateFinal versions of tutorials must be posted by 4:00PM, the exam time.Instructions & rubric: link
\\n \\n
\\n \\n
\\n \\n
\\n

Mini-Projects né Homework

\\n

In addition to the tutorial to be posted publicly at the end of the semester, there will be four "mini-projects" assigned over the course of the semester (plus one simple setup assignment that will walk you through using git, Docker, and Jupyter). The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry.

\\n \\n

Posting solutions publicly online without the staff\\'s express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted.\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n
(Assignments will appear over the course of the semester.)
#DescriptionDate ReleasedDate DueProject Link
0Setting Things UpJanuary 27Febuary 3link
\\n \\n

\\n \\n \\n
\\n \\n
\\n \\n

Final Tutorials

\\n

In the spirit of "learning by doing," students created a self-contained online tutorial to be posted publicly. Tutorials could be created individually or in a small group. The intention was to create a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data. Below is a list of (most of) the tutorials created in the Fall 2020 version of CMSC320.

\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n
Most links lead to a public GitHub Page created by a student or small group in the Fall 2020 CMSC320 course; some links lead to students\\' personal websites or to a Notebook hosted on Google Colab. Project creators: if a link is missing or incorrect, please get in touch with Jos\\xc3\\xa9!
Project TitleURL
\\n \\n
\\n \\n
\\n \\n \\n
\\n

Additional Administrative Information

\\n \\n \\n \\n

Excused Absences

\\n

Missing an exam for reasons such as illness, religious observance, participation in required university activities, or family or personal emergency (such as a serious automobile accident or close relative\\xe2\\x80\\x99s funeral) will be excused so long as the absence is requested in writing at least 2 days in advance and the student includes documentation that shows the absence qualifies as excused; a self-signed note is not sufficient as exams are Major Scheduled Grading Events. For this class, such events are the final project assessment and midterms, which will be due on the dates listed in the schedule above. The final exam is scheduled according to the University Registrar.

\\n \\n

For medical absences, you must furnish documentation from the health care professional who treated you. This documentation must verify dates of treatment and indicate the timeframe that the student was unable to meet academic responsibilities. In addition, it must contain the name and phone number of the medical service provider to be used if verification is needed. No diagnostic information will ever be requested. Note that simply being seen by a health care professional does not constitute an excused absence; it must be clear that you were unable to perform your academic duties.

\\n \\n

It is the University\\xe2\\x80\\x99s policy to provide accommodations for students with religious observances conflicting with exams, but it is the your responsibility to inform the instructor in advance of intended religious observances. If you have a conflict with a planned exam, you must inform the instructor prior to the end of the first two weeks of the class.

\\n \\n

The policies for excused absences do not apply to project assignments. Projects will be assigned with sufficient time to be completed by students who have a reasonable understanding of the necessary material and begin promptly. In cases of extremely serious documented illness of lengthy duration or other protracted, severe emergency situations, the instructor may consider extensions on project assignments, depending upon the specific circumstances.

\\n \\n

Besides the policies in this syllabus, the University\\xe2\\x80\\x99s policies apply during the semester. Various policies that may be relevant appear in the Undergraduate Catalog.\\n \\n

If you experience difficulty during the semester keeping up with the academic demands of your courses, you may consider contacting the Learning Assistance Service in 2201 Shoemaker Building at (301) 314-7693. Their educational counselors can help with time management issues, reading, note-taking, and exam preparation skills.

\\n \\n \\n

Right to Change Information

\\n

Although every effort has been made to be complete and accurate, unforeseen circumstances arising during the semester could require the adjustment of any material given here. Consequently, given due notice to students, the instructors reserve the right to change any information on this syllabus or in other course materials. Such changes will be announced and prominently displayed at the top of the syllabus.\\n

\\n \\n

University of Maryland Policies for Undergraduate Students

\\n

Please read the university\\xe2\\x80\\x99s guide on Course Related Policies, which provides you with resources and information relevant to your participation in a UMD course.\\n

\\n\\n

Miscellaneous Resources

\\n

As we go through the course sometimes I will mention additional resources or next steps. None of this is required for the course, but students have asked for me to keep a record of which texts/websites I mention.

\\n\\n
    \\n
  • Python for Data Analysis covers some of the same topics that we cover in this class, but in textbook form.\\n
  • Data Science From Scratch this book is meant to give you a grounding in how some of the statistics and math could be implemented. Mostly, data scientists use off-the-shelf libraries for their mathematical routines/functionality, this books works through how you would implement some of those libraries. The idea is that this might give you a deeper understanding of what is going on. The implementations in this book are not meant to be high-performance, or industrial strength, but illustrative.\\n
  • Previous iterations of this course can be found for the Fall 2019 and Fall 2020 semesters.\\n
\\n
\\n \\n
\\n \\n
\\n \"University\\n
\\n \\n \\n \\n
\\n \\n \\n \\n \\n \\n \\n\\n'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r.content" ] }, { "cell_type": "code", "execution_count": 6, "id": "initial-better", "metadata": {}, "outputs": [], "source": [ "root = BeautifulSoup(r.content)" ] }, { "cell_type": "code", "execution_count": 7, "id": "abstract-lyric", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "Introduction to Data Science\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "

CMSC320 – Spring 2021

\n", "
\n", "
\n", "

Introduction to Data Science

\n", "\n", "\n", "\n", "\n", "\n", "\"Data \n", "

\n", "

\n", "Instructor: José Manuel Calderón Trilla
\n", "TAs:\n", " Anubhav,\n", " Kishan Kolur,\n", " Jerry Peng,\n", " Noor Singh,\n", " Yow-Ting Shiue,\n", " Qingyang Tan,\n", " Julian Vanecek,\n", " Amulya Velamakanni,\n", " Laura Zheng\n", "
\n", "Lectures: Monday and Wednesday, 5:00–6:15 PM
\n", "Lectures are live on Zoom & posted on ELMS
\n", "
\n", "

\n", "
\n", "
\n", "

Description of Course

\n", "

Data science encapsulates the interdisciplinary activities required to create data-centric products and applications that address specific scientific, socio-political or business questions. It has drawn tremendous attention from both academia and industry and is making deep inroads in industry, government, health and journalism—just ask Nate Silver!

\n", "

This course focuses on (i) data management systems, (i) exploratory and statistical data analysis, (ii) data and information visualization, and (iv) the presentation and communication of analysis results. It will be centered around case studies drawing extensively from applications, and will yield a publicly-available final project that will strengthen course participants' data science portfolios.

\n", "

This course will consist primarily of sets of self-contained lectures and assignments that leverage real-world data science platforms when needed; as such, there is no assigned textbook. Each lecture will come with links to required reading, which should be done before that lecture, and (when appropriate) a list of links to other resources on the web.

\n", "

There will be a weekly quiz on ELMS relating to the material from that week. Of these weekly quizzes (about 14), you must take 10. Each quiz is pass/fail where anything about 60% is considered a pass. Because only 10 are required, there is no need to worry if you miss a quiz! Just make sure that you take at least 10 of them over the whole semester. If you take more than 10 of the quizzes, we will only count the 10 best scores.

\n", "

Requirements

\n", "

\n", " Students enrolled in the course should be comfortable with programming (for those at UMD, having passed CMSC216 will be good enough!) and be reasonably mathematically mature. The course itself will make heavy use of the Python scripting language by way of Jupyter Notebooks, leaning on the Anaconda package manager; we'll give some Python-for-data-science primer lectures early on, so don't worry if you haven't used Python before. Later lectures will delve into statistics and machine learning and may make use of basic calculus and basic linear algebra; light mathematical maturity is preferred at roughly the level of a junior CS student.\n", "

\n", "

\n", " There will be one written, take-home (obviously, given COVID-19 and all) midterm examination. There will not be a final examination; rather, in the interest of building students' public portfolios, and in the spirit of \"learning by doing\", students will create a self-contained online tutorial to be posted publicly. This tutorial can be created individually or in a small group. As described here (subject to change!), the tutorial will be a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data.\n", "

\n", "

\n", " Final grades will be calculated as:\n", "

    \n", "
  • 10% weekly quizzes
  • \n", "
  • 25% midterm
  • \n", "
  • 40% mini-project assignments
  • \n", "
  • 25% final tutorial to be posted publicly online (instructions, subject to change!)
  • \n", "\n", "
\n", "

\n", "\n", "

This course is aimed at junior- and senior-level Computer Science majors, but should be accessible to any student of life with some degree of mathematical and statistical maturity, reasonable experience with programming, and an interest in the topic area. If in doubt, e-mail me: jmct@umd.edu!

\n", "

Office Hours & Communication

\n", "

We are going to use Discord as a replacement for the physical space that we don't have access to during COVID. This means that it's appropriate to use Discord for office hours and or short high-bandwidth discussions that would usually take place before/after class. Note that Discord is not appropriate for things like asking for accommodation or other such issues/concerns, please email José with [CMSC320] in the email subject line.for those things.

\n", "

Discord, while useful, can be very 'stream of conscious' and does not allows for threading when several students have the same concern. Therefore for course-related questions, please use Piazza.\n", "As mentioned above: for private correspondence or special situations (e.g., excused absences, DDS accomodations, etc), please email José with [CMSC320] in the email subject line.

\n", "

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Office Hours (all times EDT)
HumanTimeLocation
Anubhav2PM-4PM Monday, 10AM-12PM Wednesday; Piazza on 12PM-2PM FridayOnline
José Manuel Calderón TrillaBy appointment; please email José with [CMSC320] in the email subject line.Zoom
Kishan Kolur12PM-2PM Wednesday; Piazza on TBDOnline
Jerry Peng11AM-1PM Wednesday, 1PM-3PM Friday; Piazza on Monday 1PM-3PMOnline
Noor Singh3PM-5PM on Wednesday and Thursday; Piazza on Friday 12PM-2PMOnline
Yow-Ting Shiue10AM-12PM on Tuesday and Friday; Piazza on TBDOnline
Qingyang Tan2PM-4PM on Wednesday; Piazza on Monday 10AM-12PMOnline
Julian Vanecek4PM-6PM Friday and Sun; Piazza on 7PM-9PMOnline
Amulya Velamakanni3PM-5PM on Thursday; Piazza on 4PM-6PM SaturdayOnline
Laura Zheng1PM-3PM on Wednesday; Piazza on Monday eveningsOnline
\n", "

\n", "\n", "

University Policies and Resources

\n", "

Policies relevant to Undergraduate Courses are found here: http://ugst.umd.edu/courserelatedpolicies.html. Topics that are addressed in these various policies include academic integrity, student and instructor conduct, accessibility and accommodations, attendance and excused absences, grades and appeals, copyright and intellectual property.

\n", "\n", "

Course evaluations

\n", "

Course evaluations are important and the department and faculty take student feedback seriously. Near the end of the semester, students can go to http://www.courseevalum.umd.edu to complete their evaluations.

\n", "
\n", "
\n", "
\n", "

Schedule

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
(Schedule subject to change as the semester progresses!)
#DateTopicReadingSlidesLecturerNotes
101/25IntroductionWhat the Fox Knows. pdf, pptxCalderón TrillaSign up on Piazza!
201/27What is Data & Lightning Python OverviewAnaconda's Test Drive. pdf, pptxCalderón Trilla
302/1Scraping Data (with Python) I\"What happens when you type google.com into your browser's address bar?\" pdfCalderón Trilla
402/3Scraping Data (with Python) IIThink about data link pdfCalderón Trilla
402/8Scraping Data (with Python) IIIThink about data (extended edition!) link pdf, notebookCalderón TrillaExample APIs: WMATA API, NOAA API, Library Docs: Requests, JSON, Beautiful Soup
502/10NumPy & SciPy, & Best PracticesIntroduction to pandas. pdf, pptxCalderón TrillaPandas tutorials: link
602/15Data Wrangling I: Pandas & Tidy DataHadley Wickham. \"Tidy Data.\" pdf, pptxCalderón TrillaHould's Tidy Data for Python
702/17Data Wrangling II: Tidy data & SQLDerman & Wilmott's \"Financial Modelers' Manifesto.\" pdf, pptxCalderón TrillaSQLite: link; pandasql library: link
902/22GraphsIntroduction to GraphQL: link pdf, pptxCalderón TrillaNetworkX: link
1002/24Graphs, & Summary Statistics and TransformationsBackstrom & Kleinberg. \"Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook,\" CSCW-14. arXiv link. pdf, pptxCalderón Trilla
1103/01Summary Statistics and Transformations, & Missing Data I pdf, pptxCalderón Trilla
1203/03Missing Data IIPandas tutorial on working with missing data. pdf, pptxCalderón TrillaScikit-learn's imputation functionality: link
1303/08Missing Data III, & Data Wrangling Wrap-Up: Data Integration, Data Warehousing, Entity ResolutionData Cleaning: Problems and Current Approaches (Note: this is a reference piece; please don't read the whole thing!) pdf, pptxCalderón TrillaWikipdia article on outliers
1403/10Natural Language I: Syntax & SemanticsNLTK Book. pdf, pptxCalderón TrillaPython Natural Language Toolkit (NLTK): link; Criticisms of the Turing Test: link
1403/15Spring Break 1 Nobody Enjoy!
1403/17Spring Break 2 Nobody Enjoy!
1503/22Natural Language II: RepresentationContinued from last class ... pdf, pptxCalderón TrillaContinued from last class ...
1603/24Natural Language III: Embeddings & SimilarityContinued from last class ... pdf, pptxCalderón TrillaContinued from last class ...
1703/29Midterm Review & TBDMidterm review: pdf, pptx; Lecture slides: pdf, pptxCalderón TrillaNew material from this lecture will not be included on the midterm.
1803/31MidtermCalderón Trilla
1904/05TBDTBDPossibly Exam review
2004/07Introduction to Machine LearningHal Daumé III. A Course in Machine Learning. pdf, pptxTBD
2104/12Decision Trees and Random ForestsRussell & Norvig's Chapter 18 lecture slides: pdf, pptxCalderón TrillaScikit-learn's basic decision tree functionality: link; Bart Selman's CS4700: link
2204/14Random Forests, K-NN pdf, pptxCalderón Trilla
2304/19Practical Issues I: Overfitting, Cross-validation, Regularization pdf, pptxTBDxkcd on overfitting: link; Polynomial features/Interaction terms in Scikit: link
2404/21Practical Issues II: Feature Engineering, PCA, Clustering, Association RulesNguyen & Holmes. \"Ten quick tips for effective dimensionality reduction,\" PLoS Computational Biology. pdf, pptxTBDWikipedia article on the confusion matrix: link
2504/26Practical Issues III: Recommender Systems and Association RulesBest Practices for Recommender Systems (from Microsoft). pdf, pptxTBD
2604/28Scaling It UpDean & Ghemawat. \"MapReduce: Simplified Data Processing on Large Clusters,\" CACM. pdf, pptxCalderón TrillaWikipedia on SGD: link
2705/03Data Science Ethics & Best Practices IThe Atlantic. \"Everything We Know About Facebook's Secret Mood Manipulation Experiment\" pdf, pptxCalderón TrillaWhat is GDPR? (link)
2805/05Data Science Ethics & Best Practices IIApple's brief overview of differential privacy: ; Barocas, Hardt, & Narayanan. Fairness in Machine Learning. pdf, pptxCalderón TrillaSIGCOMM paper that passed IRB review but is widely seen as unethical: link
2905/10Debugging Data Science, & Data Science in Industry pdf, pptxCalderón TrillaAdditional discussion of debugging models (from Cornell): link
Final05/17Final Exam DateFinal versions of tutorials must be posted by 4:00PM, the exam time.Instructions & rubric: link
\n", "
\n", "
\n", "
\n", "

Mini-Projects né Homework

\n", "

In addition to the tutorial to be posted publicly at the end of the semester, there will be four \"mini-projects\" assigned over the course of the semester (plus one simple setup assignment that will walk you through using git, Docker, and Jupyter). The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry.

\n", "\n", "

Posting solutions publicly online without the staff's express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
(Assignments will appear over the course of the semester.)
#DescriptionDate ReleasedDate DueProject Link
0Setting Things UpJanuary 27Febuary 3link
\n", "

\n", "
\n", "
\n", "

Final Tutorials

\n", "

In the spirit of \"learning by doing,\" students created a self-contained online tutorial to be posted publicly. Tutorials could be created individually or in a small group. The intention was to create a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data. Below is a list of (most of) the tutorials created in the Fall 2020 version of CMSC320.

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Most links lead to a public GitHub Page created by a student or small group in the Fall 2020 CMSC320 course; some links lead to students' personal websites or to a Notebook hosted on Google Colab. Project creators: if a link is missing or incorrect, please get in touch with José!
Project TitleURL
\n", "
\n", "
\n", "
\n", "

Additional Administrative Information

\n", "\n", "\n", "

Excused Absences

\n", "

Missing an exam for reasons such as illness, religious observance, participation in required university activities, or family or personal emergency (such as a serious automobile accident or close relative’s funeral) will be excused so long as the absence is requested in writing at least 2 days in advance and the student includes documentation that shows the absence qualifies as excused; a self-signed note is not sufficient as exams are Major Scheduled Grading Events. For this class, such events are the final project assessment and midterms, which will be due on the dates listed in the schedule above. The final exam is scheduled according to the University Registrar.

\n", "

For medical absences, you must furnish documentation from the health care professional who treated you. This documentation must verify dates of treatment and indicate the timeframe that the student was unable to meet academic responsibilities. In addition, it must contain the name and phone number of the medical service provider to be used if verification is needed. No diagnostic information will ever be requested. Note that simply being seen by a health care professional does not constitute an excused absence; it must be clear that you were unable to perform your academic duties.

\n", "

It is the University’s policy to provide accommodations for students with religious observances conflicting with exams, but it is the your responsibility to inform the instructor in advance of intended religious observances. If you have a conflict with a planned exam, you must inform the instructor prior to the end of the first two weeks of the class.

\n", "

The policies for excused absences do not apply to project assignments. Projects will be assigned with sufficient time to be completed by students who have a reasonable understanding of the necessary material and begin promptly. In cases of extremely serious documented illness of lengthy duration or other protracted, severe emergency situations, the instructor may consider extensions on project assignments, depending upon the specific circumstances.

\n", "

Besides the policies in this syllabus, the University’s policies apply during the semester. Various policies that may be relevant appear in the Undergraduate Catalog.\n", " \n", "

If you experience difficulty during the semester keeping up with the academic demands of your courses, you may consider contacting the Learning Assistance Service in 2201 Shoemaker Building at (301) 314-7693. Their educational counselors can help with time management issues, reading, note-taking, and exam preparation skills.

\n", "

Right to Change Information

\n", "

Although every effort has been made to be complete and accurate, unforeseen circumstances arising during the semester could require the adjustment of any material given here. Consequently, given due notice to students, the instructors reserve the right to change any information on this syllabus or in other course materials. Such changes will be announced and prominently displayed at the top of the syllabus.\n", "

\n", "

University of Maryland Policies for Undergraduate Students

\n", "

Please read the university’s guide on Course Related Policies, which provides you with resources and information relevant to your participation in a UMD course.\n", "

\n", "

Miscellaneous Resources

\n", "

As we go through the course sometimes I will mention additional resources or next steps. None of this is required for the course, but students have asked for me to keep a record of which texts/websites I mention.

\n", "
    \n", "
  • Python for Data Analysis covers some of the same topics that we cover in this class, but in textbook form.\n", "
  • Data Science From Scratch this book is meant to give you a grounding in how some of the statistics and math could be implemented. Mostly, data scientists use off-the-shelf libraries for their mathematical routines/functionality, this books works through how you would implement some of those libraries. The idea is that this might give you a deeper understanding of what is going on. The implementations in this book are not meant to be high-performance, or industrial strength, but illustrative.\n", "
  • Previous iterations of this course can be found for the Fall 2019 and Fall 2020 semesters.\n", "
\n", "

\n", "
\n", "
\n", "\"University\n", "
\n", "\n", "
\n", "\n", "\n", "\n", "\n", "
" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "root" ] }, { "cell_type": "code", "execution_count": 15, "id": "hairy-machine", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\n", " Human\n", " Time\n", " Location\n", " ,\n", " \n", " Anubhav\n", " 2PM-4PM Monday, 10AM-12PM Wednesday; Piazza on 12PM-2PM Friday\n", " Online\n", " ,\n", " \n", " José Manuel Calderón Trilla\n", " By appointment; please email José with [CMSC320] in the email subject line.\n", " Zoom\n", " ,\n", " \n", " Kishan Kolur\n", " 12PM-2PM Wednesday; Piazza on TBD\n", " Online\n", " ,\n", " \n", " Jerry Peng\n", " 11AM-1PM Wednesday, 1PM-3PM Friday; Piazza on Monday 1PM-3PM\n", " Online\n", " ,\n", " \n", " Noor Singh\n", " 3PM-5PM on Wednesday and Thursday; Piazza on Friday 12PM-2PM\n", " Online\n", " ,\n", " \n", " Yow-Ting Shiue\n", " 10AM-12PM on Tuesday and Friday; Piazza on TBD\n", " Online\n", " ,\n", " \n", " Qingyang Tan\n", " 2PM-4PM on Wednesday; Piazza on Monday 10AM-12PM\n", " Online\n", " ,\n", " \n", " Julian Vanecek\n", " 4PM-6PM Friday and Sun; Piazza on 7PM-9PM\n", " Online\n", " ,\n", " \n", " Amulya Velamakanni\n", " 3PM-5PM on Thursday; Piazza on 4PM-6PM Saturday\n", " Online\n", " ,\n", " \n", " Laura Zheng\n", " 1PM-3PM on Wednesday; Piazza on Monday evenings\n", " Online\n", " ]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "root.find(\"div\", id=\"description\").find(\"table\").findAll(\"tr\")" ] }, { "cell_type": "code", "execution_count": null, "id": "arabic-confirmation", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 5 }