Course syllabus

Data Science Concepts and Analysis

Course listing

PSTAT100

Updated

April 2023

Staff and class meetings

Instructor: Trevor Ruiz

Teaching assistants: Mengye Liu, Harry Yu, Gabrielle Salo

Class meetings: M-W 12:30pm – 1:45pm Buchanan 1920

Section meetings:

  • M 2:00pm – 2:50pm Phelps 1525 (Mengye)
  • M 3:00pm – 3:50pm Phelps 1525 (Mengye)
  • M 4:00pm – 4:50pm Phelps 1513 (Harry)
  • M 5:00pm – 5:50pm Phelps 1525 (Harry)
  • M 6:00pm – 6:50pm Phelps 1525 (Gabrielle)

Office hours:

  • Mengye M 9:00am – 11:00am on Zoom
  • Gabrielle Tu 7:00pm – 8:00pm on Zoom
  • Harry W 2:00pm – 4:00pm Building 434 Room 113
  • Trevor W 2:00pm – 3:00pm ILP 2207

Content and materials

Data Science Concepts and Analysis (PSTAT100) is a hands-on introduction to data science intended for intermediate-level students from any discipline with some exposure to probability and basic computing skills, but few or no upper-division courses in statistics or computer science. The course introduces central concepts in statistics – such as sampling variation, uncertainty, and inference – in an applied setting together with techniques for data exploration and analysis. Applications emphasize end-to-end data analyses. Course activities model standard data science workflow practices by example, and successful students acquire programming skills, project management skills, and subject exposure that will serve them well in upper-division courses as well as in independent research or projects.

Catalog description

Overview of data science key concepts and the use of tools for data retrieval, analysis, visualization, and reproducible research. Topics include an introduction to inference and prediction, principles of measurement, missing data, and notions of causality, statistical “traps”, and concepts in data ethics and privacy. Case studies will illustrate the importance of domain knowledge. Credit units: 4.

Prerequisites

  • Probability and Statistics I (PSTAT 120A)
  • Linear Algebra (MATH 4A)
  • Prior experience with Python or another programming language (CMPSC 9 or CMPSC 16).

Learning outcomes

Successful students will establish foundational data science skills:

  • critical assessment of data quality and sampling design
  • retrieval, inspection, and cleaning of raw data
  • exploratory, descriptive, visual, and inferential techniques
  • interpretation and communication of results in context

These skills will be discussed in depth during course lectures; students will practice them through lab activities, homework assignments, and project work.

Assessments

Attainment of course learning outcomes will be measured by assessment of submitted work. Submitted work falls into four categories:

  • Labs will be assigned weekly in most weeks. These are structured coding assignments with small exercises throughout that introduce the programming skills needed to complete homework assignments.
  • Homeworks will be assigned biweekly. These are fairly involved assignments which apply concepts and techniques from the lectures and programming skills from the labs to real data sets in order to reproduce an analysis and answer substantive questions. Collaboration is encouraged, but students must write up and submit their own work individually.
  • Mini projects will be assigned biweekly in alternation with homeworks. These assignments prompt students to use skills from the course in an unstructured setting to answer high-level questions pertaining to one or more datasets. Mini projects should be completed collaboratively.
  • A course project will be assigned requiring students to carry out an open-ended data analysis. This will be completed in teams.

Overall scores in the course will be calculated for each student as the weighted average of scores on all submitted work; the relative weighting and letter grade assignments will depend entirely on the score distribution of the class as a whole and as such reflect each student’s performance relative to their peers.

Schedule

The tentative topic and assignment schedule is given below. Assignments are indicated by due date: all assignments are due by Monday 11:59pm in the week indicated. Late submissions are allowed, with a possible penalty, for up to 48 hours.

The schedule is subject to change based on the progress of the class.

Week Topic Lab Homework Project
1 Data science life cycle
2 Tidy data L0
3 Sampling and bias L1
4 Statistical graphics L2 H1
5 Kernel density estimation L3 MP1
6 Principal components L4 H2
7 Simple regression
8 Multiple regression L5 H3
9 Classification and clustering L6 MP2
10 Case study L7 H4
11 Finals CP
  • L: lab
  • H: homework
  • MP: mini project
  • CP: course project

Materials

The course website will link to all course content and resources. Readings for the course will draw on multiple sources, including:

Software

Computing will be hosted via the course LSIT server pstat100.lsit.ucsb.edu. Students need only a web browser and stable internet connection to complete all course work. It is strongly recommended that students download backup copies of their assignments from the LSIT server.

Interested students are encouraged to install the software needed to open, edit, and execute notebooks on their own machine, in particular:

Managing package installations will require some (straightforward) use of the package installer pip or pip3 in the terminal to retrieve/install packages from the Python Package Index repository. Documentation for specific packages (or a Google search) will indicate the appropriate pip command.

Policies

Communication

There are two primary means of communication outside of scheduled class meetings: office hours and a discussion board.

Course staff have limited availability via email. Course staff will make every effort to respond to individual communication within 48 weekday-hours on the following (or similar) matters:

  • accommodations/extensions due to personal circumstances;
  • logistical issues such as access to materials or missing scores;
  • general advising.

Email should not be used to ask content questions or submit assignments (unless specifically requested). Emails related to the following (or similar) matters may not receive replies and should be redirected:

Topic Redirect to…
Troubleshooting codes Discussion board
Checking answers Office hours or discussion board
Clarifying assignment content Office hours or discussion board
Assignment submission Gradescope
Re-evaluation request Gradescope

Expected time commitment

The course is 4 credit units; each credit unit corresponds to an approximate time commitment of 3 hours. So, students should expect to allocate 12 hours per week to the course on average. Course staff are available to help any students spending considerably more time on the class balance the workload.

Scores and grades

Scores on submitted work can be monitored on Gradescope to ensure fair assignment of course grades. On any individual assignment, re-evaluation can be requested within one week of receiving a score. Requests for re-evaluation made beyond one week after publication of scores may or may not be considered on a discretionary basis.

Determination of letter grade assignments is made entirely at the discretion of the instructor based on the assessments outlined above and consistent with university policy. Students are not permitted to negotiate their grades, and are discouraged from requesting audits, recalculations, or verification of self-calculations after the course has concluded. The instructor is under no obligation to share the details of grade calculations with students or to respond to such requests.

If at the end of the course a student believes their grade was unfairly assigned, either due to discrimination or without basis in coursework, they are entitled to contest it according to the procedure outlined here.

Conduct

Students are expected to uphold the student code of conduct and to maintain integrity. All individually-submitted work must be an honest reflection of individual effort. Evidence of dishonest conduct will be discussed with the student(s) involved and reported to the Office of Student Conduct (OSC). Depending on the nature of the evidence and the violation, penalty in the course may range from a warning to loss of credit to automatic failure. For a definition and examples of dishonesty, a discussion of what constitutes an appropriate response from faculty, and an explanation of the reporting and investigation process, see the OSC page on academic integrity.

Deadlines and late work

There is a one-hour grace period on all submission deadlines. After that, work may be submitted within 48 hours of the original deadline (not the deadline plus grace period) and will be considered late. Every student can submit two late assignments without penalty. Subsequent late submissions will be evaluated for 75% credit.

Accommodations

Reasonable accommodations will be made for any student with a qualifying disability. Such requests should be made through the Disabled Students Program (DSP). More information, instructions on how to access accommodations, and information on related resources can be found on the DSP website.

Feedback

Toward the end of the term students will be given an opportunity to provide feedback about the course via ESCI. This feedback is valuable for improvement of the course in future terms, and students are strongly encouraged to provide thoughtful course evaluations. The identities of student respondents to ESCI surveys are not disclosed to instructors.