Final Project

All deadlines mean by midnight that day. So Wed Feb 11th means by 12:00am.

Week	Date	Milestone
6	Wed Feb 11	Initial meeting with instructors
8	Wed Feb 25	Project proposal due
Finals	Tues Mar 17	Final project due

The final project is your opportunity to apply the statistical modeling skills you’ve learned throughout the course to a real dataset. Your deliverable is a publication-quality Abstract, Methods, & Results document — the kind you’d submit to a journal as part of a full manuscript — rendered as a reproducible .qmd document with all supporting materials (e.g. code, data) in a github repository. Your instructors will grade you based on the rubric which isn’t designed to trick you, but help you focus on these core skills:

Research Question

Formulate a clear, answerable research question that requires statistical analysis given an existing dataset
What inference(s) are you hoping to make? What assumptions might those require? How will you validate this?
We do not want you to collect new data or design a new experiment. Instead you should focus on acquiring, cleaning, and organizing an existing dataset to make it amenable to your research question(s).

Data Analysis

Exploratory Data Analysis — Summary statistics, visualizations, data quality assessment
Statistical Modeling — You hypotheses articulated as statistical model(s) with diagnostic evaluation where appropriate
Inference/Prediction — Approaches relevant to your question: parameter inference, cross-validated prediction performance, etc
Interpretation — What do the results mean? How do they relate to your assumptions? What future directions?

1. Pick Solo or Group

You may work solo or in a group. In either case you’ll use a project template that we’ll provide in a few weeks, configured like our labs and HWs (Python environment, Quarto, Notebooks). Groups may provide a single submission, but you must:

Include a CRediT author statement describing each member’s contributions
Demonstrate evidence of collaboration via GitHub — each person must contribute at least 1 commit/PR to the project repository

2. Pick a Dataset

Important

In all cases, you must choose a new question to answer — not something you or someone else has previously analyzed. In other words, no reproducibility/replication focused projects. Think about extending prior work instead.

Regardless of solo or group, you must choose one of the following options for your dataset:

Existing dataset from your 201A or first-year project
Existing dataset in your lab
New open dataset
New simulated dataset

For option 3 we’ve put together a large list of open datasets that you can browse based on your research interests. By no means should you feel limited to using a dataset on this list! These are just to help you get started.

For option 4 you should discuss with your instructors what you’re thinking and why you want to pursue this option.

3. Work & Submit

Both your project proposal and your final submission will be in the form of a GitHub repository just like our labs and HWs. We’ll provide a template soon, that you’ll use to submit your initial project proposal. Then you’ll update the same repository to commit and push your work. When you’re done your final commit should make sure to include:

Quarto document

A .qmd file and rendered PDF that includes:
title, authors, and relevant meta-data
Abstract
Methods
Results with relevant figures & tables
References

Data files and analysis scripts

You should update your project README.md with informative descriptions of all relevant analysis & data files. Feel free to perform your core work using Marimo notebooks if you find those easier. Just make sure they are documented and your final submission is still a Quarto .qmd file
You should make sure to separate raw data files from any preprocessed/cleaned data by making new files
If datasets are too large for GitHub (>5 MB), provide a download link and retrieval instructions in the README so someone can still reproduce your work

Reproducible Python environment

While we’ll configure the template with all the Python packages we’ve been using throughout the course, feel free to use additional tools if you want. Just make sure to use uv add so they’re tracked automatically in your pyproject.toml file and a collaborator/reviewer gets them when they git clone and run uv sync.

Collaboration & GenAI Policy

Whether working solo or in a group, you are welcome to discuss ideas, troubleshoot code, and give feedback with classmates outside your group — but all submitted writing and analysis must be your own (or your group’s). If you use GenAI tools, make sure to review the course GenAI policy (i.e. use it as a tool not a cheat-code) and include a transcript with your submission.

Rubric

General Formatting & Prose (10%)

Excellent (9-10)	Good (7-8)	Adequate (5-6)	Needs Improvement (0-4)
Publication-quality writing: clear, concise, and error-free. APA-style reporting of statistics (e.g., F(1, 48) = 12.3, p < .001, d = 0.45). Figures and tables are properly labeled, referenced in text, and enhance the narrative. Complete, properly formatted references	Generally clear writing with minor errors. Statistics mostly reported in APA style with occasional inconsistencies. Figures/tables present with minor labeling issues. References present but may have formatting gaps	Understandable but unpolished writing. Inconsistent statistical reporting format. Figures/tables incomplete or poorly integrated. Several grammatical errors or missing references	Unclear or confusing writing. Statistics poorly reported or missing key values. Missing or unprofessional figures/tables. Pervasive grammatical issues

Methods Section Clarity & Completeness (30%)

Excellent (27-30)	Good (21-26)	Adequate (15-20)	Needs Improvement (0-14)
Data: Complete description of the dataset — source, how it was collected, sample characteristics, sample size, and any exclusions or missing data handling explained. Variables: All variables clearly operationalized (what was measured, how, units/scales). Analytical Approach: Statistical models justified with clear rationale; specifies assumption checks, alpha levels, and software/packages used. Sufficient detail for an independent researcher to reproduce the analysis	All major components present with some details missing. Variables adequately described. Analytical choices mostly justified. Minor gaps that wouldn’t prevent replication	Key information present but lacks depth. Some variables poorly described. Limited justification for analytical choices. Replication would be challenging	Major components missing or inadequate. Insufficient detail for understanding the analysis. No justification for analytical decisions. Replication not possible

Results Interpretation, Assumptions, & Limitations (30%)

Excellent (27-30)	Good (21-26)	Adequate (15-20)	Needs Improvement (0-14)
Assumptions: Explicitly tests all relevant assumptions (normality, homogeneity, independence, linearity); appropriate diagnostic plots; states whether assumptions are met and describes remedial actions if not. Results: All analyses clearly reported with complete statistics (test statistics, df, p-values, effect sizes, CIs); figures and tables enhance understanding. Interpretation: Accurate; avoids over-interpreting null results; appropriately cautious with causal language. Limitations: Thoughtful discussion of methodological and statistical limitations, including threats to validity	Most assumptions checked appropriately. Results clearly reported with minor omissions. Interpretation generally sound with appropriate caution. Limitations discussed but may lack depth or miss key issues	Basic assumption checking but incomplete. Results reported but missing key statistics (e.g., no effect sizes or CIs). Some understanding shown but may over-interpret. Limitations mentioned but superficial	Assumptions not checked or inadequately addressed. Results poorly reported or incomplete. Misinterpretation of findings. Limitations absent or lacking understanding

Code Quality & Reproducibility (15%)

Excellent (14-15)	Good (11-13)	Adequate (8-10)	Needs Improvement (0-7)
Code runs without errors and reproduces all results. Well-organized with clear headers and comments. Follows best practices (relative paths, no hardcoding). Includes package versions and random seed where applicable. Data cleaning transparent and justified. Efficient and readable code	Code runs with minimal adjustments. Generally well-organized with adequate comments. Reproduces main results. Minor issues with paths or organization	Code runs but requires troubleshooting. Limited organization or comments. Reproduces some but not all results. Hardcoded paths or unclear workflow	Code doesn’t run or has major errors. Poorly organized and difficult to follow. Cannot reproduce reported results. Missing key analysis steps

Innovation & Integration (15%)

Excellent (14-15)	Good (11-13)	Adequate (8-10)	Needs Improvement (0-7)
Goes beyond basic analyses with appropriate techniques (e.g., model comparison via AIC/BIC or cross-validation, bootstrap inference, mixed-effects modeling for nested data). Analytical choices justified by citing course materials or methodological literature. Demonstrates thoughtful engagement with the “why” behind modeling decisions	Incorporates some techniques beyond the basics. Analytical choices generally justified. Shows familiarity with course best practices	Primarily basic analyses without much justification. Limited connection to course concepts or methodological reasoning	Only rudimentary analyses. No justification for modeling choices. Does not engage with course material