Skip to Main Content

2025 Big Data Summer Immersion at Yale Projects

During BDSY, students are organized into teams of around 10, each working on a distinct project in biomedical or public health research. Each team is guided by one or more faculty mentors and graduate student assistants who provide support throughout the project. Team placements are made by considering students' skills and project interests. Final project topics and datasets are selected shortly before the program begins.


Genetics Project

Instructor: Hongyu Zhao, PhD
Graduate Student Instructors: Leqi Xu, Jiaqi Hu

This project explores the genetic basis of disease comorbidity through integrative analyses of genome-wide, transcriptome-wide, and proteome-wide association studies. Students will learn to identify shared genetic variants across multiple diseases and quantify their impact on disease pathways. Through hands-on analysis, they will gain skills in genetic epidemiology, bioinformatics, and statistical genetics. Computational tools will be used to interpret complex genetic data and uncover biological mechanisms of disease. Students will work in teams to develop reports and presentations based on their findings. This experience prepares students for future careers in biomedical data science and genetics research.


Causal Inference Project

Instructors: Lee Kennedy-Shaffer, PhD and Fan Li, PhD
Graduate Student Instructors: Xi Fang, PhD and Jiaqi Tong

Using the SUPPORT observational dataset, this project focuses on estimating causal effects of right heart catheterization on mortality outcomes. Students will apply methods such as propensity score weighting, outcome regression, and doubly robust approaches to compare estimates of treatment effects. They will explore individualized treatment effects using causal machine learning methods like DR-learner, R-learner, and BART. Sensitivity analyses will be performed to assess the impact of unmeasured confounding on causal conclusions. Students will critically evaluate assumptions behind different causal inference methods. The project offers rigorous training in modern causal inference techniques and their application to real-world health data.


Public Health Modeling Project

Instructors: Stephanie Perniciaro, PhD, MPH and Shelby Golden, MS

This project examines pneumococcal disease dynamics and the phenomenon of serotype replacement following vaccine interventions. Students will analyze global infectious disease surveillance data to characterize changes in pneumococcal serotype distributions. Key statistical methods include time series analysis, hierarchical modeling, and spatial regression, all performed using R. Students will explore how biological, epidemiological, and policy factors influence pneumococcal evolution and vaccine effectiveness. Through data-driven modeling, students will deepen their understanding of public health strategies to prevent infectious diseases. The project bridges biological knowledge with quantitative modeling in the context of global health.

OSZAR »