Speaker introduction: Ana Trisovic is a Sloan postdoctoral scholar at the Institute for quantitative Social Sciences (iqss) of Harvard University. Her research focuses on computational reproducibility, data protection and data science. Working with the dataverse team, she studied how to promote the reuse of research data and code through automation, metadata and encapsulation. Previously, Ana Trisovic was a CLIR postdoctoral fellow at the University of Chicago, where she worked with the Energy Policy Institute (EPIC) and the library. She completed her PhD in computer science at Cambridge University in 2018, and her doctoral thesis is entitled "data preservation and reproducibility of CERN lhcb experiment". During her work at CERN, she worked with lhcb, CERN open data and CERN analysis and preservation group. During her doctoral study, she was a member of Muir wood scholar at Newham college and a winner of CERN doctoral program and Google Anita Borg Memorial Scholarship.
Title 1:How to conduct a big data analysis on air pollution and health? The study design.
Abstract: The talk will present the logistics of planning, designing, and executing a big data analysis on air pollution and health. The talk will give an introduction to epidemiology and basic study design. First, we'll introduce basic terms, concepts and requirements for performing healthcare data analysis. Then, we will talk about the datasets required to undertake the study, in particular, exposure data describing air pollution; and confounders data such as population, geospatial data, weather and climate data, and others. We will talk about conducting descriptive and regression analysis and defending your decisions regarding model selection, interpretation, and presentation.
Title 2:How to conduct a big data analysis on air pollution and health? The computational execution.
The talk will present the logistics of planning and executing analysis on the analytic data set prepared from multiple data sources. It will focus on spatial and temporal data aggregation for statistical analysis on air pollution and health and automating these processes in computational workflows on the high-performance computing infrastructure. We will talk about interpreting the final model in context of your original hypothesis. In the end, we’ll present the best practices for code naming and arrangement, stepwise selection modeling, odd and prevalence ratios, and relative risk. We'll also talk about result dissemination, which is especially challenging when working with sensitive healthcare data.