Skip to toolbar

April 20, 2020

Watch the recording (Requires ‘standard’ authentication or login)

Sharon Huang, Information Sciences and Technology

Talk Title: Knowledge Discovery from Image Data
Abstract: A picture is worth a thousand words. The question is, which thousand words? This talk will present computational methods developed in my group that enable the translation of raw image pixels to objects, relations, events, decisions, and new knowledge. Based upon these methods we have created highly accurate machine learning systems for segmentation, classification and synthesis of biological and medical images.

Michael Andreae, Penn State Health Anesthesiology and Perioperative Medicine

Talk Title:, a curated clinical perioperative de-identified dataset with 8 million unique cases from 50 academic medical centers
Abstract: The Multicenter Perioperative Outcomes Group ( is a perioperative electronic medical record registry holding over 8 million cases from over 50 medical centers. The data are de-identified, curated with a defined data dictionary. Penn State Hershey just joined, lead by the department of Anesthesiology and would invite collaborations with the data science community to leverage the data to test high performance computing algorithms, hierachical modeling and other Big Data applications. Examples are the combination of geospatial modeling to better understand mechanisms leading to social disparities of health.

April 13, 2020

Watch the recording (Requires ‘standard’ authentication or login)

Jia Li, Professor, Statistics

Talk Title: Mixture models and unsupervised learning
Abstract: Mixture models and more generally latent variable models have been widely used for high dimensional, sequential, or imagery data. The model can serve as a density estimation, and the latent states as the cluster labels. I will introduce a few mixture models for high dimensional continuous and discrete data, some recent advances in clustering to overcome curse of dimensionality and to address distributional data, and approaches for assessing uncertainty in clustering at the levels of overall partitions, individual clusters, as well as individual points.

Fangcao Xu, Graduate Student, Geography

Talk Title: Multiple Geometry Atmospheric Correction for Image Spectroscopy using Deep Learning
Abstract: The goal of this research is to develop a deep learning solution for atmospheric correction and target detection of multiple hyperspectral scenes, acquired by aerial platforms at different viewing angles. A deep learning solution based on convolutional neural networks is used to learn the relationships between the total radiance observed at the sensor, and the different solar and atmospheric components such as upwelling, downwelling and transmission. The proposed approach requires analyzing multiple scenes acquired in rapid sequence. It is assumed that the target and the atmosphere remain invariant within the time scale of the collection, while the angles of collection and ranges vary. This work focuses on emissive properties of targets, and simulations are performed in the longwave infrared between 7.5 and 12 um. Results show that the proposed method is computationally efficient, and it can characterize the atmosphere and retrieve the target spectral emissivity within one order of magnitude errors or less.

March 30, 2020

Watch the recording (Requires ‘standard’ authentication or login)

Sy-Miin Chow, Professor, Human Development and Family Studies

Talk Title: My Journey to Dynamical Systems Modeling as a Behavioral Data Scientist
Abstract: Dynamical systems models have historically been workhorses of the physical sciences and applied mathematics, but have begun to gain traction in statistics, and more recently, in the behavioral sciences. The recent influx of intensive longitudinal data from wearable devices, smartphones, Global Positioning System (GPS), and other sensors has introduced a pressing need, and also unique opportunities for developing novel data science approaches to examining the systems dynamics of individuals, family systems, social networks, and their interplay with environmental factors. In this talk, I will highlight some of my current work and ideas for future collaborations utilizing intensive longitudinal health data from individuals and family systems.

Chaopeng Shen, Associate Professor, Civil and Environmental Engineering

Talk Title: Pathways of combining process-based knowledge with deep learning for hydrologic modeling
Abstract: Here I demonstrate selected pathways, out a many, to combine the power of both machine learning and process-based knowledge in improving our predictive capability of hydrologic variables. Compared to purely data-driven models, process-based models (PBM) can produce seamless solutions of observed or unobserved hydrologic variables at continental scales. However, a longstanding difficulty was to effectively and efficiently obtain parameters for PBMs. Here we show the vastly superior efficiency of a deep-learning-based parameter estimation framework that is based on a completely different paradigm of parameter estimation. We can gain five orders of magnitude of computational savings in calibration/training while achieving better calibrated parameters using the new framework. In addition, we comment on other forms of physics-informed neural networks.

December 19, 2019

Watch the recording (Requires ‘standard’ authentication or login)

Rick O. Gilmore, Professor, Dept. of Psychology, College of the Liberal Arts

Talk Title: How to share personal data ethically
Abstract: Data scientists interested in studying human behavior require access to data about people. In this talk, I will describe how the project I co-direct has tackled the problem of sharing personally identifiable data–video and audio recordings–while upholding ethical principles. Databrary’s approach builds on and extends established practices and thus may have implications for other domains.

Jennifer McCormick, Associate Professor, Dept. of Humanities, College of Medicine

Talk Title: Engendering Trust: A Role for Data Governance
Abstract: In recent years, information technology advancements have transformed the capacity for biomedical and public health science researchers to collect and analyze vast amounts of personal health-related data. Not uncommonly, consent of individuals from which the information comes is not required because it is already collected and stored de-identified data. I will briefly share what role data governance and transparency might have in engendering trust among the public and policymakers.

October 24, 2019

View this week’s presentation files

Mike Rutter, associate professor of statistics at PS Behrend

Talk Title: On R and E. Coli
Abstract: In this presentation, I will discuss two long-term projects I have been working on. The first is the cran2deb4ubuntu project, which involves providing pre-built binaries for 2,000+ R packages for Ubuntu Linux. These packages are used by a number of other data science related projects based on the number of emails I receive when I accidentally break something. The second project I will discuss is a random forest model I use to help predict E. Coli events at Presque Isle State Park (Erie, PA). For both of these projects, I will discuss some current challenges and opportunities for future collaboration.

David Hughes, associate professor of entomology and biology at UP

Talk Title: AI for pests and climate change adaptation in smallholder farms in Africa
Abstract: For smallholder farmers in Africa a fundamental constraint is growing enough food in the face of biotic stresses (pests and diseases) and now increasingly abiotic stresses due to climate change when farmers do not have access to human experts (extension services). There are two few human experts interfacing with farmers on their farms at a high enough frequency to help them cope with diseases and climate change. The ratio between extension workers and farmers is 1:3,000 in Kenya and as a high as 1:10,000 in the DRC. PlantVillage is a Penn State tool that delivers an AI assistant in smartphones that works offline and is equal to human experts in their ability to diagnose pests and diseases. It has been officially adopted by the United Nations Food and Agriculture Organization and operates in 70 countries and 21 languages. Our tool can also connect with a UN FAO tool WaPOR which measures crop water stress across every field in Africa with a 10 year data record. We are leveraging this data and ML to predict the future water stress for farmers to enable climate change adaptation. We would dearly love help from meteorological colleagues and data scientists at Penn State.

August 8, 2019

View this week’s presentation files
Watch the recording (Requires ‘standard’ authentication or login)

Sarah Rajtmajer, College of IST, Rock Ethics Institute

Talk Title: The role of data science in an interpretable scholarly record
Abstract: The last few years have seen important progress toward increased methodological rigor in a number of fields that regularly engage with empirical data. This progress has been driven by the so-called “reproducibility crisis”— the finding that wide swaths of published science cannot be replicated. In this talk I will give historical context for the reproducibility crisis, discuss the opportunity it has presented to increase the credibility of the scientific literature and accelerate discovery, and extend the notion of scholarship for the computational and data-enabled sciences in a world of radical transparency. In addition, I will outline a nascent research agenda leveraging AI (synthetic prediction markets) to assign confidence scores to claims in the social and behavioral science literatures. This work is proposed in support of DARPA’s Systematizing Confidence in Open Research and Evidence (SCORE) program.

Justine Blanford, Department of Geography

Talk Title: (Geo) Data Science
Abstract: I am a geographer, spatial analyst, GISer who has been applying spatial analysis methods to better understand the world around us for the better part of 20 years. Much of my work has been about issues related to geohealth, in particular the ecology of disease/health across space and time and includes the use of novel technologies and data sources to better understand how places are connected, the impact of human-environment interactions and what this means in terms of the health and well-being of society. To really understand what is going on we sometimes need to use big data sets and look at things from many different angles. My goal in participating here is to learn more about different data science techniques and how to better integrate these in the geospatial world as these are a perfect match. Both deal with data but in slightly different ways.

Stephen Ross, Penn State Law

Facilitated Discussion Title: Sports Analytics
Abstract: Sporting competitions amass significant data regarding performance, health/biometrics, and business analytics. Although some data is publicly available and other data might be shared on a confidential basis and presented in aggregated form, most sporting organizations view data as highly proprietary. This leads sport to be understudied. Evidence-based decision-making is critical in sport for several reasons:

  • The importance of sports as a social institution demands sound public policy to sport with official decisions based on data.
  • Many sports decisions are made by powerful private entities whose decisions are collective (such as the NCAA or professional sports leagues), and parochial decision-makers often are limited in access to data or use data strategically; there is a public interest in having these entities make publicly-accountable decisions based on best evidence.
  • Individual sporting entities attract strong loyalty. If Pizza Hut makes a poor decision, consumers switch to Domino’s. If Penn State Athletics makes a poor decision, no one shifts to Ohio State. Thus, the importance of publicly-accountable decisions based on evidence is important even for internal decisions.

Josephine Wee, Food Science & Gretta Tritch Roman, College of Agricultural Sciences

Facilitated Discussion Title: Digital fluency: Project iOn as a
Abstract: Increasingly, data and information are
being used communities to make
customized decisions on management,
 food safety and quality, sustainability,
 resilience, and economics. However, our
 ability to collect data has far outstripped
 our ability to effectively utilize data. The
 current challenge is how to successfully
 cultivate data to reveal actionable 
information. Digital fluency is the ability to
leverage data and information to enhance 
critical thinking, problem-solving,
 communication, and innovation. In this 
data science community meeting, I will
 facilitate broad conversations
 surrounding each digital fluency 
specialization: design thinking, code
 competency, data visualizations, teaching
 and learning, data curation, ethics,
 communications and storytelling, and 
diversity and inclusion. The goal of this 
broad discussion is to introduce the concept of digital fluency within the data science
 community. I will also briefly discuss a university seed grant funded project on how
 prototyping development and use of interactive open educational resource notebooks 
(ION) could provide digital fluency training for our faculty and students. Finally, I will briefly 
talk about our one-day Digital Fluency Symposium: The future of teaching, learning, and
research in a digital world in October at Penn State.

June 5, 2019

View this week’s presentation files
Watch the recording (recording is truncated and requires ‘standard’ authentication or login)

David Hunter, Department of Statistics

Abstract: Data Science is an evolving field of study that neither includes nor is included within any single traditional discipline. We propose to support a data science community that embraces this interdisciplinarity, emphasizing that our community includes not only those whose work advances data science methods but those whose work requires the application of these methods. According to a preliminary survey, those who self-identify as part of such a community already represent a majority of campuses and units across the entirety of Penn State University. We seek to provide a space, both virtual and physical, to catalyze and publicize data science-related work at Penn State.

Simon Hooper, Department of Learning and Performance Systems

Talk Title: How can data patterns help enhance student literacy?
Abstract: Our research team has designed a progress monitoring system to help teachers monitor Deaf and Hard of Hearing students’ literacy. The suite includes 8 assessments that target different literacy components. Most assessments take no more than one minute to complete, are gamified to enhance student motivation, and include customized scoring tools. Performance charts help teachers to determine whether individual students are making adequate progress. Targeted data, stored in a relational database, allow researchers to address specific questions, but web log data are not currently being analyzed. Our goal is to learn how supervised learning analytics can reveal patterns of teacher and student behavior that affect learner outcomes.

Drew Wham, Teaching and Learning with Technology, Data Empowered Learning Team

Talk Title: Measuring Perceptual Distance of Organismal Color Pattern using the Features of Deep Neural Networks
Abstract: A wide range of biological research relies upon the accurate and repeatable measurement of the degree to which organisms resemble one another. Current practice for quantifying organismal color pattern similarity, however, lag behind many of the recent advancements in deep learning computer vision. Here, I propose a new workflow that adapts several deep learning based computer vision techniques for the purpose of quantifying organismal color pattern similarity. This workflow is fully unsupervised and requires no model training. Utilizing several classic color pattern datasets, I demonstrate that this technique is able to achieve similar results to state of the art supervised and semi-supervised methods commonly utilized in the field. The unsupervised nature of this approach therefor has the potential to revolutionize color pattern research offering an unbiased, accurate, scalable and repeatable method of quantifying perceptual similarity.