Community Events | Data Science

Upcoming Events

We are hard at work planning more events. Check back later in summer 2022 for more information.

Past Events

May 13, Noon — Deep Dive

Practices to Support Open Science & Research Reproducibility

About: Open Science approaches that make research processes and results more transparent and accessible aim to ensure research reproducibility and enable re-use. In this combined seminar and workshop, Dr. Nicole Lazar, professor of statistics at Penn State, and Dr. Kyle Johnsen, professor of electrical and computer engineering at the University of Georgia, will discuss the benefits of Open Science for you as a data scientists; lower the hurdle to adopting an Open Science workflow by sharing an introduction to the founding principles and processes of Open Science, including practical tips for research reproducibility for statistical analyses; and offer a hands-on workshop on version control using Git.

April 21 — Pop-up Talk

Office of the Associate Chief Information Officer for Research

View recording

About: The Office of the Associate CIO, Research (OACIOR) supports the University’s Research Mission by facilitating collaboration among College and Institute research IT units, central Penn State IT, research administration units and central administration units to pursue the University’s strategic goals for research IT. Within OACIOR, the Office of Research Information Systems (ORIS) supports administrative systems for sponsored programs, research protection and core facilities.

Presenter: Jim Taylor, Associate Chief Information Officer for Research

April 15, 2:00-4:00 p.m. — Deep Dive

Baking the Cake: a seminar and workshop on data storytelling

View Recording

About: Telling a story with data is like baking a cake. Spreadsheets are ingredients, like raw flour and eggs, that no one wants to eat. In this combined seminar and workshop, Dr. Alex Serpi will share the recipe you need to bake the cake. A delicious, data-filled cake.

Presenter: Alex Serpi, Assessment and Research Analyst, Office Of Planning, Assessment, and Institutional Research

March 22, 11:00 a.m.-Noon — Pop-up Talk

Foundation Relations and Data Scientists: Keys to Successfully Seeking Support

View Recording

About: Learn more about seeking foundation funding for your research and projects. Sophie Penney Leach, Director of the Office of Foundation Relations and Andrew Kuhn, Assistant Director and Foundation Relations liaison to IST and ICDS, will lead a discussion about finding and targeting foundation funding opportunities, writing effective proposals for foundations, working with Penn State’s Office of Foundation Relations to maximize your opportunities for success, and examining current foundation funding opportunities relevant to data scientists.

Presenters:

Sophie Penney Leach, Director of the Office of Foundation Relations
Andrew Kuhn, Assistant Director and Foundation Relations Liaison to IST and ICDS

November 12, 11:00 a.m.-Noon — Data Science Research Talks

View recording

The Big Data Constitution
Presenter: Margaret Hu, Associate Dean for Non-JD Programs, Professor of Law and International Affairs, and ICDS Co-Hire
About: The Constitution was forged in a small data world to protect rights and liberties from small data governance threats. In a big data world, the threats to our constitutional rights have been transformed. Therefore, the way that we interpret the Constitution must also be reimagined.

Imaging and Art History
Presenters: James Wang, Professor of Information Sciences and Technology; and Elizabeth Mansfield, Professor and Head of the Department of Art History
About: In this talk, we will present some recent results on computer-based analysis of oil paintings to answer art-historical questions.

November 2, 10:00-11:00 a.m. — Data Science Research Talks

View recording

Form Finding for Architectural Knitted Textiles
Presenters: Felecia Davis, Associate Professor of Architecture and Carey Memorial Early Career Professor in the Arts; and Farzaneh Oghazian, Ph.D. student
Abstract: The goal of this research is to develop machine learning models that simplify form-finding of architectural knitted textiles as well as predict the initial and final shape of the knitted textile structures removing all the tedious process of form finding. Development of such machine learning models enhances implementation of knitted textile materials for architects and architecture students who are not trained as textile designers to take advantage of knitted materials. The main question that our team would try to answer is: How can we develop machine learning models for the form-finding and reverse form-finding process of the architectural knitted textile structures? (Reverse form-finding is predicting the initial shape required to be knitted for a given overall tensioned form.)

Developing data science tools for cryosphere studies
Presenter: Shujie Wang, Assistant Professor of Geography
Abstract: The cryosphere plays an important role in affecting global climate, sea level rise, ocean current, and water supply etc. Widespread ice loss has been observed in Greenland and Antarctica over recent decades. Monitoring the ice sheet changes using multi-source datasets and predicting the future changes of ice sheets are key tasks for better projection of global sea level rise. In this talk, I will introduce some critical questions which needs integration of multi-source data, physical modeling, and data-driven tools for improved understanding of cryospheric processes.

October 12, 9:00-10:00 a.m. — Data Science Research Talks

View Recording

Fast, Economical & Scalable AutoML
Presenter: Qingyun Wu, Assistant Professor of Information Sciences and Technology
About: Automated machine learning (AutoML) is the process of automating the time-consuming, iterative tasks of machine learning model development, including data pre-processing, hyperparameter tuning, model selection, etc. It frees data scientists, analysts, and developers from tedious trial-and-error in building machine learning models. In this talk, I will introduce our latest efforts in fast, economical & scalable AutoML, how it can benefit a wide spectrum of end-to-end data science and machine learning tasks, and the new challenges.

Time Series Prediction of Air Pollution From Wildfires Using Transformer – A Multi-Head Attention Mechanism
Presenter: Manzhu Yu, Assistant Professor of Geography; Associate Director of Geoinformatics and Earth Observation Laboratory; Associate of ICDS
Abstract: Wildfire smoke can be more damaging to respiratory health than other sources of air pollution. As fires grow larger and human populations expand, it is crucial to provide a more accurate picture of how communities will be at risk for wildfire. In this research, we investigated the capability of a Transformer architecture for predicting short-term PM2.5measurements in California during wildfire seasons. The time series prediction leverages the past 24 hours observations to predict PM2.5measurements in the future 12 hours. Feature contributions and feature temporal contributions were calculated to capture different characteristics in multi-variable time series, distinguish each variable’s contribution to the prediction, and provide guidance on future air quality forecast systems over multi-variable data.

September 22, 3:00-4:00 p.m. — Data Science Research Talks

View Recording

Big Data Squared: Imaging Genomics
Presenter: Nicole Lazar, Professor of Statistics
About: In this talk, I will briefly discuss the challenges and opportunities that arise from the consideration of two Big Data modalities – neuroimaging and genetic information – and what can be gained from their combination. This is an example of “Big Data Squared” as each modality on its own constitutes a Big Data problem.

Making Images with Radio Telescope Interferometers
Presenter: Ian Czekala, Assistant Professor of Astronomy and Astrophysics, and ICDS Co-Hire
About: Radio interferometric arrays, like the Atacama Large Millimeter Array (ALMA) and the Very Large Array (VLA), work by correlating the signals from multiple antennas to mimic a larger diameter telescope. We will discuss how we are meeting some of the challenges of modern radio interferometery using Regularized Maximum Likelihood image reconstruction algorithms and neural network representations of protoplanetary disks to learn how planet formation proceeds in solar systems across our Galaxy.

August 10, 10:00-11:00 a.m. — Pop-up Talk on Corporate Engagement Center

View recording (Penn State authentication required)

The Corporate Engagement Center connects industry partners to strategic opportunities at Penn State for research and development, philanthropy, and talent recruitment. Brought together in 2019, the center serves as a hub for industry/University relationships, supporting companies as they navigate the vast resources of Penn State. The team is here to help build lasting, mutually beneficial relationships. The Corporate Engagement Center is a joint initiative of the Office of the Senior Vice President for Research and the Office of University Development, working in partnership with Career Services.

Presenters: Beth Colledge, Director of Corporate Engagement, Corporate Engagement Center, and Todd Price, Corporate Relations Director for Research, Institute for Computational and Data Sciences

August 4, 9:00-10:00 a.m. — Pop-up Talk on Choosing the Correct Cyberinfrastructure Resource for your Research

View recording (Penn State authentication required)

For researchers who are using big data techniques or need high-performance computing in their investigations, finding and understanding the available cyberinfrastructure resources can be tricky. This talk will explore the various computing and storage options — including those hosted by Penn State as well as external resources — available to Penn State researchers and provide tips on how researchers can identify the right tool to use for a given scientific objective.

Presenter: Carrie Brown, Advanced Cyberinfrastructure Research and Education Facilitator, Institute for Computational and Data Sciences

July 27, 1:00-2:00 p.m. — Pop-up Talk on Learning Analytics with Teaching and Learning with Technology (TLT)

View recording (Penn State authentication required)

TLT’s Data Empowered Learning team explores the usage and impact of learning analytics to support student success.

Presenters: Bart Pursel, Director of Teaching and Learning with Technology Innovation; Drew Wham, Lead Data Scientist (TLT); Ben Hellar, Lead User Experience Developer (TLT)

July 15, 9:00-10:00 a.m. — Pop-up Talk on ResearchPros: Research Professionals Community

View recording (Penn State authentication required)

Now, more than ever, the field of research requires not only scientific expertise, but also knowledge of technology, people management practices, and the understanding of security, privacy, and ethical considerations. There are few at the University with expertise at this complex intersection of knowledge domains. Thus, the work of implementing and managing research projects has emerged as a distinct and specialized profession. ResearchPros aims to cultivate a community for research professionals and lead the professionalization of the practice and process of research at Penn State.

Presenters: Ashley Stauffer, Founder/Director; Gabrielle Provenzano, Co-founder/Assistant Director

July 7, 2:00-3:00 p.m. — Pop-up Talk on collaborating with RISE to overcome data science challenges

View recording (Penn State authentication required)

RISE — Research Innovations with Scientists and Engineers — is a team of computational scientists and software engineers in the Institute for Computational and Data Sciences (ICDS) who have extensive supercomputing experience and a deep understanding of academic research and advanced degrees. They are available to help Penn State researchers make the most of using advanced computing for short- and long-term research projects. This talk will explore how the RISE team supports research, including building, coding, and maintaining workflows; optimizing code to save time and money; educating research groups; and more.

Presenter: Chuck Pavloski, Team Lead, ICDS’s RISE Team

June 23, 3:00-4:00 p.m. — Pop-up Talk on Nittany Data Labs

View recording (Penn State authentication required)

Nittany Data Labs (NDL) is Penn State’s student-run data science organization. NDL serves to foster the growth of our member’s knowledge and understanding of data science and business intelligence by providing them with platforms to gain knowledge and projects with real clients and stakeholders where they can contextualize that knowledge. NDL aims to be the place for all Penn Staters to learn about and improve on using data to make them better at what they do.

Presenters: Vince Trost, Co-advisor; Matt Beckman, Co-advisor

June 17, 10:00-11:00 a.m. — Pop-up Talk on Nittany AI Alliance

View recording (Penn State authentication required)

The Nittany AI Alliance, a Penn State Outreach initiative, creates programs that bring together students, faculty, staff, and industry leaders to address real-world problems through experiential learning projects using artificial intelligence–based solutions. We are committed to providing students with unique out-of-classroom learning opportunities, improving the student experience at Penn State, and facilitating innovative collaboration between businesses and top talent at Penn State and across the Commonwealth.

Presenters: Daren Coudriet, Executive Director of Innovation and Nittany AI Alliance; Brad Zdenek, Innovation Strategist, Nittany AI Alliance; Tim Bracken, Nittany AI Advance Program Manager; Katy Colby, Program Associate, Nittany AI Alliance

April 15, 1:00-2:00 p.m. — Pop-up Talk on COVID-19 Database

View recording (Penn State authentication required)

“Getting Access to De-Identified COVID-19 and Other Healthcare Data“
Presented by Avnish Katoch, informatics project manager, Penn State Clinical and Translational Science Institute
Abstract: The National COVID Cohort Collaborative (N3C), funded by the National Center for Advancing Translational Sciences, is an initiative that seeks to put data into the hands of scientists who are skilled at manipulating or deriving insight from data sets. As part of the initiative, which includes institution-level data usage agreements, Penn State faculty can get free access to the platform without seeking Institutional Review Board approval. Attend this “pop-up” Data Science Community talk to learn how Penn State faculty can get access to the database and what data is included. Avnish Katoch, informatics project manager with Penn State Clinical and Translational Science Institute, will provide a live demo of the platform.

April 1, 11:00 a.m.-Noon — Data Science Research Talks

View recording (Penn State authentication required)

“Human Science without Data Collection”
Presented by Timothy Brick, assistant professor of human development and family studies and Institute for Computational and Data Sciences (ICDS) faculty co-hire
Abstract: Scientific study of humans regularly relies on the collection of vast quantities of often personal or private data. These data have tremendous value both scientifically and economically, but participants are rarely reimbursed for the value of their data or for the risk that they undertake if the data becomes public. This leads to a conundrum: responsibilities to proper scientific practice require open data for reproducibility, but responsibilities to participants require closed collection and data sharing. In this talk, I discuss the goals of the MIDDLE project, which aims to build the infrastructure needed to do human science without data collection.

“Expert knowledge capture for 2D materials synthesis”
Presented by Wesley Reinhart, assistant professor of materials science and engineering and ICDS faculty co-hire
Abstract: The relationship between material processing, structure, and properties is challenging to understand and even harder to predict because it is non-linear, high-dimensional, and results from physical phenomena at many scales. While traditional materials design has relied on human intuition to interpret patterns in known materials and infer new ones with similar (hopefully improved) properties, emerging data science tools offer new strategies to expedite materials design. This talk will discuss opportunities and potential impact of expert knowledge capture for the discovery and optimization of new “2D materials” in the 2D Crystal Consortium, a NSF-funded Materials Innovation Platform at Penn State.

March 18, 11:00 a.m.-Noon — Data Science Research Talks

View recording (Penn State authentication required)

“Human movement and infectious diseases”
Presented by Nita Bharti, assistant professor of biology and Lloyd Huck Early Career Professor
Abstract: Human movement is an important factor in the transmission of infectious diseases. Movements often drive the transmission dynamics we observe across a number of endemic infectious diseases in the world today. That link underlies the reason behavioral interventions are so effective against emerging pathogens, like SARS-CoV-2. When we understand the interactions between movement, behavior, and infectious diseases, we improve intervention strategies by making them more effective and efficient.

“Health System Barriers Preventing Use of ‘Big Data Analytics’ to More Effectively Manage The Covid-19 Pandemic”
Presented by Dennis Scanlon, Distinguished Professor of Health Policy and Administration
Abstract: Despite that potential of using ‘Big Data’ and advanced modeling and analytics to more effectively segment populations based on the likely outcomes of COVID infection, and thus inform prevention and interventional responses, this approach was under-utilized during the current pandemic. This talk will discuss the reasons for this shortcoming, which include systemic structural issues in the United States that prevent using data and modeling to its fullest potential.

March 4, 11:00 a.m.-Noon — Data Science Research Talks

View recording (Penn State authentication required)

“Studying Sharing of Political Content on Facebook”
Presented by S. Shyam Sundar, James P. Jimirro Professor of Media Effects
Abstract: User sharing of content is the engine that drives social media, contributing not only to the virality of content but also conferring status by way of metrics such as number of retweets. In the realm of politics, this activity can richly operationalize the ideal of providing a voice to the masses by enabling ordinary citizens to share their political thought at scale. On the other hand, it can also lead to the spread of misinformation, creating echo chambers and political polarization. When social media users see a link or headline that appears to be aligned with their ideology, they are more likely to share it, often without clicking on the link themselves. This means more sharing of extreme, rather than moderate, political content, and more often without clicking. Our project tests this hypothesis by examining users’ sharing of political content on Facebook. As part of the Social Media and Democracy Grant, awarded by the Social Science Research Council (SSRC), we have obtained access to data provided by Facebook, which include the number of clicks, shares and shares without clicks for all web pages (URLs) shared on the platform from 2017 to 2019. We will discuss the nature of the dataset, and challenges faced in extracting and analyzing data in a privacy-protected manner. We will also share emerging patterns found in a subset of our data.

“Multi Modal Sensor Fusion for Data driven Process Monitoring of Additive Manufacturing Processing”
Presented by Jan Petrich, research development engineer, Geospatial Intelligence Department, Applied Research Laboratory
Abstract: Additive manufacturing (AM) — the industrial version of 3D printing — has triggered a revolution in making niche items, such as medical implants and plastic rapid-prototypes. While this seemingly science-fictional ability to “turn bits into atoms” for consumers and small entrepreneurs has received a great deal of publicity, it is in anytime-anywhere-manufacturing where the technology could have its most significant impact. However, part quality may not always be guaranteed and residual anomalies in the print process, that are often stochastic in nature, can never be eliminated completely. Clearly, an excess of anomalies or flaws within the part may render the printed component unusable. Therefore, data driven techniques that detect and identify such process anomalies during the AM build process or automatically characterize the quality of the part immediately after the build are critically needed. Both approaches, in-situ and ex-situ, may offer on-site part certification in the future. This talk will cover ongoing research and development efforts at PSU/ARL that aim to bring data science and data analytics to the forefront of metal AM. Instrumentation and data acquisition capabilities at PSU/CIMP-3D will be presented. In addition, machine perception techniques for X-ray Computed Tomography (CT) inspection as well as machine learning algorithms for in-situ defect detection will be discussed.

February 25, 1:00-2:00 p.m. — Data Science Research Talks

View Recording (Penn State authentication required)

“Algorithmic fairness for socially optimal decision-making”
Presented by Hadi Hosseini, Assistant Professor of Information Sciences and Technology
Abstract: The advent of distributed platforms has given rise to novel challenges in designing socially desirable algorithms in complex multi-agent systems, which call for subtle, practical, and scalable solutions. Algorithmic fairness has been the focus of study in economics, mathematics, and AI for decades. It encompasses solutions for a wide variety of real-world applications including assigning students to courses, assigning riders to drivers in ride-sharing platforms, dividing inheritance among heirs, matching donors to kidney patients, and distributing charitable food items. In this talk, I will focus on the importance of making decisions according to preference data, discuss algorithmic and theoretical advances in fair resource allocation, and argue how empirical and theoretical findings, together, can provide deep insights into designing socially desirable systems.

“Using machine learning to understand gene regulation”
Presented by Shaun Mahony, Assistant Professor of Biochemistry and Molecular Biology
Abstract: The regulation of genes within each of the cell types in our bodies is orchestrated by the activities of transcription factors and other regulatory proteins. Determining how such regulators recognize their genomic targets would enable a deeper understanding of how cellular processes are controlled and how disease states occur. However, understanding the gene regulatory code has turned out to be challenging. While many regulatory proteins recognize specific DNA patterns, the vast majority of sequences that match the pattern will not in fact be bound by the protein in a given cell type. Furthermore, a given regulatory protein can recognize different instances of its binding pattern in different cell types. Such context-dependent activities appears to be determined by the regulatory environment of the cell: interactions with other proteins, chemical modifications on the genome, and the organization of the genome within the cell all play roles in specifying a given regulatory protein’s targets. My lab applies machine learning techniques to understand how genes are regulated and how particular cellular identities are established. In this presentation, I’ll discuss how we’re using neural networks to interpret how regulatory proteins establish distinct regulatory landscapes on the genome during particular steps in development.

February 12, 2021

To kick off the Spring 2021 semester, we invite you to a Data Science Community Building Session. The purpose of the Community Building Session is to provide a space to:

Engage in open dialogue and discussion
Create and foster informal connections among community members
Identify common themes, issues, questions, and ideas in the area of data science

During this Community Building Session, we will utilize breakout rooms and guiding questions to facilitate discussion, but also welcome members to bring their ideas and thoughts for ways to engage in virtual space and advance the community. We hope to see you there and are looking forward to having a lively discussion and identifying common areas of interest as well as areas for growth for the community!

December 3, 2020

Watch the recording (Requires ‘standard’ authentication or login)

Rebecca Passonneau, Professor of Computer Science and Engineering

Contact: rjp49@psu.edu
Talk Title: Pitting Machines against Humans for Assessments of Student Writing? Not Really.
Abstract: For years, the National Center for Education Statistics has reported that on average, the writing skills of secondary students are below grade level. Research consistently shows that student writing can improve through consistent practice and feedback, yet teachers and instructors at all levels lack the resources to provide additional writing assignments. My Natural Language Processing Lab has been investigating writing rubrics across content areas, their reliability within and outside the context of actual classroom use, and the potential for human-machine collaboration to lead to more timely and consistent feedback on student writing.

Ashkan Negahban, Assistant Professor of Engineering Management

Mohamad Darayi, Assistant Professor of Systems Engineering

Contact: Ashkan Negahban aun85@psu.edu; Mohamad Darayi mud415@psu.edu
Talk Title: Incorporating Mobility into Epidemic Vulnerability Measures
Abstract: The movement and interaction of public transportation users can play an important role in spreading an infectious disease. In this work, by integrating the mobility network with regional demographics and health data, we develop network-based vulnerability indices that can help prioritize resource allocation across communities and the underlying transportation network during an epidemic or pandemic. For example, the indices can provide decision support on where intervention resources – such as testing sites, additional personnel to disinfect frequently touched surfaces in subway stations, and masks and hand sanitizer distribution – could be of greatest benefit. We also evaluate the efficacy of the proposed vulnerability indices using an agent-based simulation model of COVID-19 in New York City. Our network analysis and simulations rely on census data and other public data from the Department of Labor, Center for Disease Control, and Department of Health and Human Services. For future phases of this research, our team is seeking collaborators with expertise in disaster science and community resilience.

November 16, 2020

Data Science and Public Health – Community Discussion

Watch the recording (Requires ‘standard’ authentication or login)

Panelists

Dr. Jan Reimann, Associate Professor of Mathematics
Dr. Drew Wham, Data Scientist, Teaching and Learning with Technology
Dr. Patrick Dudas, Assistant Teaching Professor of Information Sciences and Technology

Discussion topic

Utilizing data science and data science tools in the classroom can advance and improve education because it can provide students with timely and relevant examples, data, and feedback. In the Data Science Community discussion on ‘Utilizing data science tools in the classroom’, we will hear from three individuals, Dr. Jan Reimann, Dr. Drew Wham, and Dr. Patrick Dudas, who have applied data science tools in their classrooms and seen the benefits. Dr. Jan Reimann noticed that math textbooks often suffer from either outdated or obscure examples that students find hard to relate to. He uses Jupyter notebooks to provide students both with interactive content (automatically generating and checking problems using the Python kernel), and also to integrate current data into his class. For example, he uses the Pandas library to pull and process real-time COVID-19 data from Johns-Hopkins University into a notebook, and then students can study and experiment with logistic functions using this data. Dr. Drew Wham wanted to provide students the opportunity to try out different ideas to solve predictive challenges and receive real-time feedback. Using Kaggle, a data competition website which allows you to post data including training and test sets, students can work on a problem, try out different solutions, compete against benchmarks, and receive instant feedback on how good an approach it is. Dr. Patrick Dudas has found that he can use Google Colab (part of PSU’s G Suite) coupled with GitHub to teach a variety of programming topics (SQL, HTML, Python, R, etc), create more enriched content for his courses, and create lectures, labs or assignments that students can work on directly. In this environment, students can see both text, images, and code, freely make changes, and see effects in real-time. Please attend this session if you are interested in learning more and discussing how data science tools are being used in the classroom!

November 12, 2020

Watch the recording (Requires ‘standard’ authentication or login)

Melissa Gervais, Assistant Professor of Meteorology and Atmospheric Science, Faculty. fellow of ICDS, Adjunct Associate Research Scientist Lamont Doherty Earth Institute (LDEO))

Contact: mmg62@psu.edu
Talk Title: Evolution Self-Organizing Maps to Classify Spatio-Temporal Variability
Abstract: Within the climate system, our understanding of the physical processes involved in producing variability on different time scales depends on complex spatial patterns that vary in time. Here we are particularly interested in understanding decadal variability in North Atlantic sea surface temperatures that is an important driver of the Northern Hemisphere climate. We develop a novel application of self-organizing maps, an unsupervised machine learning method, that effectively characterizes the entire time evolution of sea surface temperatures in the North Atlantic. Furthermore, we are able to obtain a deeper understanding of the mechanisms occurring within these clusters through composite analysis of additional atmosphere and ocean fields. The results provide a new vantage point on the atmosphere-ocean interactions that occur on decadal time scales.

Bing Pan, Associate Professor, Department of Recreation, Park, and Tourism Management

Contact: bup63@psu.edu
Talk Title: Big Data Analytics in Tourism and Park Research
Abstract: Visitors to a destination or a national park interact with information technologies throughout their journeys and leave online digital traces. I will discuss our recent projects on forecasting and monitoring visitors with big data sources in tourism and park research. Search engine queries, website logs, mobile phone data, reservation data, and social media can help us monitor, predict, and manage visitors to a city or a national park, and understand their demographics and visitation experience.

October 19, 2020

Data Science and Public Health – Community Discussion

Watch the recording (Requires ‘standard’ authentication or login)

Panelists

Matt Ferrari, Associate Professor of Biology

Discussion topic

Data and data science can improve public health efforts through changes in methodologies as well as response approaches. Through the use of big data and data science, public health methodologies can shift from models towards evidence-based interventions and public health approaches can become more proactive in helping to design efficient systems for monitoring, evaluation, and targeting of public health interventions. For example, data science approaches have been utilized to improve disease diagnosis from medical images as well as to examine the relationship between diseases, genetics, and the environment. This discussion session will be kicked off by Dr. Matthew Ferrari. He will describe his research on the evaluation of vaccination programs in countries with nascent surveillance programs. He will describe how he uses dynamical models to estimate disease burdens and target interventions in the absence of high-quality surveillance or well-designed monitoring and evaluation.

October 12, 2020

Machine Learning in Learning Analytics – Community Discussion

Watch the recording (Requires ‘standard’ authentication or login)

Panelists

Priya Sharma, Associate Professor of Education (Learning, Design, and Technology)
Mahir Akgun, Assistant Teaching Professor of Information Sciences and Technology
Qiyuan Li, Data Modeler & Developer, Boston University’s Digital Learning & Innovation

Discussion topic

Within education, data science methods have been traditionally used to evaluate data for prediction and remediation. What is really powerful about integrating data sciences into education is being more precise about the use and application of data within a specific context and theoretical approach to learning. One area that seems underexplored is how data sciences can inform the design of learning and pedagogical interactions. Advancement in this area will require collaboration between pedagogical experts and data scientists as well as an understanding of indicators of learning and performance. In this discussion session, the panelists will describe their current project where they are using a supervised machine learning model to classify student’s online discourse to assist the instructor in assessing the quality of learner interactions. They will discuss how they anticipate that they will be able to automate the type of low-level feedback that can be provided to students in online discussions and also assist the instructor in generating detailed, yet timely feedback for large numbers of students as well as create a learning analytics dashboard based on this ML model to assist the instructor in refining the pedagogical approach to be more attentive to students’ learning and understanding. Please attend this session if you are interested in learning more and discussing how data science is being applied to the teaching and learning analytics space!

September 24, 2020

Watch the recording (Requires ‘standard’ authentication or login)

Fall 2020 Meeting of the Community

Introduce the new roles and commitments of Teaching and Learning with Technology, the Institute for Computational and Data Sciences, and the University Libraries in providing logistical support for the Data Science Community as a collaboration moving forward.
Have short introductory “meet and greet” talks from the new DS Community leaders, Briana Ezray (Research Data Librarian – STEM, University Libraries) and Xiaofeng Liu (Associate Professor, ICDS and Civil and Environmental Engineering).
Discuss future presentations and discussions for the fall semester lineup.
Foster an open discussion around areas of interest, growth, and logistics for the community.

April 20, 2020

Watch the recording (Requires ‘standard’ authentication or login)

Sharon Huang, Information Sciences and Technology

Contact: suh972@psu.edu
Talk Title: Knowledge Discovery from Image Data
Abstract: A picture is worth a thousand words. The question is, which thousand words? This talk will present computational methods developed in my group that enable the translation of raw image pixels to objects, relations, events, decisions, and new knowledge. Based upon these methods we have created highly accurate machine learning systems for segmentation, classification and synthesis of biological and medical images.

Michael Andreae, Penn State Health Anesthesiology and Perioperative Medicine

Contact: mua419@psu.edu
Talk Title: MPOG.org, a curated clinical perioperative de-identified dataset with 8 million unique cases from 50 academic medical centers
Abstract: The Multicenter Perioperative Outcomes Group (MPOG.org) is a perioperative electronic medical record registry holding over 8 million cases from over 50 medical centers. The data are de-identified, curated with a defined data dictionary. Penn State Hershey just joined, lead by the department of Anesthesiology and would invite collaborations with the data science community to leverage the data to test high performance computing algorithms, hierachical modeling and other Big Data applications. Examples are the combination of geospatial modeling to better understand mechanisms leading to social disparities of health.

April 13, 2020

Watch the recording (Requires ‘standard’ authentication or login)

Jia Li, Professor, Statistics

Contact: jol2@psu.edu
Talk Title: Mixture models and unsupervised learning
Abstract: Mixture models and more generally latent variable models have been widely used for high dimensional, sequential, or imagery data. The model can serve as a density estimation, and the latent states as the cluster labels. I will introduce a few mixture models for high dimensional continuous and discrete data, some recent advances in clustering to overcome curse of dimensionality and to address distributional data, and approaches for assessing uncertainty in clustering at the levels of overall partitions, individual clusters, as well as individual points.

Fangcao Xu, Graduate Student, Geography

Contact: xfangcao@psu.edu
Talk Title: Multiple Geometry Atmospheric Correction for Image Spectroscopy using Deep Learning
Abstract: The goal of this research is to develop a deep learning solution for atmospheric correction and target detection of multiple hyperspectral scenes, acquired by aerial platforms at different viewing angles. A deep learning solution based on convolutional neural networks is used to learn the relationships between the total radiance observed at the sensor, and the different solar and atmospheric components such as upwelling, downwelling and transmission. The proposed approach requires analyzing multiple scenes acquired in rapid sequence. It is assumed that the target and the atmosphere remain invariant within the time scale of the collection, while the angles of collection and ranges vary. This work focuses on emissive properties of targets, and simulations are performed in the longwave infrared between 7.5 and 12 um. Results show that the proposed method is computationally efficient, and it can characterize the atmosphere and retrieve the target spectral emissivity within one order of magnitude errors or less.

March 30, 2020

Watch the recording (Requires ‘standard’ authentication or login)

Sy-Miin Chow, Professor, Human Development and Family Studies

Contact: quc16@psu.edu
Talk Title: My Journey to Dynamical Systems Modeling as a Behavioral Data Scientist
Abstract: Dynamical systems models have historically been workhorses of the physical sciences and applied mathematics, but have begun to gain traction in statistics, and more recently, in the behavioral sciences. The recent influx of intensive longitudinal data from wearable devices, smartphones, Global Positioning System (GPS), and other sensors has introduced a pressing need, and also unique opportunities for developing novel data science approaches to examining the systems dynamics of individuals, family systems, social networks, and their interplay with environmental factors. In this talk, I will highlight some of my current work and ideas for future collaborations utilizing intensive longitudinal health data from individuals and family systems.

Chaopeng Shen, Associate Professor, Civil and Environmental Engineering

Contact: shen@engr.psu.edu
Talk Title: Pathways of combining process-based knowledge with deep learning for hydrologic modeling
Abstract: Here I demonstrate selected pathways, out a many, to combine the power of both machine learning and process-based knowledge in improving our predictive capability of hydrologic variables. Compared to purely data-driven models, process-based models (PBM) can produce seamless solutions of observed or unobserved hydrologic variables at continental scales. However, a longstanding difficulty was to effectively and efficiently obtain parameters for PBMs. Here we show the vastly superior efficiency of a deep-learning-based parameter estimation framework that is based on a completely different paradigm of parameter estimation. We can gain five orders of magnitude of computational savings in calibration/training while achieving better calibrated parameters using the new framework. In addition, we comment on other forms of physics-informed neural networks.

December 19, 2019

Watch the recording (Requires ‘standard’ authentication or login)

Rick O. Gilmore, Professor, Dept. of Psychology, College of the Liberal Arts

Contact: rog1@psu.edu
Talk Title: How to share personal data ethically
Abstract: Data scientists interested in studying human behavior require access to data about people. In this talk, I will describe how the Databrary.org project I co-direct has tackled the problem of sharing personally identifiable data–video and audio recordings–while upholding ethical principles. Databrary’s approach builds on and extends established practices and thus may have implications for other domains.

Jennifer McCormick, Associate Professor, Dept. of Humanities, College of Medicine

Contact: jmccormick@pennstatehealth.psu.edu
Talk Title: Engendering Trust: A Role for Data Governance
Abstract: In recent years, information technology advancements have transformed the capacity for biomedical and public health science researchers to collect and analyze vast amounts of personal health-related data. Not uncommonly, consent of individuals from which the information comes is not required because it is already collected and stored de-identified data. I will briefly share what role data governance and transparency might have in engendering trust among the public and policymakers.

October 24, 2019

View this week’s presentation files

Mike Rutter, associate professor of statistics at PS Behrend

Contact: mar36@psu.edu
Talk Title: On R and E. Coli
Abstract: In this presentation, I will discuss two long-term projects I have been working on. The first is the cran2deb4ubuntu project, which involves providing pre-built binaries for 2,000+ R packages for Ubuntu Linux. These packages are used by a number of other data science related projects based on the number of emails I receive when I accidentally break something. The second project I will discuss is a random forest model I use to help predict E. Coli events at Presque Isle State Park (Erie, PA). For both of these projects, I will discuss some current challenges and opportunities for future collaboration.

David Hughes, associate professor of entomology and biology at UP

Contact: dph14@psu.edu
Talk Title: AI for pests and climate change adaptation in smallholder farms in Africa
Abstract: For smallholder farmers in Africa a fundamental constraint is growing enough food in the face of biotic stresses (pests and diseases) and now increasingly abiotic stresses due to climate change when farmers do not have access to human experts (extension services). There are two few human experts interfacing with farmers on their farms at a high enough frequency to help them cope with diseases and climate change. The ratio between extension workers and farmers is 1:3,000 in Kenya and as a high as 1:10,000 in the DRC. PlantVillage is a Penn State tool that delivers an AI assistant in smartphones that works offline and is equal to human experts in their ability to diagnose pests and diseases. It has been officially adopted by the United Nations Food and Agriculture Organization and operates in 70 countries and 21 languages. Our tool can also connect with a UN FAO tool WaPOR which measures crop water stress across every field in Africa with a 10 year data record. We are leveraging this data and ML to predict the future water stress for farmers to enable climate change adaptation. We would dearly love help from meteorological colleagues and data scientists at Penn State.

August 8, 2019

View this week’s presentation files
Watch the recording (Requires ‘standard’ authentication or login)

Sarah Rajtmajer, College of IST, Rock Ethics Institute

Contact: smr48@psu.edu
Talk Title: The role of data science in an interpretable scholarly record
Abstract: The last few years have seen important progress toward increased methodological rigor in a number of fields that regularly engage with empirical data. This progress has been driven by the so-called “reproducibility crisis”— the finding that wide swaths of published science cannot be replicated. In this talk I will give historical context for the reproducibility crisis, discuss the opportunity it has presented to increase the credibility of the scientific literature and accelerate discovery, and extend the notion of scholarship for the computational and data-enabled sciences in a world of radical transparency. In addition, I will outline a nascent research agenda leveraging AI (synthetic prediction markets) to assign confidence scores to claims in the social and behavioral science literatures. This work is proposed in support of DARPA’s Systematizing Confidence in Open Research and Evidence (SCORE) program.

Justine Blanford, Department of Geography

Contact: jib18@psu.edu
Talk Title: (Geo) Data Science
Abstract: I am a geographer, spatial analyst, GISer who has been applying spatial analysis methods to better understand the world around us for the better part of 20 years. Much of my work has been about issues related to geohealth, in particular the ecology of disease/health across space and time and includes the use of novel technologies and data sources to better understand how places are connected, the impact of human-environment interactions and what this means in terms of the health and well-being of society. To really understand what is going on we sometimes need to use big data sets and look at things from many different angles. My goal in participating here is to learn more about different data science techniques and how to better integrate these in the geospatial world as these are a perfect match. Both deal with data but in slightly different ways.

Stephen Ross, Penn State Law

Contact: sfr10@psu.edu
Facilitated Discussion Title: Sports Analytics
Abstract: Sporting competitions amass significant data regarding performance, health/biometrics, and business analytics. Although some data is publicly available and other data might be shared on a confidential basis and presented in aggregated form, most sporting organizations view data as highly proprietary. This leads sport to be understudied. Evidence-based decision-making is critical in sport for several reasons:

The importance of sports as a social institution demands sound public policy to sport with official decisions based on data.
Many sports decisions are made by powerful private entities whose decisions are collective (such as the NCAA or professional sports leagues), and parochial decision-makers often are limited in access to data or use data strategically; there is a public interest in having these entities make publicly-accountable decisions based on best evidence.
Individual sporting entities attract strong loyalty. If Pizza Hut makes a poor decision, consumers switch to Domino’s. If Penn State Athletics makes a poor decision, no one shifts to Ohio State. Thus, the importance of publicly-accountable decisions based on evidence is important even for internal decisions.

Josephine Wee, Food Science & Gretta Tritch Roman, College of Agricultural Sciences

Contact: jmw970@psu.edu
Facilitated Discussion Title: Digital fluency: Project iOn as a  prototype
Abstract: Increasingly, data and information are being used communities to make customized decisions on management,  food safety and quality, sustainability,  resilience, and economics. However, our  ability to collect data has far outstripped  our ability to effectively utilize data. The  current challenge is how to successfully  cultivate data to reveal actionable  information. Digital fluency is the ability to leverage data and information to enhance  critical thinking, problem-solving,  communication, and innovation. In this  data science community meeting, I will  facilitate broad conversations  surrounding each digital fluency  specialization: design thinking, code  competency, data visualizations, teaching  and learning, data curation, ethics,  communications and storytelling, and  diversity and inclusion. The goal of this  broad discussion is to introduce the concept of digital fluency within the data science  community. I will also briefly discuss a university seed grant funded project on how  prototyping development and use of interactive open educational resource notebooks  (ION) could provide digital fluency training for our faculty and students. Finally, I will briefly  talk about our one-day Digital Fluency Symposium: The future of teaching, learning, and research in a digital world in October at Penn State.

June 5, 2019

View this week’s presentation files
Watch the recording (recording is truncated and requires ‘standard’ authentication or login)

David Hunter, Department of Statistics

Contact: dhunter@stat.psu.edu
Abstract: Data Science is an evolving field of study that neither includes nor is included within any single traditional discipline. We propose to support a data science community that embraces this interdisciplinarity, emphasizing that our community includes not only those whose work advances data science methods but those whose work requires the application of these methods. According to a preliminary survey, those who self-identify as part of such a community already represent a majority of campuses and units across the entirety of Penn State University. We seek to provide a space, both virtual and physical, to catalyze and publicize data science-related work at Penn State.

Simon Hooper, Department of Learning and Performance Systems

Contact: sxh12@psu.edu
Talk Title: How can data patterns help enhance student literacy?
Abstract: Our research team has designed a progress monitoring system to help teachers monitor Deaf and Hard of Hearing students’ literacy. The suite includes 8 assessments that target different literacy components. Most assessments take no more than one minute to complete, are gamified to enhance student motivation, and include customized scoring tools. Performance charts help teachers to determine whether individual students are making adequate progress. Targeted data, stored in a relational database, allow researchers to address specific questions, but web log data are not currently being analyzed. Our goal is to learn how supervised learning analytics can reveal patterns of teacher and student behavior that affect learner outcomes.

Drew Wham, Teaching and Learning with Technology, Data Empowered Learning Team

Contact: fcw5014@psu.edu
Talk Title: Measuring Perceptual Distance of Organismal Color Pattern using the Features of Deep Neural Networks
Abstract: A wide range of biological research relies upon the accurate and repeatable measurement of the degree to which organisms resemble one another. Current practice for quantifying organismal color pattern similarity, however, lag behind many of the recent advancements in deep learning computer vision. Here, I propose a new workflow that adapts several deep learning based computer vision techniques for the purpose of quantifying organismal color pattern similarity. This workflow is fully unsupervised and requires no model training. Utilizing several classic color pattern datasets, I demonstrate that this technique is able to achieve similar results to state of the art supervised and semi-supervised methods commonly utilized in the field. The unsupervised nature of this approach therefor has the potential to revolutionize color pattern research offering an unbiased, accurate, scalable and repeatable method of quantifying perceptual similarity.