Secondary Navigation

Thoughts of a Recent Graduate: Jason Fries

Jason Fries - 2015 PhD graduate; currently postdoctoral fellow with Stanford University's Mobilize Center

Jason Fries is a 2015 PhD graduate. He is currently a postdoctoral fellow with Stanford University's Mobilize Center, an NIH-funded Big Data to Knowledge (BD2K) Center developing data science methods for understanding and modeling diseases of human mobility. He is co-advised by Chris Ré in Computer Science, Scott Delp from Bioengineering, and Nigam Shah from Bioinformatics, all at Stanford.

What are you working on now?

I'm working on 2 primary projects (1) developing tools and methods using a new formalism for distant supervision called "data programming" and (2) modeling outcomes associated with joint replacement surgeries.

Chris Re's group is developing a well-known system called DeepDive (, a new type of data processing system that extracts entities and relations from unstructured "dark data" like text, tables, images, and figures. DeepDive is used in several high-impact applications including monitoring human trafficking on the "dark" web, automatically constructing databases of paleontological data, and extracting gene-phenotype and other biological relationships from scientific literature.

DeepDive scales nicely as a system, but development cycles can be long as users iterate on distant supervision rules for labeling input data. This has motivated some interesting theoretical work in Chris's lab on programmatically supervising and de-noising data, a machine learning paradigm called "data programming" ( I've been collaborating on building their next generation data processing system Snorkel and developing lightweight NLP systems for extracting biomedical relations from clinical text.

My second project focuses on osteoarthritis (OA). A common clinical endpoint of OA is total joint replacement so we’re looking at how to use electronic medical record (EMR) data to better predict post-operative recovery. Unstructured patient notes are quite useful in this capacity, so it's a great use case for data programming. Ultimately these models help lay the foundation for a more ambitious goal of deploying a near real-time implant surveillance system, where medical devices can be monitored and scored quantitatively using patient EMR data.

My overall goals during my postdoctoral fellowship are to (a) closely integrate research discoveries and tools into clinical settings and (b) develop a strong, empirical argument why structured data systems are important, even in the age of deep learning and automatic feature engineering. Both of these goals are a continuation of what I started during my PhD in CS at Iowa. My ultimate goal is to transition into an academic position or a research scientist in the hospital/healthcare field within the next 2 years.

Can you tell us a bit about your dissertation?

My dissertation looked at information extraction methods for population health surveillance. While some aspects of human health are reasonably well captured by primary care systems (e.g., hospitals, clinics), examining the ways in which behavior and everyday life impact health requires analyzing alternate data streams like social media or wearables and smartphones. For example, in endemic sexually transmitted infections it's useful to know prevalence rates of behaviors like safe sex practices, illegal drug use, etc. in order to tailor public health interventions. This information is hard to collect and definitely not a standard component of the EMR. The first part of my dissertation looked at sexual behavior surveillance of individuals using Craigslist to find anonymous sexual partners. I presented ways of algorithmically collecting survey-like information on risk behaviors, population demographic data, and travel patterns in near real-time and at city-level geographic resolution. The last part of my dissertation looked at a collection of 15 million clinical notes from the UIHC's EMR system, where I worked on representation learning and recurrent neural network methods for named entity recognition.

Do you have any advice for our current graduate students?

Take as many statistics courses as you can, as early as you can. Also, definitely take Calculus III! I skipped the advanced math courses and really regretted it later in grad school where my time and choices were more constrained. From the CS side, probabilistic graphical models and more generally probabilistic approaches to AI (e.g., Bayesian networks) are extremely useful in many different fields. Finally don't shy away from building toy distributed systems on Amazon EC2 or other compute infrastructures. Being able to quickly analyze large datasets is immensely valuable and the norm in any "data scientist" type of job.


Advisor: Alberto Segre | CompEpi

Other "Thoughts of a Recent Graduates" at this link.