DSI Spring Symposium 2019

SUNday, March 31st

10:30 am - 5 pm

RION BallRoom @ REITz union

RSVP: HERE

Come spend your Saturday at the largest DSI event of the year - our annual Symposium. 

Begin the day with coffee, remarks from DSI leadership and the UFII Director, and a keynote.

The symposium continues with speakers from a wide range of research fields at UF in three breakout sessions of four speakers each.

Learn about computer vision, bioinformatics, political forecasting, business analytics, and more.

Our symposium will also include two rounds of workshops with several choices in each round- so you can brush up on your Python, learn about data visualization, or deepen your knowledge of machine learning.

This is a fantastic opportunity to network with students and faculty who are passionate about the impact of data science and the tools they utilize to realize that impact. 

Coffee and Lunch will be served. 

If you plan to attend, please RSVP through THIS FORM 

While we urge you to RSVP for food estimates, we will not turn anyone away, so feel free to bring a friend!

This event is completely free and open to public. 

Click below pictures to see the schedule and room assignments for the breakout sessions and workshops!

Schedule Details:

11:00 - 11:15 PM Opening Remarks and Introduction to DSI
Vinay Chitepu, DSI President

Vinay Chitepu is a Junior at the University of Florida with a major in Computer Science and a minor in Statistics. He has been involved in DSI for 3 years, starting as a workshop coordinator and moving up to Internal VP and soon after President. His talk will introduce you to DSI, it's history, and its goals.

11:30 - 12:00 PM Keynote Speaker - Dr. David Kaber

Dr. David Kaber is Professor and Chair of the Department of Industrial and Systems Engineering at the University of Florida. His scholarly interest is in the area of human-systems engineering. His current research focuses on modeling and analysis of cognitive workload and situation awareness in unmanned systems operations, driver performance and behavior in automated vehicle use, and design principles for “automation transparency” in human-in-the-loop systems. He has earned several grants for projects, including a NSF Human-Centered Computing Research Award on virtual reality-based simulations for motor rehabilitation and fine-motor skill training. He has also received support from NASA for study of aviation display clutter and its impact on pilot performance as well as several North Carolina Department of Transportation contracts for the study of driver behavior and distraction in use of roadway facility designs. His research has a common theme of developing a better understanding of how to design and engineer human systems and specifically, human-automation interaction. In total, Kaber has received over $10M in sponsored research and educational program funding. Through this funding, he has published over 100 refereed journal articles, over 150 conference publications along with a similar number of research project technical reports. Kaber has also supervised the thesis research of over 30 masters students and over 20 PhD students. Dr. Kaber is a Fellow of the Human Factors and Ergonomics Society as well as the Institute of Industrial & Systems Engineers. He received his bachelor’s and master’s degrees in industrial engineering from the University of Central Florida and a doctorate in industrial engineering from Texas Tech University. He is a certified safety professional and certified human factors professional.

Networking Lunch & Poster Sessions 12:00 - 1:20 pm

Setting up a data warehouse with BigQuery and Airflow by SharpSpring

SharpSpring marketing automation helps you drive more leads, convert leads to sales and prove ROI. And at just a fraction of the cost of competitors, SharpSpring’s powerful platform also includes: (1) Month-to-month billing (no long-term contracts), (2) Free, unlimited support, (3) Unlimited Users, (4) Open API that integrates with hundreds of apps.

Topological Data Analysis of Actin Networks by Nikola Milicevic & Parker Edwards

UF Dept. of Mathematics

Networks of filaments assembled from the protein actin contribute significantly to cells' ability to move and change shape. These actin networks exhibit distinct local geometric structure. Some networks contain regions of straight and tightly packed fibers, for instance, as well as loops of varying sizes. We analyze actin networks starting from data that consist of high resolution live-cell microscopy images of cells' actin fibers. Our methodology detects localized features using image segmentation and tools from topological data analysis: relative persistent homology, a novel approach in the field, and persistence landscapes. We are presently experimenting with a number of subsequent machine learning methods using geometric summaries of each image as the feature vectors.

Spatial movement and travel networks in astro-tourism: evidence from social media during the 2017 solar eclipse by Shinan Ma & Andrei P. Kirilenko

UF Dept. of Tourism, Recreation & Sport Management

Astro-tourism has grown into a fashionable and emerging niche market recently, yet has been largely ignored by the academic community. Solar eclipse observation is arguably the most popular astro-tourism activity, evidenced in the 2017 Solar Eclipse, reportedly attracted millions of travelers to the totality path. In this study, we use social media as a proxy to monitor astro-tourists’ spatial usage and travel patterns during the event, with the following questions posed and answered: (a) what were the most popular destinations; (b) what were the origins of these tourists; (c) what were the movement patterns and travel networks of these travelers; and (d) is it possible to identify market segments based on the emerging travel patterns.

Using living data to inform restoration and monitoring: An example from Lone Cabbage Reef by Melissa Moreno

School of Natural Resources and the Environment

“Living” data, data that are continuously collected from field programs or automated sensors, can provide critical information for informing ongoing restoration and monitoring programs in aquatic ecosystems. For example, these data can help inform whether water quality sensors are recording useable data and at what frequency field crews should service the sensors to ensure their functionality. This reduces the likelihood of sensor failure and missing data observations. However, continuous data can be difficult to manage for several reasons. Here we demonstrate a case history of a living data lifecycle for the Lone Cabbage reef oyster restoration project near Cedar Key, Florida using open source widely available, free or low-cost software. We demonstrate how data are collected by field teams on oyster populations and recorded on paper data sheets, as well as water quality data from an array of autonomous sensors are rapidly, compiled, QA/QC’ed, and analyzed to provide prompt feedback to ongoing monitoring programs.

Optimizing Freeway Merge Operations under Conventional and Automated Vehicle Traffic by Aschkan Omidvar

UF Transportation Institute

We discuss the deployment and testing of an intelligent real-time isolated intersection traffic control system (IICS), designed to optimize simultaneously signal control and automated vehicle (AV) and connected vehicle (CV) trajectories for low demand condition.  The work described here is part of an ongoing larger project (funded by the National Science Foundation - NSF - and the Florida Department of Transportation - FDOT) to develop, test, and deploy the IICS. The focus of this paper is on the deployment and testing of the algorithm at the Traffic Engineering and Research Laboratory (TERL), an FDOT closed-course facility. The algorithm (described in more detail elsewhere) optimizes signal control and provides optimal AV and CV trajectories at isolated intersections.  The algorithm is designed to handle AV, CV, and conventional vehicles in a mixed traffic and low demand condition.

Stochastic Approximation and Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms by Adithya Devraj

UF Department of Electrical and Computer Engineering

This poster will provide an overview of stochastic approximation, with focus on optimizing the rate of convergence. With applications to reinforcement learning, it can be shown that typically, the Q-learning algorithm has infinite asymptotic variance. Three new stochastic approximation / Q-learning algorithms are introduced to overcome this issue: (i) The Zap Q-learning algorithm that has provably optimal asymptotic variance, and resembles the Newton-Raphson method in a deterministic setting, (ii) The PolSA algorithm that is based on Polyak's momentum technique, but with a specialized matrix momentum, and (iii) The NeSA algorithm based on Nesterov's acceleration technique.

Real-Time Intersection Optimization (RIO) for Signal Phase and Timing and Automated Vehicle Trajectory Data by Mahmoud Pourmehrab

UF Dept. of Civil and Coastal Engineering

Signalized intersections can perform as bottlenecks in the urban transportation network causing excessive delay and frustration. The emergence of automated and connected vehicles (CAVs) and recent advancement in computation power have created the opportunity to enhance operation level. Many previously developed algorithms which aim to integrate CAV technology with intersections hold assumptions on traffic composition and behavior. Unfortunately, this has limited the scalability and applicability of the suggested frameworks to real-world intersections. This research aims to develop an optimization-based intersection control algorithm to efficiently serve traffic of CAVs and conventional vehicles (CNVs) at an isolated intersection.

Kullback-Leibler-Quadratic Optimal Control for Demand Dispatch by Neil Cammardella

UF Department of Electrical and Computer Engineering

A new stochastic control methodology is introduced for distributed control, motivated by the goal of creating virtual energy storage from flexible electric loads, i.e. Demand Dispatch.   In recent work, the authors have introduced Kullback-Leibler-Quadradic (KLQ) optimal control as a stochastic control methodology for Markovian models. The paper develops KLQ theory, and shows how this can be applied to demand dispatch applications. In one formulation of the design, the grid balancing authority simply broadcasts the desired tracking signal, and the heterogeneous population of loads ramps power consumption up and down to accurately track the signal.  Analysis of the Lagrangian dual of the KLQ optimization problem leads to a menu of solution options, and expressions of the gradient and Hessian suitable for Monte-Carlo-based optimization. Numerical results illustrate these theoretical results.

Novel miRNA Sequence Discovery via Analysis of qCLASH Experimental Data by Daniel B. Stribling

UF Department of Molecular Genetics and Microbiology

Previous bioinformatic methods for (q)CLASH data analysis, such as the “Hyb” pipeline, provide an effective means to identify miRNA-target interactions via alignment to a database of known miRNA transcripts. However, this method cannot unlock the significant potential of the (q)CLASH technique for novel miRNA sequence discovery or characterization of miRNA-like function by sequences not previously annotated as miRNA. Here I present ongoing work on a first-principles analysis method that takes a function-naïve approach to identification of miRNA sequences within (q)CLASH experimental data, including description of the process of identification of candidate miRNA sequences and subsequent evaluation of predicted physical properties, coverage, and other selected criteria.

Extra Proximal-Gradient Inspired Non-local Network by Qingchao Zhang

UF Department of Computer and Information Science and Engineering

Variational method and deep learning method are two mainstream powerful approaches to solve inverse problems. To take advantages of advanced optimization algorithms in variational method and powerful representation ability of deep learning, we propose a novel Extra Proximal-Gradient Inspired Non-local Network. Our network is based on our proposed accelerated extra proximal-gradient algorithm, which combines the extra proximal-gradient and Nesterov’s accelerated gradient method. Upon this, we provide a learnable data-driving L1-norm prior in residual representation. The sparsity-induced prior involves two core complementary components: local convolutional operator and non-local operator to exploit the non-local self-similarity property. All the parameters in our network are learned from minimizing a loss function. In exemplary problems image compressive sensing and image denoising, our network outperforms several state-of-the-art deep networks by large margins without much increasing the number of learnable parameters

Optimal Control Variates for MCMC by Anand Radhakrishnan

UF Department of Electrical and Computer Engineering

Markov chain Monte Carlo techniques (MCMC) are popular for high-dimensional Bayesian inference problems. The expected value of a function of interest with respect to a target density is approximated empirically using samples obtained from a Markov chain whose invariant distribution is the same as the target density. These standard MCMC estimators suffer from high asymptotic variance, which is a measure of their convergence. In our work, we propose a method to minimize the asymptotic variance of the estimates by the introduction of a zero-mean term called the control variates. Optimal control variates are obtained by minimizing the asymptotic variance. The poster presents some numerical examples using Unadjusted Langevin (ULA) and Random walk Metropolis-Hastings (RWM) algorithms.

Precision Ketogenic Diet Meal Tracking by Simon Frank

UF Department of Food Science and Human Nutrition

In the Borum Nutrition Lab, they are able to have a high success rate for treating patients with seizures because they create precise ketogenic recipes and diet plans using an extensive food database created and maintained by the lab.  The database tracks all of the macro and micronutrients for any food a patient could possibly intake and is updated to reflect food changes. Therefore, with the use of the food database and the programs created in the lab, they create, track and monitor a patient's intake. With the patient’s intake they can adjust the meals to better their treatment and health.  In addition, having the patient intake allows the lab to do research to try to determine which nutrient is a cause for the change in seizures for patients. In total, the lab uses data science to create a food database to track the patient's intake allowing them to create more precise meals and diet plans to give them the best treatment possible, and it allows the lab to research which nutrient could be the cause for the improved condition of subjects.

Learning with Biologically-Inspired Spike Trains in Reproducing Kernel Hilbert Space for Speech Processing by Kan Li

UF Computational NeuroEngineering Lab - CNEL

We present a novel real-time nonlinear dynamic framework for quantifying time-series structure using biologically-inspired spike trains. Audio signals are first converted into multi-channel spike trains using a gammatone filterbank and leaky integrate-and-fire (LIF) spike generators. These sparse representations are then mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS), using point-process kernels, where a state-space model (kernel adaptive autoregressive-moving-average or KAARMA) learns the dynamics of the multidimensional spike input using gradient descent learning. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. We compared the performance of the proposed framework to that of hidden Markov model (HMM) and spiking neural network (SNN).

Breakout Sessions

Session 1: 1:30 - 2:00 PM

Room A (3320)

A Data Lake with Big Query and Airflow at Sharpspring

Skyler Slade, Directory of Site Reliability

Matthew Collins, Data Engineer

Business questions depend on data from many providers and processes that are often stored on many computers in many formats. Constructing a data lake to collect, document, and provide access to diverse types of data allows people like business analysts and data scientists to build reports and analyses faster and easier. SharpSpring, a Gainesville marketing automation company, is using Google BigQuery for data storage and Apache Airflow for data pipeline management to construct our data lake. In this presentation we’ll describe our architecture and show examples of how Airflow can be used to assemble and process data to prepare it for data users.

Room B (3315)

Real-World Data Challenges Faced by Startups

Jose Luna, CTO at Eventplicity

Machine learning, predictive analytics, and "big data" are taking the startup world by storm. We explore the real-world challenges faced by data scientists in a startup environment. Issues include shifting requirements, lack of domain expertise, poor data pipelining, human error, product pivots, and operating with limited resources.  With shared insights gained from real-world experience, you will be better prepared to determine if a specific startup is the right opportunity for you and better prepared to succeed once you're in the startup environment.

Room C (3305)

CTS-IT

Jenny Martinez

CTS-IT is a research support unit specializing in the science of information. We support research at all stages with services such as IT infrastructure design, software development and grant application assistance. CTS-IT provides training and support for researchers using REDCap to store their study data. We have also worked on a variety of projects such as MetSCalc, a web-based calculator for measuring metabolic syndrome severity, and NACCulator, which converts data from REDCap to a format required by a coordinating center.

Session 2: 2:10 - 2:40 PM

Room A (3320)

Conquering Tinder with Machine Learning

Charles Jekel

An individual reviewed over 8,000 online dating profiles on Tinder. For each profile, a feature set was constructed from the images. A method was then proposed to automatically like profiles based on a user’s historical preference. The method takes advantage of a CNN facial recognition model (FaceNet) to extract features from an individual's face. A point of diminishing marginal returns was identified to occur around 80 profiles, at which the model accuracy of 73\% would only improve marginally after reviewing a significant number of additional profiles.

Room B (3315)

Program Analysis for Scientific Model Augmentation

Dr. James Fairbanks, Research Engineer at Georgia Tech Research Institute

The vast majority of scientific knowledge is represented as mathematical and computational models of natural and engineered phenomena, and to create scientific AI, we must build computational systems that can understand these models. Modeling frameworks create embedded Domain Specific Languages for describing models which are machine readable, but cannot be applied retroactively to existing scientific codes. We will discuss a novel approach to modeling frameworks, SemanticModels.jl, that allows novel models to be expressed in terms of transformations on existing models. Algorithmic manipulation of these models to add capabilities and combine existing models is easy within the framework. Code implementing the novel models is generated, compiled, and executed in an interactive modeling environment.

We will discuss how knowledge graphs, category theory, abstract algebra, and program analysis can help analyze software implementing scientific models. This analysis leads to practical tools for helping scientists develop novel models. Examples will be provided in interactive Julia notebooks.

Room C (3305)

A Bayesian Approach to Joint Estimation of Multiple Graphical Models

Peyman Jalali

Undirected Graphical Models represent a family of canonical statistical models for reconstructing interactions amongst a set of entities from multi-dimensional data profiles.

They have numerous applications in biology involving Omics and neuroimaging data, in social sciences for voting records and econ/financial data, in text mining, etc. For the case of continues data, one natural way of inferring a graphical model is through estimating the inverse covariance/partial covariance matrix. Graphical models are particularly of interest when it comes to estimating multiple undirected graphs (networks), with observations belonging to several distinct groups (sub-populations). For example, in studying metabolomics data, there are interactions between the lipids and/or metabolites that are shared between different groups of inflammatory bowel disease patients and there are some interactions that are group specific. In this case, it is crucial to integrate the information across the groups, because estimating separate networks for each group results in ignoring the information about the structures that are shared between the underlying networks. The key is to borrow strength by combining the information that different groups can provide regarding the common structures. In this talk, I first give a brief intro to gaussian graphical models, then will describe a novel Bayesian high dimensional approach to joint estimation of multiple graphical models and finally illustrate the model using synthetic and real data.

Session 3: 2:50 - 3:20 PM

Room A (3320)

Data Science in the Construction Industry

Ashwini Jain

Organizations in today’s competitive environment are presented with the challenge of creating a versatile ecosystem that connects their devices, people, content, services, and processes. Digital transformation helps organizations by connecting business operations with applications, objects, data, and services. Simultaneously, this connection also generates vital data, enabling organizations to make information available, accessible, and usable- anywhere, anytime, and from any device.

Digital transformation also affects the construction industry. Greater digitalization helps organizations foster closer inter-departmental collaboration, allowing the business to seamlessly move forward and grow at a faster pace. This data is often isolated, unassimilated, and underused. In some cases, the data are employed for narrow analyses and not aggregated to enable a broader understanding of how the organization is performing. Powerful data science and analytics tools are readily available and can assist to create a visualization platform and analyze patterns in the data. At Superior Construction, we are using data science from managing labor hours, equipment hours, and invoices to optimizing productivity. The job of a project manager has not changed, but the value of technology and the available data has changed. Empowering employees at all levels with technology and access to timely information creates both greater efficiency and the ability to improve project performance.

Room B (3315)

Tracking Antibiotic Resistance from Farm to Table to Clinic

Christina Boucher

Antimicrobial resistance is a critical public health issue. Infections with drug resistant pathogens are estimated to cause an additional eight million hospitalization days annually over the hospitalizations that would be seen for infections with susceptible agents. The use of antibiotics (in both clinical and agricultural settings) is being viewed as precursor for these infections and thus, is a major public health concern—particularly as outbreaks become more frequent and severe. However, scientific evidence describing the hazards associated with antibiotic use is lacking due to inability to quantify the risk of these practices. One promising avenue to elucidate this risk is to use shotgun metagenomics to identify the AMR genes in samples taken through systematic spatiotemporal surveillance. The goal of this proposed work is to develop algorithms that will provide such a means for analysis. The algorithms need to be scalable to very large datasets and thus, will require the development and use succinct data structures. In this talk, we discuss a specific method to study AMR through the creation and use of large compressed colored de Bruijn graphs.

Workshop Session 1: 3:30 - 4:10 pm,
(4 workshops)

For all workshops, see instructions below

Salon A (3320) - Python 2
Instructor: Meghana Tatineni, DSI Workshop Coordinator Lead

This workshop is an introduction to machine learning in Python and will cover machine learning techniques such as support vector machine (SVM), random forest classifier, and the scikit-learn machine learning Python library.

Salon B (3315) - Natural Language Processing with Python
Instructor: Allison Kahn, DSI Secretary

This workshop covers introductory techniques and resources for NLP and ends with the implementation of a Named Entity Recognition algorithm. No advanced coding experience required!

Salon C (3305) - Statistics for Data Scientists
Instructor: Delaney Gomen, DSI Internal Vice President

Statistics for Data Scientists is an intro to the most commonly used and useful statistical concepts, implemented in python. This workshop is perfect for students who want to turn theoretical statistical concepts into a practical toolkit for Data Science problems, and will cover basics of concepts like the Central Limit Theorem, common distributions, and Causal Inference.

Salon D (3310) - The Tools of Data Science
Instructor: Allison Kahn, DSI Secretary


New to Data Science? This workshop is for you! We cover what tools/software you’ll need to get started with our workshops and data science on your own. Step by step instructions on downloading necessary software and files.

Workshop Session 2: 4:20 - 5:00 pm,
(4 workshops)

For all workshops, see instructions below

Salon A (3320) - Data Visualization
Instructor: Delaney Gomen, DSI Internal Vice President

This workshop, you will learn to use the Python data visualization library Seaborn to create stunning visuals inside of Python!

Salon B (3315) - SQL Workshop
Instructor: Aarti Tolani, Workshop Coordinator

This workshop covers introductory and complex database concepts, focusing on getting data from a database to be used in data analytics and science pipelines.

Salon C (3305) - Neural Networks
Instructor: Meghana Tatineni, DSI Workshop Coordinator Lead

Neural networks are some of the most powerful machine learning algorithms available today, with applications in computer vision, speech recognition, pattern classification, and many more.

Salon D (3310) - Introduction to Python
Instructor: Allison Kahn, DSI Secretary

This workshop is an introduction to Python and will cover all the basics of Python to get you started with using one of the most popular languages for data science. No programming experience needed.

Pre-Workshop Instructions

For all workshops, follow this link, click the green “Clone or Download” button, and select “Download ZIP”.  The folder you download will contain important files that are used for all workshops.


Introduction to Python, Data Visualization in Python using Seaborn, Introduction to Natural Language Processing in Python, and Machine Learning in Python:
Using this link, download and install Anaconda Python distribution for Python 2.
Click “clone or download" in the top right-hand corner, and select "download zip."  Use this link, Open up the Anaconda Navigator, select “Jupyter Notebook”, and the notebook will launch in a web browser.  Through this page, you can navigate through files on your computer.  Navigate to the files you downloaded from GitHub, and select through the Jupyter webpage, you will be able to run the iPython notebook.  
 

Amazon AWS Tutorial:
Follow the instructions at http://amzn.to/2GuijZ3 at least 24 hours before the workshop.