DSI Spring Symposium 2018


Saturday, March 17th

10:30 am - 5 pm

Reitz Union Grand BallRoom 

RSVP: https://goo.gl/forms/7uE731ZT3QyIn2Px1

Come spend your Saturday at the largest DSI event of the year - our annual Symposium. 

Begin the day with coffee, remarks from DSI leadership and the UFII Director, and a keynote.

The symposium continues with speakers from a wide range of research fields at UF in three breakout sessions

of four speakers each. Learn about computer vision, bioinformatics, political forecasting, business analytics, and more.

Our symposium will also include two rounds of workshops with several choices in each round- so you can brush up on your Python, learn about data visualization, or deepen your knowledge of machine learning.

This is a fantastic opportunity to network with students and faculty who are passionate about the impact of data science and the tools they utilize to realize that impact. 

Coffee and Lunch will be served. 

If you plan to attend, please RSVP through this form: 

While we urge you to RSVP for food estimates, we will not turn anyone away, so feel free to bring a friend! 

The schedule is below: 

Registration & Coffee in Grand Ballroom
Bobbie Isaly - What is DSI?
Dr. George Michailidis - What is the UFII?
Dr. Manuel Bermúdez - Paradigms and the Future of Computing
Networking Lunch
Breakout Session 1 (20 minute presentations, 10 minute Q&A)
Breakout Session 2 (20 minute presentations, 10 minute Q&A)
Breakout Session 3 (20 minute presentations, 10 minute Q&A)
Workshop Session 1 (4 workshops)
Workshop Session 2 (4 workshops)
Closing Remarks and How to Get Involved


10:30 - 11:00
11:00 - 11:15
11:15 - 11:30
11:30 - 12:30
12:30 - 1:20
1:30 - 2:00
2:10 - 2:40
2:50 - 3:20
3:30 - 4:10
4:20 - 5:00
5:05 - 5:10


Opening Remarks and Introduction to DSI
Bobbie Isaly, DSI President

Bobbie Isaly is a senior at the University of Florida with a major in Computer Science and a minor in Business Administration. She has been involved in DSI for 2 years, starting as online content manager and moving up to President. Bobbie is graduating this semester and moving to Dallas to start her professional career with Texas Instruments. Her talk will introduce you to DSI, it's history, and its goals.


Introduction to UFII: The UF Informatics Institute
Dr. George Michailidis, UFII Director and Professor of Statistics

Dr. Michailidis is the Professor and Director of the Informatics Institute. He has made very important contributions to multivariate data analysis as well as modeling, analysis and control of networks. His current research interests include multivariate analysis and machine learning, computational statistics, change-point estimation, stochastic processing networks, bioinformatics, network tomography, visual analytics, statistical methodology with applications to computer, communications and sensor networks.


Keynote Presentation:

Paradigms and the Future of Computing
Dr. Manuel Bermudez, Computer and Information Science and Engineering Department

The notion of a paradigm, which in the past has been exclusively a topic for philosophers, is recognized today as an important topic in all areas, from science to business. In this talk we will discuss the role that paradigms play in general, and in Computing. We will give several definitions of the term "paradigm", discuss the concept of paradigm blindness, and present a few amusing examples. We will also discuss some future perspectives in Computing, in the context of paradigms.

Dr. Manuel E. Bermudez is an Associate Professor in the Computer and Information Science and Engineering Department at the University of Florida, where he is also Director of Graduate Admissions, and Latin American Outreach Coordinator. He received his B.Sc. and Licenciado degrees in 1979 and 1980 from the University of Costa Rica, and his M.Sc. and Ph.D. in 1982 and 1984, from the University of California at Santa Cruz. His research interests are in programming languages, software engineering, and compilers. His main professional interest is in transfer of technology and academic cooperation with Latin America. He has twice been named Teacher of the Year by the ACM Student Chapter at the University of Florida. He has three times obtained the prestigious Fulbright Scholar award, in 1996-1997 (University of Costa Rica), in 2003-2004 (Universidad de los Andes in Mérida, Venezuela), and in 2014 (University of Costa Rica). He has received grants from the Dell Corporation, Lockheed Martin, General Dynamics, Electronic Arts, Disney, and Microsoft. He has supervised over 20 Master's students, and 4 Ph.D. students. He has published two books, and over 40 papers in journals and conference proceedings. He has given over 100 invited talks in various countries in Central and South America, and is listed in the Who's Who Among Hispanic Americans.


Breakout Session 1:

Salon A:

Smart /Green Manufacturing:  Data Enabled Decision Making and Optimization Applications
Panos M. Pardalos, ISE Department
Center for Applied Optimization, University of Florida

Smart manufacturing (Industry 4.0) is the fourth industrial revolution. With advances in information and telecommunication technologies and data enabled decision making, smart manufacturing can be an essential component of sustainable development.
We are going to discuss some successes and focus on data enabled decision making and optimization applications. In addition, we will discuss future research directions and new challenges to society. 

Salon D:

Behavioral Finance, Data Science, and Sports: Umpires and MLB Totals Market Efficiency
Dr. Brian M Mills, Department of Tourism, Recreation and Sport Management

Sports betting markets have been used extensively in understanding market efficiency and behavioral biases, and have played a public role in generating interest in data science and analytics. We use this setting to test the propensity for the MLB totals market to integrate information about umpire home plate assignments, which are only known to the public for certain games. We first use generalized additive models to estimate the strike zone surface in MLB using data on individual pitch location for 2.5 million called pitches from 2008 through 2014. From these models, we aggregate error terms at the individual umpire level as a measure of favorability toward offense or defense, and insert this measure into least squares regressions to identify effects of umpire behavior on actual run totals. We then identify whether totals lines adjust upon release of information about umpire assignments to the public for certain games. Our regressions show that while the market adjusts slightly to umpire assignments, it does not adjust fully, and there are opportunities for sharp bettors to take advantage of this information. We exhibit a simple betting strategy using this granular umpire decision data that returns nearly 10% per bet.
Salon E:

Brief Overview of Statistical Designs
Matthew Robinson, Department of Biostatistics

A brief overview of common statistical designs, sample size and power analysis, data formatting, and restricted randomization. We talk about statistical ways to compare groups and measure associations between variables while avoiding common pitfalls such as confounding. Additional related topics of sample size considerations, power analysis, data formatting, and methods of randomization will also be covered.

Salon H:

Network based approaches for measuring connectivity in the financial sector with applications to systemic risk
Dr. George Michailidis, Department of Statistics, UFII Director

In the aftermath of the 2008 financial crisis, there has been a lot of interest in understanding inter-relationships amongst firms in the financial sector. Hence, network analysis approaches have become popular to address this task. In this talk, we provide a brief overview of such methods and discuss how they can be used by policy makers in early assessment of developing risks in the sector. The methods are illustrated on stock log-returns and on interbank lending transactions data.

Breakout session 2:

Salon A:

Visualizing Student Success Using a Sankey Diagram in Tableau
Tim Young, CLAS
Assistant Director for Data Management and Analysis

Student success is often only considered with a metric like retention or the four and six year graduation rates of a cohort.  Simple statistics like these often masks subtle changes that happen at different points along a student’s academic career.  I will demonstrate the ability to explore cohorts with a Sankey diagram (a.k.a. ribbon diagram) using Tableau software.  I will also demonstrate how this visualization tool can be used for other purposes like exploring student success in course sequences.

Salon D:

Reconstructing the Evolution of Biological Function as a ‘Big Data’ Problem
Bryan Kolaczkowski, Department of Microbiology and Cell Science

Evolutionary biology aims to explain the similarities and differences among living organisms by characterizing the historical events and processes generating extant biodiversity. Statistical modeling of the evolutionary process has tremendously advanced our understanding of biodiversity at multiple levels of organization, from molecules to cells, organisms, populations and communities. A particularly powerful set of tools for reliably characterizing historical evolution in detail utilize phylogeny-based statistical models to infer information about past events and processes. Here we describe current efforts aimed at modeling the evolution of functional traits across phylogenetic trees and posit that - in the case of molecular function - detailed information can be inferred at a massive scale by combining models of protein sequence, structure and function.

Salon E:

Using FaceNet to Automatically Like Tinder profiles based on Individual
Charles Jekel, Department of Mechanical and Aerospace Engineering

A user reviewed 8,545 online dating profiles on Tinder. For each
profile, a feature set was constructed from the images. A method was
proposed to automatically like profiles based on the user's own
historical preference. The method takes advantage of a FaceNet facial
recognition model to extract features from an individual's face. These
Features may be related to facial attractiveness. A simple logistic
regression trained on the embeddings from just 20 profiles could obtain
a 65\% validation accuracy. A point of diminishing marginal returns was
identified to occur around 80 profiles, at which the model accuracy of
73\% would only improve marginally after reviewing a significant number
of additional profiles.

Salon H:

Cleaning Data Efficiently
Dr. Carl Klarner, Klarnerpolitics.com

It is often asserted that the lion’s share of time spent on data analysis is getting the data ready.  Yet there is little guidance about how to most effectively do so.  The efficacy of different data cleaning strategies is evaluated based on evidence from a project involving the collection of precinct level election results for eight states in the 1966 to 2002 period.  These data were produced by “converting” / OCRing (optical character recognition) images of returns encompassing 7,209 pages of material.  Reported totals, outlier analysis and observed patterns in OCR errors were used to “flag” cases for workers to check in an iterative process of flagging and checking.  Different strategies utilized were 1) the “check everything” strategy in which workers check every cell of data in “discrepant columns” (i.e., columns where the reported and computed totals do not agree), 2) the “flag checking” strategy where workers check single cells flagged as being more likely to be incorrect, 3) the “column checking” strategy where workers start at the top of a discrepant column and stop checking when they get to the first error they find, and 4) the “discrepancy resolution” strategy where workers are informed of the discrepancy between a reported and computed total, expedite their search informed with this information (for example, by only checking the hundreds place if the discrepancy is a multiple of 100) and stop checking a column when this discrepancy is resolved.  Data regarding “time on task” is utilized to assess the efficacy of different correction methods.  The danger of “offsetting errors” is also evaluated.  Last, advice on how to speed the mundane task of data checking itself is given.  

Breakout session 3:

Salon A:

Deep Learning Models for Applications in Health Care
Chengliang Yang, Computer & Information Science & Engineering

Large amounts of data are being collected daily in health care and machine learning approaches may provide advancements in preventive care and efficient delivery. Recently, deep learning has begun to reshape the frontier of machine learning research, providing a new, promising method to approach a wide range of health care problems. However, deep neural networks are usually “black boxes”, lacking the transparency required and expected for health care applications. In this presentation, we will introduce how to break into the black boxes of two types of deep learning models, a recurrent neural network predicting health care costs from administrative claims time series, and 3D convolutional networks for predicting Alzheimer’s disease from MRI scans. We leveraged techniques such as attention mechanism, activation mapping, and segmentation-based sensitivity analysis to make the predictions from deep learning models interpretable.

Salon D:

Concept Drift Detection: the State-of-the-Art
Shujian Yu, Computational NeuroEngineering Laboratory

In a streaming environment, there is often a need for statistical prediction models to detect and adapt to concept drifts (i.e., changes in the joint distribution between predictor and response variables) so as to mitigate deteriorating predictive performance over time. Various concept drift detection approaches have been proposed in the past decades. However, they do not perform well across different data stream distributions and rely heavily on the availability of true labels. This talk presents a novel framework that can detect and also adapt to the various concept drift types, even in the scenario of expensive labels. The framework leverages a hierarchical set of hypothesis tests in an online fashion to detect concept drifts and employs an adaptive training strategy to significantly boost its adaptation capability. A Request-and-Reverify strategy is further incorporated to significantly reduce the requirement of true labels. The performance of the proposed framework is compared to benchmark approaches using both simulated and real-world datasets spanning the breadth of concept drift types. The proposed approach significantly outperforms benchmark solutions in terms of precision, delay of detection, the adaptability across different concepts as well as the number of required true labels.

Salon E:

Learning the Shape of Data
Dr. Peter Bubenik, Department of Mathematics

In many applications, such as medical images for example, the data has a complicated geometric structure whose shape is crucial for understanding the data, but is difficult to quantify using traditional methods. I will give an introduction to Topological Data Analysis, which uses ideas from topology to provide summaries of the shape of data that are stable with respect to perturbations of the data. I will also show how these tools can be used to construct feature vectors that can be combined with machine learning. Furthermore, I will apply this computational pipeline to an example from biology.


Workshop Session 1: 3:30-4:10 pm,
(4 workshops)

For all workshops, see instructions below

Salon A - Introduction to Python
Instructor: Vinay Chitepu, DSI Workshop Coordinator

This workshop will walk you through the essentials of programming in the Python language, and will cover the basics of Python programming. This workshop is very beginner friendly, but all skill levels are encouraged to attend. Please bring a laptop.

Salon D - Data Visualization in Python using Seaborn
Instructor: Delaney Gomen, Workshop Contributor

Learn the basics of data cleaning and exploratory data analysis (EDA). Learn what to look for in your data and how to efficiently find leads in your data and how to clean up your findings for presentation using the plotting library Seaborn. This workshop will show you how to create beautiful visualizations using different techniques in Pythons powerful visualization library Seaborn. 

Salon E - Introduction to R
Instructor: Tyler Richards, DSI Workshop Coordinator Lead

This workshop will cover the statistical programming language R. It is aimed at those who are interested in R and may not have any experience using the language. We will be covering basic functions related to data structures and data types. We will also work through importing a data set and completing some basic manipulations.

Salon H - Introduction to Natural Language Processing in Python
Instructor: Allison Kahn, DSI Workshop Coordinator

Learn the basics of natural language processing in Python and how to make a simple sentiment analysis tool for movie reviews using a Naive Bayes Machine Learning classifier. 

Workshop Session 2: 4:20-5:00 pm,
(4 workshops)

For all workshops, see instructions below

Salon A - Introduction to R
Instructor: Vinay Chitepu, DSI Workshop Coordinator

This workshop will cover the statistical programming language R. It is aimed at those who are interested in R and may not have any experience using the language. We will be covering basic functions related to data structures and data types. We will also work through importing a data set and completing some basic manipulations.

Salon D - Introduction to Python
Instructor: Anthony Codella, DSI Internal Vice President

This workshop will walk you through the essentials of programming in the Python language, and will cover the basics of Python programming. This workshop is very beginner friendly, but all skill levels are encouraged to attend. Please bring a laptop.

Salon E - Machine Learning in Python
Instructor: Tyler Richards, DSI Workshop Coordinator Lead

Machine learning is quickly becoming one of the most sought-after skill sets in the technology industry, as it allows computers to recognize patterns and make predictions without being explicitly programmed to do so. This workshop will cover the implementation of ML algorithms like SVMs and Random Forests in Python. 

Salon H - Amazon AWS Tutorial by UF Artificial Intelligence
Instructor: Mason Rawson, UF AI

In this tutorial, we will cover the basics of Deep Learning and Amazon AWS. 
We will show how to set up a machine learning AMI on AWS. We will then briefly
discuss deep learning fundamentals and run a neural net on a data science problem
using keras.

Pre-Workshop Instructions

For all workshops, follow this link, (https://github.com/dsiufl/SpringSymposium2018), click the green “Clone or Download” button, and select “Download ZIP”.  The folder you download will contain important files that are used for all workshops.

Introduction to Python, Data Visualization in Python using Seaborn, Introduction to Natural Language Processing in Python, and Machine Learning in Python:
Using this link, (https://www.continuum.io/downloads) download and install Anaconda Python distribution for Python 2.
Click “clone or download" in the top right-hand corner, and select "download zip."  Use this link:  https://github.com/dsiufl/SpringSymposium2018. Open up the Anaconda Navigator, select “Jupyter Notebook”, and the notebook will launch in a web browser.  Through this page, you can navigate through files on your computer.  Navigate to the files you downloaded from GitHub, and select through the Jupyter webpage, you will be able to run the iPython notebook.  

Introduction to Data Analysis in R :

First download R: https://cran.cnr.berkeley.edu                                           
For Windows: Open up the link for windows and select ‘install R for the first time’
For Mac: Visit https://cran.r-project.org/bin/macosx/ and select your operating system.

Once you have downloaded R we need to download R-studio:
Select your operating system and complete the download.

Run RStudio and make sure to download some packages by running the following commands. If prompted select a repository in CRAN from California for homogeneity.

Amazon AWS Tutorial:
Follow the instructions at http://amzn.to/2GuijZ3 at least 24 hours before the workshop.