Welcome!

Here is the short version of the data science capstone day schedule. Below you will find abstract information for each year.

2024

2023

Schedule

Date: Wednesday, May 3 from noon (12:00pm) - 3:00pm

12:00: Developing a series of Convolutional Neural Networks for the detection of lanes, vehicles, traffic signs, and traffic lights using dash cam driving footage. Taha Amer (DSC master's project; Prof. Adnan El Nasan, Computer & Information Science)
12:13: Command Line Tool for Exploration and Visualization of Data. Sudhanshu Mukherjee (DSC master's project; Advisor: Prof. Alfa Heryudono, Mathematics)
12:26: Vocal biomarkers associated with hospitalization and mortality among congestive heart failure patients. HarshilNileshkumar Patel (DSC master's project; Advisor: Prof. Gary Davis, Mathematics)
12:39: Developing a high-performance framework for differentiable gravitational waveform models. Tousif Islam (DSC master's project; Advisor: Prof. Scott Field, Mathematics)
12:52: Stock Price Prediction using Machine learning and Sentiment analysis. MaheshKumarReddy Pathireddy (DSC master's project; Prof. Yuchou Chang, Computer & Information Science)
1:05: Investigating the Impact of Local Demographic and Personal Factors on Hypertension Diagnosis: A Data Science Study. Benjamin Pfeffer (DSC master's project; Advisor: Prof. Donghui Yan, Mathematics)
1:18: Learning Orbital Dynamics of Binary Black Hole Systems from Gravitational Wave Measurements. Pranav Vinod (DSC master's project; Advisor: Prof. Scott Field, Mathematics)
1:31:Generating a Database to track decline of endangered species. Ethan Ducharme (DSC senior; external partner: Dartmouth Natural Resources Trust)
1:44: US Patent Phrase To Phrase Matching: Detect Novel Semantic Similarity (NSS) To Extract Meaningful Information From US Patent Applications. Anubhav Shankar (DSC master's project; Prof. Ming Shao, Computer & Information Science)
1:57: Using Generative AI to Investigate Transient Noise Artifacts in Gravitational Wave Interferometers. Christopher Johanson (DSC master's project; Advisor: Prof. Sarah Caudill, Physics)
2:10: Class Incremental Learning Acceleration: Harnessing A100 GPUs for Efficient Methods of Experimentation. Andrew Anctil (DSC master's project; Prof. Ming Shao, Computer & Information Science)
2:23: Hover-and-Click BoxPlot. VinilGiriraj Harsh. (DSC master's project; Advisor: Prof. Alfa Heryudono, Mathematics)
2:36: Automatic Text Summarizer. VijayKumar Kuchibhotla (DSC master's project; Advisor: Prof. Donghui Yan, Mathematics)
2:49: Drug Recommendation System based on Sentiment Analysis of Drug Reviews using Machine Learning. NaveenYadav Guthi (DSC master's project; Prof. Yuchou Chang, Computer & Information Science)

Students who graduated in the Fall, Winter, or Summer have "On-demand talk" (recordings are available)

Drowsy Driver Detection using Neural Networks. Rahul Nimmagadda (DSC master's project; Prof. Yuchou Chang, Computer & Information Science).
Stock Market Prediction using LSTM. Lavanya Bollineni (DSC master's project; Prof. Donghui Yan, Mathematics).
Text Summarization Using Natural Language Processing. Vyshnavi Mattapalli (DSC master's project; Prof. Donghui Yan, Mathematics).
MonkeyPox Visualization. Tharun Balaji (DSC master's project; Prof. Yuchou Chang, Computer & Information Science).
Sars Cov-2 CT-Scan Dataset Prediction. ShreyaReddy KethiReddy (DSC master's project; Prof. Donghui Yan, Mathematics).
Correlation of Happiness Index and Homicide Rates. Gopi Krishna Vajrala (DSC master's project; Prof. Yuchou Chang, Computer & Information Science).
Object classification using CNN on CIFAR-100 dataset. Mohan Teja (DSC master's project; Prof. Yuchou Chang, Computer & Information Science)
ECG Signal Neural Network Classification using Machine learning with Python. Reshwanth Gunda (DSC master's project; Prof. Yuchou Chang, Computer & Information Science).
A numerical investigation of a variable coefficient elliptic solver. SanjanaReddy Singireddy (DSC master's project; Advisor: Prof. Cheng Wang, Mathematics)
NBA games winner Prediction. Anudeep Goud Rampur (DSC master's project; Advisor: Prof. Yuchou Chang, Computer & Information Science)
Application of Machine Learning in Late Delivery Prediction Praharsha Prateek More (DSC master's project; Advisor: Prof. Yuchou Chang, Computer & Information Science)
Multimodality-Enhanced Graph Generation and Multimodality-Driven Graph Convolutional Networks. Rohan Gonjari (DSC master's thesis; Prof. Ming Shao, Computer & Information Science).
Explainability of Network Intrusion Detection using Transformers: A Packet-Level Approach. Pahalavan Rajkumar Dheivanayahi (DSC master's thesis; Prof. Gokhan Kul, Computer & Information Science).
User Recommendation System for E-commerace. Vijay Mohan Yeddu (DSC master's project; Advisor: Prof. Donghui Yan, Mathematics)
AI-Boosted Intelligent Fish Discard Chute. Pratishthit Choudhary (DSC master's thesis; Prof. Ming Shao, Computer & Information Science).
Unicorn Chronicles: Visualizing the Growth and Impact of High-Valued Startups. Kirti Bendigeri (DSC master's project; Advisor: Prof. Yuchou Chang, Computer & Information Science)
Heart Health Prediction Analysis: AI in Healthcare. Sriram Jagadeesan (DSC master's project; Advisor: Prof. Yuchou Chang, Computer & Information Science)
Performance analysis of sequential merge sort and parallel merge sort. NeerajSai Alapati (DSC master's project; Advisor: Prof. Yuchou Chang, Computer & Information Science)
A numerical investigation of an iteration solver for a singular-potential gradient flow Harsh Makwana (DSC master's project; Advisor: Prof. Cheng Wang, Mathematics)
Exploring Simplicial Complex Structures in Text Andrew Disher (DSC master's project; Advisor: Prof. Gary Davis, Mathematics)
Collaborative Strategies in RoboCup Rescue: A Data-Driven Approach. Abhijot Bedi (DSC master's thesis; Prof. Shelley Zhang, Computer & Information Science).

Abstracts

Speaker: Pratishthit Choudhary

Title: AI-Boosted Intelligent Fish Discard Chute

Abstract: The fish processing industry faces several challenges in ensuring the quality of their products, particularly in accurately monitoring and controlling the species and size of fish on the conveyor belt during processing. To address this problem, we developed an AI-based system that can detect, segment, and count fish species on a conveyor belt live stream while also detecting the fish length using stereo vision. Our model is built on OpenCV and YOLOv8 – a state-of-the-art (SOTA) computer vision model – and was trained on a large and internally created dataset of fish conveyer images. This model will then be pushed to an Internet of Things (IoT) device that can be connected to the camera placed right above the conveyor belt on ships for real-time fish processing. Our research strategy involved data collection and annotation, pre-processing images, and training a custom model on top of a pre-trained segmentation model with 45 million parameters. Our results demonstrated that our model achieved high accuracies in detecting, classifying fish species, and segmenting the fish bodies while also counting fish by their species and measuring their length. We believe that our research outcomes have the potential to revolutionize the fish processing industry by enabling real-time monitoring of fish species and size, reducing human error, and improving overall quality control.

Speaker: Praharsha Prateek More

Title: Application of Machine Learning in Late Delivery Prediction

Speaker: Anudeep Goud Rampur

Title: NBA games winner Prediction

Abstract: The objective of this data science project is to develop a model for predicting the winners of NBA games. With the ever-increasing availability of data in sports, accurate prediction models can provide valuable insights and assist stakeholders, such as sports analysts and betting enthusiasts, in making informed decisions. In this project, we utilized a dataset containing historical NBA game data, including game attributes such as the location the game is played and game outcomes. The dataset was preprocessed and carefully prepared to ensure data quality and relevance. We employed the XGBoost algorithm, a powerful and widely-used gradient boosting technique, for our prediction model. By leveraging XGBoosts ability to capture complex relationships and handle imbalanced data, we aimed to improve the accuracy of our predictions. Throughout the project, we conducted exploratory data analysis to gain insights into the dataset and identify relevant features for prediction. We applied rigorous feature engineering techniques to enhance the model's predictive capabilities and improve its generalization. The model was trained and evaluated using appropriate performance metrics, including accuracy, precision, recall, and F1 score. We compared the results of our XGBoost model with other baseline models to demonstrate its superior performance. Our findings reveal that the developed XGBoost model significantly enhances the accuracy of NBA game winner prediction compared to alternative approaches. The model's robustness to noise and missing values, as well as its feature importance ranking, further contribute to its effectiveness. The implications of accurate NBA game winner prediction extend beyond sports analytics. They can be valuable for sports enthusiasts, sports betting markets, and team management in making strategic decisions. Overall, this project showcases the potential of data science and machine learning techniques, specifically XGBoost, in improving the accuracy of NBA game winner prediction. The developed model provides a valuable tool for analyzing and understanding the factors influencing game outcomes and offers practical insights for various stakeholders in the sports industry.

Speaker: Tousif Islam

Title: Developing a high-performance framework for differentiable gravitational waveform models

Abstract: Both the detection and characterization of gravitational wave (GW) signals from the merger of compact object binaries, like binary black holes (BBHs), heavily rely on accurate gravitational waveform models. Reduced order surrogate waveform models trained on expensive numerical relativity simulations are fast and accurate and can enable optimal signal detection and faithful inference of the binary source properties. However, traditional data analysis frameworks based on Markov chain Monte Carlo (MCMC) or nested sampling are computationally expensive and often inefficient in exploring highly correlated parameter space. It is therefore timely to develop efficient sampling algorithms such as Hamiltonian Monte Carlo that employs gradient-based optimization - requiring differentiable surrogate waveforms. In this project, we develop a differentiable surrogate model with existing numerical relativity training data using a commonly used python-based automatic differentiation framework JAX. This model will be made public through an open-source package named ripple.

Speaker: Sudhanshu Mukherjee

Title: Command Line Tool for Exploration and Visualization of Data.

Abstract: A command line tool for data pre-processing and visualization to streamline data preparation tasks in the terminal. The tool provides several functionalities, including data cleaning, imputation, and outlier detection. The user can input data in different file formats, and the tool will perform a quick summary analysis and plots for each combination of numeric features. The tool will also identify outliers using statistical packages with options for user-defined threshold limits. Other functionality includes imputation capabilities for missing data, with the option to use mean, median, and different imputing methods to fill in the gaps. Overall, this tool can be used to quickly prototype data pre-processing, visualization, and cleaning tasks in standard terminals where GUI functionalities are limited.

Speaker: HarshilNileshkumar Patel

Title: Vocal biomarkers associated with hospitalization and mortality among congestive heart failure patients

Abstract: Vocal biomarkers, or changes in a patient's voice's features, have been linked in recent studies to an increased risk of hospitalization and mortality in people with congestive heart failure. These vocal biomarkers comprise fluctuations in pitch, tone, and other characteristics that may be recognized and examined utilizing cutting-edge algorithms and machine learning methods. Healthcare professionals may be better equipped to monitor patients with congestive heart failure and offer early therapies that might improve outcomes and lower the risk of adverse events if they can recognize these vocal indicators.

Speaker: Taha Amer

Title: Developing a series of Convolutional Neural Networks for detection of lanes, vehicles, traffic signs, and traffic lights using dash cam driving footage.

Abstract: In this capstone project, we present a novel approach to the detection of lanes, vehicles, traffic signs, and traffic lights in driving video footage using convolutional neural networks (CNNs). We train our models on large datasets of labeled target variables and test their accuracy and efficiency on a separate test set. We evaluate our models using real driving footage captured from dash cams and several performance metrics, including precision, recall, and F1 score. Our results demonstrate that our CNN models achieve high accuracy and speed in detecting lane, vehicle, traffic signs, and traffic lights in driving video footage.

Speaker: Benjamin Pfeffer

Title: Investigating the Impact of Local Demographic and Personal Factors on Hypertension Diagnosis: A Data Science Study

Abstract: Hypertension, or high blood pressure, is a serious health condition that affects millions of people worldwide. Even though hypertension can be managed by making lifestyle changes or going on medication, its high prevalence and association with cardiovascular disease make it a significant public health concern. National studies have been performed on what factors influence hypertension diagnosis, but these large-scale results may cause the biggest local factors to fade and, thus, have little impact on specific communities. This data science study investigates the impact that demographic and personal factors have on hypertension diagnoses through the use of a community hospital’s data about patient visits and blood pressure results over the course of one year. This approach focused on using demographic information such as age, sex, race, ethnicity, and zip code, as well as personal information such as smoking status, to identify relationships with, and impact on, hypertension diagnoses. Using statistical and machine learning methods, the significance of each demographic and personal factor on predicting hypertension diagnosis was analyzed. The findings suggest that there are certain demographic and personal factors that more strongly impact hypertension diagnosis. These results may lead to significant changes in the hospital’s policies and initiatives that aim to prevent and manage hypertension. Through performing this investigation, we may be able to reduce the negative impact of hypertension in the hospital’s community, and improve the health of many patients. This study underscores the importance of considering the demographic and personal information of patients in a community, as they relate to hypertension, and highlights the need for a hospital-specific, patient-focused treatment strategy.

Speaker: Pranav Vinod

Talk: Learning Orbital Dynamics of Binary Black Hole Systems from Gravitational Wave Measurements.

Abstract: As two black holes orbit each other, their motion generates gravitational waves that propagate to the far field, where they can be observed by detectors in an international network. The underlying physics of this process is governed by complex partial differential equations, which connect the near-field dynamics of the black holes to the far-field gravitational radiation. Traditionally, computational tools used to model black hole orbital dynamics and gravitational waves have been expensive simulation codes or approximations to general relativity such as Post-Newtonian formalism. However, [1] has shown that it is possible to deduce relativistic two-body orbital models from gravitational wave measurements using an inverse problem formulation that relies on a neural network with weights and biases as control variables. In this project we propose to extend this work by incorporating a range of physically viable parameters, p : semi-latus rectum and e : eccentricity. To accomplish this, we have conducted a series of experiments to investigate how the loss for the neural network varies for different values of p and e. We have also examined different learning techniques to assess the effectiveness of cumulative learning in reducing loss.

Speaker: Anubhav Shankar

Title: US Patent Phrase To Phrase Matching: Detect Novel Semantic Similarity (NSS) To Extract Meaningful Information From US Patent Applications

Abstract: The U.S. Patent and Trademark Office (USPTO) offers one of the world’s largest repositories of scientific, technical, and commercial information through its Open Data Portal (ODP). Patents are a form of intellectual property granted in exchange for the public disclosure of new and valuable inventions. The motivation of this undertaking is to detect/establish “novel semantic similarity (NSS).” NSS between phrases is a critical part of the patent evaluation process to recognize if an invention has been described before. Spurious and duplicate patent applications clog the pipeline and lead to unnecessary time and financial wastage. The scope of this project goes beyond simple phrase-to- phrase matching. Contextual establishment of phrases is also essential. For example, while the terms may have low semantic similarity in everyday language, the likeness of their meaning is much closer if considered in the context of or scenery.

Speaker: Christopher Johanson

Title: Using Generative AI to Investigate Transient Noise Artifacts in Gravitational Wave Interferometers

Abstract: Transient noise artifacts in gravitational wave (GW) interferometers limit the amount of true positive detections that can be added to scientific databases. Investigating these artifacts can, in turn, aid in our ability to confidently make GW detections. This work uses Generative Adversarial Networks (GANs) to simulate said artifacts. In addition, this work builds on previous investigations by analyzing how changes in network architecture impact the networks' outputs.

Speaker: Andrew Anctil

Title: Class Incremental Learning Acceleration: Harnessing A100 GPUs for Efficient Methods of Experimentation

Speaker: VinilGiriraj Harsh

Title: Hover-and-Click BoxPlot

Abstract: Box Plots are commonly used to visualize how data are distributed and identify outliers. However, rudimentary static versions of it lack point-and-click interactivity. This interactivity is particularly useful when tinkering with the plots to show some features. The b_hover function is a small Python package that addresses this limitation by adding interactivity, allowing users to visualize and explore the plots with hover-over functionality. This feature lets users view the specific data points for each box and whisker, providing a more in-depth understanding of the data distribution. It also offers customization options for boxplot parameters such as outlier points, whisker length, and box fill color. The simple code is open-source, easy to use, and can be integrated into any Python project. In this project, we demonstrate the b_hover library's capabilities by visualizing a dataset containing the distribution of salaries for software developers in various countries. Point-and-click interactivity is also useful when finalizing the plots before saving them for a manuscript.

Speaker: VijayKumar Kuchibhotla

Title: Automatic Text Summarizer

Abstract: In modern times, data is growing rapidly in every domain such as news, social media, banking, education, etc. Due to the excessiveness of data, there is a need of automatic summarizer which will be capable to summarize the data especially textual data from original document without losing any critical purposes, and “Keywords” in a document represents subset of words or phrases from the document for describing its meaning. Manual assignment of quality keywords is time-consuming and expensive. In this paper, we present our preliminary development including sentence similarity index with cosine to measure connected within clusters where keywords. Using such techniques, the novel approach strengthens its process and finds hybrid approach to perform the summarization of document along with the keywords identified by the type of document using text mining techniques. Since text summarization process is highly dependent on keyword extraction, the overall results are found promising.

Speaker: NaveenYadav Guthi

Title: Drug Recommendation System based on Sentiment Analysis of Drug Reviews using Machine Learning

Speaker: MaheshKumarReddy Pathireddy

Title: Stock Price Prediction using Machine learning and Sentiment Anlysis.

Abstract: The use of machine learning algorithms and sentiment analysis will help in predicting stock prices by analyzing the impact of news and market sentiment. Machine learning is a subset of artificial intelligence that uses statistical techniques to enable machines to learn from data and improve performance on a specific task. Sentiment analysis, on the other hand, is a natural language processing technique that identifies and extracts subjective information from text data, such as opinions and emotions. The integration of machine learning algorithms and sentiment analysis techniques will help in analyzing the impact of market sentiment and news on stock prices. The sentiment analysis will help in identifying the positive or negative news related to a company or a particular stock, which will be used as an input to the predictive model. The predictive model will use this information along with the historical data to predict the future movement of stock prices.

Speaker: Ethan Ducharme

Title: Generating a Database to track decline of endangered species

2022

Schedule

Date: Wednesday, May 4 from 12:30pm - 3:30pm

12:30: Multi-Document Summarization using Maximal Marginal Relevance and Sentence Graph Compression. Satish Uppalapati (DSC master's project; Advisor: Prof. Donghui Yan, Mathematics)
12:45: Generating Cancer Images to Improve Cancer Diagnosis Accuracy. Benjamin Pfeffer (DSC senior)
1:00: Prototyping the UUV Docking Process with Machine Learning. Brianna Johnson (DSC master's project; Advisor: Prof. Alfa Heryudono, Mathematics)
1:15: Fitchburg State Baseball Team Dashboard. Nicholas Collins (DSC senior; external partner: Fitchburg State Baseball Team)
1:30: Human Activity Classification from Accelerometer Data. Sai Surya Nattuva (DSC master's project; Prof. Gokhan Kul, Computer and Information Science).
1:45: A Database User Interface Web Application for Green Energy Consumers Alliance’s Renewable Energy Certificate Allocation Task and Inventory Management Using Python, Plotly Dash, and SQLite. Nathan Rice (DSC senior; external partner: Green Energy Consumers Alliance)
2:00: Visualization and Interaction of the Human Skeleton. Sarabjit Saini (DSC master's project; Prof. Ming Shao, Computer & Information Science)
2:15: Improving Natural Language Classification With Augmented Data From GPT-3. Salvador Balkus (DSC senior)
2:30: Deep Learning for Document Data Extraction and Classification. Andrew Anctil (DSC senior; external partner: ArmorDoc)
2:45: Database and User Interface Development for Study of Atlantic Cod Fish. Lauren Fletcher (DSC master's project; Advisor: Prof. Steven Cadrin, Fisheries Oceanography)
3:00: Sentiment Analysis of Quarterly Earnings Reports to Predict Subsequent Change in Stock Prices. Michael Delisle (DSC master's project; Advisor: Prof. Donghui Yan, Mathematics)
3:15: Numerical simulation of different coarsening processes for variable mobility function and alpha. Emmanuel Oyedeji (DSC master's project; Advisor: Prof. Cheng Wang, Mathematics)

Students who graduated in the Fall, Winter, or Summer have "On-demand talk" (recordings are available)

SR-BigGAN: Lightweight Image Super-Resolution with Priors and Knowledge Distillation. Harshitha Srinivas Rao (DSC master's thesis; Prof. Ming Shao, Computer & Information Science).
A numerical investigation of the Flory-Huggins Cahn-Hilliard model. SunhithReddy Kotapally (DSC master's project; Advisor: Prof. Cheng Wang, Mathematics)
Racial Disparities Among Police Killings. Alexandra Sherman (DSC master's project; Advisor: Prof. Gary Davis, Mathematics)
Study of LSTM Applicability: Weather and Stock Prediction. Arijit Dey (DSC master's project; Advisor: Iren Valova, Computer & Information Science)
Data Analytics Intern: University of Massachusetts Presidents Office (UMPO) Abhinav Pendem (DSC master's project; Advisor: Yuchou Chang, Computer & Information Science)
Visualization and Interaction of Skeleton Based Human Action Dataset. Vijay Kshetri (DSC master's project; Prof. Ming Shao, Computer & Information Science).
Comparative Study and Performance Analysis of Recent Deep Reinforcement Learning. Mounika Thakkallapally (DSC master's project; Prof. Ming Shao, Computer & Information Science).
Web application on House Price Prediction Using Machine Learning. Nayani Yalamanchili (DSC master's project; Prof. Yuchou Chang, Computer & Information Science).
Project at Amazon Web Services: Automating Visibility and Patching Metric Emails. Nate Rice (DSC master's project; Prof. Bharatendra Rai)

Abstracts

Speaker: Nate Rice

Title: A Database User Interface Web Application for Green Energy Consumers Alliance’s Renewable Energy Certificate Allocation Task and Inventory Management Using Python, Plotly Dash, and SQLite

Abstract: Managing data of a growing business using Microsoft Excel can be cumbersome as the data becomes increasingly large and complex. For Green Energy Consumers Alliance, a non-profit organization in New England that provides access to renewable energy through the purchase of Renewable Energy Certificates (RECs), this had become apparent as the company added new customers and suppliers every year. The goal of this project was to develop a software system to help the company more easily and accurately complete quarterly tasks essential to the business model. The main focus was on the task of REC allocation. In order to facilitate this task, the data was reorganized into a relational database using Python and SQLite. A web application was created as a database user interface and REC allocation tool using Python Dash. This database user interface system will increase the speed of inventory management tasks, correct and minimize errors in the planning model, allow for REC supply and demand estimate records to be kept, provide a persistent data storage, and make data management and allocation tasks more user friendly.

Speaker: Benjamin Pfeffer

Title: Generating Cancer Images to Improve Cancer Diagnosis Accuracy

Abstract: The lack of publicly available medical data has been negatively impacting Artificial Intelligence in its ability to be used in the medical field [1]. Generative Adversarial Networks, or GANs, have been used to create similar, but novel images using real images [2]. Hence, GANs may be used to produce images that can augment a small amount of publicly available medical data in a way that leads to the improvement of Artificial Intelligence’s accuracy. Here, breast cancer tissue images and their corresponding h-scores were scraped from Stanford’s TMA database and converted into GLCMs, which were then read by GANs for each h-score to produce novel images of each score. Then, the new and generated data were used to train an Ordinal- Convolutional Neural Network (O-CNN), whose results were compared to a network that was trained without the generated data. The O-CNN whose training data consisted of the generated images as well as the base images had an increased accuracy and a decreased MSE when compared to the O-CNN whose training data consisted only of the base images. These results provide a method for improving Artificial Intelligence’s ability on smaller datasets and may help provide a slight improvement when dealing with the issue of a lack of publicly available medical data.

Nicholas Collins

Title: Fitchburg State Baseball Team Dashboard

Abstract: Analytics has changed the world of sports for the better. Professional teams and organizations have large departments dedicated solely to analytics and statistics. Seeing the success professional organizations have had with sports analytics has led teams like Fitchburg State to try to replicate the same results. Except, Fitchburg State doesn't have the budget or manpower to have an analytics team. The problem I am looking to solve is helping the baseball team get some useful outputs from their numerous csv files. These files are hundreds of rows long and contains over 40 different columns. Using python and a web application to deploy the results, the Fitchburg State baseball team can isolate areas where players can improve and finally analyze their data without having to look through each csv file manually.

Sarabjit Saini

Title: Visualization and Interaction of the Human Skeleton

Abstract: The Skeleton is the internal frame of the Human Body which has 206 Bones connected with many joints. We are considering 25 joints only. This dataset has 56880 action samples with 60 labeled actions. These action classes are divided into three major groups which are 40 daily actions Like drinking, eating, reading etc, 9 health-related actions Like sneezing, falling down etc and 11 mutual actions like kicking, punching, hugging etc. This is a 3D skeleton where we can move multiple joints at the same time using offsets in 3 directions(X,Y&Z). We can even change the frame index. The coordinates of the skeleton is saved in Npy format which can be further used as a dataset for machine learning models. The major challenge was to drag multiple joints at the same time which I was able to do and there was a considerable change in skeleton during the change in frame index too.

Brianna Johnson

Title: Prototyping the UUV Docking Process with Machine Learning

Abstract: The docking process of an unmanned underwater vehicle (UUV) with an underwater node/station can be far from trivial due to sea conditions. Indeed, the process is crucial for the purpose of recharging the battery and data transfer during a mission. This project focuses on learning the docking system behavior of UUV models based on available raw oscillatory datasets. The dataset is then modeled using parametric ODEs with the help of ready-to-use machine learning software. Research in Neural Networks, as well as testing with varying Python and Julia Packages like TensorFlow, Keras, and Flux are utilized to observe the most accurate and optimal solutions for solving these complex ODE problems. The study will hopefully help design a better data-driven stabilizer system for the UUV models during the docking process.

Andrew Anctil

Title: Deep Learning for Document Data Extraction and Classification

Abstract: ArmorDoc is a financial technology startup specializing in extracting data and information from financial documents. Currently, ArmorDoc is working with mortgage documents; they are leveraging different types of machine learning such as named-entity recognition, natural language processing, and optical character recognition. In this senior capstone project, I assisted ArmorDoc by building computer vision models to extract valuable information that the company previously could not obtain through its machine learning methods. These fields are marked or recorded using a visual indicator such as a check box, signature fields, and stamps. Other computer visions models improve the quality of ArmorDoc’s data and the results they return to clients by correcting unknown page orientation automatically through image classification.

Salvador Balkus

Title: Improving Natural Language Classification With Augmented Data From GPT-3

Abstract: GPT-3 is a large-scale natural language model developed by OpenAI that can perform many different tasks, including topic classification. Although billed as "few-shot learning" with only a small number of in-context examples required to teach a task, in practice the model requires examples to be either of exceptional quality or of a higher quantity than easily created by hand. To address this issue, this study teaches GPT-3 to classify whether a question is related to data science using a set of in-context examples augmented by its own generative capabilities - that is, we generate additional examples using GPT-3 itself. This study compares two classifiers: the GPT-3 Classification endpoint, and the GPT-3 Completion endpoint with optimal in-context examples chosen via genetic algorithm. We find that, while the optimized Completion endpoint achieves upwards of 80 percent validation accuracy, using the Classification endpoint with an augmented example set yields far improved accuracy on the test set of unseen examples.

Satish Uppalapati

Title: Multi-Document Summarization using Maximal Marginal Relevance and Sentence Graph Compression

Abstract: The goal of Multi-Document Summarization (MDS) is to summarize multiple long documents/articles into short sentences which cover the important aspects of these documents. There are two approaches in general for text summarization, abstractive and extractive summarization. This project demonstrates the development of an end-to-end system to independently summarize multiple documents using two methods. The first method is an extractive summarization, Maximal Marginal Relevance (MMR), which balances summary salience and redundancy. The second method is an abstractive summarization, Sentence Graph Compression which converts documents into a sentence graph and uses spectral clustering to generate multiple clusters of sentences and eventually compress each cluster to generate the final summary. A user interface is developed, where the user can enter URLs to multiple articles and choose a summarizer method. The system extracts text information from the weblinks, processes the text using the chosen summarizing method and generates the summary. In addition, sentiment analysis is also done on each of the articles and is displayed to the user.

Sai Surya Nattuva

Title: Human Activity Classification from Accelerometer Data

Abstract: Human activity can be observed and measured by employing different sensors on different parts of the body. An increase in the usage of wearable devices by people has led to increased research in Human Activity monitoring and analysis. This analysis has helped develop many technologies like fall detection, irregular heart rhythms, walking steadiness, posture, etc. This project focuses on classifying the activity performed by a human from time-series data generated from the user's phone's accelerometer. The time-series data is first converted into 2-D images and then fed to a CNN to learn five different activities from 4 features of the data. The primary focus of the project is to leverage the predicting power of CNNs on time-series data with a small number of training samples and input channels.

Speaker: Lauren Fletcher

Title: Database and User Interface Development for Study of Atlantic Cod Fish

Abstract: Alison Frey, UMass Dartmouth School for Marine Science and Technology graduate student, and Steve Cadrin, UMass Dartmouth School for Marine Science and Technology Professor, have been involved in a research study on defining the spatiotemporal distribution of spawning, residence times, movement patterns, habitat utilization during spawning, and demographics of Atlantic Cod Fish on Cox Ledge. There is interest in researching cod in this area because there is potential offshore wind turbine development in this area. Frey and Cadrin’s research entails analyzing how the development of these wind turbines could affect cod. Beginning in 2019 to present, Frey has 88 fish tagged and 10+ Vemco VR2W and VR2TX receivers in Cox Ledge. The receivers hold data on when a tagged cod fish is heard in the receivers’ sonar range. Frey and Cadrin are looking to safely store and analyze the receiver data. With HTML tools, I have developed a user interface for Frey to upload the data into a MySQL database. The data in the database automatically is used to update a data visualization. The data visualization shows the user a time series bar chart of when the fish were heard by the receivers.

Speaker: Michael Delisle

Title: Sentiment Analysis of Quarterly Earnings Reports to Predict Subsequent Change in Stock Prices

Abstract: Every publicly traded company is required by law to submit a 10-Q to the Securities and Exchange Commission (SEC) at the end of their first three fiscal quarters and a 10-K at the end of their fiscal year. These reports are more commonly referred to as Earnings Reports. Most of these companies also hold an Earnings Call immediately following this report being filed which usually include a summary of the company’s performance and future goals. The goal of this project was to use Sentiment Analysis on these Earnings Calls to predict the change in a company’s stock price over the following days. In order to do this, we used RStudio to build a predictive model. A random selection of companies within the S&P 500 were used, and their earnings calls for Q3 and Q4 2021 were web scraped and organized. Sentiment Analysis was deployed four times for each set, each time using a different lexicon. A model was created for each lexicon dataset and was used on Q1 2022 data to see which model returned the highest grade of accuracy. A multiple linear regression on the NRC lexicon data provided the best results, with the other lexicons falling only a few points short. However, our results show that the sentiment analysis prediction models are not strong indicators as to the magnitude of change in a stock price, but it can be used to determine if a stock price will trend positively or negatively over the following days.

2021

Schedule

Date: May 3rd (Monday) from 10am to 2pm

10:00: Using NLP to Extract Salient User Reviews. Siyi Ge (DSC master's project; Advisor: Prof. Donghui Yan, Mathematics)
10:15: Measuring patient perceived quality of care in New England’s Hospitals using twitter data: Data collection, cleaning and classifying. Alekhya Achanta (DSC master's project; Advisor: Prof. Keivan Sadeghzadeh, Decision & Information Sciences)
10:30: Measuring patient perceived quality of care in New England’s Hospitals using twitter data: Calculating sentiment score. Visesh Kumar Jaiswal (DSC master's project; Advisor: Prof. Keivan Sadeghzadeh, Decision & Information Sciences)
10:45: Find most optimal cost for bridge maintenance. Laukik Upadhye (DSC master's project; Advisor: Prof. Keivan Sadeghzadeh, Decision & Information Sciences)
11:00: Multi Agent Actor Critic Model with Attention. Anushree Chopde (DSC master's project; Advisor: Prof. Ming Shao, Computer & Information Science)
11:15: An Exploration of Long/Short Term Memory Networks. Andrea Haines (DSC master's project; Advisor: Iren Valova, Computer & Information Science)
11:30: Applications of Statistical Inference Techniques to Neural Networks. Fatemeh Sadjadpour (DSC master's project; Advisor: Prof. Lance Fiondella, Electrical and Computer Engineering, and Prof. Arghavan Louhghalam, Civil Engineering)
11:45: LSTM networks for Named Entity Recognition. Nikita Seleznev (DSC master's project; Advisor: Prof. Iren Valova, Computer & Information Science)
12:00: Numerical investigation of certain physical quantities in terms of surface diffusion coefficients for a liquid thin film coarsening model. Bala Akshay Reddy Yeruva (DSC master's project; Advisor: Prof. Cheng Wang, Mathematics)
12:15: Detecting Medical Errors in the Cancer Screening Process: An In-Depth Review of Lung Cancer Screening and At-Risk Qualifications. Brianna Johnson (DSC senior; external partner: CCB Faculty Prof. Keivan Sadeghzadeh)
12:27: Predicting Disease in Leaves Using Machine Learning. Hunter Canning (DSC senior)
12:39: Cluster Analysis of COVID-19 Database in USA with Data Visualizations. Ziyu Xia (DSC senior)
12:51: Optical Character Recognition and Correction for Images of Digital Displays. Brian Cornet (DSC senior)
01:03: How Much Does Salary Factor into Performance in the MLB? Jimmy Mcrae (DSC senior)
01:15: An Introduction to Handwritten Digit Recognition Algorithms. Marco Sousa (DSC senior)
01:27: DataMatch: A website for matching datasets to machine learning algorithms. Cachelle Johnson-Lewis (DSC senior)
01:39: Shopping Behavior Analysis with Data Analytics Tools. Nicholas Cen (DSC senior)

Students who graduated in the Fall or Winter have "On-demand talk" (recordings are available)

Driver Drowsiness Detection System Using Visual Behavior. Anirudh Reddy Suram (DSC master's project; Advisor: Prof. Donghui Yan, Mathematics). Talk Location: DSC OneDrive (private)
Face Mask Detection and Classification. Shravani Lagishetty (DSC master's project; Advisor: Prof. Ming Shao, Computer and Information Science). Talk Location: DSC OneDrive (private)
True Naming of Principle Components. Dwyer Deighan (DSC master's project; Prof. Gokhan Kul, Computer and Information Science). Talk given on Nov 20, 2020 by zoom. (not recorded)
Implementation of a product formulation and test result database for Nye Lubricants. Richard Raithel (DSC master's thesis; Advisor: Prof. Alfa Heryudono, Mathematics). Talk Location: https://umassd.mediaspace.kaltura.com/media/RichardRaithel_MC2DSC550F2020/1_qkqonjlz
Reproducible Notebooks: A Study in the Replicability and Reproducibility of Computational Notebooks. Colin Brown (DSC master's thesis; Advisor: Prof. Scott Field, Mathematics). Talk given on January 5th, 2021 by zoom. (not recorded)
3D map visualization of the US presidential dataset from 2008 to 2020. Shreyas Nagabhushan. Recorded talk on 8/14/2021. Talk Location: DSC OneDrive (private)

Abstracts

Speaker: Alekhya Achanta

Title: Measuring patient perceived quality of care in New England’s Hospitals using twitter data: Data collection, cleaning and classifying

Abstract: To find the measure of patient perceived quality of care in real time, instead of traditional or survey-based approaches, we are taking the help of twitter data of each hospital in new England area. Tweets collected over a period directed to these hospitals were classified as having to do with patient experience using a machine learning approach. Later, sentiment score has been calculated for these tweets using natural language processing (NLP). Finally, patient sentiment is compared with two established quality measures that are readmission rate and HCAHPS rating. The goal is to find whether twitter sentiment has any correlation with these two quality measures. This this first part of the talk, I will discuss data collection, cleaning and classifying. Here I would be talking about number of Hospitals considered for this data and their tweets collected over a period. Data cleaning techniques and final data preparation will be discussed in detail. How machine learning classifier helps to classify tweets based on patient experience.

Speaker: Visesh Jaiswal

Title: Measuring patient perceived quality of care in New England’s Hospitals using twitter data: Calculating sentiment score

Abstract: To find the measure of patient perceived quality of care in real time, instead of traditional or survey-based approaches, we are taking the help of twitter data of each hospital in new England area. Tweets collected over a period directed to these hospitals were classified as having to do with patient experience using a machine learning approach. Later, sentiment score has been calculated for these tweets using natural language processing (NLP). Finally, patient sentiment is compared with two established quality measures that are readmission rate and HCAHPS rating. The goal is to find whether twitter sentiment has any correlation with these two quality measures. This this first part of the talk, I will discuss data collection, cleaning and classifying. Here I would be talking about number of Hospitals considered for this data and their tweets collected over a period. Data cleaning techniques and final data preparation will be discussed in detail. How machine learning classifier helps to classify tweets based on patient experience. In this second part, I will discuss calculating sentiment score of each hospital and comparing it with two established quality measures which are readmission rate and HCAHPS rating, to figure out the correlation. Even the twitter characteristics of each hospital is compared with these quality measures to check whether there is any weak association or not. But the ultimate goal is to find whether twitter sentiment has any correlation with these two quality measures.

Speaker: Laukik Upadhye

Title: Find most optimal cost for bridge maintenance

Abstract: This project was started with problem where there are multiple bridges and each one has different maintenance cost based on pair of years they are maintained on. Every bridge has 10 to 15 data points of combination of years and cost. The project is to, consider all bridges where there is no overlap of years (only one bridge is maintained in single year) and find most optimal cost to do that. And important task for doing this is to know the time complexity of checking all combinations. It was almost NP-Hard problem unless it is solved with graph theory. I used year as nodes and different data points as edges between nodes. To complete this project and check all combinations I had to use recursion for graph traversal. The solution is developed in MATLAB.

Speaker: Ziyu (Damon) Xia

Title: Cluster Analysis of COVID-19 Database in USA with Data Visualizations

Abstract: Since March 2020, the Corona Virus 2019 began to ravage the entire United States. Each day, many people get infected and die from diseases caused by the Corona Virus. In this report, we analyze the transmission trend and severity of Corona Virus in different states of USA using the dataset all-states-history. Hierarchical agglomerative clustering is utilized to analysis the situations of COVID-19 in different states of USA and divide them into several clusters. Vivid data visualizations of map and line chart have been used for better understanding of cluster analysis.

Speaker: Bala Akshay Reddy Yeruva

Title: Numerical investigation of certain physical quantities in terms of surface diffusion coefficients for a liquid thin film coarsening model

Abstract: A droplet liquid film model is taken into consideration. This model could be formulated in terms of a mass-conservative gradient flow associated with certain free energy functional, with a singular Leonard-Jones energy potential involved. The gradient flow preserves the physical dissipation property, and the steady state solution corresponds to a global minimum energy value of the given energy functional. In particular, the maximum height of the minimum energy structure, and the saturation time scale, which measures tie needed to reach a minimum energy structure with a random initial perturbation data, are of great physics interests. In this project, we collect the coarsening process data, obtained by a recently developed numerical scheme for the physical model, and make a detailed analysis afterward. The numerical data with a sequence of surface diffusion coefficient values are analyzed, and the scaling law between the physical quantities in terms of surface diffusion parameters is numerically obtained.

Speaker: Brian Cornet

Title: Optical Character Recognition and Correction for Images of Digital Displays

Abstract: Data structures such as tables and lists are commonly presented through graphical user interfaces (GUIs). In many cases, accessing the data directly is inconvenient or prohibited by the platform itself (Apple iOS, Nintendo consoles, etc.) or unavailable locally as with video streaming. Notably, the data is not private to the user, but collection must be performed by hand. This project introduces a program for Optical Character Recognition (OCR) using Tesseract 4 and Python to automatically convert data displayed in images or videos into discrete structures. Image transformations such as binarization, scaling, skewing, edge detection, and denoising enable compatibility with elaborate GUIs and photographs of digital displays. Predictive methods using Levenshtein and character-shape distances provide text correction to significantly improve output accuracy. Error analysis is also conducted for individual correction methods using several sample images, dictionaries, and fonts.

Speaker: Hunter Canning

Title: Predicting Disease in Leaves Using Machine Learning

Abstract: This project is about using machine learning to classify many images of leaves according to which diseases they have. The primary reason I wanted to do this specific project was that I wanted to give farmers a tool to deal with pests and diseases killing plants. This Project uses a dataset downloaded from Kaggle and the Keras package for image classification.

Speaker: Cachelle Johnson

Title: DataMatch: A website for matching datasets to machine learning algorithms

Abstract: Education is the best source to have within grasp in order to process technology and discover new methods to help others. Data Science is a field that uses many disciplines and insights such as math, computer science, and business. In 2020, the progress of technology dramatically changed. There are now computers, phones, and new technology coming out each year. The world is creating new insights, every day. The students at the University of Massachusetts Dartmouth use some of the disciplines we were taught as a Data Science major and various concepts. For this particular Data Science Capstone, I created a website called DataMatch, where users can import an excel spreadsheet and text file into any dataset and give a user the result of which it displayed, what it is, and how to use it. The motivation of this project was to create educational resources for people to learn more information about machine learning and how to use machine learning to build models. By using HTML, CSS, Bootstrap, and Javascript to construct and build the front end. I used a flask API to create the framework of the website. I also used python and python libraries (numpy, pandas, sklearn .. etc) to build the back end. In this website, information of models like Linear Regression, Logistic Regression, Binary Classification, Multi-Classification, K-means Clustering, Mean-Shift, Random Forest and Density-based spatial clustering of applications with noise (DBSCAN) will be constructed.

Speaker: Jimmy Mcrae

Title: How Much Does Salary Factor into Performance in the MLB?

Abstract: The MLB had long been known as a league that has been dominated by teams with higher payrolls than everyone else because of there being no salary cap. Recently though many teams with smaller payrolls have been winning and making it deep into the playoffs. I have been investigating this trend by taking a look at the salaries of all the players and comparing stats between players of small salaries and players of big salaries. There are many key stat comparisons and findings that I will explain in the talk.

Speaker: Andrea Haines

Title: An Exploration of LSTMs (Long/Short Term Memory Networks)

Abstract: We conducted an exploration of Long Short-Term Memory Networks (LSTMs) using Python. This exploration included learning the background and basics of LSTMs, as well as how to train them. We implemented a LSTM as well as a Gated Recurrent Unit (GRU) on two different Natural Language Processing (NLP) datasets to compare performances and investigated different model parameters and setups to achieve the best results on both datasets.

Speaker: Siyi Ge

Title: Using NLP to Extract Salient User Reviews

Abstract: Reviews will efficiently reflect customers’ preferences and complaints about a business; but due to the information overload problem, it becomes crucial needs to extract useful information from tons of reviews. To deal with this situation, I will describe how to use NLP to preprocess a restaurant review dataset; build a predictive model that predicts the sentiments of each review; then compare the accuracies of those models with different algorithms by a confusion matrix. Lastly, I will talk about evaluating the sentence score by calculating the TFIDF and extract the top 10 sentences to form a summarization.

Speaker: Nikita Seleznev

Title: LSTM networks for Named Entity Recognition

Abstract: A named entity is a real-world object that can be denoted with a proper name, such as people, places, organizations, goods, and so on. This concept is commonly used in the field of data mining and natural language processing. Named Entity Recognition (NER) is the task of recognizing named entities in text. Named Entity recognition involves processing a text and identifying certain occurrences of words or expressions as belonging to specific categories of named entities. NER task can serve as an important preprocessing step for other tasks such as information extraction, information retrieval and other text processing applications. In the present study we examine feasibility of applying a special type of recurrent neural networks, the Long Short-Term Memory (LSTM) networks, for the purpose of NER. We investigate the influence of the network architecture and LSTM cell type on the model performance based on two NER datasets. We carry out the model hyperparameter tuning and present results for the optimized NER model.

Speaker: Nicholas Cen

Title: Shopping Behavior Analysis with Data Analytics Tools

Abstract: This project discusses how with data analytic tools, we can determine the shopping behavior of customers. I chose this topic because I am interested to know what stores are doing to adopt to online shopping and what area they are doing are effective and what areas can be improved upon. This project uses a dataset from Kaggle.

Speaker: Brianna Johnson

Title: Detecting Medical Errors in the Cancer Screening Process: An In-Depth Review of Lung Cancer Screening and At-Risk Qualifications

Abstract: Globally, 1.3% of all individuals will receive a cancer diagnosis at some point in their lives. That number jumps to 5.5% in the United States alone, accounting for over 600,000 lives being lost each year. Lung Cancer specifically, is the leading cause of death among cancers, above colon, breast, and prostate cancers combined. Current screening protocols only test smokers and former smokers who have quit within the past 15 years. However, nearly 20% of all Lung Cancer cases are being found in those who have never smoked. This investigation aims to look at the environmental and medical history factors that could lead to diagnosing Lung Cancer when smoking is not the cause. Machine Learning Classification methods like Logistic Regression and Classification and Regression Trees will be used to pinpoint which nonsmokers should be screened. These results will then be tested to analyze the predicted lifespan achieved through these implemented screening protocols.

Speaker: Marco Sousa

Title: An Introduction to Handwritten Digit Recognition Algorithms

Abstract: Classification of handwritten digits is a standard problem in pattern recognition. This project provides a comprehensive introduction to several different classification algorithms including K-Means, SVD, Tangent Distance, K-Nearest Neighbors, Neural Networks, and Convolutional Neural Networks, alongside model ensemble methods. This presentation will summarize some key insights in applying such algorithms for classification on the Modified NIST (MNIST) and US Postal Service Zip Code Database. The applications of such techniques can apply directly to handwritten digits, such as automating the reading of US postal zip codes on letters, or can be generalized to handle broader machine learning classification tasks.

Speaker: Fatemeh Sadjadpour

Title: Applications of Statistical Inference Techniques to Neural Networks

Abstract: This research project describes a framework to obtain test statistics for a neural network. We derive the fisher information matrix of a neural network which can be used to obtain test statistics enabling statistical hypothesis testing. Experimental results are explored on a binary classification problem. The paper concludes on how both statistical hypothesis testing techniques and model selection techniques can help algorithm designers quantify and respond to weight uncertainty and better interpretability in neural networks.

Speaker: Colin Brown

Title: Reproducible Notebooks: A study in the Replicability and Reproducibility of Computational Notebooks

Abstract: In the computational space, reproducibility has become a concern for many as results need verification and reproducibility which are foundational parts of the scientific process. Technologies such as Jupyter Notebook have become increasingly common as an attempt to increase the nature of creating both reproducible results as well as reproducible code and sharing those results rapidly. The framework of notebooks isn't perfect, however, as notebooks can be shared without being entirely reproducible due to various constraints within the current Jupyter Notebook infrastructure as well as bad practices enabled by the technology. This thesis explores the potential shortfalls and pitfalls of reproducibility in this environment and also aims to address the concerns that come from Jupyter Notebooks. Resolving how these issues can be mitigated through better coding practices as well as creating additional tools that capture some of the existing issues within this environment.

2020

Schedule

2:30: Diagnostic Analysis in Classification. Guancheng Zhou (DSC masters; Advisor: Prof. Donghui Yan, Mathematics)
2:45: Building a used car prediction calculator. Dunick Voltaire (DSC senior).
3:00: Automated Modeling Pipeline. Shristi Bhat (DSC masters; Advisor: Prof. Ming Shao, CIS)
3:15: Game Recommendation System for Steam. Yi Ming Huang (DSC masters; Advisor: Prof. Ming Shao, CIS)
3:30: Improving Daily Business Operations with Tableau. Jaeliana Ortega (DSC senior; Client: Chris Lester from Dell)
3:45: Data Driven Tools for Political Science in Data Acquisition and Visualization. Ian Russell (DSC senior; Client: Prof. Shannon Jenkins, Political Science)
4:00: Predicting the Market Value of Soccer Players. Justin Quinlan (DSC senior).
4:15: COVID-19 infection detection on a mobile application using x-ray images. Apoorva Ramesh (DSC masters; Advisor: Prof. Ming Shao, CIS)
4:30: Understanding the Ecology of Georges' Bank: A Visualization of Bayesian Networks. Aaron LeBlanc (DSC senior; Client: Robert Wildermuth, SMAST)
4:45: Optimizing User Behaivor with Large Scale Multi-Agent Reinforcement Learning. Jacob Zuliani (DSC senior; Client: Mike Audi from Blustream)
5:00: Classifying neutron star signals from gravitational Waves. Nate Cady (DSC senior; Client: Prof. Scott Field)
5:15: Modeling air quality in Mainland China. Xia Tian (DSC senior)

Abstracts

Speaker: Guancheng Zhou

Title: Diagnostic Analysis in Classification

Abstract: Diagnostic Analysis is a form of advance analysis which examines data or content to answer the question “Why did it happen?” and is characterized by techniques such as drill-down, data discovery, data mining and correlations. Diagnostic analysis takes a deeper look at data to attempt to understand the causes of events and behaviors. In this talk, I will discuss two different ways we used in Diagnostic analysis which has not been applied to classification before. One is changing labels in training sample; another is deleting points in training sample. The output of a study is viewed as the result of the error of misclassification. We believe the Diagnostic analysis could help better understand the nature of the problem and potentially suggest directions for improvement. we also used different visualizations to get better visual understanding such as scatter plots and PCA plots.

Speaker: Dunick Voltaire

Title: Building a used car prediction calculator

Abstract: Buying a car is one of the biggest financial decisions a person can do after buying a house. The goal of this project is to predict the price of used vehicles if their styles fall under an SUV, Sedan, Wagon, or a Hatchback. Which led me to think about what are the most important variables of a vehicle that could help to predict the vehicle price? And what regression model would produce the best car price based on the information given by a user? The shiny application that you will use during the presentation will display the price of a used vehicle when all the information is filled out.

Speaker: Jacob Zuliani

Title: Optimizing User Behaivor with Large Scale Multi-Agent Reinforcement Learning

Abstract: Since the start of Blustream helping people take care of the things they love has been one of our core values. We now provide much more than that, but this core value has never left and is still an important part of who we are. In many industries increased care means reduced churn and increased consumer lifetime value. This is what care means for one of our earlier partners, Taylor Guitars. Through our partnership Blustream enabled the release of Taylor Sense, a smart guitar monitoring system that sends users a phone notification when their instrument in in danger of being damaged due to extreme humidity, temperature, or impact conditions. When a someone is notified it is their responsibility to take action to save their instrument, but users are often busy, or preoccupied, and don't always immediately respond to a notification. In this project I show that it's possible to model this behaivor and use it to help users better protect the things they love. I structure this as a large-scale multi-agent reinforcement learning problem where every instrument has its own reinforcement learning agent. Agents are responsible for constantly monitoring their instruments environment and determining when to notify the user to maximize the safety of the item they are protecting. I show that agents are successfully able to learn the habits of their instruments owner and modify their policy as needed to cater to these habits. This learning system if deployed has signifigant potential to increase consumer lifetime value, not only for Taylor but for all Blustream partners.

Speaker: Yi Ming Huang

Title: Game Recommendation System for Steam

Abstract: Recommendation systems have potential to change the way websites communicate with users and to allow companies to maximize their profits based on the information they can gather on each customer's preferences and purchases. The process begins by scraping data through steam API and cleaning data, implements different algorithms including content based filtering, item-based collaborative filtering and ALS (alternative linear square) to suggest top-k games to a user and finally creates a game recommendation web application for users.

Speaker: Jaeliana Ortega

Title: Improving Daily Business Operations with Tableau

Abstract: In the Services department of a business, customer satisfaction is key. How do businesses get high satisfaction ratings? Efficiency, speed, and availability are all important factors to earning high customer satisfaction ratings. For my project, I generated multiple dashboards in Tableau that can be used to improve the daily operations in the services department. These dashboards provide the management team with tools to manage their employees, workload, and improve their performance. Thus, this will increase customer satisfaction.

Author: Ian Russell

Title: Data Driven Tools for Political Science in Data Acquisition and Visualization

Abstract: Data is a modern essential resource. It is ubiquitous across all empirical fields of study. In Political Science there is a rising need for the appropriate tools to enhance research capabilities in a data driven world. This project will examine many available tools and possibilities to enhance already established concepts such as policy diffusion and legislative dynamics. The first element of the project creates web scraping tools using python to acquire FOIA contact data of political staffers for survey research. To supplement, examples of exploratory data analysis and visualization are exemplified. Not all aspects of survey research reached completion due to COVID-19 and new societal circumstances. This research is set up to be continued at a future date.

Speaker: Justin Quinlan

Title: Predicting the Market Value of Soccer Players

Abstract: On August 3rd , 2017 a new transfer record fee had been set in the soccer world. This was the transfer of Neymar from Barcelona to PSG. The transfer fee was a staggering 222 million Euros ($241,402,800), which is more than double of the previous record of 105 million Euros set back in 2016. This type of money can be used to finance and improve many different areas of a team. This made me start to wonder if it would be possible to predict soccer player market value based on their season stats. This could be used to help teams with lower budgets grow and have a better chance to compete against those with seemingly unlimited funds.

Speaker: Apoorva Ramesh

Title: COVID-19 infection detection on a mobile application using x-ray images

Abstract: COVID-19 is a respiratory disease caused by infection with a new form of coronavirus, SARS-CoV-2. With the COVID-19 pandemic spreading so fast, it only helps if the diagnosis and treatment of the disease is paced up and the testing is made readily available for people to test without relying much on expensive and time-consuming tag devices. It roughly takes around 14-24 days for the symptoms to show up and then around 2-4 days for the test to be completed. This automatic classifier on the mobile application uses the X-ray images of the patients to detect the presence of the coronavirus infection. The dataset used here has been collected partly from John Hopkins University Data on Github(100 Covid x-ray images) and partly from Kaggle Competition dataset (100 Non-Covid x-ray images). The system simply uses a Convolutional Neural Network (CNN – MobileNet with last few layers retrained) to do the classification. The classification model here, is based on Transfer Learning. The deep learning models used here are built using the Keras and Tensorflow libraries in python. The intent of the project is to perform covid-19 image classification using different models and compare the accuracies for model performance analysis and make the features available on a mobile application. The key features of the mobile application would be to be able to use scanned x-ray images to detect the presence of coronavirus infection, easily update and deploy data and models, etc.

Speaker: Shristi Bhat

Title: Automated Modeling Pipeline

Abstract: Out of all the core steps, Data Scientists typically spend 70% of their time and resources on feature engineering. If done correctly, it can increase the predictive power of the machine learning algorithms by a considerable amount. This is the motivation behind my Capstone Project. It seeks to dig deep into this realm. The aim is to primarily apply Deep Feature Synthesis to a multi-table relational dataset and create a pipeline which can also be applied to other data science problems, significantly increasing the efficiency. The ability of DFS to create new features by applying Feature primitives in the Entity-sets and stack them on each other gives it the power to create deep features. Additionally, results would include a comparison with a baseline model which shall be testified by a few classical machine learning models. Successful feature engineering benefits model performance and accuracy, and automation of this process will not only reduce the resources allocated by the client, but also improve the performance of their models.

Speaker: Nate Cady

Title: Classifying neutron star signals from gravitational Waves

Abstract: Gravitational waves discovered by the LIGO observatory often come in a variety of different forms. When searching through them, it is important for classifiers to be trained on many different types of inputs so that they can detect as many, and as accurately as possible. Binary Neutron stars create gravitational waves much longer than Black holes and thus are much harder to initially detect and classify. Coupled with this the signals are generally a lot fainter in the noise causing them to be more difficult to pull out from the rest. The goal is to expand on a in place system which classifies Black holes from noise and include the ability to classify fainter Binary Neutron Star Signals. This will result in a more powerful model which can get more out of its input.

Speaker: Aaron LeBlanc

Title: Understanding the Ecology of Georges' Bank: A Visualization of Bayesian Networks

Abstract: Developing a Bayesian network model is a common practice in environmental and resource management. These models can be used to understand the complexities of these environmental systems and aid in decision making. However, building and interpreting these models is no easy task. Utilizing Shiny and Netica, I have built a Shiny application with the intent of providing additional means to help understand and visualize the features used to create the model of the Georges Bank system, as well as streamline the model tweaking process.

Speaker: Xia Tian

Title: Modeling air quality in Mainland China

Abstract: TBA

Wiki

Data-Science-Capstone-Day-UMassD / Home

Welcome!

2024

2023

Schedule

Abstracts

2022

Schedule

Abstracts

2021

Schedule

Abstracts

2020

Schedule

Abstracts