data science thesis

Research Topics & Ideas: Data Science

50 Topic Ideas To Kickstart Your Research Project

Research topics and ideas about data science and big data analytics

If you’re just starting out exploring data science-related topics for your dissertation, thesis or research project, you’ve come to the right place. In this post, we’ll help kickstart your research by providing a hearty list of data science and analytics-related research ideas , including examples from recent studies.

PS – This is just the start…

We know it’s exciting to run through a list of research topics, but please keep in mind that this list is just a starting point . These topic ideas provided here are intentionally broad and generic , so keep in mind that you will need to develop them further. Nevertheless, they should inspire some ideas for your project.

To develop a suitable research topic, you’ll need to identify a clear and convincing research gap , and a viable plan to fill that gap. If this sounds foreign to you, check out our free research topic webinar that explores how to find and refine a high-quality research topic, from scratch. Alternatively, consider our 1-on-1 coaching service .

Research topic idea mega list

Data Science-Related Research Topics

  • Developing machine learning models for real-time fraud detection in online transactions.
  • The use of big data analytics in predicting and managing urban traffic flow.
  • Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.
  • The application of predictive analytics in personalizing cancer treatment plans.
  • Analyzing consumer behavior through big data to enhance retail marketing strategies.
  • The role of data science in optimizing renewable energy generation from wind farms.
  • Developing natural language processing algorithms for real-time news aggregation and summarization.
  • The application of big data in monitoring and predicting epidemic outbreaks.
  • Investigating the use of machine learning in automating credit scoring for microfinance.
  • The role of data analytics in improving patient care in telemedicine.
  • Developing AI-driven models for predictive maintenance in the manufacturing industry.
  • The use of big data analytics in enhancing cybersecurity threat intelligence.
  • Investigating the impact of sentiment analysis on brand reputation management.
  • The application of data science in optimizing logistics and supply chain operations.
  • Developing deep learning techniques for image recognition in medical diagnostics.
  • The role of big data in analyzing climate change impacts on agricultural productivity.
  • Investigating the use of data analytics in optimizing energy consumption in smart buildings.
  • The application of machine learning in detecting plagiarism in academic works.
  • Analyzing social media data for trends in political opinion and electoral predictions.
  • The role of big data in enhancing sports performance analytics.
  • Developing data-driven strategies for effective water resource management.
  • The use of big data in improving customer experience in the banking sector.
  • Investigating the application of data science in fraud detection in insurance claims.
  • The role of predictive analytics in financial market risk assessment.
  • Developing AI models for early detection of network vulnerabilities.

Research topic evaluator

Data Science Research Ideas (Continued)

  • The application of big data in public transportation systems for route optimization.
  • Investigating the impact of big data analytics on e-commerce recommendation systems.
  • The use of data mining techniques in understanding consumer preferences in the entertainment industry.
  • Developing predictive models for real estate pricing and market trends.
  • The role of big data in tracking and managing environmental pollution.
  • Investigating the use of data analytics in improving airline operational efficiency.
  • The application of machine learning in optimizing pharmaceutical drug discovery.
  • Analyzing online customer reviews to inform product development in the tech industry.
  • The role of data science in crime prediction and prevention strategies.
  • Developing models for analyzing financial time series data for investment strategies.
  • The use of big data in assessing the impact of educational policies on student performance.
  • Investigating the effectiveness of data visualization techniques in business reporting.
  • The application of data analytics in human resource management and talent acquisition.
  • Developing algorithms for anomaly detection in network traffic data.
  • The role of machine learning in enhancing personalized online learning experiences.
  • Investigating the use of big data in urban planning and smart city development.
  • The application of predictive analytics in weather forecasting and disaster management.
  • Analyzing consumer data to drive innovations in the automotive industry.
  • The role of data science in optimizing content delivery networks for streaming services.
  • Developing machine learning models for automated text classification in legal documents.
  • The use of big data in tracking global supply chain disruptions.
  • Investigating the application of data analytics in personalized nutrition and fitness.
  • The role of big data in enhancing the accuracy of geological surveying for natural resource exploration.
  • Developing predictive models for customer churn in the telecommunications industry.
  • The application of data science in optimizing advertisement placement and reach.

Recent Data Science-Related Studies

While the ideas we’ve presented above are a decent starting point for finding a research topic, they are fairly generic and non-specific. So, it helps to look at actual studies in the data science and analytics space to see how this all comes together in practice.

Below, we’ve included a selection of recent studies to help refine your thinking. These are actual studies,  so they can provide some useful insight as to what a research topic looks like in practice.

  • Data Science in Healthcare: COVID-19 and Beyond (Hulsen, 2022)
  • Auto-ML Web-application for Automated Machine Learning Algorithm Training and evaluation (Mukherjee & Rao, 2022)
  • Survey on Statistics and ML in Data Science and Effect in Businesses (Reddy et al., 2022)
  • Visualization in Data Science VDS @ KDD 2022 (Plant et al., 2022)
  • An Essay on How Data Science Can Strengthen Business (Santos, 2023)
  • A Deep study of Data science related problems, application and machine learning algorithms utilized in Data science (Ranjani et al., 2022)
  • You Teach WHAT in Your Data Science Course?!? (Posner & Kerby-Helm, 2022)
  • Statistical Analysis for the Traffic Police Activity: Nashville, Tennessee, USA (Tufail & Gul, 2022)
  • Data Management and Visual Information Processing in Financial Organization using Machine Learning (Balamurugan et al., 2022)
  • A Proposal of an Interactive Web Application Tool QuickViz: To Automate Exploratory Data Analysis (Pitroda, 2022)
  • Applications of Data Science in Respective Engineering Domains (Rasool & Chaudhary, 2022)
  • Jupyter Notebooks for Introducing Data Science to Novice Users (Fruchart et al., 2022)
  • Towards a Systematic Review of Data Science Programs: Themes, Courses, and Ethics (Nellore & Zimmer, 2022)
  • Application of data science and bioinformatics in healthcare technologies (Veeranki & Varshney, 2022)
  • TAPS Responsibility Matrix: A tool for responsible data science by design (Urovi et al., 2023)
  • Data Detectives: A Data Science Program for Middle Grade Learners (Thompson & Irgens, 2022)
  • MACHINE LEARNING FOR NON-MAJORS: A WHITE BOX APPROACH (Mike & Hazzan, 2022)
  • COMPONENTS OF DATA SCIENCE AND ITS APPLICATIONS (Paul et al., 2022)
  • Analysis on the Application of Data Science in Business Analytics (Wang, 2022)

As you can see, these research topics are a lot more focused than the generic topic ideas we presented earlier. So, for you to develop a high-quality research topic, you’ll need to get specific and laser-focused on a specific context with specific variables of interest.  In the video below, we explore some other important things you’ll need to consider when crafting your research topic.

Get 1-On-1 Help

If you’re still unsure about how to find a quality research topic, check out our Private Coaching service, the perfect starting point for developing a unique, well-justified research topic.

Private Coaching

I have to submit dissertation. can I get any help

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly
  • Advertise with Us

Logo

  • Cryptocurrencies

10 Best Research and Thesis Topic Ideas for Data Science in 2022

10 Best Research and Thesis Topic Ideas for Data Science in 2022

These research and thesis topics for data science will ensure more knowledge and skills for both students and scholars

As businesses seek to employ data to boost digital and industrial transformation, companies across the globe are looking for skilled and talented data professionals who can leverage the meaningful insights extracted from the data to enhance business productivity and help reach company objectives successfully. Recently, data science has turned into a lucrative career option. Nowadays, universities and institutes are offering various data science and big data courses to prepare students to achieve success in the tech industry. The best course of action to amplify the robustness of a resume is to participate or take up different data science projects. In this article, we have listed 10 such research and thesis topic ideas to take up as data science projects in 2022.

  • Handling practical video analytics in a distributed cloud:  With increased dependency on the internet, sharing videos has become a mode of data and information exchange. The role of the implementation of the Internet of Things (IoT), telecom infrastructure, and operators is huge in generating insights from video analytics. In this perspective, several questions need to be answered, like the efficiency of the existing analytics systems, the changes about to take place if real-time analytics are integrated, and others.
  • Smart healthcare systems using big data analytics: Big data analytics plays a significant role in making healthcare more efficient, accessible, and cost-effective. Big data analytics enhances the operational efficiency of smart healthcare providers by providing real-time analytics. It enhances the capabilities of the intelligent systems by using short-span data-driven insights, but there are still distinct challenges that are yet to be addressed in this field.
  • Identifying fake news using real-time analytics:  The circulation of fake news has become a pressing issue in the modern era. The data gathered from social media networks might seem legit, but sometimes they are not. The sources that provide the data are unauthenticated most of the time, which makes it a crucial issue to be addressed.
  • TOP 10 DATA SCIENCE JOB SKILLS THAT WILL BE ON HIGH DEMAND IN 2022
  • TOP 10 DATA SCIENCE UNDERGRADUATE COURSES IN INDIA FOR 2022
  • TOP DATA SCIENCE PROJECTS TO DO DURING YOUR OMICRON QUARANTINE
  • Secure federated learning with real-world applications : Federated learning is a technique that trains an algorithm across multiple decentralized edge devices and servers. This technique can be adopted to build models locally, but if this technique can be deployed at scale or not, across multiple platforms with high-level security is still obscure.
  • Big data analytics and its impact on marketing strategy : The advent of data science and big data analytics has entirely redefined the marketing industry. It has helped enterprises by offering valuable insights into their existing and future customers. But several issues like the existence of surplus data, integrating complex data into customers' journeys, and complete data privacy are some of the branches that are still untrodden and need immediate attention.
  • Impact of big data on business decision-making: Present studies signify that big data has transformed the way managers and business leaders make critical decisions concerning the growth and development of the business. It allows them to access objective data and analyse the market environments, enabling companies to adapt rapidly and make decisions faster. Working on this topic will help students understand the present market and business conditions and help them analyse new solutions.
  • Implementing big data to understand consumer behaviour : In understanding consumer behaviour, big data is used to analyse the data points depicting a consumer's journey after buying a product. Data gives a clearer picture in understanding specific scenarios. This topic will help understand the problems that businesses face in utilizing the insights and develop new strategies in the future to generate more ROI.
  • Applications of big data to predict future demand and forecasting : Predictive analytics in data science has emerged as an integral part of decision-making and demand forecasting. Working on this topic will enable the students to determine the significance of the high-quality historical data analysis and the factors that drive higher demand in consumers.
  • The importance of data exploration over data analysis : Exploration enables a deeper understanding of the dataset, making it easier to navigate and use the data later. Intelligent analysts must understand and explore the differences between data exploration and analysis and use them according to specific needs to fulfill organizational requirements.
  • Data science and software engineering : Software engineering and development are a major part of data science. Skilled data professionals should learn and explore the possibilities of the various technical and software skills for performing critical AI and big data tasks.

Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp

                                                                                                       _____________                                              

Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.

Related Stories

logo

data science Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Assessing the effects of fuel energy consumption, foreign direct investment and GDP on CO2 emission: New data science evidence from Europe & Central Asia

Documentation matters: human-centered ai system to assist data science code documentation in computational notebooks.

Computational notebooks allow data scientists to express their ideas through a combination of code and documentation. However, data scientists often pay attention only to the code, and neglect creating or updating their documentation during quick iterations. Inspired by human documentation practices learned from 80 highly-voted Kaggle notebooks, we design and implement Themisto, an automated documentation generation system to explore how human-centered AI systems can support human data scientists in the machine learning code documentation scenario. Themisto facilitates the creation of documentation via three approaches: a deep-learning-based approach to generate documentation for source code, a query-based approach to retrieve online API documentation for source code, and a user prompt approach to nudge users to write documentation. We evaluated Themisto in a within-subjects experiment with 24 data science practitioners, and found that automated documentation generation techniques reduced the time for writing documentation, reminded participants to document code they would have ignored, and improved participants’ satisfaction with their computational notebook.

Data science in the business environment: Insight management for an Executive MBA

Adventures in financial data science, gecoagent: a conversational agent for empowering genomic data extraction and analysis.

With the availability of reliable and low-cost DNA sequencing, human genomics is relevant to a growing number of end-users, including biologists and clinicians. Typical interactions require applying comparative data analysis to huge repositories of genomic information for building new knowledge, taking advantage of the latest findings in applied genomics for healthcare. Powerful technology for data extraction and analysis is available, but broad use of the technology is hampered by the complexity of accessing such methods and tools. This work presents GeCoAgent, a big-data service for clinicians and biologists. GeCoAgent uses a dialogic interface, animated by a chatbot, for supporting the end-users’ interaction with computational tools accompanied by multi-modal support. While the dialogue progresses, the user is accompanied in extracting the relevant data from repositories and then performing data analysis, which often requires the use of statistical methods or machine learning. Results are returned using simple representations (spreadsheets and graphics), while at the end of a session the dialogue is summarized in textual format. The innovation presented in this article is concerned with not only the delivery of a new tool but also our novel approach to conversational technologies, potentially extensible to other healthcare domains or to general data science.

Differentially Private Medical Texts Generation Using Generative Neural Networks

Technological advancements in data science have offered us affordable storage and efficient algorithms to query a large volume of data. Our health records are a significant part of this data, which is pivotal for healthcare providers and can be utilized in our well-being. The clinical note in electronic health records is one such category that collects a patient’s complete medical information during different timesteps of patient care available in the form of free-texts. Thus, these unstructured textual notes contain events from a patient’s admission to discharge, which can prove to be significant for future medical decisions. However, since these texts also contain sensitive information about the patient and the attending medical professionals, such notes cannot be shared publicly. This privacy issue has thwarted timely discoveries on this plethora of untapped information. Therefore, in this work, we intend to generate synthetic medical texts from a private or sanitized (de-identified) clinical text corpus and analyze their utility rigorously in different metrics and levels. Experimental results promote the applicability of our generated data as it achieves more than 80\% accuracy in different pragmatic classification problems and matches (or outperforms) the original text data.

Impact on Stock Market across Covid-19 Outbreak

Abstract: This paper analysis the impact of pandemic over the global stock exchange. The stock listing values are determined by variety of factors including the seasonal changes, catastrophic calamities, pandemic, fiscal year change and many more. This paper significantly provides analysis on the variation of listing price over the world-wide outbreak of novel corona virus. The key reason to imply upon this outbreak was to provide notion on underlying regulation of stock exchanges. Daily closing prices of the stock indices from January 2017 to January 2022 has been utilized for the analysis. The predominant feature of the research is to analyse the fact that does global economy downfall impacts the financial stock exchange. Keywords: Stock Exchange, Matplotlib, Streamlit, Data Science, Web scrapping.

Information Resilience: the nexus of responsible and agile approaches to information use

AbstractThe appetite for effective use of information assets has been steadily rising in both public and private sector organisations. However, whether the information is used for social good or commercial gain, there is a growing recognition of the complex socio-technical challenges associated with balancing the diverse demands of regulatory compliance and data privacy, social expectations and ethical use, business process agility and value creation, and scarcity of data science talent. In this vision paper, we present a series of case studies that highlight these interconnected challenges, across a range of application areas. We use the insights from the case studies to introduce Information Resilience, as a scaffold within which the competing requirements of responsible and agile approaches to information use can be positioned. The aim of this paper is to develop and present a manifesto for Information Resilience that can serve as a reference for future research and development in relevant areas of responsible data management.

qEEG Analysis in the Diagnosis of Alzheimers Disease; a Comparison of Functional Connectivity and Spectral Analysis

Alzheimers disease (AD) is a brain disorder that is mainly characterized by a progressive degeneration of neurons in the brain, causing a decline in cognitive abilities and difficulties in engaging in day-to-day activities. This study compares an FFT-based spectral analysis against a functional connectivity analysis based on phase synchronization, for finding known differences between AD patients and Healthy Control (HC) subjects. Both of these quantitative analysis methods were applied on a dataset comprising bipolar EEG montages values from 20 diagnosed AD patients and 20 age-matched HC subjects. Additionally, an attempt was made to localize the identified AD-induced brain activity effects in AD patients. The obtained results showed the advantage of the functional connectivity analysis method compared to a simple spectral analysis. Specifically, while spectral analysis could not find any significant differences between the AD and HC groups, the functional connectivity analysis showed statistically higher synchronization levels in the AD group in the lower frequency bands (delta and theta), suggesting that the AD patients brains are in a phase-locked state. Further comparison of functional connectivity between the homotopic regions confirmed that the traits of AD were localized in the centro-parietal and centro-temporal areas in the theta frequency band (4-8 Hz). The contribution of this study is that it applies a neural metric for Alzheimers detection from a data science perspective rather than from a neuroscience one. The study shows that the combination of bipolar derivations with phase synchronization yields similar results to comparable studies employing alternative analysis methods.

Big Data Analytics for Long-Term Meteorological Observations at Hanford Site

A growing number of physical objects with embedded sensors with typically high volume and frequently updated data sets has accentuated the need to develop methodologies to extract useful information from big data for supporting decision making. This study applies a suite of data analytics and core principles of data science to characterize near real-time meteorological data with a focus on extreme weather events. To highlight the applicability of this work and make it more accessible from a risk management perspective, a foundation for a software platform with an intuitive Graphical User Interface (GUI) was developed to access and analyze data from a decommissioned nuclear production complex operated by the U.S. Department of Energy (DOE, Richland, USA). Exploratory data analysis (EDA), involving classical non-parametric statistics, and machine learning (ML) techniques, were used to develop statistical summaries and learn characteristic features of key weather patterns and signatures. The new approach and GUI provide key insights into using big data and ML to assist site operation related to safety management strategies for extreme weather events. Specifically, this work offers a practical guide to analyzing long-term meteorological data and highlights the integration of ML and classical statistics to applied risk and decision science.

Export Citation Format

Share document.

Chapman University Digital Commons

Home > Dissertations and Theses > Computational and Data Sciences (PhD) Dissertations

Computational and Data Sciences (PhD) Dissertations

Below is a selection of dissertations from the Doctor of Philosophy in Computational and Data Sciences program in Schmid College that have been included in Chapman University Digital Commons. Additional dissertations from years prior to 2019 are available through the Leatherby Libraries' print collection or in Proquest's Dissertations and Theses database.

Dissertations from 2024 2024

Advancement in In-Silico Drug Discovery from Virtual Screening Molecular Dockings to De-Novo Drug Design Transformer-based Generative AI and Reinforcement Learning , Dony Ang

A Novel Correction for the Multivariate Ljung-Box Test , Minhao Huang

Medical Image Analysis Based on Graph Machine Learning and Variational Methods , Sina Mohammadi

Machine Learning and Geostatistical Approaches for Discovery of Weather and Climate Events Related to El Niño Phenomena , Sachi Perera

Global to Glocal: A Confluence of Data Science and Earth Observations in the Advancement of the SDGs , Rejoice Thomas

Dissertations from 2023 2023

Computational Analysis of Antibody Binding Mechanisms to the Omicron RBD of SARS-CoV-2 Spike Protein: Identification of Epitopes and Hotspots for Developing Effective Therapeutic Strategies , Mohammed Alshahrani

Integration of Computer Algebra Systems and Machine Learning in the Authoring of the SANYMS Intelligent Tutoring System , Sam Ford

Voluntary Action and Conscious Intention , Jake Gavenas

Random Variable Spaces: Mathematical Properties and an Extension to Programming Computable Functions , Mohammed Kurd-Misto

Computational Modeling of Superconductivity from the Set of Time-Dependent Ginzburg-Landau Equations for Advancements in Theory and Applications , Iris Mowgood

Application of Machine Learning Algorithms for Elucidation of Biological Networks from Time Series Gene Expression Data , Krupa Nagori

Stochastic Processes and Multi-Resolution Analysis: A Trigonometric Moment Problem Approach and an Analysis of the Expenditure Trends for Diabetic Patients , Isaac Nwi-Mozu

Applications of Causal Inference Methods for the Estimation of Effects of Bone Marrow Transplant and Prescription Drugs on Survival of Aplastic Anemia Patients , Yesha M. Patel

Causal Inference and Machine Learning Methods in Parkinson's Disease Data Analysis , Albert Pierce

Causal Inference Methods for Estimation of Survival and General Health Status Measures of Alzheimer’s Disease Patients , Ehsan Yaghmaei

Dissertations from 2022 2022

Computational Approaches to Facilitate Automated Interchange between Music and Art , Rao Hamza Ali

Causal Inference in Psychology and Neuroscience: From Association to Causation , Dehua Liang

Advances in NLP Algorithms on Unstructured Medical Notes Data and Approaches to Handling Class Imbalance Issues , Hanna Lu

Novel Techniques for Quantifying Secondhand Smoke Diffusion into Children's Bedroom , Sunil Ramchandani

Probing the Boundaries of Human Agency , Sook Mun Wong

Dissertations from 2021 2021

Predicting Eye Movement and Fixation Patterns on Scenic Images Using Machine Learning for Children with Autism Spectrum Disorder , Raymond Anden

Forecasting the Prices of Cryptocurrencies using a Novel Parameter Optimization of VARIMA Models , Alexander Barrett

Applications of Machine Learning to Facilitate Software Engineering and Scientific Computing , Natalie Best

Exploring Behaviors of Software Developers and Their Code Through Computational and Statistical Methods , Elia Eiroa Lledo

Assessing the Re-Identification Risk in ECG Datasets and an Application of Privacy Preserving Techniques in ECG Analysis , Arin Ghazarian

Multi-Modal Data Fusion, Image Segmentation, and Object Identification using Unsupervised Machine Learning: Conception, Validation, Applications, and a Basis for Multi-Modal Object Detection and Tracking , Nicholas LaHaye

Machine-Learning-Based Approach to Decoding Physiological and Neural Signals , Elnaz Lashgari

Learning-Based Modeling of Weather and Climate Events Related To El Niño Phenomenon via Differentiable Programming and Empirical Decompositions , Justin Le

Quantum State Estimation and Tracking for Superconducting Processors Using Machine Learning , Shiva Lotfallahzadeh Barzili

Novel Applications of Statistical and Machine Learning Methods to Analyze Trial-Level Data from Cognitive Measures , Chelsea Parlett

Optimal Analytical Methods for High Accuracy Cardiac Disease Classification and Treatment Based on ECG Data , Jianwei Zheng

Dissertations from 2020 2020

Development of Integrated Machine Learning and Data Science Approaches for the Prediction of Cancer Mutation and Autonomous Drug Discovery of Anti-Cancer Therapeutic Agents , Steven Agajanian

Allocation of Public Resources: Bringing Order to Chaos , Lance Clifner

A Novel Correction for the Adjusted Box-Pierce Test — New Risk Factors for Emergency Department Return Visits within 72 hours for Children with Respiratory Conditions — General Pediatric Model for Understanding and Predicting Prolonged Length of Stay , Sidy Danioko

A Computational and Experimental Examination of the FCC Incentive Auction , Logan Gantner

Exploring the Employment Landscape for Individuals with Autism Spectrum Disorders using Supervised and Unsupervised Machine Learning , Kayleigh Hyde

Integrated Machine Learning and Bioinformatics Approaches for Prediction of Cancer-Driving Gene Mutations , Oluyemi Odeyemi

On Quantum Effects of Vector Potentials and Generalizations of Functional Analysis , Ismael L. Paiva

Long Term Ground Based Precipitation Data Analysis: Spatial and Temporal Variability , Luciano Rodriguez

Gaining Computational Insight into Psychological Data: Applications of Machine Learning with Eating Disorders and Autism Spectrum Disorder , Natalia Rosenfield

Connecting the Dots for People with Autism: A Data-driven Approach to Designing and Evaluating a Global Filter , Viseth Sean

Novel Statistical and Machine Learning Methods for the Forecasting and Analysis of Major League Baseball Player Performance , Christopher Watkins

Dissertations from 2019 2019

Contributions to Variable Selection in Complexly Sampled Case-control Models, Epidemiology of 72-hour Emergency Department Readmission, and Out-of-site Migration Rate Estimation Using Pseudo-tagged Longitudinal Data , Kyle Anderson

Bias Reduction in Machine Learning Classifiers for Spatiotemporal Analysis of Coral Reefs using Remote Sensing Images , Justin J. Gapper

Estimating Auction Equilibria using Individual Evolutionary Learning , Kevin James

Employing Earth Observations and Artificial Intelligence to Address Key Global Environmental Challenges in Service of the SDGs , Wenzhao Li

Image Restoration using Automatic Damaged Regions Detection and Machine Learning-Based Inpainting Technique , Chloe Martin-King

Theses from 2017 2017

Optimized Forecasting of Dominant U.S. Stock Market Equities Using Univariate and Multivariate Time Series Analysis Methods , Michael Schwartz

  • Collections
  • Disciplines

Advanced Search

  • Notify me via email or RSS

Author Corner

  • Submit Research
  • Rights and Terms of Use
  • Leatherby Libraries
  • Chapman University

ISSN 2572-1496

Home | About | FAQ | My Account | Accessibility Statement

Privacy Copyright

Warning icon

Thesis/Capstone for Master's in Data Science | Northwestern SPS - Northwestern School of Professional Studies

  • Post-baccalaureate
  • Undergraduate
  • Professional Development
  • Pre-College
  • Center for Public Safety
  • Get Information

SPS Logo

Data Science

Capstone and thesis overview.

Capstone and thesis are similar in that they both represent a culminating, scholarly effort of high quality. Both should clearly state a problem or issue to be addressed. Both will allow students to complete a larger project and produce a product or publication that can be highlighted on their resumes. Students should consider the factors below when deciding whether a capstone or thesis may be more appropriate to pursue.

A capstone is a practical or real-world project that can emphasize preparation for professional practice. A capstone is more appropriate if:

  • you don't necessarily need or want the experience of the research process or writing a big publication
  • you want more input on your project, from fellow students and instructors
  • you want more structure to your project, including assignment deadlines and due dates
  • you want to complete the project or graduate in a timely manner

A student can enroll in MSDS 498 Capstone in any term. However, capstone specialization courses can provide a unique student experience and may be offered only twice a year. 

A thesis is an academic-focused research project with broader applicability. A thesis is more appropriate if:

  • you want to get a PhD or other advanced degree and want the experience of the research process and writing for publication
  • you want to work individually with a specific faculty member who serves as your thesis adviser
  • you are more self-directed, are good at managing your own projects with very little supervision, and have a clear direction for your work
  • you have a project that requires more time to pursue

Students can enroll in MSDS 590 Thesis as long as there is an approved thesis project proposal, identified thesis adviser, and all other required documentation at least two weeks before the start of any term.

From Faculty Director, Thomas W. Miller, PhD

Tom Miller

Capstone projects and thesis research give students a chance to study topics of special interest to them. Students can highlight analytical skills developed in the program. Work on capstone and thesis research projects often leads to publications that students can highlight on their resumes.”

A thesis is an individual research project that usually takes two to four terms to complete. Capstone course sections, on the other hand, represent a one-term commitment.

Students need to evaluate their options prior to choosing a capstone course section because capstones vary widely from one instructor to the next. There are both general and specialization-focused capstone sections. Some capstone sections offer in individual research projects, others offer team research projects, and a few give students a choice of individual or team projects.

Students should refer to the SPS Graduate Student Handbook for more information regarding registration for either MSDS 590 Thesis or MSDS 498 Capstone.

Capstone Experience

If students wish to engage with an outside organization to work on a project for capstone, they can refer to this checklist and lessons learned for some helpful tips.

Capstone Checklist

  • Start early — set aside a minimum of one to two months prior to the capstone quarter to determine the industry and modeling interests.
  • Networking — pitch your idea to potential organizations for projects and focus on the business benefits you can provide.
  • Permission request — make sure your final project can be shared with others in the course and the information can be made public.
  • Engagement — engage with the capstone professor prior to and immediately after getting the dataset to ensure appropriate scope for the 10 weeks.
  • Teambuilding — recruit team members who have similar interests for the type of project during the first week of the course.

Capstone Lesson Learned

  • Access to company data can take longer than expected; not having this access before or at the start of the term can severely delay the progress
  • Project timeline should align with coursework timeline as closely as possible
  • One point of contact (POC) for business facing to ensure streamlined messages and more effective time management with the organization
  • Expectation management on both sides: (business) this is pro-bono (students) this does not guarantee internship or job opportunities
  • Data security/masking not executed in time can risk the opportunity completely

Publication of Work

Northwestern University Libraries offers an option for students to publish their master’s thesis or capstone in Arch, Northwestern’s open access research and data repository.

Benefits for publishing your thesis:

  • Your work will be indexed by search engines and discoverable by researchers around the world, extending your work’s impact beyond Northwestern
  • Your work will be assigned a Digital Object Identifier (DOI) to ensure perpetual online access and to facilitate scholarly citation
  • Your work will help accelerate discovery and increase knowledge in your subject domain by adding to the global corpus of public scholarly information

Get started:

  • Visit Arch online
  • Log in with your NetID
  • Describe your thesis: title, author, date, keywords, rights, license, subject, etc.
  • Upload your thesis or capstone PDF and any related supplemental files (data, code, images, presentations, documentation, etc.)
  • Select a visibility: Public, Northwestern-only, Embargo (i.e. delayed release)
  • Save your work to the repository

Your thesis manuscript or capstone report will then be published on the MSDS page. You can view other published work here .

For questions or support in publishing your thesis or capstone, please contact [email protected] .

  • DSpace@MIT Home
  • MIT Libraries

This collection of MIT Theses in DSpace contains selected theses and dissertations from all MIT departments. Please note that this is NOT a complete collection of MIT theses. To search all MIT theses, use MIT Libraries' catalog .

MIT's DSpace contains more than 58,000 theses completed at MIT dating as far back as the mid 1800's. Theses in this collection have been scanned by the MIT Libraries or submitted in electronic format by thesis authors. Since 2004 all new Masters and Ph.D. theses are scanned and added to this collection after degrees are awarded.

MIT Theses are openly available to all readers. Please share how this access affects or benefits you. Your story matters.

If you have questions about MIT theses in DSpace, [email protected] . See also Access & Availability Questions or About MIT Theses in DSpace .

If you are a recent MIT graduate, your thesis will be added to DSpace within 3-6 months after your graduation date. Please email [email protected] with any questions.

Permissions

MIT Theses may be protected by copyright. Please refer to the MIT Libraries Permissions Policy for permission information. Note that the copyright holder for most MIT theses is identified on the title page of the thesis.

Theses by Department

  • Comparative Media Studies
  • Computation for Design and Optimization
  • Computational and Systems Biology
  • Department of Aeronautics and Astronautics
  • Department of Architecture
  • Department of Biological Engineering
  • Department of Biology
  • Department of Brain and Cognitive Sciences
  • Department of Chemical Engineering
  • Department of Chemistry
  • Department of Civil and Environmental Engineering
  • Department of Earth, Atmospheric, and Planetary Sciences
  • Department of Economics
  • Department of Electrical Engineering and Computer Sciences
  • Department of Humanities
  • Department of Linguistics and Philosophy
  • Department of Materials Science and Engineering
  • Department of Mathematics
  • Department of Mechanical Engineering
  • Department of Nuclear Science and Engineering
  • Department of Ocean Engineering
  • Department of Physics
  • Department of Political Science
  • Department of Urban Studies and Planning
  • Engineering Systems Division
  • Harvard-MIT Program of Health Sciences and Technology
  • Institute for Data, Systems, and Society
  • Media Arts & Sciences
  • Operations Research Center
  • Program in Real Estate Development
  • Program in Writing and Humanistic Studies
  • Science, Technology & Society
  • Science Writing
  • Sloan School of Management
  • Supply Chain Management
  • System Design & Management
  • Technology and Policy Program

Collections in this community

Doctoral theses, graduate theses, undergraduate theses, recent submissions.

Thumbnail

Transport Properties of Divertor Edge Plasmas Measured with Multi-Spectral Imaging 

Thumbnail

Entanglement and Chaos in Quantum Field Theory and Gravity 

Thumbnail

Illuminating the Cosmos: dark matter, primordial black holes, and cosmic dawn 

Show Statistical Information

feed

LIBRARIES | ARCH

Data science masters theses.

The Master of Science in Data Science program requires the successful completion of 12 courses to obtain a degree. These requirements cover six core courses, a leadership or project management course, two required courses corresponding to a declared specialization, two electives, and a capstone project or thesis. This collection contains a selection of masters theses or capstone projects by MSDS graduates.

Collection Details

List of items in this collection
  Title Date Added Visibility
 

2022-06-15
 

2022-06-05
 

2020-06-16
 

2020-06-13
 

2019-11-26
 

2019-11-21
 

2019-06-23
  • Search Ramapo College Website Search Ramapo College Website
  • Accreditation / Memberships
  • Mission, Vision & History
  • Visit Ramapo College
  • Lodging/Restaurants
  • Public Transportation
  • Virtual Campus Tour
  • Campus Directory
  • News & Media Home
  • Press Releases
  • The College Tour
  • Photo Galleries
  • Campus Videos
  • Ramapo Magazine
  • College Leadership
  • Office of the President
  • Board of Trustees
  • Strategic Plan
  • Institutional Effectiveness Council (IEC)
  • Office Directory
  • Consumer Info
  • Emergency Preparedness
  • Public Safety Department
  • Events & Conferences
  • Phone Directory
  • Ramapo Green
  • Academics Home
  • Majors, Minors, Concentrations
  • Graduate Programs
  • Degree Completion Program
  • College Honors Program
  • Nursing Programs
  • Teacher Education Programs
  • Anisfield School of Business (ASB)
  • Contemporary Arts (CA)
  • School of Humanities and Global Studies (HGS)
  • Social Science and Human Services (SSHS)
  • Theoretical and Applied Science (TAS)
  • Int'l Education Home
  • Study & Intern Abroad
  • International Students
  • International Scholars, Faculty & Staff
  • Internationalization
  • Registrar Home
  • Registration Information
  • Online Course Information
  • Graduation & Commencement Info
  • Forms / Transcripts
  • College Catalog
  • Academic Calendar
  • Office of Student Accounts
  • Testing Center
  • First Year Students
  • First-Generation Student Center
  • Web For Students & Faculty
  • Admissions Home
  • International
  • Returning Student
  • Veterans / Military Family
  • Admitted Students
  • Admission Requirements
  • Tuition & Cost
  • Financial Aid & Deadlines
  • Education Opp. Fund
  • Scholarships
  • Request More Information
  • Residence Life
  • Center for Student Involvement (CSI)
  • Career Services
  • Civic & Community Engagement Center
  • Health & Counseling Center
  • Queer Peer Services
  • Specialized Services
  • Dining Services
  • Student Affairs
  • Office of Student Conduct
  • Sexual Assault Resources
  • Commuter Affairs
  • Women's Center
  • Clubs & Organizations
  • Fraternity & Sorority Life
  • Student Government Association (SGA)
  • Student Leadership Programs
  • Student Jobs On Campus
  • Shuttle Destinations
  • Student Guide
  • Student Success Stories
  • Alumni Home
  • Alumni Advisory Boards
  • Alumni Association
  • Alumni Benefits
  • Alumni Discount
  • Alumni Events
  • Get Involved
  • Foundation Home
  • Board of Governors
  • College Magazine
  • Foundation Events
  • Foundation Grants
  • Friends of Ramapo
  • Government Grant Awards
  • Giving Home
  • The Fund for Ramapo
  • Capital Projects
  • How to Give
  • Matching Gifts
  • Planned Giving
  • About the Berrie Center
  • Performance Schedule
  • Tickets / Seating
  • About the Galleries
  • Kresge & Pascal
  • Rodman Gallery
  • Potter Library
  • Ramapo Collections
  • Center for Holocaust and Genocide Studies
  • STEM Center at Ramapo College
  • Roukema Center for International Education
  • Sabrin Center for Free Enterprise
  • Sharp Sustainability Education Center
  • New Jersey Small Business Development Center at Ramapo College
  • About Events and Conferences
  • About the Facilities
  • Space Requests
  • Policies & Procedures
  • Summer Programs
  • Other Resources
  • Contact Event Services
  • Current Students
  • Parents & Families
  • Faculty & Staff
  • RCNJ Intranet
  • About Ramapo
  • Admissions & Aid
  • Student Life
  • Arts / Community

Ramapo College of New Jersey Home Page » Admissions & Aid » Graduate » DMC » MS Thesis Archive

  • Center for Data, Mathematical, and Computational Sciences
  • Undergraduate
  • MS Data Science
  • MS Applied Mathematics
  • MS Computer Science
  • 4+1 BS to MS
  • Academic Policies and Resources
  • Fieldwork Experience
  • Thesis (Handbook)
  • Thesis Archive
  • Student Clubs
  • Advisory Board
  • Lecture Series
  • Tuition and Financial Aid
  • News and Events
  • Fieldwork Sponsorship

MS Thesis Archive

Spring 2024, predicting first year retention for undergraduate educational opportunity fund students, kelly o’neill, m.s. applied mathematics.

Predicting undergraduate retention using various machine learning algorithms has the potential to reduce the likelihood of attrition for students who are identified as being at an elevated risk of dropping out. Thus, providing a mechanism to help increase the likelihood of a student graduating from college. Following the approach of previous studies, retention is predicted using primarily freshman data, where retention is defined as a student being enrolled a year later from their first semester. For this thesis, the population was focused on predicting retention for Educational Opportunity Fund (EOF) students. Based on the EOF department’s most recent report, which comes from Ramapo’s Office of Institutional Research 2023, in 2016, the 4-year graduation rate is 46.40%, and the 6-year graduation rate is 63.10%, whereas for the college, the four-year graduation rate is 56.9%, and the six-year graduation rate is 69.5%, using the Fall 2018 cohort. Through identifying these specific individuals who will not be retained, it allows the EOF department to devise an appropriate plan and provide resources to help the students achieve academic success, and thus increase graduation rates.

This thesis will consider many factors, provided by the EOF department, from Fall 2013 to Spring 2023. I will consider the impact of covid within my analysis. I predict retention using logistic regression, decision tree, random forest, support vector machine, ensemble, and gradient boosting classifier, where feature selection and the Synthetic Minority Over-sampling Technique (SMOTE), since the dataset was not balanced, were used for each algorithm. While all of the models performed well, even after 10-fold cross-validation, the random forest model using feature selection a balanced dataset is recommended. In the future, the EOF department can use this model to determine which incoming students are at elevated risk of dropping out and provide them with the necessary resources to help them succeed.

The second part of this thesis is a comprehensive exploratory data analysis to learn more about the EOF student population. EOF students tend to struggle within the subject areas of math, biology, interdisciplinary studies, psychology, and chemistry. More specifically, in the courses math 108, interdisciplinary study 101, biology 221, critical reading and writing 102, amer/intl interdisciplinary 201, math 101, and math 110. Regarding retention, the average cumulative GPA for students who retained was 2.84, and 2.15 for students who did not retain. Furthermore, the average term GPA for those who retained was 2.67 but was 1.65 for students who did not retain.

Through analyzing the relationship between retention and other variables, such as GPA, subject areas, and courses, it provides the EOF department with a better idea of possible support mechanisms for students. Coupling this information with the recommended prediction algorithm of ensemble learning, can help the EOF department increase their four year and six-year graduation rates, by providing the student(s) with resources, guidance, and plans with their expertise.

EXAMINING DISEASE THROUGH MICROBIOME DATA ANALYSIS

Brett van tassel, m.s. data science.

The objective of this project is to examine the relationship between gut microbiomes of human subjects having different disease statuses by examining microbial diversity shifts. Read analysis and data cleaning is recorded from beginning to end so that the unfiltered and unfettered data can be reanalyzed and processed. Here we strive to create a tool that works for well curated data. Data is gathered from the database QIITA and the read data and metadata are queried via the tool redbiom. The initial exploratory analysis involved an examination of metadata attributes. A heat map of correlating attributes of the metadata using Cramer’s V algorithm allows visual correlation examination. Next, we train random forests based on metadata of interest. Due to the large quantity of attributes, many random forests are trained, and their respective significance values and Receiver Operating Characteristic curves (ROC) are generated. ROC curves are used to isolate optimal correlations. This process is built into a pipeline, ultimately allowing the efficient, automated analysis and assignment of disease susceptibility. Alpha and beta diversity metrics are generated and plotted for visual interpretation using QIIME2, a microbial analysis software platform. CLOUD, a tool for finding microbiome outliers, is used to identify markers of dysbiosis and contamination, and to measure rates of successful identification. CLOUD was found to identify positive diagnoses where Random Forests did not when examining positive samples and their predicted diagnosis status. SMOTE was found to perform similarly or slightly poorer compared to random sampling as a data balancing technique.

Summer 2023

Evaluating how nhl player shot selection impacts even-strength goal output over the course of a full season, elliott barinberg, m.s. data science.

Within this thesis work, the applications of data collection, machine learning, and data visualization were used on National Hockey League (NHL) shot data collected between the 2014-2015 season and the 2022-2023 season. Modeling sports data to better understand player evaluation has always been a goal of sports analytics. In the modern era of sports analytics the techniques used to quantify impacts on games have multiplied. However, when it comes to ice hockey all the most difficult challenges of sports data analysis present themselves in trying to understand the player impacts of such a continuously changing game-state. The methods developed and presented in this work serve to highlight those challenges and better explain a player’s impact on goal scoring for their team.

Throughout this work there are multiple kinds of modeling techniques used to try to best demonstrate a player’s impact on goal scoring as a factor of all the elements the player is capable of controlling. We try to understand which players have the best offensive process and impact on goal-scoring by caring about the merit of the offensive opportunities they create. It is important to note that these models are not intended to re-create the results seen in reality, although reality and true results are used to evaluate the outputs.

This process used data scraping to collect the data from the NHL public application programming interface (API). Data cleansing techniques were applied to the collected data, yielding custom data sets which were used for the corresponding models. Data transformation techniques were used to calculate additional factors based upon the data provided, thus creating additional data within the training and testing datasets. Techniques including but not limited to linear regression, logistic regression, random forests and extreme gradient boosted regression were all used to attempt to model the true possibility of any particular even-strength event being a goal in the NHL. Then, using formulaic approaches the individual event model was extrapolated upon to draw larger conclusions. Lastly, some unique data visualization techniques were used to best present the outputs of these models. In all, many experimental models were created which have yielded a reproducible methodology upon which to evaluate the results of any NHL player impact upon goal scoring over the course of a season.

Spring 2023

Building a statistical learning model for evaluation of nba players using player tracking data, matthew byman, m.s. data science.

This thesis aims to develop faster and more accurate methods for evaluating NBA player performances by leveraging publicly available player tracking data. The primary research question addresses whether player tracking data can improve existing performance evaluation metrics. The ultimate goal is to enable teams to make better-informed decisions in player acquisitions and evaluations.

To achieve this objective, the study first acquired player tracking data for all available NBA seasons from 2013 to 2021. Regularized Adjusted Plus-Minus (RAPM) was selected as the target variable, as it effectively ranks player value over the long term. Five statistical learning models were employed to estimate RAPM using player tracking data as features. Furthermore, the coefficients of each feature were ranked, and the models were rerun with only the 30 most important features.

Once the models were developed, they were tested on a newly acquired player tracking data from the 2022 season to evaluate their effectiveness in estimating RAPM. The key findings revealed that Lasso Regression and Random Forest models performed the best in predicting RAPM values. These models enable the use of player tracking statistics that settle earlier, providing an accurate estimate of future RAPM. This early insight into player performance offers teams a competitive advantage in player evaluations and acquisitions.

In conclusion, this study demonstrates that combining statistical learning models with player tracking data can effectively estimate performance metrics, such as RAPM, earlier in the season. By obtaining accurate RAPM estimates before other teams, organizations can identify and acquire top-performing players, ultimately enhancing their competitive edge in the NBA.

BUILDING AN ML DRIVEN SYSTEM FOR REAL-TIME CODE-PERFORMANCE MONITORING

Mikhail delyusto, m.s. data science.

This project is a part of a multidirectional attempt to increase quality of the software and data product that is being produced by Science and Engineering departments of Aetion Inc., the company that is transforming the healthcare industry by providing its partners (major healthcare industry players) with a real-world evidence generation platform, that helps to drive greater safety, effectiveness, and value of health treatments. Large datasets (up to 100Tb each) of healthcare market data (for example, insurance claims) get ingested into the platform and get transformed into Aetion’s proprietary longitudinal format.

This attempt is being led by the Quality Engineering Team and is envisioned to move away from conventional testing techniques by decoupling different moving parts and isolating them in separate, maintainable and reliable tools.

A subject of this thesis is a particular branch of a large quality initiative that will be helping to continuously monitor a number of metrics that are involved in execution of the two most common types of jobs that run on Aetion’s platform: cohorts and analyses. These jobs may take up to a few hours to generate depending on the size of a dataset and the complexity of an analysis.

Implemented, this monitoring system would be supplied with a feed of logs that contain certain data points, like timestamps. Enhanced with a built-in algorithm to set a threshold on the metrics and notify its users (stakeholders from Engineering and Science) when said threshold is exceeded, would be a game-changing capability in Aetion’s quality space. Currently, there is no way to say if any given job is taking more or, otherwise, significantly less time and most of the defects get identified in upper environments (including production).

The issues identified in upper environments are the costlier of all the types and, by different industry considerations, can cost $5000 – $10000 each.

As a result of implementing said system we would expect a steep decrease in a number of issues in upper environments, as well as an increase in release frequency, that the organization will greatly benefit from.

OPTIMIZING PRODUCT RECOMMENDATION DECISIONS USING SPATIAL ANALYSIS

Raul a. hincapie, m.s. data science.

At a certain Consumer Packaged Goods (CPG) company, there was a need to coordinate between sales, geographic location, and demographic datasets to make better-informed business decisions. One area that required this type of coordination was the replacement process of a specific product being sold to a store. The need for this type of replacement arises when a product is not authorized to be sold at the store, out of stock, permanently discontinued, or not selling at the intended rate. Previously, the process at this company relied on instinctual decision-making when it came to product replacements, which showed a need for this protocol to be more data-driven.

The premise of this project is to create a data-driven product replacement process. It would be a type of system where the CPG company inputs a store and a product then it would output a product list with suitable replacement items. The replacement items would be based on stores similar to the input store using its sales, geographic location, and demographic portfolio. By identifying these similar stores, it is possible that the CPG company could also discover product opportunities or niches for a specific store or region. With a system like this, the company will increase their regional product knowledge based on geographical location as well as improve current and future sales. The system could also provide highly valuable information on its consumer preferences and behaviors, which could eventually help to understand future customers.

PREDICTING AND ANALYZING STOCK MARKET BEHAVIOR USING MAGAZINE COVERS

Egor isakson, m.s. data science.

Financial magazines have been part of the financial industry right from the start. There has long been a debate whether a stock being featured in a magazine is a contrarian signal. The reasoning behind this is simple; any informational edge reaches the wide masses last, which means by the time that happens, the bulk of the directional move of the financial instrument has long been completed. This paper puts this idea to the test by examining the behavior of the stock market and the stocks that are featured on magazine covers of various financial magazines and newspapers. By going through several stages of data extraction and processing utilizing a series of most up-to-date data science techniques, ticker symbols are derived from raw colorful images of covers. The derivation results in a many-to-many relationship, where a single ticker shows up at different points in time, at the same time, with a possibility of a single cover having many tickers at once. From then, several historic price and media-related features are created in preparation for the machine learning models. Several models are utilized to look at the behavior of the stock and the index at different points in time in the upcoming future. Results demonstrate more than random results but insufficient as the sole determinant of direction of the asset.

IDENTIFYING OUTLIER DATA POINTS IN NON-CLINICAL INVESTIGATIONAL NEW DRUG SUBMISSIONS

Cassandra o’malley, m.s. data science.

The Food and Drug Administration (FDA) uses a format known as SEND (Standard for Exchange of Nonclinical Data) to evaluate non-clinical (animal) studies for investigational new drug applications. Investigative drug sponsors currently use information from historical and control data to determine if drugs cause toxicity.

The goal of this study is to identify outlying data points that may indicate an investigative new drug could be toxic. Examples include a negative body weight gain over time, enlarged organ weights, or laboratory test abnormalities, especially in relation to a control group within the same study. Flagged records can be analyzed by a veterinarian or pathologist for potential signs of toxicity without looking at each individual data point.

Common domains within the non-clinical pharmaceutical studies were evaluated using changes from baseline measurements, changes from the control group, a percent change from the previous measurement with reference to the ethical guidelines, values outside of the mean ± two standard deviations, and a measure of abnormal findings to unremarkable findings in pathology. A program was designed to analyze five of these domains and return a collection of possible outlying data for simpler and faster than individual data point analysis by a study monitor, performing the analysis in a fraction of the time. The resulting file is more easily read by someone unfamiliar with the SEND format.

With this program, analyzing a study for possible toxic effects during the study can save time, effort, and even animal lives by identifying the signs of toxicity early. Sponsors or CROs can determine if the product is safe enough to proceed with testing or should be stopped in the interest of safety and additional research.

CLIMATE CHANGE IMPACTS ON FOOD PRODUCTION: A BIBLIOMETRIC NETWORK ANALYSIS

Skylar clawson, m.s. data science.

Climate change is an environmental issue that is affecting many different sectors of society such as terrestrial, freshwater and marine ecosystems, human health and agriculture. With a growing population, food security is a serious issue exacerbated by climate change. Climate change is not only impacting food production, but food production is also impacting climate change by emitting greenhouse gasses during the different stages of the food supply chain. This project seeks to use a bibliometric network analysis to identify the influence that the food supply chain has on climate change. We created four networks for each stage in the food supply chain (food processing, food transportation, food retail, food waste) to distinguish how influential the food supply chain is on climate change. The data needed for a bibliometric network comes from a scientific database and the networks are created based on a co-word analysis. Co-word analysis reveals words that frequently appear together to show that they have some form of a relationship in research publications. The second part of our analysis is more focused on how climate change impacts the early growth and development stages of grains. We collected data on several grains as well as temperature and precipitation to see if the representing climate stressors had any influence on production rates. This project’s main focus is to identify how climate change and food production could be influencing each other. The main findings of this project indicate that all four stages of the food supply chain influence climate change. This project also indicates that climate change affects grain production by different climate variables such as temperature and precipitation variability.

EXPLORING VEHICLE SERVICE CONTRACT CANCELLATIONS

Josip skunca, m.s. data science.

The goal of this thesis is to propose the cancellation reserve requirement for ServiceContract.com, a start-up vehicle service contract administrator being formed by its parent company DOWC. DOWC is a vehicle service contract administrator who prides itself on offering customized financial products to large car dealerships. The creation of ServiceContract.com (referred to as ServiceContract) serves to offer no-chargeback products as a means of marketing to another portion of the automotive industry. No-chargeback means that if the contract cancels (after 90 days) the Dealership, Finance Manager, and Agent (account manager) of the account are not required to refund their profit from the insurance contract – the administrator refunds the prorated price of the contract. In other words, the administrator must refund the entirety of the contract’s price, prorated at the time of cancellation.

Therefore, the cancellation reserve is the price that must be collected per contract in order to cover all cancellation costs. This research was a requirement to determine the feasibility of the new company and determine the pricing requirements of its products. The pricing of the new company’s products would determine ServiceContract’s competitiveness in the market, and therefore provide an evaluation of the business model.

To find this reserve requirement, research first started by finding the total amount of money that DOWC has refunded, along with the total number of contracts sold. Adding specific information allowed the calculation of these requirements in the necessary form. Service contract administrators are required to file rate cards with each state that must clearly specify the dimensions of the contract and their corresponding price.

The key result in the research was the realization that the Cancellation Reserve would be tied to the Maximum allowed retail price. If the maximum price dealerships can sell for is lowered, the required Cancellation Reserve will follow suit, and as a result lower the Coverage cost of the contract. This allowed for the dealership to have an opportunity to make their desired profit, while enabling ServiceContract to offer competitive pricing.

The most significant impact of these results is that ServiceContract was able to determine that the company had more competitive rates than both competitors and DOWC. This research opened the company’s eyes to the benefit of this kind of research, and will prompt further research in the future.

Spring 2022

A tool for who will drop out of school, colette joelle barca, m.s. data science.

A student’s high school experience often forms the foundation of his or her postsecondary career. As the competition in our nation’s job market continues to increase, many businesses stipulate applicants need a college degree. However, recent studies show approximately one-third of the United States’ college students never obtain a degree. Although colleges have developed methods for identifying and supporting their struggling students, early intervention could be a more effective approach for combating postsecondary dropout rates. This project seeks to use anomaly detection techniques to create a holistic early detection tool that indicates which high school students are most at risk to drop out of college. An individual’s high school experience is not confined to the academic components. As such, an effective model should incorporate both environmental and educational factors, including various descriptive data on the student’s home area, the school’s area, and the school’s overall structure and performance. This project combined this information with data on students throughout their secondary educational careers (i.e., from ninth through twelfth grade) in an attempt to develop a model that could detect during high school which students have a higher probability of dropping out of college. The clustering-based and classification-based anomaly detection algorithms detail the situational and numeric circumstances, respectively, that most frequently result in a student dropping out of college. High school administrators could implement these models at the culmination of each school year to identify which students are most at risk for dropping out in college. Then, administrators could provide additional support to those students during the following school year to decrease that risk. College administrators could also follow this same process to minimize dropout rates.

COMPREHENSIVE ANALYSIS OF THE FUTURE PRICE OF NBA TOP SHOT MOMENTS

Miguel a. esteban diaz, m.s. data science.

NBA Top Shot moments are NFTs built on the FLOW blockchain and created by Dapper Labs in collaboration with the NBA. These NFTs, commonly referred to as “moments”, consist of in-game highlights of an NBA or WNBA player. Using the different variables of a moment, like for example: the type of play done by the player appearing in the moment (dunk, assist, block, etc.), the number of listings of that moment in the marketplace, whether the player appearing in the moment is a rookie or the rarity tier of the moment (Common, Fandom, Rare or Legendary). This project aims to provide a statistical analysis that could yield hidden correlations of the characteristics of a moment and its price, and a prediction of the price of moments with the use of machine learning regression models which include linear regression, random forest or neural networks. As NFTs, and especially NBA Top Shot, are a relatively recent area of research, at the moment there is not extensive research performed about this area. This research has an intent to expand the up to date analysis and research performed in this topic and serve as a foundation for any future research in this area, as well as provide helpful and practical information about the valuation of moments, the importance of the diverse characteristics of moments and impact in the pricing of the moments and the future possible application of this information to other similar highlight-oriented sport NFTs like NFL AllDay or UFC Strike, which are designed similarly to NBA Top Shot.

PREVENTING THE LOSS OF SKILLFUL TEACHERS: TEACHER TURNOVER PREDICTION USING MACHINE LEARNING TECHNIQUES

Nirusha srishan, m.s. data science.

Teacher turnover rate is an increasing problem in the United States. Each year, teachers leave their current teaching position to either move to a different school or to leave the profession entirely. In an effort to understand why teachers are leaving their current teaching positions and to help identify ways to increase teacher retention rate, I am exploring possible reasons that influence teacher turnover and creating a model to predict if a teacher will leave the teaching profession. The ongoing turnover of teachers has a vast impact on school district employees, the state, the country, and the student population. Therefore, exploring the variables that contribute to teacher turnover can ultimately lead to decreasing the rate of turnover.

This project compares those in the educational field, including general education teachers, special education teachers and other educational staff, who have completed the 1999-2000 School and Staffing Survey (SASS) and Teacher Follow-up Survey (TFS) from the National Center for Educational Statistics (NCES, n.d.). This data will be used to identify trends in teachers that have left the profession. Predictive modeling will include various machine learning techniques, including Logistic Regression, Support Vector Machines (SVM), Decision Tree and Random Forest, and K-Nearest Neighbors. By finding the reasons for teacher turnover, a school district can identify a way to maximize their teacher retention rate, fostering a supportive learning environment for students, and creating a positive work environment for educators.

FORECASTING AVERAGE SPEED OF CALL CENTER RESPONSES

Emmanuel torres, m.s. data science.

Organizations use multifaceted modern call centers and are currently utilizing antiquated forecasting technologies leading to erroneous staffing during critical periods of unprecedented volume. Companies will experience financial hemorrhaging or provide an inadequate customer experience due to incorrect staffing when sporadic volume emerges. The current forecasting models being employed are being used with known caveats such as the inability for the model to handle wait time without abandonment and only considers a single call type when making the prediction.

This study aims to create a new forecasting model to predict the Average Speed of Answer (ASA) to obtain a more accurate prediction of the staffing requirements for a call center. The new model will anticipate historical volume of varying capacities to create the prediction. Both parametric and nonparametric methodologies will be used to forecast the ASA. An ARIMA (Autoregressive Integrated Moving Average) parametric model was used to create a baseline for the prediction. The application of machine learning techniques such as Recurrent Neural Networks (RNN) was used since it can process sequential data by utilizing previous outputs as inputs to create the neural network. Specifically, Long Short-Term Memory (LSTM) recurrent neural networks were used to create a forecasting model for the call center ASA.

With the LSTM neural network a univariate and multivariate approach was utilized to forecast the ASA. The findings confirm that univariate LSTM neural networks resulted in a more accurate forecast by netting the lowest Root Mean Squared Error (RMSE) score from the three methods used to predict the call center ASA. Even though the univariate LSTM model produced the best results, the multivariate LSTM model did not stray far from providing an accurate prediction but received a higher RMSE score compared to the univariate model. Furthermore, ARIMA provided the highest RMSE score and forecasted the ASA inaccurately.

A COMPREHENSIVE EVALUATION ON THE APPLICATIONS OF DATA AUGMENTATION, TRANSFER LEARNING AND IMAGE ENHANCEMENT IN DEVELOPING A ROBUST SPEECH EMOTION RECOGNITION SYSTEM

Kyle philip calabro, m.s. data science.

Within this thesis work, the applications of data augmentation, transfer learning, and image enhancement techniques were explored in great depth with respect to speech emotion recognition (SER) via convolutional neural networks and the classification of spectrogram images. Speech emotion recognition is a challenging subset of machine learning with an incredibly active research community. One of the prominent challenges of SER is a lack of quality training data. The methods developed and presented in this work serve to alleviate this issue and improve upon the current state-of-the-art methodology. A novel unimodal approach was taken in which five transfer learning models pre-trained on the ImageNet data set were used with both the feature extraction and fine-tuning method of transfer learning. Such transfer learning models include the VGG-16, VGG-19, InceptionV3, Xception and ResNet-50. A modified version of the AlexNet deep neural network model was utilized as a baseline for non pre-trained deep neural networks. Two speech corpora were utilized to develop these methods. The Ryerson Audio-Visual Database of Emotional Speech and Songs (RAVDESS) and the Crowd-source Emotional Multimodal Actors dataset (CREMA-D). Data augmentation techniques were applied to the raw audio of each speech corpora to increase the amount of training data, yielding custom data sets. Raw audio data augmentation techniques include the addition of Gaussian noise, stretching by two different factors, time shifting and shifting pitch by three separate tones. Image enhancement techniques were implemented with the aim of improving classification accuracy by unveiling more prominent features in the spectrograms. Image enhancement techniques include conversion to grayscale, contrast stretching and the combination of grayscale conversion followed by contrast stretching. In all, 176 experiments were conducted to provide a comprehensive overview of all techniques that were proposed as well as a definitive methodology. Such methodology yields improved or comparable results to what is currently considered to be state-of-the-art when deployed on the RAVDESS and CREMA-D speech corpora.

Ramapo College Logo

505 Ramapo Valley Road Mahwah, NJ 07430

p: 201-684-7500 e: [email protected]

  • Arts & Community
  • Web Self-Service
  • Student Complaint Form

Copyright ©2024 Ramapo College Of New Jersey. Statements And Policies . Contact Webmaster .

DigitalCommons@Kennesaw State University

Home > CCSE > Data Science and Analytics > PhD DSA

Doctor of Data Science and Analytics Dissertations

The PhD Website

The Ph.D. in Data Science and Analytics is an advanced degree with a dual focus of application and research - where students will engage in real world business problems, which will inform and guide their research interests.

We launched the first formal PhD program in Data Science in 2015. Our program sits at the intersection of computer science, statistics, mathematics, and business. Our students engage in relevant research with faculty from across our eleven colleges. As one of the institutions on the forefront of the development of data science as an academic discipline, we are committed to developing the next generation of Data Science leaders, researchers, and educators. Culturally, we are committed to the discipline of Data Science, through ethical practices, attention to fairness, to a diverse student body, to academic excellence, and research which makes positive contributions to our local, regional, and global community. -Sherry Ni, Director, Ph.D. in Data Science and Analytics

This degree will train individuals to translate and facilitate new innovative research, structured and unstructured, complex data into information to improve decision making. This curriculum includes heavy emphasis on programming, data mining, statistical modeling, and the mathematical foundations to support these concepts. Importantly, the program also emphasizes communication skills – both oral and written – as well as application and tying results to business and research problems.

Need to Submit Your Dissertation? Submit Here!

Dissertations from 2024 2024.

A Holistic and Collaborative Behavioral Health Detection Framework Using Sensitive Police Narratives , Martin Keagan Wynne Brown

MEDICAL IMAGING DATASET MANAGEMENT LEVERAGING DEEP LEARNING FRAMEWORKS IN BREAST CANCER SCREENING , Inchan Hwang

Multi-Modality Transformer for E-Commerce: Inferring User Purchase Intention to Bridge the Query-Product Gap , Srivatsa Mallapragada

Innovative Approaches for Identifying and Reducing Disparity in Machine Learning Model Performance – Bridging the Gap in Binary Classification for Health Informatics , Linglin Zhang

Dissertations from 2023 2023

Quantification of Various Types of Biases in Large Language Models , Sudhashree Sayenju

Dissertations from 2022 2022

Appley: Approximate Shapley Values for Model Explainability in Linear Time , Md Shafiul Alam

Ethical Analytics: A Framework for a Practically-Oriented Sub-Discipline of AI Ethics , Jonathan Boardman

Novel Instance-Level Weighted Loss Function for Imbalanced Learning , Trent Geisler

Debiasing Cyber Incidents – Correcting for Reporting Delays and Under-reporting , Seema Sangari

Dissertations from 2021 2021

Integrated Machine Learning Approaches to Improve Classification performance and Feature Extraction Process for EEG Dataset , Mohammad Masum

A Distance-Based Clustering Framework for Categorical Time Series: A Case Study in Episodes of Care Healthcare Delivery System , Lauren Staples

Dissertations from 2020 2020

A CREDIT ANALYSIS OF THE UNBANKED AND UNDERBANKED: AN ARGUMENT FOR ALTERNATIVE DATA , Edwin Baidoo

Quantitatively Motivated Model Development Framework: Downstream Analysis Effects of Normalization Strategies , Jessica M. Rudd

Data-driven Investment Decisions in P2P Lending: Strategies of Integrating Credit Scoring and Profit Scoring , Yan Wang

A Novel Penalized Log-likelihood Function for Class Imbalance Problem , Lili Zhang

ATTACK AND DEFENSE IN SECURITY ANALYTICS , Yiyun Zhou

Dissertations from 2019 2019

One and Two-Step Estimation of Time Variant Parameters and Nonparametric Quantiles , Bogdan Gadidov

Biologically Interpretable, Integrative Deep Learning for Cancer Survival Analysis , Jie Hao

Deep Embedding Kernel , Linh Le

Ordinal HyperPlane Loss , Bob Vanderheyden

Advanced Search

  • Notify me via email or RSS
  • All Collections
  • Disciplines
  • Conferences
  • Faculty Works
  • Open Access
  • Research Support
  • Student Works
  • Data Science Homepage

Useful Links

  • Training Materials

Home | About | FAQ | My Account | Accessibility Statement

Privacy Copyright DigitalCommons@Kennesaw State University ISSN: 2576-6805

  • ODSC EUROPE
  • AI+ Training
  • Speak at ODSC

data science thesis

  • Data Analytics
  • Data Engineering
  • Data Visualization
  • Deep Learning
  • Generative AI
  • Machine Learning
  • NLP and LLMs
  • Business & Use Cases
  • Career Advice
  • Write for us
  • ODSC Community Slack Channel
  • Upcoming Webinars

10 Compelling Machine Learning Ph.D. Dissertations for 2020

10 Compelling Machine Learning Ph.D. Dissertations for 2020

Machine Learning Modeling Research posted by Daniel Gutierrez, ODSC August 19, 2020 Daniel Gutierrez, ODSC

As a data scientist, an integral part of my work in the field revolves around keeping current with research coming out of academia. I frequently scour arXiv.org for late-breaking papers that show trends and reveal fertile areas of research. Other sources of valuable research developments are in the form of Ph.D. dissertations, the culmination of a doctoral candidate’s work to confer his/her degree. Ph.D. candidates are highly motivated to choose research topics that establish new and creative paths toward discovery in their field of study. Their dissertations are highly focused on a specific problem. If you can find a dissertation that aligns with your areas of interest, consuming the research is an excellent way to do a deep dive into the technology. After reviewing hundreds of recent theses from universities all over the country, I present 10 machine learning dissertations that I found compelling in terms of my own areas of interest.

[Related article: Introduction to Bayesian Deep Learning ]

I hope you’ll find several that match your own fields of inquiry. Each thesis may take a while to consume but will result in hours of satisfying summer reading. Enjoy!

1. Bayesian Modeling and Variable Selection for Complex Data

As we routinely encounter high-throughput data sets in complex biological and environmental research, developing novel models and methods for variable selection has received widespread attention. This dissertation addresses a few key challenges in Bayesian modeling and variable selection for high-dimensional data with complex spatial structures. 

2. Topics in Statistical Learning with a Focus on Large Scale Data

Big data vary in shape and call for different approaches. One type of big data is the tall data, i.e., a very large number of samples but not too many features. This dissertation describes a general communication-efficient algorithm for distributed statistical learning on this type of big data. The algorithm distributes the samples uniformly to multiple machines, and uses a common reference data to improve the performance of local estimates. The algorithm enables potentially much faster analysis, at a small cost to statistical performance.

Another type of big data is the wide data, i.e., too many features but a limited number of samples. It is also called high-dimensional data, to which many classical statistical methods are not applicable. 

This dissertation discusses a method of dimensionality reduction for high-dimensional classification. The method partitions features into independent communities and splits the original classification problem into separate smaller ones. It enables parallel computing and produces more interpretable results.

3. Sets as Measures: Optimization and Machine Learning

The purpose of this machine learning dissertation is to address the following simple question:

How do we design efficient algorithms to solve optimization or machine learning problems where the decision variable (or target label) is a set of unknown cardinality?

Optimization and machine learning have proved remarkably successful in applications requiring the choice of single vectors. Some tasks, in particular many inverse problems, call for the design, or estimation, of sets of objects. When the size of these sets is a priori unknown, directly applying optimization or machine learning techniques designed for single vectors appears difficult. The work in this dissertation shows that a very old idea for transforming sets into elements of a vector space (namely, a space of measures), a common trick in theoretical analysis, generates effective practical algorithms.

4. A Geometric Perspective on Some Topics in Statistical Learning

Modern science and engineering often generate data sets with a large sample size and a comparably large dimension which puts classic asymptotic theory into question in many ways. Therefore, the main focus of this dissertation is to develop a fundamental understanding of statistical procedures for estimation and hypothesis testing from a non-asymptotic point of view, where both the sample size and problem dimension grow hand in hand. A range of different problems are explored in this thesis, including work on the geometry of hypothesis testing, adaptivity to local structure in estimation, effective methods for shape-constrained problems, and early stopping with boosting algorithms. The treatment of these different problems shares the common theme of emphasizing the underlying geometric structure.

5. Essays on Random Forest Ensembles

A random forest is a popular machine learning ensemble method that has proven successful in solving a wide range of classification problems. While other successful classifiers, such as boosting algorithms or neural networks, admit natural interpretations as maximum likelihood, a suitable statistical interpretation is much more elusive for a random forest. The first part of this dissertation demonstrates that a random forest is a fruitful framework in which to study AdaBoost and deep neural networks. The work explores the concept and utility of interpolation, the ability of a classifier to perfectly fit its training data. The second part of this dissertation places a random forest on more sound statistical footing by framing it as kernel regression with the proximity kernel. The work then analyzes the parameters that control the bandwidth of this kernel and discuss useful generalizations.

6. Marginally Interpretable Generalized Linear Mixed Models

A popular approach for relating correlated measurements of a non-Gaussian response variable to a set of predictors is to introduce latent random variables and fit a generalized linear mixed model. The conventional strategy for specifying such a model leads to parameter estimates that must be interpreted conditional on the latent variables. In many cases, interest lies not in these conditional parameters, but rather in marginal parameters that summarize the average effect of the predictors across the entire population. Due to the structure of the generalized linear mixed model, the average effect across all individuals in a population is generally not the same as the effect for an average individual. Further complicating matters, obtaining marginal summaries from a generalized linear mixed model often requires evaluation of an analytically intractable integral or use of an approximation. Another popular approach in this setting is to fit a marginal model using generalized estimating equations. This strategy is effective for estimating marginal parameters, but leaves one without a formal model for the data with which to assess quality of fit or make predictions for future observations. Thus, there exists a need for a better approach.

This dissertation defines a class of marginally interpretable generalized linear mixed models that leads to parameter estimates with a marginal interpretation while maintaining the desirable statistical properties of a conditionally specified model. The distinguishing feature of these models is an additive adjustment that accounts for the curvature of the link function and thereby preserves a specific form for the marginal mean after integrating out the latent random variables. 

7. On the Detection of Hate Speech, Hate Speakers and Polarized Groups in Online Social Media

The objective of this dissertation is to explore the use of machine learning algorithms in understanding and detecting hate speech, hate speakers and polarized groups in online social media. Beginning with a unique typology for detecting abusive language, the work outlines the distinctions and similarities of different abusive language subtasks (offensive language, hate speech, cyberbullying and trolling) and how we might benefit from the progress made in each area. Specifically, the work suggests that each subtask can be categorized based on whether or not the abusive language being studied 1) is directed at a specific individual, or targets a generalized “Other” and 2) the extent to which the language is explicit versus implicit. The work then uses knowledge gained from this typology to tackle the “problem of offensive language” in hate speech detection. 

8. Lasso Guarantees for Dependent Data

Serially correlated high dimensional data are prevalent in the big data era. In order to predict and learn the complex relationship among the multiple time series, high dimensional modeling has gained importance in various fields such as control theory, statistics, economics, finance, genetics and neuroscience. This dissertation studies a number of high dimensional statistical problems involving different classes of mixing processes. 

9. Random forest robustness, variable importance, and tree aggregation

Random forest methodology is a nonparametric, machine learning approach capable of strong performance in regression and classification problems involving complex data sets. In addition to making predictions, random forests can be used to assess the relative importance of feature variables. This dissertation explores three topics related to random forests: tree aggregation, variable importance, and robustness. 

10. Climate Data Computing: Optimal Interpolation, Averaging, Visualization and Delivery

This dissertation solves two important problems in the modern analysis of big climate data. The first is the efficient visualization and fast delivery of big climate data, and the second is a computationally extensive principal component analysis (PCA) using spherical harmonics on the Earth’s surface. The second problem creates a way to supply the data for the technology developed in the first. These two problems are computationally difficult, such as the representation of higher order spherical harmonics Y400, which is critical for upscaling weather data to almost infinitely fine spatial resolution.

I hope you enjoyed learning about these compelling machine learning dissertations.

Editor’s note: Interested in more data science research? Check out the Research Frontiers track at ODSC Europe this September 17-19 or the ODSC West Research Frontiers track this October 27-30.

data science thesis

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.

openshift sq

12 AI Frameworks and Libraries Every Software Engineer Should Know

Software Engineering Modeling posted by ODSC Team Sep 6, 2024 As the demand for AI and machine learning continues to surge, software engineers looking to enter...

12 Must-Use Datasets for Data Visualization in 2024

12 Must-Use Datasets for Data Visualization in 2024

Data Visualization posted by ODSC Team Sep 6, 2024 Not only is data visualization a powerful tool that allows data scientists and analysts to communicate...

Here are the Details for the ODSC West AI Startup Showcase & Make the Jump Dinner

Here are the Details for the ODSC West AI Startup Showcase & Make the Jump Dinner

West 2024 Conferences posted by ODSC Team Sep 6, 2024 We’re thrilled to announce that we are bringing the AI Startup Showcase back for ODSC West...

genaix square

Instructions for MSc Thesis

Before the thesis.

Before you start work on your thesis, it is important to put some thought into the choice of topic and familiarize yourself with the criteria and procedure. To do that, follow these steps, in this order:

Step 0: Read the university instructions .

Read the MSc thesis instructions and grading criteria on the university website. Computer Science Master's program: [link] . Data Science Master's program: [ link ].

Step 1: Choose a topic .

Choose a topic among the ones listed on the group's webpage [ link ]. You can also propose your own topic. In this case, you must explain what the main contribution of the thesis will be and identify at least one scientific publication that is related to the topic you propose.

Step 2: Contact us .

Submit the application form [ link ] to let us know of your interest to do your thesis in the group. Note : If you contact us, then please be ready to start work on the thesis within one month .

Step 3: Agree on the topic .

We have a brief discussion about the topic and devise a high-level plan for thesis work and content. We also discuss a start date , when you start work on the thesis. In addition, you should contact a second evaluator for the thesis.

Thesis timeline

Below you find the milestones after you have started work on the thesis. In parenthesis, you find an estimate of when each milestone occurs. The thesis work ends when you submit it for approval. The total duration from start to end of the thesis should be about four months.

Milestone #0: Thesis outline (at most 3 weeks from the start) .

You create a first outline of the thesis. The outline should contain the titles of the chapters, along with a (tentative) list of sections and contents. An indicative template for the outline is shown below on this page.

Milestone #1: A draft with first results (about 2 months from start) .

All chapters should contain some readable content (not necessarily polished). Most importantly, some results should already be described. Ideally, you should be able to complete and refine the results within one more month.

Milestone #2: A draft with all results (about 1 month before the end).

Most content should now be in the draft. Some polishing remains and some results may still be refined. Notify the second evaluator that you are near the end of the thesis work. Optionally, you may send the thesis draft and receive preliminary comments from the second evaluator.

Milestone #3: Submit the thesis for approval (end of thesis work).

You will receive a grade and comments after the next program board's meeting.

Supervision

What you can expect from the supervisor:

  • Comments for the thesis draft after each milestone (see timeline above) and, if necessary, a meeting.
  • Suggestions for how to proceed in cases when you encounter a major hurdle.

In addition, you are welcome to participate in the group meetings and discuss your thesis work with other group members.

Note however that one of the grading criteria for the thesis is whether you worked independently -- and in the end, the thesis should be your own work.

Template for Thesis Outline

Below you find a suggested template for the outline of the thesis. You may adapt it to your work, of course (e.g., change chapter titles or structure).

A summary of the thesis that mentions the broader topic of the thesis and why it is important; the research question or technical problem addressed by the thesis; the main thesis contributions (e.g., data gathering, developed methods and algorithms, experimental evaluation) and results.

Chapter 1: Introduction

The introduction should motivate the thesis and give a longer summary. It should be written in a way that allows anyone in your program to understand it, even if they are not experts in the topic.

  • What is the broader topic of the thesis?
  • Why is it important?
  • What research question(s) or technical problems does the thesis address?
  • What are the most related works from the literature on the topic? How does the thesis differ from what has already been done?
  • What are the main thesis contributions (e.g., data gathering, developed methods and algorithms, experimental evaluation)?
  • What are the results?

Chapter 2: Related literature

Organize this chapter in sections, with one section for each research area that is related to your thesis. For each research area, cite all the publications that are related to your topic, and describe at least the most important of them.

Chapter 3: Preliminaries

In this chapter, place the information that is necessary for you to describe the contributions and results of the thesis. It may be different from thesis to thesis, but could include sections about:

Setting. Define the terms and notation you will be using. State any assumptions you make across the thesis. Background on Methods . Describe existing methods from the literature (e.g., algorithms or ML models) that you use for your work. Data (esp. for a Data Science thesis). If the main contribution is data analysis, then describe the data here, before the analysis.

Chapter 4: Methodological contribution

For a Computer Science thesis, this part typically describes the algorithm(s) developed for the thesis. For a Data Science thesis, this part typically describes the method for the analysis.

Chapter 5: Results

This chapter describes the results obtained when the methods of Chapter 4 are used on data.

For a Computer Science thesis, this part typically describes the performance of the developed algorithm(s) on various synthetic and real datasets. For a Data Science thesis, this part typically describes the findings of the analysis.

The chapter should also describe what insights are obtained from the results.

Chapter 6: Conclusion

  • Summarize the contribution of the thesis.
  • Provide an evaluation: are the results conclusive, are there limitations in the contribution?
  • How would you extend the thesis, what can be done next on the same topic?
  • Press Enter to activate screen reader mode.

Department of Computer Science

Thesis projects and research in ds.

The Master's thesis is a mandatory course of the Master's program in Data Science. The thesis is supervised by a professor of the data science faculty list .

Research in Data Science is a core elective for students in Data Science under the supervision of a data science professor.

Research in Data Science

The project is in independent work under the supervision of a member of the faculty in data science

Only students who have passed at least one core course in Data Management and Processing, and one core course in Data Analysis can start with a research project.

Before starting, the project must be registered in mystudies and a project description must be submitted at the start of the project to the studies administration by e-mail (address see Contact in right column).

Master's Thesis

The Master's Thesis requires 6 months of full time study/work, and we strongly discourage you from attending any courses in parallel. We recommend that you acquire all course credits before the start of the Master’s thesis. The topic for the Master’s thesis must be chosen within Data Science.

Before starting a Master’s thesis, it is important to agree with your supervisor on the task and the assessment scheme. Both have to be documented thoroughly. You electronically register the Master’s thesis in mystudies.

It is possible to complete the Master’s thesis in industry provided that a professor involved in the Data Science Master’s program supervises the thesis and your tutor approves it.

Further details on internal regulations of the Master’s thesis can be downloaded from the following website: www.inf.ethz.ch/studies/forms-and-documents.html .

Overview Master's Theses Projects

Chair of programming methodology.

  • Prof. Dr. Martin Vechev

Institute for Computing Platform

  • Prof. Dr. Gustavo Alonso
  • Prof. Dr. Torsten Hoefler
  • Prof. Dr. Ana Klimovic
  • Prof. Dr. Timothy Roscoe

Institute for Machine Learning

  • Prof. Dr. Valentina Boeva
  • Prof. Dr. Joachim Buhmann
  • Prof. Dr. Ryan Cotterell    
  • external page Prof. Dr. Menna El-Assady   
  • Prof. Dr. Niao He
  • Prof. Dr. Thomas Hofmann
  • Prof. Dr. Andreas Krause
  • Prof. Dr. Gunnar Rätsch
  • external page Prof. Dr. Mrinmaya Sachan
  • external page Prof. Dr. Bernhard Schölkopf  
  • Prof. Dr. Julia Vogt

Institute for Persasive Computing

  • Prof. Dr. Otmar Hilliges

Institute of Computer Systems

  • Prof. Dr. Markus Püschel

Institute of Information Security

  • Prof. Dr. David Basin
  • Prof. Dr. Srdjan Capkun
  • external page Prof. Dr. Florian Tramèr

Institute of Theoretical Computer Science

  • Prof. Dr. Bernd Gärtner

Institute of Visual Computing

  • Prof. Dr. Markus Gross
  • Prof. Dr. Marc Pollefeys
  • Prof. Dr. Olga Sorkine
  • Prof. Dr. Siyu Tang

Disney Research Zurich

  • external page Prof. Dr. Robert Sumner

Automatic Control Laboratory

  • Prof. Dr. Florian Dörfler
  • Prof. Dr. John Lygeros

Communication Technology Laboratory

  • Prof. Dr. Helmut Bölcskei

Computer Engineering and Networks Laboratory

  • Prof. Dr. Laurent Vanbever
  • Prof. Dr. Roger Wattenhofer

Computer Vision Laboratory

  • Prof. Dr. Ender Konukoglu
  • Prof. Dr. Luc Van Gool
  • Prof. Dr. Fisher Yu

Institute for Biomedical Engineering

  • Prof. Dr. Klaas Enno Stephan

Integrated Systems Laboratory

  • Prof. Dr. Luca Benini
  • Prof. Dr. Christoph Studer

Signal and Information Processing Laboratory (ISI)

  • Prof. Dr. Amos Lapidoth
  • Prof. Dr. Hans-Andrea Loeliger

D-MATH does not publish Master's Theses projects. In case of interest contact the professor directly.

FIM - Insitute for Mathematical Research

  • Prof. Dr. Alessio Figalli

Financial Mathematics

  • Prof. Dr. Josef Teichmann

Institute for Operations Research

  • Prof. Dr. Robert Weismantel
  • Prof. Dr. Rico Zenklusen

RiskLab Switzerland

  • external page Prof. Dr. Patrick Cheridito
  • external page Prof. Dr. Mario Valentin Wüthrich

Seminar for Applied Mathematics

  • Prof. Dr. Rima Alaifari
  • Prof. Dr. Siddhartha Mishra

Seminar for Statistics

  • Prof. Dr. Afonso Bandeira
  • Prof. Dr. Peter Bühlmann
  • Prof. Dr. Yuansi Chen
  • Prof. Dr. Nicolai Meinshausen
  • Prof. Dr. Jonas Peters
  • Prof. Dr. Johanna Ziegel

Law, Economics, and Data Science Group

  • Prof. Dr. Eliott Ash , D-GESS)

Institute for Geodesy and Photogrammetry

  • Prof. Dr. Konrad Schindler (D-BSSE)

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Dissertation
  • What Is a Thesis? | Ultimate Guide & Examples

What Is a Thesis? | Ultimate Guide & Examples

Published on September 14, 2022 by Tegan George . Revised on April 16, 2024.

A thesis is a type of research paper based on your original research. It is usually submitted as the final step of a master’s program or a capstone to a bachelor’s degree.

Writing a thesis can be a daunting experience. Other than a dissertation , it is one of the longest pieces of writing students typically complete. It relies on your ability to conduct research from start to finish: choosing a relevant topic , crafting a proposal , designing your research , collecting data , developing a robust analysis, drawing strong conclusions , and writing concisely .

Thesis template

You can also download our full thesis template in the format of your choice below. Our template includes a ready-made table of contents , as well as guidance for what each chapter should include. It’s easy to make it your own, and can help you get started.

Download Word template Download Google Docs template

Instantly correct all language mistakes in your text

Upload your document to correct all your mistakes in minutes

upload-your-document-ai-proofreader

Table of contents

Thesis vs. thesis statement, how to structure a thesis, acknowledgements or preface, list of figures and tables, list of abbreviations, introduction, literature review, methodology, reference list, proofreading and editing, defending your thesis, other interesting articles, frequently asked questions about theses.

You may have heard the word thesis as a standalone term or as a component of academic writing called a thesis statement . Keep in mind that these are two very different things.

  • A thesis statement is a very common component of an essay, particularly in the humanities. It usually comprises 1 or 2 sentences in the introduction of your essay , and should clearly and concisely summarize the central points of your academic essay .
  • A thesis is a long-form piece of academic writing, often taking more than a full semester to complete. It is generally a degree requirement for Master’s programs, and is also sometimes required to complete a bachelor’s degree in liberal arts colleges.
  • In the US, a dissertation is generally written as a final step toward obtaining a PhD.
  • In other countries (particularly the UK), a dissertation is generally written at the bachelor’s or master’s level.

Prevent plagiarism. Run a free check.

The final structure of your thesis depends on a variety of components, such as:

  • Your discipline
  • Your theoretical approach

Humanities theses are often structured more like a longer-form essay . Just like in an essay, you build an argument to support a central thesis.

In both hard and social sciences, theses typically include an introduction , literature review , methodology section ,  results section , discussion section , and conclusion section . These are each presented in their own dedicated section or chapter. In some cases, you might want to add an appendix .

Thesis examples

We’ve compiled a short list of thesis examples to help you get started.

  • Example thesis #1:   “Abolition, Africans, and Abstraction: the Influence of the ‘Noble Savage’ on British and French Antislavery Thought, 1787-1807” by Suchait Kahlon.
  • Example thesis #2: “’A Starving Man Helping Another Starving Man’: UNRRA, India, and the Genesis of Global Relief, 1943-1947″ by Julian Saint Reiman.

The very first page of your thesis contains all necessary identifying information, including:

  • Your full title
  • Your full name
  • Your department
  • Your institution and degree program
  • Your submission date.

Sometimes the title page also includes your student ID, the name of your supervisor, or the university’s logo. Check out your university’s guidelines if you’re not sure.

Read more about title pages

The acknowledgements section is usually optional. Its main point is to allow you to thank everyone who helped you in your thesis journey, such as supervisors, friends, or family. You can also choose to write a preface , but it’s typically one or the other, not both.

Read more about acknowledgements Read more about prefaces

Don't submit your assignments before you do this

The academic proofreading tool has been trained on 1000s of academic texts. Making it the most accurate and reliable proofreading tool for students. Free citation check included.

data science thesis

Try for free

An abstract is a short summary of your thesis. Usually a maximum of 300 words long, it’s should include brief descriptions of your research objectives , methods, results, and conclusions. Though it may seem short, it introduces your work to your audience, serving as a first impression of your thesis.

Read more about abstracts

A table of contents lists all of your sections, plus their corresponding page numbers and subheadings if you have them. This helps your reader seamlessly navigate your document.

Your table of contents should include all the major parts of your thesis. In particular, don’t forget the the appendices. If you used heading styles, it’s easy to generate an automatic table Microsoft Word.

Read more about tables of contents

While not mandatory, if you used a lot of tables and/or figures, it’s nice to include a list of them to help guide your reader. It’s also easy to generate one of these in Word: just use the “Insert Caption” feature.

Read more about lists of figures and tables

If you have used a lot of industry- or field-specific abbreviations in your thesis, you should include them in an alphabetized list of abbreviations . This way, your readers can easily look up any meanings they aren’t familiar with.

Read more about lists of abbreviations

Relatedly, if you find yourself using a lot of very specialized or field-specific terms that may not be familiar to your reader, consider including a glossary . Alphabetize the terms you want to include with a brief definition.

Read more about glossaries

An introduction sets up the topic, purpose, and relevance of your thesis, as well as expectations for your reader. This should:

  • Ground your research topic , sharing any background information your reader may need
  • Define the scope of your work
  • Introduce any existing research on your topic, situating your work within a broader problem or debate
  • State your research question(s)
  • Outline (briefly) how the remainder of your work will proceed

In other words, your introduction should clearly and concisely show your reader the “what, why, and how” of your research.

Read more about introductions

A literature review helps you gain a robust understanding of any extant academic work on your topic, encompassing:

  • Selecting relevant sources
  • Determining the credibility of your sources
  • Critically evaluating each of your sources
  • Drawing connections between sources, including any themes, patterns, conflicts, or gaps

A literature review is not merely a summary of existing work. Rather, your literature review should ultimately lead to a clear justification for your own research, perhaps via:

  • Addressing a gap in the literature
  • Building on existing knowledge to draw new conclusions
  • Exploring a new theoretical or methodological approach
  • Introducing a new solution to an unresolved problem
  • Definitively advocating for one side of a theoretical debate

Read more about literature reviews

Theoretical framework

Your literature review can often form the basis for your theoretical framework, but these are not the same thing. A theoretical framework defines and analyzes the concepts and theories that your research hinges on.

Read more about theoretical frameworks

Your methodology chapter shows your reader how you conducted your research. It should be written clearly and methodically, easily allowing your reader to critically assess the credibility of your argument. Furthermore, your methods section should convince your reader that your method was the best way to answer your research question.

A methodology section should generally include:

  • Your overall approach ( quantitative vs. qualitative )
  • Your research methods (e.g., a longitudinal study )
  • Your data collection methods (e.g., interviews or a controlled experiment
  • Any tools or materials you used (e.g., computer software)
  • The data analysis methods you chose (e.g., statistical analysis , discourse analysis )
  • A strong, but not defensive justification of your methods

Read more about methodology sections

Your results section should highlight what your methodology discovered. These two sections work in tandem, but shouldn’t repeat each other. While your results section can include hypotheses or themes, don’t include any speculation or new arguments here.

Your results section should:

  • State each (relevant) result with any (relevant) descriptive statistics (e.g., mean , standard deviation ) and inferential statistics (e.g., test statistics , p values )
  • Explain how each result relates to the research question
  • Determine whether the hypothesis was supported

Additional data (like raw numbers or interview transcripts ) can be included as an appendix . You can include tables and figures, but only if they help the reader better understand your results.

Read more about results sections

Your discussion section is where you can interpret your results in detail. Did they meet your expectations? How well do they fit within the framework that you built? You can refer back to any relevant source material to situate your results within your field, but leave most of that analysis in your literature review.

For any unexpected results, offer explanations or alternative interpretations of your data.

Read more about discussion sections

Your thesis conclusion should concisely answer your main research question. It should leave your reader with an ultra-clear understanding of your central argument, and emphasize what your research specifically has contributed to your field.

Why does your research matter? What recommendations for future research do you have? Lastly, wrap up your work with any concluding remarks.

Read more about conclusions

In order to avoid plagiarism , don’t forget to include a full reference list at the end of your thesis, citing the sources that you used. Choose one citation style and follow it consistently throughout your thesis, taking note of the formatting requirements of each style.

Which style you choose is often set by your department or your field, but common styles include MLA , Chicago , and APA.

Create APA citations Create MLA citations

In order to stay clear and concise, your thesis should include the most essential information needed to answer your research question. However, chances are you have many contributing documents, like interview transcripts or survey questions . These can be added as appendices , to save space in the main body.

Read more about appendices

Once you’re done writing, the next part of your editing process begins. Leave plenty of time for proofreading and editing prior to submission. Nothing looks worse than grammar mistakes or sloppy spelling errors!

Consider using a professional thesis editing service or grammar checker to make sure your final project is perfect.

Once you’ve submitted your final product, it’s common practice to have a thesis defense, an oral component of your finished work. This is scheduled by your advisor or committee, and usually entails a presentation and Q&A session.

After your defense , your committee will meet to determine if you deserve any departmental honors or accolades. However, keep in mind that defenses are usually just a formality. If there are any serious issues with your work, these should be resolved with your advisor way before a defense.

If you want to know more about AI for academic writing, AI tools, or research bias, make sure to check out some of our other articles with explanations and examples or go directly to our tools!

Research bias

  • Survivorship bias
  • Self-serving bias
  • Availability heuristic
  • Halo effect
  • Hindsight bias
  • Deep learning
  • Generative AI
  • Machine learning
  • Reinforcement learning
  • Supervised vs. unsupervised learning

 (AI) Tools

  • Grammar Checker
  • Paraphrasing Tool
  • Text Summarizer
  • AI Detector
  • Plagiarism Checker
  • Citation Generator

The conclusion of your thesis or dissertation shouldn’t take up more than 5–7% of your overall word count.

If you only used a few abbreviations in your thesis or dissertation , you don’t necessarily need to include a list of abbreviations .

If your abbreviations are numerous, or if you think they won’t be known to your audience, it’s never a bad idea to add one. They can also improve readability, minimizing confusion about abbreviations unfamiliar to your reader.

When you mention different chapters within your text, it’s considered best to use Roman numerals for most citation styles. However, the most important thing here is to remain consistent whenever using numbers in your dissertation .

A thesis or dissertation outline is one of the most critical first steps in your writing process. It helps you to lay out and organize your ideas and can provide you with a roadmap for deciding what kind of research you’d like to undertake.

Generally, an outline contains information on the different sections included in your thesis or dissertation , such as:

  • Your anticipated title
  • Your abstract
  • Your chapters (sometimes subdivided into further topics like literature review , research methods , avenues for future research, etc.)

A thesis is typically written by students finishing up a bachelor’s or Master’s degree. Some educational institutions, particularly in the liberal arts, have mandatory theses, but they are often not mandatory to graduate from bachelor’s degrees. It is more common for a thesis to be a graduation requirement from a Master’s degree.

Even if not mandatory, you may want to consider writing a thesis if you:

  • Plan to attend graduate school soon
  • Have a particular topic you’d like to study more in-depth
  • Are considering a career in research
  • Would like a capstone experience to tie up your academic experience

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

George, T. (2024, April 16). What Is a Thesis? | Ultimate Guide & Examples. Scribbr. Retrieved September 3, 2024, from https://www.scribbr.com/dissertation/thesis/

Is this article helpful?

Tegan George

Tegan George

Other students also liked, dissertation & thesis outline | example & free templates, writing strong research questions | criteria & examples, 10 research question examples to guide your research project, what is your plagiarism score.

Eindhoven University of Technology research portal Logo

  • Help & FAQ

Data Science

  • Mathematics and Computer Science

Student theses

  • 1 - 50 out of 781 results
  • Title (descending)

Search results

3d face reconstruction using deep learning.

Student thesis : Master

3D fingerprint detection in ancient museum sculptures from CT data

Achieving long term fairness through curiosity driven reinforcement learning: how intrinsic motivation influences fairness in algorithmic decision making, a coherent temporal visualization of algorithm dynamics over large graphs.

Student thesis : Bachelor

A comparative study for process mining approaches in a real-life environment

A comparative study on unsupervised deep learning methods for x-ray image denoising with multi-image self2self and single frequency denoising, a comparison of quantitative evaluation and human perception of quality of generated images of faces, a computational biology framework: a data analysis tool to support biomedical engineers in their research, active learning for text classification, active learning in vae latent space, activity recognition using deep learning in videos under clinical setting, a dashboard for emulating lstm-based predictive process monitoring and its qualitative evaluation, a data cleaning assistant, a data cleaning assistant for machine learning, adding formal specifications to a legacy code generator, a deep learning approach for clustering a multi-class dataset, a detailed understanding of actor involvement in business processes, adopting the factorized model of execution in a graph database engine, advances in understanding and initializing einsum networks, adversarial attacks on deep dreams, adversarial datasets through sentence length and conjunctions, adversarial nlp benchmarks: data characteristics complicating automated generation of adversarial examples, adversarial noise benchmarking on image caption, aerial imagery pixel-level segmentation, aethra db: optimising analytical processing through query-tailored code generation, a feasibility study on automated database exercise generation with large language models, a forecasting framework for recirculation in baggage handling systems, a framework for understanding business process remaining time predictions, age(ing) in software development, aggregated information visualization for process alignments, a heuristic approach for the vrptw using dual information of its lp formulation, a hybrid model for pedestrian motion prediction, algorithms for center-based trajectory clustering, aligning incompatible state (and action) representations through disentanglement for multi-task transfer in offline rl, allocation decision-making in service supply chain with deep reinforcement learning, a method for identifying undesired medical treatment variants using process and data mining techniques, a method to determine actual time worked from event logs, an adaptive and scrutable math tutoring system, an adversarial analysis of inference capabilities acquired by state-of-the-art nlp models from the rte dataset, analysis and improvement of process models with respect to key performance indicators: a debt collection case study, analysis of the influence of routines on task execution performance, analyzing application usage logs to understand the users, analyzing causes of outlier cascade behavior in baggage handling systems, analyzing collaborations and routines in event graphs using statistics and pattern mining, analyzing complexity progression and complexity correlation of sql questions on stack overflow, analyzing customer journey with process mining: from discovery to recommendations, analyzing data of operating rooms in hospitals to reduce rework, analyzing policy gradient approaches towards rapid policy transfer, analyzing routines and habits in event graphs using statistics and pattern mining.

Harvard University Theses, Dissertations, and Prize Papers

The Harvard University Archives ’ collection of theses, dissertations, and prize papers document the wide range of academic research undertaken by Harvard students over the course of the University’s history.

Beyond their value as pieces of original research, these collections document the history of American higher education, chronicling both the growth of Harvard as a major research institution as well as the development of numerous academic fields. They are also an important source of biographical information, offering insight into the academic careers of the authors.

Printed list of works awarded the Bowdoin prize in 1889-1890.

Spanning from the ‘theses and quaestiones’ of the 17th and 18th centuries to the current yearly output of student research, they include both the first Harvard Ph.D. dissertation (by William Byerly, Ph.D . 1873) and the dissertation of the first woman to earn a doctorate from Harvard ( Lorna Myrtle Hodgkinson , Ed.D. 1922).

Other highlights include:

  • The collection of Mathematical theses, 1782-1839
  • The 1895 Ph.D. dissertation of W.E.B. Du Bois, The suppression of the African slave trade in the United States, 1638-1871
  • Ph.D. dissertations of astronomer Cecilia Payne-Gaposchkin (Ph.D. 1925) and physicist John Hasbrouck Van Vleck (Ph.D. 1922)
  • Undergraduate honors theses of novelist John Updike (A.B. 1954), filmmaker Terrence Malick (A.B. 1966),  and U.S. poet laureate Tracy Smith (A.B. 1994)
  • Undergraduate prize papers and dissertations of philosophers Ralph Waldo Emerson (A.B. 1821), George Santayana (Ph.D. 1889), and W.V. Quine (Ph.D. 1932)
  • Undergraduate honors theses of U.S. President John F. Kennedy (A.B. 1940) and Chief Justice John Roberts (A.B. 1976)

What does a prize-winning thesis look like?

If you're a Harvard undergraduate writing your own thesis, it can be helpful to review recent prize-winning theses. The Harvard University Archives has made available for digital lending all of the Thomas Hoopes Prize winners from the 2019-2021 academic years.

Accessing These Materials

How to access materials at the Harvard University Archives

How to find and request dissertations, in person or virtually

How to find and request undergraduate honors theses

How to find and request Thomas Temple Hoopes Prize papers

How to find and request Bowdoin Prize papers

  • email: Email
  • Phone number 617-495-2461

Related Collections

Harvard faculty personal and professional archives, harvard student life collections: arts, sports, politics and social life, access materials at the harvard university archives.

Master’s Thesis Presentation • Data Systems • Robust Recursive Query Parallelization in Graph Database Management Systems

Please note: this master’s thesis presentation will take place in dc 2310..

Anurag Chakraborty, Master’s candidate David R. Cheriton School of Computer Science

Supervisor : Professor Semih Salihoğlu

Recursive joins such as shortest path and variable length path queries are a core feature set of modern graph database management systems (GDBMS). Since these queries tend to be computationally expensive and may suffer from high execution time, they require efficient parallel processing using multiple cores to achieve good performance. Existing work on parallel query processing includes the morsel driven parallelism approach that distributes a unit of work (denoted as “morsel”) to threads for parallel execution.

We revisit this technique in the context of parallelization of recursive joins in GDBMS and discuss how the traditional approach of morsel driven query execution is inadequate to tackle recursive join queries. We show how this approach can be modified to better accommodate scalable parallelization of recursive joins. We further describe how this modified parallel query execution approach has been integrated into Kuzu, an embedded disk based columnar GDBMS. Compared to vanilla morsel driven parallelism, our modified parallel query execution approach can be orders of magnitude faster and scales well on multiple cores.

  • WebNotice ,
  • Current students ,
  • Current undergraduate students ,
  • Current graduate students ,
  • Thesis defence

IMAGES

  1. thesis in data science

    data science thesis

  2. MS in Data Science Thesis

    data science thesis

  3. (PDF) Research on Data Science, Data Analytics and Big Data

    data science thesis

  4. Degree project in data science

    data science thesis

  5. How to write a great data science thesis

    data science thesis

  6. 2: Steps of methodology of the thesis

    data science thesis

VIDEO

  1. Why Data Science?

  2. DATA SCIENCE [MODULE-2]

  3. DATA SCIENCE [MODULE-1]

  4. The Data Sciences Institute

  5. JADS Thesis Master Data Science & Entrepreneurship information for companies

  6. Machine Learning and Data Science Thesis Proposal 23 February 2021

COMMENTS

  1. How to write a great data science thesis

    They will stress the importance of structure, substance and style. They will urge you to write down your methodology and results first, then progress to the literature review, introduction and conclusions and to write the summary or abstract last. To write clearly and directly with the reader's expectations always in mind.

  2. Research Topics & Ideas: Data Science

    Research Topics & Ideas: Data Science

  3. 10 Best Research and Thesis Topic Ideas for Data Science in 2022

    The best course of action to amplify the robustness of a resume is to participate or take up different data science projects. In this article, we have listed 10 such research and thesis topic ideas to take up as data science projects in 2022. Handling practical video analytics in a distributed cloud: With increased dependency on the internet ...

  4. data science Latest Research Papers

    data science Latest Research Papers

  5. Computational and Data Sciences (PhD) Dissertations

    Computational and Data Sciences (PhD) Dissertations. Below is a selection of dissertations from the Doctor of Philosophy in Computational and Data Sciences program in Schmid College that have been included in Chapman University Digital Commons. Additional dissertations from years prior to 2019 are available through the Leatherby Libraries ...

  6. Thesis/Capstone for Master's in Data Science

    Data Science; Capstone and Thesis Overview; Capstone and Thesis Overview. Capstone and thesis are similar in that they both represent a culminating, scholarly effort of high quality. Both should clearly state a problem or issue to be addressed. Both will allow students to complete a larger project and produce a product or publication that can ...

  7. MIT Theses

    MIT Theses - DSpace@MIT

  8. Five Tips For Writing A Great Data Science Thesis

    Although educational programs, conventions and thesis requirements vary wildly, I hope to offer some common guidelines for any student currently working on a Data Science thesis. The article offers five guidance points, but may effectively be summarized in a single line: "Write for your reader, not for yourself."

  9. Luigi's guide to writing Master's theses (in Data Science)

    Luigi Acerbi, University of Helsinki, Finland Last edited: 30 Aug 2024 (added section on Thesis Review) These recommendations are aimed primarily to my students from the Master's Programme in Data Science at the University of Helsinki, but many points are likely to apply to related programmes and other institutions. In fact, most of this guide generalizes to scientific academic writing in ...

  10. 17 Compelling Machine Learning Ph.D. Dissertations

    This dissertation revisits and makes progress on some old but challenging problems concerning least squares estimation, the work-horse of supervised machine learning. Two major problems are addressed: (i) least squares estimation with heavy-tailed errors, and (ii) least squares estimation in non-Donsker classes.

  11. Data Science Masters Theses

    Data Science Masters Theses. The Master of Science in Data Science program requires the successful completion of 12 courses to obtain a degree. These requirements cover six core courses, a leadership or project management course, two required courses corresponding to a declared specialization, two electives, and a capstone project or thesis.

  12. MS Thesis Archive

    Elliott Barinberg, M.S. Data Science. Within this thesis work, the applications of data collection, machine learning, and data visualization were used on National Hockey League (NHL) shot data collected between the 2014-2015 season and the 2022-2023 season. Modeling sports data to better understand player evaluation has always been a goal of ...

  13. Doctor of Data Science and Analytics Dissertations

    The Ph.D. in Data Science and Analytics is an advanced degree with a dual focus of application and research - where students will engage in real world business problems, which will inform and guide their research interests. We launched the first formal PhD program in Data Science in 2015. Our program sits at the intersection of computer science ...

  14. 10 Compelling Machine Learning Ph.D. Dissertations for 2020

    This dissertation explores three topics related to random forests: tree aggregation, variable importance, and robustness. 10. Climate Data Computing: Optimal Interpolation, Averaging, Visualization and Delivery. This dissertation solves two important problems in the modern analysis of big climate data.

  15. PDF Applied Data Science Master Thesis

    Preface. This master's thesis was undertaken as a part of the Applied Data Science program at Utrecht University. Two teams of master students, collaborated with Inversable BV and Intergas Verwarming BV, which provided the team with the dataset collected for the Demoproject Hybride.

  16. PDF Master Thesis: Data Science and Marketing Analytics

    Erasmus School of Economics. Master Thesis: Data Science and Marketing Analytics. Interpretable Machine Learning for Attribution Modeling. A Machine Learning Approach for Conversion Attribution in Digital Marketing Student name: Jordy Martodipoetro Student number: 454072 Supervisor: Dr. Kathrin Gruber Second assessor: Prof. Bas Donkers Date ...

  17. Instructions for MSc Thesis

    For a Data Science thesis, this part typically describes the method for the analysis. Chapter 5: Results. This chapter describes the results obtained when the methods of Chapter 4 are used on data. For a Computer Science thesis, this part typically describes the performance of the developed algorithm(s) on various synthetic and real datasets.

  18. Thesis Projects and Research in DS

    Thesis Projects and Research in DS. The Master's thesis is a mandatory course of the Master's program in Data Science. The thesis is supervised by a professor of the data science faculty list. Research in Data Science is a core elective for students in Data Science under the supervision of a data science professor.

  19. PDF Thesis topics for the master thesis Data Science and Business Analytics

    thesis is an exploration by well-motivated simulation scenarios. (3) Find/collect an appropriate set of data to illustrate the method. The context of the data should be explained, as well as a discussion of the results and an interpretation for the context of the data. Main reference: A. Fisher, C. Rudin, F. Dominici (2019).

  20. What Is a Thesis?

    What Is a Thesis? | Ultimate Guide & Examples

  21. PDF University of Washington

    University of Washington

  22. Data Science

    Analyzing data of operating rooms in hospitals to reduce rework van der Schoot, F. (Author), Medeiros de Carvalho, R. (Supervisor 1) & Broeren, J. (External coach), 23 Sept 2021 Student thesis : Master

  23. Harvard University Theses, Dissertations, and Prize Papers

    Harvard University Theses, Dissertations, and Prize Papers

  24. Master's Thesis Presentation • Data Systems • Robust Recursive Query

    Data Science (BCS) First year students CS transfers Related programs and plans Computing and CS minors Business and Computer Science Double Degree ... Master's Thesis Presentation • Data Systems • Robust Recursive Query Parallelization in Graph Database Management Systems . Wednesday, September 11, 2024 11:00 am - 12:00 pm EDT (GMT -04:00