Date of Award
5-2023
Document Type
Project
Degree Name
Master of Science in Information Systems and Technology
Department
College of Business and Public Administration
First Reader/Committee Chair
Benedict, Bailey
Abstract
Social media is a great domain for news consumption; however, it is referred to as a double-edged sword. While it is user-friendly and low-cost, social media is the reason why fake news can spread rapidly, which is detrimental to society, businesses, and many consumers. Therefore, fake news detection is an emerging field. However, some challenges have restricted other researchers from developing a universal machine learning model that is fast, efficient, and reliable to stop the proliferation because of the lack of resources available, such as large-sized datasets. The goal of this culminating experience project is to explore how varying datasets sizes affect the accuracy percentage of a machine learning model. The research questions are: Q1) How do large volumes of fake news datasets affect the accuracy percentage of a machine learning model? Q2) As one increases the volume of data fed into a machine learning model from small to large datasets, what will the cutoff accuracy percentage point be? Various data sizes collected from Kaggle were fed into the machine learning model, Naïve Bayes, to help answer the two questions. Then, all three datasets were combined together to see if the accuracy of the model improves as more data is fed into the model. The results and findings for each question are; 1) Larger dataset sizes do increase the accuracy percentage because there is more data to train and test on. 2) The cutoff accuracy is dependent on the number of unique values within the dataset. Since it is not finite, we can expect that large dataset sizes to have a cutoff accuracy of above 90%, given that the data is cleaned and pre-processed. Compared to a data size ranging from small to medium, it will achieve an accuracy score of around 70%-90%. An accuracy score below 70% means that the model is highly unreliable and that the dataset size is too small. For instance, Dataset 1 achieved an accuracy score of 66%, Dataset 2 was 83%, and Dataset 3 was 92%. To effectively study and experiment on how to build an optimized model, one must use a large dataset size for analysis. Furthermore, other areas for future studies that appeared from this study are building a new and improved fact-checking website that quickly and accurately processes large databases and how well the model, Naïve Bayes works with different modalities, such as images and videos.
Recommended Citation
Koeum, Sochaeta, "A STUDY OF VARIOUS DATA SIZES USING MACHINE LEARNING" (2023). Electronic Theses, Projects, and Dissertations. 1694.
https://scholarworks.lib.csusb.edu/etd/1694