Date of Award

8-2023

Document Type

Project

Degree Name

Master of Science in Computer Science

Department

School of Computer Science and Engineering

First Reader/Committee Chair

Zhang, Yan

Abstract

Data collected from the real world is often imbalanced, meaning that the distribution of data across known classes is biased or skewed. When using machine learning classification models on such imbalanced data, predictive performance tends to be lower because these models are designed with the assumption of balanced classes or a relatively equal number of instances for each class. To address this issue, we employ data preprocessing techniques such as SMOTE (Synthetic Minority Oversampling Technique) for oversampling data and random undersampling for undersampling data on unbalanced datasets. Once the dataset is balanced, genetic programming is utilized for feature selection to enhance performance and efficiency. For this experiment, we consider an imbalanced bank marketing dataset from the UCI Machine Learning Repository. To assess the effectiveness of the technique, it is implemented on four different classification algorithms: Decision Tree, Logistic Regression, KNN (K-Nearest Neighbors), and SVM (Support Vector Machines). Various metrics including accuracy, balanced accuracy, recall, F-score, ROC (Receiver Operating Characteristics) curve, and PR (Precision-Recall) curve are compared for unbalanced data, oversampled data, undersampled data, and cleaned data with Tomek-Links for each algorithm. The results indicate that all four algorithms perform better when oversampling the minority class to half of the majority class and undersampling the majority class examples to match the minority class, followed by performing Tomek-Links on the balanced dataset.

Recommended Citation

Thumpati, Asitha, "GENETIC PROGRAMMING TO OPTIMIZE PERFORMANCE OF MACHINE LEARNING ALGORITHMS ON UNBALANCED DATA SET" (2023). Electronic Theses, Projects, and Dissertations. 1777.
https://scholarworks.lib.csusb.edu/etd/1777

Download

Contact Author

Included in

Data Science Commons

COinS

Electronic Theses, Projects, and Dissertations

GENETIC PROGRAMMING TO OPTIMIZE PERFORMANCE OF MACHINE LEARNING ALGORITHMS ON UNBALANCED DATA SET

Date of Award

Document Type

Degree Name

Department

First Reader/Committee Chair

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Links

Electronic Theses, Projects, and Dissertations

GENETIC PROGRAMMING TO OPTIMIZE PERFORMANCE OF MACHINE LEARNING ALGORITHMS ON UNBALANCED DATA SET

Author

Date of Award

Document Type

Degree Name

Department

First Reader/Committee Chair

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links