Date of Award


Document Type


Degree Name

Master of Science in Information Systems and Technology


Information and Decision Sciences

First Reader/Committee Chair

Conrad Shayo


This culminating experience research project explores the parameters needed to predict the water quality levels for use in different climatic conditions pre and post monsoon from 2018 to 2020 in Telangana State, India. A study was conducted on the water quality analysis by using linear regression with water quality Index in Telangana region. However, in this study we are replicating the water quality analysis by using stack model and machine learning algorithms such as Light Gradient Boosting Machine, Random Forest and Artificial Neural Network. The research Questions are: (Q1) What are the sources of the significant parameters that impact groundwater quality in a location? (Q2) Will the use of the stacked model analysis approach produce different results when applied to the Telangana dataset? (Q3) How does the size and nature of a dataset impact the effectiveness of ensemble techniques, such as stacking, for addressing class imbalance in groundwater quality prediction models? The finding are: (Q1) Sodium and Magnesium parameters values have been calculated for Sodium Adsorption Ratio (SAR) for the ground water samples. Based on these parameter electrical conductivity EC and SAR values, Salinity hazard values calculated and converted into different classes like Low C1 (EC2250). Sodium Hazard Classes Low S1 (SAR < 10), Medium S2 (SAR 10 – 18), High S3 (SAR 18-26), Very High S4 (SAR > 26). In comparing of 2018, 2019 and 2020 dataset of water quality analysis, increased in ranges Sodium (5.07 to 748), Calcium (1.2 to 640.0), Magnesium ((4.86 to 457.02), Electrical Conductivity (102 to 9499). (Q2). Stacked models achieved the best performance with use of different classifiers in terms of accuracy (the individual models of Random forecast 97%, Light GBM 97% and calculation of two predicted probability values passes through ANN which model accuracy increasers to 98%) to predict the water quality by collecting the data from different regions and climatic conditions based on the suitability of water salinity and sodium content. (Q3) In order to manage imbalanced data and increase prediction accuracy by calculating the model performance by using classification report of random forest, LGBM and ANN these are the values which are varying in performance F1 Score. For Class Marginal (RF-0.63), (LGBM-0.67), ANN increased to performance to (0.76). Class Poor (RF-0.95), (LGBM-0.95), ANN increased to performance to (0.96), Class Very Poor (RF-0.77), (LGBM-0.77), ANN increased to performance to (0.86). For classes Excellent and good F1 Score for 3 models are1 and for Permissible three models got 0.99. The conclusions are: (Q1) This Research provides helpful information to understand and handle the potential risks of salinity and sodium in the researched region by classifying the salinity hazard levels into four classes (C1 to C4 and S1 to S4) based on electrical conductivity (EC) and SAR values. (Q2) To conclude, our research demonstrates that stacked models, employing different classifiers, have proven to be highly effective in predicting water quality with remarkable accuracy. When we utilized the predicted probability values by passing them through the Artificial Neural Network (ANN), the accuracy further improved to an impressive 98%. (Q3) The stacked model technique, which combines random forest, light GBM, and ANN, seems to be an effective means of dealing with imbalanced data and enhancing prediction accuracy. The significant improvement in F1 Scores for a few classes, especially when using ANN, demonstrates how effectively this ensemble approach handles challenging classification problems.

Furthermore, emerging areas for future research that emerged from this study include the opportunity for training and testing using our model with a larger dataset and modifying different hyperparameters for further improvement.