Communications of the IIMA


Public health issues feature prominently in popular awareness, political debate, and in data mining literature. Data mining has the potential to influence public health in a myriad of ways, from personalized, genetic medicine to studies of environmental health and epidemiology, and many applications in between. Authors have asserted the importance of medical data as the basis for any conclusions applied to the public health domain, the promise of naive Bayes classification for prediction in the public health domain, and the impact of feature selection on classification accuracy. In keeping with this perspective, this study explored the combination of a naive Bayes classifier with greedy feature selection, applied to a robust public health dataset, with the goal of efficiently identifying the one or several attributes, which best predict a selected target attribute. This approach did consistently identify the most-predictive attributes for a given target attribute and produced modest increases in classification accuracy. For each choice of target attribute, the most predictive attributes were those relating to diagnosis or procedure codes, a result, which points to several opportunities for future work.