Date of Award


Document Type


Degree Name

Master of Science in Computer Science


School of Computer Science and Engineering

First Reader/Committee Chair

Yan Zhang


The amount of data generated in medical records, especially in a modern context, is growing significantly. As the amount of data grows, it is very useful to classify the data into relevant classes for further interventions. Different methods that are not automated are very time-consuming and require manual effort have been tried for this before.

Recently deep learning has been used for this task but due to the complexity of the dataset, specifically due to inter-class similarities in the dataset and specific terminology having different meanings in medical contexts has caused significant problems in having a definitive approach to medical notes classification. In this study, the different recent improvements to NLP borne from Transformer networks have been implemented on a significant medical notes dataset to classify them into 5 different classes, specifically neoplasms, digestive system diseases, nervous system diseases, cardiovascular diseases and general pathological conditions.

To achieve this, specifically, the best model proposed is the BERT pre-trained on a large corpus and fine-tuned on this medical notes dataset. A robust 5-fold cross-validation is performed when these results are generated to utilize the entire dataset for finding optimal search parameters and the test dataset is used to report the results. The results on the optimised BERT model outperform the results in the paper that introduced this dataset and a baseline linear model.

In addition to the final accuracy, individual class-wise metrics were also calculated and reported showing that general pathological conditions have significant overlap with other diseases and thus are harder to classify. This was visualised using t-SNE of the embeddings of the CLS token in the BERT model.

In conclusion, these results present a novel and effective method for medical notes classification using a transformer-based BERT model. The proposed model outperforms the current state-of-the-art results published on this dataset. There are still significant challenges requiring domain knowledge to be used in modelling to improve efficiency, but the proposed model achieves high class-wise accuracy metrics to automate classification tasks in the first pass after which further models can be used with higher confidence.