Date of Award


Document Type

Thesis (Undergraduate)


Computer Science

First Advisor

Prof. Sarah Preum


The ability of patients to understand health-related text is important for optimal health outcomes. A system that can automatically annotate medical entities could help patients better understand health-related text. Such a system would also accelerate manual data annotation for this low-resource domain as well as assist in down- stream medical NLP tasks such as finding textual similarity, identifying conflicting medical advice, and aspect-based sentiment analysis. In this work, we investigate a state-of-the-art entity set expansion model, BootstrapNet, for the task of medical entity classification on a new dataset of medical advice text. We also propose EP SBERT, a simple model that utilizes Sentence-BERT embeddings of entities and context patterns to more effectively capture the semantics of the entities. Our experiments show that EP SBERT significantly outperforms a random classifier baseline, outperforms the more complex BootstrapNet by 5.2 F1 points, and achieves a 5-fold cross validated weighted F1 score of 0.835. Further experiments show that EP SBERT achieves a weighted F1 score of 0.870 when we remove a peripheral class whose inclusion is nonessential to the problem formulation, and a weighted F1 score of 0.949 when using top-2 evaluation. This makes us confident that EP SBERT can be useful when building human-in-the-loop data annotation tools. Finally, we perform an extensive error analysis of EP SBERT, identifying two core challenges and future work. Our code will be made available at