Distributional Semantic Models for Clinical Text Applied to Health Record Summarization
MetadataVis full innførsel
As information systems in the health sector are becoming increasingly computerized, large amounts of care-related information are being stored electronically. In hospitals clinicians continuously document treatment and care given to patients in electronic health record (EHR) systems. Much of the information being documented is in the form of clinical notes, or narratives, containing primarily unstructured free-text information. For each care episode, clinical notes are written on a regular basis, ending with a discharge summary that basically summarizes the care episode. Although EHR systems are helpful for storing and managing such information, there is an unrealized potential in utilizing this information for smarter care assistance, as well as for secondary purposes such as research and education. Advances in clinical language processing are enabling computers to assist clinicians in their interaction with the free-text information documented in EHR systems. This includes assisting in tasks like query-based search, terminology development, knowledge extraction, translation, and summarization. This thesis explores various computerized approaches and methods aimed at enabling automated semantic textual similarity assessment and information extraction based on the free-text information in EHR systems. The focus is placed on the task of (semi-)automated summarization of the clinical notes written during individual care episodes. The overall theme of the presented work is to utilize resource-light approaches and methods, circumventing the need to manually develop knowledge resources or training data. Thus, to enable computational semantic textual similarity assessment, word distribution statistics are derived from large training corpora of clinical free text and stored as vector-based representations referred to as distributional semantic models. Also resource-light methods are explored in the task of performing automatic summarization of clinical freetext information, relying on semantic textual similarity assessment. Novel and experimental methods are presented and evaluated that focus on: a) distributional semantic models trained in an unsupervised manner from statistical information derived from large unannotated clinical free-text corpora; b) representing and computing semantic similarities between linguistic items of different granularity, primarily words, sentences and clinical notes; and c) summarizing clinical free-text information from individual care episodes. Results are evaluated against gold standards that reflect human judgements. The results indicate that the use of distributional semantics is promising as a resourcelight approach to automated capturing of semantic textual similarity relations from unannotated clinical text corpora. Here it is important that the semantics correlate with the clinical terminology, and with various semantic similarity assessment tasks. Improvements over classical approaches are achieved when the underlying vector-based representations allow for a broader range of semantic features to be captured and represented. These are either distributed over multiple semantic models trained with different features and training corpora, or use models that store multiple sense-vectors per word. Further, the use of structured meta-level information accompanying care episodes is explored as training features for distributional semantic models, with the aim of capturing semantic relations suitable for care episode-level information retrieval. Results indicate that such models performs well in clinical information retrieval. It is shown that a method called Random Indexing can be modified to construct distributional semantic models that capture multiple sense-vectors for each word in the training corpus. This is done in a way that retains the original training properties of the Random Indexing method, by being incremental, scalable and distributional. Distributional semantic models trained with a framework called Word2vec, which relies on the use of neural networks, outperform those trained using the classic Random Indexing method in several semantic similarity assessment tasks, when training is done using comparable parameters and the same training corpora. Finally, several statistical features in clinical text are explored in terms of their ability to indicate sentence significance in a text summary generated from the clinical notes. This includes the use of distributional semantics to enable case-based similarity assessment, where cases are other care episodes and their “solutions”, i.e., discharge summaries. A type of manual evaluation is performed, where human experts rates the different aspects of the summaries using a evaluation scheme/tool. In addition, the original clinician-written discharge summaries are explored as gold standard for the purpose of automated evaluation. Evaluation shows a high correlation between manual and automated evaluation, suggesting that such a gold standard can function as a proxy for human evaluations.
Består avPaper A: Henriksson, Aron; Moen, Hans; Skeppstedt, Maria; Daudaravicius, Vidas; Duneld, Martin. Synonym Extraction and Abbreviation Expansion with Ensembles of Semantic Spaces. Journal of Biomedical Semantics 2014 ;Volum 5.(6) http://dx.doi.org/ 10.1186/2041-1480-5-6 © Henriksson et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0)
Paper B: Moen, Hans; Marsi, Erwin; Gambäck, Björn. Towards Dynamic Word Sense Discrimination with Random Indexing. I: Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality. Association for Computational Linguistics 2013, s. 83-90
Paper C: Moen, Hans; Ginter, Filip; Marsi, Erwin; Murtola, Laura-Maria; Salakoski, Tapio; Salanterä, Sanna. Care Episode Retrieval: Distributional Semantic Models for Information Retrieval in the Clinical Domain. BMC Medical Informatics and Decision Making 2015 ;Volum 15. http://dx.doi.org/ 10.1186/1472-6947-15-S2-S2 Attribution 4.0 International (CC BY 4.0)
Paper D: Moen, Hans; Heimonen, Juho; Murtola, Laura-Maria; Airola, Antti; Pahikkala, Tapio; Terävä, Virpi; Danielsson-Ojala, Riitta; Salakoski, Tapio; Salanterä, Sanna. On Evaluation of Automatically Generated Clinical Discharge Summaries. CEUR Workshop Proceedings 2014 ;Volum 1251. s. 101-114 (c) 2014 by the paper's authors. Copying permitted for private and academic purposes.
Paper E: Moen, Hans; Peltonen, Laura-Maria; Heimonen, Juho; Airola, Antti; Pahikkala, Tapio; Salakoski, Tapio, and Salanterä, Sanna. Comparison of automatic summarisation methods for clinical free text notes; Artificial Intelligence in Medicine, 67:25–37, 2016 http://dx.doi.org/10.1016/j.artmed.2016.01.003 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license.