Extraction of medical knowledge from clinical reports and chest x-rays using machine learning techniques

Please use this identifier to cite or link to this item: http://hdl.handle.net/10045/102193
Información del item - Informació de l'item - Item information
Title: Extraction of medical knowledge from clinical reports and chest x-rays using machine learning techniques
Authors: Bustos, Aurelia
Research Director: Pertusa, Antonio
Center, Department or Service: Universidad de Alicante. Instituto Universitario de Investigación Informática
Keywords: Natural Language Processing | Machine Learning | Artificial Intelligence | Neural Networks | Deep Learning | Computer Vision | Multilabel Text Classifiers | Clinical Research | Radiology | Chest X-Rays | Medical Image Dataset | Clinical Trials on Cancer | Medical Text
Knowledge Area: Lenguajes y Sistemas Informáticos
Date Created: 2019
Issue Date: 2019
Date of defense: 19-Jun-2019
Publisher: Universidad de Alicante
Abstract: This thesis addresses the extraction of medical knowledge from clinical text using deep learning techniques. In particular, the proposed methods focus on cancer clinical trial protocols and chest x-rays reports. The main results are a proof of concept of the capability of machine learning methods to discern which are regarded as inclusion or exclusion criteria in short free-text clinical notes, and a large scale chest x-ray image dataset labeled with radiological findings, diagnoses and anatomic locations. Clinical trials provide the evidence needed to determine the safety and effectiveness of new medical treatments. These trials are the basis employed for clinical practice guidelines and greatly assist clinicians in their daily practice when making decisions regarding treatment. However, the eligibility criteria used in oncology trials are too restrictive. Patients are often excluded on the basis of comorbidity, past or concomitant treatments and the fact they are over a certain age, and those patients that are selected do not, therefore, mimic clinical practice. This signifies that the results obtained in clinical trials cannot be extrapolated to patients if their clinical profiles were excluded from the clinical trial protocols. The efficacy and safety of new treatments for patients with these characteristics are not, therefore, defined. Given the clinical characteristics of particular patients, their type of cancer and the intended treatment, discovering whether or not they are represented in the corpus of available clinical trials requires the manual review of numerous eligibility criteria, which is impracticable for clinicians on a daily basis. In this thesis, a large medical corpora comprising all cancer clinical trials protocols in the last 18 years published by competent authorities was used to extract medical knowledge in order to help automatically learn patient’s eligibility in these trials. For this, a model is built to automatically predict whether short clinical statements were considered inclusion or exclusion criteria. A method based on deep neural networks is trained on a dataset of 6 million short free-texts to classify them between elegible or not elegible. For this, pretrained word embeddings were used as inputs in order to predict whether or not short free-text statements describing clinical information were considered eligible. The semantic reasoning of the word-embedding representations obtained was also analyzed, being able to identify equivalent treatments for a type of tumor in an analogy with the drugs used to treat other tumors. Results show that representation learning using deep neural networks can be successfully leveraged to extract the medical knowledge from clinical trial protocols and potentially assist practitioners when prescribing treatments. The second main task addressed in this thesis is related to knowledge extraction from medical reports associated with radiographs. Conventional radiology remains the most performed technique in radiodiagnosis services, with a percentage close to 75% (Radiología Médica, 2010). In particular, chest x-ray is the most common medical imaging exam with over 35 million taken every year in the US alone (Kamel et al., 2017). They allow for inexpensive screening of several pathologies including masses, pulmonary nodules, effusions, cardiac abnormalities and pneumothorax. For this task, all the chest-x rays that had been interpreted and reported by radiologists at the Hospital Universitario de San Juan (Alicante) from Jan 2009 to Dec 2017 were used to build a novel large-scale dataset in which each high-resolution radiograph is labeled with its corresponding metadata, radiological findings and pathologies. This dataset, named PadChest, includes more than 160,000 images obtained from 67,000 patients, covering six different position views and additional information on image acquisition and patient demography. The free text reports written in Spanish by radiologists were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. For this, a subset of the reports (a 27%) were manually annotated by trained physicians, whereas the remaining set was automatically labeled with deep supervised learning methods using attention mechanisms and fed with the text reports. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score. To the best of our knowledge, this is one of the largest public chest x-ray databases suitable for training supervised models concerning radiographs, and also the first to contain radiographic reports in Spanish. The PadChest dataset can be downloaded on request from http://bimcv.cipf.es/bimcv-projects/padchest/. PadChest is intended for training image classifiers based on deep learning techniques to extract medical knowledge from chest x-rays. It is essential that automatic radiology reporting methods could be integrated in a clinically validated manner in radiologists’ workflow in order to help specialists to improve their efficiency and enable safer and actionable reporting. Computer vision methods capable of identifying both the large spectrum of thoracic abnormalities (and also the normality) need to be trained on large-scale comprehensively labeled large-scale x-ray datasets such as PadChest. The development of these computer vision tools, once clinically validated, could serve to fulfill a broad range of unmet needs. Beyond implementing and obtaining results for both clinical trials and chest x-rays, this thesis studies the nature of the health data, the novelty of applying deep learning methods to obtain large-scale labeled medical datasets, and the relevance of its applications in medical research, which have contributed to its extramural diffusion and worldwide reach. This thesis describes this journey so that the reader is navigated across multiple disciplines, from engineering to medicine up to ethical considerations in artificial intelligence applied to medicine.
URI: http://hdl.handle.net/10045/102193
Language: eng
Type: info:eu-repo/semantics/doctoralThesis
Rights: Licencia Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0
Appears in Collections: Doctoral theses

Files in This Item:
Files in This Item:
File Description SizeFormat 
Thumbnailthesis_aurelia_bustos_moreno.pdf14,12 MBAdobe PDFOpen Preview


This item is licensed under a Creative Commons License Creative Commons