Peer-Reviewed Papers

Home > Peer-Reviewed Papers

Peer-Reviewed Papers

Deeper Clinical Document Understanding Using Relation Extraction

Authors: Hasham Ul Haw, Veysel Kocaman and David Talby

Accepted to SDU (Scientific Document Understanding) workshop at AAAI 2022.

The surging amount of biomedical literature & digital clinical records presents a growing need for text mining techniques that can not only identify but also semantically relate entities in unstructured data. In this paper we propose a text mining framework comprising of Named Entity Recognition (NER) and Relation Extraction (RE) models, which expands on previous work in three main ways. First, we introduce two new RE model architectures — an accuracy-optimized one based on BioBERT and a speed-optimized one utilizing crafted features over a Fully Connected Neural Network (FCNN).

Second, we evaluate both models on public benchmark datasets and obtain new state-of-the-art F1 scores on the 2012 i2b2 Clinical Temporal Relations challenge (F1 of 73.6, +1.2% over the previous SOTA), the 2010 i2b2 Clinical Relations challenge (F1 of 69.1, +1.2%), the 2019 Phenotype-Gene Relations dataset (F1 of 87.9, +8.5%), the 2012 Adverse Drug Events Drug-Reaction dataset (F1 of 90.0, +6.3%), and the 2018 n2c2 Posology Relations dataset (F1 of 96.7, +0.6%). Third, we show two practical applications of this framework — for building a biomedical knowledge graph and for improving the accuracy of mapping entities to clinical codes. The system is built using the Spark NLP library which provides a production-grade, natively scalable, hardware-optimized, trainable & tunable NLP framework.

Mining Adverse Drug Reactions from Unstructured Mediums at Scale

Download PDF ArXiv Papers with Code

Authors: Hasham Ul Haw, Veysel Kocaman and David Talby

Accepted to the 6th International Workshop on Health Intelligence at AAAI 2022.

Adverse drug reactions / events (ADR/ADE) have a major impact on patient health and health care costs. Detecting ADR’s as early as possible and sharing them with regulators, pharma companies, and healthcare providers can prevent morbidity and save many lives. While most ADR’s are not reported via formal channels, they are often documented in a variety of unstructured conversations such as social media posts by patients, customer support call transcripts, or CRM notes of meetings between healthcare providers and pharma sales reps. In this paper, we propose a natural language processing (NLP) solution that detects ADR’s in such unstructured free-text conversations, which improves on previous work in three ways. First, a new Named Entity Recognition (NER) model obtains new state-of-the-art accuracy for ADR and Drug entity extraction on the ADE, CADEC, and SMM4H benchmark datasets (91.75%, 78.76%, and 83.41% F1 scores respectively).

Second, two new Relation Extraction (RE) models are introduced – one based on BioBERT while the other utilizing crafted features over a Fully Connected Neural Network (FCNN) – are shown to perform on par with existing state-of-the-art models, and outperform them when trained with a supplementary clinician-annotated RE dataset. Third, a new text classification model, for deciding if a conversation includes an ADR, obtains new state-of-the-art accuracy on the CADEC dataset (86.69% F1 score). The complete solution is implemented as a unified NLP pipeline in a production-grade library built on top of Apache Spark, making it natively scalable and able to process millions of batch or streaming records on commodity clusters.

Biomedical Named Entity Recognition at Scale

Download PDF ArXiv Papers with Code

Authors: Veysel Kocaman and David Talby

Accepted for presentation and inclusion in CADL 2020 (International Workshop on Computational Aspects of Deep Learning), in conjunction with ICPR 2020.

Named entity recognition (NER) is a widely applicable natural language processing task and building block of question answering, topic modeling, information retrieval, etc. In the medical domain, NER plays a crucial role by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Reimplementing a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark, we present a single trainable NER model that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT.

This includes improving BC4CHEMD to 93.72% (4.1% gain), Species800 to 80.91% (4.6% gain), and JNLPBA to 81.29% (5.2% gain). In addition, this model is freely available within a production-grade code base as part of the open-source Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and can be extended to support other human languages with no code changes.

Clinical Application of Detecting COVID-19 Risks: A Natural Language Processing Approach

Download PDF Papers with Code

Authors: Syed Raza Bashir, Shaina Raza,Veysel Kocaman and Urooj Qamar

Abstract: The clinical application of detecting COVID-19 factors is a challenging task. The existing named entity recognition models are usually trained on a limited set of named entities. Besides clinical, the non-clinical factors, such as social determinant of health (SDoH), are also important to study the infectious disease. In this paper, we propose a generalizable machine learning approach that improves on previous efforts by recognizing a large number of clinical risk factors and SDoH.

The novelty of the proposed method lies in the subtle combination of a number of deep neural networks, including the BiLSTM-CNN-CRF method and a transformer-based embedding layer. Experimental results on a cohort of COVID-19 data prepared from PubMed articles show the superiority of the proposed approach. When compared to other methods, the proposed approach achieves a performance gain of about 1–5% in terms of macro- and micro-average F1 scores. Clinical practitioners and researchers can use this approach to obtain accurate information regarding clinical risks and SDoH factors, and use this pipeline as a tool to end the pandemic or to prepare for future pandemics.

Improving Clinical Document Understanding on COVID-19 Research with Spark NLP

Download PDF ArXiv Papers with Code

Authors: Veysel Kocaman and David Talby

Accepted to SDU (Scientific Document Understanding) workshop at AAAI 2021.

Following the global COVID-19 pandemic, the number of scientific papers studying the virus has grown massively, leading to increased interest in automated literate review. We present a clinical text mining system that improves on previous efforts in three ways. First, it can recognize over 100 different entity types including social determinants of health, anatomy, risk factors, and adverse events in addition to other commonly used clinical and biomedical entities.

Second, the text processing pipeline includes assertion status detection, to distinguish between clinical facts that are present, absent, conditional, or about someone other than the patient. Third, the deep learning models used are more accurate than previously available, leveraging an integrated pipeline of state-of-the-art pretrained named entity recognition models, and improving on the previous best performing benchmarks for assertion status detection.

We illustrate extracting trends and insights, e.g. most frequent disorders and symptoms, and most common vital signs and EKG findings, from the COVID-19 Open Research Dataset (CORD-19). The system is built using the Spark NLP library which natively supports scaling to use distributed clusters, leveraging GPUs, configurable and reusable NLP pipelines, healthcare specific embeddings, and the ability to train models to support new entity types or human languages with no code changes.

Accurate Clinical and Biomedical Named Entity Recognition at Scale

Download PDF Journal

Authors: Veysel Kocaman and David Talby

Accepted to Software Impacts, July 2022.

While recent advances in NLP like Transformers and BERT have pushed the boundaries for accuracy, these methods are significantly slow and difficult to scale on millions of records.

In this study, we introduce an agile, production-grade clinical and biomedical NER algorithm based on a modified BiLSTM-CNN-Char DL architecture built on top of Apache Spark.

Our NER implementation establishes new state-of-the-art accuracy on 7 of 8 well-known biomedical NER benchmarks and 3 clinical concept extraction challenges: 2010 i2b2/VA clinical concept extraction, 2014 n2c2 de-identification, and 2018 n2c2 medication extraction. Moreover, clinical NER models trained using this implementation outperform the accuracy of commercial entity extraction solutions such AWS Medical Comprehend and Google Cloud Healthcare API by a large margin (8.9% and 6.7% respectively), without using memory-intensive language models.

The proposed model requires no handcrafted features or task-specific resources, requires minimal hyperparameter tuning for a given dataset from any domain, can be trained with any embeddings including BERT, and can be trained to support more human languages with no code changes. It is available within a production-grade code base as part of the Spark NLP library, the only open-source NLP library that can scale to make use of a Spark cluster for training and inference, has GPU support, and provides libraries for Python, R, Scala and Java.

Preparing for the next pandemic via transfer learning from existing diseases with hierarchical multi-modal BERT: a study on COVID-19 outcome prediction

Download PDF Papers with Code

Authors: Khushbu Agarwal, Sutanay Choudhury, Sindhu Tipirneni, Pritam Mukherjee, Colby Ham, Suzanne Tamang, Matthew Baker, Siyi Tang, Veysel Kocaman, Olivier Gevaert, Robert Rallo & Chandan K Reddy

Developing prediction models for emerging infectious diseases from relatively small numbers of cases is a critical need for improving pandemic preparedness. Using COVID-19 as an exemplar, we propose a transfer learning methodology for developing predictive models from multi-modal electronic healthcare records by leveraging information from more prevalent diseases with shared clinical characteristics. Our novel hierarchical, multi-modal model (TRANSMED) integrates baseline risk factors from the natural language processing of clinical notes at admission, time-series measurements of biomarkers obtained from laboratory tests, and discrete diagnostic, procedure and drug codes.

We demonstrate the alignment of TRANSMED’s predictions with well-established clinical knowledge about COVID-19 through univariate and multivariate risk factor driven sub-cohort analysis. TRANSMED’s superior performance over state-of-the-art methods shows that leveraging patient data across modalities and transferring prior knowledge from similar disorders is critical for accurate prediction of patient outcomes, and this approach may serve as an important tool in the early response to future pandemics.

Spark NLP: A Versatile Solution for Structuring Data from Endoscopy Reports

Download PDF Papers with Code

Authors: Ioanovici, Andrei Constantin; Măruşteri, Ştefan Marius; Feier, Andrei Marian; Trambitas-Miron, Alina Dia.

Spark NLP: A Versatile Solution for Structuring Data from Endoscopy Reports

Artificial intelligence (AI) can be applied in the practice of gastroenterology to acquire and analyze information. Besides speed and duplicability, AI has the potential of also offering insight with results that surpass medical specialists. Natural language processing (NLP) is being used to extract information from text, organize and categorize documents. Processing unstructured data with NLP will result in structured data and medical codes can be extracted more easily (ICD10, medical procedure codes, etc) for reimbursement purposes among others. Recent research is studying the use of AI for automated interpretation of text from endoscopy and medical documents for better quality and patient phenotyping as well as enhanced detection and descriptions of endoscopic lesions such as colon polyps. In this paper, we present a method of extracting medical data using Spark NLP (John Snow Labs, DE, USA), by annotating endoscopy reports and training a model to automatically extract labels in order to obtain
structured medical data. This can be used in combination with other forms of structured data for an optimal and novel patient profiling.

Tracking the Evolution of COVID-19 via Temporal Comorbidity Analysis from Multi-Modal Data

Download PDF

Presented at AMIA-2021 Annual Symposium.

Connecting the dots in clinical document understanding with Relation Extraction at scale

Download PDF ArXiv

Authors: Hasham Ul Haw, Veysel Kocaman and David Talby

Easy to use, scalable NLP framework that can leverage Spark. Introduction of BERT based Relation Extraction models. State-of-the-art performance on Named Entity Recognition and Relation Extraction. Reported SOTA performance of multiple public benchmark datasets. Application of these models on real-world use-cases.

We present a text mining framework based on top of the Spark NLP library — comprising of Named Entity Recognition (NER) and Relation Extraction (RE) models, which expands on previous work in three main ways. First, we release new RE model architectures that obtain state-of-the-art F1 scores on 5 out of 7 benchmark datasets. Second, we introduce a modular approach to train and stack multiple models in a single nlp pipeline in a production grade library with little coding. Third, we apply these models in practical applications including knowledge graph generation, prescription parsing, and robust ontology mapping.

Spark NLP: Natural Language Understanding at Scale

Download PDF ArXiv Papers with Code

Authors: Veysel Kocaman and David Talby

Accepted as a publication in Elsevier, Software Impacts Journal.

Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant and accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100 pre trained pipelines and models in more than 192 languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing nine times growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the worlds most widely used NLP library in the enterprise.

Understanding COVID-19 News Coverage using Medical NLP

Download PDF ArXiv

Authors: Ali Emre Varol, Veysel Kocaman, Hasham Ul Haq, David Talby

Proceedings of the Text2Story’22 Workshop, Stavanger (Norway).

Being a global pandemic, the COVID-19 outbreak received global media attention. In this study, we analyze news publications from CNN and The Guardian – two of the world’s most influential media organizations. The dataset includes more than 36,000 articles, analyzed using the clinical and biomedical Natural Language Processing (NLP) models from the Spark NLP for Healthcare library, which enables a deeper analysis of medical concepts than previously achieved. The analysis covers key entities and phrases, observed biases, and change over time in news coverage by correlating mined medical symptoms, procedures, drugs, and guidance with commonly mentioned demographic and occupational groups. Another analysis is of extracted Adverse Drug Events about drug and vaccine manufacturers, which when reported by major news outlets has an impact on vaccine hesitancy.

Social Media Mining for Health (#SMM4H) with Spark NLP.

Download PDF

Veysel Kocaman, Cabir Celik, Damla Gurbaz, Gursev Pirge, Bunyamin Polat, Halil Saglamlar, Meryem Vildan Sarikaya, Gokhan Turer, and David Talby.

Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 44–47, Gyeongju, Republic of Korea.

Social media has become a major source of information for healthcare professionals but due to the growing volume of data in unstructured format, analyzing these resources accurately has become a challenge. In this study, we trained health related NER and text classification models on different datasets published within the Social Media Mining for Health Applications (#SMM4H 2022) workshop1.

We utilized transformer based Bert for Token Classification and Bert for Sequence Classification algorithms as well as vanilla NER and text classification algorithms from Spark NLP library during this study without changing the underlying DL architecture. The trained models are available within a production-grade code base as part of the Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming
languages such as Python, R, Scala and Java.

Biomedical Named Entity Recognition in Eight Languages with Zero Code Changes

Download PDF

Authors: Veysel Kocaman, Gursev Pirge, Bunyamin Polat, David Talby

Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2022), A Coruña, Spain, September 2022.

Named entity recognition (NER) is one of the most important building blocks of NLP tasks in the medical domain by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Due to the growing volume of healthcare data in unstructured format, an increasingly important challenge is providing high accuracy implementations of state-of-the-art deep learning (DL) algorithms at scale. On the other hand, when it comes to low-resource languages, collecting high quality annotated data sets in the biomedical domain
is still a big challenge. In this study, we train production-grade biomedical NER models on eight different biomedical datasets published within the LivingNER competition [1].
Transformer based Bert for token classification and BiLSTM-CNN-Char based NER algorithms from Spark NLP library are utilized during this study and we trained 28 different NER models in total with decent accuracies (0.9243 F1 test score in Spanish) without changing the underlying DL architecture. The trained models are available within a production-grade code base as part of the Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java