Limitations of Transformers on Clinical Text Classification

Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application to document classification on long clinical texts is limited. In this work, we introduce four methods...

Full description

Bibliographic Details
Main Authors: Alawad, M. (Author), Coyle, L. (Author), Doherty, J. (Author), Durbin, E.B (Author), Gao, S. (Author), Gounley, J. (Author), Schaefferkoetter, N. (Author), Stroup, A. (Author), Tourassi, G. (Author), Wu, X.-C (Author), Yoon, H.J (Author), Young, M.T (Author)
Format: Article
Language:English
Published: Institute of Electrical and Electronics Engineers Inc. 2021
Subjects:
Online Access:View Fulltext in Publisher
LEADER 03411nam a2200697Ia 4500
001 10.1109-JBHI.2021.3062322
008 220427s2021 CNT 000 0 und d
020 |a 21682194 (ISSN) 
245 1 0 |a Limitations of Transformers on Clinical Text Classification 
260 0 |b Institute of Electrical and Electronics Engineers Inc.  |c 2021 
856 |z View Fulltext in Publisher  |u https://doi.org/10.1109/JBHI.2021.3062322 
520 3 |a Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application to document classification on long clinical texts is limited. In this work, we introduce four methods to scale BERT, which by default can only handle input sequences up to approximately 400 words long, to perform document classification on clinical texts several thousand words long. We compare these methods against two much simpler architectures - a word-level convolutional neural network and a hierarchical self-attention network - and show that BERT often cannot beat these simpler baselines when classifying MIMIC-III discharge summaries and SEER cancer pathology reports. In our analysis, we show that two key components of BERT - pretraining and WordPiece tokenization - may actually be inhibiting BERT's performance on clinical text classification tasks where the input document is several thousand words long and where correctly identifying labels may depend more on identifying a few key words or phrases rather than understanding the contextual meaning of sequences of text. © 2013 IEEE. 
650 0 4 |a Article 
650 0 4 |a artificial neural network 
650 0 4 |a attention network 
650 0 4 |a BERT 
650 0 4 |a bidirectional encoder representations from transformer 
650 0 4 |a Classification (of information) 
650 0 4 |a clinical text 
650 0 4 |a Clinical text classifications 
650 0 4 |a convolutional neural network 
650 0 4 |a Convolutional neural networks 
650 0 4 |a deep learning 
650 0 4 |a deep learning 
650 0 4 |a Discharge summary 
650 0 4 |a Document Classification 
650 0 4 |a histology 
650 0 4 |a human 
650 0 4 |a Humans 
650 0 4 |a ICD-9 
650 0 4 |a Information retrieval systems 
650 0 4 |a Input sequence 
650 0 4 |a learning algorithm 
650 0 4 |a machine learning 
650 0 4 |a mathematical model 
650 0 4 |a natural language processing 
650 0 4 |a natural language processing 
650 0 4 |a Natural Language Processing 
650 0 4 |a NAtural language processing 
650 0 4 |a Natural language processing systems 
650 0 4 |a neural networks 
650 0 4 |a Neural Networks, Computer 
650 0 4 |a Pre-training 
650 0 4 |a signal noise ratio 
650 0 4 |a State of the art 
650 0 4 |a text classification 
650 0 4 |a Text processing 
650 0 4 |a Tokenization 
700 1 |a Alawad, M.  |e author 
700 1 |a Coyle, L.  |e author 
700 1 |a Doherty, J.  |e author 
700 1 |a Durbin, E.B.  |e author 
700 1 |a Gao, S.  |e author 
700 1 |a Gounley, J.  |e author 
700 1 |a Schaefferkoetter, N.  |e author 
700 1 |a Stroup, A.  |e author 
700 1 |a Tourassi, G.  |e author 
700 1 |a Wu, X.-C.  |e author 
700 1 |a Yoon, H.J.  |e author 
700 1 |a Young, M.T.  |e author 
773 |t IEEE Journal of Biomedical and Health Informatics