Limitations of Transformers on Clinical Text Classification

Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application to document classification on long clinical texts is limited. In this work, we introduce four methods...

Full description

Bibliographic Details
Main Authors:	Alawad, M. (Author), Coyle, L. (Author), Doherty, J. (Author), Durbin, E.B (Author), Gao, S. (Author), Gounley, J. (Author), Schaefferkoetter, N. (Author), Stroup, A. (Author), Tourassi, G. (Author), Wu, X.-C (Author), Yoon, H.J (Author), Young, M.T (Author)
Format:	Article
Language:	English
Published:	Institute of Electrical and Electronics Engineers Inc. 2021
Subjects:	Article artificial neural network attention network BERT bidirectional encoder representations from transformer Classification (of information) clinical text Clinical text classifications convolutional neural network Convolutional neural networks deep learning Discharge summary Document Classification histology human Humans ICD-9 Information retrieval systems Input sequence learning algorithm machine learning mathematical model natural language processing Natural Language Processing NAtural language processing Natural language processing systems neural networks Neural Networks, Computer Pre-training signal noise ratio State of the art text classification Text processing Tokenization
Online Access:	View Fulltext in Publisher


LEADER	03411nam a2200697Ia 4500
001	10.1109-JBHI.2021.3062322
008	220427s2021 CNT 000 0 und d
020			\|a 21682194 (ISSN)
245	1	0	\|a Limitations of Transformers on Clinical Text Classification
260		0	\|b Institute of Electrical and Electronics Engineers Inc. \|c 2021
856			\|z View Fulltext in Publisher \|u https://doi.org/10.1109/JBHI.2021.3062322
520	3		\|a Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application to document classification on long clinical texts is limited. In this work, we introduce four methods to scale BERT, which by default can only handle input sequences up to approximately 400 words long, to perform document classification on clinical texts several thousand words long. We compare these methods against two much simpler architectures - a word-level convolutional neural network and a hierarchical self-attention network - and show that BERT often cannot beat these simpler baselines when classifying MIMIC-III discharge summaries and SEER cancer pathology reports. In our analysis, we show that two key components of BERT - pretraining and WordPiece tokenization - may actually be inhibiting BERT's performance on clinical text classification tasks where the input document is several thousand words long and where correctly identifying labels may depend more on identifying a few key words or phrases rather than understanding the contextual meaning of sequences of text. © 2013 IEEE.
650	0	4	\|a Article
650	0	4	\|a artificial neural network
650	0	4	\|a attention network
650	0	4	\|a BERT
650	0	4	\|a bidirectional encoder representations from transformer
650	0	4	\|a Classification (of information)
650	0	4	\|a clinical text
650	0	4	\|a Clinical text classifications
650	0	4	\|a convolutional neural network
650	0	4	\|a Convolutional neural networks
650	0	4	\|a deep learning
650	0	4	\|a deep learning
650	0	4	\|a Discharge summary
650	0	4	\|a Document Classification
650	0	4	\|a histology
650	0	4	\|a human
650	0	4	\|a Humans
650	0	4	\|a ICD-9
650	0	4	\|a Information retrieval systems
650	0	4	\|a Input sequence
650	0	4	\|a learning algorithm
650	0	4	\|a machine learning
650	0	4	\|a mathematical model
650	0	4	\|a natural language processing
650	0	4	\|a natural language processing
650	0	4	\|a Natural Language Processing
650	0	4	\|a NAtural language processing
650	0	4	\|a Natural language processing systems
650	0	4	\|a neural networks
650	0	4	\|a Neural Networks, Computer
650	0	4	\|a Pre-training
650	0	4	\|a signal noise ratio
650	0	4	\|a State of the art
650	0	4	\|a text classification
650	0	4	\|a Text processing
650	0	4	\|a Tokenization
700	1		\|a Alawad, M. \|e author
700	1		\|a Coyle, L. \|e author
700	1		\|a Doherty, J. \|e author
700	1		\|a Durbin, E.B. \|e author
700	1		\|a Gao, S. \|e author
700	1		\|a Gounley, J. \|e author
700	1		\|a Schaefferkoetter, N. \|e author
700	1		\|a Stroup, A. \|e author
700	1		\|a Tourassi, G. \|e author
700	1		\|a Wu, X.-C. \|e author
700	1		\|a Yoon, H.J. \|e author
700	1		\|a Young, M.T. \|e author
773			\|t IEEE Journal of Biomedical and Health Informatics

Limitations of Transformers on Clinical Text Classification

Similar Items