Unsupervised grammar induction of clinical report sublanguage

<p>Abstract</p> <p>Background</p> <p>Clinical reports are written using a subset of natural language while employing many domain-specific terms; such a language is also known as a sublanguage for a scientific or a technical domain. Different genres of clinical reports u...

Full description

Bibliographic Details
Main Author: Kate Rohit J
Format: Article
Language:English
Published: BMC 2012-10-01
Series:Journal of Biomedical Semantics
Online Access:http://www.jbiomedsem.com/content/3/S3/S4
Description
Summary:<p>Abstract</p> <p>Background</p> <p>Clinical reports are written using a subset of natural language while employing many domain-specific terms; such a language is also known as a sublanguage for a scientific or a technical domain. Different genres of clinical reports use different sublaguages, and in addition, different medical facilities use different medical language conventions. This makes supervised training of a parser for clinical sentences very difficult as it would require expensive annotation effort to adapt to every type of clinical text.</p> <p>Methods</p> <p>In this paper, we present an unsupervised method which automatically induces a grammar and a parser for the sublanguage of a given genre of clinical reports from a corpus with no annotations. In order to capture sentence structures specific to clinical domains, the grammar is induced in terms of semantic classes of clinical terms in addition to part-of-speech tags. Our method induces grammar by minimizing the combined encoding cost of the grammar and the corresponding sentence derivations. The probabilities for the productions of the induced grammar are then learned from the unannotated corpus using an instance of the expectation-maximization algorithm.</p> <p>Results</p> <p>Our experiments show that the induced grammar is able to parse novel sentences. Using a dataset of discharge summary sentences with no annotations, our method obtains 60.5% F-measure for parse-bracketing on sentences of maximum length 10. By varying a parameter, the method can induce a range of grammars, from very specific to very general, and obtains the best performance in between the two extremes.</p>
ISSN:2041-1480