Learning Topic Models With Arbitrary Loss
Topic modeling is an area of text analysis actively developing over the past 20 years. A probabilistic topic model (PTM) finds a set of hidden topics from a collection of text documents. It defines each topic as a probability distribution over words and describes each document as a probability mixtu...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
FRUCT
2020-04-01
|
Series: | Proceedings of the XXth Conference of Open Innovations Association FRUCT |
Subjects: | |
Online Access: | https://www.fruct.org/publications/fruct26/files/Api.pdf |
Summary: | Topic modeling is an area of text analysis actively developing over the past 20 years. A probabilistic topic model (PTM) finds a set of hidden topics from a collection of text documents. It defines each topic as a probability distribution over words and describes each document as a probability mixture of topic distributions. Learning algorithms for topic models are usually based on Bayesian inference or log-likelihood maximization. In both cases, EM-like algorithms are used. In this paper, we propose to replace the logarithm in the log-likelihood by an arbitrary smooth loss function. We prove that such a modification preserves the structure of the E- and M-steps of the algorithm in terms of additive regularization of topic models (ARTM). Moreover, in the case of a linear loss, the E-step becomes much faster due to the omission of a normalization. We study combinations of the fast and usual E-steps and compare them to regularization using different number of topics in both offline and online versions of EM-algorithm. For an empirical comparison of the algorithms, we estimate perplexity, coherence, and learning time. We use an efficient parallel implementation of the EM-algorithm from the BigARTM open-source library. We show that in all cases the two-stage strategy wins, which uses fast E-steps at the beginning of iterations, then proceeds with usual E-steps. |
---|---|
ISSN: | 2305-7254 2343-0737 |