Learning Topic Models With Arbitrary Loss

Topic modeling is an area of text analysis actively developing over the past 20 years. A probabilistic topic model (PTM) finds a set of hidden topics from a collection of text documents. It defines each topic as a probability distribution over words and describes each document as a probability mixtu...

Full description

Bibliographic Details
Main Authors: Murat Apishev, Konstantin Vorontsov
Format: Article
Language:English
Published: FRUCT 2020-04-01
Series:Proceedings of the XXth Conference of Open Innovations Association FRUCT
Subjects:
Online Access:https://www.fruct.org/publications/fruct26/files/Api.pdf
Description
Summary:Topic modeling is an area of text analysis actively developing over the past 20 years. A probabilistic topic model (PTM) finds a set of hidden topics from a collection of text documents. It defines each topic as a probability distribution over words and describes each document as a probability mixture of topic distributions. Learning algorithms for topic models are usually based on Bayesian inference or log-likelihood maximization. In both cases, EM-like algorithms are used. In this paper, we propose to replace the logarithm in the log-likelihood by an arbitrary smooth loss function. We prove that such a modification preserves the structure of the E- and M-steps of the algorithm in terms of additive regularization of topic models (ARTM). Moreover, in the case of a linear loss, the E-step becomes much faster due to the omission of a normalization. We study combinations of the fast and usual E-steps and compare them to regularization using different number of topics in both offline and online versions of EM-algorithm. For an empirical comparison of the algorithms, we estimate perplexity, coherence, and learning time. We use an efficient parallel implementation of the EM-algorithm from the BigARTM open-source library. We show that in all cases the two-stage strategy wins, which uses fast E-steps at the beginning of iterations, then proceeds with usual E-steps.
ISSN:2305-7254
2343-0737