Exploring Time Dynamics in Large Corpora
Patrick Jänichen, University of Leipzig
Keywords: time dynamics, large corpora, topic models, topic shift
This contribution reviews a family of algorithms known as topic models. In particular the most basic model Latent Dirichlet Allocation (LDA) and an extension of it, the Continuous Time Dynamic Topic Model (CTM) are described.
Further, we propose a generalization of the latter to model and analyze the changes a certain theme underlies in large diachronic corpora. While our current work typically relies on newspaper texts of this and the last century, it would be of particular interest to apply these kinds of models to data arising in the Digital Humanities community.
Topic Models are a group of algorithms used to explore and analyze large document corpora. The basic assumption is that each document one encounters in a text collection is a mixture of topics taken from a global set specific to the corpus in question, i.e. a document can be described by a probability distribution over the set of topics.
Each topic in turn is a distribution over the vocabulary of words in the corpus indicating how probable a word is in a certain topic.
This assumption allows a massive reduction in dimensionality as the dimension of documents decreases from the size of the vocabulary (in a frequency coding scheme) to the number of topics (which is a lot smaller).
Our primary goal is to track the dynamics such a topic underlies when analyzing diachronic data. The CTM proposes a framework where time is viewed (quite naturally) as a continuous variable. Topics as distributions over words follow a Brownian motion in this model, i.e. topic k at time t is treated as the draw from a multivariate Gaussian distribution with mean topic k at time t-1 and having a variance equal to the difference between t and t-1.
We propose to translate this to an underlying Gaussian Process parameterized with a specific covariance function governing the time dynamics of this model. Using this framework allows other stochastic processes to be adopted by just defining appropriate covariance functions.
The choice of Brownian motion as the thriving stochastic process is a natural one when virtually nothing is known about the behavior of the topics in time, although the assumption of smoothness appears to be a bit contrived.
We suggest to use the Ornstein-Uhlenbeck process instead (which is mean reverting and provides a term describing the volatility of the individual topics).
The idea behind this is to be able to identify topics that differ strongly from their long term mean thus exhibiting a high value of volatility.
Taking into account the theory that a change in contextual structures co-occurs with a change of their meaning we expect to discover particularly "interesting" topics.
Other processes (although not tested yet) include the Poisson, Compound Poisson and the Affine Process (all of which are Lévy processes), allowing jumps in the topic distributions. Considering the nature of texts this seems to be a plausible assumption as we encounter new or recurring topics.