Domain Adaptation in Corpus Linguistics
Christian PĂ¶litz (TU Dortmund)
Keywords: domain adaptation, frequency distributions, corpus similarity
An interesting application for corpus linguistics arises from domain adaptation. In case we have large corpora and know that a large number of the texts are unimportant for the analysis of a linguistic hypothesis, such texts should ideally be removed or filtered out. This can be done by training a classifier on some texts labeled as relevant or not. Since the corpora are large, labeling of the texts to train the classifier is expensive. An idea from transfer learning would be to use a classifier trained on a certain corpus and simply adapt the classifier such that it can also be applied to the other corpora. By this, we need only to train a classifier once on a single corpus and use it on a collection of corpora that we want to use for a linguistic analysis.
In case the distribution according to the hypothesis we want to test are the same in the corpora and also the distributions of the texts are the same, we can expect that a classifier, trained on only one corpus performs the same on all corpora.
It gets more difficult when such condition are not met. In such cases we need to perform special actions to adapt the classifier.
We investigate the case when we expect the texts to be differently distributed among the corpora but the conditional probability that a text is irrelevant and should be removed given a certain text is the same among the corpora. Such a hypothesis can be checked by statistical tests like the Chi squared test.
We will concretely investigate how such a classifier trained on corpus A can be adapted to perform well on a corpus B. This can be done by importance sampling to adapt the mass distribution of the texts in corpus A to the masses from corpus B.
A critical part of such an approach is that it depends in the ratio of the densities of the text distributions in the corpora. This means we need to estimate the density p(text in corpus A) and p(text in corpus B).
Unfortunately texts are usually modeled as elements in a vector space model that is a high dimensional space with usual only sparse samples given and hence density estimation is difficult. We speak shortly about methods to estimate the ratio directly and possibilities to estimate the densities.
We plan to investigate different methods to estimate the density ratio. For instance we can directly model it as a log linear model or find an optimal scaling of one density by minimizing the KL divergence to the other or map the densities into a RKHS and use technics like Kernel Mean Matching to find the optimal ratio.