Dieser Webauftritt wurde archiviert und wird nicht mehr aktualisiert.

Bei technischen Fragen kontaktieren Sie bitte das Webteam. Es gilt die Datenschutzerklärung der TU Darmstadt.

Natural Language Processing for Historical Texts


Digitization of historical paper documents is motivated by the aim of preserving cultural heritage and making it more accessible to both laypeople and scholars.  As digital images cannot be searched for text, digitization projects increasingly strive to create digital text, which can be searched and otherwise automatically processed.  Indeed, the emerging field of digital humanities (see Caroline Sporleder's tutorial) heavily relies on the availability of digital text for its studies.  More and more historical texts are thus becoming available in digital form.

Together with the increasing availability of historical texts in digital form, there is a growing interest in applying natural language processing (NLP) methods and tools to historical texts.  However, the specific linguistic properties of historical texts—the lack of standardized orthography in particular—pose special challenges for NLP.

This tutorial aims to give an introduction to NLP for historical texts and an overview of the state of the art in this field.  We will start with an overview of methods for the acquisition of historical texts (scanning and OCR), discuss text encoding and annotation schemes, and present examples of corpora of historical texts in a variety of languages.  We will then look at approaches to specific challenges, such as the creation of part-of-speech taggers for historical languages or the handling of spelling variation.  We will also talk about the relationship between NLP and the digital humanities.


Michael Piotrowski

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Kontakt Contact
zum Seitenanfangzum Seitenanfang