Rapid Resource Transfer for Multilingual Natural Language Processing

Kolak, Okan

Repository landing page

oai:drum.lib.umd.edu:1903/3182

Rapid Resource Transfer for Multilingual Natural Language Processing

Authors: Okan Kolak
Publication date: 2 December 2005
Publisher

Abstract

Until recently the focus of the Natural Language Processing (NLP) community has been on a handful of mostly European languages. However, the rapid changes taking place in the economic and political climate of the world precipitate a similar change to the relative importance given to various languages. The importance of rapidly acquiring NLP resources and computational capabilities in new languages is widely accepted. Statistical NLP models have a distinct advantage over rule-based methods in achieving this goal since they require far less manual labor. However, statistical methods require two fundamental resources for training: (1) online corpora (2) manual annotations. Creating these two resources can be as difficult as porting rule-based methods. This thesis demonstrates the feasibility of acquiring both corpora and annotations by exploiting existing resources for well-studied languages. Basic resources for new languages can be acquired in a rapid and cost-effective manner by utilizing existing resources cross-lingually. Currently, the most viable method of obtaining online corpora is converting existing printed text into electronic form using Optical Character Recognition (OCR). Unfortunately, a language that lacks online corpora most likely lacks OCR as well. We tackle this problem by taking an existing OCR system that was desgined for a specific language and using that OCR system for a language with a similar script. We present a generative OCR model that allows us to post-process output from a non-native OCR system to achieve accuracy close to, or better than, a native one. Furthermore, we show that the performance of a native or trained OCR system can be improved by the same method. Next, we demonstrate cross-utilization of annotations on treebanks. We present an algorithm that projects dependency trees across parallel corpora. We also show that a reasonable quality treebank can be generated by combining projection with a small amount of language-specific post-processing. The projected treebank allows us to train a parser that performs comparably to a parser trained on manually generated data

Similar works

Full text

Open in the Core reader

Download PDF

Digital Repository at the University of Maryland

oai:drum.lib.umd.edu:1903/3182

Last time updated on 12/11/2016

This paper was published in Digital Repository at the University of Maryland.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.