Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese--Japanese Wikipedia

Chu, Chenhui; Nakazawa, Toshiaki; Kurohashi, Sadao

Repository landing page

oai:repository.kulib.kyoto-u.ac.jp:2433/265843

Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese--Japanese Wikipedia

Authors: Chenhui Chu
Toshiaki Nakazawa
Sadao Kurohashi
Publication date: 1 February 2016
Publisher: Association for Computing Machinery (ACM)
Doi

Abstract

Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract either parallel sentences or fragments from them for SMT. In this article, we propose an integrated system to extract both parallel sentences and fragments from comparable corpora. We first apply parallel sentence extraction to identify parallel sentences from comparable sentences. We then extract parallel fragments from the comparable sentences. Parallel sentence extraction is based on a parallel sentence candidate filter and classifier for parallel sentence identification. We improve it by proposing a novel filtering strategy and three novel feature sets for classification. Previous studies have found it difficult to accurately extract parallel fragments from comparable sentences. We propose an accurate parallel fragment extraction method that uses an alignment model to locate the parallel fragment candidates and an accurate lexicon-based filter to identify the truly parallel fragments. A case study on the Chinese--Japanese Wikipedia indicates that our proposed methods outperform previously proposed methods, and the parallel data extracted by our system significantly improves SMT performance

Similar works

Full text

Open in the Core reader

Download PDF

Kyoto University Research Information Repository

oai:repository.kulib.kyoto-u.a...

Last time updated on 23/12/2021

This paper was published in Kyoto University Research Information Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.