Language identification for South African Bantu languages Using Rank Order Statistics

Dube, Meluleki; Suleman, Hussein

Repository landing page

oai:pubs.cs.uct.ac.za:1334

Language identification for South African Bantu languages Using Rank Order Statistics

Authors: Meluleki Dube
Hussein Suleman
Publication date: 1 January 2019
Publisher: 'Springer Fachmedien Wiesbaden GmbH'

Abstract

Language identification is an important pre-process in many data management and information retrieval and transformation systems. However, Bantu languages are known to be difficult to identify because of lack of data and language similarity. This paper investigates the performance of n-gram counting using rank orders in order to discriminate among the different Bantu languages spoken in South Africa, using varying test and training data sizes. The highest average accuracy obtained was 99.3% with a testing size of 495 characters and training size of 600000 characters. The lowest average accuracy obtained was 78.72% when the testing size was 15 characters and learning size was 200000 characters

Similar works

Full text

Open in the Core reader

Download PDF

UCT Computer Science Research Document Archive

oai:pubs.cs.uct.ac.za:1334

Last time updated on 28/10/2019

This paper was published in UCT Computer Science Research Document Archive.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.