Fast, small and exact:infinite-order language modelling with compressed suffix trees

Shareghi, Ehsan; Petri, Matthias; Haffari, Gholamreza; Cohn, Trevor

Repository landing page

oai:monash.edu:publications/48d3deb5-22f1-4b58-a91c-0f06a78698f8

Fast, small and exact:infinite-order language modelling with compressed suffix trees

Authors: Ehsan Shareghi
Matthias Petri
Gholamreza Haffari
Trevor Cohn
Publication date: 1 January 2016
Publisher

Abstract

Efficient methods for storing and querying are critical for scaling high-order m-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimizations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying)

article

Similar works

Full text

Monash University Research Portal

oai:monash.edu:publications/48...

Last time updated on 05/12/2019

This paper was published in Monash University Research Portal.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.