A Multitude of Linguistically-rich Features for Authorship Attribution

Tanguy, Ludovic; Urieli, Assaf; Calderone, Basilio; Hathout, Nabil; Sajous, Franck

Repository landing page

A Multitude of Linguistically-rich Features for Authorship Attribution

Authors: Ludovic Tanguy
Assaf Urieli
Basilio Calderone
Nabil Hathout
Franck Sajous
Publication date: 19 September 2011
Publisher: HAL CCSD

Abstract

International audienceThis paper reports on the procedure and learning models we adopted for the 'PAN 2011 Author Identification' challenge targetting real-world email messages. The novelty of our approach lies in a design which combines shallow characteristics of the emails (words and trigrams frequencies) with a large number of ad hoc linguistically-rich features addressing different language levels. For the author attribution tasks, all these features were used to train a maximum entropy model which gave very good results. For the single author verification tasks, a set of features exclusively based on the linguistic description of the emails' messages was considered as input for symbolic learning techniques (rules and decision trees), and gave weak results. This paper presents in detail the features extracted from the corpus, the learning models and the results obtained

Similar works

Full text

HAL Descartes

oai:HAL:hal-00703987v1

Last time updated on 14/04/2021

This paper was published in HAL Descartes.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.