Toward Optimal Feature Selection in Naive Bayes for Text Categorization

Tang, Bo; Kay, Steven; He, Haibo

Repository landing page

oai:digitalcommons.uri.edu:ele_facpubs-1471

Toward Optimal Feature Selection in Naive Bayes for Text Categorization

Authors: Bo Tang
Steven Kay
Haibo He
Publication date: 1 September 2016
Publisher: DigitalCommons@URI
Doi

Abstract

Automated feature selection is important for text categorization to reduce feature size and to speed up learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination (MD) and MD-χ2 methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches

Similar works

Full text

DigitalCommons@URI

oai:digitalcommons.uri.edu:ele...

Last time updated on 02/12/2021

This paper was published in DigitalCommons@URI.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.