Extracting Web Information using Representation Patterns

Roldán Salvador, Juan Carlos; Jiménez Aguirre, Patricia; Corchuelo Gil, Rafael

Repository landing page

oai:idus.us.es:11441/131931

Extracting Web Information using Representation Patterns

Authors: Juan Carlos Roldán Salvador
Patricia Jiménez Aguirre
Rafael Corchuelo Gil
Publication date: 1 January 2017
Publisher: Association for Computing Machinery (ACM)
Doi

Abstract

Feeding decision support systems with Web information typically requires sifting through an unwieldy amount of information that is available in human-friendly formats only. Our focus is on a scalable proposal to extract information from semi-structured documents in a structured format, with an emphasis on it being scalable and open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean that it must extract as much useful information as possible and not be subject to any pre-defined data model. In the literature, there is only one open but not scalable proposal, since it requires human supervision on a per-domain basis. In this paper, we present a new proposal that relies on a number of heuristics to identify patterns that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very competitive in terms of effectiveness and efficiency.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Economía y Competitividad TIN2013-40848-

Similar works

Full text

Open in the Core reader

Download PDF

idUS. Depósito de Investigación Universidad de Sevilla

oai:idus.us.es:11441/131931

Last time updated on 19/05/2022

This paper was published in idUS. Depósito de Investigación Universidad de Sevilla.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.