Evaluating neural multi-field document representations for patent classification

Pujari, Subhash Chandra; Mantiuk, Fryderyk; Giereth, Mark; Strötgen, Jannik; Friedrich, Annemarie

Repository landing page

Evaluating neural multi-field document representations for patent classification

Authors: Subhash Chandra Pujari
Fryderyk Mantiuk
Mark Giereth
Jannik Strötgen
Annemarie Friedrich
Publication date: 5 July 2023
Publisher

Abstract

Patent classification constitutes a long-tailed hierarchical learning problem. Prior work has demonstrated the efficacy of neural representations based on pre-trained transformers, however, due to the limited input size of these models, using only title and abstract of patents as input. Patent documents consist of several textual fields, some of which are quite long. We show that a baseline using simple tf.idf-based methods can easily leverage this additional information. We propose a new architecture combining the neural transformer-based representations of the various fields into a meta-embedding, which we demonstrate to outperform the tf.idf-based counterparts especially on less frequent classes. Using a relatively simple architecture, we outperform the previous state of the art on CPC classification by a margin of 1.2 macro-avg. F1 and 2.6 micro-avg. F1. We identify the textual field giving a “brief-summary” of the patent as most informative with regard to CPC classification, which points to interesting future directions of research on less computation-intensive models, e.g., by summarizing long documents before neural classification

Similar works

Full text

Open in the Core reader

Download PDF

OPUS Augsburg

oai:uni-augsburg.opus-bayern.d...

Last time updated on 08/08/2023

This paper was published in OPUS Augsburg.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.