Geometry- and Accuracy-Preserving Random Forest Proximities with Applications

Rhodes, Jake S.

Repository landing page

oai:digitalcommons.usu.edu:etd-9739

Geometry- and Accuracy-Preserving Random Forest Proximities with Applications

Authors: Jake S. Rhodes
Publication date: 1 August 2022
Publisher: DigitalCommons@USU
Doi

Abstract

Many machine learning algorithms use calculated distances or similarities between data observations to make predictions, cluster similar data, visualize patterns, or generally explore the data. Most distances or similarity measures do not incorporate known data labels and are thus considered unsupervised. Supervised methods for measuring distance exist which incorporate data labels and thereby exaggerate separation between data points of different classes. This approach tends to distort the natural structure of the data. Instead of following similar approaches, we leverage a popular algorithm used for making data-driven predictions, known as random forests, to naturally incorporate data labels into similarity measures known as random forest proximities. In this dissertation, we explore previously defined random forest proximities and demonstrate their weaknesses in popular proximity-based applications. Additionally, we develop a new proximity definition that can be used to recreate the random forest’s predictions. We call these random forest-geometry-and accuracy-Preserving proximities or RF-GAP. We show by proof and empirical demonstration can be used to perfectly reconstruct the random forest’s predictions and, as a result, we argue that RF-GAP proximities provide a truer representation of the random forest’s learning when used in proximity-based applications. We provide evidence to suggest that RF-GAP proximities improve applications including imputing missing data, detecting outliers, and visualizing the data. We also introduce a new random forest proximity-based technique that can be used to generate 2- or 3-dimensional data representations which can be used as a tool to visually explore the data. We show that this method does well at portraying the relationship between data variables and the data labels. We show quantitatively and qualitatively that this method surpasses other existing methods for this task

Similar works

Full text

Open in the Core reader

Download PDF

DigitalCommons@USU

oai:digitalcommons.usu.edu:etd...

Last time updated on 05/10/2022

This paper was published in DigitalCommons@USU.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.