Repository landing page

We are not able to resolve this OAI Identifier to the repository landing page. If you are the repository manager for this record, please head to the Dashboard and adjust the settings.

An approach to validity indices for clustering techniques in Big Data

Abstract

Clustering analysis is one of the most used Machine Learning techniques to discover groups among data objects. Some clustering methods require the number of clus ters into which the data is going to be partitioned. There exist several cluster validity indices that help us to approximate the optimal number of clusters of the dataset. However, such indices are not suitable to deal with Big Data due to its size limitation and runtime costs. This paper presents two cluster ing validity indices that handle large amount of data in low computational time. Our indices are based on redefinitions of traditional indices by simplifying the intra-cluster distance calculation. Two types of tests have been carried out over 28 synthetic datasets to analyze the performance of the proposed indices. First, we test the indices with small and medium size datasets to verify that our indices have a similar effectiveness to the traditional ones. Subsequently, tests on datasets of up to 11 million records and 20 features have been executed to check their efficiency. The results show that both indices can handle Big Data in a very low computational time with an effectiveness similar to the traditional indices using Apache Spark framework.Ministerio de Economía y Competitividad TIN2014-55894-C2-1-

Similar works

Full text

thumbnail-image

idUS. Depósito de Investigación Universidad de Sevilla

redirect
Last time updated on 19/05/2022

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.