Data Reduction in the String Space for Efficient kNN Classification through Space Partitioning

Please use this identifier to cite or link to this item:
Información del item - Informació de l'item - Item information
Title: Data Reduction in the String Space for Efficient kNN Classification through Space Partitioning
Authors: Valero Mas, José Javier | Castellanos, Francisco J.
Research Group/s: Reconocimiento de Formas e Inteligencia Artificial
Center, Department or Service: Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos
Keywords: String space | Data reduction | k-Nearest neighbor | Prototype generation
Knowledge Area: Lenguajes y Sistemas Informáticos
Issue Date: 12-May-2020
Publisher: MDPI
Citation: Valero-Mas JJ, Castellanos FJ. Data Reduction in the String Space for Efficient kNN Classification through Space Partitioning. Applied Sciences. 2020; 10(10):3356. doi:10.3390/app10103356
Abstract: Within the Pattern Recognition field, two representations are generally considered for encoding the data: statistical codifications, which describe elements as feature vectors, and structural representations, which encode elements as high-level symbolic data structures such as strings, trees or graphs. While the vast majority of classifiers are capable of addressing statistical spaces, only some particular methods are suitable for structural representations. The kNN classifier constitutes one of the scarce examples of algorithms capable of tackling both statistical and structural spaces. This method is based on the computation of the dissimilarity between all the samples of the set, which is the main reason for its high versatility, but in turn, for its low efficiency as well. Prototype Generation is one of the possibilities for palliating this issue. These mechanisms generate a reduced version of the initial dataset by performing data transformation and aggregation processes on the initial collection. Nevertheless, these generation processes are quite dependent on the data representation considered, being not generally well defined for structural data. In this work we present the adaptation of the generation-based reduction algorithm Reduction through Homogeneous Clusters to the case of string data. This algorithm performs the reduction by partitioning the space into class-homogeneous clusters for then generating a representative prototype as the median value of each group. Thus, the main issue to tackle is the retrieval of the median element of a set of strings. Our comprehensive experimentation comparatively assesses the performance of this algorithm in both the statistical and the string-based spaces. Results prove the relevance of our approach by showing a competitive compromise between classification rate and data reduction.
Sponsor: This research work was partially funded by “Programa I+D+i de la Generalitat Valenciana” through grant ACIF/2019/ 042 and the Spanish Ministry through HISPAMUS project TIN2017-86576-R, partially funded by the EU.
ISSN: 2076-3417
DOI: 10.3390/app10103356
Language: eng
Type: info:eu-repo/semantics/article
Rights: © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (
Peer Review: si
Publisher version:
Appears in Collections:INV - GRFIA - Artículos de Revistas

Files in This Item:
Files in This Item:
File Description SizeFormat 
ThumbnailValero-Mas_Castellanos_2020_ApplSci.pdf332,84 kBAdobe PDFOpen Preview

This item is licensed under a Creative Commons License Creative Commons