Measuring the diversity of data and metadata in digital libraries
Please use this identifier to cite or link to this item:
http://hdl.handle.net/10045/152342
Title: | Measuring the diversity of data and metadata in digital libraries |
---|---|
Authors: | Carrasco, Rafael C. | Candela, Gustavo | Marco Such, Manuel |
Research Group/s: | Lucentia | Transducens |
Center, Department or Service: | Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos |
Keywords: | Metadata | Digital libraries | Open data | Collections as data |
Issue Date: | 21-Feb-2025 |
Publisher: | Springer Nature |
Citation: | International Journal on Digital Libraries. 2025, 26: 5. https://doi.org/10.1007/s00799-025-00411-1 |
Abstract: | Diversity indices have been traditionally used to capture the biodiversity of ecosystems by measuring the effective number of species or groups of species. In contrast to abundance, which grows with the amount of data available and is sensitive to the appearance of small groups, diversity indices provide a more robust indicator on the variability of individuals. These types of indices can be employed in the context of digital libraries to analyse their content and metadata. They can be used, for example, to identify trends in the distribution of topics, to compare the lexica employed by different authors or to analyse the coverage of semantic metadata. In this article, the lexical diversity is measured through one of the most common indices employed to evaluate diversity, the Shannon index. The experiments show that this index slowly grows with the length of the text used to calculate it. As this growth has the diversity value as ceiling, the curves show that the true value of diversity will only be reached for very large samples. Unfortunately, the available text is often not long enough to achieve the convergence. This paper introduces therefore a new model for the calculation of the asymptotic value of the Shannon diversity of the vocabulary which outperforms traditional models. As regards metadata in digital libraries, we use the new model to analyse the topical specialization of a digital library and its time evolution and propose a more robust way to measure the variety of tags (classes and properties) employed by digital libraries to describe their holdings in Linked Open Data repositories. |
Sponsor: | Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. |
URI: | http://hdl.handle.net/10045/152342 |
ISSN: | 1432-5012 (Print) | 1432-1300 (Online) |
DOI: | 10.1007/s00799-025-00411-1 |
Language: | eng |
Type: | info:eu-repo/semantics/article |
Rights: | © The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. |
Peer Review: | si |
Publisher version: | https://doi.org/10.1007/s00799-025-00411-1 |
Appears in Collections: | INV - TRANSDUCENS - Artículos de Revistas INV - LUCENTIA - Artículos de Revistas |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
![]() | 2,22 MB | Adobe PDF | Open Preview | |
Items in RUA are protected by copyright, with all rights reserved, unless otherwise indicated.