Data representations for audio-to-score monophonic music transcription

Román, Miguel A.; Pertusa, Antonio; Calvo-Zaragoza, Jorge

Data representations for audio-to-score monophonic music transcription

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/108733

Información del item - Informació de l'item - Item information
Título:	Data representations for audio-to-score monophonic music transcription
Autor/es:	Román, Miguel A. \| Pertusa, Antonio \| Calvo-Zaragoza, Jorge
Grupo/s de investigación o GITE:	Reconocimiento de Formas e Inteligencia Artificial
Centro, Departamento o Servicio:	Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos \| Universidad de Alicante. Instituto Universitario de Investigación Informática
Palabras clave:	Automatic music transcription \| Audio processing \| Neural networks \| Audio to score \| Monophonic music
Área/s de conocimiento:	Lenguajes y Sistemas Informáticos
Fecha de publicación:	30-dic-2020
Editor:	Elsevier
Cita bibliográfica:	Expert Systems with Applications. 2020, 162: 113769. https://doi.org/10.1016/j.eswa.2020.113769
Resumen:	This work presents an end-to-end method based on deep neural networks for audio-to-score music transcription of monophonic excerpts. Unlike existing music transcription methods, which normally perform pitch estimation, the proposed approach is formulated as an end-to-end task that outputs a notation-level music score. Using an audio file as input, modeled as a sequence of frames, a deep neural network is trained to provide a sequence of music symbols encoding a score, including key and time signatures, barlines, notes (with their pitch spelling and duration) and rests. Our framework is based on a Convolutional Recurrent Neural Network (CRNN) with Connectionist Temporal Classification (CTC) loss function trained in an end-to-end fashion, without requiring to align the input frames with the output symbols. A total of 246,870 incipits from the Répertoire International des Sources Musicales online catalog were synthesized using different timbres and tempos to build the training data. Alternative input representations (raw audio, Short-Time Fourier Transform (STFT), log-spaced STFT and Constant-Q transform) were evaluated for this task, as well as different output representations (Plaine & Easie Code, Kern, and a purpose-designed output). Results show that it is feasible to directly infer score representations from audio files and most errors come from music notation ambiguities and metering (time signatures and barlines).
Patrocinador/es:	This work has been supported by the Spanish "Ministerio de Ciencia e Innovación" through Project HISPAMUS (No. TIN2017-86576-R supported by EU FEDER funds).
URI:	http://hdl.handle.net/10045/108733
ISSN:	0957-4174 (Print) \| 1873-6793 (Online)
DOI:	10.1016/j.eswa.2020.113769
Idioma:	eng
Tipo:	info:eu-repo/semantics/article
Derechos:	© 2020 Elsevier Ltd.
Revisión científica:	si
Versión del editor:	https://doi.org/10.1016/j.eswa.2020.113769
Aparece en las colecciones:	INV - GRFIA - Artículos de Revistas

Archivos en este ítem:

Archivos en este ítem:
Archivo	Descripción	Tamaño	Formato
Roman_etal_2020_ESWA_final.pdf	Versión final (acceso restringido)	1,2 MB	Adobe PDF	Abrir Solicitar una copia
Roman_etal_2020_ESWA_accepted.pdf	Accepted Manuscript (acceso abierto)	770,55 kB	Adobe PDF	Abrir Vista previa Cerrar vista previa

Ver citas en Google Académico

Muestra el registro completo