Data representations for audio-to-score monophonic music transcription

Empreu sempre aquest identificador per citar o enllaçar aquest ítem http://hdl.handle.net/10045/108733
Información del item - Informació de l'item - Item information
Títol: Data representations for audio-to-score monophonic music transcription
Autors: Román, Miguel A. | Pertusa, Antonio | Calvo-Zaragoza, Jorge
Grups d'investigació o GITE: Reconocimiento de Formas e Inteligencia Artificial
Centre, Departament o Servei: Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos | Universidad de Alicante. Instituto Universitario de Investigación Informática
Paraules clau: Automatic music transcription | Audio processing | Neural networks | Audio to score | Monophonic music
Àrees de coneixement: Lenguajes y Sistemas Informáticos
Data de publicació: 30-de desembre-2020
Editor: Elsevier
Citació bibliogràfica: Expert Systems with Applications. 2020, 162: 113769. https://doi.org/10.1016/j.eswa.2020.113769
Resum: This work presents an end-to-end method based on deep neural networks for audio-to-score music transcription of monophonic excerpts. Unlike existing music transcription methods, which normally perform pitch estimation, the proposed approach is formulated as an end-to-end task that outputs a notation-level music score. Using an audio file as input, modeled as a sequence of frames, a deep neural network is trained to provide a sequence of music symbols encoding a score, including key and time signatures, barlines, notes (with their pitch spelling and duration) and rests. Our framework is based on a Convolutional Recurrent Neural Network (CRNN) with Connectionist Temporal Classification (CTC) loss function trained in an end-to-end fashion, without requiring to align the input frames with the output symbols. A total of 246,870 incipits from the Répertoire International des Sources Musicales online catalog were synthesized using different timbres and tempos to build the training data. Alternative input representations (raw audio, Short-Time Fourier Transform (STFT), log-spaced STFT and Constant-Q transform) were evaluated for this task, as well as different output representations (Plaine & Easie Code, Kern, and a purpose-designed output). Results show that it is feasible to directly infer score representations from audio files and most errors come from music notation ambiguities and metering (time signatures and barlines).
Patrocinadors: This work has been supported by the Spanish "Ministerio de Ciencia e Innovación" through Project HISPAMUS (No. TIN2017-86576-R supported by EU FEDER funds).
URI: http://hdl.handle.net/10045/108733
ISSN: 0957-4174 (Print) | 1873-6793 (Online)
DOI: 10.1016/j.eswa.2020.113769
Idioma: eng
Tipus: info:eu-repo/semantics/article
Drets: © 2020 Elsevier Ltd.
Revisió científica: si
Versió de l'editor: https://doi.org/10.1016/j.eswa.2020.113769
Apareix a la col·lecció: INV - GRFIA - Artículos de Revistas

Arxius per aquest ítem:
Arxius per aquest ítem:
Arxiu Descripció Tamany Format  
ThumbnailRoman_etal_2020_ESWA_final.pdfVersión final (acceso restringido)1,2 MBAdobe PDFObrir     Sol·licitar una còpia
ThumbnailRoman_etal_2020_ESWA_accepted.pdfAccepted Manuscript (acceso abierto)770,55 kBAdobe PDFObrir Vista prèvia


Tots els documents dipositats a RUA estan protegits per drets d'autors. Alguns drets reservats.