Data representations for audio-to-score monophonic music transcription
Por favor, use este identificador para citar o enlazar este ítem:
http://hdl.handle.net/10045/108733
Título: | Data representations for audio-to-score monophonic music transcription |
---|---|
Autor/es: | Román, Miguel A. | Pertusa, Antonio | Calvo-Zaragoza, Jorge |
Grupo/s de investigación o GITE: | Reconocimiento de Formas e Inteligencia Artificial |
Centro, Departamento o Servicio: | Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos | Universidad de Alicante. Instituto Universitario de Investigación Informática |
Palabras clave: | Automatic music transcription | Audio processing | Neural networks | Audio to score | Monophonic music |
Área/s de conocimiento: | Lenguajes y Sistemas Informáticos |
Fecha de publicación: | 30-dic-2020 |
Editor: | Elsevier |
Cita bibliográfica: | Expert Systems with Applications. 2020, 162: 113769. https://doi.org/10.1016/j.eswa.2020.113769 |
Resumen: | This work presents an end-to-end method based on deep neural networks for audio-to-score music transcription of monophonic excerpts. Unlike existing music transcription methods, which normally perform pitch estimation, the proposed approach is formulated as an end-to-end task that outputs a notation-level music score. Using an audio file as input, modeled as a sequence of frames, a deep neural network is trained to provide a sequence of music symbols encoding a score, including key and time signatures, barlines, notes (with their pitch spelling and duration) and rests. Our framework is based on a Convolutional Recurrent Neural Network (CRNN) with Connectionist Temporal Classification (CTC) loss function trained in an end-to-end fashion, without requiring to align the input frames with the output symbols. A total of 246,870 incipits from the Répertoire International des Sources Musicales online catalog were synthesized using different timbres and tempos to build the training data. Alternative input representations (raw audio, Short-Time Fourier Transform (STFT), log-spaced STFT and Constant-Q transform) were evaluated for this task, as well as different output representations (Plaine & Easie Code, Kern, and a purpose-designed output). Results show that it is feasible to directly infer score representations from audio files and most errors come from music notation ambiguities and metering (time signatures and barlines). |
Patrocinador/es: | This work has been supported by the Spanish "Ministerio de Ciencia e Innovación" through Project HISPAMUS (No. TIN2017-86576-R supported by EU FEDER funds). |
URI: | http://hdl.handle.net/10045/108733 |
ISSN: | 0957-4174 (Print) | 1873-6793 (Online) |
DOI: | 10.1016/j.eswa.2020.113769 |
Idioma: | eng |
Tipo: | info:eu-repo/semantics/article |
Derechos: | © 2020 Elsevier Ltd. |
Revisión científica: | si |
Versión del editor: | https://doi.org/10.1016/j.eswa.2020.113769 |
Aparece en las colecciones: | INV - GRFIA - Artículos de Revistas |
Archivos en este ítem:
Archivo | Descripción | Tamaño | Formato | |
---|---|---|---|---|
Roman_etal_2020_ESWA_final.pdf | Versión final (acceso restringido) | 1,2 MB | Adobe PDF | Abrir Solicitar una copia |
Roman_etal_2020_ESWA_accepted.pdf | Accepted Manuscript (acceso abierto) | 770,55 kB | Adobe PDF | Abrir Vista previa |
Todos los documentos en RUA están protegidos por derechos de autor. Algunos derechos reservados.