Emotion Detection through Audio and Video

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/115372
Información del item - Informació de l'item - Item information
Título: Emotion Detection through Audio and Video
Autor/es: Ellarby Sánchez, Nicolás
Director de la investigación: Cazorla, Miguel | Escalona, Félix
Centro, Departamento o Servicio: Universidad de Alicante. Departamento de Ciencia de la Computación e Inteligencia Artificial
Palabras clave: Machine Learning | Deep Learning | Emotion Detection | Audio | Video
Área/s de conocimiento: Ciencia de la Computación e Inteligencia Artificial
Fecha de publicación: 1-jun-2021
Fecha de lectura: jun-2021
Resumen: This project has been an incredible learning experience allowing me to expand my knowledge on AI and more specifically Machine Learning and Deep Learning. The first weeks were spent doing research on different emotion recognition projects that already existed. These were studied to understand the different techniques used, because these could become useful at later stages as well as the approaches people take. Also, an intensive investigation of datasets was carried out as well as a comparison of them to choose the most optimal ones for my project. Once the direction in which to move forward was clear, a new phase started which was to develop the necessary code to extract audio and video features. AS has been clearly explained, the audio files were divided into three second samples from which a grey scale mel-spectrogram was obtained. On the other hand, the frames from the video files were extracted in order to obtain the facial landmarks which were used to align the actors' faces. These audio and video features where then used to train the ResNet50 and EfficientNet models. These were trained several times with slight modifications such as incrementing the amount of data, mixing the different datasets and splitting these into train and test sets in different ways. After many tests, some good results were obtained and these models were saved and combined into a single model with the objective to check whether a better accuracy is obtained. In the end, some very positive results were obtained such as a 74\% accuracy for the audio model, a 43\% accuracy with the video model and an 85\% accuracy with the joint model. This proves that a model that is based on audio as well as video performs better than separate models. This was very satisfactory because it meant that the main objective of this project had been reached. Then, these models were tested with videos I recorded of myself acting out different emotions. These results were somewhat disappointing but very understandable, since I was not recording under the same conditions as in the datasets with which the different models were trained; equipment quality, lighting and positioning. All these steps mentioned previously have been hard work and have taken months of redefining the direction to follow, modifying the way things were being done and especially conducting multiple tests to make sure that these were completely fair and that the results were the best possible.
URI: http://hdl.handle.net/10045/115372
Idioma: eng
Tipo: info:eu-repo/semantics/bachelorThesis
Derechos: Licencia Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0
Aparece en las colecciones:Grado en Ingeniería Robótica - Trabajos Fin de Grado

Archivos en este ítem:
Archivos en este ítem:
Archivo Descripción TamañoFormato 
ThumbnailTFG_Nicolas_Ellarby.pdf25,85 MBAdobe PDFAbrir Vista previa


Este ítem está licenciado bajo Licencia Creative Commons Creative Commons