Emotion Detection through Audio and Video

Please use this identifier to cite or link to this item: http://hdl.handle.net/10045/115372
Información del item - Informació de l'item - Item information
Title: Emotion Detection through Audio and Video
Authors: Ellarby Sánchez, Nicolás
Research Director: Cazorla, Miguel | Escalona, Félix
Center, Department or Service: Universidad de Alicante. Departamento de Ciencia de la Computación e Inteligencia Artificial
Keywords: Machine Learning | Deep Learning | Emotion Detection | Audio | Video
Knowledge Area: Ciencia de la Computación e Inteligencia Artificial
Issue Date: 1-Jun-2021
Date of defense: Jun-2021
Abstract: This project has been an incredible learning experience allowing me to expand my knowledge on AI and more specifically Machine Learning and Deep Learning. The first weeks were spent doing research on different emotion recognition projects that already existed. These were studied to understand the different techniques used, because these could become useful at later stages as well as the approaches people take. Also, an intensive investigation of datasets was carried out as well as a comparison of them to choose the most optimal ones for my project. Once the direction in which to move forward was clear, a new phase started which was to develop the necessary code to extract audio and video features. AS has been clearly explained, the audio files were divided into three second samples from which a grey scale mel-spectrogram was obtained. On the other hand, the frames from the video files were extracted in order to obtain the facial landmarks which were used to align the actors' faces. These audio and video features where then used to train the ResNet50 and EfficientNet models. These were trained several times with slight modifications such as incrementing the amount of data, mixing the different datasets and splitting these into train and test sets in different ways. After many tests, some good results were obtained and these models were saved and combined into a single model with the objective to check whether a better accuracy is obtained. In the end, some very positive results were obtained such as a 74\% accuracy for the audio model, a 43\% accuracy with the video model and an 85\% accuracy with the joint model. This proves that a model that is based on audio as well as video performs better than separate models. This was very satisfactory because it meant that the main objective of this project had been reached. Then, these models were tested with videos I recorded of myself acting out different emotions. These results were somewhat disappointing but very understandable, since I was not recording under the same conditions as in the datasets with which the different models were trained; equipment quality, lighting and positioning. All these steps mentioned previously have been hard work and have taken months of redefining the direction to follow, modifying the way things were being done and especially conducting multiple tests to make sure that these were completely fair and that the results were the best possible.
URI: http://hdl.handle.net/10045/115372
Language: eng
Type: info:eu-repo/semantics/bachelorThesis
Rights: Licencia Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0
Appears in Collections:Grado en Ingeniería Robótica - Trabajos Fin de Grado

Files in This Item:
Files in This Item:
File Description SizeFormat 
ThumbnailTFG_Nicolas_Ellarby.pdf25,85 MBAdobe PDFOpen Preview


This item is licensed under a Creative Commons License Creative Commons