Facial Expressions Recognition in Sign Language Based on a Two-Stream Swin Transformer Model Integrating RGB and Texture Map Images

Lourdes Ramírez Cerna, José Antonio Rodriguez Melquiades, Edwin Jonathan Escobedo Cárdenas, Guillermo Cámara Chávez, Dayse Garcia Miranda

Abstract


The study of facial expressions in sign language has become a significant research area, as these expressions not only convey personal states, but also enhance the meaning of signs within specific contexts. The absence of facial expressions during communication can lead to misinterpretations, underscoring the need for datasets that include facial expressions in sign language. To address this, we present the Facial-BSL dataset, which consists of videos capturing eight distinct facial expressions used in Brazilian Sign Language. Additionally, we propose a two-stream model designed to classify facial expressions in a sign language context. This model utilizes RGB images to capture local facial information and texture map images to record facial movements. We assessed the performance of several deep learning architectures within this two-stream framework, including Convolutional Neural Networks (CNNs) and Vision Transformers. In addition, experiments were conducted using public datasets such as CK+, KDEF-dyn, and LIBRAS. The two-stream architecture based on the Swin Transformer model demonstrated superior performance on the KDEF-dyn and LIBRAS datasets and achieved a second-place ranking on the CK+ dataset, with an accuracy of 97% and an F1-score of 95%.

Keywords


Facial expressions in sign language; RGBD data; texture map images; two-stream architecture; Swin Transformer.

Full Text: PDF