TY - JOUR
T1 - HTNet for micro-expression recognition
AU - Wang, Zhifeng
AU - Zhang, Kaihao
AU - Luo, Wenhan
AU - Sankaranarayana, Ramesh
N1 - Publisher Copyright:
© 2024 The Author(s)
PY - 2024/10/14
Y1 - 2024/10/14
N2 - Facial expression is related to facial muscle contractions and different muscle movements correspond to different emotional states. For micro-expression recognition, the muscle movements are usually subtle, which has a negative impact on the performance of current facial emotion recognition algorithms. Most existing methods use self-attention mechanisms to capture relationships between tokens in a sequence, but they do not take into account the inherent spatial relationships between facial landmarks. This can result in sub-optimal performance on micro-expression recognition tasks. Therefore, learning to recognize facial muscle movements is a key challenge in the area of micro-expression recognition. In this paper, we propose a Hierarchical Transformer Network (HTNet) to identify critical areas of facial muscle movement. HTNet includes two major components: a transformer layer that leverages the local temporal features and an aggregation layer that extracts local and global semantical facial features. Specifically, HTNet divides the face into four different facial areas: left lip area, left eye area, right eye area and right lip area. The transformer layer is used to focus on representing local minor muscle movement with local self-attention in each area. The aggregation layer is used to learn the interactions between eye areas and lip areas. The experiments on four publicly available micro-expression datasets show that the proposed approach outperforms previous methods by a large margin. The codes and models are available at: https://github.com/wangzhifengharrison/HTNet.
AB - Facial expression is related to facial muscle contractions and different muscle movements correspond to different emotional states. For micro-expression recognition, the muscle movements are usually subtle, which has a negative impact on the performance of current facial emotion recognition algorithms. Most existing methods use self-attention mechanisms to capture relationships between tokens in a sequence, but they do not take into account the inherent spatial relationships between facial landmarks. This can result in sub-optimal performance on micro-expression recognition tasks. Therefore, learning to recognize facial muscle movements is a key challenge in the area of micro-expression recognition. In this paper, we propose a Hierarchical Transformer Network (HTNet) to identify critical areas of facial muscle movement. HTNet includes two major components: a transformer layer that leverages the local temporal features and an aggregation layer that extracts local and global semantical facial features. Specifically, HTNet divides the face into four different facial areas: left lip area, left eye area, right eye area and right lip area. The transformer layer is used to focus on representing local minor muscle movement with local self-attention in each area. The aggregation layer is used to learn the interactions between eye areas and lip areas. The experiments on four publicly available micro-expression datasets show that the proposed approach outperforms previous methods by a large margin. The codes and models are available at: https://github.com/wangzhifengharrison/HTNet.
KW - Deep learning
KW - Facial muscle movement
KW - Hierarchical transformer
KW - Local self-attention
KW - Micro-expression recognition
UR - https://www.scopus.com/pages/publications/85200253039
U2 - 10.1016/j.neucom.2024.128196
DO - 10.1016/j.neucom.2024.128196
M3 - Article
AN - SCOPUS:85200253039
SN - 0925-2312
VL - 602
JO - Neurocomputing
JF - Neurocomputing
M1 - 128196
ER -