TY - GEN
T1 - Spatial-Temporal-Class Attention Network for Acoustic Scene Classification
AU - Niu, Xinlei
AU - Martin, Charles Patrick
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022/8/26
Y1 - 2022/8/26
N2 - Acoustic scene classification, where a scene is identified from a sound recording, is a difficult problem that is much less studied than similar problems in computer vision. Re-cent advances in attention-based convolution neural networks (CNNs) can be applied to audio data by operating on two dimensional spectrograms, where frequency and time infor-mation have been separated, rather than a raw audio signal. Typical CNNs have difficulty coping with this problem due to the temporal aspects of acoustic data. In this research we propose a novel and intuitive CNN-based architecture with attention mechanisms called the spatial-temporal-class attention network (STCANet). The STCANet consists of a spatial-temporal attention and a class attention which extracts in-formation along with frequency, temporal, and the class di-mension of spectrograms. In our experiments, the STCANet achieved 75.6%, 95.4%, and 97.0% accuracy on TUT 2018, TAU 2020, and ESC-I0 datasets that are competitive results compared with previous works. Our contributions include this novel network design and a detailed analysis of how attention allows these results to be achieved.
AB - Acoustic scene classification, where a scene is identified from a sound recording, is a difficult problem that is much less studied than similar problems in computer vision. Re-cent advances in attention-based convolution neural networks (CNNs) can be applied to audio data by operating on two dimensional spectrograms, where frequency and time infor-mation have been separated, rather than a raw audio signal. Typical CNNs have difficulty coping with this problem due to the temporal aspects of acoustic data. In this research we propose a novel and intuitive CNN-based architecture with attention mechanisms called the spatial-temporal-class attention network (STCANet). The STCANet consists of a spatial-temporal attention and a class attention which extracts in-formation along with frequency, temporal, and the class di-mension of spectrograms. In our experiments, the STCANet achieved 75.6%, 95.4%, and 97.0% accuracy on TUT 2018, TAU 2020, and ESC-I0 datasets that are competitive results compared with previous works. Our contributions include this novel network design and a detailed analysis of how attention allows these results to be achieved.
KW - Acoustic scene classification
KW - Attention
KW - CNN
KW - Neural network
UR - http://www.scopus.com/inward/record.url?scp=85137726370&partnerID=8YFLogxK
U2 - 10.1109/ICME52920.2022.9859735
DO - 10.1109/ICME52920.2022.9859735
M3 - Conference contribution
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - ICME 2022 - IEEE International Conference on Multimedia and Expo 2022, Proceedings
PB - IEEE Computer Society
T2 - 2022 IEEE International Conference on Multimedia and Expo, ICME 2022
Y2 - 18 July 2022 through 22 July 2022
ER -