JTAV: Jointly learning social media content representation by fusing textual, acoustic, and visual features

Hongru Liang, Haozheng Wang, Jun Wang, Shaodi You, Zhe Sun, Jin Mao Wei, Zhenglu Yang*

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    3 Citations (Scopus)

    Abstract

    Learning social media content is the basis of many real-world applications, including information retrieval and recommendation systems, among others. In contrast with previous works that focus mainly on single modal or bi-modal learning, we propose to learn social media content by fusing jointly textual, acoustic, and visual information (JTAV). Effective strategies are proposed to extract fine-grained features of each modality, that is, attBiGRU and DCRNN. We also introduce cross-modal fusion and attentive pooling techniques to integrate multi-modal information comprehensively. Extensive experimental evaluation conducted on real-world datasets demonstrates our proposed model outperforms the state-of-the-art approaches by a large margin.

    Original languageEnglish
    Title of host publicationCOLING 2018 - 27th International Conference on Computational Linguistics, Proceedings
    EditorsEmily M. Bender, Leon Derczynski, Pierre Isabelle
    PublisherAssociation for Computational Linguistics (ACL)
    Pages1269-1280
    Number of pages12
    ISBN (Electronic)9781948087506
    Publication statusPublished - 2018
    Event27th International Conference on Computational Linguistics, COLING 2018 - Santa Fe, United States
    Duration: 20 Aug 201826 Aug 2018

    Publication series

    NameCOLING 2018 - 27th International Conference on Computational Linguistics, Proceedings

    Conference

    Conference27th International Conference on Computational Linguistics, COLING 2018
    Country/TerritoryUnited States
    CitySanta Fe
    Period20/08/1826/08/18

    Fingerprint

    Dive into the research topics of 'JTAV: Jointly learning social media content representation by fusing textual, acoustic, and visual features'. Together they form a unique fingerprint.

    Cite this