Vln↻bert: A Recurrent Vision-and-Language BERT for Navigation

Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen Gould

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    119 Citations (Scopus)

    Abstract

    Accuracy of many visiolinguistic tasks has benefited significantly from the application of vision-and-language (V&L) BERT. However, its application for the task of vision- and-language navigation (VLN) remains limited. One reason for this is the difficulty adapting the BERT architecture to the partially observable Markov decision process present in VLN, requiring history-dependent attention and decision making. In this paper we propose a recurrent BERT model that is time-aware for use in VLN. Specifically, we equip the BERT model with a recurrent function that maintains cross-modal state information for the agent. Through extensive experiments on R2R and REVERIE we demonstrate that our model can replace more complex encoder-decoder models to achieve state-of-the-art results. Moreover, our approach can be generalised to other transformer-based architectures, supports pre-training, and is capable of solving navigation and referring expression tasks simultaneously.

    Original languageEnglish
    Title of host publicationProceedings - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021
    PublisherIEEE Computer Society
    Pages1643-1653
    Number of pages11
    ISBN (Electronic)9781665445092
    DOIs
    Publication statusPublished - 2021
    Event2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021 - Virtual, Online, United States
    Duration: 19 Jun 202125 Jun 2021

    Publication series

    NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
    ISSN (Print)1063-6919

    Conference

    Conference2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021
    Country/TerritoryUnited States
    CityVirtual, Online
    Period19/06/2125/06/21

    Fingerprint

    Dive into the research topics of 'Vln↻bert: A Recurrent Vision-and-Language BERT for Navigation'. Together they form a unique fingerprint.

    Cite this