Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    3550 Citations (Scopus)

    Abstract

    Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

    Original languageEnglish
    Title of host publicationProceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
    PublisherIEEE Computer Society
    Pages6077-6086
    Number of pages10
    ISBN (Electronic)9781538664209
    DOIs
    Publication statusPublished - 14 Dec 2018
    Event31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 - Salt Lake City, United States
    Duration: 18 Jun 201822 Jun 2018

    Publication series

    NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
    ISSN (Print)1063-6919

    Conference

    Conference31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
    Country/TerritoryUnited States
    CitySalt Lake City
    Period18/06/1822/06/18

    Fingerprint

    Dive into the research topics of 'Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering'. Together they form a unique fingerprint.

    Cite this