SPICE: Semantic propositional image caption evaluation

Peter Anderson*, Basura Fernando, Mark Johnson, Stephen Gould

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    1126 Citations (Scopus)

    Abstract

    There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors? and can caption-generators count?.

    Original languageEnglish
    Title of host publicationComputer Vision - 14th European Conference, ECCV 2016, Proceedings
    EditorsBastian Leibe, Jiri Matas, Nicu Sebe, Max Welling
    PublisherSpringer Verlag
    Pages382-398
    Number of pages17
    ISBN (Print)9783319464534
    DOIs
    Publication statusPublished - 2016
    Event14th European Conference on Computer Vision, ECCV 2016 - Amsterdam, Netherlands
    Duration: 8 Oct 201616 Oct 2016

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume9909 LNCS
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference14th European Conference on Computer Vision, ECCV 2016
    Country/TerritoryNetherlands
    CityAmsterdam
    Period8/10/1616/10/16

    Fingerprint

    Dive into the research topics of 'SPICE: Semantic propositional image caption evaluation'. Together they form a unique fingerprint.

    Cite this