TY - GEN
T1 - Guided open vocabulary image captioning with constrained beam search
AU - Anderson, Peter
AU - Fernando, Basura
AU - Johnson, Mark
AU - Gould, Stephen
N1 - Publisher Copyright:
© 2017 Association for Computational Linguistics.
PY - 2017
Y1 - 2017
N2 - Existing image captioning models do not generalize well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real world applications dealing with images in the wild. We address this problem using a flexible approach that enables existing deep captioning architectures to take advantage of image taggers at test time, without re-training. Our method uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words. Using this approach we achieve state of the art results for out-of-domain captioning on MSCOCO (and improved results for in-domain captioning). Perhaps surprisingly, our results significantly outperform approaches that incorporate the same tag predictions into the learning algorithm. We also show that we can significantly improve the quality of generated ImageNet captions by leveraging ground-truth labels.
AB - Existing image captioning models do not generalize well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real world applications dealing with images in the wild. We address this problem using a flexible approach that enables existing deep captioning architectures to take advantage of image taggers at test time, without re-training. Our method uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words. Using this approach we achieve state of the art results for out-of-domain captioning on MSCOCO (and improved results for in-domain captioning). Perhaps surprisingly, our results significantly outperform approaches that incorporate the same tag predictions into the learning algorithm. We also show that we can significantly improve the quality of generated ImageNet captions by leveraging ground-truth labels.
UR - http://www.scopus.com/inward/record.url?scp=85048487879&partnerID=8YFLogxK
U2 - 10.18653/v1/d17-1098
DO - 10.18653/v1/d17-1098
M3 - Conference contribution
T3 - EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings
SP - 936
EP - 945
BT - EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings
PB - Association for Computational Linguistics (ACL)
T2 - 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017
Y2 - 9 September 2017 through 11 September 2017
ER -