TY - JOUR
T1 - Deep0Tag
T2 - Deep Multiple Instance Learning for Zero-Shot Image Tagging
AU - Rahman, Shafin
AU - Khan, Salman
AU - Barnes, Nick
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2020/1
Y1 - 2020/1
N2 - Zero-shot learning aims to perform visual reasoning about unseen objects. In-line with the success of deep learning on object recognition problems, several end-to-end deep models for zero-shot recognition have been proposed in the literature. These models are successful in predicting a single unseen label given an input image but do not scale to cases where multiple unseen objects are present. Here, we focus on the challenging problem of zero-shot image tagging, where multiple labels are assigned to an image, that may relate to objects, attributes, actions, events, and scene type. Discovery of these scene concepts requires the ability to process multi-scale information. To encompass global as well as local image details, we propose an automatic approach to locate relevant image patches and model image tagging within the Multiple Instance Learning (MIL) framework. To the best of our knowledge, we propose the first end-to-end trainable deep MIL framework for the multi-label zero-shot tagging problem. We explore several alternatives for instance-level evidence aggregation and perform an extensive ablation study to identify the optimal pooling strategy. Due to its novel design, the proposed framework has several interesting features: 1) unlike previous deep MIL models, it does not use any off-line procedure (e.g., Selective Search or EdgeBoxes) for bag generation. 2) During test time, it can process any number of unseen labels given their semantic embedding vectors. 3) Using only image-level seen labels as weak annotation, it can produce a localized bounding box for each predicted label. We experiment with the large-scale NUS-WIDE and MS-COCO datasets and achieve superior performance across conventional, zero-shot, and generalized zero-shot tagging tasks.
AB - Zero-shot learning aims to perform visual reasoning about unseen objects. In-line with the success of deep learning on object recognition problems, several end-to-end deep models for zero-shot recognition have been proposed in the literature. These models are successful in predicting a single unseen label given an input image but do not scale to cases where multiple unseen objects are present. Here, we focus on the challenging problem of zero-shot image tagging, where multiple labels are assigned to an image, that may relate to objects, attributes, actions, events, and scene type. Discovery of these scene concepts requires the ability to process multi-scale information. To encompass global as well as local image details, we propose an automatic approach to locate relevant image patches and model image tagging within the Multiple Instance Learning (MIL) framework. To the best of our knowledge, we propose the first end-to-end trainable deep MIL framework for the multi-label zero-shot tagging problem. We explore several alternatives for instance-level evidence aggregation and perform an extensive ablation study to identify the optimal pooling strategy. Due to its novel design, the proposed framework has several interesting features: 1) unlike previous deep MIL models, it does not use any off-line procedure (e.g., Selective Search or EdgeBoxes) for bag generation. 2) During test time, it can process any number of unseen labels given their semantic embedding vectors. 3) Using only image-level seen labels as weak annotation, it can produce a localized bounding box for each predicted label. We experiment with the large-scale NUS-WIDE and MS-COCO datasets and achieve superior performance across conventional, zero-shot, and generalized zero-shot tagging tasks.
KW - Deep learning
KW - Feature pooling
KW - Multiple instance learning
KW - Object detection
KW - Zero-shot tagging
UR - http://www.scopus.com/inward/record.url?scp=85077792443&partnerID=8YFLogxK
U2 - 10.1109/TMM.2019.2924511
DO - 10.1109/TMM.2019.2924511
M3 - Article
SN - 1520-9210
VL - 22
SP - 242
EP - 255
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
IS - 1
M1 - 8744401
ER -