Abstract
A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment. To this end, recent advances at the intersection of language and vision have made incredible progress - from being able to generate natural language descriptions of images/videos, to answering questions about them, to even holding free-form conversations about visual content! However, while these agents can passively describe images or answer (a sequence of) questions about them, they cannot act in the world (what if I cannot answer a question from my current view, or I am asked to move or manipulate something?). Thus, the challenge now is to extend this progress in language and vision to embodied agents that take actions and actively interact with their visual environments. 2018 Association for Computational Linguistics
Original language | English |
---|---|
Pages (from-to) | 1/10/2014 |
Journal | Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics-System Demonstrations |
Publication status | Published - 2018 |
Event | 57th Annual Meeting of the Association for Computational Linguistics, ACL 2018 - Italy, Italy Duration: 1 Jan 2019 → … https://www.aclweb.org/anthology/P19-1000 |