Workshop

Chunhua Shen

Title

Visual Question Answering: new datasets and approaches

Abstract

Combining computer vision and natural language processing is an emerging topic which has been receiving much research attention recently. Visual Question Answering (VQA) can be seen as a proxy task for evaluating a vision system’s capacity for deeper image understanding. Current datasets, and the models built upon them, have focused on questions which are answerable by direct analysis of the question and image alone. The set of such questions that require no external information to answer is interesting, but very limited. We thus propose a new VQA dataset (FVQA) with additional supporting-facts. In response to the observed limitations of RNN-based approaches, we propose a method which is based on explicit reasoning about the visual concepts detected from images. Second, an intriguing features of the Visual Question Answering (VQA) challenge is the unpredictability of the questions. Extracting the information required to answer them demands a variety of image operations from detection and counting, to segmentation and reconstruction. We propose a general and scalable approach which exploits the fact that very good methods to achieve these operations already exist, and thus do not need to be trained. Last, the key challenge in Visual Dialogue is to maintain a consistent, and natural dialogue while continuing to answer questions correctly. We show how to combine Reinforcement Learning and Generative Adversarial Networks (GANs) to generate more human-like responses to questions.

Bio

Chunhua Shen is a Full Professor at School of Computer Science, University of Adelaide, leading the Statistical Machine Learning Group. He is a Project Leader and Chief Investigator at the Australian Research Council Centre of Excellence for Robotic Vision (ACRV), for which he leads the project on machine learning for robotic vision. Before he moved to Adelaide, he was with the computer vision program at NICTA (National ICT Australia), Canberra Research Laboratory for a about six years. He studied at Nanjing University, at Australian National University, and received his PhD degree from the University of Adelaide. From 2012 to 2016, he holds an Australian Research Council Future Fellowship.

Marcus Rohrbach

Title

Building Systems Which Understand Vision and Language

Abstract

Vision and Language are the core channels of communication for humans and when interacting with intelligent agents we also want to rely on these channels. In this talk I will present our recent progress to model vision and language systems, by not only by improving their performance but also increasing their grounding or interpretability. I will discuss applications in visual question answering, visual dialog, and video description.

Bio

Marcus Rohrbach is a research scientist at Facebook AI Research. Previously, he was a postdoctoral researcher at UC Berkeley. He received his PhD from the Max Planck Institute for Informatics, in Germany. His interests lie at the intersection of computer vision, computational linguistics, and machine learning. He received the DAGM MVTec Dissertation Award 2015 for the best PhD thesis defended in 2014; he was awarded the Best Long Paper Award at NAACL 2016 for his work on module networks; and he won the Visual Question Answering Challenge in 2016 and in 2018.

Kyungmin Kim

Title

Visual-Linguistic Representation for Video Question Answering

Abstract

Question-answering (QA) on video contents is a significant task for achieving human-level intelligence as it involves both vision and language in real-world settings. However, video story QA task is challenging because of its multimodal, time-series, and ambiguous properties. The task requires to extract high-level meanings from the multimodal contents, while scene frames and captions in a video are redundant, highly-complex, and sometimes ambiguous information for the task. In this talk, I will introduce various visual-linguistic story representation with attention mechanisms. They represent visual- linguistic knowledge in the form of a sentence or a distributed representation taking the entire story into account. Also, I will introduce a video QA challenge, MovieQA Challenge, that is attracting attention from researchers.

Bio

Kyung-Min Kim is a researcher at Clova AI Research of Naver & Line. He studied machine learning and artificial intelligence at Seoul National University and received his PhD in 2018. He won the MovieQA Challenge in 2017 and had 1st position two times. He awarded Naver Ph.D. Fellowship in 2017. He has interests in computer vision, natural language, and generative modeling.

Damien Teney

Title

Vision, Language, and Reasoning - What deep learning brought us, and the missing pieces

Abstract

Modern advances in computer vision and natural language processing have generated considerable interest for tasks at the intersection of these two fields. Image captioning and visual question answering have emerged as opportunities to study automated, high-level reasoning over multiple modalities. While the performance of deep learning models has generated significant excitement, recent studies have pointed at limited abilities for actual reasoning of such models. In this talk, we will consider the intrinsic limitations of purely supervised, learning-based approaches to these tasks, and present some of our ongoing work to overcome them, using modular architectures, mid-level supervision, and a meta learning formulations.

Bio

Damien Teney is a researcher at the Australian Institute for Machine Learning at the University of Adelaide, South Australia. His research interests are at the intersection of computer vision and machine learning. He holds a M.Sc. and PhD in Computer Science from the University of Liege, Belgium. He has previously been affiliated with Carnegie Mellon University (USA), the University of Bath (UK), and the University of Innsbruck (Austria).

Xuming He

Title

Describing images with diverse styles

Abstract

Linguistic style is an essential part of written communication, with the power to affect both clarity and attractiveness. With recent advances in vision and language, we can start to tackle the problem of generating image captions that are both visually grounded and appropriately styled. One such style is descriptions with emotions, which is commonplace in everyday communication. We develop a system to describe an image with emotions, and present a model that automatically generates captions with positive or negative sentiments. We propose a novel switching recurrent neural network with word-level regularization, which is able to produce emotional image captions using only 2000+ training sentences containing sentiments. In the second part of this talk, we will discuss the problem of learning to generate styled image captions from unpaired image and text data. We present a strategy that learns styled image caption generation from a large corpus of styled text without aligned images. The core idea of this model, called SemStyle, is to separate semantics and style. One key component is a novel and concise semantic term representation generated using natural language processing techniques and frame semantics. In addition, we develop a unified language model that decodes sentences with diverse word choices and syntax for different styles. Evaluations, both automatic and manual, show captions from SemStyle preserve image semantics, are descriptive, and are style shifted. This work provides possibilities to learn richer image descriptions from the plethora of linguistic data available on the web.

Bio

Xuming He is currently an Associate Professor in the School of Information Science and Technology at ShanghaiTech University. He received Ph.D. degree in computer science from the University of Toronto in 2008. He held a postdoctoral position at the University of California at Los Angeles from 2008 to 2010. After that, he joined in National ICT Australia (NICTA) and was a Senior Researcher from 2013 to 2016. He was also an adjunct Research Fellow at the Australian National University from 2010 to 2016. His research interests include semantic segmentation, multimodal scene understanding, and learning in structured models.

Tatsuya Harada

Title

Acquiring Knowledge by Asking Questions

Abstract

Image recognition performance has dramatically improved due to the progress of machine learning, large-scale and high-quality datasets, and strong computer power. However, in most methods, adaptable environments are limited, and many problems to be solved remain to adapt to real-world issues with high uncertainty. One of the effective ways to tackle those difficult problems for a recognition system is learning by asking questions. To realize this knowledge acquisition, the following three modules are required: 1) knowing unknown objects, 2) generating appropriate questions for unknown objects, 3) learning from limited supervised information. In this talk, we will introduce recent works of our team about the above modules for acquiring knowledge by asking questions.

Bio

Tatsuya Harada is a Professor in the Department of Information Science and Technology at the University of Tokyo. His research interests center on visual recognition, automatic contents generation and intelligent robot using machine learning. He received his Ph.D. from the University of Tokyo in 2001. He was a visiting scientist at Carnegie Mellon University in 2001 before joining the University of Tokyo in 2001. He is also a team leader of RIKEN and a vice director of Research Center for Medical Bigdata at National Institute of Informatics, Japan

Yanwei Fu

Title

Attribute Learning in Big Data

Abstract

For the past decade computer vision research has achieved increasing success in visual recognition including object detection and video classification. Nevertheless, these achievements still cannot meet the urgent needs of image and video understanding. The recently rapid development of social media sharing has created a huge demand for automatic media classification and annotation techniques. One promising solution is to employ attribute learning to transfer the information to tasks for which no data have been observed so far. In this talk, we will summarize the our recent works on attribute learning in image/video understanding. We show the benefits of solving these challenges and limitations in our approach which thus achieves better performance than previous methods.

Bio

Dr. Yanwei Fu is now tenure-tracked professor in the school of Data Science, Fudan University. He received his PhD degree from Queen Mary University of London in 2014, and the MEng degree from the Department of Computer Science & Technology, Nanjing University in 2011, China. He was a postdoctoral researcher in Disney Research (2015-2016), Pittsburgh, which is co-located with Carnegie Mellon University. and he won the "1000 Young Talent Program" awarded awarded by China Government in 2018. He was appointed as the Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning. His research interests include image and video understanding, few-shot and zero-shot learning. He published more than 30 journal/conference papers including IEEE TPAMI, TMM, ECCV, and CVPR. He had filed 15 patents in China and 3 patents in US. His research has been widely reported on the press, such as Science 2.0, PhyORG, Science Newsline Technology, Science 2.0, Communications of ACM, Business Standard, Science Newsline Technology, PhyORG, EurekAlert!

Combining Vision and Language

ACCV 2018 Workshop, 3rd Dec

Invited Speakers