BEN YOUNES Hedi
Supervision : Matthieu CORD
Co-supervision : THOME Nicolas
Multi-modal representation learning towards visual reasoning
The quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding.
In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature.
Defence : 05/20/2019
Jury members :
M. Jakob Verbeek, INRIA Grenoble [rapporteur]
M. Christian Wolf, INSA de Lyon [rapporteur]
M. Vittorio Ferrari, Google AI - University of Edinburgh
M. Yann LeCun, Facebook - NYU
M. Patrick Pérez, Valeo AI
Mme Laure Soulier, Sorbonne Université - LIP6
M. Nicolas Thome, CNAM - CEDRIC
M. Matthieu Cord, Sorbonne Université - LIP6
2017-2019 Publications
-
2019
- H. Ben Younes : “Apprentissage de représentation multi-modale et raisonnement visuel”, thesis, phd defence 05/20/2019, supervision Cord, Matthieu, co-supervision : Thome, Nicolas (2019)
- R. Cadene, C. Dancette, H. Ben‑younes, M. Cord, D. Parikh : “RUBi: Reducing Unimodal Biases for Visual Question Answering”, Neural Information Processing Systems, vol. 32, Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 841-852, (Curran Associates, Inc.) (2019)
- R. Cadene, H. Ben‑younes, M. Cord, N. Thome : “MUREL: Multimodal Relational Reasoning for Visual Question Answering”, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, United States (2019)
- H. Ben‑younes, R. Cadene, N. Thome, M. Cord : “BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection”, AAAI 2019 - 33rd AAAI Conference on Artificial Intelligence, Honolulu, United States (2019)
-
2017
- H. Ben‑younes, R. Cadene, M. Cord, N. Thome : “MUTAN: Multimodal Tucker Fusion for Visual Question Answering”, 2017 IEEE International Conference on Computer Vision (ICCV), 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp. 2631-2639, (IEEE) (2017)