Staff directory

BEN YOUNES Hedi

PhD Student at Sorbonne University
Team : MLIA

Supervision : Matthieu CORD
Co-supervision : THOME Nicolas

Multi-modal representation learning towards visual reasoning

The quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding.

In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature.

Phd defence : 05/20/2019

Jury members :

M. Jakob Verbeek, INRIA Grenoble [rapporteur]
M. Christian Wolf, INSA de Lyon [rapporteur]
M. Vittorio Ferrari, Google AI - University of Edinburgh
M. Yann LeCun, Facebook - NYU
M. Patrick Pérez, Valeo AI
Mme Laure Soulier, Sorbonne Université - LIP6
M. Nicolas Thome, CNAM - CEDRIC
M. Matthieu Cord, Sorbonne Université - LIP6

Departure date : 07/20/2019

2017-2019 Publications

All Communications Thesis

2019
- H. Ben Younes : “Apprentissage de représentation multi-modale et raisonnement visuel”, thesis, phd defence 05/20/2019, supervision Cord, Matthieu, co-supervision : Thome, Nicolas (2019)
- R. Cadene, C. Dancette, H. Ben‑younes, M. Cord, D. Parikh : “RUBi: Reducing Unimodal Biases for Visual Question Answering”, Neural Information Processing Systems, vol. 32, Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 841-852, (Curran Associates, Inc.) (2019)
- R. Cadene, H. Ben‑younes, M. Cord, N. Thome : “MUREL: Multimodal Relational Reasoning for Visual Question Answering”, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, United States (2019)
- H. Ben‑younes, R. Cadene, N. Thome, M. Cord : “BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection”, AAAI 2019 - 33^rd AAAI Conference on Artificial Intelligence, Honolulu, United States (2019)
2017
- H. Ben‑younes, R. Cadene, M. Cord, N. Thome : “MUTAN: Multimodal Tucker Fusion for Visual Question Answering”, 2017 IEEE International Conference on Computer Vision (ICCV), 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp. 2631-2639, (IEEE) (2017)