Staff directory

CADENE Rémi

PhD Student at Sorbonne University
Team : MLIA
http://remicadene.com

Supervision : Matthieu CORD
Co-supervision : THOME Nicolas

Deep multimodal learning for vision and language processing

Digital technologies have become instrumental in transforming our society. Recent statistical methods have been successfully deployed to automate the processing of the growing amount of images, videos, and texts we produce daily. In particular, deep neural networks have been adopted by the computer vision and natural language processing communities for their ability to perform accurate image recognition and text understanding once trained on big sets of data. Advances in both communities built the groundwork for new research problems at the intersection of vision and language. Integrating language into visual recognition could have an important impact on human life through the creation of real-world applications such as next-generation search engines or AI assistants.

In the first part of this thesis, we focus on systems for cross-modal text-image retrieval. We propose a learning strategy to efficiently align both modalities while structuring the retrieval space with semantic information. In the second part, we focus on systems able to answer questions about an image. We propose a multimodal architecture that iteratively fuses the visual and textual modalities using a factorized bilinear model while modeling pairwise relationships between each region of the image. In the last part, we address the issues related to biases in the modeling. We propose a learning strategy to reduce the language biases which are commonly present in visual question answering systems.

Phd defence : 07/08/2020

Jury members :

Mme. Gabriela Csurka, Naver LABS Europe [rapportrice]
M. Ivan Laptev, INRIA Paris [rapporteur]
M. Patrick Gallinari, Sorbonne Université - LIP6
M. Thomas Serre, Brown University
M. Eduardo Valle, Campinas University - RECOD
M. Nicolas Thome, CNAM - CEDRIC
M. Matthieu Cord, Sorbonne Université - LIP6

Departure date : 04/04/2021

2017-2021 Publications

All Communications Thesis

2021
- C. Dancette, R. Cadene, D. Teney, M. Cord : “Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering”, 2021 International Conference on Computer Vision, Montreal, Canada (2021)
- C. Dancette, R. Cadene, X. Chen, M. Cord : “Learning Reasoning Mechanisms for Unbiased Question-based Counting”, VQA Workshop, CVPR 2021, Nashville, United States (2021)
- C. Dancette, R. Cadene, X. Chen, M. Cord : “Learning Reasoning Mechanisms for Unbiased Question-based Counting”, VQA Workshop,Conference on Computer Vision and Pattern Recognition 2021, Nashville, United States (2021)
2020
- R. Cadene : “Deep multimodal learning for vision and language processing”, thesis, phd defence 07/08/2020, supervision Cord, Matthieu, co-supervision : Thome, Nicolas (2020)
2019
- R. Cadene, C. Dancette, H. Ben‑younes, M. Cord, D. Parikh : “RUBi: Reducing Unimodal Biases for Visual Question Answering”, Neural Information Processing Systems, vol. 32, Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 841-852, (Curran Associates, Inc.) (2019)
- R. Cadene, H. Ben‑younes, M. Cord, N. Thome : “MUREL: Multimodal Relational Reasoning for Visual Question Answering”, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, United States (2019)
- H. Ben‑younes, R. Cadene, N. Thome, M. Cord : “BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection”, AAAI 2019 - 33^rd AAAI Conference on Artificial Intelligence, Honolulu, United States (2019)
2018
- M. Carvalho, R. Cadene, D. Picard, L. Soulier, N. Thome, M. Cord : “Cross-Modal Retrieval in the Cooking Context”, SIGIR proceedings, Ann Arbor, Michigan, United States, pp. 35-44, (ACM Press) (2018)
- M. Carvalho, R. Cadene, D. Picard, L. Soulier, M. Cord : “Images & Recipes: Retrieval in the cooking context”, International Conference on Data Engineering (ICDE), DECOR workshop, Paris, France (2018)
2017
- H. Ben‑younes, R. Cadene, M. Cord, N. Thome : “MUTAN: Multimodal Tucker Fusion for Visual Question Answering”, 2017 IEEE International Conference on Computer Vision (ICCV), 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp. 2631-2639, (IEEE) (2017)