PANTIN Jérémie
Supervision : Christophe MARSALA
Detection and semantic characterisation of textual outliers
Outlier detection is a recurring problem in machine learning, involving the identification of data points significantly different from the rest of the dataset. In this context, we focus on identifying such outliers with textual data, which faces several challenges, including the formalisation and definition of textual outliers. There exists a distinct difference between syntactic and semantic outliers. To address this ambiguity, we propose a new taxonomy for identifying these outliers.
Within this framework, we identify various types of outliers and associated levels of difficulty, and we introduce a novel method to study them. With this method, it becomes possible to leverage a vast array of datasets, highlighting the strengths and weaknesses of anomaly detection and outlier detection approaches. Outlier detection can be performed using ensemble methods, where multiple text representations can be simultaneously employed with various detection techniques, enhancing efficiency and robustness against challenging outliers.
We introduce a novel approach that leverages robust learning and ensemble learning. We connect this work with XAI and data representation studies. Lastly, we present an application of our work in the domain of unsupervised abstractive summarization. In this scenario, outlier analysis aids in filtering out non-relevant sentences, resulting in an improvement in the quality of the summary.
Defence : 09/11/2023
Jury members :
LAURENT Anne (Université de Montpellier) [Rapporteur]
SMITS Gregory (IMT Atlantique) [Rapporteur]
AMANN Bernd (Sorbonne Université)
MARSALA Christophe (Sorbonne Université)
2022-2024 Publications
-
2024
- J. Pantin, Ch. Marsala : “Détection d’anomalies textuelles par ensemble d’autoencodeurs robustes”, Revue des Nouvelles Technologies de l'Information, vol. Extraction et Gestion des Connaissances, RNTI-E-40, Dijon, France, pp. 319-326 (2024)
-
2023
- J. Pantin : “Détection et caractérisation sémantique de données textuelles aberrantes”, thesis, phd defence 09/11/2023, supervision Marsala, Christophe (2023)
-
2022
- J. Pantin, Ch. Marsala, M.‑J. Lesot : “Analyse de Données Aberrantes pour le Texte: Taxonomie et Étude Expérimentale”, Actes de l'atelier sur la fouille de textes - TextMine'22, Blois, France, pp. 15-26 (2022)