LIP6 2001/020:
THÈSE de DOCTORAT de l'UNIVERSITÉ PARIS 6 LIP6 /
LIP6
research reports
210 pages - Juillet/July 2001 -
French document.
Get it : 2459 Ko /Kb
Contact : par mail / e-mail
Thème/Team: Apprentissage et Acquisition de Connaissances
Titre français : Apprentissage Automatique et Recherche d'Information : application à l'Extraction d'Information de surface et au Résumé de Texte.
Titre anglais : Machine Learning and Information Retrieval: Application to surface Inforamtion Extraction and Text Summarization.
Abstract : The prupose of this work is the application of machine learning techniques to Information Retrieval tasks. Our concern was to explore the potential of learning techniques to handle textual information access needs related to the developpment of huge databases and Internet. In this context it is becoming important to handle huge quantities of data, to provide solutions to new user needs and to automate tools for exploiting textual information. For this, we have explored two directions. The first is the developpment of systems able to model the sequential nature of documents so as to take advantage of this information which is not handled by classical information retrieval systems. For this, we propose statistical models based on Hidden Markov Models and Neural Networks. We show how these systems allow to extend the capabilites of classical inforamtion retrieval probabilistic models and in particular, how they can be used for the surface information extraction tasks. The second direction concerns the semi-supervised learning paradigm. It is a matter of using a small-labeled data set together with a huge unlabeled data set in order to train systems for information access tasks. This situation is frequently met in information retrieval. We propose and analyze original algorithms based on a discriminant formalism. We have used these techniques for the text summarization task where the goal is to extract the most relevant sentences of a document. This study has led to the developpment of an automatic summarizer system (S.A.R.A.).
Key-words : Textual information access, Information retrieval and extraction, Machine Learning, Sequence Models, semi-supervised learning, Text Summarization
Publications internes LIP6 2001 / LIP6 research reports 2001