KIM Young-Min

PhD student at Sorbonne University
Team : MALIRE
https://lip6.fr/Young-Min.Kim

Supervision : Patrick GALLINARI

Document Clustering in a Learned Concept Space

Document clustering is one of the fundamental techniques of unsupervised learning from unstructured textual data which constitutes a real saving in terms of efficiency for various information retrieval (IR) tasks. The clustering results are not only used as basic information for the structure of a collection, but also as a preceding step before conducting other IR applications. On the other hand, probabilistic models provide a useful framework for the data analysis in unsupervised learning. They can be used as dimensionality reduction techniques providing a compact representation of a collection or as clustering techniques. Especially, topic models have been rapidly developed and became popular tools among these models.

In this thesis, we are interested in to develop effective clustering techniques which allow to find meaningful reduced spaces on which document clustering may be performed more efficiently than in the initial bag-of-words space. With this purpose, we develop four different clustering approaches for text collection using probabilistic models and more precisely with topic models. We especially try to integrate the dimensionality reduction induced by latent variables which compose a concept space and perform clustering in that space. Our experimental results confirm that our attempts are successful in terms of clustering accuracy on different data collections.

This thesis is structured in two parts. The first part presents the state-of-the-art in clustering and probabilistic models and the second part corresponds to our contributions. We first develop a two-stage clustering method applying concept space. Inspired by its success, we develop the three clustering approaches based on probabilistic latent semantic analysis (PLSA). Ext-PLSA model supplements the previous approach by combining two stages in a process. CS-PLSA algorithm allows an effective model selection for clustering. Finally, voted-PLSA provides a successful multi-view clustering procedure on a multilingual collection.


Phd defence : 12/16/2010

Jury members :

M. Bernd AMANN (Université Pierre et Marie Curie / Laboratoire LIP6)
M. Massih-Reza AMINI (Université Pierre et Marie Curie / Laboratoire LIP6) [Directeur de thèse]
M. Patrice BELLOT (Université d’Avignon / Laboratoire LIA-CERI )
M. Patrick GALLINARI (Université Pierre et Marie Curie / Laboratoire LIP6) [Directeur de thèse]
M. Eric GAUSSIER (Université Joseph Fourier / Laboratoire LIG ) [Rapporteur]
M. Pascal PONCELET (Ecole des Min d’Alès / Laboratoire LGI2P) [Rapporteur]

Departure date : 09/30/2011

2008-2010 Publications