Staff directory

KIM Young-Min

PhD Student at Sorbonne University
Team : MALIRE

Supervision : Patrick GALLINARI

Document Clustering in a Learned Concept Space

Document clustering is one of the fundamental techniques of unsupervised learning from unstructured textual data which constitutes a real saving in terms of efficiency for various information retrieval (IR) tasks. The clustering results are not only used as basic information for the structure of a collection, but also as a preceding step before conducting other IR applications. On the other hand, probabilistic models provide a useful framework for the data analysis in unsupervised learning. They can be used as dimensionality reduction techniques providing a compact representation of a collection or as clustering techniques. Especially, topic models have been rapidly developed and became popular tools among these models.

In this thesis, we are interested in to develop effective clustering techniques which allow to find meaningful reduced spaces on which document clustering may be performed more efficiently than in the initial bag-of-words space. With this purpose, we develop four different clustering approaches for text collection using probabilistic models and more precisely with topic models. We especially try to integrate the dimensionality reduction induced by latent variables which compose a concept space and perform clustering in that space. Our experimental results confirm that our attempts are successful in terms of clustering accuracy on different data collections.

This thesis is structured in two parts. The first part presents the state-of-the-art in clustering and probabilistic models and the second part corresponds to our contributions. We first develop a two-stage clustering method applying concept space. Inspired by its success, we develop the three clustering approaches based on probabilistic latent semantic analysis (PLSA). Ext-PLSA model supplements the previous approach by combining two stages in a process. CS-PLSA algorithm allows an effective model selection for clustering. Finally, voted-PLSA provides a successful multi-view clustering procedure on a multilingual collection.

Phd defence : 12/16/2010

Jury members :

M. Bernd AMANN (Université Pierre et Marie Curie / Laboratoire LIP6)
M. Massih-Reza AMINI (Université Pierre et Marie Curie / Laboratoire LIP6) [Directeur de thèse]
M. Patrice BELLOT (Université d’Avignon / Laboratoire LIA-CERI )
M. Patrick GALLINARI (Université Pierre et Marie Curie / Laboratoire LIP6) [Directeur de thèse]
M. Eric GAUSSIER (Université Joseph Fourier / Laboratoire LIG ) [Rapporteur]
M. Pascal PONCELET (Ecole des Min d’Alès / Laboratoire LGI2P) [Rapporteur]

Departure date : 09/30/2011

2008-2010 Publications

All Articles Communications Thesis

2010
- Y.‑M. Kim : “Apprentissage d’Espaces de Concepts pour le Partitionnement Non-Supervisé de Documents Textuels”, thesis, phd defence 12/16/2010, supervision Gallinari, Patrick (2010)
- Y.‑M. Kim, M.‑R. Amini, C. Goutte, P. Gallinari : “Multiview Clustering of Multilingual Documents”, Proceedings of the 33^rd Annual ACM SIGIR Conference (SIGIR 2010), Geneva, Switzerland, pp. 812-822, (ACM) (2010)
- J.‑F. Pessiot, Y.‑M. Kim, M.‑R. Amini, P. Gallinari : “Improving Document Clustering in a Learned Concept Space”, Information Processing and Management, vol. 46 (2), pp. 180-192, (Elsevier) (2010)
- Y.‑M. Kim, J.‑F. Pessiot, M.‑R. Amini, P. Gallinari : “Apprentissage d’un Espace de Concepts de Mots pour une Nouvelle Représentation des Données Textuelles”, Document numérique - Revue des sciences et technologies de l'information. Série Document numérique, vol. 13 (1), pp. 63-82, (Hermès) (2010)
2009
- Y.‑M. Kim, J.‑F. Pessiot, M.‑R. Amini, P. Gallinari : “Une extension du modèle sémantique latent probabiliste pour le partitionnement non-supervisé de documents textuels”, Conférence d'apprentissage, CAP 2009, Hammamet, Tunisia (2009)
2008
- Y.‑M. Kim, J.‑F. Pessiot, M.‑R. Amini, P. Gallinari : “An Extension of PLSA for Document Clustering”, 17^th ACM Conference on Information and Knowledge Management (CIKM 2008), Napa Valley, CA, United States, pp. 1345-1346, (ACM) (2008)
- Y.‑M. Kim, J.‑F. Pessiot, M.‑R. Amini, P. Gallinari : “Apprentissage d’un espace de concepts de mots pour une nouvelle représentation des données textuelles”, COnférence en Recherche d'Information et Applications (CORIA 2008), Trégastel, France, pp. 119-134 (2008)
- J.‑F. Pessiot, Y.‑M. Kim, M.‑R. Amini, N. Usunier, P. Gallinari : “Une méthode contextuelle d’extension de requête avec des groupements de mots pour le résumé automatique”, Conference en Recherche d'information et Applications, CORIA 2008, Trégastel, France, pp. 289-304 (2008)