SANOJA Andres

PhD student at Sorbonne University
Team : BD
https://perso.lip6.fr/Andres.Sanoja

Supervision : Stéphane GANÇARSKI

Web Page Segmentation, Evaluation and Applications

Web pages are becoming more complex than ever, as they are usually not designed manually but generated by Content Management Systems (CMS). Thus, analyzing them, i.e. automatically identifying and classifying different elements from Web pages, such as main content, menus, user comments, advertising among others, becomes difficult. A solution to this issue is provided by Web page segmentation. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called blocks.
The quality of any Web page segmenter is measured by its correctness (or precision), and its genericity, i.e. the variety of Web page types it is able to segment. Our research focuses on enhancing this quality and measuring it in a fair and accurate way, so that we can compare the state of the art segmenters.
We first propose a conceptual model for segmentation, as well as a Block-o-Matic (BoM) a Web page segmenter that takes the precision and genericity into account. We propose an evaluation model that takes the content as well as the geometry of blocks into account in order to measure the correctness of a segmentation algorithm according to a predefined ground truth. The quality of four state of the art algorithms (including BoM) is experimentally tested on four types of pages (blog, enterprise, forum, picture and wiki). Our evaluation framework allows testing any segmenter. It allows us measuring segmenters quality and giving observations about their correctness. The results show that BoM presents the best performance among the four segmentation algorithms tested, and also that the performance of segmenters depends on the type of page to segment.
We present two applications of BoM. Pagelyzer uses BoM for comparing two Web pages versions and decides if they are similar or not. It is the main contribution of our team to the European project Scape (FP7-IP). We also developed a migration tool of Web pages from HTML4 format to HTML5 format in the context of Web archives.

Defence : 01/22/2015

Jury members :

MURISASCO Elisabeth (Professeure, Université de Toulon) [Rapporteur]
RUKOZ Marta (Professeure, Université de Paris Ouest Nanterre) [Rapporteur]
BOUGAMIN Luc (Directeur de Recherches, Inria Rocquencourt)
SENELLART Pierre (Professeur, Télécom ParisTech)
CORD Matthieu (Professeur, UPMC)
GANÇARSKI Stéphane (Maître de Conférences HDR, UPMC)

Departure date : 01/31/2015

2012-2016 Publications