BEN SAAD Myriam
Supervision : Stéphane GANÇARSKI
Web Archives Quality : modeling and optimization
Nowadays, the Web has become the most important way to spread information that can have a great cultural, scientific or economic value. Archiving the Web or at least a part of it has become crucial to preserve some useful information for future generations of researchers, writers, historians, etc. However, archivists are facing a great challenge to maintain the quality of collected data that should reflect the real Web. In this perspective, our work in this thesis aims at improving the quality of archives. We focus on two quality measures: the temporal completeness and the temporal coherence which are very relevant to assess Web archives. We propose a new Web archiving approach based on the visual aspect of pages to detect changes in the same way that they are perceived by users. Then, we propose a method to evaluate the importance of detected changes. We model the importance of changes based on patterns through PPaC model (Pattern of Pages Changes). Unlike existing models based on the average rate of changes, PPaC better predicts the periods of time where important changes are expected to occur on web pages. Based on PPaC, we have proposed different crawling strategies that aim at improving the temporal completeness and/or the temporal coherence. Our different strategies have been implemented and tested on both simulated and real pages. The results show that the PPaC model based on the importance of changes is an useful instrument to improve significantly the quality of archives.
Defence : 11/18/2011
Jury members :
Serge Abiteboul Directeur de recherche à INRIA-Saclay [Rapporteur]
Vassilis Christophides Professeur à FORTH-ICS [Rapporteur]
Elisabeth Murisasco Professeur à l'USTV
Bernd Amann Professeur à l'UPMC
Julien Masanès Directeur d'Internet Memory Foundation
Jérôme Mainka Directeur de recherche à Antidot
Stéphane Gançarski Maitre de conférences (HDR) à l'UPMC
2010-2012 Publications
-
2012
- M. Ben Saad, S. Gançarski : “Archiving theWeb using Changes Patterns : a Case Study”, International Journal on Digital Libraries, vol. 13 (1), pp. 33-49, (Springer Verlag) (2012)
-
2011
- M. Ben Saad : “Qualité des archives Web: modélisation et optimisation”, thesis, phd defence 11/18/2011, supervision Gançarski, Stéphane (2011)
- M. Ben Saad, Z. Pehlivan, S. Gançarski : “Coherence-oriented Crawling and Navigation for Web Archives using Patterns”, 27es journées Bases de Données Avancées, BDA'11, Rabat, Morocco (2011)
- M. Ben Saad, Z. Pehlivan, S. Gançarski : “Coherence-oriented Crawling and Navigation for Web Archives using Patterns”, International Conference on Theory and Practice of Digital Libraries, TPDL 2011, vol. 6966, Lecture Notes in Computer Science, Berlin, Germany, pp. 421-433, (Springer) (2011)
- M. Ben Saad, S. Gançarski : “Improving the Quality of Web Archives through the Importance of Changes”, chapter in Database and Expert Systems Applications, vol. 6860, Lecture Notes in Computer Science, pp. 394-409, (Springer Berlin / Heidelberg), (ISBN: 978-3-642-23087-5) (2011)
-
2010
- Z. Pehlivan, M. Ben Saad, S. Gançarski : “Vi-DIFF: Understanding Web Pages Changes”, DEXA 2010, 21st International Conference on Database and Expert Systems Applications, vol. 6261, Lecture Notes in Computer Science, Bilbao, Spain, pp. 1-15, (Springer) (2010)
- M. Ben Saad, S. Gançarski : “Using visual pages analysis for optimizing web archiving”, In EDBT/ICDT 2010 Ph.D. Workshop, Lausanne, Switzerland, pp. 43, (ACM) (2010)