SPENGLER Alexander
Supervision : Patrick GALLINARI
Co-supervision : SCHÖLKOPF Bernhard, SZUMMER Martin
Probabilistic Web Content Analysis. Representation of Content Semantics in the Bayesian Diagnostic Paradigm.
An automatic identification of meaningful content sections on web pages, such as titles, paragraphs, advertisements, product images or user comments, facilitates a large number of applications, ranging from speech rendering for the visually impaired over contextual advertisement to structured web search. Ultimately, such an identification always necessitates both, a partitioning of the content and a classification of the resulting partitions into a number of application-dependent semantic categories. We hence propose to approach the analysis of web content in an interdependent classification framework, integrating semantic coherence, just as in segmentation, via interaction features which describe the semantic configuration of two or more semantically atomic content regions. One of the major obstacles to gaining meaningful access to web contents is their semantically inappropriate organisation and markup. As a consequence, it generally is impossible to characterise an interesting content region with certainty. In this thesis, we propose to treat the uncertainties arising in an analysis of web content in a coherent probabilistic framework, the Bayesian diagnostic paradigm, and attempt to illuminate the conditions under which some probability model might be justified, deriving its form of representation from assumptions about observable quantities such as region features and semantics, utilising the concepts of exchangeability, conditional independence and sufficiency. In particular, we examine different Markovian dependencies between the semantic content categories within individual web pages and discuss how to take into account the structure that exists between pages and sites. We equally present an informal feature analysis which elucidates the manifold information available in the content, structure and style of a web page. Such an analysis is a quintessential prerequisite to both formal probabilistic modelling and high predictive performance. Furthermore, we introduce a new, publicly available data set of 604 real-world news web pages from 177 distinct sites with accurate annotations based on 30 distinct semantic categories, termed the NEWS600 corpus. Finally, we conduct a series of experiments on the NEWS600 corpus to empirically compare a number of different approaches for web news content classification. It demonstrates that even relatively simple models in our framework achieve significantly better results than the current state of the art.
Defence : 12/12/2011
Jury members :
Boris Chidlovskii, Chercheur au Xerox Research Centre Europe [Rapporteur]
Isabelle Tellier, Professeur à l'Université Sorbonne Nouvelle [Rapporteur]
Mathieu Cord, Professeur à l'Université Pierre et Marie Curie
Gregory Grefenstette, Directeur Scientifique chez Exalead
Patrick Gallinari, Professeur à l'Université Pierre et Marie Curie
2009-2013 Publications
-
2013
- S. Rubrichi, S. Quaglini, A. Spengler, P. Russo, P. Gallinari : “A system for the extraction and representation of summary of product characteristics content”, Artificial Intelligence in Medicine, vol. 57 (2), pp. 145-154, (Elsevier) (2013)
-
2011
- A. Spengler : “Analyse probabiliste du contenu de pages Web. Représentation des sémantiques de contenu dans le paradigme Bayésien”, thesis, phd defence 12/12/2011, supervision Gallinari, Patrick, co-supervision : Schölkopf, Bernhard, Szummer, Martin (2011)
- S. Rubrichi, S. Quaglini, Alexander A. Spengler, P. Gallinari : “Extracting Information from Summary of Product Characteristics for Improving Drugs Prescription Safety”, 13th Conference on Artificial Intelligence in Medicine (AIME 2011), vol. 6747, Lecture Notes in Computer Science, Bled, Slovenia, pp. 327-337, (Springer) (2011)
- S. Rubrichi, Alexander A. Spengler, P. Gallinari, S. Quaglini : “Preventing Adverse Drug Events by Extracting Information from Drug Fact Sheets”, Proceedings of the Fourth International Symposium for Semantic Mining in Biomedicine, vol. 714, Cambridge, United Kingdom, pp. 6, (CEUR-WS.org) (2011)
-
2010
- Alexander A. Spengler, P. Gallinari : “Document Structure Meets Page Layout: Loopy Random Fields for Web News Content Extraction”, 10th ACM Symposium on Document Engineering (DocEng 2010), Manchester, United Kingdom, pp. 151-160, (ACM) (2010)
-
2009
- A. Spengler, P. Gallinari : “Learning to Extract Content from News Webpages”, International Conference on Advanced Information Networking and Applications Workshops, 2009 (WAINA '09), Bradford, United Kingdom, pp. 709-714, (IEEE) (2009)