2005
info:eu-repo/semantics/OpenAccess
N. Georgiev et al., « Supervised Data Extraction », HAL-SHS : linguistique, ID : 10670/1.8o1srj
The process of data extraction from internet sources have beenoriginating the interest of the scientific society for the past years. However thereare still no well established standards because of the heterogeneous nature ofthe information in the Global Network. Nevertheless there is still something incommon – all the data is available in HTML format for compatibility reasons.This article presents our methodology and the prototype system we've createdto extract data from HTML pages. We use XPath as data extraction languageand have developed a methodology for visual wrapper generation. Ourapproach takes advantage of the implicit correlation between the data and thesurrounding structure. Some evaluation tests are given also in order justify ourmethods.