The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction

Construction Architectural engineering Construction Buildings--Design and construction Engineering, Architectural Western architecture (Western countries) Building design Construction Buildings--Design and construction Architecture, Western (Western countries)

Citer ce document

Roland Schäfer et al., « The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction », HAL-SHS : linguistique, ID : 10670/1.xf2zy8

Partage / Export

Résumé En

In this paper, we examine notions of text quality in the context of web corpus construction. Web documents often contain material which disqualifies them from inclusion in a corpus (tag clouds, lists of names or nouns, etc.). First, we look at the agreement between coders (especially corpus designers) given the task of rating text quality. Then, we evaluate a simple and fully unsupervised method of text quality assessment based on short and very frequent words. Finally, we describe our general approach to the construction of carefully cleansed and non-destructively normalized web corpora. Under this approach, we annotate documents with quality metrics instead of actually removing those documents classified as being of low quality.

The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction

Fiche du document

Mots-clés En Und

Sujets proches En

Citer ce document

Métriques

Partage / Export

Résumé En

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en