Changes between Version 4 and Version 5 of WAC-X

Nov 20, 2015, 3:43:53 PM (6 years ago)
Roland Schäfer



  • WAC-X

    v4 v5  
    4444The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data di­versity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., corpus composition assessment, sampling strategies and their relation to crawling algorithms, handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, or large-scale paral­lelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web cor­pora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus.
    4546For almost a decade, the ACL SIGWAC (, and especially the highly suc­cessful Web as Corpus (WAC) workshops have served as a platform for researchers interested in com­pilation, processing and application of web-derived corpora. Past workshops were co-located with ma­jor conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, Corpus Linguistics). As in previous years, the 10th Web as Corpus workshop (WAC-X) invites contributions pertaining to all aspects of web corpus creation, including but not restricted to
    8182As part of the 10th Web as Corpus workshop (WAC-X), a panel discussion will be organized. Web cor­pus designers are probably those who are most affected by issues and uncertainties of copyright legisla­tion and intellectual property rights, especially in the EU. While in some countries, such as the U.S., a Fair Use doctrine allows the use of data for non-commercial research purposes, the situation in Europe is more problematic. For example, German copyright law ("Urheberrecht") requires that any re-use of a work which reaches a certain threshold of creativity be explicitly approved by the author. This poses numerous problems for any corpus creator, but it is completely infeasible for large web corpora con­taining texts written by millions of different authors. Thus, corpora are re-distributed in crippled form as sentence shuffles (e.g. COW and the Leipzig Corpora Collection), and it is not even clear whether there really is a reliable legal exemption for single sentences. In the famous Infopaq case, a Danish court decided that even snippets of 11 words might be protected under EU copyright laws (
    8284This situation is highly undesirable. Large web corpora have been shown to be indispensable for many tasks in computational linguistics, in the documentation of standard and non-standard language, and in empirically oriented theoretical linguistics.
    8386Reports written by legal experts – such as the one recently commissioned by the German Research Council ( – only provide an interpretation of the given legal situation. Only ac­tive lobbying in favor of a reasonable copyright reform will eventually bring about the necessary changes such that researchers can build corpus resources and share them freely for academic purposes. Therefore, the goal of this panel discussion is to bring together corpus creators, active users of web cor­pora, and open science activists in order to share and discuss views on the copyright problem as a politi­cal rather than a legal problem. Ideally, a first draft of a joint declaration might come out of this discussion. With such a declaration, the (web) corpus community could make sure that its voice is heard, especially in the ongoing discussion about reforms of the European copyright legislation.