11th Web as Corpus Workshop (WAC-XI)
at Corpus Linguistics 2017, Birmingham on 24 July 2017
endorsed by the Special Interest Group of the ACL on Web as Corpus (SIGWAC)
Note: WAC-XI has been merged with: CMLC + BigNLP. Please refer to the CMLC website for details.
- Adrien Barbaresi (BBAW Berlin/ÖAW Vienna)
- Felix Bildhauer (IDS Mannheim)
- Roland Schäfer (Freie Universität Berlin (DFG))
The accepted papers will appear in the proceedings of CMLC + BigNLP.
- Edyta Jurkiewicz-Rohrbacher, Zrinka Kolaković, Björn Hansen: Web Corpora – the best possible solution for tracking rare phenomena in underresourced languages – clitics in Bosnian, Croatian and Serbian
- Vladimir Benko: Are Web Corpora Inferior? The Case of Czech and Slovak
- Vit Suchomel: Removing Spam from Web Corpora Through Supervised Learning Using FastText
Workshop description, call for papers, and details
For almost a decade, the ACL SIGWAC, and most notably the Web as Corpus (WAC) workshops, have served as a platform for researchers interested in the compilation, processing and use of web-derived corpora as well as computer-mediated communication. Past workshops were co-located with major conferences on corpus linguistics and/or computational linguistics (such as ACL, EACL, Corpus Linguistics, LREC, NAACL, WWW). The eleventh Web as Corpus workshop (WAC-XI) emphasises the linguistic aspects of web corpus research more than the technological aspects while keeping in mind that the two are inseparable.
The World Wide Web has become increasingly popular as a source of linguistic evidence, not only within the computational linguistics community, but also with theoretical linguists facing problems such as data sparseness or the lack of variation in traditional corpora of written language. Accordingly, web corpora continue to gain relevance, given their size and diversity in terms of genres and text types. In lexicography, web data have become a major and well-established resource with dedicated research data and specialised tools such as the SketchEngine. In other areas of linguistics, the adoption rate of web corpora has been slower but steady. Furthermore, some completely new areas of research dealing exclusively with web (or similar) data have emerged, such as the construction and utilisation of corpora based on short messages. Another example is the (manual or automatic) classification of web texts by genre, register, or – more generally speaking – text type, as well as topic area. Similarly, the areas of corpus evaluation and corpus comparison have been advanced greatly through the rise of web corpora, mostly because web corpora (especially larger ones in the region of several billions of tokens) are often created by downloading texts from the web unselectively with respect to their text type or content. While the composition (or stratification) of such corpora cannot be determined before their construction, it is desirable to evaluate it afterwards, at least. Also, comparing web corpora to corpora that have been compiled in a more traditional way is key in determining the quality of web corpora with respect to a given research question.
Call for papers
The eleventh Web as Corpus workshop (WAC-XI) takes a (corpus) linguistic look at the state of the art in all these areas. More specifically, in linguistic publications presenting case studies based on web data, some authors explicitly discuss and/or defend the validity of web corpus data for a specific type of research question – while others simply take web corpora as a new or complementary source of data without discussing fundamental questions of data quality and appropriateness of web data for a given research question. We think it is vital to discuss such fundamental questions, and therefore ask researchers to present and discuss
- case studies in corpus or computational linguistics where web data have been used
- research specifically related to the validity of web data in corpus, computational, and theoretical linguistics,
- research on the technical aspects web corpus construction which have a strong influence on theoretical aspects of corpus design
For example, presentations could address questions (either as part of a case study or in the form of primary research):
- Are there substantial differences in theoretical inferences when web data are used instead of data from traditionally compiled corpora? If so: Why? Are they expected?
- Do findings from traditionally compiled corpora and web corpora converge when compared with evidence from other sources (such as psycholinguistic experiments)? If not: Which type of data matches the external findings better?
- Is it possible to analyse lectal variation with web corpora, given the frequent lack of relevant meta data?
- How good is the quality of the (automatic) linguistic annotation of web data compared to traditionally compiled corpora? How does this affect empirical linguistic research with web corpora? What could corpus designers do to improve it?
- Are there differences with regard to the dispersion of linguistic entities in web corpora compared to traditionally compiled corpora? If so: Why? Does it matter? How can we deal with it or even profit from it?
- How do very large web corpora compare to smaller, more intentionally stratified web corpora created for a specific task? How can it be decided which type of corpus is better for a given research question?
- Masayuki Asahara, National Inst. for Japanese Language and Linguistics, JP
- Piotr Bánski, IDS Mannheim, DE
- Silvia Bernardini, U of Bologna, IT
- Niels Brügger, University of Aarhus, DK
- Sascha Diwersy, Université Montpellier 3, FR
- Stefan Evert, FAU Erlangen, DE
- Susanne Flach, Freie Universität Berlin, DE
- Cédrick Fairon, UC Louvain, BE
- William H. Fletcher, U.S. Naval Academy, US
- Jack Grieve, Aston University, UK
- Aurelie Herbelot, University of Trento, IT
- Matthias Hüning, FU Berlin, DE
- Detmar Meurers, Universität Tübingen, DE
- Miloš Jakubíček, Masaryk University Brno, CZ
- Iztok Kosem, Trojina, Institute for Applied Slovene Studies, SI
- Anne Krause, Universität Leipzig, DE
- Simon Krek, Jožef Stefan Institute, SI
- Lothar Lemnitzer, BBAW, DE
- Nikola Ljubešić, Jožef Stefan Institute, Ljubljana, SI
- Steffen Remus, TU Darmstadt, DE
- Antonio Ruiz Tinoco, Sophia University, JP
- Kevin Scannell, Saint Louis U, US
- Serge Sharoff, University of Leeds, UK
- Barbara Schlücker, Universität Bonn, DE
- Sabine Schulte im Walde, IMS Stuttgart, DE
- Klaus Schulz, LMU München, DE
- Egon Stemle, Eurac Research, IT
- Peter Uhrig, FAU Erlangen, DE
- Marieke van Erp, VU Amsterdam, NL
- Wajdi Zaghouani, CMU Qatar, QA
- Amir Zeldes, Georgetown University, Washington, US
- Arne Zeschel, IDS Mannheim, DE