Changes between Version 11 and Version 12 of WAC-X


Ignore:
Timestamp:
01/24/16 16:07:19 (8 years ago)
Author:
Roland Schäfer
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WAC-X

    v11 v12  
     1[[PageOutline]]
     2
    13= 10th Web as Corpus Workshop (WAC-X) =
    24
     
    810'''[#cfp The Call for Papers is out!]'''
    911
     12
    1013== WAC-X main workshop ==
     14
    1115The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., assessment of corpus composition, sampling strategies and their relation to crawling algorithms, and handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleaning and linguistic annotation, or large-scale parallelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web corpora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus. For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly successful Web as Corpus (WAC) workshops have served as a platform for researchers interested in compilation, processing and application of web-derived corpora. Past workshops were co-located with major conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, and Corpus Linguistics).
    1216
    1317WAC-X will also feature the final workshop of the EmpiriST 2015 shared task "Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media" (see https://sites.google.com/site/empirist2015/ for details) and the panel discussion "Corpora, open science, and copyright reforms" (see https://www.sigwac.org.uk/wiki/WAC-X#paneldisc for details).
    1418
    15 == Organizers ==
     19=== Organizers ===
    1620
    1721* [http://cs.unb.ca/~ccook1/ Paul Cook (University of New Brunswick)]
     
    2731* 12 August 2016: Workshop Date
    2832
    29 == Call for Papers == #cfp
     33=== Call for Papers === #cfp
    3034
    3135As in previous years, the 10th Web as Corpus workshop (WAC-X) invites contributions pertaining to all aspects of web corpus creation, including but not restricted to
     
    98102
    99103
     104== Co-located events ==
    100105
    101 == EmpiriST 2015 shared task == #empirist
     106
     107=== EmpiriST 2015 shared task === #empirist
    102108
    103109The [https://sites.google.com/site/empirist2015/ EmpiriST 2015 shared task] aims to encourage the developers of NLP applications to adapt their tools and resources to the processing of German discourse in genres of computer-mediated communica­tion (CMC), including both dialogical (chat, SMS, social networks, etc.) and monological (web pages, blogs, etc.) texts. Since there has been relatively little work in this area for German so far, the shared task focuses on tokenization and part-of-speech tagging as the core annotation steps required by virtu­ally all NLP applications. While we have a particular interest in robust tools that can be applied to dia­logical CMC and web corpora alike, participants are allowed to use different systems for the two sub­sets or submit results for one subset only.
     
    106112The final workshop of EmpiriST 2015 will be co-located with WAC-X. It will include a detailed pre­sentation of the task and results, a poster session with all participating systems, oral presentations of se­lected systems, and a plenary discussion about the challenges of CMC in general as well as German CMC genres in particular.
    107113
    108 == Panel discussion "Corpora, open science, and copyright reforms" == #paneldisc
     114=== Panel discussion "Corpora, open science, and copyright reforms" === #paneldisc
    109115
    110116As part of the 10th Web as Corpus workshop (WAC-X), a panel discussion will be organized. Web cor­pus designers are probably those who are most affected by issues and uncertainties of copyright legisla­tion and intellectual property rights, especially in the EU. While in some countries, such as the U.S., a Fair Use doctrine allows the use of data for non-commercial research purposes, the situation in Europe is more problematic. For example, German copyright law ("Urheberrecht") requires that any re-use of a work which reaches a certain threshold of creativity be explicitly approved by the author. This poses numerous problems for any corpus creator, but it is completely infeasible for large web corpora con­taining texts written by millions of different authors. Thus, corpora are re-distributed in crippled form as sentence shuffles (e.g. COW and the Leipzig Corpora Collection), and it is not even clear whether there really is a reliable legal exemption for single sentences. In the famous Infopaq case, a Danish court decided that even snippets of 11 words might be protected under EU copyright laws (http://bit.ly/1GYTDjR).