Changes between Version 6 and Version 7 of WAC-X


Ignore:
Timestamp:
01/17/16 19:47:04 (8 years ago)
Author:
Roland Schäfer
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WAC-X

    v6 v7  
    1 = 10th Web as Corpus Workshop (WAC-X) and EmpiriST Shared Task =
     1= 10th Web as Corpus Workshop (WAC-X) =
    22
    3 We are happy to announce that WAC-X will be co-located with ACL 2016 in Berlin. More information and a call for papers will be published in due time. There will be a tightly packed one-day schedule with the main workshop, the EmpiriST shared task final workshop, and a panel discussion.
     3'''featuring the EmpiriST Shared Task'''[[BR]]
     4August 12, 2016, Berlin / co-located with ACL 2016[[BR]]
     5Endorsed by the Special Interest Group of the ACL on Web as Corpus (SIGWAC)
    46
    5 == Details ==
     7'''[#cfp 1st Call for Papers is out!]'''
    68
    7 === Organizers ===
     9== WAC-X main workshop ==
     10
     11The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data di­versity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., corpus composition assessment, sampling strategies and their relation to crawling algorithms, handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, or large-scale paral­lelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web cor­pora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus. For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly suc­cessful Web as Corpus (WAC) workshops have served as a platform for researchers interested in com­pilation, processing and application of web-derived corpora. Past workshops were co-located with ma­jor conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, Corpus Linguistics).
     12
     13== Organizers ==
    814
    915* [http://cs.unb.ca/~ccook1/ Paul Cook (University of New Brunswick)]
     
    1218* [http://iiegn.eu/work Egon Stemle (European Academy of Bozen/Bolzano)]
    1319
     20== 1st Call for Papers == #cfp
    1421
    15 === Program committee (preliminary) ===
     22As in previous years, the 10th Web as Corpus workshop (WAC-X) invites contributions pertaining to all aspects of web corpus creation, including but not restricted to
     23
     24* data collection (both for large web corpora and smaller custom web corpora)
     25* cleaning/handling of noise
     26* duplicate removal/document filtering
     27* linguistic post-processing (including non-standard data)
     28* automatic generation of meta data (including register, genre, etc.)
     29* corpus evaluation (quality of text and annotations, comparison to other corpora, etc.)
     30
     31Furthermore, aspects of usability and availability of web-derived corpora are highly relevant in the context of WAC-X
     32
     33* development of interfaces
     34* visualization techniques
     35* tools for statistical analysis of very large (e.g., web-derived) corpora
     36* long-term archiving
     37* documentation and standardization
     38* legal issues
     39
     40Finally, reports of the use of web corpora in language technology and linguistics are welcome, for example
     41information extraction & opinion mining
     42
     43* language modeling, distributional semantics
     44* machine translation
     45* linguistic studies of web-specific forms of communication
     46* linguistic studies of rare phenomena
     47* web-specific lexicography, grammaticography, and language documentation
     48
     49=== Submission format ===
     50
     51All submissions must be in PDF format and should follow the ACL 2015 style guidelines. We strongly recommend the use of the ACL 2015 LaTeX style files or Microsoft Word Style files. We reserve the right to reject submissions that do not conform to these styles including font and page size restrictions.
     52
     53* [http://acl2015.org/files/acl2015.pdf General instructions (PDF)]
     54* LaTeX: [http://acl2015.org/files/acl.bst BST], [http://acl2015.org/files/acl2015.sty STY], [http://acl2015.org/files/acl2015.tex TEX]
     55* MS Word: [http://acl2015.org/files/acl2015.dot DOT]
     56
     57Full paper submissions may consist of up to eight (8) pages of content plus any number of pages consisting of only references. Short papers may consist of up to four (4) pages of content plus any number of pages consisting of only references. Full papers will be distinguished from short papers in the proceedings.
     58
     59Papers will be presented either orally or as posters at the workshop. There will be no distinction between papers presented orally and those presented as posters in the proceedings.
     60
     61Reviewing of papers will be double-blind. Therefore, the paper must not include the authors' names and affiliations. Furthermore, self-references that reveal the author's identity, e.g., "We previously showed (Smith, 1991) ...", must be avoided. Instead, use citations such as "Smith (1991) previously showed ...". Papers not conforming to these requirements will be rejected without review.
     62
     63=== Important dates ===
     64
     65* 8 May 2016: Workshop Paper Due date (23:59 GMT-12)
     66* 5 June 2016: Notification of Acceptance
     67* 22 June 2016: Camera-ready papers due
     68* 12 August 2016: Workshop Date
     69
     70
     71=== Program committee ===
    1672
    1773The workshop organizers plus:
     
    4096
    4197
    42 == WAC-X main workshop ==
    43 
    44 The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data di­versity. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. The field is still new, though, and a number of issues in web corpus construction need much additional research, both fundamental and applied. These issues range from questions of corpus design (e.g., corpus composition assessment, sampling strategies and their relation to crawling algorithms, handling of duplicated material) to more technical aspects (e.g., efficient implementation of individual post-processing steps in document cleansing and linguistic annotation, or large-scale paral­lelization to achieve web-scale corpus construction). Similarly, the systematic evaluation of web cor­pora, for example in the form of task-based comparisons to traditional corpora, has only recently shifted into focus.
    45 
    46 For almost a decade, the ACL SIGWAC (http://www.sigwac.org.uk/), and especially the highly suc­cessful Web as Corpus (WAC) workshops have served as a platform for researchers interested in com­pilation, processing and application of web-derived corpora. Past workshops were co-located with ma­jor conferences on computational linguistics and/or corpus linguistics (such as EACL, NAACL, LREC, WWW, Corpus Linguistics). As in previous years, the 10th Web as Corpus workshop (WAC-X) invites contributions pertaining to all aspects of web corpus creation, including but not restricted to
    47 
    48 * data collection (both for large web corpora and smaller custom web corpora)
    49 * cleaning/handling of noise
    50 * duplicate removal/document filtering
    51 * linguistic post-processing (including non-standard data)
    52 * automatic generation of meta data (including register, genre, etc.)
    53 * corpus evaluation (quality of text and annotations, comparison to other corpora, etc.)
    54 
    55 Furthermore, aspects of usability and availability of web-derived corpora are highly relevant in the con­text of WAC-X
    56 
    57 * development of interfaces
    58 * visualization techniques
    59 * tools for statistical analysis of very large (e.g., web-derived) corpora
    60 * long-term archiving
    61 * documentation and standardization
    62 * legal issues
    63 
    64 Finally, reports of the use of web corpora in language technology and linguistics are welcome, for ex­ample
    65 information extraction & opinion mining
    66 
    67 * language modeling, distributional semantics
    68 * machine translation
    69 * linguistic studies of web-specific forms of communication
    70 * linguistic studies of rare phenomena
    71 * web-specific lexicography, grammaticography, and language documentation
    7298
    7399== EmpiriST 2015 shared task ==
     
    82108As part of the 10th Web as Corpus workshop (WAC-X), a panel discussion will be organized. Web cor­pus designers are probably those who are most affected by issues and uncertainties of copyright legisla­tion and intellectual property rights, especially in the EU. While in some countries, such as the U.S., a Fair Use doctrine allows the use of data for non-commercial research purposes, the situation in Europe is more problematic. For example, German copyright law ("Urheberrecht") requires that any re-use of a work which reaches a certain threshold of creativity be explicitly approved by the author. This poses numerous problems for any corpus creator, but it is completely infeasible for large web corpora con­taining texts written by millions of different authors. Thus, corpora are re-distributed in crippled form as sentence shuffles (e.g. COW and the Leipzig Corpora Collection), and it is not even clear whether there really is a reliable legal exemption for single sentences. In the famous Infopaq case, a Danish court decided that even snippets of 11 words might be protected under EU copyright laws (http://bit.ly/1GYTDjR).
    83109
    84 This situation is highly undesirable. Large web corpora have been shown to be indispensable for many tasks in computational linguistics, in the documentation of standard and non-standard language, and in empirically oriented theoretical linguistics.
     110This situation is highly unsatisfactory. Large web corpora have been shown to be indispensable for many tasks in computational linguistics, in the documentation of standard and non-standard language, and in empirically oriented theoretical linguistics.
    85111
    86112Reports written by legal experts – such as the one recently commissioned by the German Research Council (http://bit.ly/1PG4Gq6) – only provide an interpretation of the given legal situation. Only ac­tive lobbying in favor of a reasonable copyright reform will eventually bring about the necessary changes such that researchers can build corpus resources and share them freely for academic purposes. Therefore, the goal of this panel discussion is to bring together corpus creators, active users of web cor­pora, and open science activists in order to share and discuss views on the copyright problem as a politi­cal rather than a legal problem. Ideally, a first draft of a joint declaration might come out of this discussion. With such a declaration, the (web) corpus community could make sure that its voice is heard, especially in the ongoing discussion about reforms of the European copyright legislation.
    87