Changes between Version 13 and Version 14 of WAC-XI


Ignore:
Timestamp:
02/13/17 15:45:57 (7 years ago)
Author:
Roland Schäfer
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WAC-XI

    v13 v14  
    99Contact: `wacxi2017@gmail.com`
    1010
    11 === Organizers ===
     11== Organizers ==
    1212
    1313* Adrien Barbaresi (BBAW Berlin/ÖAW Wien)
     
    1515* [http://rolandschaefer.net Roland Schäfer (Freie Universität Berlin)]
    1616
    17 == Workshop description ==
     17
     18== Workshop description, call for papers, and details ==
     19
     20=== Workshop description ===
    1821
    1922For almost a decade, the ACL SIGWAC, and  most notably the Web as Corpus (WAC) workshops, have served as a platform for researchers interested in the com­pilation, processing and use of web-derived corpora as well as computer-mediated communication. Past workshops were co-located with major conferences on corpus linguistics and/or computational linguis­tics (such as ACL, EACL, Corpus Linguistics, LREC, NAACL, WWW). The eleventh Web as Corpus workshop (WAC-XI) emphasises the linguistic aspects of web corpus research more than the technological aspects while keeping in mind that the two are inseparable.
    2023
    2124The World Wide Web has become increasingly popular as a source of linguistic evidence, not only within the computational linguistics community, but also with theoretical linguists facing problems such as data sparseness or the lack of variation in traditional corpora of written language. Accordingly, web corpora continue to gain relevance, given their size and diversity in terms of genres and text types. In lexicography, web data have become a major and well-established resource with dedicated research data and an environment such as the !SketchEngine. In other areas of linguistics, the adoption rate of web corpora has been slower but steady. Furthermore, some areas of research dealing exclusively with web (or similar) data have emerged, such as the con­struction and exploitation of corpora based on short messages. Another example is the (manual or auto­matic) classification of web texts by genre, register, or – more generally speaking – text type, as well as topic area. Similarly, the areas of corpus evaluation and corpus comparison have been advanced greatly through the rise of web cor­pora, mostly because web cor­pora (especially larger ones in the region of several billions of tokens) are often created by download­ing texts from the web unselectively with respect to their text type or content. While the composition (or strati­fication) of such corpora cannot be determined before their construction, it is desirable to evaluate it afterwards, at least. Also, comparing web corpora to corpora that have been compiled in a traditional way is key in determining the quality of web corpora with respect to a given research question.
     25
     26=== Call for papers === #cfp
    2227
    2328The eleventh Web as Corpus workshop (WAC-XI) takes a (corpus) linguistic look at the state of the art in all these areas. More specifically, in linguistic publications presenting case studies based on web data, some authors explicitly discuss and/or defend the validity of web corpus data for a specific type of research question – while others simply take web corpora as a new or complementary source of data without discussing fundamental questions of data quality and appropriateness of web data for specific research questions. We think it is vital to discuss such fundamental questions, and therefore ask researchers to present and discuss
     
    3742
    3843
    39 == !CleanerEval first panel discussion ==
     44=== Submission website ===
     45
     46We will use EasyChair, URL tba.
     47
     48=== Submission format ===
     49
     50We call for extended abstracts of 1,000 – 1,500 words length (excluding references, tables, and figures).
     51Submissions must be in PDF format. Authors of accepted papers will receive minimal formatting instructions for the publication of the abstracts on the WAC-XI website in due time.
     52There will be no proceedings volume, but a successful workshop might lead to a special issue/edited volume on web (and similar) data in linguistics, for which a separate call for (full) papers would be published after the workshop.
     53
     54
     55=== Important dates ===#dates
     56
     57* 13 February 2017: First Call for Workshop Papers
     58* 13 March 2017: Second Call for Workshop Papers
     59* 16 April 2017: Workshop Paper Due Date
     60* 5 June 2017: Notification of Acceptance
     61* 24 July 2017: Workshop Day
     62
     63
     64
     65=== !CleanerEval first panel discussion ===
    4066
    4167As part of the workshop and consistent with its general theme, we plan to organise a panel discussion as the first meeting of the !CleanerEval shared task on combined paragraph and document quality detec­tion for (web) documents. The "CleanerEval shared task follows the successful CleanEval shared task organised by SIGWAC in 2006. While !CleanEval focused specifically on boilerplate re­moval (the removal of automatically inserted and frequently repeated non-corpus material from web pages), "CleanerEval goes beyond this basic task. Participating systems should be able to determine the linguistic quality of para­graphs and whole documents in an automatic fashion, such that corpus designers and/or users can decide whether to include them in their corpus or not. In the "CleanerEval setting, boilerplate paragraphs are paragraphs with low quality, but there might be other, non-boilerplate paragraphs with low quality as well. "CleanerEval was proposed by the organisers of WAC-XI during the final discussion of WAC-X, where the proposal was met with great interest. The WAC-XI panel discussion is intended to serve as a platform for the development of the operationalisation of the notions of paragraph and document quality, the an­notation guidelines, and the final schedule for the shared task. There can be no doubt that corpus lin­guists should define what counts as good corpus material and what does not. It would be misguided to threat this ques­tion as a purely technical one. The final meeting of the shared task is planned for to be part of WAC-XII in 2018.
     
    6894* Arne Zeschel, Institut für Deutsche Sprache, Mannheim
    6995
    70 == Details ==
    7196
    72 === Important dates ===#dates
    73 
    74 * 13 February 2017: First Call for Workshop Papers
    75 * 13 March 2017: Second Call for Workshop Papers
    76 * 16 April 2017: Workshop Paper Due Date
    77 * 5 June 2017: Notification of Acceptance
    78 * 24 July 2017: Workshop Day
    79 
    80 === Call for papers === #cfp
    81 
    82 tba
    83 
    84 === Submission website ===
    85 
    86 We will use EasyChair, URL tba.
    87 
    88 === Submission format ===
    89 
    90 We call for extended abstracts of 1,000 – 1,500 words length (excluding references, tables, and figures).
    91 Submissions must be in PDF format. Authors of accepted papers will receive minimal formatting instructions for the publication of the abstracts on the WAC-XI website in due time.
    92 There will be no proceedings volume, but a successful workshop might lead to a special issue/edited volume on web (and similar) data in linguistics, for which a separate call for (full) papers would be published after the workshop.