Changes between Version 2 and Version 3 of WAC-XII


Ignore:
Timestamp:
12/05/19 03:19:00 (4 years ago)
Author:
Roland Schäfer
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WAC-XII

    v2 v3  
    2222== Workshop description ==
    2323
    24 For almost fifteen years, the ACL SIGWAC, and most notably the Web as Corpus (WAC) workshops, have served as a platform for researchers interested in the compilation, pro-cessing and use of web-derived corpora as well as computer-mediated communication. Past workshops were co-located with major conferences on corpus linguistics and/or computa-tional linguistics (such as ACL, EACL, Corpus Linguistics, LREC, NAACL, WWW).
     24For almost fifteen years, the ACL SIGWAC, and most notably the Web as Corpus (WAC) workshops, have served as a platform for researchers interested in the compilation, processing and use of web-derived corpora as well as computer-mediated communication. Past workshops were co-located with major conferences on corpus linguistics and/or computational linguistics (such as ACL, EACL, Corpus Linguistics, LREC, NAACL, WWW).
    2525
    26 In corpus/theoretical linguistics, the World Wide Web has become increasingly popular as a source of linguistic evidence, especially in the face of data sparseness or the lack of varia-tion in traditional corpora of written language. In lexicography, web data have become a major and well-established resource with dedicated research data and commercially available tools. In other areas of theoretical linguistics, the adoption rate of web corpora has been slower but steady. Furthermore, some completely new areas of linguistic research dealing exclusively with web (or similar) data have emerged, such as the construc-tion and utilisation of corpora based on short messages. Another example is the (manual or automatic) classification of web texts by genre, register, or – more generally speaking – “text type”, as well as topic area. In computational linguistics, web corpora have become an established source of data for the creation of language models, word embeddings, and for all types of machine learning.
     26In corpus/theoretical linguistics, the World Wide Web has become increasingly popular as a source of linguistic evidence, especially in the face of data sparseness or the lack of variation in traditional corpora of written language. In lexicography, web data have become a major and well-established resource with dedicated research data and commercially available tools. In other areas of theoretical linguistics, the adoption rate of web corpora has been slower but steady. Furthermore, some completely new areas of linguistic research dealing exclusively with web (or similar) data have emerged, such as the construction and utilisation of corpora based on short messages. Another example is the (manual or automatic) classification of web texts by genre, register, or – more generally speaking – “text type”, as well as topic area. In computational linguistics, web corpora have become an established source of data for the creation of language models, word embeddings, and for all types of machine learning.
    2727
    28 The twelfth Web as Corpus workshop (WAC-XII) looks at the past, present, and future of web corpora given the fact that large web corpora are nowadays provided mostly by a few major initiatives and/or companies, and the diversity of the early years appears to have fad-ed slightly. Also, we acknowledge the fact that alternative sources of data (such as data from Twitter and similar platforms) have emerged, some of them only available to large companies and their affiliates, such as linguistic data from social media and other forms of the deep web. At the same time, gathering interesting and/or relevant web data (web crawling) is becoming an ever more intricate task as the nature of the data offered on the web changes (for example the death of forums in favour of more closed platforms).
     28The twelfth Web as Corpus workshop (WAC-XII) looks at the past, present, and future of web corpora given the fact that large web corpora are nowadays provided mostly by a few major initiatives and/or companies, and the diversity of the early years appears to have faded slightly. Also, we acknowledge the fact that alternative sources of data (such as data from Twitter and similar platforms) have emerged, some of them only available to large companies and their affiliates, such as linguistic data from social media and other forms of the deep web. At the same time, gathering interesting and/or relevant web data (web crawling) is becoming an ever more intricate task as the nature of the data offered on the web changes (for example the death of forums in favour of more closed platforms).
    2929
    30 We intend WAC-XII to be a platform for the discussion of some fundamental issues in cur-rent web corpus construction. Some of the key issues that we see for the future of web cor-pora are:
     30We intend WAC-XII to be a platform for the discussion of some fundamental issues in current web corpus construction. Some of the key issues that we see for the future of web corpora are:
    3131
    32   * Can the requirements of all of the aforementioned groups of users (theoretical lin-guists, lexicographers, computational linguists, etc.) be met by the same type of web corpora, or should web corpora be tailored to the specific needs of different groups of users?
     32  * Can the requirements of all of the aforementioned groups of users (theoretical linguists, lexicographers, computational linguists, etc.) be met by the same type of web corpora, or should web corpora be tailored to the specific needs of different groups of users?
    3333  * How has the composition of the web (and subsequently that of web corpora) changed? Are web data still as relevant and interesting as they were fifteen years ago?
    3434  * What is the impact of changes in web data production (e.g., CMS and microtexts published on more restricted platforms), and how can it be addressed in the data collection process?
    35   * Is there still an interest in fundamental research on the linguistic nature and compo-sition of the web?
     35  * Is there still an interest in fundamental research on the linguistic nature and composition of the web?
    3636  * What is the level of quality of web data relative to the abovementioned tasks to be performed with web data?
    3737
     
    4040=== Description ===
    4141
    42 The twelfth Web as Corpus workshop (WAC-XII) aims to unite (web) corpus creators and all types of (web) corpus users from corpus/theoretical linguistics, computational linguistics, cognitive science, etc. We invite papers dealing with the fundamental questions mentioned above. In addition, we invite papers dealing with the whole range of applied and fundamen-tal topics from both corpus/theoretical linguistic and computational linguistics which have characterised WAC workshops, including but not limited to:
     42The twelfth Web as Corpus workshop (WAC-XII) aims to unite (web) corpus creators and all types of (web) corpus users from corpus/theoretical linguistics, computational linguistics, cognitive science, etc. We invite papers dealing with the fundamental questions mentioned above. In addition, we invite papers dealing with the whole range of applied and fundamental topics from both corpus/theoretical linguistic and computational linguistics which have characterised WAC workshops, including but not limited to:
    4343
    4444  * data selection and collection (discovery and/or crawling)
     
    5959== Identify, Describe and Share your LRs! ==
    6060
    61 Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiat-ed at LREC 2014 about “Sharing LRs” (data, tools, web-services, etc.), authors will have the possibility,  when submitting a paper, to upload LRs in a special LREC repository.  This effort of sharing LRs, linked to the LRE Map for their description, may become a new “regular” feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.
     61Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about “Sharing LRs” (data, tools, web-services, etc.), authors will have the possibility,  when submitting a paper, to upload LRs in a special LREC repository.  This effort of sharing LRs, linked to the LRE Map for their description, may become a new “regular” feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.
    6262
    63 As scientific work requires accurate citations of referenced work so as to allow the commu-nity to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2020 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN, www.islrn.org), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers  will be offered at submission time.
     63As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2020 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN, www.islrn.org), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers  will be offered at submission time.
    6464
    6565== Proposed programme committee (to be confirmed) ==