= ACL SIGWAC home page = The Special Interest Group of the [http://www.aclweb.org/ Association for Computational Linguistics (ACL)] on '''Web as Corpus'''. Join the SIG by [http://devel.sslmit.unibo.it/mailman/listinfo/sigwac signing up to the mailing list!] The Special Interest Group on '''Web as Corpus''' aims to research the opportunities and limitations of using textual web data for 1. performing linguistic research 2. modelling knowledge of language 3. modelling extralinguistic knowledge == Objectives == * To build a community around the web-as-corpus research * To support and promote information exchange and the dissemination of results and best practices * To organize workshops, hackathons and shared tasks Download the [attachment:wiki:WikiStart:constitution.txt?format=raw constitution of ACL SIGWAC]. == Officers == * [https://nljubesi.github.io Nikola Ljubešić] (co-president) * [http://alpage.inria.fr/~sagot/ Benoît Sagot] (co-president) * [https://www.utu.fi/en/people/veronika-laippala Veronika Laippala] (co-secretary) * [https://portizs.eu Pedro Ortiz Suarez] (co-secretary) == Resources == === Corpora === * [https://commoncrawl.org CommonCrawl] * [https://oscar-project.org OSCAR] * [https://paracrawl.eu ParaCrawl] * [https://macocu.eu MaCoCu] * [https://www.clarin.si/info/new-classla-web-corpora-and-tutorial-on-usage-of-the-corpora-via-clarin-si-concordancers/ CLASSLA South Slavic web corpora] * [http://sketch.juls.savba.sk/aranea_about/ Aranea web corpora] * [https://www.clarin.si/noske/wacs.cgi/ CLARIN.SI web corpora] * [http://corpus.leeds.ac.uk/internet.html University of Leeds (CTS) web corpora] * [http://www.sketchengine.co.uk/ Web corpora on Sketchengine (commercial product)] * [https://wacky.sslmit.unibo.it/doku.php?id=start WaCKy corpora] === Technologies === * [https://corpus.tools A Masaryk University and Lexical Computing list of tools for harvesting and processing web data] * [https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier The XGENRE multilingual text genre classifier] * [https://github.com/TurkuNLP/multilingual-register-labeling Massively Multilingual Modeling of Web Registers by TurkuNLP] === Additional information === * [https://link.springer.com/book/10.1007/978-3-031-02152-7 Schäfer and Bildhauer's web corpus book] * [http://webascorpus.sf.net/ Stephanie Evert's WAC website] * [https://sigwac.org.uk/cleaneval CLEANEVAL], a competition for cleaning webpages == Meetings == * [wiki:WAC-XII] at [https://lrec2020.lrec-conf.org LREC 2020], Marseille, France, 16 May 2020… [[span(style=color: #FF0000, CANCELLED due to Covid-19 outbreak)]] but proceedings have been published! * [wiki:WAC-XI] at [http://www.birmingham.ac.uk/research/activity/corpus/events/2017/cl2017/index.aspx Corpus Linguistics 2017], Birmingham, UK, 24-27 July 2017 * [wiki:WAC-X] at [http://acl2016.org/ ACL 2016], Berlin, Germany, 12 August 2016 * [wiki:wac2015 WAC@eLex2015], In 2015 we will meet at eLex, Herstmonceux Castle, UK, 10 August 2015 * [wiki:WAC9], at [http://eacl2014.org/ EACL 2014], Gothenburg, Sweden, 26-27 April 2014 * [wiki:WAC8], at [http://ucrel.lancs.ac.uk/cl2013/ Corpus Linguistics 2013], Lancaster, UK, 22 July 2013 * [wiki:WAC7], at [http://www2012.wwwconference.org/ WWW12], Lyon, France, 17 April 2012 * [http://www.limsi.fr/~pz/bucc2011-comparable-corpora/ BUCC, Building and Using Comparable Corpora, Portland, Oregon, 24 June 2011], In 2011 we will meet at the BUCC workshop at [http://www.acl2011.org/ ACL2011] * [wiki:WAC6], at NAACL-HLT, Los Angeles, USA, 5 June 2010: programme [wiki:WAC6Programme here] * [wiki:WAC5], at SPLN, San Sebastian, Basque Country, Spain, 7 September 2009 * [http://webascorpus.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_40_WAC-4___lb__2008__rb__ WAC4 at LREC, Marrakech, Morocco, 1 June 2008] * [http://cental.fltr.ucl.ac.be/wac3 WAC3, Louvain-la-Neuve, Belgium, 15-16 September 2007 ] * [http://sslmit.unibo.it/~baroni/web_as_corpus_eacl06.html WAC2, at EACL, Trento, Italy, April 2006] * [http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html WAC1, at Corpus Linguistics conference, Birmingham, UK, July 2005] == ACL SIGWAC annual reports == * [https://www.aclweb.org/adminwiki/index.php?title=2023Q3_Reports:_SIGWAC ACL SIGWAC 2023 Q3 report] * [https://www.aclweb.org/adminwiki/index.php?title=2021Q3_Reports:_SIGWAC ACL SIGWAC 2021 Q3 report] * [https://www.aclweb.org/adminwiki/index.php?title=2020Q3_Reports:_SIGWAC ACL SIGWAC 2020 Q3 report] * [https://www.aclweb.org/adminwiki/index.php?title=2019Q3_Reports:_SIGWAC ACL SIGWAC 2019 Q3 report] * [https://www.aclweb.org/adminwiki/index.php?title=2018Q3_Reports:_SIGWAC ACL SIGWAC 2018 Q3 report] * [http://aclweb.org/adminwiki/index.php?title=2016Q3_Reports:_SIGWAC ACL SIGWAC 2016 Q3 report] * [http://aclweb.org/adminwiki/index.php?title=2015Q3_Reports:_SIGWAC ACL SIGWAC 2015 Q3 report] * [http://aclweb.org/adminwiki/index.php?title=2014Q3_Reports:_SIGWAC ACL SIGWAC 2014 Q3 report] * [http://aclweb.org/adminwiki/index.php?title=2013Q3_Reports:_SIGWAC ACL SIGWAC 2013 Q3 report] * [http://aclweb.org/adminwiki/index.php?title=2012Q3_Reports:_SIGWAC ACL SIGWAC 2012 Q3 report] * [http://aclweb.org/adminwiki/index.php?title=Reports Older reports...]