= ACL SIGWAC home page = The Special Interest Group of the [http://www.aclweb.org/ Association for Computational Linguistics (ACL)] on '''Web as Corpus'''. == Objectives == * to promote interest in the use of the web as a source of linguistic data, and as an object of study in its own right; * to provide members of the ACL with a special interest in the web-as-corpus with a means of exchanging news of recent research developments and other matters of interest; * to sponsor meetings and workshops on the web as corpus that appear to be timely and worthwhile. == Meetings == * [http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html WAC1, at Corpus Linguistics conference, Birmingham, UK, July 2005] * [http://sslmit.unibo.it/~baroni/web_as_corpus_eacl06.html WAC2, at EACL, Trento, Italy, April 2006] * [http://cental.fltr.ucl.ac.be/wac3 WAC3, Louvain-la-Neuve, Belgium, 15-16 September 2007 ] * [http://webascorpus.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_40_WAC-4___lb__2008__rb__ WAC4 at LREC, Marrakech, Morocco, 1 June 2008] * WAC5 is scheduled for 8 September 2009, San Sebastian, Spain We invite papers on various topics concerning the use of Web resources for corpus research and NLP applications, including (but not limited to) the following: * linguistic Web crawler technology and Web corpus collection projects * applications of Web-derived corpora and other kinds of Web data * how far does the “easy way” get you? (using search engines, or Google's n-gram lists; we are particularly interested in a critical discussion of the usefulness and limitations of such approaches) * methods and tools for “cleaning” Web pages to turn them into a corpus (contributors to this topic will be encouraged to participate in the second CLEANEVAL competition to be held in 2009) * automatic linguistic annotation of Web data: tokenisation, POS tagging, lemmatisation, semantic tagging, etc. (established tools often perform very poorly on Web data) * search engine architectures for linguists: bringing linguistics to commercial search engines, or high-performance search technology to linguistics? * search engine-related topics such as result ranking (e.g. how to identify “typical” uses rather than returning 50 very similar matches on the first page) * duplicate detection, interactive query refinement, etc. * reviews and clever uses of search engine APIs (Google, Yahoo, Altavista, and in particular Microsoft's current generous LiveSearch API) == Activities == * [http://cleaneval.sigwac.org.uk/ CLEANEVAL], a competition for cleaning webpages * Mailing list: * sign up [http://devel.sslmit.unibo.it/mailman/listinfo/sigwac here] * address to send mail to: [mailto:sigwac@sslmit.unibo.it sigwac@sslmit.unibo.it] == Officers == * Chair: [http://www.comp.leeds.ac.uk/ssharoff/ Serge Sharoff] * Secretary: [http://clic.cimec.unitn.it/marco/ Marco Baroni] Constitution [attachment:wiki:WikiStart:constitution.txt?format=raw here]. == Useful resources == * [http://webascorpus.sf.net/ Stefan Evert's WAC website] * [http://webascorpus.org/ Bill Fletcher's WAC website] * [http://www.sketchengine.co.uk/ Web corpora on Sketchengine] * [http://corpus.leeds.ac.uk/internet.html Web corpora on CTS website] * [http://wacky.sslmit.unibo.it/ WACKY in Forli] * [http://purl.org/net/webgenres A wiki on webgenres]