= ACL SIGWAC home page =
The Special Interest Group of the [http://www.aclweb.org/ Association for Computational Linguistics (ACL)] on
'''Web as Corpus'''.

== Objectives ==
  * to promote interest in the use of the web as a source of linguistic data, and as an object of study in its own right;
  * to provide members of the ACL with a special interest in the web-as-corpus with a means of exchanging news of recent research developments and other matters of interest;
  * to sponsor meetings and workshops on the web as corpus that appear to be timely and worthwhile. 

== Meetings ==
  * [http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html WAC1, at Corpus Linguistics conference, Birmingham, UK, July 2005]
  * [http://sslmit.unibo.it/~baroni/web_as_corpus_eacl06.html WAC2, at EACL, Trento, Italy, April 2006]
  * [http://cental.fltr.ucl.ac.be/wac3 WAC3, Louvain-la-Neuve, Belgium, 15-16 September 2007 ]
  * [http://webascorpus.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_40_WAC-4___lb__2008__rb__ WAC4 at LREC, Marrakech, Morocco, 1 June 2008]
  * WAC5 is scheduled for 8 September 2009, San Sebastian, Spain

We invite papers on various topics concerning the use of Web resources for corpus research and NLP applications, including (but not limited to) the following:

    * linguistic Web crawler technology and Web corpus collection projects
    * applications of Web-derived corpora and other kinds of Web data
    * how far does the “easy way” get you? (using search engines, or Google's n-gram lists; we are particularly interested in a critical discussion of the usefulness and limitations of such approaches)
    * methods and tools for “cleaning” Web pages to turn them into a corpus (contributors to this topic will be encouraged to participate in the second CLEANEVAL competition to be held in 2009)
    * automatic linguistic annotation of Web data: tokenisation, POS tagging, lemmatisation, semantic tagging, etc. (established tools often perform very poorly on Web data)
    * search engine architectures for linguists: bringing linguistics to commercial search engines, or high-performance search technology to linguistics?
    * search engine-related topics such as result ranking (e.g. how to identify “typical” uses rather than returning 50 very similar matches on the first page)
    * duplicate detection, interactive query refinement, etc.
    * reviews and clever uses of search engine APIs (Google, Yahoo, Altavista, and in particular Microsoft's current generous LiveSearch API)

== Activities ==
  * [http://cleaneval.sigwac.org.uk/ CLEANEVAL], a competition for cleaning webpages
  * Mailing list:
    * sign up [http://devel.sslmit.unibo.it/mailman/listinfo/sigwac here]
    * address to send mail to: [mailto:sigwac@sslmit.unibo.it sigwac@sslmit.unibo.it]

== Officers ==
  * Chair: [http://www.comp.leeds.ac.uk/ssharoff/ Serge Sharoff]
  * Secretary: [http://clic.cimec.unitn.it/marco/ Marco Baroni]

Constitution [attachment:wiki:WikiStart:constitution.txt?format=raw here]. 

== Useful resources ==
  * [http://webascorpus.sf.net/ Stefan Evert's WAC website]
  * [http://webascorpus.org/ Bill Fletcher's WAC website]
  * [http://www.sketchengine.co.uk/ Web corpora on Sketchengine]
  * [http://corpus.leeds.ac.uk/internet.html Web corpora on CTS website]
  * [http://wacky.sslmit.unibo.it/ WACKY in Forli]
  * [http://purl.org/net/webgenres A wiki on webgenres]