Andrew Brindle
Thug breaks man's jaw: A Corpus Analysis of Responses to Interpersonal Street Violence
Abstract: A great deal of what is bad in the world, from genocide to interpersonal violence, is the product of men and their masculinities (DeKeseredy and Schwartz, 2005).Work by criminologists such as Anderson (1990) have argued that instances of interpersonal violence originate from strongly held values in the construction and defence of personal street status and that violence is a tool for both the formation of and the protection of self-image. Furthermore, Messerschmidt (2004) writes that among certain men violence is a core component of masculinity and a means of proving one’s manhood. However, Winlow (2001) considers that street and pub fights function as a means for working-class men to actualize a masculine identity due to the loss of traditional industrial job opportunities in a post-modern society. Clearly, violence is one means by which certain men live up to the ideals of hegemonic masculinity; such practices may be learned through interactions with particular peer groups, or virtual peer groups.
This paper examines a corpus constructed of online responses to an article in an online edition of the British tabloid newspaper The Sun describing an act of interpersonal street violence between two men. The report describes how a man, in an unprovoked attack, left another man unconscious in a street after breaking his jaw. The article produced 190 responses from readers, the majority of whom either through avatars or online names indicated that they were male. The responses were collected and compiled into a corpus containing 6,606 tokens. This was then analysed using the WordSmith Tools software package. Taking a corpus-based approach, the data was analysed by undertaking concordance analyses of keywords and collocates of those words.
The findings of the study of keyword collocates and concordance lines indicate that regardless of the negative depiction of the aggressor in the online article, the assailant and his actions were defended, and at times admired and praised, while the victim was criticized for his lack of fighting skills, and not considered as innocent. However, the findings also provide data revealing that other respondents reject such actions, clearly demonstrating that multiple constructs of masculine identity exist among the tabloid readership who responded to the article.
The paper concludes by discussing the hypothesis that masculine identity and specifically hegemonic masculinity is constructed of multiple identities, and rejecting the notion that violence is a response to the destabilizing effects of post-modernism, while arguing that interpersonal violence is a means by which certain men express and validate masculinity. Furthermore, the importance of investigating and analysing online peer groups is emphasised as an invaluable source in comprehending aspects of social behaviour within contemporary society.

Jesse Egbert and Douglas Biber
Developing a User-based Method of Web Register Classification
Abstract: This paper introduces a new grant-funded initiative to develop a comprehensive linguistic taxonomy of English web registers. We begin the talk with an overview of the goals, methods, and current status of the project. However, we focus mostly on a detailed discussion of the methods used to develop a user-based register classification rubric, and a presentation of the results obtained to date coding a large corpus of web documents for their register categories.

Adam Kilgarriff and Vít Suchomel
Web Spam
Abstract: Web spamming 'refers to actions intended to mislead search engines into ranking some pages higher than they deserve'. Web spam is a problem for web corpus builders because it is quite like the material we want to gather, but we do not want it. It is on the increase: when we compare two corpora gathered using the same methods in 2008 and 2012, !enTenTen08 and !enTenTen12, the web spam in the later one is a striking difference. In this paper we first review some relevant literature, and then identify some characteristics of web spam that we have noted, and suggest corresponding strategies for distinguishing it from good text.

Sarah Schulz, Verena Lyding and Lionel Nicolas
STirWaC - Compiling a diverse corpus based on texts from the web for South Tyrolean German
Abstract: In this paper we report on the creation of a web corpus for the variety of German spoken in South Tyrol.
We discuss how we tackled the particular challenge to find a balance between data quantity and quality, when the internet provides also texts of the neighboring varieties. Thus, our aim was twofold, to achieve a high degree of representativeness of the texts with regard to the South Tyrolean variety as well as high diversity concerning the included genres.
We present our procedure for selecting relevant texts and discuss an approach for detecting and filling gaps in the compiled web corpus.

Silke Scheible and Sabine Schulte Im Walde
A Compact but Linguistically Detailed Database for German Verb Subcategorisation relying on Dependency Parses from a Web Corpus
Abstract: Within the area of automatic lexical acquisition, the definition of lexical verb information has been a major focus, because verbs play a central role for the structure and meaning of sentences and discourse. We describe a novel resource of verb subcategorisation data obtained from a parsed German web corpus. While relying on a dependency parser, our extraction was based on a set of detailed guidelines to maximise the linguistic value of the subcategorisation information but nevertheless represent the data in a compact, flexible format. This abstract outlines our subcategorisation extractor and describes the format of the subcategorisation database, as well as actual and potential uses in computational linguistics.

Alexander Piperski, Vladimir Belikov, Nikolay Kopylov, Vladimir Selegey and Serge Sharoff
Big and diverse is beautiful: A large corpus of Russian to study linguistic variation
Abstract: The General Internet Corpus of Russian (GICR) is aimed at studying linguistic variation in present-day Russian available on the Web. In addition to traditional morphosyntactic annotation, the corpus will be richly annotated with metadata aimed at sociolinguistics research of language variation, including regional, gender, age, and genre variation. The sources of metadata include explicit information available about the author in his/her pro-file, information coming from IP or URL, as well as machine learning from textual features.

Adriano Ferraresi and Silvia Bernardini
The academic Web-as-Corpus
Abstract: This paper presents "acWaC" (an acronym for "academic Web-as-Corpus"), a 10-million word corpus of web pages in English crawled from the websites of European universities and annotated with rich contextual metadata. It introduces the pipeline that was followed to build the corpus, which can be easily replicated for monitoring purposes or to extend the corpus for inclusion of texts from academic institutions worldwide. It then presents a case study in which modal verbs, taken as a signal of universities' stance and engagement strategies, are compared across texts produced by universities based in countries where English is a native vs. non-native language. The paper concludes by presenting on-going work focusing on the construction of "genre"-restricted sub-corpora by means of simple Reg-Ex based strategies focusing on URL syntax.

Akshay Minocha, Siva Reddy and Adam Kilgarriff
Feed Corpus : An Ever Growing Up-to-date Corpus
Abstract: Languages constantly evolve and corpus analysis tools and techniques need to keep pace with the changes. In this paper we propose a novel method for collecting dynamic corpora which is ever growing and up-to-date with the language (Jeremy, 1987). We make use of social media to discover sources of latest content. We keep track of latest content from dynamic content sources like blogs, news websites. Most of these websites provide a short summary of their content change in a separate page known as feed.

Stephen Wattam, Paul Rayson and Damon Berridge
LWAC: Longitudinal Web-as-Corpus Sampling
Abstract: As the web develops, issues surrounding network and content stability increasingly affect sampling of web data. The needs of those aiming to investigate the impact network-based effects such as link rot have upon language content are currently poorly served by linguistic search engines such as WebCorp, which attempt to produce language samples more comparable to offline corpora.
We present here an open-source tool, LWAC, for formal longitudinal sampling of URI lists, designed to download portions of the web in a fast, parallel manner that imitates end users. LWAC is designed to run on commodity hardware and provide a high-performance method of corpus construction for investigating both language change online (in a conventional manner) and epistemic issues in the web-as-corpus field.

Roland Schäfer, Adrien Barbaresi and Felix Bildhauer
The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction
Abstract: Crawled raw data for Web corpus construction contains a lot of documents which are in the target language, but which fail as a text. Documents just containing tag clouds, lists of names or products, etc. need to be removed. There needs to be a criterion by which we classify documents into those containing linguistically relevant text and those which do not. However, many documents as found on the Web contain a mix of text and non-text material. In this paper, we ask how well humans can decide which Web documents are good (as a text) and which are not. We present a data set of 1,000 documents from a large English Web crawl, rated independently by three human raters, two of them corpus designers. Based on the low inter-rater agreement, we suggest an unsupervised approach to text detection which does not involve difficult and ultimately arbitrary design decisions.

David Lutz, Parry Cadwallader and Mats Rooth
A web application for filtering and annotating web speech data
Abstract: A vast amount of recorded speech is freely available on the web, in the form of podcasts, radio broadcasts, and posts on media-sharing sites. However, finding specific words or phrases in online speech data remains a challenge for researchers, not least because transcripts of this data are often automatically-generated and imperfect. We have developed a web application that addresses this challenge by allowing non-expert and potentially remote annotators to filter and annotate speech data collected from the web and produce large, high-quality data sets suitable for speech research. We have used this application to filter and annotate thousands of speech tokens, and active development continues.

Colleen Crangle
A web-based model of semantic relatedness and the analysis of electroencephalographic (EEG) data
Abstract: Recent studies of language and the brain have shown that models of semantics extracted from web-based corpora can predict brain activity as measured by functional magnetic resonance imaging (fMRI), magnetoencephalography (MEG), or electroencephalography (EEG). In Mitchell et al. (2008) the semantics of a word was represented by its distributional properties in the data set contributed by Google Inc. This data set consists of English word n-grams and their frequencies in an approximately 1-trillion-word set of web pages (Brants, Franz 2006). For nouns referring to physical objects, co-occurrence patterns with 25 manually-selected sensory-motor verbs provided the semantic model. Taking 60 such nouns and their !fMRI images, statistically significant pre-dictions were made as to the semantic category (mammal or tool, for example) of the words in this set.
Since Mitchell, other web-based corpora and other ways of selecting semantic features have been investigated to see if they offered improved methods of predicting from brain data the word someone is seeing or hearing or otherwise attending to.
Murphy et al. (2012), for example, used a 16 billion-word set of English-language web-page documents as their corpus and point-wise mutual information (Turney, 2001) combined with co-occurrence frequencies to provide a semantic model. Pereira et al. (2010) used a large text corpus consisting of pertinent articles from Wikipedia and Latent Dirichlet allocation (LDA, Blei et al., 2003) to provide the semantic model. In Jelodor et al. (2010) we find WordNet (Fellbaum, 1998) used as a supplementary source of information to construct a semantic model. Several WordNet similarity measures computed the similarity of each of the 60 nouns with each of the 25 sensory-motor verbs of Mitchell et al.
In this paper, I take a model of semantic relatedness extracted from the Web and examine the extent to which it corresponds to predictions made from EEG data about the relations between sets of words participants are attending to. Unlike previous work that looked at isolated word predictions, this work examines sets of words and the relations between them. The brain data are drawn from experiments in which statements about commonly known geographic facts of Europe were presented auditorily to participants who were asked to determine the truth or falsity of each statement while EEG recordings were made (Suppes et al, 1999; Suppes et al., 2009). The corpus is the Google Inc. data set and semantic relatedness is obtained from a point-wise mutual information measure.
Corpus-based models of semantics face the unavoidable evaluation question, namely how well distributional information extracted from a corpus matches the semantic knowledge of language users. Corpus-based studies of semantics and the brain potentially offer a new way to answer this question.
Last modified 5 years ago Last modified on May 23, 2013, 12:45:26 PM