Developing and completing language resources: The Big Brother show as a modern speech corpus


The project is designed to last for five months: August-December 2007. The goals for the project are to digitalize videos, time-code existing transcriptions, and transcribe hitherto un-transcribed parts of the hitherto unfinished Big Brother corpus. Furthermore, to make the new and transcribed, time-coded and annotated version into a new and complete corpus of the first Norwegian Big Brother TV show. Finally, the corpus will be presented in a user-friendly web interface. The first series of the Norwegian Big Brother TV show will be a very important supplement to other speech corpora that are being planned and developed. It contains speech situations that are difficult or even impossible to get hold of with ordinary recording situations. The BB corpus will contain speech by people who interact in genuine ways, with a lot of accompanying emotional language. This contrasts sharply with other corpora, which usually contain speech by people in neutral conversations. The project will take as its starting point a largely unfinished version of a corpus of the BB show, finish and which needs further transcription, a task which can be performed within a few autumn months. The existing, unfinished corpus is further described in Part of the Big Brother Corpus is already available on the Internet in a web-based search interface, for research and development. However, previous lack of funding means that only half of it is transcribed and digitalized, and part of the transcribed material is not time-coded. With the present project, we plan to finish the transcription, digitalization and time-coding, as well as present it in an advanced search interface for linguistic and sociolinguistic studies, as well as for language-technological purposes. In a futu re project, we will mark sections of the corpus as examples of central emotions (described in 2.1). One trip to Copenhagen will be made, to discuss possible issues related to this at CST. This application will enable us to make existing material (the un-transcribed material) available with the rest of the corpus. Transcribing it means building up new resources for researchers. A main project application to the Norwegian Research Council, to be sent next year, will have a better basis after this project. The project will be useful for the description of the Norwegian and Scandinavian languages, language technology, artificial intelligence and psychology. The result will be made available for a future Norsk Språkbank (Norwegian Language Bank).

Project leader: Janne M. Bondi Johannessen

Started: 2007

Ends: 2008

Category: Universiteter

Sector: UoH-sektor

Budget: 470000

Institution: Institutt for lingvistiske og nordiske studier

Address: Oslo