Spoken Portuguese - Geographical and Social Varieties
A new version of this corpus is freely available at ELRA's Catalog. It consists of audio files in WAV format, aligned transcriptions in XML EXMARaLDA format and transcriptions in plain text and HTML formats. The plain text files also have automatically assigned PoS-tag information.
This resource has been assigned the International Standard Language Resource Number (ISLRN) 969-074-010-182-2.
More information can be found in www.islrn.org.
(1995-1997 - European Commission DGXXII, Programme LINGUA/SOCRATES)
The project is concluded and the materials are published in CD-ROM, with the exclusive publishing support of Instituto Camões, under the title Português Falado - Documentos Autênticos: Gravações áudio com transcrição alinhada. Its distribution outside of Portugal is ensured by Instituto Camões and in Portugal by CLUL. From the original project a corpus of samples of the Portuguese varieties spoken in Portugal, Brazil, the African countries with Portuguese as its official language and Macao was derived. The published materials also include samples of the Portuguese spoken in Goa and in East-Timor, collected later. These samples of oral speech, recorded in various places, situations and periods of time, go together with the correspondent aligned orthographic transcriptions.
The four published CD-ROMs include a spoken Portuguese corpus - with aligned sound and orthographic transcription - collected among sociolinguistically diverse speakers having Portuguese as mother tongue or as second language. This corpus consists of informal conversations between acquaintances, friends or relatives as well as formal acts as, for instance, radio programs or conferences. In a total of 86 recordings, the texts exemplify the Portuguese spoken in Portugal (30), in Brazil (20), in the African countries with Portuguese as its official language: Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe (5 each), in Macao (5), in Goa (3) and in East-Timor (3), corresponding to 8h44m of recording and to 91.966 tokens. The recordings cover a period that goes from 1970 to 2001, and approximately 70% of them fall upon the last decade.
These samples of Portuguese varieties are distributed in the four CD-ROMs in the following way:
- - Portugal (recordings from the nineties);
- - Portugal (recordings from the seventies and the eighties), Macao, Sao Tome and Principe
- - Angola, Cape Verde, Guinea-Bissau and Mozambique;
- - Brazil and Goa.
Finally, 94 speakers appear in the recordings; their characterizations (origin, sex, age, professional status, level of education) are visible on the header of each transcription, in which is also given information about the place, date and situation in which the recording was made, as well as other relevant types of information.
Main Goals of the Publishing
The use of authentic spoken texts in the teaching of Portuguese as a foreign language is not a common practice; instead, written texts are often used to reproduce the spontaneous speech, without any success. In fact, these artificial representations do not contribute to improve the knowledge of the spoken language; in order to fill this gap, authentic texts, collected in real communication acts with various types of speakers, are now published in CD-ROM: all these texts are samples of existent varieties and uses of spoken Portuguese.
The user can read the orthographic transcription while he listens to the recording: the text becomes highlighted as the user is listening to its oral production. The user can listen to the all document or select some passages, as well as repeat or jump parts of the text whenever he wants.
The orthographic transcription, besides improving the understanding of spoken texts, constitutes a consistent basis for the study of morphophonological, lexical, syntactic and discursive aspects of contemporaneous spoken Portuguese.
The main goal of producing these CD-ROMs was to contribute to the improvement of the understanding (and production) skills in students of Portuguese as a second language of the advanced or superior learning levels. The materials are presented in a way that favors the use of self-learning processes.
It is still worth mentioning that, since this corpus was not collected having in mind a specific user profile, it will be useful not only for students and teachers but also for researchers, translators and interpreters, among others, which will be able to select and analyze the materials according to their own particular aims.