Linguistic Resources for the Study of the African Varieties of Portuguese


Project description:

Given the extreme disparity in the area of Language Resources (LR), regarding, on the one hand, the publication of studies of the European and Brazilian varieties of Portuguese and, on the other hand, the African varieties, the project “Linguistic Resources for the Study of Portuguese African Varieties” aims to fill this gap, providing LR that will allow an objective description of these five African varieties of Portuguese.

This project aims at the constitution, treatment, analysis and availability (on-line query) of a corpus of the African varieties of Portuguese, with 3 million words of written and spoken texts, constituted by five comparable subcorpora with 600.000 words each, corresponding to the varieties of Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe.

With the availability of the materials extracted from this corpus, authentic data will be accessible to researchers, teachers, students and authors of different materials (grammars, dictionaries, manuals). These data will be organised making it possible, for the first time, to achieve empirical descriptive studies of each of the Portuguese varieties mentioned above.

This material will allow intra and intercorpora comparative studies (of these Portuguese varieties), which will make visible, on the one hand, variations that result from discursive and pragmatic differences of each corpus and, on the other hand, aspects of linguistic unity or diversity that characterise the spoken Portuguese of all 5 African countries, whose official language is Portuguese. The 5 corpora are comparable in size (600,000 words each), in chronology and in types and genres (24,000 spoken words and c. 580,000 written words, the last belonging to newspapers, literature and varia).

Some materials of the Reference Corpus of Contemporary Portuguese will be used, including a part of oral texts published in partnership by Instituto Camões and Centro de Linguística da Universidade de Lisboa (Bacelar do Nascimento (coord.) Português Falado, Documentos Autênticos, Gravações audio com transcrições alinhadas, em CD-ROM).

The other materials will be collected taking into consideration the internal balance of each corpus and the comparability among them.

The following materials are available on-line:

  1. Concordances in KWIC format of all the words the corpus, organized by subcorpora and type of discourse.
  2. Contrastive word indexes (lemmas / lemas and forms A-DE-I J-PQ-Z)with frequency data and divided by subcorpora and by genre of discourse.
  3. Contrastive word indexes (lemmas and forms) that occur in each subcorpus (AngolaCape VerdeGuinea-BissauMozambique and Sao Tome and Principe) with frequency data and divided by genre of discourse.
  4. Comparative description of the vocabulary of the subcorpora as a result of quantitative and statistical analyses.
Other publications

Bacelar do Nascimento, M. F., Pereira, L. A. S., Estrela, A., Bettencourt Gonçalves, J., Oliveira, S. M. and Santos, R. (2006). The African Varieties of Portuguese: Compiling Comparable Corpora and Analysing Data-derived Lexicon. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC2006), pages 1791-1794, Genoa, Italy. ELRA. 

CLUL - Centro de Linguística da Universidade de Lisboa
CFTC - Centro de Física Teórica e Computacional da Universidade de Lisboa