Language Resources for Portuguese: a corpus and tools for query and analysis

Concluded
Date
Reference
Programa Lusitânia PLUS/1999/LIN/15152
Funding institution
Fundação Calouste Gulbenkian
Instituto Camões

Project Description:
This project made available for on-line queries at CLUL's webpage a balanced written and spoken European Portuguese corpus, of 9 million words.  The project also included the morphosyntactic annotation of a 500.000 word subcorpus,  achieved with Fundação Calouste Gulbenkian funding.
The project aimed to answer the shortage of Portuguese Language Resources in this area, and the strongly increasing demand for Portuguese data, for theoretical and practical works and for applications of this language’s real use in computational linguistics, language teaching and lexicography, among others.

This written corpus is composed of 15 million running-words selected from CLUL’s Reference Corpus of Contemporary Portuguese. The texts are extracted from books, newspapers and magazines, and also from a miscellaneous  of leaflets, brochures, official documents, etc. Those texts are relating to literary, informative, scientific, technical and didactic genres, in a wide diversity of domains.

Corpus constitution:

The final corpus is constituted by 9.171.480 words, with the following distribution:

Spoken corpus transcribed and constituted by informal talks:

105964

 

(spoken)  ORAL_RL

 

105964

Written corpus constitution:

 

 

jornal_RL     (newspaper)

4097868

 

livrolit_RL     (fiction books)

1792590

 

livrotec_RL   (technical books)

1440625

 

revista_RL    (magazines)

420792

 

varia_RL       (varia)

812599

 

jornal_anotado_RL     (tagged newspaper)

336151

 

livro_anotado_RL       (tagged books)

125434

 

revista_anotado_RL    (tagged magazines)

25908

 

varia_anotado_RL      (tagged varia)

13549

 

(tagged subcorpus)   subcorpus_anotado_RL

 

501042

(written)   ESCRITO_RL

 

9065516

TOTAL_RL

 

9171480

 

Corpus sources:

The corpus samples come from several different sources:

- Spoken corpus:

Informal talks collected for the project Português Fundamental, transcribed and published in:

Bacelar do Nascimento, M. F. et al. Português Fundamental, vol. II - Métodos e Documentos, tomo 1 - Inquérito de Frequência, Lisboa, INIC, CLUL, 1987;

- Written corpus:

Fiction books - 70 titles of 53 Authors of the Portuguese Literature (XIXth e XXth centuries);

Technical books - 39 titles of 38 Authors, published (end of the XXth century and XXIst century);

Newspaper - several editions of year 2000 of the following newspapers: "A BOLA", "Diário de Notícias", "Expresso", "Jornal de Notícias" and "PÚBLICO";

Magazines - numbers 83 to 95 of the magazine "Revista do Instituto do Consumidor" (1999 and 2000);

Varia - several articles from the "Enciclopédia Verbo", from scientific meetings proceedings, webpages, interviews published in the newspaper "O Primeiro de Janeiro", manuals for college students, final reports for bachelor training posts, etc.

Copyrights:

Negotiations with the Portuguese authors of fiction prose represented in the corpus were undertaken, in association with the Portuguese Authors Society (SPA), in order to obtain the necessary permission for text queries.

POS tagging:

A subset of 500.000 running words was annotated and checked at morphological level. Texts were automatically tagged using an adapted version of Eric Brill's tagger, and a corpus subset was manually checked to solve remaining ambiguities and to correct errors. The manually checked subset was used as a training corpus to tag the whole corpus.

The tagged corpus will also be available for on-line queries at CLUL's webpage.

Mendes, A., Amaro, R., & Bacelar do Nascimento, M. F. (2003). Reusing resources for the morphosyntactic annotation of a spoken Portuguese corpus. In A. Branco, Mendes, A., & Ribeiro, R. (Eds.), Tagging and Shallow Processing of Portuguese: workshop notes of TASHA 2003. Lisboa: Departamento de Informática da Faculdade de Ciências da Universidade de Lisboa.
Partnerships
CLUL - Centro de Linguística da Universidade de Lisboa
SPA - Sociedade Portuguesa de Autores