Language Resources for Portuguese: a corpus and tools for query and analysis

Concluded

Date

01 January 2000

Reference

Programa Lusitânia PLUS/1999/LIN/15152

Funding institution

Fundação Calouste Gulbenkian

Instituto Camões

Grupo

Grammar & Resources

Project Description:
This project made available for on-line queries at CLUL's webpage a balanced written and spoken European Portuguese corpus, of 9 million words. The project also included the morphosyntactic annotation of a 500.000 word subcorpus, achieved with Fundação Calouste Gulbenkian funding.
The project aimed to answer the shortage of Portuguese Language Resources in this area, and the strongly increasing demand for Portuguese data, for theoretical and practical works and for applications of this language’s real use in computational linguistics, language teaching and lexicography, among others.

This written corpus is composed of 15 million running-words selected from CLUL’s Reference Corpus of Contemporary Portuguese. The texts are extracted from books, newspapers and magazines, and also from a miscellaneous of leaflets, brochures, official documents, etc. Those texts are relating to literary, informative, scientific, technical and didactic genres, in a wide diversity of domains.

Corpus constitution:

The final corpus is constituted by 9.171.480 words, with the following distribution:

Spoken corpus transcribed and constituted by informal talks:	105964
(spoken) ORAL_RL		105964
Written corpus constitution:
jornal_RL (newspaper)	4097868
livrolit_RL (fiction books)	1792590
livrotec_RL (technical books)	1440625
revista_RL (magazines)	420792
varia_RL (varia)	812599
jornal_anotado_RL (tagged newspaper)	336151
livro_anotado_RL (tagged books)	125434
revista_anotado_RL (tagged magazines)	25908
varia_anotado_RL (tagged varia)	13549
(tagged subcorpus) subcorpus_anotado_RL		501042
(written) ESCRITO_RL		9065516
TOTAL_RL		9171480

Corpus sources:

The corpus samples come from several different sources:

- Spoken corpus:

Informal talks collected for the project Português Fundamental, transcribed and published in:

Bacelar do Nascimento, M. F. et al. Português Fundamental, vol. II - Métodos e Documentos, tomo 1 - Inquérito de Frequência, Lisboa, INIC, CLUL, 1987;

- Written corpus:

Fiction books - 70 titles of 53 Authors of the Portuguese Literature (XIX^th e XX^th centuries);

Technical books - 39 titles of 38 Authors, published (end of the XXth century and XXI^st century);

Newspaper - several editions of year 2000 of the following newspapers: "A BOLA", "Diário de Notícias", "Expresso", "Jornal de Notícias" and "PÚBLICO";

Magazines - numbers 83 to 95 of the magazine "Revista do Instituto do Consumidor" (1999 and 2000);

Varia - several articles from the "Enciclopédia Verbo", from scientific meetings proceedings, webpages, interviews published in the newspaper "O Primeiro de Janeiro", manuals for college students, final reports for bachelor training posts, etc.

Copyrights:

Negotiations with the Portuguese authors of fiction prose represented in the corpus were undertaken, in association with the Portuguese Authors Society (SPA), in order to obtain the necessary permission for text queries.

POS tagging:

A subset of 500.000 running words was annotated and checked at morphological level. Texts were automatically tagged using an adapted version of Eric Brill's tagger, and a corpus subset was manually checked to solve remaining ambiguities and to correct errors. The manually checked subset was used as a training corpus to tag the whole corpus.

The tagged corpus will also be available for on-line queries at CLUL's webpage.

Mendes, A., Amaro, R., & Bacelar do Nascimento, M. F. (2003). Reusing resources for the morphosyntactic annotation of a spoken Portuguese corpus. In A. Branco, Mendes, A., & Ribeiro, R. (Eds.), Tagging and Shallow Processing of Portuguese: workshop notes of TASHA 2003. Lisboa: Departamento de Informática da Faculdade de Ciências da Universidade de Lisboa.

Members

Amália Mendes

Florbela Barreto

João Miguel Casteleiro

Maria Lúcia Garcia Marques

Partnerships

CLUL - Centro de Linguística da Universidade de Lisboa

SPA - Sociedade Portuguesa de Autores