Multifunctional Computational Lexicon of Contemporary Portuguese

Lexicon

Multifunctional Computational Lexicon of Contemporary Portuguese

ISLRN

489-956-642-755-8

Website Complementar

Catálogo ELRA

Group

Grammar & Resources

Financing institution

FCT – Fundação para a Ciência e a Tecnologia

Description
Team

Partnership:

Centro de Linguística da Universidade de Lisboa (Prime Contractor)

INESC - Instituto de Engenharia de Sistemas e Computadores (Partner)

Editorial Verbo (Partner)

Istituto di Linguistica Computazionale del CNR – ILC – Pisa (Consultant)

Project Status:

Concluded

Multifunctional Computational Lexicon of Contemporary Portuguese

The European Portuguese has now a 26.443 lemma Frequency Lexicon with 140.315 tokens, with the minimum lemma frequency of 6, extracted from a relevant contemporary Portuguese corpus (16.210.438 words). Each lemma is followed by morphosyntactic and quantitative information. The same information is given regarding each lemma token (inflected forms and some compounds). The lexicon indexations are listed in alphabetical order or decreasing frequency order.

This resource is freely available at ELRA's Catalog and it has been assigned the International Standard Language Resource Number (ISLRN) 489-956-642-755-8.
More information can be found at http://www.islrn.org/.

PROJECT DESCRIPTION

Corpus

In order to carry out this project, CLUL designed and extracted from CRPC a 16.210.438 word corpus - CORLEX - containing a spoken subcorpus (of 856.195 words) and a written subcorpus (of 15.354.243 words).CORLEX contains written and spoken texts of several types, being the genre diversity a characteristic of this corpus. In order to represent the common language and a great diversity of themes, Corlex is composed mainly by journalistic texts (56% of the written corpus and 53% of the whole corpus).

Constitution of the written corpus

Part of this corpus is constituted by texts given to CLUL by Editorial VERBO, partner of this project. The written corpus has several different sources, being constituted by samples of selected titles.

Press

Newspapers
Number of Titles	Dates	N. of copies	N. of article
3	1997 and 1998	105	13.085
Periodicals
Number of Titles	Dates	N. of copies	N. of articles
3	1992 and 1997	105	13.085

Literary

(Romances, Novels, Short Stories, Poetry, Memoirs And Theatre of Portuguese authors)

N. of Authors	N. of Titles	Dates
135	186	XIX^th century(2^nd half): 11 authors; 14 titles
135	186	XX^th century: 124 authors; 172 titles

Techno-Scientific and Didactic

N. of Authors	N. of Titles	Dates
91 Techno-Scientific book - 68 Didactic book - 23	93 Techno-Scientific book - 68 Didactic book - 25	1980 - 1993

Varia

Type of Document	N. of texts/articles	Dates
Specialized newspapers and journals	347	1900 - 1997
Other documents	30	1900 - 1997

Constitution of the spoken corpus (856.195 words)

The spoken corpus contains orthographic transcriptions of informal conversations and more formal productions like conferences, interviews in the radio and TV, etc.

Type of Speech	N. of words	N. of texts	Dates
informal	752.394	1409	1970 and 1990 decades
formal	103.801	150	1980 decade

The Lexicon

Extraction of the lexicon

In order to extract the lexicon from CORLEX, all different lexical forms occurring in the corpus were indexed. It was found 16.210.438 occurrences of which 283.530 were different word forms. The highest frequency was achieved by the word form "de" and 128.383 word forms had only a frequency of 1.

All word forms were automatically tagged (morphosytactic tagging) and lemmatised by PALAVROSO (an automatic analyser belonging to INESC and given to CLUL as part of an agreement made on 6th of January of 1992 between both institutions).The tags were theoretically attributed to each word form that occurred in the corpus.

Taking in account every different possibility of a tag that a word form could have, the automatic lemmatisation generated 39.966 lemmas.

The next task consisted on a manual verification of all the tags attributed to each word form and lemma (with a frequency ³ 6). This task was carried out by CLUL.

The criteria followed in this verification was the same used in Português Fundamental project (Cf. Português Fundamental, Métodos e Documentos, Vol. I, Inquérito de Frequência, INIC-CLUL, Lisboa, 1987, pp. 358-391).

After the manual verification, another theoretic lemmatisation was carried out, having the following results:

Number of lemma
Number of different tokens
Number of homograph tokens

26.474
131.433
44.773

Disambiguation

The disambiguation of homographic word forms was made following different criteria. INESC designed specific software for this task, DESAMBIG and ENCONTRA&ESTATIC.

Probabilistic calculations and automatic extraction of rules were made using the PAROLE annotated subcorpus (annotated by INESC with the morphological analyser PALAVROSO and disambiguated by CLUL/INESC with an auxiliary tool of manual analysis of word forms in context, DESAMBIG).

INESC has run Eric Brill's Tagger over CORLEX for the automatic disambiguation. In a parallel process, CLUL performed a manual disambiguation of the ambiguous word forms that didn't exist in the PAROLE tagged corpus - 335.637 analysed contexts - and a great number of manual validations and analysis of word forms whose frequency and/or grammatical category seemed odd - over 2.000.000 forms in context.

After the gathering of all the mentioned data, resulting from the automatic disambiguation (INESC) and from the manual disambiguation and validation (CLUL), the final indexation of the Lexicon was made.

Quantitative Information

Probabilistic calculations, based on the data from the manually revised PAROLE subcorpus, were made by INESC in order to determine the frequencies in CORLEX.

The quantitative data regarding the lemma considered in the Lexicon, i.e., lemma whose frequency value is equal or higher than the minimum established (6), resulted of these calculations and of the manual disambiguation process performed by CLUL.

Thus, a number of occurrences are presented with each lemma entry and with each form of the lemma entry. Since the occurrence variation interval is wide, whether when concerning entries, whether when concerning forms, a logarithmic scale, of base 10 (log₁₀/2), was used to obtain a more homogeneous distribution of the quantitative data. These data are represented by graphic characters sequences that indicate the following value intervals:

Frequency level (log₁₀/2 ):

Lemmas

Indexation by alphabetical order:

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

Indexation by decreasing frequency order:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12

Indexation, with numerical frequency, by alphabetical order:

lmcpc_alf.txt

Indexation, with numerical frequency, by decreasing frequency order:

lmcpc_dec.txt

Partnerships

CLUL - Centro de Linguística da Universidade de Lisboa

INESC - Instituto de Engenharia de Sistemas e Computadores

Editorial Verbo

Istituto di Linguistica Computazionale del CNR – ILC