Multifunctional Computational Lexicon of Contemporary Portuguese

Concluded
Reference
JNICT/FCT - Programa PRAXIS XXI (Contrato 2/2.1/CSH/759/95)
Funding institution
FCT – Fundação para a Ciência e a Tecnologia
Project PI
João Malaca Casteleiro

Multifunctional Computational Lexicon of Contemporary Portuguese

The European Portuguese has now a 26.443 lemma Frequency Lexicon with 140.315 tokens, with the minimum lemma frequency of 6, extracted from a relevant contemporary Portuguese corpus (16.210.438 words). Each lemma is followed by morphosyntactic and quantitative information. The same information is given regarding each lemma token (inflected forms and some compounds). The lexicon indexations are listed in alphabetical order or decreasing frequency order.

This resource is freely available at ELRA's Catalog and it has been assigned the International Standard Language Resource Number (ISLRN) 489-956-642-755-8.
More information can be found at http://www.islrn.org/.

PROJECT DESCRIPTION

Corpus

In order to carry out this project, CLUL designed and extracted from CRPC a 16.210.438 word corpus - CORLEX - containing a spoken subcorpus (of 856.195 words) and a written subcorpus (of 15.354.243 words).CORLEX contains written and spoken texts of several types, being the genre diversity a characteristic of this corpus. In order to represent the common language and a great diversity of themes, Corlex is composed mainly by journalistic texts (56% of the written corpus and 53% of the whole corpus).

 

Constitution of the written corpus

Part of this corpus is constituted by texts given to CLUL by Editorial VERBO, partner of this project. The written corpus has several different sources, being constituted by samples of selected titles.

Press

Newspapers

Number of Titles

Dates

N. of copies
 

N. of article

3

1997 and 1998

105

13.085

Periodicals

Number of Titles

Dates

N. of copies

N. of articles

3

1992 and 1997

105

13.085

 

Literary

(Romances, Novels, Short Stories, Poetry, Memoirs And Theatre of Portuguese authors)

N. of Authors

N. of Titles

Dates

135

186

XIXth century(2nd half): 11 authors; 14 titles

XXth century: 124 authors; 172 titles

 

Techno-Scientific and Didactic

N. of Authors

N. of Titles

Dates

91
Techno-Scientific book - 68
Didactic book - 23

93
Techno-Scientific book - 68
Didactic book - 25

1980 - 1993

 

Varia

Type of Document

N. of texts/articles

Dates

Specialized newspapers and journals

347

1900 - 1997

Other documents

30

 

 

Constitution of the spoken corpus (856.195 words)

The spoken corpus contains orthographic transcriptions of informal conversations and more formal productions like conferences, interviews in the radio and TV, etc.

 

Type of Speech

N. of words

N. of texts

Dates

informal

752.394

1409

1970 and 1990 decades

formal

103.801

150

1980 decade

 

 

The Lexicon

Extraction of the lexicon

In order to extract the lexicon from CORLEX, all different lexical forms occurring in the corpus were indexed. It was found 16.210.438 occurrences of which 283.530 were different word forms. The highest frequency was achieved by the word form "de" and 128.383 word forms had only a frequency of 1.

All word forms were automatically tagged (morphosytactic tagging) and lemmatised by PALAVROSO (an automatic analyser belonging to INESC and given to CLUL as part of an agreement made on 6th of January of 1992 between both institutions).The tags were theoretically attributed to each word form that occurred in the corpus.

Taking in account every different possibility of a tag that a word form could have, the automatic lemmatisation generated 39.966 lemmas.

The next task consisted on a manual verification of all the tags attributed to each word form and lemma (with a frequency ³ 6). This task was carried out by CLUL.

The criteria followed in this verification was the same used in Português Fundamental project (Cf. Português Fundamental, Métodos e Documentos, Vol. I, Inquérito de Frequência, INIC-CLUL, Lisboa, 1987, pp. 358-391).

After the manual verification, another theoretic lemmatisation was carried out, having the following results:

 

Number of lemma
Number of different tokens
Number of homograph tokens

26.474
131.433
44.773

 

Disambiguation

The disambiguation of homographic word forms was made following different criteria. INESC designed specific software for this task, DESAMBIG and ENCONTRA&ESTATIC.

Probabilistic calculations and automatic extraction of rules were made using the PAROLE annotated subcorpus (annotated by INESC with the morphological analyser PALAVROSO and disambiguated by CLUL/INESC with an auxiliary tool of manual analysis of word forms in context, DESAMBIG).

INESC has run Eric Brill's Tagger over CORLEX for the automatic disambiguation. In a parallel process, CLUL performed a manual disambiguation of the ambiguous word forms that didn't exist in the PAROLE tagged corpus ­- 335.637 analysed contexts - and a great number of manual validations and analysis of word forms whose frequency and/or grammatical category seemed odd - over 2.000.000 forms in context.

After the gathering of all the mentioned data, resulting from the automatic disambiguation (INESC) and from the manual disambiguation and validation (CLUL), the final indexation of the Lexicon was made.

Quantitative Information

Probabilistic calculations, based on the data from the manually revised PAROLE subcorpus, were made by INESC in order to determine the frequencies in CORLEX.

The quantitative data regarding the lemma considered in the Lexicon, i.e., lemma whose frequency value is equal or higher than the minimum established (6), resulted of these calculations and of the manual disambiguation process performed by CLUL.

Thus, a number of occurrences are presented with each lemma entry and with each form of the lemma entry. Since the occurrence variation interval is wide, whether when concerning entries, whether when concerning forms, a logarithmic scale, of base 10 (log10/2), was used to obtain a more homogeneous distribution of the quantitative data. These data are represented by graphic characters sequences that indicate the following value intervals:

 

Frequency level (log10/2 ):

 

Lemma:
6 - 10
11 - 31
32 - 100
101 - 316
317 - 1.000
1.001 - 3.162
3.163 - 10.000
10.001 - 31.622
31.623 - 100.000
100.001 - 316.227
316.228 - 1.000.000
1.000.001 - 3.162.277
  Tokens:
0 - 5
6 - 10
11 - 31
32 - 100
101 - 316
317 - 1.000
1.001 - 3.162
3.163 - 10.000
10.001 - 31.622
31.623 - 100.000
100.001 - 316.227
316.228 - 1.000.000

 

Indexation by alphabetical order:
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

 

Indexation by decreasing frequency order:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12

 

Indexation, with numerical frequency, by alphabetical order:
lmcpc_alf.zip

 

Indexation, with numerical frequency, by decreasing frequency order:
lmcpc_dec.zip

Partnerships
CLUL - Centro de Linguística da Universidade de Lisboa
INESC - Instituto de Engenharia de Sistemas e Computadores
Editorial Verbo
Istituto di Linguistica Computazionale del CNR – ILC