Multifunctional Computational Lexicon of Contemporary Portuguese
Multifunctional Computational Lexicon of Contemporary Portuguese
The European Portuguese has now a 26.443 lemma Frequency Lexicon with 140.315 tokens, with the minimum lemma frequency of 6, extracted from a relevant contemporary Portuguese corpus (16.210.438 words). Each lemma is followed by morphosyntactic and quantitative information. The same information is given regarding each lemma token (inflected forms and some compounds). The lexicon indexations are listed in alphabetical order or decreasing frequency order.
This resource is freely available at ELRA's Catalog and it has been assigned the International Standard Language Resource Number (ISLRN) 489-956-642-755-8.
More information can be found at http://www.islrn.org/.
PROJECT DESCRIPTION
Corpus
In order to carry out this project, CLUL designed and extracted from CRPC a 16.210.438 word corpus - CORLEX - containing a spoken subcorpus (of 856.195 words) and a written subcorpus (of 15.354.243 words).CORLEX contains written and spoken texts of several types, being the genre diversity a characteristic of this corpus. In order to represent the common language and a great diversity of themes, Corlex is composed mainly by journalistic texts (56% of the written corpus and 53% of the whole corpus).
Constitution of the written corpus
Part of this corpus is constituted by texts given to CLUL by Editorial VERBO, partner of this project. The written corpus has several different sources, being constituted by samples of selected titles.
Press
Newspapers |
|||
Number of Titles |
Dates |
N. of copies |
N. of article |
3 |
1997 and 1998 |
105 |
13.085 |
Periodicals |
|||
Number of Titles |
Dates |
N. of copies |
N. of articles |
3 |
1992 and 1997 |
105 |
13.085 |
Literary
(Romances, Novels, Short Stories, Poetry, Memoirs And Theatre of Portuguese authors)
N. of Authors |
N. of Titles |
Dates |
135 |
186 |
XIXth century(2nd half): 11 authors; 14 titles |
XXth century: 124 authors; 172 titles |
Techno-Scientific and Didactic
N. of Authors |
N. of Titles |
Dates |
91 |
93 |
1980 - 1993 |
Varia
Type of Document |
N. of texts/articles |
Dates |
Specialized newspapers and journals |
347 |
1900 - 1997 |
Other documents |
30 |
Constitution of the spoken corpus (856.195 words)
The spoken corpus contains orthographic transcriptions of informal conversations and more formal productions like conferences, interviews in the radio and TV, etc.
Type of Speech |
N. of words |
N. of texts |
Dates |
informal |
752.394 |
1409 |
1970 and 1990 decades |
formal |
103.801 |
150 |
1980 decade |
The Lexicon
Extraction of the lexicon
In order to extract the lexicon from CORLEX, all different lexical forms occurring in the corpus were indexed. It was found 16.210.438 occurrences of which 283.530 were different word forms. The highest frequency was achieved by the word form "de" and 128.383 word forms had only a frequency of 1.
All word forms were automatically tagged (morphosytactic tagging) and lemmatised by PALAVROSO (an automatic analyser belonging to INESC and given to CLUL as part of an agreement made on 6th of January of 1992 between both institutions).The tags were theoretically attributed to each word form that occurred in the corpus.
Taking in account every different possibility of a tag that a word form could have, the automatic lemmatisation generated 39.966 lemmas.
The next task consisted on a manual verification of all the tags attributed to each word form and lemma (with a frequency ³ 6). This task was carried out by CLUL.
The criteria followed in this verification was the same used in Português Fundamental project (Cf. Português Fundamental, Métodos e Documentos, Vol. I, Inquérito de Frequência, INIC-CLUL, Lisboa, 1987, pp. 358-391).
After the manual verification, another theoretic lemmatisation was carried out, having the following results:
Number of lemma |
26.474 |
Disambiguation
The disambiguation of homographic word forms was made following different criteria. INESC designed specific software for this task, DESAMBIG and ENCONTRA&ESTATIC.
Probabilistic calculations and automatic extraction of rules were made using the PAROLE annotated subcorpus (annotated by INESC with the morphological analyser PALAVROSO and disambiguated by CLUL/INESC with an auxiliary tool of manual analysis of word forms in context, DESAMBIG).
INESC has run Eric Brill's Tagger over CORLEX for the automatic disambiguation. In a parallel process, CLUL performed a manual disambiguation of the ambiguous word forms that didn't exist in the PAROLE tagged corpus - 335.637 analysed contexts - and a great number of manual validations and analysis of word forms whose frequency and/or grammatical category seemed odd - over 2.000.000 forms in context.
After the gathering of all the mentioned data, resulting from the automatic disambiguation (INESC) and from the manual disambiguation and validation (CLUL), the final indexation of the Lexicon was made.
Quantitative Information
Probabilistic calculations, based on the data from the manually revised PAROLE subcorpus, were made by INESC in order to determine the frequencies in CORLEX.
The quantitative data regarding the lemma considered in the Lexicon, i.e., lemma whose frequency value is equal or higher than the minimum established (6), resulted of these calculations and of the manual disambiguation process performed by CLUL.
Thus, a number of occurrences are presented with each lemma entry and with each form of the lemma entry. Since the occurrence variation interval is wide, whether when concerning entries, whether when concerning forms, a logarithmic scale, of base 10 (log10/2), was used to obtain a more homogeneous distribution of the quantitative data. These data are represented by graphic characters sequences that indicate the following value intervals:
Frequency level (log10/2 ):
Lemma: 6 - 10 11 - 31 32 - 100 101 - 316 317 - 1.000 1.001 - 3.162 3.163 - 10.000 10.001 - 31.622 31.623 - 100.000 100.001 - 316.227 316.228 - 1.000.000 1.000.001 - 3.162.277 |
Tokens: 0 - 5 6 - 10 11 - 31 32 - 100 101 - 316 317 - 1.000 1.001 - 3.162 3.163 - 10.000 10.001 - 31.622 31.623 - 100.000 100.001 - 316.227 316.228 - 1.000.000 |
Indexation by alphabetical order:
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
Indexation by decreasing frequency order:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12
Indexation, with numerical frequency, by alphabetical order:
lmcpc_alf.zip
Indexation, with numerical frequency, by decreasing frequency order:
lmcpc_dec.zip