CRPC - Reference Corpus of Contemporary Portuguese

CRPC

CRPC - Reference Corpus of Contemporary Portuguese

Ongoing

Date

01 January 1988

Funding institution

Fundação Calouste Gulbenkian

Junta Nacional de Investigação Científica e Tecnológica

FCT – Fundação para a Ciência e a Tecnologia

Instituto Camões

Comissão Europeia

União Latina

Caixa Geral de Depósitos

Project PI

Amália Mendes

Grupo

Grammar & Resources

Description
Team

version 3.0 2012

Online queries over the written subcorpus: platform CQPweb

Online queries over the spoken subcorpus: platform TEITOK

ISLRN: 151-982-545-991-0

The CRPC is a large electronic corpus of European Portuguese and other varieties (Brazil, Angola, Cape Verde, Guinea-Bissau, Mozambique, S. Tome and Principe, Goa, Macao and East-Timor). It contains 311,4 million words and covers several types of written texts (literary, newspaper, technical, etc.) and spoken texts (formal and informal).
The written subpart of the CRPC (309 M) can be searched online, and subparts of the corpus are available for download or for purchase on the ELDA catalogue.

Corpus citation

"data retrieved from the Reference Corpus of Contemporary Portuguese (CRPC) of the Centre of Linguistics of the University of Lisbon - CLUL (version 3.0 2012, using CQPWeb in the period [month/year])"

Corpus design

The CRPC contains 309,8M words of written texts and 1,6M words of spoken recordings and transcriptions.

It is called a ‘reference’ corpus in the sense that written texts are sampled before inclusion in the corpus.

Text types

The CRPC covers several types of written texts: literary, newspaper, technical, scientific, didactic, leaflets, decisions of the supreme court of justice, parliament sessions, etc.
The CRPC includes a spoken subpart of both formal and informal speech. It covers different types of spoken interaction: monologues, dialogues, conversations, phone conversations, lectures, homilies, etc.

Time

The CRPC contains texts from the second half of the 19th century up until 2006, but most of the texts have been produced after 1970.

Portuguese varieties

The texts included in the CRPC are mainly of European Portuguese, but also from other national varieties of Portuguese in the world: Brazilian, African (Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe) and Asiatic Portuguese (Macao, Goa and East-Timor).

The map below signals all the varieties covered by the CRPC and gives information on the number of tokens for each one.
For information on the distribution of spoken and written texts per variety, see this table.

mapa_crpc_FINAL

Annotation

Cleaning the corpus

The CRPC was cleaned using Ncleaner (Evert, 2008). With this approach, the texts are first cleared of all boilerplate material using simple rules. A language model is then used to free the text from irrelevant segments (adverts, spams). Our language model was built using 200 annotated documents picked randomly from the CRPC.

Annotation

The texts are automatically tokenized using the LX tokenizer (Branco and Silva, 2004) which removes punctuation marks from words and detects sentence boundaries.

For part-of-speech tagging we trained a memory-based tagger (Daelemans et al, 1996) on a slightly adapted version of the written part of the CINTIL corpus. We use a tag set of 80 different tags.

For automatic lemmatization, we created a Portuguese version of the MBLEM lemmatizer (van den Bosch and Daelemans, 1999). MBLEM combines a dictionary lookup with machine learning to assign root forms to words. As dictionary we used the lemma list produced in the project DEP - Electronic Dictionary of Portuguese, coordinated at CLUL by Maria Elisa Macedo (the computational application was developed by João Miguel Casteleiro).

Text-to-sound alignment of the spoken subcorpus

The spoken subcorpus has been text-to-sound aligned with the software EXMARaLDA (Schmidt, 2004). A previous subset (project C-ORAL-ROM) used the alignment software WinPitch.

Online queries

The written subpart of CRPC, composed of 309M tokens, is available to be searched online through CQPWeb.

To register, please send an email to: crpc.cqpweb at gmail.com

The spoken subpart of CRPC, text-to-sound aligned, is available to be searched online through TEITOK.

Availability

Available subcorpora or derived resources
Project Name and Description	Authorship	Available at
Português Fundamental - published sample. A spoken CRPC subcorpus of 106 488 words. Aligned version (EXMARaLDA) freely available for research at ELRA.	CLUL	here aligned version: ELRA
Português Falado - Variedades Geográficas e Sociais 86 authentic speech texts with sound/text alignment Aligned version (EXMARaLDA) freely available for research at ELRA.	CLUL, Univ. de Toulouse-le-Mirail and Univ. de Provence Aix-Marseille	4 Cd-Rom edited by CLUL/Instituto Camões Also available aligned version: ELRA
PAROLE Subcorpus A 3 million word subcorpus extracted from the PAROLE corpus, containing 250 000 morphosyntactic tagged words.	CLUL and INESC	ELRA
LT Corpus A Literary Corpus that contains approximately 1,781,083 running words of 70 copyright-free classics of European and Brazilian literature (61 Portugal and 9 from Brazil) published before 1940. Freely available for research at ELRA.	CLUL	ELRA
PTPARL Corpus The PTPARL Corpus contains 1,076 transcriptions of the Portuguese Parliament sessions. Freely available for research at ELRA.	CLUL	ELRA
LMCPC A 26 980 lemma and 140 976 form lexicon with morphological and quantitative information.	CLUL, INESC and Editorial Verbo	aqui
PAROLE LEXICON A 20 000 morphosyntactic tagged unit lexicon with syntactic description.	CLUL, INESC and Editorial Verbo	ELRA
SIMPLE A 300 unit PAROLE sub-lexicon with semantic description.	CLUL, Portuguese and European partnership

History

The CRPC project started in 1988, at CLUL. The team had previously been involved in the compilation of the spoken corpus Português Fundamental and the new project was a natural follow-up to cover both spoken and written texts.
The corpus was initially planned to be balanced but later evolved into a monitor corpus. A subcorpus of 11 million words was made available for the search of concordances and frequencies and has been replaced by the CQPWeb platform, which enables a large set of search options.

Acknowledgements

Funding

Fundação Calouste Gulbenkian
Junta Nacional de Investigação Científica e Tecnológica (JNICT) - Programme Estímulo em Ciências sociais e Humanas
Fundação para a Ciência e Tecnologia (FCT) - Fundos Programáticos
Instituto Camões
União Latina
Caixa Geral de Depósitos
European Commission: LE-PAROLE and C-ORAL-ROM projects.

Data suppliers

Academia das Ciências de Lisboa
Agência Lusa
Assembleia da República
Caixa Geral de Depósitos
Centro de Informática do Ministério da Justiça
Coimbra Editora
DECO
Editora Colibri
Editora Nova Fronteira - Brasil
Editorial Verbo
Estação de Rádio TSF
Fundação Calouste Gulbenkian - Serviço de Bibliotecas e Apoio à Leitura
Instituto do Consumidor
Sociedade Bíblica Portugal
Portuguese newspapers: A Bola, A Capital, Diário de Notícias, Diário Económico, Expresso, Jornal de Notícias, Diário do Minho, O Independente, Público
Cape Verdean newspapers: A Semana, Correio Quinze, Novo Jornal
Procuradoria-Geral da República
Corpus do Português Contemporâneo (Universidade Estadual Paulista - UNESP
NURC-BR project (São Paulo e Rio de Janeiro)
PEUL project (Rio de Janeiro)
Periodical: Grande Reportagem, Ingenium, Máxima, ProTeste, Visão; Selecções do Reader's Digest

The CQPweb interface was designed and developed by Andrew Hardie. We would like to thank Andrew for helping us adapting the interface for Portuguese. We thank the NLX Group (Natural Language and Speech) at the Faculty of Sciences of the University of Lisbon, coordinated by António Branco, for granting us access to their LX-Tokenizer.

We would also like to thank Thomas Schmidt for adapting the EXMARaLDA software to previous formats used at CLUL.

Contact

If you have any question or suggestion concerning CRPC, please contact us at amaliamendes@letras.ulisboa.pt.

Research

The CRPC has been used in many MA and Ph.D studies carried out in Portugal and abroad, and in research projects such as the Dicionário da Língua Portuguesa Contemporânea of the Academia das Ciências de Lisboa, which used the CRPC as a source of quotations. It was also used in the following projects at CLUL:

References
van den Bosch, Antal and Walter Daelemans (1999) Memory-based morphological analysis. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL '99). Association for Computational Linguistics, Stroudsburg, PA, USA, 285-292.

Branco, António e João Silva (2004) Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese. In Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa and Raquel Silva (orgs.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC2004), Paris, ELRA, ISBN 2-9517408-1-6, pp.507-510.

Evert, Stefan (2008) A lightweight and efficient tool for cleaning web pages. In 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco.

MBT: A Memory-Based Part of Speech Tagger-Generator. Walter Daelemans, Jakub Zavrel, Peter Berck and Steven Gillis. in: E. Ejerhed and I. Dagan (eds.) Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen, Denmark, 14-27, 1996.

Schmidt, Thomas (2004) Transcribing and annotating spoken language with EXMARaLDA. In: Proceedings of the LREC-Workshop on XML based richly annotated corpora, Lisbon 2004, Paris: ELRA.

Project PI

Members

Past members

Maria Fernanda Bacelar do Nascimento (coordenadora)
João Malaca Casteleiro
Maria Lúcia Garcia Marques
José Bettencourt Gonçalves
Raquel Amaro
Florbela Barreto
João Miguel Casteleiro
Tiago Sá
Sandra Antunes
Rita Veloso
Iris Hendrickx