The CRPC-DB is a Discourse Bank for Portuguese annotated according to the Penn Discourse Treebank scheme (Prasad et al., 2008). The corpus is labeled for discourse relations (also referred to as rhetorical relations or coherence relations), such as cause and condition, that hold between two spans of text and contribute to ensure the overall cohesion and coherence. The scheme follows the principles of the PDTB 2.0 annotation proposal and the PDTB 3.0 sense hierarchy (Webber et al., 2016). The annotation is applied over 319 files of the PAROLE corpus, a subset of the Reference Corpus of Contemporary Portuguese (CRPC) (Généreux et al., 2012). The files are newspaper, fiction and didactic/scientific texts.
The relation is considered Explicit when there is an overt connective that denotes the meaning of the relation. Connectives include (single or multi word) conjunctions, prepositions and adverbs. The discourse relation may be lexically expressed by elements that do not fall into the category of connectives. In these cases, we follow PDTB and consider that these are alternative lexicalizations (AltLex) such as "a razão é que", "um exemplo disso é’. When no connective or alternative lexicalization is found, the relation is considered Implicit and the annotator has to supply a connective that could occur in that context. EntRel is used when an Entity relation holds between sentences (the second argument gives more information about an entity referred to in the first argument) and NoRel when there is no visible relation. For each relation of the type Explicit, Implicit and AltLex, a sense is provided, out of the 3-level set of senses of the PDTB3. The set of senses is divided in 4 top-level senses: Temporal, Contingency,Comparison and Expansion.
References
Généreux, M., Hendrickx, I., and Mendes, A. (2012). Introducing the reference corpus of contemporary portuguese on-line. In Nicoletta Calzolari, et al., editors, LREC’2012 – Eighth International Conference on Language Resources and Evaluation, pages 2237–2244, Istanbul,Turkey, May. European Language Resources Association (ELRA).
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A. K., and Webber, B. L. (2008). The Penn Discourse Treebank 2.0. In LREC2008.
Webber, B., Prasad, R., Lee, A., and Joshi, A. (2016). A discourse-annotated corpus of conjoined VPs. In Proceedings of the 10th Linguistics Annotation Workshop, pages 22–31.