A NER-classifier based on memory-based learning, trained on the CINTIL dataset, a corpus that contains part of the Corpus de Referência do Português Contemporâneo - CRPC (Reference Corpus of Contemporary Portuguese).
Availability
The tool is freely available on the PORTULAN CLARIN infrastructure.
Annotation
The annotation includes several categories:
EVT - Event
LOC - Location
ORG - Organization
PER - Person
WRK - Work
MSC - Miscellaneous
The tool applies tags to each token:
/0 indicates that the token is not (part of) a named entity
/B indicates that the token is the first unit of a named entity
/I indicates that the token is the middle or last unit of a named entity
Output will have one sentence per line with tags after each token separated with a slash:
De_/O a/O parte/O de_/O a/O tarde/O ,*/O Maria/B-PER Cristina/B-PER Portugal/I-PER ,*/O advogada//O ,*/O moderou//O o/O painel/O \*"/O Restrições//B-WRK a_/I-WRK o/I-WRK Conteúdo//I-WRK de_/I-WRK a/I-WRK Publicidade/I-WRK "/O ,*/O em/O que/O se /O abordaram//O duas/O temáticas/O <utt>
Evaluation
The NER tool was evaluated by splitting the CINTIL corpus in 50k for training and for testing.
This gave the following accuracy, precision and recall scores on the held-out testset:
processed 211479 tokens with 10631 phrases; found: 10628 phrases; correct: 10409.
accuracy: 99.72%; precision: 97.94%; recall: 97.91%; FB1: 97.93