|
Laboratorio de Lingüística Informática Universidad Autónoma de Madrid |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(the research of Susana López has been supported by a grant from New York University).
- Linguists: (guidelines, data selection, tagging and debugging)
- Antonio Moreno
- Susana López
- Manuel Alcántara
- Computational linguists: (tools for annotation and debugging)
- Fernando Sánchez
- Ralph Grishman
- To develop our own guidelines incorporating the relevant features of Spanish into the general mainstream of the corpus annotation.
- Guidelines based on experience in annotating real texts.
- To develop some tools for helping human annotators with the tagging and debugging tasks.
- 1,600 syntactically annotated sentences.
- Guidelines. A 86-pages manual (Specifications. Version 5, 30 April 1999).
- Tools
The 86-page annotation guidelines includes a typed inventory of categories and features, the annotation scheme, and specific directions for a great variety of Spanish phenomena.
The trees are encoded in a nested, parenthesized structure,
with the elements at each level including the -part of speech or phrasal-
category, the -syntactic and semantic- features, and the constituent nodes.
The structure closely reflects the surface syntax.
Back to the top
- A statistical POS tagger, which provides the most
frequent category and inflectional features for each word.
We use the tagger described in Sánchez, Ramírez & Declerck
(1999).
- A "chunker" that recognizes NPs, VPs, PPs and ADJPs (developed
by F. Sánchez)
- A sentence selector, that ramdomly picks sentences out the source.
Some variables like text type or sentence length can be set (developed
by F. Sánchez)
Back to the top
- A graphical tree-drawer for the annotated sentences. We use a public release program called Computational Linguistics Interactive (CLIG http://www.ags.uni-sb.de/~konrad/clig.html), developed by Karsten Konrad at Saarbrücken.
- A feature checker that controls the assignment of proper features for each category (developed by R. Grishman).
- A phrase structure rules generator, which is used to detect possible incorrect annotations (developed by R. Grishman).
Table 1: Error in feature assigment
|
|
|
|
|
|
|
|
| Total number of cases |
|
|
|
|
|
|
| Total number of errors |
|
|
|
|
|
|
| Percentage of errors |
|
|
|
|
|
|
Table 2: Types of errors
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
From these data, the most common errors are lack of
features and replacement of features.
Also, we learned that NPs and ADVPs are the phrases most prone to error with respect to our feature annotation scheme. An estimation of the current percentage of error in assigning features is below 5 %.
The phrase structure rules generator, on the other
hand, detects "strange" combinations of constituents. This tool has been
useful for detecting some inconsistencies.
The coherence and quality of the analysis is checked manually by the human annotators.
Back to the top
During the last years, we have been conducting two diferent experiments:
- To enlarge the treebank with a simpler annotation to test whether the richer information is relevant for rule induction in Spanish.
- To implement a syntax to semantics transformation using the SESCO tag set.
Back to the top
·Alcántara, M., 2005: Anotación y recuperación de información semántica eventiva en corpus. PhD Thesis.Back to the top·Alcántara, M. and A. Moreno, 2004. Syntax to Semantics Transformation: Application to Treebanking. In Proc. Workshop Frontiers in Corpus Annotation at HLT-NAACL 2004. Boston, 2-7 May 2004.
·Moreno, A. and S. López, 1999. Developing a Spanish Tree Bank. In Proc. Journées ATALA, Corpus annotés pour la syntaxe. Paris, 18-19 June 1999.
·Moreno, A., R. Grishman, S. López, F. Sánchez and S. Sekine, 2000. A Treebank of Spanish and its Application to Parsing. Available morenoetal.ps(6496863 bytes) and morenoetal.ps.gz(84118 bytes).
e-mail: antonio.msandoval@uam.es
Web: www.lllf.uam.es