Laboratorio de Lingüística Informática
|
|
|
|
|
|
|
|
|
|
|
The new version of UAM Spanish Treebank includes the annotation of negation and its scope in the 1,501 sentences which compose the corpus. For each sentence, negation cues, negative concordance and their scope have been marked. Also, some preliminary statistical results regarding frequency and function of negative elements annotated in the corpus were extracted. This new version of the corpus is also freely available.
This work was carried out by Drs. Marta Garrote andAntonio Moreno.
The project started in December 1997, and by September
1999 the corpus consists of 1,500 syntactically annotated sentences extracted
from newspapers (El País Digital and Compra Maestra).
In this period we have developed the annotation guidelines and tools for
annotating and debugging. In the current new phase , we continue the manual
annotation with the help of more human annotators and improved tools. The
goal for this phase is to get 5,000 annotated sentences. We have also started
some experiments on the corpus. The future work is oriented to the semi-automatic
corpus construction, based on a grammar infered from the treebank.
During the first phase of the project (from December 1997 until May 2000) very few people have been involved:
The 86-page annotation guidelines includes a typed inventory of categories and features, the annotation scheme, and specific directions for a great variety of Spanish phenomena.
The trees are encoded in a nested, parenthesized structure,
with the elements at each level including the part of speech or phrasal
category, the syntactic and semantic features, and the constituent nodes.
The structure closely reflects the surface syntax.
Table 1: Error in feature assigment
|
|
|
|
|
|
|
Total number of cases |
|
|
|
|
|
|
Total number of errors |
|
|
|
|
|
|
Percentage of errors |
|
|
|
|
|
|
Table 2: Types of errors
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
From these data, the most common errors are lack of
features and replacement of features.
Also, we learned that NPs and ADVPs are the phrases most prone to error with respect to our feature annotation scheme. An estimation of the current percentage of error in assigning features is below 5 %.
The phrase structure rules generator, on the other
hand, detects "strange" combinations of constituents. This tool has been
useful for detecting some inconsistencies.
The coherence and quality of the analysis is checked manually by the human annotators.
We have used the treebank to train a statistical parser, the Apple Pie Parser (Sekine, 1995). The APP works with a probabilistic context-free grammar and probabilistic information about the parts-of-speech. We have obtained with it an efficient system for finding the most likely analysis.
The UAM Spanish Treebank is available for free. However, we ask you to acknowledge the UAM Spanish Treebank license agreement for non-commercial use and to send it to us.
The Treebank is not available for commercial purposes.
1. Download the license agreement (ONLY for research purposes).e-mail: antonio.msandoval@uam.es
Web: www.lllf.uam.es