Main

Laboratorio de Lingüística Informática

UAM Spanish Treebank

Description
New version
Developing a Spanish Treebank
Annotation guidelines
Tools
Debugging
Some experiments
Current work
Papers
Examples
Research license

Contact Address

New version

The new version of UAM Spanish Treebank includes the annotation of negation and its scope in the 1,501 sentences which compose the corpus. For each sentence, negation cues, negative concordance and their scope have been marked. Also, some preliminary statistical results regarding frequency and function of negative elements annotated in the corpus were extracted. This new version of the corpus is also freely available.

This work was carried out by Drs. Marta Garrote andAntonio Moreno.

Description

The project started in December 1997, and by September 1999 the corpus consists of 1,500 syntactically annotated sentences extracted from newspapers (El País Digital and Compra Maestra). In this period we have developed the annotation guidelines and tools for annotating and debugging. In the current new phase , we continue the manual annotation with the help of more human annotators and improved tools. The goal for this phase is to get 5,000 annotated sentences. We have also started some experiments on the corpus. The future work is oriented to the semi-automatic corpus construction, based on a grammar infered from the treebank.

Developing a Spanish Treebank

Members

During the first phase of the project (from December 1997 until May 2000) very few people have been involved:

(the research of Susana López has been supported by a grant from New York University).

Results

Annotation guidelines

A manual for human annotators is available: Spanish Tree Bank: Specifications, Version 5 (30 April 1999).

The 86-page annotation guidelines includes a typed inventory of categories and features, the annotation scheme, and specific directions for a great variety of Spanish phenomena.

The trees are encoded in a nested, parenthesized structure, with the elements at each level including the part of speech or phrasal category, the syntactic and semantic features, and the constituent nodes. The structure closely reflects the surface syntax.

Tools

Annotation tools Debugging tools

Debugging

We have conducted an evaluation of the feature assignment for the first 500 sentences:

Table 1: Error in feature assigment


All categories
ADJP
ADVP
NP
PP
VP
Total number of cases
6364
592
262
2933
1503
1074
Total number of errors
672
51
70
457
35
59
Percentage of errors
10.5 %
8.6 %
26.7 %
15.6 %
2.3 %
5.4 %

Table 2: Types of errors

Types of errors
All categories
ADJP
ADVP
NP
PP
VP
Total
672
51
70
457
35
59
Missing features
442
22
29
333
25
13
Incorrect features
226
29
40
105
10
42
Unnecessary features
24
0
1
19
0
4

From these data, the most common errors are lack of features and replacement of features.

Also, we learned that NPs and ADVPs are the phrases most prone to error with respect to our feature annotation scheme. An estimation of the current percentage of error in assigning features is below 5 %.

The phrase structure rules generator, on the other hand, detects "strange" combinations of constituents. This tool has been useful for detecting some inconsistencies.

The coherence and quality of the analysis is checked manually by the human annotators.

Some experiments

We have used the treebank to train a statistical parser, the Apple Pie Parser (Sekine, 1995). The APP works with a probabilistic context-free grammar and probabilistic information about the parts-of-speech. We have obtained with it an efficient system for finding the most likely analysis.

Current work

During the last years, we have been conducting two diferent experiments:

Papers

Examples

Research License

The UAM Spanish Treebank is available for free. However, we ask you to acknowledge the UAM Spanish Treebank license agreement for non-commercial use and to send it to us.

The Treebank is not available for commercial purposes.

1. Download the license agreement (ONLY for research purposes).
2. Send it to the contact adress (see below) or via fax (+34 914974498).

Contact address:

Laboratorio de Lingüística Informática
Dept. de Lingüística
Universidad Autónoma de Madrid
E-28049 Madrid, Spain

e-mail: antonio.msandoval@uam.es
Web: www.lllf.uam.es