UAM Spanish Treebank
Laboratorio de Lingüística Informática
Universidad Autónoma de Madrid
Universidad Autónoma de Madrid
Universidad Autónoma de Madrid

 
 
 
Description
Developing a Spanish Treebank
Annotation guidelines
Tools
Debugging
Some experiments
Current work
Papers
Examples
Research license New!!

Contact Address






Description

    The project started in December 1997, and by September 1999 the corpus consists of 1,500 syntactically annotated sentences extracted from newspapers (El País Digital and Compra Maestra). In this period we have developed the annotation guidelines and tools for annotating and debugging. In the current new phase , we continue the manual annotation with the help of more human annotators and improved tools. The goal for this phase is to get 5,000 annotated sentences. We have also started some experiments on the corpus. The future work is oriented to the semi-automatic corpus construction, based on a grammar infered from the treebank.
 
 
 
 
 
 
 
  Back to the top


Developing a Spanish Treebank

 

 Members

    During the first phase of the project (from December 1997 until May 2000) very few people have been involved:
 
(the research of Susana López has been supported by a grant from New York University).
 
 

Preliminary goals

 

Results


  Back to the top


Annotation guidelines

    A manual for human annotators is available: Spanish Tree Bank: Specifications, Version 5 (30 April 1999).

    The 86-page annotation guidelines includes a typed inventory of categories and features, the annotation scheme, and specific directions for a great variety of Spanish phenomena.

    The trees are encoded in a nested, parenthesized structure, with the elements at each level including the -part of speech or phrasal- category, the -syntactic and semantic- features, and the constituent nodes. The structure closely reflects the surface syntax.
 
 

 

Back to the top


Tools

Annotation tools

 

Debugging tools

 
Back to the top

Debugging

    We have conducted an evaluation of the feature assignment for the first 500 sentences:

Table 1: Error in feature assigment


All categories
ADJP
ADVP
NP
PP
VP
Total number of  cases
6364
592
262
2933
1503
1074
Total number of errors
672
51
70
457
35
59
Percentage of errors
10.5 %
8.6 %
26.7 %
15.6 %
2.3 %
5.4 %

Table 2: Types of errors

 
Types of errors
All categories
ADJP
ADVP
NP
PP
VP
Total
672
51
70
457
35
59
Missing features
442
22
29
333
25
13
Incorrect features
226
29
40
105
10
42
Unnecessary features
24
0
1
19
0
4

    From these data, the most common errors are lack of features and replacement of features.
 

    Also, we learned that NPs and ADVPs are the phrases most prone to error with respect to our feature annotation scheme. An estimation of the current percentage of error in assigning features is below 5 %.

    The phrase structure rules generator, on the other hand, detects "strange" combinations of constituents. This tool has been useful for detecting some inconsistencies.
 

    The coherence and quality of the analysis is checked manually by the human annotators.

 

Back to the top


Some experiments

    We have used the treebank to train a statistical parser, the Apple Pie Parser (Sekine, 1995). The APP works with a probabilistic context-free grammar and probabilistic information about the parts-of-speech. We have obtained with it an efficient system for finding the most likely analysis and we have could referee the present corpus.
 
 
  Back to the top


Current work

 
    During the last years, we have been conducting two diferent experiments:

Papers

·Alcántara, M., 2005: Anotación y recuperación de información semántica eventiva en corpus. PhD Thesis.

·Alcántara, M. and A. Moreno, 2004. Syntax to Semantics Transformation: Application to Treebanking. In Proc. Workshop Frontiers in Corpus Annotation at HLT-NAACL 2004. Boston, 2-7 May 2004.

·Moreno, A. and S. López, 1999. Developing a Spanish Tree Bank. In Proc. Journées ATALA, Corpus annotés pour la syntaxe. Paris, 18-19 June 1999.

·Moreno, A., R. Grishman, S. López, F. Sánchez and S. Sekine, 2000. A Treebank of Spanish and its Application to Parsing. Available morenoetal.ps(6496863 bytes) and morenoetal.ps.gz(84118 bytes).
 
 
 
 

Back to the top

Examples

 


  Back to the top

 



 

Research License:

The UAM Spanish Treebank is available for free. However, we ask you to acknowledge the UAM Spanish Treebank license agreement for non-commercial use and to send it to us.
The Treebank is not available for commercial purposes.

1. Download the license agreement (ONLY for research purposes).
2. Send it to the contact adress (see below) or via fax (+34 914974498).
 
 
 
 
Back to the top


 

Contact address:

Laboratorio de Lingüística Informática
Dept. de Lingüística
Universidad Autónoma de Madrid
E-28049 Madrid, Spain

e-mail: antonio.msandoval@uam.es
Web:     www.lllf.uam.es
 
 
 
 
 
 

Back to the top