Laboratorio de Lingüística Informática

UAM Spanish Treebank

Description

New version

Developing a Spanish Treebank	Annotation guidelines
Tools	Debugging
Some experiments	Current work
Papers	Examples

Research license

Contact Address

New version

The new version of UAM Spanish Treebank includes the annotation of negation and its scope in the 1,501 sentences which compose the corpus. For each sentence, negation cues, negative concordance and their scope have been marked. Also, some preliminary statistical results regarding frequency and function of negative elements annotated in the corpus were extracted. This new version of the corpus is also freely available.

This work was carried out by Drs. Marta Garrote andAntonio Moreno.

Description

The project started in December 1997, and by September 1999 the corpus consists of 1,500 syntactically annotated sentences extracted from newspapers (El País Digital and Compra Maestra). In this period we have developed the annotation guidelines and tools for annotating and debugging. In the current new phase , we continue the manual annotation with the help of more human annotators and improved tools. The goal for this phase is to get 5,000 annotated sentences. We have also started some experiments on the corpus. The future work is oriented to the semi-automatic corpus construction, based on a grammar infered from the treebank.

Developing a Spanish Treebank

Members

During the first phase of the project (from December 1997 until May 2000) very few people have been involved:

Linguists: (guidelines, data selection,tagging and debugging)

Antonio Moreno
Susana López
Manuel Alcántara

Computational linguists:(tools for annotation and debugging)

Fernando Sánchez
Ralph Grishman

(the research of Susana López has been supported by a grant from New York University).

Results

1,600 syntactically annotated sentences.

Guidelines. A 86-pages manual (Specifications. Version 5, 30 April 1999).

Tools

Annotation guidelines

A manual for human annotators is available: Spanish Tree Bank: Specifications, Version 5 (30 April 1999).

The 86-page annotation guidelines includes a typed inventory of categories and features, the annotation scheme, and specific directions for a great variety of Spanish phenomena.

The trees are encoded in a nested, parenthesized structure, with the elements at each level including the part of speech or phrasal category, the syntactic and semantic features, and the constituent nodes. The structure closely reflects the surface syntax.

Tools

Annotation tools

A statistical POS tagger, which provides the most frequent category and inflectional features for each word. We use the tagger described in Sánchez, Ramírez & Declerck (1999).
A "chunker" that recognizes NPs, VPs, PPs and ADJPs (developed by F. Sánchez)
A sentence selector, that ramdomly picks sentences out the source. Some variables like text type or sentence length can be set (developed by F. Sánchez)

Debugging tools

A graphical tree-drawer for the annotated sentences. We use a public release program called Computational Linguistics Interactive (CLIG http://www.ags.uni-sb.de/~konrad/clig.html), developed by Karsten Konrad at Saarbrücken.

A feature checker that controls the assignment of proper features for each category (developed by R. Grishman).

A phrase structure rules generator, which is used to detect possible incorrect annotations (developed by R. Grishman).

Debugging

We have conducted an evaluation of the feature assignment for the first 500 sentences:

Table 1: Error in feature assigment

	All categories	ADJP	ADVP	NP	PP	VP
Total number of cases	6364	592	262	2933	1503	1074
Total number of errors	672	51	70	457	35	59
Percentage of errors	10.5 %	8.6 %	26.7 %	15.6 %	2.3 %	5.4 %

Table 2: Types of errors


Types of errors	All categories	ADJP	ADVP	NP	PP	VP
Total	672	51	70	457	35	59
Missing features	442	22	29	333	25	13
Incorrect features	226	29	40	105	10	42
Unnecessary features	24	0	1	19	0	4

From these data, the most common errors are lack of features and replacement of features.

Also, we learned that NPs and ADVPs are the phrases most prone to error with respect to our feature annotation scheme. An estimation of the current percentage of error in assigning features is below 5 %.

The phrase structure rules generator, on the other hand, detects "strange" combinations of constituents. This tool has been useful for detecting some inconsistencies.

The coherence and quality of the analysis is checked manually by the human annotators.

Some experiments

We have used the treebank to train a statistical parser, the Apple Pie Parser (Sekine, 1995). The APP works with a probabilistic context-free grammar and probabilistic information about the parts-of-speech. We have obtained with it an efficient system for finding the most likely analysis.

Current work

During the last years, we have been conducting two diferent experiments:

To enlarge the treebank with a simpler annotation to test whether the richer information is relevant for rule induction in Spanish.

To implement a syntax to semantics transformation using the SESCO tag set.

Papers

Moreno, A. and S. López, 1999. Developing a Spanish Tree Bank. In Proc. Journées ATALA, Corpus annotés pour la syntaxe. Paris, 18-19 June 1999.
Moreno, A., R. Grishman, S. López, F. Sánchez and S. Sekine, 2000. A Treebank of Spanish and its Application to Parsing. Available morenoetal.ps(6496863 bytes) and morenoetal.ps.gz (84118 bytes).

Moreno, A., López, S., Sánchez, F. & Grishman, R. (2003) Developing a syntactic annotation scheme and tools for a Spanish treebank. In A. Abeillé (Ed.) Treebanks. Building and Using Parsed Corpora. Kluwer Academic Publishers: Dordrecht, The Netherlands.

Herrero Zorita, C. & Moreno Sandoval, A. (2016). Sentence length and NP complexity of general and medical written academic and media text. An analysisi using a trained syntactic parser. In A. Moreno Ortiz and C. Perez-Hernández (eds.), CILC2016 EPiC Series in Language and Linguistics, vol. 1, pp. 181-190.

Examples

Sentence 1: lisp & clig

Sentence 2: lisp & clig

Sentence 3: lisp & clig

Research License

The UAM Spanish Treebank is available for free. However, we ask you to acknowledge the UAM Spanish Treebank license agreement for non-commercial use and to send it to us.

The Treebank is not available for commercial purposes.

1. Download the license agreement (ONLY for research purposes).
2. Send it to the contact adress (see below) or via fax (+34 914974498).

Contact address:

Laboratorio de Lingüística Informática
Dept. de Lingüística
Universidad Autónoma de Madrid
E-28049 Madrid, Spain

e-mail: antonio.msandoval@uam.es
Web: www.lllf.uam.es