Laboratorio de Lingüística Informática

Corpus Resources And Terminology ExtRaction

(MLAP-93/20)

PROJECT SUMMARY

This project proposes the creation of a set of tools and resources for multilingual corpus linguistics work.

TOOLS

A database model for annotated and aligned parallel multilingual corpora storage based on the model developed under the ET10-63 project.

A language free statistical alignment package for sentence alignment.

A software package for text retrieval and corpus browsing.

A part-of-speech tagger for Spanish.

RESOURCES

A 1M word parallel trilingual (English, French, Spanish) subcorpus of the ITU corpus part-of-speech annotated and sentence aligned (POS annotation manually corrected).

Mono- and multilingual lexical resources (lexicons, term banks).

PARTNERSHIP

Full partners

Lancaster University

Computers, Communications and Visions

Universidad Autónoma de Madrid

Subcontractors

IBM-France
ETSI Telecomunicación, UPM, Spain