Laboratorio de Lingüística Informática

Multimodal and Multilingual Advanced Answers Search: Linguistic Resources

Funded by CICYT
Project TIN2007-67407-C03-02
October 2007 to September 2010

The project aims at creating a multimodal (text and voice) and multilingual asnwers search platform which integrates the modules developed by the different participating groups. The stating hypothesis is that it is possible to improve the answers search task of the current systems, working on the modules which made up the architecture of a system of this sort. Specially, the multilingual IR modules, the enhancement of indexing, speeding up the information access, improvement of extraction and arrangement of answers and the questions analysis. We deal with web information, encyclopaedic resources and news. Thus, linguists' work is essential to develop and/or adapt appropriate resources, as well as for the integration of lexical and software resources.

We also aim at appliying this techniques and methodology to other areas, as onthology and information retrieval, Named Entities and voice interaction, investigating ways of adapting these tasks to new domains and languages.

Project goals

The main tasks of the LLI-UAM in BRAVO are:

Creation of new multilingual resources in Arabic, Spanish and Japanese.
Design and annotatio of a Spanish speech corpus of questions.
Definition of a model for question classification.
Adding linguistic resources to improve the management of spontaneous speech, in order to adapt a voice recognizer to questions formulation.

Results

Researchers

Principal investigator: Antonio Moreno Sandoval
Computer technician: José María Guirao Miras
Other professors:
- Théophile Ambadiang
- Mohamed El-Madkouri
- Chieko Kimura
- Paula Gonzalo Gómez
Other researchers:
- Manuel Alcántara
- Doaa Samy
- Ana González Ledesma
- Marta Garrote Salazar

Papers

2011

MORENO-SCHNEIDER, J., GARROTE-SALAZAR, M., MARTÍNEZ, P. and MARTÍNEZ FERNANDEZ, J.L. "Some experiments in evaluating ASR systems applied to multimedia retrieval", in Detyniecki, M., García-Serrano, A.and Nürnberger, A. (Eds.), Adaptive Multimedia Retrieval. Understanding Media and Adapting to the User. 7th International Workshop, AMR 2009, Madrid, Spain, September 24-25, 2009, Revised Selected Papers, Springer-Verlag, Lecture Notes in Computer Science, 6535, ISBN: 978-3-642-184, Páginas: 12-23.

2010

CAMPILLOS LLANOS, L., GOZALO GÓMEZ, P., GUIRAO MIRAS, J. Mª and MORENO SANDOVAL, A. Español oral en contexto. Vol. 1. Textos de español oral. Material de ELE basado en corpus. Comprensión auditiva. Madrid: Servicio de publicaciones de la Universidad Autónoma de Madrid. 2010. ISBN 978-84-8344-181-7.
GARROTE, M. and MORENO SANDOVAL, A."Chiede. A spontaneous child language corpus of spanish". In Moneglia y Panunzi (eds.): Bootstrapping Information from Corpora in a Cross-Linguistic Perspective. Firenze University Press, pp. 121-140. ISBN 978-88-8453-518-4.
GARROTE, M. Los corpus de habla infantil. Metodología y análisis. Servicio de publicaciones de la Universidad Autónoma de Madrid. ISBN 978-84-8344-187-9.
VICENTE-DÍEZ, M., DE PABLO, C., MARTÍNEZ, P., MORENO-SCHNEIDER, J. and GARROTE-SALAZAR, M. "Are Passages Enough? The MIRACLE Team Participation in QA@CLEF2009", in Peters, C., Di Nunzio, G.M., Kurimo, M., Mandl, Th., Mostefa, D., Penas, A. y Roda, G. (Eds.), Multilingual Information Access Evaluation I - Text Retrieval Experiments. Springer-Velarg, ISBN: 978-3-642-157, Volumen: 6241, Páginas: 281-288.

2009

ALCÁNTARA PLA , M. and DECLERCK, T. Proceedings of the EACL 2009 Workshop on Semantic Representation of Spoken Language; Atenas: ACL, 2009.
CAMPILLOS, L. and ALCÁNTARA, M. "Speech Disfluencies in Formal Context. Analysis Based on Spontaneous Speech Corpora", in Corpus Linguistics Conference, Liverpool. 2009
GONZÁLEZ LEDESMA, A. Los marcadores del discurso en el corpus C-ORAL-ROM: anotación pragmática, estrategias computacinales de etiquetado y aplicaciones a otros campos. 2009. Universidad Autónoma de Madrid.
MORENO SANDOVAL, A. and GUIRAO MIRAS, J.M. "Frecuencia y distintividad en el uso lingüístico: casos tomados de la lematización verbal de corpus de distintos registros", in Actas del I Congreso Internacional de Lingüística de Corpus (CILC-09), Universidad de Murcia, 2009.

2008

ALCÁNTARA PLÁ, M."El análisis lingüístico en la transcripción automática de la lengua hablada, el Proyecto COAST"
in Actas del VIII Congreso de Lingüística General: El valor de la diversidad [meta]lingüística, Madrid. AÑO: 2008
CAMPILLOS, L.. "Las expresiones causales en el corpus de habla espontánea C-ORAL-ROM". In Actas del 8ª Congreso de Lingüística General, Universidad Autónoma de Madrid, 25-28 de junio. AÑO: 2008
DE PABLO SÁNCHEZ, C., MARTÍNEZ FERNÁNDEZ, J.L., GONZÁLEZ LEDESMA, A., SAMY, D., MARTÍNEZ, P., MORENO, A. and ALJUMAILY, H. "Combining Wikipedia and newswire text for Question Answering in Spanish" Carol Peters, Valentin Jijkoun, Thomas Mandl, Henning Müller, Douglas W. Oard, Anselmo Peñas, Vivien Petras, Diana Santos (Eds.): Advances in Multilingual and Multimodal Information Retrieval, 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19-21, 2007, Revised Selected Papers. Lecture Notes in Computer Science 5152 Springer 2008, ISBN 978-3-540-85759-4 Pp. 352-355.
GARROTE, M., GUIRAO, J.M. and MORENO, A.. "Extracción de unidades distintivas en adultos y niños de un corpus de lengua oral espontánea". In Actas del 8ª Congreso de Lingüística General, Universidad Autónoma de Madrid, 25-28 de junio. AÑO: 2008
GONZÁLEZ LEDESMA, A. and SAMY, D.. "Marcadores discursivos en árabe y español: un estudio computacional basado en corpus paralelos con anotación pragmática". In Actas del 8ª Congreso de Lingüística General, Universidad Autónoma de Madrid, 25-28 de junio. AÑO: 2008
GOZALO, P.. "Reflexiones sobre el futuro. Los datos del español no nativo". In Actas del 8ª Congreso de Lingüística General, Universidad Autónoma de Madrid, 25-28 de junio. AÑO: 2008
MORENO SANDOVAL, A., T. TOLEDANO, D., DE LA TORRE, R., GARROTE, M. and GUIRAO, J.M.. "Developing a Phonemic and Syllabic Frequency Inventory for Spontaneous Spoken Castilian Spanish and their Comparison to Text-Based Inventories". In Proceedings of LREC 2008,Marrakech, 28-30 de mayo. AÑO: 2008
SAMY, D. y GONZÁLEZ LEDESMA, A.. "Pragmatic Annotation of Discourse Markers in a Multilingual Parallel Corpus (Arabic- Spanish-English)". In Proceedings of LREC 2008,Marrakech, 28-30 de mayo. AÑO: 2008
SEGURA BEDMAR, I., MARTÍNEZ, P. and SAMY, D. "Detección de fármacos genéricos en textos biomédicos" Marzo, 2008, Revista Española para el procesamiento del lenguaje natural (SEPLN), ISSN: 1135-5948, Pp. 27-34.
SEGURA BEDMAR, I., MARTÍNEZ, P. and SAMY, D. "A preliminary approach to recognize generic drug names by combining UMLS resources and USAN naming conventions" Ohio, USA, June, 2008, Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing (BioNLP), Association for Computational Linguistics, ISBN: 978-1-932432-, Páginas: 100-101.
SEGURA BEDMAR, I., SAMY, D., MARTÍNEZ FERNÁNDEZ, J.L.and MARTÍNEZ, P. "Detecting Semantic Relations between Nominals using Support Vector Machines and Linguistic-Based Rules", Portugal, November, 2007, On the Move to Meaningful Internet Systems 2007: OTM 2007 Workshops, Springer Berlin / Heidelberg, ISBN: 978-3-540-768, ISSN: 0302-9743, Pp. 1267-1273.
VICENTE DÍEZ, M., SAMY, D. and MARTÍNEZ, P. "An empirical approach to a preliminary successful identification and resolution of temporal expressions in Spanish news corpora" Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC'08), Marrakech, Morocco, May, 2008, European Language Resources Association (ELRA), ISBN: 2-9517408-4-0, Pp. 2153-2158.