FINT-ESP: FINANCIAL TEXT ANALYTICS IN SPANISH: TOOLS AND LANGUAGE RESOURCES

Financed by MINECO 2017

R&D&I projects of the Spanish State Programme for Research, Development and Innovation Oriented to the Challenges of Society

01 January 2018 to 31 June 2021

The process of automatically analysing textual content is called Text Analytics. (Moreno & Redondo 2016). This process can be widely applied to different fields: from analyses of social network comments to information extraction from legal, medical or financial texts. Text Analytics' main challenge is to understand the content of the linguistic utterances and to show relevant information.

In order to achieve these goals, different techniques are used, including statistics (data mining) or rule-based procedures. Our approach is based on the Computational Linguistics traditional method: we annotate the relevant information appearing in non-structured texts through domain-specific rules and lexicons. Then, we analyze such information in terms of quantity and quality by using corpus linguistics tools (Lyneal y Wmatrix)..

Our proposal brings together the experience of two internationally recognized research teams, the Computational Linguistics Laboratory at Universidad Autónoma de Madrid (LLI-UAM) and the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster University.

For more than two decades, these teams have independently developed language processing tools and corpora.. The main goal of this proposal is to integrate Spanish within the tools developed by UCREL, in order to use them to analyse financial texts, more specifically companies' annual reports. For this purpose, a Spanish corpus of financial texts will be collected and annotated with a new version of the Semantic Tagger of UCREL.

The topics and the results of the project are fully within the framework Challenge 7 "Digital Economy and Society", because they help to process and understand financial documents in digital format. Language technologies are included within a strategic plan of Digital Agenda for Spain. The industrial transfer of the results could be carried out through softwares and services developed and offered by research institutions, such as Instituto de Ingeniería del Conocimiento, which is a private non-profit body dedicated to research. It is located at UAM Campus, where some of the research team members collaborate.

Financial Narrative Processing (FNP) constitutes the PLN branch applied to the economic-financial domain that includes all systems that process and analyse large amounts of textual and numerical financial data in order to extract, summarise or analyse them using automatic and computer-aided approaches.

This project aims to bring together an international multidisciplinary team (linguists, computer scientists, economists) with two main objectives:
Develop new methods and computational tools to analyse financial narratives.

Use these new methods to analyse the properties of financial texts with a view to developing practical business applications.

This project brings together the experience of two internationally recognized research teams, the Computational Linguistics Laboratory at Universidad Autónoma de Madrid (LLI-UAM) and the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster. University.

TEAM

Research

Antonio Moreno Sandoval (PI), Linguistics, UAM

José María Guirao Miras, Computer science, Universidad de Granada

Ana Gisbert, Accounting and Finance, UAM

Chelo Vargas Sierra, Translation, Universidad de Alicante

Research team

Paul Rayson, Computer Science, UCREL, Lancaster University

Mahmoud El-Haj, UCREL, Lancaster University

Scott Piao, UCREL, Lancaster University

Pablo Haya, IIC

José Antonio Jiménez Millán, Computer Science, Universidad de Cádiz

Jordi Porta, Computer Science, UAM

Helena Montoro, UAM-IIC Chair

Blanca Carbajo, UAM-IIC Chair

Ana García Toro, UAM-IIC Chair

PROJECT PHASES
Phase 1: Creation of the FinT-esp corpus

Firstly, the compilation process was carried out:

• Manual download of more than 500 reports from the websites of IBEX companies.

• Adaptation of the Spanish reports to the CFIE tool (developed by El-Haj et al.) using keyword indexes to detect the structure of the documents. Due to their variability, only 388 reports could be converted to txt.

• Cleaning of the converted texts: Python scripts were made to clean, normalise and extract the sections, as the CFIE scripts did not work.

• Hand transcription of the "Letters to investors" sections of the unconverted texts to obtain a complete and representative sample.




At the end of the process, two corpora were obtained:

• Annual reports: composed of 388 documents, 23 million words and 2 million sentences.
As it is an updated set of texts (texts from 2014 to 2017) and written by and for specialists, it contains the jargon of financial experts and is suitable for terminological and conceptual extraction.

• Letters from CEOs to shareholders: 397 documents, 500,000 words and 16,800 sentences.
It is a corpus suitable to be processed with a PLN pipeline (taggers, parsers, NER). In addition, it has been used for linguistic analysis and automatic training of financial speech.

Phase 2: Developed tools

The first purely IT task consisted of integrating or adapting the tools previously developed by the UAM and Lancaster teams:

• Adaptation of the CFIE to financial reports in Spanish, with partial results due to the difference between the UK and Spanish narrative structures.

• Integration of Grampal (LLI-UAM) into WMatrix tool (Lancaster). In addition, a new version of Grampal is being adapted within the Stanza environment (Stanford).

• Integration of Grampal into USAS (Lancaster), the semantic tagger of the UCREL team. A specific semantic lexicon has been created for the financial domain in Spanish.




Secondly, the project's computer scientists, Guirao and Jiménez, have developed two tools:

• A system for consulting the two corpora FinT-esp. It is inspired by the MultiMedica version and allows not only independent querying of the two corpora but also searching for selected financial terms and a prototype of an automatic financial term extractor.

WikiCorporaComposer, a tool inspired by Marco Baroni's BooTCat program. It is used to create an ad-hoc corpus for a specific theme or purpose (translation, lexicography, terminology) from articles in Spanish and English Wikipedia.

Phase 3: Estudios lingüísticos

The team includes experts in finance (Ana Gisbert) and specialized translation in the financial domain (Chelo Vargas). Therefore, one of the objectives was to leverage the corpus and tools for wide-ranging studies using Corpus Linguistics methodology.

The work focuses on four topics:

• Possibility and necessity in financial narrative, through the use of adverbs ending in -mente (-ly in English)

• The use of metaphors in financial reports

• Extraction of neologisms and new financial terms from the corpus

• The use of discourse markers in the argumentation of Shareholder Letters

Phase 4: Experiments with machine learning techniques

Different NLP techniques have been applied to FinT-esp:

• Sentiment analysis in Shareholder Letters

• Semantic annotation with USAS in the financial domain in Spanish

• Automatic classification of companies with profits and losses based on their reports

• Automatic financial term extractor

• Automatic recognizer of discourse markers

For the first two tasks, a classical strategy based on lexicon and rules has been adopted. For the other three tasks, machine learning techniques have been applied based on manual annotation.

RESULTS
Semantically processing Presidents' Letters to Shareholders is a complex and challenging task, both for human experts and Artificial Intelligence systems, due to the following reasons:

There are far fewer examples of companies with losses compared to those with profits (distribution 15/85).

The discourses of companies with losses are very similar to those of companies with profits since executives know how to mask bad news to avoid affecting the credibility and solvency of the companies they lead.

Automatic term extractors and discourse marker recognizers perform at a very high level of accuracy, using supervised machine learning techniques.

By manually annotating these units with linguistic experts, we used deep neural networks based on Transformers, which yielded preliminary results with over 90% accuracy. This has allowed us to have a recognition prototype (similar to NER) for the financial domain, which we will further explore in the next project.

PUBLICATIONS

Moreno-Sandoval, A., Gisbert, A., Haya, P.A., Guerrero, M. y Montoro, H.: "Tone Analysis in Spanish Financial Reporting Narratives." In Proceedings of the Second Financial Narrative Processing Workshop (FNP 2019). NoDaLiDa, Turku, Finland, 30 September 2019, pp. 42-50

Moreno-Sandoval, A., Gisbert, A. y Montoro, H. "FinT-esp: a corpus of financial reports in Spanish." Presented in CILC-2019, Valencia. It will appear in Comares.

Vargas-Sierra, C. and A. Moreno-Sandoval (2021): "War and Health Metaphors in Financial Discourse: The case of "Letter to Shareholders" in Annual Reports. In Mateo-Martínez, J and Francisco Yus (eds.): On the Use of Metaphors in Specialized Discourse, Berna, Peter Lang, pp. 41-71.

Financial Narrative Processing in Spanish (press). Tirant lo Blanch. To be published from September 2021.
Chapters:
1. "Financial narratives" (A. Gisbert)
2. "State of the Art in FNP" (M. El-Haj et al.)
3. "Anglicisms in a Financial Corpus: exploiting resources for terminological retrieval and Analysis" (C. Vargas and B. Carbajo)
4. "Discourse Markers in Financial Narrative: The Case of the Annual reports and Letters to Shareholders" (A. García-Toro and A. Moreno-Sandoval).
5. "Machine Learning models for classifying Spanish Beaters and Non-Beaters Financial Reports" (El-Haj, Moreno-Sandoval and Jiménez-Millán).
6. "Tools for processing FinT-esp resources" (Moreno Sandoval, Guirao and Jiménez Millán).

CONFERENCES

Moreno-Sandoval, A.: "Possibility and necessity in financial narrative: a study of modal adverbs in Spanish." Presented in th XI International Conference on Corpus Linguistics (CILC-2019), Valencia. Published in Proceedings.

Moreno-Sandoval, A.: "Some discursive aspects of financial narrative in Spanish: modality, lexical distinctiveness and sentiment analysis" Plenary session in 3rd International Conference on Corpus Analysis in Academic Discourse 2019 (CAAD'19).

Moreno-Sandoval, A., Gisbert, A. and Montoro, H (2019): "Compiling a corpus of financial reports in Spanish". In XI International Conference on Coprus Linguistics (CILC 2019) Proceedings, Valencia.

Carbajo-Coronado, B., Vargas-Sierra C., and Moreno-Sandoval, A. (2021): "Reconocimiento de términos financieros nuevos en un corpus de informes corporativos". In AESLA 2021 Proceedings, Coruña.

García-Toro, A. (2021): "Marcadores discursivos en la argumentación de los informes de empresas con pérdidas y ganancias". In EntreTextos 2021, Alicante.

Chair on Computational Linguistics UAM-IIC Instituto de Ingeniería del Conocimiento (IIC) Institutional logos: Spanish Ministry of Science and Innovation / Funded by the European Union / Plan for Recovery, Transformation and Resilience / State Research Agency (AEI)