XML in cataloguing ancient funds in Hispanic Libraries


Francisco A. Marcos Marín

Universidad Autónoma de Madrid


Four CLiPs ago, in Edinburg, 1998, the audience was taken to Buenos Aires, unfortunately only in a virtual manner. Our team had rediscovered in 1996 the whereabouts of a huge part of a library auctioned in Paris in 1936, almost lost for 60 years then. The late owner of the collection, Raymond Foulché-Delbosc, was one of the greatest French hispanists of all times. He developed his activity mostly during the end of the 19th Century. and the first third of the 20th . He founded the well known journal Revue Hispanique, where he published, under different names, his editions of old texts, as well as substantial contributions to Spanish studies. Many of the books that so came to light in those pages were part of his personal library. After his death, that excellent collection went under the hammer in 1936. The catalogue of the auction was published, the content of the library was therefore known by everybody[1]. From that precise moment on, however, most scholars lost any trace of a substantial number of those books, that were marked as “lost” in the bibliographies, reference catalogues and data bases devoted to Medieval and Classical Spanish. Nobody seemed to know who had bought more than one thousand and two hundred of those books, and where they could actually be.


After the rediscovery of that large part of the books, in the Treasure Room of the National Library of the Republic Argentina, a group of scholars, under my direction, started working in the cataloguing and studying of the Foulché-Delbosc collection, in the National Library. The project was supported by the Bilateral Agreement between Spain and  Argentina, and funded by a number of institutions[2] in both countries.




Our departing issue becomes now more concrete. When the collection was rediscovered in 1996, and its full importance was duly acknowledged, there existed but two sources of information about it: the old card files of 1937, whose information was very limited, and the more detailed, albeit also restricted, information included in the  catalogue of the auction. In the first case, the card files, only the first of the several books often included inside the same binding was mentioned. When a volume contains several books, there is no information of its whole content on the card files. In the second source, the auction catalogue, there was no indication of who had bought the books and, in a more concrete wise, of which books had been actually bought by the Argentinean government in 1936. There was, therefore, no possibility to identify and locate the books on the Library shelves. Georgina Olivetto undertook the challenging task of preparing the lists of concordances between the card files and the auction catalogue, with the valuable help of Librarian Hugo Acevedo, a person who has a thorough knowledge of the deposits of the Biblioteca Nacional.


With those tables in our hands, we could start preparing the way to detail to the scientific community the contents of the collection. As I have already said, many of the books were considered lost. The auction had taken place in 1936, times were not peaceful after that date, neither in Spain nor in the rest of the world. Most people thought that the books could had been destroyed. Even those who knew that there was a Foulché-Delbosc collection in Buenos Aires had no means of knowing how many books were in the collection and what condition they were in, unless they visited the Library and investigated the card files[3].


Although times are changing very quickly, editing the catalogue in book format is still necessary. It is however, insufficient. We have to establish a link between the old and the new Philology, again, by preparing a format that could achieve three goals: being published in printed form, being incorporated into the electronic catalogues of the Biblioteca Nacional, and getting ready to be accessed through Internet. Another requirement was the need for a system that includes text and pictures. For those who know our previous work, it will be clear that this implies an evolution of the ADMYTE format.


The kind of information suitable to be collected by a printed catalogue is not enough. It has to be complemented in several ways. All of them cannot be detailed now, but their main spheres can be delineated. For the sake of this presentation we shall discuss the hypertext version, not withstanding whether its concrete form is HTML, XML or SGML. The html version was ready by August 1999 and is available in the web site




The catalogues of printed books and that of manuscripts are separated, and the search engine is reduced to the searching possibilities of the browser.


A second step was therefore required. In the first place, a line had to be open for the interrogation of the database, to build our own indexes, according to the parameters we can choose by interrelating the different fields of the template. Technically, it can be done with a series of small programs or CGIs, implementing different search options. A second line can take us to the text of the book itself, its transcription, preferably in paleographic format. The third line will take us to the facsimile reproduction of the book, usually in either compressed TIFF format, or JPEG.


Once finished the compilation of the raw data, the most important objective for any philologist will be their analysis.  Achieving that goal was possible, in our case, by the integration of Foulché-Delbosc in a larger project. It was ACORDEON, whose goal is to develop an architecture for integrating applications that, in co-operative synergy, are able to provide search and retrieval of information services for large text collections. The project is oriented to users in the publishing industry, where this type of systems can manage huge amounts of text information, reducing the impact of the information flood and improving the overall productivity; as well as to the libraries and documentation field, where the necessity of efficient information search is a requirement.


To improve the retrieval of relevant information from texts, NLP techniques will be integrated in the search engines. The use of such techniques requires the construction of linguistic resources and tools. In particular, morphological analysis and lemmatization of words will be used to get more fine-grained results. In a rich inflecting language, as it is Spanish, this technique is supposed to improve the retrieval process, provided that a disambiguation module is included. Additional knowledge can improve the retrieval process, for instance, accounting semantic relations between words, such as synonymy, hyperonymy and meronymy. Such knowledge allows the user to reach accurate interpretations of several types of data, for instance, proper names. Knowing that Leon does not mean León in Spain, but Lyon in France, is absolutely necessary. Whether available, it should be integrated in the lexical resources repository.


A prototype has been developed to consult and retrieve information from library catalogues (where poorly structured text is usual) . In particular, it has been applied both to the Foulché-Delbosc collection and to an oral corpus in Spanish. Such prototype is provisionally accessible from the site




Laboratorio de Lingüística Informática


Universidad Autónoma de Madrid
A JavaScript set of instructions specifies the different arguments of the available function validar, the default being “ALL / Todos”. Complementary functions browse and check the lists, select the matching data, testing them against the template of orthographic variants, and clean up previous parameters, such as ranage (radio) check.


  function validar(x) expects the user to select:


a concrete place (lugar)

a concrete language (idioma)

a unique date or an interval, between years 1000 and 1999


XML and metadata


Extensible Markup Language, abbreviated XML, describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them. XML is an application profile or restricted form of SGML, the Standard Generalized Markup Language [ISO8879]. By construction, XML documents are conforming SGML documents.

XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form the character data in the document, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. The XML document type declaration contains or points to markup declarations that provide a grammar for a class of documents. This grammar is known as a document type definition, or DTD. XML, therefore, provides a mechanism to impose constraints on the storage layout and logical structure.

A software module called an XML processor is used to read XML documents and provide access to their content and structure. An XML processor reads XML data and provides information to another module, called the application.

A TEI-Based Tag Set for Manuscript Transcription is defined in DS2.DTD, a set of XML tags for the transcription of medieval manuscripts, for use either with excerpts (e.g. transcripts of individual pages) or with full transcripts of an entire manuscript. The tagset defined in DS2.DTD conforms to the Text Encoding Initiative Guidelines (TEI). The presentation and selection of elements was influenced as well by discussions with representatives of the Nordic Network, and David Mackenzie's Manual of Manuscript Transcription for the Dictionary of the Old Spanish Language (5th ed. rev. and exp. Ray Harris-Northall). The DTD is an XML version of the SGML "beta release" ds.dtd, created in 1998 by Michael Sperberg McQueen for the Digital Scriptorum Project.


Metadata is routinely defined in accordance with its literal interpretation: “data about data”, which means “information about information”. It would be more precise to add: “structured information about information”. Day (1998) explains that “metadata is commonly understood as an amplification of traditional bibliographic cataloguing practices in an electronic environment.” Metadata used as a resource for description allows it to be understood by both humans and machines in ways that promote interoperability. Interoperability is defined by Hodge as “the ability of multiple systems, with different hardware and software platforms, data structures, and interfaces, to exchange data with minimal loss of content and functionality.”


Because digital cataloguing and preservation is such a broad field, it is necessary to specify from the outset several areas of focus to facilitate the collaboration among different teams in different countries, composed by librarians, teachers, cultural animators, with different degrees of instruction. The first area, and the one which is partially dealt with in this paper, involves the identification of the attributes of a digital archive for research repositories. Another area, equally important for librarians, is the use of metadata to support the digital preservation process.


In the context of digital information objects, metadata can be assigned to one of three functional categories (Wendler: 1999):


· Descriptive: facilitating resource discovery and identification. Elements such as title, abstract, author, and keywords are descriptive.

· Administrative: supporting resource management within a collection. Creation, file type, technical information, access permits, rights management are administrative metadata.

· Structural: binding together the components of complex information objects: pages form chapters, chapters form a book, lines form stanzas, stanzas form a poem.


Of these three categories, descriptive metadata for electronic resources has received the most attention, notably through the Dublin Core metadata initiative[4], and its 15 optional and repeatable core elements, most of them descriptive: title, subject, description, source, language, relation, coverage, creator, publisher, contributor, rights, date, type, format, and identifier.


The goal of Foulché-Delbosc on the issue of metadata is to establish three subsets in the previous html cataloguing system. The basic one will be the MARC compliant subset of bibliographical data for librarians. This subset will establish a markup in accordance with the specifications of the MARC format adopted by Iberian and Latin-American Libraries, Ibermarc. A second set will be related to transcription and philological information, according to the TEI and Digital Scriptorium proposals. Transcripts will be dealt with by a distinct DTD. Formal information will be converted into metadata. A third subset will be devoted to metadata regarding information about the electronic process itself, including information about the persons involved in the cataloguing and digitizing processes. A specific set of metadata will be devoted to preservation.




A presentation aimed at increasing awareness of the challenges posed by digital preservation, the long-term retention of digital objects, has to underscore metadata needs for digital objects beyond resource discovery.


Effective management of all but the crudest forms of digital preservation is likely to be facilitated by the creation, maintenance, and evolution of detailed metadata in support of the preservation process. For example, metadata could document the technical processes associated with preservation, specify rights management information, and establish the authenticity of digital content. It can record the chain of custody for a digital object, and uniquely identify it both internally and externally in relation to the archive in which it resides. In short, the creation and deployment of preservation metadata is likely to be a key component of most digital preservation strategies.


Dublin Core Elements for this paper


Title: XML in cataloguing ancient funds in Hispanic Libraries

Creator: Marcos Marín, Francisco A.

Subject: XML, library,  metadata

Description: Describes the evolution of the Foulché-Delbosc project in the National Library of Argentina and the state of the art concerning its transition from html to metadata standards.

Publisher: Laboratorio de Lingüística Informática. Universidad Autónoma de Madrid.

Date: 20011126

Type: Text.Report

Format: text/html

Identifier: http://www.lllf.uam.es/~fmarcos/coloquio/CLiP2001.htm

Language: en




Digital information is exposed to corruption and alteration, with or without intention. It is fragile. Hardware and software technologies change, so do storage media. Those changes can make important data unusable. Archivists, librarians, philologists and computer scientists are compelled to foresee migration from one format to another, newer, and to design strategies. Future platforms must be capable of emulating current hardware and software behavior. Metadata is key to ensuring that resources will survive and continue to be accessible into the future.




[1] The Catalogue, actually, was based on a previous publication, Catalogue de la Bibliothèque Hispanique de M. R. Foulché-Delbosc. Abbeville: Imprimerie F. Paillart, 1920, it was issued in Mayenne: Imprimerie Floch, 1936.

[2] The Agencia Española de Cooperación Internacional, AECI, the Secretaría de Cultura of the Presidency of the Argentine Republic, the Secretaría de Estado de Universidades, Ministerio de Educación y Cultura of Spain (PR1997-0019 0023659550), the Universidad Autónoma de Madrid, and the Biblioteca Nacional de la República Argentina. It is only fair to acknowledge the debt to the personal involvement of Esperanza Aguirre, Ministre of Education and Culture of Spain, and Beatriz K. de Gutiérrez-Walker, Secretary of Culture of Argentina. Both gave a courageous support to the project from the very beginning. Fernando R. Lafuente, first in his condition of Director General del Libro, Archivos y Bibliotecas, later as Director of the Instituto Cervantes, helped on many concrete issues, making it feasible.

[3] It is only fair to add immediately that it is not totally true that nobody knew that the books were in Buenos Aires. At least four notes had been published since the acquisition. They did not seem to reach the audience. The acquisition was duly reviewed in the Memoria of the Biblioteca Nacional corresponding to 1936, and in 1937, in the first issue of the Revista de la Biblioteca Nacional, “La Biblioteca Nacional durante el quinquenio 1932-1936.” Revista de la Biblioteca Nacional I.1 (enero-marzo 1937): 206, published again under the direction of Gustavo Martínez Zuviría, after a long period of silence following the publication of the last number of the Anales de la Biblioteca Nacional, which had been directed by Paul Groussac. Martínez Zuviría was known in literary circles as Hugo Wast, his pen name. He was appointed Director of the Biblioteca Nacional by the de facto government of General José Félix Uriburu, on October 30,1931, and remained Director until 1955, with the exception of two very short periods in 1941 and 1943. During the first of them exerted as Federal Interventor in the Catamarca province. During the second he was the Minister of Justice and Public Instruction in another de facto government, that of General Pedro Pablo Ramírez. Cf. Horacio Salas, Biblioteca Nacional Argentina, Buenos Aires: Manrique Zago ediciones, 1997, pp. 80-82.  The person who actually purchased the books at the auction, Jorge Max Rohde, was also well known in the intellectual circles of Buenos Aires. The copies entered in 1937 the sala de reservados, later Sala del Tesoro, as Colección F-D. At that moment they were taken care of, catalogued and included in the card files of the Reserved Room. At least one notice appeared in an international journal, included by Milton A. Buchanan in his “Bibliographical Notes,” Hispanic Review, 9, 1941, 228-230. When Verónica Zumárraga and I visited the Library in 1990, trying to prepare an ADMYTE CD-rom with manuscripts and old printed books of the Biblioteca Nacional, at those times under the direction of Mr. Castiñeira de Dios, still in the Calle México, nobody could give us any information or assistance. Even those who had the vague idea of the existence of a collection of ancient books, had never actually seen them, with the exceptions of those who could write the notes referred to in this presentation. Nevertheless, when a substantial part of the old funds of the Library was microfilmed, most of the F-D books were preserved in this way also. Even if the quality of the filming is not excellent, particularly in the case of manuscripts, and microfilms do not improve with the passing of time, that basic step was taken. In 1992 a new reference to the collection was made, although it was included in a publication with limited diffusion: Ofelia N. Salgado, “Buenos Aires, Bibliothèque Nationale (Mexico 564, 1097 Buenos Aires, Argentine). Fonds Raymond Foulché-Delbosc.” Nouvelles du Livre Ancien 71 (1992), 5-6. After moving to the new location of the Library, Hugo Acevedo wrote the chapter “Biblioteca Nacional de Argentina” in the book Historia de las Bibliotecas Nacionales de Iberoamérica: pasado y presente. Asociación de Bibliotecas Nacionales de Iberoamérica (ABINIA), edited by José G. Moreno de Alba y Elsa M. Ramírez Leyva. Mexico: UNAM, 1995 (2nd ed.), pp. 3-24, esp. 15-16. There he referred to the acquisition in 1936 and put the accent, among other bibliographical treasures, on “varias ediciones de La Celestina”, with special mention to the volume of “Sevilla 1502”, one of the three extant copies of that printing. See Georgina Olivetto, “Ejemplares de Celestina de la colección Foulché-Delbosc en la Biblioteca Nacional de la República Argentina”, Celestinesca, 1998, 22.1, 67-74. Jack Weiner has included a note about the collection in his paper “Sebastián de Horozco (1510-1579) y su Prouerbios y Consejos que Qualquier Padre Deue Dar a su Hijo (Salamanca, 1607): Estudio y Edición,”  Annali Sezione Romanza XXXVIII, 1996, n. 2, 431-450. Isabel Jones, the widow of Foulché-Delbosc, had a copy of the Catalogue where she had written the names of the buyers at the auction. That copy of the Catalogue was donated by her to the Library of the University of Toronto, where it is preserved.

[4] The Dublin Core Metadata Element Set arose from discussions at a 1995 workshop sponsored by OCLC and the National Center for Supercomputing Applications (NCSA). As the workshop was held in Dublin,  Ohio, the element set was named the Dublin Core. The continuing development of the Dublin Core and related specifications is managed by the Dublin Core Metadata Initiative (DCMI) (http://dublincore.org/).