Subsections


BIEM BTEX records in LDAP - Building a non-standard white-pages service.

Pierangelo Masarati had the idea of storing bibliographical references in a directory. His LDAP2BibTeX package12.1 provides a schema and two utility programs, one for converting an existing collection of bibliographical references in BibTeX bib format to LDIF, which can be loaded into a directory server, and another for retrieving the information from the directory again.


Schema and DIT

The schema designed by Pierangelo Masarati uses one object class (``bibtexEntry'') to store a reference. Entries of this type must have a ``bibtexEntryTag'' attribute that holds the string used in the LaTeX \cite{} command and a ``bibtexEntryType'' attribute that describes the kind of publication, e.g. book, manual or article. All other attributes are optional.

The approach in this thesis was however to build a two-level object class hierarchy. The abstract ``bibtexEntry'' class is a collection of attributes common to all BibTeX resource types. It includes the mandatory ``cn'' (short for ``common name'') attribute. Its value is used as parameter to the LaTeX \cite{} command. ``cn'' also is the naming attribute for ``bibtexEntry''. Structural object classes exist for every BibTeX resource type. These classes are derived from ``bibtexEntry''. Attributes that are required for a resource type, e.g. the publisher of a book, are declared as mandatory in these object classes.

With the modified schema, application can take advantage of the information contained in the schema without having to be adapted specifically to BibTeX. For example, a dialog for creating a new reference could first enumerate the available resource types and would then only display fields for those attributes that make sense for the given type.

Another guideline in designing the schema was to store information in human-readable and standards-compliant form, and still keep the semantics of TeX. This makes information accessible with standard tools while avoiding any information loss. TeX-specials can generally be converted to respective Unicode characters. However, no Unicode representation exists for mathematical formulas, which sometimes appear in titles. The ``author'' attribute is also an area of concern. First, BibTeX uses curly brackets to get name prefixes right. Curly brackets are also used mark those words in titles, whose case must be preserved. Secondly all authors of a document appear on one line separated by and. When stored in the directory, the author field should be a multi-valued attribute. This allows for better search capabilities. However, LDAP does not guarantee the order of values in a multi-valued attribute. To cope with these problems, all values that have special TeX code in them are additionally stored in an attribute subtype that is identified by the ;lang-x-tex tag12.2. The ;lang-x-tex form should be used by BibTeX and also for editing purposes. If information is displayed by non-BibTeX-aware applications, the base form is used instead.

For example, the entry for [30] would look like this in LDIF notation:

   dn: cn=Howes:1999:UDL, cn=Bibliography
   cn: Howes:1999:UDL
   objectclass: top
   objectclass: bibtexEntry
   objectclass: bibtexBook
   bibtexAuthor: T. Howes
   bibtexAuthor: M. Smith
   bibtexAuthor: G. Good
   bibtexAuthor;lang-x-tex: T. Howes and M. Smith and
     G. Good
   bibtexTitle: Understanding and Deploying LDAP 
     Directory Services
   bibtexTitle;lang-x-tex: Understanding and Deploying 
     {LDAP} Directory Services
   bibtexYear: 1999
   bibtexPublisher: Macmillan Technical Publishing


Converting bibliographies to LDIF

The original package includes bibtex2ldap, a tool to convert bib files to LDIF. It is written as a lex and yacc parser. This program was extended to be more tolerant to the bib syntax. It will also migrate only those attributes that have been defined in the schema.

Separate tools have been developed for two special bibliographical collections:

rfc-parse2ldif.pl12.3 processes the index file12.4 for IETF Request for Comments. RFCs are stored as ``bibtexTechReport''. Where applicable, these entries are augmented by the ``rfcStatus'' auxiliary object class. This allows information about the status of an RFC (e.g. current category, earlier or later revisions) to be stored within its entry.
id-parse2ldif.pl does the same for Internet Drafts. In addition to bibliographic information, the source12.5 contains abstracts, which are also migrated.


Retrieving BibTeX records from LDAP

When compiling a tex source file, LaTeX writes meta-information like section names for the table of contents or referenced citations into an aux file. The provided l2b.pl utility scans this file, optionally recursing into included files, and builds a hash12.6 of all references. A connection is then opened to an LDAP server and a search for each entry in the hash is performed. The results from these queries are converted to bib format and printed to stdout.


A web front-end with Java Servlets

A Java servlet was developed with the help of the Netscape Directory SDK for Java to allow efficient management of bibliographical references in the directory. It makes active use of schema information and stores its resource strings in the directory. To this end, an ``ldapServletSchemaItem'' auxiliary object class with two structural subclasses (``ldapServletObjectClass'' and ``ldapServletAttributeType'') were introduced. Both include a ``displayName'' attribute, which is used to store a friendly name for a schema item, for example ``Book'' for ``bibtexBook''. Support for additional languages can thus be added easily by adding an attribute value with the appropriate lang subtype. To use this information in the servlet, two new classes (BibtexAttriuteSchema and BibtexObjectClassSchema) were derived from the respective classes for schema items in the SDK. These classes are able to read the additional information from an entry in the directory and provide methods to return it to the servlet.

A pool of persistent connections to the directory server is used to maximise performance. To serve an http request, a connection from the pool is requested. This avoids the overhead of having to establish a new LDAP connection for each request. By using the integrated session management of the servlet container, authenticated LDAP connections are used for tasks that involve modifying entries in the directory.


Conclusions and future work

This example shows how a directory service tailored for a specific application can be designed. A more general approach for storing bibliographic information might look into using the standards proposed by the Dublin Core meta-data initiative12.7 instead of a BibTeX specific schema. In addition, implementing support for the format proposed in [55] is worth being considered.

Norbert Klasen 2001-10-22