This document is written for a TERENA activity. The aim is to list existing automatic classification systems and to state the availability of these software products. It is an outcome of an extensive but most probably not comprehensive survey of the web and of other related documents. In addition some of the project participants were contacted to get additional information. As new projects are on their way and other older projects may turn out to be interesting, this work should be seen as work in progress, although no concrete plans exist right now to continue this survey in near future.
Long before the Web hit the world, problems of finding the right data, hidden knowledge or undiscovered correlations in a big data collection had been addressed. For different types of data, different approaches where used: Knowledge Discovery in Databases (KDD) techniques were used if the data were stored in a database as a complete set of entries, Information Retrieval (IR) techniques were used for heterogeneous unordered set of data, like the WWW. In contrast to a deterministic approach of KDD (exact match), IR rather states probabilities. KDD uses a monothetic classification, IR rather a polythetic, which means attributes of an object can make the object a member of more than one class. Thus you need more than one fitting attribute to decide to which class the object should be put. In KDD the retrieval language is artificial, i.e. a combination of Boolean operators whereas the retrieval language of IR is naturally and more vague, which makes it more fault tolerant. The answer will always be a probability and not an exact match.
Data mining is the process of finding patterns in the data and is relevant to both mentioned techniques. There are different functions of data mining:
Artificial neural networks are a means of representing findings of data mining. For Internet data Information Retrieval techniques fit well, especially if the data mining consists of probabilistic search methods, classification and clustering.
Before data mining techniques can be applied the documents have to be pre-processed to create a document index with frequency and weightings of the location of the document terms (title, abstract, keywords or body of the text). Different methods of term indexing exist:
Instead of using stop word lists, the frequency of a word in a document (term frequency) can be used for selecting the words. It had been shown that a medium frequency points to the highest significance (high frequency are stop words and low frequency means low significance).
Another method is to use thesauri for structuring words according to their meaning. A thesaurus is a collection of relevant terms ordered in a hierarchy of superordinate and subordinate concepts and homonyms. Thesauri are very sumptuary to maintain and need special knowledge. There are also different methods for the actual clustering after the documents had been pre-processed, i.e. the different statistical methods to determine similarity of documents. It has shown that specific techniques are appropriate for specific subjects. Artificial neural networks are a non linear extension of such classical statistical methods and are especially useful for classification and clustering. They can be self organizing and self expandable via different techniques, of which to only name two here: Self Organizing Map (SOM) [3] for visualizing relations in a multidimensional space and Competitive Learning, which is used to minimize errors and maximize entropy to enhance the amount of possible matchings. Another neural network is the Hopfield Net. For a good introduction to neural networks, see [4].
After trying to summarize the computational and statistical methods coming from information theory, the rest of this document will list different projects in which such methods have been used for automatic classification of WWW documents. This list is by no way complete. There has been such a high number of projects dealing with automatic classification or similar problems (e.g. auto summarization) that the task to describe all of them would definitely be outside a preliminary study than this one. Here it is more aimed to list some relevant projects that show the variety of techniques. Only projects that already have achieved results were included in this list. Some new EU-funded projects like ARION (Advanced Lightweight Architecture for accessing Scientific collections, http://dlforum.external.forth.gr:8080), ARTISTE (An integrated Art Analysis and Navigation Environment, see http://www.cultivate-int.org/issue1/artiste/) or BINDEX (Bilingual Automatic Parallel Indexing and Classification, http://www.iai.uni-sb.de/bindex/home.html) were thus not treated here. A number of specific questions were the background for the project list:
Not all questions could be answered for all projects. Not all projects could be contacted.
GERman Harvest Automated Retrieval and Directory is a research project of Bibliotheks- und Informationssystem (BIS, http://www.bis.uni-oldenburg.de/) of the University of Oldenburg, Institut für Semantische Informationsverarbeitung of the University of Osnabrück (ISIV, http://www.isiv.uni-osnabrueck.de/) and Oldenburger Forschungs- und Entwicklungsinstitut für Informatik-Werkzeuge und -Systeme of the University of Oldenburg (OFFIS, http://www.offis.uni-oldenburg.de/) funded by the Deutsche Forschungsgemeinschaft (DFG). The project web-site is: http://www.gerhard.de.
GERHARD created a HARVEST based robot-generated index of Web resources in Germany using indexing and automatic classification and provides a search and a hierarchical browsing facility based on the Universal Decimal Classification (UDC) which is maintained electronically at the Library of the ETH Zürich. The UDC is three-lingual (English, German, French) and includes about 60.000 entries with 13 different relations. The automatic classification works in three steps:
The software is implemented in C and Perl. The classification module is a client/server implementation in C. The UDC data is stored in an relational database. With the exception of a program library for stemming, which belongs to a company Lingsoft from Helsinki, the code is available for researchers. Currently GERHARD provides access to 1,284,819 documents with 6,182,891 relations between them. For more detailed information on GERHARD see [5].
There had been co-operation with the EU project DESIRE with the aim to replace the HARVEST component with the gatherer developed at NetLab in Lund.
The first phase of GERHARD ended in 1998 and a second phase was projected but is not visible on the net yet. From personal communication with Bernd Diekmann from BIS, I know that the second phase was granted very recently and that it will start in October 2001. It is planned to improve the statistics software without changing the overall concept, as well as to enrich the UDC data with bibliographic resources that were manually classified by using UDC. Other new features will be automatic recognition of language and document types.
Scorpion is a research project of the OCLC Office of Research attempting to combine indexing and cataloguing based on the observation that these are complementary activities. Scorpion specifically focuses on building tools for automatic subject recognition by combining library science and information retrieval techniques. For instance, to assign subject codes to a document, the document can be treated as a query against a Dewey Decimal System database using ranked retrieval. The results of the search can then be treated as the subjects of the document. Subject assignment in this manner provides clear differentiation from the traditional computer indexing behind the currently available free search services. Scorpion has predetermined conceptual clusters (the DDC) that it tries to place documents into. The clustering methods are described in [6]. The project site can be found at http://orc.rsch.oclc.org:6109/.
Jean Godby, OCLC wrote the
following to my question about software availability: "Right now,
the Scorpion software is available only through a research license. However, we
intend to release the software, minus the Dewey Decimal Classification
database, as an Open Source project in the near future. This would enable
researchers to substitute their own classification database."
.
Development of a European Service for Information on Research and Education was a EU-funded Project in two phases (1996-1998 and 1998-2000). One part of the work, done by NetLab of Lund University Library (http://www.lub.lu.se/), especially by Traugott Koch, was to evaluate automatic classification and compare and combine these techniques with manual classification. The subject chosen was engineering and as basis for both, the manual and the automatic classification, the Ei thesaurus was used ( http://www.ei.org). Data sources were HTML files and files with record-syntax format used in the Combine Harverster (http://www.lub.lu.se/combine). The pre-processing included weightings according to the location and deletion of stop words, etc. and optionally stemming with the Porters stemming algorithm. Then a simpler matching algorithm was used to compare the words of the texts with the words of the thesaurus. The final weighting of the results is done by term complexity/classification type, match location (metadata, headings, other text) and matching frequency.
More detailed information about this work can be found in [7], a demonstration page presenting all practical results is available at http://www.lub.lu.se/desire/demonstration.html. A good summary of universal classification systems and their usability for automatic classification can be found in [8].
Two software packages had been developed as Perl modules in DESIRE II on this subject: LoadTermList for loading thesaurus and stop lists, and Matcher for getting and extracting text and matching thesaurus against text-mark-up groups. This software should be freely available.
Experiments were made in the frame of DESIRE to auto-classify the collection on engineering used in DESIRE with the software and thesaurus (UDC) of GERHARD. The results showed that GERHARD "probably works well when treating a heterogeneous database of documents from all subject areas. ... To use it for a large collection of subject specific documents, however, a lot of further improvements and adaptions would be needed." [7] The results of these tests will feed into the new phase of GERHARD. Similar experiments with the Scorpion software and DDC had been planned. I could not find any indication that this has happened yet.
Cora is a special-purpose search engine covering computer science research papers. It allows keyword searches over the partial text of Postscript-formatted papers it has found by spidering the Web. Cora provides access to over 50,000 research papers on all computer science subjects. The construction of Cora has been greatly automated by the use of artificial intelligence and machine learning techniques. Efficient topic-directed spidering is performed using reinforcement learning; papers (in Postscript format) are automatically categorized into the topic hierarchy by probabilistic techniques; papers' titles, authors, references, etc, are automatically extracted using hidden Markov models. Basically the software provides a means for automated creation and maintaining portals by the use of machine learning techniques. A detailed overview can be found in [9]
Cora is the result of a research project at Just Research (http://www.justresearch.com/) led by Andrew McCallum with participation of Carnegie Mellon University, Pitzburg. It seems that the technology is now maintained by whizbang! Labs (http://www.whizbanglabs.com/), where it was used for to build one of the Internet’s largest commercial online recruiting site, FlipDog.com, where job seekers can search through more than 500,000 jobs gathered from nearly 50,000 employers’ own Web sites. The current Cora website is at http://cora.whizbang.com/.
WEBSOM was a project at the Neural Network Research Centre of the Helsinki University of Technology headed by T. Kohonen. WEBSOM is a method for organizing miscellaneous text documents onto meaningful maps for exploration and search. WEBSOM is based on the SOM (Self-Organizing Map) algorithm that automatically organizes the documents onto a two- dimensional grid so that related documents appear close to each other. This reduction to a two dimensional SOM for visualization purposes leads to significant information loss. The method was used to classify over a million documents from over 80 Usenet newsgroups. The map is displayed as a hierarchy of sensitive web-images that can be browsed. The project uses the public domain software package SOM_PAK available at: http://www.cis.hut.fi/research/som-research/nnrc-programs.shtml. The project site can be found at: http://websom.hut.fi/websom/.
A similar approach was used in the project "Concept based Categorization and Search on the Internet: A Machine Learning, Parallel Computing Approach" of the Artificial Intelligence Lab of the University of Arizona in Tuscon headed by H. Chen and funded mainly by an NSF/CISE "Intelligent Internet Categorization and Search" project (1995-1998) and the NSF/ARPA/NASA Illinois Digital Library Initiative project (1994-1998). More than 10,000 entertainment related Web pages were analysed using SOM technology. Again the result can be browsed in hierarchical ordered concept maps. With a Concept Search feature the user can alternatively search via a thesaurus. Both facilities can be found at: http://ai.bpa.arizona.edu/ent/entertainment. For a more detailed description, see [10].
Based on SOM technology and on the research carried out within the Digital Library Initiative (DLI) project at the University of Illinois, partially funded by ARPA ITO, the Interspace project aims to bring a new level in network information management based on correlation of knowledge and "intended to be universal to all repositories of all types". Its semantics are based upon statistical clustering techniques from information retrieval and image processing. Concept spaces, collections of abstract concepts generated from concrete objects are seen as independent of the physical objects they represent. The statistical analysis is not based on single terms but on up to 5-word phrases. It relies on and enhances the results of the TIPSTER project (see http://www.itl.nist.gov/iaui/894.02/related_projects/tipster/). Different statistical algorithms had been evaluated and performance-wise evolved: concept spaces (co-occurrence matrices), category spaces (Kohonen maps), metadata generation (Hopfield nets) and meta-map generation.
A lot of software had been designed and implemented in the project using Java, Smalltalk and CORBA technology. The Interspace system is a layered system with a kernel layer providing the Interspace Analysis Environment and an underlying service layer with different modules organized by a domain management and fed by the Multimedia Concept Extraction unit. The modules are:
It is unclear to me whether an implementation of this architecture is available and if so under which licence. The software was used to generate concept space e.g. from 2.6 Million abstracts on engineering and from 1.5 Million abstracts of articles on cancer. The results can be browsed with the Interspace's Remote Access Client at http://www.canis.uiuc.edu/archive/Ira/index.html.
The project site can be found at http://www.canis.uiuc.edu/projects/interspace/ where you can find the very detailed and comprehensive project proposal with a lot of information on similar research as well as a documentation of the architecture. Apparently the project started in 1998 and was completed in 2000. The key personnel consists of Bruce Schatz and Duncan Lawrie, both University of Illinois, Charles Herring, U.S. Army Construction Engineering Research Laboratories (CERL), Hsinchun Chen, University of Arizona and others.
Open Architecture Server for Information Search and Delivery was a EU-funded project, the Consortium of which consisted University College Dublin, Ireland, St.-Petersburg State University, Russia, University of Tübingen, Germany, JSC Peterlink, St.- Petersburg, Russia, Valtek Ltd., Kiev, Ukraine and DSI Ltd., Irkutsk, Russia.
The OASIS software is intended to perform intelligent information search and delivery service based on artificial neural network techniques and methods. The cluster analysis method is called Hierarchical Radius based Competitive Learning (HRCL) and used a multidimensional neural network that is self learning and able to find clusters on different hierarchical levels. It thus is able to create an Internet catalogue automatically. The clustering is not dependent on the sequence of the analysed documents. No classification system or thesaurus is needed, since it will be created automatically. Since more than two dimensions are used it is not possible to visualize the relations of the cluster in forms of a map. OASIS system uses relevance feedback from a user not only to refine the user's query, but to refine its internal algorithms also. It can accumulate query and relevance feedback statistics, analyse it and use results of this analysis to increase the precision of search. LDAP is used for the storage of metadata information. The OASIS search engine is distributed. Several OASIS server are connected via CORBA and store information about different document collections. The server perform a chaining of search requests transparently for the client. This intelligent forwarding is done via LDAP. For the merging of results from different servers again neural network technology is used. More information about the system is available in [1] and [11].
The software is public domain (GPL) and can be downloaded at: http://www.oasis-europe.org/download/. The project site can be found at http://www.oasis-europe.org/. There is a new spin-off company from the University of Tübingen, InSuMa (http://www.insuma.de), set up by the developers of OASIS, that are interested in co-operations with research activities.
It shows that considerable work has been done in the field of automatic classification of text documents. Although the overall structure of the procedures are similar (preparation of the data, statistical analysis, visualization), the different project used different technologies in this structure (different word reduction techniques, different Thesauri if used at all, different statistical algorithms, etc.). It seems that some technologies seem to fit better to some subjects than others, and systems specialized on a single subject perform better. Neural network based analysis seems to be a promising approach, as well as the use of hierarchical thesauri. It would be interesting to see how a hierarchical thesaurus can be stored with directory technology. Although the new projects mentioned in the beginning of this text also do new research on text document classification, the focus of research has shifted to automatic classification of multi media objects, a different and more complex task.
Ideally you would give a toolbox of different solutions to the digital librarians to enable them to experiment and find out which techniques suit their specific needs in their specific subject. It would be helpful to have standardized interfaces to the different modules to make such a toolbox expandable. There is a lot of software freely available for the research community for all steps and most techniques of automatic classification. In a TERENA activity with the focus on computing science such a toolbox could be started. Since the statistical and linguistic problems involved are quite complex I would strongly advise to look for at least one partner already involved in such work to take part in the TERENA activity.
[1] Udo Heuser, Automatische Internet-Katalogisierung mit Hilfe des Hierarchischen Radius-basierten Competitive Learnings, (thesis, University of Tübingen), Tübingen 2000.
[2] M.F. Porter, An algorithm for suffix stripping, Program 14,3, July 1980, pp. 130-137.
[3] T. Kohonen, Self Organizing Maps, Springer-Verlag, 1995. Second extended ed. 1997.
[4] Simon Catterall, Mind and Machine Module, http://www.physics.syr.edu/courses/modules/MM/.
[5] H.-J. Wätjen, B. Diekmann, G. Möller, K.-U. Carstensen, Bericht zum DFG-Projekt: GERHARD, 16.6.1998
[6] Srividhya Subramanian, Keith Shafer, Clustering, http://orc.rsch.oclc.org:6109/clustering.html.
[7] Traugott Koch, Anders Ardö, Automatic classification of full-text HTML documents from one specific subject area, DESIRE II D3.6a Working Paper 2, http://www.lub.lu.se/desire/DESIRE36a-WP2.html.
[8] Traugott Koch, Nutzung von Klassifikationssystemen zur verbesserten Beschreibung, Organisation und Suche von Internet Ressourcen, Buch und Bibliothek 50:5 (1998), pp. 326-335. (Available at: http://www.lub.lu.se/tk/publ/bubmanus.html).
[9] A.K. McCallum, K. Nigam, J. Rennie, K. Seymore, Automating the Construction of internet Portals with Machine Learning, Kluwer Academic Publishing, 2000, http://www.cs.cmu.edu/~Knigam/papers/cora-jul.ps.gz.
[10] H. Chen, C. Schuffels, R. Orwig, Internet Categorization and Search: A Machine Learning Approach, Journal of Visual Communication and Image Representation, Special Issue on Digital Libraries, Vol 7, Nr. 1, pp. 88-102, 1996. (Available at http://ai.bpa.arizona.edu/papers/som95/som95.html).
[11] Udo Heuser, Wolfgang Rosenstiel, Automatic Construction of local Internet directories using hierarchical Radius-based competitive learning, Proc. of the 4th World Multiconference on Systemics, Cybernetics and Informatics (SCI 2000) July 23-26, 2000, Orlando/Florida, Volume IV (Comunications Systems and Networks), pp. 436-441 (Available at http://www-ti.informatik.uni-tuebingen.de/~heuser/index_de.html).