Greenstone: A Comprehensive Open-Source
Digital Library Software System
Ian H. Witten,* Rodger J. McNab,† Stefan J. Boddie,* David Bainbridge*
* Dept of Computer Science
† Digilib Systems
University of Waikato, New Zealand
Hamilton, New Zealand
E-mail: {ihw, sjboddie, davidb}@cs.waikato.ac.nz
E-mail:
[email protected]
ABSTRACT
multilingual information retrieval to distributed computing
protocols, from interoperability to search engine
This paper describes the Greenstone digital library
technology, from metadata standards to multiformat
software, a comprehensive, open-source system for the
document parsing, from multimedia to multiple operating
construction and presentation of information collections.
systems, from Web browsers to plug-and-play DVDs.
Collections built with Greenstone offer effective full-text
searching and metadata-based browsing facilities that are
The Greenstone Digital Library Software from the New
attractive and easy to use. Moreover, they are easily
Zealand Digital Library (NZDL) project tackles this issue
maintainable and can be augmented and rebuilt entirely
by providing a new way of organizing information and
automatically. The system is extensible: software
making it available over the Internet. A
collection of
“plugins” accommodate different document and metadata
information comprises several (typically several thousand,
types.
or several million)
documents, and a uniform interface is
provided to all documents in a collection. A library may
INTRODUCTION
include many different collections, each organized
differently—though there is a strong family resemblance in
Notwithstanding intense research activity in the digital
how collections are presented.
library field during the second half of the 1990s,
comprehensive software systems for creating digital
Making information available using this system is far more
libraries are not widely available. In fact, the usual solution
than “just putting it on the Web.” The collection becomes
when creating a digital library is also the most
maintainable, searchable, and browsable. Each collection,
obvious—just put it on the Web. But consider how much
prior to presentation, undergoes a “building” process that,
effort is involved in constructing a Web site for a digital
once established, is completely automatic. This process
library. To be effective it needs to be visually attractive
creates all the structures that are used at run-time for
and ergonomically easy to use, incorporate convenient and
accessing the collection. Searching is based on various
powerful searching capabilities, and offer rich and natural
indexes, while browsing is based on various metadata;
browsing facilities. Above all it must be easy to maintain
support structures for both are created during the building
and augment, which presents a significant challenge if any
operation. When new material appears it can be fully
manual organization is involved.
incorporated into the collection by rebuilding.
The alternative is to automate these activities through
To address the exceptionally broad demands of digital
software tools. But the broad scope of digital library
libraries, the system is public and extensible. It is issued
requirements makes this a daunting prospect. Ideally the
under the Gnu public license and, in the spirit of open-
software should incorporate facilities ranging from
source software, users are invited to contribute
modifications and enhancements. Only through an
international cooperative effort will digital library software
become sufficiently comprehensive to meet the world’s
needs. Currently the Greenstone software is used at sites in
Canada, Germany, New Zealand, Romania, UK, and the
US, and collections range from newspaper articles to
technical documents, from educational journals to oral
history, from visual art to folksongs. The software has
been used for collections in many different languages, and
for CD-ROMs that have been published by the United
Nations and other humanitarian agencies in Belgium,
France, Japan, and the US for distribution in developing
countries (Humanity Libraries, 1998; PAHO, 1999;
UNESCO, 1999; UNU, 1998). Further details can be
obtained from
www.nzdl.org.
become a first-class component of the library. And what
permits it to be integrated into existing searching and
browsing structures without any manual intervention is
metadata. This provides sufficient focus to the concept of
“digital library” to support the development of a
construction kit.
OVERVIEW OF GREENSTONE
Information collections built by Greenstone combine
extensive full-text search facilities with browsing indexes
based on different metadata types. There are several ways
for users to find information, although they differ between
collections depending on the metadata available and the
collection design. Typically you can
search for particular
words that appear in the text, or within a section of a
document, or within a title or section heading. You can
browse documents by title: just click on the displayed book
icon to read it. You can
browse documents by subject.
Subjects are represented by bookshelves: just click on a
shelf to see the books. Where appropriate, documents
Figure 1: Searching the HDL collection
come complete with a table of contents (constructed
automatically): you can click on a chapter or subsection to
This paper sets the scene with a brief discussion of what a
open it, expand the full table of contents, or expand the full
digital library is. We then give an overview of the facilities
document.
offered by Greenstone and show how end users find
information in collections. Next we describe the files and
An example of searching is shown in Figure 1 where
directories involved in a collection, and then discuss the
documents in the Global Help Project’s Humanity
processes of updating existing collections and creating new
Development Library (HDL) are being searched for
ones, including extending the software to provide new
chapters matching the word
butterfly. In Figure 2 the same
facilities. We conclude with an overview of related work.
collection is being browsed by subject: by clicking on the
bookshelf icons the user has discovered an item under
WHAT IS A DIGITAL LIBRARY?
Section 16, Animal Husbandry. Pursuing an interest in
butterfly farming, the user selects a book by clicking on its
Ten definitions of the term “digital library” have been
book icon. In Figure 3 the front cover of the book is
culled from the literature by Fox (1998), and their spirit is
displayed as a graphic on the left, and the automatically
captured in the following brief characterization:
constructed table of contents appears at the start of the
document. The current focus,
Introduction and Summary,
A collection of digital objects, including text,
is shown in bold in the table of contents with its text
video, and audio, along with methods for access
starting further down the page.
and retrieval, and for selection, organization
and maintenance of the collection
In accordance with Lesk’s advice, a statement of purpose
and coverage accompanies each collection, along with an
(Akscyn and Witten, 1998). Lesk (1998) views digital
explanation of how it is organized (Figure 1 shows the
libraries as “organized collections of digital information,”
start of this). A distinction is made between
searching and
and wisely recommends that they articulate the principles
browsing. Searching is full-text, and—depending on the
governing what is included and how the collection is
collection’s design—the user can choose between indexes
organized.
built from different parts of the documents, or from
Digital libraries are generally distinguished from the
different metadata. Some collections have an index of full
World-Wide Web, the essential difference being in
documents, an index of sections, an index of paragraphs,
selection and organization. But they are not generally
an index of titles, and an index of section headings, each of
distinguished from a web
site: indeed, virtually all extant
which can be searched for particular words or phrases.
digital libraries manifest themselves as a web site. Hence
Browsing involves data structures created from metadata
the obvious question: to make a digital library, why not
that the user can examine: lists of authors, lists of titles,
just put the information on the Web?
lists of dates, hierarchical classification structures, and so
on. Data structures for both browsing and searching are
But we make a distinction between a digital library and a
built according to instructions in a configuration file,
web site that lies at the heart of our software design: one
which controls both building and serving the collection.
should easily be able to add new material to a library
Sample configuration files are discussed below.
without having to integrate it manually or edit its content
in any way. Once added, new material should immediately
matter of specifying all the necessary plugins. In order to
build browsing indexes from metadata, an analogous
scheme of “classifiers” is used: classifiers create indexes
of various kinds based on metadata. Source documents are
brought into the Greenstone system through a process
called
importing, which uses the plugins and classifiers
specified in the collection configuration file.
The international Unicode character set is used throughout,
so documents—and interfaces—can be written in any
language. Collections have so far been produced in
English, French, Spanish, German, Maori, Chinese, and
Arabic. The NZDL Web site provides numerous examples.
Collections can contain text, pictures, and even audio and
video clips; a text-only version of the interface is also
provided to accommodate visually impaired users.
Compression technology is used to ensure best use of
storage (Witten
et al ., 1999). Most non-textual material is
either linked to textual documents or accompanied by
textual descriptions (such as photo captions) to allow full-
text searching and browsing. However, the architecture
Figure 2: Browsing the HDL collection by subject
permits the implementation of plugins and classifiers even
for non-textual data.
Rich browsing facilities can be provided by manually
linking parts of documents together and building explicit
The system includes an “administrative” function whereby
indexes and tables of contents. However, manually-created
specified users can examine the composition of all
linking becomes difficult to maintain, and often falls into
collections, protect documents so that they can only be
disrepair when a collection expands. The Greenstone
accessed by registered users on presentation of a password,
software takes a different tack: it facilitates
maintainability
and so on. Logs of user activity are kept that record all
by creating all searching and browsing structures
queries made to every Greenstone collection (though this
automatically from the documents themselves. No links
facility can be disabled).
are inserted by hand. This means that when new
Although primarily designed for Internet access over the
documents in the same format become available, they can
World-Wide Web, collections can be made available, in
be added automatically. Indeed, for some collections this is
precisely the same form, on CD-ROM. In either case they
done by processes that wake up regularly, scout for new
are accessed through any Web browser. Greenstone CD-
material, and rebuild the indexes—all without manual
ROMs operate on a standalone PC under Windows 3.X,
intervention.
95, 98, and NT, and the interaction is identical to accessing
Collections comprise many documents: thousands, tens of
the collection on the Web—except that response is faster
thousands, or even millions. Each document may be
and more predictable. The requirement to operate on early
hierarchically organized into
sections (subsections, sub-
Windows systems is one that plagues the software design,
subsections, and so on). Each section comprises one or
but is crucial for many users—particularly those in
more
paragraphs. Metadata such as author, title, date,
underdeveloped countries seeking access to humanitarian
keywords, and so on, may be associated with documents,
aid collections. If the PC is connected to a network
or with individual sections of documents. This is the raw
(intranet or Internet), a custom-built Web server provided
material for indexes. It must either be provided explicitly
on each CD makes exactly the same information available
for each document and section (for example, in an
to others through their standard Web browser. The use of
accompanying spreadsheet) or be derivable automatically
compression ensures that the greatest possible volume of
from the source documents. Metadata is converted to
information can be packed on to a CD-ROM.
Dublin Core and stored with the document for internal use.
The collection-serving software operates under Unix and
In order to accommodate different kinds of source
Windows NT, and works with standard Web servers. A
documents, the software is organized so that “plugins” can
flexible process structure allows different collections to be
be written for new document types. Plugins exist for plain
served by different computers, yet be presented to the user
text documents, HTML documents, email documents, and
in the same way, on the same Web page, as part of the
bibliographic formats. Word documents are handled by
same digital library, even as part of the same collection
saving them as HTML; PostScript ones by applying a
(McNab and Witten, 1998). Existing collections can be
preprocessor (Nevill-Manning
et al., 1998). Specially
updated and new ones brought on-line at any time, without
written plugins also exist for proprietary formats such as
bringing the system down; the process responsible for the
that used by the BBC archives department. A collection
user interface will notice (through periodic polling) when
may have source documents in different forms: it is just a
new collections appear and add them to the list presented
to the user.
FILES IN A COLLECTION
When a new collection is created or material is added to an
existing one, the original source documents are first
brought into the system through a process known as
“importing.” This involves converting documents into a
simple HTML-like format known as GML (for
“Greenstone Markup Language”), which includes any
metadata associated with the document. Documents are
assumed to be in the Unicode UTF-8 code (of which the
ASCII characters form a subset).
Files and directories
There is a separate directory for each collection, which
contains five subdirectories: the original raw material
(
import), the GML files created from this (
archives), the
final collection as it is served to users (
index), a directory
for use during the building process (
building), and one for
any supporting files (
etc)—including the configuration file
Figure 3: Reading a book in the HDL
that controls the collection creation procedure. Additional
files might be required: for example, building a hierarchy
of classifications requires a data file of sub-classifications.
FINDING INFORMATION
Greenstone digital library systems generally include
several separate collections. A home page allows you to
The imported documents
select a collection; in addition, each collection has its own
In order to identify documents internally, a unique object
“about” page that gives you information about how the
identifier or OID is assigned to each original source
collection is organized and the principles governing what
document when it is imported (formed by hashing the
is included.
content, to overcome file duplication effects caused by
All icons in the screenshots of Figures 1–4 are clickable.
mirroring) and stored as metadata within that document. It
Those icons at the top of the page return to the home page,
is important that OIDs persist throughout the index-
provide help text, and allow you to set user interface and
building process—so that a user’s search history is
searching preferences. The navigation bar underneath
unaffected by rebuilding the collection. OIDs are assigned
gives access to the searching and browsing facilities,
by hashing the contents of the original source document.
which differ from one collection to another.
Once imported, each document is stored in its own
Each of the five buttons provides a different way to find
subdirectory of
archives, along with any associated
information. You can
search for particular words that
files—for example, images. To ensure compatibility with
appear in the text from the “search” page (or from the
Windows 3.0, only eight characters are used in directory
“about” page of Figure 1). This collection contains indexes
and file names, which causes annoying but essentially
of chapters, section titles, and entire books. The default
trivial complications.
search interface is a simple one, suitable for casual users;
advanced searching—which allows full Boolean
Inside the documents
expressions, phrase searching, case and stemming
control—can be enabled from the
Preferences page.
The GML format imposes a limited amount of structure on
documents. Documents are divided into paragraphs. They
This collection has four browsable metadata indexes. You
can be split hierarchically into sections and subsections.
can
access publications by subject by clicking the
subjects
OIDs are extended to identify these components by
button, which brings up a list of subjects, represented by
appending numbers, separated by periods, to a document’s
bookshelves (Figure 2). You can
access publications by
OID. When a book is read, its section hierarchy is visible
title by clicking
titles a-z (Figure 4), which brings up a list
as the table of contents (Figure 3). Chapters, sections,
of books in alphabetic order. You can
access publications
subsections, and pages are all implemented simply as
by organization (i.e. Dublin Core “publisher”), bringing up
“sections” within the document. In some collections
a list of organizations. You can
access publications by
documents do not have a hierarchical subsection structure,
“how to” listing, yielding a list of hints defined by the
but are split into pages to permit browsing within a
collection’s editors. We use the Dublin Core as a base and
retrieved document.
extend it in an
ad hoc manner to accommodate the
individual requirements of collection designers.
The document structure is used for searchable indexes.
There are three levels of index:
documents,
sections, and
the
import process is invoked, which converts the files into
GML using the specified plugins. Old material for which
GML files have previously been created is not re-imported.
Then the
build process is invoked to build the requisite
indexes for the collection. Finally, the contents of the
building directory are moved into the
index directory, and
the new version of the collection automatically becomes
live.
This procedure may seem cumbersome. But all the steps
are necessary for efficient operation with large collections.
The
import process could be performed on the fly during
the building operation—but because building indexes is a
multipass operation, the often lengthy importing would be
repeated several times. The
build process can take
considerable time—a day or two, for very large
collections. Consequently, the results are placed in the
building directory so that, if the collection already exists, it
will continue to be served to users in its old form
throughout the building operation.
Active users of the collection will not be disturbed when
the new version becomes live—they will probably not
Figure 4: Browsing titles in the HDL
even notice. The persistent OIDs ensure that interactions
remain coherent—users who are examining the results of a
query or browse operation will still retrieve the expected
paragraphs, corresponding to the distinctions that GML
documents—and if a search is actually in progress when
makes—the hierarchical structure is flattened for the
the change takes place the program detects the resulting
purposes of creating these indexes. Indexes can be of text,
file-structure inconsistency and automatically and
or metadata, or any combination. Thus you can create a
transparently re-executes the query, this time on the new
searchable index of section titles, and/or authors, and/or
version of the collection.
document descriptions, as well as the document text.
UPDATING EXISTING COLLECTIONS
How it works
Updating an existing collection with new files in the same
The original material in the
import directory may be in any
format is easy. For example, the raw material for the HDL
format, and plugins are required to process each format
is supplied in the form of HTML files marked up with
type. The plugins that a collection uses must be specified
<<TOC>> tags to split books into sections and
in the collection configuration file. The
import program
subsections, and <<I>> tags to indicate where an image is
reads the list of plugins and passes each document to each
to be inserted. For each book in the library there is a
plugin in order until it finds one that can process it. When
directory that contains a single HTML file representing the
updating an existing collection, all plugins necessary to
book, and separate files containing the associated images.
process new material should already have been specified in
An accompanying spreadsheet file contains the
the configuration file.
classification hierarchy; this is converted to a simple file
format (using Excel’s
Save As command).
The building step creates the indexes for both searching
and browsing. The MG software is generally used to do the
Since the collection exists, its directory is already set up
searching (Witten
et al., 1999), and the
mgbuild module is
with subdirectories
import,
archives,
building,
index, and
automatically invoked to create each of the indexes that is
etc, and the
etc directory will contain a suitable collection
required. For example, the Humanity Development Library
configuration file.
has three indexes, one for entire books, one for chapters,
and one for section titles. Subdirectories of the
index
directory are created for each of these indexes.
The updating procedure
To update a collection, the new raw material is placed in
the
import directory, in whatever form it is available. Then
creator
[email protected]
1
maintainer
[email protected]
2
public
True
3
4
indexes
document:text
5
defaultindex
document:text
6
plugins
GMLPlug TEXTPlug ArcPlug RecPlug
7
8
classify
AZList metadata=Title
9
10
collectionmeta
collectionname "generic text collection"
11
(a)
collectionmeta
.document:text "documents"
12
creator
[email protected]
1
maintainer
[email protected]
2
public
True
3
4
indexes
document:text document:From
5
defaultindex
document:text
6
plugins
GMLPlug EMAILPlug ArcPlug RecPlug
7
8
classify
AZList metadata=Title
9
classify
DateList
10
11
collectionmeta
collectionname "Email messages"
12
collectionmeta
.document:text "documents"
13
collectionmeta
.document:From "email senders"
14
15
format
QueryResults \\\\
16
(b)
<td>[link][icon][/link]</td><td>[Title]</td><td>[Author]</td>
17
Figure 5: Collection configuration files (a) generic, (b) for an email collection
MG also compresses the text of the collection; and the
certain circumstances, however, it might be preferable to
image files are linked into the
index subdirectory. Now
use a standardized format such as XML. This is
none of the material in the
import and
archives directories
straightforward to implementjust write an XML
is needed to run the collection and can be removed from
pluginalthough we have not done so ourselves. Given
the file system (though they would be needed if the
the transitory nature of the imported data, to date, we have
collection were rebuilt).
found GML a satisfactory and beneficial format.
Associated with each collection is a database stored in
CREATING NEW COLLECTIONS
GDBM (Gnu database manager) format. This contains an
entry for each document, giving its OID, its internal MG
Building new collections from scratch is only slightly
document number, and metadata such as title. Information
different from updating an existing collection. The key
for each of the browsing indexes, which appear as buttons
new requirement is creating a collection configuration file,
on the Greenstone search/browse bar, is also extracted
and a software utility is provided to help. Two pieces of
during the building process and stored in the database. A
information are required for this: the name of the directory
“classifier” program is required for each browsing index to
that the collection will use (into which the source data and
extract the appropriate information from GML documents.
other files will eventually be placed), and a contact e-mail
Like plugins, classifiers are written on an
ad hoc basis for
address for use if any problems are encountered by the
the particular information required, and where possible
software once the collection is up and running. The utility
reused from one collection to another.
creates files and directories within the newly-named
directory to support a generic collection of plain text
The building program creates the indexes based on
documents. With suitable data placed in the
import
whatever appears in the
archives directory. The first plugin
directory, building the collection at this point will yield a
specified by all collections is one that processes GML
document-level searchable index of all the text and a
files, and so if
archives contains imported files they will be
browsable list of “titles” (defined in this case to be the
processed correctly. If it contains material in the original
document filenames).
format, that will be converted using the appropriate plugin.
Thus the import process is optional.
To enhance the functionality and presentation— something
anything but the most trivial collection will require—the
GML is designed to be fast and easy to parse, an important
configuration file must be edited. For a collection sourced
requirement when millions of documents are to be
from documents in an already supported data format,
processed. Something as simple as requiring tags to be
presented in a similar fashion to an existing collection, the
lower-case, for example, yields a substantial speed-up. In
These are modules of code that can be slotted into the
system to enhance its capabilities. Plugins parse
documents, extracting the text and metadata to be indexed.
Classifiers control how metadata is brought together to
form browsable data structures. Both are specified in an
object-oriented framework using inheritance to minimize
the amount of code written.
A plugin must specify three things: what file formats it can
handle, how they should be parsed, and whether the plugin
is recursive. File formats are normally determined using
regular expression matching on the filename. For example,
the HTML plugin accepts all files that end in
.htm, .
html,
.HTM, or
.HTML. (It is quite possible, however, to write
plugins that “look inside” the file as well.) For other files,
the plugin returns
undefined and the file is passed to the
next plugin in the collection’s configuration file (e.g.
Figure 5 line 7). If it can, the plugin parses the file and
returns the number of documents processed. This involves
extracting text and metadata and adding it to the library’s
content through calls to
add text and
add metadata.
Some plugins (“recursive” ones) add extra files into the
Figure 6: Searching bookmarked Web pages
stream of data processed during the building phase by
artificially reactivating the list of plugins. This is how
directory hierarchies are traversed.
amount of editing is minimal. Importing new data formats
and browsing metadata in ways not currently supported are
Plugins are small modules of code that are easy to write.
more complex activities that require programming skills.
We monitored the time it took to develop a new one that
was different to any we had produced so far. We chose to
make as an example a collection of HTML bookmark files,
Modifying the configuration file
the motivation being to produce a convenient way of
searching and browsing one’s bookmarked Web pages.
Figure 5b shows simple alterations to the generic
Figure 6 shows a user searching for bookmarked pages
configuration file in Figure 5a that was generated by the
about
music. The new plugin took under an hour to write,
new-collection utility.
TEXTPlug is replaced with
and was 160 lines long (ignoring blank lines and
EMAILPlug (line 7) which reads email files and extracts
comments)—about the average length of existing plugins.
metadata (
From,
To,
Date,
Subject) from them. A classifier
for dates is added (line 10) to make the collection
Classifiers are more general than plugins because they
browsable chronologically. The default presentation of
work on GML-format data. For example, any plugin that
search results is overridden (line 17) to display both the
generates date metadata in accordance with the Dublin
title of the message (i.e. Dublin Core
Title) and its sender
core can request the collection to be browsable
(i.e. Dublin Core
Author). Elements in square brackets,
chronologically by specifying the
DateList classifier in the
such as
[Title], are replaced by the metadata associated
collection’s configuration file (Figure 7). Classifiers are
with a particular document. The built-in term
[icon]
more elaborate than most plugins, but new ones are seldom
produces a suitable image that represents the document
required. The average length of existing classifiers is 230
(such as a book icon or page icon), and the
[link]…[/link]
lines.
construct forms a hyperlink to the complete document.
Anything else in the format statement, which in this case is
Classifiers must specify three things: an initialization
solely table-cell tags in HTML, is passed through to the
routine, how individual documents are classified, and the
page being displayed.
final browsable data structure. Initialization takes care of
any options specified in the configuration file (such as
As this example shows, creating a new collection that stays
metadata=Title on line 9 of Figure 5b). Classifying
within the bounds of the library’s established capabilities
individual documents is an iterative process: for each one,
falls within the capability of many computer users—for
a call to
document-classify is made. On presentation of the
instance, computer-trained librarians. Extending
document’s OID, the necessary metadata is located and
Greenstone to handle new document formats and browse
used to control where the document is added to the
metadata in new ways is more challenging.
browsable data structure being constructed.
Once all documents have been added, a request is made for
Writing new plugins and classifiers
the completed data structure. Some classifiers return the
data structure directly; others transform the data structure
Extensibility is obtained through plugins and classifiers.
before it is returned. For example, the
AZList classifier
a page number, next and previous page buttons, and
displaying a particular page at different resolutions. A text
version of the page is also available upon which a
searching option is also provided.
Started in 1994, Harvest is also a long-running research
project. It provides an efficient means of gathering source
data from the Internet and distributing indexing
information over the Internet. This is accomplished
through five components:
gatherer,
broker,
indexer,
replicator and
cache. The first three are central to creating,
updating and searching a collection; the last two help to
improve performance over the Internet through transparent
mirroring and caching techniques.
The system is configurable and customizable. While
searching is most commonly implemented using Glimpse
(
glimpse.cs.arizona.edu), in principle any search engine
that supports incremental updates and Boolean
combinations of attribute-based queries can be used. It is
possible to control what type of documents are gathered
during creation and updating, and how the query interface
Figure 7: Browsing a newspaper collection by date
looks and is laid out.
Sample collections cited by the developers include 21,000
divides the alphabetically sorted list of metadata into
computer science technical reports and 7,000 home pages.
separate pages of about the same size and returns the
Other examples include a sizable collection of agriculture-
alphabetic ranges for each one (Figure 4).
related electronic journals and magazines called “tomato-
juice” (accessed through
hegel.lib.ncsu.edu) and a full-text
OVERVIEW OF RELATED WORK
index of library-related electronic serials
Two projects that provide substantial open source digital
(
sunsite.berkeley.edu/IndexMorganagus). Harvest is also
library software are Dienst (Lagoze and Fielding, 1998)
often used to index Web sites (for example
and Harvest (Bowman
et al., 1994). The origins of Dienst
www.middlebury.edu).
(
www.cs.cornell.edu/cdlrg) stretch back to 1992. The term
Comparing Greenstone with Dienst and Harvest, there are
has come to represent three entities: a conceptual
both similarities and differences. All provide substantial
architecture for distributed digital libraries; an open
digital library systems, hence common themes recur, but
protocol for service communication; and a software
they are driven by projects with different aims. Harvest,
system that implements the protocol. To date, five sample
for instance, was not conceived as a digital library project
digital libraries have been built using this technology.
at all, but by virtue of its selective document gathering
They manifest themselves in two forms: technical reports
process it can be classed (and is used) as one. While it
and primary source documents.
provides sophisticated search options, it lacks the
Best known is NCSTRL, the Networked Computer
complementary service of browsing. Furthermore it adds
Science Technical Reference Library project
no structure or order to the documents collected, relying
(
www.ncstrl.org). This collection facilitates searching by
on whatever structures are present in the site that they
title, author and abstract, and browsing by year and author,
were gathered from. A proven strength of the design is its
across a distributed network of document repositories.
flexibility through configuration and customizationan
Documents can (where supported) be delivered in various
element also present in Greenstone.
formats such as PostScript, a thumbnail overview of the
Dienstbest exemplified through the NCSTRL
pages, and a GIF image of a particular page.
worksupports searching and browsing, like Greenstone.
The
Making of America resource is an example of a
Both use open protocols. Differences include a high
collection based around primary sourcesin this case
reliance in Dienst on user-supplied information when a
American social history, 1830−1900. It has a different
document is added, and a smaller range of document types
“look and feel” to NCSTRL, being strongly oriented
supported—although Dienst does include a document
toward browsing rather than searching. A user navigates
model that should, over time, allow this to expand with
their way through a hierarchical structure of hyperlinks to
relative ease.
reach a book of interest. The book itself is a series of
There are also commercial systems that provide similar
scanned images: delivery options include going directly to
digital library services to those described. However, since
corporate culture instills proprietary attitudes there is little
REFERENCES
opportunity for advancement through a shared
1. Akscyn, R.M. and Witten, I.H. (1998) “Report on First
collaborative effort. Consequently they are not reviewed
Summit on International Cooperation on Digital
here.
Libraries.” ks.com/idla-wp-oct98.
2. Bowman, C.M., Danzig, P.B., Manber, U., and
CONCLUSIONS
Schwartz, M.F. “Scalable Internet resource discovery:
Greenstone is a comprehensive software system for
Research problems and approaches”
Communications
creating digital library collections. It builds data structures
of the ACM, Vol. 37, No. 8, pp. 98−107, 1994.
for searching and browsing from the material provided,
3. Fox, E. (1998) “Digital library definitions.”
rather than relying on any hand-crafting. The process is
ei.cs.vt.edu/~fox/dlib/def.html.
controlled by a configuration file, and once a collection
exists new material can be added completely
4. Humanity Libraries (1998)
Humanity Development
automatically. Browsing is based on Dublin Core
Library. CD-ROM produced by the Global Help
metadata.
Project, Antwerp, Belgium.
New collections can be developed easily, particularly if
5. Lagoze, C. and Fielding, D “Defining Collections in
they resemble existing ones. Extensibility is achieved
Distributed Digital Libraries”
D-Lib Magazine, Nov.
through software “plugins” that can be written to
1998.
accommodate documents, and metadata, in different
6. PAHO (1999)
Virtual Disaster Library. CD-ROM
formats. Standard plugins exist for many document types;
produced by the Pan-American Health Organization,
new ones are easily written. Browsing is controlled by
Washington DC, USA.
“classifiers” that process metadata into browsing structures
7. McNab, R.J., Witten, I.H. and Boddie, S.J. (1998) “A
(by date, alphabetical, hierarchical, etc).
distributed digital library architecture incorporating
However, the most powerful support for extensibility is
different index styles.”
Proc IEEE Advances in Digital
achieved not by technical means but by making the source
Libraries, Santa Barbara, CA, pp. 36–45.
code freely available under the Gnu public license. Only
8. Nevill-Manning, C.G., Reed, T., and Witten, I.H.
through an international cooperative effort will digital
(1998) “Extracting text from PostScript”
library software become sufficiently comprehensive to
Software—Practice and Experience, Vol. 28, No. 5, pp.
meet the world’s needs with the richness and flexibility
481–491; April.
that users deserve.
9. UNESCO (1999)
SAHEL point DOC: Anthologie du
ACKNOWLEDGMENTS
développement au Sahel. CD-ROM produced by
UNESCO, Paris, France.
We gratefully acknowledge all those who have worked on
the Greenstone software, and all members of the New
10. UNU (1998)
Collection on critical global issues. CD-
Zealand Digital Library project for their enthusiasm and
ROM produced by the United Nations University
ideas.
Press, Tokyo, Japan.
11. Witten, I.H., Moffat, A. and Bell, T. (1999)
Managing
Gigabytes: compressing and indexing documents and
images, Morgan Kaufmann, second edition.