Building Greenstone collections

Building Greenstone collections

From GreenstoneWiki

What is the "Greenstone Librarian Interface"?

The Greenstone Librarian Interface (GLI) is a graphical tool for building new collections, altering or deleting existing collections, and exporting existing collections to stand-alone CD-ROMs. It allows you to import or assign metadata, and has an interactive collection design module. Launch the GLI under Windows by selecting Greenstone Digital Library from the Programs section of the Start menu and choosing Librarian Interface. Under Linux, run gli.sh from the gsdl/gli directory. For details on using the Librarian Interface see the Greenstone User's Guide.

What is "the Collector"?

The Collector is a web interface for collection building, altering and exporting. It predates the Librarian Interface and for most practical purposes, the Librarian Interface should be used instead. To begin using the Collector, click the "The Collector" button on your Greenstone home page. For further details on using the Collector see the Greenstone User's Guide.

How do I build a collection from the command line or DOS prompt?

It's occasionally preferable to build your Greenstone collections from the command line rather than from the Librarian Interface. This allows you greater control over how your new collection turns out. This page has an overview of the collection building process. Or see the Greenstone Developer's Guide for detailed step by step instructions on building collections from the command line.

I built a new Greenstone collection on my Windows machine. Everything appeared to work fine while building, however when I tried to view the collection some of the documents contained no text. Sometimes Greenstone appeared to crash completely. What have I done wrong?

Are you running Norton Anti-Virus? There are some incompatibilities between Norton and the Greenstone collection building process that cause unpredictable things to happen if you build your collection while Norton is running. Try disabling Norton and rebuilding the collection.

If you do not have Norton or disabling Norton does not solve the problem please contact us for further help.

Why won't the Collector's "export to CD-ROM" function work?

If you downloaded Greenstone from the web you will not have all the components required to make the "export to CD-ROM" function work. These extra components have been made available in a separate download which you can get from the download page.

I'm trying to use the Collector on Windows 2000 but it's running extremely slowly. Is this normal?

Are you using a Netscape web browser with the local library? If so, try using Internet Explorer instead. There are some socket connection problems that show up on Windows 2000 when using Netscape.

What is "the Organizer"?

The Organizer (also called the "Collection Organizer") is a Windows utility used for automatically generating some of the configuration files (metadata.xml, sub.txt etc.) used by complex Greenstone collections.

Where do I get the Organizer?

From the download page.

I'm attempting to build a collection with the collector but it keeps failing with an error. What am I doing wrong?

There are several reasons that the collector might fail to build a collection and the error messages it produces are not always very helpful.

If you changed the default configuration during the configure collection stage you'll need to make sure the changes were valid. For example, if you added a new classify or plugin line you'll need to make sure that the classifier and/or plugin names and arguments are all correct. If they're not the collector will fail. A good test is to build your collection without changing the configuration. If it builds ok with the default configuration but fails after you change the configuration you'll need to look closely at the changes you're making.

Another good thing to do if having problems with the collector is to build your collection from the command line instead. You'll get much more feedback to help debug problems when building in this way. For details on how to build a collection from the command line see the Greenstone developer's guide.

What options are available for the collect.cfg file?

See here for a list of all configuration file options.

Where can I find some example collect.cfg configuration files?

The collect.cfg files for many of the collections at www.nzdl.org have been made available here.

How can I build my collection using MGPP?

The MGPP user manual gives some instructions.

How do I fix XML::Parser errors

Our Mac OS X Greenstone distributions are built on machines using Perl 5.6, and these distributions contain a few binary perl modules. These cause problems if you are using a recent version of perl like 5.8 or 5.8.1 (you can type "perl -v" from the command line to see the version).

On the Mac, our distribution contains modules for both perl 5.6 and 5.8 and the correct one should (hopefully) be installed.

A typical error message during import.pl would be:

Uncaught exception from user code: Can't load
'/home/httpd/gsdl/perllib/cpan/auto/XML/Parser/Expat/Expat.so' for module XML::Parser::Expat:/home/httpd/gsdl/perllib/cpan/auto/XML/Parser/Expat/Expat.so:
undefined symbol:PL_sv_undef at /usr/lib/perl5/5.8.0/i386-linux-thread-multi/DynaLoader.pm line 229. at /home/httpd/gsdl/perllib/cpan/XML/Parser.pm line 14

To remedy this, you need to remove the "gsdl/perllib/cpan/perl-5.8/XML" and "gsdl/perllib/cpan/perl-5.8/auto" directories. (For versions earlier than 2.52, remove "gsdl/perllib/cpan/XML" and "gsdl/perllib/cpan/auto".) Then you need to install the perl XML::Parser natively for your system.

On redhat or mandrake, install the .rpm named "perl-XML-Parser", on debian, install the "libxml-parser-perl" package. For other Linuxes, use your distribution's package, or you can get it from http://search.cpan.org/~msergeant/XML-Parser-2.34/.

You may also need to get Expat, available from http://sourceforge.net/projects/expat/.

Are there any limits to the size of collections?

The largest collections we have built have been 7 Gb of text, and 11 million short documents (about 3 Gb text). These built with no problems. We haven't tried larger amounts of text because we don't have larger amounts of text lying around. It's no good using 7 Gb twice over to make 14 Gb because the vocabulary hasn't grown accordingly, as it would with a real collection.

There are three main limitations:

  1. There is a file size limit of 2 Gb on Linux (soon to be increased to infinity, the Linux people say). I don't know about corresponding figures for Windows; we use Linux for development. There are systems that go higher, but we don't have access to them.

    The compressed text will hit the limit first. MG stores the compressed text in a single file. 7 Gb will compress to just under 2 Gb, so you can't go much higher without splitting the compressed-text file (hacky, but probably easy).

  2. Technical. There is a Huffman coding limitation which we would expect to run into at collections of around 16 Gb. However, the solution is very easy, we just haven't bothered to implement it until we have encountered the problem.
  3. Build time. For building a single index on an already-imported collection, extrapolations indicate that on a modern machine with 1 Gb of main memory, you should be able to build a 60 Gb collection in about 3 days. However, there are often large gaps between theory and practice in this area! The more indexes you have, the longer things take to build.

In practice, the solution for very large amounts of data is not to treat the collection as one huge monolith, but to partition it into subcollections and arrange for the search engine to search them all together behind the scenes. However, while you can amalgamate the results of searching subcollections fairly easily, it's much harder with browsing. Of course, A-Z lists and datelists and the like aren't really much use with very large collections. This is where new techniques of hierarchical phrase browsing come into their own. And the really good news is that you can partition a collection into subcollections, each with individual phrase browsers, and arrange to view them all together in a single hierarchical browsing structure, as one coordinated whole. We haven't actually demonstrated this yet, but it seems quite feasible.

A test collection was built by "Archivo Digital", an office that depends on the "Archivo Nacional de la Memoria" (National Memory Archive in English), in Argentina. It contained sequences of page images with associated OCR text.

Setup details

  • Greenstone version: 2.52
  • Server: Pentium IV 1.8 GHz, 512 Mb RAM, Windows XP Prof.
  • Number of indexed documents: 17,655
  • Number of images (tiff format): 980,000
  • Total size of text files: 3.2 Gb
  • Built indexes: section:text document:Title
  • Used Plugin: PagedImgPlug
  • 5 classifiers

Statistics

  • Time to import the collection: Almost a week was spent collecting documents and importing them. No image conversion was done.
  • Time to build the collection (excluding import): almost 24 hours. The archives and the indexes were on separate hard disks, to reduce the overhead that reading and writing from the same disk would cause.
  • Time to open a hierarchy node that contains 908 objects: 23 seconds
  • Average Time to search only one word in text index: 2 to 5 seconds
  • Average Time to search 3 words in text index: 2 to 5 seconds
  • Average Time to search exact phrases (includes 4, 5 and 6 words): 30 seconds

How do I enter non-English metadata in GLI?

Metadata in the GLI should be entered in UTF-8. If your system doesn't allow typing directly in UTF-8 (your metadata looks like ??? in GLI), then type your metadata in another application such as Notepad, save it as UTF-8, then open it again and cut and paste into GLI. If the metadata has been properly entered in UTF-8, then it should appear fine in a browser once the collection is built.

If your metadata appears as square boxes in GLI, then you will need to use a different font to display it. You can change the font in GLI by going to File->Preferences. The font that you will need to use depends on what language you are using and what fonts are installed on your computer. A good one to try is Arial Unicode MS, PLAIN, 12.

How do I change the search results order?

The order of search results is dependent on the kind of query you are running. For simple (MG) collections, search results are either ranked (for a 'some' or ranked search) or in build order (for an 'or' or boolean search). MG cannot do ranking and boolean searching at the same time. For advanced (MGPP) collections, search results can be in ranked or build order as above, but this doesn't depend on the kind of search you are doing. Boolean searches may be ranked.

Build order is the seemingly random order that documents are processed during import. This can be changed by using the sortmeta option to import.pl. If a metadata element is specified here, then documents will be sorted during import by that metadata.

This option can be specified as an option to import.pl (in GLI Expert mode), or specified in the collect.cfg file. Note that it needs to be added manually to collect.cfg like e.g.:

sortmeta dc.Date

It cannot be added to the config file using GLI at this stage.

For MGPP collections, in advanced searching mode, the query form has a drop down box specifying "display search results in ranked/natural order". If you have sorted the documents by metadata, then you may like to change the text for this box, e.g. have it display "ranked/date order". To achieve this, add the following line to the collection's collect.cfg file:

collectionmacro query:textnatural "date"

What's the difference between MG, MGPP, Lucene?

Greenstone gives you a choice of three indexing tools to index your collection. MG is the default indexer, MGPP and Lucene can be used by turning on "Enable Advanced Searching" in the "Search Types" section of the "Design" panel in the Librarian Interface.

MG
This is the original indexer used by Greenstone, developed mainly by Alistair Moffat and described in the classic book Managing Gigabytes. It does section level indexing, and searches can be boolean or ranked (not both at once). For each index specified in the collection, a separate physical index is created. For phrase searching, Greenstone does an "AND" search on all the terms, then scans the resulting hits to see if the phrase is present. It has been extensively tested on very large collections (many GB of text). MG in Greenstone.
MGPP
This new version of MG (MG plus plus) was developed by the New Zealand Digital Library Project. It does word level indexing, which allows fielded, phrase and proximity searching to be handled by the indexer. Boolean searches can be ranked. Only a single index is created for a Greenstone collection: document/section levels and text/metadata fields are all handled by the one index. For collections with many indexes, this results in a smaller collection size than using MG. For large collections, searching may be a bit slower due to the index being word level rather than section level. MGPP user guide
Lucene
Lucene was developed by the Apache Software Foundation. It handles field and proximity searching, but only at a single level (e.g. complete documents or individual sections, but not both). Therefore document and section indexes for a collection require two separate indexes. It provides a similar range of search functionality to MGPP with the addition of single-character wildcards and range searching. It was added to Greenstone to facilitate incremental collection building, which MG and MGPP can't provide. Lucene home page


How do I build my collection incrementally?

At the moment, its best to create and configure your collection using the Librarian Interface, but then do the building phase using the command line. For a brief introduction to command line building, see this page.

You need to use Lucene as your indexer (set this on Design->Search Indexes).

We only support incremental addition, so if you want to change the documents or metadata, then you will need to do a full import and build.

If you change the design of the collection (plugin options, search indexes, classifiers) then you will need to do a full rebuild.

Once you have set up the collection using GLI, it will live in greenstone/collect/collname, where collname is the short collection name shown in brackets in the GLI title bar.

The source documents and metadata live in the import subfolder.

To build the collection the first time, do:

import.pl collname
buildcol.pl collname
rename building to index

Next time you need to update the collection, you can add documents and metadata for those documents directly into the import directory, or using the Librarian Interface.

To add these into the collection, do:

import.pl -incremental collname
buildcol.pl -incremental -builddir <path-to-index-dir> collname

The -incremental option to both import and buildcol tells it not to delete the current archives/building directory, and to only import/index those documents that are new. The builddir option to buildcol is necessary if the building directory has been deleted or renamed. buildcol by default puts the output into building. If you have renamed the initial building directory to index, then you need to tell buildcol to use that directory instead.


Can I build collections using the Librarian Interface on a remote server?

The Greenstone installation running on the server will need to be set up for remote collection building. Full instructions on how to set this up and use it can be found on the remote building page.