a "-records_per_folder" option to

Converting very large CDS/ISIS databases to Greenstone Collections

by John Rose, Honorary Research Associate, University of Waikato

Users with very large CDS/ISIS databases may experience difficulties in attempting to convert them to Greenstone collections using GLI. In such cases GLI may hang up or work for an inordinate amount of time without a result. This guide is intended to advise on steps which may be taken to overcome such problems.

1. Explode function

GLI may fail at the explode step because it wasn't designed to handle huge amounts of metadata (generally those approaching 15,000 records, but possibly less or greater depending on the size of the CDS/ISIS records).

If the problem is due to slowness rather than metadata overload, it may be able to be solved by adjusting the records_per_folder parameter in the Explode Metadata Database window. This puts the records from exploding a metadata database into multiple subdirectories, which means that the GLI should use less memory and edit the metadata more quickly. The default value is 100, so you can try a lower value, say 10.

If the explode function of GLI fails, there are three choices:

i) You may break the CDS/ISIS database into several sub-databases (exporting different MFN ranges to separate ISO files in CDS/ISIS, and reimporting them to CDS/ISIS databases with different names). You can then build separate Greenstone collections to be searched with the cross-collection search facility (to be set in the GLI Format panel). This has the disadvantage that browsing across more than one the sub-databases at one time will not be possible.

ii) You can convert your CDS/ISIS database "as is" rather than exploding it; see section 1 of the Creating Digital Libraries Based on CDS/ISIS Databases (http://greenstone.sourceforge.net/wiki/gsdoc/others/CDS-ISIS_to_DL.doc) to set up the "as is" collection and the section 2 of the present guide if there is trouble with building the "as is" collection.

iii) You can switch to Greenstone command line mode, explained in detail in section 3. Note that if the command line is necessary to perform the explode step, it will also be required to build the collection (GLI cannot be expected to create a collection with more metadata than it could handle at the explode step).

2. Create panel

GLI may also hang up or work for an inordinate amount of time without a result in the Build Collection process within the Create panel. This may happen in an "as is" conversion or when building a collection set-up using the explode function.

The first thing remedy to try is changing the groupsize parameter. For this, set GLI to Library Systems Specialist or Expert mode in the File/Preferences menu item, and set groupsize, which is 1 by default, to a larger number such as 100 or 1000 before rebuilding. groupsize controls how many documents go into one doc.xml file in the archives directory. Increasing groupsize is unlikely to allow a build to complete correctly if it does not work with a smaller groupsize, but should decrease the time required for a successful build.

3. Command mode

If the explode or build function cannot be performed in GLI, you should build your collection from the command line as explained in Chapter 1 of the Greenstone Developer's Guide (http://prdownloads.sourceforge.net/greenstone/Develop-en.pdf). The first step is to save and close your collection in GLI.

Under Windows, the next step is to get at the "command prompt", the place where you type commands. Try looking in the Start menu, or under the Programs submenu, for an entry like MS-DOS Prompt, DOS Prompt, or Command Prompt. If you can't find it, invoke the Run entry and try typing "command" (or "cmd") in the dialog box. If all else fails, seek help from one who knows, such as your system administrator.

Change into the directory where Greenstone has been installed. Assuming Greenstone was installed in its default location, you can move there by typing

cd "C:\\Program Files\\Greenstone"

(You need the quotation marks because of the space in Program Files.) Next, at the prompt type

setup.bat

This batch file (which you can read if you like) tells the system where to look for Greenstone programs.1 If, later on in your interactive session at the DOS prompt, you wish to return to the top level Greenstone directory you can accomplish this by typing cd "%GSDLHOME%" (again, the quotation marks are here because of spaces in the filename). If you close your DOS window and start another one, you will need to invoke setup.bat again.

Now you are in a position to make, build and rebuild collections. The Greenstone Developer's Guide speaks first about the Perl program "mkcol.pl", whose name stands for "make a collection". You don't have to do this since you have already created the collection. Since you have already dragged the CDS/ISIS database files into collection through the GLI Gather panel, you don't have to copy the document files for the collection into the import directory, either. Similarly, you don't have to do worry either about editing the "collect.cfg" file since all of the information about metadata sets, indexes, browsing classifiers and formats will already have been saved in this file by GLI.

If GLI failed at the explode step, then this step can be implemented from the command line by typing

perl -S explode_metadata_database.pl -plugin ISISPlug -metadata_set exp <path to CDS/ISIS MST file>

Now type

perl -S import.pl -removeold your_collection_name

at the command prompt. "your_collection_name" is the short collection name of your collection (the first data that you entered into GLI for this collection). Don't worry about all the text that scrolls past—it's just reporting the progress of the import. Note that you do not have to be in either the collect or your_collection_name directories when this command is entered; because GSDLHOME is already set, the Greenstone software can work out where the necessary files are.

Next type

perl -S buildcol.pl your_collection_name

at the command prompt Don't worry about the "progress report" text that scrolls past.

Make the collection "live" as follows: select the contents of the collection's building directory (in principle, greenstone\\collect\\your_collection_name\\building) and drag them into the index directory (in principle, greenstone\\collect\\your_collection_name\\index). Alternatively, you can remove the index directory (and all its contents) by typing the command

rd /s index (under Windows NT/2000/XP) or

deltree /Y index (under Windows 98)

and then change the name of the building directory to index with

ren building index

Finally, type

mkdir building

in preparation for any future rebuilds. It is important that these commands are issued from the correct directory (unlike the Greenstone commands mkcol.pl, import.pl and buildcol.pl). If the current working directory is not "your_collection_name", type

cd "%GSDLHOME%\\collect\\your_collection_name"

before going through the rd, ren and mkdir sequence above.

You should now be able to access the newly built collection from your Greenstone homepage. You will have to reload the page if you already had it open in your browser, or perhaps even close the browser and restart it (to prevent caching problems). Alternatively, if you are using the "local library" version of Greenstone you will have to restart the library program. To view the new collection, click on the image or collection name that you had originally set in GLI.

In this case, it is likely that the metadata.xml files are too large for GLI to handle. It would be appreciated if you set GLI to Expert mode in the File/Preferences menu item, rerun the explode process, and report the error message (for example, 'out of memory can not parse metadata.xml') and details on the total size of the database and the number and size of the CDS/ISIS records to one of the Greenstone discussion lists.