The Sample XML texts collection.

This README provides directions on how to rebuild the collection if necessary, 
and how to create a new collection using this method.

Building the gberg collection (linux):

Go to top level Greenstone and run:
source gs3-setup.sh
Starting from the gberg directory (where this README file is located):
cd java
javac *.java
cd ..
java -classpath $CLASSPATH:./java BuildXMLColl $GSDL3HOME/sites/localsite gberg
mv building index
cp etc/buildConfig.xml index
cp import/*.dtd $GSDL3SRCHOME/resources/dtd/ (this step is already done)

The building process:

There are two stages, import and build.
Importing goes through the xml documents in the import directory and adds 
gs3:id attributes to indexable nodes. Building makes another pass though the 
documents, indexing any indexable nodes, and assigning them node ids. Node 
ids are made up of several parts: document id, scope, tag name, gs3:id. 
Document id refers to the work, and is generated from the original document 
filename. For example, origin.xml becomes origin. The scope part is optional. 
Tag name is the name of the node that is to be indexed, and gs3:id is its id.

The class XMLTagInfo provides two methods, isIndexable and isScopable, and is 
used to determine which tags provide scope, and which tags should be indexed. 
This is the only part of the build code that is specific to the documents in 
the collection.

Indexable nodes are those that should be indexed individually. When a search 
is done, these are the units that will be returned. For instance, suppose a 
collection contains books with chapters and sections of chapters. To have 
searching and retrieval at chapter level, the chapter nodes would be made 
indexable. Alternatively, making the section nodes indexable would provide 
searching at the smaller section level instead.

Scopable nodes are not necessarily indexed, but are recorded as part of the 
id as a scope. This is to speed up searching for the appropriate tag during 
retrieval. Scopeable tags should only occur once per document.

Testing the index:

You can run the command line Search program, and query the index. From the gberg folder:
java -classpath $CLASSPATH:./java Search ./index/idx

Building other collections:

To use the java building stuff for other XML document collections, you can 
just modify the XMLTagInfo.java file to include the appropriate tag names for 
your XML documents. Then put import documents into the import directory, and 
run the BuildXMLColl program as above. Configuration files will need to be 
created for the collection (etc/collectionConfig.xml and 
index/buildConfig.xml).

The Greenstone runtime software has problems locating DTDs, so any DTDs for 
your collection should be placed in the collection's resources directory.
If DTD files are shared between collections, they can go into the 
WEB-INF/classes directory (in gsdl3/web or tomcat/webapps/gsdl3, depending on 
your setup).

NOTE: Sept 2018. Tomcat cannot find gutbook DTD when trying to transform the file to produce table of contents :-(