Archive for the ‘Greenstone3’ Category

Sam’s Greenstone Blog 19/8/2011

admin. Monday, August 22nd, 2011.

Since the last time I wrote I have mostly been continuing work on the Document Maker. The back-end (the part that does all the hard work) is around 80% complete and is ready enough for me to start working more on the front-end (the part that makes the back-end easier to use).

I have also been tidying up some of the Format Manager work that the other Sam did before he left. He had modified the JQuery UI source code so that it allowed multiple nested lists of items (basically lists inside lists, which the original code did not allow for). The only problem with this approach (directly modifying the source code) is that it does not allow us to easily upgrade our version of JQuery UI in the future. To remedy this problem we downloaded the original JQuery UI source code and worked out what parts of the modified code we needed to keep. We then took these parts and put them into a different Javascript file and used the prototype functionality of Javascript to make sure that the modified code would overwrite the original code.

We originally tried contacting the JQuery UI developers to see if we could get Sam’s changes included in the official source code but they responded saying that this new list-inside-list functionality was outside the scope of what the original lists were intended for.

The Format Manager is still very much a work in progress and, although it will be included in the next release of Greenstone 3, we will be still recommending that people make their format statement modifications through editing the collectionConfig.xml file at this point.

Sam’s Greenstone Blog 8/8/2011

admin. Monday, August 8th, 2011.

Since my last post I have been working hard on the new Document Maker functionality that is planned for a future version of Greenstone. So far I have implemented the ability to create new documents, create new document sections, delete documents, delete document sections, copy documents, turn a document into a section of another document, turn a section of a document into a document and the ability to copy a section from one document into another document. Also planned is the ability to move documents and sections (basically the same as the copying operations except the original document or section is deleted afterwards); the ability to merge sections together or to split them apart; various document manipulations such as the ability to get and set metadata and the ability to get and set the document content.

The plan is that this Document Maker functionality will be presented to the users via a web interface, allowing users to modify their documents on the fly. We imagine that this functionality will be very useful to people who want to be able to create organised collections out of large, unorganised sets of text and images. One such example of this is the Pei Jones collection which is made up of many individual letters, photos and articles that have been OCRed.

Anu’s entry for 25-29 July

ak19. Monday, August 1st, 2011.

Last week started off with requiring fixes to a bug introduced during recent GS3 code changes: suddenly metadata and titles were no longer being retrieved for normal search and browse operations. Then Sam’s recent improvement to GS3’s GLI by starting the tomcat server upon GLI startup was expanded to also stop the tomcat server on GLI’s exit.

Then it was time to move back to GS3 XSLT files once more. Recently, changes were made to GS3’s old standard skin (gs3library) XSLT files, so that the features exhibited in the DSpace Tutorial would work for GS3 as well. These changes needed to still be ported over to the new standard skin for GS3, currently called “oran” (its servlet is called “dev”). However, in trying to make sense of how to do this, it was discovered that the default dev servlet was not set to use Sam’s excellent default GS3 interface for dev. Because GS3’s format features need to be customisable, having any format statements in a collection’s configuration file would bypass Sam’s interface to show up a default one. However, this default one was not working at this stage. This was therefore fixed up to get back some rudimentary behaviour not unlike what GS2’s interface offers for hierarchical browsing and search results. To use Sam’s interface, all users would need to do is use GLI to delete any format statements in a collection’s config file.

In looking into this matter, a further minor bug was discovered in classifier.xsl that was also fixed.

Porting the GS3 changes made for DSpace tutorial into the new default skin later had to be continued later, since there was some incomplete work awaiting finishing: the week ended with continuing work to do with working with embedded metadata (such as of the form ex.dc.*).

Sam’s Greenstone Blog 27/7/2011

admin. Wednesday, July 27th, 2011.

It’s been a while since I last wrote, so I’ll fill you all in on what has been happening.

We have been working fairly solidly on some improvements for 2.85. One thing we have been aiming to do is improve the use of PDF files with complex embedded metadata. We have added several options to the EmbeddedMetadataPlugin that allows more advanced manipulation of metadata arrays (metadata values that have multiple entries like ex.PDF.Keywords).

We have also fixed several issues that arise when 2 similar documents (for example if two identical PDF documents are put into Greenstone but have different embedded metadata) are put into Greenstone.

In other news, we are currently taking another look at the way we encode PDF files. As some of you may know we introduced the PDFBox extension along with 2.84 as a way of converting the latest PDF formats to HTML (pdftohtml only allows conversion of the earlier PDF formats). PDFBox works well except that it does not also get the images out of the PDF like pdftohtml does, it also is fairly large which is why we need to keep it as an extension rather than bundle it with Greenstone. Unfortunately for us, the pdftohtml utility has not been in active development for quite a while now so it has not been upgraded to deal with the more recent PDF versions. However the Xpdf library that pdftohtml uses is still in active development so we have been exploring the viability of upgrading pdftohtml ourselves.

Alongside this I am continuing to work on the Document Maker for Greenstone 3. I have a skeleton of the program in place and have starting filling it out.

Anu’s entry for weeks 11-22 July 2011

ak19. Friday, July 22nd, 2011.
  • Week starting 11 July: Closed ticket 770 to do with multiple pieces of metadata for the same metadata name in GS3. GS3 was previously not consulting the mdoffset field in the index database to work out which of multiple assigned metadata values to display for a particular metadata field. When browsing on that metadata field, it used to display only the first each time, but now displays all values in turn.
  • For the rest of that week and the start of the week thereafter, worked on some items discovered by John Rose and Luigi. They found a bug in the GS2 OAI server that manifested when a GS2 client tried to download docs from it over OAI. The bug had to do with an incorrect URL being generated for the dc.Resource Identifier field. They also requested a minor improvement to the button layout in GLI’s OAI download panel and needed some clarifications on the GS2 OAI server’s behaviour.
  • Continuing on in the week of 18 July: On GLI startup, an information dialog box will show up if the user does not have the PDFBox extension installed (telling them how to get it if they want newer PDF versions processed). A dialog will also appear on startup if the user’s collect home was set to be somewhere outside its default location inside the GS2 installation.
  • In implementing the last, a bug was discovered that had been introduced when implementing the reset-gsdlhome target of the gsicontrol script. The bug interfered with the proper behaviour of setting and loading a custom collecthome when using GLI. It’s now been fixed in such a manner that there’s the added advantage that the intensive operations of the reset-gsdlhome task will not be carried out anymore each time the GS2-server is launched. Instead, the relocation-specific operations are only performed when GSDLHOME has in fact changed since the previous time the GS2-server was launched.
  • The pdfbox-app.jar executable file was changed again: it was returned to being the plain, official 1.5.0 release, without the Greenstone-specific changes regarding the line-separator that had thereafter been committed. Instead, the line.separator is now set as a command-line property when launching the pdfbox-app.jar, as suggested by Dr. Bainbridge, since it was no more than a Java System property that needed to be adjusted for GS’ customisation of PDFBox anyway.
  • Changes have been made to modelcol’s config.cfg (and related changes in runtime-src) to deal with embedded metadata, so that it will now handle the “ex.” prefix of metadata already qualified by a set name, such as ex.dc.something. Further changes were made to runtime-src’s code to not always remove the ex. prefix, since this should be retained for embedded metadata. The handling of embedded metadata by the DSpacePlugin was also slightly modified so that DC metadata in the dublin_core.xml files of DSpace documents get prefixed with “ex.”. This allows these metadata fields to be visible in GLI, while yet being unmodifiable, as they are still extracted (ex) metadata.
  • Tried to reproduce some issues noticed by members of the mailing list.

Sam’s Greenstone Blog 11/7/2011

admin. Tuesday, July 12th, 2011.

Looks like I have some catching up to do.

My time is still mostly being spent on Greenstone 3, tidying up loose ends and making sure we haven’t forgotten anything.  One thing I fixed up was what Greenstone 3 does when GLI is started.  As most Greenstone 2 users will know, when you start up GLI the Greenstone 2 server window also starts.  Previously in Greenstone 3 nothing happened when GLI started (if the server wasn’t running then it would stay not running) but I have modified it so that on Windows the Tomcat window will launch as GLI is launched and on Linux it runs silently in the background.

I have also been spending some time working on the API for the new Document Maker facility that will no doubt make it into the public release of Greenstone at some stage (not 3.05 but maybe 3.06? It’s probably to early to say. Dr. David Bainbridge and I have been discussing the API in detail and I think we are close to finalising what needs to be included to support all of the operations we are planning. The next stage is figuring out how to do the things we want and then implementing them.

Blog entry for 19 June – 1st of July

ak19. Tuesday, July 5th, 2011.

Forgot to write entries for the last two weeks

– A lot of time was devoted to ticket 449: after Dr Bainbridge’s initial solution to the problem in javascript, Sam and Veronica spent a lot of their time on it just so that we could get it to do the same in XSLT,  and so at last (yesterday, 4 July) this was finished.

– Sam and Dr Bainbridge noticed the GS2 server’s portnumber would keep incrementing at times if the chosen port was unavailable at that moment. Their ticket specified a way to request preserving the chosen port. So that was implemented some time last week.

– investigated pdf to text on Windows. Ghostscript seems to support ASCII conversion, but Greenstone would need unicode to be preserved. There were Perl solutions as well as open source programs to do this on Windows. For now, PDFBox has been tweaked to use its inbuilt ability to convert PDF to text when this is specified. Also looked into the latest version of AbiWord which Max pointed out as a free and small-sized alternative to MS Office and Open Office for converting docX files.

– the latest updates to acku and areu collections were uploaded

Sam’s Greenstone Blog 27/6/2011

admin. Monday, June 27th, 2011.

Last week I spent a fair amount of time helping out other people in Greenstone lab. One of our recent additions is a Masters student from India named Papitha. Her project involves using Greenstone 3 to create a framework for scholars to work with large sets of images and OCRed text and to organise these into cohesive collections. Some examples of potential functionality include dynamically creating new documents and merging and/or splitting existing documents. So I have been showing her the ins and outs of Greenstone 3, as well as helping to set up a platform for her to start working from.

There is also another Sam in the lab who had been working for us alongside working on his PHD, he took some time off to finish his PHD and now that his PHD is finished he is working for us part time again. He has been working on a way to more easily customise the format of Greenstone 3. As I’ve mentioned before, with Greenstone 3 we use XSL stylesheets to control the formatting of the various pages, instead of the format statements and macros that Greenstone 2 used. XSL stylesheets are good in that they give us a large amount of flexibility, but this can also make them difficult to understand (especially for people who don’t have any experience in XML) so Sam has been working on a way to hide a lot of the underlying complexity by presenting a simple interface on the pages themselves. I have been helping him with some of the XSL coding as well as some of the run-time code.

Anu’s entry for week of 6 June 2011

ak19. Monday, June 13th, 2011.

Mainly small odds and ends. From making sure that the GS2 OAI server was validating against a new online validator (at which point the resumptionToken functionality was retested), very minor bug fixes such as making sure images in PagedImage collections built with xml item files won’t get reprocessed by ImagePlugin and some other questions on the mailing list. Spent time investigating how to implement use_sections with the PDFBox to PDFPlugin (can try updating the PDFBox code to split on a page at a time) and on Friday was (still unsuccessfully) trying to figure out problems on circumventing hard-coded GS2 {If} format statements in metadata so that things still work with GS3, as in ticket http://trac.greenstone.org/ticket/449.

Sam’s Greenstone Blog 27/5/2011

admin. Thursday, June 2nd, 2011.

At the end of last week we discovered the that – unlike the current skin – the new skin did not correctly allow for custom user templates for things like browsing classifiers and search results. We were originally unsure why it was not working but we tracked it down to XSL template priorities. For those of you who are unaware, Greenstone 3 makes heavy use of XML to produce the information necessary to serve pages to the end-user. This XML is then transformed into an actual web page by performing several XSL transformations. XSL (Extensible Stylesheet Language) is basically made up of a set of rules (templates) that say what to do when you encounter a given piece of XML. For example there might be template that means: when you see a documentNode in the XML, replace it with a book icon (HTML tag) followed by a link (HTML tag) to the page for that document. These rules are stored in various files in the web/interfaces/{interface name} folder. To support format statements like what Greenstone 2 has we also allow users to write format statements in the collectionConfig.xml file and these are added to the other templates used to transform the page. So what happens when a user wants to overwrite a pre-existing template (like the one I mentioned before)? Well, we deliberately give the user’s template a higher priority than the default one, so it will do that transformation instead. We found that the default templates in the new skin had deliberately been designed to be more important that the user’s templates, mean that they weren’t showing up at all. So this has now been changed to be more user-template friendly.

Next week I will be adding the finishing touches to the new skin (mostly organising it so that it is easy to modify) and assuming that we can clear up the remaining tickets then I predict I will be spending time testing Greenstone 3 to make sure they are ready for release.