:: Office Documents Demo :: Greenstone3 Showcase

Scalable browsing for large collections: a case study Gordon W....

Scalable browsing for large collections: a case study
Gordon W. Paynter,1 Ian H. Witten,1 Sally Jo Cunningham,1 George Buchanan2
1 Dept of Computer Science
2 Dept of Computer Science
University of Waikato, New Zealand
Middlesex University, London
{gwp, ihw, sallyjo}@cs.waikato.ac.nz
[email protected]
ABSTRACT
links manually is labor-intensive, and this kind of
information rapidly goes stale as the collection grows. For
Phrase browsing techniques use phrases extracted
large collections, the complexity of manually organizing
automatically from a large information collection as a basis
the information is daunting.
for browsing and accessing it. This paper describes a case
study that uses an automatically constructed phrase
Metadata provides information that can be used for
hierarchy to facilitate browsing of an ordinary large Web
browsing—given the relevant metadata, it is possible to
site. Phrases are extracted from the full text using a novel
provide the human browser with indexes of authors and
combination of rudimentary syntactic processing and
titles, classification hierarchies, and so on [17]. But as the
sequential grammar induction techniques. The interface is
scale of the information increases, the value of such lists
simple, robust and easy to use.
decays—they become too large to be of much use. With
large indexes one is reduced to searching rather than
To convey a feeling for the quality of the phrases that are
browsing.
generated automatically, a thesaurus used by the
organization responsible for the Web site is studied and its
We have been experimenting with different ways of
degree of overlap with the phrases in the hierarchy is
automatically abstracting hierarchical structures of phrases
analyzed. Our ultimate goal is to amalgamate hierarchical
from large collections of information and using them to
phrase browsing and hierarchical thesaurus browsing: the
facilitate browsing [10, 12]. This paper reports an
latter provides an authoritative domain vocabulary and the
application of these techniques to a large Web site.
former augments coverage in areas the thesaurus does not
Our case study is based on the site of the United Nations
reach.
Food and Agriculture Organization (FAO, www.fao.org),
an international organization founded in 1945 whose
INTRODUCTION
mandate is to raise levels of nutrition and standards of
Suppose you are browsing a large collection of information
living, to improve agricultural productivity, and to better
such as a digital library—or a large Web site. Searching is
the condition of rural populations. Web presence is seen as
easy, if you know what you are looking for—and can
an important part of the FAO’s information dissemination
express it as a query at the lexical level. But current search
activities, and the site is organized and maintained by the
mechanisms are not much use if you are not looking for a
World Agricultural Information Center (WAICENT), a
specific piece of information, but are generally exploring
subunit of the FAO. The version that we use in this study is
the collection. Studies of browsing have shown that it is a
dated 1998 and contains 21,700 Web pages, as well as
rich and fundamental human information behavior, a
around 13,700 associated files (image files, PDFs, etc).
multifaceted and multidimensional human activity [3]. But
This corresponds to a medium-sized collection of
it is not well-supported for large digital collections.
approximately 140 million words of text. Figures 1 and 2
show typical pages from the site.
Web sites link together information in a way that is
designed to help the browser. But as the scale of collections
This site exhibits many problems common to large, public
increase, links becomes very difficult to create and
Web sites. It has existed for some time, is large and
maintain. Inserting
continues to grow rapidly. Despite strenuous efforts to
organize it, it is becoming increasingly hard to find
information. A search mechanism is in place, but while this
allows some specific questions to be answered it does not
really address the needs of the user who wishes to browse
in a less directed manner.

Figure 1: Example Web page (English)
Figure 2: Example Web page (French)
appears to the right for use when more than ten phrases are
We support browsing of the FAO site with an interactive
displayed. The number of phrases appears above the list: in
interface to the phrases present in the documents; this
this case there are 493 top-level phrases that contain the
interface is discussed in the next section, and the
term forest.
succeeding section describes the techniques used to create
the underlying index of phrases. We then examine the
So far we have only described the upper of the two panels
quality and potential usefulness of the phrases by
in Figure 3. The lower one appears as soon as the user
comparing them with terms and phrases contained in
clicks one of the phrases in the upper list. In this case the
AGROVOC [4], a manually constructed thesaurus for the
user has clicked forest products (that is why that line is
field of agriculture.
highlighted in the upper panel) and the lower panel, which
shows phrases containing the text forest products, has
PHRASE-BASED SUBJECT INDEX INTERFACE
appeared.
The phrase-based browser that we have developed is an
If one continues to descend through the phrase hierarchy,
interactive interface to a phrase hierarchy that has been
eventually the leaves will be reached. A leaf corresponds to
extracted automatically from the full text of the Web site. It
a phrase that occurs in only one document of the collection
is designed to resemble a paper-based subject index or
(though the phrase may appear several times in that
thesaurus. Figure 3 shows the interface in use. The user
document). In this case, the text above the lower panel
enters an initial word in the search box at the top. On
shows that the phrase forest products appears in 72 phrases
pressing the Search button the upper panel appears. This
(the first ten are shown), and, in addition, appears in a
shows the phrases at the top level in the hierarchy that
unique context in 382 documents. The first ten of these are
contain the search word—in this case the word forest. The
available too, though the list must be scrolled down to make
list is sorted by phrase frequency; on the right is the number
them appear in the visible part of the panel. Figure 4 shows
of times the phrase appears, and to the left of that is the
this. In effect, the panel shows a phrase list followed by a
number of documents in which the phrase appears.
document list. Either of these lists may be null (in fact the
document list is null in the upper panel, because every
Only the first ten phrases are shown, because it is
context in which the word forest appears occurs more than
impractical with a Web interface to download a large
once). The document list displays the titles of the
number of phrases, and many of these phrase lists are very
documents.
large. At the end of the list is an item that reads Get more
phrases (displayed in a distinctive color); clicking this will
It is possible, in both panels of Figures 3 and 4, to click Get
download another ten phrases, and so on. A scroll bar
more phrases to increase the number of phrases that are
2

Figure 3: Browsing for information about forest
Figure 4: Expanding on forest products
shown in the list of phrases. It is also possible, in the lower
system is still usable. Here, the user has expanded
panels, to Get more documents (again it is displayed at the
commercialisation du poisson and, in the lower panel, has
end of the list in a distinctive color, but to see it that entry is
clicked INFOPECHE which brings up the page in Figure 2.
necessary to scroll the panel down a little more) to increase
the number of documents that are shown in the list of
DERIVING THE PHRASES
documents.
We have experimented with several different ways of
Clicking on a phrase will expand that phrase. The page
creating a phrase hierarchy from a document collection.
holds only two panels, and if a phrase in the lower panel is
Nevill-Manning et al. [10] describe an algorithm called
clicked the contents of that panel will move up into the top
SEQUITUR that builds a hierarchical structure containing
one to make space for the phrase’s expansion.
every single phrase that occurs more than once in the
Alternatively, clicking on a document will open that
document collection. We have also worked on a scheme
document in a new window. In fact, the user in Figure 4 has
called KEA which extracts keyphrases from scientific
clicked on IV FORESTS AND TRADE AND THE
papers. This produces a far smaller, controllable, number of
ENVIRONMENT, and this brings up the page shown in
phrases per document [5]. The scheme that we use for the
Figure 1. As Figure 4 indicates, that document contains 15
interface described in this paper is an amalgam of the two
occurrences of the phrase forest products.
techniques.
Figures 5 and 6 show some more examples of the interface
Constructing phrase hierarchies using S
in use. In Figure 5 the user has entered the word dairy and
EQUITUR
expanded on New Zealand dairy (note that this collection is
The basic insight of SEQUITUR is that any phrase that
from the FAO in Rome, Italy; it is impressive to be able to
appears more than once can be replaced by a grammatical
home in on information about the local dairy industry in
rule that generates the phrase, and this process can be
New Zealand so rapidly). Figure 6 shows a French user
continued recursively. The result is a hierarchical
typing the word poisson. The FAO site contains documents
representation of the original sequence. It is not a grammar,
in French, but our phrase extraction system is tailored for
for the rules are not generalized and are capable of
English as described below. The French phrases are
generating only one string.
displayed are of much lower quality than the English ones
in Figures 3, 4 and 5; the list of ten phrases in the upper
There exists a remarkably efficient algorithm to derive
panel of Figure 6 contains only four useful ones. Phrases
these phrases from an input sequence, and the time it takes
like du poisson (meaning of fish) are not meaningful, and
is linear in the length of the input [11]. This has allowed us
can even obscure more interesting material. However, the
3

Figure 5: Browsing for information on dairy
Figure 6: Browsing for information on poisson
to investigate hierarchies formed from sequences of words
containing up to 60 million tokens.
Extracting keyphrases using KEA
Nevill-Manning et al. [10] reported character-based
In a separate project, we investigated algorithms for
hierarchies, formed by using characters as tokens, and
extracting keyphrases from technical documents [5].
word-based hierarchies, formed using words. Interesting
Keyphrases provide a kind of semantic metadata that is
effects occur in both cases, although word mode is most
useful for a wide variety of purposes. It turns out that
suitable for interactive browsing of large information
keyphrases can be extracted automatically from the full text
collections.
of documents with surprising accuracy. To do this,
candidate keyphrases are identified, features are computed
In order to display the phrase hierarchy interactively, a
for each candidate, and machine learning is used to
number of additional facilities are incorporated into the
generate a classifier that determines which candidates
browser. Words like a and the cause problems because they
should be assigned as keyphrases. One feature, TF×IDF,
are often used to form rules, but as far as the user is
requires a corpus of text from which document frequencies
concerned they add little meaning to the phrase. Nobody
can be calculated; the machine learning phase requires a set
really wants to know that the most common use of the work
of training documents with keyphrases assigned. The
index is in the phrase the index. Hence we label as common
success of various stages of the procedure was evaluated on
words the one hundred most frequently occurring words in
a large test corpus, in terms of how many author-assigned
the collection, and weed out phrase expansions that differ
keyphrases are correctly identified (a measure that is
from the original phrase only by the addition of common
subject to some caveats).
words. At the other extreme, phrases that occur rarely
increase the number of potential phrases but contribute little
In the final procedure that we developed for keyphrase
to our understanding of the collection. This effect is
extraction, stop words were used to determine whether or
mitigated by the SEQUITUR algorithm, which ignores
not a phrase is a candidate phrase. Our experiments on
singleton phrases; by according more weight to frequent
keyphrase extraction also used a syntactic method for
phrases; and by discarding phrases whose frequency falls
identifying candidate phrases: we tried to identify noun
below a low-frequency threshold. These measures greatly
phrases. The two approaches are equally accurate on the
increase the usability of the resulting interface [12].
keyphrase extraction task, but we used stop words in the
final system because it is significantly faster.
The syntactic analysis first tags the input by assigning
syntactic classes to each word. We use the Brill tagger
4

AGROVOC
Extracted phrases
AGROVOC
Extracted phrases
length
number percentage
number percentage
length
number percentage
number percentage
in words
in characters
1
12342
44.9%
58954
21.2%
1 – 5
1207
4.4%
11513
4.4%
2
13046
47.5%
126950
45.7%
6 – 10
8089
29.5%
47665
18.1%
3
1692
6.2%
57844
20.8%
11 – 15
8737
31.8%
68471
26.0%
4
327
1.2%
19356
7.0%
16 – 20
6146
22.4%
60861
23.1%
5
51
0.2%
7194
2.6%
21 – 25
2477
9.0%
35598
13.5%
6
7
0.0%
3271
1.2%
26 – 30
599
2.2%
17608
6.7%
7
1
0.0%
1724
0.6%
31 – 35
211
0.8%
8950
3.4%
8
1050
0.4%
36 – 40
4752
1.8%
9
639
0.2%
41 – 45
2690
1.0%
10 - 42
1109
0.4%
46 – 50
1544
0.6%
average length
1.64 words
2.37 words
51 – 55
1037
0.4%
Table 1: Length of phrases (words)
> 55
2674
1.0%
average length
13.58 characters
17.62 characters
[1,2]. Then we experimented with two heuristics for noun
Table 2: Length of phrases (characters)
phrase identification. The first was suggested by Turney (in
press) as matching almost all of the keyphrases in the
indivisible units; they do not exploit their hierarchical
corpuses he used. It specifies zero or more nouns or
nature for browsing.
adjectives, followed by one final noun or gerund:
(noun | adjective)* (noun | verb-gerund)
Constructing hierarchies of noun phrases
where a “noun” is either a singular or plural noun or proper
For the interface described in the present paper, we have
noun. (“*” means repetition, appearing zero or more times.)
employed a combination of the two approaches. As noted
above, SEQUITUR produces all phrases that occur more than
Although this structure resembles a noun phrase, it turns
once. However, users who are browsing are generally far
out that the notion of “noun phrase” is only loosely defined
more interested in noun phrases rather than in other types
in the first place. Also, in our work we have encountered
of phrase. SEQUITUR, when applied to the full input text,
many author-defined keyphrases that are not noun phrases
tends to produce many other phrases that are not so useful
according to this regular expression.
for browsing information collections (though they are
Consequently, we experimented with a different regular
useful for other purposes).
expression to locate candidate phrases, which we describe
If SEQUITUR produces too many phrases, then keyphrase
as “augmented” noun phrases:
extraction produces too few. A typical document contains
[(noun | adjective | verb)+ (conjunction | prep)]*
thousands of candidate phrases, which the extraction
(noun | verb-gerund)
algorithm pares down to fewer than a dozen. Inevitably,
hundreds of valuable phrases are discarded. Further, by
where conjunctions and prepositions are members of a
compressing every occurrence of a phrase to a single
predefined list (and “+” means one or more repetitions).
summary occurrence, the phrase’s context and frequency
This allows sequences of nouns, adjectives, and verbs to be
are sacrificed. Without context and frequency—the de facto
interspersed with connectives, before the terminating noun
measure of relative importance—we are unable to construct
or gerund, and permits phrases such as programming by
a browsable hierarchy.
demonstration.
As a compromise, we extract just the noun phrases that
Several browsing interfaces are based on keyphrases. Jones
appear in the full text of the documents, and base a
and Paynter [7] automatically insert hyperlinks into digital
SEQUITUR hierarchy on those. To do this we convert the
library collections using keyphrases as link anchors and
Web pages to plain ASCII text, using the Lynx browser to
document clusters as destinations. Martin and Turney [15]
strip out all HTML tags, then process the resulting
use keyphrases to construct searchable subject indexes.
sequence with the Brill tagger. We extract every sequence
Gutwin et al. [6] search for clusters of documents that share
of words whose tags have the syntactic structure given
keyphrases. Phrases in the result list can be reused as search
above for augmented noun phrases, and insert a special
terms, allowing the user to search increasingly specific
delimiter symbol between noun phrases and at clause
variations on a phrase. All three interfaces treat phrases as
5

noun phrase. But most importantly, some of these
AGROVOC thesaurus
Extracted phrases
documents (e.g. Figure 2) are in other languages—mostly
1 forest canopy
forest Academy
French and Spanish—and this naturally plays havoc with
2 forest decline
forest access
the tagger. Non-English words are assumed to be nouns and
3 forest dieback
forest Act
used to build nonsense phrases.
4 forest ecology
forest activities
Another issue is whether or not to apply stemming before
5 forest establishment
forest administration
building the noun phrase list. Without stemming, we will
6 forest fires
forest agencies
get different versions of the same basic noun phrase. In our
7 forest floor vegetation
forest agenda
8 forest grazing
forest animals
work on keyphrase extraction, we stemmed words and
9 forest health
forest area
conflated different versions in order to remove duplicate
10 forest industry
forest assessment
phrases and count phrase frequencies, but kept a record of
11 forest inventories
forest authorities
the most frequent unstemmed version of each phrase in
12 forest land
forest authority
order to reexpand the stemmed version for display to the
13 forest litter
forest base
user. This is also an option for the present system, although
14 forest management
forest benefits
the illustrations in this paper do not use any stemming.
15 forest measurement
forest biodiversity
16 forest mensuration
forest biology
The final phase is to build a hierarchy from the noun
17 forest meteorology
forest biomass
phrases by running SEQUITUR over the sequence of noun
18 forest nurseries
forest Botany
phrases, specifying the delimiter symbol as a delimiter for
19 forest pathology
forest boundaries
SEQUITUR. In fact, the SEQUITUR algorithm is really
20 forest pests
forest canopy
designed for long undelimited sequences—the problem of
21 forest plantations
forest capital
generating a hierarchy from a set of short phrases in
22 forest policies
forest certification
reasonable time is much easier than treating a single long
23 forest products
forest characteristics
sequence. And SEQUITUR makes some sacrifices in
24 forest product industry*
forest charges
accuracy to operate in reasonable time. Thus this step also
25 forest protection
forest clearance
adds a degree of approximation to the phrase hierarchy that
26 forest range
forest co management regime
results, which could be avoided by using a more suitable
27 forest regulations**
forest codes
method.
28 forest rehabilitation
forest college
29 forest replanting
forest commons
COMPARING THE PHRASES TO A THESAURUS
30 forest reserves
forest communities
31 forest resources
forest companies
The phrases extracted represent the topics present in the
32 forest returns
forest composition
FAO site, as described in the terminology of the document
33 forest roads
forest concession
authors. But how well does this set of phrases match the
34 forest soils
forest condition
standard terminology of the discipline? We investigate this
35 forest stands
forest conflicts
by comparing the extracted phrases with phrases used by
35 forest steppe
forest conservation
the AGROVOC agricultural thesaurus. The degree of
37 forest surveys**
forest control
overlap between the two sets of phrases provides a rough
38 forest thinning
forest conversion
indication of the quality of the extracted phrases as subject
39 forest tree nurseries
forest cover
descriptors—or conversely, the applicability of the
40 forest trees
forest crisis
41 forest workers
forest crops
AGROVOC thesaurus to the FAO site can be assessed by
…
…
measuring the extent to which the AGROVOC phrases
235
forest zones
appear in the natural text of the documents.
236
forest zoology
The AGROVOC thesaurus
Table 3: Phrases beginning with the word forest
AGROVOC is a multilingual thesaurus for agricultural
breaks like commas and the ends of sentences. The result is
information systems, developed by the FAO to support
a long sequence of delimited noun phrases.
subject control for the AGRIS agricultural bibliographic
database and the CARIS database of agricultural research
There are many problems with this procedure, and the
projects [4]. The thesaurus supports the three working
result is only an approximation to the actual noun phrases
languages of the FAO—English, French, and Spanish—and
that occur in the input. First, the Brill tagger is not
versions in Arabic, German, Italian, and Portuguese are
perfect—for example, unrecognized words are assumed to
under construction. AGROVOC is actively supported by
be nouns. Second, it is not easy to define a regular
the FAO and its international community of users, and is
expression on the tags that result that captures all and every
periodically updated to reflect changing terminology or
6

were imposed by the original thesaurus software [4]. The
AGROVOC thesaurus
Extracted phrases
strict upper limit on characters has proven problematic, in
1 coppice forest
actual forest
that lengthy terms (such as the names of organizations,
2 duff (forest litter)
aggregate forest
enzymes, chemical compounds, etc.) have had to be
3 high forest
Amazon forest
abbreviated—sometimes in arbitrary or non-standard ways.
4 minor forest products*
amenity forest
This practice can make querying more difficult for users,
5 mixed forest stands
American forest
who have to guess when and how a phrase has been
6 monsoon forest
artificial forest
abbreviated. The potential overlap between the extracted
7 nontimber forest products
available forest
and AGROVOC phrases is also reduced, though only
8 nonwood forest products*
Bangladesh forest
slightly.
9 secondary forest products*
bavarian forest
10 semliki forest virus
Black forest
Overlap with AGROVOC phrases
11 slash (forest litter)
boreal forest
We begin with an example to illustrate the degree and type
12 thorn forest
Chimanes forest
of overlap found between the two sets of phrases. Table 3
…
…
shows phrases beginning with the word forest in
204
world forest
AGROVOC and at the top level of the phrase hierarchy.
205
Wright forest Mgt
Italics indicates that the AGROVOC phrase occurs amongst
206
young forest
the extracted phrases (and vice versa). All italicized phrases
Table 4: Phrases containing the word forest
occur at the top level except the ones marked with a single
asterisk—in Table 3, just forest products industry—which
shifts in the boundaries of the research field. A searchable
appears at a lower level of the hierarchy. This distinction is
version is accessible at www.fao.org/AGROVOC.
visible in Figure 3, where forest products industry appears
The thesaurus is of a significant size—each language
as an expansion of the top-level phrase forest products (as
version includes more than 15,700 descriptors, and
do the three asterisked phrases in Table 4). The doubly-
approximately 10,000 non-descriptors (also colorfully
asterisked phrases, forest regulations and forest surveys,
referred to as “forbidden terms”, non-descriptors are
appear in the plural only coincide with extracted phrases if
synonyms that are linked to a descriptor by a “use”
they are stemmed—to forest regulation and forest survey
reference). Thesaurus terms are nouns or noun phrases, and
respectively.
all—including non-descriptors—were selected for inclusion
The overlap between the AGROVOC thesaurus and the
on the basis of their common usage in the agricultural
phrases extracted from the FAO site is quantified in Tables
research literature. The AGROVOC vocabulary forms a
5–6. For comparison’s sake, we also include statistics for
rich semantic network describing the agricultural domain,
the raw text and the keyphrases extracted from it by KEA.
with links between terms describing hierarchical
The former represents an upper bound for matches, and was
relationships (broader term, narrower term), associative
generated by extracting every sequence of one to four
relations (related terms), and synonym links between
words present in the FAO site. The latter emphasizes
descriptors and non-descriptors (use, use for).
precision rather than recall in a match, since there are fewer
Tables 1 and 2 summarize the structural characteristics of
keyphrases associated with each document (a maximum of
the AGROVOC phrases and the extracted phrases. The
six). The keyphrases are also more likely to be true
AGROVOC phrases are taken from the English version
indicators of the focus of the document, and so are closer to
only, and include both descriptors and non-descriptors. The
the intent of AGROVOC thesaurus entries.
non-descriptors appear in this analysis because, despite
As illustrated in the forest example, stemming can affect
their title, they are useful in thesaurus searching, since they
the degree of match. We examine this effect by comparing
are simply synonyms of their associated descriptors.
the overlap between unstemmed phrases and phrases
The algorithm extracts phrases of two or more words. The
stemmed using the Lovins and Iterated Lovins algorithms
phrases in the hierarchy are drawn from a vocabulary of
[9]. The Lovins algorithm stems words to their root form;
single word terms, and this vocabulary is the source of the
for example, dictionary is reduced to diction. The iterated
single-word phrases in Tables 1–6.
algorithm repeatedly applies the Lovins stemmer until the
stem no longer changes; dictionary is thus stemmed to dict.
The extracted phrases tend to be longer than the
When phrases are stemmed more severely, the number of
AGROVOC ones, measured both by the number of words
unique entries decreases because similar phrases are
and the number of characters per phrase (Tables 1–2). This
stemmed to equivalent root terms, as can be seen in the top
difference was expected, since AGROVOC phrases were
row of Table 6.
deliberately designed to be brief (three or fewer words) and
compact (maximum of 35 characters). These limitations
7

Unstemmed
Lovins
Iterated
bring in new search or browsing terms for the user to
stemmer
Lovins
consider.
Number of unique terms
Stemming increases the number of AGROVOC words and
Agrovoc
20574
17293
15670
full phrases that can be matched to the FAO site, the
FAO Web pages
169209
123975
107870
extracted hierarchy, and the keyphrases, but only
Extracted phrases
44226
30441
25013
marginally. Iterated Lovins provides a higher degree of
Keyphrases
7886
5913
5284
matching than Lovins, but again, the advantage is small.
Number of Agrovoc terms covered by words in...
DISCUSSION
FAO Web pages
9945
8685
8210
extracted phrases
6186
5599
5384
A free-text index is the most common access method for
keyphrases
2483
2356
2294
Web collections, mainly because the index can be
constructed automatically. Searchers typically experience
Proportion of Agrovoc terms covered by words in...
difficulty in constructing effective queries, since they must
FAO Web pages
48.3%
50.2%
52.4%
match their personal vocabulary to that of the collection.
extracted phrases
30.1%
32.4%
34.4%
The interface presented in this paper provides a tool for
keyphrases
12.1%
13.6%
14.6%
spanning the gap between the two vocabularies. The
phrases extracted from the document collection are noun
Table 5: Term overlap between AGROVOC, extracted phrases, and
keyphrases
phrases, and noun phrases are by far the most common
queries submitted to retrieval systems. Users, then, can
explore the collection’s terms and term relationships
Note that slightly over half of the words appearing in the
through a display that mirrors the query construction
AGROVOC thesaurus phrases are also present in the FAO
naturally favored by users.
documents (Table 5). This overlap is a strong indication
that AGROVOC is a suitable thesaurus to use with those
A controlled vocabulary such as a subject thesaurus is
pages. The proportion of AGROVOC words contained in
useful as a complement to free-text indexing: it can provide
phrases in the extracted hierarchy is smaller, but still
a framework for understanding the domain and learning its
represents a respectable one-third of the AGROVOC terms.
technical terminology [14]; as a primary interface for
Including vocabulary terms from the extracted hierarchy
searching/browsing a document collection [13]; and as a
increases the coverage of the AGROVOC terms. As
supporting tool for query construction (typically in
expected, the Kea keyphrases cover a smaller proportion of
automated or semi-automated query expansion; for
AGROVOC terms.
example, see [7]). Usually the information resource
explored through a thesaurus is a bibliographic database, or
The proportion of full AGROVOC phrases that are
(less commonly) a highly structured database such as the
included in the FAO site and the extracted hierarchy is
CARIS descriptions of agricultural research projects. In
high—40% and 26% respectively (Table 6). This is
principle, users of an unstructured but focused document
particularly encouraging, as it indicates that a significant
collection such as the FAO site should also benefit from the
number of links exist between AGROVOC terms,
availability of a subject-specific thesaurus. However, the
documents, and the extracted hierarchy. These inter-
potential benefits are difficult to realize; the problems
relations could form the basis for a rich tool to support
remain of matching the natural terminology of the searcher
collection browsing. For example, the user interaction
to the vocabulary of the FAO site and the thesaurus, and
depicted in Figures 3 and 4 begins as the search term forest
matching the terminology of the thesaurus to the site.
is entered into the phrase-based browser. The phrase
hierarchy is scanned and the phrase forest products is
One approach to addressing the latter problem is to require
selected. But this term is also represented in the
the creator of a document at the FAO site to supply
AGROVOC thesaurus; access to the thesaurus would also
cataloging information that includes a set of applicable
have brought to the user’s attention 44 specific types of
AGROVOC terms—in fact, this procedure is currently in
forest product (for example, Christmas trees, charcoal, and
use. But relatively few authors provide suitable
particle boards), and 10 related topics (such as logging
AGROVOC keywords; perhaps the authors are themselves
wastes, cellulose products, and tanning agents). These
unfamiliar with AGROVOC and, like many searchers, find
AGROVOC terms could then be browsed in the interactive
it difficult to select quality AGROVOC descriptors.
interface. Interestingly, in the AGROVOC entry for forest
Our next step will be to amalgamate the phrase and
product, three of the 54 narrower/related phrase links
thesaurus hierarchies, both for searching and for
contain the word forest, one contains forestry, and six
AGROVOC term assignment during cataloging. Our
contain products. The majority of the AGROVOC links
analysis of the overlap between the AGROVOC and Web
8

Unstemmed
Lovins
Iterated
5. Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C. and
stemmer
Lovins
Nevill-Manning, C.G. (1999) “Domain-specific
Number of phrases
keyphrase extraction.” Proc Int Joint Conf on
Agrovoc phrases
27466
26701
25901
Artificial Intelligence, pp. 668-673, Stockholm,
FAO Web site phrases
19071445
18098815
17764015
Sweden.
Extracted phrases
278091
245374
233095
6. Gutwin, C., Paynter, G., Witten, I.H., Nevill-Manning,
Keyphrases
13855
12183
11655
C., and Frank, E. (in press) “Improving Browsing in
Digital Libraries with Keyphrase Indexes.” J.
Number of Agrovoc phrases covered...
Decision Support Systems.
by FAO Web site
9835
10750
10855
7. Jones, S., Gatford, M., Robertson, S., Hancock-
by extracted phrases
6166
6913
7014
Beauliu, M., Secker, J., and Walker, S. (1995)
by keyphrases
1447
1793
1874
“Interactive Thesaurus navigation: intelligence rules
Proportion of Agrovoc phrases covered...
OK?” JASIS, Vol. 46, No. 1, pp. 52-59.
by FAO Web site
35.8%
40.3%
41.9%
8. Jones, S. and Paynter, G. W. (1999) “Topic-based
by extracted phrases
22.4%
25.9%
27.1%
browsing within a digital library using keyphrases.”
by keyphrases
5.3%
6.7%
7.2%
Proc ACM Digital Libraries 99, pp. 114–121.
Table 6: Phrase overlap between AGROVOC, extracted phrases,
9. Lovins, J.B. (1968) “Development of a Stemming
and keyphrases
Algorithm.” Mechanical Translation and
Computational Linguistics, Vol 11, pp. 2231.
site vocabularies indicates that the two are similar enough
10. Nevill-Manning, C.G., Witten, I.H. and Paynter,
that a tool linking the two hierarchies is likely to be useful.
We envisage an interface that will allow users to gracefully
G.W. (1997) “Browsing in digital libraries.” Proc
navigate between their personal vocabulary, terms extracted
ACM Digital Libraries 97, pp. 230-236, July.
from the FAO site, and AGROVOC terms/phrases. We can
11. Nevill-Manning, C.G. and Witten, I.H. (1997)
exploit the overlap between the extracted and AGROVOC
“Identifying hierarchical structure in sequences.” J
phrases to support cataloging by running the extraction
Artificial Intelligence Research, Vol. 7, pp. 67-82.
process over a submitted Web page and using the resulting
12. Nevill-Manning, C.G., Witten, I.H. and Paynter,
phrases to link to potentially relevant portions of the
G.W. (1999) “Lexically-generated subject hierarchies
AGROVOC hierarchy.
for browsing large collections.” Int J on Digital
Libraries, Vol. 2, No. 2/3, pp. 111-123; September.
ACKNOWLEDGMENTS
13. Smith, M.P., Pollitt, A.S., and Li, C.S. (1992)
We gratefully acknowledge Craig Nevill-Manning, Carl
“Evaluation of concept translation through menu
Gutwin, Eibe Frank and Steve Jones, who have worked
navigation in the MenUSE intermediary system.”
with us on phrase extraction and phrase interfaces, and all
Proc BCS IRSG Research Colloquium on Information
members of the New Zealand Digital Library project for
Retrieval, pp. 38-54, University of Lancaster, UK.
their enthusiasm and ideas.
14. Soergel, D. (1985) Organizing Information:
REFERENCES
principles of data base and retrieval systems. Orlando:
1. Brill, E. (1992) “A simple rule-based part of speech
Academic Press.
tagger.” Proc ACL Conference on Applied Natural
15. Turney, P.D. (1999) “Learning to Extract Keyphrases
Language Processing, pp. 152–155, Trento, Italy.
from Text.” NRC Technical Report ERB-1057,
2. Brill, E. (1994) “Some advances in rule-based part of
National Research Council, Canada.
speech tagging,” Proc AAAI-94, pp. 722–727, Seattle.
16. Turney, P.D. (in press) “Learning algorithms for
3. Chang, S.J. and Rice, R.E. (1993) “Browsing: a
keyphrase extraction.” Information Retrieval.
multidimensional framework.” Annual Review of
17. Witten, I.H., McNab, R.J., Boddie, S. and Bainbridge,
Information Science and Technology, Vol. 28, pp.
D. (1999) “Greenstone: a comprehensive open-source
231-276.
digital library software system.” Research Report,
4. FAO (Food and Agriculture Organization of the
Dept. of Computer Science, University of Waikato.
United Nations) (1995) AGROVOC: multilingual
agricultural thesaurus. FAO, Rome.
9