Bronwyn; page: 1 of 1 1 Using language models for generic entity extraction Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan Computer ScienceUniversity of WaikatoHamilton, New [email protected] AbstractThis paper describes the use of statisticallanguage modeling techniques, such as arecommonly used for text compression, to extractmeaningful, low-level, information about thelocation of semantic tokens, or \\322entities,\\323 in text.We begin by marking up several different tokentypes in training documents\\321for example,people\\325s names, dates and time periods, phonenumbers, and sums of money. We form alanguage model for each token type and examinehow accurately it identifies new tokens. We thenapply a search algorithm to insert tokenboundaries in a way that maximizes compressionof the entire test document. The technique can beapplied to hierarchically-defined tokens, leadingto a kind of \\322soft parsing\\323 that will, we believe,be able to identify structured items such asreferences and tables in html or plain text, basedon nothing more than a few marked-up examplesin training documents. 1. INTRODUCTIONText mining is about looking for patterns in text, and maybe defined as the process of analyzing text to extractinformation that is useful for particular purposes.Compared with the kind of data stored in databases, textis unstructured, amorphous, and difficult to deal with.Nevertheless, in modern Western culture, text is the mostcommon vehicle for the formal exchange of information.The motivation for trying to extract information from it iscompelling\\321even if success is only partial.Text mining is possible because you do not have tounderstand text in order to extract useful information fromit. Here are four examples. First, if only names could beidentified, links could be inserted automatically to otherplaces that mention the same name\\321links that are\\322dynamically evaluated\\323 by calling upon a search engineto bind them at click time. Second, actions can beassociated with different types of data, using eitherexplicit programming or programming-by-demonstrationtechniques. A day/time specification appearing anywherewithin one\\325s email could be associated with diary actionssuch as updating a personal organizer or creating anautomatic reminder, and each mention of a day/time in thetext could raise a popup menu of calendar-based actions.Third, text could be mined for data in tabular format,allowing databases to be created from formatted tablessuch as stock-market information on Web pages. Fourth,an agent could monitor incoming newswire stories forcompany names and collect documents that mentionthem\\321an automated press clipping service.In all these examples, the key problem is to recognizedifferent types of target fragments, which we will calltokens or \\322entities\\323. This is really a kind of languagerecognition problem: we have a text made up of differentsublanguages {for personal names, company names, dates,table entries, and so on} and seek to determine whichparts are expressed in which language.The information extraction research community {of whichwe were, until recently, unaware} has studied these tasksand reported results at annual Message UnderstandingConferences {MUC}. For example, \\322named entities\\323 aredefined as proper names and quantities of interest,including personal, organization, and location names, aswell as dates, times, percentages, and monetary amounts{Chinchor, 1999}.The standard approach to this problem is manual:tokenizers and grammars are hand-designed for theparticular data being extracted. Looking at currentcommercial state-of-the-art text mining software, forexample, IBM\\325s Intelligent Miner for Text {Tkach, 1997}uses specific recognition modules carefully programmedfor the different data types, while Apple\\325s data detectors{Nardi et al., 1998} uses language grammars. The TextTokenization Tool of Grover et al. {1999} is anotherexample, and a demonstration version is available on theWeb. The challenge for machine learning is to use