About the Named Entity Demo
Named entity recognition finds mentions of things in text. The interface in LingPipe provides character offset representations as chunkings.
Genre-Specific Models
Named entity recognizers in LingPipe are trained from a corpus of data. The examples below extract mentions of people, locations or organizations in English news texts, and mentions of genes and other biological entities of interest in biomedical research literature.
Language-Specific Models
Although we're only providing English data here, there is training data available (usually for research purposes only) in a number of languages, including Arabic, Chinese, Dutch, German, Greek, Hindi, Japanese, Korean, Portuguese and Spanish. Many of these training sets may be purchased for commercial applications. There are additional biology-based corpora, most of which are available with unrestricted licensing.
LingPipe's Recognizers
LingPipe provides three statistical named-entity recognizers:
com.aliasi.chunk. |
Size | 1st-best | n-best | confidence | |||
---|---|---|---|---|---|---|---|
speed | accuracy | speed | accuracy | speed | accuracy | ||
TokenShapeChunker |
small | fast | medium | n/a | |||
CharLmHmmChunker |
medium | fast | low | medium | medium | slow | high |
CharLmRescoringChunker |
very large | slow | high | slower | high | slowest | low |
Sentence Annotation Included
The demos use the appropriate sentence models. See the Sentence Demo for more information.
Named Entity XML Markup
First-best output
Entities are marked as in MUC, with an ENAMEX
element with attribute TYPE
indicating the
kind of entity.
N-best output
Each analysis is marked with a tag analysis
,
with attribute jointLog2Prob
providing
the joint log (base 2) probability of the analysis, and
rank
providing the rank on the n-best list
(numering from zero), e.g.
<analysis jointLog2Prob="-39.9" rank="5">
.
Within each analysis, tokens are tagged as for first-best output.
Per tag confidence output
Each token and its analyses are wrapped with
an nBestEntities
element. Following
the text content is a sequence of ENAMEX
elements, marking types, conditional probabilities of
the entity given the text, and start/end position
markers, as well as confidence rank and the text of the entity.
Named Entity Demo on the Web
The demos are hosted on the web at the following URLs:
English News: MUC6 Corpus (CharLmRescoringChunker)
http://lingpipe-demos.com:8080/lingpipe-demos/ne_en_news_muc6/textInput.html
English Biomedical Text: GeneTag Corpus (CharLmHmmChunker)
http://lingpipe-demos.com:8080/lingpipe-demos/ne_en_bio_genetag/textInput.html
English Biomedical Text: GENIA Corpus (TokenShapeChunker)
http://lingpipe-demos.com:8080/lingpipe-demos/ne_en_bio_genia/textInput.html
For detailed information about using web demos, including web form, file upload and web service instructions, see the web demo instructions
Named Entity Demo via GUI
To launch the demo in a GUI, first change directories to the command directory and then invoke the demo batch script. Note: Parameters are set in the GUI, not as arguments to the launch script.
Windows Operating System
English News: MUC6 Corpus (CharLmRescoringChunker)
> cd %LINGPIPE_HOME%\demos\generic\bin > gui_ne_en_news_muc6.bat
English Biomedical: GeneTag Corpus (CharLmChunker)
> cd %LINGPIPE_HOME%\demos\generic\bin > gui_ne_en_bio_genetag.bat
English Biomedical: GENIA Corpus (TokenShapeChunker)
> cd %LINGPIPE_HOME%\demos\generic\bin > gui_ne_en_bio_genia.bat
Unix-like Operating Systems
English News: MUC6 Corpus (CharLmRescoringChunker)
> cd %LINGPIPE_HOME%\demos\generic\bin > sh gui_ne_en_news_muc6.sh
English Biomedical: GeneTag Corpus (CharLmChunker)
> cd %LINGPIPE_HOME%\demos\generic\bin > sh gui_ne_en_bio_genetag.sh
English Biomedical: GENIA Corpus (TokenShapeChunker)
> cd %LINGPIPE_HOME%\demos\generic\bin > sh gui_ne_en_bio_genia.sh
For detailed information about running demos in a GUI, see the GUI demo instructions
Named Entity Demo via Shell Command
Shell commands may be run over single files, all of the files in a directory, or using standard input/output.
Running over a Directory
English News: MUC6 Corpus (CharLmRescoringChunker)
> cd $LINGPIPE/demos/generic/bin > cmd_ne_en_news_muc6.bat -inDir=../../data/testdir -outDir=/testout
English Biomedical: GeneTag Corpus (CharLmChunker)
> cd $LINGPIPE/demos/generic/bin > cmd_ne_en_bio_genetag.bat -inDir=../../data/testdir -outDir=/testout
English Biomedical: GENIA Corpus (TokenShapeChunker)
> cd $LINGPIPE/demos/generic/bin > cmd_ne_en_bio_genia.bat -inDir=../../data/testdir -outDir=/testout
Running a Single File
English News: MUC6 Corpus (CharLmRescoringChunker)
> cd $LINGPIPE/demos/generic/bin > cmd_ne_en_news_muc6.bat -inFile=../../data/testdir/foo.txt -outFile=foo.out.xml
The other genres are handled the same way,
with different suffixes in place of news_muc6
.
Running through a Pipe (Standard input/output)
English News: MUC6 Corpus (CharLmRescoringChunker)
> cd demos/generic/bin > echo See Spot. See Spot run. | cmd_ne_en_news_muc6.bat
The other genres are handled the same way,
with different suffixes in place of general_brown
.
Running in Unix-like Operating Systems
For unix-like operating systems such as Unix, Solaris, Linux, or Macintosh OS X:
- Replace path backward slashes
(
\
) with forward slashes (/
), and - substitute
.sh
for the.bat
suffix in the command.
For detailed information about running demos from the command line, see the command line demo instructions
Named Entity Demo Scripts
The following scripts are available in
$LINGIPE/demos/generic/bin
for running the demo. Note
that each script comes in four flavors, distinguishing
command line from GUI, and the Windows DOS shell from the Unix shell.
Language | Genre | Corpus | Mode | Windows DOS | Unix/Linux/Mac sh |
---|---|---|---|---|---|
English | General | MUC 6 | Command | cmd_ne_en_news_muc6.bat |
cmd_ne_en_news_muc6.sh |
GUI | gui_ne_en_news_muc6.bat |
gui_ne_en_news_muc6.sh |
|||
English | Biomedical | GeneTag | Command | cmd_ne_en_bio_genetag.bat |
cmd_ne_en_bio_genetag.sh |
GUI | gui_ne_en_bio_genetag.bat |
gui_ne_en_bio_genetag.sh |
|||
English | Biomedical | GENIA | Command | cmd_ne_en_bio_genia.bat |
cmd_ne_en_bio_genia.sh |
GUI | gui_ne_en_bio_genia.bat |
gui_ne_en_bio_genia.sh |
Named Entity Demo Parameters
The following is a complete list of parameters for the demo.
Demo-Specific Parameters
The following parameter is specific to the named entity demo (though also found in the part-of-speech demo).
Parameter | Description | Usage Constraints |
---|---|---|
resultType |
Form of results | Values determine output:
|
General Demo Parameters
These parameters apply to every version (web/GUI/command) of every demo.
Parameter | Description | Usage Constraints |
---|---|---|
inCharset |
Input character set | Optional. Defaults to platform default. |
outCharset |
Output character set | |
contentType |
Input content type | May be one of:
text/plain . |
removeElts |
Element tags to remove | Optional. May only be used with contentType=text/html
or contentType=text/xml . Each value may be
comma-separated list. If neither of these are
specified, all text content is processed. |
includeElts |
Elements to annotate |
Command-Line Only Parameters
These parameters apply to every command-line demo, but are not relevant for the GUI or web versions of the demos.
Parameter | Description | Usage Constraints |
---|---|---|
inFile |
Readable input file | May not be used with inDir .
If either is not specified, defaults to standard input or output. |
outFile |
Writeable output file | |
inDir |
Readable input directory | May not be used with inFile or outFile .
If used, inDir and outDir must both be specified. |
outDir |
Writeable output directory |