About the API Tutorials
The application program interface (API) turorials are intended to help developers get started with the LingPipe API. Each tutorial is designed to stand alone.
Included Data and Precompiled Scripts
Most of the tutorials come with sample data, precompiled jars and an example that works out of the box. Tutorials which require third-party data or software with restricted distribution (e.g. MySQL) are noted below.
The Tutorials
This section provides a list of available tutorials:
Topic Classification
Categorization of news articles by genre using character language models.
Named Entity Recognition
How to run named-entity recognizers in first-best, n-best and per-entity confidence modes. How to train and evaluate named-entity recognizers. Examples with newswire in Spanish and genomics in English.
Clustering
An illustration of the single-link and complete-link hierarchical clusterers, including a variety of cluster evaluation techniques. There is an example of using clustering for cross-document coreference, with an example application resolving different John Smiths in the news. There is also an extensive tutorial on latent Dirichlet allocation (LDA).
Part-of-Speech Tagging
How to train part-of-speech (POS) taggers from corpora using tag parsers and handlers, how to compile models to disk and read them in, and how to run and evaluate first-best, n-best and confidence-scored taggers. Examples include the Brown, Genia, and GenTag part-of-speech corpora.
Sentence Detection
How to run sentence detection using the chunking interface, how to evaluate the performance of a sentence model against a corpu s using sentence chunk parsers and handlers, and how to tune a model for a particular corpus. Examples from the Genia corpus.
Spelling Correction
"Did you mean"-style search engine spell checking. How to train and tune a model.
String Comparison
How to use distance and proximity measures over strings, including weighted edit disance, TF/IDF distance, Jaccard distance, Jaro-Winkler distance, etc.
Interesting Phrase Detection
Extraction of statistically significant multi-word phrases in one corpus and of relatively significant ("hot") terms in one corpus relative to another.
Character Language Modeling
Training and tuning character language models, extending
com.aliasi.util.AbstractCommand
and using the
com.aliasi.corpus.TextHandler
and
com.aliasi.corpus.Parser
interfaces.
MEDLINE Parsing
Use of the MEDLINE parser interface to extract MEDLINE citations as structured Java objects. Also contains a pointer to our sandbox project to keep an up-to-date Lucene index of MEDLINE.
Database Text Mining
Part one populates a MySQL database with MEDLINE citations using
JDBC. Part two runs over a database of documents to create
tables of
sentences and entities. Part three shows how to do text data mining
through database queries.
[Requires GNU-licensed MySQL.]
Chinese Word Segmentation
Shows how to segment a stream of Chinese characters into distinct
words. The demo uses the standard LingPipe spelling corrector with an
edit distance tuned for word segmentation. Shows how to train and
evaluate using publicly available training corpora from the First and
Second International Chinese Word Segmentation Bakeoffs.
[Requires SigHan data download.]
Hyphenation and Syllabification
Shows how to train a hyphenator or syllabifier from dictionary training data. Examples in Dutch, English and German.
Sentiment Analysis
Uses language model classifiers to do sentiment analysis over movie
reviews. Whole movie reviews are classified by polarity (thumbs up or
thumbs down), and single sentences are classified with respect to
subjectivity (subjective/opinion or objective/fact). Walks through
compiling models and reading the extensive output produced by the
classifier evaluators. Also explains hierarchical classification
which stacks the polarity classifier on top of the subjectivity
classifier for improved performance. Discusses binomial confidence
intervals and the danger of a posteriori parameter setting.
[Requires sentiment data download.]
Language Identification
Language identification as a classification problem. How to train and evaluate language identifiers, with examples from the Leipzig corpus of 15 languages.
Singular Value Decomposition
Use singular value decomposition to factor matrices. Explains how to deal with unknown value imputation, regularization and setting tuning parameters.
Logistic Regression
How to estimate regularized multinomial logistic regression models for discriminitive classification.
Expectation Maximization
How to use expectation maximization for semi-supervised learning for a variety of tasks.
Word Sense Disambiguation
Word sense disambiguation is the process of determing which of a word's possible meanings is intended by a particular instance of the word. Word sense disambiguation has applications for classification, search, clustering, etc.
Eclipse
Basic instructions on how to compile and test LingPipe using the Eclipse integrated development environment (IDE).