Playing in the Sandbox

The sandbox is filled with experimental LingPipe projects with working code. We'll be using it to share things we're working on in development and things our users would like to share with others. Because we are giving you live access to our updates through the Subversion version control system, the content is more fluid than our standard releases.

If you're familiar with anonymous SVN, jump to the:

Otherwise, you might want to start with this free book:

Checking Out Projects

The sandbox is set up to allow anonymous Subversion access. After installing Subversion, create and move to a working directory set up for Subverison, then use the following command to check out (co) a particular sandbox project:

svn co

where ProjectName is the name of the project. It will create a subdirectory of your current directory called ProjectName.

If you leave off the project name, it will check out every project in the sandbox, creating a new subdirectory called sandbox into which all of the sandbox projects will be checked out as subdirectories.

Updating Projects

To update an existing project to the latest version, use the following commands:

cd ProjectName
svn update

That is, you move to the directory containing the project and issue the update command to Subversion This will merge all of the changes in the project into your local copy.

The reason you don't need to specify the Subversion URL root again (:pserver:anonymous...) is that it's stored in a "hidden" .svn directory in checked out projects.


Just send us email about your project. Ideally, you can describe it and send us a tarball with working code.

Sandbox Project List

HierAnno: Hierarchical Models of Data Annotation

The hierarchical annotation sandbox project contains R, BUGS and Java code for estimating a gold standard, annotator accuracy, item difficulty, and hierarchical parameters for annotation difficulty.

LingMed: Biomedical Databases and Gene Linkage

LingMed is what we use for our back-end updating, storage, and indexing of bio-medical resources such as MEDLINE, Entrez-Gene, OMIM and GO. It contains extensive documentation and build files, but has lots of moving parts ranging from MySQL to RMI to Log4J.

The project includes a robust downloading process to keep MEDLINE up to date and indexed with Lucene. The Lucene index contains many fields for the parts of MEDLINE, with all text being indexed three ways with no normalization, standard normalization and stoplisting, and with character n-grams for approximate search. The index may be used by client programs either locally or remotely through Lucene's RMI integration. There's a generic abstraction layer that supports object-relation mapping and querying through MySQL and object-document mapping and search through Lucene.

The LingMed sandbox project also includes a basic version of our gene linkage application, which links mentions of genes and proteins to Entrez-Gene using name matching and context matching.

Named Entity Corpus Annotation Tool

This is a complete application for creating your own named-entity training data. It's language independent and only requires a tokenizer. It interactively learns as you tag just like the original Alembic Workbench.

We've been able to use it to tag on the order of 5-10K tokens/hour, depending on the complexity of the task.

There's generic information, and also a whole section on extracting bibliographies from PDF docs, starting from text conversion, then bibliography extraction, then citation extraction and then finally field extraction.

The name of the repository is derived from its original use for bibliographies:

SIGHan 2006: Chinese Words and Entities

This project contains the complete code used to submit Alias-i's 2006 SIGHan bakeoff results. It includes training and output for the two bakeoff tasks: word segmentation and named entity extraction. Code for both tasks is under one page.

BioCreative 2006: Gene Entity Recognition

This project contains the complete code used to submit Alias-i's 2006 BioCreative bakeoff results. It includes training and output for the bakeoff task: named entity extraction. This includes first best along with precision- and recall-oriented runs with confidence.


This project contains sample code for getting started using LingPipe in IBM's UIMA framework. We're actively soliciting submissions for this sandbox entry if you have anything you'd be willing to share. As is, this is for the old IBM UIMA, not the new Apache UIMA.

If you have a LingPipe wrapper for UIMA, we'd be happy to distribute it.

Sentence, Entity and Coreference Demo

This demo illustrates the 2.0 sentence chunking model, the 2.2 HMM-based entity extraction, and the 1.0 version of coreference in a simple one-pager.

XHTML: Document Object Model

This is our answer to Jakarta's Element Construction Set, which seems to be moribund. Our approach builds a full XHTML document-object model based on the XHTML DTD. The classes reflects the macros in the DTD through interfaces, allowing for simple constraints on occurrence.

Feedback on this project would be most welcome.