Playing in the Sandbox
The sandbox is filled with experimental LingPipe projects with working code. We'll be using it to share things we're working on in development and things our users would like to share with others. Because it's based on CVS, it will be more fluid than our standard releases.
If you're familiar with anonymous CVS, jump to the:
Checking Out Projects
The sandbox is set up to allow anonymous CVS access. After installing CVS, use the following command:
cvs -d :pserver:anonymous@threattracker.com:/usr/local/sandbox checkout ProjectName
where ProjectName is the name of the project.
This will check out the project into a directory of the same name.
Updating Projects
To update an existing project to the latest version, use the following commands:
cd ProjectName cvs update
That is, you move to the directory containing the project and issue the update command to CVS. This will merge all of the changes in the CVS archive into your local copy.
The reason you don't need to specify the CVS root
again (:pserver:anonymous...) is that it's
stored under ProjectName/CVS in a file
called Root.
Contributing
Just send us email about your project. Ideally, you can describe it
and send us a tarball with working code. We can either just check it
in for you or give you an account on our CVS server so you can use
the :ext mode for CVS instead of :pserver.
Sandbox Project List
LingMed: Biomedical Databases and Gene Linkage
LingMed is what we use for our back-end updating, storage, and indexing of bio-medical resources such as MEDLINE, Entrez-Gene, OMIM and GO. It contains extensive documentation and build files, but has lots of moving parts ranging from MySQL to RMI to Log4J.
The project includes a robust downloading process to keep MEDLINE up to date and indexed with Lucene. The Lucene index contains many fields for the parts of MEDLINE, with all text being indexed three ways with no normalization, standard normalization and stoplisting, and with character n-grams for approximate search. The index may be used by client programs either locally or remotely through Lucene's RMI integration. There's a generic abstraction layer that supports object-relation mapping and querying through MySQL and object-document mapping and search through Lucene.
The LingMed sandbox project also includes a basic version of our gene linkage application, which links mentions of genes and proteins to Entrez-Gene using name matching and context matching.
lingmed
Named Entity Corpus Annotation Tool
This is a complete application for creating your own named-entity training data. It's language independent and only requires a tokenizer. It interactively learns as you tag just like the original Alembic Workbench.
We've been able to use it to tag on the order of 5-10K tokens/hour, depending on the complexity of the task.
There's generic information, and also a whole section on extracting bibliographies from PDF docs, starting from text conversion, then bibliography extraction, then citation extraction and then finally field extraction.
The name of the repository is derived from its original use for bibliographies:
citationEntities
SIGHan 2006: Chinese Words and Entities
This project contains the complete code used to submit Alias-i's 2006 SIGHan bakeoff results. It includes training and output for the two bakeoff tasks: word segmentation and named entity extraction. Code for both tasks is under one page.
sighan2006
BioCreative 2006: Gene Entity Recognition
This project contains the complete code used to submit Alias-i's 2006 BioCreative bakeoff results. It includes training and output for the bakeoff task: named entity extraction. This includes first best along with precision- and recall-oriented runs with confidence.
biocreative2006
IBM's UIMA
This project contains sample code for getting started using LingPipe in IBM's UIMA framework. We're actively soliciting submissions for this sandbox entry if you have anything you'd be willing to share. As is, this is for the old IBM UIMA, not the new Apache UIMA.
uima
Sentence, Entity and Coreference Demo
This demo illustrates the 2.0 sentence chunking model, the 2.2 HMM-based entity extraction, and the 1.0 version of coreference in a simple one-pager.
simpleCoref
XHTML: Document Object Model
This is our answer to Jakarta's Element Construction Set, which seems to be moribund. Our approach builds a full XHTML document-object model based on the XHTML DTD. The classes reflects the macros in the DTD through interfaces, allowing for simple constraints on occurrence.
Feedback on this project would be most welcome.
xhtml