Directory of Questions

The LingPipe frequently asked questions are arranged below by topic.

General

Technical

Natural Language Processing

Miscellaneous

General

What is LingPipe?

LingPipe is a state-of-the-art suite of natural language processing tools written in Java that performs tokenization, sentence detection, named entity detection, coreference resolution, classification, clustering, part-of-speech tagging, general chunking, fuzzy dictionary matching. These general tools support a range of applications.

For more information see the feature list on the LingPipe home page.

Where is the LingPipe home page?

http://alias-i.com/lingpipe

Who makes LingPipe?

Alias-I, Incorporated makes LingPipe. For more information see the Alias-i home page.

How did Alias-i get started?

It began in 1995 as a collaboration between a group of graduate students at the University of Pennsylvania for the purposes of competing in the DARPA MUC-6 (Defense Advanced Research Projects Agency Message Understanding Conference) evaluation. That sparked a series of events which culminated in DARPA awarding a research contract under the TIDES (Trans-Lingual Information Detection Extraction and Summarization) program in 2000 to found Baldwin Language Technologies, which later changed its name to Alias-i.

After using LingPipe in our FactTracker and ThreatTracker products, we decided in September of 2003 that it was ready to be released to the world under our Royalty Free license (that included access to the source). In addition to working on MUC-6 and TIDES, we have participated in many DARPA projects ranging from the ACE (Automatic Content Extraction) and DUC (Document Understanding Conference) evaluations to the TIA (Terrorist Information Awareness) program.

In 2004, we picked up research funding from the United States National Library of Medicine (NLM), one of the US National Institutes of Health (NIH). Our grant is for text data mining over biomedical research literature.

Where can I download LingPipe?

The LingPipe download page.

Why would I use LingPipe?

Developers building natural language processing applications including summarization, information extraction, full text search and machine translation will save development time by using LingPipe rather than building and refining their own tokenization, sentence detection, named entity detection and coreference resolution tools.

What is the hard nosed business case for LingPipe?

Your development team doesn't have the time to reinvent the wheel or to create robust, fast natural language processing tools as accurate as those in LingPipe. Using LingPipe you can finish your application sooner or with fewer developers or both. This translates into better bottom line results.

What is the warm fuzzy raison-d'etre for LingPipe?

LingPipe is available under a Royalty Free license designed to let researchers, application & tool builders incorporate LingPipe into their own projects. We release our source as well since it is an important part of understanding what the algorithms do. Under this license you can also use LingPipe to annotate your document collection as long as the contents of the document collection are made freely available to the public. You'll sleep better knowing you're using state-of-the-art tools from a company that cares both about staying in business and building great tools that other people use.

Technical

How do I report a LingPipe bug?

See the LingPipe bug report page.

What sorts of named entities does LingPipe recognize?

This depends on what model is being used. The "standard" model supplies People, places, organizations and GSP (GeoSpatial Political Entity). This can be changed if the named entity detector is retrained.

What kind of input does LingPipe take?

LingPipe is neutral with respect to input; it processes text in unicode and supports a range of character sets and input formats, including HTML, XML and plain text.

What platforms does LingPipe run on?

LingPipe will run on any platform that supports a Java Virtual Machine (JVM), version 1.4 or higher. It's been extensively tested on Windows XP, Win64, Linux, 64-bit Linux, and Macintosh OS X.

See the LingPipe installation page for more details about the required components.

How fast is LingPipe?

Very fast. For our statistical models, the CPU is typically waiting on memory, not on Java computations. For heuristic models, we're typically waiting on disk access. LingPipe's tight loops are written very much like they'd be written in C.

The new (1.4.2 and on) Hotspot compilers do an impressive amount of partial evaluation (aka unfolding or inlining) based on the runtime evaluation of the running code. That's eliminated the most time-consuming profile/unfold step in optimization.

An exact answer in terms of megabytes/minute depends on which module of LingPipe you're evaluating, which version of Java you're running and on what platform/load, how fast your memory is, whether models are dynamic or compiled, and the particular shape of the run time task such as number of characters, number of tags and sentence length.

We provide benchmark data as part of some of our tutorials.

What are the memory requirements for LingPipe?

LingPipe is designed to run on servers with shared memory. A tighter answer depends on several factors, ranging from the task being computed to the amount of caching performed. All of the demos run in 512MB of memory.

How many documents can LingPipe handle?

LingPipe doesn't store documents, so there is no limit to the number of documents LingPipe can handle. It runs multi-threaded, so it can handle many documents simultaneously, but performance is limited by CPU bandwidth.

What languages/character sets does LingPipe support?

LingPipe supports Unicode and can process documents in any language with Unicode support. This includes almost all of the major human languages, and even some non-human languages such as Klingon.

What do I need to run LingPipe?

LingPipe was written in Java and requires the JVM to run. It is distributed with everything you'll need to run it. To build and unit test the system, it helps to install Apache Ant. See the LingPipe installation page for more information.

How do I run LingPipe?

LingPipe runs through a Java API, which is detailed in the LingPipe Javadoc. Some functionality is supported through command lines, there is a GUI demo of some functionality, as well as web demos. See the LingPipe demos page for information on web demos, the GUI demo read-me for running the GUI demos, and Command line tutorial

How can I process the output of LingPipe?

LingPipe outputs annotated documents in an XML format. You can process the XML however you like, but the most common ways are to use XSLT or an XML DOM.

What programming languages are the LingPipe APIs designed for?

The LingPipe APIs are designed for use in Java.

Why does LingPipe output XML?

XML is easily processed by a wide variety of tools and is also human readable.

Natural Language Processing

What is natural language processing?

Natural language processing (NLP is a branch of computer science devoted to manipulating documents written by people with computers.

What is sentence detection?

Sentence detection is an early processing step common to many natural language processing systems in which paragraphs of text are divided into sentences based on punctuation and other information contained in each paragraph.

What is tokenization?

Tokenization is an early processing step common to many natural language processing systems in which text is divided into tokens most of which are words but some of which are punctuation, numbers or other symbols in order to make additional processing easier. For example, a tokenizer might break the sentence "Yesterday, I went to the store." into the following eight tokens:

     Yesterday
     ,
     I
     went
     to
     the
     store
     .

What is a named entity?

A named entity is a string that refers to a particular kind of object in the world. The string "Breck Baldwin" refers to a particular person and is therefore a named entity of type person. The string "Alias-i" refers to a company and is also a named entity, but of type company.

What is coreference resolution?

Coreference resolution is a natural language processing task of determining whether various named entities refer to the same object in the world. The strings "B. Baldwin" and "Dr. Baldwin" and "Breck" are likely to refer to the same person, whereas the strings "Breck Baldwin", "Mary Smith" and "Alias-i" almost certainly refer to different objects in the world. Coreference resolution is most often performed on strings within one document.

What is cross-document coreference resolution?

Cross-document coreference resolution is the natural language processing task of determining whether various named entities from different documents refer to the same object in the world. It's considerably harder than within-document coreference, in part because multiple people, companies and places share the same name. The existence of organizations like the Jim Smith Society shows the difficulty of this problem.

What is summarization?

Summarization is the natural language processing task of producing an abstract from a document by machine.

What is data mining?

Data mining allows people to search for unexpected relations in large data collections and can be applied to structured collections, such as databases, as well as unstructured collections such as documents written in natural language.

What is full text search?

Full text search allows people to use keywords to find documents in document collections, such as the world wide web or the documents stored in a library.

What is machine translation?

Machine translation systems convert documents written from one human language into another without human intervention. For example, a machine translation system might convert a document from Japanese into English.

What is information extraction?

Information extraction is the process of pulling certain kinds of information out of documents automatically. For example, newspaper articles can be mined for new product announcements and those announcements could then be entered into a database.

What additional resources are there about natural language processing?

There are many courses, books, journals and web sites about natural language processing. Your favorite search engine is a good place to start.

Licensing

How is the Royalty Free license different from other standard Open Source licenses?

The Alias-i royalty free license version 1 goes one step further than most open source licenses we've seen in that it requires the content you produce using LingPipe to also be freely distributed. We want people to build and use free software but we also want people to freely distribute the information.

Where is the Alias-i royalty free license version 1 for LingPipe?

See the LingPipe licensing page.

Can I use LingPipe to annotate my documents?

Yes, but under the Alias-i royalty free license version 1 you need to make the documents annotated by LingPipe available to the public for free.

If you have needs that can't be met under that license, contact Alias-i to discuss other licensing arrangements. See the LingPipe contact page for contact information.

Can I extend LingPipe?

Yes. You can incorporate LingPipe into your product, fix bugs in LingPipe or add to LingPipe, but only under the terms of the Alias-i royalty free license version 1. See the LingPipe licensing page for details.

How do I get a commercial license for LingPipe?

Contact Alias-i for further information. See the LingPipe contact page for details.

Miscellaneous

How do I recommend a question for this FAQ?

Send us e-mail. See the LingPipe contact page for details.