LingPipe's Competition

On this page, we break our competition down into academic toolkits and industrial toolkits. We only consider software that is available for linguistic processing, not companies that rely on linguistic processing in an application but do not sell that technology.

How does LingPipe compare to the below offerings? A few key points to keep in mind as you browse the offerings:

Academic and Open Source Competition

The following is a list of ongoing large-scale, multi-function natural language toolkits that are built and distributed by academics.


ABNER is a statistical named entity recognizer "using linear-chain conditional random fields (CRFs) with a variety of orthographic and contextual features. Version 1.5 includes two models trained on the NLPBA and BioCreative corpora, for which performance is roughly state of the art. The new version also includes a Java API allowing users to incorporate ABNER into their systems, as well as train and use models for other data." Written by Burr Settles out of University of Wisconsin-Madison. Released with source with the Commons Public License.


The Baseline Information Extraction (BALIE) system is a Java natural language toolkit developed at the University of Ottawa and released under the GNU General Public License. BALIE provides language ID, sentence detection, tokenization and named-entity recognition. Here's the BALIE javadoc.


"BANNER is a named entity recognition system, primarily intended for biomedical text. It is a machine-learning system based on conditional random fields and contains a wide survey of the best features in recent literature on biomedical named entity recognition (NER). BANNER is portable and is designed to maximize domain independence by not employing semantic features or rule-based processing steps. It is therefore useful to developers as an extensible NER implementation, to researchers as a standard for comparing innovative techniques, and to biologists requiring the ability to find novel entities in large amounts of text. BANNER is released under the Common Public License."


FreeLing is a set of C++ tools developed at the Universitat Politècnica de Catalunya and released under the GNU Lesser General Public License. Freeling provides sentence detection, morphological analysis, named entities, POS tagging, shallow parsing, dependency parsing and word sense disambiguation.

The Dragon Toolkit

"The Dragon Toolkit is a Java-based development package for academic use in information retrieval (IR) and text mining (TM, including text classification, text clustering, text summarization, and topic modeling). It is tailored for researchers who work on large-scale IR and TM and prefer Java programming."


"Ellogon is different from other similar software. First of all, it respects the user's time by offering a simple and user friendly graphical interface. But beneath this simple appearance a powerful engine is hidden, that has been proved to be able to support a wide range of uses, from simple research prototypes to commercial applications. Ellogon is licensed under the GNU LGPL license, is easy to install and administer and is reliable. Running under all major operating systems, Ellogon offers a comfortable environment for computational linguists, language engineers or plain users."


GATE is a Java text mining toolkit developed at the University of Sheffield and released under the GNU Lesser General Public License. GATE provides a general offset-oriented development/deployment environment/framework and some rule-based tools to run within that framework. Many other GATE plugins have been contributed by Sheffield and third parties. Here is a link to their user guide and javadoc.


"The JULIE Lab here offers a comprehensive NLP tool suite for the application purposes of semantic search, information extraction and text mining. Most of our continuously expanding tool suite is based on machine learning methods and thus is domain- and language independent.

One main feature is that we offer our tools both as stand-alone programs and wrapped within the UIMA framework. UIMA is an open-source , industrial-strength, scaleable and extensible platform for creating, integrating and deploying unstructured information management solutions from combinations of semantic analysis and search components. "

Apache Lucene Mahout

Mahout's goal is to build scalable, Apache licensed machine learning libraries. Initially, we are interested in building out the ten machine learning libraries detailed in [Chu et al.'s 2006 NIPS paper Map-Reduce for Machine Learning on Multicore] using [Apache] Hadoop."


MALLET is a Java natural language toolkit developed at the University of Massachussetts and released under the Common Public License. MALLET is most widely used for classification and sequence modeling. It also includes clustering. It provides maximum entropy training, including conditional random fields, general undirected graphical models, finite-state transducers, and some general numerical optimization classes. Here's a link to their javadoc.


"MaltParser is a system for data-driven dependency parsing, which can be used to induce a parsing model from treebank data and to parse new data using an induced model. MaltParser is developed by Johan Hall, Jens Nilsson and Joakim Nivre at Vaxjo University and Uppsala University, Sweden."


Maryland Information and Network Dynamics Lab Semantic Web Agents Project offers a generic Semantic Web anno toolkit, SMORE, "designed to enable users to markup HTML documents in OWL using Web Ontologies."

Minor Third

Minor Third is a Java natural language toolkit developed at Carnegie Mellon University and released under the BSD License. Minor Third provides an extensive suite of general and sequence classifiers, including KNN, active learning, SVMs, decision trees, CRFs, CMMs, boosting, perceptrons, etc.


"MontyLingua is a free for research use, commonsense-enriched, end-to-end natural language understander for English. Feed raw English text into MontyLingua, and the output will be a semantic interpretation of that text. Perfect for information retrieval and extraction, request processing, and question answering. From English sentences, it extracts subject/verb/object tuples, extracts adjectives, noun phrases and verb phrases, and extracts people's names, places, events, dates and times, and other semantic information. MontyLingua makes traditionally difficult language processing tasks trivial!"


An extensive package developed at Northwestern for spelling, lemmatization, sentence detector, part-of-speech tagger, named entity detector, and phrase chunker. It rolls in many other projects into a coherent and well-documented package. Mainly aimed at historical forms of English, but retrainable. Very open license.


The National Centre for Text Mining (NaCTeM) is the first publicly-funded text mining centre in the world. We provide text mining services in response to the requirements of the UK academic community. NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.

On our website, you can find pointers to sources of information about text mining such as links to


The Natural Language Toolkit (NLTK) is a general Python toolkit developed at the University of Melbourne for natural language processing released under the GNU General Public License. NLTK contains modules for heuristic and statistical tagging (including the Brill tagger) and chunking, full parsing (CFG), and clustering (including K-means and EM). The documentation page contains pointers to tutorials and API documentation. It's also distributed with a range of interesting data.


OpenNLP is a heterogeneous collection of projects distributed under a variety of open source licenses. The main projects being developed for OpenNLP itself include a general Java maximum entropy package released under the GNU Lesser General Public License. Here's the maxent javadoc. There's a tools API to go with it; here's the tools javadoc. The tools include statistical tokenizers, sentence detection, name finders, part-of-speech taggers and full syntactic (PCFG) parsing. This is one of the few packages to do coreference resolution.


Actually, it's hard to tell if this is academic or commerical. They offer a suite of .NET tools for natural language processing called Antelope. It looks like much of the work is based on wrappers to other packages.


RASP is an NLP framework for "robust accurate statistical parsing". It is trained using the British English corpora Susanne, LOB and BNC. RASP includes a tokenizer, part-of-speech tagger, hand-built FST-based morphological analyzer for English, grammar-based parser and parse reranking model.

There is a RASP white paper describing the system, including dependency parser accuracy evaluation, and a longer technical report about the grammar formalism used.

Stanford NLP Software

This is many pieces of software with Java source licensed under the GNU General Public License. Tools include a part-of-speech tagger, text classifier, and PCFG parser. They no longer provide public access to their full JavaNLP toolkit.

SRI International

"SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995. The toolkit has also greatly benefitted from its use and enhancements during the Johns Hopkins University/CLSP summer workshops in 1995, 1996, 1997, and 2002."


TreeTagger is Helmud Schmid's multilingual part-of-speech tagger. It has a research-only.


WEKA is the University of Waikato's data mining software released under the GNU General Public License. It's not a natural language processing toolkit, but a very extensive general machine learning toolkit. It provides a nice graphical interface for evaluation and supports just about every machine learning algorithm known to research. It's based on Ian Witten and Eibe Frank's book Data Mining. Here's a pointer to the WEKA documentation, which includes tutorials and pointers to the javadoc.

Industrial Competition

The following is a list of competitors with quotes from their own web pages. We've listed technology components where we could find them.

Accenture Technology Labs

Accenture is offering a sentiment monitoring service, "Sentiment Monitoring Services searches preferred sites or newsgroups on the Internet for opinions. Using advanced language technologies, it interprets the sentiment of the text towards a specified product or service and then provides the user with an analysis of the results. Sentiment Monitoring Services combines a search agent and a perception engine to present users with an instant gauge of market perception of any feature, product, brand or organization. The natural language processor of the perception engine achieves an accuracy of approximately 90 percent compared to opinion ratings ranked manually."

Adaptive Semantics

"ASI specializes in sentiment analysis, powered by cutting edge machine learning methods. From major blog networks, to social networks, to community reviews sites, JuLiA offers automated comment moderation, user-profiling, and a series of reporting tools to enhance and streamline the entire community moderation process."

"The Huffington Post has acquired Adaptive Semantics, its first purchase of another company. HuffPo wants to use Adaptive Semantics software, which provides learning and sentiment analysis technology to continue to scale community and work in tandem with the site's team of human moderators."

A-Life Medical

"A-Life Medical’s patented Natural Language Processing (NLP) technology utilizes proprietary knowledge-bases of more than ten million facts to automate the coding process. Our technology combined with our software solutions and services are dramatically changing the way healthcare codes, submits claims, collects reimbursement, as well as improving patient care."

A-Life produces Alacer, "the first end-to-end practice management system that integrates document management, real-time NLP coding, billing, collections, denials management, and auditing into one streamlined Windows-based platform."

Alitora Systems

"Alitora System provides comprehensive software solutions for biotech research, management, compliance, intellectual property management and competitive intelligence. Our software enables users to search, annotate and collaborate, seamlessly, allowing the annotation of information as simple as a click, and collaboration as simple as a drag-and-drop."

Alitora offers kHarmony, which "allows users to identify concepts that are of interest, and then search for information relating specifically to those concepts...". Alitora describes kHarmony as " using proprietary graph-theoretic and information retrieval techniques to provide structure to unstructured data, perform data clustering, and enable visual data exploration."


"Appen develops and markets sophisticated computer-based speech and language technology products and services for major international information and communication companies and government organizations."

Appen supplies a range of corpora, as well as tools for morphology and sentiment, and authorship.

Ariadne Genomics

"Ariadne develops software tools for biologists in the areas of pathway analysis and automated scientific text processing. Ariadne products incorporate proprietary Natural Language Processing (NLP) and statistical algorithms designed to functionally interpret novel genetic information."

Although they don't really sell NLP software per se, they are making applications like their Medscan Reader for text data mining over biomedical research articles. It uses entity extraction (e.g. genes and diseases), specific relation extraction (e.g. binding or regulation) and sentence-level search and summarization.


"Attensity's breakthrough Text Analytics solutions enable computers to understand and process free-form text, offering organizations the opportunity to leverage the vast amounts of information contained in non-structured formats. The technology allows users to extract and analyze facts like who, what, where, why, under what conditions and to whom, as well as opinions and events found in unstructured data."

"Attensity offers a complete suite of products for Text Analytics. The suite includes both targeted and Exhaustive Extraction Engines that pull the information out of text and put it into a usable format, analysis and discovery applications that allow you to explore and make sense of the data, knowledge libraries and knowledge engineering tools that provide the ability to define what to extract and categories to put data in, and an integration toolkit."


" Autonomy is the acknowledged leader in the rapidly growing area of Meaning Based Computing (MBC)."

"Meaning Based Computing not only uncovers, but also makes sense of, the 85% of enterprise information that is hidden to all other technologies including keyword search engines and relational databases. ... Meaning Based Computing enables organizations to automatically form a contextual understanding of people's interests, behavior and ongoing interaction with any type of information. ... Meaning Based Computing enables organizations to extract meaningful evidence from terabytes of email, documents, spreadsheets and other unstructured information."

Basis Technology

"Basis Technology provides software solutions for extracting meaningful intelligence from unstructured text in Asian, European and Middle Eastern languages. We help technology companies and government organizations improve the accuracy of information retrieval, text mining and other applications through advanced linguistics."

Basis provides entity extraction in ten languages, does language identification, as well as Chinese and Japanese character-level support.


"Baynote delivers the industry-leading recommendation engine for products and content as well as UseRank social search for websites, intranets, and portals. Our Collective Intelligence Platform allows businesses to better understand their visitors' intent and context and automatically display the best content and products based on that insight."


"BBN is the leader in the development of the new "Semantic Web," which will enable powerful searches and automated agents. BBN has been coordinating the work of 23 US and international research teams in conjunction with the World Wide Web Consortium and European Union collaborators to drive the transition to a semantic web."


"Providing natural language understanding to any application We have already succeeded with search engines, databases, virtual assistants..."

Bitext (for "bits and text") provides NaturalFinder, "the essential complement for any search engine for Internet and intranets which allows users to query in natural language (Spanish or English) without using booleans or wildcards."


"At Biz360, we are committed to removing the inefficiencies of traditional market research and measurement through technology and product innovation. Biz360 uses proprietary technology, analytics and natural language processing (NLP) to aggregate and analyze vast amounts of media and market information to yield insights which help marketing companies better understand, reach and motivate their key target audiences."

Their offerings are centered around sentiment analysis and subject tracking through dashboard-style reporting.


"Brainware helps the world's leading companies automatically extract, process, and retrieve data from any source. Our Data Capture solutions virtually eliminate manual data entry while our Enterprise Search solutions allow you to stop searching and start finding..."

The technology behind Brainware's Intelligent Data Capture Distiller involves "patented neural network-based classification" as well as "pattern recognition technologies and "fuzzy logic" to accurately sort documents and extract key data fields even when fields shift positions from document to document."

Butler Hill Group

"The Butler Hill Group is a streamlined network of linguists, computer scientists, language experts and research librarians with expertise in the natural language issues of computer technology. We maintain solid relationships with highly skilled consultants ... our past projects include machine translation, web search, lexicon evaluation, product usability studies, speech and product localization ..."

Butler Hill primarily provides services for corpus creation, evaluation and internationalization.

Cambridge Semantics

"Cambridge Semantics empowers even non-technical users to leverage their familiarity with Excel (using Anzo for Excel) and the simplicity of our Web visualization and forms tool (Anzo on the Web) to rapidly develop solutions that connect critical enterprise data to an information fabric, making that data available to be combined, manipulated and shared at the moment it is needed."

Carrot Search

Carrot Search offers "professional installation, customization, clustering and text mining consulting services based on Open Source and proprietary software." They also offer Lingo3G, a "Document Clustering Engine that can organize collections of text documents into clearly labeled thematic groups. Accurately and on-the-fly."

They also offer the open source framework Carrot2, which provides federated search with clustering over popular search engines and APIs, including Lucene.


"Clarabridge's text mining software transforms text into actionable insight to improve market research, customer care, product development, quality assurance and risk management. Clarabridge's award-winning software links the worlds of text mining, search and business intelligence (BI) to enable enterprises to more quickly and intuitively leverage all of their data to make better business decisions."

Their offerings include data deduplication/cleansing, data linkage/merging, document segmentation and categorization, entity extraction, event, relationship and fact extraction, table parsing and image processing, as well as search and visualization on top of these.


"ClearForest's text-driven business intelligence solutions help organizations make more informed business decisions by doing what search technologies do not--extract free text for use within analytics applications and BI systems. We provide the analytical bridge between two previously disconnected worlds of information--unstructured text and enterprise data. In allowing both to be analyzed simultaneously, ClearForest makes unified business intelligence a reality."

"ClearForest Tags' open and flexible platform supports statistical, structural and semantic tagging as well as custom taggers, industry and custom taxonomies, and information agents."


"Trust, context and confidence anchor CodeRyte's natural language processing (NLP) technology."

"The technology ‘reads’ medical reports and identifies accurate CPT and ICD codes from the text of a physician’s documentation." They very helpfully point to the American Health Information Management Association's page on Delving into Computer-assisted Coding.


"CognitionSearch, the Company's patented meaning-based linguistic Search architecture, is able to deliver significantly higher levels of relevant search results than is possible with currently used Search technologies."

"The technology employs a unique mix of linguistics and mathematical algorithms which has, in effect, 'taught' the computer the meanings (or associated concepts) of nearly all the words and frequently used ph rases within the common English language. It also has knowledge of the relations between words and phrases, especially paraphrase and taxonomy."

"CognitionSearch is the only commercially available technology that combines natural langauge queries with linguistic meaning-based Search and semantics. It incorporates statistical algorithms with linguistically mapped coverage of teh English language,..."


"Connexor provides linguistic technologies and expertise to software houses and solution providers who tackle the challenge of how to derive useful information from unstructured digital text for different kinds of consumers and analysts."

"Connexor's Machinese software discovers the grammar and semantics of natural language. Machinese enriches text with linguistic markup: a uniform programmer's interface that enables use of text content in software applications and solutions."

Connexor makes some excllent heuristic/rule-based part-of-speech taggers, named entity extractors and dependency parsers.


"Go Beyond Search... Access, Share and Deliver Intelligence and Awareness, not just Links". "At Connotate we believe in working smarter, not harder."

"Utilizing patented machine learning algorithms, Agents are easily trained in minutes using a simple point-and-click process that requires NO programming. Agents can be deployed to do anything a human can do to mine, monitor, survey, collect, aggregate and normalize dynamic financial content deep within the Web or in the Enterprise into actionable business intelligence."

Content Analyst

"Stop searching, start doing." Content Analyst supplies a range of text analytics applications, including classification, named entity coreference, relationship discovery.

Their technology is based on latent semantic indexing, a dimensionality reduction technique based on singular value decomposition of a matrix of co-occurrences. The technology was acquired when Content Analyst spun out from SAIC.

Crawdad Technologies

"Crawdad Technologies, LLC provides software and services to analysts and research professionals who need to transform unstructured text into insight."

Crawdad supplies Crawdad Desktop, which does scraping and some natural langauge processing involving classification and terminology extraction.

Crawdad also builds Listening Posts, which "listens to blogs, discussion boards, chat rooms, social networking sites, and online media for news or opinion about products, brands, celebrities, and issues. Users view a daily dashboard which uses patented natural language processing technology to analyze the buzz on the Web and make sense of it."

Digital Reasoning

"Digital Reasoning Systems develops software solutions that rapidly process and organize unstructured data into meaningful, relevant and quantifiable knowledge automatically. The core technology, embodied in Synthesys, can be used to provide for the automatic categorization, linking, retrieval and profiling of unstructured data."

Dolores Labs

"We make crowdsourcing easy." Dolores Labs is involved in classification including topic and sentiment, document de-duplication, and other natural language tasks.

Dolores Labs are not so much competing as providing a complementary service: data annotationa via crowdsourcing, for which they've used Amazon's Mechanical Turk.


"Engenium is a pioneer in conceptual search technology that increases the effectiveness of electronic information retrieval. Unlike keyword searching that is limited to precisely matching the language of a given query, Engenium's Semetric concept search engines and integrated Autometric clustering engine analyze documents by meaning, concept and context. This yields better, faster search results --- uncovers information that otherwise would remain buried --- and enables organizations to work smarter."

They seem to be applying latent semantic analysis (a kind of principal component analysis) to search and clustering.


"Using semantic understanding of content, Evri is building the data graph of the web. We'll use this to create interesting and meaningful connections without having to search."


"Exalead CloudView is a one-of-a-kind search engine that collects unstructured and structured data from any source, in any format and in any volume, and automatically transforms it into a single structured information resource. This resource, which continually evolves and adapts as your data evolves, can be directly searched or used to develop innovative search-based applications (SBAs)."

Expert System

Expert System is a "leading provider of Semantic Intelligence software to discover, classify, and understand information contained in unstructured text. Expert System technology, COGITO enables natural language processing. It leverages full semantic analysis to automatically understand the content from any textual document, including the retrieval of meanings and the comprehension of natural language.

Semantic Intelligence enables you to read, understand, and extract the most relevant concepts present in the huge amount of documents, websites, presentations, emails and blogs that are accessible to us everyday. "


"Extractiv is a new content provisioning service tha helps consumers "make sense" of large amounts of unstructured text. We use natural language processing in conjunction with one of the world best distributed computing platform in order to turn text into structured data that can be used in a variety of apps, such as sentiment tracking or semantic search."


"We do not simply search, we find. We filter out all the irrelevant, peripheral data and provide the exact information end users are looking for."

"We have solutions that monitor competitive intelligence, provide brand and litigation protection, support regulatory and policy compliance, and investigate criminal and terrorist activity. They don't just return results, they return confidence and protection."


"Fetch Technologies provides innovative solutions for integrating and accessing heterogeneous data sources."

Fetch isn't so much a direct competitor, but more of a complementary technology aimed at scraping web pages and record linkage (also known as database deduplication).

General Dynamics Advanced Information System

"General Dynamics Advanced Information Systems designs, develops, manufactures, and integrates information solutions for defense, intelligence, space and homeland security communities."

"General Dynamics Advanced Information Systems uses data mining technologies to help customers find new correlations, patterns and trends. We use advanced technology to sift through large amounts of data (structured data, text, audio, video, etc.) stored in repositories and use pattern recognition and statistical and mathematical techniques." In particular, their system "successfully performs entity extraction, a natural language processing technique, to derive facts such as names, places, organizations, locations and time from text."


"GrammarSoft ApS is a small company specializing in Language Technology." Product offerings (for multiple European languages) include morphological analyzers, part-of-speech taggers, syntactic/dependency parsing, named-entity recognition, translation, and tools for teaching language and spell checking. They are a spinoff of the Visual Interactive Syntax Learning (VISL) project.


H5 builds classifiers for enterprise document management, especially in the legal domain.


" Different than a familiar R&D agenda in a search engine company, we undertook highly specific research tasks solely dedicated to the advancement of the core-competency in Web search. The main challenge is to make science work in a constrained deployment environment where speed, coverage, accuracy, and ease-of-use are high priority considerations."

hakia-Lab provides several technologies, including OntoSem, "a formal and comprehensive linguistic theory of meaning in natural language", QDEX "Query Detection and Extraction system was invented to bypass the limitations of the inverted index approach when dealing with semantically rich data", SemanticRank, "a collection of methods to score and rank paragraphs", and Dialogue, "the conversational (dialogue) systems where the search engine communicates with the user in an elevated level of confidence".

Hot Neuron

"Clustify groups related documents together into clusters and labels each cluster with keywords to tell you what it is about. It does both conceptual clustering and near-duplicate detection. This gives you a quick overview of the document set, and makes categorization of the documents more efficient and consistent. Clustify can process millions of documents on a desktop computer, bringing organization to large projects."

"Hot Neuron Similarity uses a proprietary algorithm to quantify the similarity of a pair of documents. This software is demonstrated on where articles that are determined to be sufficiently similar are marked in a database. The user can click on the icon next to an article he/she likes to retrieve the list of similar articles."


"The Infobright self-tuning analytic database delivers high performance without the work and cost of other solutions."


"Infogistics are one of the leading companies providing text-analysis, content extraction and document retrieval solutions across multiple areas of industry including HR, law enforcement, knowledge management and CRM."

"Using advanced Natural Language processing technology developed at Edinburgh University, Infogistics solutions enable information and data contained in structured or unstructured text documents to be retrieved, categorised, extracted and delivered to the right people at the right time." Their NLP offerings include sentence detection, part of speech tagging, and light syntactic chunking. They also offer higher-level products for specialized search, relationship extraction and document parsing.


Inform does "precise topic-based search for related content". Their technology involves text classification and entity extraction for more-like-this applications. You can also check out the Inform News Demo.


"InQuira helps companies deliver more effective customer service through their Web sites and contact centers." Their product features "integrated capabilities for natural language search, knowledge base management, and analytics".

Their product line includes InQuira Intelligent Search, "a unified system that combines advanced linguistic techniques and contextual understanding to provide unparalleled capabilities for understanding, and responding to, the true intent behind a user’s inquiry and browsing behavior."


"IntelliResponse delivers on the promise of web self-service by providing one right answer to visitor questions."

Intelliresponse mainly works in question answering and classification in the context of customer relationship management (CRM). About this they say their "Patented 'one right answer' solution understands precisely what the visitor wants, regardless of the hundreds of ways a specific question can be asked" They hedge this a bit later by saying "While it's not possible (or a good idea) to create an IntelliResponse knowledge base that would answer every possible question that anyone would ask, it is very possible to create one that will answer upwards of 90% of incoming questions."

You can even try it by entering a question on their home page.


"Irion Technologies has succesfully picked up the challenge to make computer programmes that make sense out of text, and really understand human language."

"Irion's software improves any web communication involving human language, and applies to any organization in the world dealing with textual information. This includes conceptual search, knowledge management, E-commerce, customer support, and many other applications." Their technology seems to be organized around classification.


"Janya provides products and services to support information discovery from unstructured and semi-structured data. With more than a decade of experience developing and integrating this technology, Janya works with customers and system integrators to incorporate information discovery capability in both unclassified and classified environments. By leveraging existing search technology and tools for structured data analysis, Janya's solutions enable users to increase their effective bandwidth to analyze text data streams."

Janya builds Semantex, "an enterprise-class information extraction system that supports the automatic or semi-automatic analysis of large volumes of electronic information in order to detect entities, attributes, relationships and events. Semantex represents a hybrid model for information extraction, combining machine-learning and grammatical approaches to achieve better results than any of the techniques could individually." They also build case restoration and "text zoning".

Note that Janya was a spinoff of Cymfony, and Cymfony is now part of TNS Media Intelligence.

LXA Lexalytics

"Designed to help our customers address the basic problem of making their loosely structured information more valuable. We have created a set of products that attack the problem of discovering, understanding and acting on information that affects their business."

Lexalytics's products include entity extraction, relation extraction, document summarization, sentiment analysis, and classification.


Limbix has "designed a suite of enterprise and social solutions built on our unique ability to precisely determine the tone in text-based communication. Whether it's deeper social media analytics or a solution for improved enterprise communication you're looking for, you can bet we have a product offering available (or in development) to match. Our team is ready to work with you."


"M*Modal's state-of-the-art Speech Understanding technology translates physician dictation in real-time into a searchable, structured document."


"Utilizing and implementing up to date research results in the fields of computer science, language technology, and information theory we at Matrixware enable our customers to skilfully navigate the endless sea of patent literature."

Meaningful Machines

"Meaningful Machines develops, patents, and commercializes language technologies based on a unique suite of methods that automate machine understanding of natural language. The company is developing technologies for use in machine translation (MT), text mining, machine learning, and other applications that benefit from machine understanding."

Although they cite very general problems, their technologies page only addresses machine translation.


"TextAnalyst will help you quickly summarize, efficiently navigate, and cluster documents in your textbase."

Megaputer's TextAnalyst product extracts semantic networks, summarizes text using a "balanced combination of linguistic and neural network investigation methods". Their notion of semantic network is "the most important concepts from the text and the relations between these concepts weighted by their relative importance". They also use these semantic networks for clustering and exploration/search.


"GDMs [Geographic Data Modules] make up the core of MetaCarta products and hosted solutions. A GDM is a knowledge base used to identify and disambiguate geographic references, assign latitude/longitude coordinates, and confidence scores and relevance ranking. Each MetaCarta GDM contains linguistic statistics, gazetteer data, and natural language processing (NLP) logic."

"MetaCarta GeoSiteSearch is a Web portal drop-in that enables users to conduct a geographic search on any content, with search results ranked on location and text relevance."

MetaCarta specializes in location mention recognition and resolution in text. They use probabilistic models with confidence ranking.

Mnemonic Technology

"At Mnemonic, we can help you realize the full value of your information, whether structured or unstructured."

"Our relevance models help you get the right information to the right people at the right time. They automatically learn to prioritize, categorize, monitor and summarize large volumes of unstructured textual information according to the unique requirements of individual users."

From what we can tell from their web site, their "relevance models" learn scored text classifiers by example. The only application mentioned for text analytics is search query refinement.


Morphologic provides a range of products, mostly arranged around morphologically sensitive bilingual translation dictionaries, including thesauri. Applications include copy editing such as spelling checkers, hyphenators, grammar and style checkers; text search tools including stemmers; and translation of full documents. Tools include morphological analyzers including stemmers based on unification grammars, syntactic analyzers, spell checkers and language identifiers.


Extractor: " Accurately perform entity extraction from unstructured texts using advanced computational linguistics and natural language processing."

Summarizer: "Reliably generate abstracts and summaries of long and complex documents."

TextMiner: "Empower users to find, organize, analyze, and mine a large volume of unstructured information using the the most advanced text analysis technology available."

They also run in several languages. They have some kind of cross-document coreference. They spun out of SRA International. They claim to have "best-of-breed entity extraction" and "unique link and event extraction", but don't explain what the breed is and don't list any unique features of their link and event extractors. They even claim "NetOwl posted the highest score ever achieved for name extraction from unformatted text, a score which has never been equaled by another system.", but don't provide any details.

Next IT

"Next IT's Human Emulation Software, ActiveAgent, creates Virtual Experts for organizations that are redefining customer service through technology. As a single solution, ActiveAgent accurately understands and interprets users' natural language questions and delivers exact results. ActiveAgent leverages an organization's entire asset and resource portfolio through multiple service channels such as the web, contact center, intranet, mobile devices, and more with accuracy, scalability and operational efficiency."

Northern Light

"What if your search engine could read all the market intelligence reports and articles your company creates or licenses and tell you what is in them, suggest to you what the business issues are that they report on, and direct you to the documents that are the most interesting to you, not from a search relevance perspective, but from a meaning perspective?"

Northern Light offers Market Intelligence Analyst, which contains entity extraction, sentiment analysis, relationship identification, meaning extraction and trend analysis.


"We’re developing software that understands English, and can converse with people about a body of facts.". Looks like they're still in the development phase, but plan to use parsing to logical representations, ontologies and natural language generation."


"Ntelligent Enterprise Search by Nstein is a powerful search solution built to increase the efficiency and productivity of your employees on intranets and portals. For public websites, it will guide your customers in the most advanced discovery process. It quickly delivers highly accurate search results in all circumstances."

"One suspicious letter. One dangerous passenger boarding a routine flight. One viral infection in a small village, in a distant country. These events have led to tragedies we all wish had never happened. Critical information preceeded these events. Critical information that could have been flagged by Nstein Technologies."


"Nuxeo Enterprise Platform (Nuxeo EP) is a Java-based content infrastructure designed to be used as a development environment for content- and case-based applications. Nuxeo EP is an extensible and configurable set of ECM services and modular plug-ins that allows an organization to build out specific horizontal or vertical applications."


"SemanticMiner utilizes the strengths of ontologies. These are knowledge models representing the relevant expertise within your department or company. They facilitate, moderated searches, optimization of the search results, unified view onto diverse sources. With SemanticMiner, users find the relevant information required much faster and easily. The SemanticMiner is available in two flavors: with a pre-configured web-interface or as a set of web services. These can be integrated into any SOA-compatible application or user interface."


"Ontotext is a leading developer of core semantic technology, which delivers applications in domains like Web Mining, EAI, KM, BI, and Media Research."

"Ontotext is a laboratory of Sirma, active in several research areas, including: Ontology Management; Information Extraction and Retrieval (IE, IR); Semantic Web Services." Their products include the KIM Platform for semantic annotation driven by a "semantic repository". The semantic annotation includes named-entity extraction from a specified ontology. They have contributed extensively to GATE (for more info on GATE, see above).

Optimal Solutions and Technologies

"OST develops tools and integrates resources that are a precise fit to our client's needs, size, and budget. Unlike many other businesses, we are not married to specific manufacturers or solutions providers that will limit the scope of our offerings. Our Services Include Computational Linguistics"

Oracle Data Mining

"Oracle Data Mining (ODM) -- an option to Oracle Database 11g Enterprise Edition -- enables you to easily build and deploy next-generation applications that deliver predictive analytics and new insights. Application developers can rapidly build next-generation applications using ODM's SQL and Java APIs that automatically mine Oracle data and deploy results in real-time-throughout the enterprise."

Although not specifically aimed at natural language, there is quite a bit of NLP-relevant technology in it, inlcuding naive Bayes, decidsion trees, logistic regression, and support-vector machines.


"Panscient is a content supplier for vertical search engines." They have the interesting business model of supplying lists of people and businesses scraped from the entire .com domain of corporate web sites and updated monthly. They also develop vertical search applications.

Parity Computing

" Parity Computing's unstructured data management and knowledge discovery solutions transform disparate data and content into a knowledge network of actionable profiles and linked relationships."

Parity offers the Profiler System, which "assembles and analyzes detailed profiles of key entities such as people, institutions, and products, from disparate unstructured documents and semi-structured data sources. ... The key entities are extracted and assembled into distinct profiles using advanced machine learning heuristics. This includes normalization of spelling variations together with disambiguation of similarly-named entities (e.g. two people with the same name)." Additional functionality cited includes home page finding, extracting patent references from web pages, etc.

Parity also offers a lower level tool, the Reference Processor, a "fully automated software engine for high-accuracy reference processing and linking of publication databases and bibliographies in arbitrary formats.". Technology includes extraction, deduplication, clustering and correction.


"We're creating natural language technology that grows and improves as it collects simple human judgments about language. Using this technology, we're building tools to search blog posts, feeds, and other texts for key concepts, not just keywords and phrases. Think of it as tagging meets natural language processing." As of February 2007, they only offered a mailing list.

Popup Chinese

Popup Chinese is a natural language processing engine for Chinese text. It uses a combination of a dictionary and statistical methods to intelligently and contextual segment text, identify parts of speech and manipulate text. The software provides hanzi-to-pinyin conversion, text segmentation and machine translation for a variety of applications including search, content extraction and more. It supports POS tagging and word sense disambiguation. The software is coded in object oriented C++ and released under an open source license permitting commercial use for hanzi-to-pinyin conversion, text segmentation and machine translation purposes.


"PureDiscovery, a software company based in Dallas Texas, is reinventing the art of semantic search by harnessing collective intelligence. PureDiscovery is the creator of KnowledgeGraph, an intelligent software platform that transforms an organization's documents into a working collective intelligence. PureDiscovery KnowledgeGraph semantically connects people and knowledge in ways and on a scale that simply was not possible before. We serve a variety of markets including: Legal, Human Capital Management, HomeLand Security / Defense, and Intellectual Property "


"Natural Language Search".

Q-go offers the Q-go Natural Language Search Product Suite. "Q-go's Natural Language Search gives organizations insight into visitors' expectations and wishes and helps them adapt their online information accordingly." "The answers provided by Q-go are comparable to those of call centers in terms of consistency, completeness and quality, which is not only cheaper but also faster and easier for organizations."


"QL2 Software's tools and solutions deliver business critical data seamlessly and in real-time. QL2's technology integrates data from virtually any source, inside and outside the firewall, with existing applications and solutions. The result is better analytics and smarter, more profitable decisions."

It wasn't clear from the web site whether any natural language processing was involved in their products.


"RapidMiner (formerly YALE) is the world-leading open-source system for knowledge discovery and data mining. It is available in different flavours: a free open-source version licensed under the GPL, a free version with an improved user interface, and under a developer license (OEM) which allows the integration of RapidMiner as a powerful library even into proprietary products. Enhance your products with adaptability and innovative analytical features. By now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive edge."


"Sophisticated Search Review and Analysis Made Simple"

"MindServer Categorization automatically maps structured and unstructured information into an information structure - taxonomy, ontology, or subject heading classification."

"The core technology powering Recommind's MindServer platform is based on patented, proprietary machine learning techniques including the Probabilistic Latent Semantic Analysis (PLSA) algorithms."

Recorded Future

"Temporal Analytics Engine: A predictive analysis tool that allows you to visualize the future, past or present. How Recorded Future works: 1. Scour the web: We continually scan thousands of high-quality news publications, blogs, public niche sources, trade publications, government web sites, financial databases and more. 2. Extract, analyze and rank: We extract information from text including entities, events, and the time that these events occur. 3. Make it useful: You can explore the past, present and predicted future of almost anything. Powerful visualization tools allow you to quickly see temporal patterns, or link networks of related information."

Reel Two

"Reel Two is tackling the tough problems in search and data analysis. Our software products and custom solutions provide scientists, analysts and managers with quick, intuitive access to the information that is most relevant to their work."

Reel Two spun out of the WEKA group at Waikato, and is primarily focused on text classification and entity extraction as well as additional biomedical applications aimed at chemical name resolution.


". Our groundbreaking data analytics technology is designed to deliver intelligent monitoring and analysis of unstructured data, all within the context of the analysis environment. With RiverGlass tools, organizations can make unstructured data sources like the Internet into true strategic information resources to help drive their success."

From their technology capabilities page, they appear to be doing search with ontology integration for topic monitoring, entity extraction and link analysis.

SAP BusinessObjects

"SAP BusinessObjects offers a broad portfolio of tools and applications designed to help you optimize business performance by connecting people, information, and businesses across business networks."

Among its operations is Intelligent Search and Text Analysis software to "extract, categorize, and summarize key information from unstructured text and convert it to a structured format so that it can be an effective data source for data integration or business intelligence. Process in more than 30 languages."


"SAS Text Miner incorporates advanced linguistic capabilities within the core data mining solution of SAS Enterprise Miner. Consolidating structured (quantitative) data analysis with unstructured (free-form text) provides complete views and meaningful insights within an integrated predictive modeling environment. Automating manual comprehension of the textual data sources, incorporating interactive drill-down reporting, and delivering algorithms for rigorous advanced analyses make it possible to grasp future trends and act on new opportunities more efficiently and with less risk."

"The Teragram division licenses its proprietary software and Linguistic data to other companies who embed our technologies into their corporate, Internet or software applications. Teragram has established itself as the leading technology and service provider for technologies such as Linguistics, pattern matching, Linguistic search and retrieval, international dictionaries for search and e-commerce, document management, and high demand Internet applications and services. By leveraging Teragram technologies, our customers are able to create new products quickly; improve the performance of their own products; expand their business to European, Arabic, and Asian markets; manage information more efficiently; and provide new functionalities to their own customers."

Scout Labs

"A powerful, web-based application that tracks social media and finds signals in the noise to help your team build better products and stronger customer relationships."

Their Scout Labs Product includes sentiment analysis, summarization and search over "consumer-generated media".

SDL Language Weaver

"As the pioneer of statistical machine translation, Language Weaver provides trusted automated translation solutions to improve human communications for government and commercial organizations. Delivering a trusted level of translation quality. Language Weaver ensures that organizations can communicate with global audiences in over 75 language combinations and is the only provider of automated translation with the capability to provide a TrustScore against each translation. TrustScore provides a predication of the translation quality by the translation engine. This enables customers to build business rules around TrustScore and automate translation and publishing processes."


"Semantia is the expert in the automatic processing of natural written language for optimizing customer interactions."


"Semantra extends traditional BI and enterprise search applications, by empowering users to quickly and easily access precise, critical information from enterprise databases through a familiar search box and natural language."

Social Nuggets, was Serendio

SocialNuggets technology developed over 20 man-years unearths critical customer sentiments and feedback embedded in millions of online conversations occurring over the social landscape. SocialNuggets technology converts free form textual content into structured contextual information.


"Silobreaker is an automated search service for news and current affairs that aims to provide more relevant results to the user than what traditional search and aggregation engines have been offering so far. Instead of returning just lists of articles matching a search query, Silobreaker finds people, companies, organizations, topics, places and keywords; understands how they relate to each other in the news flow, and puts them in context through graphical results in its intuitive user interface. The search result pages look similar to an online newspaper but are generated without human editing.

The site aggregates content on global issues, science, technology, energy, business, sports and entertainment from tens of thousands of news sources, blogs, multimedia, and other forms of news media from around the world. With the engine's focus on finding and connecting related data in the information flow, Silobreaker user tools and visualizations are ideal for bringing meaning to content from either today's Web or the evolving Semantic Web, or both."


"Sinequa is an innovative leading global provider of Enterprise Search Solutions."

"Sinequa CS has been developed by Sinequa as the ultimate, multi-lingual knowledge access platform. Featuring cutting-edge semantic and linguistic technologies, Sinequa CS is one of the most advanced Enterprise Search solutions available today.


"Soliloquy is the world's first company to offer 'intelligent,' fully automated solutions that enable end users to find the information, services and products they desire through targeted online dialogs."

Soliloquy is in the business of dialog mining, a kind of text data mining over dialogues. They claim to be "the world's first company to offer turnkey solutions that enable end users to find the information and products they desire through intelligent, targeted dialogs."


"PASW Modeler makes it easy to discover insights in your data. Its simple graphical interface puts the power of data mining in the hands of business users and high-performance increases analyst productivity."

STIL Language Technology

"Consultancy and software. Bringing cademic research to business." "STIL can offer solutions to businesses and non-profit organizations that have a need to explore what language technology could offer them, or with a need to integrate language technology components in their information systems."

STIL offers software based on the TiMBL memory-based learning system, including sequence taggers, shallow parsers, word-sense disambiguation and morphological analysis.


"Amazing natural language processing, semantic search, and question-answering technology. We have built the largest index of questions and answers ever created: we know the answer to more than 10 billion questions."


"Data management encompasses all measures implemented to support the use of data as a resource. The purpose of data management is to manage and supply accurate and timely data to business processes. Major disciplines in Data Management include data integration, data quality, Master Data Management, etc...All Talend products are built on a unified Eclipse-based development environment, which provides users with consistent ergonomics, fast learning curve and a high-level of reusability."


"The Talis Platform makes it easy for developers to build powerful applications that use Semantic Web technologies and standards. Delivered as Software as a Service (SaaS), the Platform dramatically reduces the complexity and cost of storing, indexing, searching and augmenting large quantities of data. "


"TEMIS develops and markets corporate Text Mining solutions. Our software unlocks knowledge from unstructured data."

Temis's core products include an "information extraction server dedicated to the analysis of text documents, a hierarchical clusterer that "proposes the most relevant classification for a given document collection", a classifier that "classifies unstructured documents into pre-defined categories, combining statistical and linguistic analysis rules". This is all based on "XeLDA", their "multilingual linguistic engine". They have an impressive list of clients.

Text Analysis International

"VisualText is the premier integrated development environment for building information extraction systems, natural language processing systems, and text analyzers."

TAI offers VisualText, "an Integrated Development Environment for deep text analysis applications. Think of it as Visual C++ for Natural Language Processing applications. They also provide TAIParse, which includes part-of-speech tagging and noun-phrase chunking. The basic technology appears to be a multi-pass rule-based approach.


Textkernel offers resume parsing, info extraction, sentiment, and general data mining.


"TextMap is a search engine for entities: the important (and not so important)people, places, and things in the news.".

"TextMap analyzes both the temporal and geographical distribution of news entities."

"TextMap uses natural language processing techniques to track entity references in news sources, and a variety of statistical techniques to analyze the relationships between them."


"We provide business-to-business (B2B) analytical software and services to accurately examine and extract information from large volumes of unstructured text."

"TextOre has the ability to perform searches that are highly detailed, using multiple queries and in multiple languages, while providing easily understood results. The results are provided through an advanced visualization profile tool that identifies and visually depicts the intensity of relationships in unstructured data sources (letters, documents, e-mail and web pages), including real-time news and information feeds. Our technology not only identifies anomalies missed by competitive technologies, but also identifies specific sentences, paragraphs and relationships, taking into account the precise terms applied by a user."


"TextWise energizes your existing advertising portfolio by offering high-resolution targeting, sophisticated media placement, and hassle-free automation of both ad creation and placement."

"Semantic Signatures are TextWise's patented contextual targeting technology. They innovate beyond simple keyword-based or category-based models currently used in so-called "contextual advertising" and deliver a new level of context-driven advertisment matching." They also "capture meaning through concepts, not keywords -- including multiple meanings and topics within a single document."

TNS Media Intelligence/Cymfony

" Cymfony, a division of TNS Media Intelligence, is a market influence analytics company that sifts and interprets the millions of voices at the intersection of traditional and social media such as blogs and social networks to gain consumer insight and develop stronger bonds with influencers."

"Cymfony's core is an advanced information extraction engine that combines information retrieval and Natural Language Processing (NLP) technologies to identify important people, places, companies, concepts, relationships and events in documents." From the web site, this looks like the latest version of Cymfony's "InfoXtract Engine".

Cymfony spun off the government systems business to form Janya.


"In a fully automated process BullDoc(tm) server will crawl your organization resources (shared directories, submitted emails, specific web sites), feed them to the information extraction engine that will save the extracted data into the database....The system comes with plug-ins for many applications (MSWord, outlook, numerous web browsers) that enable the user to view the documents/emails/web pages in the way he use to, only gives her the ability to browse and navigate within a document to the relevant information. "

Vantage Linguistics

"As a world leader in the development of linguistic software solutions, Vantage Linguistics continues to set the benchmark for innovation and excellence in language-based research and artificial intelligence."

Vantage offers a range of products, including language identifiers, spell checkers, grammar checkers, and linguistically informed search.

Viewpoints Network

"Viewpoints Network is a social technology and media company focused on helping consumers make smarter decisions. We specialize in building communities and motivating "social influencers" to share their experiences by writing reviews, blog posts, how to guides, participating in discussion boards and contributing and voting on ideas. We then help organize and present those contributions to help other consumers make well informed purchase decisions."

Visible Technologies

Doing sentiment and phrase mining over news feeds, truPULSE "helps you keep pace with the incredible speed and vast volume of social media conversations via an easy-to-use, RSS feed-based Web monitoring application. With truPULSE, organizations of all sizes can quickly and easily begin listening to and assessing online conversations about their brand."


"Wordtracker's leading-edge research tool gives you the keywords you need to rise above your competitors in search engine rankings. Even better, we also show you how keyword research can help you discover untapped market niches, get inspiration for new products, and create compelling content that distinguishes your site from the pack."

Xerox European Research Centre

"With the multiplication of on-line document repositories and the phenomenal growth of the Web, a fantastic amount of information is available at our fingertips. The central problem becomes that of quickly accessing, within that mass, the arbitrary pieces of information that are needed at any given time. As a large proportion of the data is made up of natural language texts, any comprehensive solution will rely heavily on natural language processing (NLP). Our research agenda concerns theories, methods, tools and systems that make it possible to uncover the content of natural language texts."

XRCE provides demos and licenses for software for finite state automat, machine learning for categorization and clustering, robust parsing and semantics. You can find some online demos and links to research software from the above link.


"YooName is Named Entity Recognition software based on semi-supervised learning. It identifies nine named entity categories that are split into more than 100 sub-categories."

"The YooName database and rule system are built using semi-supervised learning techniques."

ZoomInfo "is the premier business information search engine, with profiles on more than 35 million people and 3.8 million companies. ZoomInfo delivers a single site for quick and easy access to in-depth information on industries, companies, people, products, services and jobs." "ZoomInfo, a semantic search engine, uses its patented Natural Language Processing algorithms to understand and organize the business web."

ZoomInfo focuses on search for people, companies or jobs on


ZyLAB is in e-discovery, and offers general text data mining, including entity extraction and summarization. They also do visualization and MT.

Lists of Tools and Corpora

Lots of other groups have put together lists like this. They contain many links to one-off packages and many lists almost all more comprehensive on the one-off packages (like Adwait Ratnaparkhi's tagger, Michael Collins's parser, Eric Brill's tagger, the YamCha SVM tagger, the Cambridge-CMU language toolkit, etc.)