What's Database Text Mining?
This tutorial shows how to use a relational database management system
(RDBMS) to store documents and LingPipe analyses. It uses MEDLINE data as
the example data, and MySQL as the example RDBMS. As in the MEDLINE Parsing and Indexing Demo,
the LingPipe MedlineParser
is used to parse the data from
an XML file. Scripts are provided to create the database and database
tables. This tutorial is aimed at the novice database programmer, and
therefore the database design and the way that the program interacts
with the database have been kept as simple as possible.
For expository purposes, we break this task into three parts:
- Loading MEDLINE data into the database, using the LingMed MEDLINE parser and the JDBC API to access a RDBMS.
- Using the LingPipe API to annotate text data in the database, and to store the annotations back into the database.
- SQL database queries over the annotated data.
Completing part 1 results in a simple database containing a table containing the titles and abstracts of MEDLINE citations. In part 2 we add two more tables into the database, one for sentence information and one for mentions of named entities found in each sentence. Then we present a program which retrieves text data from the MEDLINE citation table, runs it through the LingPipe sentence and named-entity analyzers, and stores the results into the sentence and mention tables. Once the database has been created and populated, we present a few example queries which can be run against it. Finally, we discuss ways in which this task can be optimized.
MySQL
MySQL runs on most operating systems, including Linux, Unix, Windows, and MacOS, and is available under both a commercial and GPL license. This demo uses MySQL version 5.0, which is available from:
You will only need the "essentials" version for Windows x86 or AMD64. This demo can run in what the installer calls a "typical" install run with the non-custom configuration.
JDBC
The Java Database Connectivity (JDBC) API allows Java programs to access data stored in relational databases which support the standard SQL query language. For information on the JDBC API, see Sun's JDBC Database Access tutorial.
The official JDBC driver for MySQL is available from the
We ran this demo with mysql-connector-java-3.1.13-bin.jar
.
The jar file containing the MySQL Connector/J driver must be on the
runtime classpath. The Ant build.xml file is
configured to find the 3.1 Connector/J jar if you put it in the lingpipe/demos/lib directory.
The connector is licensed under the Gnu General Public License.
Part 1: Loading MEDLINE data into a database
Creating the database
We define a very simple database which contains one table of MEDLINE citation data.
The name of the database is medline
and the name of the table is citation
.
Each citation
record contains the MEDLINE citation PubMed identifier,
title, and abstract (if any).
Table: citation |
||
---|---|---|
Columns | citation_id |
database id, automatically generated via the MySQL
AUTO_INCREMENT attribute
|
pubmed_id |
PubMed id | |
title |
article title, character encoding is unicode-16 | |
abstract |
abstract, if any, character encoding is unicode-16 |
The script sql/medline_citations.sql contains the SQL commands to create the database, and can be executed via the MySQL command-line client:
$ mysql -u root -p -v < sql/medline_citations.sql
The system prompts for the password interactively, then executes the commands
in medline_citations.sql
, as seen in the following transcript:
Enter password: **** -------------- CREATE DATABASE IF NOT EXISTS medline -------------- -------------- CREATE TABLE IF NOT EXISTS citation ( citation_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, pubmed_id VARCHAR(10) UNIQUE NOT NULL, title VARCHAR(500) CHARSET utf8 NOT NULL, abstract VARCHAR(10000) CHARSET utf8 NULL ) --------------
Loading MEDLINE citations
Once the database and table have been created, we are ready to run the program LoadMedlineDb.java
which loads MEDLINE data into the database.
LoadMedlineDb.java
parses out MEDLINE citation instances from an XML file.
LoadMedlineDb
requires an input file which contains
the database server name, port, username, and password.
This information is needed to connect to the database using JDBC.
This demo uses the file db.properties, a java properties
file which contains installation-specific database settings, stored
as a set of key-value pairs.
This file is the first command-line argument to the program.
Remaining arguments are treated as names of XML files containing MEDLINE data.
As in the MEDLINE parsing demo, we define a static class that implements
com.aliasi.medline.MedlineHandler
.
This class is called MedlineDbLoader
.
This class maintains a connection to the database via the instance variable:
Connection mCon;
The method openDb
is used to set this variable to an
actual connection to the database, as follows:
Class.forName("com.mysql.jdbc.Driver").newInstance(); mCon = DriverManager.getConnection(dbUrl,username,password);
The first line loads the MySQL implementation of the DriverManager
class. (See the Sun
DriverManager Overview for more information on loading and
registering drivers.) The second line uses the information from the
db.properties
file to open a connection to the database.
The handle
method takes a MedlineCitation object, and stores
the PubMed id, article title, and abstract (if any) as a record int the
citation
table.
The SQL statement which inserts a record into the citation
table
looks like this:
INSERT INTO citation (pubmed_id, title, abstract) VALUES ('123456789','some title text','some abstract text')
In order to efficiently execute such a query, the JDBC API provides
the PreparedStatement object, which precompiles the database query so
that it can be run multiple times. Precompilation is performed on a
schematic query, using the question mark (?
) to indicate
parameters. In this case, each time we execute the query we will be
passing in values for pubmed_id, title, and abstract, therefore we
write the query as follows:
static final String INSERT_CITATION_SQL = "INSERT INTO citation (pubmed_id, title, abstract) VALUES (?,?,?)";
It is not necessary to specify a value for the
citation_id
column, since it will be automatically
populated by MySQL.
The Connection
prepareStatement
method takes
the query string and returns a PreparedStatement
object:
pstmt = mCon.prepareStatement(INSERT_CITATION_SQL);
Each question mark (?
) in the query string designates a
parameter which must be set using the correct setter method. Failing
to set a parameter, or setting the a parameter to the wrong type will
throw a SQLException
. Here is the code which sets these
parameters and executes the query:
pstmt.setString(1,citation.pmid()); pstmt.setString(2,citation.article().articleTitleText()); Abstract abstr = citation.article().abstrct(); if (abstr != null) pstmt.setString(3,abstr.text()); else pstmt.setNull(3,Types.VARCHAR); pstmt.executeUpdate();
The above code is embedded in a try/catch/finally
construction in order to catch exceptions thrown by the database.
In the finally
block we call the close
method on the prepared statement and swallow any SQL exception thrown:
try { pstmt.close(); } catch (SQLException se) {}
When all of the citations have been processed the main
method calls the LoadMedlineHandlerDb
close
method, which closes the connection to the database.
To run part 1 of this demo:
- Install MySQL, (if you don't already have an installation available).
- Run the script medline_citations.sql to create the database and table.
- Edit the file db.properties, setting the database configuration appropriately.
-
Execute the Ant target
loadmedline
The results for the loadmedline
target are:
$ ant loadmedline Buildfile: build.xml compile: loadmedline: [java] Indexing file=../../data/medsamp2006.xml [java] Handling PMID=10540283 [java] Handling PMID=10854512 [java] Handling PMID=10972993 [java] Handling PMID=11056631 [java] Handling PMID=11034741 [java] Handling PMID=11406024 ... [java] Handling PMID=15611661 [java] Handling PMID=15611667 [java] Handling PMID=15968009 BUILD SUCCESSFUL Total time: 3 seconds
Clearing the Database
Once the Ant target loadmedline
has run successfully,
attempts to run it again will result in a series of errors,
because every record that the program is trying to insert into
the database already exists:
$ant loadmedline Buildfile: build.xml compile: loadmedline: [java] Indexing file=../../data/medsamp2006.xml [java] Handling PMID=10540283 [java] EXCEPTION HANDLING CITATION=10540283 java.sql.SQLException: Duplicate entry '10540283' for key 2 [java] Handling PMID=10854512 [java] EXCEPTION HANDLING CITATION=10854512 java.sql.SQLException: Duplicate entry '10854512' for key 2 ... [java] Handling PMID=15968009 [java] EXCEPTION HANDLING CITATION=15968009 java.sql.SQLException: Duplicate entry '15968009' for key 2 BUILD SUCCESSFUL Total time: 3 seconds
If you wish to run this part of the demo again, then the existing set of
citations must be deleted from the database.
We have provided a script
sql/medline_clear_citations.sql
which deletes all records from the citation
table, and can be executed via the
MySQL command-line client:
$ mysql -u root -p -v < sql/medline_clear_citations.sql
The system prompts for the password interactively, then executes the commands
in medline_clear_citations.sql
, as seen in the following transcript:
C:\mycvs\lingpipe\demos\tutorial\db>mysql -u root -p -v < sql/medline_clear_citations.sql Enter password: ***** -------------- SELECT COUNT(*) FROM citation -------------- COUNT(*) 88 -------------- DELETE FROM citation -------------- -------------- SELECT COUNT(*) FROM citation -------------- COUNT(*) 0
Now the Ant target loadmedline
can be run again.
Part 2: Using LingPipe to annotate data in the database
Part 2 of this demo builds on the database created and populated in part 1. Now that we have a database of text data, we show how to use the LingPipe API to annotate this data, and store the results back into the database.
Annotation consists of indentifying sentences and mentions of named entities in each sentence. To do this we add two tables:
Table: sentence |
||
---|---|---|
Columns | sentence_id |
database id, automatically generated via the MySQL
AUTO_INCREMENT attribute
|
citation_id |
database id of citation containing this sentence. | |
offset |
sentence start point (offset from start of text) | |
length |
sentence length | |
type |
either "title" or "abstract" |
Table: mention |
||
---|---|---|
Columns | mention_id |
database id, automatically generated via the MySQL
AUTO_INCREMENT attribute
|
sentence_id |
database id of sentence containing this mention. | |
offset |
mention start point (offset from start of sentence) | |
length |
mention length | |
type |
type assigned to this mention by the language model | |
text |
the mention text itself |
Because the title
and abstract
field in the
citation
record already contain the citation text, the
sentence
record only stores information pointing back
into the citation entry. Since we want to be able to query the
database looking for particular mentions, we need to store the mention
text in its own column. Given that we store the start index of the
mention, and the mention text itself, storing the length of the
mention is not strictly necessary. This is a design decision, as is
the choice of storing the start and length of the sentence and
mention, as opposed to the start and end indices. Databse design is a
dark art, and as
such, is outside the scope of this tutorial.
The script medline_annotations.sql
contains the SQL commands to add these two tables to the medline
database, and can be executed via the command-line client:
$ mysql -u root -p -v < sql/medline_annotations.sql
This command produces the following output (waiting for the password).
Enter password: ***** -------------- CREATE TABLE IF NOT EXISTS sentence ( sentence_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, citation_id INT UNSIGNED NOT NULL REFERENCES citation, offset INT UNSIGNED NOT NULL, length INT UNSIGNED NOT NULL, type VARCHAR(10) NOT NULL ) -------------- -------------- CREATE TABLE IF NOT EXISTS mention ( mention_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, sentence_id INT UNSIGNED NOT NULL REFERENCES sentence, offset INT UNSIGNED NOT NULL, length INT UNSIGNED NOT NULL, type VARCHAR(40) NOT NULL, text VARCHAR(500) CHARSET utf8 NOT NULL ) --------------
Once we have added these tables to the database, we are ready to run
the program AnnotateMedlineDb.java
.
AnnotateMedlineDb.java
connects to the database in the
same way that LoadMedlineDb.java
does, and therefore we
reuse the openDB
and closeDb
methods, along
with the strategy of storing database information in a properties file
which is passed in as a command-line argument to the program.
AnnotateMedlineDb.java
creates an instance of com.aliasi.sentences.SentenceChunker
to identify sentences in the citation text fields. It reconstitutes
a serialized instance of com.aliasi.chunk.Chunker
to extract named entities from the sentences.
These are instantiated in the AnnotateMedlineDb
constructor:
tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; sentenceModel = new IndoEuropeanSentenceModel(); sentenceChunker = new SentenceChunker(tokenizerFactory, sentenceModel); genomicsModelfile = new File("../../models/ne-en-bio-genia.TokenShapeChunker"); neChunker = (Chunker) AbstractExternalizable .readObject(genomicsModelfile);
The SentenceChunker
constructor requires a tokenizer and
model, and in this example we use the standard Indo-European tokenizer
and sentence model that are part of the LingPipe distribution. The
Chunker
is deserialized using a helper method in
util.AbstractExternalizable
from the specified file using
the ne-en-bio-genia.TokenShapeChunker
model that is
distributed with the LingPipe demos.
To annotate a citation, we first retrieve the title and abstract columns from the database. As in part 1 of this demo, we use a prepared statement object.
static final String GET_CITATION_TEXT_SQL = = "SELECT title, abstract FROM citation where citation_id = ?"; ... protected void annotateCitation(int citationId) { ... pstmt = mCon.prepareStatement(GET_CITATION_TEXT_SQL); pstmt.setInt(1,citationId); rs = pstmt.executeQuery(); if (rs.next()) { title = rs.getString("title"); abstr = rs.getString("abstract"); } rs.close(); pstmt.close();
Then we pass the title and abstract text into
annotateSentences
. The
com.aliasi.sentences.SentenceChunker.chunk
method is used
to break the text into a set of sentences. As each sentence is
stored, the database returns a unique identifier for it. We then annotate
the text of the sentence with the following method:
protected void annotateMentions(int sentenceId, String text) throws SQLException { Chunking chunking = neChunker.chunk(text.toCharArray(), 0, text.length()); Set mentions = chunking.chunkSet(); for (Iterator it = mentions.iterator(); it.hasNext(); ) { Chunk mention = (Chunk)it.next(); int start = mention.start(); int end = mention.end(); storeMention(sentenceId,start,mention.type(), text.substring(start,end)); } }
The chunk
method of Chunker
is used
to find all named entities in the sentence. It returns a
com.aliasi.chunk.Chunking
,
which is a set of chunks over a shared underlying character sequence
(in this example, the text
).
The com.aliasi.chunk.Chunking.chunkSet
method returns the set of
com.aliasi.chunk.Chunk
objects.
We store the mention start, type, and the text spanned by this chunk in the database.
To run part 2 of this demo:
- (run part 1 of this demo)
- Run the script medline_annotations.sql to create the sentence and mention tables.
-
Run the Ant target
annotatemedline
Running the final target produces the following output.
$ ant annotatemedline annotatemedline: [java] Annotating citation_id=89 [java] Annotating citation_id=90 [java] Annotating citation_id=91 ... [java] Annotating citation_id=174 [java] Annotating citation_id=175 [java] Annotating citation_id=176
Note that the identifiers may vary depending on the state of the database.
Clearing the Database
Executing the Ant target annotatemedline
for a second time
will result in duplicate annotations for each citation, (since for demo
purposes we have created an extremely simple database).
If you run this part of the demo again, then the existing set of
annotations should be deleted from the database first.
Run the script medline_clear_annotations.sql to
delete the existing records from the sentence and mention tables.
Now you can run the Ant target annotatemedline
again.
Part 3: Querying the Database
With the medline
database in place, we can begin running
queries. In the following examples we show the transcript of a
session using the MySQL command-line client, which is started without
passing in a file argument:
$ mysql -u root -p -v
To check that the data loading process was successful, we run simple queries
to get rows counts on tables, and do spot checks on the data.
First we check the number of records in the
citation
, sentence
, and mention
tables,
using the query: "select count(*) from table
."
$ mysql -u root -p -v Enter password: ***** Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 18 to server version: 5.0.22-community-nt mysql> use medline Database changed mysql> select count(*) from citation; -------------- select count(*) from citation -------------- +----------+ | count(*) | +----------+ | 88 | +----------+ 1 row in set (0.02 sec) mysql> select count(*) from sentence; -------------- select count(*) from sentence -------------- +----------+ | count(*) | +----------+ | 555 | +----------+ 1 row in set (0.00 sec) mysql> select count(*) from mention; -------------- select count(*) from mention -------------- +----------+ | count(*) | +----------+ | 1570 | +----------+ 1 row in set (0.00 sec)
To check the data we select a few rows from each table, via the query:
"select * from table limit N
."
To see the data in each column on a separate line, we terminate the query with "\G" instead of ";".
mysql> select * from mention limit 5 \G -------------- select * from mention limit 5 -------------- *************************** 1. row *************************** mention_id: 1571 sentence_id: 556 offset: 0 length: 24 type: other_name text: Transcription regulation *************************** 2. row *************************** mention_id: 1572 sentence_id: 556 offset: 32 length: 16 type: DNA_domain_or_region text: nir gene cluster *************************** 3. row *************************** mention_id: 1573 sentence_id: 556 offset: 58 length: 17 type: protein_family_or_group text: nitrite reductase *************************** 4. row *************************** mention_id: 1574 sentence_id: 556 offset: 79 length: 24 type: mono_cell text: Paracoccus denitrificans *************************** 5. row *************************** mention_id: 1575 sentence_id: 556 offset: 113 length: 3 type: other_name text: NNR 5 rows in set (0.00 sec)
Citations are listed in the same way.
mysql> select * from citation limit 1 \G -------------- select * from citation limit 1 -------------- *************************** 1. row *************************** citation_id: 89 pubmed_id: 10540283 title: Transcription regulation of the nir gene cluster encoding nitrite reductase of Paracoccus denitrificans involves NNR and NirI, a novel type of membrane protein. abstract: The nirIX gene cluster of Paracoccus denitrificans is located between the nir and nor gene clusters encoding nitrite and nitric oxide reductases respectively. The NirI sequence corresponds to that of a membrane-bound protein with six transmembrane helices, a large periplasmic domain and cysteine-rich cytoplasmic domains that resemble the binding sites of [4Fe-4S] clusters in many ferredoxin-like proteins. NirX is soluble and apparently located in the periplasm, as judged by the predicted signal sequence. NirI and NirX are homologues of NosR and NosX, proteins involved in regulation of the expression of the nos gene cluster encoding nitrous oxide reductase in Pseudomonas stutzeri and Sinorhizobium meliloti. Analysis of a NirI-deficient mutant strain revealed that NirI is involved in transcription activation of the nir gene cluster in response to oxygen limitation and the presence of N-oxides. The NirX-deficient mutant transiently accumulated nitrite in the growth medium, but it had a final growth yield similar to that of the wild type. Transcription of the nirIX gene cluster itself was controlled by NNR, a member of the family of FNR-like transcriptional activators. An NNR binding sequence is located in the middle of the intergenic region between the nirI and nirS genes with its centre located at position -41.5 relative to the transcription start sites of both genes. Attempts to complement the NirI mutation via cloning of the nirIX gene cluster on a broad-host-range vector were unsuccessful, the ability to express nitrite reductase being restored only when the nirIX gene cluster was reintegrated into the chromosome of the NirI-deficient mutant via homologous recombination in such a way that the wild-type nirI gene was present directly upstream of the nir operon. 1 row in set (0.00 sec)
The NEChunker assigned types to mentions using a model trained on the
GENIA corpus. GENIA labels dozens of categories in the genomics and
proteomics domain. For more infomation on this corpus, see the GENIA Project Home
Page. To see the distribution of mention types in the medline
database, we run the following query:
mysql> select distinct type, count(*) -> from mention group by type; -------------- select distinct type, count(*) from mention group by type -------------- +--------------------------+----------+ | type | count(*) | +--------------------------+----------+ | amino_acid_monomer | 9 | | atom | 10 | | body_part | 26 | | carbohydrate | 3 | | cell_component | 15 | | cell_line | 31 | | cell_type | 42 | | DNA_domain_or_region | 72 | | DNA_family_or_group | 16 | | DNA_molecule | 5 | | DNA_N | 1 | | DNA_substructure | 2 | | inorganic | 15 | | lipid | 14 | | mono_cell | 14 | | multi_cell | 148 | | nucleotide | 8 | | other_artificial_source | 4 | | other_name | 773 | | other_organic_compound | 123 | | peptide | 8 | | polynucleotide | 6 | | protein_domain_or_region | 13 | | protein_family_or_group | 42 | | protein_molecule | 112 | | protein_N | 8 | | protein_substructure | 5 | | RNA_domain_or_region | 3 | | RNA_family_or_group | 4 | | RNA_molecule | 4 | | RNA_N | 1 | | RNA_substructure | 1 | | tissue | 27 | | virus | 5 | +--------------------------+----------+ 34 rows in set (0.00 sec)
To find the 10 most frequently mentioned proteins (regardless of subtype) we use the query:
mysql> select text, count(*) from mention -> where type like 'protein%' -> group by 1 order by 2 -> desc limit 10; -------------- select text, count(*) from mention where type like 'protein%' group by 1 order by 2 desc limit 10 -------------- +-------------------+----------+ | text | count(*) | +-------------------+----------+ | 201Tl | 5 | | PASG | 4 | | prostate | 3 | | Raf-1 | 3 | | MAPK | 3 | | PKC-alpha | 3 | | PKC-delta | 3 | | nitrite reductase | 2 | | NirI | 2 | | surgical | 2 | +-------------------+----------+ 10 rows in set (0.00 sec)
To find sentences which mention a particular named entity, we use the query:
mysql> select sentence_id from mention -> where text = 'MAPK'; -------------- select sentence_id from mention where text = 'MAPK' -------------- +-------------+ | sentence_id | +-------------+ | 942 | | 947 | | 951 | +-------------+ 3 rows in set (0.01 sec)
Finally, we present a slightly more complex query, which finds all pairs of protein names which co-occur in a sentence. To do this we join the mention table to itself, and restrict our selection to protiens which co-occur in the same sentence. Because this query returns a table containing columns for name1, name2, each pair of names will occur twice in the result, once as the pair {A,B} and once as the pair {B,A}. To remove this duplication, we require the names to occur in alphabetical order - this selects {A,B} and discards {B,A}. The full query is:
mysql> select m1.text, m2.text -> from mention m1, mention m2 -> where m1.sentence_id = m2.sentence_id -> and m1.type like 'protein%' -> and m2.type like 'protein%' -> and m1.text < m2.text -> limit 10; -------------- select m1.text, m2.text from mention m1, mention m2 where m1.sentence_id = m2.sentence_id and m1.type like 'protein%' and m2.type like 'protein%' and m1.text < m2.text limit 10 -------------- +-----------------------------------+------------------------+ | text | text | +-----------------------------------+------------------------+ | NirI | nitrite reductase | | membrane protein | nitrite reductase | | membrane protein | NirI | | membrane-bound protein | NirI sequence | | cysteine-rich cytoplasmic domains | NirI sequence | | ferredoxin-like proteins | NirI sequence | | cysteine-rich cytoplasmic domains | membrane-bound protein | | ferredoxin-like proteins | membrane-bound protein | | NirI sequence | transmembrane helices | | membrane-bound protein | transmembrane helices | +-----------------------------------+------------------------+ 10 rows in set (0.00 sec)
Finally, the client may be exited as follows:
mysql> quit Bye
Discussion
Because we chose to present the JDBC API first and the LingPipe annotation API second, we have two separate programs, where one program would be more efficient. It is left as an excercise to the reader to write a program which uses LingPipe to parse and annotate MEDLINE data in one pass.
The medline
database developed in this example, and the
code that accesses it have been kept as simple as possible, without
any attempt to optimize database performance. The data loading
process is extremely inefficient - as each new piece of information is
identified, it is inserted into the database. For very large data
sets, moving to a bulk-loading strategy would be advised. The
database tables haven't been tuned with respect to the expected set of
queries that will be run against them. For example, if the mention
table contains 5 million rows, a query like:
select * from mention where text = 'foobar'
could be quite slow, and if so, adding an index on the
text
column would be in order.
Database otpimization depends on the performance goals and/or constraints of the application(s) that the database supports. For a good introduction to this subject, we recommend the book SQL Performance Tuning .
References
- MySQL Reference Manual, (searchable, with user comments)
- MySQL Administrator GUI client
- MySQL Query Browser
- Sun's JDBC Tutorial
- SQL Performance Tuning Book