What is MEDLINE?

MEDLINE is a collection of 13 million plus citations into the bio-medical literature maintained by the United States National Library of Medicine (NLM). It contains richly structured data including authors, affiliations, titles, abstracts, grants, medical subject headings (MeSH), etc. LingPipe provides tools to parse the data from its native XML format into a structured Java object.

What's in this Tutorial?

This tutorial contains two parts: the first covers how to process MEDLINE using a sample file distributed with LingPipe. The second part shows how to use a LingPipe command-line to automatically download, perform checksums, and update MEDLINE on a rolling basis.

The first part of the tutorial shows how to parse the data so that it may be accessed and/or stored. As an application, this tutorial stores the citations in the index of a text search engine (Apache's Lucene) and stores the reconstitutes structured objects as search results. The result is a kind of object-oriented search engine combining aspects of object-relation mapping in databases (ORM) and parsing and document object modeling in XML/HTML (DOM).

Running the Demo

To run the demo, change directories to demos/tutorial/medline and then run:

ant index-samples

This will print the PubMed IDs of the citations as they are indexed:

Index directory=C:\carp\mycvs\lingpipe\demos\tutorial\medline\lucene-idx

Indexing file=../../data/medsamp2008.xml
     Adding PMID=10540283
     Adding PMID=10854512
     ...
     Adding PMID=15968009
     Adding PMID=16949478

Luke Index Browser

Once the index is created, it may be viewed and searched using Luke, the Lucene index toolbox, using the command:

ant inspect-index

Luke is fairly self-explanatory. There are tabs below the menu, the most useful for inspection being Overview, Documents and Search, which provide high level statistics for the collection, a way to inspect the documents and their processing, and a way to carry out search.

Like Lucene, Luke requires the analyzers used at run time to match those used at indexing time. In our case, the right analyzer is TokenLengthAnalyzer (which we implement in src/TokenLengthAnalzyer.java. It also requires the right default field; in our case, this is ABS for abstracts.

Luke uses Lucene's query parser (which is notoriously non-robust). More information is available from the documentation:

More documentation for Luke itself is available near the bottom of the:

MEDLINE XML Sample Files

NLM distributes one small sample file in plain XML, which we have included in the LingPipe distribution:

Each XML file contains a set of citations under a single element MedlineCitationSet, with individual citations being the content of elements MedlineCitation. The gzipped files unpack into a single XML file adhering to the same DTD as the small sample.

All of the demos can be run directly from the gzipped MEDLINE files. You can save the gzipped files wherever you want and include a path to them as an argument to the Ant targets such as index-samples.

Licensing MEDLINE

NLM's web site also contains information on how to:

It's free for research and most commercial purposes.

Parsing MEDLINE

The complete source of this tutorial is available as src/IndexBaseline.java. As with our other tutorials, parameters are hardwired as statics whenever possible for simplicity.

Parsing MEDLINE requires an instance of com.aliasi.medline.MedlineParser, which is created as a static instance with the line:

static MedlineParser PARSER
    = new MedlineParser(true);

The argument true tells it to save the raw XML as part of a citation instance. We'll see how that's used later.

The parser works through callbacks to an implementation of the com.aliasi.medline.MedlineHandler. We have defined a static class in the demo called CitationIndexer that implements MedlineHandler. This is created with no arguments because it hard codes output locations for this demo:

CitationIndexer indexer
    = new CitationIndexer();

The MEDLINE parser works through callbacks. The lines that actually call the parser are in the indexXML method:

String url = Files.fileToURLName(file);
InputSource inSource = new InputSource(url);
PARSER.parse(inSource,indexer);

The first line just converts the file to a URL representation. The second line creates an org.xml.sax.InputSource from that URL. The third line calls the MEDLINE parser on the specified input source with the specified indexer. The parser then parses the input and for each citation in the input, calls the indexer's method MedlineHandler.handle(MedlineCitation) method.

The indexer first prints the ID of the citation it's handling before proceeding with the Lucene indexing.

public void handle(MedlineCitation citation) {
    System.out.println("Handling PMID="
                       + citation.pmid());
    ...

All of the structured content of a citation is available through the citation parameter. Here, only the PubMed identifier (the unique identifier for MEDLINE citations) is used by calling citation.pmid(). The handler can perform arbitrary operations on the citation, accumulating results in a database, calculating occurrence statistics for particular MeSH terms, etc.

Indexing with Lucene

As an example, our MedlineHandler for the demo indexes various components of the citation using the Apache Lucene text search engine. The site for Lucene is:

A recent Lucene release is bundled with the LingPipe demo distribution.

Lucene creates a document index through an instance of org.apache.lucene.index.IndexWriter. This is created and stored as a member variable in the inner class IndexBaseline.CitationIndexer:

static class CitationIndexer
    implements MedlineHandler {

    IndexWriter mLuceneIndexer;

    CitationIndexer() throws IOException {
        File indexDir
            = new File("lucene-index");
        Analyzer analyzer = new SimpleAnalyzer();
        boolean freshIndex = true;
        mLuceneIndexer
            = new IndexWriter(indexDir,analyzer,
                              freshIndex);
    }
    ...

Note that we have hard-coded the directory name "lucene-index" in the class. Lucene indexers need an implementation of the abstract class org.apache.lucene.analysis.Analyzer in order to perform tokenization, token normalization, and stoplisting. We have hardcoded an instance of org.apache.lucene.analysis.SimpleAnalyzer. The last parameter is a boolean and setting it to true causes Lucene to create a fresh index in the specified directory.

The rest of the CitationIndexer subclass citation indexer creates an empty Lucene document in the first line and then adds fields to it:

  public void handle(MedlineCitation citation) {
      ...
      Field idField = new Field("ID",citation.pmid(),
                                Field.Store.YES,
                                Field.Index.NO_NORMS);
      doc.add(idField);
      ...

Here we see the new instance of org.apache.lucene.document.Document being created and the first field being added. The field is a keyword with the key "ID" and a value equal to the PubMed ID of the citation; keyword fields are stored and indexed, but not tokenized.

At the end of this method, the document is actually added to the index and any I/O errors are caught and logged:

try {
    mLuceneIndexer.addDocument(doc);
} catch (IOException e) {
    System.err.println("Handling Exception="
                       + e);
}

The bulk of the method involves extracting information from the MEDLINE citation object and adding it to the Lucene document object. The one entry that's particularly relevant to the MEDLINE parser is the next one:

doc.add(Field.UnIndexed("RAW",
                        citation.xmlString()));

This says to add an unindexed field (stored, not indexed, not tokenized) with key "RAW". Its content is the result of calling the method citation.xmlString(), which is the raw XML representation of the entry. We store this so we can parse it again on information retrieval to recover the full document.

The next two lines are straightforward:

Article article = citation.article();
doc.add(Field.Text("TITLE",
                   article.articleTitleText()));

These first line extracts the com.aliasi.medline.Article from the citation. All articles have titles, so the next line just adds this as a text field with key TITLE. Text fields are stored (really redundant given that we're storing the full XML), tokenized, and indexed.

Continuing on, we next extract the date, which is stored in two locations in MEDLINE citations, depending on whether the article was in a journal or a book. Here's the logic to add the date field:

JournalIssue issue
    = article.journal().journalIssue();
PubDate date
    = issue != null
    ? issue.pubDate()
    : article.book().pubDate();
doc.add(Field.Text("DATE",
                   date.toString()));

Dates are in whacky formats in MEDLINE, so we have not attempted to normalize them into Java Date objects, but have simply left them as strings.

Abstracts are optional, so we test for null before adding them:

Abstract abstr = article.abstrct();
if (abstr != null)
  doc.add(Field.Text("ABS",
                     abstr.textWithoutTruncationMarker()));

MeSH headings are more complexly structured in MEDLINE. There may be zero or more headings, each consisting of one or more topics, with each topic being either a major or minor topic of the article. In order to add all of the major topics, we have:

MeshHeading[] headings = citation.meshHeadings();
for (int i = 0; i < headings.length; ++i) {
    Topic[] topics = headings[i].topics();
    for (int j = 0; j < topics.length; ++j)
        if (topics[j].isMajor())
            doc.add(Field.Text("MESH",
                               topics[j].topic()));
}

Parsing from GZipped Input

The baseline distribution of MEDLINE for 2008 contains 563 GZipped files containing over 16 million citations. Full information on the 2008 baseline is available from:

The compressed (gzip) files occupy more than seven gigabytes (about five times that size uncompressed, because it's mostly repetitive XML). It is easier to store them in GZipped format and access them directly rather than unzipping them. To this end, we provide a method that does this directly (ignore the try/catch logic for closing the streams:

FileInputStream fileIn
    = new FileInputStream(file);
GZIPInputStream gzipIn
    = new GZIPInputStream(fileIn);
InputStreamReader inReader
    = new InputStreamReader(gzipIn,Strings.UTF8);
BufferedReader bufReader
    = new BufferedReader(inReader);
InputSource inSource
    = new InputSource(bufReader);
inSource.setSystemId(Files.fileToURLName(file));
PARSER.parse(inSource,indexer);

This approach uses Java's built-in java.util.zip.GZIPInputStream class, but is otherwise very much like parsing out of plain text. The one thing to note is the penultimate line, which sets the system ID for the input source to be the URL of the file; this allows DTDs and other relative references to be read off, and is good practice, but not strictly necessary for this example.

Downloading and Updating MEDLINE

In this section of the tutorial, we show how to download the MEDLINE baseline distribution (released once per year) automatically. The command not only handles the details of the ftp protocol and finding the data, but also checks the checksums and reschedules downloads if they do not match. The command may be interrupted and restarted at any time; it will always do checks against the checksums before finishing.

Licensing MEDLINE

MEDLINE data is free for most purposes. It is licensed by the United States National Library of Medicine (NLM). See the following link for more information:

NLM is very responsive and will help you out if you have problems. Note that you will need to register your IP address for downloading with the NLM.

Downloading MEDLINE

You can view the contents of the various files, including the dates or citations, in:

Before downloading, you need to register your IP address with NLM. So you'll need a fixed IP address to keep up with MEDLINE.

We have written a script that will let you do both the downloads and the checksums. It will only download files it needs, and therefore may be run after wget to check the checksums. The command to do the download in this message may be invoked from Ant with the following target (the e-mail is for NLM's convenience; some value must be specified, but it's not checked against your actual registered e-mail, which is why plain wget works):

> ant download-baseline
   -Dmedline.pwd=YourEmail
   -Dmedline.dir=TargetDir

where YourEmail should be your email address and TargetDir is the directory into which you want the MEDLINE data written. The full command, if you don't want to use Ant, is the following (all one one line, as usual, with semicolons converted to colons under unix/linux):

java
-cp build/classes;
    ../../../lingpipe-3.5.1.jar;
    ../../lib/commons-net-1.4.1.jar"
DownloadMedline
-domain=ftp.nlm.nih.gov
-user=anonymous
-password=YourEmail
-medlineDir=TargetDir
-baselinepath=/nlmdata/.medleasebaseline/gz

The output looks like this:

> ant download-baseline -Dmedline.pwd=carp@alias-i.com -Dmedline.dir=e:/tmp
Downloading MEDLINE
  Start time=Mon Dec 18 17:56:35 EST 2006
  Domain=ftp.nlm.nih.gov
  Baseline Path on Domain=/nlmdata/.medleasebaseline/gz
  Update Path on Domain=null
  MEDLINE Target Directory=f:\data\medline\dist\2008
  User name=anonymous
  Password=carp@aliasi.com
  Max tries=8
Establishing FTP connection.
  Connecting to NLM
  Connected.
  Logging in.
Logged in to FTP Server.

Checking/Downloading Baseline.
Reading from server path=/nlmdata/.medleasebaseline/gz
Writing to target directory=f:\data\medline\dist\2008\baseli
Number of existing files=551
Reading list of file names from server.
Found server file names. Number of files=1077
Server files=[medline07n0001.xml.gz, medline07n0002.xml.gz, ...



Downloading MEDLINE
  Start time=Mon Jan 30 15:37:52 EST 2006
  Domain=ftp.nlm.nih.gov
  Baseline Path on Domain=/nlmdata/.medleasebaseline/gz
  Update Path on Domain=null
  MEDLINE Target Directory=e:/tmp
  User name=anonymous
  Password=carp@alias-i.com
  Max tries=8
Establishing FTP connection.
  Connecting to NLM
  Connected.
  Logging in.
Logged in to FTP Server.

Checking/Downloading Baseline.
Reading from server path=/nlmdata/.medleasebaseline/gz
Writing to target directory=e:\tmp\baseline
Number of existing files=0
Found server file names. Number of files=1032

Downloading checksums.
Done downloading checksums.
Elapsed time=2:11
Number of citation files, based on checksums=516

Checking existing files.
Finished existing file check. Progress: 0/516 Elapsed time=2:11

Download new files.
medline06n0509.xml.gz OK Progress: 1/516 Elapsed time=3:14
medline06n0001.xml.gz OK Progress: 2/516 Elapsed time=3:25

...

MEDLINE DOWNLOAD COMPLETE.

The program first prints an overview of when it's downloading, the full ftp information, and then provides ongoing output to indicate what it's doing. It then checks the number of files, then downloads the checksums. It checks the checksums to make sure they are in the right format, re-downloading any that fail the check. Then any existing data files are checked using the checksum files. The ones that pass are not downloaded again. Next, the existing files are downloaded one by one, with OK being printed if they pass the checksum after download.

This job may take several hours -- the 2008 baseline is roughly 7 gigabytes. It may also need to be restarted if there are continued network errors.

The baseline distribution is done once per year. This should only need to be downloaded once.

Downloading Updates

MEDLINE provides updated entries in its updates directory. These are released almost daily. The following command can be run once to catch up to the current day's updates and may then be run as a scheduled job in Windows or a cron job in Linux/Unix in order to keep the updates directory current.

The update command is invoked from Ant in exactly the same way as the baseline command, only with a different target:

> ant download-updates
   -Dmedline.pwd=YourEmail
   -Dmedline.dir=TargetDir

As a command, the only difference is that it provides a value for -updatePath instead of for baselinePath; these are paths to the data the FTP servers. Thus updates are possible by invoking Java directly (all on one line):

java
-cp build/classes;
    ../../../lingpipe-3.5.1.jar;
    ../../lib/commons-net-1.4.1.jar"
DownloadMedline
-domain=ftp.nlm.nih.gov
-user=anonymous
-password=YourEmail
-medlineDir=TargetDir
-updatePath=/nlmdata/.medlease/gz

The output is very similar; only the numbers are different (there are fewer files and they are much smaller):

Downloading MEDLINE
  Start time=Mon Jan 30 15:04:17 EST 2006
  Domain=ftp.nlm.nih.gov
  Baseline Path on Domain=null
  Update Path on Domain=/nlmdata/.medlease/gz
  MEDLINE Target Directory=e:\data\medline\2006
  User name=anonymous
  Password=carp@aliasi.com
  Max tries=8
Establishing FTP connection.
  Connecting to NLM
  Connected.
  Logging in.
Logged in to FTP Server.

Checking/Downloading Updates.
Reading from server path=/nlmdata/.medlease/gz
Writing to target directory=e:\data\medline\2006\updates
Number of existing files=0
Found server file names. Number of files=255

Downloading checksums.
Done downloading checksums.
Elapsed time=:19
Number of citation files, based on checksums=83

Checking existing files.
Finished existing file check. Progress: 0/83 Elapsed time=:19

Download new files.
medline06n0517.xml.gz OK Progress: 1/83 Elapsed time=:45
medline06n0518.xml.gz OK Progress: 2/83 Elapsed time=1:08
medline06n0519.xml.gz OK Progress: 3/83 Elapsed time=1:28
...
medline06n0599.xml.gz OK Progress: 83/83 Elapsed time=17:29

Total Time (HH:MM:SS)=17:29
MEDLINE DOWNLOAD COMPLETE.

The Code

The code for the example is in src/DownloadMedline.java. Most of the code is there for error-handling in the case of failed network or checksum situations. In the rest of this tutorial, we focus on the actual FTP components and computing the checksums.

Initializing the FTP Client

Other than handling parameters, the code that connects to the FTP server and logs in is quite simple:

import org.apache.commons.net.ftp.FTPClient;
...
private FTPClient mFTPClient;
...

mFTPClient = new FTPClient();
mFTPClient.setDataTimeout(SIXTY_SECONDS_IN_MS);
mFTPClient.connect(mDomainName);
if (!FTPReply.isPositiveCompletion(mFTPClient.getReplyCode())) {
    System.out.println("FTP Server refused connection.");
    exitIncomplete();
}
mFTPClient.login(mUserName,mPassword);

Listing the Files

We next change the working directory to the one specified by the command:

mFTPClient.changeWorkingDirectory(path);

and set the file type to binary (after importing the constants class from Apache commons.net):

import org.apache.commons.net.ftp.FTP;
...
mFTPClient.setFileType(FTP.BINARY_FILE_TYPE);

Listing the files is then a one-liner:

private String[] mServerFileNames;
...
mServerFileNames = mFTPClient.listNames();

Verifying Checksums

We iterate through the files names that end in .gz and check to see if the file exists. If it exists, we verify its checksum and consider it finished if it passes. If it fails, we delete it and schedule it for downloading.

Given a file containing a checksum (suffix .md5) and a matching file containing gzipped MEDLINE data (suffix .gz), the code to verify the checksum in method verifyChecksum(int,File) is:

String expectedChecksum = getExpectedMD5String(checksumFile);
String foundChecksum = getMD5HexString(testFile);
return expectedChecksum.equals(foundChecksum);

The first method to compute the expected checksum program is a little tricky because they supply checksums as hex strings in a formatted file which varies in format between the baseline and updates. Otherwise, it's just a matter of grabbing the relevant substring. To second method to compute the found checksum first gets the checksum as bytes, and then converts those to a hex string using a utility method in com.aliasi.util.Strings.

String getMD5HexString(File file) throws IOException {
    byte[] md5Bytes = getMD5Bytes(file);
    return Strings.bytesToHex(md5Bytes);
}

The actual method to compute the checksums is done using the Java builtins:

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
...
byte[] getMD5Bytes(File file) throws IOException {
    MessageDigest digest = null;
    try {
        digest = MessageDigest.getInstance("MD5");
    } catch (NoSuchAlgorithmException e) {
        throw new IOException("Couldn't find MD5 algorithm. Exception=" + e);
    }
    InputStream fileIn = null;
    byte[] buffer = new byte[1024];
    try {
        fileIn = new FileInputStream(file);
        int numRead;
        do {
            numRead = fileIn.read(buffer);
            if (numRead > 0) {
                digest.update(buffer, 0, numRead);
            }
        } while (numRead != -1);
        return digest.digest();
    } finally {
        Streams.closeInputStream(fileIn);
    }
}

This code first sets up a digest to compute the checksum, then feeds it buffered byte array slices read from file until there are no more bytes. It then returns the array of bytes making up the checksum by calling the digest() method.

The final conversion from bytes to hex values is easy with the following static method in com.aliasi.util.Strings:

public static String bytesToHex(byte[] bytes) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < bytes.length; ++i)
        sb.append(byteToHex(bytes[i]));
    return sb.toString();
}

public static String byteToHex(byte b) {
    String result = Integer.toHexString(Math.byteAsUnsigned(b));
    switch (result.length()) {
       case 0: return "00";
       case 1: return "0" + result;
       case 2: return result;
      default: throw new IllegalArgumentException("byteToHex(" + b + ")=" + result);
    }
}

Note that the helper function calls the standard Java java.lang.Integer.toHexString(int) method, but first uses java.lang.Math.byteAsUnsigned(byte) to treat it as unsigned. The rest of the method just performs the requisite padding to make sure each byte is converted to a two-character string.

References