public class MedlineSentenceModel extends HeuristicSentenceModel implements Serializable
MedlineSentenceModelis a heuristic sentence model designed for operating over biomedical research abstracts as found in MEDLINE.
The MEDLINE model assumes that parentheses are balanced as
defined in the class documentation for
HeuristicSentenceModel. It also assumes the final token is a
sentence boundary, overriding any other possible checks. This is
set because there are many truncated MEDLINE abstracts, and this
ensures that every token falls within a sentence in the result.
The sets required by the superclass constructor
determine which tokens are possible sentence stops, which are
disallowed before stops, and which are disallowed as starts. These
three sets are:
Impossible Penultimates some scientific and publishing terms personal/professional titles/suffixes months, times corporate designators common abbreviations back quotes, commas
Impossible Sentence Starts possible stops (see above) close parens, brackets, braces
This class overrides the default implementation of the possible
start token method to allow a sentence start to be any sequence of
tokens uninterrupted by spaces that contains a non-lowercase letter
character. This behavior is described with examples in its
implementing method's documentation:
INSTANCEmay be used anywhere a MEDLINE sentence model is needed.
|Modifier and Type||Field and Description|
A single instance which may be used anywhere a MEDLINE sentence model is needed.
|Constructor and Description|
Construct a MEDLINE sentence model.
|Modifier and Type||Method and Description|
balanceParens, boundaryIndices, forceFinalStop
boundaryIndices, boundaryIndices, verifyBounds, verifyTokensWhitespaces
public static final MedlineSentenceModel INSTANCE
trueif the specified start index can be a sentence start in the specified array of tokens and whitespaces running up to the end token.
For MEDLINE, this implementation returns
if the sequence of contiguous tokens starting with the
specified token contains an uppercase or digit character. Each
token is considered, beginning with the specified start token
and continuing through all tokens that are not separated by
non-empty whitespace, up to the token with the end index minus
one. If any of the tokens contains an uppercase or digit
character, then the result is
the result is
For example, if the first token is "Therefore", then it can be a sentence start because it contains the non-lowercase letter "T". Similarly, the token "pH" can be a sentence start, as can "p53", because they have non-lower-case characters "H" and "5" respectively. If the underlying sequence is " correlation. p-53 was...", then the array of tokens and whitespaces is:
Here, "p" is a valid sentence start token even though it is only a single lowercase character, because it is followed by a hyphen (
Index Whitespace Token 0
" correlation. p-53 was ..."
-) with no intervening whitespace. By way of contrast, the first token
"and"in the sequence
"and Foo", can't start a sentence because it is separated from the following token by a non-empty whitespace. Recall that the whitespace with the same index as a token precedes the token.