|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.tokenizer.RegExTokenizerFactory
public class RegExTokenizerFactory
A RegExTokenizerFactory creates a tokenizer factory
out of a regular expression. The regular expression is presented
as an instance of Pattern and matching is carried out with
the java.util.regex package. The pattern provided when the
factory is constructed is used to create instances of Matcher for use in tokenizers. The method Matcher.find(int) is called to find the next token in an input
sequence.
For instance, consider a regular expression which takes a token to be a sequence of alphabetic characters, a sequence of numeric characters, or a single non-alphanumeric character:
[a-zA-Z]+|[0-9]+|\S
This can be used to construct a tokenizer factory:
String regex = "[a-zA-Z]+|[0-9]+|\\S";
TokenizerFactory tf = new RegExTokenizerFactory(regex);
char[] cs = "abc de 123. ".toCharArray();
Tokenizer tokenizer = tf.tokenizer(cs,0,cs.length);
Note the escaping of the backslash character (\) in
the Java string regex with a backslash
(\), resulting in \\. For the regular
expression there are no spaces within any of the disjuncts because
the matched tokens should not contain whitespaces. Finally note
the use of Kleene plus (+) rather than Kleene star
(*) to ensure that tokens are at least a single
character long. In fact, the constructor will throw an exception
if the pattern matches the empty string.
The tokenizer above will return the following tokens, whitespaces and character offsets:
whitespaces: "", " ", " ", "", " "
tokens: "abc", "de", "123", "."
token starts: 0, 4, 7, 10
Serialization just does a standard serialization, which deserializes to an instance of this class.
| Constructor Summary | |
|---|---|
RegExTokenizerFactory(Pattern pattern)
Construct a regular expression tokenizer factory with the specified pattern for matching. |
|
RegExTokenizerFactory(String regex)
Construct a regular expression tokenizer factory using the specified regular expression for matching. |
|
RegExTokenizerFactory(String regex,
int flags)
Construct a regular expression tokenizer factory using the specified regular expression for matching according to the specified flags. |
|
| Method Summary | |
|---|---|
void |
compileTo(ObjectOutput objOut)
Compile this object to the specified object output. |
Pattern |
pattern()
Returns the regular expression pattern backing this tokenizer factory. |
Tokenizer |
tokenizer(char[] cs,
int start,
int length)
Returns a tokenizer for the specified subsequence of characters. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public RegExTokenizerFactory(String regex)
regex - The regular expression.
PatternSyntaxException - If the expression's syntax is
invalid.
public RegExTokenizerFactory(String regex,
int flags)
|") of the
following flags: Pattern.CASE_INSENSITIVE, Pattern.MULTILINE, Pattern.DOTALL, Pattern.UNICODE_CASE and Pattern.CANON_EQ.
See Pattern.compile(String,int) for more information.
regex - The regular expression.flags - The match flags.
PatternSyntaxException - If the expression's syntax is
invalid.
IllegalArgumentException - If bit values other than those
corresponding to defined match flags are set in the flags.public RegExTokenizerFactory(Pattern pattern)
pattern - Pattern to use for matching.| Method Detail |
|---|
public Pattern pattern()
public Tokenizer tokenizer(char[] cs,
int start,
int length)
TokenizerFactory
tokenizer in interface TokenizerFactorycs - Characters to tokenize.start - Index of first character to tokenize.length - Number of characters to tokenize.
public void compileTo(ObjectOutput objOut)
throws IOException
Compilable
compileTo in interface CompilableobjOut - Object output to which this object is compiled.
IOException - If there is an I/O error compiling the
object.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||