Factory for
SimplePatternTokenizer, for matching tokens based on the provided regexp.
This tokenizer uses Lucene
RegExp pattern matching to construct distinct tokens
for the input stream. The syntax is more limited than
PatternTokenizer, but the
tokenization is quite a bit faster. It takes two arguments:
- "pattern" (required) is the regular expression, according to the syntax described at
RegExp
- "maxDeterminizedStates" (optional, default 10000) the limit on total state count for the determined automaton computed from the regexp
The pattern matches the characters to include in a token (not the split characters), and the
matching is greedy such that the longest token matching at a given point is created. Empty
tokens are never created.
For example, to match tokens delimited by simple whitespace characters:
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/>
</analyzer>
</fieldType>