How to use
SimplePatternTokenizerFactory
in
org.apache.lucene.analysis.pattern

Best Java code snippets using org.apache.lucene.analysis.pattern.SimplePatternTokenizerFactory (Showing top 1 results out of 315)

/** Creates a new SimplePatternTokenizerFactory */
public SimplePatternTokenizerFactory(Map<String,String> args) {
 super(args);
 maxDeterminizedStates = getInt(args, "maxDeterminizedStates", Operations.DEFAULT_MAX_DETERMINIZED_STATES);
 dfa = Operations.determinize(new RegExp(require(args, PATTERN)).toAutomaton(), maxDeterminizedStates);
 if (args.isEmpty() == false) {
  throw new IllegalArgumentException("Unknown parameters: " + args);
 }
}

Javadoc

Factory for SimplePatternTokenizer, for matching tokens based on the provided regexp.

This tokenizer uses Lucene RegExp pattern matching to construct distinct tokens for the input stream. The syntax is more limited than PatternTokenizer, but the tokenization is quite a bit faster. It takes two arguments:

"pattern" (required) is the regular expression, according to the syntax described at RegExp
"maxDeterminizedStates" (optional, default 10000) the limit on total state count for the determined automaton computed from the regexp

The pattern matches the characters to include in a token (not the split characters), and the matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.

For example, to match tokens delimited by simple whitespace characters:

 
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> 
<analyzer> 
<tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/> 
</analyzer> 
</fieldType>

Most used methods

Popular in Java

Running tasks concurrently on multiple threads
getExternalFilesDir (Context)
startActivity (Activity)
setRequestProperty (URLConnection)
RandomAccessFile (java.io)
Allows reading from and writing to a file in a random-access manner. This is different from the uni-
System (java.lang)
Provides access to system-related information and resources including standard input and output. Ena
DecimalFormat (java.text)
A concrete subclass of NumberFormat that formats decimal numbers. It has a variety of features desig
SortedSet (java.util)
SortedSet is a Set which iterates over its elements in a sorted order. The order is determined eithe
FileUtils (org.apache.commons.io)
General file manipulation utilities. Facilities are provided in the following areas: * writing to a
LoggerFactory (org.slf4j)
The LoggerFactory is a utility class producing Loggers for various logging APIs, most notably for lo
Top Vim plugins

How to useSimplePatternTokenizerFactory in org.apache.lucene.analysis.pattern

Best Java code snippets using org.apache.lucene.analysis.pattern.SimplePatternTokenizerFactory (Showing top 1 results out of 315)

How to use
SimplePatternTokenizerFactory
in
org.apache.lucene.analysis.pattern