Class SimplePatternTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.pattern.SimplePatternTokenizer
- All Implemented Interfaces:
Closeable,AutoCloseable
This tokenizer uses a Lucene
RegExp or (expert usage) a pre-built determinized Automaton, to locate tokens. The regexp syntax is more limited than PatternTokenizer,
but the tokenization is quite a bit faster. The provided regex should match valid token
characters (not token separator characters, like String.split). The matching is greedy:
the longest match at a given start point will be the next token. Empty string tokens are never
produced.- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State -
Field Summary
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY -
Constructor Summary
ConstructorsConstructorDescriptionSimplePatternTokenizer(String regexp) SeeRegExpfor the accepted syntax.SimplePatternTokenizer(AttributeFactory factory, String regexp, int determinizeWorkLimit) SeeRegExpfor the accepted syntax.SimplePatternTokenizer(AttributeFactory factory, Automaton dfa) Runs a pre-built automaton.Runs a pre-built automaton. -
Method Summary
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPointMethods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
Method Details
-
incrementToken
- Specified by:
incrementTokenin classTokenStream- Throws:
IOException
-
end
- Overrides:
endin classTokenStream- Throws:
IOException
-
reset
- Overrides:
resetin classTokenizer- Throws:
IOException
-