Package org.apache.lucene.tests.analysis
Class MockTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.tests.analysis.MockTokenizer
-
- All Implemented Interfaces:
Closeable,AutoCloseable
public class MockTokenizer extends Tokenizer
Tokenizer for testing.This tokenizer is a replacement for
WHITESPACE,SIMPLE, andKEYWORDtokenizers. If you are writing a component such as a TokenFilter, it's a great idea to test it wrapping this tokenizer instead for extra checks. This tokenizer has the following behavior:- An internal state-machine is used for checking consumer consistency. These checks can be
disabled with
setEnableChecks(boolean). - For convenience, optionally lowercases terms that it outputs.
-
-
Field Summary
Fields Modifier and Type Field Description static intDEFAULT_MAX_TOKEN_LENGTHLimit the default token length to a size that doesn't cause random analyzer failures on unpredictable data like the enwiki data set.static CharacterRunAutomatonKEYWORDActs Similar to KeywordTokenizer.static CharacterRunAutomatonSIMPLEActs like LetterTokenizer.static CharacterRunAutomatonWHITESPACEActs Similar to WhitespaceTokenizer-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description MockTokenizer()MockTokenizer(AttributeFactory factory)MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase)MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase, int maxTokenLength)MockTokenizer(CharacterRunAutomaton runAutomaton, boolean lowerCase)MockTokenizer(CharacterRunAutomaton runAutomaton, boolean lowerCase, int maxTokenLength)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()voidend()booleanincrementToken()protected booleanisTokenChar(int c)protected intnormalize(int c)protected intreadChar()protected intreadCodePoint()voidreset()voidsetEnableChecks(boolean enableChecks)Toggle consumer workflow checking: if your test consumes tokenstreams normally you should leave this enabled.protected voidsetReaderTestPoint()-
Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
WHITESPACE
public static final CharacterRunAutomaton WHITESPACE
Acts Similar to WhitespaceTokenizer
-
KEYWORD
public static final CharacterRunAutomaton KEYWORD
Acts Similar to KeywordTokenizer. TODO: Keyword returns an "empty" token for an empty reader...
-
SIMPLE
public static final CharacterRunAutomaton SIMPLE
Acts like LetterTokenizer.
-
DEFAULT_MAX_TOKEN_LENGTH
public static final int DEFAULT_MAX_TOKEN_LENGTH
Limit the default token length to a size that doesn't cause random analyzer failures on unpredictable data like the enwiki data set.This value defaults to
CharTokenizer.DEFAULT_MAX_WORD_LEN(255).- See Also:
- "https://issues.apache.org/jira/browse/LUCENE-10541", Constant Field Values
-
-
Constructor Detail
-
MockTokenizer
public MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase, int maxTokenLength)
-
MockTokenizer
public MockTokenizer(CharacterRunAutomaton runAutomaton, boolean lowerCase, int maxTokenLength)
-
MockTokenizer
public MockTokenizer(CharacterRunAutomaton runAutomaton, boolean lowerCase)
-
MockTokenizer
public MockTokenizer()
-
MockTokenizer
public MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase)
-
MockTokenizer
public MockTokenizer(AttributeFactory factory)
-
-
Method Detail
-
incrementToken
public final boolean incrementToken() throws IOException- Specified by:
incrementTokenin classTokenStream- Throws:
IOException
-
readCodePoint
protected int readCodePoint() throws IOException- Throws:
IOException
-
readChar
protected int readChar() throws IOException- Throws:
IOException
-
isTokenChar
protected boolean isTokenChar(int c)
-
normalize
protected int normalize(int c)
-
reset
public void reset() throws IOException- Overrides:
resetin classTokenizer- Throws:
IOException
-
close
public void close() throws IOException- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Overrides:
closein classTokenizer- Throws:
IOException
-
setReaderTestPoint
protected void setReaderTestPoint()
- Overrides:
setReaderTestPointin classTokenizer
-
end
public void end() throws IOException- Overrides:
endin classTokenStream- Throws:
IOException
-
setEnableChecks
public void setEnableChecks(boolean enableChecks)
Toggle consumer workflow checking: if your test consumes tokenstreams normally you should leave this enabled.
-
-