Class RegExp
Automaton.
Regular expressions are built from the following abstract syntax:
| regexp | ::= | unionexp | ||
| | | ||||
| unionexp | ::= | interexp | unionexp |
(union) | |
| | | interexp | |||
| interexp | ::= | concatexp & interexp |
(intersection) | [OPTIONAL] |
| | | concatexp | |||
| concatexp | ::= | repeatexp concatexp | (concatenation) | |
| | | repeatexp | |||
| repeatexp | ::= | repeatexp ? |
(zero or one occurrence) | |
| | | repeatexp * |
(zero or more occurrences) | ||
| | | repeatexp + |
(one or more occurrences) | ||
| | | repeatexp {n} |
(n occurrences) |
||
| | | repeatexp {n,} |
(n or more occurrences) |
||
| | | repeatexp {n,m} |
(n to m occurrences, including both) |
||
| | | complexp | |||
| charclassexp | ::= | [ charclasses ] |
(character class) | |
| | | [^ charclasses ] |
(negated character class) | ||
| | | simpleexp | |||
| charclasses | ::= | charclass charclasses | ||
| | | charclass | |||
| charclass | ::= | charexp - charexp |
(character range, including end-points) | |
| | | charexp | |||
| simpleexp | ::= | charexp | ||
| | | . |
(any single character) | ||
| | | # |
(the empty language) | [OPTIONAL] | |
| | | @ |
(any string) | [OPTIONAL] | |
| | | " <Unicode string without double-quotes> " |
(a string) | ||
| | | ( ) |
(the empty string) | ||
| | | ( unionexp ) |
(precedence override) | ||
| | | < <identifier> > |
(named automaton) | [OPTIONAL] | |
| | | <n-m> |
(numerical interval) | [OPTIONAL] | |
| charexp | ::= | <Unicode character> | (a single non-reserved character) | |
| | | \d |
(a digit [0-9]) | ||
| | | \D |
(a non-digit [^0-9]) | ||
| | | \s |
(whitespace [ \t\n\r]) | ||
| | | \S |
(non whitespace [^\s]) | ||
| | | \w |
(a word character [a-zA-Z_0-9]) | ||
| | | \W |
(a non word character [^\w]) | ||
| | | \ <Unicode character> |
(a single character) |
The productions marked [OPTIONAL] are only allowed if specified by the syntax
flags passed to the RegExp constructor. The reserved characters used in the
(enabled) syntax must be escaped with backslash (\) or double-quotes (
"..."). (In contrast to other regexp syntaxes, this is required also in character
classes.) Be aware that dash (-) has a special meaning in charclass
expressions. An identifier is a string not containing right angle bracket (>
) or dash (-). Numerical intervals are specified by non-negative
decimal integers and include both end points, and if n and m
have the same number of digits, then the conforming strings must have that length (i.e.
prefixed by 0's).
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumThe type of expression represented by a RegExp node. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intSyntax flag, enables all optional regexp syntax.static final intSyntax flag, enables anystring (@).static final intDeprecated.static final intSyntax flag, enables named automata (<identifier>).final intCharacter expressionstatic final intAllows case-insensitive matching of most Unicode characters.static final intSimilar toCASE_INSENSITIVEbut for character class ranges.static final intDeprecated.This method will be removed in Lucene 11final intLimits for repeatable type expressionsstatic final intSyntax flag, enables empty language (#).final RegExpChild expressions held by a container type expressionfinal RegExpChild expressions held by a container type expressionfinal int[]Extents for range type expressionsstatic final intSyntax flag, enables intersection (&).static final intSyntax flag, enables numerical intervals (<n-m>).final RegExp.KindThe type of expressionfinal intLimits for repeatable type expressionsfinal intLimits for repeatable type expressionsstatic final intSyntax flag, enables no optional regexp syntax.final StringString expressionfinal int[]Extents for range type expressions -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionReturns set of automaton identifiers that occur in this regular expression.The string that was used to construct the regex.Constructs newAutomatonfrom thisRegExp.toAutomaton(Map<String, Automaton> automata) Constructs newAutomatonfrom thisRegExp.toAutomaton(AutomatonProvider automaton_provider) Constructs newAutomatonfrom thisRegExp.toString()Constructs string from parsed regular expression.Like to string, but more verbose (shows the hierarchy more clearly).
-
Field Details
-
INTERSECTION
public static final int INTERSECTIONSyntax flag, enables intersection (&).- See Also:
-
EMPTY
public static final int EMPTYSyntax flag, enables empty language (#).- See Also:
-
ANYSTRING
public static final int ANYSTRINGSyntax flag, enables anystring (@).- See Also:
-
AUTOMATON
public static final int AUTOMATONSyntax flag, enables named automata (<identifier>).- See Also:
-
INTERVAL
public static final int INTERVALSyntax flag, enables numerical intervals (<n-m>).- See Also:
-
ALL
public static final int ALLSyntax flag, enables all optional regexp syntax.- See Also:
-
NONE
public static final int NONESyntax flag, enables no optional regexp syntax.- See Also:
-
ASCII_CASE_INSENSITIVE
Deprecated.Allows case-insensitive matching of ASCII characters.This flag has been deprecated in favor of
CASE_INSENSITIVEthat supports the full range of Unicode characters. Usage of this flag now has the same behavior asCASE_INSENSITIVE- See Also:
-
CASE_INSENSITIVE
public static final int CASE_INSENSITIVEAllows case-insensitive matching of most Unicode characters.In general the attempt is to reach parity with
PatternPattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flags when doing a case-insensitive match. We support common case folding in addition to simple case folding as defined by the common (C) and simple (S) mappings in https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. This is in line withPatternand means characters like those representing the Greek symbol sigma (Σ, σ, ς) will all match one another despite σ and ς both being lowercase characters as detailed here: https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt.Some Unicode characters are difficult to correctly decode casing. In some cases Java's String class correctly handles decoding these but Java's
Patternclass does not. We make only a best effort to maintaining consistency withPatternand there may be differences.There are three known special classes of these characters:
- 1. the set of characters whose casing matches across multiple characters such as the Greek sigma character mentioned above (Σ, σ, ς); we support these; notably some of these characters fall into the ASCII range and so will behave differently when this flag is enabled
- 2. the set of characters that are neither in an upper nor lower case stable state and can be both uppercased and lowercased from their current code point such as Dž which when uppercased produces DŽ and when lowercased produces dž; we support these
- 3. the set of characters that when uppercased produce more than 1 character. For
performance reasons we ignore characters for now, which is consistent with
Pattern
Sometimes these classes of character will overlap; if a character is in both class 3 and any other case listed above it is ignored; this is consistent with
Patternand C,S,T mappings in https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. Support for class 3 is only available with full (F) mappings, which is not supported. For instance: this character ῼ will match it's lowercase form ῳ but not it's uppercase form: ΩΙClass 3 characters that when uppercased generate multiple characters such as ﬗ (0xFB17) which when uppercased produces ՄԽ (code points: 0x0544 0x053D) and are therefore ignored; however, lowercase matching on these values is supported: 0x00DF, 0x0130, 0x0149, 0x01F0, 0x0390, 0x03B0, 0x0587, 0x1E96-0x1E9A, 0x1F50, 0x1F52, 0x1F54, 0x1F56, 0x1F80-0x1FAF, 0x1FB2-0x1FB4, 0x1FB6, 0x1FB7, 0x1FBC, 0x1FC2-0x1FC4, 0x1FC6, 0x1FC7, 0x1FCC, 0x1FD2, 0x1FD3, 0x1FD6, 0x1FD7, 0x1FE2-0x1FE4, 0x1FE6, 0x1FE7, 0x1FF2-0x1FF4, 0x1FF6, 0x1FF7, 0x1FFC, 0xFB00-0xFB06, 0xFB13-0xFB17
- See Also:
-
CASE_INSENSITIVE_RANGE
public static final int CASE_INSENSITIVE_RANGESimilar toCASE_INSENSITIVEbut for character class ranges.This flag allows ranges such as
[a-z]to matchA, but may result in performance costs during parsing.- See Also:
-
DEPRECATED_COMPLEMENT
Deprecated.This method will be removed in Lucene 11Allows regexp parsing of the complement (~).Note that processing the complement can require exponential time, but will be bounded by an internal limit. Regexes exceeding the limit will fail with TooComplexToDeterminizeException.
- See Also:
-
kind
The type of expression -
exp1
Child expressions held by a container type expression -
exp2
Child expressions held by a container type expression -
s
String expression -
c
public final int cCharacter expression -
min
public final int minLimits for repeatable type expressions -
max
public final int maxLimits for repeatable type expressions -
digits
public final int digitsLimits for repeatable type expressions -
from
public final int[] fromExtents for range type expressions -
to
public final int[] toExtents for range type expressions
-
-
Constructor Details
-
RegExp
Constructs newRegExpfrom a string. Same asRegExp(s, ALL).- Parameters:
s- regexp string- Throws:
IllegalArgumentException- if an error occurred while parsing the regular expression
-
RegExp
Constructs newRegExpfrom a string.- Parameters:
s- regexp stringsyntax_flags- boolean 'or' of optional syntax constructs to be enabled- Throws:
IllegalArgumentException- if an error occurred while parsing the regular expression
-
RegExp
Constructs newRegExpfrom a string.- Parameters:
s- regexp stringsyntax_flags- boolean 'or' of optional syntax constructs to be enabledmatch_flags- boolean 'or' of match behavior options such as case insensitivity- Throws:
IllegalArgumentException- if an error occurred while parsing the regular expression
-
-
Method Details
-
toAutomaton
Constructs newAutomatonfrom thisRegExp. Same astoAutomaton(null)(empty automaton map). -
toAutomaton
public Automaton toAutomaton(AutomatonProvider automaton_provider) throws IllegalArgumentException, TooComplexToDeterminizeException Constructs newAutomatonfrom thisRegExp.- Parameters:
automaton_provider- provider of automata for named identifiers- Throws:
IllegalArgumentException- if this regular expression uses a named identifier that is not available from the automaton providerTooComplexToDeterminizeException
-
toAutomaton
public Automaton toAutomaton(Map<String, Automaton> automata) throws IllegalArgumentException, TooComplexToDeterminizeExceptionConstructs newAutomatonfrom thisRegExp.- Parameters:
automata- a map from automaton identifiers to automata (of typeAutomaton).- Throws:
IllegalArgumentException- if this regular expression uses a named identifier that does not occur in the automaton mapTooComplexToDeterminizeException
-
getOriginalString
The string that was used to construct the regex. Compare to toString. -
toString
Constructs string from parsed regular expression. -
toStringTree
Like to string, but more verbose (shows the hierarchy more clearly). -
getIdentifiers
Returns set of automaton identifiers that occur in this regular expression.
-