Class TextPattern
- All Implemented Interfaces:
Serializable, CharSequence
The regular expression facilities of the Java API are a powerful tool; however, when searching for a constant pattern many algorithms can increase of orders magnitude the speed of a search.
This class provides constant-pattern text search facilities by implementing the last-character heuristics of the Boyer–Moore search algorithm using compact approximators, a randomized data structure that can accomodate in a small space (but in an approximated way) the bad-character shift table of a large alphabet such as Unicode.
Since a large subset of US-ASCII is used in all languages (e.g., whitespace, punctuation, etc.), this class caches separately the shifts for the first 128 Unicode characters, resulting in very good performance even on text in pure US-ASCII.
Note that the indexOf methods of
MutableString use a even more simplified variant of Boyer–Moore's algorithm which
is less efficient, but has a smaller setup time and does not generate any object. In general, for
short case-insensitive patterns the overhead of this class will make it slower than such methods.
The search facilities provided by this class are targeted at searches with long patterns, and
case-insensitive searches.
Instances of this class are immutable and thread-safe.
- Since:
- 0.6
- Author:
- Sebastiano Vigna, Paolo Boldi
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intEnables case-insensitive matching.protected char[]The pattern backing array.static final intEnables Unicode-aware case folding. -
Constructor Summary
ConstructorsConstructorDescriptionTextPattern(CharSequence pattern) Creates a new case-sensitiveTextPatternobject that can be used to search for the given pattern.TextPattern(CharSequence pattern, int flags) Creates a newTextPatternobject that can be used to search for the given pattern. -
Method Summary
Modifier and TypeMethodDescriptionbooleanReturns whether this pattern is case insensitive.charcharAt(int i) final booleanCompares this text pattern to another object.final inthashCode()Returns a hash code for this text pattern.intlength()intsearch(byte[] a) Returns the index of the first occurrence of this pattern in the given byte array.intsearch(byte[] a, int from) Returns the index of the first occurrence of this pattern in the given byte array starting from a given index.intsearch(byte[] a, int from, int to) Returns the index of the first occurrence of this pattern in the given byte array between given indices.intsearch(char[] array) Returns the index of the first occurrence of this pattern in the given character array.intsearch(char[] array, int from) Returns the index of the first occurrence of this pattern in the given character array starting from a given index.intsearch(char[] a, int from, int to) Returns the index of the first occurrence of this pattern in the given character array between given indices.intsearch(it.unimi.dsi.fastutil.chars.CharList list) Returns the index of the first occurrence of this pattern in the given character list.intsearch(it.unimi.dsi.fastutil.chars.CharList list, int from) Returns the index of the first occurrence of this pattern in the given character list starting from a given index.intsearch(it.unimi.dsi.fastutil.chars.CharList list, int from, int to) Returns the index of the first occurrence of this pattern in the given character list between given indices.intReturns the index of the first occurrence of this pattern in the given character sequence.intsearch(CharSequence s, int from) Returns the index of the first occurrence of this pattern in the given character sequence starting from a given index.intsearch(CharSequence s, int from, int to) Returns the index of the first occurrence of this pattern in the given character sequence between given indices.subSequence(int from, int to) final StringtoString()booleanReturns whether this pattern uses Unicode case folding.Methods inherited from interface CharSequence
chars, codePoints, getChars, isEmpty
-
Field Details
-
CASE_INSENSITIVE
public static final int CASE_INSENSITIVEEnables case-insensitive matching.By default, case-insensitive matching assumes that only characters in the ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag.
Case-insensitivity involves a performance drop.
- See Also:
-
UNICODE_CASE
public static final int UNICODE_CASEEnables Unicode-aware case folding.When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the ASCII charset are being matched.
Unicode-aware case folding is very expensive (two method calls per examined non-ASCII character).
- See Also:
-
pattern
protected char[] patternThe pattern backing array.
-
-
Constructor Details
-
TextPattern
Creates a new case-sensitiveTextPatternobject that can be used to search for the given pattern.- Parameters:
pattern- the constant pattern to search for.
-
TextPattern
Creates a newTextPatternobject that can be used to search for the given pattern.- Parameters:
pattern- the constant pattern to search for.flags- a bit mask that may includeCASE_INSENSITIVEandUNICODE_CASE.
-
-
Method Details
-
caseInsensitive
public boolean caseInsensitive()Returns whether this pattern is case insensitive. -
unicodeCase
public boolean unicodeCase()Returns whether this pattern uses Unicode case folding. -
length
public int length()- Specified by:
lengthin interfaceCharSequence
-
charAt
public char charAt(int i) - Specified by:
charAtin interfaceCharSequence
-
subSequence
- Specified by:
subSequencein interfaceCharSequence
-
search
public int search(char[] array) Returns the index of the first occurrence of this pattern in the given character array.- Parameters:
array- the character array to look in.- Returns:
- the index of the first occurrence of this pattern contained in the
given array, or
-1, if the pattern cannot be found.
-
search
public int search(char[] array, int from) Returns the index of the first occurrence of this pattern in the given character array starting from a given index.- Parameters:
array- the character array to look in.from- the index from which the search must start.- Returns:
- the index of the first occurrence of this pattern contained in the
subarray starting from
from(inclusive), or-1, if the pattern cannot be found.
-
search
public int search(char[] a, int from, int to) Returns the index of the first occurrence of this pattern in the given character array between given indices.- Parameters:
a- the character array to look in.from- the index from which the search must start.to- the index at which the search must end.- Returns:
- the index of the first occurrence of this pattern contained in the
subarray starting from
from(inclusive) up toto(exclusive) characters, or-1, if the pattern cannot be found.
-
search
Returns the index of the first occurrence of this pattern in the given character sequence.- Parameters:
s- the character sequence to look in.- Returns:
- the index of the first occurrence of this pattern contained in the
given character sequence, or
-1, if the pattern cannot be found.
-
search
Returns the index of the first occurrence of this pattern in the given character sequence starting from a given index.- Parameters:
s- the character array to look in.from- the index from which the search must start.- Returns:
- the index of the first occurrence of this pattern contained in the
subsequence starting from
from(inclusive), or-1, if the pattern cannot be found.
-
search
Returns the index of the first occurrence of this pattern in the given character sequence between given indices.- Parameters:
s- the character array to look in.from- the index from which the search must start.to- the index at which the search must end.- Returns:
- the index of the first occurrence of this pattern contained in the
subsequence starting from
from(inclusive) up toto(exclusive) characters, or-1, if the pattern cannot be found.
-
search
public int search(byte[] a) Returns the index of the first occurrence of this pattern in the given byte array.- Parameters:
a- the byte array to look in.- Returns:
- the index of the first occurrence of this pattern contained in the
given byte array, or
-1, if the pattern cannot be found.
-
search
public int search(byte[] a, int from) Returns the index of the first occurrence of this pattern in the given byte array starting from a given index.- Parameters:
a- the byte array to look in.from- the index from which the search must start.- Returns:
- the index of the first occurrence of this pattern contained in the
array fragment starting from
from(inclusive), or-1, if the pattern cannot be found.
-
search
public int search(byte[] a, int from, int to) Returns the index of the first occurrence of this pattern in the given byte array between given indices.- Parameters:
a- the byte array to look in.from- the index from which the search must start.to- the index at which the search must end.- Returns:
- the index of the first occurrence of this pattern contained in the
array fragment starting from
from(inclusive) up toto(exclusive) characters, or-1, if the pattern cannot be found.
-
search
public int search(it.unimi.dsi.fastutil.chars.CharList list) Returns the index of the first occurrence of this pattern in the given character list.- Parameters:
list- the character list to look in.- Returns:
- the index of the first occurrence of this pattern contained in the
given list, or
-1, if the pattern cannot be found.
-
search
public int search(it.unimi.dsi.fastutil.chars.CharList list, int from) Returns the index of the first occurrence of this pattern in the given character list starting from a given index.- Parameters:
list- the character list to look in.from- the index from which the search must start.- Returns:
- the index of the first occurrence of this pattern contained in the
sublist starting from
from(inclusive), or-1, if the pattern cannot be found.
-
search
public int search(it.unimi.dsi.fastutil.chars.CharList list, int from, int to) Returns the index of the first occurrence of this pattern in the given character list between given indices.- Parameters:
list- the character list to look in.from- the index from which the search must start.to- the index at which the search must end.- Returns:
- the index of the first occurrence of this pattern contained in the
sublist starting from
from(inclusive) up toto(exclusive) characters, or-1, if the pattern cannot be found.
-
equals
Compares this text pattern to another object.This method will return
trueiff its argument is aTextPatterncontaining the same constant pattern with the same flags set. -
hashCode
public final int hashCode()Returns a hash code for this text pattern.The hash code of a text pattern is the same as that of a
Stringwith the same content (suitably lower cased, if the pattern is case insensitive). -
toString
- Specified by:
toStringin interfaceCharSequence- Overrides:
toStringin classObject
-