Class ZenoString
java.lang.Object
net.sf.saxon.str.UnicodeString
net.sf.saxon.str.ZenoString
- All Implemented Interfaces:
Comparable<UnicodeString>, AtomicMatchKey
A ZenoString is an implementation of UnicodeString that comprises a list
of segments representing substrings of the total string. By convention the
segments are not themselves ZenoStrings, so the structure is a shallow tree.
An index holds pointers to the segments and their offsets within the string
as a whole; this is used to locate the codepoint at any particular location
in the string.
The segments will always be non-empty. An empty string contains no segments.
The key to the performance of the data structure (and its name) is the algorithm for consolidating segments when strings are concatenated, so as to keep the number of segments increasing logarithmically with the string size, with short segments at the extremities to allow efficient further concatenation at the ends.
For further details see the paper by Michael Kay at Balisage 2021.
-
Field Summary
Fields -
Method Summary
Modifier and TypeMethodDescriptionintcodePointAt(long index) Get the code point at a given position in the stringGet an iterator over the code points present in the string.concat(UnicodeString other) Concatenate another stringstatic UnicodeStringconcatSegments(UnicodeString left, UnicodeString right) (package private) voidcopy16bit(char[] target, int offset) Copy this string, as a sequence of 16-bit characters, to a specified array(package private) voidcopy24bit(byte[] target, int offset) Copy this string, as a sequence of 24-bit characters, to a specified array(package private) voidcopy8bit(byte[] target, int offset) Copy this string, as a sequence of 8-bit characters, to a specified arrayThis method is for diagnostics and unit testing only: it exposes the lengths of the internal segments.Get an equivalent UnicodeString that uses the most economical representation availableintgetWidth()Get the number of bits needed to hold all the characters in this stringlongindexOf(int codePoint, long from) Get the position of the first occurrence of the specified codepoint, starting the search at a given position in the stringlongindexWhere(IntPredicate predicate, long from) Get the position of the first occurrence of a codepoint that matches a supplied predicate, starting the search at a given position in the stringbooleanisEmpty()Ask whether the string is emptylonglength()Get the length of the stringstatic ZenoStringof(UnicodeString content) Construct a ZenoString from a supplied UnicodeStringsubstring(long start, long end) Get a substring of this codepoint sequence, with a given start and end positiontoString()voidwriteSegments(UnicodeWriter writer) Write each of the segments in turn to a UnicodeWriterMethods inherited from class UnicodeString
asAtomic, checkSubstringBounds, compareTo, equals, estimatedLength, hashCode, hasSubstring, indexOf, indexOf, length32, prefix, requireInt, substring, tidy, verifyCharacters
-
Field Details
-
EMPTY
An empty ZenoString
-
-
Method Details
-
of
Construct a ZenoString from a supplied UnicodeString- Parameters:
content- the supplied UnicodeString- Returns:
- the resulting ZenoString
-
codePoints
Get an iterator over the code points present in the string.- Specified by:
codePointsin classUnicodeString- Returns:
- an iterator that delivers the individual code points
-
length
public long length()Get the length of the string- Specified by:
lengthin classUnicodeString- Returns:
- the number of code points in the string
-
isEmpty
public boolean isEmpty()Ask whether the string is empty- Overrides:
isEmptyin classUnicodeString- Returns:
- true if the length of the string is zero
-
getWidth
public int getWidth()Get the number of bits needed to hold all the characters in this string- Specified by:
getWidthin classUnicodeString- Returns:
- 7 for ascii characters, 8 for latin-1, 16 for BMP, 24 for general Unicode.
-
indexOf
public long indexOf(int codePoint, long from) Get the position of the first occurrence of the specified codepoint, starting the search at a given position in the string- Specified by:
indexOfin classUnicodeString- Parameters:
codePoint- the sought codePointfrom- the position from which the search should start (0-based), in the range 0 to length()-1- Returns:
- the position (0-based) of the first occurrence found, or -1 if not found
- Throws:
IndexOutOfBoundsException- if thefromvalue is out of range
-
indexWhere
Description copied from class:UnicodeStringGet the position of the first occurrence of a codepoint that matches a supplied predicate, starting the search at a given position in the string- Overrides:
indexWherein classUnicodeString- Parameters:
predicate- condition that the codepoint must satisfyfrom- the position from which the search should start (0-based)- Returns:
- the position (0-based) of the first codepoint to match the predicate, or -1 if not found
-
codePointAt
public int codePointAt(long index) Get the code point at a given position in the string- Specified by:
codePointAtin classUnicodeString- Parameters:
index- the given position (0-based)- Returns:
- the code point at the given position
- Throws:
IndexOutOfBoundsException- if the index is out of range
-
substring
Get a substring of this codepoint sequence, with a given start and end position- Specified by:
substringin classUnicodeString- Parameters:
start- the start position (0-based): that is, the position of the first code point to be includedend- the end position (0-based): specifically, the position of the first code point not to be included- Returns:
- the requested substring
-
concat
Concatenate another string- Overrides:
concatin classUnicodeString- Parameters:
other- the string to be appended to this one- Returns:
- the result of the concatenation (neither input string is altered)
-
copy8bit
void copy8bit(byte[] target, int offset) Description copied from class:UnicodeStringCopy this string, as a sequence of 8-bit characters, to a specified array- Overrides:
copy8bitin classUnicodeString- Parameters:
target- the target array: the caller must ensure there is sufficient capacityoffset- the position in the target array
-
copy16bit
void copy16bit(char[] target, int offset) Description copied from class:UnicodeStringCopy this string, as a sequence of 16-bit characters, to a specified array- Overrides:
copy16bitin classUnicodeString- Parameters:
target- the target array: the caller must ensure there is sufficient capacityoffset- the position in the target array
-
copy24bit
void copy24bit(byte[] target, int offset) Description copied from class:UnicodeStringCopy this string, as a sequence of 24-bit characters, to a specified array- Overrides:
copy24bitin classUnicodeString- Parameters:
target- the target array: the caller must ensure there is sufficient capacityoffset- the position in the target array as a byte offset (that is, the character offset times 3)
-
writeSegments
Write each of the segments in turn to a UnicodeWriter- Parameters:
writer- the writer to which the string is to be written- Throws:
IOException
-
concatSegments
-
economize
Get an equivalent UnicodeString that uses the most economical representation available- Overrides:
economizein classUnicodeString- Returns:
- an equivalent UnicodeString
-
toString
-
debugSegmentLengths
-