Class Utf8Scanner
java.lang.Object
com.fasterxml.aalto.in.XmlScanner
com.fasterxml.aalto.in.ByteBasedScanner
com.fasterxml.aalto.in.StreamScanner
com.fasterxml.aalto.in.Utf8Scanner
- All Implemented Interfaces:
XmlConsts, NamespaceContext, XMLStreamConstants
Scanner for tokenizing XML content from a byte stream encoding using
UTF-8 encoding, or something suitably close it for decoding purposes
(including ISO-Latin1 and US-ASCII).
-
Field Summary
Fields inherited from class StreamScanner
_charTypes, _in, _inputBuffer, _quadBuffer, _symbolsFields inherited from class ByteBasedScanner
_inputEnd, _inputPtr, _tmpChar, BYTE_a, BYTE_A, BYTE_AMP, BYTE_APOS, BYTE_C, BYTE_CR, BYTE_D, BYTE_EQ, BYTE_EXCL, BYTE_g, BYTE_GT, BYTE_HASH, BYTE_HYPHEN, BYTE_l, BYTE_LBRACKET, BYTE_LF, BYTE_LT, BYTE_m, BYTE_NULL, BYTE_o, BYTE_p, BYTE_P, BYTE_q, BYTE_QMARK, BYTE_QUOT, BYTE_RBRACKET, BYTE_s, BYTE_S, BYTE_SEMICOLON, BYTE_SLASH, BYTE_SPACE, BYTE_t, BYTE_T, BYTE_TAB, BYTE_u, BYTE_xFields inherited from class XmlScanner
_attrCollector, _attrCount, _cfgCoalescing, _cfgLazyParsing, _config, _currElem, _currNsCount, _currRow, _currToken, _defaultNs, _depth, _entityPending, _isEmptyTag, _lastNsContext, _lastNsDecl, _nameBuffer, _nsBindingCache, _nsBindingCount, _nsBindings, _nsBindMisses, _pastBytesOrChars, _publicId, _rowStartOffset, _startColumn, _startRawOffset, _startRow, _systemId, _textBuilder, _tokenIncomplete, _tokenName, _xml11, CDATA_STR, INT_0, INT_9, INT_a, INT_A, INT_AMP, INT_APOS, INT_COLON, INT_CR, INT_EQ, INT_EXCL, INT_f, INT_F, INT_GT, INT_HYPHEN, INT_LBRACKET, INT_LF, INT_LT, INT_NULL, INT_QMARK, INT_QUOTE, INT_RBRACKET, INT_SLASH, INT_SPACE, INT_TAB, INT_z, MAX_UNICODE_CHAR, TOKEN_EOIFields inherited from interface XmlConsts
CHAR_CR, CHAR_LF, CHAR_NULL, CHAR_SPACE, STAX_DEFAULT_OUTPUT_ENCODING, STAX_DEFAULT_OUTPUT_VERSION, XML_DECL_KW_ENCODING, XML_DECL_KW_STANDALONE, XML_DECL_KW_VERSION, XML_SA_NO, XML_SA_YES, XML_V_10, XML_V_10_STR, XML_V_11, XML_V_11_STR, XML_V_UNKNOWNFields inherited from interface XMLStreamConstants
ATTRIBUTE, CDATA, CHARACTERS, COMMENT, DTD, END_DOCUMENT, END_ELEMENT, ENTITY_DECLARATION, ENTITY_REFERENCE, NAMESPACE, NOTATION_DECLARATION, PROCESSING_INSTRUCTION, SPACE, START_DOCUMENT, START_ELEMENT -
Constructor Summary
ConstructorsConstructorDescriptionUtf8Scanner(ReaderConfig cfg, InputStream in, byte[] buffer, int ptr, int last) -
Method Summary
Modifier and TypeMethodDescriptionprivate final intcollectValue(int attrPtr, byte quoteByte, PName attrName) This method implements the tight loop for parsing attribute values.intdecodeCharForError(byte b) Method called called to decode a full UTF-8 characters, given its first byte.private final intdecodeMultiByteChar(int c, int ptr) private final intdecodeUtf8_2(int c) private final intdecodeUtf8_3(int c1) private final intdecodeUtf8_3fast(int c1) private final intdecodeUtf8_4(int c) protected final voidprotected final voidprotected final voidprotected final voidprotected final voidMethod that gets called after a primary text segment (of type CHARACTERS or CDATA, not applicable to SPACE) has been read in text buffer.protected final voidprotected final voidfinishDTD(boolean copyContents) When this method gets called we know that we have an internal subset, and that the opening '[' has already been read.protected final voidfinishPI()protected final voidNote: this method is only called in cases where it is known that only space chars are legal.protected final voidThis method is called to ensure that the current token/event has been completely parsed, such that we have all the data needed to return it (textual content, PI data, comment text etc)protected final inthandleEntityInText(boolean inAttr) Method called when an ampersand is encounter in text segment.private voidhandleNsDeclaration(PName name, byte quoteByte) Method called from the main START_ELEMENT handling loop, to parse namespace URI values.protected inthandleStartElement(byte b) Parsing of start element requires parsing of the element name (and attribute names), and is thus encoding-specific.protected StringparsePublicId(byte quoteChar) Parsing of public ids is bit more complicated than that of system ids, since white space is to be coalesced.protected StringparseSystemId(byte quoteChar) protected voidreportInvalidOther(int mask, int ptr) protected final voidprotected final booleanprotected final booleanMethod that gets called after a primary text segment (of type CHARACTERS or CDATA, not applicable to SPACE) has been skipped.protected final voidprotected final voidskipPI()protected final voidprivate final voidskipUtf8_2(int c) private final voidskipUtf8_3(int c) private final voidskipUtf8_4(int c) private final voidskipUtf8_4Slow(int c) Methods inherited from class StreamScanner
_closeSource, _nextEntity, _releaseBuffers, addPName, checkInTreeIndentation, checkPrologIndentation, handleCharEntity, handleEndElement, loadAndRetain, loadMore, loadOne, loadOne, nextByte, nextByte, nextFromProlog, nextFromTree, parsePName, parsePNameLong, parsePNameMedium, parsePNameSlow, skipInternalWsMethods inherited from class ByteBasedScanner
addUTFPName, getCurrentColumnNr, getCurrentLocation, getEndingByteOffset, getEndingCharOffset, getStartingByteOffset, getStartingCharOffset, markLF, markLF, reportInvalidInitial, reportInvalidOther, setStartLocationMethods inherited from class XmlScanner
bindName, bindNs, checkImmutableBinding, close, decodeAttrBinaryValue, decodeAttrValue, decodeAttrValues, decodeElements, findAttrIndex, findOrCreateBinding, fireSaxCharacterEvents, fireSaxCommentEvent, fireSaxEndElement, fireSaxPIEvent, fireSaxSpaceEvents, fireSaxStartElement, getAttrCollector, getAttrCount, getAttrLocalName, getAttrNsURI, getAttrPrefix, getAttrPrefixedName, getAttrQName, getAttrType, getAttrValue, getAttrValue, getConfig, getCurrentLineNr, getDepth, getDTDPublicId, getDTDSystemId, getEndLocation, getInputPublicId, getInputSystemId, getName, getNamespacePrefix, getNamespaceURI, getNamespaceURI, getNamespaceURI, getNonTransientNamespaceContext, getNsCount, getPrefix, getPrefixes, getQName, getStartLocation, getText, getText, getTextCharacters, getTextCharacters, getTextLength, handleInvalidXmlChar, hasEmptyStack, isAttrSpecified, isEmptyTag, isTextWhitespace, loadMoreGuaranteed, loadMoreGuaranteed, reportDoubleHyphenInComments, reportDuplicateNsDecl, reportEntityOverflow, reportEofInName, reportIllegalCDataEnd, reportIllegalNsDecl, reportIllegalNsDecl, reportInputProblem, reportInvalidNameChar, reportInvalidNsIndex, reportInvalidXmlChar, reportMissingPISpace, reportMultipleColonsInName, reportPrologProblem, reportPrologUnexpChar, reportPrologUnexpElement, reportTreeUnexpChar, reportUnboundPrefix, reportUnexpandedEntityInAttr, reportUnexpectedEndTag, resetForDecoding, skipToken, throwInvalidSpace, throwNullChar, throwUnexpectedChar, verifyXmlChar
-
Constructor Details
-
Utf8Scanner
-
-
Method Details
-
finishToken
Description copied from class:XmlScannerThis method is called to ensure that the current token/event has been completely parsed, such that we have all the data needed to return it (textual content, PI data, comment text etc)- Specified by:
finishTokenin classXmlScanner- Throws:
XMLStreamException
-
handleStartElement
Description copied from class:StreamScannerParsing of start element requires parsing of the element name (and attribute names), and is thus encoding-specific.- Specified by:
handleStartElementin classStreamScanner- Throws:
XMLStreamException
-
collectValue
private final int collectValue(int attrPtr, byte quoteByte, PName attrName) throws XMLStreamException This method implements the tight loop for parsing attribute values. It's off-lined from the main start element method to simplify main method, which makes code more maintainable and possibly easier for JIT/HotSpot to optimize.- Throws:
XMLStreamException
-
handleNsDeclaration
Method called from the main START_ELEMENT handling loop, to parse namespace URI values.- Throws:
XMLStreamException
-
handleEntityInText
Method called when an ampersand is encounter in text segment. Method needs to determine whether it is a pre-defined or character entity (in which case it will be expanded into a single char or surrogate pair), or a general entity (in which case it will most likely be returned as ENTITY_REFERENCE event)- Specified by:
handleEntityInTextin classStreamScanner- Parameters:
inAttr- True, if reference is from attribute value; false if from normal text content- Returns:
- 0 if a general parsed entity encountered; integer value of a (valid) XML content character otherwise
- Throws:
XMLStreamException
-
parsePublicId
Parsing of public ids is bit more complicated than that of system ids, since white space is to be coalesced.- Specified by:
parsePublicIdin classStreamScanner- Throws:
XMLStreamException
-
parseSystemId
- Specified by:
parseSystemIdin classStreamScanner- Throws:
XMLStreamException
-
skipCharacters
- Specified by:
skipCharactersin classXmlScanner- Returns:
- True, if an unexpanded entity was encountered (and is now pending)
- Throws:
XMLStreamException
-
skipComment
- Specified by:
skipCommentin classXmlScanner- Throws:
XMLStreamException
-
skipCData
- Specified by:
skipCDatain classXmlScanner- Throws:
XMLStreamException
-
skipPI
- Specified by:
skipPIin classXmlScanner- Throws:
XMLStreamException
-
skipSpace
- Specified by:
skipSpacein classXmlScanner- Throws:
XMLStreamException
-
skipUtf8_2
- Throws:
XMLStreamException
-
skipUtf8_3
- Throws:
XMLStreamException
-
skipUtf8_4
- Throws:
XMLStreamException
-
skipUtf8_4Slow
- Throws:
XMLStreamException
-
finishCData
- Specified by:
finishCDatain classXmlScanner- Throws:
XMLStreamException
-
finishCharacters
- Specified by:
finishCharactersin classXmlScanner- Throws:
XMLStreamException
-
finishComment
- Specified by:
finishCommentin classXmlScanner- Throws:
XMLStreamException
-
finishDTD
When this method gets called we know that we have an internal subset, and that the opening '[' has already been read.- Specified by:
finishDTDin classXmlScanner- Throws:
XMLStreamException
-
finishPI
- Specified by:
finishPIin classXmlScanner- Throws:
XMLStreamException
-
finishSpace
Note: this method is only called in cases where it is known that only space chars are legal. Thus, encountering a non-space is an error (WFC or VC). However, an end-of-input is ok.- Specified by:
finishSpacein classXmlScanner- Throws:
XMLStreamException
-
finishCoalescedText
Method that gets called after a primary text segment (of type CHARACTERS or CDATA, not applicable to SPACE) has been read in text buffer. Method has to see if the following event would be textual as well, and if so, read it (and any other following textual segments).- Throws:
XMLStreamException
-
finishCoalescedCharacters
- Throws:
XMLStreamException
-
finishCoalescedCData
- Throws:
XMLStreamException
-
skipCoalescedText
Method that gets called after a primary text segment (of type CHARACTERS or CDATA, not applicable to SPACE) has been skipped. Method has to see if the following event would be textual as well, and if so, skip it (and any other following textual segments).- Specified by:
skipCoalescedTextin classXmlScanner- Returns:
- True if we encountered an unexpandable entity
- Throws:
XMLStreamException
-
decodeMultiByteChar
- Returns:
- Either decoded character (if positive int); or negated value of a high-order char (one that needs surrogate pair)
- Throws:
XMLStreamException
-
decodeUtf8_2
- Throws:
XMLStreamException
-
decodeUtf8_3
- Throws:
XMLStreamException
-
decodeUtf8_3fast
- Throws:
XMLStreamException
-
decodeUtf8_4
- Returns:
- Character value minus 0x10000; this so that caller can readily expand it to actual surrogates
- Throws:
XMLStreamException
-
decodeCharForError
Method called called to decode a full UTF-8 characters, given its first byte. Note: does not do any validity checks, since this is only to be used for informational purposes (often when an error has already been encountered)- Specified by:
decodeCharForErrorin classByteBasedScanner- Throws:
XMLStreamException
-
reportInvalidOther
- Throws:
XMLStreamException
-