Package com.fasterxml.aalto.in
Class Utf8Scanner
- java.lang.Object
-
- com.fasterxml.aalto.in.XmlScanner
-
- com.fasterxml.aalto.in.ByteBasedScanner
-
- com.fasterxml.aalto.in.StreamScanner
-
- com.fasterxml.aalto.in.Utf8Scanner
-
- All Implemented Interfaces:
XmlConsts,javax.xml.namespace.NamespaceContext,javax.xml.stream.XMLStreamConstants
public final class Utf8Scanner extends StreamScanner
Scanner for tokenizing XML content from a byte stream encoding using UTF-8 encoding, or something suitably close it for decoding purposes (including ISO-Latin1 and US-ASCII).
-
-
Field Summary
-
Fields inherited from class com.fasterxml.aalto.in.StreamScanner
_charTypes, _in, _inputBuffer, _quadBuffer, _symbols
-
Fields inherited from class com.fasterxml.aalto.in.ByteBasedScanner
_inputEnd, _inputPtr, _tmpChar, BYTE_a, BYTE_A, BYTE_AMP, BYTE_APOS, BYTE_C, BYTE_CR, BYTE_D, BYTE_EQ, BYTE_EXCL, BYTE_g, BYTE_GT, BYTE_HASH, BYTE_HYPHEN, BYTE_l, BYTE_LBRACKET, BYTE_LF, BYTE_LT, BYTE_m, BYTE_NULL, BYTE_o, BYTE_p, BYTE_P, BYTE_q, BYTE_QMARK, BYTE_QUOT, BYTE_RBRACKET, BYTE_s, BYTE_S, BYTE_SEMICOLON, BYTE_SLASH, BYTE_SPACE, BYTE_t, BYTE_T, BYTE_TAB, BYTE_u, BYTE_x
-
Fields inherited from class com.fasterxml.aalto.in.XmlScanner
_attrCollector, _attrCount, _cfgCoalescing, _cfgLazyParsing, _config, _currElem, _currNsCount, _currRow, _currToken, _defaultNs, _depth, _entityPending, _isEmptyTag, _lastNsContext, _lastNsDecl, _nameBuffer, _nsBindingCache, _nsBindingCount, _nsBindings, _nsBindMisses, _pastBytesOrChars, _publicId, _rowStartOffset, _startColumn, _startRawOffset, _startRow, _systemId, _textBuilder, _tokenIncomplete, _tokenName, _xml11, CDATA_STR, INT_0, INT_9, INT_a, INT_A, INT_AMP, INT_APOS, INT_COLON, INT_CR, INT_EQ, INT_EXCL, INT_f, INT_F, INT_GT, INT_HYPHEN, INT_LBRACKET, INT_LF, INT_LT, INT_NULL, INT_QMARK, INT_QUOTE, INT_RBRACKET, INT_SLASH, INT_SPACE, INT_TAB, INT_z, MAX_UNICODE_CHAR, TOKEN_EOI
-
Fields inherited from interface com.fasterxml.aalto.util.XmlConsts
CHAR_CR, CHAR_LF, CHAR_NULL, CHAR_SPACE, STAX_DEFAULT_OUTPUT_ENCODING, STAX_DEFAULT_OUTPUT_VERSION, XML_DECL_KW_ENCODING, XML_DECL_KW_STANDALONE, XML_DECL_KW_VERSION, XML_SA_NO, XML_SA_YES, XML_V_10, XML_V_10_STR, XML_V_11, XML_V_11_STR, XML_V_UNKNOWN
-
-
Constructor Summary
Constructors Constructor Description Utf8Scanner(ReaderConfig cfg, java.io.InputStream in, byte[] buffer, int ptr, int last)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private intcollectValue(int attrPtr, byte quoteByte, PName attrName)This method implements the tight loop for parsing attribute values.intdecodeCharForError(byte b)Method called called to decode a full UTF-8 characters, given its first byte.private intdecodeMultiByteChar(int c, int ptr)private intdecodeUtf8_2(int c)private intdecodeUtf8_3(int c1)private intdecodeUtf8_3fast(int c1)private intdecodeUtf8_4(int c)protected voidfinishCData()protected voidfinishCharacters()protected voidfinishCoalescedCData()protected voidfinishCoalescedCharacters()protected voidfinishCoalescedText()Method that gets called after a primary text segment (of type CHARACTERS or CDATA, not applicable to SPACE) has been read in text buffer.protected voidfinishComment()protected voidfinishDTD(boolean copyContents)When this method gets called we know that we have an internal subset, and that the opening '[' has already been read.protected voidfinishPI()protected voidfinishSpace()Note: this method is only called in cases where it is known that only space chars are legal.protected voidfinishToken()This method is called to ensure that the current token/event has been completely parsed, such that we have all the data needed to return it (textual content, PI data, comment text etc)protected inthandleEntityInText(boolean inAttr)Method called when an ampersand is encounter in text segment.private voidhandleNsDeclaration(PName name, byte quoteByte)Method called from the main START_ELEMENT handling loop, to parse namespace URI values.protected inthandleStartElement(byte b)Parsing of start element requires parsing of the element name (and attribute names), and is thus encoding-specific.protected java.lang.StringparsePublicId(byte quoteChar)Parsing of public ids is bit more complicated than that of system ids, since white space is to be coalesced.protected java.lang.StringparseSystemId(byte quoteChar)protected voidreportInvalidOther(int mask, int ptr)protected voidskipCData()protected booleanskipCharacters()protected booleanskipCoalescedText()Method that gets called after a primary text segment (of type CHARACTERS or CDATA, not applicable to SPACE) has been skipped.protected voidskipComment()protected voidskipPI()protected voidskipSpace()private voidskipUtf8_2(int c)private voidskipUtf8_3(int c)private voidskipUtf8_4(int c)private voidskipUtf8_4Slow(int c)-
Methods inherited from class com.fasterxml.aalto.in.StreamScanner
_closeSource, _nextEntity, _releaseBuffers, addPName, checkInTreeIndentation, checkPrologIndentation, handleCharEntity, handleEndElement, loadAndRetain, loadMore, loadOne, loadOne, nextByte, nextByte, nextFromProlog, nextFromTree, parsePName, parsePNameLong, parsePNameMedium, parsePNameSlow, skipInternalWs
-
Methods inherited from class com.fasterxml.aalto.in.ByteBasedScanner
addUTFPName, getCurrentColumnNr, getCurrentLocation, getEndingByteOffset, getEndingCharOffset, getStartingByteOffset, getStartingCharOffset, markLF, markLF, reportInvalidInitial, reportInvalidOther, setStartLocation
-
Methods inherited from class com.fasterxml.aalto.in.XmlScanner
bindName, bindNs, checkImmutableBinding, close, decodeAttrBinaryValue, decodeAttrValue, decodeAttrValues, decodeElements, findAttrIndex, findOrCreateBinding, fireSaxCharacterEvents, fireSaxCommentEvent, fireSaxEndElement, fireSaxPIEvent, fireSaxSpaceEvents, fireSaxStartElement, getAttrCollector, getAttrCount, getAttrLocalName, getAttrNsURI, getAttrPrefix, getAttrPrefixedName, getAttrQName, getAttrType, getAttrValue, getAttrValue, getConfig, getCurrentLineNr, getDepth, getDTDPublicId, getDTDSystemId, getEndLocation, getInputPublicId, getInputSystemId, getName, getNamespacePrefix, getNamespaceURI, getNamespaceURI, getNamespaceURI, getNonTransientNamespaceContext, getNsCount, getPrefix, getPrefixes, getQName, getStartLocation, getText, getText, getTextCharacters, getTextCharacters, getTextLength, handleInvalidXmlChar, hasEmptyStack, isAttrSpecified, isEmptyTag, isTextWhitespace, loadMoreGuaranteed, loadMoreGuaranteed, reportDoubleHyphenInComments, reportDuplicateNsDecl, reportEntityOverflow, reportEofInName, reportIllegalCDataEnd, reportIllegalNsDecl, reportIllegalNsDecl, reportInputProblem, reportInvalidNameChar, reportInvalidNsIndex, reportInvalidXmlChar, reportMissingPISpace, reportMultipleColonsInName, reportPrologProblem, reportPrologUnexpChar, reportPrologUnexpElement, reportTreeUnexpChar, reportUnboundPrefix, reportUnexpandedEntityInAttr, reportUnexpectedEndTag, resetForDecoding, skipToken, throwInvalidSpace, throwNullChar, throwUnexpectedChar, verifyXmlChar
-
-
-
-
Constructor Detail
-
Utf8Scanner
public Utf8Scanner(ReaderConfig cfg, java.io.InputStream in, byte[] buffer, int ptr, int last)
-
-
Method Detail
-
finishToken
protected final void finishToken() throws javax.xml.stream.XMLStreamExceptionDescription copied from class:XmlScannerThis method is called to ensure that the current token/event has been completely parsed, such that we have all the data needed to return it (textual content, PI data, comment text etc)- Specified by:
finishTokenin classXmlScanner- Throws:
javax.xml.stream.XMLStreamException
-
handleStartElement
protected int handleStartElement(byte b) throws javax.xml.stream.XMLStreamExceptionDescription copied from class:StreamScannerParsing of start element requires parsing of the element name (and attribute names), and is thus encoding-specific.- Specified by:
handleStartElementin classStreamScanner- Throws:
javax.xml.stream.XMLStreamException
-
collectValue
private final int collectValue(int attrPtr, byte quoteByte, PName attrName) throws javax.xml.stream.XMLStreamExceptionThis method implements the tight loop for parsing attribute values. It's off-lined from the main start element method to simplify main method, which makes code more maintainable and possibly easier for JIT/HotSpot to optimize.- Throws:
javax.xml.stream.XMLStreamException
-
handleNsDeclaration
private void handleNsDeclaration(PName name, byte quoteByte) throws javax.xml.stream.XMLStreamException
Method called from the main START_ELEMENT handling loop, to parse namespace URI values.- Throws:
javax.xml.stream.XMLStreamException
-
handleEntityInText
protected final int handleEntityInText(boolean inAttr) throws javax.xml.stream.XMLStreamExceptionMethod called when an ampersand is encounter in text segment. Method needs to determine whether it is a pre-defined or character entity (in which case it will be expanded into a single char or surrogate pair), or a general entity (in which case it will most likely be returned as ENTITY_REFERENCE event)- Specified by:
handleEntityInTextin classStreamScanner- Parameters:
inAttr- True, if reference is from attribute value; false if from normal text content- Returns:
- 0 if a general parsed entity encountered; integer value of a (valid) XML content character otherwise
- Throws:
javax.xml.stream.XMLStreamException
-
parsePublicId
protected java.lang.String parsePublicId(byte quoteChar) throws javax.xml.stream.XMLStreamExceptionParsing of public ids is bit more complicated than that of system ids, since white space is to be coalesced.- Specified by:
parsePublicIdin classStreamScanner- Throws:
javax.xml.stream.XMLStreamException
-
parseSystemId
protected java.lang.String parseSystemId(byte quoteChar) throws javax.xml.stream.XMLStreamException- Specified by:
parseSystemIdin classStreamScanner- Throws:
javax.xml.stream.XMLStreamException
-
skipCharacters
protected final boolean skipCharacters() throws javax.xml.stream.XMLStreamException- Specified by:
skipCharactersin classXmlScanner- Returns:
- True, if an unexpanded entity was encountered (and is now pending)
- Throws:
javax.xml.stream.XMLStreamException
-
skipComment
protected final void skipComment() throws javax.xml.stream.XMLStreamException- Specified by:
skipCommentin classXmlScanner- Throws:
javax.xml.stream.XMLStreamException
-
skipCData
protected final void skipCData() throws javax.xml.stream.XMLStreamException- Specified by:
skipCDatain classXmlScanner- Throws:
javax.xml.stream.XMLStreamException
-
skipPI
protected final void skipPI() throws javax.xml.stream.XMLStreamException- Specified by:
skipPIin classXmlScanner- Throws:
javax.xml.stream.XMLStreamException
-
skipSpace
protected final void skipSpace() throws javax.xml.stream.XMLStreamException- Specified by:
skipSpacein classXmlScanner- Throws:
javax.xml.stream.XMLStreamException
-
skipUtf8_2
private final void skipUtf8_2(int c) throws javax.xml.stream.XMLStreamException- Throws:
javax.xml.stream.XMLStreamException
-
skipUtf8_3
private final void skipUtf8_3(int c) throws javax.xml.stream.XMLStreamException- Throws:
javax.xml.stream.XMLStreamException
-
skipUtf8_4
private final void skipUtf8_4(int c) throws javax.xml.stream.XMLStreamException- Throws:
javax.xml.stream.XMLStreamException
-
skipUtf8_4Slow
private final void skipUtf8_4Slow(int c) throws javax.xml.stream.XMLStreamException- Throws:
javax.xml.stream.XMLStreamException
-
finishCData
protected final void finishCData() throws javax.xml.stream.XMLStreamException- Specified by:
finishCDatain classXmlScanner- Throws:
javax.xml.stream.XMLStreamException
-
finishCharacters
protected final void finishCharacters() throws javax.xml.stream.XMLStreamException- Specified by:
finishCharactersin classXmlScanner- Throws:
javax.xml.stream.XMLStreamException
-
finishComment
protected final void finishComment() throws javax.xml.stream.XMLStreamException- Specified by:
finishCommentin classXmlScanner- Throws:
javax.xml.stream.XMLStreamException
-
finishDTD
protected final void finishDTD(boolean copyContents) throws javax.xml.stream.XMLStreamExceptionWhen this method gets called we know that we have an internal subset, and that the opening '[' has already been read.- Specified by:
finishDTDin classXmlScanner- Throws:
javax.xml.stream.XMLStreamException
-
finishPI
protected final void finishPI() throws javax.xml.stream.XMLStreamException- Specified by:
finishPIin classXmlScanner- Throws:
javax.xml.stream.XMLStreamException
-
finishSpace
protected final void finishSpace() throws javax.xml.stream.XMLStreamExceptionNote: this method is only called in cases where it is known that only space chars are legal. Thus, encountering a non-space is an error (WFC or VC). However, an end-of-input is ok.- Specified by:
finishSpacein classXmlScanner- Throws:
javax.xml.stream.XMLStreamException
-
finishCoalescedText
protected final void finishCoalescedText() throws javax.xml.stream.XMLStreamExceptionMethod that gets called after a primary text segment (of type CHARACTERS or CDATA, not applicable to SPACE) has been read in text buffer. Method has to see if the following event would be textual as well, and if so, read it (and any other following textual segments).- Throws:
javax.xml.stream.XMLStreamException
-
finishCoalescedCharacters
protected final void finishCoalescedCharacters() throws javax.xml.stream.XMLStreamException- Throws:
javax.xml.stream.XMLStreamException
-
finishCoalescedCData
protected final void finishCoalescedCData() throws javax.xml.stream.XMLStreamException- Throws:
javax.xml.stream.XMLStreamException
-
skipCoalescedText
protected final boolean skipCoalescedText() throws javax.xml.stream.XMLStreamExceptionMethod that gets called after a primary text segment (of type CHARACTERS or CDATA, not applicable to SPACE) has been skipped. Method has to see if the following event would be textual as well, and if so, skip it (and any other following textual segments).- Specified by:
skipCoalescedTextin classXmlScanner- Returns:
- True if we encountered an unexpandable entity
- Throws:
javax.xml.stream.XMLStreamException
-
decodeMultiByteChar
private final int decodeMultiByteChar(int c, int ptr) throws javax.xml.stream.XMLStreamException- Returns:
- Either decoded character (if positive int); or negated value of a high-order char (one that needs surrogate pair)
- Throws:
javax.xml.stream.XMLStreamException
-
decodeUtf8_2
private final int decodeUtf8_2(int c) throws javax.xml.stream.XMLStreamException- Throws:
javax.xml.stream.XMLStreamException
-
decodeUtf8_3
private final int decodeUtf8_3(int c1) throws javax.xml.stream.XMLStreamException- Throws:
javax.xml.stream.XMLStreamException
-
decodeUtf8_3fast
private final int decodeUtf8_3fast(int c1) throws javax.xml.stream.XMLStreamException- Throws:
javax.xml.stream.XMLStreamException
-
decodeUtf8_4
private final int decodeUtf8_4(int c) throws javax.xml.stream.XMLStreamException- Returns:
- Character value minus 0x10000; this so that caller can readily expand it to actual surrogates
- Throws:
javax.xml.stream.XMLStreamException
-
decodeCharForError
public int decodeCharForError(byte b) throws javax.xml.stream.XMLStreamExceptionMethod called called to decode a full UTF-8 characters, given its first byte. Note: does not do any validity checks, since this is only to be used for informational purposes (often when an error has already been encountered)- Specified by:
decodeCharForErrorin classByteBasedScanner- Throws:
javax.xml.stream.XMLStreamException
-
reportInvalidOther
protected void reportInvalidOther(int mask, int ptr) throws javax.xml.stream.XMLStreamException- Throws:
javax.xml.stream.XMLStreamException
-
-