Package org.gjt.xpp.impl.tokenizer
Class Tokenizer
- java.lang.Object
-
- org.gjt.xpp.impl.tokenizer.Tokenizer
-
public class Tokenizer extends java.lang.ObjectSimpe XML Tokenizer (SXT) performs input stream tokenizing. Advantages:- utility class to simplify creation of XML parsers, especially suited for pull event model but can support also push (SAX2)
- small footprint: whole tokenizer is in one file
- minimal memory utilization: does not use memory except for input and content buffer (that can grow in size)
- fast: all parsing done in one function (simple automata)
- supports most of XML 1.0 (except validation and external entities)
- low level: supports on demand parsing of Characters, CDSect, Comments, PIs etc.)
- parsed content: supports providing on demand parsed content to application (standard entities expanded all CDATA sections inserted, Comments and PIs removed) not for attribute values and element content
- mixed content: allow to dynamically disable mixed content
- small - total compiled size around 15K
- it is just a tokenizer - does not enforce grammar
- readName() is using Java identifier rules not XML
- does not parse DOCTYPE declaration (skips everyting in [...])
- Author:
- Aleksander Slominski
-
-
Field Summary
Fields Modifier and Type Field Description static byteATTR_CHARACTERSstatic byteATTR_CONTENTstatic byteATTR_NAMEchar[]bufstatic byteCDSECTstatic byteCHAR_REFstatic byteCHARACTERSstatic byteCOMMENTstatic byteCONTENTstatic byteDOCTYPEstatic byteEMPTY_ELEMENTstatic byteEND_DOCUMENTstatic byteENTITY_REFstatic byteETAG_NAMEprotected static intLOOKUP_MAXprotected static charLOOKUP_MAX_CHARprotected static boolean[]lookupNameCharprotected static boolean[]lookupNameStartCharintnsColonCountbooleanparamNotifyAttValuebooleanparamNotifyCDSectbooleanparamNotifyCharactersbooleanparamNotifyCharRefbooleanparamNotifyCommentbooleanparamNotifyDoctypebooleanparamNotifyEntityRefbooleanparamNotifyPIbooleanparsedContentThis falg decides which buffer will be used to retrieve content for current token.char[]pcThis is buffer for parsed content such as actual valuue of entity ('<' in buf but in pc it is '<')intpcEndintpcStartRange [pcStart, pcEnd) defines part of pc that is content of current token iff parsedContent == falsestatic bytePIintposposition of next char that will be read from bufferintposEndintposNsColonintposStartRange [posStart, posEnd) defines part of buf that is content of current token iff parsedContent == falsebooleanseenContentstatic byteSTAG_ENDstatic byteSTAG_NAME
-
Constructor Summary
Constructors Constructor Description Tokenizer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description intgetBufferShrinkOffset()intgetColumnNumber()intgetHardLimit()intgetLineNumber()java.lang.StringgetPosDesc()Return string describing current position of parsers as text 'at line %d (row) and column %d (colum) [seen %s...]'.intgetSoftLimit()booleanisAllowedMixedContent()booleanisBufferShrinkable()protected booleanisNameChar(char ch)protected booleanisNameStartChar(char ch)protected booleanisS(char ch)Determine if ch is whitespace ([3] S)bytenext()Return next recognized toke or END_DOCUMENT if no more input.voidreset()voidsetAllowedMixedContent(boolean enable)Set support for mixed conetent.voidsetBufferShrinkable(boolean shrinkable)voidsetHardLimit(int value)Set hard limit on internal buffer size.voidsetInput(char[] data)Reset tokenizer state and set new input sourcevoidsetInput(char[] data, int off, int len)voidsetInput(java.io.Reader r)Reset tokenizer state and set new input sourcevoidsetNotifyAll(boolean enable)Set notification of all XML content tokens: Characters, Comment, CDSect, Doctype, PI, EntityRef, CharRef and AttValue (tokens for STag, ETag and Attribute are always sent).voidsetParseContent(boolean enable)Allow reporting parsed content for element content and attribute content (no need to deal with low level tokens such as in setNotifyAll).voidsetSoftLimit(int value)Set soft limit on internal buffer size.
-
-
-
Field Detail
-
END_DOCUMENT
public static final byte END_DOCUMENT
- See Also:
- Constant Field Values
-
CONTENT
public static final byte CONTENT
- See Also:
- Constant Field Values
-
CHARACTERS
public static final byte CHARACTERS
- See Also:
- Constant Field Values
-
CDSECT
public static final byte CDSECT
- See Also:
- Constant Field Values
-
COMMENT
public static final byte COMMENT
- See Also:
- Constant Field Values
-
DOCTYPE
public static final byte DOCTYPE
- See Also:
- Constant Field Values
-
PI
public static final byte PI
- See Also:
- Constant Field Values
-
ENTITY_REF
public static final byte ENTITY_REF
- See Also:
- Constant Field Values
-
CHAR_REF
public static final byte CHAR_REF
- See Also:
- Constant Field Values
-
ETAG_NAME
public static final byte ETAG_NAME
- See Also:
- Constant Field Values
-
EMPTY_ELEMENT
public static final byte EMPTY_ELEMENT
- See Also:
- Constant Field Values
-
STAG_END
public static final byte STAG_END
- See Also:
- Constant Field Values
-
STAG_NAME
public static final byte STAG_NAME
- See Also:
- Constant Field Values
-
ATTR_NAME
public static final byte ATTR_NAME
- See Also:
- Constant Field Values
-
ATTR_CHARACTERS
public static final byte ATTR_CHARACTERS
- See Also:
- Constant Field Values
-
ATTR_CONTENT
public static final byte ATTR_CONTENT
- See Also:
- Constant Field Values
-
paramNotifyCharacters
public boolean paramNotifyCharacters
-
paramNotifyComment
public boolean paramNotifyComment
-
paramNotifyCDSect
public boolean paramNotifyCDSect
-
paramNotifyDoctype
public boolean paramNotifyDoctype
-
paramNotifyPI
public boolean paramNotifyPI
-
paramNotifyCharRef
public boolean paramNotifyCharRef
-
paramNotifyEntityRef
public boolean paramNotifyEntityRef
-
paramNotifyAttValue
public boolean paramNotifyAttValue
-
buf
public char[] buf
-
pos
public int pos
position of next char that will be read from buffer
-
posStart
public int posStart
Range [posStart, posEnd) defines part of buf that is content of current token iff parsedContent == false
-
posEnd
public int posEnd
-
posNsColon
public int posNsColon
-
nsColonCount
public int nsColonCount
-
seenContent
public boolean seenContent
-
parsedContent
public boolean parsedContent
This falg decides which buffer will be used to retrieve content for current token. If true use pc and [pcStart, pcEnd) and if false use buf and [posStart, posEnd)
-
pc
public char[] pc
This is buffer for parsed content such as actual valuue of entity ('<' in buf but in pc it is '<')
-
pcStart
public int pcStart
Range [pcStart, pcEnd) defines part of pc that is content of current token iff parsedContent == false
-
pcEnd
public int pcEnd
-
LOOKUP_MAX
protected static final int LOOKUP_MAX
- See Also:
- Constant Field Values
-
LOOKUP_MAX_CHAR
protected static final char LOOKUP_MAX_CHAR
- See Also:
- Constant Field Values
-
lookupNameStartChar
protected static boolean[] lookupNameStartChar
-
lookupNameChar
protected static boolean[] lookupNameChar
-
-
Method Detail
-
reset
public void reset()
-
setInput
public void setInput(java.io.Reader r)
Reset tokenizer state and set new input source
-
setInput
public void setInput(char[] data)
Reset tokenizer state and set new input source
-
setInput
public void setInput(char[] data, int off, int len)
-
setNotifyAll
public void setNotifyAll(boolean enable)
Set notification of all XML content tokens: Characters, Comment, CDSect, Doctype, PI, EntityRef, CharRef and AttValue (tokens for STag, ETag and Attribute are always sent).
-
setParseContent
public void setParseContent(boolean enable)
Allow reporting parsed content for element content and attribute content (no need to deal with low level tokens such as in setNotifyAll).
-
isAllowedMixedContent
public boolean isAllowedMixedContent()
-
setAllowedMixedContent
public void setAllowedMixedContent(boolean enable)
Set support for mixed conetent. If mixed content is disabled tokenizer will do its best to ensure that no element has mixed content model also ignorable whitespaces will not be reported as element content.
-
getSoftLimit
public int getSoftLimit()
-
setSoftLimit
public void setSoftLimit(int value) throws TokenizerExceptionSet soft limit on internal buffer size. That means suggested size that tokznzier will try to keep.- Throws:
TokenizerException
-
getHardLimit
public int getHardLimit()
-
setHardLimit
public void setHardLimit(int value) throws TokenizerExceptionSet hard limit on internal buffer size. That means that if input (such as element content) is bigger than hard limit size tokenizer will throw XmlTokenizerBufferOverflowException.- Throws:
TokenizerException
-
getBufferShrinkOffset
public int getBufferShrinkOffset()
-
setBufferShrinkable
public void setBufferShrinkable(boolean shrinkable) throws TokenizerException- Throws:
TokenizerException
-
isBufferShrinkable
public boolean isBufferShrinkable()
-
getPosDesc
public java.lang.String getPosDesc()
Return string describing current position of parsers as text 'at line %d (row) and column %d (colum) [seen %s...]'.
-
getLineNumber
public int getLineNumber()
-
getColumnNumber
public int getColumnNumber()
-
isNameStartChar
protected boolean isNameStartChar(char ch)
-
isNameChar
protected boolean isNameChar(char ch)
-
isS
protected boolean isS(char ch)
Determine if ch is whitespace ([3] S)
-
next
public byte next() throws TokenizerException, java.io.IOExceptionReturn next recognized toke or END_DOCUMENT if no more input.This is simple automata (in pseudo-code):
byte next() { while(state != END_DOCUMENT) { ch = more(); // read character from input state = func(ch, state); // do transition if(state is accepting) return state; // return token to caller } }For speed (and simplicity?) it is using few procedures such as readName() or isS().
- Throws:
TokenizerExceptionjava.io.IOException
-
-