Package org.htmlcleaner
Class HtmlTokenizer
- java.lang.Object
-
- org.htmlcleaner.HtmlTokenizer
-
public class HtmlTokenizer extends java.lang.ObjectMain HTML tokenizer.It's task is to parse HTML and produce list of valid tokens: open tag tokens, end tag tokens, contents (text) and comments. As soon as new item is added to token list, cleaner is invoked to clean current list at the end.
Created by: Vladimir Nikic.
Date: November, 2006
-
-
Field Summary
Fields Modifier and Type Field Description private boolean_asExpectedprivate int_colprivate TagToken_currentTagTokenprivate DoctypeToken_docTypeprivate boolean_isLateForDoctypeprivate boolean_isSpecialContextprivate java.lang.String_isSpecialContextNameprivate int_lenprivate java.util.Set<java.lang.String>_namespacePrefixesprivate int_posprivate java.io.BufferedReader_readerprivate int_rowprivate java.lang.StringBuffer_savedprivate java.util.List<BaseToken>_tokenListprivate char[]_workingprivate HtmlCleanercleanerprivate CleanTimeValuescleanTimeValuesprivate CleanerPropertiespropsprivate CleanerTransformationstransformationsprivate static intWORKING_BUFFER_SIZE
-
Constructor Summary
Constructors Constructor Description HtmlTokenizer(HtmlCleaner cleaner, java.io.Reader reader, CleanTimeValues cleanTimeValues)Constructor - creates instance of the parser with specified content.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private booleanaddSavedAsContent()private voidaddToken(BaseToken token)private java.lang.StringattributeValue()Parses a single tag attribute - it is expected to be in one of the forms: name=value name="value" name='value' nameprivate voidcdata()private voidcomment()private booleancontainsEndCData()private booleancontent()private voiddoctype()DoctypeTokengetDocType()(package private) java.util.Set<java.lang.String>getNamespacePrefixes()(package private) java.util.List<BaseToken>getTokenList()private voidgo()private voidgo(int step)private voidhandleInterruption()Called whenver the thread is interrupted.private java.lang.Stringidentifier(boolean attribute)Parses an identifier from the current position.private voidignoreUntil(char ch)private booleanisAllRead()Checks if end of the content is reached.private booleanisChar(char ch)Checks if character at current runtime position is equal to specified char.private booleanisChar(int position, char ch)Checks if character at specified position is equal to specified char.private booleanisElementIdentifierStartChar(int position)Checks if character at specified position can be identifier start.private booleanisHtmlAttributeIdentifierChar()private booleanisHtmlAttributeIdentifierChar(int position)Check whether the character at the specified position in the stream is a valid character for part of an attribute identifier in HTMLprivate booleanisHtmlAttributeIdentifierStartChar()Checks if character at current runtime position can be identifier start.private booleanisHtmlElementIdentifier()private booleanisHtmlElementIdentifier(int position)private booleanisReservedTag(java.lang.String tagName)Checks if specified tag name is one of the reserved tags: HTML, HEAD or BODYprivate booleanisTagStartOrEnd()Not all '<' (lt) symbols mean tag start or end.private booleanisWhitespace()Checks if character at current runtime position is whitespace.private booleanisWhitespace(int position)Checks if character at specified position is whitespace.private voidreadIfNeeded(int neededChars)private voidsave(char ch)Saves specified character to the temporary buffer.private voidsaveCurrent()Saves character at current runtime position to the temporary buffer.private voidsaveCurrent(int size)Saves specified number of characters at current runtime position to the temporary buffer.private voidskipWhitespaces()Skips whitespaces at current position and moves foreward until non-whitespace character is found or the end of content is reached.(package private) voidstart()Starts parsing HTML.private booleanstartsWith(java.lang.String value)Checks if content starts with specified value at the current position.private voidtagAttributes()Parses list tag attributes from the current position.private voidtagEnd()Parses end of the tag.private voidtagStart()Parses start of the tag.private voidupdateCoordinates(char ch)Looks onto the char passed and updates current position coordinates.
-
-
-
Field Detail
-
WORKING_BUFFER_SIZE
private static final int WORKING_BUFFER_SIZE
- See Also:
- Constant Field Values
-
_reader
private java.io.BufferedReader _reader
-
_working
private char[] _working
-
_pos
private transient int _pos
-
_len
private transient int _len
-
_row
private transient int _row
-
_col
private transient int _col
-
_saved
private transient java.lang.StringBuffer _saved
-
_isLateForDoctype
private transient boolean _isLateForDoctype
-
_docType
private transient DoctypeToken _docType
-
_currentTagToken
private transient TagToken _currentTagToken
-
_tokenList
private transient java.util.List<BaseToken> _tokenList
-
_namespacePrefixes
private transient java.util.Set<java.lang.String> _namespacePrefixes
-
_asExpected
private boolean _asExpected
-
_isSpecialContext
private boolean _isSpecialContext
-
_isSpecialContextName
private java.lang.String _isSpecialContextName
-
cleaner
private HtmlCleaner cleaner
-
props
private CleanerProperties props
-
transformations
private CleanerTransformations transformations
-
cleanTimeValues
private CleanTimeValues cleanTimeValues
-
-
Constructor Detail
-
HtmlTokenizer
public HtmlTokenizer(HtmlCleaner cleaner, java.io.Reader reader, CleanTimeValues cleanTimeValues)
Constructor - creates instance of the parser with specified content.- Parameters:
cleaner-reader-
-
-
Method Detail
-
addToken
private void addToken(BaseToken token)
-
readIfNeeded
private void readIfNeeded(int neededChars) throws java.io.IOException- Throws:
java.io.IOException
-
getTokenList
java.util.List<BaseToken> getTokenList()
-
getNamespacePrefixes
java.util.Set<java.lang.String> getNamespacePrefixes()
-
go
private void go() throws java.io.IOException- Throws:
java.io.IOException
-
go
private void go(int step) throws java.io.IOException- Throws:
java.io.IOException
-
startsWith
private boolean startsWith(java.lang.String value) throws java.io.IOExceptionChecks if content starts with specified value at the current position.- Parameters:
value-- Returns:
- true if starts with specified value, false otherwise.
- Throws:
java.io.IOException
-
isWhitespace
private boolean isWhitespace(int position)
Checks if character at specified position is whitespace.- Parameters:
position-- Returns:
- true is whitespace, false otherwise.
-
isWhitespace
private boolean isWhitespace()
Checks if character at current runtime position is whitespace.- Returns:
- true is whitespace, false otherwise.
-
isChar
private boolean isChar(int position, char ch)Checks if character at specified position is equal to specified char.- Parameters:
position-ch-- Returns:
- true is equals, false otherwise.
-
isChar
private boolean isChar(char ch)
Checks if character at current runtime position is equal to specified char.- Parameters:
ch-- Returns:
- true is equal, false otherwise.
-
isElementIdentifierStartChar
private boolean isElementIdentifierStartChar(int position)
Checks if character at specified position can be identifier start.- Parameters:
position-- Returns:
- true is may be identifier start, false otherwise.
-
isHtmlAttributeIdentifierStartChar
private boolean isHtmlAttributeIdentifierStartChar()
Checks if character at current runtime position can be identifier start.- Returns:
- true is may be identifier start, false otherwise.
-
isHtmlAttributeIdentifierChar
private boolean isHtmlAttributeIdentifierChar()
-
isHtmlElementIdentifier
private boolean isHtmlElementIdentifier()
-
isHtmlElementIdentifier
private boolean isHtmlElementIdentifier(int position)
-
isHtmlAttributeIdentifierChar
private boolean isHtmlAttributeIdentifierChar(int position)
Check whether the character at the specified position in the stream is a valid character for part of an attribute identifier in HTML- Parameters:
position-- Returns:
-
isAllRead
private boolean isAllRead()
Checks if end of the content is reached.
-
save
private void save(char ch)
Saves specified character to the temporary buffer.- Parameters:
ch-
-
updateCoordinates
private void updateCoordinates(char ch)
Looks onto the char passed and updates current position coordinates. If char is a line break, increments row coordinate, if not -- col coordinate.- Parameters:
ch- - char to analyze.
-
saveCurrent
private void saveCurrent()
Saves character at current runtime position to the temporary buffer.
-
saveCurrent
private void saveCurrent(int size) throws java.io.IOExceptionSaves specified number of characters at current runtime position to the temporary buffer.- Throws:
java.io.IOException
-
skipWhitespaces
private void skipWhitespaces() throws java.io.IOExceptionSkips whitespaces at current position and moves foreward until non-whitespace character is found or the end of content is reached.- Throws:
java.io.IOException
-
addSavedAsContent
private boolean addSavedAsContent()
-
start
void start() throws java.io.IOExceptionStarts parsing HTML.- Throws:
java.io.IOException
-
isReservedTag
private boolean isReservedTag(java.lang.String tagName)
Checks if specified tag name is one of the reserved tags: HTML, HEAD or BODY- Parameters:
tagName-- Returns:
-
tagStart
private void tagStart() throws java.io.IOExceptionParses start of the tag. It expects that current position is at the "<" after which the tag's name follows.- Throws:
java.io.IOException
-
tagEnd
private void tagEnd() throws java.io.IOExceptionParses end of the tag. It expects that current position is at the "<" after which "/" and the tag's name follows.- Throws:
java.io.IOException
-
identifier
private java.lang.String identifier(boolean attribute) throws java.io.IOExceptionParses an identifier from the current position.- Throws:
java.io.IOException
-
tagAttributes
private void tagAttributes() throws java.io.IOExceptionParses list tag attributes from the current position.- Throws:
java.io.IOException
-
attributeValue
private java.lang.String attributeValue() throws java.io.IOExceptionParses a single tag attribute - it is expected to be in one of the forms: name=value name="value" name='value' name- Throws:
java.io.IOException
-
content
private boolean content() throws java.io.IOException- Throws:
java.io.IOException
-
isTagStartOrEnd
private boolean isTagStartOrEnd() throws java.io.IOExceptionNot all '<' (lt) symbols mean tag start or end. For example '<' can be part of mathematical expression. To avoid false breaks of content tags use this method to determine content tag end.- Returns:
- true if current position is tag start or end.
- Throws:
java.io.IOException
-
ignoreUntil
private void ignoreUntil(char ch) throws java.io.IOException- Throws:
java.io.IOException
-
comment
private void comment() throws java.io.IOException- Throws:
java.io.IOException
-
cdata
private void cdata() throws java.io.IOException- Throws:
java.io.IOException
-
doctype
private void doctype() throws java.io.IOException- Throws:
java.io.IOException
-
getDocType
public DoctypeToken getDocType()
-
handleInterruption
private void handleInterruption()
Called whenver the thread is interrupted. Currently this is a placeholder, but could hold cleanup methods and user interaction
-
containsEndCData
private boolean containsEndCData() throws java.io.IOException- Throws:
java.io.IOException
-
-