Class HtmlParser
- All Implemented Interfaces:
XMLReader
- Direct Known Subclasses:
InfosetCoercingHtmlParser
By default, when using the constructor without arguments, the
this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible
infosets. This corresponds to ALTER_INFOSET as the general
XML violation policy. To make the parser support non-conforming HTML fully
per the HTML 5 spec while on the other hand potentially violating the SAX2
API contract, set the general XML violation policy to ALLOW.
It is possible to treat XML 1.0 infoset violations as fatal by setting
the general XML violation policy to FATAL.
By default, this parser doesn't do true streaming but buffers everything
first. The parser can be made truly streaming by calling
setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL). This
has the consequence that errors that require non-streamable recovery are
treated as fatal.
By default, in order to make the parse events emulate the parse events
for a DTDless XML document, the parser does not report the doctype through
LexicalHandler. Doctype reporting through
LexicalHandler can be turned on by calling
setReportingDoctype(true).
- Version:
- $Id$
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final List<CharacterHandler> private booleanprivate XmlViolationPolicyprivate ContentHandlerprivate XmlViolationPolicyprivate XmlViolationPolicyprivate DoctypeExpectationprivate DocumentModeHandlerprivate Driverprivate DTDHandlerprivate EntityResolverprivate ErrorHandlerprivate Heuristicsprivate booleanprivate LexicalHandlerprivate booleanprivate XmlViolationPolicyprivate booleanprivate SAXStreamerprivate SAXTreeBuilderprivate booleanprivate XmlViolationPolicyprivate TransitionHandlerprivate TreeBuilder<?> private ErrorHandlerprivate XmlViolationPolicy -
Constructor Summary
ConstructorsConstructorDescriptionInstantiates the parser with a fatal XML violation policy.HtmlParser(XmlViolationPolicy xmlPolicy) Instantiates the parser with a specific XML violation policy. -
Method Summary
Modifier and TypeMethodDescriptionvoidaddCharacterHandler(CharacterHandler characterHandler) Deprecated.Returns the commentPolicy.Returns the contentNonXmlCharPolicy.Returns the contentSpacePolicy.Returns the doctype expectation.Returns theLocatorduring parse.Returns the document mode handler.booleangetFeature(String name) Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non-XMLReadergetters directly.Returns the lexicalHandler.The policy for non-NCName element and attribute names.getProperty(String name) AllowsXMLReader-level access to non-boolean valued getters.Returns the streamabilityViolationPolicy.Returns the xmlnsPolicy.booleanIndicates whether NFC normalization of source is being checked.booleanWhether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.booleanWhetherlangis mapped toxml:lang.booleanReturns the reportingDoctype.booleanWhether the parser considers scripting to be enabled for noscript treatment.private voidlazyInit()This class wraps different tree builders depending on configuration.private TokenizernewTokenizer(TokenHandler handler, boolean newAttributesEachTime) voidvoidparse(InputSource input) voidparseFragment(InputSource input, String context) Parses a fragment.voidsetBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy) Deprecated.voidsetCheckingNormalization(boolean enable) Toggles the checking of the NFC normalization of source.voidsetCommentPolicy(XmlViolationPolicy commentPolicy) Sets the policy for consecutive hyphens in comments.voidsetContentHandler(ContentHandler handler) voidsetContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy) Sets the policy for non-XML characters except white space.voidsetContentSpacePolicy(XmlViolationPolicy contentSpacePolicy) Sets the policy for non-XML white space.voidsetDoctypeExpectation(DoctypeExpectation doctypeExpectation) Sets the doctype expectation.voidsetDocumentModeHandler(DocumentModeHandler documentModeHandler) Sets the document mode handler.voidsetDTDHandler(DTDHandler handler) voidsetEntityResolver(EntityResolver resolver) voidsetErrorHandler(ErrorHandler handler) voidsetErrorProfile(HashMap<String, String> errorProfileMap) voidsetFeature(String name, boolean value) Sets a boolean feature without having to use non-XMLReadersetters directly.voidsetHeuristics(Heuristics heuristics) Sets the encoding sniffing heuristics.voidsetHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata) Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.voidsetLexicalHandler(LexicalHandler handler) Sets the lexical handler.voidsetMappingLangToXmlLang(boolean mappingLangToXmlLang) Whetherlangis mapped toxml:lang.voidsetNamePolicy(XmlViolationPolicy namePolicy) The policy for non-NCName element and attribute names.voidsetProperty(String name, Object value) Sets a non-boolean property without having to use non-XMLReadersetters directly.voidsetReportingDoctype(boolean reportingDoctype) voidsetScriptingEnabled(boolean scriptingEnabled) Sets whether the parser considers scripting to be enabled for noscript treatment.voidsetStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy) Sets the streamabilityViolationPolicy.voidsetTransitionHandler(TransitionHandler handler) voidDeprecated.For Validator.nu internal usevoidsetXmlnsPolicy(XmlViolationPolicy xmlnsPolicy) Whether thexmlnsattribute on the root element is passed to through.voidsetXmlPolicy(XmlViolationPolicy xmlPolicy) This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go.private voidtokenize(InputSource is)
-
Field Details
-
driver
-
treeBuilder
-
saxStreamer
-
saxTreeBuilder
-
contentHandler
-
lexicalHandler
-
dtdHandler
-
entityResolver
-
errorHandler
-
documentModeHandler
-
doctypeExpectation
-
checkingNormalization
private boolean checkingNormalization -
scriptingEnabled
private boolean scriptingEnabled -
characterHandlers
-
contentSpacePolicy
-
contentNonXmlCharPolicy
-
commentPolicy
-
namePolicy
-
streamabilityViolationPolicy
-
html4ModeCompatibleWithXhtml1Schemata
private boolean html4ModeCompatibleWithXhtml1Schemata -
mappingLangToXmlLang
private boolean mappingLangToXmlLang -
xmlnsPolicy
-
reportingDoctype
private boolean reportingDoctype -
treeBuilderErrorHandler
-
heuristics
-
errorProfileMap
-
transitionHandler
-
-
Constructor Details
-
HtmlParser
public HtmlParser()Instantiates the parser with a fatal XML violation policy. -
HtmlParser
Instantiates the parser with a specific XML violation policy.- Parameters:
xmlPolicy- the policy
-
-
Method Details
-
newTokenizer
-
lazyInit
private void lazyInit()This class wraps different tree builders depending on configuration. This method does the work of hiding this from the user of the class. -
getContentHandler
- Specified by:
getContentHandlerin interfaceXMLReader- See Also:
-
getDTDHandler
- Specified by:
getDTDHandlerin interfaceXMLReader- See Also:
-
getEntityResolver
- Specified by:
getEntityResolverin interfaceXMLReader- See Also:
-
getErrorHandler
- Specified by:
getErrorHandlerin interfaceXMLReader- See Also:
-
getFeature
Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non-XMLReadergetters directly.http://xml.org/sax/features/external-general-entitiesfalsehttp://xml.org/sax/features/external-parameter-entitiesfalsehttp://xml.org/sax/features/is-standalonetruehttp://xml.org/sax/features/lexical-handler/parameter-entitiesfalsehttp://xml.org/sax/features/namespacestruehttp://xml.org/sax/features/namespace-prefixesfalsehttp://xml.org/sax/features/resolve-dtd-uristruehttp://xml.org/sax/features/string-interningfalsehttp://xml.org/sax/features/unicode-normalization-checkingisCheckingNormalizationhttp://xml.org/sax/features/use-attributes2falsehttp://xml.org/sax/features/use-locator2falsehttp://xml.org/sax/features/use-entity-resolver2falsehttp://xml.org/sax/features/validationfalsehttp://xml.org/sax/features/xmlns-urisfalsehttp://xml.org/sax/features/xml-1.1falsehttp://validator.nu/features/html4-mode-compatible-with-xhtml1-schemataisHtml4ModeCompatibleWithXhtml1Schematahttp://validator.nu/features/mapping-lang-to-xml-langisMappingLangToXmlLanghttp://validator.nu/features/scripting-enabledisScriptingEnabled
- Specified by:
getFeaturein interfaceXMLReader- Parameters:
name- feature URI string- Returns:
- a value per the list above
- Throws:
SAXNotRecognizedExceptionSAXNotSupportedException- See Also:
-
getProperty
AllowsXMLReader-level access to non-boolean valued getters.The properties are mapped as follows:
http://xml.org/sax/properties/document-xml-version"1.0"http://xml.org/sax/properties/lexical-handlergetLexicalHandlerhttp://validator.nu/properties/content-space-policygetContentSpacePolicyhttp://validator.nu/properties/content-non-xml-char-policygetContentNonXmlCharPolicyhttp://validator.nu/properties/comment-policygetCommentPolicyhttp://validator.nu/properties/xmlns-policygetXmlnsPolicyhttp://validator.nu/properties/name-policygetNamePolicyhttp://validator.nu/properties/streamability-violation-policygetStreamabilityViolationPolicyhttp://validator.nu/properties/document-mode-handlergetDocumentModeHandlerhttp://validator.nu/properties/doctype-expectationgetDoctypeExpectationhttp://xml.org/sax/features/unicode-normalization-checking
- Specified by:
getPropertyin interfaceXMLReader- Parameters:
name- property URI string- Returns:
- a value per the list above
- Throws:
SAXNotRecognizedExceptionSAXNotSupportedException- See Also:
-
parse
- Specified by:
parsein interfaceXMLReader- Throws:
IOExceptionSAXException- See Also:
-
parseFragment
Parses a fragment.- Parameters:
input- the input to parsecontext- the name of the context element- Throws:
IOExceptionSAXException
-
tokenize
- Parameters:
is-- Throws:
SAXExceptionIOExceptionMalformedURLException
-
parse
- Specified by:
parsein interfaceXMLReader- Throws:
IOExceptionSAXException- See Also:
-
setContentHandler
- Specified by:
setContentHandlerin interfaceXMLReader- See Also:
-
setLexicalHandler
Sets the lexical handler.- Parameters:
handler- the hander.
-
setDTDHandler
- Specified by:
setDTDHandlerin interfaceXMLReader- See Also:
-
setEntityResolver
- Specified by:
setEntityResolverin interfaceXMLReader- See Also:
-
setErrorHandler
- Specified by:
setErrorHandlerin interfaceXMLReader- See Also:
-
setTransitionHandler
-
setTreeBuilderErrorHandlerOverride
Deprecated.For Validator.nu internal use- See Also:
-
setFeature
public void setFeature(String name, boolean value) throws SAXNotRecognizedException, SAXNotSupportedException Sets a boolean feature without having to use non-XMLReadersetters directly.The supported features are:
http://xml.org/sax/features/unicode-normalization-checkingsetCheckingNormalizationhttp://validator.nu/features/html4-mode-compatible-with-xhtml1-schematasetHtml4ModeCompatibleWithXhtml1Schematahttp://validator.nu/features/mapping-lang-to-xml-langsetMappingLangToXmlLanghttp://validator.nu/features/scripting-enabledsetScriptingEnabled
- Specified by:
setFeaturein interfaceXMLReader- Throws:
SAXNotRecognizedExceptionSAXNotSupportedException- See Also:
-
setProperty
public void setProperty(String name, Object value) throws SAXNotRecognizedException, SAXNotSupportedException Sets a non-boolean property without having to use non-XMLReadersetters directly.http://xml.org/sax/properties/lexical-handlersetLexicalHandlerhttp://validator.nu/properties/content-space-policysetContentSpacePolicyhttp://validator.nu/properties/content-non-xml-char-policysetContentNonXmlCharPolicyhttp://validator.nu/properties/comment-policysetCommentPolicyhttp://validator.nu/properties/xmlns-policysetXmlnsPolicyhttp://validator.nu/properties/name-policysetNamePolicyhttp://validator.nu/properties/streamability-violation-policysetStreamabilityViolationPolicyhttp://validator.nu/properties/document-mode-handlersetDocumentModeHandlerhttp://validator.nu/properties/doctype-expectationsetDoctypeExpectationhttp://validator.nu/properties/xml-policysetXmlPolicy
- Specified by:
setPropertyin interfaceXMLReader- Throws:
SAXNotRecognizedExceptionSAXNotSupportedException- See Also:
-
isCheckingNormalization
public boolean isCheckingNormalization()Indicates whether NFC normalization of source is being checked.- Returns:
trueif NFC normalization of source is being checked.- See Also:
-
setCheckingNormalization
public void setCheckingNormalization(boolean enable) Toggles the checking of the NFC normalization of source.- Parameters:
enable-trueto check normalization- See Also:
-
setCommentPolicy
Sets the policy for consecutive hyphens in comments.- Parameters:
commentPolicy- the policy- See Also:
-
setContentNonXmlCharPolicy
Sets the policy for non-XML characters except white space.- Parameters:
contentNonXmlCharPolicy- the policy- See Also:
-
setContentSpacePolicy
Sets the policy for non-XML white space.- Parameters:
contentSpacePolicy- the policy- See Also:
-
isScriptingEnabled
public boolean isScriptingEnabled()Whether the parser considers scripting to be enabled for noscript treatment.- Returns:
trueif enabled- See Also:
-
setScriptingEnabled
public void setScriptingEnabled(boolean scriptingEnabled) Sets whether the parser considers scripting to be enabled for noscript treatment.- Parameters:
scriptingEnabled-trueto enable- See Also:
-
getDoctypeExpectation
Returns the doctype expectation.- Returns:
- the doctypeExpectation
-
setDoctypeExpectation
Sets the doctype expectation.- Parameters:
doctypeExpectation- the doctypeExpectation to set- See Also:
-
getDocumentModeHandler
Returns the document mode handler.- Returns:
- the documentModeHandler
-
setDocumentModeHandler
Sets the document mode handler.- Parameters:
documentModeHandler- the documentModeHandler to set- See Also:
-
getStreamabilityViolationPolicy
Returns the streamabilityViolationPolicy.- Returns:
- the streamabilityViolationPolicy
-
setStreamabilityViolationPolicy
Sets the streamabilityViolationPolicy.- Parameters:
streamabilityViolationPolicy- the streamabilityViolationPolicy to set
-
setHtml4ModeCompatibleWithXhtml1Schemata
public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata) Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Parameters:
html4ModeCompatibleWithXhtml1Schemata-
-
getDocumentLocator
Returns theLocatorduring parse.- Returns:
- the
Locator
-
isHtml4ModeCompatibleWithXhtml1Schemata
public boolean isHtml4ModeCompatibleWithXhtml1Schemata()Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Returns:
- the html4ModeCompatibleWithXhtml1Schemata
-
setMappingLangToXmlLang
public void setMappingLangToXmlLang(boolean mappingLangToXmlLang) Whetherlangis mapped toxml:lang.- Parameters:
mappingLangToXmlLang-- See Also:
-
isMappingLangToXmlLang
public boolean isMappingLangToXmlLang()Whetherlangis mapped toxml:lang.- Returns:
- the mappingLangToXmlLang
-
setXmlnsPolicy
Whether thexmlnsattribute on the root element is passed to through. (FATAL not allowed.)- Parameters:
xmlnsPolicy-- See Also:
-
getXmlnsPolicy
Returns the xmlnsPolicy.- Returns:
- the xmlnsPolicy
-
getLexicalHandler
Returns the lexicalHandler.- Returns:
- the lexicalHandler
-
getCommentPolicy
Returns the commentPolicy.- Returns:
- the commentPolicy
-
getContentNonXmlCharPolicy
Returns the contentNonXmlCharPolicy.- Returns:
- the contentNonXmlCharPolicy
-
getContentSpacePolicy
Returns the contentSpacePolicy.- Returns:
- the contentSpacePolicy
-
setReportingDoctype
public void setReportingDoctype(boolean reportingDoctype) - Parameters:
reportingDoctype-- See Also:
-
isReportingDoctype
public boolean isReportingDoctype()Returns the reportingDoctype.- Returns:
- the reportingDoctype
-
setErrorProfile
- Parameters:
errorProfile-- See Also:
-
setNamePolicy
The policy for non-NCName element and attribute names.- Parameters:
namePolicy-- See Also:
-
setHeuristics
Sets the encoding sniffing heuristics.- Parameters:
heuristics- the heuristics to set- See Also:
-
getHeuristics
-
setXmlPolicy
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.- Parameters:
xmlPolicy-
-
getNamePolicy
The policy for non-NCName element and attribute names.- Returns:
- the namePolicy
-
setBogusXmlnsPolicy
Deprecated.Does nothing. -
getBogusXmlnsPolicy
Deprecated.ReturnsXmlViolationPolicy.ALTER_INFOSET.- Returns:
XmlViolationPolicy.ALTER_INFOSET
-
addCharacterHandler
-