Class HtmlBuilder
- java.lang.Object
-
- nu.xom.Builder
-
- nu.validator.htmlparser.xom.HtmlBuilder
-
public class HtmlBuilder extends nu.xom.BuilderThis class implements an HTML5 parser that exposes data through the XOM interface.By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to
ALTER_INFOSETas the general XML violation policy. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy toFATAL.The doctype is not represented in the tree.
The document mode is represented via the
Modeinterface on theDocumentnode if the node implements that interface (depends on the used node factory).The form pointer is stored if the node factory supports storing it.
This package has its own node factory class because the official XOM node factory may return multiple nodes instead of one confusing the assumptions of the DOM-oriented HTML5 parsing algorithm.
- Version:
- $Id$
-
-
Field Summary
Fields Modifier and Type Field Description private java.util.List<CharacterHandler>characterHandlersprivate booleancheckingNormalizationprivate XmlViolationPolicycommentPolicyprivate XmlViolationPolicycontentNonXmlCharPolicyprivate XmlViolationPolicycontentSpacePolicyprivate DoctypeExpectationdoctypeExpectationprivate DocumentModeHandlerdocumentModeHandlerprivate Driverdriverprivate org.xml.sax.EntityResolverentityResolverprivate org.xml.sax.ErrorHandlererrorHandlerprivate Heuristicsheuristicsprivate booleanhtml4ModeCompatibleWithXhtml1Schemataprivate booleanmappingLangToXmlLangprivate XmlViolationPolicynamePolicyprivate booleanreportingDoctypeprivate booleanscriptingEnabledprivate SimpleNodeFactorysimpleNodeFactoryprivate XmlViolationPolicystreamabilityViolationPolicyprivate TransitionHandlertransitionHandlerprivate XOMTreeBuildertreeBuilderprivate org.xml.sax.ErrorHandlertreeBuilderErrorHandlerprivate XmlViolationPolicyxmlnsPolicy
-
Constructor Summary
Constructors Constructor Description HtmlBuilder()Constructor with default node factory and fatal XML violation policy.HtmlBuilder(XmlViolationPolicy xmlPolicy)Constructor with default node factory and given XML violation policy.HtmlBuilder(SimpleNodeFactory nodeFactory)Constructor with given node factory and fatal XML violation policy.HtmlBuilder(SimpleNodeFactory nodeFactory, XmlViolationPolicy xmlPolicy)Constructor with given node factory and given XML violation policy.
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description voidaddCharacterHandler(CharacterHandler characterHandler)nu.xom.Documentbuild(java.io.File file)Parse fromFile.nu.xom.Documentbuild(java.io.InputStream stream)Parse fromInputStream.nu.xom.Documentbuild(java.io.InputStream stream, java.lang.String uri)Parse fromInputStream.nu.xom.Documentbuild(java.io.Reader stream)Parse fromReader.nu.xom.Documentbuild(java.io.Reader stream, java.lang.String uri)Parse fromReader.nu.xom.Documentbuild(java.lang.String uri)Parse from URI.nu.xom.Documentbuild(java.lang.String content, java.lang.String uri)Parse fromString.nu.xom.Documentbuild(org.xml.sax.InputSource is)Parse from SAXInputSource.nu.xom.NodesbuildFragment(org.xml.sax.InputSource is, java.lang.String context)Parse a fragment from SAXInputSource.XmlViolationPolicygetBogusXmlnsPolicy()Deprecated.XmlViolationPolicygetCommentPolicy()Returns the commentPolicy.XmlViolationPolicygetContentNonXmlCharPolicy()Returns the contentNonXmlCharPolicy.XmlViolationPolicygetContentSpacePolicy()Returns the contentSpacePolicy.DoctypeExpectationgetDoctypeExpectation()Returns the doctype expectation.org.xml.sax.LocatorgetDocumentLocator()Returns theLocatorduring parse.DocumentModeHandlergetDocumentModeHandler()Returns the document mode handler.HeuristicsgetHeuristics()XmlViolationPolicygetNamePolicy()The policy for non-NCName element and attribute names.SimpleNodeFactorygetSimpleNodeFactory()Gets the node factoryXmlViolationPolicygetStreamabilityViolationPolicy()Returns the streamabilityViolationPolicy.XmlViolationPolicygetXmlnsPolicy()Returns the xmlnsPolicy.booleanisCheckingNormalization()Indicates whether NFC normalization of source is being checked.booleanisHtml4ModeCompatibleWithXhtml1Schemata()Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.booleanisMappingLangToXmlLang()Whetherlangis mapped toxml:lang.booleanisReportingDoctype()Returns the reportingDoctype.booleanisScriptingEnabled()Whether the parser considers scripting to be enabled for noscript treatment.private voidlazyInit()This class wraps different tree builders depending on configuration.private TokenizernewTokenizer(TokenHandler handler, boolean newAttributesEachTime)voidsetBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)Deprecated.voidsetCheckingNormalization(boolean enable)Toggles the checking of the NFC normalization of source.voidsetCommentPolicy(XmlViolationPolicy commentPolicy)Sets the policy for consecutive hyphens in comments.voidsetContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)Sets the policy for non-XML characters except white space.voidsetContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)Sets the policy for non-XML white space.voidsetDoctypeExpectation(DoctypeExpectation doctypeExpectation)Sets the doctype expectation.voidsetDocumentModeHandler(DocumentModeHandler documentModeHandler)Sets the document mode handler.voidsetEntityResolver(org.xml.sax.EntityResolver resolver)voidsetErrorHandler(org.xml.sax.ErrorHandler handler)voidsetHeuristics(Heuristics heuristics)Sets the encoding sniffing heuristics.voidsetHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.voidsetIgnoringComments(boolean ignoreComments)Sets whether comment nodes appear in the tree.voidsetMappingLangToXmlLang(boolean mappingLangToXmlLang)Whetherlangis mapped toxml:lang.voidsetNamePolicy(XmlViolationPolicy namePolicy)The policy for non-NCName element and attribute names.voidsetReportingDoctype(boolean reportingDoctype)voidsetScriptingEnabled(boolean scriptingEnabled)Sets whether the parser considers scripting to be enabled for noscript treatment.voidsetStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)Sets the streamabilityViolationPolicy.voidsetTransitionHander(TransitionHandler handler)voidsetXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)Whether thexmlnsattribute on the root element is passed to through.voidsetXmlPolicy(XmlViolationPolicy xmlPolicy)This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go.private voidtokenize(org.xml.sax.InputSource is)
-
-
-
Field Detail
-
driver
private Driver driver
-
treeBuilder
private final XOMTreeBuilder treeBuilder
-
simpleNodeFactory
private final SimpleNodeFactory simpleNodeFactory
-
entityResolver
private org.xml.sax.EntityResolver entityResolver
-
errorHandler
private org.xml.sax.ErrorHandler errorHandler
-
documentModeHandler
private DocumentModeHandler documentModeHandler
-
doctypeExpectation
private DoctypeExpectation doctypeExpectation
-
checkingNormalization
private boolean checkingNormalization
-
scriptingEnabled
private boolean scriptingEnabled
-
characterHandlers
private final java.util.List<CharacterHandler> characterHandlers
-
contentSpacePolicy
private XmlViolationPolicy contentSpacePolicy
-
contentNonXmlCharPolicy
private XmlViolationPolicy contentNonXmlCharPolicy
-
commentPolicy
private XmlViolationPolicy commentPolicy
-
namePolicy
private XmlViolationPolicy namePolicy
-
streamabilityViolationPolicy
private XmlViolationPolicy streamabilityViolationPolicy
-
html4ModeCompatibleWithXhtml1Schemata
private boolean html4ModeCompatibleWithXhtml1Schemata
-
mappingLangToXmlLang
private boolean mappingLangToXmlLang
-
xmlnsPolicy
private XmlViolationPolicy xmlnsPolicy
-
reportingDoctype
private boolean reportingDoctype
-
treeBuilderErrorHandler
private org.xml.sax.ErrorHandler treeBuilderErrorHandler
-
heuristics
private Heuristics heuristics
-
transitionHandler
private TransitionHandler transitionHandler
-
-
Constructor Detail
-
HtmlBuilder
public HtmlBuilder()
Constructor with default node factory and fatal XML violation policy.
-
HtmlBuilder
public HtmlBuilder(SimpleNodeFactory nodeFactory)
Constructor with given node factory and fatal XML violation policy.- Parameters:
nodeFactory- the factory
-
HtmlBuilder
public HtmlBuilder(XmlViolationPolicy xmlPolicy)
Constructor with default node factory and given XML violation policy.- Parameters:
xmlPolicy- the policy
-
HtmlBuilder
public HtmlBuilder(SimpleNodeFactory nodeFactory, XmlViolationPolicy xmlPolicy)
Constructor with given node factory and given XML violation policy.- Parameters:
nodeFactory- the factoryxmlPolicy- the policy
-
-
Method Detail
-
newTokenizer
private Tokenizer newTokenizer(TokenHandler handler, boolean newAttributesEachTime)
-
lazyInit
private void lazyInit()
This class wraps different tree builders depending on configuration. This method does the work of hiding this from the user of the class.
-
tokenize
private void tokenize(org.xml.sax.InputSource is) throws nu.xom.ParsingException, java.io.IOException, java.net.MalformedURLException- Throws:
nu.xom.ParsingExceptionjava.io.IOExceptionjava.net.MalformedURLException
-
build
public nu.xom.Document build(org.xml.sax.InputSource is) throws nu.xom.ParsingException, java.io.IOExceptionParse from SAXInputSource.- Parameters:
is- theInputSource- Returns:
- the document
- Throws:
nu.xom.ParsingException- in case of an XML violationjava.io.IOException- if IO goes wrang
-
buildFragment
public nu.xom.Nodes buildFragment(org.xml.sax.InputSource is, java.lang.String context) throws java.io.IOException, nu.xom.ParsingExceptionParse a fragment from SAXInputSource.- Parameters:
is- theInputSourcecontext- the name of the context element- Returns:
- the fragment
- Throws:
nu.xom.ParsingException- in case of an XML violationjava.io.IOException- if IO goes wrang
-
build
public nu.xom.Document build(java.io.File file) throws nu.xom.ParsingException, nu.xom.ValidityException, java.io.IOExceptionParse fromFile.- Overrides:
buildin classnu.xom.Builder- Parameters:
file- the file- Returns:
- the document
- Throws:
nu.xom.ParsingException- in case of an XML violationjava.io.IOException- if IO goes wrangnu.xom.ValidityException- See Also:
Builder.build(java.io.File)
-
build
public nu.xom.Document build(java.io.InputStream stream, java.lang.String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, java.io.IOExceptionParse fromInputStream.- Overrides:
buildin classnu.xom.Builder- Parameters:
stream- the streamuri- the base URI- Returns:
- the document
- Throws:
nu.xom.ParsingException- in case of an XML violationjava.io.IOException- if IO goes wrangnu.xom.ValidityException- See Also:
Builder.build(java.io.InputStream, java.lang.String)
-
build
public nu.xom.Document build(java.io.InputStream stream) throws nu.xom.ParsingException, nu.xom.ValidityException, java.io.IOExceptionParse fromInputStream.- Overrides:
buildin classnu.xom.Builder- Parameters:
stream- the stream- Returns:
- the document
- Throws:
nu.xom.ParsingException- in case of an XML violationjava.io.IOException- if IO goes wrangnu.xom.ValidityException- See Also:
Builder.build(java.io.InputStream)
-
build
public nu.xom.Document build(java.io.Reader stream, java.lang.String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, java.io.IOExceptionParse fromReader.- Overrides:
buildin classnu.xom.Builder- Parameters:
stream- the readeruri- the base URI- Returns:
- the document
- Throws:
nu.xom.ParsingException- in case of an XML violationjava.io.IOException- if IO goes wrangnu.xom.ValidityException- See Also:
Builder.build(java.io.Reader, java.lang.String)
-
build
public nu.xom.Document build(java.io.Reader stream) throws nu.xom.ParsingException, nu.xom.ValidityException, java.io.IOExceptionParse fromReader.- Overrides:
buildin classnu.xom.Builder- Parameters:
stream- the reader- Returns:
- the document
- Throws:
nu.xom.ParsingException- in case of an XML violationjava.io.IOException- if IO goes wrangnu.xom.ValidityException- See Also:
Builder.build(java.io.Reader)
-
build
public nu.xom.Document build(java.lang.String content, java.lang.String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, java.io.IOExceptionParse fromString.- Overrides:
buildin classnu.xom.Builder- Parameters:
content- the HTML source as stringuri- the base URI- Returns:
- the document
- Throws:
nu.xom.ParsingException- in case of an XML violationjava.io.IOException- if IO goes wrangnu.xom.ValidityException- See Also:
Builder.build(java.lang.String, java.lang.String)
-
build
public nu.xom.Document build(java.lang.String uri) throws nu.xom.ParsingException, nu.xom.ValidityException, java.io.IOExceptionParse from URI.- Overrides:
buildin classnu.xom.Builder- Parameters:
uri- the URI of the document- Returns:
- the document
- Throws:
nu.xom.ParsingException- in case of an XML violationjava.io.IOException- if IO goes wrangnu.xom.ValidityException- See Also:
Builder.build(java.lang.String)
-
getSimpleNodeFactory
public SimpleNodeFactory getSimpleNodeFactory()
Gets the node factory
-
setEntityResolver
public void setEntityResolver(org.xml.sax.EntityResolver resolver)
- See Also:
XMLReader.setEntityResolver(org.xml.sax.EntityResolver)
-
setErrorHandler
public void setErrorHandler(org.xml.sax.ErrorHandler handler)
- See Also:
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)
-
setTransitionHander
public void setTransitionHander(TransitionHandler handler)
-
isCheckingNormalization
public boolean isCheckingNormalization()
Indicates whether NFC normalization of source is being checked.- Returns:
trueif NFC normalization of source is being checked.- See Also:
nu.validator.htmlparser.impl.Tokenizer#isCheckingNormalization()
-
setCheckingNormalization
public void setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source.- Parameters:
enable-trueto check normalization- See Also:
nu.validator.htmlparser.impl.Tokenizer#setCheckingNormalization(boolean)
-
setCommentPolicy
public void setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments.- Parameters:
commentPolicy- the policy- See Also:
Tokenizer.setCommentPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
setContentNonXmlCharPolicy
public void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space.- Parameters:
contentNonXmlCharPolicy- the policy- See Also:
Tokenizer.setContentNonXmlCharPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
setContentSpacePolicy
public void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space.- Parameters:
contentSpacePolicy- the policy- See Also:
Tokenizer.setContentSpacePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
isScriptingEnabled
public boolean isScriptingEnabled()
Whether the parser considers scripting to be enabled for noscript treatment.- Returns:
trueif enabled- See Also:
TreeBuilder.isScriptingEnabled()
-
setScriptingEnabled
public void setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment.- Parameters:
scriptingEnabled-trueto enable- See Also:
TreeBuilder.setScriptingEnabled(boolean)
-
getDoctypeExpectation
public DoctypeExpectation getDoctypeExpectation()
Returns the doctype expectation.- Returns:
- the doctypeExpectation
-
setDoctypeExpectation
public void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation.- Parameters:
doctypeExpectation- the doctypeExpectation to set- See Also:
TreeBuilder.setDoctypeExpectation(nu.validator.htmlparser.common.DoctypeExpectation)
-
getDocumentModeHandler
public DocumentModeHandler getDocumentModeHandler()
Returns the document mode handler.- Returns:
- the documentModeHandler
-
setDocumentModeHandler
public void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler.- Parameters:
documentModeHandler- the documentModeHandler to set- See Also:
TreeBuilder.setDocumentModeHandler(nu.validator.htmlparser.common.DocumentModeHandler)
-
getStreamabilityViolationPolicy
public XmlViolationPolicy getStreamabilityViolationPolicy()
Returns the streamabilityViolationPolicy.- Returns:
- the streamabilityViolationPolicy
-
setStreamabilityViolationPolicy
public void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
Sets the streamabilityViolationPolicy.- Parameters:
streamabilityViolationPolicy- the streamabilityViolationPolicy to set
-
setHtml4ModeCompatibleWithXhtml1Schemata
public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Parameters:
html4ModeCompatibleWithXhtml1Schemata-
-
getDocumentLocator
public org.xml.sax.Locator getDocumentLocator()
Returns theLocatorduring parse.- Returns:
- the
Locator
-
isHtml4ModeCompatibleWithXhtml1Schemata
public boolean isHtml4ModeCompatibleWithXhtml1Schemata()
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.- Returns:
- the html4ModeCompatibleWithXhtml1Schemata
-
setMappingLangToXmlLang
public void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Whetherlangis mapped toxml:lang.- Parameters:
mappingLangToXmlLang-- See Also:
Tokenizer.setMappingLangToXmlLang(boolean)
-
isMappingLangToXmlLang
public boolean isMappingLangToXmlLang()
Whetherlangis mapped toxml:lang.- Returns:
- the mappingLangToXmlLang
-
setXmlnsPolicy
public void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
Whether thexmlnsattribute on the root element is passed to through. (FATAL not allowed.)- Parameters:
xmlnsPolicy-- See Also:
Tokenizer.setXmlnsPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
getXmlnsPolicy
public XmlViolationPolicy getXmlnsPolicy()
Returns the xmlnsPolicy.- Returns:
- the xmlnsPolicy
-
getCommentPolicy
public XmlViolationPolicy getCommentPolicy()
Returns the commentPolicy.- Returns:
- the commentPolicy
-
getContentNonXmlCharPolicy
public XmlViolationPolicy getContentNonXmlCharPolicy()
Returns the contentNonXmlCharPolicy.- Returns:
- the contentNonXmlCharPolicy
-
getContentSpacePolicy
public XmlViolationPolicy getContentSpacePolicy()
Returns the contentSpacePolicy.- Returns:
- the contentSpacePolicy
-
setReportingDoctype
public void setReportingDoctype(boolean reportingDoctype)
- Parameters:
reportingDoctype-- See Also:
TreeBuilder.setReportingDoctype(boolean)
-
isReportingDoctype
public boolean isReportingDoctype()
Returns the reportingDoctype.- Returns:
- the reportingDoctype
-
setNamePolicy
public void setNamePolicy(XmlViolationPolicy namePolicy)
The policy for non-NCName element and attribute names.- Parameters:
namePolicy-- See Also:
Tokenizer.setNamePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
-
setHeuristics
public void setHeuristics(Heuristics heuristics)
Sets the encoding sniffing heuristics.- Parameters:
heuristics- the heuristics to set- See Also:
nu.validator.htmlparser.impl.Tokenizer#setHeuristics(nu.validator.htmlparser.common.Heuristics)
-
getHeuristics
public Heuristics getHeuristics()
-
setXmlPolicy
public void setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.- Parameters:
xmlPolicy-
-
getNamePolicy
public XmlViolationPolicy getNamePolicy()
The policy for non-NCName element and attribute names.- Returns:
- the namePolicy
-
setBogusXmlnsPolicy
public void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Deprecated.Does nothing.
-
getBogusXmlnsPolicy
public XmlViolationPolicy getBogusXmlnsPolicy()
Deprecated.ReturnsXmlViolationPolicy.ALTER_INFOSET.- Returns:
XmlViolationPolicy.ALTER_INFOSET
-
addCharacterHandler
public void addCharacterHandler(CharacterHandler characterHandler)
-
setIgnoringComments
public void setIgnoringComments(boolean ignoreComments)
Sets whether comment nodes appear in the tree.- Parameters:
ignoreComments-trueto ignore comments- See Also:
TreeBuilder.setIgnoringComments(boolean)
-
-