Class HTMLTagBalancer
java.lang.Object
org.htmlunit.cyberneko.HTMLTagBalancer
- All Implemented Interfaces:
HTMLComponent, XMLComponent, XMLDocumentFilter, XMLDocumentSource, XMLDocumentHandler
Balances tags in an HTML document. This component receives document events
and tries to correct many common mistakes that human (and computer) HTML
document authors make. This tag balancer can:
- add missing parent elements;
- automatically close elements with optional end tags; and
- handle mis-matched inline element tags.
This component recognizes the following features:
- http://cyberneko.org/html/features/augmentations
- http://cyberneko.org/html/features/report-errors
- http://cyberneko.org/html/features/balance-tags/document-fragment
- http://cyberneko.org/html/features/balance-tags/ignore-outside-content
This component recognizes the following properties:
- http://cyberneko.org/html/properties/names/elems
- http://cyberneko.org/html/properties/names/attrs
- http://cyberneko.org/html/properties/error-reporter
- http://cyberneko.org/html/properties/balance-tags/current-stack
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescription(package private) static classStructure to hold information about an element placed in buffer to be comsumed laterstatic classElement info for each start element.static classUnsynchronized stack of element information. -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final StringInclude infoset augmentations.protected static final StringDocument fragment balancing only.private XMLDocumentHandlerThe document handler.private XMLDocumentSourceprivate final List<HTMLTagBalancer.ElementEntry> protected static final StringError reporter.protected booleanAllows self closing iframe tags.protected booleanAllows self closing tags.protected booleanInclude infoset augmentations.protected booleanDocument fragment balancing only.protected final HTMLTagBalancer.InfoStackThe element stack.protected HTMLErrorReporterError reporter.protected booleanIgnore outside content.protected final HTMLTagBalancer.InfoStackThe inline stack.protected shortModify HTML element names.protected booleanNamespaces.protected booleanTrue if a form is in the stack (allow to discard opening of nested forms)protected booleanTrue if a select is in the stackprotected booleanTrue if a svg is in the stack (no parent checking takes place)private booleanprivate booleanprivate final QNameA qualified name.static final String<font color="red">EXPERIMENTAL: may change in next release</font><br/> Name of the property holding the stack of elements in which context a document fragment should be parsed.private QName[]Stack of elements determining the context in which a document fragment should be parsedprivate intprotected booleanReport errors.protected booleanTrue if seen anything.protected booleanTrue if seenbodyelement.private booleanprivate booleanTrue if seen non whitespace characters.protected booleanTrue if root element has been seen.private booleanTrue if seenframesetelement.protected booleanTrue if seenheadelement.protected booleanTrue if root element has been seen.protected booleanTrue if seen the end of the document element.protected booleanTemplate document fragment balancing only.private final HTMLConfigurationprotected static final StringIgnore outside content.private final LostTextprotected static final StringModify HTML attribute names: { "upper", "lower", "default" }.protected static final StringModify HTML element names: { "upper", "lower", "default" }.private static final shortLowercase HTML names.private static final shortDon't modify HTML names.private static final shortUppercase HTML names.protected static final StringNamespaces.private static final String[]Recognized features.private static final Boolean[]Recognized features defaults.private static final String[]Recognized properties.private static final Object[]Recognized properties defaults.protected static final StringReport errors.private static final HTMLEventInfoSynthesized event info item.protected HTMLTagBalancingListener -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate voidaddBodyIfNeeded(short element) protected final voidcallEndElement(QName element, Augmentations augs) protected final voidcallStartElement(QName element, XMLAttributes attrs, Augmentations augs) voidcharacters(XMLString text, Augmentations augs) Characters.voidcomment(XMLString text, Augmentations augs) Comment.private voidConsume elements that have been buffered, like that are first consumed at the end of documentprivate voidprivate QNamecreateQName(String tagName) voiddoctypeDecl(String rootElementName, String publicId, String systemId, Augmentations augs) Doctype declaration.voidemptyElement(QName element, XMLAttributes attrs, Augmentations augs) Empty element.voidendCDATA(Augmentations augs) End CDATA section.voidendDocument(Augmentations augs) End document.voidendElement(QName element, Augmentations augs) End element.private voidGenerates a missing (which creates missing when needed)private booleanforceStartElement(QName elem, XMLAttributes attrs, Augmentations augs) Forces an element start, taking care to set the information to allow startElement to "see" that's the element has been forced.Returns the document handler.protected HTMLElements.ElementgetElement(QName elementName) protected final intgetElementDepth(HTMLElements.Element element) getFeatureDefault(String featureId) Returns the default state for a feature.protected static shortgetNamesValue(String value) protected intgetParentDepth(HTMLElements.Element[] parents, short bounds) getPropertyDefault(String propertyId) Returns the default state for a property.String[]Returns recognized features.String[]Returns recognized properties.protected static StringmodifyName(String name, short mode) private voidnotifyDiscardedEndElement(QName element, Augmentations augs) Notifies the tagBalancingListener (if any) of an ignored end elementprivate voidnotifyDiscardedStartElement(QName elem, XMLAttributes attrs, Augmentations augs) Notifies the tagBalancingListener (if any) of an ignored start elementvoidprocessingInstruction(String target, XMLString data, Augmentations augs) Processing instruction.voidreset(XMLComponentManager manager) Resets the component.voidsetDocumentHandler(XMLDocumentHandler handler) Sets the document handler.voidsetDocumentSource(XMLDocumentSource source) Sets the document source.voidsetFeature(String featureId, boolean state) Sets a feature.voidsetProperty(String propertyId, Object value) Sets a property.(package private) voidsetTagBalancingListener(HTMLTagBalancingListener tagBalancingListener) voidstartCDATA(Augmentations augs) Start CDATA section.voidstartDocument(XMLLocator locator, String encoding, NamespaceContext nscontext, Augmentations augs) Start document.voidstartElement(QName elem, XMLAttributes attrs, Augmentations augs) Start element.protected final AugmentationsvoidxmlDecl(String version, String encoding, String standalone, Augmentations augs) XML declaration.
-
Field Details
-
NAMESPACES
-
AUGMENTATIONS
-
REPORT_ERRORS
-
DOCUMENT_FRAGMENT
-
IGNORE_OUTSIDE_CONTENT
-
RECOGNIZED_FEATURES
Recognized features. -
RECOGNIZED_FEATURES_DEFAULTS
Recognized features defaults. -
NAMES_ELEMS
Modify HTML element names: { "upper", "lower", "default" }.- See Also:
-
NAMES_ATTRS
Modify HTML attribute names: { "upper", "lower", "default" }.- See Also:
-
ERROR_REPORTER
-
FRAGMENT_CONTEXT_STACK
<font color="red">EXPERIMENTAL: may change in next release</font><br/> Name of the property holding the stack of elements in which context a document fragment should be parsed.- See Also:
-
RECOGNIZED_PROPERTIES
Recognized properties. -
RECOGNIZED_PROPERTIES_DEFAULTS
Recognized properties defaults. -
NAMES_NO_CHANGE
private static final short NAMES_NO_CHANGEDon't modify HTML names.- See Also:
-
NAMES_UPPERCASE
private static final short NAMES_UPPERCASEUppercase HTML names.- See Also:
-
NAMES_LOWERCASE
private static final short NAMES_LOWERCASELowercase HTML names.- See Also:
-
SYNTHESIZED_ITEM
Synthesized event info item. -
fNamespaces
protected boolean fNamespacesNamespaces. -
fAugmentations
protected boolean fAugmentationsInclude infoset augmentations. -
fReportErrors
protected boolean fReportErrorsReport errors. -
fDocumentFragment
protected boolean fDocumentFragmentDocument fragment balancing only. -
fTemplateFragment
protected boolean fTemplateFragmentTemplate document fragment balancing only. -
fIgnoreOutsideContent
protected boolean fIgnoreOutsideContentIgnore outside content. -
fAllowSelfclosingIframe
protected boolean fAllowSelfclosingIframeAllows self closing iframe tags. -
fAllowSelfclosingTags
protected boolean fAllowSelfclosingTagsAllows self closing tags. -
fNamesElems
protected short fNamesElemsModify HTML element names. -
fErrorReporter
Error reporter. -
documentSource_
-
documentHandler_
The document handler. -
fElementStack
The element stack. -
fInlineStack
The inline stack. -
fSeenAnything
protected boolean fSeenAnythingTrue if seen anything. Important for xml declaration. -
fSeenDoctype
protected boolean fSeenDoctypeTrue if root element has been seen. -
fSeenRootElement
protected boolean fSeenRootElementTrue if root element has been seen. -
fSeenRootElementEnd
protected boolean fSeenRootElementEndTrue if seen the end of the document element. In other words, this variable is set to false until the end </HTML> tag is seen (or synthesized). This is used to ensure that extraneous events after the end of the document element do not make the document stream ill-formed. -
fSeenHeadElement
protected boolean fSeenHeadElementTrue if seenheadelement. -
fSeenBodyElement
protected boolean fSeenBodyElementTrue if seenbodyelement. -
fSeenBodyElementEnd
private boolean fSeenBodyElementEnd -
fSeenFramesetElement
private boolean fSeenFramesetElementTrue if seenframesetelement. -
fSeenCharacters
private boolean fSeenCharactersTrue if seen non whitespace characters. -
fOpenedForm
protected boolean fOpenedFormTrue if a form is in the stack (allow to discard opening of nested forms) -
fOpenedSvg
protected boolean fOpenedSvgTrue if a svg is in the stack (no parent checking takes place) -
fOpenedSelect
protected boolean fOpenedSelectTrue if a select is in the stack -
fQName
A qualified name. -
tagBalancingListener
-
lostText_
-
forcedStartElement_
private boolean forcedStartElement_ -
forcedEndElement_
private boolean forcedEndElement_ -
fragmentContextStack_
Stack of elements determining the context in which a document fragment should be parsed -
fragmentContextStackSize_
private int fragmentContextStackSize_ -
endElementsBuffer_
-
discardedStartElements
-
htmlConfiguration_
-
-
Constructor Details
-
HTMLTagBalancer
HTMLTagBalancer(HTMLConfiguration htmlConfiguration)
-
-
Method Details
-
getFeatureDefault
Returns the default state for a feature.- Specified by:
getFeatureDefaultin interfaceHTMLComponent- Specified by:
getFeatureDefaultin interfaceXMLComponent- Parameters:
featureId- The feature identifier.- Returns:
- the default state for a feature, or null if this component does not want to report a default value for this feature.
-
getPropertyDefault
Returns the default state for a property.- Specified by:
getPropertyDefaultin interfaceHTMLComponent- Specified by:
getPropertyDefaultin interfaceXMLComponent- Parameters:
propertyId- The property identifier.- Returns:
- the default state for a property, or null if this component does not want to report a default value for this property
-
getRecognizedFeatures
Returns recognized features.- Specified by:
getRecognizedFeaturesin interfaceXMLComponent- Returns:
- an array of feature identifiers that are recognized by this component. This method may return null if no features are recognized by this component.
-
getRecognizedProperties
Returns recognized properties.- Specified by:
getRecognizedPropertiesin interfaceXMLComponent- Returns:
- an array of property identifiers that are recognized by this component. This method may return null if no properties are recognized by this component.
-
reset
Resets the component.- Specified by:
resetin interfaceXMLComponent- Parameters:
manager- The component manager.- Throws:
XMLConfigurationException
-
setFeature
Sets a feature.- Specified by:
setFeaturein interfaceXMLComponent- Parameters:
featureId- The feature identifier.state- The state of the feature.- Throws:
XMLConfigurationException- Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
-
setProperty
Sets a property.- Specified by:
setPropertyin interfaceXMLComponent- Parameters:
propertyId- The property identifier.value- The value of the property.- Throws:
XMLConfigurationException- Thrown for configuration error. In general, components should only throw this exception if it is really a critical error.
-
setDocumentHandler
Sets the document handler.- Specified by:
setDocumentHandlerin interfaceXMLDocumentSource- Parameters:
handler- the new handler
-
getDocumentHandler
Returns the document handler.- Specified by:
getDocumentHandlerin interfaceXMLDocumentSource- Returns:
- the document handler
-
setDocumentSource
Sets the document source.- Specified by:
setDocumentSourcein interfaceXMLDocumentHandler- Parameters:
source- the new source
-
getDocumentSource
- Specified by:
getDocumentSourcein interfaceXMLDocumentHandler- Returns:
- the document source.
-
startDocument
public void startDocument(XMLLocator locator, String encoding, NamespaceContext nscontext, Augmentations augs) throws XNIException Start document.- Specified by:
startDocumentin interfaceXMLDocumentHandler- Parameters:
locator- The document locator, or null if the document location cannot be reported during the parsing of this document. However, it is strongly recommended that a locator be supplied that can at least report the system identifier of the document.encoding- The auto-detected IANA encoding name of the entity stream. This value will be null in those situations where the entity encoding is not auto-detected (e.g. internal entities or a document entity that is parsed from a java.io.Reader).nscontext- The namespace context in effect at the start of this document. This object represents the current context. Implementors of this class are responsible for copying the namespace bindings from the the current context (and its parent contexts) if that information is important.augs- Additional information that may include infoset augmentations- Throws:
XNIException- Thrown by handler to signal an error.
-
xmlDecl
public void xmlDecl(String version, String encoding, String standalone, Augmentations augs) throws XNIException XML declaration.- Specified by:
xmlDeclin interfaceXMLDocumentHandler- Parameters:
version- The XML version.encoding- The IANA encoding name of the document, or null if not specified.standalone- The standalone value, or null if not specified.augs- Additional information that may include infoset augmentations- Throws:
XNIException- Thrown by handler to signal an error.
-
doctypeDecl
public void doctypeDecl(String rootElementName, String publicId, String systemId, Augmentations augs) throws XNIException Doctype declaration.- Specified by:
doctypeDeclin interfaceXMLDocumentHandler- Parameters:
rootElementName- The name of the root element.publicId- The public identifier if an external DTD or null if the external DTD is specified using SYSTEM.systemId- The system identifier if an external DTD, null otherwise.augs- Additional information that may include infoset augmentations- Throws:
XNIException- Thrown by handler to signal an error.
-
endDocument
End document.- Specified by:
endDocumentin interfaceXMLDocumentHandler- Parameters:
augs- Additional information that may include infoset augmentations- Throws:
XNIException- Thrown by handler to signal an error.
-
consumeBufferedEndElements
private void consumeBufferedEndElements()Consume elements that have been buffered, like
-