Class HtmlCleaner
java.lang.Object
org.htmlcleaner.HtmlCleaner
Main HtmlCleaner class.
// create an instance of HtmlCleaner
HtmlCleaner cleaner = new HtmlCleaner();
// take default cleaner properties
CleanerProperties props = cleaner.getProperties();
// customize cleaner's behavior with property setters
props.setXXX(...);
// Clean HTML taken from simple string, file, URL, input stream,
// input source or reader. Result is root node of created
// tree-like structure. Single cleaner instance may be safely used
// multiple times.
TagNode node = cleaner.clean(...);
// optionally find parts of the DOM or modify some nodes
TagNode[] myNodes = node.getElementsByXXX(...);
// and/or
Object[] myNodes = node.evaluateXPath(xPathExpression);
// and/or
aNode.removeFromTree();
// and/or
aNode.addAttribute(attName, attValue);
// and/or
aNode.removeAttribute(attName, attValue);
// and/or
cleaner.setInnerHtml(aNode, htmlContent);
// and/or do some other tree manipulation/traversal
// serialize a node to a file, output stream, DOM, JDom...
new XXXSerializer(props).writeXmlXXX(aNode, ...);
myJDom = new JDomSerializer(props, true).createJDom(aNode);
myDom = new DomSerializer(props, true).createDOM(aNode);
It represents public interface to the user. It's task is to call tokenizer with specified source HTML, traverse list of produced token list and create internal object model. It also offers a set of methods to write resulting XML to string, file or any output stream.
Typical usage is the following:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic intstatic intprivate static final StringMarker attribute added to aid with part of the cleaning process.private CleanerPropertiesprivate CleanerTransformations -
Constructor Summary
ConstructorsConstructorDescriptionConstructor - creates cleaner instance with default tag info provider,default version and default properties.HtmlCleaner(CleanerProperties properties) Constructor - creates the instance with default tag info provider and specified propertiesHtmlCleaner(ITagInfoProvider tagInfoProvider) Constructor - creates the instance with specified tag info provider and default propertiesHtmlCleaner(ITagInfoProvider tagInfoProvider, CleanerProperties properties) Constructor - creates the instance with specified tag info provider and specified properties -
Method Summary
Modifier and TypeMethodDescriptionprivate voidaddAttributesToTag(TagNode tag, Map<String, String> attributes) Add attributes from specified map to the specified tag.private booleanaddIfNeededToPruneSet(TagNode tagNode, CleanTimeValues cleanTimeValues) private voidaddPossibleHeadCandidate(TagInfo tagInfo, TagNode tagNode, CleanTimeValues cleanTimeValues) Checks if specified tag with specified info is candidate for moving to head section.protected voidaddPruneNode(TagNode node, CleanTimeValues cleanTimeValues) private static booleanareCopiedTokensEqual(TagNode token1, TagNode token2) Determines if two copied tokens are equal.private voidcalculateRootNode(CleanTimeValues cleanTimeValues, Set<String> namespacePrefixes) Assigns root node to internal variable and adds neccessery xmlns attributes if cleaner is namespace-aware.clean(InputStream in) clean(InputStream in, String charset) protected TagNodeclean(Reader reader, CleanTimeValues cleanTimeValues) Basic version of the cleaning call.Deprecated.Deprecated.private voidcloseAll(List nodeList, CleanTimeValues cleanTimeValues) Close all unclosed tags if there are any.closeSnippet(List nodeList, TagPos tagPos, Object toNode, CleanTimeValues cleanTimeValues) Forced closingprivate voidcreateDocumentNodes(List listNodes, CleanTimeValues cleanTimeValues) private TagNodecreateTagNode(TagNode startTagToken) flattenNestedList(List list) Flattens a list of tagnodesprotected Set<ITagNodeCondition> getAllowTagSet(CleanTimeValues cleanTimeValues) getAllTags(CleanTimeValues cleanTimeValues) private ChildBreaksgetChildBreaks(CleanTimeValues cleanTimeValues) getInnerHtml(TagNode node) For the specified node, returns it's content as string.private OpenTagsgetOpenTags(CleanTimeValues cleanTimeValues) protected Set<ITagNodeCondition> getPruneTagSet(CleanTimeValues cleanTimeValues) getTagInfo(String tagName, CleanTimeValues cleanTimeValues) Returns a TagInfo object for the specified tag name.private voidhandleEndTagToken(BaseToken token, ListIterator<BaseToken> nodeIterator, List nodeList, CleanTimeValues cleanTimeValues) Process rules for a new end tag token in the HTML tree.protected voidCalled whenever the thread is interrupted.private voidhandleStartTagToken(BaseToken token, ListIterator<BaseToken> nodeIterator, List nodeList, CleanTimeValues cleanTimeValues) Processes all the rules associated with a new opening tag in the HTML treevoidinitCleanerTransformations(Map transInfos) private booleanisAllowedAsForeignMarkup(String tagname, CleanTimeValues cleanTimeValues) Checks whether we can allow a tag as "foreign markup".private booleanisAllowedInLastOpenTag(BaseToken token, CleanTimeValues cleanTimeValues) private static booleanisCopiedTokenEqualToNextThreeCopiedTokens(TagNode copiedStartToken, ListIterator<BaseToken> nodeIterator) Determines if a copied token is equal to the next 3 tokens in the iterator.private booleanisFatalTagSatisfied(TagInfo tag, CleanTimeValues cleanTimeValues) Checks if open fatal tag is missing if there is a fatal tag for the specified tag.protected booleanisRemovingNodeReasonablySafe(TagNode startTagToken) private boolean(package private) voidmakeTree(List nodeList, ListIterator<BaseToken> nodeIterator, CleanTimeValues cleanTimeValues) This method generally mutates flattened list of tokens into tree structure.private booleanmarkNodesToPrune(List nodeList, CleanTimeValues cleanTimeValues, int depth) private booleanmustAddRequiredParent(TagInfo tag, CleanTimeValues cleanTimeValues) Check if specified tag requires parent tag, but that parent tag is missing in the appropriate context.private TagNodenewTagNode(String tagName) private NestingStatepopNesting(CleanTimeValues cleanTimeValues) private NestingStatepushNesting(CleanTimeValues cleanTimeValues) private voidreopenBrokenNode(ListIterator<BaseToken> nodeIterator, TagNode toReopen, CleanTimeValues cleanTimeValues) private voidsaveToLastOpenTag(List nodeList, Object tokenToAdd, CleanTimeValues cleanTimeValues) voidsetInnerHtml(TagNode node, String content) For the specified tag node, defines it's html content.
-
Field Details
-
MARKER_ATTRIBUTE
Marker attribute added to aid with part of the cleaning process. TODO: a non-intrusive way of doing this that does not involve modifying the source html- See Also:
-
HTML_4
public static int HTML_4 -
HTML_5
public static int HTML_5 -
properties
-
transformations
-
-
Constructor Details
-
HtmlCleaner
public HtmlCleaner()Constructor - creates cleaner instance with default tag info provider,default version and default properties. -
HtmlCleaner
Constructor - creates the instance with specified tag info provider and default properties- Parameters:
tagInfoProvider- Provider for tag filtering and balancing
-
HtmlCleaner
Constructor - creates the instance with default tag info provider and specified properties- Parameters:
properties- Properties used during parsing and serializing
-
HtmlCleaner
Constructor - creates the instance with specified tag info provider and specified properties- Parameters:
tagInfoProvider- Provider for tag filtering and balancingproperties- Properties used during parsing and serializing
-
-
Method Details
-
clean
-
clean
- Throws:
IOException
-
clean
- Throws:
IOException
-
clean
Deprecated.Deprecated because unmanaged network IO does not handle proxies, slow servers or broken connections well. the htmlcleaner caller should be managing the connections themselves and just providing the htmlcleaner library with a stream.- Parameters:
url-charset-- Returns:
- Throws:
IOException
-
clean
Deprecated.Creates instance from the content downloaded from specified URL. HTML encoding is resolved following the attempts in the sequence: 1. reading Content-Type response header, 2. Analyzing META tags at the beginning of the html, 3. Using platform's default charset.- Parameters:
url- the url to retrieve content from- Returns:
- the cleaned content
- Throws:
IOException
-
clean
- Throws:
IOException
-
clean
- Throws:
IOException
-
clean
- Throws:
IOException
-
clean
Basic version of the cleaning call.- Parameters:
reader- (not closed)- Returns:
- An instance of TagNode object which is the root of the XML tree.
- Throws:
IOException
-
markNodesToPrune
-
calculateRootNode
Assigns root node to internal variable and adds neccessery xmlns attributes if cleaner is namespace-aware. Root node of the result depends on parameter "omitHtmlEnvelope". If it is set, then first child of the body will be root node, or html will be root node otherwise.- Parameters:
namespacePrefixes-
-
addAttributesToTag
-
isFatalTagSatisfied
Checks if open fatal tag is missing if there is a fatal tag for the specified tag.- Parameters:
tag-
-
mustAddRequiredParent
Check if specified tag requires parent tag, but that parent tag is missing in the appropriate context.- Parameters:
tag-
-
newTagNode
-
createTagNode
-
isAllowedInLastOpenTag
-
saveToLastOpenTag
-
isStartToken
-
isAllowedAsForeignMarkup
Checks whether we can allow a tag as "foreign markup". This means we must have namespace aware set to true, and we must either have a current xmlns declaration within scope that isn't for HTML, or we have a namespace prefix on the tag- Parameters:
cleanTimeValues-- Returns:
-
handleEndTagToken
private void handleEndTagToken(BaseToken token, ListIterator<BaseToken> nodeIterator, List nodeList, CleanTimeValues cleanTimeValues) Process rules for a new end tag token in the HTML tree.- Parameters:
token-nodeIterator-nodeList-cleanTimeValues-
-
handleStartTagToken
private void handleStartTagToken(BaseToken token, ListIterator<BaseToken> nodeIterator, List nodeList, CleanTimeValues cleanTimeValues) Processes all the rules associated with a new opening tag in the HTML tree- Parameters:
token-nodeIterator-nodeList-cleanTimeValues-
-
makeTree
This method generally mutates flattened list of tokens into tree structure.- Parameters:
nodeList-nodeIterator-
-
isCopiedTokenEqualToNextThreeCopiedTokens
private static boolean isCopiedTokenEqualToNextThreeCopiedTokens(TagNode copiedStartToken, ListIterator<BaseToken> nodeIterator) Determines if a copied token is equal to the next 3 tokens in the iterator. -
flattenNestedList
-
areCopiedTokensEqual
-
reopenBrokenNode
private void reopenBrokenNode(ListIterator<BaseToken> nodeIterator, TagNode toReopen, CleanTimeValues cleanTimeValues) -
isRemovingNodeReasonablySafe
- Parameters:
startTagToken-- Returns:
- true if no id attribute or class attribute
-
createDocumentNodes
-
closeSnippet
-
closeAll
Close all unclosed tags if there are any. -
addPossibleHeadCandidate
private void addPossibleHeadCandidate(TagInfo tagInfo, TagNode tagNode, CleanTimeValues cleanTimeValues) Checks if specified tag with specified info is candidate for moving to head section.- Parameters:
tagInfo-tagNode-
-
getProperties
-
getPruneTagSet
-
getAllowTagSet
-
addPruneNode
-
getTagInfo
Returns a TagInfo object for the specified tag name. If the tag is foreign markup, we leave it as null. This is because we may get name clashes, e.g. svg:title. However, we do handle the tag if its embedded content within the correct NS (e.g. SVG, MathML)- Parameters:
tagName-cleanTimeValues-- Returns:
- a TagInfo object, or null if no matching TagInfo is found
-
addIfNeededToPruneSet
-
getAllTags
-
getTagInfoProvider
- Returns:
- ITagInfoProvider instance for this HtmlCleaner
-
getTransformations
- Returns:
- Transformations defined for this instance of cleaner
-
getInnerHtml
-
setInnerHtml
-
initCleanerTransformations
- Parameters:
transInfos-
-
getOpenTags
-
getChildBreaks
-
pushNesting
-
popNesting
-
handleInterruption
protected void handleInterruption()Called whenever the thread is interrupted. Currently this is a placeholder, but could hold cleanup methods and user interaction
-