Class HTMLHighlighter
java.lang.Object
com.kohlschutter.boilerpipe.sax.HTMLHighlighter
Highlights text blocks in an HTML document that have been marked as "content" in the
corresponding
TextDocument.-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionprivate final classprivate static class -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate Stringprivate booleanprivate static final Patternprivate static final Patternprivate Stringprivate Stringprivate static final HTMLHighlighter.TagActionprivate static final HTMLHighlighter.TagActionprivate static Map<String, HTMLHighlighter.TagAction> -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionReturns the extra stylesheet definition that will be inserted in the HEAD element.Returns the string that will be inserted after any highlighted HTML block.Returns the string that will be inserted before any highlighted HTML block.booleanIf true, only HTML enclosed within highlighted content will be returnedstatic HTMLHighlighterCreates a newHTMLHighlighter, which is set-up to return only the extracted HTML text, including enclosed markup.static HTMLHighlighterCreates a newHTMLHighlighter, which is set-up to return the full HTML text, with the extracted text portion highlighted.process(TextDocument doc, String origHTML) Processes the givenTextDocumentand the original HTML text (as a String).process(TextDocument doc, InputSource is) Processes the givenTextDocumentand the original HTML text (as anInputSource).process(URL url, BoilerpipeExtractor extractor) Fetches the givenURLusingHTMLFetcherand processes the retrieved HTML using the specifiedBoilerpipeExtractor.voidsetExtraStyleSheet(String extraStyleSheet) Sets the extra stylesheet definition that will be inserted in the HEAD element.voidsetOutputHighlightOnly(boolean outputHighlightOnly) Sets whether only HTML enclosed within highlighted content will be returned, or the whole HTML document.voidsetPostHighlight(String postHighlight) Sets the string that will be inserted after any highlighted HTML block.voidsetPreHighlight(String preHighlight) Sets the string that will be inserted prior to any highlighted HTML block.voidsetTagWhitelist(Map<String, Set<String>> tagWhitelist) private static String
-
Field Details
-
tagWhitelist
-
PAT_TAG_NO_TEXT
-
PAT_SUPER_TAG
-
outputHighlightOnly
private boolean outputHighlightOnly -
extraStyleSheet
-
preHighlight
-
postHighlight
-
TA_IGNORABLE_ELEMENT
-
TA_HEAD
-
TAG_ACTIONS
-
-
Constructor Details
-
HTMLHighlighter
private HTMLHighlighter(boolean extractHTML)
-
-
Method Details
-
newHighlightingInstance
Creates a newHTMLHighlighter, which is set-up to return the full HTML text, with the extracted text portion highlighted. -
newExtractingInstance
Creates a newHTMLHighlighter, which is set-up to return only the extracted HTML text, including enclosed markup. -
process
Processes the givenTextDocumentand the original HTML text (as a String).- Parameters:
doc- The processedTextDocument.origHTML- The original HTML document.- Returns:
- The highlighted HTML.
- Throws:
BoilerpipeProcessingException
-
process
Processes the givenTextDocumentand the original HTML text (as anInputSource).- Parameters:
doc- The processedTextDocument.is- The original HTML document.- Returns:
- The highlighted HTML.
- Throws:
BoilerpipeProcessingException
-
process
public String process(URL url, BoilerpipeExtractor extractor) throws IOException, BoilerpipeProcessingException, SAXException Fetches the givenURLusingHTMLFetcherand processes the retrieved HTML using the specifiedBoilerpipeExtractor.- Parameters:
doc- The processedTextDocument.is- The original HTML document.- Returns:
- The highlighted HTML.
- Throws:
BoilerpipeProcessingExceptionIOExceptionSAXException
-
isOutputHighlightOnly
public boolean isOutputHighlightOnly()If true, only HTML enclosed within highlighted content will be returned -
setOutputHighlightOnly
public void setOutputHighlightOnly(boolean outputHighlightOnly) Sets whether only HTML enclosed within highlighted content will be returned, or the whole HTML document. -
getExtraStyleSheet
Returns the extra stylesheet definition that will be inserted in the HEAD element. By default, this corresponds to a simple definition that marks text in class "x-boilerpipe-mark1" as inline text with yellow background. -
setExtraStyleSheet
Sets the extra stylesheet definition that will be inserted in the HEAD element. To disable, set it to the empty string: ""- Parameters:
extraStyleSheet- Plain HTML
-
getPreHighlight
Returns the string that will be inserted before any highlighted HTML block. By default, this corresponds to<span class=&qupt;x-boilerpipe-mark1"> -
setPreHighlight
Sets the string that will be inserted prior to any highlighted HTML block. To disable, set it to the empty string: "" -
getPostHighlight
Returns the string that will be inserted after any highlighted HTML block. By default, this corresponds to</span> -
setPostHighlight
Sets the string that will be inserted after any highlighted HTML block. To disable, set it to the empty string: "" -
xmlEncode
-
getTagWhitelist
-
setTagWhitelist
-