Package com.kohlschutter.boilerpipe.sax
Class HTMLHighlighter
- java.lang.Object
-
- com.kohlschutter.boilerpipe.sax.HTMLHighlighter
-
public final class HTMLHighlighter extends java.lang.ObjectHighlights text blocks in an HTML document that have been marked as "content" in the correspondingTextDocument.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private classHTMLHighlighter.Implementationprivate static classHTMLHighlighter.TagAction
-
Field Summary
Fields Modifier and Type Field Description private java.lang.StringextraStyleSheetprivate booleanoutputHighlightOnlyprivate static java.util.regex.PatternPAT_SUPER_TAGprivate static java.util.regex.PatternPAT_TAG_NO_TEXTprivate java.lang.StringpostHighlightprivate java.lang.StringpreHighlightprivate static HTMLHighlighter.TagActionTA_HEADprivate static HTMLHighlighter.TagActionTA_IGNORABLE_ELEMENTprivate static java.util.Map<java.lang.String,HTMLHighlighter.TagAction>TAG_ACTIONSprivate java.util.Map<java.lang.String,java.util.Set<java.lang.String>>tagWhitelist
-
Constructor Summary
Constructors Modifier Constructor Description privateHTMLHighlighter(boolean extractHTML)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.StringgetExtraStyleSheet()Returns the extra stylesheet definition that will be inserted in the HEAD element.java.lang.StringgetPostHighlight()Returns the string that will be inserted after any highlighted HTML block.java.lang.StringgetPreHighlight()Returns the string that will be inserted before any highlighted HTML block.java.util.Map<java.lang.String,java.util.Set<java.lang.String>>getTagWhitelist()booleanisOutputHighlightOnly()If true, only HTML enclosed within highlighted content will be returnedstatic HTMLHighlighternewExtractingInstance()Creates a newHTMLHighlighter, which is set-up to return only the extracted HTML text, including enclosed markup.static HTMLHighlighternewHighlightingInstance()Creates a newHTMLHighlighter, which is set-up to return the full HTML text, with the extracted text portion highlighted.java.lang.Stringprocess(TextDocument doc, java.lang.String origHTML)Processes the givenTextDocumentand the original HTML text (as a String).java.lang.Stringprocess(TextDocument doc, org.xml.sax.InputSource is)Processes the givenTextDocumentand the original HTML text (as anInputSource).java.lang.Stringprocess(java.net.URL url, BoilerpipeExtractor extractor)Fetches the givenURLusingHTMLFetcherand processes the retrieved HTML using the specifiedBoilerpipeExtractor.voidsetExtraStyleSheet(java.lang.String extraStyleSheet)Sets the extra stylesheet definition that will be inserted in the HEAD element.voidsetOutputHighlightOnly(boolean outputHighlightOnly)Sets whether only HTML enclosed within highlighted content will be returned, or the whole HTML document.voidsetPostHighlight(java.lang.String postHighlight)Sets the string that will be inserted after any highlighted HTML block.voidsetPreHighlight(java.lang.String preHighlight)Sets the string that will be inserted prior to any highlighted HTML block.voidsetTagWhitelist(java.util.Map<java.lang.String,java.util.Set<java.lang.String>> tagWhitelist)private static java.lang.StringxmlEncode(java.lang.String in)
-
-
-
Field Detail
-
tagWhitelist
private java.util.Map<java.lang.String,java.util.Set<java.lang.String>> tagWhitelist
-
PAT_TAG_NO_TEXT
private static final java.util.regex.Pattern PAT_TAG_NO_TEXT
-
PAT_SUPER_TAG
private static final java.util.regex.Pattern PAT_SUPER_TAG
-
outputHighlightOnly
private boolean outputHighlightOnly
-
extraStyleSheet
private java.lang.String extraStyleSheet
-
preHighlight
private java.lang.String preHighlight
-
postHighlight
private java.lang.String postHighlight
-
TA_IGNORABLE_ELEMENT
private static final HTMLHighlighter.TagAction TA_IGNORABLE_ELEMENT
-
TA_HEAD
private static final HTMLHighlighter.TagAction TA_HEAD
-
TAG_ACTIONS
private static java.util.Map<java.lang.String,HTMLHighlighter.TagAction> TAG_ACTIONS
-
-
Method Detail
-
newHighlightingInstance
public static HTMLHighlighter newHighlightingInstance()
Creates a newHTMLHighlighter, which is set-up to return the full HTML text, with the extracted text portion highlighted.
-
newExtractingInstance
public static HTMLHighlighter newExtractingInstance()
Creates a newHTMLHighlighter, which is set-up to return only the extracted HTML text, including enclosed markup.
-
process
public java.lang.String process(TextDocument doc, java.lang.String origHTML) throws BoilerpipeProcessingException
Processes the givenTextDocumentand the original HTML text (as a String).- Parameters:
doc- The processedTextDocument.origHTML- The original HTML document.- Returns:
- The highlighted HTML.
- Throws:
BoilerpipeProcessingException
-
process
public java.lang.String process(TextDocument doc, org.xml.sax.InputSource is) throws BoilerpipeProcessingException
Processes the givenTextDocumentand the original HTML text (as anInputSource).- Parameters:
doc- The processedTextDocument.is- The original HTML document.- Returns:
- The highlighted HTML.
- Throws:
BoilerpipeProcessingException
-
process
public java.lang.String process(java.net.URL url, BoilerpipeExtractor extractor) throws java.io.IOException, BoilerpipeProcessingException, org.xml.sax.SAXExceptionFetches the givenURLusingHTMLFetcherand processes the retrieved HTML using the specifiedBoilerpipeExtractor.- Parameters:
doc- The processedTextDocument.is- The original HTML document.- Returns:
- The highlighted HTML.
- Throws:
BoilerpipeProcessingExceptionjava.io.IOExceptionorg.xml.sax.SAXException
-
isOutputHighlightOnly
public boolean isOutputHighlightOnly()
If true, only HTML enclosed within highlighted content will be returned
-
setOutputHighlightOnly
public void setOutputHighlightOnly(boolean outputHighlightOnly)
Sets whether only HTML enclosed within highlighted content will be returned, or the whole HTML document.
-
getExtraStyleSheet
public java.lang.String getExtraStyleSheet()
Returns the extra stylesheet definition that will be inserted in the HEAD element. By default, this corresponds to a simple definition that marks text in class "x-boilerpipe-mark1" as inline text with yellow background.
-
setExtraStyleSheet
public void setExtraStyleSheet(java.lang.String extraStyleSheet)
Sets the extra stylesheet definition that will be inserted in the HEAD element. To disable, set it to the empty string: ""- Parameters:
extraStyleSheet- Plain HTML
-
getPreHighlight
public java.lang.String getPreHighlight()
Returns the string that will be inserted before any highlighted HTML block. By default, this corresponds to<span class=&qupt;x-boilerpipe-mark1">
-
setPreHighlight
public void setPreHighlight(java.lang.String preHighlight)
Sets the string that will be inserted prior to any highlighted HTML block. To disable, set it to the empty string: ""
-
getPostHighlight
public java.lang.String getPostHighlight()
Returns the string that will be inserted after any highlighted HTML block. By default, this corresponds to</span>
-
setPostHighlight
public void setPostHighlight(java.lang.String postHighlight)
Sets the string that will be inserted after any highlighted HTML block. To disable, set it to the empty string: ""
-
xmlEncode
private static java.lang.String xmlEncode(java.lang.String in)
-
getTagWhitelist
public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> getTagWhitelist()
-
setTagWhitelist
public void setTagWhitelist(java.util.Map<java.lang.String,java.util.Set<java.lang.String>> tagWhitelist)
-
-