Class HTMLHighlighter

java.lang.Object
com.kohlschutter.boilerpipe.sax.HTMLHighlighter

public final class HTMLHighlighter extends Object
Highlights text blocks in an HTML document that have been marked as "content" in the corresponding TextDocument.
  • Field Details

  • Constructor Details

    • HTMLHighlighter

      private HTMLHighlighter(boolean extractHTML)
  • Method Details

    • newHighlightingInstance

      public static HTMLHighlighter newHighlightingInstance()
      Creates a new HTMLHighlighter, which is set-up to return the full HTML text, with the extracted text portion highlighted.
    • newExtractingInstance

      public static HTMLHighlighter newExtractingInstance()
      Creates a new HTMLHighlighter, which is set-up to return only the extracted HTML text, including enclosed markup.
    • process

      public String process(TextDocument doc, String origHTML) throws BoilerpipeProcessingException
      Processes the given TextDocument and the original HTML text (as a String).
      Parameters:
      doc - The processed TextDocument.
      origHTML - The original HTML document.
      Returns:
      The highlighted HTML.
      Throws:
      BoilerpipeProcessingException
    • process

      Processes the given TextDocument and the original HTML text (as an InputSource ).
      Parameters:
      doc - The processed TextDocument.
      is - The original HTML document.
      Returns:
      The highlighted HTML.
      Throws:
      BoilerpipeProcessingException
    • process

      Fetches the given URL using HTMLFetcher and processes the retrieved HTML using the specified BoilerpipeExtractor.
      Parameters:
      doc - The processed TextDocument.
      is - The original HTML document.
      Returns:
      The highlighted HTML.
      Throws:
      BoilerpipeProcessingException
      IOException
      SAXException
    • isOutputHighlightOnly

      public boolean isOutputHighlightOnly()
      If true, only HTML enclosed within highlighted content will be returned
    • setOutputHighlightOnly

      public void setOutputHighlightOnly(boolean outputHighlightOnly)
      Sets whether only HTML enclosed within highlighted content will be returned, or the whole HTML document.
    • getExtraStyleSheet

      public String getExtraStyleSheet()
      Returns the extra stylesheet definition that will be inserted in the HEAD element. By default, this corresponds to a simple definition that marks text in class "x-boilerpipe-mark1" as inline text with yellow background.
    • setExtraStyleSheet

      public void setExtraStyleSheet(String extraStyleSheet)
      Sets the extra stylesheet definition that will be inserted in the HEAD element. To disable, set it to the empty string: ""
      Parameters:
      extraStyleSheet - Plain HTML
    • getPreHighlight

      public String getPreHighlight()
      Returns the string that will be inserted before any highlighted HTML block. By default, this corresponds to <span class=&qupt;x-boilerpipe-mark1">
    • setPreHighlight

      public void setPreHighlight(String preHighlight)
      Sets the string that will be inserted prior to any highlighted HTML block. To disable, set it to the empty string: ""
    • getPostHighlight

      public String getPostHighlight()
      Returns the string that will be inserted after any highlighted HTML block. By default, this corresponds to </span>
    • setPostHighlight

      public void setPostHighlight(String postHighlight)
      Sets the string that will be inserted after any highlighted HTML block. To disable, set it to the empty string: ""
    • xmlEncode

      private static String xmlEncode(String in)
    • getTagWhitelist

      public Map<String, Set<String>> getTagWhitelist()
    • setTagWhitelist

      public void setTagWhitelist(Map<String, Set<String>> tagWhitelist)